EFI-EST and Cytoscape Tutorials

EFI-EST Start Screen

Note 1: The EFI-EST uses the UniProtKB protein sequence database (maintained by EMBL-EBI) for its annotations because, unlike GenBank, UniProtKB provides the ability to add to and/or correct functional annotations. In addition, the EFI-EST uses the Pfam and InterPro databases (also maintained by EMBL-EBI) to provide easy access to the complete memberships of a large number of curated protein families/superfamilies (16,230 for Pfam 28.0; 27,463 for InterPro 53.0). The InterPro database collects signature sequences from 11 different databases, including Pfam, to define its families. Because the different databases may define the “same” family with slightly different signature sequences, InterPro families almost always are larger than Pfam families.

Note 2: The sequence similarity networks generated by this webserver utilize the full‑length sequences of the proteins that are identified by BLAST (Option A) or members of specified Pfam and/or InterPro families (Options B and C). As a result, the clusters that are generated and visualized in the networks will result from sequence similarities for the entire sequence. However, many proteins have multiple domains; for these proteins the alignments used to calculate the alignment scores will not necessarily be for the domain in which you may be interested. The EFI plans to provide the capability to excise structure-based domains as defined by CATH/SCOP from multidomain proteins; however, this has not yet been incorporated. So, you must be aware of this limitation when interpreting the networks provided by EFI-EST.

Note 3: Many proteins have multiple domains; for these proteins the alignments used to calculate the alignment scores will not necessarily be for the domain in which you may be interested. You must be aware of this limitation when interpreting the networks provided by EFI-EST.

Note 4: We now provide an “Advanced Option” for Option B that provides the capability to trim the full length sequences of mulitidomain proteins to generate SSNs using domain boundaries defined by Pfam for the Pfam family that you enter. We recommend that you use this advanced option carefully—Pfam families “always” contain fragments of full length sequences plus domains often are interrupted by insertions, both potentially complicating the interpretation of the SSN.


Three options for generating networks are available from the start screen. Select the option you want to use by clicking on the circle that precedes the Option and then entering the required information. Note that for each Option, an “Advanced Options” link is provided that will allow you to modify the default parameters used to generate the SSNs.



Note: Options A, B and C are also available to users of the Unix terminal scripts:
Option A: http://enzymefunction.org/resources/tutorials/command-line-ssn/from-sequence
Option B: http://enzymefunction.org/resources/tutorials/command-line-ssn
Option C: http://enzymefunction.org/resources/tutorials/command-line-ssn/ssn-from-fasta


Option A: Networks for "neighbors" to a user-supplied sequence. Paste a protein sequence (without a FASTA header) into the input box (red arrow). A sequence data set will be built containing the most closely related sequences retrieved from the UniProtKB database using a BLAST e‑value upper limit threshold of 10-5. A default of 5,000 sequences is used, but the data set may be smaller if < 5,000 sequences are found using a BLAST alignment score upper limit of 10-5. A default of ≤ 5,000 sequences is used because in most cases a full network with all sequences (nodes) will be viewable without having to collapse nodes into representative nodes (explained here). Use this option if you are only interested in those proteins that are most similar to your protein of interest.

You can use Option A to explore specific clusters in networks for entire (super)families generated with Option B by entering the sequence from a cluster in an Option B network (below).



Advanced Options (magenta arrows): By clicking on the Advanced Options tab below the input box, you can enter “custom” values for the maximum number of sequences that will be collected (default is 10,000). You can also change the e-value used to collect the sequences and to perform the all-by-all BLAST.

Maximum BLAST Sequences: Option A was intended to allow the user to collect a subset of sequences within a large Pfam or InterPro family so that a full network (a node for each sequence) can be opened with the user’s computer. In some cases the default of 5,000 sequences may not allow the full network to be opened. In this case, you can decrease the number of sequences that will be collected by entering an integer < 5,000 in the input box.

Note that you can enter an integer as large as 10,000 if you want to collect more sequences to explore sequence-function space. You likely will need to download a repesentative node network to visualize the network.

E-Value: The all-by-all BLAST used to calculate the edges for the SSN returns a result only if the e-value is ≤ 10-5. For short sequences, e.g., < 100 residues, this default may be too small to allow an alignment score corresponding to 30% or less to be specified in the Analyze Data step. In these cases, you can select a larger upper limit for the e-value used in the all-by-all BLAST by entering an integer ≤ 5 (the negative log of the e‑value); the lower limit for the input is 0. We recommend that you first generate the SSN with the 10-5 default and examine the percent identity quartile polot to determine whether you should change the e-value.


Option B: Networks for an entire (super)family. This option requires that you know the Pfam and/or InterPro families identifier for your family of interest. If you do not know the Pfam and/or InterPro family IDs, you can find them by using InterProScan that can be accessed by clicking on the link to the InterPro homepage provided in Option B (red arrow).



Paste your sequence in the box on the InterPro home page (red arrow) and click “Search” (green arrow):



The signature sequences in the InterPro database will be searched; the output is a graphic showing the matches to the signature sequences in the 11 databases (and to what regions of the protein the match occurs).



The Pfam and InterPro family ID(s) is(are) given for each match (red and blue boxes, respectively). You can click on the database ID to access the detailed description of the family in that database. In this example, the input sequence identifies one Pfam family and five InterPro families. Please refer to Figure 6 in the BBA article for additional details.

At present, EFI-EST accepts only Pfam and InterPro families. For Pfam families, the format is PFxxxxx (five digits); for InterPro families, the format is IPRxxxxxx (six digits). Option B usually will result in a much larger data set than Option A because all of the members of (super)families are included. Because entire (super)families are in the dataset, it is likely that you'll need to view representative node SSNs instead of full SSNs (see below).

Enter the Pfam and/or Interpro family number(s) in the input box for Option B in a comma-seperated list (red arrow). The number of sequences that can be used in Option B is limited to 250,000. This limit is set to ensure that assembling the dataset/performing the all-by‑all BLAST as well as generating the networks can be completed within several hours.



Since the all-by-all BLAST may take several hours (depending on the size and sequence divergence of the dataset), you may close the running window. When the dataset is complete, you’ll receive an e-mail with a link to analyze the dataset. This link will be active for 7 days so that you may return at your convenience.

Advanced Options (magenta arrows): By clicking on the Advanced Options tab below the input box, you can enter a “custom” e-value used in the all-by-all BLAST. You also can select a fraction of the sequences in the input Pfam and/or InterPro family(ies) so that you can generate a “representative” network for families with ≥ 250,000 sequences. Finally, Pfam defines domain boundaries for members of its families—you can choose to generate the SSN with the domains instead of the full length sequences.

E-Value: The all-by-all BLAST used to calculate the edges for the SSN returns a result only if the e-value is ≤10-5. For short sequences, e.g., < 100, this default may be too small to allow an alignment score corresponding to 30% or less to be specified in the Analyze Data step. In these cases, you can select a larger upper limit for the e-value used in the all-by-all BLAST by entering an integer ≤5 (the negative log of the e‑value); the lower limit for the input is 0. You should first generate the SSN with the 10-5 default and then inspect the percent identity quartile plot to see if the e-value for the all-by-all BLAST should be decreased.

Fraction: Although the number of sequences that can be used to generate a SSN is limited to ≤ 250,000, with the advanced option you can select a fraction of the total number of sequences in larger sequence sets to generate a network. You can enter an integer that specifies the fraction of the sequences that you would like included in the BLAST and graphs (default is 1). The integer represents the divisor by which you wish to fractionate the dataset, e.g,. 10 = only every 10th sequence in the total sequence dataset is used. The Uniprot sequence dataset is not preorganized, so the sampling is "random”. If the dataset you initially select is too large (you will receive an e-mail if/when this happens), you can select the same dataset and enter an integer fraction to decrease the number of sequences to ≤ 250,000.

Domains: A SSN for full-length multidomain proteins may not group sequences into similar domain structures, so it may be difficult/impossible to use the SSN to infer functional relationships. Pfam defines N- and C-terminal domain boundaries for members of its families. Using these domain definitions, it is possible to trim full-length sequences of multi-domain proteins to obtain only the domain specified by the Pfam family ID. For example, in nonribosomal peptide synthases (NRPSs), the domain definitions can be used to extract the individual domains (e.g, condensation domains, PF00668) and use these to generate a SSN. If the full-length sequence has multiple homologues of the same domain, all of the domains will be extracted and used to generate the SSN.

If you would like to generate the SSN with domains instead of full-length sequences, click the check box. In the networks, the N- and C-boundaries of the domain are appended to the UniProt accession ID for the full-length sequence (ID:N-terminus:C-terminus).

Please be aware that Pfam families “always” include at least some fragments of full-length sequences as the result of sequencing errors, so these may complicate the analyses of networks for domain. Plus, in some proteins the domain belonging to one family may be inserted in the domain for a second family, resulting in two pieces of the second domain in the network.


Option C: Networks for a user-supplied FASTA file. Two sub-options are available for generating networks with Option C.


For the first sub-option (red arrow), upload a FASTA file (text format) in which the header for each sequence provides a description of the sequence. Because the sequences will not be associated with a UniProt ID, only the description you provide and the sequence length will be used as node attributes for these sequences. The network will include only the sequences in the FASTA file.

For the second sub-option, upload a FASTA file (red arrow) and also include the Pfam and/or InterPro family number(s) for family(ies) to which you would like to add your sequences for generating the network (orange arrow). For these networks, the node attributes for the Pfam/InterPro family members will be those provided in Option B; the “Description” node attributes for the sequences in the FASTA file will be the the smae as in the first sub-option. This sub-option places the sequences in the user-supplied FASTA file in the context of curated Pfam and/or InterPro families to enable inference of functions.

In the networks generated with Option C, the shared name and name attributes for the sequences in the FASTA file will have a total of six characters. The sequences in the FASTA file will be numbered sequentially starting with 0. The preceding characters (to make 6) will be "z", e.g., zzz123. You will find this useful: in the "Select" window of the Control Panel in Cytoscape, you can filter on the shared name or name node attributes; if you enter "z" in the search window and check "case sensitive", the nodes for the sequences in the FASTA file will be selected/highlighted.

Advanced Options (magenta arrows): By clicking on the Advanced Options tab below the input box, you can enter a “custom” value used in the all-by-all BLAST. You also can select a fraction of the sequences in the input Pfam and/or InterPro family(ies) so that you can generate a “representative” network for famliies ≥ 250,000 sequences.

E-Value: The all-by-all BLAST used to calculate the edges for the SSN returns a result only if the e-value is ≤10-5. For short sequences, e.g., < 100, this default may be too small to allow an alignment score corresponding to 30% or less to be specified in the Analyze Data step. In these cases, you can select a larger upper limit for the e-value used in the all-by-all BLAST by entering an integer ≤5 (the negative log of the e‑value); the lower limit for the input is 0. You should first generate the SSN with the 10-5 default and then inspect the percent identity quartile plot to see if the e-value for the all-by-all BLAST should be decreased.

Fraction: This advanced option applies ONLY to the sequences in the Pfam or InterPro family if so specified, not in the user-supplied FASTA file. As in Option B, although the limit on the number of sequences that can be used to generate a SSN is limited to ≤ 250,000, with this advance option you can select a fraction of the total number of sequences for larger sequence sets to generate a network. You can enter an integer that specifies the fraction of the sequences that you would like included in the BLAST and graphs (default is 1). The integer represents the divisor by which you wish to fractionate the dataset, e.g,. 10 = only every 10th sequence in the total sequence dataset is used.  The UniProt dataset is not preorganized, so the sampling is “random”. If the dataset you initially select is too large (you will receive an e-mail if/when this happens), you can select the same dataset and enter a integer fraction with this advanced option to decrease the number of sequences to ≤ 250,000.


After the input has been entered for any of the three options, enter your e-mail address (for data retrieval only; blue arrow), and hit “Go” at the bottom of the screen (green arrow). EFI-EST will assemble the sequence dataset and perform the all-by-all BLAST. The all-by-all BLAST will return alignment scores/edges for those sequence pairs for which the BLAST e-values are less than an upper limit threshold of 10-5i (or a different threshold specified in the 'Advanced Options'). For most (super)families, the default threshold should provide sufficient internode connections (edges) in the networks that inferences about divergent evolution of protein function are possible.


If you are interested in detailed exploration of sequence-function relationships in (super)families with > 250,000 sequences, please send an e-mail to efi@enzymefunction.org with a brief summary of your interests. We will provide access to the Unix terminal scripts for generating networks on Biocluster at the Institute for Genomic Biology at the University of Illinois, Urbana‑Champaign as well as assistance in using the scripts.

 

Need help or have suggestions or comments? Please click here to submit »