EFI - Enzyme Similarity Tool

A sequence similarity network (SSN) allows researchers to visualize relationships among protein sequences. In SSNs, the most related proteins are grouped together in clusters. The Enzyme Similarity Tool (EFI-EST) is a web-tool that allows researchers to easily generate SSNs that can be visualized in Cytoscape (3).

The EST database has been updated to use UniProt 2018_10 and InterPro 71.
For users with a login, a selected set of results is available for generating family SSNs.

For any input format, when protein families (Pfam and/or InterPro) are selected for the generation of a SSN and they contain more than 25,000 sequences, the SSN will be generated using the UniRef90 database. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a single seed sequence (the longest sequence in the group). This is done to reduce computing time and the size of output SSNs without loosing information: the output SSN is equivalent to a 90% Representative Node Network with each node corresponding to a seed sequence, and for which the node attribute "UniRef90 Cluster IDs" lists all the sequences represented by a node so that any sequence in input family can be located by searching this node attribute. UniRef90 SSNs are compatible with the Color SSN utility as well as the EFI-GNT tool: the UniRef90 groups are automatically expanded when needed.

A listing of new features and other information pertaining to EST is available on the release notes page.

Information on Pfam families and clans and InterPro family sizes is now available on the Family Information page.

The provided sequence is used as the query for a BLAST search of the UniProt database and then, the similarities between the sequences are calculated and used to generate the SSN. Submit only one protein sequence without FASTA header. The default maximum number of retrieved sequences is 1,000.

UniProt BLAST Query E-value: Negative log of e-value for retrieving similar sequences (≥ 1; default: 5)
Maximum Blast Sequences: Maximum number of sequences retrieved (≤ 10000; default: 1000)
If desired, include Pfam and/or InterPro families, in the analysis of your sequence. For Pfam families, the format is a comma separated list of PFxxxxx (five digits); for InterPro families, the format is IPRxxxxxx (six digits); for Pfam clans, the format is CLxxxx (four digits). For Pfam families, InterPro families, and Pfam clans with a size greater than 25,000, UniRef90 seed sequences will be utilized instead of the full family.
Advanced Family Options
Optional job name:
E-mail address:
When the sequence has been uploaded and processed, you will receive an e-mail containing a link to analyze the data.

The sequences from the Pfam families, InterPro families, and/or Pfam clans (superfamilies) are retrieved, and then, the similarities between the sequences are calculated and used to generate the SSN. For Pfam families, the format is a comma separated list of PFxxxxx (five digits); for InterPro families, the format is IPRxxxxxx (six digits); for Pfam clans, the format is CLxxxx (four digits). Lists of Pfam families, InterPro families, and Pfam clans are included in the release notes.

InterProScan sequence search can be used to find matches within the InterPro database for a given sequence.

The maximum number of retrieved sequences is 200,000. For Pfam families, InterPro families, and Pfam clans with a size greater than 25,000, UniRef90 seed sequences will be utilized instead of the full family.

Advanced Options
Optional job name:
E-mail address:
When the sequence has been uploaded and processed, you will receive an e-mail containing a link to analyze the data.
The similarities between the provided sequences will be calculated and used to generate the SSN. Input a list of protein sequences in FASTA format with headers, or upload a FASTA file.
Read FASTA headers
When selected, recognized UniProt or Genbank identifiers from FASTA headers are used to retrieve corresponding node attributes from the UniProt database.
FASTA File:
?

If desired, include Pfam and/or InterPro families, in the analysis of your FASTA file. For Pfam families, the format is a comma separated list of PFxxxxx (five digits); for InterPro families, the format is IPRxxxxxx (six digits); for Pfam clans, the format is CLxxxx (four digits). For Pfam families, InterPro families, and Pfam clans with a size greater than 25,000, UniRef90 seed sequences will be utilized instead of the full family.
Advanced Options
Optional job name:
E-mail address:
When the sequence has been uploaded and processed, you will receive an e-mail containing a link to analyze the data.
Input a list of Uniprot, NCBI, or Genbank sequence accession IDs, and/or upload a text file containing the accession IDs.
Accession ID File:
?

Treat UniProt accession IDs as: ?
If desired, include Pfam and/or InterPro families, in the analysis of your list of IDs. For Pfam families, the format is a comma separated list of PFxxxxx (five digits); for InterPro families, the format is IPRxxxxxx (six digits); for Pfam clans, the format is CLxxxx (four digits). For Pfam families, InterPro families, and Pfam clans with a size greater than 25,000, UniRef90 seed sequences will be utilized instead of the full family.
Advanced Options
Optional job name:
E-mail address:
When the sequence has been uploaded and processed, you will receive an e-mail containing a link to analyze the data.
Color a previously generated SSN and return associated cluster data.
Independent sequence clusters in the uploaded SSN are identified, numbered and colored. Summary tables, sets of IDs and sequences for specific clusters are provided. A Cytoscape-edited SNN can serve as input for this utility. In order for all of the new features to work correctly, SSNs generated by EFI-EST 2.0 (released 8/17/2017) should be used.
SNN to color and analyze (uncompressed or zipped XGMML file):
?

E-mail address:
When the sequence has been uploaded and processed, you will receive an e-mail containing a link to analyze the data.

Overview of possible inputs for EFI-EST

The EFI - ENZYME SIMILARITY TOOL (EFI-EST) is a webserver for the generation of SSNs. Four options for user-initiated generation of a SSN are available. In addition, a utility to enhance SSNs interpretation is available.

  • Option A: Single sequence query. The provided sequence is used as the query for a BLAST search of the UniProt database. The retrieved sequences are used to generate the SSN.

    Option A allows the user to explore local sequence-function space for the query sequence. Homologs are collected and used to generate the SSN. By default, 1,000 sequences are collected as this number often allows a “full” SSN to be generated and viewed with Cytoscape.

  • Option B: Pfam and/or InterPro families; Pfam clans (superfamilies). Defined protein families are used to generate the SSN.

    Option B allows the user to explore sequence-function space from defined protein families. A limit of 200,000 sequences is imposed. Generation of a SSN for more than one family is allowed.

  • Option C: User-supplied FASTA file. A SSN is generated from a set of defined sequences.

    Option C allows the user to generate a SSN for a provided set of FASTA formatted sequences. By default, the provided sequences cannot be associated with sequences in the UniProt database, and only two node attributes are provided for the SSNs generated: the number of residues as the “Sequence Length”, and the FASTA header as the “Description”.

    An option allows the FASTA headers to be read and if Uniprot or NCBI identifiers are recognized, the corresponding Uniprot information will be presented as node attributes.

  • Option D: List of UniProt and/or NCBI IDs. The SSN is generated after fetching the information from the corresponding databases.

    Option D allows the user to provide a list of UniProt IDs, NCBI IDs, and/or NCBI GI numbers (now “retired”). UniProt IDs are used to retrieve sequences and annotation information from the UniProt database. When recognized, NCBI IDs and GI numbers are used to retrieve the “equivalent” UniProt IDs and information. Sequences with NCBI IDs that cannot be recognized will not be included in the SSN and a “nomatch” file listing these IDs is available for download.

  • Utility for the identification and coloring of independent clusters within a SSN.

    Independent clusters in the uploaded SSN are identified, numbered and colored. Summary tables, sets of IDs and sequences for specific clusters and are provided. A manually edited SNN can serve as input for this utility.

Please see our recent review in BBA Proteins for examples of EFI-EST use.

UniProt Version: 2018_10

Need help or have suggestions or comments? Please click here.