EFI - Enzyme Similarity Tool

A sequence similarity network (SSN) allows for visualization of relationships among protein sequences. In SSNs, the most related proteins are grouped together in clusters. The Enzyme Similarity Tool (EFI-EST) makes it possible to easily generate SSNs. Cytoscape is used to explore SSNs.

The EFI web tool interface has been updated to improve user experience.
All functions remain unchanged.

The EST database has been updated to use UniProt 2019_06 and InterPro 75.

A listing of new features and other information pertaining to EST is available on the release notes page.

InterProScan sequence search can be used to find matches within the InterPro database for a given sequence.

Information on Pfam families and clans and InterPro family sizes is available on the Family Information page.

Generate a SSN for a single protein and its closest homologues in the UniProt database.

The input sequence is used as the query for a search of the UniProt database using BLAST. Sequences that are similar to the query in UniProt are retrieved. An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Query Sequence:
Input a single protein sequence only. The default maximum number of retrieved sequences is 1,000.

BLAST Retrieval Options

UniProt BLAST query e-value: Negative log of e-value for retrieving similar sequences (≥ 1; default: 5)
Input an alternative e-value for BLAST to retrieve sequences from the UniProt database. We suggest using a larger e-value (smaller negative log) for retrieving homologues if the query sequence is short and a smaller e-value (larger negative log) if there is no need to retrieve divergent homologues.
Maximum number of sequences retrieved: (≤ 10,000, default: 1,000)

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 25,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN for a protein family.

The sequences from the Pfam families, InterPro families, and/or Pfam clans (superfamilies) input are retrieved. An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Pfam and/or InterPro Families:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 25,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.

Family Domain Boundary Option

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:

Protein Family Option

Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN from provided sequences.

An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Input a list of protein sequences in FASTA format or upload a FASTA-formatted sequence file.

Sequences:

When selected, recognized UniProt or Genbank identifiers from FASTA headers are used to retrieve node attributes from the UniProt database.
FASTA File: ?

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 25,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN from a list of UniProt, UniRef, NCBI, or Genbank IDs.

An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Input a list of UniProt, NCBI, or Genbank (protein) accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?

Input a list of UniRef50 or UniRef90 cluster accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?
Input accession IDs are: ?

Family Domain Boundary Options

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Family: Use domain boundaries from the specified family (enter only one family).

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 25,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Clusters in the submitted SSN are identified, numbered and colored. Summary tables, sets of IDs and sequences per cluster are provided.

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).
E-mail address:

You will be notified by e-mail when your submission has been processed.

Overview of possible inputs for EFI-EST

The EFI - Enzyme Similarity Tool (EFI-EST) is a service for the generation of SSNs. Four options are available to generate SSNs. A utility to enhance SSN interpretation is also available.

  • Sequence BLAST (Option A): Single sequence query. The provided sequence is used as the query for a BLAST search of the UniProt database. The retrieved sequences are used to generate the SSN.

    Option A allows exploration of local sequence-function space for the query sequence. By default, 1,000 sequences are collected. This allows a small "full" SSN to be generated and viewed with Cytoscape. This for local high resolution SSNs.

  • Families (Option B): Pfam and/or InterPro families; Pfam clans (superfamilies). Defined protein families are used to generate the SSN.

    Option B allows exploration of sequence-function space from defined protein families. A limit of 100,000 sequences is imposed. Generation of a SSN for more than one family is allowed. Using UniRef90 and UniRef50 databases allows the creation of SSNs for very large Pfam and/or InterPro families, but at lower resolution.

  • FASTA (Option C): User-supplied FASTA file. A SSN is generated from a set of defined sequences.

    Option C allows generation of a SSN for a provided set of FASTA formatted sequences. By default, EST cannot associate the provided sequences with sequences in the UniProt database, and only two node attributes are provided for the SSNs generated: the number of residues as the "Sequence Length", and the FASTA header as the "Description". An option allows the FASTA headers to be read and if Uniprot or NCBI identifiers are recognized, the corresponding Uniprot information will be presented as node attributes.

  • Accession IDs (Option D): List of UniProt and/or NCBI IDs. The SSN is generated after fetching the information from the corresponding databases.

    Option D allows for a list of UniProt IDs, NCBI IDs, and/or NCBI GI numbers (now "retired"). UniProt IDs are used to retrieve sequences and annotation information from the UniProt database. When recognized, NCBI IDs and GI numbers are used to retrieve the "equivalent" UniProt IDs and information. Sequences with NCBI IDs that cannot be recognized will not be included in the SSN and a "no match" file listing these IDs is available for download.

  • Color SSNs: Utility for the identification and coloring of independent clusters within a SSN.

    Independent clusters in the uploaded SSN are identified, numbered and colored. Summary tables, sets of IDs and sequences per clusters are provided. A Cytoscape-edited SNN can serve as input for this utility.

Recommended Reading

'Democratized' genomic enzymology web tools for functional assignment R Zallot, NO Oberg, JA Gerlt - Current opinion in chemical biology, 2018 - Elsevier

Genomic enzymology: Web tools for leveraging protein family sequence–function space and genome context to discover novel functions JA Gerlt - Biochemistry, 2017 - ACS Publications

Enzyme function initiative-enzyme similarity tool (EFI-EST): A web tool for generating protein sequence similarity networks Gerlt JA, Bouvier JT, Davidson DB, Imker HJ, Sadkhin B, Slater DR, Whalen KL. - Biochimica Et Biophysica Acta (BBA)-Proteins and Proteomics, 2015 - Elsevier

UniProt Version: 2019_06
InterPro Version: 75

If you use the EFI web tools, please cite us.

Click here to contact us for help, reporting issues, or suggestions.