EFI - Enzyme Similarity Tool

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
Please cite your use of the EFI tools:

RĂ©mi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735

A sequence similarity network (SSN) allows for visualization of relationships among protein sequences. In SSNs, the most related proteins are grouped together in clusters. The Enzyme Similarity Tool (EFI-EST) makes it possible to easily generate SSNs. Cytoscape is used to explore SSNs.

As the UniProt database increases in size, selecting sequences from specific taxonomic categories may be useful for generating SSNs. Use the "Filter by Taxonomy" option to specify taxonomic categories in the input sequences. The "Taxonomy" tool (top of page) provides a preview of the taxonomic distribution of user-specified sequences.

Also, the upper limit on the number of sequences from the UniProt and UniRef90 databases for EFI-EST has been increased from 25,000 to 50,000 to allow the generation of SSNs for larger families. Visualization of these with Cytoscape will require more RAM, but access to increased amounts of sequence-function space may be useful. The time required to generate an SSN increases with the square of the number of sequences.

A listing of new features and other information pertaining to EST is available on the release notes page.

InterProScan sequence search can be used to find matches within the InterPro database for a given sequence.

Information on Pfam families and clans and InterPro family sizes is available on the Family Information page.

The EST database uses UniProt 2022_02 and InterPro 89.

Generate a SSN for a single protein and its closest homologues in the UniProt database.

The input sequence is used as the query for a search of the UniProt database using BLAST. Sequences that are similar to the query in UniProt are retrieved. An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Query Sequence:
Input a single protein sequence only. The default maximum number of retrieved sequences is 1,000.

Filter by Taxonomy

Sequences retrieved from the UniProt, UniRef90, and UniRef50 databases can be restricted to those that match specified taxonomic categories (superkingdom, kingdom, phylum, class, order, family, genus, species). Multiple conditions are combined to be a union of each other.

Alternatively, the user can select "Bacteria, Archaea, Fungi" from the "Preselected conditions" to restrict the retrieved sequences to those from organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The retrieved sequences from the UniRef90 database include 1) UniRef90 clusters for which the cluster ID matches the specified taxonomic categories and 2) UniRef90 clusters for which the cluster ID does not match the specified taxonomic categories but UniProt IDs within the cluster do match.

The retrieved sequences from the UniRef50 database include 1) UniRef50 clusters for which the cluster ID matches the specified taxonomic categories and 2) UniRef50 clusters for which the cluster ID does not match the specified taxonomic categories but UniRef90 cluster IDs within the cluster do match.

UniRef90 clusters contain sequences that share ≥90% sequence identity so usually are taxonomically homogeneous. However, UniRef50 clusters contain sequences that share ≥50% sequence identity so often are taxonomically heterogeneous. When possible (determined by the RAM available to Cytoscape), users should generate taxonomy-specific SSNs with UniProt IDs or UniRef90 cluster IDs.

Preselected conditions:

BLAST Retrieval Options

UniProt BLAST query e-value: Negative log of e-value for retrieving similar sequences (≥ 1; default: 5)
Input an alternative e-value for BLAST to retrieve sequences from the UniProt database. We suggest using a larger e-value (smaller negative log) for retrieving homologues if the query sequence is short and a smaller e-value (larger negative log) if there is no need to retrieve divergent homologues.
Maximum number of sequences retrieved: (≤ 10,000, default: 1,000)
Sequence database: (UniProt, UniRef90, or UniRef50; default UniProt)
Select the sequence database to BLAST against.

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 50,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN for a protein family.

The sequences from the Pfam families, InterPro families, and/or Pfam clans (superfamilies) input are retrieved. An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Pfam and/or InterPro Families:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 50,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.

Filter by Taxonomy

Sequences retrieved from the UniProt, UniRef90, and UniRef50 databases can be restricted to those that match specified taxonomic categories (superkingdom, kingdom, phylum, class, order, family, genus, species). Multiple conditions are combined to be a union of each other.

Alternatively, the user can select "Bacteria, Archaea, Fungi" from the "Preselected conditions" to restrict the retrieved sequences to those from organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The retrieved sequences from the UniRef90 database include 1) UniRef90 clusters for which the cluster ID matches the specified taxonomic categories and 2) UniRef90 clusters for which the cluster ID does not match the specified taxonomic categories but UniProt IDs within the cluster do match.

The retrieved sequences from the UniRef50 database include 1) UniRef50 clusters for which the cluster ID matches the specified taxonomic categories and 2) UniRef50 clusters for which the cluster ID does not match the specified taxonomic categories but UniRef90 cluster IDs within the cluster do match.

UniRef90 clusters contain sequences that share ≥90% sequence identity so usually are taxonomically homogeneous. However, UniRef50 clusters contain sequences that share ≥50% sequence identity so often are taxonomically heterogeneous. When possible (determined by the RAM available to Cytoscape), users should generate taxonomy-specific SSNs with UniProt IDs or UniRef90 cluster IDs.

Preselected conditions:

Family Domain Boundary Option

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Region:
N-terminal will select the portion of the sequence that is N-terminal to the specified domain to generate the SSN. C-terminal will select the portion of the sequence that is C-terminal to the specified domain to generate the SSN. Domain will use the specified domain.

Protein Family Option

Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN from provided sequences.

An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Input a list of protein sequences in FASTA format or upload a FASTA-formatted sequence file.

Sequences:

When selected, recognized UniProt or Genbank identifiers from FASTA headers are used to retrieve node attributes from the UniProt database.
FASTA File: ?

Filter by Taxonomy

Input sequences (by default, from the UniProt database) can be restricted to those that match specified taxonomic categories (superkingdom, kingdom, phylum, class, order, family, genus, species). Multiple conditions are combined to be a union of each other.

Alternatively, the user can select "Bacteria, Archaea, Fungi" from the "Preselected conditions" to restrict the retrieved sequences to those from organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

Preselected conditions:

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 50,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN from a list of UniProt, UniRef, NCBI, or Genbank IDs.

An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Input a list of UniProt, NCBI, or Genbank (protein) accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?

Input a list of UniRef50 or UniRef90 cluster accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?
Input accession IDs are: ?

Filter by Taxonomy

Input sequences from the UniProt, UniRef90, and UniRef50 databases can be restricted to those that match specified taxonomic categories (superkingdom, kingdom, phylum, class, order, family, genus, species). Multiple conditions are combined to be a union of each other.

Alternatively, the user can select "Bacteria, Archaea, Fungi" from the "Preselected conditions" to restrict the retrieved sequences to those from organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The retrieved sequences from the UniRef90 database include 1) UniRef90 clusters for which the cluster ID matches the specified taxonomic categories and 2) UniRef90 clusters for which the cluster ID does not match the specified taxonomic categories but UniProt IDs within the cluster do match.

The retrieved sequences from the UniRef50 database include 1) UniRef50 clusters for which the cluster ID matches the specified taxonomic categories and 2) UniRef50 clusters for which the cluster ID does not match the specified taxonomic categories but UniRef90 cluster IDs within the cluster do match.

UniRef90 clusters contain sequences that share ≥90% sequence identity so usually are taxonomically homogeneous. However, UniRef50 clusters contain sequences that share ≥50% sequence identity so often are taxonomically heterogeneous. When possible (determined by the RAM available to Cytoscape), users should generate taxonomy-specific SSNs with UniProt IDs or UniRef90 cluster IDs.

Preselected conditions:

Family Domain Boundary Options

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Family: Use domain boundaries from the specified family (enter only one family).
Region:
A specified InterPro family must be defined by a single database. N-terminal will select the portion of the sequence that is N-terminal to the specified domain to generate the SSN. C-terminal will select the portion of the sequence that is C-terminal to the specified domain to generate the SSN. Domain will use the specified domain.

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 50,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Clusters in the submitted SSN are identified, numbered and colored. Summary tables, sets of IDs and sequences per cluster are provided.

The clusters are numbered and colored using two conventions: 1) Sequence Count Cluster Number assigned in order of decreasing number of UniProt IDs in the cluster; 2) Node Count Cluster Number assigned in order of decreasing number of nodes in the cluster.

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).
E-mail address:

You will be notified by e-mail when your submission has been processed.

Like the Color SSN utility, clusters in the submitted SSN are identified, numbered and colored.

The SSN clusters are numbered and colored using two conventions: Sequence Count Cluster Numbers are assigned in order of decreasing number of UniProt IDs in the cluster; Node Count Cluster Numbers are assigned in order of decreasing number of nodes in the cluster.

The convergence ratio for each cluster also is calculated. The convergence ratio is the number of edges in each cluster to the number of sequence pairs. The value decreases from 1.0 for a cluster very similar sequences (same function?) to <<1.0 for clusters with distantly related sequences (different functions?).

Multiple sequence alignments (MSAs), WebLogos, hidden Markov models (HMMs), length histograms, and consensus residues are computed for each cluster.

Options are available in the tabs below to select the desired analyses:

The WebLogos tab provides the WebLogo and MSA for the node IDs in each cluster containing greater than the "Minimum Node Count" specified in the Sequence Filter tab. The percent identity matrix for the MSA is also provided on this tab.

The Consensus Residues tab provides a tab-delimited text file with the number of the conserved residues and their MSA positions for each specified residue in each cluster containing greater than the "Minimum Node Count". Note the default residue is "C" and the percent identity levels that are displayed are from 90 to 10% in intervals of 10%; a residue is counted as "conserved" if it occurs with ≥80% identity.

The HMMs tab provides the HMM for each cluster containing greater than the specified "Minimum Node Count".

The Length Histograms tab provides length histograms for each cluster containing greater than the specified "Minimum Node Count".

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).

Sequence Filter

The MSA is generated with MUSCLE using the node IDs. Clusters containing less than the Minimum Node Count will be excluded from the analyses. Since MUSCLE can fail with a "large" number sequences (variable; anywhere from >750 to 1500), the Maximum Node Count parameter can be used to limit the number of sequences that MUSCLE uses.
Minimum Node Count:
Maximum Node Count:

WebLogos

A MSA for the (length-filtered) node IDs is generated using MUSCLE; the WebLogo is generated with the http://weblogo.threeplusone.com code.
Make Weblogo:

Consensus Residues

The positions and selected percent identities of the selected residues in the MSA are determined.
Compute Consensus Residues:

HMMs

The MSA for the (length-filtered) node IDs is used to generate the HMM with hmmbuild from HMMER3 (http://hmmer.org).
Make HMMs:

Length Histograms

Length histograms for the node IDs (where applicable, UniProt, UniRef90, and UniRef50 IDs).
Make Length Histograms:
E-mail address:

You will be notified by e-mail when your submission has been processed.

Nodes in the submitted SSN are colored according to neighborhood connectivity (number of edges to other nodes).

The nodes for unresolved families can be difficult to identify in SSNs generated with low alignment scores. Coloring the nodes according to the number of edges to other nodes (Neighborhood Connectivity, NC) helps identify families with highly connected nodes (https://doi.org/10.1016/j.heliyon.2020.e05867). Using Neighborhood Connectivity Coloring as a guide, the alignment score threshold can be chosen in Cytoscape to separate the SSN into families.

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).
E-mail address:

You will be notified by e-mail when your submission has been processed.

Convergence ratio is calculated per cluster.

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).

Alignment Score

Alignment Score: The alignment score to calculate convergence ratio per cluster (should be the same as the original SSN alignment score).
E-mail address:

You will be notified by e-mail when your submission has been processed.

Retrieve taxonomy for families.

The sequences from the Pfam families, InterPro families, and/or Pfam clans (superfamilies) input are retrieved. The taxonomic distribution of the sequences is displayed as a "sunburst" with the levels of classification (superkingdom, kingdom, phylum, class, order, family, genus, species) displayed radically, with superkingdom at the center and species in the outermost ring.

The sunburst is interactive, providing the ability to zoom to the selected taxonomic level.

The UniProt/UniRef90/UniRef50 IDs and/or UniProt/UniRef90/UniRef50 FASTA-formatted sequences can be downloaded.

This preview of the taxonomic distribution also can be used to guide taxonomic restrictions in the "Filter by Taxonomy" option of Option B.

Pfam and/or InterPro Families:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

Filter by Taxonomy

Conditions on the taxonomy can be set to further restrict the set of sequences by only including the sequences that match the specific taxonomic categories. Multiple conditions are combined to be a union of each other.
Preselected conditions:

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

UniProt Version: 2022_02
InterPro Version: 89

Click here to contact us for help, reporting issues, or suggestions.