EFI - Enzyme Similarity Tool

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
Please cite your use of the EFI tools:

Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735

A sequence similarity network (SSN) allows for visualization of relationships among protein sequences. In SSNs, the most related proteins are grouped together in clusters. The Enzyme Similarity Tool (EFI-EST) makes it possible to easily generate SSNs. Cytoscape is used to explore SSNs.

The Taxonomy Tool allows the preview of the taxonomic distribution of sequences in input datasets, and includes the Filter by Taxonomy feature to restrict sequences in input datasets to specific taxonomy categories. Results can be transferred to EFI-EST to generate an SSN.

The EFI-EST options now provide Filter by Taxonomy and Filter by Family, in the initial step and in the Analysis step (SSN node filtering). This filtering enables/simplifies generation of higher resolution SSNs for specific regions of sequence-function space in families.

A manuscript describing these features has been submitted (preprint available here).

A listing of new features and other information pertaining to EST is available on the release notes page.

InterProScan sequence search can be used to find matches within the InterPro database for a given sequence.

Information on Pfam families and clans and InterPro family sizes is available on the Family Information page.

The EST database uses UniProt 2022_04 and InterPro 91.


Notice: Undefined variable: sort_method in /var/www/efi-web-prod/html/efi-est/index.php on line 96

Generate a SSN for a single protein and its closest homologues in the UniProt, UniRef90, or UniRef50 database.

The input sequence is used as the query for a search of the UniProt, UniRef90, or UniRef50 database using BLAST. For the UniRef90 and UniRef50 databases, the sequence of the cluster ID (representative sequence) is used for the BLAST.

The database is selected using the BLAST Retrieval Options.

An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Query Sequence:
Input a single protein sequence only. The default maximum number of retrieved sequences is 1,000.

BLAST Retrieval Options

UniProt BLAST query e-value: Negative log of e-value for retrieving similar sequences (≥ 1; default: 5)
Input a larger e-value (smaller negative log) to retrieve homologues if the query sequence is short. Input a smaller e-value (larger negative log) to retrieve more similar homologues.
Maximum number of sequences retrieved: (≤ 10,000, default: 1,000)
Sequence database: (UniProt, UniRef90, or UniRef50; default UniProt)
Select the sequence database to BLAST against.

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Taxonomy

A taxonomy filter is applied to the list of UniProt, UniRef90, or UniRef50 cluster IDs retrieved by the BLAST.

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the retrieved sequences to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The retrieved sequences also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.

The sequences from the UniRef90 and UniRef50 databases are the UniRef90 and UniRef50 clusters for which the cluster ID ("representative sequence") matches the specified taxonomy categories. The UniProt members in these clusters that do not match the specified taxonomy categories are removed from the cluster.

Preselected conditions:

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternate e-value for BLAST to calculate similarities/edge alignment scores similarities. The default parameter (5) is useful for most sequences. However, a larger e-value/smaller negative log should be used for short sequences or when low pairwise identities may be useful for separating functionally distinct SSN clusters.

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProtKB, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with SwissProt annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN for a protein family.

The members of the input Pfam families, InterPro families, and/or Pfam clans are selected from the UniProt, UniRef90, or UniRef50 database.

Pfam and/or InterPro Families and/or Pfam clans:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

UniRef90 clusters contain UniProt IDs that share ≥90% sequence identity and have 80% overlap with the longest sequence in the cluster ("seed sequence"); as a result, the UniProt IDs in the cluster usually are functionally homogeneous, i.e., orthologues. UniRef50 clusters contain UniProt IDs that share ≥50% sequence identity and have 80% overlap with the seed sequence; as a result, the UniProt IDs in the cluster often are functionally heterogeneous, e.g., paralogues.

The sequences from the UniRef90 and UniRef90 databases are the UniRef90 and UniRef50 clusters for which the cluster ID ("representative sequence") matches the specified families. The UniProt members in these UniRef90 and Uni/Ref50 clusters that do not match the specified families are removed from the cluster.

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Taxonomy

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the retrieved sequences to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The retrieved sequences also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.

The sequences from the UniRef90 and UniRef50 databases are the UniRef90 and UniRef50 clusters for which the cluster ID ("representative sequence") matches the specified taxonomy categories. The UniProt members in these clusters that do not match the specified taxonomy categories are removed from the cluster.

Preselected conditions:

Protein Family Size Options

Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProtKB, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with SwissProt annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

Family Domain Boundary Option

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Region:
N-terminal will select the portion of the sequence that is N-terminal to the specified domain to generate the SSN. C-terminal will select the portion of the sequence that is C-terminal to the specified domain to generate the SSN. Domain will use the specified domain.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternate e-value for BLAST to calculate similarities/edge alignment scores similarities. The default parameter (5) is useful for most sequences. However, a larger e-value/smaller negative log should be used for short sequences or when low pairwise identities may be useful for separating functionally distinct SSN clusters.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN from FASTA-formatted UniProt sequences.

An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Input a list of sequences in the FASTA format or upload a FASTA-formatted sequence file.

Two options are available for generating the SSN:

1) The sequences are used "as is", with the node attributes including only the information in the header as the description and the number of residues in the sequence.

2) The ID in the header that immediately follows the ">" is used to retrieve node attribute information. Acceptable IDs include UniProt IDs, PDB IDs, and NCBI GenBank IDs that have equivalent entries in the UniProt database. ? To use this option, check the "Read FASTA headers" box.

Sequences:

FASTA File: ?

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Family

Input a list of Pfam families, InterPro families, and/or Pfam clans to restrict the UniProt and/or UniRef IDs in the SSN to these families.
Family(s):
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

Filter by Taxonomy

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the input UniProt sequences to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The input UniProt sequences also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.

Preselected conditions:

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

UniRef90 clusters contain UniProt IDs that share ≥90% sequence identity and have 80% overlap with the longest sequence in the cluster ("seed sequence"); as a result, the UniProt IDs in the cluster usually are functionally homogeneous, i.e., orthologues. UniRef50 clusters contain UniProt IDs that share ≥50% sequence identity and have 80% overlap with the seed sequence; as a result, the UniProt IDs in the cluster often are functionally heterogeneous, e.g., paralogues.

The sequences from the UniRef90 and UniRef90 databases are the UniRef90 and UniRef50 clusters for which the cluster ID ("representative sequence") matches the specified families. The UniProt members in these UniRef90 and Uni/Ref50 clusters that do not match the specified families are removed from the cluster.

Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProtKB, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with SwissProt annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

Family Domain Boundary Options

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Family: Use domain boundaries from the specified family (enter only one family).
Region:
A specified InterPro family must be defined by a single database. N-terminal will select the portion of the sequence that is N-terminal to the specified domain to generate the SSN. C-terminal will select the portion of the sequence that is C-terminal to the specified domain to generate the SSN. Domain will use the specified domain.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternate e-value for BLAST to calculate similarities/edge alignment scores similarities. The default parameter (5) is useful for most sequences. However, a larger e-value/smaller negative log should be used for short sequences or when low pairwise identities may be useful for separating functionally distinct SSN clusters.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN from a list of UniProt, UniRef, NCBI, or Genbank IDs.

An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Input a list of UniProt, NCBI, or Genbank (protein) accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?

Input a list of UniRef50 or UniRef90 cluster accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?
Input accession IDs are: ?

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Family

The input list of UniRef90 or UniRef50 cluster IDs should (must!) be filtered with the same list of Pfam families, InterPro families, and/or Pfam clans used to generate the IDs, if:

The input list of UniRef90 or UniRef50 IDs is obtained from 1) the Color SSN or Cluster Analysis utility for a Families option (Option B) EFI-EST SSN, 2) the Families option of the Taxonomy Tool, or 3) the Accession IDs option of the Taxonomy Tool.

Input a list of Pfam families, InterPro families, and/or Pfam clans to restrict the UniProt and/or UniRef IDs in the SSN to these families.
Family(s):
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

For input lists of UniRef90 and UniRef50 clusters, the cluster ID (representative sequence) is used to identify those that match the list of families and are included in the SSN. The UniProt members in these clusters that do not match the input families are removed from the cluster and are not included in the SSN node attributes.

Filter by Taxonomy

The input list of UniRef90 or UniRef50 cluster IDs should (must!) be filtered with the same taxonomy categories used to generate the IDs, if:

The input list of UniRef90 or UniRef50 IDs is obtained from 1) the Color SSN or Cluster Analysis utility for a Families option (Option B) EFI-EST SSN, 2) the Families option of the Taxonomy Tool, or 3) the Accession IDs option of the Taxonomy Tool.

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the UniProt IDs in the sunburst to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The UniProt IDs also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.

Preselected conditions:

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

UniRef90 clusters contain UniProt IDs that share ≥90% sequence identity and have 80% overlap with the longest sequence in the cluster ("seed sequence"); as a result, the UniProt IDs in the cluster usually are functionally homogeneous, i.e., orthologues. UniRef50 clusters contain UniProt IDs that share ≥50% sequence identity and have 80% overlap with the seed sequence; as a result, the UniProt IDs in the cluster often are functionally heterogeneous, e.g., paralogues.

The sequences from the UniRef90 and UniRef90 databases are the UniRef90 and UniRef50 clusters for which the cluster ID ("representative sequence") matches the specified families. The UniProt members in these UniRef90 and Uni/Ref50 clusters that do not match the specified families are removed from the cluster.

Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProtKB, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with SwissProt annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

Family Domain Boundary Options

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Family: Use domain boundaries from the specified family (enter only one family).
Region:
A specified InterPro family must be defined by a single database. N-terminal will select the portion of the sequence that is N-terminal to the specified domain to generate the SSN. C-terminal will select the portion of the sequence that is C-terminal to the specified domain to generate the SSN. Domain will use the specified domain.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternate e-value for BLAST to calculate similarities/edge alignment scores similarities. The default parameter (5) is useful for most sequences. However, a larger e-value/smaller negative log should be used for short sequences or when low pairwise identities may be useful for separating functionally distinct SSN clusters.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Clusters in the submitted SSN are identified, numbered and colored. Summary tables, sets of IDs and sequences per cluster are provided.

The clusters are numbered and colored using two conventions: 1) Sequence Count Cluster Number assigned in order of decreasing number of UniProt IDs in the cluster; 2) Node Count Cluster Number assigned in order of decreasing number of nodes in the cluster.

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).
E-mail address:

You will be notified by e-mail when your submission has been processed.

Like the Color SSN utility, clusters in the submitted SSN are identified, numbered and colored.

The SSN clusters are numbered and colored using two conventions: Sequence Count Cluster Numbers are assigned in order of decreasing number of UniProt IDs in the cluster; Node Count Cluster Numbers are assigned in order of decreasing number of nodes in the cluster.

Multiple sequence alignments (MSAs), WebLogos, hidden Markov models (HMMs), length histograms, and consensus residues are computed for each cluster.

Options are available in the tabs below to select the desired analyses:

The WebLogos tab provides the WebLogo and MSA for the node IDs in each cluster containing greater than the "Minimum Node Count" specified in the Sequence Filter tab. The percent identity matrix for the MSA is also provided on this tab.

The Consensus Residues tab provides a tab-delimited text file with the number of the conserved residues and their MSA positions for each specified residue in each cluster containing greater than the "Minimum Node Count". Note the default residue is "C" and the percent identity levels that are displayed are from 90 to 10% in intervals of 10%; a residue is counted as "conserved" if it occurs with ≥80% identity.

The HMMs tab provides the HMM for each cluster containing greater than the specified "Minimum Node Count".

The Length Histograms tab provides length histograms for each cluster containing greater than the specified "Minimum Node Count".

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).

Sequence Filter

The MSA is generated with MUSCLE using the node IDs. Clusters containing less than the Minimum Node Count will be excluded from the analyses. Since MUSCLE can fail with a "large" number sequences (variable; anywhere from >750 to 1500), the Maximum Node Count parameter can be used to limit the number of sequences that MUSCLE uses.
Minimum Node Count:
Maximum Node Count:

WebLogos

A MSA for the (length-filtered) node IDs is generated using MUSCLE; the WebLogo is generated with the http://weblogo.threeplusone.com code.
Make Weblogo:

Consensus Residues

The positions and selected percent identities of the selected residues in the MSA are determined.
Compute Consensus Residues:

HMMs

The MSA for the (length-filtered) node IDs is used to generate the HMM with hmmbuild from HMMER3 (http://hmmer.org).
Make HMMs:

Length Histograms

Length histograms for the node IDs (where applicable, UniProt, UniRef90, and UniRef50 IDs).
Make Length Histograms:
E-mail address:

You will be notified by e-mail when your submission has been processed.

Nodes in the submitted SSN are colored according to neighborhood connectivity (number of edges to other nodes).

The nodes for unresolved families can be difficult to identify in SSNs generated with low alignment scores. Coloring the nodes according to the number of edges to other nodes (Neighborhood Connectivity, NC) helps identify families with highly connected nodes (https://doi.org/10.1016/j.heliyon.2020.e05867). Using Neighborhood Connectivity Coloring as a guide, the alignment score threshold can be chosen in Cytoscape to separate the SSN into families.

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).
E-mail address:

You will be notified by e-mail when your submission has been processed.

Convergence ratio is calculated per cluster.

SSN File: ?
A Color SSN (from either the Color SSN or Cluster Analysis utility) is the required input (cluster numbers are required).
Alignment Score: The alignment score to calculate convergence ratio per cluster (should be the same as the original SSN alignment score).

The "convergence ratio" is the ratio of the actual number of edges in the cluster to the maximum possible number of edges (each node connected to every other node). For UniRef SSNs, two convergence ratios are calculated, one for the edges connecting the UniRef nodes in the input SSN and the second for the "hypothetical" edges that would connect the internal UniProt IDsin the cluster. The user specifies the value of the alignment score to be used (usually the same a alignment score used to generate the SSN).

The value of the convergence ratio ranges from 1.0 for sequences that are very similar ("identical") to 0.0 for sequences that are unrelated at the specified alignment score. The convergence ratio can be used as a criterion to infer whether an SSN cluster is isofunctional—the convergence ratio of a cluster containing orthologous sequences is expected to be close to 1.0 even at large alignment scores.

E-mail address:

You will be notified by e-mail when your submission has been processed.

UniProt Version: 2022_04
InterPro Version: 91

Click here to contact us for help, reporting issues, or suggestions.