EFI - Enzyme Similarity Tool

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
The tools are available without charge or license to both academic and commercial users.
Please cite your use of the EFI tools:

Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735

Nils Oberg, Rémi Zallot, and John A. Gerlt, EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol 2023. https://doi.org/10.1016/j.jmb.2023.168018

A sequence similarity network (SSN) allows for visualization of relationships among protein sequences. In SSNs, the most related proteins are grouped together in clusters. The Enzyme Similarity Tool (EFI-EST) makes it possible to easily generate SSNs. Cytoscape is used to explore SSNs.

A listing of new features and other information pertaining to EST is available on the release notes page.

InterProScan sequence search can be used to find matches within the InterPro database for a given sequence.

Information on Pfam families and clans and InterPro family sizes is available on the Family Information page.

EFI database version: 2024_01 / 98

Generate a SSN for a single protein and its closest homologues in the UniProt, UniRef90, or UniRef50 database.

The input sequence is used as the query for a search of the UniProt, UniRef90, or UniRef50 database using BLAST. For the UniRef90 and UniRef50 databases, the sequence of the cluster ID (representative sequence) is used for the BLAST.

The database is selected using the BLAST Retrieval Options.

An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Query Sequence:
Input a single protein sequence only. The default maximum number of retrieved sequences is 1,000.

BLAST Retrieval Options

UniProt BLAST query e-value: Negative log of e-value for retrieving similar sequences (≥ 1; default: 5)
Input a larger e-value (smaller negative log) to retrieve homologues if the query sequence is short. Input a smaller e-value (larger negative log) to retrieve more similar homologues.
Maximum number of sequences retrieved: (≤ 10,000, default: 1,000)
Sequence database: (UniProt, UniRef90, or UniRef50; default UniProt)
Select the sequence database to BLAST against.

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Taxonomy

A taxonomy filter is applied to the list of UniProt, UniRef90, or UniRef50 cluster IDs retrieved by the BLAST.

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the retrieved sequences to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The retrieved sequences also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.

The sequences from the UniRef90 and UniRef50 databases are the UniRef90 and UniRef50 clusters for which the cluster ID ("representative sequence") matches the specified taxonomy categories. The UniProt members in these clusters that do not match the specified taxonomy categories are removed from the cluster.

Preselected conditions:

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternate e-value for BLAST to calculate similarities/edge alignment scores similarities. The default parameter (5) is useful for most sequences. However, a larger e-value/smaller negative log should be used for short sequences or when low pairwise identities may be useful for separating functionally distinct SSN clusters.

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProtKB, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with SwissProt annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN for a protein family.

The members of the input Pfam families, InterPro families, and/or Pfam clans are selected from the UniProt, UniRef90, or UniRef50 database.

Pfam and/or InterPro Families and/or Pfam clans:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

UniRef90 clusters contain UniProt IDs that share ≥90% sequence identity and have 80% overlap with the longest sequence in the cluster ("seed sequence"); as a result, the UniProt IDs in the cluster usually are functionally homogeneous, i.e., orthologues. UniRef50 clusters contain UniProt IDs that share ≥50% sequence identity and have 80% overlap with the seed sequence; as a result, the UniProt IDs in the cluster often are functionally heterogeneous, e.g., paralogues.

The sequences from the UniRef90 and UniRef90 databases are the UniRef90 and UniRef50 clusters for which the cluster ID ("representative sequence") matches the specified families. The UniProt members in these UniRef90 and Uni/Ref50 clusters that do not match the specified families are removed from the cluster.

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Taxonomy

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the retrieved sequences to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The retrieved sequences also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.

The sequences from the UniRef90 and UniRef50 databases are the UniRef90 and UniRef50 clusters for which the cluster ID ("representative sequence") matches the specified taxonomy categories. The UniProt members in these clusters that do not match the specified taxonomy categories are removed from the cluster.

Preselected conditions:

Protein Family Size Options

Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProtKB, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with SwissProt annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

Family Domain Boundary Option

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Region:
N-terminal will select the portion of the sequence that is N-terminal to the specified domain to generate the SSN. C-terminal will select the portion of the sequence that is C-terminal to the specified domain to generate the SSN. Domain will use the specified domain.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternate e-value for BLAST to calculate similarities/edge alignment scores similarities. The default parameter (5) is useful for most sequences. However, a larger e-value/smaller negative log should be used for short sequences or when low pairwise identities may be useful for separating functionally distinct SSN clusters.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN from FASTA-formatted UniProt sequences.

An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Input a list of sequences in the FASTA format or upload a FASTA-formatted sequence file.

The sequences in the FASTA file are used to calculate edge values.

The ID in the header that immediately follows the ">" is used to retrieve node attribute information. Acceptable IDs include UniProt IDs, PDB IDs, and NCBI GenBank IDs that have equivalent entries in the UniProt database. ?

If the header for a sequence does not contain an acceptable ID for retrieving node attribute information, the SSN provides node attributes for only the sequence, sequence length, and the header as the Description.

If the user identifies the input sequences as UniRef50 or UniRef90, the node attributes will include the UniRef Cluster Size and UniRef Cluster IDs node attributes. The other node attributes will be lists of the values for UniRef cluster IDs in the node.

Input FASTA-formatted sequences with UniProt accession IDs in the header or upload a file.

Sequences:
FASTA File: ?

Input FASTA-formatted sequences with UniRef50 or UniRef90 accession IDs in the header or upload a file.

Sequences:
FASTA ID File: ?
Input accession IDs are: ?

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Family

Input a list of Pfam families, InterPro families, and/or Pfam clans to restrict the UniProt and/or UniRef IDs in the SSN to these families.
Family(s):
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

Filter by Taxonomy

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the input UniProt sequences to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The input UniProt sequences also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.

Preselected conditions:

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

UniRef90 clusters contain UniProt IDs that share ≥90% sequence identity and have 80% overlap with the longest sequence in the cluster ("seed sequence"); as a result, the UniProt IDs in the cluster usually are functionally homogeneous, i.e., orthologues. UniRef50 clusters contain UniProt IDs that share ≥50% sequence identity and have 80% overlap with the seed sequence; as a result, the UniProt IDs in the cluster often are functionally heterogeneous, e.g., paralogues.

The sequences from the UniRef90 and UniRef90 databases are the UniRef90 and UniRef50 clusters for which the cluster ID ("representative sequence") matches the specified families. The UniProt members in these UniRef90 and Uni/Ref50 clusters that do not match the specified families are removed from the cluster.

Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProtKB, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with SwissProt annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

Family Domain Boundary Options

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Family: Use domain boundaries from the specified family (enter only one family).
Region:
A specified InterPro family must be defined by a single database. N-terminal will select the portion of the sequence that is N-terminal to the specified domain to generate the SSN. C-terminal will select the portion of the sequence that is C-terminal to the specified domain to generate the SSN. Domain will use the specified domain.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternate e-value for BLAST to calculate similarities/edge alignment scores similarities. The default parameter (5) is useful for most sequences. However, a larger e-value/smaller negative log should be used for short sequences or when low pairwise identities may be useful for separating functionally distinct SSN clusters.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN from a list of UniProt, UniRef, NCBI, or Genbank IDs.

An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Input a list of UniProt, NCBI, or Genbank (protein) accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?

Input a list of UniRef50 or UniRef90 cluster accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?
Input accession IDs are: ?

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Family

The input list of UniRef90 or UniRef50 cluster IDs should (must!) be filtered with the same list of Pfam families, InterPro families, and/or Pfam clans used to generate the IDs, if:

The input list of UniRef90 or UniRef50 IDs is obtained from 1) the Color SSN or Cluster Analysis utility for a Families option (Option B) EFI-EST SSN, 2) the Families option of the Taxonomy Tool, or 3) the Accession IDs option of the Taxonomy Tool.

Input a list of Pfam families, InterPro families, and/or Pfam clans to restrict the UniProt and/or UniRef IDs in the SSN to these families.
Family(s):
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

For input lists of UniRef90 and UniRef50 clusters, the cluster ID (representative sequence) is used to identify those that match the list of families and are included in the SSN. The UniProt members in these clusters that do not match the input families are removed from the cluster and are not included in the SSN node attributes.

Filter by Taxonomy

The input list of UniRef90 or UniRef50 cluster IDs should (must!) be filtered with the same taxonomy categories used to generate the IDs, if:

The input list of UniRef90 or UniRef50 IDs is obtained from 1) the Color SSN or Cluster Analysis utility for a Families option (Option B) EFI-EST SSN, 2) the Families option of the Taxonomy Tool, or 3) the Accession IDs option of the Taxonomy Tool.

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the UniProt IDs in the sunburst to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The UniProt IDs also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.

Preselected conditions:

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

UniRef90 clusters contain UniProt IDs that share ≥90% sequence identity and have 80% overlap with the longest sequence in the cluster ("seed sequence"); as a result, the UniProt IDs in the cluster usually are functionally homogeneous, i.e., orthologues. UniRef50 clusters contain UniProt IDs that share ≥50% sequence identity and have 80% overlap with the seed sequence; as a result, the UniProt IDs in the cluster often are functionally heterogeneous, e.g., paralogues.

The sequences from the UniRef90 and UniRef90 databases are the UniRef90 and UniRef50 clusters for which the cluster ID ("representative sequence") matches the specified families. The UniProt members in these UniRef90 and Uni/Ref50 clusters that do not match the specified families are removed from the cluster.

Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProtKB, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with SwissProt annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

Family Domain Boundary Options

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Family: Use domain boundaries from the specified family (enter only one family).
Region:
A specified InterPro family must be defined by a single database. N-terminal will select the portion of the sequence that is N-terminal to the specified domain to generate the SSN. C-terminal will select the portion of the sequence that is C-terminal to the specified domain to generate the SSN. Domain will use the specified domain.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternate e-value for BLAST to calculate similarities/edge alignment scores similarities. The default parameter (5) is useful for most sequences. However, a larger e-value/smaller negative log should be used for short sequences or when low pairwise identities may be useful for separating functionally distinct SSN clusters.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Clusters in the submitted SSN are identified, numbered and colored. Summary tables, sets of IDs and sequences per cluster are provided for sequences identified by a UniProt ID.

The clusters are numbered and colored using two conventions: 1) Sequence Count Cluster Number assigned in order of decreasing number of UniProt IDs in the cluster; 2) Node Count Cluster Number assigned in order of decreasing number of nodes in the cluster.

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).
E-mail address:

You will be notified by e-mail when your submission has been processed.

An input SSN from the EFI-EST FASTA option should be generated using "Read FASTA headers" from FASTA files with UniProt IDs in the headers. Otherwise, sets of IDs and sequences, MSAs, WebLogos, HMMs, consensus residues, and length histograms will not be generated.

Like the Color SSN utility, clusters in the submitted SSN are identified, numbered and colored.

The SSN clusters are numbered and colored using two conventions: Sequence Count Cluster Numbers are assigned in order of decreasing number of UniProt IDs in the cluster; Node Count Cluster Numbers are assigned in order of decreasing number of nodes in the cluster.

Multiple sequence alignments (MSAs), WebLogos, hidden Markov models (HMMs), length histograms, and consensus residues are computed for each cluster.

Options are available in the tabs below to select the desired analyses:

The WebLogos tab provides the WebLogo and MSA for the node IDs in each cluster containing greater than the "Minimum Node Count" specified in the Sequence Filter tab. The percent identity matrix for the MSA is also provided on this tab.

The Consensus Residues tab provides a tab-delimited text file with the number of the conserved residues and their MSA positions for each specified residue in each cluster containing greater than the "Minimum Node Count". Note the default residue is "C" and the percent identity levels that are displayed are from 90 to 10% in intervals of 10%; a residue is counted as "conserved" if it occurs with ≥80% identity.

The HMMs tab provides the HMM for each cluster containing greater than the specified "Minimum Node Count".

The Length Histograms tab provides length histograms for each cluster containing greater than the specified "Minimum Node Count".

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).

Sequence Filter

The MSA is generated with MUSCLE using the node IDs. Clusters containing less than the Minimum Node Count will be excluded from the analyses. Since MUSCLE can fail with a "large" number sequences (variable; anywhere from >750 to 1500), the Maximum Node Count parameter can be used to limit the number of sequences that MUSCLE uses.
Minimum Node Count:
Maximum Node Count:

WebLogos

A MSA for the (length-filtered) node IDs is generated using MUSCLE; the WebLogo is generated with the http://weblogo.threeplusone.com code.
Make Weblogo:

Consensus Residues

The positions and selected percent identities of the selected residues in the MSA are determined.
Compute Consensus Residues:

HMMs

The MSA for the (length-filtered) node IDs is used to generate the HMM with hmmbuild from HMMER3 (http://hmmer.org).
Make HMMs:

Length Histograms

Length histograms for the node IDs (where applicable, UniProt, UniRef90, and UniRef50 IDs).
Make Length Histograms:
E-mail address:

You will be notified by e-mail when your submission has been processed.

Nodes in the submitted SSN are colored according to neighborhood connectivity (number of edges to other nodes).

The nodes for unresolved families can be difficult to identify in SSNs generated with low alignment scores. Coloring the nodes according to the number of edges to other nodes (Neighborhood Connectivity, NC) helps identify families with highly connected nodes (https://doi.org/10.1016/j.heliyon.2020.e05867). Using Neighborhood Connectivity Coloring as a guide, the alignment score threshold can be chosen in Cytoscape to separate the SSN into families.

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).
E-mail address:

You will be notified by e-mail when your submission has been processed.

Convergence ratio is calculated per cluster.

SSN File: ?
A Color SSN (from either the Color SSN or Cluster Analysis utility) is the required input (cluster numbers are required).
Alignment Score: The alignment score to calculate convergence ratio per cluster (should be the same as the original SSN alignment score).

The "convergence ratio" is the ratio of the actual number of edges in the cluster to the maximum possible number of edges (each node connected to every other node). For UniRef SSNs, two convergence ratios are calculated, one for the edges connecting the UniRef nodes in the input SSN and the second for the "hypothetical" edges that would connect the internal UniProt IDsin the cluster. The user specifies the value of the alignment score to be used (usually the same a alignment score used to generate the SSN).

The value of the convergence ratio ranges from 1.0 for sequences that are very similar ("identical") to 0.0 for sequences that are unrelated at the specified alignment score. The convergence ratio can be used as a criterion to infer whether an SSN cluster is isofunctional—the convergence ratio of a cluster containing orthologous sequences is expected to be close to 1.0 even at large alignment scores.

E-mail address:

You will be notified by e-mail when your submission has been processed.

UniProt Version: 2024_01
InterPro Version: 98

Click here to contact us for help, reporting issues, or suggestions.