This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
Please cite your use of the EFI tools:
Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735
We apologize for the inconvenience during the unplanned interruption in the EFI tools. Nearly all of the capabilities of the site are restored.
Access to EFI-CGFP will be restored in the coming days. Please use the tools, but should you encounter any issues, we ask you to please email
efi@enzymefunction.org
with a description of the issue and how you encountered it.
A sequence similarity network (SSN) allows for visualization of relationships among
protein sequences. In SSNs, the most related proteins are grouped together in
clusters.
The Enzyme Similarity Tool (EFI-EST) makes it possible to easily generate SSNs.
Cytoscape is used to explore SSNs.
A listing of new features and other information pertaining to EST is available on the
release notes page.
Generate a SSN for a single protein and its closest homologues in the UniProt, UniRef90, or UniRef50 database.
The input sequence is used as the query for a search of the UniProt, UniRef90,
or UniRef50 database using BLAST. For the UniRef90 and UniRef50 databases, the
sequence of the cluster ID (representative sequence) is used for the BLAST.
The database is selected using the BLAST Retrieval Options.
An all-by-all BLAST? is performed to obtain the similarities between sequence
pairs to calculate edge values to generate the SSN.
Generate a SSN for a protein family.
The members of the input Pfam families, InterPro families, and/or Pfam
clans are selected from the UniProt, UniRef90, or UniRef50 database.
Generate a SSN from FASTA-formatted UniProt sequences.
An all-by-all BLAST? is performed to obtain the similarities between sequence
pairs to calculate edge values to generate the SSN.
Input a list of sequences in the FASTA format or upload a FASTA-formatted
sequence file.
Two options are available for generating the SSN:
1) The sequences are used "as is", with the node attributes including only the
information in the header as the description and the number of residues in the
sequence.
2) The ID in the header that immediately follows the ">" is used to
retrieve node attribute information. Acceptable IDs include UniProt IDs, PDB
IDs, and NCBI GenBank IDs that have equivalent entries in the UniProt database.
?
To use this option, check the "Read FASTA headers" box.
Generate a SSN from a list of UniProt, UniRef, NCBI, or Genbank IDs.
An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.
An input SSN from the EFI-EST FASTA option should be generated using "Read
FASTA headers" from FASTA files with UniProt IDs in the headers. Otherwise,
the summary tables and sets of IDs and sequences per cluster will not be
generated.
Clusters in the submitted SSN are identified, numbered and colored.
Summary tables, sets of IDs and sequences per cluster are provided.
The clusters are numbered and colored using two conventions: 1)
Sequence Count Cluster Number assigned in order of decreasing number of UniProt IDs in the
cluster; 2) Node Count Cluster Number assigned in order of decreasing number of
nodes in the cluster.
An input SSN from the EFI-EST FASTA option should be generated using "Read
FASTA headers" from FASTA files with UniProt IDs in the headers. Otherwise,
sets of IDs and sequences, MSAs, WebLogos, HMMs, consensus residues, and length
histograms will not be generated.
Like the Color SSN utility, clusters in the submitted SSN are identified,
numbered and colored.
The SSN clusters are numbered and colored using two conventions:
Sequence Count Cluster Numbers
are assigned in order of decreasing number of UniProt IDs in
the cluster; Node Count Cluster Numbers are assigned in order of decreasing
number of nodes in the cluster.
Multiple sequence alignments (MSAs), WebLogos, hidden Markov models (HMMs),
length histograms, and consensus residues are computed for each cluster.
Options are available in the tabs below to select the desired analyses:
The WebLogos tab provides the WebLogo and MSA for the node IDs in each cluster
containing greater than the "Minimum Node Count" specified in the
Sequence Filter tab.
The percent identity matrix for the MSA is also provided on this tab.
The Consensus Residues tab provides a tab-delimited text file with the number
of the conserved residues and their MSA positions for each specified residue in
each cluster containing greater than the "Minimum Node Count". Note the
default residue is "C" and the percent identity levels that are displayed are
from 90 to 10% in intervals of 10%; a residue is counted as "conserved" if it
occurs with ≥80% identity.
The HMMs tab provides the HMM for each cluster containing greater than the
specified "Minimum Node Count".
The Length Histograms tab provides length histograms for each cluster
containing greater than the specified "Minimum Node Count".
Nodes in the submitted SSN are colored according to neighborhood connectivity
(number of edges to other nodes).
The nodes for unresolved families can be difficult to identify in SSNs
generated with low alignment scores. Coloring the nodes according to the
number of edges to other nodes (Neighborhood Connectivity, NC) helps identify
families with highly connected nodes
(https://doi.org/10.1016/j.heliyon.2020.e05867).
Using Neighborhood Connectivity Coloring as a guide, the alignment score threshold
can be chosen in Cytoscape to separate the SSN into families.
Convergence ratio is calculated per cluster.
UniProt Version: 2022_04
InterPro Version: 91
The family(ies) selected has proteins—this is greater than
the maximum allowed (1,000,000,000). To reduce computing time and the size of
output SSN, UniRef90 cluster ID sequences will automatically be used.
In UniRef90, sequences that share
≥90% sequence identity over 80% of the sequence
length are grouped together and represented by an accession ID known as the cluster ID. The output
SSN is equivalent a to 90% Representative Node
Network with each node corresponding to a UniRef cluster ID, and for which the node attribute
"UniRef90 Cluster IDs" lists
all the sequences represented by a node. UniRef90
SSNs are compatible with the Color SSN utility as well as the EFI-GNT tool.
Press Ok to continue with UniRef90.
This job will be permanently removed from your list of jobs.