This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
The tools are available without charge or license to both academic and commercial users.
Please cite your use of the EFI tools:
Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735
Nils Oberg, Rémi Zallot, and John A. Gerlt, EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol 2023. https://doi.org/10.1016/j.jmb.2023.168018
RadicalSAM.org, our resource for investigating sequence-function space in the radical SAM superfamily, has been updated with sequences from the UniProt Release 2024_01 and InterPro Release 98 databases (January 24, 2024) !!
A sequence similarity network (SSN) allows for visualization of relationships among
protein sequences. In SSNs, the most related proteins are grouped together in
clusters.
The Enzyme Similarity Tool (EFI-EST) makes it possible to easily generate SSNs.
Cytoscape is used to explore SSNs.
A listing of new features and other information pertaining to EST is available on the
release notes page.
Generate a SSN for a single protein and its closest homologues in the UniProt, UniRef90, or UniRef50 database.
The input sequence is used as the query for a search of the UniProt, UniRef90,
or UniRef50 database using BLAST. For the UniRef90 and UniRef50 databases, the
sequence of the cluster ID (representative sequence) is used for the BLAST.
The database is selected using the BLAST Retrieval Options.
An all-by-all BLAST? is performed to obtain the similarities between sequence
pairs to calculate edge values to generate the SSN.
Generate a SSN for a protein family.
The members of the input Pfam families, InterPro families, and/or Pfam
clans are selected from the UniProt, UniRef90, or UniRef50 database.
Generate a SSN from FASTA-formatted UniProt sequences.
An all-by-all BLAST? is performed to obtain the similarities between sequence
pairs to calculate edge values to generate the SSN.
Input a list of sequences in the FASTA format or upload a FASTA-formatted
sequence file.
The sequences in the FASTA file are used to calculate edge values.
The ID in the header that immediately follows the ">" is used to
retrieve node attribute information. Acceptable IDs include UniProt IDs, PDB
IDs, and NCBI GenBank IDs that have equivalent entries in the UniProt database.
?
If the header for a sequence does not contain an acceptable ID for retrieving
node attribute information, the SSN provides node attributes for only the
sequence, sequence length, and the header as the Description.
If the user identifies the input sequences as UniRef50 or UniRef90, the node
attributes will include the UniRef Cluster Size and UniRef Cluster IDs node
attributes. The other node attributes will be lists of the values for UniRef
cluster IDs in the node.
Generate a SSN from a list of UniProt, UniRef, NCBI, or Genbank IDs.
An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.
Clusters in the submitted SSN are identified, numbered and colored.
Summary tables, sets of IDs and sequences per cluster are provided for sequences identified by a UniProt ID.
The clusters are numbered and colored using two conventions: 1)
Sequence Count Cluster Number assigned in order of decreasing number of UniProt IDs in the
cluster; 2) Node Count Cluster Number assigned in order of decreasing number of
nodes in the cluster.
An input SSN from the EFI-EST FASTA option should be generated using "Read
FASTA headers" from FASTA files with UniProt IDs in the headers. Otherwise,
sets of IDs and sequences, MSAs, WebLogos, HMMs, consensus residues, and length
histograms will not be generated.
Like the Color SSN utility, clusters in the submitted SSN are identified,
numbered and colored.
The SSN clusters are numbered and colored using two conventions:
Sequence Count Cluster Numbers
are assigned in order of decreasing number of UniProt IDs in
the cluster; Node Count Cluster Numbers are assigned in order of decreasing
number of nodes in the cluster.
Multiple sequence alignments (MSAs), WebLogos, hidden Markov models (HMMs),
length histograms, and consensus residues are computed for each cluster.
Options are available in the tabs below to select the desired analyses:
The WebLogos tab provides the WebLogo and MSA for the node IDs in each cluster
containing greater than the "Minimum Node Count" specified in the
Sequence Filter tab.
The percent identity matrix for the MSA is also provided on this tab.
The Consensus Residues tab provides a tab-delimited text file with the number
of the conserved residues and their MSA positions for each specified residue in
each cluster containing greater than the "Minimum Node Count". Note the
default residue is "C" and the percent identity levels that are displayed are
from 90 to 10% in intervals of 10%; a residue is counted as "conserved" if it
occurs with ≥80% identity.
The HMMs tab provides the HMM for each cluster containing greater than the
specified "Minimum Node Count".
The Length Histograms tab provides length histograms for each cluster
containing greater than the specified "Minimum Node Count".
Nodes in the submitted SSN are colored according to neighborhood connectivity
(number of edges to other nodes).
The nodes for unresolved families can be difficult to identify in SSNs
generated with low alignment scores. Coloring the nodes according to the
number of edges to other nodes (Neighborhood Connectivity, NC) helps identify
families with highly connected nodes
(https://doi.org/10.1016/j.heliyon.2020.e05867).
Using Neighborhood Connectivity Coloring as a guide, the alignment score threshold
can be chosen in Cytoscape to separate the SSN into families.
Convergence ratio is calculated per cluster.
UniProt Version: 2024_05
InterPro Version: 102
The family(ies) selected has proteins—this is greater than
the maximum allowed (75,000). To reduce computing time and the size of
output SSN, UniRef90 cluster ID sequences will automatically be used.
In UniRef90, sequences that share
≥90% sequence identity over 80% of the sequence
length are grouped together and represented by an accession ID known as the cluster ID. The output
SSN is equivalent a to 90% Representative Node
Network with each node corresponding to a UniRef cluster ID, and for which the node attribute
"UniRef90 Cluster IDs" lists
all the sequences represented by a node. UniRef90
SSNs are compatible with the Color SSN utility as well as the EFI-GNT tool.
Press Ok to continue with UniRef90.
This job will be permanently removed from your list of jobs.