This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
The tools are available without charge or license to both academic and commercial users.
Please cite your use of the EFI tools:
Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735
Nils Oberg, Rémi Zallot, and John A. Gerlt, EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol 2023. https://doi.org/10.1016/j.jmb.2023.168018
Reorganization of UniProtKB
With the current 2026_02 release, the UniProtKB database is reorganized to include an
expanded number of Reference Proteomes to better capture biodiversity.
This includes the removal of proteins from taxonomically unclassified organisms,
i.e., those without a binomial species name (genus and species). The total number of
accessions in UniProtKB has been reduced from 253,635,358 in the “legacy” 2025_03
release to 149,810,139 in the current 2026_02 release.
We are providing the option to select either the “legacy” 2025_03 database or the
current UniProtKB database (now 2026_02) when generating SSNs. You can select the
database in the “Database” accordion on the pages for the EFI-EST options, the EFI-GNT
tool, and the Taxonomy Tool. We suggest that you compare the SSNs, GNNs, and GNDs
generated from both databases as you explore the information you are seeking.
Because the “legacy” 2025_03 release contains UniProt IDs that are no longer active
on the UniProt web site, we provide the Metadata Tool
that provides access to the node attribute metadata for the UniProt IDs in the
“legacy” 2025_03 release.
As the UniProt database increases in size, users may encounter difficulties in
opening and visualizing SSNs with Cytoscape (too many nodes/edges for the RAM
available on the user’s computer). As a result, low resolution SSNs (UniRef50
clusters and/or representative node SSNs) may be necessary to survey
sequence-function space in large protein families. A solution for generating
higher resolution SSNs is to restrict the input sequences to specific taxonomy
categories (within the Superkingdom, Kingdom, Phylum, Class, Order, Family,
Genus, or Species), thereby reducing the number of nodes/edges and/or allowing
the use of UniRef90 clusters or UniProt members to generate the SSN.
This Taxonomy Tool provides a preview of the taxonomy distribution of the
UniProt IDs in datasets with three input options: Families, list of families;
FASTA, FASTA-formatted sequences; Accession IDs, UniProt, UniRef90 cluster, or
UniRef50 cluster IDs. These are analogous to the Families option (Option B), FASTA
option (Option C) and Accession IDs option (Option D) of EFI-EST.
The taxonomy distribution of the UniProt IDs in the input dataset is displayed
as a "sunburst" in which the ranks of classification (Superkingdom, Kingdom,
Phylum, Class, Order, Family, Genus, Species) are displayed radially, with
Superkingdom at the center and Species in the outer ring. The sunburst is
interactive, providing the ability to zoom to a selected taxonomy category. The
numbers of UniProt IDs, UniRef90 cluster IDs, and UniRef50 cluster IDs in the
selected category are displayed.
UniRef90 clusters contain sequences that share ≥90% sequence identity so
usually are taxonomically homogeneous. UniRef50 clusters contain sequences that
share ≥50% sequence identity, so often are taxonomically heterogeneous. When
possible (determined by the RAM available to Cytoscape), users should generate
taxonomy-specific SSNs with UniProt IDs or UniRef90 cluster IDs.
Files with the UniProt, UniRef90 cluster, and UniRef50 cluster IDs and
FASTA-formatted sequences at the selected taxonomy category can be downloaded.
The UniProt, UniRef90, or UniRef50 cluster IDs can be transferred to the
"Accession IDs" option (Option D) of EFI-EST to generate the SSN.
The Sequence BLAST, Families, FASTA, and Accession IDs options of EFI-EST also
include Filter by Taxonomy in both the Generate and Database Completed/Analysis
steps so that the user can select specific taxonomy categories when generating
SSNs.
Information on Pfam families and clans and InterPro family sizes is available on
the Family Information page.
The UniProt IDs for family members are identified in UniProtKB with a list of
Pfam families, InterPro families, and/or Pfam clans.
Retrieve taxonomy for FASTA files.
The input is a list of FASTA-formatted sequences in which the headers contain a
UniProt ID. The UniProt ID is required because it is used to retrieve the
taxonomy from the UniProt database (FASTA header "reading").
The UniProt IDs for the family members are retrieved; these are used to
calculate the sunburst.
Retrieve taxonomy for accession IDs.
The input is a list of UniProt, UniRef90 cluster or UniRef50 cluster IDs. For
the UniRef90 and UniRef50 clusters, the UniProt IDs in the clusters are
retrieved using the lookup table provided by UniProt/UniRef.
Filter by Family and/or Filter by Taxonomy can be used to remove UniProt IDs
that do not match a list of Pfam families, InterPro families, and/or Pfam clans
and/or specified taxonomy categories. This may be desirable/necessary if the
input list is obtained from 1) the Color SSN or Cluster Analysis utility for a
Families option (Option B) EFI-EST SSN or, 2) the Families option of the
Taxonomy Tool.
The remaining UniProt IDs are used to generate the sunburst.
UniRef90 and UniRef50 clusters that contain the UniProt IDs are retrieved from
the UniRef90 andUniRef50 databases using the lookup table provided by
UniProt/UniRef. Clusters for which the cluster ID (representative sequence)
matches the list of families are retained.
The numbers of UniProt IDs and both UniRef90 cluster and UniRef50 cluster IDs
are displayed on the sunburst; the UniProt IDs and both UniRef90 cluster and
UniRef50 cluster IDs are available for download and/or transfer to the
Accession IDs option (Option D) of EFI-EST to generate SSNs.
If the lists of UniRef90 or UniRef50 cluster IDs are used to generate SSNs with
the Accession IDs option (Option D) of EFI-EST, the lists should (must!) be
filtered with the same list of families (Filter by Family) and any specified
taxonomy categories (Filter by Taxonomy) used to generate the lists.
This filtering removes the UniRef90 and UniRef50 clusters with cluster IDs
("representative sequences") or internal UniProt IDs that are not members of
the specified families or have the selected taxonomy categories.
UniProt Version: 2026_02
InterPro Version: 109
The family(ies) selected has proteins—this is greater than
the maximum allowed (75,000). To reduce computing time and the size of
output SSN, UniRef90 cluster ID sequences will automatically be used.
In UniRef90, sequences that share
≥90% sequence identity over 80% of the sequence
length are grouped together and represented by an accession ID known as the cluster ID. The output
SSN is equivalent a to 90% Representative Node
Network with each node corresponding to a UniRef cluster ID, and for which the node attribute
"UniRef90 Cluster IDs" lists
all the sequences represented by a node. UniRef90
SSNs are compatible with the Color SSN utility as well as the EFI-GNT tool.
Press Ok to continue with UniRef90.
This job will be permanently removed from your list of jobs.