EFI - Taxonomy Tool
This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
The tools are available without charge or license to both academic and commercial users.
As the UniProt database increases in size, users may encounter difficulties in
opening and visualizing SSNs with Cytoscape (too many nodes/edges for the RAM
available on the user’s computer). As a result, low resolution SSNs (UniRef50
clusters and/or representative node SSNs) may be necessary to survey
sequence-function space in large protein families. A solution for generating
higher resolution SSNs is to restrict the input sequences to specific taxonomy
categories (within the Superkingdom, Kingdom, Phylum, Class, Order, Family,
Genus, or Species), thereby reducing the number of nodes/edges and/or allowing
the use of UniRef90 clusters or UniProt members to generate the SSN.
This Taxonomy Tool provides a preview of the taxonomy distribution of the
UniProt IDs in datasets with three input options: Families, list of families;
FASTA, FASTA-formatted sequences; Accession IDs, UniProt, UniRef90 cluster, or
UniRef50 cluster IDs. These are analogous to the Families option (Option B), FASTA
option (Option C) and Accession IDs option (Option D) of EFI-EST.
The taxonomy distribution of the UniProt IDs in the input dataset is displayed
as a "sunburst" in which the ranks of classification (Superkingdom, Kingdom,
Phylum, Class, Order, Family, Genus, Species) are displayed radially, with
Superkingdom at the center and Species in the outer ring. The sunburst is
interactive, providing the ability to zoom to a selected taxonomy category. The
numbers of UniProt IDs, UniRef90 cluster IDs, and UniRef50 cluster IDs in the
selected category are displayed.
UniRef90 clusters contain sequences that share ≥90% sequence identity so
usually are taxonomically homogeneous. UniRef50 clusters contain sequences that
share ≥50% sequence identity, so often are taxonomically heterogeneous. When
possible (determined by the RAM available to Cytoscape), users should generate
taxonomy-specific SSNs with UniProt IDs or UniRef90 cluster IDs.
Files with the UniProt, UniRef90 cluster, and UniRef50 cluster IDs and
FASTA-formatted sequences at the selected taxonomy category can be downloaded.
The UniProt, UniRef90, or UniRef50 cluster IDs can be transferred to the
"Accession IDs" option (Option D) of EFI-EST to generate the SSN.
The Sequence BLAST, Families, FASTA, and Accession IDs options of EFI-EST also
include Filter by Taxonomy in both the Generate and Database Completed/Analysis
steps so that the user can select specific taxonomy categories when generating
Information on Pfam families and clans and InterPro family sizes is available on
the Family Information page.
Retrieve taxonomy for families.
The UniProt IDs for family members are identified in UniProtKB with a list of
Pfam families, InterPro families, and/or Pfam clans.
Retrieve taxonomy for FASTA files.
The input is a list of FASTA-formatted sequences in which the headers contain a
UniProt ID. The UniProt ID is required because it is used to retrieve the
taxonomy from the UniProt database (FASTA header "reading").
The UniProt IDs for the family members are retrieved; these are used to
calculate the sunburst.
Retrieve taxonomy for accession IDs.
The input is a list of UniProt, UniRef90 cluster or UniRef50 cluster IDs. For
the UniRef90 and UniRef50 clusters, the UniProt IDs in the clusters are
retrieved using the lookup table provided by UniProt/UniRef.
Filter by Family and/or Filter by Taxonomy can be used to remove UniProt IDs
that do not match a list of Pfam families, InterPro families, and/or Pfam clans
and/or specified taxonomy categories. This may be desirable/necessary if the
input list is obtained from 1) the Color SSN or Cluster Analysis utility for a
Families option (Option B) EFI-EST SSN or, 2) the Families option of the
The remaining UniProt IDs are used to generate the sunburst.
UniRef90 and UniRef50 clusters that contain the UniProt IDs are retrieved from
the UniRef90 andUniRef50 databases using the lookup table provided by
UniProt/UniRef. Clusters for which the cluster ID (representative sequence)
matches the list of families are retained.
The numbers of UniProt IDs and both UniRef90 cluster and UniRef50 cluster IDs
are displayed on the sunburst; the UniProt IDs and both UniRef90 cluster and
UniRef50 cluster IDs are available for download and/or transfer to the
Accession IDs option (Option D) of EFI-EST to generate SSNs.
If the lists of UniRef90 or UniRef50 cluster IDs are used to generate SSNs with
the Accession IDs option (Option D) of EFI-EST, the lists should (must!) be
filtered with the same list of families (Filter by Family) and any specified
taxonomy categories (Filter by Taxonomy) used to generate the lists.
This filtering removes the UniRef90 and UniRef50 clusters with cluster IDs
("representative sequences") or internal UniProt IDs that are not members of
the specified families or have the selected taxonomy categories.
UniProt Version: 2023_05
InterPro Version: 97
The family(ies) selected has proteins—this is greater than
the maximum allowed (75,000). To reduce computing time and the size of
output SSN, UniRef90 cluster ID sequences will automatically be used.
In UniRef90, sequences that share
≥90% sequence identity over 80% of the sequence
length are grouped together and represented by an accession ID known as the cluster ID. The output
SSN is equivalent a to 90% Representative Node
Network with each node corresponding to a UniRef cluster ID, and for which the node attribute
"UniRef90 Cluster IDs" lists
all the sequences represented by a node. UniRef90
SSNs are compatible with the Color SSN utility as well as the EFI-GNT tool.
Press Ok to continue with UniRef90.
Click here to contact us for help, reporting issues, or suggestions.