EFI - Taxonomy Tool

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
Please cite your use of the EFI tools:

Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735

As the UniProt database increases in size, users may encounter difficulties in opening and visualizing SSNs with Cytoscape (too many nodes/edges for the RAM available on the user’s computer). As a result, low resolution SSNs (UniRef50 clusters and/or representative node SSNs) may be necessary to survey sequence-function space in large protein families. A solution for generating higher resolution SSNs is to restrict the input sequences to specific taxonomy categories (within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, or Species), thereby reducing the number of nodes/edges and/or allowing the use of UniRef90 clusters or UniProt members to generate the SSN.

This Taxonomy Tool provides a preview of the taxonomy distribution of the UniProt IDs in datasets with three input options: Families, list of families; FASTA, FASTA-formatted sequences; Accession IDs, UniProt, UniRef90 cluster, or UniRef50 cluster IDs. These are analogous to the Families option (Option B), FASTA option (Option C) and Accession IDs option (Option D) of EFI-EST.

The taxonomy distribution of the UniProt IDs in the input dataset is displayed as a "sunburst" in which the ranks of classification (Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, Species) are displayed radially, with Superkingdom at the center and Species in the outer ring. The sunburst is interactive, providing the ability to zoom to a selected taxonomy category. The numbers of UniProt IDs, UniRef90 cluster IDs, and UniRef50 cluster IDs in the selected category are displayed.

UniRef90 clusters contain sequences that share ≥90% sequence identity so usually are taxonomically homogeneous. UniRef50 clusters contain sequences that share ≥50% sequence identity, so often are taxonomically heterogeneous. When possible (determined by the RAM available to Cytoscape), users should generate taxonomy-specific SSNs with UniProt IDs or UniRef90 cluster IDs.

Files with the UniProt, UniRef90 cluster, and UniRef50 cluster IDs and FASTA-formatted sequences at the selected taxonomy category can be downloaded.

The UniProt, UniRef90, or UniRef50 cluster IDs can be transferred to the "Accession IDs" option (Option D) of EFI-EST to generate the SSN.

The Sequence BLAST, Families, FASTA, and Accession IDs options of EFI-EST also include Filter by Taxonomy in both the Generate and Database Completed/Analysis steps so that the user can select specific taxonomy categories when generating SSNs.

Retrieve taxonomy for families.

The UniProt IDs for family members are identified in UniProtKB with a list of Pfam families, InterPro families, and/or Pfam clans.

Pfam and/or InterPro Families:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

Filter by Taxonomy can be used to remove UniProt IDs that do not match the specified taxonomy categories.

The remaining UniProt IDs are used to generate the sunburst.

UniRef90 and UniRef50 clusters that contain the UniProt IDs are retrieved from the UniRef90 andUniRef50 databases using the lookup table provided by UniProt/UniRef. Clusters for which the cluster ID (representative sequence) matches the list of families are retained.

The numbers of UniProt IDs and both UniRef90 cluster and UniRef50 cluster IDs are displayed on the sunburst; the UniProt IDs and both UniRef90 cluster and UniRef50 cluster IDs are available for download and/or transfer to the Accession ID option (Option D) of EFI-EST to generate SSNs.

If the lists of UniRef90 or UniRef50 cluster IDs are used to generate SSNs with the Accession IDs option (Option D) of EFI-EST, the lists should (must!) be filtered with the same list of families (Filter by Family) and any specified taxonomy categories (Filter by Taxonomy) used to generate the lists.

This filtering removes the UniRef90 and UniRef50 clusters with cluster IDs ("representative sequences") or internal UniProt IDs that are not members of the specified families or have the selected taxonomy categories.

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Taxonomy

This filter is applied to the UniProt IDs after they have been identified using the list of Pfam families, InterPro families, and/or Pfam clans. The remaining UniProt IDs are used to generate the sunburst.

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the UniProt IDs in the sunburst to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The UniProt IDs also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.

Preselected conditions:
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Retrieve taxonomy for FASTA files.

The input is a list of FASTA-formatted sequences in which the headers contain a UniProt ID. The UniProt ID is required because it is used to retrieve the taxonomy from the UniProt database (FASTA header "reading").

The UniProt IDs for the family members are retrieved; these are used to calculate the sunburst.

Sequences:
FASTA File: ?

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Taxonomy

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the UniProt IDs in the sunburst to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The UniProt IDs also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.  

Preselected conditions:

Filter by Family

Input a list of Pfam families, InterPro families, and/or Pfam clans to restrict the UniProt IDs in the sunburst to these families.
Family(s):
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Retrieve taxonomy for accession IDs.

The input is a list of UniProt, UniRef90 cluster or UniRef50 cluster IDs. For the UniRef90 and UniRef50 clusters, the UniProt IDs in the clusters are retrieved using the lookup table provided by UniProt/UniRef.

Filter by Family and/or Filter by Taxonomy can be used to remove UniProt IDs that do not match a list of Pfam families, InterPro families, and/or Pfam clans and/or specified taxonomy categories. This may be desirable/necessary if the input list is obtained from 1) the Color SSN or Cluster Analysis utility for a Families option (Option B) EFI-EST SSN or, 2) the Families option of the Taxonomy Tool.

The remaining UniProt IDs are used to generate the sunburst.

UniRef90 and UniRef50 clusters that contain the UniProt IDs are retrieved from the UniRef90 andUniRef50 databases using the lookup table provided by UniProt/UniRef. Clusters for which the cluster ID (representative sequence) matches the list of families are retained.

The numbers of UniProt IDs and both UniRef90 cluster and UniRef50 cluster IDs are displayed on the sunburst; the UniProt IDs and both UniRef90 cluster and UniRef50 cluster IDs are available for download and/or transfer to the Accession IDs option (Option D) of EFI-EST to generate SSNs.

If the lists of UniRef90 or UniRef50 cluster IDs are used to generate SSNs with the Accession IDs option (Option D) of EFI-EST, the lists should (must!) be filtered with the same list of families (Filter by Family) and any specified taxonomy categories (Filter by Taxonomy) used to generate the lists.

This filtering removes the UniRef90 and UniRef50 clusters with cluster IDs ("representative sequences") or internal UniProt IDs that are not members of the specified families or have the selected taxonomy categories.

Accession IDs:
Accession ID File: ?

Input a list of UniRef50 or UniRef90 cluster accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?
Input accession IDs are: ?

Fragment Option

UniProt designates a Sequence Status for each member: Complete if the encoding DNA sequence has both start and stop codons; Fragment if the start and/or stop codon is missing. Approximately 10% of the entries in UniProt are fragments.
Fragments:

For the UniRef90 and UniRef50 databases, clusters are excluded if the cluster ID ("representative sequence") is a fragment.

UniProt IDs in UniRef90 and UniRef50 clusters with complete cluster IDs are removed from the clusters if they are fragments.

Filter by Taxonomy

This filter is applied to the UniProt IDs identified in the input dataset.

From preselected conditions, the user can select "Bacteria, Archaea, Fungi", "Eukaryota, no Fungi", "Fungi", "Viruses", "Bacteria", "Eukaryota", or "Archaea" to restrict the UniProt IDs in the sunburst to these taxonomy groups.

"Bacteria, Archaea, Fungi", "Bacteria", "Archaea", and "Fungi" select organisms that may provide genome context (gene clusters/operons) useful for inferring functions.

The UniProt IDs also can be restricted to taxonomy categories within the Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, and Species ranks. Multiple conditions are combined to be a union of each other.

Preselected conditions:

Filter by Family

This filter is applied to the UniProt IDs identified in the input dataset.

Input a list of Pfam families, InterPro families, and/or Pfam clans to restrict the UniProt IDs in the sunburst to these families.
Family(s):
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

UniProt Version: 2022_04
InterPro Version: 91

Click here to contact us for help, reporting issues, or suggestions.