EFI - Taxonomy Tool

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
Please cite your use of the EFI tools:

Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735

As the UniProt database increases in size, users may encounter difficulties in opening and visualizing SSNs with Cytoscape (too many nodes and edges for the RAM available on the user’s computer). Also, low resolution SSNs (UniRef50 clusters and/or representative node SSNs) may be necessary to survey sequence-function space. A solution is to restrict the input sequences to specific taxonomic categories (superkingdom, kingdom, phylum, class, order, family, genus, species). Options A, B, C, and D include a “Filter by Taxonomy” option so that the user can select specific taxonomic categories to include in their SSNs.

This Taxonomy Tool provides a preview of the taxonomic distribution of user-provided sequences (Option B, families; Option C, FASTA files; Option D, accession IDs).

The taxonomic distribution of the user-provided sequences is displayed as a "sunburst" in which the levels of classification (superkingdom, kingdom, phylum, class, order, family, genus, species) are displayed radially, with superkingdom at the center and species in the outermost ring. The sunburst is interactive, providing the ability to zoom to a selected taxonomic level. The numbers of UniProt IDs, UniRef90 cluster IDs, and UniRef50 cluster IDs at the selected level are displayed.

UniRef90 clusters contain sequences that share ≥90% sequence identity so usually are taxonomically homogeneous. However, UniRef50 clusters contain sequences that share ≥50% sequence identity so often are taxonomically heterogeneous. When possible (determined by the RAM available to Cytoscape), users should generate taxonomy-specific SSNs with UniProt IDs or UniRef90 cluster IDs.

Retrieve taxonomy for families.

The UniProt sequences from user-specified Pfam families, InterPro families/domains, and/or Pfam clans are retrieved.

The taxonomic distribution of the UniProt IDs is displayed as a "sunburst" in which the levels of classification (superkingdom, kingdom, phylum, class, order, family, genus, species) are displayed radially, with superkingdom at the center and species in the outermost ring. The sunburst is interactive, providing the ability to zoom to a selected taxonomic level. The numbers of UniProt IDs, UniRef90 cluster IDs, and UniRef50 cluster IDs at the selected taxonomic level are provided.

The UniProt IDs, UniRef90 clusters IDs, and UniRef50 cluster IDs as well as FASTA-formatted sequences at the selected level can be downloaded.

The UniProt IDs, UniRef90 clusters IDs, and UniRef50 cluster IDs can be transferred to EFI-EST to generate an SSN and/or to the Retrieve Neighborhood Diagrams/Sequence ID Lookup option of EFI-GNT to generate genome neighborhood diagrams (GNDs).

Pfam and/or InterPro Families:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.

Filter by Taxonomy

Conditions on the taxonomy can be set to further restrict the set of sequences by only including the sequences that match the specific taxonomic categories. Multiple conditions are combined to be a union of each other.
Preselected conditions:

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Retrieve taxonomy for FASTA files.

The user provides a list/file of FASTA-formatted sequences in which the headers contain the UniProt ID. The UniProt ID is required because it is used to retrieve the taxonomy from the UiProt database (FASTA header “reading”).

The taxonomic distribution of the UniProt IDs is displayed as a "sunburst" in which the levels of classification (superkingdom, kingdom, phylum, class, order, family, genus, species) are displayed radially, with superkingdom at the center and species in the outermost ring. The sunburst is interactive, providing the ability to zoom to a selected taxonomic level. The number of UniProt IDs at the selected taxonomic level is provided.

The UniProt IDs and their FASTA-formatted sequences at the selected level can be downloaded.

The UniProt IDs can be transferred to EFI-EST to generate an SSN and/or to the Retrieve Neighborhood Diagrams/Sequence ID Lookup option of EFI-GNT to generate genome neighborhood diagrams (GNDs).

Sequences:
FASTA File: ?

Filter by Taxonomy

Conditions on the taxonomy can be set to further restrict the set of sequences by only including the sequences that match the specific taxonomic categories. Multiple conditions are combined to be a union of each other.
Preselected conditions:

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Retrieve taxonomy for accession IDs.

The user provides a list/file of UniProt IDs, UniRef90 cluster IDs, or UniRef50 cluster IDs.

UniRef90 cluster IDs and UniRef50 cluster IDs are expanded to UniProt IDs. For a curated family, the number of UniProt IDs obtained by expansion of UniRef90 cluster IDs may be larger than the number of UniProt IDs identified by protein databases, e.g., Pfam. And, the numbers of UniProt IDs and UniRef90 cluster IDs obtained by expansion of UniRef50 cluster IDs both may be larger than the numbers identified by protein databases. This behavior is explained by the possibility that 1) the UniRef90 clusters contain divergent UniProt IDs that are not members of the family and 2) the UniRef50 clusters contain divergent UniRef90 clusters that are not members of the family. Users should be aware of this behavior when SSNs are generated using UniProt IDs from expanded UniRef90 cluster IDs or using UniProt IDs or UniRef90 cluster IDs from expanded UniRef50 clusters IDs. This problem does not occur when UniRef90 clusters are identified using UniProt IDs or when UniRef50 clusters are identified using UniRef90 cluster IDs, i.e., the UniRef90 and UniRef50 cluster IDs identified by the Families option and Option B in EFI-EST.

The taxonomic distribution of the UniProt IDs is displayed as a "sunburst" in which the levels of classification (superkingdom, kingdom, phylum, class, order, family, genus, species) are displayed radially, with superkingdom at the center and species in the outermost ring. The sunburst is interactive, providing the ability to zoom to a selected taxonomic level. The numbers of UniProt IDs, UniRef90 cluster IDs, and UniRef50 cluster IDs at the selected taxonomic level are provided.

The UniProt IDs, UniRef90 clusters IDs, and UniRef50 cluster IDs as well as FASTA-formatted sequences at the selected level can be downloaded.

The UniProt IDs, UniRef90 clusters IDs, and UniRef50 cluster IDs can be transferred to EFI-EST to generate an SSN and/or to the Retrieve Neighborhood Diagrams/Sequence ID Lookup option of EFI-GNT to generate genome neighborhood diagrams (GNDs).

Accession IDs:
Accession ID File: ?

Input a list of UniRef50 or UniRef90 cluster accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?
Input accession IDs are: ?

Filter by Taxonomy

Conditions on the taxonomy can be set to further restrict the set of sequences by only including the sequences that match the specific taxonomic categories. Multiple conditions are combined to be a union of each other.
Preselected conditions:

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

UniProt Version: 2022_02
InterPro Version: 89

Click here to contact us for help, reporting issues, or suggestions.