EFI - Enzyme Similarity Tool

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).

The tools are available without charge or license to both academic and commercial users.

Reorganization of UniProtKB

With the current 2026_02 release, the UniProtKB database is reorganized to include an expanded number of Reference Proteomes to better capture biodiversity. This includes the removal of proteins from taxonomically unclassified organisms, i.e., those without a binomial species name (genus and species). The total number of accessions in UniProtKB has been reduced from 253,635,358 in the “legacy” 2025_03 release to 149,810,139 in the current 2026_02 release.

We are providing the option to select either the “legacy” 2025_03 database or the current UniProtKB database (now 2026_02) when generating SSNs. You can select the database in the “Database” accordion on the pages for the EFI-EST options, the EFI-GNT tool, and the Taxonomy Tool. We suggest that you compare the SSNs, GNNs, and GNDs generated from both databases as you explore the information you are seeking.

Because the “legacy” 2025_03 release contains UniProt IDs that are no longer active on the UniProt web site, we provide the Metadata Tool that provides access to the node attribute metadata for the UniProt IDs in the “legacy” 2025_03 release.

Dataset Completed

Submission Name: IP91_IPR004184_NoFragments Bacteria UniRef90_NoFragments_IPR004184_Bacteria

A minimum sequence similarity threshold that specifies the sequence pairs connected by edges is needed to generate the SSN. This threshold also determines the segregation of proteins into clusters. The threshold is applied to the edges in the SSN using the alignment score, an edge node attribute that is a measure of the similarity between sequence pairs.

Dataset Summary
Taxonomy Sunburst
Dataset Analysis

The parameters for generating the initial dataset are summarized in the table.

Job Number	26146
Input Option	Accession IDs (Option D)
Job Name	IP91_IPR004184_NoFragments Bacteria UniRef90_NoFragments_IPR004184_Bacteria
Input Sequence Source	UniRef90
E-Value for SSN Edge Calculation	5
No matches file
Number of IDs in Uploaded File	6,429 (6,429 UniProt ID matches and 0 unmatched)
Taxonomy Categories	Bacteria
Family Filter	IPR004184
Exclude Fragments	Yes
Total Number of Sequences in Dataset	6,418
Total Number of Edges	18,077,707
Number of Unique Sequences	6,418
Convergence Ratio?	0.878

The taxonomy distribution for the UniProt IDs in the input dataset is displayed. For UniRef90 and UniRef50 cluster datasets, these are retrieved from the lookup table provided by UniProt/UniRef.

The UniRef90 and UniRef50 clusters containing the UniProt IDs then are identified using the lookup table provided by UniProt/UniRef. These UniRef90 and UniRef50 clusters may contain UniProt IDs from other families; in addition, the UniRef90 and UniRef50 clusters in the selected taxonomy category may contain UniProt IDs from other categories. This results from conflation of UniProt IDs in UniRef90 and UniRef50 clusters that share ≥90% and ≥50% sequence identity, respectively.

The numbers of UniProt IDs, UniRef90 cluster IDs, and UniRef50 cluster IDs for the selected category are displayed.

The sunburst is interactive, providing the ability to zoom to a selected taxonomy category by clicking on that category; clicking on the center circle will zoom the display to the next highest rank.

This tab provides histograms and box plots with statistics about the sequences in the input dataset as well as the BLAST all-by-all pairwise comparisons that were computed.

The descriptions for the histograms and plots guide the choice of the values for the "Alignment Score Threshold" and the Minimum and Maximum "Sequence Length Restrictions" that are applied to the sequences and edges to generate the SSN.

Sequences as a Function of Full Length Histogram (First Step for Alignment Score Threshold Selection)

This histogram describes the length distribution for all sequences (UniProt IDs) in the input dataset.

Inspection of the histogram permits identification of fragments, single domain proteins, and multidomain fusion proteins. This histogram is used to select Minimum and Maximum "Sequence Length Restrictions" in the "SSN Finalization" tab to remove fragments, select only single domain proteins, or select multidomain proteins. The sequences in the "Sequences as a Function of Full-Length Histogram (UniRef90 Cluster IDs)" (last histogram) are used to calculate the edges.

Alignment Length vs Alignment Score Box Plot (Second Step for Alignment Score Threshold Selection)

This box plot describes the relationship between the query-subject alignment lengths used by BLAST (y-axis) to calculate the alignment scores (x-axis).

This plot shows a monophasic increase in alignment length to a constant value for single domain proteins; this plot shows multiphasic increases in alignment length for datasets with multidomain proteins (one phase for each fusion length). The value of the "Alignment Score Threshold" for generating the SSN (entered in the "SSN Finalization" tab) should be selected (from the "Percent Identity vs Alignment Score Box Plot"; next box plot) at an alignment length ≥ the minimum length of single domain proteins in the dataset (determined by inspection of the "Sequences as a Function of Full-Length Histogram"; previous histogram). In that region, the "Alignment Length" should be independent of the "Alignment Score".

Percent Identity vs Alignment Score Box Plot (Third Step for Alignment Score Threshold Selection)

This box plot describes the pairwise percent sequence identity as a function of alignment score.

Complementing the "Alignment Length vs Alignment Score Box Plot" (previous box plot), this box plot describes a monophasic increase in sequence identity for single domain proteins or a multiphasic increase in sequence identity for datasets with multidomain proteins (one phase for each fusion length). In the "Alignment Length vs Alignment Score" box plot (previous box plot), a monophasic increase in sequence identity occurs as the alignment score increases at a constant alignment length; multiphasic increases occur as the alignment score increases at additional longer constant alignment lengths.

For the initial SSN, we recommend that an alignment score corresponding to 35 to 40% pairwise identity be entered in the "SSN Finalization" tab (for the first phase in multiphasic plots).

Edge Count vs Alignment Score Plot (Preview of Full SSN Size)

Invalid data

This plot shows the number of edges in the full SSN for the input dataset (a node of each sequence) as a function of alignment score. By moving the cursor over the plot, the number of edges for each alignment score is displayed.

This plot helps determine if the full SSN generated using the initial alignment score can be opened with Cytoscape on the user’s computer. As a rough guide, SSNs with ~2M edges can be opened with 16GB RAM, ~4M edges with 32GB RAM, ~8M edges with 64GB RAM, ~15M edges with 128GB RAM, and ~30M edges with 256GB RAM.

If the number of edges for the full SSN is too large to be opened, a representative node (rep node) SSN can be opened. In a rep node SSN, sequences are grouped into metanodes based on pairwise sequence identity (from 40 to 100% identity, in 5% intervals). The download tables on the "Download Network Files" page provide the numbers of metanodes and edges in rep node SSNs. The rep node SSNs are lower resolution than full SSNs; clusters of interest in rep node SSNs can be expanded to provide the full SSNs.

Edges as a Function of Alignment Score Histogram (Preview of SSN Diversity)

This histogram describes the number of edges calculated at each alignment score. This plot is not used to select the alignment score for the initial SSN; however, it provides an overview of the functional diversity within the input dataset.

In the histogram, edges with low alignment scores typically are those between isofunctional clusters; edges with large alignment scores typically are those connecting nodes within isofunctional clusters.

The histogram for a dataset with a single isofunctional SSN cluster is single distribution centered at a "large" alignment score; the histogram for a dataset with many isofunctional SSN clusters will be dominated by the edges that connect the clusters, with the number of edges decreasing as the alignment score increases.

Sequences as a Function of Full Length Histogram (UniRef90 Cluster IDs)

This histogram describes the distribution of the full-length UniRef cluster IDs in the input dataset. The sequences of the cluster IDs displayed do not accurately reflect the distribution of fragments, single domain proteins, and multidomain full-length proteins in the input dataset.

Portions of these data are derived from the Universal Protein Resource (UniProt) databases.

Click here to contact us for help, reporting issues, or suggestions.

Email Address:
Password: