EFI - Enzyme Similarity Tool

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
The tools are available without charge or license to both academic and commercial users.
Reorganization of UniProtKB

With the current 2026_02 release, the UniProtKB database is reorganized to include an expanded number of Reference Proteomes to better capture biodiversity. This includes the removal of proteins from taxonomically unclassified organisms, i.e., those without a binomial species name (genus and species). The total number of accessions in UniProtKB has been reduced from 253,635,358 in the “legacy” 2025_03 release to 149,810,139 in the current 2026_02 release.

We are providing the option to select either the “legacy” 2025_03 database or the current UniProtKB database (now 2026_02) when generating SSNs. You can select the database in the “Database” accordion on the pages for the EFI-EST options, the EFI-GNT tool, and the Taxonomy Tool. We suggest that you compare the SSNs, GNNs, and GNDs generated from both databases as you explore the information you are seeking.

Because the “legacy” 2025_03 release contains UniProt IDs that are no longer active on the UniProt web site, we provide the Metadata Tool that provides access to the node attribute metadata for the UniProt IDs in the “legacy” 2025_03 release.

EFI-EST and Cytoscape Tutorials

Network File Download

The network file download page includes three tables.

The first displays a summary of the input chosen, and is used for record keeping.

Summary of input for SSN generation

The following tables contain links to download networks, the representative node %ID, the number of nodes, the number of edges, and finally the file size.

Download of SSNs

The top table contains the "full" network created at your specified alignment score threshold. By default, this network contains all of the sequences/nodes in your input sequence set. However, this frequently results in very large files (~ 500 MB and greater) that will open and/or run very slowly, or not at all, on most laptop/desktop computers. As a very rough guide, generally Cytoscape networks with a few thousand nodes (protein sequences) and less than ~ 500,000 edges can viewed, although this will depend on your computer. View this "full" network whenever possible, because it will provide access to annotation information for each node in your dataset. Full networks with greater than 10 million edges will not be generated.

In cases where the full network file is too large to open, the bottom table provides the ability to download “representative node” networks. In a representative node (rep node) network, sequences sharing ≥ a specified %ID are grouped into the same node using a program called CD-HIT (4, 5). For example, 90% ID rep node means that each node in the network will contain sequences that share ≥ 90% identity over ANY length of their amino acid sequences. The edges are drawn as done for a full network, except the longest sequence in the rep node is used to determine the alignment score between other rep nodes. For example, if your specified alignment score for the network output was 28, then edges are only drawn between representative nodes where the representative sequences share that alignment score or larger. Rep node networks are automatically calculated at 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, and 100% sequence identity to assure that you will be able to open one or more of the networks on your computer. The number of sequences contained within each rep node as well as the UniProt IDs for those sequences can be viewed in the Cytoscape node attributes panel.

Downloaded files are in the xgmml format and can be imported and viewed in Cytoscape by choosing File → Import → Network and selecting an xgmml file once you have started the Cytoscape program. For more information on using Cytoscape, please see the tutorials here.

Click here to contact us for help, reporting issues, or suggestions.