EFI - Chemically-Guided Functional Profiling

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
The tools are available without charge or license to both academic and commercial users.
Important Notice

The UniProtKB database used by the EFI tools is undergoing major reorganization starting with the just-released version 2025_04 (https://www.uniprot.org/help/refprot_only_changes). When the reorganization is fully implemented (2026_02 release, Spring 2026), the number of proteins in UniProtKB will decrease from ~253M accessions in the previous 2025_03 release to ~141M accessions in the 2026_02 release.

In response to these changes, we will provide the previous 2025_03 release until the 2026_02 release is available.

The current 2025_04 release removed 82M UniProt IDs; the UniProt pages providing functional annotation for these IDs are no longer active. A new Metadata Tool provides access to the node attribute metadata for all UniProt IDs in the 2025_03 release that the tools continue to use during the UniProtKB reorganization. The Tool is available using the tab at the top of each page.

More information about the reorganization is located here.

EFI-CGFP Web Tool Overview

The input for EFI-CGFP is a colored sequence similarity network (SSN) for a protein family. SSNs that segregate protein families into functional clusters, in which each cluster represents a different function, e.g., by separating the curated SwissProt-annotated entries into different clusters, are recommended.

To obtain SSNs compatible with EFI-CGFP analysis, users need to be familiar with both EFI-EST (https://efi.igb.illinois.edu/efi-est/) to generate SSNs for protein families and Cytoscape (http://www.cytoscape.org/) to visualize, analyze, and edit SSNs. Users should also be familiar with the EFI-GNT web tool (https://efi.igb.illinois.edu/efi-gnt/) that colors SSN, and collects, analyzes, and represents genome neighborhoods for bacterial and fungal sequences in SSN clusters.

EFI-CGFP uses the ShortBRED algorithm described by Huttenhower and colleagues in two successive steps: 1) identify sequence markers that are unique to members of families in the input SSN that are identified by ShortBRED and share 85% sequence identity using the CD-HIT algorithm (CD-HIT 85 clusters) and 2) quantify the marker abundances in metagenome datasets and then map these to the SSN clusters.

EFI-CGFP provides heat maps that allow easy identification of the clusters that have metagenome hits as well as measures of metagenome abundance (hits per microbial genome in the metagenome sample). EFI-CGFP also outputs several additional files, including tab-delimited text files (can be opened in Excel) that provide actual and normalized values of both protein and cluster abundances and an enriched SSN that provides a visual summary of the markers that were identified (in the CD-HIT 85 clusters) and the abundance mapping results.

Click here to contact us for help, reporting issues, or suggestions.