EFI - Chemically-Guided Functional Profiling

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
The tools are available without charge or license to both academic and commercial users.
Important Notice

The UniProtKB database used by the EFI tools is undergoing major reorganization starting with the 2025_04 release (https://www.uniprot.org/release-notes/forthcoming-changes). When the reorganization is fully implemented (2026_02 release, Spring 2026), the number of proteins in UniProtKB is expected to decrease from ~253M accessions in the current 2025_03 release to ~141M accessions in the 2026_02 release.

In response to these changes, we are planning to provide the current 2025_03 release until the 2026_02 release is available.

More information about the changes is located here.

EFI-CGFP Web Tool Overview

The input for EFI-CGFP is a colored sequence similarity network (SSN) for a protein family. SSNs that segregate protein families into functional clusters, in which each cluster represents a different function, e.g., by separating the curated SwissProt-annotated entries into different clusters, are recommended.

To obtain SSNs compatible with EFI-CGFP analysis, users need to be familiar with both EFI-EST (https://efi.igb.illinois.edu/efi-est/) to generate SSNs for protein families and Cytoscape (http://www.cytoscape.org/) to visualize, analyze, and edit SSNs. Users should also be familiar with the EFI-GNT web tool (https://efi.igb.illinois.edu/efi-gnt/) that colors SSN, and collects, analyzes, and represents genome neighborhoods for bacterial and fungal sequences in SSN clusters.

EFI-CGFP uses the ShortBRED algorithm described by Huttenhower and colleagues in two successive steps: 1) identify sequence markers that are unique to members of families in the input SSN that are identified by ShortBRED and share 85% sequence identity using the CD-HIT algorithm (CD-HIT 85 clusters) and 2) quantify the marker abundances in metagenome datasets and then map these to the SSN clusters.

EFI-CGFP provides heat maps that allow easy identification of the clusters that have metagenome hits as well as measures of metagenome abundance (hits per microbial genome in the metagenome sample). EFI-CGFP also outputs several additional files, including tab-delimited text files (can be opened in Excel) that provide actual and normalized values of both protein and cluster abundances and an enriched SSN that provides a visual summary of the markers that were identified (in the CD-HIT 85 clusters) and the abundance mapping results.

Click here to contact us for help, reporting issues, or suggestions.