EFI - Chemically-Guided Functional Profiling

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).

The tools are available without charge or license to both academic and commercial users.

Please cite your use of the EFI tools:

Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735

Nils Oberg, Rémi Zallot, and John A. Gerlt, EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol 2023. https://doi.org/10.1016/j.jmb.2023.168018

Important Notice

The UniProtKB database used by the EFI tools is undergoing major reorganization starting with the just-released version 2025_04 (https://www.uniprot.org/help/refprot_only_changes). When the reorganization is fully implemented (2026_02 release, Spring 2026), the number of proteins in UniProtKB will decrease from ~253M accessions in the previous 2025_03 release to ~141M accessions in the 2026_02 release.

In response to these changes, we will provide the previous 2025_03 release until the 2026_02 release is available.

The current 2025_04 release removed 82M UniProt IDs; the UniProt pages providing functional annotation for these IDs are no longer active. A new Metadata Tool provides access to the node attribute metadata for all UniProt IDs in the 2025_03 release that the tools continue to use during the UniProtKB reorganization. The Tool is available using the tab at the top of each page.

More information about the reorganization is located here.

Chemically guided functional profiling (CGFP) maps metagenome protein abundance to clusters in sequence similarity networks (SSNs) generated by the EFI-EST web tool.

EFI-CGFP uses the ShortBRED software package developed by Huttenhower and colleagues in two successive steps: 1) identify sequence markers that are unique to members of families in the input SSN that are identified by ShortBRED and share 85% sequence identity using the CD-HIT algorithm (CD-HIT 85 clusters) and 2) quantify the marker abundances in metagenome datasets and then map these to the SSN clusters.

Currently, a library of 380 metagenomes is available for analysis. The dataset originates from the Human Microbiome Project (HMP) and consists of metagenomes from healthy adult women and men from six body sites [stool, buccal mucosa (lining of cheek and mouth), supragingival plaque (tooth plaque), anterior nares (nasal cavity), tongue dorsum (surface), and posterior fornix (vagina)].

The EFI-CGFP database has been updated to use UniProt 2025_03.

Run CGFP/ShortBRED
Tutorial
Example

Upload the SSN for which you want to run CGFP/ShortBRED. The initial identify step will be performed: unique markers in the input SSN will be identified.

SSN File: ?

Choose a file…

The input SSN MUST be a Colored SSN generated with the Color SSN utility of EFI-EST or the colored SSN generated by EFI-GNT. The accepted format is XGMML (or compressed XGMML as zip).

Sequence Length Restriction Options

If the submitted SSN was generated using the UniRef90 or 50 option, then it is recommended to specify a minimum sequence length, in order to eliminate fragments that may be included in UniRef clusters. A maximum length can also be specified.

Minimum: (default: none)

Maximum: (default: none)

Marker Identification Options

Several parameters are available for the identify step.

Reference database:

ShortBRED uses the UniProt, UniRef90 or UniRef50 databases to evaluate markers in order to eliminate those that could give false positives during the quantify step. The default database used in this process is UniRef90.

CD-HIT sequence identity (default 85%):

This is the sequence identity parameter that will be used for determining the ShortBRED consensus sequence families.

Sequence search type:

This is the search algorithm that will be used to remove false positives and identify unique markers.

E-mail address:

You will receive an email when the markers have been generated.

Multiple SSNs may be submitted, but due to resource constraints only one computation will run at any given time. Submitted jobs will be queued and executed when any running job completes.

Chemically-Guided Functional Profiling Overview

Experimental assignment of functions to uncharacterized enzymes in predicted pathways is expensive and time-consuming. Therefore, targets that are 'worth the effort' must be selected. Balskus, Huttenhower and their coworkers described 'chemically guided functional profiling' (CGFP). CGFP identifies SSN clusters that are abundant in metagenome datasets to prioritize targets for functional characterization.

EFI-CGFP Acceptable Input

The input for EFI-CGFP is a colored sequence similarity network (SSN). To obtain SSNs compatible with EFI-CGFP analysis, users need to be familiar with both EFI-EST (https://efi.igb.illinois.edu/efi-est/) to generate SSNs for protein families, and Cytoscape (http://www.cytoscape.org/) to visualize, analyze, and edit SSNs. Users should also be familiar with the EFI-GNT web tool (https://efi.igb.illinois.edu/efi-gnt/) that colors SSNs, and collects, analyzes, and represents genome neighborhoods for bacterial and fungal sequences in SSN clusters.

Principle of CGFP Analysis

EFI-CGFP Output

When the "Identify" step has been performed, several files are available. They include: a SSN enhanced with the markers that have been identified and their type as node attributes, additional files that describe the markers and the ShortBRED families that were used to identify them.

After the "quantify" step has been performed, heatmaps summarizing the quantification of metagenome hits per SSN clusters are available. Several additional files are provided: the SSN enhanced with metagenome hits that have been identified and quantification results given in abundance within metagenomes, per protein and per cluster.

Recommended Reading

Rémi Zallot, Nils Oberg, John A. Gerlt, "Democratized" genomic enzymology web tools for functional assignment, Current Opinion in Chemical Biology, Volume 47, 2018, Pages 77-85, https://doi.org/10.1016/j.cbpa.2018.09.009

John A. Gerlt, Genomic enzymology: Web tools for leveraging protein family sequence–function space and genome context to discover novel functions, Biochemistry, 2017 - ACS Publications

This example recreates the CGFP analysis for the GRE family (IPR004184) as it was initially described by Levin, et al (2017; full reference below).

The SSN was generated on EFI-EST with InterPro 71 and UniProt 2018_10, with UniRef90 seed sequences. An alignment score of 300 and a minimum length filter of 500 AA was applied. As required, the obtained SSN was colored using the EFI-EST Color SSN utility prior to submission to EFI-CGFP for analysis.

Submission Summary
Quantify Results
Heatmaps and Boxplots

Submission Summary Table

Input filename	EFI-CGFP_IPR004184_quantify.xgmml.zip
Minimum sequence length	500
Identify search type	DIAMOND
Reference database	UNIREF90
CD-HIT identity for ShortBRED family definition	85
Quantify search type	USEARCH
Number of SSN clusters	261
Number of SSN singletons	446
SSN sequence source	UniRef90
Number of SSN (meta)nodes	3,842
Number of accession IDs in SSN	14,341
Number of sequences after length filter	13,923
Number of unique sequences in SSN after length filter	11,351
Number of CD-HIT ShortBRED families	2,925
Number of markers	14,541
Number of consensus sequences with hits	794

Metagenomes Submitted to Quantification Step

SRS011061: stool
SRS011090: buccal mucosa
SRS011098: supragingival plaque
SRS011126: supragingival plaque
SRS011132: anterior nares
SRS011134: stool
SRS011140: tongue dorsum
SRS011144: buccal mucosa
SRS011152: supragingival plaque
SRS011239: stool
SRS011243: tongue dorsum
SRS011247: buccal mucosa
SRS011255: supragingival plaque
SRS011263: anterior nares
SRS011269: posterior fornix
SRS011271: stool
SRS011302: stool
SRS011306: tongue dorsum
SRS011310: buccal mucosa
SRS011343: supragingival plaque
SRS011355: posterior fornix
SRS011397: anterior nares
SRS011405: stool
SRS011452: stool
SRS011529: stool
SRS011584: posterior fornix
SRS011586: stool
SRS012273: stool
SRS012279: tongue dorsum
SRS012281: buccal mucosa
SRS012285: supragingival plaque
SRS012291: anterior nares
SRS012294: posterior fornix
SRS012663: anterior nares
SRS012902: stool
SRS013155: anterior nares
SRS013158: stool
SRS013164: tongue dorsum
SRS013170: supragingival plaque
SRS013215: stool
SRS013234: tongue dorsum
SRS013239: buccal mucosa
SRS013252: supragingival plaque
SRS013269: anterior nares
SRS013476: stool
SRS013502: tongue dorsum
SRS013506: buccal mucosa
SRS013521: stool
SRS013533: supragingival plaque
SRS013542: posterior fornix
SRS013637: anterior nares
SRS013687: stool
SRS013705: tongue dorsum
SRS013711: buccal mucosa
SRS013723: supragingival plaque
SRS013800: stool
SRS013818: tongue dorsum
SRS013825: buccal mucosa
SRS013836: supragingival plaque
SRS013876: anterior nares
SRS013879: tongue dorsum
SRS013881: buccal mucosa
SRS013945: buccal mucosa
SRS013949: supragingival plaque
SRS013951: stool
SRS013956: anterior nares
SRS014124: tongue dorsum
SRS014126: buccal mucosa
SRS014235: stool
SRS014271: tongue dorsum
SRS014287: stool
SRS014313: stool
SRS014459: stool
SRS014464: anterior nares
SRS014470: tongue dorsum
SRS014472: buccal mucosa
SRS014476: supragingival plaque
SRS014494: posterior fornix
SRS014573: tongue dorsum
SRS014575: buccal mucosa
SRS014578: supragingival plaque
SRS014613: stool
SRS014629: posterior fornix
SRS014682: anterior nares
SRS014683: stool
SRS014684: tongue dorsum
SRS014686: buccal mucosa
SRS014690: supragingival plaque
SRS014888: tongue dorsum
SRS014890: buccal mucosa
SRS014894: supragingival plaque
SRS014901: anterior nares
SRS014923: stool
SRS014979: stool
SRS015038: tongue dorsum
SRS015040: buccal mucosa
SRS015044: supragingival plaque
SRS015051: anterior nares
SRS015054: posterior fornix
SRS015133: stool
SRS015154: buccal mucosa
SRS015158: supragingival plaque
SRS015168: posterior fornix
SRS015190: stool
SRS015209: tongue dorsum
SRS015215: supragingival plaque
SRS015217: stool
SRS015225: posterior fornix
SRS015264: stool
SRS015269: anterior nares
SRS015272: tongue dorsum
SRS015274: buccal mucosa
SRS015278: supragingival plaque
SRS015369: stool
SRS015374: buccal mucosa
SRS015378: supragingival plaque
SRS015395: tongue dorsum
SRS015425: posterior fornix
SRS015430: anterior nares
SRS015434: tongue dorsum
SRS015436: buccal mucosa
SRS015440: supragingival plaque
SRS015450: anterior nares
SRS015470: supragingival plaque
SRS015537: tongue dorsum
SRS015574: supragingival plaque
SRS015578: stool
SRS015640: anterior nares
SRS015644: tongue dorsum
SRS015646: buccal mucosa
SRS015650: supragingival plaque
SRS015663: stool
SRS015745: buccal mucosa
SRS015752: anterior nares
SRS015755: supragingival plaque
SRS015762: tongue dorsum
SRS015782: stool
SRS015893: tongue dorsum
SRS015895: buccal mucosa
SRS015899: supragingival plaque
SRS015921: buccal mucosa
SRS015937: anterior nares
SRS015941: tongue dorsum
SRS015947: supragingival plaque
SRS015960: stool
SRS015989: supragingival plaque
SRS015996: anterior nares
SRS016002: tongue dorsum
SRS016018: stool
SRS016033: anterior nares
SRS016037: tongue dorsum
SRS016039: buccal mucosa
SRS016043: supragingival plaque
SRS016056: stool
SRS016086: tongue dorsum
SRS016088: buccal mucosa
SRS016092: supragingival plaque
SRS016095: stool
SRS016111: posterior fornix
SRS016188: anterior nares
SRS016191: posterior fornix
SRS016196: buccal mucosa
SRS016200: supragingival plaque
SRS016203: stool
SRS016225: tongue dorsum
SRS016267: stool
SRS016292: anterior nares
SRS016297: buccal mucosa
SRS016319: tongue dorsum
SRS016331: supragingival plaque
SRS016335: stool
SRS016342: tongue dorsum
SRS016349: buccal mucosa
SRS016360: supragingival plaque
SRS016434: anterior nares
SRS016495: stool
SRS016501: tongue dorsum
SRS016503: buccal mucosa
SRS016513: anterior nares
SRS016516: posterior fornix
SRS016529: tongue dorsum
SRS016533: buccal mucosa
SRS016553: anterior nares
SRS016559: posterior fornix
SRS016569: tongue dorsum
SRS016575: supragingival plaque
SRS016581: anterior nares
SRS016585: stool
SRS016600: buccal mucosa
SRS016746: supragingival plaque
SRS016752: anterior nares
SRS016753: stool
SRS016954: stool
SRS016989: stool
SRS017013: buccal mucosa
SRS017025: supragingival plaque
SRS017044: anterior nares
SRS017080: buccal mucosa
SRS017103: stool
SRS017120: tongue dorsum
SRS017127: buccal mucosa
SRS017139: supragingival plaque
SRS017156: anterior nares
SRS017191: stool
SRS017209: tongue dorsum
SRS017215: buccal mucosa
SRS017227: supragingival plaque
SRS017244: anterior nares
SRS017247: stool
SRS017304: supragingival plaque
SRS017307: stool
SRS017433: stool
SRS017439: tongue dorsum
SRS017441: buccal mucosa
SRS017445: supragingival plaque
SRS017451: anterior nares
SRS017497: posterior fornix
SRS017511: supragingival plaque
SRS017520: posterior fornix
SRS017521: stool
SRS017533: tongue dorsum
SRS017537: buccal mucosa
SRS017687: buccal mucosa
SRS017691: supragingival plaque
SRS017697: anterior nares
SRS017700: posterior fornix
SRS017701: stool
SRS017713: tongue dorsum
SRS017808: tongue dorsum
SRS017810: buccal mucosa
SRS017814: supragingival plaque
SRS017820: anterior nares
SRS017821: stool
SRS018133: stool
SRS018145: tongue dorsum
SRS018149: buccal mucosa
SRS018157: supragingival plaque
SRS018300: tongue dorsum
SRS018312: anterior nares
SRS018329: buccal mucosa
SRS018337: supragingival plaque
SRS018351: stool
SRS018357: tongue dorsum
SRS018359: buccal mucosa
SRS018369: anterior nares
SRS018394: supragingival plaque
SRS018427: stool
SRS018439: tongue dorsum
SRS018463: anterior nares
SRS018573: supragingival plaque
SRS018575: stool
SRS018585: anterior nares
SRS018591: tongue dorsum
SRS018656: stool
SRS018661: buccal mucosa
SRS018665: supragingival plaque
SRS018671: anterior nares
SRS018739: tongue dorsum
SRS018769: posterior fornix
SRS018778: supragingival plaque
SRS018784: anterior nares
SRS018791: tongue dorsum
SRS018817: stool
SRS019215: anterior nares
SRS019219: tongue dorsum
SRS019221: buccal mucosa
SRS019225: supragingival plaque
SRS019267: stool
SRS019327: tongue dorsum
SRS019329: buccal mucosa
SRS019333: supragingival plaque
SRS019339: anterior nares
SRS019379: posterior fornix
SRS019386: anterior nares
SRS019387: supragingival plaque
SRS019389: tongue dorsum
SRS019391: buccal mucosa
SRS019397: stool
SRS019587: buccal mucosa
SRS019591: supragingival plaque
SRS019597: anterior nares
SRS019600: posterior fornix
SRS019601: stool
SRS019607: tongue dorsum
SRS019968: stool
SRS019974: tongue dorsum
SRS019976: buccal mucosa
SRS019980: supragingival plaque
SRS019986: anterior nares
SRS019989: posterior fornix
SRS020220: tongue dorsum
SRS020226: supragingival plaque
SRS020232: anterior nares
SRS020233: stool
SRS020328: stool
SRS020334: tongue dorsum
SRS020336: buccal mucosa
SRS020340: supragingival plaque
SRS020349: posterior fornix
SRS020386: anterior nares
SRS020856: tongue dorsum
SRS020858: buccal mucosa
SRS020862: supragingival plaque
SRS020868: anterior nares
SRS020869: stool
SRS022137: stool
SRS022143: tongue dorsum
SRS022145: buccal mucosa
SRS022149: supragingival plaque
SRS022158: posterior fornix
SRS022530: tongue dorsum
SRS022532: buccal mucosa
SRS022536: supragingival plaque
SRS022719: tongue dorsum
SRS022721: buccal mucosa
SRS022725: supragingival plaque
SRS022734: posterior fornix
SRS023346: stool
SRS023352: tongue dorsum
SRS023354: buccal mucosa
SRS023358: supragingival plaque
SRS042428: posterior fornix
SRS042457: buccal mucosa
SRS042643: tongue dorsum
SRS043001: stool
SRS043646: buccal mucosa
SRS043663: tongue dorsum
SRS043755: supragingival plaque
SRS044373: tongue dorsum
SRS045004: stool
SRS045049: buccal mucosa
SRS045254: buccal mucosa
SRS045262: buccal mucosa
SRS045313: supragingival plaque
SRS045713: stool
SRS046344: anterior nares
SRS047824: tongue dorsum
SRS048164: stool
SRS048719: buccal mucosa
SRS049389: tongue dorsum
SRS049712: stool
SRS049900: stool
SRS049959: stool
SRS050007: buccal mucosa
SRS050025: anterior nares
SRS050029: buccal mucosa
SRS050184: posterior fornix
SRS050244: tongue dorsum
SRS050628: buccal mucosa
SRS050752: stool
SRS051244: supragingival plaque
SRS051505: posterior fornix
SRS051613: anterior nares
SRS051941: supragingival plaque
SRS052227: tongue dorsum
SRS052330: posterior fornix
SRS052590: anterior nares
SRS052604: supragingival plaque
SRS052697: stool
SRS052876: supragingival plaque
SRS053335: stool
SRS053398: stool
SRS053437: anterior nares
SRS053854: tongue dorsum
SRS054061: anterior nares
SRS054590: stool
SRS054653: supragingival plaque
SRS054687: tongue dorsum
SRS054956: stool
SRS055118: buccal mucosa
SRS055401: supragingival plaque
SRS055426: tongue dorsum
SRS056323: tongue dorsum
SRS056695: posterior fornix
SRS057539: tongue dorsum
SRS057791: tongue dorsum
SRS057807: posterior fornix
SRS058053: supragingival plaque
SRS058213: anterior nares
SRS058808: supragingival plaque

The markers that uniquely define clusters in the submitted SSN have been quantified in the metagenomes selected for analysis.

Files are provided that contain details about the markers that have been identified present in metagenomes and their abundances.

SSN and CD-HIT Files
CGFP Output (using median method)
CGFP Output (using mean method)

SSN With Quantify Results

The SSN submited has been edited so that the markers and their abundances in the selected metagenomes are included as node attributes.

	File	Size
	SSN with quantify results (ZIP)	5 MB

CGFP Family and Marker Data

The CD-HIT ShortBRED families by cluster file contains mappings of ShortBRED families to SSN cluster number as well as a color that is assigned to each unique ShortBRED family. The ShortBRED marker data file lists the markers that were identified. Finally, the Description of selected metagenomes file provides available metadata associated with the selected metagenomes.

	File	Size
	CD-HIT ShortBRED families by cluster	<1 MB
	ShortBRED marker data	1 MB
	Description of selected metagenomes	<1 MB

The default is for ShortBRED to report the abundance of metagenome hits for CD-HIT families using the "median method." The numbers of metagenome hits identified by all of the markers for a CD-HIT consensus sequence are arranged in increasing numerical order; the value for the median marker is used as the abundance. This method assumes that the distribution of hits across the markers for CD-HIT consensus sequence is uniform (expected if the metagenome sequencing is "deep," i.e., multiple coverage). For seed sequences with an even number of markers, the average of the two "middle" markers is used as the abundance.

Files detailing the abundance information are available for download.

Raw Abundance Data

Raw results for the individual proteins in the SSN (Protein abundance data (median)) as well as summarized by SSN cluster (Cluster abundance data (median)) are provided. Units are in reads per kilobase of sequence per million sample reads (RPKM).

	File	Size
	Protein abundance data (median)	2 MB
	Cluster abundance data (median)	1 MB

Average Genome Size-Normalized Abundance Data

Data are provided using Average Genome Size (AGS) normalization for individual proteins in the SSN as well as summarized by SSN cluster. Units are have been converted from RPKM to counts per microbial genome, using AGS estimated by MicrobeCensus.

	File	Size
	Average genome size (AGS) normalized protein abundance data (median)	2 MB
	Average genome size (AGS) normalized cluster abundance data (median)	1 MB

In the mean method for reporting abundances, the average value the abundances identified by the markers for each CD-HIT consensus sequence marker is used to report abundance. This method reports the presence of "any" hit for a marker for a seed sequence. An asymmetric distribution of hits a seed sequence with multiple markers is expected for "false positives," so the mean method should be used with caution.

Files detailing the abundance information are available for download.

Raw Abundance Data

Raw results for the individual proteins in the SSN (Protein abundance data (mean)) as well as summarized by SSN cluster (Cluster abundance data (mean)) are provided. Units are in reads per kilobase of sequence per million sample reads (RPKM).

	File	Size
	Protein abundance data (mean)	3 MB
	Cluster abundance data (mean)	1 MB

Average Genome Size-Normalized Abundance Data

	File	Size
	Average genome size (AGS) normalized protein abundance data (mean)	3 MB
	Average genome size (AGS) normalized cluster abundance data (mean)	1 MB

Heatmaps representing the quantification of sequences from SSN clusters per metagenome are available.

The y-axis lists the SSN cluster numbers for which metagenome hits were identified; the x-axis lists the metagenome datasets selected on the Identify Results page. A color scale is located on the right that displays the AGS normalized abundance of the number of gene copies for the "hit" per microbial genome in the metagenome sample.

The metagenomes are grouped according to body site so that trends/consensus across the six body sites can be easily discerned. The default heat map is calculated using the median method to report abundances.

Cluster Heatmap and Boxplots
Singleton Heatmap and Boxplots
Combined Heatmap and Boxplots

This heatmap presents information for SSN cluster/metagenome hit pairs.

This heatmap presents information for SSN singleton/metagenome hit pairs instead of SSN cluster/metagenome hit pairs.

This heatmap combines the information obtained for SSN cluster and singleton/metagenome hit pairs.

Tools for downloading and manipulating the heat map can be accessed by hovering and clicking above and to the right of the plot.

Several filters are available for manipulating the heatmap.

Show specific clusters: input individual cluster numbers separated by commas and/or a range of cluster numbers. Only these input clusters are displayed in the heatmap.
Abundance to display: hide any data values that are outside of the minimum and/or maximum. These hidden values appear as a zero value cell (i.e. the lowest color range).
Use mean: display the heatmap using the mean method for reporting abundances instead of the defaut median method.
Display hits only: show a black and white heatmap showing presence/absence of "hits" (which makes it easier to see low abundance hits).
Body Sites: checkboxes are provided for each body site in the heatmap; selecting one or more of these checkboxes will show data for those body sites only.

UniProt Version: 2025_03

This site uses the CGFP-ShortBRED programs (https://github.com/biobakery/shortbred and http://huttenhower.sph.harvard.edu/shortbred).

For more information on CGFP-ShortBRED, see

Levin, B. J., Huang, Y. Y., Peck, S. C., Wei, Y., Martínez-del Campo, A., Marks, J. A., Franzosa, E. A., Huttenhower, C., Balskus, E. P. A prominent glycyl radical enzyme in human gut microbiomes metabolizes trans-4-hydroxy-l-proline. Science 355, eaai8386 (2017). (DOI: 10.1126/science.aai8386)

For more information on ShortBRED, see

Kaminski J., Gibson M. K., Franzosa E. A., Segata N., Dantas G., Huttenhower C. High-specificity targeted functional profiling in microbial communities with ShortBRED. PLoS Comput Biol. 2015 Dec 18;11(12):e1004557. DOI: 10.1371/journal.pcbi.1004557

These programs use data computed by MicrobeCensus.

Nayfach, S. and Pollard, K.S. Average genome size estimation improves comparative metagenomics and sheds light on the functional ecology of the human microbiome. Genome Biology 2015;16(1):51.

Portions of the metagenome data used on this site come from the Human Microbiome Project.

The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207-214 (14 June 2012). DOI: 10.1038/nature11234

The Human Microbiome Project Consortium. A framework for human microbiome research. Nature 486, 215-221 (14 June 2012). DOI: 10.1038/nature11209

Click here to contact us for help, reporting issues, or suggestions.

Email Address:
Password: