This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
The tools are available without charge or license to both academic and commercial users.
Please cite your use of the EFI tools:
Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735
Nils Oberg, Rémi Zallot, and John A. Gerlt, EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol 2023. https://doi.org/10.1016/j.jmb.2023.168018
RadicalSAM.org, our resource for investigating sequence-function space in the radical SAM superfamily, has been updated with sequences from the UniProt Release 2024_01 and InterPro Release 98 databases (January 24, 2024) !!
Chemically guided functional profiling (CGFP) maps metagenome protein abundance
to clusters in sequence similarity networks (SSNs) generated by the EFI-EST web
tool.
EFI-CGFP uses the ShortBRED software package developed by Huttenhower and colleagues
in two successive steps: 1) identify sequence markers that are unique to
members of families in the input SSN that are identified by ShortBRED and share
85% sequence identity using the CD-HIT algorithm (CD-HIT 85 clusters) and 2)
quantify the marker abundances in metagenome datasets and then map these to the
SSN clusters.
Currently, a library of 380 metagenomes is available for analysis. The dataset
originates from the Human Microbiome Project (HMP) and consists of metagenomes
from healthy adult women and men from six body sites [stool, buccal mucosa
(lining of cheek and mouth), supragingival plaque (tooth plaque), anterior
nares (nasal cavity), tongue dorsum (surface), and posterior fornix (vagina)].
The EFI-CGFP database has been updated to use UniProt 2024_05.
Upload the SSN for which you want to run CGFP/ShortBRED.
The initial identify step will be performed: unique markers in the input SSN will be identified.
Chemically-Guided Functional Profiling Overview
Experimental assignment of functions to uncharacterized enzymes in predicted
pathways is expensive and time-consuming. Therefore, targets that are 'worth
the effort' must be selected. Balskus, Huttenhower and their coworkers
described 'chemically guided functional profiling' (CGFP). CGFP identifies SSN
clusters that are abundant in
metagenome
datasets to prioritize targets for functional characterization.
EFI-CGFP Acceptable Input
The input for EFI-CGFP is a colored sequence similarity network (SSN).
To obtain SSNs compatible with EFI-CGFP analysis, users need to be familiar
with both EFI-EST
(https://efi.igb.illinois.edu/efi-est/)
to generate SSNs for protein families, and Cytoscape
(http://www.cytoscape.org/) to visualize,
analyze, and edit SSNs. Users should also be familiar with the EFI-GNT web tool
(https://efi.igb.illinois.edu/efi-gnt/)
that colors SSNs, and collects, analyzes, and represents genome neighborhoods for bacterial and fungal
sequences in SSN clusters.
Principle of CGFP Analysis
EFI-CGFP uses the ShortBRED software package developed by Huttenhower and
colleagues in two successive steps: 1) identify sequence markers that are
unique to members of families in the input SSN that are identified by ShortBRED
and share 85% sequence identity using the CD-HIT algorithm (CD-HIT 85 clusters)
and 2) quantify the marker abundances in metagenome datasets and then map these
to the SSN clusters.
EFI-CGFP Output
When the "Identify" step has been performed, several files are available. They
include: a SSN enhanced with the markers that have been identified and their
type as node attributes, additional files that describe the markers and
the ShortBRED families that were used to identify them.
After the "quantify" step has been performed, heatmaps summarizing the
quantification of metagenome hits per SSN clusters are available. Several
additional files are provided: the SSN enhanced with metagenome hits
that have been identified and quantification results given in abundance within
metagenomes, per protein and per cluster.
Recommended Reading
Rémi Zallot, Nils Oberg, John A. Gerlt, "Democratized" genomic enzymology web
tools for functional assignment, Current Opinion in Chemical Biology, Volume
47, 2018, Pages 77-85,
https://doi.org/10.1016/j.cbpa.2018.09.009
John A. Gerlt,
Genomic enzymology: Web tools for leveraging protein family sequence–function space and genome context to discover novel functions,
Biochemistry, 2017 - ACS Publications
This example recreates the CGFP analysis for the GRE family (IPR004184) as it was initially described by
Levin, et al (2017; full reference below).
The SSN was generated on EFI-EST with
InterPro 71 and UniProt 2018_10, with UniRef90 seed sequences.
An alignment score of 300 and a minimum length filter of 500 AA was applied.
As required, the obtained SSN was colored using the EFI-EST Color SSN utility prior to submission
to EFI-CGFP for analysis.
The SSN submited has been edited so that the markers and their abundances in the
selected metagenomes are included as node attributes.
File
Size
SSN with quantify results (ZIP)
5 MB
CGFP Family and Marker Data
The CD-HIT ShortBRED families by cluster file contains mappings of ShortBRED
families to SSN cluster number as well as a color that is assigned to each unique
ShortBRED family. The ShortBRED marker data file lists the markers
that were identified. Finally, the Description of selected metagenomes file
provides available metadata associated with the selected metagenomes.
File
Size
CD-HIT ShortBRED families by cluster
<1 MB
ShortBRED marker data
1 MB
Description of selected metagenomes
<1 MB
The default is for ShortBRED to report the abundance of metagenome hits for
CD-HIT families using the "median method." The numbers of metagenome hits
identified by all of the markers for a CD-HIT consensus sequence are arranged
in increasing numerical order; the value for the median marker is used as the
abundance. This method assumes that the distribution of hits across the markers
for CD-HIT consensus sequence is uniform (expected if the metagenome sequencing
is "deep," i.e., multiple coverage). For seed sequences with an even number of
markers, the average of the two "middle" markers is used as the abundance.
Files detailing the abundance information are available for download.
Raw Abundance Data
Raw results for the individual proteins in the SSN (Protein abundance data (median))
as well as summarized by SSN cluster (Cluster abundance data (median))
are provided. Units are in reads per kilobase of sequence per million sample reads (RPKM).
File
Size
Protein abundance data (median)
2 MB
Cluster abundance data (median)
1 MB
Average Genome Size-Normalized Abundance Data
Data are provided using Average Genome Size (AGS) normalization for
individual proteins in the SSN
as well as summarized by SSN cluster.
Units are have been converted from RPKM to counts per microbial genome, using AGS estimated by MicrobeCensus.
File
Size
Average genome size (AGS) normalized protein abundance data (median)
2 MB
Average genome size (AGS) normalized cluster abundance data (median)
1 MB
In the mean method for reporting abundances, the average value the abundances
identified by the markers for each CD-HIT consensus sequence marker is used to
report abundance. This method reports the presence of "any" hit for a marker
for a seed sequence. An asymmetric distribution of hits a seed sequence with
multiple markers is expected for "false positives," so the mean method should
be used with caution.
Files detailing the abundance information are available for download.
Raw Abundance Data
Raw results for the individual proteins in the SSN (Protein abundance data (mean))
as well as summarized by SSN cluster (Cluster abundance data (mean))
are provided. Units are in reads per kilobase of sequence per million sample reads (RPKM).
File
Size
Protein abundance data (mean)
3 MB
Cluster abundance data (mean)
1 MB
Average Genome Size-Normalized Abundance Data
Data are provided using Average Genome Size (AGS) normalization for
individual proteins in the SSN
as well as summarized by SSN cluster.
Units are have been converted from RPKM to counts per microbial genome, using AGS estimated by MicrobeCensus.
File
Size
Average genome size (AGS) normalized protein abundance data (mean)
3 MB
Average genome size (AGS) normalized cluster abundance data (mean)
1 MB
Heatmaps representing the quantification of sequences from SSN clusters per
metagenome are available.
The y-axis lists the SSN cluster numbers for which metagenome hits were
identified; the x-axis lists the metagenome datasets selected on the Identify
Results page. A color scale is located on the right that displays the AGS
normalized abundance of the number of gene copies for the "hit" per microbial
genome in the metagenome sample.
The metagenomes are grouped according to body site so that trends/consensus
across the six body sites can be easily discerned. The default heat map is
calculated using the median method to report abundances.
This heatmap presents information for SSN cluster/metagenome hit pairs.
This heatmap presents information for SSN singleton/metagenome hit pairs instead of SSN cluster/metagenome hit pairs.
This heatmap combines the information obtained for SSN cluster and singleton/metagenome hit pairs.
Tools for downloading and manipulating the heat map can be accessed by hovering and
clicking above and to the right of the plot.
Several filters are available for manipulating the heatmap.
Show specific clusters: input individual cluster numbers separated by
commas and/or a range of cluster numbers. Only these input clusters are displayed
in the heatmap.
Abundance to display: hide any data values that are outside of the minimum
and/or maximum. These hidden values appear as a zero value cell (i.e. the lowest
color range).
Use mean:
display the heatmap using the mean method for reporting abundances instead of
the defaut median method.
Display hits only: show a black and white heatmap showing presence/absence
of "hits" (which makes it easier to see low abundance hits).
Body Sites: checkboxes are provided for each body site in the heatmap;
selecting one or more of these checkboxes will show data for those body sites only.
All progress will be lost.
This job will be permanently removed from your list of jobs.
Levin, B. J., Huang, Y. Y., Peck, S. C., Wei, Y., Martínez-del Campo, A., Marks, J. A., Franzosa, E. A., Huttenhower, C., Balskus, E. P.
A prominent glycyl radical enzyme in human gut microbiomes metabolizes trans-4-hydroxy-l-proline.
Science355, eaai8386 (2017).
(DOI: 10.1126/science.aai8386)
For more information on ShortBRED, see
Kaminski J., Gibson M. K., Franzosa E. A., Segata N., Dantas G., Huttenhower C.
High-specificity targeted functional profiling in microbial communities with ShortBRED.
PLoS Comput Biol. 2015 Dec 18;11(12):e1004557. DOI: 10.1371/journal.pcbi.1004557
Nayfach, S. and Pollard, K.S.
Average genome size estimation improves comparative metagenomics and sheds light on the functional ecology of the human microbiome.
Genome Biology 2015;16(1):51.
The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome.
Nature486, 207-214 (14 June 2012). DOI: 10.1038/nature11234
The Human Microbiome Project Consortium.
A framework for human microbiome research.
Nature486, 215-221 (14 June 2012). DOI: 10.1038/nature11209