Due to the computationally-heavy nature of CGFP, we are granting access to the EFI-CGFP tool on a request basis.
If you wish to use EFI-CGFP, you will need a user account and also be an approved member of the EFI-CGFP user group. If you do not have a user account, please create a user account, login, and return to this page. If you have an account and are not logged in, please login and return to this page to submit an application to become a member of the EFI-CGFP user group.
Chemically guided functional profiling (CGFP) maps metagenome protein abundance to clusters in sequence similarity networks generated by the EFI-EST web tool (https://efi.igb.illinois.edu/efi-est/).
The glycyl radical enzyme (GRE) superfamily is functionally and mechanistically diverse with many uncharacterized members. CGFP was developed to focus experimental studies for assigning novel functions to uncharacterized members of the glycyl radical enzyme (GRE) superfamily that are detected in the human gut microbiome [B. J. Levin*, Y. Y. Huang* et al. Science 355, eaai8386 (2017)]. CGFP provides a powerful approach to prioritizing uncharacterized members for functional assignment within protein families based on their abundance in metagenomes.
From the CGFP Tutorial on the Balskus laboratory website (https://www.microbialchemist.com/metagenomic-profiling/):
"The human gut contains trillions of microbial inhabitants, making it one of the most densely populated environments on the planet. The symbiosis between these organisms and the human host is extremely complex, and we are only beginning to understand the impact of the gut microbiota on human biology. Knowledge of the chemical reactions performed and compounds produced by gut microbes will provide new insights into their roles in influencing human health. By studying the gene content of the human gut microbiome and the enzymes encoded by these genes, we hope to better understand the chemical capabilities of this microbial community. However, the activities of the vast majority of enzymes found in microbiomes are unknown.
We have developed a bioinformatics workflow to guide studies of genes and enzymes in microbiomes, including enzymes of unknown function. Our approach, which we call "chemically guided functional profiling", uses a molecular understanding of a large enzyme superfamily to guide the identification and quantitation of different family members in metagenomes and metatranscriptomes. To begin, a "sequence similarity network" (SSN) analysis is used to computationally divide a large number of enzyme sequences into groups that are likely to share the same activity. The quantitative metagenomics program ShortBRED can then identify short peptide markers that are unique to highly similar enzyme sequences and quantify the abundance of these markers in raw metagenomic datasets. The markers are then mapped back to clusters from the SSN to assess the abundance of individual enzymes in that metagenome. Because this approach provides information about the relative abundance of enzyme family members with both known and unknown activities, it can provide new insights about important microbial functions and it can prioritize uncharacterized enzymes for further study based on their distribution and abundance in microbial communities. We have used chemically guided functional profiling to identify members of the glycyl radical enzyme family in Human Microbiome Project sequencing datasets, and we anticipate that this approach will be readily extended to additional enzyme families and microbial communities."
In its original form, the CGFP pipeline described by Balskus and Huttenhower (https://bitbucket.org/biobakery/cgfp/overview) required both knowledge of Unix command line programs and access to a computer cluster. The EFI-CGFP web tool was developed to "democratize" the use of CGFP by experimentalists by making it both accessible and "user friendly".
The input for EFI-CGFP is a colored sequence similarity network (SSN) for a protein family. SSNs that segregate protein families into functional clusters, in which each cluster represents a different function, e.g., by separating the curated SwissProt-annotated entries into different clusters, are recommended.
To obtain SSNs compatible with EFI-CGFP analysis, users need to be familiar with both EFI-EST (https://efi.igb.illinois.edu/efi-est/) to generate SSNs for protein families and Cytoscape (http://www.cytoscape.org/) to visualize, analyze, and edit SSNs. Users should also be familiar with the EFI-GNT web tool (https://efi.igb.illinois.edu/efi-gnt/) that colors SSN, and collects, analyzes, and represents genome neighborhoods for bacterial and fungal sequences in SSN clusters.
EFI-CGFP uses the ShortBRED algorithm described by Huttenhower and colleagues in two successive steps: 1) identify sequence markers that are unique to members of families in the input SSN that are identified by ShortBRED and share 85% sequence identity using the CD-HIT algorithm (CD-HIT 85 clusters) and 2) quantify the marker abundances in metagenome datasets and then map these to the SSN clusters.
EFI-CGFP provides heat maps that allow easy identification of the clusters that have metagenome hits as well as measures of metagenome abundance (hits per microbial genome in the metagenome sample). EFI-CGFP also outputs several additional files, including tab-delimited text files (can be opened in Excel) that provide actual and normalized values of both protein and cluster abundances and an enriched SSN that provides a visual summary of the markers that were identified (in the CD-HIT 85 clusters) and the abundance mapping results.
In the "Run CGFP/ShortBRED" tab, the user uploads the colored SSN xgmml file for which the sequence markers are identified (red arrow; either compressed/zipped or uncompressed) and metagenome abundances determined and mapped to the SSN clusters and singletons. Ideally, the input SSN should have been segregated into isofunctional clusters by the user so that EFI-CGFP will allow orthologues of the members of these clusters to be identified in the metagenome datasets. If the user later decides that the SSN clustering can be improved, the user can upload a "child" SSN derived from the initial SSN generated with a different alignment score and/or containing a subset of the sequences/clusters in the original SSN and use the abundances determined with the original SSN to generate heatmaps and abundance statistics for the child SSN.
The input SSN MUST be a Colored SSN generated with either the Color SSN utility of EFI-EST or EFI-GNT. The node attributes for a Colored SSN include "Cluster Number" and "Singleton Number" for each (meta)node [the former if the (meta)node is located in a cluster with ≥2 accession IDs; the latter if the (meta)node is a singleton (a single accession ID)]. These numbers are used to identify the clusters and singletons with metagenome hits both in the input SSN, heat maps, and tables generated by the EFI-CGFP quantify step ("Quantify Results" page, vide infra).
Before they are colored, the SSNs can be of two types: 1) SSNs generated with Options A, B, or D of EFI-EST with sequences derived entirely from the UniProt database (https://www.uniprot.org/) and 2) SSNs generated with Option C of EFI-EST that include sequences provided in a user-provided FASTA file so that they need not be included in the UniProt database, e.g., in the NCBI database or proprietary sequences obtained from in-house sequencing projects. In generating the original SSN, the user may elect to read the FASTA headers (e.g., for sequences from NCBI) to obtain node attributes if an equivalent accession ID can be found in the UniProt database; however, these node attributes are not used by EFI-CGFP.
For SSNs generated with Option C, the sequences provided in the FASTA file are used by EFI-CGFP; the SSNs also may include nodes/sequences for a user-specified UniProt or InterPro family (using the Advanced Options of Option C). EFI-CGFP uses the sequences in the FASTA files and/or those associated with the UniProt IDs in user-specified UniProt and/or InterPro family/families to identify the markers used to interrogate the metagenome datasets in the Quantify step; it does not use any of the node attributes.
The input SSN can be a full SSN with a node for each sequence/accession ID or a representative node (rep node) SSN in which the sequences/accession IDs are grouped in metanodes that share specified levels of sequence identity by EFI-EST. The SSN can be generated using either the complete set of family sequences (for families with ≤25,000 sequences) or UniRef90 seed sequences/clusters (for families with >25,000 sequences, as required by EFI-EST).
For large protein families (>25K sequences), the user may find it useful to generate the input SSN using UniRef50 seed sequences/clusters; this feature is now available on the EFI-EST web tool. The accession IDs in UniRef50 clusters share ≥50% sequence identity over ≥80% of the length of seed sequence. In many cases, UniRef50 clusters are isofunctional. SSNs generated using UniRef50 seed sequences/clusters are "equivalent" to the clustering in 50% rep node SSNs.
SSNs generated with UniRef50 seed sequences/clusters should be generated using alignments scores that correspond to <50% pairwise sequence identity—we recommend alignment scores that correspond to 30-45% pairwise sequence identity. These SSNs will contain fewer (meta)nodes than SSNs generated using UniProt sequences or UniRef90 seed sequences/clusters; the Family Information Page on the EFI-EST tools provides tables that details the number of (seed) sequences for Pfam families/clans and InterPro families for the UniProt, UniRef90, and UniRef50 databases. The node attributes for the SSNs contain the accession IDs that are present in the UniRef90 and UniRef50 clusters, so the user can locate any UniProt accession ID in SSNs generated with the UniRef databases.
Depending on the RAM available on the user's computer, the user may not be able to view a full SSN [complete set of UniProt, UniRef90, or UniRef50 (seed) sequences], but the user should be able to view a rep node SSN. We recommend the highest resolution rep node SSN that can be manipulated with Cytoscape on the user's computer (largest possible sequence identity) so that proteins with different functions can be expected to be located in separate SSN clusters.
We recommend that users apply a minimum length filter to their sequences to ensure that the sequences are "full length" when the input SSN is generated with EFI-EST. The marker identification step in ShortBRED involves initial clustering of proteins into groups of sequences that share a specified level of sequence identity (default is 85%; CD-HIT 85 ShortBRED familes); these sequences then are aligned (multiple sequence alignment using MUSCLE) to generate a consensus sequence. Finally, the consensus sequence is used by the ShortBRED programs to identify unique markers for each CD-HIT 85 ShortBRED family. The presence of sequence "fragments" may bias the multiple sequence alignment/identification of the consensus sequence so they should be avoided/absent in the input SSN.
If the input SSN was generated using UniRef90 or UniRef90 clusters, the minimum/maximum length filters in EFI-EST are applied to the seed sequences for the UniRef clusters (used in the EFI-EST BLAST) and not the sequences in the clusters. [The seed sequence is the longest sequence; shorter sequences that share 90% (UniRef90) or 50% (UniRef50) sequence identity over at least 80% of the length of the seed sequence are located within the cluster.] The user can choose maximum lengths for generation of SSNs, but it should be remembered that some of the sequences in UniRef clusters likely will be shorter than the seed sequence.
The "Run CGFP/ShortBRED" page has an Advanced Option section that provides minimum and maximum length filters that can be applied so that sequences of desired minimum and maximum lengths can be selected from UniRef90 or UniRef50 clusters (see below).
EFI-CGFP identifies and uses only unique sequences (100% sequence identity over 100% of the length of each sequence) in the input SSN so that the consensus sequence is not biased by multiple occurrences of the same sequence; metanodes in rep node SSNs and UniRef90/UniRef50 clusters are expanded so that all sequences included in the metanodes/clusters are used in the identification of unique sequences.
Sequence markers specific to the CD-HIT 85 ShortBRED family consensus sequences are identified by subjecting the consensus sequences to pairwise alignment among themselves and then to pairwise alignment with the sequences in a reference database.
The default parameters for marker identification are 1) those used for clustering the unique sequences into clusters that share 85% sequence identify (CD-HIT 85 ShortBRED families), 2) DIAMOND (normal sensitivity) for the pairwise comparisons of the consensus sequences for the CD-HIT 85 ShortBRED families with one another and a reference sequence database, and 3) the UniRef90 seed sequences for the reference sequence database.
After the quantify (second) step of ShortBRED, the metagenomes identified by the markers for the CD-HIT 85 ShortBRED family seed sequences in an SSN cluster are merged when the cluster abundance is calculated; these are included in a downloadable file as well as summarized in the heatmaps. The output files also include the metagenomes identified by the markers for each of the CD-HIT 85 ShortBRED families.
The Advanced Options section allow the user to change the default parameters used to identify markers for the CD-HIT ShortBRED families:
The quantify step then maps the abundance of metagenome hits to the markers and then to the SSN clusters that contain the CD-HIT ShortBRED families with the markers (next section).
After the colored input SSN is uploaded, marker identification is initiated (with the "Upload SSN" button; blue arrow). The execution time depends on the number of unique sequences. Using DIAMOND the execution time ranges from ~40 minutes for the GRE superfamily (InterPro family IPR004184 with a minimum length filter of 500 residues; ~9000 unique sequences) to ~5 hours for the radical SAM superfamily (InterPro families IPR007197 and IPR006638 without a minimum length filter; ~320K unique sequences). Using BLAST the times are ~24 hrs for the GRE superfamily and three weeks for the radical SAM superfamily.
An e-mail is sent to the user when the input SSN has been uploaded and the marker identification has been initiated. The "Previous Jobs" tab will display the job in black font as soon as it is received and its status will be indicated as "PENDING"; when marker identification has been initiated, the job status will be changed to "RUNNING". When the job is finished, the job name will change to a green-colored link.
When the marker identification step is completed, EFI-CGFP sends an e-mail to the user. In the "Previous Jobs" tab, the job name will be changed to a green-colored link to the "Markers Computation Results" page.
A table summarizing the Job Information is provided at the top of the page. The information includes the parameters used for marker identification as well job statistics about the input SSN, e.g., number of clusters, number of singletons, number of (meta)nodes, total number of accession IDs, number of unique sequences, number of CD-HIT85 ShortBRED families, and number of markers. This information can be downloaded as a text file.
This page provides seven files that can be downloaded: 1) the input Colored SSN to which node attributes have been added containing the type and number of markers that were generated ("SSN with marker results"); 2) the zipped file for the same SSN; 3) a tab-delimited text file ("Marker data") that provides the identities of and information about the markers that were generated; 4) a tab-delimited text file that identifies the seed sequences for the CD-HIT 85 ShortBRED families that were generated and the sequences that are contained in each cluster ("CD-HIT mapping file"); 5) the number of sequences in the SSN clusters (from the input colored SSN); 6) the SwissProt annotations associated clusters (and the accession ID for the SwissProt description from the input colored SSN); and 7) the SwissProt annotations associated with singletons.
The identify job SSN adds four additional node attributes to the input SSN: 1) "Seed Sequence" with the accession ID if the (meta)node is/contains the seed (longest) sequence for a CD-HIT 85 ShortBRED family; 2) "Seed Sequence Cluster(s)" with the accession ID of the seed sequence for the ShortBRED family with which the accession ID(s) was(were) grouped; 3) "Marker Types" for the markers identified in the seed sequences (True, Quasi, or Junction; see the ShortBRED article for details); and 4) "Number of Markers" with the number of markers identified in the seed sequence.
This page also enables the user to select the Human Microbiome Project (HMP) metagenome datasets from a library of 380 datasets for healthy adult men and women from six body sites [stool, buccal mucosa (lining of cheek and mouth), supragingival plaque (tooth plaque), anterior nares (nasal cavity), tongue dorsum (surface), and posterior fornix (vagina)]. Subsets of the library can be selected using the "Search" boxes.
We plan to add additional libraries of metagenome datasets, e.g., intestinal bowel disease metagenomes and transcriptomes from HMP2.
After the datasets are selected (red arrow; transferred from the left panel to the right panel), metagenome abundance is initiated with the "Quantify Marker" button (blue arrow). On the "Previous Jobs" tab, the quantify job will appear with a list of the selected metagenomes. Quantitiation of metagenome may require >24 hrs for execution.
The "Existing Quantify Jobs" button is provided to access the pages for the metagenome quantification jobs/results generated with the markers.
In addition, after both the Identify and Quantify steps have been completed for an input SSN, this page allows the user to upload a different Colored SSN ("child" SSN) and map the protein/cluster abundances to the new SSN (using the "Upload SSN with Different Alignment Score" section; red arrow). The new SSN must have been generated using the same EFI-EST job as the original SSN uploaded to EFI-CGFP and segregated with a different alignment score and/or a subset of the clusters. This upload eliminates the need to re-run the Identify and Quantify steps so the results will be obtained in a much shorter period of time (minutes instead of days!). Submitting an SSN will create a new job in the "Previous Jobs" panel (indicated as "Child" of the initial marker identification/metagenome quantification job) which functions independently of the original job but shares the Identify and Quantify data. The results for the "Child" job will include heat maps for the clusters and singletons in the uploaded SSN as well as the SSN with marker and metagenome hit node attributes (next section).
When the abundance quantitation is completed, EFI-CGFP sends an e-mail to the user. In the "Previous Jobs" tab, the job name will be changed to a link to the "Quantify Results" page.
A table summarizing the Job Information is provided at the top of the page. The information is that provided on the "Markers Computation Results" page plus the number of consensus sequences with markers that identified metagenome hits. This information can be downloaded as a text file.The next section provides three tabs to access files that can be downloaded:
SSN and CDHIT files: Eight files can be downloaded, including:1) the SSN provided by the marker identification job to which a node attribute has been added for CD-HIT 85 seed sequence (meta)nodes ("Metagenomes Identified by Markers") that identified a match to its markers in one or more metagenome datasets ("SSN with quantify results"); 2) the zipped file for the same SSN; 3) a list/description of the metagenomes that were selected for abundance determination; 4) a tab-delimited text file that identifies the seed sequences for the CD-HIT ShortBRED families that were generated and the sequences that are contained in each cluster (CD-HIT ShortBRED families by cluster"; identical to the file provided on the Identify Results page); 5) a tab-delimited text file ("ShortBRED Marker data") that provides the identities of and information about the markers that were generated (identical to the file provided on the Identify Results page); 6) the number of sequences in the SSN clusters ("Cluster sizes", identical to the file provided on the Identify Results page); 7) the "SwissProt annotations by cluster" (and the accession ID for the SwissProt description; identical to the file provided on the Identify Results page); and 8) the "SwissProt annotations by singletons" (identical to the file provided on the Identify Results page).
CGFP Output (using median method): The default is for ShortBRED to report the abundance of metagenome hits for CD-HIT clusters using the "median method". The numbers of metagenome hits identified by all of the markers for a CD-HIT consensus sequence are arranged in increasing numerical order; the value for the median marker is used as the abundance. This method assumes that the distribution of hits across the markers for CD-HIT consensus sequence is uniform (expected if the metagenome sequencing is "deep", i.e., multiple coverage). For seed sequences with an even number of markers, the average of the two "middle" markers is used as the abundance. The files for download include: 1) a tab-delimited text file that provides raw protein abundance for all metagenome datasets; 2) a tab-delimited text file that provides raw cluster abundance for all metagenome datasets; 3) a tab-delimited text file that provides sum-normalized protein abundance for all metagenome datasets; 4) a tab-delimited text file that provides sum-normalized cluster abundance for all metagenome datasets; 5) a tab-delimited text file that provides average genome size (AGS) normalized protein abundance for all metagenome datasets; and 6) a tab-delimited text file that provides average genome size (AGS) normalized cluster abundance for all metagenome datasets. Sum-normalization converts the raw values in reads per kilobase of sequence per million sample reads (RPKM) to relative abundances (sum=1). Alternatively, AGS normalized outputs have been converted from raw RPKM to counts per microbial genome.
CGFP Output (using mean method): In the alternate mean method for reporting abundances, the average value the abundances identified by the markers for each CD-HIT consensus sequence marker is used to report abundance. This method reports the presence of "any" hit for a marker for a seed sequence. An asymmetric distribution of hits a seed sequence with multiple markers is expected for "false positives", so the mean method should be used with caution.
The files for download include: 1) a tab-delimited text file that provides raw protein abundance for all metagenome datasets; 2) a tab-delimited text file that provides raw cluster abundance for all metagenome datasets; 3) a tab-delimited text file that provides sum-normalized protein abundance for all metagenome datasets; 4) a tab-delimited text file that provides sum-normalized cluster abundance for all metagenome datasets; 5) a tab-delimited text file that provides average genome size (AGS) normalized protein abundance for all metagenome datasets; and 6) a tab-delimited text file that provides average genome size (AGS) normalized cluster abundance for all metagenome datasets. Sum-normalization converts the raw values in reads per kilobase of sequence per million sample reads (RPKM) to relative abundances (sum=1). Alternatively, AGS normalized outputs have been converted from raw RPKM to counts per microbial genome.
The next section provides three tabs to access the heat maps:
Cluster Heatmap: a heat map in which each SSN cluster/metagenome hit pair is colored according to average genome abundance. The metagenomes are grouped according to body site so that trends/consensus across the six body sites can be easily discerned. The default heat map is calculated using the median method to report abundances.
The y-axis lists the SSN cluster numbers for which metagenome hits were identified; the x-axis lists the metagenome datasets selected on the Identify Results page. A color scale is provided on the right that provides the AGS normalize abundance of the number of gene copies for the "hit" per microbial genome in the metagenome sample.
Two Filters for manipulating the heatmap are provided below the heat map: 1) "Show specific clusters" (individual cluster numbers separated by commas and/or a range of cluster numbers); 2) "Abundance to display" to specify minimum value andmaximum values for the abundance; and 3) "Maximum abundance to display" to specify a maximum value for the abundance.
Also, several check boxes are provided: 1) "Use mean" to display the heat map using the mean method for reporting abundances; and 2) "Display hits only" to display a black/white map showing presence/absence of "hits" (easier to see low abundance hits). Finally, check boxes are provided to display the heat map regions associated with specific body sites.
Map tools for download and manipulation the heat map can be accessed by hovering and clicking above and to the right of the heat.
Singleton Heatmap: a heat map in which each SSN singleton/metagenome hit pair is colored according to average genome abundance; the default heat map is calculated using the median method of report abundances.
The same display filters described for the Cluster Heatmap are available.
Combined Heatmap: a heat map in which each SSN cluster or singleton/metagenome hit pair is colored according to average genome abundance; the default heat map is calculated using the median method of report abundances.
The same display filters described for the Cluster Heatmap are available.
Finally, below the heat maps a list of the metagenomes that were used in the abundance quantification is provided.
A link is provided in the job information table at the top of the page that links to to the marker identification results.
This example recreates the CGFP analysis for the GRE family (IPR004184) as it was initially described by Levin, et al (2017; full reference below).
The SSN was generated on EFI-EST with InterPro 71 and UniProt 2018_10, with UniRef90 seed sequences. An alignment score of 300 and a minimum length filter of 500 AA was applied. As required, the obtained SSN was colored using the EFI-EST Color SSN utility prior to submission to EFI-CGFP for analysis.
|Minimum sequence length||500|
|Identify search type||DIAMOND|
|CD-HIT identity for ShortBRED family definition||85|
|Quantify search type||USEARCH|
|Number of SSN clusters||261|
|Number of SSN singletons||446|
|SSN sequence source||UniRef 90|
|Number of SSN (meta)nodes||3,842|
|Number of accession IDs in SSN||14,341|
|Number of sequences after length filter||13,923|
|Number of unique sequences in SSN after length filter||11,351|
|Number of CD-HIT ShortBRED families||2,925|
|Number of markers||14,541|
|Number of consensus sequences with hits||794|
|SSN with quantify results (ZIP)||5 MB|
|Description of selected metagenomes||<1 MB|
|CD-HIT ShortBRED families by cluster||<1 MB|
|ShortBRED marker data||1 MB|
|Cluster sizes||<1 MB|
|SwissProt annotations by cluster||<1 MB|
|SwissProt annotations by singletons||<1 MB|
|Protein abundance data (median)||2 MB|
|Cluster abundance data (median)||1 MB|
|Normalized protein abundance data (median)||2 MB|
|Normalized cluster abundance data (median)||1 MB|
|Average genome size (AGS) normalized protein abundance data (median)||2 MB|
|Average genome size (AGS) normalized cluster abundance data (median)||1 MB|
|Protein abundance data (mean)||3 MB|
|Cluster abundance data (mean)||1 MB|
|Normalized protein abundance data (mean)||3 MB|
|Normalized cluster abundance data (mean)||1 MB|
|Average genome size (AGS) normalized protein abundance data (mean)||3 MB|
|Average genome size (AGS) normalized cluster abundance data (mean)||1 MB|
All progress will be lost.
This job will be permanently removed from your list of jobs.
UniProt Version: 2019_01
For more information on CGFP-ShortBRED, see
For more information on ShortBRED, see
These programs use data computed by MicrobeCensus.
Portions of the metagenome data used on this site come from the Human Microbiome Project.