EFI-CGFP/ShortBRED

Use of EFI-CGFP requires a user account. Login or create a user account.

Due to the computationally-heavy nature of CGFP, we are granting access to the EFI-CGFP tool on a request basis.

If you wish to use EFI-CGFP, you will need a user account and also be an approved member of the EFI-CGFP user group. If you do not have a user account, please create a user account, login, and return to this page. If you have an account and are not logged in, please login and return to this page to submit an application to become a member of the EFI-CGFP user group.

Chemically guided functional profiling

Chemically guided functional profiling (CGFP) maps metagenome protein abundance to clusters in sequence similarity networks generated by the EFI-EST web tool (https://efi.igb.illinois.edu/efi-est/).

The glycyl radical enzyme (GRE) superfamily is functionally and mechanistically diverse with many uncharacterized members. CGFP was developed to focus experimental studies for assigning novel functions to uncharacterized members of the glycyl radical enzyme (GRE) superfamily that are detected in the human gut microbiome [B. J. Levin*, Y. Y. Huang* et al. Science 355, eaai8386 (2017)]. CGFP provides a powerful approach to prioritizing uncharacterized members for functional assignment within protein families based on their abundance in metagenomes.

From the CGFP Tutorial on the Balskus laboratory website (https://www.microbialchemist.com/metagenomic-profiling/):

"The human gut contains trillions of microbial inhabitants, making it one of the most densely populated environments on the planet. The symbiosis between these organisms and the human host is extremely complex, and we are only beginning to understand the impact of the gut microbiota on human biology. Knowledge of the chemical reactions performed and compounds produced by gut microbes will provide new insights into their roles in influencing human health. By studying the gene content of the human gut microbiome and the enzymes encoded by these genes, we hope to better understand the chemical capabilities of this microbial community. However, the activities of the vast majority of enzymes found in microbiomes are unknown.

We have developed a bioinformatics workflow to guide studies of genes and enzymes in microbiomes, including enzymes of unknown function. Our approach, which we call "chemically guided functional profiling", uses a molecular understanding of a large enzyme superfamily to guide the identification and quantitation of different family members in metagenomes and metatranscriptomes. To begin, a "sequence similarity network" (SSN) analysis is used to computationally divide a large number of enzyme sequences into groups that are likely to share the same activity. The quantitative metagenomics program ShortBRED can then identify short peptide markers that are unique to highly similar enzyme sequences and quantify the abundance of these markers in raw metagenomic datasets. The markers are then mapped back to clusters from the SSN to assess the abundance of individual enzymes in that metagenome. Because this approach provides information about the relative abundance of enzyme family members with both known and unknown activities, it can provide new insights about important microbial functions and it can prioritize uncharacterized enzymes for further study based on their distribution and abundance in microbial communities. We have used chemically guided functional profiling to identify members of the glycyl radical enzyme family in Human Microbiome Project sequencing datasets, and we anticipate that this approach will be readily extended to additional enzyme families and microbial communities."

In its original form, the CGFP pipeline described by Balskus and Huttenhower (https://bitbucket.org/biobakery/cgfp/overview) required both knowledge of Unix command line programs and access to a computer cluster. The EFI-CGFP web tool was developed to "democratize" the use of CGFP by experimentalists by making it both accessible and "user friendly".

EFI-CGFP Web Tool Overview

The input for EFI-CGFP is a colored sequence similarity network (SSN) for a protein family. SSNs that segregate protein families into functional clusters, in which each cluster represents a different function, e.g., by separating the curated SwissProt-annotated entries into different clusters, are recommended.

To obtain SSNs compatible with EFI-CGFP analysis, users need to be familiar with both EFI-EST (https://efi.igb.illinois.edu/efi-est/) to generate SSNs for protein families and Cytoscape (http://www.cytoscape.org/) to visualize, analyze, and edit SSNs. Users should also be familiar with the EFI-GNT web tool (https://efi.igb.illinois.edu/efi-gnt/) that colors SSN, and collects, analyzes, and represents genome neighborhoods for bacterial and fungal sequences in SSN clusters.

EFI-CGFP uses the ShortBRED algorithm described by Huttenhower and colleagues in two successive steps: 1) identify sequence markers that are unique to members of families in the input SSN that are identified by ShortBRED and share 85% sequence identity using the CD-HIT algorithm (CD-HIT 85 clusters) and 2) quantify the marker abundances in metagenome datasets and then map these to the SSN clusters.

EFI-CGFP provides heat maps that allow easy identification of the clusters that have metagenome hits as well as measures of metagenome abundance (hits per microbial genome in the metagenome sample). EFI-CGFP also outputs several additional files, including tab-delimited text files (can be opened in Excel) that provide actual and normalized values of both protein and cluster abundances and an enriched SSN that provides a visual summary of the markers that were identified (in the CD-HIT 85 clusters) and the abundance mapping results.

"Run CGFP/ShortBRED" tab: EFI-CGFP Input SSNs

Image of the input form.

In the "Run CGFP/ShortBRED" tab, the user uploads the colored SSN xgmml file for which the sequence markers are identified (red arrow; either compressed/zipped or uncompressed) and metagenome abundances determined and mapped to the SSN clusters and singletons. Ideally, the input SSN should have been segregated into isofunctional clusters by the user so that EFI-CGFP will allow orthologues of the members of these clusters to be identified in the metagenome datasets. If the user later decides that the SSN clustering can be improved, the user can upload a "child" SSN derived from the initial SSN generated with a different alignment score and/or containing a subset of the sequences/clusters in the original SSN and use the abundances determined with the original SSN to generate heatmaps and abundance statistics for the child SSN.

The input SSN MUST be a Colored SSN generated with either the Color SSN utility of EFI-EST or EFI-GNT. The node attributes for a Colored SSN include "Cluster Number" and "Singleton Number" for each (meta)node [the former if the (meta)node is located in a cluster with ≥2 accession IDs; the latter if the (meta)node is a singleton (a single accession ID)]. These numbers are used to identify the clusters and singletons with metagenome hits both in the input SSN, heat maps, and tables generated by the EFI-CGFP quantify step ("Quantify Results" page, vide infra).

Before they are colored, the SSNs can be of two types: 1) SSNs generated with Options A, B, or D of EFI-EST with sequences derived entirely from the UniProt database (https://www.uniprot.org/) and 2) SSNs generated with Option C of EFI-EST that include sequences provided in a user-provided FASTA file so that they need not be included in the UniProt database, e.g., in the NCBI database or proprietary sequences obtained from in-house sequencing projects. In generating the original SSN, the user may elect to read the FASTA headers (e.g., for sequences from NCBI) to obtain node attributes if an equivalent accession ID can be found in the UniProt database; however, these node attributes are not used by EFI-CGFP.

For SSNs generated with Option C, the sequences provided in the FASTA file are used by EFI-CGFP; the SSNs also may include nodes/sequences for a user-specified UniProt or InterPro family (using the Advanced Options of Option C). EFI-CGFP uses the sequences in the FASTA files and/or those associated with the UniProt IDs in user-specified UniProt and/or InterPro family/families to identify the markers used to interrogate the metagenome datasets in the Quantify step; it does not use any of the node attributes.

The input SSN can be a full SSN with a node for each sequence/accession ID or a representative node (rep node) SSN in which the sequences/accession IDs are grouped in metanodes that share specified levels of sequence identity by EFI-EST. The SSN can be generated using either the complete set of family sequences (for families with ≤25,000 sequences) or UniRef90 seed sequences/clusters (for families with >25,000 sequences, as required by EFI-EST).

For large protein families (>25K sequences), the user may find it useful to generate the input SSN using UniRef50 seed sequences/clusters; this feature is now available on the EFI-EST web tool. The accession IDs in UniRef50 clusters share ≥50% sequence identity over ≥80% of the length of seed sequence. In many cases, UniRef50 clusters are isofunctional. SSNs generated using UniRef50 seed sequences/clusters are "equivalent" to the clustering in 50% rep node SSNs.

SSNs generated with UniRef50 seed sequences/clusters should be generated using alignments scores that correspond to <50% pairwise sequence identity—we recommend alignment scores that correspond to 30-45% pairwise sequence identity. These SSNs will contain fewer (meta)nodes than SSNs generated using UniProt sequences or UniRef90 seed sequences/clusters; the Family Information Page on the EFI-EST tools provides tables that details the number of (seed) sequences for Pfam families/clans and InterPro families for the UniProt, UniRef90, and UniRef50 databases. The node attributes for the SSNs contain the accession IDs that are present in the UniRef90 and UniRef50 clusters, so the user can locate any UniProt accession ID in SSNs generated with the UniRef databases.

Depending on the RAM available on the user's computer, the user may not be able to view a full SSN [complete set of UniProt, UniRef90, or UniRef50 (seed) sequences], but the user should be able to view a rep node SSN. We recommend the highest resolution rep node SSN that can be manipulated with Cytoscape on the user's computer (largest possible sequence identity) so that proteins with different functions can be expected to be located in separate SSN clusters.

We recommend that users apply a minimum length filter to their sequences to ensure that the sequences are "full length" when the input SSN is generated with EFI-EST. The marker identification step in ShortBRED involves initial clustering of proteins into groups of sequences that share a specified level of sequence identity (default is 85%; CD-HIT 85 ShortBRED familes); these sequences then are aligned (multiple sequence alignment using MUSCLE) to generate a consensus sequence. Finally, the consensus sequence is used by the ShortBRED programs to identify unique markers for each CD-HIT 85 ShortBRED family. The presence of sequence "fragments" may bias the multiple sequence alignment/identification of the consensus sequence so they should be avoided/absent in the input SSN.

If the input SSN was generated using UniRef90 or UniRef90 clusters, the minimum/maximum length filters in EFI-EST are applied to the seed sequences for the UniRef clusters (used in the EFI-EST BLAST) and not the sequences in the clusters. [The seed sequence is the longest sequence; shorter sequences that share 90% (UniRef90) or 50% (UniRef50) sequence identity over at least 80% of the length of the seed sequence are located within the cluster.] The user can choose maximum lengths for generation of SSNs, but it should be remembered that some of the sequences in UniRef clusters likely will be shorter than the seed sequence.

The "Run CGFP/ShortBRED" page has an Advanced Option section that provides minimum and maximum length filters that can be applied so that sequences of desired minimum and maximum lengths can be selected from UniRef90 or UniRef50 clusters (see below).

EFI-CGFP identifies and uses only unique sequences (100% sequence identity over 100% of the length of each sequence) in the input SSN so that the consensus sequence is not biased by multiple occurrences of the same sequence; metanodes in rep node SSNs and UniRef90/UniRef50 clusters are expanded so that all sequences included in the metanodes/clusters are used in the identification of unique sequences.

Sequence markers specific to the CD-HIT 85 ShortBRED family consensus sequences are identified by subjecting the consensus sequences to pairwise alignment among themselves and then to pairwise alignment with the sequences in a reference database.

The default parameters for marker identification are 1) those used for clustering the unique sequences into clusters that share 85% sequence identify (CD-HIT 85 ShortBRED families), 2) DIAMOND (normal sensitivity) for the pairwise comparisons of the consensus sequences for the CD-HIT 85 ShortBRED families with one another and a reference sequence database, and 3) the UniRef90 seed sequences for the reference sequence database.

After the quantify (second) step of ShortBRED, the metagenomes identified by the markers for the CD-HIT 85 ShortBRED family seed sequences in an SSN cluster are merged when the cluster abundance is calculated; these are included in a downloadable file as well as summarized in the heatmaps. The output files also include the metagenomes identified by the markers for each of the CD-HIT 85 ShortBRED families.

The Advanced Options section allow the user to change the default parameters used to identify markers for the CD-HIT ShortBRED families:

  1. The user can select minimum and maximum sequence lengths. As noted previously, UniRef90 and UniRef50 clusters used by EFI-EST to generate SSNs contain sequences shorter than the seed sequence so this option allows minimum and maximum lengths to be specified.
  2. As an alternative to the UniRef90 seed sequences for the reference sequence dataset, the user can select either the sequences in the full UniProt database or the UniRef50 seed sequences. The UniProt sequences may produce fewer false positive metagenome hits than the default UniRef90 database with a 2-fold longer execution time; UniRef50 may produce more false positive metagenome hits than the default UniRef90 database with a 2-fold shorter execution time.
  3. The user can specify an alternate sequence identity value for generating the ShortBRED families with CD-HIT, e.g., to ensure that different functions are represented by sequences in different CD-HIT clusters. Proteins that share >60% sequence identity usually have the same functions, but exceptions to this "rule" are known.
  4. The user can specify BLAST instead of DIAMOND as the algorithm for the pairwise comparisons. BLAST is more "accurate" than DIAMOND but DIAMOND is ~100x faster than BLAST when used by ShortBRED. For large protein families, we recommend the initial use of the DIAMOND default followed, if desired, by more "accurate" analyses of selected SSN clusters using BLAST.
  5. CGFP retrieves the sequences used for constructing the CD-HIT ShortBRED from the local UniProt database used by EFI-EST. If the SSN was generated with a previous database, that database can be selected to ensure that the sequences will be available (UniProt "retires" sequences so the database used to generate the SSN should be used). The database is given in the Network Information table provided by EFI-EST. The default database is the current database used by EFI-EST.

The quantify step then maps the abundance of metagenome hits to the markers and then to the SSN clusters that contain the CD-HIT ShortBRED families with the markers (next section).

After the colored input SSN is uploaded, marker identification is initiated (with the "Upload SSN" button; blue arrow). The execution time depends on the number of unique sequences. Using DIAMOND the execution time ranges from ~40 minutes for the GRE superfamily (InterPro family IPR004184 with a minimum length filter of 500 residues; ~9000 unique sequences) to ~5 hours for the radical SAM superfamily (InterPro families IPR007197 and IPR006638 without a minimum length filter; ~320K unique sequences). Using BLAST the times are ~24 hrs for the GRE superfamily and three weeks for the radical SAM superfamily.

An e-mail is sent to the user when the input SSN has been uploaded and the marker identification has been initiated. The "Previous Jobs" tab will display the job in black font as soon as it is received and its status will be indicated as "PENDING"; when marker identification has been initiated, the job status will be changed to "RUNNING". When the job is finished, the job name will change to a green-colored link.

Marker Quantification/Metagenome Abundance

Image of the identify results page.

When the marker identification step is completed, EFI-CGFP sends an e-mail to the user. In the "Previous Jobs" tab, the job name will be changed to a green-colored link to the "Markers Computation Results" page.

A table summarizing the Job Information is provided at the top of the page. The information includes the parameters used for marker identification as well job statistics about the input SSN, e.g., number of clusters, number of singletons, number of (meta)nodes, total number of accession IDs, number of unique sequences, number of CD-HIT85 ShortBRED families, and number of markers. This information can be downloaded as a text file.

This page provides seven files that can be downloaded: 1) the input Colored SSN to which node attributes have been added containing the type and number of markers that were generated ("SSN with marker results"); 2) the zipped file for the same SSN; 3) a tab-delimited text file ("Marker data") that provides the identities of and information about the markers that were generated; 4) a tab-delimited text file that identifies the seed sequences for the CD-HIT 85 ShortBRED families that were generated and the sequences that are contained in each cluster ("CD-HIT mapping file"); 5) the number of sequences in the SSN clusters (from the input colored SSN); 6) the SwissProt annotations associated clusters (and the accession ID for the SwissProt description from the input colored SSN); and 7) the SwissProt annotations associated with singletons.

The identify job SSN adds four additional node attributes to the input SSN: 1) "Seed Sequence" with the accession ID if the (meta)node is/contains the seed (longest) sequence for a CD-HIT 85 ShortBRED family; 2) "Seed Sequence Cluster(s)" with the accession ID of the seed sequence for the ShortBRED family with which the accession ID(s) was(were) grouped; 3) "Marker Types" for the markers identified in the seed sequences (True, Quasi, or Junction; see the ShortBRED article for details); and 4) "Number of Markers" with the number of markers identified in the seed sequence.

This page also enables the user to select the Human Microbiome Project (HMP) metagenome datasets from a library of 380 datasets for healthy adult men and women from six body sites [stool, buccal mucosa (lining of cheek and mouth), supragingival plaque (tooth plaque), anterior nares (nasal cavity), tongue dorsum (surface), and posterior fornix (vagina)]. Subsets of the library can be selected using the "Search" boxes.

We plan to add additional libraries of metagenome datasets, e.g., intestinal bowel disease metagenomes and transcriptomes from HMP2.

After the datasets are selected (red arrow; transferred from the left panel to the right panel), metagenome abundance is initiated with the "Quantify Marker" button (blue arrow). On the "Previous Jobs" tab, the quantify job will appear with a list of the selected metagenomes. Quantitiation of metagenome may require >24 hrs for execution.

The "Existing Quantify Jobs" button is provided to access the pages for the metagenome quantification jobs/results generated with the markers.

Image of the button for uploading new SSNs for the same identify results.

In addition, after both the Identify and Quantify steps have been completed for an input SSN, this page allows the user to upload a different Colored SSN ("child" SSN) and map the protein/cluster abundances to the new SSN (using the "Upload SSN with Different Alignment Score" section; red arrow). The new SSN must have been generated using the same EFI-EST job as the original SSN uploaded to EFI-CGFP and segregated with a different alignment score and/or a subset of the clusters. This upload eliminates the need to re-run the Identify and Quantify steps so the results will be obtained in a much shorter period of time (minutes instead of days!). Submitting an SSN will create a new job in the "Previous Jobs" panel (indicated as "Child" of the initial marker identification/metagenome quantification job) which functions independently of the original job but shares the Identify and Quantify data. The results for the "Child" job will include heat maps for the clusters and singletons in the uploaded SSN as well as the SSN with marker and metagenome hit node attributes (next section).

Quantify Results

Image of the identify results page.

When the abundance quantitation is completed, EFI-CGFP sends an e-mail to the user. In the "Previous Jobs" tab, the job name will be changed to a link to the "Quantify Results" page.

A table summarizing the Job Information is provided at the top of the page. The information is that provided on the "Markers Computation Results" page plus the number of consensus sequences with markers that identified metagenome hits. This information can be downloaded as a text file.

The next section provides three tabs to access files that can be downloaded:

SSN and CDHIT files: Eight files can be downloaded, including:1) the SSN provided by the marker identification job to which a node attribute has been added for CD-HIT 85 seed sequence (meta)nodes ("Metagenomes Identified by Markers") that identified a match to its markers in one or more metagenome datasets ("SSN with quantify results"); 2) the zipped file for the same SSN; 3) a list/description of the metagenomes that were selected for abundance determination; 4) a tab-delimited text file that identifies the seed sequences for the CD-HIT ShortBRED families that were generated and the sequences that are contained in each cluster (CD-HIT ShortBRED families by cluster"; identical to the file provided on the Identify Results page); 5) a tab-delimited text file ("ShortBRED Marker data") that provides the identities of and information about the markers that were generated (identical to the file provided on the Identify Results page); 6) the number of sequences in the SSN clusters ("Cluster sizes", identical to the file provided on the Identify Results page); 7) the "SwissProt annotations by cluster" (and the accession ID for the SwissProt description; identical to the file provided on the Identify Results page); and 8) the "SwissProt annotations by singletons" (identical to the file provided on the Identify Results page).

CGFP Output (using median method): The default is for ShortBRED to report the abundance of metagenome hits for CD-HIT clusters using the "median method". The numbers of metagenome hits identified by all of the markers for a CD-HIT consensus sequence are arranged in increasing numerical order; the value for the median marker is used as the abundance. This method assumes that the distribution of hits across the markers for CD-HIT consensus sequence is uniform (expected if the metagenome sequencing is "deep", i.e., multiple coverage). For seed sequences with an even number of markers, the average of the two "middle" markers is used as the abundance. The files for download include: 1) a tab-delimited text file that provides raw protein abundance for all metagenome datasets; 2) a tab-delimited text file that provides raw cluster abundance for all metagenome datasets; 3) a tab-delimited text file that provides sum-normalized protein abundance for all metagenome datasets; 4) a tab-delimited text file that provides sum-normalized cluster abundance for all metagenome datasets; 5) a tab-delimited text file that provides average genome size (AGS) normalized protein abundance for all metagenome datasets; and 6) a tab-delimited text file that provides average genome size (AGS) normalized cluster abundance for all metagenome datasets. Sum-normalization converts the raw values in reads per kilobase of sequence per million sample reads (RPKM) to relative abundances (sum=1). Alternatively, AGS normalized outputs have been converted from raw RPKM to counts per microbial genome.

CGFP Output (using mean method): In the alternate mean method for reporting abundances, the average value the abundances identified by the markers for each CD-HIT consensus sequence marker is used to report abundance. This method reports the presence of "any" hit for a marker for a seed sequence. An asymmetric distribution of hits a seed sequence with multiple markers is expected for "false positives", so the mean method should be used with caution.

The files for download include: 1) a tab-delimited text file that provides raw protein abundance for all metagenome datasets; 2) a tab-delimited text file that provides raw cluster abundance for all metagenome datasets; 3) a tab-delimited text file that provides sum-normalized protein abundance for all metagenome datasets; 4) a tab-delimited text file that provides sum-normalized cluster abundance for all metagenome datasets; 5) a tab-delimited text file that provides average genome size (AGS) normalized protein abundance for all metagenome datasets; and 6) a tab-delimited text file that provides average genome size (AGS) normalized cluster abundance for all metagenome datasets. Sum-normalization converts the raw values in reads per kilobase of sequence per million sample reads (RPKM) to relative abundances (sum=1). Alternatively, AGS normalized outputs have been converted from raw RPKM to counts per microbial genome.

Heatmaps

The next section provides three tabs to access the heat maps:

Cluster Heatmap: a heat map in which each SSN cluster/metagenome hit pair is colored according to average genome abundance. The metagenomes are grouped according to body site so that trends/consensus across the six body sites can be easily discerned. The default heat map is calculated using the median method to report abundances.

The y-axis lists the SSN cluster numbers for which metagenome hits were identified; the x-axis lists the metagenome datasets selected on the Identify Results page. A color scale is provided on the right that provides the AGS normalize abundance of the number of gene copies for the "hit" per microbial genome in the metagenome sample.

Two Filters for manipulating the heatmap are provided below the heat map: 1) "Show specific clusters" (individual cluster numbers separated by commas and/or a range of cluster numbers); 2) "Abundance to display" to specify minimum value andmaximum values for the abundance; and 3) "Maximum abundance to display" to specify a maximum value for the abundance.

Also, several check boxes are provided: 1) "Use mean" to display the heat map using the mean method for reporting abundances; and 2) "Display hits only" to display a black/white map showing presence/absence of "hits" (easier to see low abundance hits). Finally, check boxes are provided to display the heat map regions associated with specific body sites.

Map tools for download and manipulation the heat map can be accessed by hovering and clicking above and to the right of the heat.

Singleton Heatmap: a heat map in which each SSN singleton/metagenome hit pair is colored according to average genome abundance; the default heat map is calculated using the median method of report abundances.

The same display filters described for the Cluster Heatmap are available.

Combined Heatmap: a heat map in which each SSN cluster or singleton/metagenome hit pair is colored according to average genome abundance; the default heat map is calculated using the median method of report abundances.

The same display filters described for the Cluster Heatmap are available.

Image of the heatmap.

Finally, below the heat maps a list of the metagenomes that were used in the abundance quantification is provided.

A link is provided in the job information table at the top of the page that links to to the marker identification results.

This example recreates the CGFP analysis for the GRE family (IPR004184) as it was initially described by Levin, et al (2017; full reference below).

The SSN was generated on EFI-EST with InterPro 71 and UniProt 2018_10, with UniRef90 seed sequences. An alignment score of 300 and a minimum length filter of 500 AA was applied. As required, the obtained SSN was colored using the EFI-EST Color SSN utility prior to submission to EFI-CGFP for analysis.

Job Information

Input filenameEFI-CGFP_IPR004184_quantify.xgmml
Minimum sequence length500
Identify search typeDIAMOND
Reference databaseUNIREF90
CD-HIT identity for ShortBRED family definition85
Quantify search typeUSEARCH
Number of SSN clusters261
Number of SSN singletons446
SSN sequence sourceUniRef 90
Number of SSN (meta)nodes3,842
Number of accession IDs in SSN14,341
Number of sequences after length filter13,923
Number of unique sequences in SSN after length filter11,351
Number of CD-HIT ShortBRED families2,925
Number of markers14,541
Number of consensus sequences with hits794

Downloadable Data

FileSize
SSN with quantify results (ZIP) 5 MB
Description of selected metagenomes <1 MB
CD-HIT ShortBRED families by cluster <1 MB
ShortBRED marker data 1 MB
Cluster sizes <1 MB
SwissProt annotations by cluster <1 MB
SwissProt annotations by singletons <1 MB
FileSize
Protein abundance data (median) 2 MB
Cluster abundance data (median) 1 MB
Normalized protein abundance data (median) 2 MB
Normalized cluster abundance data (median) 1 MB
Average genome size (AGS) normalized protein abundance data (median) 2 MB
Average genome size (AGS) normalized cluster abundance data (median) 1 MB
FileSize
Protein abundance data (mean) 3 MB
Cluster abundance data (mean) 1 MB
Normalized protein abundance data (mean) 3 MB
Normalized cluster abundance data (mean) 1 MB
Average genome size (AGS) normalized protein abundance data (mean) 3 MB
Average genome size (AGS) normalized cluster abundance data (mean) 1 MB

Heatmaps

In Firefox version 64, the initial heatmap view doesn't show all of the data. The missing data can be exposed by moving the mouse over the heatmap, or by scrolling the page. This problem does not occur in earlier versions of Firefox, or in the Chrome, Safari, and Edge web browsers. The URL from Firefox can be copied and pasted into another browser for visualization. We are working on addressing the problem.


Metagenomes:
SRS011061: stool
SRS011090: buccal mucosa
SRS011098: supragingival plaque
SRS011126: supragingival plaque
SRS011132: anterior nares
SRS011134: stool
SRS011140: tongue dorsum
SRS011144: buccal mucosa
SRS011152: supragingival plaque
SRS011239: stool
SRS011243: tongue dorsum
SRS011247: buccal mucosa
SRS011255: supragingival plaque
SRS011263: anterior nares
SRS011269: posterior fornix
SRS011271: stool
SRS011302: stool
SRS011306: tongue dorsum
SRS011310: buccal mucosa
SRS011343: supragingival plaque
SRS011355: posterior fornix
SRS011397: anterior nares
SRS011405: stool
SRS011452: stool
SRS011529: stool
SRS011584: posterior fornix
SRS011586: stool
SRS012273: stool
SRS012279: tongue dorsum
SRS012281: buccal mucosa
SRS012285: supragingival plaque
SRS012291: anterior nares
SRS012294: posterior fornix
SRS012663: anterior nares
SRS012902: stool
SRS013155: anterior nares
SRS013158: stool
SRS013164: tongue dorsum
SRS013170: supragingival plaque
SRS013215: stool
SRS013234: tongue dorsum
SRS013239: buccal mucosa
SRS013252: supragingival plaque
SRS013269: anterior nares
SRS013476: stool
SRS013502: tongue dorsum
SRS013506: buccal mucosa
SRS013521: stool
SRS013533: supragingival plaque
SRS013542: posterior fornix
SRS013637: anterior nares
SRS013687: stool
SRS013705: tongue dorsum
SRS013711: buccal mucosa
SRS013723: supragingival plaque
SRS013800: stool
SRS013818: tongue dorsum
SRS013825: buccal mucosa
SRS013836: supragingival plaque
SRS013876: anterior nares
SRS013879: tongue dorsum
SRS013881: buccal mucosa
SRS013945: buccal mucosa
SRS013949: supragingival plaque
SRS013951: stool
SRS013956: anterior nares
SRS014124: tongue dorsum
SRS014126: buccal mucosa
SRS014235: stool
SRS014271: tongue dorsum
SRS014287: stool
SRS014313: stool
SRS014459: stool
SRS014464: anterior nares
SRS014470: tongue dorsum
SRS014472: buccal mucosa
SRS014476: supragingival plaque
SRS014494: posterior fornix
SRS014573: tongue dorsum
SRS014575: buccal mucosa
SRS014578: supragingival plaque
SRS014613: stool
SRS014629: posterior fornix
SRS014682: anterior nares
SRS014683: stool
SRS014684: tongue dorsum
SRS014686: buccal mucosa
SRS014690: supragingival plaque
SRS014888: tongue dorsum
SRS014890: buccal mucosa
SRS014894: supragingival plaque
SRS014901: anterior nares
SRS014923: stool
SRS014979: stool
SRS015038: tongue dorsum
SRS015040: buccal mucosa
SRS015044: supragingival plaque
SRS015051: anterior nares
SRS015054: posterior fornix
SRS015133: stool
SRS015154: buccal mucosa
SRS015158: supragingival plaque
SRS015168: posterior fornix
SRS015190: stool
SRS015209: tongue dorsum
SRS015215: supragingival plaque
SRS015217: stool
SRS015225: posterior fornix
SRS015264: stool
SRS015269: anterior nares
SRS015272: tongue dorsum
SRS015274: buccal mucosa
SRS015278: supragingival plaque
SRS015369: stool
SRS015374: buccal mucosa
SRS015378: supragingival plaque
SRS015395: tongue dorsum
SRS015425: posterior fornix
SRS015430: anterior nares
SRS015434: tongue dorsum
SRS015436: buccal mucosa
SRS015440: supragingival plaque
SRS015450: anterior nares
SRS015470: supragingival plaque
SRS015537: tongue dorsum
SRS015574: supragingival plaque
SRS015578: stool
SRS015640: anterior nares
SRS015644: tongue dorsum
SRS015646: buccal mucosa
SRS015650: supragingival plaque
SRS015663: stool
SRS015745: buccal mucosa
SRS015752: anterior nares
SRS015755: supragingival plaque
SRS015762: tongue dorsum
SRS015782: stool
SRS015893: tongue dorsum
SRS015895: buccal mucosa
SRS015899: supragingival plaque
SRS015921: buccal mucosa
SRS015937: anterior nares
SRS015941: tongue dorsum
SRS015947: supragingival plaque
SRS015960: stool
SRS015989: supragingival plaque
SRS015996: anterior nares
SRS016002: tongue dorsum
SRS016018: stool
SRS016033: anterior nares
SRS016037: tongue dorsum
SRS016039: buccal mucosa
SRS016043: supragingival plaque
SRS016056: stool
SRS016086: tongue dorsum
SRS016088: buccal mucosa
SRS016092: supragingival plaque
SRS016095: stool
SRS016111: posterior fornix
SRS016188: anterior nares
SRS016191: posterior fornix
SRS016196: buccal mucosa
SRS016200: supragingival plaque
SRS016203: stool
SRS016225: tongue dorsum
SRS016267: stool
SRS016292: anterior nares
SRS016297: buccal mucosa
SRS016319: tongue dorsum
SRS016331: supragingival plaque
SRS016335: stool
SRS016342: tongue dorsum
SRS016349: buccal mucosa
SRS016360: supragingival plaque
SRS016434: anterior nares
SRS016495: stool
SRS016501: tongue dorsum
SRS016503: buccal mucosa
SRS016513: anterior nares
SRS016516: posterior fornix
SRS016529: tongue dorsum
SRS016533: buccal mucosa
SRS016553: anterior nares
SRS016559: posterior fornix
SRS016569: tongue dorsum
SRS016575: supragingival plaque
SRS016581: anterior nares
SRS016585: stool
SRS016600: buccal mucosa
SRS016746: supragingival plaque
SRS016752: anterior nares
SRS016753: stool
SRS016954: stool
SRS016989: stool
SRS017013: buccal mucosa
SRS017025: supragingival plaque
SRS017044: anterior nares
SRS017080: buccal mucosa
SRS017103: stool
SRS017120: tongue dorsum
SRS017127: buccal mucosa
SRS017139: supragingival plaque
SRS017156: anterior nares
SRS017191: stool
SRS017209: tongue dorsum
SRS017215: buccal mucosa
SRS017227: supragingival plaque
SRS017244: anterior nares
SRS017247: stool
SRS017304: supragingival plaque
SRS017307: stool
SRS017433: stool
SRS017439: tongue dorsum
SRS017441: buccal mucosa
SRS017445: supragingival plaque
SRS017451: anterior nares
SRS017497: posterior fornix
SRS017511: supragingival plaque
SRS017520: posterior fornix
SRS017521: stool
SRS017533: tongue dorsum
SRS017537: buccal mucosa
SRS017687: buccal mucosa
SRS017691: supragingival plaque
SRS017697: anterior nares
SRS017700: posterior fornix
SRS017701: stool
SRS017713: tongue dorsum
SRS017808: tongue dorsum
SRS017810: buccal mucosa
SRS017814: supragingival plaque
SRS017820: anterior nares
SRS017821: stool
SRS018133: stool
SRS018145: tongue dorsum
SRS018149: buccal mucosa
SRS018157: supragingival plaque
SRS018300: tongue dorsum
SRS018312: anterior nares
SRS018329: buccal mucosa
SRS018337: supragingival plaque
SRS018351: stool
SRS018357: tongue dorsum
SRS018359: buccal mucosa
SRS018369: anterior nares
SRS018394: supragingival plaque
SRS018427: stool
SRS018439: tongue dorsum
SRS018463: anterior nares
SRS018573: supragingival plaque
SRS018575: stool
SRS018585: anterior nares
SRS018591: tongue dorsum
SRS018656: stool
SRS018661: buccal mucosa
SRS018665: supragingival plaque
SRS018671: anterior nares
SRS018739: tongue dorsum
SRS018769: posterior fornix
SRS018778: supragingival plaque
SRS018784: anterior nares
SRS018791: tongue dorsum
SRS018817: stool
SRS019215: anterior nares
SRS019219: tongue dorsum
SRS019221: buccal mucosa
SRS019225: supragingival plaque
SRS019267: stool
SRS019327: tongue dorsum
SRS019329: buccal mucosa
SRS019333: supragingival plaque
SRS019339: anterior nares
SRS019379: posterior fornix
SRS019386: anterior nares
SRS019387: supragingival plaque
SRS019389: tongue dorsum
SRS019391: buccal mucosa
SRS019397: stool
SRS019587: buccal mucosa
SRS019591: supragingival plaque
SRS019597: anterior nares
SRS019600: posterior fornix
SRS019601: stool
SRS019607: tongue dorsum
SRS019968: stool
SRS019974: tongue dorsum
SRS019976: buccal mucosa
SRS019980: supragingival plaque
SRS019986: anterior nares
SRS019989: posterior fornix
SRS020220: tongue dorsum
SRS020226: supragingival plaque
SRS020232: anterior nares
SRS020233: stool
SRS020328: stool
SRS020334: tongue dorsum
SRS020336: buccal mucosa
SRS020340: supragingival plaque
SRS020349: posterior fornix
SRS020386: anterior nares
SRS020856: tongue dorsum
SRS020858: buccal mucosa
SRS020862: supragingival plaque
SRS020868: anterior nares
SRS020869: stool
SRS022137: stool
SRS022143: tongue dorsum
SRS022145: buccal mucosa
SRS022149: supragingival plaque
SRS022158: posterior fornix
SRS022530: tongue dorsum
SRS022532: buccal mucosa
SRS022536: supragingival plaque
SRS022719: tongue dorsum
SRS022721: buccal mucosa
SRS022725: supragingival plaque
SRS022734: posterior fornix
SRS023346: stool
SRS023352: tongue dorsum
SRS023354: buccal mucosa
SRS023358: supragingival plaque
SRS042428: posterior fornix
SRS042457: buccal mucosa
SRS042643: tongue dorsum
SRS043001: stool
SRS043646: buccal mucosa
SRS043663: tongue dorsum
SRS043755: supragingival plaque
SRS044373: tongue dorsum
SRS045004: stool
SRS045049: buccal mucosa
SRS045254: buccal mucosa
SRS045262: buccal mucosa
SRS045313: supragingival plaque
SRS045713: stool
SRS046344: anterior nares
SRS047824: tongue dorsum
SRS048164: stool
SRS048719: buccal mucosa
SRS049389: tongue dorsum
SRS049712: stool
SRS049900: stool
SRS049959: stool
SRS050007: buccal mucosa
SRS050025: anterior nares
SRS050029: buccal mucosa
SRS050184: posterior fornix
SRS050244: tongue dorsum
SRS050628: buccal mucosa
SRS050752: stool
SRS051244: supragingival plaque
SRS051505: posterior fornix
SRS051613: anterior nares
SRS051941: supragingival plaque
SRS052227: tongue dorsum
SRS052330: posterior fornix
SRS052590: anterior nares
SRS052604: supragingival plaque
SRS052697: stool
SRS052876: supragingival plaque
SRS053335: stool
SRS053398: stool
SRS053437: anterior nares
SRS053854: tongue dorsum
SRS054061: anterior nares
SRS054590: stool
SRS054653: supragingival plaque
SRS054687: tongue dorsum
SRS054956: stool
SRS055118: buccal mucosa
SRS055401: supragingival plaque
SRS055426: tongue dorsum
SRS056323: tongue dorsum
SRS056695: posterior fornix
SRS057539: tongue dorsum
SRS057791: tongue dorsum
SRS057807: posterior fornix
SRS058053: supragingival plaque
SRS058213: anterior nares
SRS058808: supragingival plaque

 

 

 


UniProt Version: 2018_10

This site uses the CGFP-ShortBRED programs (https://bitbucket.org/biobakery/cgfp/src and http://huttenhower.sph.harvard.edu/shortbred).

For more information on CGFP-ShortBRED, see

Levin, B. J., Huang, Y. Y., Peck, S. C., Wei, Y., Martínez-del Campo, A., Marks, J. A., Franzosa, E. A., Huttenhower, C., Balskus, E. P. A prominent glycyl radical enzyme in human gut microbiomes metabolizes trans-4-hydroxy-l-proline. Science 355, eaai8386 (2017). (DOI: 10.1126/science.aai8386)

For more information on ShortBRED, see

Kaminski J., Gibson M. K., Franzosa E. A., Segata N., Dantas G., Huttenhower C. High-specificity targeted functional profiling in microbial communities with ShortBRED. PLoS Comput Biol. 2015 Dec 18;11(12):e1004557. DOI: 10.1371/journal.pcbi.1004557

These programs use data computed by MicrobeCensus.

Nayfach, S. and Pollard, K.S. Average genome size estimation improves comparative metagenomics and sheds light on the functional ecology of the human microbiome. Genome Biology 2015;16(1):51.

Portions of the metagenome data used on this site come from the Human Microbiome Project.

The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207-214 (14 June 2012). DOI: 10.1038/nature11234
The Human Microbiome Project Consortium. A framework for human microbiome research. Nature 486, 215-221 (14 June 2012). DOI: 10.1038/nature11209


If you use the EFI web tools, please cite us.

Need help or have suggestions or comments? Please click here.