The input for EFI-CGFP is a colored sequence similarity network (SSN) for a protein family. SSNs that segregate protein families into functional clusters, in which each cluster represents a different function, e.g., by separating the curated SwissProt-annotated entries into different clusters, are recommended.
To obtain SSNs compatible with EFI-CGFP analysis, users need to be familiar with both EFI-EST (https://efi.igb.illinois.edu/efi-est/) to generate SSNs for protein families and Cytoscape (http://www.cytoscape.org/) to visualize, analyze, and edit SSNs. Users should also be familiar with the EFI-GNT web tool (https://efi.igb.illinois.edu/efi-gnt/) that colors SSN, and collects, analyzes, and represents genome neighborhoods for bacterial and fungal sequences in SSN clusters.
EFI-CGFP uses the ShortBRED algorithm described by Huttenhower and colleagues in two successive steps: 1) identify sequence markers that are unique to members of families in the input SSN that are identified by ShortBRED and share 85% sequence identity using the CD-HIT algorithm (CD-HIT 85 clusters) and 2) quantify the marker abundances in metagenome datasets and then map these to the SSN clusters.
EFI-CGFP provides heat maps that allow easy identification of the clusters that have metagenome hits as well as measures of metagenome abundance (hits per microbial genome in the metagenome sample). EFI-CGFP also outputs several additional files, including tab-delimited text files (can be opened in Excel) that provide actual and normalized values of both protein and cluster abundances and an enriched SSN that provides a visual summary of the markers that were identified (in the CD-HIT 85 clusters) and the abundance mapping results.