What is a Genome Neighborhood Network?
While sequence homology alone may be sufficient to allow correct assignment of protein function in some cases, the combination of sequence homology and genome neighborhood analysis may increase the confidence of these predictions for bacterial and fungal proteins as the sequence identity decreases. Because the genes that encode metabolic pathways in bacteria and fungi often are colocalized in the genome, analysis of the genome neighborhood for an uncharacterized enzyme may provide insights into its in vivo activity and in vivo metabolic function.
For efficient regulation of transcription, bacterial and fungal genes often organized in operons and/or gene clusters. An operon may contain several genes under the transcriptional regulation of a single promoter. Their gene products, usually enzymes, constitute a metabolic pathway. In the example below, the product of Enzyme A is the substrate for Enzyme B, which produces a product that is the substrate for Enzyme C. If the functions of Enzyme A and Enzyme C are known, but the function of Enzyme B is unknown, the colocation of their genes can provide insights into the possible function of Enzyme B. Enzyme B most likely catalyzes a reaction that utilizes the product of Enzyme A to generate the substrate for Enzyme C.
Figure 1. Genome context may allow prediction of a metabolic pathway.
Sometimes genes that encode the enzymes in a pathway are organized in neighboring clusters of two or more transcriptional units that are controlled by the same transcriptional regulator. Their gene products may be similarly analyzed to deduce biochemical pathways and the functions of unknown proteins.
Using the sequences in an input Sequence Similarity Network (SSN) as queries, the Genome Neighborhood Network (GNN) organizes the proteins encoded by the genome neighborhood for each query sequence according to Pfam family. Unlike manual analysis of individual genome neighborhoods, which can be extremely time-consuming when conducted on more than a handful of genes, EFI-GNT can rapidly acquire and organize genome neighborhood information for thousands of query genes in a high throughput fashion. Because the genome contexts for orthologous enzymes (same in vitro activity and in vivo metabolic function) often are not conserved phylogenetically, the large-scale collection and organization of genome context enabled by EFI-GNT may allow the identification of the enzymes in metabolic pathways that are not co-organized in the user’s “target” organism.
The GNNs generated by EFI-GNT identify the protein families (using Pfam-defined homology-based classifications) that are encoded by the genes proximal to genes that encode the proteins in the input/query SSN dataset. The identities of these families often provide valuable information about the types of reactions catalyzed by the genome neighbors.
Two formats for the GNN information are provided:
1. Each SSN cluster with queries that found neighbors is depicted as the hub-node in a cluster in the GNN; the identities of the Pfam families of the neighbors are depicted as the spoke-nodes. This format enables identification of potential pathway members that are functionally linked to the query sequences in the cluster and, with the identities of the Pfam families, inference of the reactions in the pathway. In this format, “over-fractionation” of the SSN may result in the identification of incomplete pathways, i.e., the power of the large-scale analysis is that phylogenetically diverse genome organizations can be identified for orthologues. Synergistic interpretation of both formats may allow this situation to be identified.
2. Each neighborhood Pfam family that was found is depicted as the hub-node in a cluster in the GNN; the identities of the SSN clusters with queries that “found” neighbors in the family are depicted as the spoke-nodes in the cluster. This format enables an assessment of whether the clusters in the query SSN are isofunctional, i.e., if multiple clusters find the same Pfam family, the SSN may be “over-fractionated” so that orthologues are found in multiple clusters. Or, the Pfam family may contain members with different functions that are found by different clusters in the input SSN.
The GNNs from both formats can be filtered using Cytoscape to extract information involving specific Pfam families and/or specific query clusters from the input SSN: given the large-scale nature/amount of information in a GNN, simplification often is desirable. However, the considerable utility of GNNs is made possible by the large amount of information that is accessible to the user.