Introduction to Gene Clustering
The genes that encode metabolic pathways in bacteria and fungi often are
co-localized in the genome. Analysis of the genome neighborhood for an
uncharacterized enzyme may provide genomic context, providing insights into its
activity and metabolic function.
While sequence homology alone may be sufficient to allow correct assignment of
protein function in some cases, the combination of sequence homology and genome
neighborhood information increases the confidence of predictions.
For efficient regulation of transcription, bacterial and fungal genes often are
organized in operons and/or gene clusters. An operon may contain several genes
under the transcriptional regulation of a single promoter. Their gene products,
usually enzymes, constitute a metabolic pathway.
Sometimes genes that encode the enzymes in a pathway are organized in
neighboring clusters of two or more transcriptional units that are controlled
by the same transcriptional regulator. Their gene products may be similarly
analyzed to deduce biochemical pathways and the functions of unknown proteins.
Figure 1. Genome context may allow prediction of a metabolic pathway.
Advantages of using Genome Neighborhood Network (GNN)
Unlike manual analysis of individual genome neighborhoods, which can be
extremely time-consuming when conducted on more than a handful of genes,
EFI-GNT can rapidly acquire and organize genome neighborhood information for
thousands of query genes in a high throughput fashion. Because the genome
contexts for orthologous enzymes (same in vitro activity and in vivo metabolic
function) often are not conserved phylogenetically, the large-scale collection
and organization of genome context enabled by EFI-GNT may allow the
identification of the enzymes in metabolic pathways that are not co-localized
in the user’s "target" organism.
Creating a GNN
Using the sequences in an input Sequence Similarity Network (SSN) as queries,
the Genome Neighborhood Network (GNN) organizes the proteins encoded by the
genome neighborhood for each query sequence according to Pfam family.
The GNNs generated by EFI-GNT identify the protein families (using Pfam-defined
homology-based classifications) that are encoded by the genes proximal to genes
that encode the proteins in the input/query SSN dataset. The identities of
these families often provide valuable information about the types of reactions
catalyzed by the genome neighbors.
Two formats for the GNN information are provided
The GNNs from both formats can be filtered using Cytoscape to extract
information involving specific Pfam families and/or specific query clusters
from the input SSN: given the large-scale nature/amount of information in a
GNN, simplification often is desirable. However, the considerable utility of
GNNs is made possible by the large amount of information that is accessible to
1 - SSN cluster Hub-Nodes
Each SSN cluster with queries that found neighbors is depicted as the hub-node
(center) in a cluster in the GNN; the identities of the Pfam families of the
neighbors are depicted as the spoke-nodes. This format enables identification
of potential pathway members that are functionally linked to the query
sequences in the cluster and, with the identities of the Pfam families,
inference of the reactions in the pathway. In this format, "over-fractionation"
of the SSN may result in the identification of incomplete pathways, i.e., the
power of the large-scale analysis is that phylogenetically diverse genome
organizations can be identified for orthologues. Synergistic interpretation of
both formats may allow this situation to be identified.
2 - Pfam family Hub-Nodes
Each neighborhood Pfam family that was found is depicted as the hub-node
(center) in a cluster in the GNN; the identities of the SSN clusters with
queries that "found" the Pfam as neighbors in the family are depicted as the
spoke-nodes in the cluster. This format enables an assessment of whether the
clusters in the query SSN are isofunctional, i.e., if multiple clusters find
the same Pfam family, the SSN may be "over-fractionated" so that orthologues
are found in multiple clusters. Or, the Pfam family may contain members with
different functions that are found by different clusters in the input SSN.