EFI - Genome Neighborhood Tool

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
The tools are available without charge or license to both academic and commercial users.

Introduction to Gene Clustering

The genes that encode metabolic pathways in bacteria and fungi often are co-localized in the genome. Analysis of the genome neighborhood for an uncharacterized enzyme may provide genomic context, providing insights into its activity and metabolic function.

While sequence homology alone may be sufficient to allow correct assignment of protein function in some cases, the combination of sequence homology and genome neighborhood information increases the confidence of predictions.

For efficient regulation of transcription, bacterial and fungal genes often are organized in operons and/or gene clusters. An operon may contain several genes under the transcriptional regulation of a single promoter. Their gene products, usually enzymes, constitute a metabolic pathway.

Sometimes genes that encode the enzymes in a pathway are organized in neighboring clusters of two or more transcriptional units that are controlled by the same transcriptional regulator. Their gene products may be similarly analyzed to deduce biochemical pathways and the functions of unknown proteins.

Figure 1. Genome context may allow prediction of a metabolic pathway.

Advantages of using Genome Neighborhood Network (GNN)

Unlike manual analysis of individual genome neighborhoods, which can be extremely time-consuming when conducted on more than a handful of genes, EFI-GNT can rapidly acquire and organize genome neighborhood information for thousands of query genes in a high throughput fashion. Because the genome contexts for orthologous enzymes (same in vitro activity and in vivo metabolic function) often are not conserved phylogenetically, the large-scale collection and organization of genome context enabled by EFI-GNT may allow the identification of the enzymes in metabolic pathways that are not co-localized in the user’s "target" organism.

Creating a GNN

Using the sequences in an input Sequence Similarity Network (SSN) as queries, the Genome Neighborhood Network (GNN) organizes the proteins encoded by the genome neighborhood for each query sequence according to Pfam family.

The GNNs generated by EFI-GNT identify the protein families (using Pfam-defined homology-based classifications) that are encoded by the genes proximal to genes that encode the proteins in the input/query SSN dataset. The identities of these families often provide valuable information about the types of reactions catalyzed by the genome neighbors.

Two formats for the GNN information are provided

The GNNs from both formats can be filtered using Cytoscape to extract information involving specific Pfam families and/or specific query clusters from the input SSN: given the large-scale nature/amount of information in a GNN, simplification often is desirable. However, the considerable utility of GNNs is made possible by the large amount of information that is accessible to the user.

1 - SSN cluster Hub-Nodes

Each SSN cluster with queries that found neighbors is depicted as the hub-node (center) in a cluster in the GNN; the identities of the Pfam families of the neighbors are depicted as the spoke-nodes. This format enables identification of potential pathway members that are functionally linked to the query sequences in the cluster and, with the identities of the Pfam families, inference of the reactions in the pathway. In this format, "over-fractionation" of the SSN may result in the identification of incomplete pathways, i.e., the power of the large-scale analysis is that phylogenetically diverse genome organizations can be identified for orthologues. Synergistic interpretation of both formats may allow this situation to be identified.

2 - Pfam family Hub-Nodes

Each neighborhood Pfam family that was found is depicted as the hub-node (center) in a cluster in the GNN; the identities of the SSN clusters with queries that "found" the Pfam as neighbors in the family are depicted as the spoke-nodes in the cluster. This format enables an assessment of whether the clusters in the query SSN are isofunctional, i.e., if multiple clusters find the same Pfam family, the SSN may be "over-fractionated" so that orthologues are found in multiple clusters. Or, the Pfam family may contain members with different functions that are found by different clusters in the input SSN.

Click here to contact us for help, reporting issues, or suggestions.