The EFI-Genome Neighborhood Tool (EFI-GNT) allows the exploration of the physical association of genes on genomes, i.e.
gene clustering. EFI-GNT enables a user to retrieve, display, and interact with genome neighborhood information for
large datasets of sequences.
EFI-Genome Neighborhood Tool Overview
Although other tools allow comparison of gene neighborhoods among multiple prokaryotic genomes to allow inference of
phylogenetic relationships, e.g., IMG (https://img.jgi.doe.gov)
and PATRIC (https://www.patricbrc.org), EFI-GNT enables
comparison of the genome neighborhoods for clusters of similar protein sequences in order to facilitate the assignment
of function within protein families and superfamilies.
EFI-GNT is focused on placing protein families and superfamilies into a context. A sequence similarity network (SSN)
with defined protein clusters is used as an input. Each sequence within a SSN is used as a query for interrogation of
its genome neighborhood.
EFI-GNT acceptable input
The sequence datasets are generated from an SSN produced by the EFI-Enzyme Similarity Tool (EFI-EST). Acceptable
SSNs are generated for an entire Pfam and/or InterPro protein family (from Option B of EFI-EST), a focused region of a
family (from Option A of EFI-EST), a set of protein sequence that can be identified from FASTA headers (from option C of
EFI-EST with header reading) or a list of recognizable UniProt and/or NCBI IDs (from option D of EFI-EST). A manually
modified SSN within Cytoscape that originated from any of the EST options is also acceptable. SSNs that have been
colored using the "Color SSN Utility" of EFI-EST and that originated from any of acceptable Options are also acceptable.
Principle of GNT analysis
Protein encoding genes that are neighbors of input queries (within a defined window on either side) are collected from
sequence files for bacterial (prokaryotic and archaeal) and fungal genomes in the European Nucleotide Archive (ENA)
database. The co-occurrence frequencies of the identified neighboring sequences with the input queries are calculated as
well as the absolute values of the distances in open reading frames (orfs) between the queries and neighbors. The
calculated information is provided as Genome Neighborhood Networks (GNNs), in addition to a colored version of the input
SSN that aids analysis of the GNNs.
EFI-GNT generates two formats of the Genome Neighborhood Network (GNN) as well as a colored version of the input SSN
that aids analysis of the GNNs.
The UniProt accession IDs for the queries and the neighbors, the Pfam families for the neighbors, and both the
query-neighbor distances (in orfs) and co-occurrence frequencies are provided in the GNNs. The GNNs and colored SSN are
downloaded, visualized, and analyzed using Cytoscape.
The user can use Cytoscape to filter the GNNs for a range of query-neighbor distances and/or co-occurrence frequencies
to enable the identification of functionally related proteins/enzymes, with shorter distances and great co-occurrence
frequencies suggesting functional linkage in a metabolic pathway. With the identities of the Pfam families for the
neighbors, the user may be able to infer the in vitro enzymatic activities of the queries and neighbors and predict the
reactions in the metabolic pathway in which they participate.
Figure 1: Examples of colored SSN (left) and a hub-and-spoke cluster from a GNN (right).