Sequnce Similarity Networks Tool

Genome Neighborhood Networks Tool

The EFI-Genome Neighborhood Tool (EFI-GNT) allows the exploration of the physical association of genes on genomes, i.e. gene clustering. EFI-GNT enables a user to retrieve, display, and interact with genome neighborhood information for large datasets of sequences.

Tutorial

EFI-Genome Neighborhood Tool Overview

Although other tools allow comparison of gene neighborhoods among multiple prokaryotic genomes to allow inference of phylogenetic relationships, e.g., IMG (https://img.jgi.doe.gov) and PATRIC (https://www.patricbrc.org), EFI-GNT enables comparison of the genome neighborhoods for clusters of similar protein sequences in order to facilitate the assignment of function within protein families and superfamilies.

EFI-GNT is focused on placing protein families and superfamilies into a context. A sequence similarity network (SSN) with defined protein clusters is used as an input. Each sequence within a SSN is used as a query for interrogation of its genome neighborhood.

EFI-GNT acceptable input

The sequence datasets are generated from an SSN produced by the EFI-Enzyme Similarity Tool (EFI-EST). Acceptable SSNs are generated for an entire Pfam and/or InterPro protein family (from Option B of EFI-EST), a focused region of a family (from Option A of EFI-EST), a set of protein sequence that can be identified from FASTA headers (from option C of EFI-EST with header reading) or a list of recognizable UniProt and/or NCBI IDs (from option D of EFI-EST). A manually modified SSN within Cytoscape that originated from any of the EST options is also acceptable. SSNs that have been colored using the "Color SSN Utility" of EFI-EST and that originated from any of acceptable Options are also acceptable.

Principle of GNT analysis

Protein encoding genes that are neighbors of input queries (within a defined window on either side) are collected from sequence files for bacterial (prokaryotic and archaeal) and fungal genomes in the European Nucleotide Archive (ENA) database. The co-occurrence frequencies of the identified neighboring sequences with the input queries are calculated as well as the absolute values of the distances in open reading frames (orfs) between the queries and neighbors. The calculated information is provided as Genome Neighborhood Networks (GNNs), in addition to a colored version of the input SSN that aids analysis of the GNNs.

EFI-GNT output

EFI-GNT generates two formats of the Genome Neighborhood Network (GNN) as well as a colored version of the input SSN that aids analysis of the GNNs.

The UniProt accession IDs for the queries and the neighbors, the Pfam families for the neighbors, and both the query-neighbor distances (in orfs) and co-occurrence frequencies are provided in the GNNs. The GNNs and colored SSN are downloaded, visualized, and analyzed using Cytoscape.

The user can use Cytoscape to filter the GNNs for a range of query-neighbor distances and/or co-occurrence frequencies to enable the identification of functionally related proteins/enzymes, with shorter distances and great co-occurrence frequencies suggesting functional linkage in a metabolic pathway. With the identities of the Pfam families for the neighbors, the user may be able to infer the in vitro enzymatic activities of the queries and neighbors and predict the reactions in the metabolic pathway in which they participate.

Figure 1: Examples of colored SSN (left) and a hub-and-spoke cluster from a GNN (right).

Need help or have suggestions or comments? Please click here to submit.