EFI - Genome Neighborhood Tool

EFI-GNT allows exploration of the genome neighborhoods for sequence similarity network (SSN) clusters in order to facilitate the assignment of function within protein families and superfamilies.

In GNT Submission, each sequence within a SSN is used as a query for interrogation of its genome neighborhood. A colored SSN identifying clusters, Genome Neighborhood Networks (GNNs) providing statistical analysis of neighboring Pfam families, Genome Neighborhood Diagrams (GNDs), sets of IDs and sequences per cluster and additional files are created. For the Retrieve Neighorhood Diagrams option, only GNDs will be created.

The EFI web tool interface has been updated to improve user experience.
All functions remain unchanged.

The GNT database has been updated to use UniProt 2019_06 and ENA 140.

A listing of new features and other information pertaining to GNT is available on the release notes page.

In a submitted SSN, each sequence is considered as a query. Information associated with protein encoding genes that are neighbors of input queries (within a defined window on either side) are collected from sequence files for bacterial (prokaryotic and archaeal) and fungal genomes in the European Nucleotide Archive (ENA) database. The neighboring genes are sorted into neighbor Pfam families. For each cluster, the co-occurrence frequencies of the identified neighboring Pfam families with the input queries are calculated.

SSN File: ?
SSNs generated by EFI-EST are compatible with GNT analysis (with the exception of SSNs from the FASTA sequences without the "Read FASTA header" option), even when they have been modified in Cytoscape. The accepted format is XGMML (or compressed XGMML as zip).
Neighborhood Size:
The Pfam families for N neighboring genes upstream and downstream will be collected and analyzed. The default value is 10 and the minimum and maximum are 3 and 20, respectively.
Minimal Co-occurrence Percentage Lower Limit:
Filters out the neighboring Pfams for which the co-occurrence percentage is lower than the set value (noise filter). The default value is 20 and valid values are 0-100.

E-mail address:

You will receive an e-mail when your network has been processed.

Clicking on the headers below provides access to various ways of generating genomic neighborhood diagrams.

The provided sequence is used as the query for a BLAST search of the UniProt database. The retrieved sequences are used to generate genomic neighborhood diagrams.

Optional job title:
Maximum number of sequences retrieved (≤ 500; default: 200)
E-Value: Negative log of e-value for all-by-all BLAST (≥ 1; default: 5)
Neighborhood window size: Number of neighbors to retrieve on either side of the query sequence for each BLAST result (default: 10)

E-mail address:

You will receive an e-mail when your network has been processed.

The genomic neighborhoods are retreived for the UniProt, NCBI, EMBL-EBI ENA, and PDB identifiers that are provided in the input box below. Not all identifiers may exist in the EFI-GNT database so the results will only include diagrams for sequences that were identified.

Alternatively, a file containing a list of IDs can be uploaded: ?
The acceptable format is text.
Optional job title:
Neighborhood window size: Number of neighbors to retrieve on either side of the query sequence for each BLAST result (default: 10)

E-mail address:

You will receive an e-mail when your network has been processed.

The genomic neighborhoods are retreived for the UniProt, NCBI, EMBL-EBI ENA, and PDB identifiers that are identified in the FASTA headers. Not all identifiers may exist in the EFI-GNT database so the results will only include diagrams for sequences that were identified.

Alternatively, a file containing FASTA headers and sequences can be uploaded: ?
The acceptable format is text.
Optional job title:
Neighborhood window size: Number of neighbors to retrieve on either side of the query sequence for each BLAST result (default: 10)

E-mail address:

You will receive an e-mail when your network has been processed.

Upload a saved diagram data file for visualization.

Select a File to Upload: ?
The acceptable format is sqlite.

E-mail address:

You will receive an e-mail when your network has been processed.

EFI-Genome Neighborhood Tool Overview

The EFI-GNT (EFI Genome Neighborhood Tool) is focused on placing protein families and superfamilies into a genomic context. A sequence similarity network (SSN) is used as an input. Each sequence within a SSN is used as a query for interrogation of its genome neighborhood.

EFI-GNT enables exploration of the genome neighborhoods for sequences in SSN clusters in order to facilitate their assignment of function.

EFI-GNT Acceptable Input

EFI-GNT is compatible with SSN generated by the EFI-Enzyme Similarity Tool (EFI-EST). Acceptable SSNs are generated for an entire Pfam and/or InterPro protein family (EFI-EST option B), a focused region of a family (option A), a set of protein sequence that can be identified from FASTA headers (from option C with “Header Reading” activated) or a list of recognizable UniProt and/or NCBI IDs (from option D). SSNs manually modified within Cytoscape are accepted. SSNs that have been colored using the "Color SSN Utility" are also accepted. SSNs generated from FASTA sequences (option C) without the "Read Header" option activated are not accepted.

Principle of GNT Analysis

EFI-GNT provides statistical analysis, per SSN cluster, of genome context for bacterial, archeal and fungal sequences, in order to identify possible functional linkage. Sequences from the SSN analyzed are used as query for retrieval of their genome neighborhood. The user specifies the neighborhood size (±N orfs from the SSN query) and minimum query-neighbor co-occurrence frequency for the outputs.

EFI-GNT Output

EFI-GNT identifies each SSN cluster and assigns it a unique color. A colored SSN is produced. It then interrogates the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena) to obtain the genome contexts of each sequence, sorts neighbors into Pfam families, and provides three specific outputs. Firstly, a GNN network in which each SSN cluster is a hub node with its spoke nodes identified neighboring Pfam families (for identifying candidates for pathway enzymes); secondly, a GNN network in which each neighbor Pfam family is a hub node with its spoke nodes that SSN clusters that identify this Pfam as a neighbor (for identifying divergent clusters that are orthologues); and thirdly, genome neighborhood diagrams (GNDs) for visual representations of the neighborhoods for the sequences in each SSN cluster (for visual inspection of synteny and the presence/absence of functionally linked proteins).

Direct Genomic Neighborhood Diagrams (GND) Generation

The "Retrieve neighborhood diagrams" allows exploring of neighboring genes for specific queries. You can submit a single sequence that is used as the query for a BLAST search of the UniProt database. The retrieved sequences are used to generate GNDs. GNDs can be generated from a provided list of IDs or even from FASTA sequences, by collecting IDs from FASTA headers.

Figure 1: Examples of colored SSN (left) and a hub-and-spoke cluster from a GNN (right).

Recommended Reading

Rémi Zallot, Nils Oberg, John A. Gerlt, "Democratized" genomic enzymology web tools for functional assignment, Current Opinion in Chemical Biology, Volume 47, 2018, Pages 77-85, https://doi.org/10.1016/j.cbpa.2018.09.009

John A. Gerlt, Genomic enzymology: Web tools for leveraging protein family sequence–function space and genome context to discover novel functions, Biochemistry, 2017 - ACS Publications

UniProt Version: 2019_06
InterPro Version: 75
ENA Version: 140

If you use the EFI web tools, please cite us.

Click here to contact us for help, reporting issues, or suggestions.