EFI - Genome Neighborhood Tool

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).

The tools are available without charge or license to both academic and commercial users.

Please cite your use of the EFI tools:

Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182. https://doi.org/10.1021/acs.biochem.9b00735

Nils Oberg, Rémi Zallot, and John A. Gerlt, EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol 2023. https://doi.org/10.1016/j.jmb.2023.168018

Reorganization of UniProtKB

With the current 2026_02 release, the UniProtKB database is reorganized to include an expanded number of Reference Proteomes to better capture biodiversity. This includes the removal of proteins from taxonomically unclassified organisms, i.e., those without a binomial species name (genus and species). The total number of accessions in UniProtKB has been reduced from 253,635,358 in the “legacy” 2025_03 release to 149,810,139 in the current 2026_02 release.

We are providing the option to select either the “legacy” 2025_03 database or the current UniProtKB database (now 2026_02) when generating SSNs. You can select the database in the “Database” accordion on the pages for the EFI-EST options, the EFI-GNT tool, and the Taxonomy Tool. We suggest that you compare the SSNs, GNNs, and GNDs generated from both databases as you explore the information you are seeking.

Because the “legacy” 2025_03 release contains UniProt IDs that are no longer active on the UniProt web site, we provide the Metadata Tool that provides access to the node attribute metadata for the UniProt IDs in the “legacy” 2025_03 release.

EFI-GNT allows exploration of the genome neighborhoods for sequence similarity network (SSN) clusters in order to facilitate the assignment of function within protein families and superfamilies.

In GNT Submission, each sequence within a SSN is used as a query for interrogation of its genome neighborhood. A colored SSN identifying clusters, Genome Neighborhood Networks (GNNs) providing statistical analysis of neighboring Pfam families, Genome Neighborhood Diagrams (GNDs), sets of IDs and sequences per cluster and additional files are created. For the Retrieve Neighorhood Diagrams option, only GNDs will be created.

A listing of new features and other information pertaining to GNT is available on the release notes page.

The GNT database uses UniProt 2026_02, and ENA downloaded on June 2026.

GNT Submission
Retrieve Neighborhood Diagrams
View Saved Diagrams
Tutorial

In a submitted SSN, each sequence is considered as a query. Information associated with protein encoding genes that are neighbors of input queries (within a defined window on either side) are collected from sequence files for bacterial (prokaryotic and archaeal) and fungal genomes in the European Nucleotide Archive (ENA) database. The neighboring genes are sorted into neighbor Pfam families. For each cluster, the co-occurrence frequencies of the identified neighboring Pfam families with the input queries are calculated.

SSN File: ?

Choose a file…

SSNs generated by EFI-EST are compatible with GNT analysis (with the exception of SSNs from the FASTA sequences without the "Read FASTA header" option), even when they have been modified in Cytoscape. The accepted format is XGMML (or compressed XGMML as zip).

If a SSN that contains UniRef sequences is uploaded to the GNT, the resulting GNDs will also include GNDs for UniRef90 cluster IDs that group together UniProt sequences by 90% sequence identiy. For networks that also contain UniRef50 sequences, GNDs will also include UniProt sequences that are grouped by 50% sequence identity.

Neighborhood Size:

The Pfam families for N neighboring genes upstream and downstream will be collected and analyzed. The default value is 10 and the minimum and maximum are 3 and 20, respectively.

Minimal Co-occurrence Percentage Lower Limit:

Filters out the neighboring Pfams for which the co-occurrence percentage is lower than the set value (noise filter). The default value is 20 and valid values are 0-100.

Database version:

Due to the UniProtKB database reorganization, the EFI tools will continue to provide access to the UniProt 2025_03/InterPro 106 database ("Legacy_UniProt_2025_03_InterPro_106") in addition to the most recent UniProt release.

E-mail address:

You will receive an e-mail when your network has been processed.

Clicking on the headers below provides access to various ways of generating genomic neighborhood diagrams.

Single Sequence BLAST
Sequence ID Lookup
FASTA Sequence Lookup

The provided sequence is used as the query for a BLAST search of the UniProt database. The retrieved sequences are used to generate genomic neighborhood diagrams.

If the Sequence Database is set to UniRef90, the resulting GNDs will also include GNDs for UniRef90 cluster IDs that group together UniProt sequences by 90% sequence identiy. For UniRef50, the GNDs will also include UniProt sequences that are grouped by 50% sequence identity. Any of the "Exclude Fragments" options will exclude UniProt-defined sequence fragments.

Optional job title:
Maximum BLAST Sequences:		Maximum number of sequences retrieved (≤ 500; default: 200)
E-Value:		Negative log of e-value for all-by-all BLAST (≥ 1; default: 5)
Neighborhood window size:		Number of neighbors to retrieve on either side of the query sequence for each BLAST result (default: 10)
Database version:		Due to the UniProtKB database reorganization, the EFI tools will continue to provide access to the UniProt 2025_03/InterPro 106 database ("Legacy_UniProt_2025_03_InterPro_106") in addition to the most recent UniProt release.
Sequence database:		Sequence database for retrieving sequences (default: UniProt)

E-mail address:

You will receive an e-mail when your network has been processed.

The genomic neighborhoods are retreived for the UniProt, NCBI, EMBL-EBI ENA, and PDB identifiers that are provided in the input box below. Not all identifiers may exist in the EFI-GNT database so the results will only include diagrams for sequences that were identified.

Alternatively, a file containing a list of IDs can be uploaded: ?

Choose a file…

The acceptable format is text.

Optional job title:
Neighborhood window size:		Number of neighbors to retrieve on either side of the query sequence for each BLAST result (default: 10)
Database version:		Due to the UniProtKB database reorganization, the EFI tools will continue to provide access to the UniProt 2025_03/InterPro 106 database ("Legacy_UniProt_2025_03_InterPro_106") in addition to the most recent UniProt release.
Sequence database:		Sequence database for retrieving sequences (default: UniProt)

E-mail address:

You will receive an e-mail when your network has been processed.

The genomic neighborhoods are retreived for the UniProt, NCBI, EMBL-EBI ENA, and PDB identifiers that are identified in the FASTA headers. Not all identifiers may exist in the EFI-GNT database so the results will only include diagrams for sequences that were identified.

Alternatively, a file containing FASTA headers and sequences can be uploaded: ?

Choose a file…

The acceptable format is text.

Optional job title:
Neighborhood window size:		Number of neighbors to retrieve on either side of the query sequence for each BLAST result (default: 10)
Database version:		Due to the UniProtKB database reorganization, the EFI tools will continue to provide access to the UniProt 2025_03/InterPro 106 database ("Legacy_UniProt_2025_03_InterPro_106") in addition to the most recent UniProt release.

E-mail address:

You will receive an e-mail when your network has been processed.

EFI-Genome Neighborhood Tool Overview

The EFI-GNT (EFI Genome Neighborhood Tool) is focused on placing protein families and superfamilies into a genomic context. A sequence similarity network (SSN) is used as an input. Each sequence within a SSN is used as a query for interrogation of its genome neighborhood.

EFI-GNT enables exploration of the genome neighborhoods for sequences in SSN clusters in order to facilitate their assignment of function.

EFI-GNT Acceptable Input

EFI-GNT is compatible with SSN generated by the EFI-Enzyme Similarity Tool (EFI-EST). Acceptable SSNs are generated for an entire Pfam and/or InterPro protein family (EFI-EST option B), a focused region of a family (option A), a set of protein sequence that can be identified from FASTA headers (from option C with “Header Reading” activated) or a list of recognizable UniProt and/or NCBI IDs (from option D). SSNs manually modified within Cytoscape are accepted. SSNs that have been colored using the "Color SSN Utility" are also accepted. SSNs generated from FASTA sequences (option C) without the "Read Header" option activated are not accepted.

Principle of GNT Analysis

EFI-GNT provides statistical analysis, per SSN cluster, of genome context for bacterial, archeal and fungal sequences, in order to identify possible functional linkage. Sequences from the SSN analyzed are used as query for retrieval of their genome neighborhood. The user specifies the neighborhood size (±N orfs from the SSN query) and minimum query-neighbor co-occurrence frequency for the outputs.

EFI-GNT Output

EFI-GNT identifies each SSN cluster and assigns it a unique color. A colored SSN is produced. It then interrogates the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena) to obtain the genome contexts of each sequence, sorts neighbors into Pfam families, and provides three specific outputs. Firstly, a GNN network in which each SSN cluster is a hub node with its spoke nodes identified neighboring Pfam families (for identifying candidates for pathway enzymes); secondly, a GNN network in which each neighbor Pfam family is a hub node with its spoke nodes that SSN clusters that identify this Pfam as a neighbor (for identifying divergent clusters that are orthologues); and thirdly, genome neighborhood diagrams (GNDs) for visual representations of the neighborhoods for the sequences in each SSN cluster (for visual inspection of synteny and the presence/absence of functionally linked proteins).

Direct Genomic Neighborhood Diagrams (GND) Generation

The "Retrieve neighborhood diagrams" allows exploring of neighboring genes for specific queries. You can submit a single sequence that is used as the query for a BLAST search of the UniProt database. The retrieved sequences are used to generate GNDs. GNDs can be generated from a provided list of IDs or even from FASTA sequences, by collecting IDs from FASTA headers.

Figure 1: Examples of colored SSN (left) and a hub-and-spoke cluster from a GNN (right).

Email Address:
Password:

EFI - Genome Neighborhood Tool

EFI-Genome Neighborhood Tool Overview

EFI-GNT Acceptable Input

Principle of GNT Analysis

EFI-GNT Output

Direct Genomic Neighborhood Diagrams (GND) Generation

Recommended Reading