Sequnce Similarity Networks Tool

Genome Neighborhood Networks Tool


EFI-GNT Files and Node Attributes

As described in the previous section, EFI-GNT generates a colored version of the input SSN as well as two formats of genome neighbor networks (GNNs) for download (all three in the xgmml format) and subsequent analysis with Cytoscape.

This section provides a detailed description of the colored SSN and both GNN formats, with emphasis on the node attributes that are provided for the GNNs; these include the query-neighbor distances, co-occurrence frequencies, and the identities of the neighbor’s Pfam family that are used for pathway predictions. In the descriptions that follow, the names of the node attributes are highlighted with Bold font.

Colored SSN

The colored SSN assists the user in analyzing the GNNs by allowing color-guided association of the cluster nodes in the GNNs with the clusters in the input SSN (Figure 1).

Figure 1. Colored SSN

EFI-GNT assigns a unique number and color to each multi-node cluster in the input SSN. EFI-GNT generates a colored version of the input SSN in which the nodes in each cluster are assigned the cluster number and colored with the unique color. Node attributes are added for the cluster number (Cluster Number) and color (node.fillColor).

Singletons in the input SSN are excluded from the GNN analysis. However, they are included in the colored SSN. The nodes do not have a cluster number (the Cluster Number node attribute is blank) and have the Cytoscape default color (cyan; the node.fillColor node attribute is blank).

In addition, text files are available for download (nomatch.txt and noneigh.txt) that contain the accession IDs for the “No Match” and “No Neighbor” queries. The user may find these useful in for identifying these sequences in the input and colored SSN.

If the nodes are not automatically colored when the SSN is opened in Cytoscape, the Style Control Panel can be used to color the nodes in the SSN: In the Fill Color property, select “node:node.fillColor” for the Column value and “Passthrough Mapping” for the Mapping Type value.

GNNs: Two formats

A GNN contains clusters (hub-node and ≥ 1 spoke‑node) that provide genome neighborhood information (query-neighbor co-occurrence frequencies and distances) for the sequences in the clusters in the input SSN.

EFI-GNT generates the GNN in two formats that provide different query-neighbor perspectives to assist predictions of pathways. The formats differ in the identities of the cluster hub-node (neighbor Pfam family or query SSN cluster) and spoke-nodes (query SSN cluster or Pfam family, respectively).

In the first GNN format (Figure 2, below), a cluster is present for each query SSN cluster (hub-node) that was used to identify genome neighbors (spoke-nodes). This format allows the user to identify functionally linked enzymes, as deduced from genome proximity, that constitute the metabolic pathway in which the sequences in the query SSN cluster participate.

In the second GNN format (Figure 3, below), a cluster is present for each Pfam family (hub-node) that was identified as a neighbor to queries in the SSN clusters (spoke-nodes). This format allows the user to assess whether queries in multiple SSN clusters are neighbors to members of the same Pfam family and, therefore, may have the same in vitro activities and in vivo metabolic functions.

In both formats, the node attributes for the Pfam family and SSN cluster nodes contain the same information in both formats; these provide information about the query-neighbor relationships that can be used to infer functional relationships (co-occurrence frequency and distance) that enable the prediction of in vitro activities and in vivo metabolic pathways.

SSN Cluster Hub-Nodes and Pfam Family Spoke-Nodes

In the first GNN format, a cluster is present for each query SSN cluster (hub- node) that was used to retrieve genome neighbors (spoke-nodes; Figure 3). This format allows the user to identify functionally linked enzymes, as deduced from genome proximity, that constitute the metabolic pathway in which the query participates.

Figure 2. SSN Cluster Hub-Nodes and Pfam Family Spoke-Nodes GNN

The hub-node for each cluster, a hexagon, is the cluster number. The hub-nodes are colored with the unique color assigned for the Colored SSN and labeled with the unique cluster number that was assigned.

A spoke-node, colored grey and represented as a hexagon, is present for each Pfam family that was identified as a neighbor of a query sequence in the cluster represented by the hub-node. These are labeled with the short name for the Pfam family (e.g., Aldedh for PF00171, aldehyde dehydrogenase). For multidomain proteins, the hub-node name is a composite of the short names for the component domains (e.g., HTH_1-LysR for PF00126-PF03466, the N-terminal HTH DNA-binding and C-terminal ligand binding domains of LysR transcriptional regulators). A spoke-node will be present for neighbors that have not been assigned to a Pfamily ("none").

As described for the previous GNN format, a spoke-node will be present for “no Pfam” family neighbors.

The size of the spoke-node (node.size) is determined by the value of the Co-occurrence node attribute [decimal value of the ratio of the number of queries in the cluster that found neighbors in the Pfam family to the number of queriable sequences in the cluster (Queries with Pfam Neighbors/Queriable SSN Sequences); see below]—the larger the co-occurrence frequency of the SSN cluster queries and their genome neighbors, the larger the spoke-node. The value of node.size [calculated as (Co-occurrence * 100)] is used by Cytoscape to draw the node.

The shape of spoke-node (node.shape) is determined by the values of several node attributes for the neighbors that were identified: a triangle if an EC number of assigned in UniProtKB; a square if a Protein Data Bank (PDB) code is available; a diamond if both an EC number and a PDB are available; or a circle if neither an EC number or a PDB code is available. The availability of an EC number and/or a PDB code suggests that the function of the neighbor may be known. The node shape in the node shape node attribute (node.shape) is used by Cytoscape to draw the node. The network can be filtered with the Select Panel to select specific shapes, i.e. different levels of confidence about the functions of the neighbors.

The network can be filtered with the Control Select Panel to select specific node shapes, i.e., different levels of confidence about the functions of neighbors.

[These determinants of size and shape for the Pfam spoke nodes the same as those used in the previous format for the cluster number spoke-nodes; see previous section.]

Neighbors not assigned to Pfam families: Approximately 20% of the sequences in UniProt are not associated with one of the 16,306 families in Pfam release 30.0. Therefore, genome neighbors not associated with a Pfam family “always” will be identified in the generation of the GNN; these are designated as members of the “no Pfam” family.

In this GNN formation, a spoke-node for the “no Pfam” family will be present (labeled "none") if an input cluster found “no Pfam” neighbors.

The identities of the neighbors not associated with a Pfam family are provided in the Neighbor Accession and Query-Neighbor Arrangement node attributes (described in detailed in the descriptions of the node attributes).

Node attributes for each SSN cluster hub-node

The order of the node attributes in the Table Panel cannot be pre-specified, but you can rearrange them with your cursor to make them more “user friendly”.

  • shared name: the unique number assigned to each cluster in the input SSN (singletons are not included)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  • name: the unique number assigned to each cluster in the input SSN (singletons are not included)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  • Cluster number: the unique number assigned to each cluster in the input SSN (singletons are not included)

    This node attribute has a numerical value— a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  • Total SSN Sequences: The total number of sequences in the SSN cluster.

    This node attribute has a numerical value—a specific Cluster Number or a range of Cluster Numbers can be selected with the Select Control Panel.

  • Queriable SSN Sequences: The number of sequences in the SSN cluster that have genome neighbors in the bacterial and fungal ENA sequence files. The value of this node attribute is calculated by:

    Total number of sequences in the SSN cluster (value of Total SSN Sequences) – number of sequences that did not have a match in the ENA sequence files (list provided in the nomatch file that can be downloaded) – number of sequences for which the ENA sequence files did not provide genome neighborhoods (list provided in the noneigh(bor) file that can be downloaded).

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  • Hub Queries with Pfam Neighbors: A summary for all spoke-node Pfam families of the number of queriable sequences in the hub-node SSN cluster (value of Queriable SSN Sequences) for which a neighbor in the Pfam family was found in the following format: cluster#:Pfam#:#Queries with Pfam Neighbors where

    “cluster#” is the cluster# for the query,

    “Pfam#” is the spoke-node Pfam family number, and

    “#Queries with Pfam Neighbors” is the number of queriable sequences in the SSN cluster (value of Queriable SSN Sequences) for which a neighbor in the spoke- node Pfam family was found. A query may find multiple members of the Pfam family, but this node attribute reports only the number of queries that found any neighbor in the Pfam family.

    The neighbors in the Pfam family need not be orthologues (share the same function)

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  • Hub Pfam Neighbors: A summary for all spoke-node Pfam families of the number of neighbors in the Pfam family that were found by the queries in the hub-node SSN cluster in the following format:

    cluster#:Pfam#:#Pfam Neighbors

    where

    “cluster#” is the cluster# for the query,

    “Pfam#” is the spoke-node Pfam family, and

    “#Pfam Neighbors” is the number of neighbors in the spoke-node Pfam family found by the queries in hub-node SSN cluster.

    The value of "#Pfam Neighbors" will be greater than the value of the Queries with PFAM Neighbors (previous) node attribute if a query found more than one neighbor in the Pfam family.

    Again, the neighbors in the Pfam family need not be orthologues (share the same function).

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  • Hub Average and Median Distances: A summary for all spoke-node Pfam families of the values of the Average Distance and Median Distance node attributes in the following format:

    cluster#:"Pfam#:average absolute value of distances:median absolute value of distances

    where

    “cluster#” is the cluster# for the query,

    “Pfam#” is the Pfam# for the neighbor,

    “average absolute value of distances” is the average of the absolute values of distances between the hub-node queries and spoke-node neighbors, and

    “median absolute value of distances” is the median absolute value of distances between the hub-node queries and spoke-node neighbors.

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  • Hub Co-occurrence and Ratio: A summary for all spoke-node Pfam families of the values for Co-occurrence and Co-occurrence Ratio node attributes in the following format:

    cluster#:Pfam#:Co-occurrence:Co-occurrence Ratio

    where

    “cluster#” is the cluster# for the query,

    “Pfam#” is the Pfam# for the neighbor, and

    “Co-occurrence” is the decimal value of the ratio of the number queries that found neighbors in the Pfam family to the number of queriable sequences in the hub-node SSN query cluster (Queries with Pfam Neighbors /Queriable SSN Sequences), and

    “Co-occurrence Ratio” is the ratio of the number of queries that found neighbors in the Pfam family to the number of queriable sequences in the hub- node SSN cluster (Queries with Pfam Neighbors /Queriable SSN Sequences)

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  • Node.fillColor: the hexadecimal number of the unique color assigned to each cluster in the input SSN (singletons are not included). This number is used by the pass-through mapping “Fill Color” style of Cytoscape to color the nodes in the network
  • Node.shape: hexagon (used by Cytoscape but can be used in searches to select hub-nodes)
  • Node.size: 70.0 (used by Cytoscape)
  • Pfam: empty (a spoke-node attribute)
  • Pfam description: empty (a spoke-node attribute)
  • Queries with Pfam Neighbors: empty (a spoke-node attribute)
  • Pfam Neighbors: empty (a spoke-node attribute)
  • Query-Accessions: empty (a spoke-node attribute)
  • Query-Neighbor Accessions: empty (a spoke-node attribute)
  • Query-Neighbor Arrangement: empty (a spoke-node attribute)
  • Average Distance: empty (a spoke-node attribute)
  • Median Distance: empty (a spoke-node attribute)
  • Co-occurrence: empty (a spoke-node attribute)
  • Co-occurrence Ratio: empty (a spoke-node attribute)

Node attributes for each Pfam family spoke-node

The order of the node attributes in the Table Panel cannot be pre-specified, but you can move them with your cursor to make them more “user friendly”.

  1. shared name: the Pfam family short name or hyphen-separated short names for multidomain proteins)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  2. name: the Pfam family short name (hyphen-separated names for multidomain proteins)

    This node attribute is text—strings of characters can be selected with the Select Control PanelPfam: the Pfam number (PFxxxxx)

    [The shared name and name node attributes are redundant and required by Cytoscape.]

  3. Cluster Number: the hub-node SSN cluster that found neighbors in the Pfam family spoke-node.

  4. Pfam: the Pfam family number (PFxxxxx) (or hyphen-separated numbers for multidomain proteins)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  5. Pfam description: the Pfam family long name (or hyphen-separated names for multidomain proteins)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  6. Queries with Pfam Neighbors: The total number of queriable sequences in the hub-node SSN cluster (value of Queriable SSN Sequences in SSN cluster hub-node) for which any neighbor in this Pfam family was found. A query may find multiple members of the Pfam family, but this node attribute reports only the number of queries that found any neighbor.

  7. [Queriable SSN Sequences: Total number of sequences in the SSN cluster (value of Total SSN Sequences) – number of sequences that did not have a match in the ENA sequence files (list provided in the nomatch file that can be downloaded) – number of sequences for which the ENA sequence files did not provide genome neighborhoods (list provided in the noneigh(bor) file that can be downloaded).]

    The neighbors in the Pfam family need not be orthologues (share the same function).

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  8. Pfam Neighbors: The total number of neighbors in the Pfam family found by the queries in the spoke-node SSN cluster. This value of this node attribute will be greater than the value of the Queries with PFAM Neighbors (previous) node attribute if a query found more than one neighbor in the Pfam family.

    The neighbors in the Pfam family need not be orthologues (share the same function)—this can be evaluated by mapping the neighbors to the SSN for the Pfam family using the spreadsheet/custom node attribute files that can be downloaded.

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  9. Query-Accessions: A list of the query accession IDs in the SSN cluster that found neighbors.

  10. Query-Neighbor Accessions: Information about the query-neighbor pairs in the Pfam family in the following format:

    Pfam#:Query ID:Neighbor ID:EC#:ClosestPDB:PDB-E-value:Status

    where

    “Query ID” is the query accession ID,

    “Pfam#” is the Pfam# for the neighbor,

    “Neighbor ID” is the neighbor accession ID,

    “EC#” is the E.C. number, if any, assigned to the neighbor in the UniProt database,

    “ClosestPDB” is the Protein Databank (PDB) identifier for the most similar sequence to the neighbor with a structure in the PDB database,

    “PDB-E-value” is the BLAST e-value for the neighbor-ClosestPDB pair, and “Status” (Reviewed/Unreviewed) reports if the in vitro activity of the neighbor has been reviewed by SwissProt.

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  11. Query-Neighbor Arrangement: Genome context information for neighbors in the Pfam family in the following format:

    Pfam#:Query ID:noncomplement/complement:Neighbor ID: noncomplement/complement:

    Distance

    where

    “Pfam#” is the Pfam# for the neighbor,

    “Query ID” is the query accession ID,

    “noncomplement/complement” is the direction of transcription of the gene encoding the query (from the ENA sequence file),

    “Neighbor ID” is the neighbor accession ID,

    “noncomplement/complement” is the direction of transcription of the gene encoding the query (from the ENA sequence file), and

    “Distance” is the distance in orfs between the genes encoding the query and neighbor.

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  12. Average Distance: The average of the absolute values of the distances between the queries in the hub-node SSN cluster and their neighbors in the Pfam family. This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  13. Median Distance: The median value of the absolute values of the distances between the queries in the hub-node SSN cluster and their neighbors in the Pfam family.

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  14. Co-occurrence: The decimal value of the ratio of the number of queries in the hub-node SSN cluster that found neighbors in the Pfam family to the number of queriable sequences in the SSN cluster (Queries with Pfam Neighbors /Queriable SSN Sequences)

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  15. Co-occurrence Ratio: The numerical ratio of the number of queries in the hub- node SSN cluster that found neighbors in the Pfam family to the number of queriable sequences in the SSN cluster (Queries with Pfam Neighbors /Queriable SSN Sequences)

    This node attribute is text-strings of characters can be selected with the Select Control Panel.

  16. Node.fillColor: #EEEEEE (grey in hexadecimal); used by Cytoscape.

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  17. Node.shape: ellipse, diamond, or square (explained above); used by Cytoscape but can be used in searches.

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  18. Node.size: calculated as (Co-occurrence * 100); used by Cytoscape.

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  19. Total SSN Sequences: empty (a hub-node attribute)
  20. Queriable Total SSN Sequences: empty (a hub-node attribute)
  21. Hub Queries with Pfam Neighbors: empty (a hub-node attribute)
  22. Hub Pfam Neighbors: empty (a hub-node attribute)
  23. Hub Average and Median Distances: empty (a hub-node attribute)
  24. Hub Co-occurrence and Ratio: empty (a hub-node attribute)

Pfam Family Hub-Nodes and SSN Cluster Spoke-Nodes

In the second GNN format, a cluster is present for each Pfam family (hub-node) that was identified as a neighbor to queries in the SSN clusters (spoke-nodes; Figure 3). This format allows the user to assess whether queries in multiple SSN clusters are neighbors to members of the same Pfam family and, therefore, may have the same in vitro activities and in vivo metabolic functions.

Figure 3. Pfam Family Hub-Nodes and SSN Cluster Spoke-Nodes GNN

The hub-node, a hexagon colored white, represents the neighbor Pfam family. These are labeled with the Pfam short name for the family (e.g., DAO for PF01266, FAD-dependent oxidoreductase). For multidomain proteins, the label is a composite of the Pfam short names for the component domains (e.g., FGGY_N- FGGY_C for PF00370-PF02782, the N- and C-terminal domains in the FGGY family of carbohydrate kinases).

A spoke-node is present for each SSN cluster that identified a member of the Pfam family. The color of the spoke-node is the unique color assigned in the colored SSN; its label is the unique cluster number.

The size of the spoke-node (node.size) is determined by the value of the Co-occurrence node attribute [decimal value of the ratio of the number of queries in the cluster that found neighbors in the Pfam family to the number of queriable sequences in the cluster (Queries with Pfam Neighbors/Queriable SSN Sequences); see below]—the larger the co-occurrence frequency of the SSN cluster queries and their genome neighbors, the larger the spoke-node. The value of node.size [calculated as (Co-occurrence * 100)] is used by Cytoscape to draw the node.

The shape of spoke-node (node.shape) is determined by the values of several node attributes for the neighbors that were identified: a triangle if an EC number of assigned in UniProtKB; a square if a Protein Data Bank (PDB) code is available; a diamond if both an EC number and a PDB are available; or a circle if neither an EC number or a PDB code is available. The availability of an EC number and/or a PDB code suggests that the function of the neighbor may be known. The node shape in the node shape node attribute (node.shape) is used by Cytoscape to draw the node. The network can be filtered with the Select Panel to select specific shapes.

The network can be filtered with the Control Select Panel to select specific node shapes, i.e., different levels of confidence about the functions of neighbors.

Neighbors not assigned to Pfam families: Approximately 20% of the sequences in UniProt are not associated with one of the 16,306 families in Pfam release 30.0. Therefore, genome neighbors not associated with a Pfam family almost always will be identified in the generation of the GNN; these are designated as members of the “no Pfam” family.

If the spoke-node queries find any "no Pfam" neights, a hub-node cluster for the “no Pfam” family (labeled "none"), will be present, with a spoke-node for each cluster that found a “no Pfam” neighbor.

The identities of the neighbors not associated with a Pfam family are provided in the Neighbor Accession and Query-Neighbor Arrangement node attributes for the Pfam family nodes (described in detailed in the descriptions of the node attributes).

Node attributes for Pfam family hub-nodes

The order of the node attributes in the Table Panel cannot be pre-specified, but you can rearrange them with your cursor to make them more “user friendly”.

  1. shared name: the Pfam family short name (or hyphen-separated short names for multidomain proteins)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  2. name: the Pfam family short name (hyphen-separated names for multidomain proteins)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

    [The shared name and name node attributes are redundant and required by Cytoscape.]

  3. Pfam: the Pfam family number (PFxxxxx) (or hyphen-separated numbers for multidomain proteins)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  4. Pfam description: the Pfam family long name (or hyphen-separated names for multidomain proteins)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  5. Total SSN Sequences: The total number of sequences in the spoke-node SSN clusters.

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  6. Queriable SSN Sequences: The total number of sequences in the spoke-node SSN clusters that have genome neighbors in the bacterial and fungal ENA sequence files. The value of this node attribute is calculated by:

    Total number of sequences in the spoke-node SSN clusters (value of Total SSN Sequences) – number of sequences that did not have a match in the ENA sequence files (list provided in the nomatch file that can be downloaded) – number of sequences for which the ENA sequence files did not provide genome neighborhoods (list provided in the noneigh(bor) file that can be downloaded).

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  7. Queries with Pfam Neighbors: The total number of queriable sequences in the spoke-node SSN clusters (value of Queriable SSN Sequences) for which a neighbor in the hub-node Pfam family was found. A query may find multiple members of the Pfam family, but this node attribute reports only the number of queries that found any neighbor in the Pfam family.

    The neighbors in the Pfam family need not be orthologues (share the same function)—this can be evaluated by mapping the neighbors to the SSN for the Pfam family using the spreadsheet/custom node attribute files that can be downloaded.

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  8. Pfam Neighbors: The total number of neighbors in the Pfam family found by the queries in the spoke-node SSN clusters. This value of this node attribute will be greater than the value of the Queries with PFAM Neighbors (previous) node attribute if a query found more than one neighbor in the Pfam family.

    Again, the neighbors in the Pfam family need not be orthologues (share the same function)— this can be evaluated by mapping the neighbors to the SSN for the Pfam family using the spreadsheet/custom node attribute files that can be downloaded.

    This node attribute has a numerical value—a specific number of neighbors or a range of neighbors can be selected with the Select Control Panel.

  9. Query Accessions: A summary of queries for all spoke-node SSN clusters in the following format:

    cluster#:Query ID

    where

    “cluster#” is the cluster# for the query and

    “Query ID” is the query accession ID.

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

    Query-Neighbor Accessions: A summary for all spoke-node SSN clusters of information about the query-neighbor pairs in the Pfam family in the following format:

    cluster#:Query ID:Neighbor ID:EC#:ClosestPDB:PDB-E-value:Status

    where

    “cluster#” is the cluster# for the query,

    “Query ID” is the query accession ID,

    “Neighbor ID” is the neighbor accession ID,

    “EC# “is the E.C. number, if any, assigned to the neighbor in the UniProt database,

    “ClosestPDB” is the Protein Databank (PDB) identifier for the most similar sequence to the neighbor with a structure in the PDB database,

    “PDB-E-value” is the BLAST e-value for the neighbor-ClosestPDB pair, and

    “Status” (Reviewed/Unreviewed) reports if the in vitro activity of the neighbor has been reviewed by SwissProt.

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  10. [ClosestPDB:PDB-E-valueis a novel Node Attribute that indicates whether a sequence shares significant (e-value < e-30) homology with a protein for which an X-ray crystal structure has been deposited in the PDB. The format of this information is “PDB code:e‑value”. This information is valuable to computational chemists wanting to construct a structure model using a known structure as a template from a protein similar in sequence. Ideally, the neighbor sequence itself would have a deposited X-ray crystal structure, but this is most often not the case. Nonetheless, confident structure models have been employed successfully in pathway docking to determine the substrates of unknown enzymes.]

  11. Query-Neighbor Arrangement: A summary for all spoke-node SSN clusters of genome context information for the neighbors in the Pfam family in the following format:

    cluster#:Query ID:noncomplement/complement:Neighbor ID: noncomplement/ complement:Distance

    where

    “cluster#” is the cluster# for the query,

    “Query ID” is the query accession ID,

    “noncomplement/complement” is the direction of transcription of the gene encoding the query (from the ENA sequence file),

    “Neighbor ID” is the neighbor accession ID,

    “noncomplement/complement” is the direction of transcription of the gene encoding the query (from the ENA sequence file), and

    “Distance” is the distance in orfs between the genes encoding the query and neighbor.

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  12. Hub Average and Median Distances: A summary for all spoke-node SSN clusters of the values of the Average Distance and Median Distance node attributes in the following format:

    "cluster#:Average Distance:Median Distance

    where

    “cluster#” is the cluster# for the query,

    “average absolute value of distances” is the average of the absolute values of distances between the queries and neighbors in the cluster, and

    “median absolute value of distances” is the median absolute value of distances between the queries and neighbors in the cluster.

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  13. Hub Co-occurrence and Ratio: A summary for all spoke-node SSN clusters of the values for Co-occurrence and Co-occurrence Rationode attributes in the following format:

    cluster#:Co-occurrence:Co-occurrence Ratio

    where

    “cluster#” is the cluster# for the query,

    “Co-occurrence” is the decimal value of the ratio of the number of queries that found neighbors in the Pfam family to the number of queriable sequences in the cluster (Queries with Pfam Neighbors /Queriable SSN Sequences), and

    “Co-occurrence Ratio” is the ratio of the number of queries that found neighbors in the Pfam family to the number of queriable sequences in the cluster (Queries with Pfam Neighbors /Queriable SSN Sequences)

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  14. Node.fillColor: #FFFFFF (white in hexadecimal; used by Cytoscape)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  15. Node.shape: hexagon (used by Cytoscape but can be used in searches to select types of nodes)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  16. Node.size: 70.0 (used by Cytoscape)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  17. Cluster Number: empty (a spoke-node attribute)
  18. Average Distance: empty (a spoke-node attribute)
  19. Median Distance: empty (a spoke-node attribute)
  20. Co-occurrence: empty (a spoke-node attribute)
  21. Co-occurrence Ratio: empty (a spoke-node attribute)

Node attributes for SSN cluster spoke-nodes

The order of the node attributes in the Table Panel cannot be pre-specified, but you can move them with your cursor to make them more “user friendly”.

  1. shared name: the unique number assigned to each cluster in the input SSN (singletons are not included)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  2. name: the unique number assigned to each cluster in the input SSN (singletons are not included)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  3. Cluster Number: the unique number assigned to each cluster in the input SSN (singletons are not included)

    This node attribute has a numerical value—a specific Cluster Number or a range of Cluster Numbers can be selected with the Select Control Panel.

  4. Total SSN Sequences: The total number of sequences in the SSN cluster.

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  5. Queriable SSN Sequences: The number of sequences in the SSN cluster that have genome neighbors in the bacterial and fungal ENA sequence files. The value of this node attribute is calculated by:

    Total number of sequences in the cluster (value of Total SSN Sequences) – number of sequences that did not have a match in the ENA sequence files (list provided in the nomatch file that can be downloaded) – number of sequences for which the ENA sequence files did not provide genome neighborhoods (list provided in the noneigh(bor) file that can be downloaded).

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  6. Queries with Pfam Neighbors: The total number of queriable sequences (value of Queriable SSN Sequences) in the SSN cluster for which a neighbor in the hub- node Pfam family was found. A query may find multiple members of the Pfam family, but this node attribute reports only the number of queries that found any neighbor in the Pfam family.

    The neighbors in the Pfam family need not be orthologues (share the same function)—this can be evaluated by mapping the neighbors to the SSN for the Pfam family using the spreadsheet/custom node attribute files that can be downloaded.

    This node attribute has a numerical value—a specific number of sequences or a range of sequences can be selected with the Select Control Panel.

  7. Pfam Neighbors: The total number of neighbors in the Pfam family that were found by the queriable sequences in the SSN cluster. This value of this node attribute will be greater than the value of the Queries with PFAM Neighbors (previous) node attribute if a query found more than one neighbor in the Pfam family.

    Again, the neighbors in the Pfam family need not be orthologues (share the same function)— this can be evaluated by mapping the neighbors to the SSN for the Pfam family using the spreadsheet/custom node attribute files that can be downloaded.

    This node attribute has a numerical value—a specific number of neighbors or a range of neighbors can be selected with the Select Control Panel.

  8. Query-Neighbor Accessions: A list of information about the query-neighbor pais in the Pfam family in the following format:

    Query ID:Neighbor ID:EC#:ClosestPDB:PDB-E-value:Status

    where

    “Query ID” is the query UniProt accession ID,

    “Neighbor ID” is the neighbor UniProt accession ID,

    “EC#” is the E.C. number, if any, assigned to the neighbor in the UniProt database,

    “ClosestPDB” is the Protein Databank (PDB) identifier for the most similar sequence to the neighbor with a structure in the PDB database,

    “PDB-E-value” is the BLAST e-value for the neighbor-ClosestPDB pair, and

    “Status” (Reviewed/Unreviewed) reports if the in vitro activity of the neighbor has been reviewed by SwissProt.

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  9. Query-Neighbor Arrangement: A list of genome context information for the neighbors in the Pfam family in the following format:

    Query ID:noncomplement/complement:Neighbor ID: noncomplement/complement: Distance

    where

    “Query ID” is the query UniProt accession ID,

    “noncomplement/complement” is the direction of transcription of the gene encoding the query (from the ENA sequence file),

    “Neighbor ID” is the neighbor UniProt accession ID,

    “noncomplement/complement” is the direction of transcription of the gene encoding the query (from the ENA sequence file), and

    “Distance” is the distance in orfs between the genes encoding the query and neighbor.

    This node attribute is text—strings of characters can be selected with the Select Control Panel. By right clicking on the node attribute, the entries can be copied and pasted into Excel or a text file for further analyses. In Excel, the colon-delimited entries can be easily separated into separate columns.

  10. Average Distance: The average of the absolute values of the distances between the queries in the cluster and their neighbors in the Pfam family.

    This node attribute has a numerical value—a specific distance or a range of distances can be selected with the Select Control Panel.

  11. Median Distance: The median value of the absolute values of the distances between the queries in the cluster and their neighbors in the Pfam family This node attribute has a numerical value—a specific distance or a range of distances can be selected with the Select Control Panel.
  12. Co-occurrence: The decimal value of the ratio of the number of queries that found neighbors in the Pfam family to the number of queriable sequences in the cluster (Queries with Pfam Neighbors / Queriable SSN Sequences)

    This node attribute has a numerical value—a specific co-occurrence or a range of co-occurrences can be selected with the Select Control Panel.

  13. Co-occurrence Ratio: The ratio of the number of queries that found neighbors in the Pfam family to the number of queriable sequences in the cluster (Queries with Pfam Neighbors / Queriable SSN Sequences)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  14. Node.fillColor: the hexadecimal number of the unique color assigned to each cluster in the input SSN (singletons are not included). This number is used by the pass-through mapping “Fill Color” style of Cytoscape to color the nodes in the network.

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  15. Node.shape: ellipse, diamond, or square (explained above; used by Cytoscape but can be used in searches to select hub-nodes)

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  16. Node.size: calculated as (Co-occurrence * 100) is used by Cytoscape to draw the node.

    This node attribute is text—strings of characters can be selected with the Select Control Panel.

  17. Pfam: empty (a hub-node attribute)
  18. Pfam description: empty (a hub-node attribute)
  19. Hub Average and Median Distance: empty (a hub-node attribute)
  20. Hub Co-occurrence and Co-occurrence Ratio: empty (a hub-node attribute)

Additional Files for Download

1. Text file with list of query accession IDs not found in the bacterial and fungal ENA files (nomatch.tab), i.e., not in the STD (annotated assembled sequences), CON (high level constructed sequences), and WGS (whole genome shotgun sequencing with intermediate level of assembly) files for bacterial and fungal proteins. [The No Match node attribute in the colored SSN can be used to identify these in the SSN.]

2. Text file with list of query accession IDs that do not have genome neighbors (noneighb.tab), in the bacterial and fungal ENA files, i.e., the ENA files contain single orfs. [The No Neighbors node attribute in the colored SSN can be used to identify these in the SSN.]

In the near future, we will be providing several additional files that will assist the user in the downstream analysis of the information contained in the GNNs.

Need help or have suggestions or comments? Please provide it here »