EFI-EST and Cytoscape Tutorials

What is a Sequence Similarity Network?

Prof. Patricia Babbitt’s group at UCSF first developed sequence similarity networks (SSNs) as a way to deal with the ever-increasing deluge of sequences deposited in public databases. A general tutorial is provided here, but please see their seminal papers for a thorough description of the SSN technique (1) and their program Pythoscape (2), which were the inspiration for the EFI’s EFI-EST web server.

Sequence similarity networks (SSNs) are a quick and easy way to visualize sequence relationships within groups of proteins, especially large numbers of proteins. In the simplest form, each protein sequence is represented as a square circle (referred to as a “node”). A line (referred to as an “edge”) connecting one sequence to another is an indication of relatedness (Figure 1).

Figure 1
Figure 1. Example of a very simple sequence similarity network as a function of e-value.

For our purposes, the relatedness is described by sequence similarity. Users choose a threshold at which they’d like to examine the similarity within a set of protein sequences. The sequence set is subjected to an all-by-all BLAST, and the resulting pairwise scores are used to determine which protein sequences should or should not be connected in a network at the selected threshold metric.

NOTE: EFI-EST calculates an alignment score based on bit score, that is similar to, but not the same as e-value.

For example, if the alignment score threshold is specified as 28, then edges are only drawn between nodes (protein sequences) that share that level of similarity (or greater). If two proteins are not connected, that means their sequences are less similar than described by the 28 threshold value. If the network is recalculated at a more permissive (smaller) alignment score, relationships may become apparent that were not evident at more stringent (larger) alignment score. Groups of highly similar proteins display a high degree of interconnectivity even as the aligment score is increased. These “clusters” are often very useful for the interrogation of enzyme function.

Although not as rigorous as traditional phylogenetic trees, SSNs typically display the same topology (Figure 2). However, SSNs offer an advantage over trees in that large sequence sets (e.g. many thousands of proteins) can be analyzed much more quickly and visualized easily using the network visualization program Cytoscape (3).

Figure 2
Figure 2. Rooted phylogenetic tree (UPGMA) created with ClustalW (A) using the same sequence set as shown in the network in Figure 1 (B). Proteins in the tree are identified by their six character UniProt accession numbers.

Besides speed, another major advantage of SSNs is the ability to include pertinent information for each individual protein (such as species, annotation, length, PDB deposition, etc.).  This information is included as “node attributes” which are searchable and sortable within a sequence similarity network displayed in Cytoscape (Figure 3). For a complete list of node attributes available via this tool, click here.

Figure 3

Figure 3. Representative node attributes for the example data set as seen in a Cytoscape session.

Need help or have suggestions or comments? Please click here to submit »