Prof. Patricia Babbitt’s group at UCSF first developed sequence similarity networks (SSNs) as a way to deal with the ever-increasing deluge of sequences deposited in public databases. Their seminal papers describing the SSN technique (1) and their program Pythoscape (2) were the inspiration for the EFI’s EFI-EST webserver.
A sequence similarity network (SSN) allows to visualize relationships among protein sequences. In SSNs, the most related proteins are grouped together in clusters.
The generation of a SNN involves two steps. First, a set of sequences to analyze is chosen, and an all-by-all BLAST is performed to determine, for each pair of sequences in the data set, their similarity as a consideration of their relatedness. The second step involves filtering the sequences into clusters, based on a similarity threshold that is user defined.
When visualizing an SSN, protein sequences are represented as “nodes”. The line connecting two nodes is an “edge”. It is an indication of the relatedness between the nodes. An edge is drawn between nodes only if the BLAST pairwise similarity scores between the connected nodes is above a user defined threshold (Figure 1).
It is the user that defines the threshold at which sequences should be connected in a network. For families that contains non-isofunctional enzymes, a threshold score separating the different functions in different independent clusters is a good starting point. There is no predefined threshold: each protein set has its own optimal threshold that needs to be empirically determined.
Groups of highly similar proteins display a high degree of interconnectivity as the threshold alignment score is increased. These “clusters” are often very useful for the interrogation of enzyme function. Experienced users generate and compare several SSNs with various thresholds to visualize the interconnectivity evolution.
Figure 1. Example of a simple sequence similarity network as a function of e-value.
In the case of the Figure 1, if the alignment score threshold is specified as 28 (center), then edges are only drawn between nodes (protein sequences) that share that level of similarity (or greater). If two proteins are not connected, that means their sequences are less similar than described by the 28 threshold value. If the network is recalculated at a more stringent (greater) alignment score (right, threshold 56), the network is segregated into clusters of highly similar proteins. If the network is recalculated at a more permissive (lesser) alignment score (left, threshold 14), relationships between previously segregated proteins become apparent.
Although not as rigorous as traditional phylogenetic trees, SSNs typically display the same topology (Figure 2). However, the advantage of SSNs over trees is that large sequence sets (e.g. many thousands of proteins) can be analyzed much more quickly, and visualized easily using the network visualization Cytoscape (3).
Figure 2. Rooted phylogenetic tree (UPGMA) created with ClustalW (A) using the same sequence set as shown in the network in Figure 1 (B). Proteins in the tree are identified by their six character UniProt accession numbers.
Besides speed, another major advantage of SSNs is the ability to include pertinent information for each individual protein (such as species, annotation, length, PDB deposition, etc.). This information is included as “node attributes” which are searchable and sortable within a sequence similarity network displayed in Cytoscape (3).
SSNs have been proven useful for examining the sequence relationships between proteins and have helped for functional assignment.
SSN typical usage:
In either case, SSNs allow the user to quickly and easily view sequence relationships and gather information about proteins of known and unknown function. As sequence databases and needs to concatenate disparate information into a single visual aid grow, SSNs are increasingly more valuable for developing hypotheses.