A sequence similarity network (SSN) allows for visualization of relationships among
protein sequences. In SSNs, the most related proteins are grouped together in
clusters.
The Enzyme Similarity Tool (EFI-EST) makes it possible to easily generate SSNs.
Cytoscape is used to explore SSNs.
The Color SSNs and Cluster Analysis tabs are now included on the SSN Utilities tab. Neighborhood Connectivity (NC) is a new tool on the SSN Utilities tab. NC colors the input SSN according to the number of internode connections. NC coloring helps identify families in SSNs generated with low alignment scores. The EST database has been updated to use UniProt 2021_01 and InterPro 84.
A listing of new features and other information pertaining to EST is available on the
release notes page.
Generate a SSN for a single protein and its closest homologues in the UniProt database.
The input sequence is used as the query for a search of the UniProt database using BLAST.
Sequences that are similar to the query in UniProt are retrieved.
An all-by-all BLAST is performed to obtain the similarities between sequence pairs to
calculate edge values to generate the SSN.
Generate a SSN for a protein family.
The sequences from the Pfam families, InterPro families, and/or Pfam clans (superfamilies) input are retrieved.
An all-by-all BLAST is performed to obtain the similarities between sequence pairs to
calculate edge values to generate the SSN.
Generate a SSN from provided sequences.
An all-by-all BLAST is performed to obtain the similarities between sequence pairs to
calculate edge values to generate the SSN.
Input a list of protein sequences in FASTA format or upload a FASTA-formatted sequence file.
Generate a SSN from a list of UniProt, UniRef, NCBI, or Genbank IDs.
An all-by-all BLAST is performed to obtain the similarities between sequence pairs to
calculate edge values to generate the SSN.
Clusters in the submitted SSN are identified, numbered and colored.
Summary tables, sets of IDs and sequences per cluster are provided.
The clusters are numbered and colored using two conventions: 1)
Sequence Count Cluster Number assigned in order of decreasing number of UniProt IDs in the
cluster; 2) Node Count Cluster Number assigned in order of decreasing number of
nodes in the cluster.
Like the Color SSN utility, clusters in the submitted SSN are identified,
numbered and colored.
The SSN clusters are numbered and colored using two conventions:
Sequence Count Cluster Numbers
are assigned in order of decreasing number of UniProt IDs in
the cluster; Node Count Cluster Numbers are assigned in order of decreasing
number of nodes in the cluster.
The convergence ratio for each cluster also is calculated. The convergence
ratio is the number of edges in each cluster to the number of sequence pairs.
The value decreases from 1.0 for a cluster very similar sequences (same
function?) to <<1.0 for clusters with distantly related sequences (different
functions?).
Multiple sequence alignments (MSAs), WebLogos, hidden Markov models (HMMs),
length histograms, and consensus residues are computed for each cluster.
Options are available in the tabs below to select the desired analyses:
The WebLogos tab provides the WebLogo and MSA for the node IDs in each cluster
containing greater than the "Minimum Node Count" specified in the
Sequence Filter tab.
The Consensus Residues tab provides a tab-delimited text file with the number
of the conserved residues and their MSA positions for each specified residue in
each cluster containing greater than the "Minimum Node Count". Note the
default residue is "C" and the percent identity levels that are displayed are
from 90 to 10% in intervals of 10%; a residue is counted as "conserved" if it
occurs with ≥80% identity.
The HMMs tab provides the HMM for each cluster containing greater than the
specified "Minimum Node Count".
The Length Histograms tab provides length histograms for each cluster
containing greater than the specified "Minimum Node Count".
Nodes in the submitted SSN are colored according to neighborhood connectivity
(number of edges to other nodes).
The nodes for unresolved families can be difficult to identify in SSNs
generated with low alignment scores. Coloring the nodes according to the
number of edges to other nodes (Neighborhood Connectivity, NC) helps identify
families with highly connected nodes
(https://www.biorxiv.org/content/10.1101/2020.04.16.045138v1.full).
Using Neighborhood Connectivity Coloring as a guide, the alignment score threshold
can be chosen in Cytoscape to separate the SSN into families.
UniProt Version: 2021_01
InterPro Version: 84
The family(ies) selected has proteins—this is greater than
the maximum allowed (25,000). To reduce computing time and the size of
output SSN, UniRef90 cluster ID sequences will automatically be used.
In UniRef90, sequences that share
≥90% sequence identity over 80% of the sequence
length are grouped together and represented by an accession ID known as the cluster ID. The output
SSN is equivalent a to 90% Representative Node
Network with each node corresponding to a UniRef cluster ID, and for which the node attribute
"UniRef90 Cluster IDs" lists
all the sequences represented by a node. UniRef90
SSNs are compatible with the Color SSN utility as well as the EFI-GNT tool.
Press Ok to continue with UniRef90.
This job will be permanently removed from your list of jobs.