Generate a SSN for a single protein and its closest homologues in the UniProt, UniRef90, or UniRef50 database.
The input sequence is used as the query for a search of the UniProt, UniRef90, or UniRef50 database using BLAST. For the UniRef90 and UniRef50 databases, the sequence of the cluster ID (representative sequence) is used for the BLAST.
The database is selected using the BLAST Retrieval Options.
An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.
Generate a SSN for a protein family.
The members of the input Pfam families, InterPro families, and/or Pfam clans are selected from the UniProt, UniRef90, or UniRef50 database.
Generate a SSN from FASTA-formatted UniProt sequences.
An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.
Input a list of sequences in the FASTA format or upload a FASTA-formatted sequence file.
The sequences in the FASTA file are used to calculate edge values.
The ID in the header that immediately follows the ">" is used to retrieve node attribute information. Acceptable IDs include UniProt IDs, PDB IDs, and NCBI GenBank IDs that have equivalent entries in the UniProt database. ?
If the header for a sequence does not contain an acceptable ID for retrieving node attribute information, the SSN provides node attributes for only the sequence, sequence length, and the header as the Description.
If the user identifies the input sequences as UniRef50 or UniRef90, the node attributes will include the UniRef Cluster Size and UniRef Cluster IDs node attributes. The other node attributes will be lists of the values for UniRef cluster IDs in the node.
Generate a SSN from a list of UniProt, UniRef, NCBI, or Genbank IDs.
An all-by-all BLAST? is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.
Clusters in the submitted SSN are identified, numbered and colored. Summary tables, sets of IDs and sequences per cluster are provided for sequences identified by a UniProt ID.
The clusters are numbered and colored using two conventions: 1) Sequence Count Cluster Number assigned in order of decreasing number of UniProt IDs in the cluster; 2) Node Count Cluster Number assigned in order of decreasing number of nodes in the cluster.
An input SSN from the EFI-EST FASTA option should be generated using "Read FASTA headers" from FASTA files with UniProt IDs in the headers. Otherwise, sets of IDs and sequences, MSAs, WebLogos, HMMs, consensus residues, and length histograms will not be generated.
Like the Color SSN utility, clusters in the submitted SSN are identified, numbered and colored.
The SSN clusters are numbered and colored using two conventions: Sequence Count Cluster Numbers are assigned in order of decreasing number of UniProt IDs in the cluster; Node Count Cluster Numbers are assigned in order of decreasing number of nodes in the cluster.
Multiple sequence alignments (MSAs), WebLogos, hidden Markov models (HMMs), length histograms, and consensus residues are computed for each cluster.
Options are available in the tabs below to select the desired analyses:
The WebLogos tab provides the WebLogo and MSA for the node IDs in each cluster containing greater than the "Minimum Node Count" specified in the Sequence Filter tab. The percent identity matrix for the MSA is also provided on this tab.
The Consensus Residues tab provides a tab-delimited text file with the number of the conserved residues and their MSA positions for each specified residue in each cluster containing greater than the "Minimum Node Count". Note the default residue is "C" and the percent identity levels that are displayed are from 90 to 10% in intervals of 10%; a residue is counted as "conserved" if it occurs with ≥80% identity.
The HMMs tab provides the HMM for each cluster containing greater than the specified "Minimum Node Count".
The Length Histograms tab provides length histograms for each cluster containing greater than the specified "Minimum Node Count".
Nodes in the submitted SSN are colored according to neighborhood connectivity (number of edges to other nodes).
The nodes for unresolved families can be difficult to identify in SSNs generated with low alignment scores. Coloring the nodes according to the number of edges to other nodes (Neighborhood Connectivity, NC) helps identify families with highly connected nodes (https://doi.org/10.1016/j.heliyon.2020.e05867). Using Neighborhood Connectivity Coloring as a guide, the alignment score threshold can be chosen in Cytoscape to separate the SSN into families.
Convergence ratio is calculated per cluster.