A minimum sequence similarity threshold that specifies the sequence pairs connected by edges is needed to generate the SSN. This threshold also determines the segregation of proteins into clusters. The threshold is applied to the edges in the SSN using the alignment score, an edge node attribute that is a measure of the similarity between sequence pairs.
The parameters for generating the initial dataset are summarized in the table.
|Database Version||UniProt: 2022-04 / InterPro: 91|
|Input Option||Families (Option B)|
|E-Value for SSN Edge Calculation||5|
|Pfam / InterPro Family||IPR000385, IPR001989, IPR002684, IPR003698, IPR003739, IPR004383, IPR004558, IPR004559, IPR005839, IPR005840, IPR005909, IPR005911, IPR005980, IPR006463, IPR006466, IPR006467, IPR006638, IPR007197, IPR010505, IPR010722, IPR010723, IPR011101, IPR011843, IPR012726, IPR012837, IPR012838, IPR012839, IPR013483, IPR013704, IPR013848, IPR013917, IPR014191, IPR016431, IPR016771, IPR016779, IPR016863, IPR017200, IPR017672, IPR017742, IPR017833, IPR017834, IPR019939, IPR019940, IPR020050, IPR020612, IPR022431, IPR022432, IPR022447, IPR022459, IPR022462, IPR022881, IPR022946, IPR023404, IPR023805, IPR023807, IPR023819, IPR023820, IPR023821, IPR023822, IPR023858, IPR023862, IPR023863, IPR023867, IPR023868, IPR023874, IPR023880, IPR023885, IPR023886, IPR023891, IPR023897, IPR023904, IPR023912, IPR023913, IPR023930, IPR023969, IPR023979, IPR023980, IPR023984, IPR023992, IPR023993, IPR023995, IPR024001, IPR024007, IPR024016, IPR024017, IPR024018, IPR024021, IPR024023, IPR024025, IPR024032, IPR024177, IPR024521, IPR024560, IPR024924, IPR025895, IPR026322, IPR026332, IPR026335, IPR026344, IPR026346, IPR026351, IPR026357, IPR026401, IPR026404, IPR026407, IPR026412, IPR026423, IPR026426, IPR026429, IPR026447, IPR026482, IPR027492, IPR027526, IPR027527, IPR027559, IPR027564, IPR027570, IPR027583, IPR027586, IPR027596, IPR027604, IPR027608, IPR027609, IPR027621, IPR027622, IPR027626, IPR027633, IPR030801, IPR030837, IPR030894, IPR030896, IPR030905, IPR030915, IPR030933, IPR030950, IPR030969, IPR030977, IPR030989, IPR031003, IPR031004, IPR031010, IPR031012, IPR031014, IPR031015, IPR031019, IPR031691, IPR032432, IPR033971, IPR033974, IPR033975, IPR033976, IPR034165, IPR034386, IPR034391, IPR034405, IPR034422, IPR034428, IPR034436, IPR034438, IPR034457, IPR034462, IPR034465, IPR034466, IPR034471, IPR034474, IPR034479, IPR034480, IPR034485, IPR034491, IPR034497, IPR034498, IPR034505, IPR034508, IPR034514, IPR034515, IPR034519, IPR034529, IPR034530, IPR034531, IPR034532, IPR034534, IPR034547, IPR034556, IPR034557, IPR034559, IPR034560, IPR034687, IPR038135, IPR039661, IPR040063, IPR040072, IPR040074, IPR040081, IPR040082, IPR040085, IPR040086, IPR040087, IPR040088, IPR041582, IPR045375, IPR045567, IPR045784, PF04055, PF06969, PF08497, PF12345, PF13186, PF16199, PF16881, PF19238, PF19288, PF19864|
|Number of IDs in Pfam / InterPro Family||773,531|
|Number of Cluster IDs in UniRef90 Family||3,870|
|Taxonomy Categories:||Class: epsilonproteobacteria|
|Total Number of Sequences in Dataset||3,870|
|Total Number of Edges||566,085|
|Number of Unique Sequences||3,870|
The taxonomy distribution for the UniProt IDs in the input dataset is displayed. For UniRef90 and UniRef50 cluster datasets, these are retrieved from the lookup table provided by UniProt/UniRef.
The UniRef90 and UniRef50 clusters containing the UniProt IDs then are identified using the lookup table provided by UniProt/UniRef. These UniRef90 and UniRef50 clusters may contain UniProt IDs from other families; in addition, the UniRef90 and UniRef50 clusters in the selected taxonomy category may contain UniProt IDs from other categories. This results from conflation of UniProt IDs in UniRef90 and UniRef50 clusters that share ≥90% and ≥50% sequence identity, respectively.
The numbers of UniProt IDs, UniRef90 cluster IDs, and UniRef50 cluster IDs for the selected category are displayed.
The sunburst is interactive, providing the ability to zoom to a selected taxonomy category by clicking on that category; clicking on the center circle will zoom the display to the next highest rank.
This tab provides histograms and box plots with statistics about the sequences in the input dataset as well as the BLAST all-by-all pairwise comparisons that were computed.
The descriptions for the histograms and plots guide the choice of the values for the "Alignment Score Threshold" and the Minimum and Maximum "Sequence Length Restrictions" that are applied to the sequences and edges to generate the SSN.
This histogram describes the length distribution for all sequences (UniProt IDs) in the input dataset.
Inspection of the histogram permits identification of fragments, single domain proteins, and multidomain fusion proteins. This histogram is used to select Minimum and Maximum "Sequence Length Restrictions" in the "SSN Finalization" tab to remove fragments, select only single domain proteins, or select multidomain proteins. The sequences in the "Sequences as a Function of Full-Length Histogram (UniRef90 Cluster IDs)" (last histogram) are used to calculate the edges.
This box plot describes the relationship between the query-subject alignment lengths used by BLAST (y-axis) to calculate the alignment scores (x-axis).
This plot shows a monophasic increase in alignment length to a constant value for single domain proteins; this plot shows multiphasic increases in alignment length for datasets with multidomain proteins (one phase for each fusion length). The value of the "Alignment Score Threshold" for generating the SSN (entered in the "SSN Finalization" tab) should be selected (from the "Percent Identity vs Alignment Score Box Plot"; next box plot) at an alignment length ≥ the minimum length of single domain proteins in the dataset (determined by inspection of the "Sequences as a Function of Full-Length Histogram"; previous histogram). In that region, the "Alignment Length" should be independent of the "Alignment Score".
This box plot describes the pairwise percent sequence identity as a function of alignment score.
Complementing the "Alignment Length vs Alignment Score Box Plot" (previous box plot), this box plot describes a monophasic increase in sequence identity for single domain proteins or a multiphasic increase in sequence identity for datasets with multidomain proteins (one phase for each fusion length). In the "Alignment Length vs Alignment Score" box plot (previous box plot), a monophasic increase in sequence identity occurs as the alignment score increases at a constant alignment length; multiphasic increases occur as the alignment score increases at additional longer constant alignment lengths.
For the initial SSN, we recommend that an alignment score corresponding to 35 to 40% pairwise identity be entered in the "SSN Finalization" tab (for the first phase in multiphasic plots).
This plot shows the number of edges in the full SSN for the input dataset (a node of each sequence) as a function of alignment score. By moving the cursor over the plot, the number of edges for each alignment score is displayed.
This plot helps determine if the full SSN generated using the initial alignment score can be opened with Cytoscape on the user’s computer. As a rough guide, SSNs with ~2M edges can be opened with 16GB RAM, ~4M edges with 32GB RAM, ~8M edges with 64GB RAM, ~15M edges with 128GB RAM, and ~30M edges with 256GB RAM.
If the number of edges for the full SSN is too large to be opened, a representative node (rep node) SSN can be opened. In a rep node SSN, sequences are grouped into metanodes based on pairwise sequence identity (from 40 to 100% identity, in 5% intervals). The download tables on the "Download Network Files" page provide the numbers of metanodes and edges in rep node SSNs. The rep node SSNs are lower resolution than full SSNs; clusters of interest in rep node SSNs can be expanded to provide the full SSNs.
This histogram describes the number of edges calculated at each alignment score. This plot is not used to select the alignment score for the initial SSN; however, it provides an overview of the functional diversity within the input dataset.
In the histogram, edges with low alignment scores typically are those between isofunctional clusters; edges with large alignment scores typically are those connecting nodes within isofunctional clusters.
The histogram for a dataset with a single isofunctional SSN cluster is single distribution centered at a "large" alignment score; the histogram for a dataset with many isofunctional SSN clusters will be dominated by the edges that connect the clusters, with the number of edges decreasing as the alignment score increases.