EFI - Enzyme Similarity Tool

This web resource is supported by a Research Resource from the National Institute of General Medical Sciences (R24GM141196-01).
The tools are available without charge or license to both academic and commercial users.

Dataset Completed

Submission Name: IPR004184_IP74_UniRef50

A minimum sequence similarity threshold that specifies the sequence pairs connected by edges is needed to generate the SSN. This threshold also determines the segregation of proteins into clusters. The threshold is applied to the edges in the SSN using the alignment score, an edge node attribute that is a measure of the similarity between sequence pairs.

The parameters for generating the initial dataset are summarized in the table.

Job Number29537
Database VersionUniProt: 2019-04 / InterPro: 74
Input OptionFamilies (Option B)
Job NameIPR004184_IP74_UniRef50
E-Value for SSN Edge Calculation5
Pfam / InterPro FamilyIPR004184
Number of IDs in Pfam / InterPro Family20,232
Domain Optionoff
UniRef Version50
Number of Cluster IDs in UniRef50 Family1,365
Exclude FragmentsNo
Total Number of Sequences in Dataset1,365
Total Number of Edges449,488
Convergence Ratio?0.483

This tab provides histograms and box plots with statistics about the sequences in the input dataset as well as the BLAST all-by-all pairwise comparisons that were computed.

The descriptions for the histograms and plots guide the choice of the values for the "Alignment Score Threshold" and the Minimum and Maximum "Sequence Length Restrictions" that are applied to the sequences and edges to generate the SSN.

Sequences as a Function of Full Length Histogram (First Step for Alignment Score Threshold Selection)

This histogram describes the length distribution for all sequences (UniProt IDs) in the input dataset.

Inspection of the histogram permits identification of fragments, single domain proteins, and multidomain fusion proteins. This histogram is used to select Minimum and Maximum "Sequence Length Restrictions" in the "SSN Finalization" tab to remove fragments, select only single domain proteins, or select multidomain proteins. The sequences in the "Sequences as a Function of Full-Length Histogram (UniRef90 Cluster IDs)" (last histogram) are used to calculate the edges.

Alignment Length vs Alignment Score Box Plot (Second Step for Alignment Score Threshold Selection)

This box plot describes the relationship between the query-subject alignment lengths used by BLAST (y-axis) to calculate the alignment scores (x-axis).

This plot shows a monophasic increase in alignment length to a constant value for single domain proteins; this plot shows multiphasic increases in alignment length for datasets with multidomain proteins (one phase for each fusion length). The value of the "Alignment Score Threshold" for generating the SSN (entered in the "SSN Finalization" tab) should be selected (from the "Percent Identity vs Alignment Score Box Plot"; next box plot) at an alignment length ≥ the minimum length of single domain proteins in the dataset (determined by inspection of the "Sequences as a Function of Full-Length Histogram"; previous histogram). In that region, the "Alignment Length" should be independent of the "Alignment Score".

Percent Identity vs Alignment Score Box Plot (Third Step for Alignment Score Threshold Selection)

This box plot describes the pairwise percent sequence identity as a function of alignment score.

Complementing the "Alignment Length vs Alignment Score Box Plot" (previous box plot), this box plot describes a monophasic increase in sequence identity for single domain proteins or a multiphasic increase in sequence identity for datasets with multidomain proteins (one phase for each fusion length). In the "Alignment Length vs Alignment Score" box plot (previous box plot), a monophasic increase in sequence identity occurs as the alignment score increases at a constant alignment length; multiphasic increases occur as the alignment score increases at additional longer constant alignment lengths.

For the initial SSN, we recommend that an alignment score corresponding to 35 to 40% pairwise identity be entered in the "SSN Finalization" tab (for the first phase in multiphasic plots).

Edge Count vs Alignment Score Plot (Preview of Full SSN Size)

Invalid data

This plot shows the number of edges in the full SSN for the input dataset (a node of each sequence) as a function of alignment score. By moving the cursor over the plot, the number of edges for each alignment score is displayed.

This plot helps determine if the full SSN generated using the initial alignment score can be opened with Cytoscape on the user’s computer. As a rough guide, SSNs with ~2M edges can be opened with 16GB RAM, ~4M edges with 32GB RAM, ~8M edges with 64GB RAM, ~15M edges with 128GB RAM, and ~30M edges with 256GB RAM.

If the number of edges for the full SSN is too large to be opened, a representative node (rep node) SSN can be opened. In a rep node SSN, sequences are grouped into metanodes based on pairwise sequence identity (from 40 to 100% identity, in 5% intervals). The download tables on the "Download Network Files" page provide the numbers of metanodes and edges in rep node SSNs. The rep node SSNs are lower resolution than full SSNs; clusters of interest in rep node SSNs can be expanded to provide the full SSNs.

Edges as a Function of Alignment Score Histogram (Preview of SSN Diversity)

This histogram describes the number of edges calculated at each alignment score. This plot is not used to select the alignment score for the initial SSN; however, it provides an overview of the functional diversity within the input dataset.

In the histogram, edges with low alignment scores typically are those between isofunctional clusters; edges with large alignment scores typically are those connecting nodes within isofunctional clusters.

The histogram for a dataset with a single isofunctional SSN cluster is single distribution centered at a "large" alignment score; the histogram for a dataset with many isofunctional SSN clusters will be dominated by the edges that connect the clusters, with the number of edges decreasing as the alignment score increases.

Sequences as a Function of Full Length Histogram (UniRef50 Cluster IDs)

This histogram describes the distribution of the full-length UniRef cluster IDs in the input dataset. The sequences of the cluster IDs displayed do not accurately reflect the distribution of fragments, single domain proteins, and multidomain full-length proteins in the input dataset.

Portions of these data are derived from the Universal Protein Resource (UniProt) databases.

Click here to contact us for help, reporting issues, or suggestions.