EFI - Enzyme Similarity Tool

The EFI tools will undergo system maintenance from 12/11/2021 12:00 PM until 12/16/2021 12:00 PM, US Central Time. No results will be available and no new jobs can be submitted during that period.

Dataset Completed

Submission Name: IPR004184_IP74_UniProt

A minimum sequence similarity threshold that specifies the sequence pairs connected by edges is needed to generate the SSN. This threshold also determines the segregation of proteins into clusters. The threshold is applied to the edges in the SSN using the alignment score, an edge node attribute that is a measure of the similarity between sequence pairs.

The parameters for generating the initial dataset are summarized in the table.

Job Number29535
Time Started -- Finished6/20 03:10 PM -- 6/20 09:20 PM
Database VersionUniProt: 2019-04 / InterPro: 74
Input OptionFamilies (Option B)
Job NameIPR004184_IP74_UniProt
Pfam / InterPro FamilyIPR004184
Number of IDs in Pfam / InterPro Family20,232
Fraction Optionoff
Domain Optionoff
Exclude FragmentsNo
Total Number of Sequences in Dataset20,232
Total Number of Edges166,127,802
Convergence Ratio?0.812

This tab provides histograms and box plots with statistics about the sequences in the input dataset as well as the BLAST all-by-all pairwise comparisons that were computed.

The descriptions for the histograms and plots guide the choice of the values for the "Alignment Score Threshold" and the Minimum and Maximum "Sequence Length Restrictions" that are applied to the sequences and edges to generate the SSN. These values are entered using the "SSN Finalization" tab on this page.

Sequences as a Function of Full Length Histogram (First Step for Alignment Score Threshold Selection)

This histogram describes the length distribution for all sequences (UniProt IDs) in the input dataset; the sequences in this histogram are used to calculate the edges.

Inspection of the histogram permits identification of fragments, single domain proteins, and multidomain fusion proteins. The dataset can be length-filtered using the Minimum and Maximum "Sequence Length Restrictions" in the "SSN Finalization" tab to remove fragments, select single domain proteins, or select multidomain fusion proteins.

Alignment Length vs Alignment Score Box Plot (Second Step for Alignment Score Threshold Selection)

This box plot describes the relationship between the query-subject alignment lengths used by BLAST (y-axis) to calculate the alignment scores (x-axis).

This plot shows a monophasic increase in alignment length to a constant value for single domain proteins; this plot shows multiphasic increases in alignment length for datasets with multidomain proteins (one phase for each fusion length). The value of the "Alignment Score Threshold" for generating the SSN (entered in the "SSN Finalization" tab) should be selected (from the "Percent Identity vs Alignment Score Box Plot"; next box plot) at an alignment length ≥ the minimum length of single domain proteins in the dataset (determined by inspection of the "Sequences as a Function of Full-Length Histogram"; previous histogram). In that region, the "Alignment Length" should be independent of the "Alignment Score".

Percent Identity vs Alignment Score Box Plot (Third Step for Alignment Score Threshold Selection)

This box plot describes the pairwise percent sequence identity as a function of alignment score.

Complementing the "Alignment Length vs Alignment Score Box Plot" (previous box plot), this box plot describes a monophasic increase in sequence identity for single domain proteins or a multiphasic increase in sequence identity for datasets with multidomain proteins (one phase for each fusion length). In the "Alignment Length vs Alignment Score" box plot (previous box plot), a monophasic increase in sequence identity occurs as the alignment score increases at a constant alignment length; multiphasic increases occur as the alignment score increases at additional longer constant alignment lengths.

For the initial SSN, we recommend that an alignment score corresponding to 35 to 40% pairwise identity be entered in the "SSN Finalization" tab (for the first phase in multiphasic plots).

Edge Count vs Alignment Score Plot (Preview of Full SSN Size)

This plot shows the number of edges in the full SSN for the input dataset (a node of each sequence) as a function of alignment score. By moving the cursor over the plot, the number of edges for each alignment score is displayed.

This plot helps determine if the full SSN generated using the initial alignment score can be opened with Cytoscape on the user’s computer. As a rough guide, SSNs with ~2M edges can be opened with 16GB RAM, ~4M edges with 32GB RAM, ~8M edges with 64GB RAM, ~15M edges with 128GB RAM, and ~30M edges with 256GB RAM.

If the number of edges for the full SSN is too large to be opened, a representative node (rep node) SSN can be opened. In a rep node SSN, sequences are grouped into metanodes based on pairwise sequence identity (from 40 to 100% identity, in 5% intervals). The download tables on the "Download Network Files" page provide the numbers of metanodes and edges in rep node SSNs. The rep node SSNs are lower resolution than full SSNs; clusters of interest in rep node SSNs can be expanded to provide the full SSNs.

Edges as a Function of Alignment Score Histogram (Preview of SSN Diversity)

This histogram describes the number of edges calculated at each alignment score. This plot is not used to select the alignment score for the initial SSN; however, it provides an overview of the functional diversity within the input dataset.

In the histogram, edges with low alignment scores typically are those between isofunctional clusters; edges with large alignment scores typically are those connecting nodes within isofunctional clusters.

The histogram for a dataset with a single isofunctional SSN cluster is single distribution centered at a "large" alignment score; the histogram for a dataset with many isofunctional SSN clusters will be dominated by the edges that connect the clusters, with the number of edges decreasing as the alignment score increases.

Enter chosen Sequence Length Restriction and Alignment Score Threshold in the SSN Finalization tab.

This tab is used to specify the minimum "Alignment Score Threshold" (that is a measure of the minimum sequence similarity threshold) for drawing the edges that connect the proteins (nodes) in the SSN. This tab also is used to specify Minimum and Maximum "Sequence Length Restriction Options" that exclude fragments and/or domain architectures.

Alignment Score Threshold: ?

This value corresponds to the lower limit for which an edge will be present in the SSN. The alignment score is similar in magnitude to the negative base-10 logarithm of a BLAST e-value.

Sequence Length Restriction Options

Allows restriction of sequences in the generated SSN based on their length. ?

Minimum: (default: 0)
Maximum: (default: 50000)

Neighborhood Connectivity Option

Neighborhood Connectivity:

The nodes for unresolved families can be difficult to identify in SSNs generated with low alignment scores. Coloring the nodes according to the number of edges to other nodes (Neighborhood Connectivity, NC) helps identify families with highly connected nodes (https://www.biorxiv.org/content/10.1101/2020.04.16.045138v1.full). Using Neighborhood Connectivity Coloring as a guide, the alignment score threshold can be chosen in Cytoscape to separate the SSN into families.

Network name: This name will be displayed in Cytoscape.

You will be notified by e-mail when the SSN is ready for download.

A list of the SSNs generated from this initial dataset.

Portions of these data are derived from the Universal Protein Resource (UniProt) databases.

If you use the EFI web tools, please cite us.

Click here to contact us for help, reporting issues, or suggestions.