EFI - Enzyme Similarity Tool

A sequence similarity network (SSN) allows for visualization of relationships among protein sequences. In SSNs, the most related proteins are grouped together in clusters. The Enzyme Similarity Tool (EFI-EST) makes it possible to easily generate SSNs. Cytoscape is used to explore SSNs.

The Color SSNs and Cluster Analysis tabs are now included on the SSN Utilities tab.
Neighborhood Connectivity (NC) is a new tool on the SSN Utilities tab. NC colors the input SSN according to the number of internode connections. NC coloring helps identify families in SSNs generated with low alignment scores.
The EST database has been updated to use UniProt 2020_04 and InterPro 81.

A listing of new features and other information pertaining to EST is available on the release notes page.

InterProScan sequence search can be used to find matches within the InterPro database for a given sequence.

Information on Pfam families and clans and InterPro family sizes is available on the Family Information page.

Generate a SSN for a single protein and its closest homologues in the UniProt database.

The input sequence is used as the query for a search of the UniProt database using BLAST. Sequences that are similar to the query in UniProt are retrieved. An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Query Sequence:
Input a single protein sequence only. The default maximum number of retrieved sequences is 1,000.

BLAST Retrieval Options

UniProt BLAST query e-value: Negative log of e-value for retrieving similar sequences (≥ 1; default: 5)
Input an alternative e-value for BLAST to retrieve sequences from the UniProt database. We suggest using a larger e-value (smaller negative log) for retrieving homologues if the query sequence is short and a smaller e-value (larger negative log) if there is no need to retrieve divergent homologues.
Maximum number of sequences retrieved: (≤ 10,000, default: 1,000)
Sequence database: (UniProt, UniRef90, or UniRef50; default UniProt)
Select the sequence database to BLAST against.

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 25,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN for a protein family.

The sequences from the Pfam families, InterPro families, and/or Pfam clans (superfamilies) input are retrieved. An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Pfam and/or InterPro Families:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 25,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.

Family Domain Boundary Option

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Region:
N-terminal will select the portion of the sequence that is N-terminal to the specified domain to generate the SSN. C-terminal will select the portion of the sequence that is C-terminal to the specified domain to generate the SSN. Domain will use the specified domain.

Protein Family Option

Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN from provided sequences.

An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Input a list of protein sequences in FASTA format or upload a FASTA-formatted sequence file.

Sequences:

When selected, recognized UniProt or Genbank identifiers from FASTA headers are used to retrieve node attributes from the UniProt database.
FASTA File: ?

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 25,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Generate a SSN from a list of UniProt, UniRef, NCBI, or Genbank IDs.

An all-by-all BLAST is performed to obtain the similarities between sequence pairs to calculate edge values to generate the SSN.

Input a list of UniProt, NCBI, or Genbank (protein) accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?

Input a list of UniRef50 or UniRef90 cluster accession IDs, or upload a text file.

Accession IDs:
Accession ID File: ?
Input accession IDs are: ?

Family Domain Boundary Options

Pfam and InterPro databases define domain boundaries for members of their families.
Domain:
Family: Use domain boundaries from the specified family (enter only one family).
Region:
A specified InterPro family must be defined by a single database. N-terminal will select the portion of the sequence that is N-terminal to the specified domain to generate the SSN. C-terminal will select the portion of the sequence that is C-terminal to the specified domain to generate the SSN. Domain will use the specified domain.

Protein Family Addition Options

Add sequences belonging to Pfam and/or InterPro families to the sequences used to generate the SSN.
Familes:
The input format is a single family or comma/space separated list of families. Families should be specified as PFxxxxx (five digits), IPRxxxxxx (six digits) or CLxxxx (four digits) for Pfam clans.
The EST provides access to the UniRef90 and UniRef50 databases to allow the creation of SSNs for very large Pfam and/or InterPro families. For families that contain more than 25,000 sequences, the SSN will be generated using the UniRef50 or UniRef90 databases. In UniRef90, sequences that share ≥90% sequence identity over 80% of the sequence length are grouped together and represented by a sequence known as the cluster ID. UniRef50 is similar except that the sequence identity is ≥50%. If one of the UniRef databases is used, the output SSN is equivalent to a 90% (for UniRef90) or 50% (for UniRef50) Representative Node Network with each node corresponding to a UniRef cluster ID; in this case an additional node attribute is provided which lists all of the sequences represented by the UniRef node.
Fraction: ? Reduce the number of sequences used to a fraction of the full family size (≥ 1; default: 1)
Selects every Nth sequence in the family; the sequences are assumed to be added randomly to UniProt, so the selected sequences are assumed to be a representative sampling of the family. This allows reduction of the size of the SSN. Sequences in the family with Swiss-Prot annotations will always be included; this may result in the size of the resulting data set being slightly larger than the fraction specified.

SSN Edge Calculation Option

E-Value: Negative log of e-value for all-by-all BLAST (≥1; default 5)
Input an alternative e-value for BLAST to calculate similarities between sequences defining edge values. Default parameters are permissive and are used to obtain edges even between sequences that share low similarities. We suggest using a larger e-value (smaller negative log) for short sequences.

Fragment Option

Fragments:
The UniProt database designates a sequence as a fragment if it is translated from a gene missing a start and/or a stop codon (Sequence Status). Fragments are included in the SSNs by default; checking this box will exclude fragmented sequences from computations. This results in an approximately 10% smaller SSN.
Job name: (required)
E-mail address:

You will be notified by e-mail when your submission has been processed.

Clusters in the submitted SSN are identified, numbered and colored. Summary tables, sets of IDs and sequences per cluster are provided.

The clusters are numbered and colored using two conventions: 1) Sequence Count Cluster Number assigned in order of decreasing number of UniProt IDs in the cluster; 2) Node Count Cluster Number assigned in order of decreasing number of nodes in the cluster.

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).
E-mail address:

You will be notified by e-mail when your submission has been processed.

Like the Color SSN utility, clusters in the submitted SSN are identified, numbered and colored.

The SSN clusters are numbered and colored using two conventions: Sequence Count Cluster Numbers are assigned in order of decreasing number of UniProt IDs in the cluster; Node Count Cluster Numbers are assigned in order of decreasing number of nodes in the cluster.

The convergence ratio for each cluster also is calculated. The convergence ratio is the number of edges in each cluster to the number of sequence pairs. The value decreases from 1.0 for a cluster very similar sequences (same function?) to <<1.0 for clusters with distantly related sequences (different functions?).

Multiple sequence alignments (MSAs), WebLogos, hidden Markov models (HMMs), length histograms, and consensus residues are computed for each cluster.

Options are available in the tabs below to select the desired analyses:

The WebLogos tab provides the WebLogo and MSA for the node IDs in each cluster containing greater than the "Minimum Node Count" specified in the Sequence Filter tab.

The Consensus Residues tab provides a tab-delimited text file with the number of the conserved residues and their MSA positions for each specified residue in each cluster containing greater than the "Minimum Node Count". Note the default residue is "C" and the percent identity levels that are displayed are from 90 to 10% in intervals of 10%; a residue is counted as "conserved" if it occurs with ≥80% identity.

The HMMs tab provides the HMM for each cluster containing greater than the specified "Minimum Node Count".

The Length Histograms tab provides length histograms for each cluster containing greater than the specified "Minimum Node Count".

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).

Sequence Filter

The MSA is generated with MUSCLE using the node IDs. Clusters containing less than the Minimum Node Count will be excluded from the analyses. Since MUSCLE can fail with a "large" number sequences (variable; anywhere from >750 to 1500), the Maximum Node Count parameter can be used to limit the number of sequences that MUSCLE uses.
Minimum Node Count:
Maximum Node Count:

WebLogos

A MSA for the (length-filtered) node IDs is generated using MUSCLE; the WebLogo is generated with the http://weblogo.threeplusone.com code.
Make Weblogo:

Consensus Residues

The positions and selected percent identities of the selected residues in the MSA are determined.
Compute Consensus Residues:

HMMs

The MSA for the (length-filtered) node IDs is used to generate the HMM with hmmbuild from HMMER3 (http://hmmer.org).
Make HMMs:

Length Histograms

Length histograms for the node IDs (where applicable, UniProt, UniRef90, and UniRef50 IDs).
Make Length Histograms:
E-mail address:

You will be notified by e-mail when your submission has been processed.

Nodes in the submitted SSN are colored according to neighborhood connectivity (number of edges to other nodes).

The nodes for unresolved families can be difficult to identify in SSNs generated with low alignment scores. Coloring the nodes according to the number of edges to other nodes (Neighborhood Connectivity, NC) helps identify families with highly connected nodes (https://www.biorxiv.org/content/10.1101/2020.04.16.045138v1.full). Using Neighborhood Connectivity Coloring as a guide, the alignment score threshold can be chosen in Cytoscape to separate the SSN into families.

SSN File: ?
A Cytoscape-edited SNN can serve as input. The accepted format is XGMML (or compressed XGMML as zip).
E-mail address:

You will be notified by e-mail when your submission has been processed.

UniProt Version: 2020_04
InterPro Version: 81

If you use the EFI web tools, please cite us.

Click here to contact us for help, reporting issues, or suggestions.