DOMAINATION information


		DOMAINATION Information

DOMAINATION description

DOMAINATION infers structural domains in your query from alignments generated by from PSI-BLAST results. In addition, DOMAINATION improves finding distant homologies compared with PSI-BLAST. A complete description of the DOMAINATION method has been published and is available as PDF. The method, input and output are briefly described below.

DOMAINATION steps

DOMAINATION consists of two main parts: PSI-BLAST searches to find similar sequences and domain delinineation using the distribution of the sequences found over the query sequence. Details and parameters used are given below.

DOMAINATION starts with performing a database search (using the NCBI non-redundant protein database) of your query sequence with PSI-BLAST (with options -j4 -e0.001 -h0.0005). The resulting "hit" sequence fragments are filtered on low complexity (using SEG). Sequences with more than 15% low complexity regions are discarded. Next, domains are delineated using CHOP (as described in the article). CHOP assigns domain boundaries based on the distribution of the hit sequences over the query sequence. The minimum domain segment size is 10. CHOP produces a file with multiple FASTA sequences. These sequences are filtered to find the largest subset of sequences within a range of 20% to 60% sequence identity. OBSTRUCT is used for this filtering step. Thus, all sequences with less than 20% or more than 60% sequence identity are filtered out. The remaining sequences are aligned with Praline, a multiple sequence alignment program, to generate a "true" multiple sequence alignment instead of a master-slave alignment. This alignment is supplied for subsequent PSI-BLAST runs (also with options -j4 -e0.001 -h0.0005). In case PSI-BLAST finds a large number of hits for your query sequence, the DOMAINATION run can take over 15 minutes. Figure 1 summarizes the steps. The maximum number of iterations is set to 10. It is extremely unlikely to reach this number of iterations.

Figure 1: Flow diagram of DOMAINATION.
DOMAINATION input

You can supply a protein sequence in FASTA format via the textfield or via file upload. Note that the only a single sequence should be supplied. You can choose to perform a sequence similarity search which will locally run NCBI's PSI-BLAST and FastA's SSEARCH (producing high quality Smith-Waterman alignments) on all sequences found by DOMAINATION. This will provide you with an overview of similarities between your query sequence and all sequences found during DOMAINATION's iterations.

DOMAINATION output

The output contains two parts: domain lay out and sequences found by DOMAINATION. In case PSI-BLAST did not find any hits, you will see an error message (PSI-BLAST did not find any hits for your query sequence!).

Domain lay out

Overall domain lay-out and domain lay-out per iteration is shown as a graphic. Mousing over a domain will display the domain boundaries (only on the output page). Clicking on a domain will highlight its sequence in the original query FASTA sequence. Clicking on the highlighted domain will extract its sequence. A graph with annotated domain boundaries is also available. Each domain of a query sequence will be send as query to further iterations. DOMAINATION works on FASTA sequences only. This means that you may find domains shorter than 10, since iterations are combined in our output. If a query is not split further into domains, this query will drop out of the iterations and become absent in tracks for higher iterations. For example, domain 3 in the graphic is not split in the second iteration. Therefore, it is absent in the third iteration.

Sequences found by DOMAINATION

This part reports the number of sequences found by all DOMAINATION iterations together. A file with all sequences found by DOMAINATION can be downloaded (in FASTA format). If you have choosen to perform the sequence similarity search, you will find links here to BLASTP and SSEARCH report, both in HTML and plain text formats. The report in HTML format contains a graphic which show the position of the top 100 hits on your query sequence. If you mouse over the hits the name, score and E-value of the hit will appear both in the text box at the top of the graph and as "pop up". These similarity searches are corrected for the size of the original non-redundant database. The command line options used are:

blastpgp -F T -z 1418577927 : low-complexity filter is on and the size of the original database is used to calculate the E-values

ssearch35 -z 11 -s BL62 -B -Z 4113801 : estimate the statistical parameters from shuffled copies of each library sequence (-z); use BLOSUM62 (-s); show normalized score as a z-score (-B) and use the number of sequences from the original database to calculate E-values.

For more information, see the explanation on BLAST and FASTA.

Data download

The multiple sequence alignments generated by DOMAINATION can be opened in Jalview. To facilitate additional off-line analysis, generated data is available for download. This includes the domain sequences, multiple alignments, PSI-BLAST reports.
Example output
Complete example output is available here. The query protein was NP_543124.1, WD repeat domain 5 [Mus musculus].

DOMAINATION Information

DOMAINATION description

DOMAINATION steps

DOMAINATION input

DOMAINATION output

Domain lay out

Sequences found by DOMAINATION

Data download

Example output