Scooby-domain (Sequence hydrophobicity predicts domains)
is a fast and simple method to identify globular domains in protein
sequence, based on the observed lengths and hydrophobicities of
domains from proteins with known tertiary structure. The prediction
method successfully identifies sequence regions that will form a
globular structure and those that are likely to be unstructured. The
method does not rely on homology searches and, therefore, can identify
previously unknown domains for structural elucidation. Scooby-domain is
available as a Java
applet (version 1.0). It may be used to visualise local properties within a
protein sequence such as average hydrophobicity, secondary structure
propensity and domain boundaries, as well as being a method for fast
domain assignment of large sequence sets.
Scooby-domain home
Distribution of domain size and average hydrophobicity
Figure 1a shows the 3D histogram of average
hydrophobicity and domain length distributions for the CATH
domains. There are clear limits to the number of hydrophobic residues
within a domain depending on its sequence length. Figure 1a also shows
how the average hydrophobic and length distributions are very
different to those measured for a set of sequences with a random
selection of residues from the CATH domains. Shorter domain sequences,
less than 100 residues, require a smaller proportion of hydrophobic
residues than larger domains. This follows the rules that there must
be a limit to the number of non-polar residues so that they can be
completely protected from the solvent exterior by a fixed shell of
polar residues. Larger domains have a larger proportion of hydrophobic
residues, which levels out at 55% where domains become larger than 200
residues.
Generating a domain probability matrix for a query sequence
Scooby-domain uses a multilevel smoothing window to
predict the location of domains in a novel sequence (Figure 1b). The
window size, representing the length of a putative domain, is
incremented starting from the smallest domain size observed in the
database to the largest domain size. Each smoothing window calculates
the fraction of hydrophobic residues it encapsulates along a sequence,
and places the value at its central position. This leads to a 2D
matrix, where the value at ij is the average hydrophobicity
encapsulated by a window of size j that is centred at residue position
i. The matrix has a triangular shape, the apex of which will
correspond to a window size equal to the length of the sequence or the
maximum window size (largest observed domain).
All values in the matrix are converted into
probability scores by referring to the observed distribution of domain
sizes and hydrophobicities described earlier, i.e. given an average
hydrophobicity and window length the probability that it can fold into
a domain is found directly from the observed data. Visualisation of
the Scooby plots can be used to effectively identify regions that
are likely to fold into domains, as well as unstructured regions
(Figure 1c).
Automatic domain boundary assignment
The highest probability in the Scooby plot
represents the first predicted domain (Figure 1d). The corresponding
sequence stretch for this domain is removed from the
sequence. Therefore, the first predicted domain will always have a
continuous sequence and further domain predictions can encompass
discontinuous domains. If the excised domain is at a central position
in the sequence, the resulting N- and C-terminal fragments are
rejoined and the probability matrix recalculated as before. The second
highest probability is then found and the corresponding subsequence
removed. The process is repeated until there are less than 30 residues
left from the original sequence, the size of the smallest domain, or
there are no probabilities greater than 0.33 in the matrix to avoid
error prone predictions.

Figure 1 (a) Histogram of CATH domains as a
function of their hydrophobicity and domain length. The colour bar to
the right of the figure shows the scale of the distribution (0 to 1):
red areas represent regions that have a high frequency of domain
occurrence. The second plot shows the average CATH domain
hydrophobicity minus the average hydrophobicity for randomised
sequences (generated from a random selection of residues from
sequences in the CATH database). (b) Multilevel smoothing window. The
horizontal axis corresponds to the sequence position, i, and the
vertical axis represents the window length used in the smoothing of
sequence hydrophobicity, j. Each position in the matrix corresponds to
the average hydrophobicity assigned to the centre of a window during
smoothing. (c) Each position in the matrix is converted to a
probability that it will fold into a domain, based on the lengths and
hydrophobicities observed in the distribution of CATH domains. (d)
i. The highest scoring window (first predicted domain) is identified
in the probability matrix and the sequence region it encapsulates
(blue triangle) is removed from the sequence. ii. The resulting
sequence fragments are rejoined and the probability matrix
recalculated. iii. The smoothing windows that encapsulate the last 15
residues of the N-terminal fragment and the first 15 residues of the
C-terminal fragment have their probabilities set to zero (white
bands). If the next highest scoring region is found in the red region
then the excised domain will be discontinuous, otherwise it will be
continuous.
Download Scooby binaries...
ScoobyDo.bin for Linux.
ScoobyDo.exe for Windows.
How to use Scooby-Domain with the command line?
The program requires a fasta sequence file as the
first argument (eg ./ScoobyDo_linux 2pia.seq). The program will print
out ten results ranked by score (score is in brackets). If you just
want one prediction for a sequence, you could take the first
prediction which will have number (rank) 0, see below.
>0-0(0.742) 114 231
>1-6(0.742) 113 226
>2-2(0.557) 95 204 234
>3-7(0.538) 113 232
>4-8(0.538) 104 223
>5-5(0.497) 130 197 296
>6-4(0.407) 137 199 296
>7-9(0.403) 94 213 240
>8-1(0.325) 34 104 213 240
>9-3(0.325) 39 146 199 296
Each number following the score, in brackets, is a predicted domain boundary.
The first number after the '>' is rank of the prediction.
Rank 0 is the best prediction to use, followed by rank 1 and so on.
The second number after the '-' sign is the prediction number.
It represents the order in which the prediction is produced from the algorithm, and
it does not indicates the benchmarked quality of the prediction.
The produced *.out file gives each residue position a domain number for each prediction, so you can keep track of
discontinuous domains.
The produced *.dom file is similar to the *.out file. Each domain is marked with a letter from 'a-z' in small cap.
Each 'U' represents possible linker regions that are unstructured. These regions are not likely to be part of the
hydrophobic core of domains.
The *.ps file is a postscipt file (not always printed
correctly) showing a probability matrix (Figure 1c). Each hot spot
represents a central region along the sequence that will fold into a
domain.
How to use linker prediction scores?
The command line usage for the use of linker prediction socres is:
./ScoobyDo [input sequence in fasta mode] [DOMCUT or DOMCUT_B] [optional: file name with linker scores]
DOMCUT_B = Use Domcut on first sequence only (recommended)
DOMCUT = Use DomCut on all sequences
Instead of using DomCut, you may use your own linker prediction scores via an input file:
The input file
1 1.98609687084548
2 1.77344019624752
3 1.78786992621695
4 1.36715273915053
5 1.20387679635056
6 1.33729683458531
7 0.71342730104684
8 0.851118053061024
9 0.732165206833348
10 0.552918905787393
11 0.299159313296559
....etc....
The first column of each line is the amino acid residue position, the second column is the linker score. The
higher the score is for an amino acid position, the more likely it is a linker. The columns are separated by a tab.
If DomCut is used, two files *.domcut and *.dcs will be produced. The *.domcut file records the linker score for each position
along the query protein sequence. The *.dcs file record the linker positions predicted by DomCut.