1. Prediction Method:

DNA binding specificity predictions are based on the binding profiles associated with a reference set of 84 homeodomain proteins. This reference set consists of 83 Drosophila melanogaster and 1 human (Oct) proteins containing the homeodomain DNA binding domain. These proteins were chosen because they contain only one known DNA binding domain. A bacterial one hybrid system (Meng 2005) was used to determine a set of high affinity DNA sites for each reference protein.

A portion of he top reference proteins (default: 3) are chosen based on similarity to the query protein. The sites associated with these top reference proteins are then pooled to estimate the DNA specificity of the query protein. First, the query protein is aligned to a multiple sequence alignment of the 60 residue core of the homeodomain region of the 84 reference proteins. These sequence-profile alignments are produced using the program MUSCLE (Edgar 2004).

Only a subset of the amino acid residues in the protein bind DNA directly. Presumably, these residues play the largest role in determining DNA specificity. Thus, exact matches are required at a set of 'critical' and 'key' residues thought to be directly involved in binding DNA, based on X-ray crystallographic evidence, etc. Critical residues (default: Asn 51) must match the query exactly, while a specified number of mismatches (default: 1 mismatch) are allowed at key residues (default: residues 5, 47, 50, 54, 55). Then, the overall degree of similarity between the DNA binding domain of the query protein and each reference protein is determined based on the sequence-profile alignment and the BLOSUM45 substitution matrix. All of the reference proteins are sorted first (1) according to the number of exact matches at key residues and secondly (2) according to the total similarity score. The reference protein at the top of this sorted list matches the query protein at the greatest number of key residues. If it passes the similarity score threshold, it will be selected and used to estimate the specificity of the query. If there are other reference proteins which also match the query at the same number of key residues as the top reference protein, (regardless of the value of the 'number of required key residue matches' parameter, which is only a minimum value) and meet the similarity score threshold, these proteins will be selected as well, until up to N (default: 3) of the top proteins have been chosen. The DNA binding sites associated with these selected reference proteins are then combined to estimate the set of sites bound by the query protein.

2. Reference Alignment

The reference profile consists of an alignment of 83 Drosophila melanogaster homeodomain DNA binding domain sequences and 1 human sequence (Oct). These proteins were chosen because they contain only one known DNA binding domain. The initial alignment was trimmed to yield the final reference profile which contains no gaps.


A portion of the original reference multiple sequence alignment, before trimming the gap containing N-terminal and C-terminal columns, as well as the TALE region. Columns outlined in red are those which were removed.


The final, trimmed version of the reference multiple sequence alignment shown above. Position 1, the TALE region (positions 21-23 in the initial MSA), and positions greater than 64 were all removed from the initial alignment

Note: You do not need to pre-process your query sequences by removing residues. Although gap containing columns were removed from the initial reference alignment, query sequences do not need to be modified by the user. MUSCLE will align the input sequence to the reference profile and columns containing gaps (insertions relative to the profile, which contains no gaps itself) in the sequence-profile alignment will be removed when the program assigns residue position indices.

3. Parameters

By default, the parameters are set to those used to make the human predictions. The user is free to change these parameters, but should do so with caution.
  1. Critical Residues:
    The set of amino acid residues which are deemed critical to determining the DNA binding specificity. By default, only residue Asp51 is in the critical residue set (residue are numbered according to the reference profile alignment). Position 51 is completely conserved in the reference set, so by default, no predictions will be made for query proteins that do not have Asp at position 51.

    While you may add residues to this set, it is recommended that you always include residue 51 in the set since it plays a critical role in binding DNA.

  2. Key Residues:
    The set of amino acid residues that are important determinants of DNA specificity (residue are numbered according to the reference profile alignment). The default residues are all thought to bind DNA directly.

  3. Number of required key residue matches:
    Some mismatches may be allowed at key residues. This parameter determines how many key residue matches are required for a reference protein to be selected.

  4. Substitution matrix:
    The overall degree of similarity between the query protein and each reference protein is determined using a substitution matrix. Currently, only the BLOSSUM45 substitution matrix is available.

  5. Similarity score threshold:
    Every reference protein used to predict specificity must receive a similarity score greater than or equal to this cutoff to be considered.

  6. Number of reference sequences:
    This parameter determines the maximum number, N, of reference proteins that will be used to make predictions. However, frequently, fewer than N reference proteins meet the requisite criteria. In the case of ties where multiple proteins receive the same rank, more than N sequences may be selected.

  7. Similarity score range:
    All selected reference proteins must be within this many similarity score units of the maximum similarity score received by a reference protein. This parameter is used to insure that reference proteins are not selected that match the query protein exactly at the critical and key residues but still have a much lower overall similarity score relative to other reference proteins.

4. Results

  1. Sites:
    Once a set of reference proteins is selected, all of the sites associated with these proteins are combined to form one set. This set is used to estimate the set of sites bound by the query protein.

  2. Logos:
    The sequence logos are produced using WebLogo (Crooks 2004) Version 2.8 for the set of predicted sites.

  3. Matrix:
    The predicted set of sites is used to generate a count matrix.

  4. Motif Quality Score:
    Each predicted motif is assigned a quality score ranging from 1 to 4, 1 indicating the highest quality. The method for determining quality scores is based on the results of a leave one out cross validation analysis. The degree of overall similarity, the number of mismatches at key residues, and the number of reference proteins that are similar to query are all taken into account.

5. Reference

    Noyes, M.B., R.G. Christensen, A. Wakabayashi, G.D. Stormo, M.H. Brodsky, S.A. Wolfe.
    Analysis of Homeodomain Specificities Allows the Family-wide Prediction of Preferred Recognition Sites
    Cell. In press.