1. Prediction Method:DNA binding specificity predictions are based on the binding profiles associated with a reference set of 84 homeodomain proteins. This reference set consists of 83 Drosophila melanogaster and 1 human (Oct) proteins containing the homeodomain DNA binding domain. These proteins were chosen because they contain only one known DNA binding domain. A bacterial one hybrid system (Meng 2005) was used to determine a set of high affinity DNA sites for each reference protein.A portion of he top reference proteins (default: 3) are chosen based on similarity to the query protein. The sites associated with these top reference proteins are then pooled to estimate the DNA specificity of the query protein. First, the query protein is aligned to a multiple sequence alignment of the 60 residue core of the homeodomain region of the 84 reference proteins. These sequence-profile alignments are produced using the program MUSCLE (Edgar 2004). Only a subset of the amino acid residues in the protein bind DNA directly. Presumably, these residues play the largest role in determining DNA specificity. Thus, exact matches are required at a set of 'critical' and 'key' residues thought to be directly involved in binding DNA, based on X-ray crystallographic evidence, etc. Critical residues (default: Asn 51) must match the query exactly, while a specified number of mismatches (default: 1 mismatch) are allowed at key residues (default: residues 5, 47, 50, 54, 55). Then, the overall degree of similarity between the DNA binding domain of the query protein and each reference protein is determined based on the sequence-profile alignment and the BLOSUM45 substitution matrix. All of the reference proteins are sorted first (1) according to the number of exact matches at key residues and secondly (2) according to the total similarity score. The reference protein at the top of this sorted list matches the query protein at the greatest number of key residues. If it passes the similarity score threshold, it will be selected and used to estimate the specificity of the query. If there are other reference proteins which also match the query at the same number of key residues as the top reference protein, (regardless of the value of the 'number of required key residue matches' parameter, which is only a minimum value) and meet the similarity score threshold, these proteins will be selected as well, until up to N (default: 3) of the top proteins have been chosen. The DNA binding sites associated with these selected reference proteins are then combined to estimate the set of sites bound by the query protein. 2. Reference AlignmentThe reference profile consists of an alignment of 83 Drosophila melanogaster homeodomain DNA binding domain sequences and 1 human sequence (Oct). These proteins were chosen because they contain only one known DNA binding domain. The initial alignment was trimmed to yield the final reference profile which contains no gaps.
Note: You do not need to pre-process your query sequences by removing residues. Although gap containing columns were removed from the initial reference alignment, query sequences do not need to be modified by the user. MUSCLE will align the input sequence to the reference profile and columns containing gaps (insertions relative to the profile, which contains no gaps itself) in the sequence-profile alignment will be removed when the program assigns residue position indices.
3. ParametersBy default, the parameters are set to those used to make the human predictions. The user is free to change these parameters, but should do so with caution.
4. Results
5. Reference
Analysis of Homeodomain Specificities Allows the Family-wide Prediction of Preferred Recognition Sites Cell. In press. |