########################### # README FOR MATALIGN-V2A # ########################### Copyright 2002--2005 Ting Wang and Gary Stormo May be copied for noncommercial purposes. Author: Ting Wang and Gary Stormo Department of Genetics Washington University in St. Louis Campus Box 8232 St. Louis, MO 63110 stormo@ural.wustl.edu twang@ural.wustl.edu MatrixAligner-v2a ################# # BASIC OPTIONS # ################# -f1 -f2 -t1 -t2 [-h ] [-n ] [-c0 ] [-c1 ] [-g ] [-a ] [-A ] [-CS ] ####################### # GENERAL INFORMATION # ####################### Matrix Aligner is a program to compare two positional specific matrices. The precursor of this program is "CompareTwo". Matalign-v2a made several improvements over the previous establishment of the program. Two input PSSMs should be formatted according to consensus output. They can be either count matrices, or frequency matrices. If a frequency matrix is used, a user-defined "total count" is assumed (default = 10). The scoring function between two positions of the two matrices is ALLR statistic. The alignment algorithm can be either "local alignment" or "global alignment". Both options do not allow gaps. The comparison between the two matrices results in the following output: 1) An alignment between two input matrices. 2) An ALLR score of the comparison. In general, the higher the ALLR score, the more similar are the two matrices. 3) A distance score. Distance between A and B is defined as: ALLR(A,A) + ALLR(B,B) - 2xALLR(A,B). In general, the smaller the distance, the more similar are the two matrices. 4) P-value and E-value of the observed ALLR score. These significance values are calculated based on Karlin-Altschul statistics. The meaning of the p-value is, given two random PSSMs, the probability of observing an equal or higher ALLR score. 5) The aligned parts of the two matrices will be merged into one new matrix and printed as output. ######################## # COMMAND LINE OPTIONS # ######################## 0) -h: print these directions. 1) Required options -f1 filename "filename" contains the one matrix. -f2 filename "filename" contains the second matrix. -t1 integer: the type of matrix 1. 0 means count matrix, 1 means frequency matrix. -t2 integer: the type of matrix 1. 0 means count matrix, 1 means frequency matrix. -s number: the number of standard deviations to lower the information content at each position before identifying information peaks (required) during generation of initial alignment. A range of values should be tried. For example, try values of 0.5, 1, 1.5, and 2. The overall best alignment is the one having the lowest expected frequency. 2) Alphabet options -a filename: file containing the alphabet and normalization information. Each line contains a letter (a symbol in the alphabet) followed by an optional normalization number (default: 1.0). The normalization is based on the relative prior probabilities of the letters. For nucleic acids, this might be be the genomic frequency of the bases; however, if the "-d" option is not used, the frequencies observed in your own sequence data are used. In nucleic acid alphabets, a letter and its complement appear on the same line, separated by a colon (a letter can be its own complement, e.g. when using a dimer alphabet). Complementary letters may use the same normalization number. Only the standard 26 letters are permissible; however, when the "-CS" option is used, the alphabet is case sensitive so that a total of 52 different characters are possible. POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS: letter letter normalization POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS: letter:complement letter:complement normalization letter:complement normalization:complement's_normalization Example alphabet file 1: A:T C:G Example alphabet file 2: A:T 0.3 C:G 0.2 3) Options for handling the complement of the matrices --- -c0: ignore the complement -c1: compare both orientation (the default option) 4) Algorithm options -n: if a frequency matrix is compared, a total count is assumed. "n" is the user-defined total count. Default is 10. -g: global alignment. Default is local alignment. ########### # Example # ########### (bifrost)[3:11pm]src_v2a 207>>more matrix1 A | 1 0 0 0 0 0 0 C | 0 14 2 0 0 0 9 G | 15 2 14 16 16 16 2 T | 0 0 0 0 0 0 5 (bifrost)[3:23pm]src_v2a 208>>more matrix2 A | 0 0 1 0 1 0 0 C | 3 22 0 0 0 1 15 G | 18 1 23 24 23 23 1 T | 3 1 0 0 0 0 8 (bifrost)[3:23pm]src_v2a 209>>./matalign-v2a -f1 matrix1 -f2 matrix2 -t1 0 -t2 0 -g COMMAND LINE: ./matalign-v2a -f1 matrix1 -f2 matrix2 -t1 0 -t2 0 -g ***** PID: 4450 ***** Algorithm options: Compare both orientations. Global alignment. Matrix 1 type: count matrix Matrix 2 type: count matrix Matrices Alignment: MATRIX: A | 1 0 0 0 0 0 0 C | 0 14 2 0 0 0 9 G | 15 2 14 16 16 16 2 T | 0 0 0 0 0 0 5 Consensus: G C G G G G c MATCH: | | | | | | | MATRIX: A | 0 0 1 0 1 0 0 C | 3 22 0 0 0 1 15 G | 18 1 23 24 23 23 1 T | 3 1 0 0 0 0 8 Consensus: G C G G G G c Comparison scores: ALLR = 8.3894 Distance = 1.9606 E_value = 3.01e-07 P_value = 3.01e-07 NEW MATRIX: number of sequences = 40 width = 7 crude information = 9.7450 (bits) A | 1 0 1 0 1 0 0 C | 3 36 2 0 0 1 24 G | 33 3 37 40 39 39 3 T | 3 1 0 0 0 0 13 Consensus: GCGGGGc