########################### # README FOR MATALIGN-V4a # ########################### Copyright 2002--2006 Ting Wang and Gary Stormo May be copied for noncommercial purposes. Author: Ting Wang and Gary Stormo Department of Genetics Washington University in St. Louis Campus Box 8232 St. Louis, MO 63110 stormo@ural.wustl.edu twang@ural.wustl.edu MatrixAligner-v4a, or "MegaMatAlign". NOTES ON THIS UPDATE: MAIN DIFFERENCE FROM V2: batch process of many matrix comparisons. This version allows comparison among many matrices, and fix some bugs. Credit should also be given to Alok Saldanha [alok@caltech.edu] who did lots of initial experimenting with this idea. Strategy: user now provides one or two files that contains file names of matrices to compare. The first line of the file always contains the directory information of the matrices (therefore all matrices should be under the same directory). If one list is provided, all pair-wise comparison will be performed; if two lists are provided, all matrices in list file 1 will be compared to all matrices in list file 2. I also did some software engineering to speed up the comparison. In this version, all matrices should have the format of a count vector. I will allow more variations to the format in later modifications. To illustrate the speed gain, I compared all TRANSFAC matrices against themselves (636 matrices, 201930 comparisons). It took MegaMatAlign about 30 seconds to complete the comparison, while if I run a perl script that calls MatAlign that many times, it took almost an hour. So, there is at least a 50~100x gain in speed. I should also point out that another trick for speeding up the process is to calculate a ALLR lookup table upfront. However, since pseudocount treatment is different between the lookup table and real calculation, the ALLR scores are slightly different. I don't think this will cause significant different, but have to watch out. I may turn this off. Then it would run for twice amount the time, but the logrithm calculation would be more consistent. ################# # BASIC OPTIONS # ################# Usage: ./matalign-v4a Use the "-h" option for detailed directions -f1 [-h ] [-f2 ] [-c0 ] [-c1 ] [-g ] [-ez ] [-a ] [-i ] [-A ] [-CS ] ####################### # GENERAL INFORMATION # ####################### Matrix Aligner is a program to compare two positional specific matrices. The precursor of this program is "CompareTwo". Matalign-v4a made several improvements over the previous establishment of the program. "MegaMatAlign" enables comparison of many matrices. The input file contains matrix names (and their location, i.e. directory information). If there is one input list file, MegaMatAlign pair-wise compares all matrices listed in this file; if there are two input list files, then MegaMatAlign compares every matrix listed in file 1 to every matrix listed in file 2. Input PSSMs should be formatted according to consensus output. They should be count matrix like this: (bifrost)[11:06am]src_v4a 574>>more matrix1 A | 1 0 0 0 0 0 0 C | 0 14 2 0 0 0 9 G | 15 2 14 16 16 16 2 T | 0 0 0 0 0 0 5 The scoring function between two positions of the two matrices is ALLR statistic. The alignment algorithm can be either "local alignment" or "global alignment". Both options do not allow gaps. The program generates a data table as output. The table contains the following columns, tab-delimited: Matrix_1: Name of the first matrix in comparison Consensus: Consensus pattern of the first matrix Matrix_2: Name of the second matrix in comparison Consensus: Consensus pattern of the second matrix ALLR: ALLR score of the comparison. In general, the higher the ALLR score, the more similar are the two matrices. Overlap: The overlap length (bp) of the aligned part Distance: A distance score. Distance between A and B is defined as: ALLR(A,A) + ALLR(B,B) - 2xALLR(A,B). In general, the smaller the distance, the more similar are the two matrices. Aligned_Dist: Similar defined as Distance, but only calculated for the aligned part. Shared_Consensus: The aligned part of the two matrices are merged, and a new matrix generated. This is the consensus of the new matrix. E_value: P_value: P-value and E-value of the observed ALLR score. These significance values are calculated based on Karlin-Altschul statistics. The meaning of the p-value is, given two random PSSMs, the probability of observing an equal or higher ALLR score. If "-ez" is defined at command line, the output table will only contain Matrix_1, Matrix_2, ALLR, and Overlap for simplicity. ######################## # COMMAND LINE OPTIONS # ######################## 0) -h: print these directions. 1) Required options -f1 filename "filename" contains information about matrices: the first line of the file contains directory, after which every line contains a matrix file name. -f2 filename "filename" contains the second list of matrices. 2) Alphabet options -a filename: file containing the alphabet and normalization information. Each line contains a letter (a symbol in the alphabet) followed by an optional normalization number (default: 1.0). The normalization is based on the relative prior probabilities of the letters. For nucleic acids, this might be be the genomic frequency of the bases; however, if the "-d" option is not used, the frequencies observed in your own sequence data are used. In nucleic acid alphabets, a letter and its complement appear on the same line, separated by a colon (a letter can be its own complement, e.g. when using a dimer alphabet). Complementary letters may use the same normalization number. Only the standard 26 letters are permissible; however, when the "-CS" option is used, the alphabet is case sensitive so that a total of 52 different characters are possible. POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS: letter letter normalization POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS: letter:complement letter:complement normalization letter:complement normalization:complement's_normalization Example alphabet file 1: A:T C:G Example alphabet file 2: A:T 0.3 C:G 0.2 3) Options for handling the complement of the matrices --- -c0: ignore the complement -c1: compare both orientation (the default option) 4) Algorithm options -g: global alignment. Default is local alignment. -ez: only generate ALLR score and Overlap for each comparison. ########### # Example # ########### (bifrost)[11:32am]src_v4a 598>>more sample_list /home3/twang/CompareTwo/src_v4a/sample GCN4_01.matrix GCN4_C.matrix HSF_01.matrix HSF_02.matrix HSF_03.matrix HSF_04.matrix HSF_05.matrix (bifrost)[11:32am]src_v4a 599>>more /home3/twang/CompareTwo/src_v4a/sample/GCN4_01.matrix A | 4 2 7 2 3 5 5 4 6 19 0 0 43 0 2 1 43 1 12 4 12 6 7 4 5 8 6 C | 6 7 6 7 6 4 11 12 9 6 0 0 0 43 0 42 0 11 16 14 10 9 10 8 10 7 7 G | 2 5 2 9 6 10 7 9 11 15 0 43 0 0 0 0 0 5 2 13 4 11 4 11 6 3 4 T | 2 1 3 2 8 6 4 5 11 2 43 0 0 0 41 0 0 26 8 3 8 7 10 5 6 7 6 (bifrost)[11:33am]src_v4a 600>>more /home3/twang/CompareTwo/src_v4a/sample/GCN4_C.matrix A | 13 25 21 0 0 38 0 0 0 3 C | 12 0 0 0 0 0 30 0 38 1 G | 8 12 12 0 38 0 0 0 0 0 T | 4 0 4 38 0 0 8 38 0 1 (bifrost)[11:33am]src_v4a 601>>./matalign-v4a -f1 sample_list -a ~/alphabet -g COMMAND LINE: ./matalign-v4a -f1 sample_list -a /home3/twang/alphabet -g ***** PID: 19688 ***** Pair-wise comparison among matrices specified in sample_list. Algorithm options: Compare both orientations. Global alignment. Output option: Complete output, including all statistics. Number of pair-wise comparisons: 21 Matrix_1 Consensus Matrix_2 Consensus ALLR Overlap Distance Aligned_Dist Shared_Consensus E_value P_value GCN4_01.matrix nnnnnnnnnRTGACTCAtnSnnnnnnn GCN4_C.matrix naaTGACTCa 9.362 10 8.874 7.115 nnaTGACTCA 2.165e-07 2.165e-07 GCN4_01.matrix nnnnnnnnnRTGACTCAtnSnnnnnnn HSF_01.matrix aGAAn -0.018 1 20.037 0.138 n 36.51 1 GCN4_01.matrix nnnnnnnnnRTGACTCAtnSnnnnnnn HSF_02.matrix aGAAnaGAAnaGAAn -0.018 1 29.985 0.138 n 109.5 1 GCN4_01.matrix nnnnnnnnnRTGACTCAtnSnnnnnnn HSF_03.matrix aGAAnaGAAnnTTCt -0.215 1 30.379 0.743 n 165.4 1 GCN4_01.matrix nnnnnnnnnRTGACTCAtnSnnnnnnn HSF_04.matrix aGAAnnTTCtaGAAn -0.018 1 29.985 0.138 n 109.5 1 GCN4_01.matrix nnnnnnnnnRTGACTCAtnSnnnnnnn HSF_05.matrix nTTCtaGAAnaGAAn -0.018 1 29.985 0.138 n 109.5 1 GCN4_C.matrix naaTGACTCa HSF_01.matrix aGAAn -0.044 1 17.633 0.210 n 14.3 1 GCN4_C.matrix naaTGACTCa HSF_02.matrix aGAAnaGAAnaGAAn -0.044 1 27.581 0.210 n 42.89 1 GCN4_C.matrix naaTGACTCa HSF_03.matrix aGAAnaGAAnnTTCt -0.225 1 27.943 1.079 a 62.66 1 GCN4_C.matrix naaTGACTCa HSF_04.matrix aGAAnnTTCtaGAAn -0.044 1 27.581 0.210 n 42.89 1 GCN4_C.matrix naaTGACTCa HSF_05.matrix nTTCtaGAAnaGAAn -0.044 1 27.581 0.210 n 42.89 1 HSF_01.matrix aGAAn HSF_02.matrix aGAAnaGAAnaGAAn 4.974 5 9.948 -0.000 aGAAn 0.000587 0.0005868 HSF_01.matrix aGAAn HSF_03.matrix aGAAnaGAAnnTTCt 4.974 5 9.948 -0.000 aGAAn 0.000587 0.0005868 HSF_01.matrix aGAAn HSF_04.matrix aGAAnnTTCtaGAAn 4.974 5 9.948 -0.000 aGAAn 0.000587 0.0005868 HSF_01.matrix aGAAn HSF_05.matrix nTTCtaGAAnaGAAn 4.974 5 9.948 -0.000 aGAAn 0.000587 0.0005868 HSF_02.matrix aGAAnaGAAnaGAAn HSF_03.matrix aGAAnaGAAnnTTCt 9.948 10 9.948 -0.000 aGAAnaGAAn 5.288e-08 5.288e-08 HSF_02.matrix aGAAnaGAAnaGAAn HSF_04.matrix aGAAnnTTCtaGAAn 4.974 5 19.896 -0.000 aGAAn 0.001761 0.001759 HSF_02.matrix aGAAnaGAAnaGAAn HSF_05.matrix nTTCtaGAAnaGAAn 9.948 10 9.948 -0.000 aGAAnaGAAn 5.288e-08 5.288e-08 HSF_03.matrix aGAAnaGAAnnTTCt HSF_04.matrix aGAAnnTTCtaGAAn 9.948 10 9.948 -0.000 aGAAnnTTCt 5.288e-08 5.288e-08 HSF_03.matrix aGAAnaGAAnnTTCt HSF_05.matrix nTTCtaGAAnaGAAn 9.948 10 9.948 -0.000 aGAAnaGAAn 5.288e-08 5.288e-08 HSF_04.matrix aGAAnnTTCtaGAAn HSF_05.matrix nTTCtaGAAnaGAAn 9.948 10 9.948 -0.000 nTTCtaGAAn 5.288e-08 5.288e-08 (bifrost)[11:33am]src_v4a 603>>./matalign-v4a -f1 sample_list -f2 sample_list -a ~/alphabet -g -ez COMMAND LINE: ./matalign-v4a -f1 sample_list -f2 sample_list -a /home3/twang/alphabet -g -ez ***** PID: 32161 ***** Compare all matrices specified in sample_list to all matrices specified in sample_list. Algorithm options: Compare both orientations. Global alignment. Output option: Simple output, only names of matrices, ALLR and Overlap. Number of pair-wise comparisons: 49 Matrix_1 Matrix_2 ALLR Overlap GCN4_01.matrix GCN4_01.matrix 15.027 27 GCN4_01.matrix GCN4_C.matrix 9.362 10 GCN4_01.matrix HSF_01.matrix -0.018 1 GCN4_01.matrix HSF_02.matrix -0.018 1 GCN4_01.matrix HSF_03.matrix -0.215 1 GCN4_01.matrix HSF_04.matrix -0.018 1 GCN4_01.matrix HSF_05.matrix -0.018 1 GCN4_C.matrix GCN4_01.matrix 9.362 10 GCN4_C.matrix GCN4_C.matrix 12.570 10 GCN4_C.matrix HSF_01.matrix -0.044 1 GCN4_C.matrix HSF_02.matrix -0.044 1 GCN4_C.matrix HSF_03.matrix -0.225 1 GCN4_C.matrix HSF_04.matrix -0.044 1 GCN4_C.matrix HSF_05.matrix -0.044 1 HSF_01.matrix GCN4_01.matrix -0.018 1 HSF_01.matrix GCN4_C.matrix -0.044 1 HSF_01.matrix HSF_01.matrix 4.974 5 HSF_01.matrix HSF_02.matrix 4.974 5 HSF_01.matrix HSF_03.matrix 4.974 5 HSF_01.matrix HSF_04.matrix 4.974 5 HSF_01.matrix HSF_05.matrix 4.974 5 HSF_02.matrix GCN4_01.matrix -0.018 1 HSF_02.matrix GCN4_C.matrix -0.044 1 HSF_02.matrix HSF_01.matrix 4.974 5 HSF_02.matrix HSF_02.matrix 14.922 15 HSF_02.matrix HSF_03.matrix 9.948 10 HSF_02.matrix HSF_04.matrix 4.974 5 HSF_02.matrix HSF_05.matrix 9.948 10 HSF_03.matrix GCN4_01.matrix -0.215 1 HSF_03.matrix GCN4_C.matrix -0.225 1 HSF_03.matrix HSF_01.matrix 4.974 5 HSF_03.matrix HSF_02.matrix 9.948 10 HSF_03.matrix HSF_03.matrix 14.922 15 HSF_03.matrix HSF_04.matrix 9.948 10 HSF_03.matrix HSF_05.matrix 9.948 10 HSF_04.matrix GCN4_01.matrix -0.018 1 HSF_04.matrix GCN4_C.matrix -0.044 1 HSF_04.matrix HSF_01.matrix 4.974 5 HSF_04.matrix HSF_02.matrix 4.974 5 HSF_04.matrix HSF_03.matrix 9.948 10 HSF_04.matrix HSF_04.matrix 14.922 15 HSF_04.matrix HSF_05.matrix 9.948 10 HSF_05.matrix GCN4_01.matrix -0.018 1 HSF_05.matrix GCN4_C.matrix -0.044 1 HSF_05.matrix HSF_01.matrix 4.974 5 HSF_05.matrix HSF_02.matrix 9.948 10 HSF_05.matrix HSF_03.matrix 9.948 10 HSF_05.matrix HSF_04.matrix 9.948 10 HSF_05.matrix HSF_05.matrix 14.922 15 (bifrost)[11:34am]src_v4a 604>>