########################### # README FOR PHYLOCON-V3A # ########################### Copyright 2002--2003 Ting Wang and Gary Stormo May be copied for noncommercial purposes. Author: Ting Wang and Gary Stormo Department of Genetics Washington University in St. Louis Campus Box 8232 St. Louis, MO 63110 stormo@ural.wustl.edu twang@ural.wustl.edu PhyloCon (version 3a) ################# # BASIC OPTIONS # ################# -s [-h ] [-f ] [-q ] [-iq ] [-cq ] [-th ] [-a ] [-CS ] [-d ] [-c0 ] [-c1 ] [-c2 ] [-l ] [-u0 ] [-u1 ] [-u2 ] [-pc ] [-o1 ] ####################### # GENERAL INFORMATION # ####################### PhyloCon is a motif discovery program. PhyloCon stands for "Phylogenetic Consensus". It is derived from a well established program "Consensus", with comparative genomic features added. PhyloCon assumes a "multiple gene, multiple species" model. It takes into account both conservation among orthologous genes and co-regulation of genes within a species. This algorithm first aligns conserved regions of orthologous sequences into multiple sequence alignments, or profiles, then compares profiles representing non-orthologous sequences. Motifs emerge as common regions in these profiles. The program will determine the width of the pattern being sought. The algorithm is based on a matrix representation of a consensus pattern. Each row corresponds to one of the letters of the relevant alphabet---e.g., 4 rows in the case of DNA. Each column corresponds to one of the positions within the pattern. The elements of the matrix are determined by the number of times that the indicated letter occurs at the indicated position based on the words summarized by the pattern. The input of the program should be a set of promoters that are co- regulated (or at least some of them share same regulatory mechanism). Each promoter should also have at least one orthologous promoter from a reference genome. Therefore, the program takes in a set of "co-regulated orthologous groups". PhyloCon first treats each group separately, by aligning orthologous sequences and generating lots of multiple sequence alignments within each group. It does so by a previously described algorithm called "Wconsensus". These initial alignments are constructed by sequentially adding additional words to previously saved alignments. During each cycle, only the most significant alignments are saved. The maximum number of alignments to save at each cycle is determined by the "-q" option (see below). An inherited feature from Wconsensus to identify an overall best initial alignment is the various multiples of the standard-deviation correction to the information content (set with the -s option). As the standard-deviation correction is increased, less positions will tend to be in the resulting alignments. The overall best initial alignment is the one having the lowest expected frequency. In practise, this "-s" option controls the quality of the initial alignments. A larger s value (such as 2) produces shorter, tighter alignments, and a smaller s value (such as 0.5) produces longer, looser alignments. Therefore, if the species you use are closely related (such as human and mouse), you may want to use large s value; while if they are far apart in evolution, you may want to use small s value. After initial alignments are generated for each orthologous group, a number of top, different alignments are saved for each group, and this is determined by commandline option "-iq". Many sub-optimal alignments are saved. Each alignment represents a conserved region in the original sequence. PhyloCon then performs a pair-wise comparison between each initial alignment and another alignment from a different orthologous group. This comparison uses ALLR statistic, a new statistic recently developed in our lab. PhyloCon then saves and sorts some number of high scoring pairs (HSPs) that exceed a threshold ALLR score. It merges profile components of a HSP into a new profile by simply summing them together, trimming off the sections not contained in the HSP. New profiles generated at this step contain sequences from two groups. They are ranked by their corresponding ALLR scores. Number of profiles to save at this step is determined by commandline option "-cq". Then, PhyloCon compares each new profile saved from last step to initial alignments that it does not already contain. Save HSPs and create new profiles, up to a user-defined number. This number is also determined by "-cq". New profiles generated at this step contain sequences from three groups. PhyloCon keeps doing this cycle after cycle. At cycle N it compares profiles from previous cycles that do not share a common orthologous group and contain N orthologous groups if merged. Save HSPs and create new profiles. New profiles generated at cycle N contain sequences from N groups, and sorted by corresponding ALLR scores. At each cycle, PhyloCon prints to the standard output a number of top profiles (matrices). This number is determined by commandline option "-pc". PhyloCon stops and prints report when: 1) all orthologous groups are included at the current cycle; 2) no comparison gives HSP higher than a threshold ALLR value at the current cycle. This threshold is determined by commandline option "-th". In the program's output, the words contained in each matrix are listed in the order of their occurrence in the input sequences. The order is indicated by "integer|integer". The first integer is simply a sequential count of the words, and the second integer indicates during which cycle the word was added to the matrix. The location of a word is indicated by "integer/integer". The first integer indicates which sequence contains the word, and the second integer indicates where in that sequence the word is located. If the first integer is preceded by a minus sign, then the complementary word is the one included in the matrix. The output of the program is sent to the standard output. The input files---those containing the actual sequences and those indicated by the "-f", "-a", and "-i" options---can contain comments according to the following convention. The portion of a line following a ';', '%', or '#' is considered a comment and is ignored. Comments can begin anywhere in a line and always end at the end of the line. The one minor exception is that, to avoid ambiguity, comments in the list of sequences (see the "-f" option below) must be preceded by a blank space when not occurring at the beginning of a line. ########################### # FORMAT OF THE SEQUENCES # ########################### Sequence file should follow the structure described below. Suppose there are 3 genes, each has 2 orthlogous sequences: Seq1-1, Seq1-2; Seq2-1, Seq2-2; Seq3-1, Seq3-2. Then the sequence file should look like: (where [ ] contains optional modifiers) [ modifiers.. ] Seq1-1 ; any description of the seq \ AACC.... the actual sequence \ [ modifiers.. ] Seq1-2 ; any description of the seq \ AACC.... the actual sequence \ \\ [ modifiers.. ] Seq2-1 ; any description of the seq \ AACC.... the actual sequence \ [ modifiers.. ] Seq2-2 ; any description of the seq \ AACC.... the actual sequence \ \\ [ modifiers.. ] Seq3-1 ; any description of the seq \ AACC.... the actual sequence \ [ modifiers.. ] Seq3-2 ; any description of the seq \ AACC.... the actual sequence \ \\ The rules are: 1) Each sequence has two components: a description line where you can add modifiers; actual sequence, wrapped by "\". 2) At the end of each orthologous group, use "\\" to indicate this fact. 3) Order of orthologous groups, and order of sequences within each orthologous group, are not important; 4) Sequence modifiers appear in front of the name of the relevant sequence. They are: -s integer-integer integer-integer: the positions in the sequence indicated by the integer pairs, inclusive, are seed sequences. If the "-s" modifier is used anywhere in the input file, then the initial set of matrices will only be constructed (i.e., seeded) from the sequences within the marked regions. If this modifier is not used anywhere in the input file, then all the sequences will be used to seed matrices. One or more integer pair can be indicated for a single sequence. However, if no integer pairs are given, the whole sequence will be used for seeding matrices. -i integer-integer integer-integer: the positions in the sequence indicated by the integer pairs, inclusive, are the only positions to be analyzed. -e integer-integer integer-integer: the positions in the sequence indicated by the integer pairs, inclusive, are to be excluded from the analysis. When both the "-i" and "-e" modifiers are used, the intersection of permissible positions is analyzed. When a sequence name is not marked by either the "-i" or "-e" modifier, then the whole sequence is included in the analysis. Do not explicitly give the complements of nucleic acid sequences. The complementary sequence is determined by the program. Whitespace, periods, dashes (unless part of an integer when the "-i" option is used), and comments beginning with ';', '%', or '#' are ignored. When using letter characters (i.e., with the "-a" and "-A" alphabet options), integers are also ignored so that the sequence file can contain positional information. COMMAND LINE OPTIONS: 0) -h: print these directions. 1) General information -f filename "filename" contains sequences formatted as described above. -q integer: the maximum number of matrices to save between cycles when generating initial alignment (default: save 200 matrices). -s number: the number of standard deviations to lower the information content at each position before identifying information peaks (required) during generation of initial alignment. A range of values should be tried. For example, try values of 0.5, 1, 1.5, and 2. The overall best alignment is the one having the lowest expected frequency. 2) Alphabet options -d: use the designated prior probabilities of the letters to override the observed frequencies. By default, the program uses the frequencies observed in your own sequence data for the prior probabilities of the letters. -a filename: file containing the alphabet and normalization information. Each line contains a letter (a symbol in the alphabet) followed by an optional normalization number (default: 1.0). The normalization is based on the relative prior probabilities of the letters. For nucleic acids, this might be be the genomic frequency of the bases; however, if the "-d" option is not used, the frequencies observed in your own sequence data are used. In nucleic acid alphabets, a letter and its complement appear on the same line, separated by a colon (a letter can be its own complement, e.g. when using a dimer alphabet). Complementary letters may use the same normalization number. Only the standard 26 letters are permissible; however, when the "-CS" option is used, the alphabet is case sensitive so that a total of 52 different characters are possible. POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS: letter letter normalization POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS: letter:complement letter:complement normalization letter:complement normalization:complement's_normalization Example alphabet file 1: A:T C:G Example alphabet file 2: A:T 0.3 C:G 0.2 3) Alphabet modifier indicating whether ascii alphabets are case sensitive--- -CS: ascii alphabets are case sensitive. 4) Options for handling the complement of nucleic acid sequences--- the 3 options in this section are mutually exclusive. -c0: ignore the complement (the default option) -c1: include both strands as separate sequences -c2: include both strands as a single sequence (i.e., orientation unknown) These options are inherited from Consensus and will be removed in the next version of PhyloCon. 5) Algorithm options -l: (lowercase L) This option is inherited from Consensus. seed with the first sequence and proceed linearly through the list. This option results in a significant speed up in the program, but the algorithm becomes dependent on the order of the sequence-file names. This option corresponds to the original "consensus" algorithm (Stormo and Hartzell, 1989, PNAS, 86:1183-1187; Hertz et al., 1990, CABIOS, 6:81-92). -iq integer: the maximum number of initial profiles saved for each orthologous group [default: 50] -cq integer: the maximum number of intermediate profiles saved at each cycle during profile comparison [default: 200] This is the size of the queue. -th positive number: the threshold value for ALLR statistic -- only consider HSPs that have a higher value than this during profile comparison. -o1: This is a recommended option. When this is used, profile comparison automatically considers the "mirror" profile, or the reverse complement. However, it is going to take longer for the program to finish. 6) Output options -pc integer: the number of matrices to print to the standard output at each cycle. [default: 4]