Homework Assignment #3

Due in class 4/29

You are to implement a hidden Markov model for gene finding and apply it to E. coli data.

The Model

Your HMM should have the architecture shown in this figure. This model has four groups of states plus a silent start state and a silent end state. The four groups of states are intended to represent noncoding sequences, start codons, internal codons (i.e. codons in genes excluding the start and stop codons of the genes), and stop codons.

Each state in this model is able to emit only a single DNA base. Therefore, you don't need to represent emission probabilities.

The blue arrows show the transitions among the states of each group. The orange arrows show the transitions that cut across groups. The figure uses thick orange arrows as shorthand for several sets of transitions:

the start state has a transition to each of the four states in the noncoding group,
each of the noncoding states has a transition to each of the first four (i.e. leftmost) states in the start codon group,
each of the the last four (i.e. leftmost) states in the internal codons group has a transition to each of the first four states in this group,
the two last (i.e. leftmost) states in the stop codon group have transitions to all four noncoding states,
each of the noncoding states has a transition to the end state.

Training

The training data for estimating the parameters of your model can be found in http://www.biostat.wisc.edu/~craven/776/hw3/. The file sequenceOfEcoliStrainM54.txt contans the complete sequence sequence of the main laboratory strain of E. coli. When reading in this file, be sure to discard any whitespace characters.

The file train.seqs contains a list of 100 subsequences that you should use for training. Each line in this file describes a single sequence. The first and second columns list the start and end coordinates for the sequence, respectively. These values can be used to extract the actual sequence from the sequence file. For example the coordinates of the first sequence are [1, 2800]. This means that the first training sequence comprises the first 2800 characters of the E. coli sequence. Note that the index of the first character in the sequence file is 1, not 0. The subsequent columns list the coordinates of the genes contained in each sequence. Each gene is described by a pair of coordinates inside of brackets. For example, the first sequence contains two genes, one that extends from the 190th position of the genome sequence to the 255th position, and one which extends from position 337 to position 2799.

All of the sequences in train.seqs start and end with noncoding sequences and have noncoding sequences interspersed between their genes. Moreover, all of the genes are on the given strand.

You won't need to use the Forward-Backward (Baum-Welch) algorithm for training, since for each of the training sequences it is clear which state "explains" each base. Instead training will consist of counting transition frequencies in the training set and converting these into probabilities. You should use a pseudocount value = 1 for each of the transitions.

Testing

To use your model to predict genes in new sequences, you should implement the Viterbi algorithm. Given a test sequence, use the Viterbi algorithm to predict the sequence coordinates of any genes in the sequence. In particular, for each predicted gene, you should report the coordinate of the first base in the start codon and the last base in the stop codon.

You should test your model using the sequences in test.seqs. The format of this file is the same as the format of train.seqs.

What to Turn In

You should turn in the following

A listing of your program source code.
The path to your program's executable (in case we want to run it).
For each test sequence, the sequence coordinates of the first and last base in each gene you predict.
A brief description (one-paragraph) of how accurate your model was on the test sequences.