Homework Assignment #3
Due in class 4/29
You are to implement a hidden Markov model for gene finding and
apply it to E. coli data.
The Model
Your HMM should have the architecture shown in
this figure.
This model has four groups of states plus a silent start state and a silent end state.
The four groups of states are intended to represent noncoding sequences,
start codons, internal codons (i.e. codons in genes excluding the
start and stop codons of the genes), and stop codons.
Each state in this model is able to emit only a single DNA base.
Therefore, you don't need to represent emission probabilities.
The blue arrows show the transitions among the states of each group.
The orange arrows show the transitions that cut across groups.
The figure uses thick orange arrows as shorthand for several
sets of transitions:
- the start state has a transition to each of the four states
in the noncoding group,
- each of the noncoding states has a transition to
each of the first four (i.e. leftmost) states in the start codon
group,
- each of the the last four (i.e. leftmost) states in the
internal codons group has a transition to each of the
first four states in this group,
- the two last (i.e. leftmost) states in the stop codon
group have transitions to all four noncoding states,
- each of the noncoding states has a transition to
the end state.
Training
The training data for estimating the parameters of your model can be
found in http://www.biostat.wisc.edu/~craven/776/hw3/.
The file sequenceOfEcoliStrainM54.txt
contans the complete sequence sequence of the main laboratory strain
of E. coli. When reading in this file, be sure to discard any whitespace characters.
The file train.seqs
contains a list of 100 subsequences that you should use for training.
Each line in this file describes a single sequence. The first and
second columns list the start and end coordinates for the sequence,
respectively. These values can be used to extract the actual sequence
from the sequence file. For example the coordinates of the first
sequence are [1, 2800]. This means that the first training sequence
comprises the first 2800 characters of the E. coli sequence. Note
that the index of the first character in the sequence file is 1, not
0. The subsequent columns list the coordinates of the genes contained
in each sequence. Each gene is described by a pair of coordinates
inside of brackets. For example, the first sequence contains two
genes, one that extends from the 190th position of the genome sequence
to the 255th position, and one which extends from position 337 to
position 2799.
All of the sequences in train.seqs start and end with noncoding sequences
and have noncoding sequences interspersed between their genes.
Moreover, all of the genes are on the given strand.
You won't need to use the Forward-Backward (Baum-Welch) algorithm
for training, since for each of the training sequences it is
clear which state "explains" each base. Instead training will
consist of counting transition frequencies in the training set
and converting these into probabilities. You should use
a pseudocount value = 1 for each of the transitions.
Testing
To use your model to predict genes in new sequences, you should
implement the Viterbi algorithm. Given a test sequence, use the
Viterbi algorithm to predict the sequence coordinates of any genes in
the sequence. In particular, for each predicted gene, you should
report the coordinate of the first base in the start codon and the
last base in the stop codon.
You should test your model using the sequences in
test.seqs.
The format of this file is the same as the format of train.seqs.
What to Turn In
You should turn in the following
- A listing of your program source code.
- The path to your program's executable (in case we want to run it).
- For each test sequence, the sequence coordinates of the
first and last base in each gene you predict.
- A brief description (one-paragraph) of how accurate your model was on the
test sequences.