Homework Assignment #2
Due in class 4/8
You are to implement two algorithms for motif discovery:
- The EM algorithm for motif discovery described in
the Bailey and Elkan paper (and discussed in class). Your code needs
to handle only the "one occurrence per sequence" (OOPS) model, and you
can assume that it will be run only on DNA sequences. Additionally,
you do not need to worry about finding multiple motifs in a given
training set.
- The Gibbs sampling algorithm described in the Lawrence et al.
paper (and discussed in class). You can assume that your code will
be run only on DNA sequences.
You are to run experiments in which you compare these two algorithms
on a data set of sequences that contain donor splice sites
(exon->intron boundaries). The file of training sequences is here
and the file of test sequences is here.
You should use the training sequences to discover a motif and then
scan each of the test sequences to find the most likely position of
the motif in each.
For your experiments, you should do the following.
- Set the motif width parameter to 12.
- Run your EM program three times, with different starting
points for each training run. For these three runs, the p matrix should be
initialized using the subsequence that starts at the 25th base in
the first, second, and third sequences in the file. Use the initialization
method we discussed in class with X = 0.5. Initialize the
background probabilities to be uniform (i.e. 0.25 for all four).
- After each training run, scan your motif through the test sequences
and identify the most probable motif starting position in each. For each
run, make a histogram showing the number of test sequences vs. predicted
starting positions.
- Run your Gibbs sampling program three times, with different starting
points for each training run. Randomly select different starting points
for each run.
- After each training run, scan your motif through the test sequences
and identify the most probable motif starting position in each. For each
run, make a histogram showing the number of test sequences vs. predicted
starting positions.
Your program should take the following as command-line arguments:
- the name of a file containing the sequences to be used as
a training set
- the name of a file containing the sequences to be used as
a test set
- a specification of the algorithm to be used, "em" or "gibbs"
- an integer value for the motif width parameter (W)
- an integer indicating the index (starting from 0) of the
training sequence to use to determine the starting point
for EM (see below)
Here are some details on implementing the algorithms:
- You should use pseudocounts to keep any probability
in the p matrix from going to 0.
- We will operationalize the notion of convergence by
saying that the methods have converged when
no probability in the p matrix has changed by more
than 0.01 (i.e. is abs(p_new[i][j] - p_last[i][j]) >= 0.01).
- When possible, you should
sum logarithms of probabilities instead of multiplying
probabilities directly.
You should turn in the following
- A listing of your program source code.
- The path to your program's executable (in case we want to run it).
- For each run, the resulting values in the p matrix
and the histograms of predicted motif starting positions (in
the test sequences).
- A one or two paragraph write-up discussing the results of your
experiments. What can you say about the motif discovered?
How stable are the results across different starting points?
Which algorithm seemed to be better at finding a significant motif?