Homework Assignment #2

Due in class 4/8

You are to implement two algorithms for motif discovery:

The EM algorithm for motif discovery described in the Bailey and Elkan paper (and discussed in class). Your code needs to handle only the "one occurrence per sequence" (OOPS) model, and you can assume that it will be run only on DNA sequences. Additionally, you do not need to worry about finding multiple motifs in a given training set.
The Gibbs sampling algorithm described in the Lawrence et al. paper (and discussed in class). You can assume that your code will be run only on DNA sequences.

You are to run experiments in which you compare these two algorithms on a data set of sequences that contain donor splice sites (exon->intron boundaries). The file of training sequences is here and the file of test sequences is here. You should use the training sequences to discover a motif and then scan each of the test sequences to find the most likely position of the motif in each.

For your experiments, you should do the following.

Set the motif width parameter to 12.
Run your EM program three times, with different starting points for each training run. For these three runs, the p matrix should be initialized using the subsequence that starts at the 25th base in the first, second, and third sequences in the file. Use the initialization method we discussed in class with X = 0.5. Initialize the background probabilities to be uniform (i.e. 0.25 for all four).
After each training run, scan your motif through the test sequences and identify the most probable motif starting position in each. For each run, make a histogram showing the number of test sequences vs. predicted starting positions.
Run your Gibbs sampling program three times, with different starting points for each training run. Randomly select different starting points for each run.
After each training run, scan your motif through the test sequences and identify the most probable motif starting position in each. For each run, make a histogram showing the number of test sequences vs. predicted starting positions.

Your program should take the following as command-line arguments:

the name of a file containing the sequences to be used as a training set
the name of a file containing the sequences to be used as a test set
a specification of the algorithm to be used, "em" or "gibbs"
an integer value for the motif width parameter (W)
an integer indicating the index (starting from 0) of the training sequence to use to determine the starting point for EM (see below)

Here are some details on implementing the algorithms:

You should use pseudocounts to keep any probability in the p matrix from going to 0.
We will operationalize the notion of convergence by saying that the methods have converged when no probability in the p matrix has changed by more than 0.01 (i.e. is abs(p_new[i][j] - p_last[i][j]) >= 0.01).
When possible, you should sum logarithms of probabilities instead of multiplying probabilities directly.

You should turn in the following

A listing of your program source code.
The path to your program's executable (in case we want to run it).
For each run, the resulting values in the p matrix and the histograms of predicted motif starting positions (in the test sequences).
A one or two paragraph write-up discussing the results of your experiments. What can you say about the motif discovered? How stable are the results across different starting points? Which algorithm seemed to be better at finding a significant motif?