Learning Parameters for Sequence Alignment from Examples with Missing Data
Eagu Kim
Computer Science Department,
University of Arizona
Monday, May 12th, 2008
11:00 pm
3265 MSC
| ABSTRACT |
For as long as biologists have been computing alignments of sequences,
the question of what values to use for scoring substitutions and gaps
has persisted. In practice, substitution scores are usually chosen by
convention, and gap parameters are often found by trial and error. In
contrast, a rigorous way to determine parameter values that are
appropriate for aligning biological sequences is by solving the
problem of Inverse Parametric Sequence Alignment: given examples of
correct alignments, find parameter values that make the examples score
as close as possible to optimal alignments of their sequences. The
examples that are currently available contain regions where the
alignment is not specified, which leads to a version with missing data.
In this talk, we present a new polynomial-time algorithm for Inverse
Parametric Sequence Alignment that is simple to implement, fast in
practice, and can learn hundreds of parameters simultaneously from
hundreds of examples with missing data. Computational results show
that best-possible values for all 212 parameters of the standard
protein sequence alignment model can be computed from 200 examples in
4 hours of computation. Experiments on benchmark biological
alignments show we can find parameters that generalize across protein
families and boost the accuracy of multiple sequence alignment by as
much as 25%. If time permits, we will also discuss how to use
predicted secondary structure to improve the accuracy of protein
sequence alignment even further.
This is joint research with John Kececioglu.
Biography
Eagu Kim is currently a Ph.D. candidate in the Computer Science
Department at the University of Arizona doing a dissertation on local
similarity and inverse parametric sequence alignment. His research
interests include combinatorial optimization, and design and analysis
of algorithms for multiple sequence alignment, and whole genome
alignment. |
Return to seminar list