According to recent estimates, as much as 3-4% of the human genome may be involved in functions other than coding for proteins. The investigation of functional non-coding sequences is a strong focus of current genomics and bioinformatics research, with comparative approaches made possible by the
availability of whole-genome alignments of an increasing number of species. In the first part of this talk, we describe a strategy that exploits training data and phylogenetic relationships to construct parsimonious encodings of multiple alignments. Once encoded, alignments are strings of symbols in a relatively small alphabet, whose characteristic patterns can be discerned using variable order Markov models. Log-odds classification under these models forms the basis for prediction, and guides the search for parsimonious and effective encodings. Phylogenetic relationships are incorporated as a biologically principled way to frame this search, and play a crucial role because of the limited availability of training data (experimentally verified functional elements). Our strategy proves effective for predicting several classes of functional elements using five-way alignments of human, chimpanzee, mouse, rat and dog. In particular, prediction of cis-regulatory elements forms the basis for the five-way mammalian
Regulatory Potential score (RP) available as a track at the UCSC Human Genome Browser.
Because of the limited training data, even after reduction to parsimonious encodings, considering alignment patterns of non-trivial length leads to so-called large-p-small-n problems. If instead of fitting Markov models on encoded alignments and constructing log-odds, we were to use pattern frequencies as predictors in a regression, we would typically have p =
thousands of predictors and only n = hundreds of observations. Large-p-small-n regressions are ubiquitous in many different kinds of genomic analyses, and could be handled using sufficient dimension reduction (SDR) to identify a small number (d < n) of relevant linear combinations of the predictors. In the second part of this talk, we briefly introduce basic SDR concepts, and use the encoded training data for our cis-regulatory elements prediction problem to demonstrate a novel iterative SDR method. Unlike standard SDR techniques, this method is suited for large-p-small-n regressions because it does not require inversion of the sample predictor covariance matrix, and leverages its structure by means of iterative powers.
The work on alignment encodings and RP scores is in collaboration with James Taylor, Webb Miller, Ross Hardison and others at the Center for Comparative Genomics and Bioinformatics of Penn State. The work SDR methodology for large-p-small-n regressions is in collaboration with R. Dennis Cook (Statistics, Univ. of Minnesota) and Bing Li (Statistics, Penn State).