According to recent estimates, as much as 34% of the human genome may be involved in functions other than coding for proteins. The investigation of functional noncoding sequences is a strong focus of current genomics and bioinformatics research, with comparative approaches made possible by the
availability of wholegenome alignments of an increasing number of species. In the first part of this talk, we describe a strategy that exploits training data and phylogenetic relationships to construct parsimonious encodings of multiple alignments. Once encoded, alignments are strings of symbols in a relatively small alphabet, whose characteristic patterns can be discerned using variable order Markov models. Logodds classification under these models forms the basis for prediction, and guides the search for parsimonious and effective encodings. Phylogenetic relationships are incorporated as a biologically principled way to frame this search, and play a crucial role because of the limited availability of training data (experimentally verified functional elements). Our strategy proves effective for predicting several classes of functional elements using fiveway alignments of human, chimpanzee, mouse, rat and dog. In particular, prediction of cisregulatory elements forms the basis for the fiveway mammalian
Regulatory Potential score (RP) available as a track at the UCSC Human Genome Browser.
Because of the limited training data, even after reduction to parsimonious encodings, considering alignment patterns of nontrivial length leads to socalled largepsmalln problems. If instead of fitting Markov models on encoded alignments and constructing logodds, we were to use pattern frequencies as predictors in a regression, we would typically have p =
thousands of predictors and only n = hundreds of observations. Largepsmalln regressions are ubiquitous in many different kinds of genomic analyses, and could be handled using sufficient dimension reduction (SDR) to identify a small number (d < n) of relevant linear combinations of the predictors. In the second part of this talk, we briefly introduce basic SDR concepts, and use the encoded training data for our cisregulatory elements prediction problem to demonstrate a novel iterative SDR method. Unlike standard SDR techniques, this method is suited for largepsmalln regressions because it does not require inversion of the sample predictor covariance matrix, and leverages its structure by means of iterative powers.
Credits:
The work on alignment encodings and RP scores is in collaboration with James Taylor, Webb Miller, Ross Hardison and others at the Center for Comparative Genomics and Bioinformatics of Penn State. The work SDR methodology for largepsmalln regressions is in collaboration with R. Dennis Cook (Statistics, Univ. of Minnesota) and Bing Li (Statistics, Penn State).
