### Stat 877: Statistical methods for molecular biology (Spring, 2019) ### Homework #3: QTL mapping, solutions ```{r knitr_chunk_options, include=FALSE} knitr::opts_chunk$set(results="hide", fig.width=10, warning=FALSE, message=FALSE) set.seed(84268976) ``` #### Analyses I'll do all of the calculations first, and then summarize them, and then discuss them. I load the R/qtl package and the data, as follows: ```{r load} library(qtl) hw <- read.cross("csv", "https://www.biostat.wisc.edu/~kbroman/teaching/uwstatgen", "hw3.csv") ``` I'll first perform standard interval mapping (IM) and Haley-Knott regression. ```{r scanone} hw <- calc.genoprob(hw, step=1) out.em <- scanone(hw, method="em") out.hk <- scanone(hw, method="hk") ``` I then perform a standard permutation test by each method. ```{r permutation, cache=TRUE} num_cores <- parallel::detectCores() n.perm <- 100 operm.em <- scanone(hw, method="em", n.perm=n.perm, n.cluster=num_cores) operm.hk <- scanone(hw, method="hk", n.perm=n.perm, n.cluster=num_cores) ``` I then perform stratified permutation tests. ```{r permutation_strat, cache=TRUE, depends="permutation"} nt <- ntyped(hw) strat <- as.numeric(nt > mean(nt)) operm.em.strat <- scanone(hw, method="em", n.perm=n.perm, n.cluster=num_cores, perm.strat=strat) operm.hk.strat <- scanone(hw, method="hk", n.perm=n.perm, n.cluster=num_cores, perm.strat=strat) ``` I'll also do analyses when I omit the individuals that were not genotyped. ```{r scanone_genotyped} hw_sub <- subset(hw, ind=(nt > 0)) out.em.typed <- scanone(hw_sub, method="em") out.hk.typed <- scanone(hw_sub, method="hk") ``` Finally, I'll do permutation tests using just the genotyped individuals. ```{r permutation_genotyped, cache=TRUE, depends="permutation"} operm.em.typed <- scanone(hw_sub, method="em", n.perm=n.perm, n.cluster=num_cores) operm.hk.typed <- scanone(hw_sub, method="hk", n.perm=n.perm, n.cluster=num_cores) ``` #### Summaries of the results First, a plot of the LOD curves, by each of standard interval mapping (with the EM algorithm) and Haley-Knott regression, using all individuals or just the genotyped individuals. ```{r plot_scanone} plot(out.em, out.hk, col=c("blue", "red")) plot(out.em.typed, out.hk.typed, col=c("blue", "red"), lty=2, add=TRUE) legend("topright", lwd=2, lty=c(1,1,2,2), col=rep(c("blue","red"), 2), c("EM, all", "HK, all", "EM, typed", "HK, typed")) ``` Haley-Knott regression with all individuals gives a very large LOD score on chromosome 1, and much higher LOD scores than the other methods on chromosomes 2, 4, and 5. Haley-Knott and standard IM give similar results when we use only the genotyped individuals (in fact, _at the markers_, the two methods give _exactly the same results_ in this case); on chromosome 1, the LOD score is slightly larger when we focus on the genotyped individuals than we get from standard IM with all individuals. On the other five chromosomes, standard IM and HK with only the genotyped individuals give results that are very similar to standard IM with all individuals. Let's now look at the permutation thresholds. ```{r perm_thresholds, results="markup"} alpha <- c(0.05, 0.20) thr.em <- summary(operm.em, alpha) thr.hk <- summary(operm.hk, alpha) thr.em.strat <- summary(operm.em.strat, alpha) thr.hk.strat <- summary(operm.hk.strat, alpha) thr.em.typed <- summary(operm.em.typed, alpha) thr.hk.typed <- summary(operm.hk.typed, alpha) thr <- cbind(thr.em, thr.em.strat, thr.em.typed, thr.hk, thr.hk.strat, thr.hk.typed) colnames(thr) <- paste(rep(c("em", "hk"), each=3), rep(c("unstrat", "strat", "typed"), 2)) thr ``` The unstratified and stratified permutation tests give similar thresholds for standard IM, but for Haley-Knott regression, the thresholds from the stratified permutation test are considerably higher than for the unstratified permutation test. Perhaps the biggest surprise in threshold for standard interval mapping when we use just the genotyped individuals. Haley-Knott regression with just the genotyped individuals gives a threshold that's similar to what we get when we use all individuals, but standard IM is giving a much larger 5% threshold (and you might have gotten a _very_ large threshold). Let's look at a histograms of all of those results. ```{r perm_em_typed_hist, fig.height=10, dev.args=list(pointsize=18)} par(mfrow=c(3,2)) br <- seq(0, 35, by=0.5) hist(operm.em, main="em, all", breaks=br) hist(operm.hk, main="hk, all", breaks=br) hist(operm.em.strat, main="em, strat", breaks=br) hist(operm.hk.strat, main="hk, strat", breaks=br) hist(operm.em.typed, main="em, typed", breaks=br) hist(operm.hk.typed, main="hk, typed", breaks=br) ``` Four of the six histograms look basically the same. The distribution for Haley-Knott with a stratified permutation test is uniformly larger. The distribution for EM with just the genotyped individuals looks about the same as the others on the left, but has a second mode out near 25-30. This is a common problem with standard IM when the phenotype distribution is not normal (and here, using only the genotyped individuals, it will be decidely non-normal): it has a tendency to give spuriously large LOD scores, occasionally. I'd asked you to indicate the significant QTL and 1.5-LOD scores by each approach. There's a trick to this: ```{r summaries, results="markup"} out <- cbind(out.em, out.hk, out.em, out.hk, out.em.typed, out.hk.typed, labels = paste(rep(c("em", "hk"), 3), rep(c("unstrat", "strat ", "typed "), each=2))) operm <- cbind(operm.em, operm.hk, operm.em.strat, operm.hk.strat, operm.em.typed, operm.hk.typed) summary(out, perms=operm, alpha=0.05, format="tabByChr", pvalues=TRUE) ``` Four out of the six approaches are giving the same answer. Standard IM with only the genotyped individuals gave basically the same answer, though with a much larger p-value (`r round(mean(operm.em.typed >= max(out.em.typed[,3])), 3)`). Haley-Knott regression using all individuals and with the _unstratified_ permutation test also indicated that the QTL on chr 2 and 5 were significant. The 1.5-LOD interval for the chr 1 locus is more narrow by Haley-Knott regression using all individuals: since the LOD scores are inflated, the 1.5 drop probably needs to be inflated, too. #### Discussion Haley-Knott regression gives inflated LOD scores when there is selective genotyping, as in this case. This can largely be corrected by the stratified permutation test (which will give an appropriately inflated significance threshold). It might be better to just focus on the genotyped individuals, but that is usually not an option: generally all individuals have _some_ genotype data, in the end. Standard interval mapping gives reasonable results with selectively genotyped individuals, and it doesn't make much difference whether a stratified or unstratified permutation test is used, though it would be best to use the _stratified_ permutation test, to conform to the nature of the data. Standard interval mapping can give crazy results when only the genotyped individuals are considered; it's best to use the full set of data. #### Here is my R setup ```{r sessionInfo, results="markup"} sessionInfo() ```