Cleaning the genotype data for the Attie project

1. Load the data

First we load the R/qtl package and print the version we’re using.

> library(qtl)
> qtlversion()
[1] "1.25-15"

We load the raw genotype data, which includes data on the parental strains and F₁. We drop the control mouse with a blank for “strain.”

> rawg <- read.cross("csv", "Data", "genotypes4rqtl.csv", genotypes=0:2,
+                    convertXdata=FALSE, alleles=c("B","R"))
> rawg <- subset(rawg, ind=(rawg$pheno$Strain!=""))

The mouse with missing strain also had missing sex, we clean up the sex and strain phenotypes as follows:

> rawg$pheno$Sex <- factor(as.character(rawg$pheno$Sex), levels=c("Female", "Male"))
> rawg$pheno$Strain <- factor(as.character(rawg$pheno$Strain), levels=unique(as.character(rawg$pheno$Strain)))

A quick summary:

> summary(rawg)
    F2 intercross

    No. individuals:    565

    No. phenotypes:     5
    Percent phenotyped: 100 100 100 100 100

    No. chromosomes:    21
        Autosomes:      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 un
        X chr:          X

    Total markers:      4853
    No. markers:        386 343 340 282 256 274 234 248 216 219 257 218 230 240 197 243 183 194 117 169 7
    Percent genotyped:  98.6
    Genotypes (%):      BB:40  BR:21.3  RR:38.8  not RR:0  not BB:0

Note that the chromosome "un" is a group of 7 markers that I couldn’t find in the mouse genome build 37.

2. Missing data pattern

We first look at the number of missing genotypes at each mouse and each marker.

> par(mfrow=c(2,1), las=1)
> hist(nmissing(rawg), main="No. missing genotypes by mouse",
+      xlab="No. missing genotypes", breaks=140)
> rug(nmissing(rawg), col="blue")
> hist(nmissing(rawg, what="mar"), main="No. missing genotypes by marker",
+      xlab="No. missing genotypes", breaks=50)
> rug(nmissing(rawg, what="mar"), col="blue")

There are three mice with no genotype data.

> rawg$pheno[ntyped(rawg)==0,]
                          longid  MouseNum    Sex Strain         PlatePos
265   tid28116_Mouse3377_F2_Male Mouse3377   Male     F2 SMP4_0001630:C04
311   tid28077_Mouse3338_F2_Male Mouse3338   Male     F2 SMP4_0001632:B10
498 tid28004_Mouse3265_F2_Female Mouse3265 Female     F2 SMP4_0001632:A01

I might as well drop these.

> rawg <- subset(rawg, ind=(ntyped(rawg)>0))

We should look at the histogram of the number of missing genotypes per mouse again.

> par(las=1)
> hist(nmissing(rawg), main="No. genotypes by mouse",
+      xlab="No. markers genotyped", breaks=140)
> rug(nmissing(rawg), col="blue")

There remain a number of mice with a good amount of missing data, which might suggest poor-quality DNAs or other problems with the genotyping arrays. There are 13 mice missing genotypes at more than 5% of the markers (that is, at more than 242 of the 4853 markers). There are also a number of markers with elevated rates of missing data. There are 188 markers with more than 5% missing data (that is, at more than 28 of the 562 mice). But none of these seem terrible.

3. Study X chromosome genotypes

Let’s start by looking at the X chromosome genotypes. The cross was (BTBR x B6) x (BTBR x B6), with females listed first, and so the F₁ males should all be hemizygous B6 or BTBR while the F₁ females should be heterozygous. The F₂ males should be hemizygous B6 or BTBR and the F₂ females should be heterozygous or homozygous BTBR.

We pull out the X chromosome genotypes and the “strain” and sex for each mouse.

> gx <- pull.geno(rawg, chr="X")
> dim(gx)
[1] 562 169
> strain <- rawg$pheno$Strain
> sex <- rawg$pheno$Sex

3.1. Not segregating

First, let’s look at which markers seem to segregate in the F₂. We create a table of the three possible genotypes (1, 2, 3) and look at the number of mice with the least-frequent genotype. Since there are 289 F₂ males and 262 F₂ females, all markers should have an appreciable number of mice with each of the three genotypes.

> ming <- apply(apply(gx[strain=="F2",], 2, function(a) table(factor(a, levels=1:3))), 2, min)
> table(ming)
ming
  0   1   2   3   6 120 125 126 131 132 133 134 136 138 143
138   6   2   2   1   1   2   1   5   2   3   3   1   1   1

There are 149 markers that appear to not be segregating. Let’s go ahead and omit those. We’ll create a new cross object cleang that will contain the cleaned genotypes.

> todrop <- names(ming[ming<120])
> cleang <- drop.markers(rawg, todrop)
> gx <- pull.geno(cleang, chr="X")
> dim(gx)
[1] 562  20

We have just 20 markers left on the X chromosome.

3.2. Genotypes in B6 and BTBR

Let’s first look at the B6 mice. There are 3 such; while 2 are male and 1 is female, they should all be homozygous for the same allele.

> all(apply(gx[strain=="B6",], 2, function(a) length(unique(a[!is.na(a)]))) == 1)
[1] TRUE
> b6xg <- apply(gx[strain=="B6",], 2, function(a) unique(a[!is.na(a)]))

All of the B6 mice are homozygous for the same allele, on the X chromosome.

We repeat this for the 4 BTBR mice, of which 2 are male and 2 are female.

> all(apply(gx[strain=="BTBR",], 2, function(a) length(unique(a[!is.na(a)])) == 1))
[1] TRUE
> btbrxg <- apply(gx[strain=="BTBR",], 2, function(a) unique(a[!is.na(a)]))

All of the BTBR mice are homozygous for the same allele, on the X chromosome.

We further check that the B6 and BTBR genotypes are really homozygous and for different alleles at these 20 markers.

> all((b6xg == 1 & btbrxg==3) | (b6xg == 3 & btbrxg==1))
[1] TRUE

Let’s swap the genotype codes for the markers where the B6 mice have the 3 genotype (to make it so that 1 is homozygous B6 and 3 is homozygous BTBR).

> toswap <- names(b6xg)[b6xg==3]
> for(mar in toswap)
+   cleang$geno[["X"]]$data[,mar] <- 4 - cleang$geno[["X"]]$data[,mar]
> gx <- pull.geno(cleang, chr="X")

3.3. Genotypes in F₁

Let’s look at the 4 F₁ mice, of which 2 are male and 2 are female. The males should all by homozygous BTBR and the females should all by heterozygous.

> f1male.gx <- gx[strain=="F1" & sex=="Male",]
> f1female.gx <- gx[strain=="F1" & sex=="Female",]
> all(is.na(f1male.gx) | f1male.gx==3)
[1] TRUE
> all(is.na(f1female.gx) | f1female.gx==2)
[1] TRUE

Everything looks good.

3.4. Genotypes in F₂ males

We now turn to the F₂ males. They should all be hemizgous B6 (1) or BTBR (3).

> f2male.gx <- gx[strain=="F2" & sex=="Male",]
> sum(!is.na(f2male.gx) & f2male.gx==2)
[1] 171

There are 171 heterozygous genotypes in the 289 male mice. Let’s look at the males that have heterozygous calls.

> wh <- apply(f2male.gx, 1, function(a) any(!is.na(a) & a==2))
> sum(wh)
[1] 18
> tab <- apply(f2male.gx[wh,], 1, function(a) table(factor(a[!is.na(a)], levels=1:3)))
> tab[,order(tab[2,])]
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
1    7    5   13   15    0    0    0    0    0     0     0     0     0     0     0     0     0     0
2    1    1    1    1    4    4    5    7    8    10    11    13    14    15    16    20    20    20
3   12   13    5    3   15   16   14   13   12     9     9     7     6     5     4     0     0     0

The first 4 of these look to be male but with a single genotyping error. The other 14 look to be really female. Let’s save the IDs for these mice.

> malef2id <- as.character(cleang$pheno$MouseNum)[strain=="F2" & sex=="Male"]
> notmale <- malef2id[apply(f2male.gx, 1, function(a) sum(!is.na(a) & a==2)>1)]

Note that there are additional male mice that might really be females; if all of their genotypes on the X chromosome are homozygous/hemizygous BTBR, we can’t tell.

> sum(apply(f2male.gx, 1, function(a) all(is.na(a) | a==3)))
[1] 50

3.5. Genotypes in F₂ females

We turn to the F₂ females. They should all be heterozygous (2) or homozygous BTBR (3).

> f2female.gx <- gx[strain=="F2" & sex=="Female",]
> sum(!is.na(f2female.gx) & f2female.gx==1)
[1] 270

There are 270 homozygous B6 genotypes in the 262 female mice. Let’s look at the females that have homozygous B6 calls.

> wh <- apply(f2female.gx, 1, function(a) any(!is.na(a) & a==1))
> sum(wh)
[1] 21
> tab <- apply(f2female.gx[wh,], 1, function(a) table(factor(a[!is.na(a)], levels=1:3)))
> tab[,order(tab[1,])]
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
1    5    5    5    7    8    9   10   11   11    13    13    13    15    16    16    16    18    19    20    20    20
2    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0
3   15   15   15   13    8   10   10    9    9     7     7     7     5     4     4     4     2     0     0     0     0

These all look to be clearly males. Let’s save the IDs for these mice.

> femalef2id <- as.character(cleang$pheno$MouseNum)[strain=="F2" & sex=="Female"]
> notfemale <- femalef2id[apply(f2female.gx, 1, function(a) sum(!is.na(a) & a==1)>1)]

Note that there are additional female mice that might really be males; if all of their genotypes on the X chromosome are homozygous/hemizygous BTBR, we can’t tell.

> sum(apply(f2female.gx, 1, function(a) all(is.na(a) | a==3)))
[1] 53

3.6. Swap sexes

Let’s go ahead and swap the sexes of these mice whose X chromosome genotypes indicate they are the opposite sex. We’ll change the phenotype Sex to SexID and make a new “phenotype” Sex.

> names(cleang$pheno)[names(cleang$pheno)=="Sex"] <- "SexID"
> cleang$pheno$Sex <- cleang$pheno$SexID
> cleang$pheno$Sex[match(notmale, cleang$pheno$MouseNum)] <- "Female"
> cleang$pheno$Sex[match(notfemale, cleang$pheno$MouseNum)] <- "Male"
> table(cleang$pheno$SexID, cleang$pheno$Sex)
         Female Male
  Female    246   21
  Male       14  281

I’ll re-create my sex vector to match the new values.

> sex <- cleang$pheno$Sex

3.7. Re-code genotypes

Now let’s recode the X chromosome genotypes. F₂ females should have genotype codes 1 for homozygous BTBR and 2 for heterozygous. F₂ males should have genotype codes 1 for homozygous B6 and 2 for homozygous BTBR.

First, we handle the males.

> gx <- pull.geno(cleang, chr="X")
> whmale <- strain=="F2" & sex=="Male"
> gxmale <- gx[whmale,]
> sum(!is.na(gxmale) & gxmale==2)
[1] 4
> gxmale[!is.na(gxmale) & gxmale==2] <- NA
> gxmale[!is.na(gxmale) & gxmale==3] <- 2
> gx[whmale,] <- gxmale

Now, we handle the females.

> whfemale <- strain=="F2" & sex=="Female"
> gxfemale <- gx[whfemale,]
> sum(!is.na(gxfemale) & gxfemale==1)
[1] 0
> gxfemale[!is.na(gxfemale) & gxfemale==1] <- NA
> gxfemale[!is.na(gxfemale) & gxfemale==3] <- 1
> gx[whfemale,] <- gxfemale

Now we paste the genotypes back into the cross object.

> cleang$geno[["X"]]$data <- gx

Let’s further add a pgm phenotype, indicating cross direction, which will be 1 for all F₂ mice.

> cleang$pheno$pgm <- rep(NA, nind(cleang))
> cleang$pheno$pgm[strain=="F2"] <- 1

The summary of the cross (with just the F₂ mice) should be clean. (But then, the warnings don’t get printed here anyway!)

> summary(subset(cleang, ind=(strain=="F2")))
    F2 intercross

    No. individuals:    551

    No. phenotypes:     7
    Percent phenotyped: 100 100 100 100 100 100 100

    No. chromosomes:    21
        Autosomes:      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 un
        X chr:          X

    Total markers:      4704
    No. markers:        386 343 340 282 256 274 234 248 216 219 257 218 230 240 197 243 183 194 117 20 7
    Percent genotyped:  99.1
    Genotypes (%):      BB:39.6  BR:22.2  RR:38.3  not RR:0  not BB:0

3.8. Segregation patterns

Let’s look at the segregation of the genotypes on the X chromosome. First, let’s look at the output of geno.table. What we’re looking for here are departures from 1:1 segregation that might indicate markers with high rates of genotyping errors.

> geno.table(subset(cleang, ind=(strain=="F2")), chr="X")
           chr missing BB  BR  RR not.RR not.BB  BY  RY P.value
rs13483704   X       2  0 120 134      0      0 150 145  0.6517
rs13483715   X       1  0 125 130      0      0 149 146  0.9377
rs13483716   X       3  0 125 128      0      0 150 145  0.9416
rs13483718   X       0  0 126 129      0      0 155 141  0.7056
rs13483737   X       1  0 131 123      0      0 151 145  0.8296
rs13483738   X       1  0 131 124      0      0 149 146  0.8946
rs13483741   X       3  0 131 123      0      0 148 146  0.8757
rs13483746   X       1  0 134 121      0      0 147 148  0.7167
rs13483750   X      19  0 132 120      0      0 146 134  0.5811
rs13483765   X       2  0 131 123      0      0 154 141  0.6620
rs13483766   X       7  0 132 122      0      0 152 138  0.5858
rs13483767   X       1  0 132 122      0      0 154 142  0.6440
rs13483770   X       1  0 134 121      0      0 148 147  0.7167
rs13484015   X      12  0 138 112      0      0 151 138  0.1931
rs13484018   X      25  0 141 107      0      0 144 134  0.0812
rs13459160   X       2  0 133 122      0      0 152 142  0.6654
rs3715531    X       3  0 133 122      0      0 150 143  0.7255
rs13484105   X       6  0 133 121      0      0 152 139  0.5634
rs13459161   X       3  0 131 124      0      0 147 146  0.9068
rs13484113   X       1  0 136 119      0      0 156 139  0.3477

Everything looks okay. Let’s make a couple of plots, of genotype frequencies along the X chromosome for each sex.

> par(mfrow=c(2,1), las=1)
> f2male <- subset(cleang, ind=(strain=="F2" & sex=="Male"))
> plot(geno.table(f2male, chr="X", scanone=TRUE), lod=8:9,
+      main="X chr genotype frequencies; F2 males",
+      ylab="Genotype frequency", col=c("blue","red"),
+      ylim=c(0.3, 0.7))
> abline(h=0.5, lty=2, col="gray")
> legend("topleft", lwd=2, col=c("blue","red"),
+        legend=c("hem B6", "hem BTBR"))
> f2female <- subset(cleang, ind=(strain=="F2" & sex=="Female"))
> plot(geno.table(f2female, chr="X", scanone=TRUE), lod=4:5,
+      main="X chr genotype frequencies; F2 females",
+      ylab="Genotype frequency", col=c("green", "red"),
+      ylim=c(0.3, 0.7))
> abline(h=0.5, lty=2, col="gray")
> legend("topleft", lwd=2, col=c("green","red"),
+        legend=c("het", "hom BTBR"))

The departures from 1:1 segregation seem not too worrisome.

3.9. Counts of crossovers

Let’s look at the number of crossovers in the F₂ mice on the X chromosome.

> nxo <- countXO(subset(cleang, ind=(strain=="F2")), chr="X")
> table(nxo)
nxo
  0   1   2   3   4   6
216 261  65   3   3   3

There are 9 mice with >2 crossovers. Let’s plot the genotype data for these mice. We run calc.errorlod to avoid a warning message. (The genotyping error LOD scores are used in plot.geno to flag potential genotyping errors.)

> temp <- subset(cleang, ind=(strain=="F2"))
> temp <- subset(temp, ind=(nxo>2), chr="X")
> temp <- calc.errorlod(temp, error.prob=0.002, map.function="c-f")
> plot.geno(temp, chr="X")

These look unusual, but there are no obvious problems.

3.10. Re-estimate map

Finally, let’s re-estimate the genetic map for the X chromosome and see how it differs from the map from Cox et al. (Genetics, 182:1335-1344, 2009). I’ll use a genotype error rate of 2/1000, and the Carter-Falconer map function (which corresponds, approximately, to the high level of crossover interference in the mouse).

> f2g <- subset(cleang, ind=(strain=="F2"), chr="X")
> newmap <- est.map(f2g, err=0.002, map.function="c-f")

Let’s look at summaries of the original and the newly estimated maps.

> summary.map(f2g)[1,,drop=FALSE]
      n.mar      length ave.spacing max.spacing
       20.0        75.5         4.0        23.9
> summary.map(newmap)[1,,drop=FALSE]
      n.mar      length ave.spacing max.spacing
       20.0        77.7         4.1        24.0

Overall, the map hasn’t changed much.

Let’s plot the map next to the one embedded in the cross (which corresponds to that of Cox et al. 2009). The map on the bottom is the newly estimated one

> plot.map(f2g, newmap, horizontal=TRUE)

It sure looks similar. It might be better to study the individual interval lengths.

> dom <- diff(pull.map(f2g)[[1]])
> dnm <- diff(newmap[[1]])
> par(las=1)
> plot(dom, dnm-dom, type="n", xlab="Original interval length (cM)",
+      ylab="Change in interval length (cM)")
> abline(h=0, lty=2, col="gray")
> text(dom, dnm-dom, seq(along=dnm))

Note that the numbers indicate the intervals. Some of the smaller intervals have changed a great deal in length, but overall it looks pretty good.

4. Study autosomal genotypes

We now turn to the autosomal genotypes.

> ga <- pull.geno(cleang, chr="-X")
> dim(ga)
[1]  562 4684

4.1. Not segregating

First, let’s look at which markers seem to segregate in the F₂. We create a table of the three possible genotypes (1, 2, 3) and look at the proportion of mice with the least-frequent genotype. Since there are 4684 markers, we’ll look at a histogram.

> ming <- apply(apply(ga[strain=="F2",], 2, function(a) table(factor(a, levels=1:3))/sum(!is.na(a))), 2, min)
> par(las=1)
> hist(ming, breaks=seq(0, ceiling(max(ming)*100)/100, by=0.005),
+      xlab="Frequency of least-frequent genotype", main="")
> rug(ming, col="blue")

Note the big gap between 4% and 17%. There are 2640 markers that appear not to be segregating, as the least-frequent genotype is present in <10% of mice. There are 2044 markers that do appear to be segregating, as the least-frequent genotype is present in >10% of mice.

Let’s go ahead and omit the non-segregating markers.

> todrop <- names(ming[ming<0.1])
> cleang <- drop.markers(cleang, todrop)
> ga <- pull.geno(cleang, chr="-X")
> dim(ga)
[1]  562 2044

4.2. Genotypes in B6 and BTBR

Let’s look at the B6 mice. There are 3 such; they should all be homozygous for the same allele.

> bad.b6 <- apply(ga[strain=="B6",], 2, function(a) (length(unique(a[!is.na(a)]))) != 1 || all(is.na(a) | a==2))
> sum(bad.b6)
[1] 8
> ga[strain=="B6",bad.b6]
     rs6163246 rs13478222 rs13478387 rs4226443 rs6188470 rs8259478 rs13481107 rs6358703
[1,]         3          2          2         2         3         2          2         2
[2,]         2         NA          3         3         2        NA          1         3
[3,]        NA         NA          3         3         2         3          1         3

There are 8 markers with heterozygous calls, including one with no homozygous calls.

Let’s grab the inferred genotype (1, 3, or NA) for B6 mice.

> b6ga <- apply(ga[strain=="B6",], 2, function(a) if(all(is.na(a) | a==2)) return(NA) else return(unique(a[!is.na(a) & a!=2])))
> sum(is.na(b6ga))
[1] 1
> table(b6ga)
b6ga
   1    3
1036 1007

We repeat this for the 4 BTBR mice.

> bad.btbr <- apply(ga[strain=="BTBR",], 2, function(a) (length(unique(a[!is.na(a)]))) != 1 || all(is.na(a) | a==2))
> sum(bad.btbr)
[1] 7
> ga[strain=="BTBR",bad.btbr]
     rs3022901 rs6183172 rs6173859 rs8259478 rs13481105 rs6302629 rs13483472
[1,]         1         1         1         1          1         3          2
[2,]         1         1         1         1          1         3          3
[3,]         1         1         1         1          1         3          3
[4,]         2         2         2         2          2         2          3

There are 7 markers with heterozygous calls.

Again, we pull out the inferred genotype for BTBR mice.

> btbrga <- apply(ga[strain=="BTBR",], 2, function(a) if(all(is.na(a) | a==2)) return(NA) else return(unique(a[!is.na(a) & a!=2])))
> sum(is.na(btbrga))
[1] 0
> table(btbrga)
btbrga
   1    3
1007 1037

We further check that the B6 and BTBR genotypes are homozygous for different alleles at these 2044 markers.

> all(is.na(b6ga) | (b6ga==1 & btbrga==3) | (b6ga==3 & btbrga==1))
[1] TRUE

Let’s swap the genotype codes for the markers where the BTBR mice have the 1 genotype (to make it so that 1 is homozygous B6 and 3 is homozygous BTBR).

> toswap <- names(btbrga)[btbrga==1]
> for(chr in names(cleang$geno)[-20]) {
+   thisswap <- toswap[!is.na(match(toswap, markernames(cleang, chr=chr)))]
+   cleang$geno[[chr]]$data[,thisswap] <- 4 - cleang$geno[[chr]]$data[,thisswap]
+ }
> ga <- pull.geno(cleang, chr="-X")

4.3. Genotypes in F₁

Let’s look at the 4 F₁ mice. They should all be heterozygous.

> f1ga <- ga[strain=="F1",]
> sum(!is.na(f1ga) & f1ga != 2)
[1] 18
> f1ga[,apply(f1ga, 2, function(a) !all(is.na(a) | a==2))]
     rs13476089 rs13476669 rs4223511 rs6183172 rs13477660 rs6355453 rs13478004 rs13478983 rs13479059 rs13479197 rs13479440
[1,]          1          1         2         2          3         2          2          2          2          2          2
[2,]          2          2         1         2          2         3         NA          2          3          1          2
[3,]          2          2         2         1          2         2          2          3          2          2          2
[4,]          2          2         2         2          2         2          3          2          2          2          3
     rs13480554 rs3023809 rs13481105 rs3698807 rs13481939 rs13483373 rs3090325
[1,]          2         1          2         2          2          1         1
[2,]          3         2          2         3          1          2        NA
[3,]          2         2          2         2          2          2         2
[4,]          2         2          1         2          2          2         2

There are 18 markers with a homozygous F₁ mouse, but at each of these markers there is just one such mouse.

4.4. Segregation in F₂

We now turn to the segregation patterns of these autosomal markers in the F₂ mice. As in our study of the X-linked markers, we are looking for markers whose aberrant segregation indicates a high genotyping error rate. We use geno.table to inspect the segregation patterns; the p-values are for tests of 1:2:1 segregation. We do the calculation for all markers, and then pull out the rows that show significant departure at a 5% level, after a Bonferroni correction for the multiple tests.

> gt <- geno.table(subset(cleang, chr="-X", ind=(strain=="F2")))
> bad <- gt$P.value < 0.05/nrow(gt)
> gt[bad,]
           chr missing  BB  BR  RR not.RR not.BB  P.value
rs3689700    3      29 196 197 129      0      0 2.82e-11
rs13477546   4      46 120 212 173      0      0 5.80e-06
rs13478004   4      73 116 196 166      0      0 2.34e-06
rs13459096   6       0 219 228 104      0      0 1.05e-14
rs13478602   6       0 219 227 105      0      0 1.12e-14
rs6172481    6       0 220 226 105      0      0 5.17e-15
rs13478606   6       0 221 225 105      0      0 2.37e-15
rs13478610   6       1 224 230  96      0      0 7.32e-17
rs13478667   6       1 197 243 110      0      0 2.55e-08
rs13481814  13      46 174 207 124      0      0 1.95e-06

The markers on chromosome 6 show real segregation distortion rather than genotyping errors, due to the study design (that’s where leptin sits). The other 4 markers are probably error-prone. (The two markers on chromosome 4 are not near each other.) It’s probably best to remove them.

> cleang <- drop.markers(cleang, rownames(gt)[bad & gt$chr != 6])

Let’s look at plots of the segregation patterns. First, we re-run geno.table with the argument scanone=TRUE (more formally, scanone.output=TRUE), so that the output is as that from a genome scan.

> gt <- geno.table(subset(cleang, chr="-X", ind=(strain=="F2")), scanone=TRUE)

Now let’s plot the -log (base 10) p-values and the genotype frequencies, for each chromosome. For the p-values, I’ll use a different y-axis for chromosome 6 than for the others.

This is a very long set of figures; click here to jump to below them.

> par(mfrow=c(20,2))
> ym1a <- max(gt$neglog10P)
> ym1b <- max(gt$neglog10P[gt$chr != 6])
> ym2 <- max(gt$BR)
> for(i in c(1:19,"un")){
+       plot(gt, chr=i, main=paste("Chr", i), ylim=c(0,ifelse(i==6,ym1a,ym1b)))
+       plot(gt, chr=i, lod=3:5, main=paste("Chr", i), ylab="Genotype frequency",
+            ylim=c(0, ym2), col=c("green","blue","red"))
+       legend("bottomleft", lwd=2, col=c("green","blue","red"), c("BB","BR","RR"))
+ }

(Click here to jump back to the top of those figures.)

There are lots of little spikes in these figures (e.g., chromosome 13 at 46.4 cM): single markers with somewhat aberrant segregation patterns relative to the surrounding markers. These likely correspond to a small proportion of genotyping errors.

The bad marker on chromosome 13 is quite clearly indicated to be due to a large batch of double-crossovers (indicating likely genotyping errors). To show that, we look at two-locus genotype tables for the problem marker and the adjacent markers to the left and right.

> f2g <- subset(cleang, ind=(strain=="F2"))
> themar <- rownames(gt)[gt$chr==13 & gt[,3]>4]
> c13mar <- markernames(cleang, chr=13)
> themarnum <- which(c13mar == themar)
> c13mar[themarnum+c(-1,0,1)]
[1] "rs6322774"  "rs13481939" "rs6288319"
> geno.crosstab(f2g, c13mar[themarnum+c(-1,1)])
         rs6288319
rs6322774 -  BB  BR  RR
       -  0   1   0   0
       BB 1 135   0   0
       BR 0   0 264   1
       RR 0   0   0 149
> geno.crosstab(f2g, c13mar[themarnum+c(-1,0)])
         rs13481939
rs6322774  -  BB  BR  RR
       -   0   1   0   0
       BB  1 134   1   0
       BR 22  31 212   0
       RR  9   0   1 139
> geno.crosstab(f2g, c13mar[themarnum+c(0,1)])
          rs6288319
rs13481939 -  BB  BR  RR
        -  0   1  22   9
        BB 1 134  30   1
        BR 0   1 212   1
        RR 0   0   0 139

As seen in the above tables, the markers to the left and right of rs13481939 show just one recombinant, but there are about 30 double-recombination events. This marker appears to have a genotyping error rate of about 5%. It is interesting to note that it also has 5% missing data. Probably I should look at the quality scores for this and other markers, and perhaps also the intensities from the genotyping arrays.

For now, it is probably okay to leave this and similar markers in the data, as if we run calc.genoprob with error.prob=0.002, the errors will be smoothed over. We can check that, for this marker, by looking at the two-locus genotype tables after imputing in the genotypes with fill.geno.

> f2gi <- fill.geno(f2g, error.prob=0.002, map.function="c-f", method="argmax")
> geno.crosstab(f2gi, c13mar[themarnum+c(-1,1)])
         rs6288319
rs6322774 -  BB  BR  RR
       -  0   0   0   0
       BB 0 137   0   0
       BR 0   0 264   1
       RR 0   0   0 149
> geno.crosstab(f2gi, c13mar[themarnum+c(-1,0)])
         rs13481939
rs6322774 -  BB  BR  RR
       -  0   0   0   0
       BB 0 137   0   0
       BR 0   0 265   0
       RR 0   0   0 149
> geno.crosstab(f2gi, c13mar[themarnum+c(0,1)])
          rs6288319
rs13481939 -  BB  BR  RR
        -  0   0   0   0
        BB 0 137   0   0
        BR 0   0 264   1
        RR 0   0   0 149

As seen in these tables, in the imputed data, the genotypes that led to double-crossovers were deemed to be errors.

4.5. Linkage to other chromosomes

We next study whether the markers are all linked to their respective chromosomes and not to other chromosomes. We do so by first considering each pair of markers and calculating the estimated recombination fractions and LOD scores indicating evidence for linkage. This is accomplished with est.rf. The calculation can take quite a long time, since with have a total of 2040 autosomal markers (including the 3 markers on chromosome "un"). We use pull.rf to pull out matrices of the LOD scores and estimated recombination fractions.

> temp <- est.rf(f2g)
> lod <- pull.rf(temp, "lod")
> rf <- pull.rf(temp, "rf")

We now calculate the maximum LOD score between a marker another on its chromosome, and then the maximum LOD score between a marker and another on some other chromosome.

> start <- cumsum(c(1, nmar(f2g)))
> end <- cumsum(nmar(f2g))
> maxlodsame <- maxloddiff <- rep(NA, sum(nmar(f2g)))
> names(maxlodsame) <- names(maxloddiff) <- markernames(f2g)
> for(i in seq(along=end)) {
+   maxlodsame[start[i]:end[i]] <- apply(lod[start[i]:end[i],start[i]:end[i]],1,max, na.rm=TRUE)
+   maxloddiff[start[i]:end[i]] <- apply(lod[start[i]:end[i],-c(start[i]:end[i],start[21]:end[21])],1,max, na.rm=TRUE)
+ }
> max(maxloddiff[-(start[21]:end[21])])
[1] 3.94

For unlinked marker pairs (ignoring the markers on chromosome "un"), the biggest LOD score is 3.94. So everything looks clean.

The three markers on the "un" chromosome map quite cleanly to chromosomes 7 and 8, but since we don’t have reliable build 37 positions for them, it’s probably best to just ignore them.

Let me clarify that a bit. Let’s find, for each of the markers on chromosome "un", the marker with the strongest evidence for linkage.

> unmar <- markernames(f2g, chr="un")
> linkedmar <- unmar
> for(i in 1:3)
+   linkedmar[i] <- colnames(lod)[!is.na(lod[,end[20]+i]) & lod[,end[20]+i]==max(lod[,end[20]+i], na.rm=TRUE)]
> find.markerpos(f2g, linkedmar)
           chr  pos
rs13479902   8 43.1
rs13479254   7 33.1
rs13479825   8 39.6

Now let’s plot the LOD score for each of those markers against each other marker in the genome.

> unmar <- markernames(f2g, chr="un")
> par(mfrow=c(3,1))
> for(i in unmar)
+   qtl:::plot.rfmatrix(lod, i, chr="-un", ylim=c(0,260), main=i)

4.6. Counts of crossovers

Let’s look again at the observed number of crossovers in each individual. We’ll first consider just the autosomes (and ignore the "un" chromosome). These will be greatly affected by apparent genotyping errors, and so we’ll do this both with the “raw” data as well as with imputed genotypes.

> nxo.raw <- countXO(f2g, chr=c("-X", "-un"))
> nxo.clean <- countXO(fill.geno(f2g, method="argmax", error.prob=0.002, map.function="c-f"),
+                      chr=c("-X", "-un"))
> sort(nxo.clean, decreasing=TRUE)[1:10]
235 312 419 512 356 132 245 298  43 242
376 347 342 334 327 320  52  39  38  38

There are a handful of individuals with a large number of crossovers. Let’s look at scatter plots of the raw counts against those from the imputed data.

> par(mfrow=c(1,2), las=1)
> plot(nxo.clean, nxo.raw, xlab="No. crossovers (imputed)",
+      ylab="No. crossovers (raw data)")
> plot(nxo.clean, nxo.raw, xlab="No. crossovers (imputed)",
+      ylab="No. crossovers (raw data)", xlim=c(0,60), ylim=c(0,340))

4.7. Apparent genotyping errors

Let’s look more closely at the apparent genotyping errors. We’ll look at all of the markers except those in the "un" group, and we’ll compare the observed genotypes to the inferred genotypes allowing a small amount of genotyping error (using fill.geno with method="argmax").

> gi <- pull.geno(f2gi, chr="-un")
> g <- pull.geno(f2g, chr="-un")
> rownames(g) <- rownames(gi) <- as.character(f2g$pheno$MouseNum)

Now let’s count the number of apparent errors (that is, where the observed genotype doesn’t match the imputed genotype) by marker and by mouse.

> err.by.mar <- apply((!is.na(g) & gi != g), 2, function(a) sum(a))
> err.by.mar.prop <- err.by.mar/nind(f2g)
> err.by.ind <- apply((!is.na(g) & gi != g), 1, function(a) sum(a))
> tot.mar <- totmar(subset(f2g, chr="-un"))
> err.by.ind.prop <- err.by.ind/tot.mar

Now let’s plot histograms of the proportion of errors, by marker and by mouse.

> par(las=1, mfrow=c(2,1))
> hist(err.by.mar.prop, breaks=seq(-0.25, max(err.by.mar)+ 0.25, by=0.5)/nind(f2g),
+      main="Errors by marker", xlab="Proportion of apparent genotyping errors")
> rug(err.by.mar.prop, col="blue")
> u <- par("usr")
> text(mean(u[1:2]), mean(u[3:4]), paste(sum(err.by.mar.prop > 0.02), "markers with > 2% errors"), col="red")
> text(mean(u[1:2]), mean(u[3:4])*0.7, paste(sum(err.by.mar.prop < 0.01), "markers with < 1% errors"), col="blue")
> hist(err.by.ind.prop, breaks=seq(-1, max(err.by.ind)+ 1, by=2)/tot.mar,
+      main="Errors by mouse", xlab="Proportion of apparent genotyping errors")
> rug(err.by.ind.prop, col="blue")
> u <- par("usr")
> text(mean(u[1:2]), mean(u[3:4]), paste(sum(err.by.ind.prop > 0.05), "mice with > 5% errors"), col="red")

Let’s now look at the proportion of genotyping errors by marker, after we omit the 7 mice with error rates > 5%.

> err.by.mar.dropbadmice <- apply((!is.na(g) & gi != g)[err.by.ind.prop < 0.05,], 2, function(a) sum(a))
> err.by.mar.dropbadmice.prop <- err.by.mar.dropbadmice/sum(err.by.ind.prop < 0.05)
> par(las=1)
> hist(err.by.mar.dropbadmice.prop, breaks=seq(-0.25, max(err.by.mar.dropbadmice)+ 0.25, by=0.5)/sum(err.by.ind.prop < 0.05),
+      main="Errors by marker, dropping bad mice", xlab="Proportion of apparent genotyping errors")
> rug(err.by.mar.dropbadmice.prop, col="blue")
> u <- par("usr")
> text(mean(u[1:2]), mean(u[3:4]), paste(sum(err.by.mar.dropbadmice.prop > 0.02), "markers with > 2% errors"), col="red")
> text(mean(u[1:2]), mean(u[3:4])*0.7, paste(sum(err.by.mar.dropbadmice.prop < 0.01), "markers with < 1% errors"), col="blue")

5. Study the raw genotype array information

In the last section above, we saw that there are a number of mice with very high error rates, and also a number of markers with very high error rates.

In order to figure out what’s going on with these markers and mice, it seemed best to look at the raw genotype information. The big genotype data file from Rosetta includes the allele intensities for each mouse at each marker on the Affymetrix genotyping arrays, as well as a quality score (which itself is of dubious quality).

Via a Perl script, I pulled out just the 2060 segregating markers for all mice (except the control mouse). At each marker, I grabbed the intensities for the two alleles, and also all of the other quality-related columns. The resulting file is still quite large (about 160 Mb) and takes a while to load (about 2 min). After loading the file, we’ll split it into separate data frames, one for each marker.

> allgeno <- read.csv("Data/detailed_genotypes.csv", as.is=TRUE)
> allgeno <- split(allgeno, allgeno$snp)

Let’s now sort the markers and mice by their position in cleang.

> allgeno <- allgeno[markernames(cleang)]
> allgeno <- lapply(allgeno, function(a,b) a[match(b,a$Sample.Name),], cleang$pheno$longid)

Since these data include not just the F₂ mice but also the F₁, B6 and BTBR mice, let’s get imputed genotypes for all mice.

> cleangi <- fill.geno(cleang, err=0.002, map.function="c-f", method="argmax")
> cg <- pull.geno(cleang, chr="-un")
> cgi <- pull.geno(cleangi, chr="-un")

5.1. The bad marker on chromosome 13

Let’s first look at that badly behaved marker on chromosome 13, rs13481939, and the adjacent markers to the left and right.

> mnam <- markernames(cleang)
> wh <- which(mnam == themar)
> c13mar <- mnam[wh+c(-1,0,1)]
> par(mfrow=c(3,1), las=1)
> for(i in c13mar) {
+   plot(A2.Normalized.Signal ~ A1.Normalized.Signal, data=allgeno[[i]],
+        main=i, xlab="Allele 1 intensity", ylab="Allele 2 intensity",
+        col=c("blue","red","green","","","orange")[allgeno[[i]]$Measured.Genotype+1],
+        lwd=2)
+   legend("topright", pt.lwd=2, pch=c(1,1,1,1,16),
+   col=c("blue","red","green","orange", "hotpink"),
+          legend=c("11","12","22","missing","error"))
+   points(A2.Normalized.Signal ~ A1.Normalized.Signal,
+          data=allgeno[[i]], col="hotpink", pch=16,
+          subset=!is.na(cg[,i]) & cg[,i] != cgi[,i])
+ }

It’s clear, from this, why marker rs13481939 is badly behaved. The two adjacent markers show clear separation of the three clusters of genotypes, while for rs13481939, the clusters blend together. The inferred genotyping errors, highlighted in pink in the middle figure above, were called homozygous but were probably heterozygotes that melded into one of the clusters of homoyzgotes.

5.2. The best and worst markers

Let’s look at the same sort of figures for the top five and bottom five markers.

> worst5 <- names(sort(err.by.mar.dropbadmice, decreasing=TRUE)[1:5])
> best5 <- names(sort(err.by.mar.dropbadmice, decreasing=FALSE)[1:5])
> par(mfrow=c(5,2), las=1)
> for(i in 1:5) {
+   plot(A2.Normalized.Signal ~ A1.Normalized.Signal, data=allgeno[[worst5[i]]],
+        main=worst5[i], xlab="Allele 1 intensity", ylab="Allele 2 intensity",
+        col=c("blue","red","green","","","orange")[allgeno[[worst5[i]]]$Measured.Genotype+1],
+        lwd=2)
+   legend("topright", pt.lwd=2, pch=c(1,1,1,1,16),
+   col=c("blue","red","green","orange", "hotpink"),
+          legend=c("11","12","22","missing","error"))
+   points(A2.Normalized.Signal ~ A1.Normalized.Signal,
+          data=allgeno[[worst5[i]]], col="hotpink", pch=16,
+          subset=!is.na(cg[,worst5[i]]) & cg[,worst5[i]] != cgi[,worst5[i]])
+
+   plot(A2.Normalized.Signal ~ A1.Normalized.Signal, data=allgeno[[best5[i]]],
+        main=best5[i], xlab="Allele 1 intensity", ylab="Allele 2 intensity",
+        col=c("blue","red","green","","","orange")[allgeno[[best5[i]]]$Measured.Genotype+1],
+        lwd=2)
+   legend("topright", pt.lwd=2, pch=c(1,1,1,1,16),
+   col=c("blue","red","green","orange", "hotpink"),
+          legend=c("11","12","22","missing","error"))
+   points(A2.Normalized.Signal ~ A1.Normalized.Signal,
+          data=allgeno[[best5[i]]], col="hotpink", pch=16,
+          subset=!is.na(cg[,best5[i]]) & cg[,best5[i]] != cgi[,best5[i]])
+ }

The figures on the left, above, correspond to the bad markers (as defined by their apparent genotyping error rate); the figures on the right correspond to the good markers. The genotyping errors all seem to derive from the same problem: the intensities for heterozygotes overlap those for the two homozygotes.

5.3. Those seven bad mice

Now let’s try to figure out what’s going on with those seven bad mice. I’ll grab the seven bad mice and a random seven of the other “good” mice. We’ll first look at the "contrast" column, which I take to be a measure distinguishing between the two homozygous clusters.

> badmice <- match(as.character(f2g$pheno$longid)[err.by.ind.prop > 0.05], allgeno[[1]]$Sample.Name)
> f2g$pheno$MouseNum[badmice]
[1] Mouse3635 Mouse3121 Mouse3225 Mouse3188 Mouse3580 Mouse3450 Mouse3566
566 Levels: 755356 Mouse133 Mouse134 Mouse135 Mouse136 Mouse2000 Mouse2001 Mouse2011 Mouse2012 Mouse2031 Mouse2272 ... Mouse3670
> goodmice <- sample(match(as.character(f2g$pheno$longid)[err.by.ind.prop < 0.05], allgeno[[1]]$Sample.Name), length(badmice))
> contr <- sapply(allgeno, function(a,b) a$Contrast[b], c(badmice, goodmice))

> par(mfrow=c(7,2), las=1)
> for(i in 1:7) {
+   hist(contr[i,], breaks=seq(-1.5, 1.5, len=101), xlab="Contrast",
+        main=cleang$pheno$MouseNum[badmice[i]])
+   hist(contr[i+7,], breaks=seq(-1.5, 1.5, len=101), xlab="Contrast",
+        main=cleang$pheno$MouseNum[goodmice[i]])
+ }

It looks like the “bad” mice are enriched for apparent homozygous genotypes. Let’s look at that more directly by plotting the number of genotyping errors against the proportion of markers with "contrast" < 0.5.

> allcontr <- sapply(allgeno, function(a) a$Contrast)
> plot(apply(allcontr, 1, function(a) mean(abs(a) < 0.5)),
+      apply(!is.na(cg) & (cg != cgi), 1, sum),
+      col=c("black", "blue", "red","green")[as.numeric(strain)],
+      ylab="Number of errors", xlab="Prop contrast < 0.5")
> legend("topright", col=c("black","blue","red","green"), pch=1,
+        pt.lwd=2, c("F2","B6","BTBR","F1"))

Six of the seven “bad” mice are at the lower range of the proportion of heterozygous genotypes.

We can use the a triangle plot (aka Holmans' triangle) to depict the genotype frequencies for each mouse, as another way to indicate this aspect of the problem in these mice.

> library(broman)
> gf <- apply(pull.geno(cleang, chr=c("-un", "-X")), 1,
+             function(a) table(factor(a, levels=1:3))/sum(!is.na(a)))
> triplot(labels=c("BB","BR","RR"))
> tripoints(gf[,-badmice], col=c("black", "blue", "red", "green")[as.numeric(strain)][-badmice], lwd=2)
> tripoints(gf[,badmice], col="orange", lwd=2)
> legend("topright", col=c("black","blue","red","green","orange"), pch=1,
+        pt.lwd=2, c("F2","B6","BTBR","F1","bad"))

6. How to deal with the genotyping errors?

I’m inclined to drop the seven mice with high genotyping error rates and so high numbers of observed crossovers. We might further seek to omit particularly bad markers, and even omit particular genotypes that look to be errors, but I think for the markers and the individual genotyping errors, it is sufficient to rely on the hidden Markov model (HMM) with assumed genotyping error rate.

So, let’s just drop the seven bad mice (having genotyping error rate > 5%).

> badmice <- names(err.by.ind.prop)[err.by.ind.prop > 0.05]
> badg <- subset(cleang, ind=!is.na(match(cleang$pheno$MouseNum, badmice)))
> cleang <- subset(cleang, ind=is.na(match(cleang$pheno$MouseNum, badmice)))
> f2g <- subset(cleang, ind=(cleang$pheno$Strain=="F2"))

7. Duplicate individuals

Let’s look for F₂ individuals with nearly duplicate genotype data.

> cg <- comparegeno(f2g)
> cgu <- cg[lower.tri(cg)]
> par(las=1, mar=c(5.1,4.1,0.6,0.6))
> hist(cgu, main="", xlab="Proportion of matching genotypes",
+      breaks=seq(0, 1, len=201))
> rug(c(sort(cgu)[1:100], sort(cgu, dec=TRUE)[1:100]), col="blue")

There are five pairs of mice with nearly matching genotypes.

> wh <- which(cg > 0.8, arr=TRUE)
> wh <- wh[wh[,1] < wh[,2],]
> cbind(as.character(f2g$pheno$MouseNum)[wh[,1]], as.character(f2g$pheno$MouseNum)[wh[,2]])
     [,1]        [,2]
[1,] "Mouse3318" "Mouse3317"
[2,] "Mouse3287" "Mouse3290"
[3,] "Mouse3354" "Mouse3353"
[4,] "Mouse3559" "Mouse3553"
[5,] "Mouse3259" "Mouse3269"

It’s safe to assume that these are sample mix-ups, and that any mismatches are genotyping errors in one or the other cases. We’ll omit genotypes that mismatch.

> n.omit <- rep(0,nrow(wh))
> for(i in 1:nrow(wh)) {
+   for(j in 1:nchr(f2g)) {
+     g1 <- f2g$geno[[j]]$data[wh[i,1],]
+     g2 <- f2g$geno[[j]]$data[wh[i,2],]
+     toomit <- !is.na(g1) & !is.na(g2) & g1 != g2
+     f2g$geno[[j]]$data[wh[i,1], toomit] <- NA
+     f2g$geno[[j]]$data[wh[i,2], toomit] <- NA
+     n.omit[i] <- n.omit[i] + sum(toomit)
+   }
+ }
> n.omit
[1] 32  4  5 10  5

As seen, the five pairs have between 4 and 32 genotyping errors.

8. Estimated genetic map

Let’s re-estimate the genetic map to see how it differs from the map from Cox et al. (Genetics, 182:1335-1344, 2009). I’ll use a genotype error rate of 2/1000, and the Carter-Falconer map function (which corresponds, approximately, to the high level of crossover interference in the mouse).

First we should drop one mouse from each of the 5 duplicate pairs.

> f2g.nodup <- subset(f2g, ind=(-wh[,2]))

Now we pull out the current map and re-estimate the map with the current data. In est.map(), I use n.cluster=8 to run multiple chromosomes in parallel (a feature added in R/qtl version 1.20-4).

> origmap <- pull.map(f2g.nodup, chr="-un")
> newmap <- est.map(f2g.nodup, chr="-un", err=0.002,
+                   map.function="c-f", n.cluster=8)
 -Running est.map via a cluster of 8 nodes.

Let’s look at summaries of the original and the newly estimated maps.

> cbind(summary.map(origmap), summary.map(newmap))[,c(1,5,2,6,3,7,4,8)]
        n.mar n.mar.1 length length.1 ave.spacing ave.spacing.1 max.spacing max.spacing.1
1         156     156   96.8     98.2       0.625         0.634        8.18          5.89
2         135     135   98.1    100.8       0.732         0.752       11.06          9.78
3         157     157   79.9     75.0       0.512         0.481        5.35          4.77
4         126     126   80.6     83.0       0.645         0.664        9.31          7.65
5         125     125   87.0     92.7       0.701         0.747       18.44         18.40
6         102     102   76.7     96.0       0.759         0.950       12.42         18.15
7         109     109   86.7     85.3       0.803         0.789       12.03         14.34
8          91      91   73.3     76.6       0.814         0.851        8.67         10.80
9          93      93   69.6     70.0       0.756         0.760        6.40          5.56
10        123     123   74.7     84.2       0.612         0.690        5.15          6.59
11        124     124   81.6     77.8       0.664         0.632        7.73          9.37
12        116     116   50.1     50.4       0.436         0.439        3.09          3.69
13        116     116   65.2     66.7       0.567         0.580        5.37          5.47
14         91      91   57.5     60.6       0.639         0.674        5.27          6.00
15        102     102   56.2     57.6       0.556         0.571        5.65          5.25
16         66      66   55.2     53.3       0.849         0.820       11.45         15.99
17         60      60   55.3     54.2       0.938         0.919        4.65          5.83
18         95      95   56.8     53.2       0.604         0.566        5.05          4.08
19         50      50   43.7     46.5       0.891         0.950        6.91          8.67
X          20      20   75.5     73.5       3.972         3.868       23.90         23.77
overall  2057    2057 1420.4   1455.6       0.697         0.715       23.90         23.77

Overall, the map hasn’t changed much. Let’s plot the chromosome lengths against each other.

> par(las=1, mar=c(5.1,4.1,0.6,0.6))
> plot(summary.map(origmap)[-21,2], summary.map(newmap)[-21,2],
+      type="n", xlab="Original chr length (cM)",
+      ylab="New chr length (cM)")
> abline(0,1, lty=2, col="gray")
> text(summary.map(origmap)[-21,2], summary.map(newmap)[-21,2], c(1:19,"X"))

Let’s plot the map next to the one embedded in the cross (which corresponds to that of Cox et al. 2009). The map on the bottom is the newly estimated one.

> par(mar=c(5.1,4.1,0.6,0.6))
> plot.map(origmap, newmap, horizontal=TRUE, main="")

We can also plot the individual interval lengths against each other. It’s better to plot the difference against either the average or against one of the two lengths. We’ll plot the difference versus the original length (that is, in the Cox et al. maps).

I’ve highlighted in red those intervals that increased in length by more than 4 cM; the numbers indicate the chromosome for the interval.

> par(las=1, mar=c(5.1,4.1,0.6,0.6))
> do <- unlist(lapply(origmap, diff))
> dn <- unlist(lapply(newmap, diff))
> thechr <- rep(c(1:19,"X"), lapply(newmap, function(a) length(a)-1))
> plot(do, dn-do,
+      type="n", xlab="Original interval length (cM)",
+      ylab="Change (new - old) in interval length (cM)", ylim=c(-5,15))
> abline(h=0, lty=2, col="gray")
> points(do, (dn-do), col=c("black","red")[(dn-do>4)+1])
> text(do[dn-do>4]+0.25, (dn-do)[dn-do>4], thechr[dn-do>4], col="red", adj=c(0,0.5))

9. Load and reformat the physical map

Load and reformat the physical map, so that we can save it with the cross data.

> temp <- read.csv("Maps/finalmap.csv", as.is=TRUE)
> temp <- temp[!is.na(match(temp$marker, markernames(f2g))),]
> pmap <- vector("list", 20)
> names(pmap) <- c(1:19,"X")
> for(i in c(1:19,"X")) {
+   pmap[[i]] <- temp[temp[,2]==i,3]/10^6
+   names(pmap[[i]]) <- temp[temp[,2]==i,1]
+   class(pmap[[i]]) <- ifelse(i=="X", "X", "A")
+ }
> class(pmap) <- "map"

Are there really the same number of markers per chromosome and are the markers in the same order?

> all(sapply(pmap, length) == nmar(f2g["-un",]))
[1] TRUE
> all(unlist(lapply(pmap, names)) == markernames(f2g["-un",]))
[1] TRUE

10. Save the clean cross object

> save(cleang, f2g, newmap, pmap, file="Data/clean_cross.RData")
> save(badg, file="Data/bad_mice.RData")

Cleaning the genotype data for the Attie project

1. Load the data

2. Missing data pattern

3. Study X chromosome genotypes

3.1. Not segregating

3.2. Genotypes in B6 and BTBR

3.3. Genotypes in F1

3.4. Genotypes in F2 males

3.5. Genotypes in F2 females

3.6. Swap sexes

3.7. Re-code genotypes

3.8. Segregation patterns

3.9. Counts of crossovers

3.10. Re-estimate map

4. Study autosomal genotypes

4.1. Not segregating

4.2. Genotypes in B6 and BTBR

4.3. Genotypes in F1

4.4. Segregation in F2

4.5. Linkage to other chromosomes

4.6. Counts of crossovers

4.7. Apparent genotyping errors

5. Study the raw genotype array information

5.1. The bad marker on chromosome 13

5.2. The best and worst markers

5.3. Those seven bad mice

6. How to deal with the genotyping errors?

7. Duplicate individuals

8. Estimated genetic map

9. Load and reformat the physical map

10. Save the clean cross object

3.3. Genotypes in F₁

3.4. Genotypes in F₂ males

3.5. Genotypes in F₂ females

4.3. Genotypes in F₁

4.4. Segregation in F₂