Overview

The data set consists of observations of the number of colony-forming units (CFUs) for 35 mice, as well as genotype information for each mouse at 13 different locations in the genome (called loci – singluar: locus).

Details

Dr. Julie Wilder (Lovelace Respiratory Research Institute) studies the response of the lung to the introduction of pathogens. In particular, she has examined the genetic characteristics of immune response to infection with the pathogen Cryptococcus neoformans in mice. A preliminary study indicated that differences in the ability of two strains of mice, C57BL/6 and C.B-17, to clear the pathogen from the lung were likely due to genetic causes.

One component of Dr. Wilder’s overall study was to measure the ability of mice with either the C57BL/6 background genotype or the C.B-17 background genotype to clear the pathogen from their lungs. This was measured in each mouse by examining the average number of colony-forming units (CFUs) found in the lung. The data set gives these measurements, as well as the genotype of the mouse (A or B) at each of 13 loci. The goal is to determine whether any of these loci are associated with the average number of CFUs.

Data Description

Variable Description
Column 1: List of locus names
Columns 2-26: Data for each mouse
Row 1: Mouse ID
Row 2: CFU values for each mouse
Rows 3-15: Genotype (A or B) for each mouse at each locus
mouse_cfus = read.csv("data/qtl2C.csv", header = FALSE)
head(mouse_cfus)

Data Files

Objectives

Within each locus, two-sample t-tests can be used to compare the CFU values for genotypes A and B. This will amount to carrying out 13 different hypothesis test. Another complication is that some of the data are missing for some loci. This project will explore these issues.

  1. When carrying out 13 tests, what proportion of the tests would be expected to be significant if no loci were actually associated with CFU values?
  2. For how many tests did you observe significant differences in CFU values for the two genotypes?
  3. Read (online) about methods for correcting for multiple testing. Define the following terms: familywise error rate and false discovery rate.
  4. Apply at least one multiple testing correction to this data set. You can use (with justification) any correction you wish.
  5. How many genes are significant after applying the correction you selected in part (4)?
  6. Imputation is a common approach to handle missing data. Read about imputation online. Can you suggest an approach for imputation in this data set? Is missing data likely to cause problems in this data set?
  7. Describe how you would report your results to Dr. Wilder. Which genes would you suggest that she pursue for future work in this problem?