Project 2I: AnthroKids Data

Overview

The data set consists of anthropomorphic data collected on 3,900 children in 1977 for use in consumer product safety studies.

Details

These data are a subset of a dataset that was the result of a Consumer Product Safety Commission (CPSC) effort to collect anthropomorphic data on children in the mid-seventies. A total of 87 traditional and functional body measurements were taken on a sample of 4127 infants, children and youths representing the U.S. population aged 2 weeks through 18 years. Measurements were taken throughout the United States by two teams of anthropometrists using an automated anthropometric data acquisition system. Standard anthropometers, calipers, and tape devices were modified to read electronically and input dimensional data directly to a mini-computer for data processing and storage. The goal in collecting such data was to provide guidance in consumer product safety for the design of items that would be utilized by children.

More information can be found here: http://stat.pugetsound.edu/hoard/datasetDetails.aspx?id=10

Data Description

Variable Units Description
id (number) numerical id assigned to each sampled child
mass kilograms mass of child
height centimeters height of child
waist centimeters waist circumference of child
foot centimeters child’s foot length
sittingHeight millimeters sitting height of child
upperLegLength millimeters length of child’s upper leg
kneeHeight millimeters height of child’s knee
forearmLength millimeters length of child’s forearm
age years age of child
gender F (female) or M (male) gender of child
handedness both, left, or right handedness of child
birthOrder (number) child’s numerical ranking by age among siblings (1 being first)

Objectives

Sometimes data such as these are used to estimate reference growth curves and charts. A simple growth chart relating, for example, age and height, would present values of age along the x-axis, values of height along the y-axis, and curves marking out quantiles of height across values of age. When visiting the doctor, a child’s age and height might be located on the chart, and the doctor would then tell the child and his or her parents that the child is at the 70th percentile of height for their age. This sort of growth chart is often used as a screening tool, helping doctors and parents identify problems if one biometric measurement is extreme given the child’s age and other biometric measurements. More sophisticated growth charts may be predictive in intent – for example, predicting a child’s future adult height from his or her current biometric values.

In this project, you will consider producing simple growth charts relating two variables of your choice (separately) to age. Choose (at least) two of the biometric measurements: mass, height, waist, and foot. If the biometric measurement is approximately normal, one approach might be to fit a standard linear regression model and plot the mean function and prediction intervals. For example, if we wanted to see the 5th percentile, mean/median, and 95th percentile, we could try proceeding as follows:

anthro_data = read.table("Data/anthrokids2I.csv", header = TRUE, sep = ",")
anthro_data = na.omit(anthro_data)

age_gird = data.frame(age = seq(from = min(anthro_data$age), 
                                to = max(anthro_data$age), 
                                length = 10))

anthro_model = lm(mass ~ age, data = anthro_data)
plot(mass ~ age, data = anthro_data, pch = 20)
grid()
anthro_pred = data.frame(predict(anthro_model, newdata = age_gird, interval = "predict", level = 0.90))
lines(age_gird$age, anthro_pred$fit, col = "red", lty = 1, lwd = 2)
lines(age_gird$age, anthro_pred$lwr, col = "red", lty = 2, lwd = 2)
lines(age_gird$age, anthro_pred$upr, col = "red", lty = 2, lwd = 2)

However, we can see some problems with these curves – primarily, that the underlying relationship is curvilinear not linear, and the variability is changing with age. You will explore some approaches for improving this approach. Specifically, for your two variables of interest (separately) and for the the 5th percentile, mean/median, and 95th percentile:

  1. Consider a standard linear regression model with both a linear and a quadratic effect for age.
  2. Consider a standard linear model where first the outcome variable (e.g., mass in the above example) is log-transformed.
  3. Learn about quantile regression online, and explore the R package “quantreg”. In quantile regression, instead of modeling the mean function, you can model the median function – or any quantile function. Thus, by using the function “rq” in the “quantreg” package, you can estimate models for the 0.05 quantile, the 0.5 quantile, and the 0.95 quantile, and use these as growth curves. Explore using this function with just a linear effect, as well as with a linear and a quadratic effect.
  4. Finally, in the quantreg package, there is a simple function called lprq() which can estimate a nonlinear relationship between the predictor (e.g., age) and a particular quantile of the outcome (e.g., the 0.95 quantile of mass). The fit is a “local linear” fit, which basically strings together a bunch of quantile linear regression models fit to points nearby (in age) rather than to the entire data set. This function takes as an argument a bandwidth, h, which controls how nearby points need to be in order to influence the local fit. Try fits with h=1/12 (one month), h=1/2, h=1, and h=3.
  5. Which of these methods do you think is producing the most reasonable results? Using that method, present “final” growth curves for each of your variables of interest as a function of age, including the 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, and 0.95 quantiles. For help implementing the methods in the quantreg package, make sure to look at the CRAN page: https://cran.r-project.org/web/packages/quantreg/index.html In particular, look at both the reference manual and the vignette, which provides a more expository discussion of functions in the package.