Overview

The data consists of an English word list, as well as a word list from George Orwell’s novel 1984.

Details

Computer systems sometimes have lists of words for the system to use for spell-checking, etc., and for this project you will consider this word list for English. Note that the list may not quite be what you expect – for example, the English list starts out “a, aa, aaa” and also contains some words that ought to be proper names (e.g., “aaron”). You will also consider the list of words from the novel 1984 by George Orwell.

english = scan("data/english3.txt", what = "character")
english[1:5]

## [1] "a"        "aa"       "aaa"      "aachen"   "aardvark"

orwell = scan("data/1984.txt", what = "character")
orwell[1:5]

## [1] "A"        "ABOLISH"  "AFTER"    "AGITPROP" "ALL"

Data Files

Objectives

We will consider modeling the lengths of words in English (as accessible through this computer list) and the lengths of words in the novel 1984. The goal is to find distributions that fit these data well, and to estimate the associated parameters, as well as to compare features of the two word lists.

Calculate the word lengths of the words in the two lists, and provide numerical and graphical summaries of the word lengths. Are the distributions of word length similar across the two lists?
Consider fitting some of the known distributions discussed in class, both this semester and last semester, to these variables. For each distribution you consider, explain how you are estimating the relevant parameters (e.g., are you using the MLEs? MOMs? etc.). Consider at least two different distributions for each of the variables.
Provide some assessment of fit of the distributions to the data. For example, q-q plots are useful for this purpose.
Take a look at the words that are used in 1984 but are not in the English word list. You should start by converting the words in 1984 to be all lowercase letters, since some are capitalized but the English word list has no capital letters. Comment on these words.
Make some conclusions about modeling the word length variables.

Project 1N: Word Length in George Orwell’s 1984

STAT 3202: Group Project I

Overview

Details

Data Files

Objectives