Overview

The data consists of an English word list, as well as a word list from George Orwell’s novel 1984.

Details

Computer systems sometimes have lists of words for the system to use for spell-checking, etc., and for this project you will consider this word list for English. Note that the list may not quite be what you expect – for example, the English list starts out “a, aa, aaa” and also contains some words that ought to be proper names (e.g., “aaron”). You will also consider the list of words from the novel 1984 by George Orwell.

english = scan("data/english3.txt", what = "character")
english[1:5]
## [1] "a"        "aa"       "aaa"      "aachen"   "aardvark"
orwell = scan("data/1984.txt", what = "character")
orwell[1:5]
## [1] "A"        "ABOLISH"  "AFTER"    "AGITPROP" "ALL"

Data Files

Objectives

We will consider modeling the lengths of words in English (as accessible through this computer list) and the lengths of words in the novel 1984. The goal is to find distributions that fit these data well, and to estimate the associated parameters, as well as to compare features of the two word lists.