Overview

The data consists of English and French word lists for use by computer systems.

Details

Computer systems sometimes have lists of words for the system to use for spell-checking, etc., and for this project you will consider two such word lists – one for French and one for English. Note that accents have been removed from the French word list. Also note that the lists may not quite be what you expect – for example, the English list starts out “a, aa, aaa” and also contains some words that ought to be proper names (e.g., “aaron”).

english = read.table("data/english3.txt", stringsAsFactors = FALSE)
french = read.table("data/francais_edited.txt", stringsAsFactors = FALSE)
english[1:5,]
## [1] "a"        "aa"       "aaa"      "aachen"   "aardvark"
french[1:5,]
## [1] "a"           "ab"          "abaissa"     "abaissai"    "abaissaient"

Data Files

Objectives

We will consider modeling the lengths of words in French and English (as accessible through these computer dictionaries). The goal is to find distributions that fit these data well, and to estimate the associated parameters, as well as to compare features of the two languages.