Overview

The data consists of English and French word lists for use by computer systems.

Details

Computer systems sometimes have lists of words for the system to use for spell-checking, etc., and for this project you will consider two such word lists – one for French and one for English. Note that accents have been removed from the French word list. Also note that the lists may not quite be what you expect – for example, the English list starts out “a, aa, aaa” and also contains some words that ought to be proper names (e.g., “aaron”).

english = read.table("data/english3.txt", stringsAsFactors = FALSE)
french = read.table("data/francais_edited.txt", stringsAsFactors = FALSE)
english[1:5,]

## [1] "a"        "aa"       "aaa"      "aachen"   "aardvark"

french[1:5,]

## [1] "a"           "ab"          "abaissa"     "abaissai"    "abaissaient"

Data Files

Objectives

We will consider modeling the lengths of words in French and English (as accessible through these computer dictionaries). The goal is to find distributions that fit these data well, and to estimate the associated parameters, as well as to compare features of the two languages.

Calculate the word lengths of the words in the two dictionaries, and provide numerical and graphical summaries of the word lengths. Are the distributions of word length similar across the two languages?
Consider fitting some of the known distributions discussed in class, both this semester and last semester, to these variables. For each distribution you consider, explain how you are estimating the relevant parameters (e.g., are you using the MLEs? MOMs? etc.). Consider at least two different distributions for each of the variables.
Provide some assessment of fit of the distributions to the data. For example, q-q plots are useful for this purpose.
Make some conclusions about modeling the word length variables.

Project 1M: English Versus French Word Length

STAT 3202: Group Project I

Overview

Details

Data Files

Objectives