Tidy Data

Working with Data Frames

David Dalpiaz

February 5, 2024

Tabular Data

Tabular Data

  • A variable is a quantity, quality, or property that you can measure.
  • A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
  • An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. We’ll sometimes refer to an observation as a data point.

Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.

Tabular Data

  • Rows represent observations (samples, records, etc).
    • Often, the first row is the header, which stores the names of the columns, not values.
  • Columns represent variables (features, attributes, etc).
    • Sometimes, the first column is an index.
  • Cells are row-column intersections that store values.

Palmer Penguins

species island bill_length_mm bill_depth_mm flipper_length_mm
0 Adelie Torgersen 39.1 18.7 181.0
1 Adelie Torgersen 39.5 17.4 186.0
2 Adelie Torgersen 40.3 18.0 195.0
3 Adelie Torgersen NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0
... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0
341 Gentoo Biscoe 50.4 15.7 222.0
342 Gentoo Biscoe 45.2 14.8 212.0
343 Gentoo Biscoe 49.9 16.1 213.0

344 rows × 5 columns

Tidy Data

Tidy Data Definition

There are three interrelated rules that make a dataset tidy:

  • Each variable is a column; each column is a variable.
  • Each observation is a row; each row is an observation.
  • Each value is a cell; each cell is a single value.

Tidy Data Literature

Tidy Data and The Relational Model

Tidy Data and The Relational Model

The Tidy Data formulation is largely Codd’s Third Normal Form database schema for relational databases.

  • Row = Tuple
  • Column = Attribute
  • Table = Relation

Data Frames

From Arrays to Frames

  • Arrays: Homogeneous Data
    • Elements all have same type.
    • Stored in contiguous block of memory.
    • Ability to locate elements via indexing.
  • Data Frames: Heterogeneous Data
    • Collection of one-dimensional arrays.
    • Row and column structure give meaning and define how values are related.

Data Frames: Collections of Columns

Data frames are usually a collection of columns.

  • R: A list of vectors.
  • Python: A dictionary of arrays.

Each column has a single data type, but different columns can have different types. All columns have the same length.

The Matrix

Neo, sooner or later you’re going to realize, just as I did, that there’s a difference between knowing the path and walking the path.

Data frames are not matrices!!!

  • Row operations are often nonsensical or impossible.
  • Row operations that are possible can be misleading or slow!

Data Frames for Jupyter Languages

Data Frame DSLs

Disclaimer

Dave is about to give some opinions.

Core Operations

  • Create
    • From language objects
    • From external sources
  • Group Operations
  • Combine
  • Modify
    • Make new columns
    • Select columns
    • Filter rows
    • Summarize columns
    • Arrange by columns

Benchmarks

In-Memory / In-Process Analytics

Remember: Big Data is dead!

Pandas

Python for Data Analysis

Data Representations and Transformation

That’s All Folks!