MP 03: ML System

Be aware of the general Machine Problem Policy document.


The goal of this MP is to create a machine learning system for classifying pitches thrown in MLB games.

It’s tough to make predictions, especially about the future.

– Yogi Berra


  • Due: Monday, April 22, 11:59 PM

GitHub Repository

To setup your repository for this MP, use the following link:


Pitch Classification

What is a pitch type you might ask? Well, it’s complicated. Let’s allow someone else to explain:

While the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.

Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:

That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:

But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:

The long story short is:

  • Have advanced tracking technology that can instantly record speed, spin, and other characteristics for each pitch.
  • Have a trained classifier for pitch type based on the pitch’s characteristics.
  • In real time, make predictions of pitch type as soon as the pitch’s characteristics are recorded.
  • Display the result in the stadium and on the broadcast!

There are several ways to accomplish this task, but one decision we’ll make is to make one-model-per-pitcher.

Baseball Data

Machine Learning


Video Walkthroughs

Example Repository

Directory Structure

Your completed MP should contain only the following directories and files:

├── data/
│   ├── ____.parquet
│   ├── ...
│   └── ____.parquet
├── models/
│   ├── ____.joblib
│   ├── ...
│   └── ____.joblib
├── .gitignore
├── requirements.txt

The ____.parquet and ____.joblib files correspond to the relevant pitchers listed below. You may commit a .scratch.ipynb if you’d like.


Starting from the structure and code given in the example repository, all students should re-run the relevant pipelines to produce data and models for the following pitchers.

Data for each pitcher should be gathered for the following dates, inclusive:

  • 2022-01-01
  • 2024-04-17

Chicago Cubs

Chicago White Sox

DDG Only Requirements

In addition to the above, students in the DDG section should modify the predict POST request to return the full name of pitches rather than the two character shortcode. For example, instead of FF, the API should return 4-Seam Fastball. To accomplish this, only the and files should be modified.

Grading Rubric

For simplicity, the grading rubric will consist of many items, each scored one of 0, 1, or 2. These scores will generally take the meaning:

  • 2: Item completed fully and successfully.
  • 1: Item completed with minor issues.
  • 0: Item incomplete or completed with major issues.

As much as possible, grading of each item will be done independently. However, some items are clearly dependent on others, and often those items are of extra importance. As such, we reserve the right to, when appropriate, allow important items to (negatively) effect the grading of other items.

After cloning your repository, we will grade your package based on the following items:

  • All necessary packages to run the system are listed in a requirements.txt.
  • Grader is able to successfully run uv pip install -r requirements.txt.
  • Grader is able to successfully activate the virtual environment.
  • Provided file runs and starts the relevant server without error.
  • With the server running, provided runs without error.
  • A .parquet file in the data directory exists for each pitcher.
  • The parquet files are named appropriately.
  • A .joblib file in the models directory exists for each pitcher.
  • The .joblib files are named appropriately.
  • Running the provided calculates metrics for all requested pitchers.

DDG Grading Rubric

  • POST requests to the predict endpoint provide pitch classifications using the full pitch name.


Two forms of submission are required:

  • Push your code to Github.
    • This is how we will access your.
  • Submit your repository URL to the Canvas assignment named MP 03.
    • This is how we will know your code is ready for grading, and will allow us to track late submissions.
    • You may only submit to Canvas once. Once you have submitted, we will grade your MP.
      • Once you have submitted ot Canvas, you should make no further changes to the code pushed to GitHub.
    • Students in the DDG section will make an additional submission on Cavnas to the assignment named MP 03 DDG.
      • Failure to submit to the DDG version in addition to the regular version will result in significant point loss.