MP 03: ML System
Goal
The goal of this MP is to create a machine learning system for classifying pitches thrown in MLB games.
It’s tough to make predictions, especially about the future.
– Yogi Berra
Deadlines
- Due: Monday, April 22, 11:59 PM
GitHub Repository
To setup your repository for this MP, use the following link:
Context
Pitch Classification
What is a pitch type you might ask? Well, it’s complicated. Let’s allow someone else to explain:
While the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.
Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:
That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:
But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:
The long story short is:
- Have advanced tracking technology that can instantly record speed, spin, and other characteristics for each pitch.
- Have a trained classifier for pitch type based on the pitch’s characteristics.
- In real time, make predictions of pitch type as soon as the pitch’s characteristics are recorded.
- Display the result in the stadium and on the broadcast!
There are several ways to accomplish this task, but one decision we’ll make is to make one-model-per-pitcher.
Baseball Data
Machine Learning
Web API
Video Walkthroughs
Example Repository
Directory Structure
Your completed MP should contain only the following directories and files:
./your-github-repo-name/
│
├── data/
│ ├── ____.parquet
│ ├── ...
│ └── ____.parquet
│
├── models/
│ ├── ____.joblib
│ ├── ...
│ └── ____.joblib
│
├── .gitignore
├── README.md
├── client.py
├── make-data.py
├── make-metrics.py
├── make-models.py
├── requirements.txt
├── run.py
├── server.py
└── utils.py
The ____.parquet
and ____.joblib
files correspond to the relevant pitchers listed below. You may commit a .scratch.ipynb
if you’d like.
Requirements
Starting from the structure and code given in the example repository, all students should re-run the relevant pipelines to produce data and models for the following pitchers.
Data for each pitcher should be gathered for the following dates, inclusive:
2022-01-01
2024-04-17
Chicago Cubs
Chicago White Sox
DDG Only Requirements
In addition to the above, students in the DDG section should modify the predict
POST request to return the full name of pitches rather than the two character shortcode. For example, instead of FF
, the API should return 4-Seam Fastball
. To accomplish this, only the server.py
and utils.py
files should be modified.
Grading Rubric
For simplicity, the grading rubric will consist of many items, each scored one of 0
, 1
, or 2
. These scores will generally take the meaning:
2
: Item completed fully and successfully.1
: Item completed with minor issues.0
: Item incomplete or completed with major issues.
As much as possible, grading of each item will be done independently. However, some items are clearly dependent on others, and often those items are of extra importance. As such, we reserve the right to, when appropriate, allow important items to (negatively) effect the grading of other items.
After cloning your repository, we will grade your package based on the following items:
- All necessary packages to run the system are listed in a
requirements.txt
. - Grader is able to successfully run
uv pip install -r requirements.txt
. - Grader is able to successfully activate the virtual environment.
- Provided
run.py
file runs and starts the relevant server without error. - With the server running, provided
client.py
runs without error. - A
.parquet
file in thedata
directory exists for each pitcher. - The parquet files are named appropriately.
- A
.joblib
file in themodels
directory exists for each pitcher. - The
.joblib
files are named appropriately. - Running the provided
make-metrics.py
calculates metrics for all requested pitchers.
DDG Grading Rubric
- POST requests to the
predict
endpoint provide pitch classifications using the full pitch name.
Submission
Two forms of submission are required:
- Push your code to Github.
- This is how we will access your.
- Submit your repository URL to the Canvas assignment named MP 03.
- This is how we will know your code is ready for grading, and will allow us to track late submissions.
- You may only submit to Canvas once. Once you have submitted, we will grade your MP.
- Once you have submitted ot Canvas, you should make no further changes to the code pushed to GitHub.
- Students in the DDG section will make an additional submission on Cavnas to the assignment named MP 03 DDG.
- Failure to submit to the DDG version in addition to the regular version will result in significant point loss.