Case study · 2026

NYC Subway Events from Ridership Data

A detector that recovers NYC’s 2024 event calendar (Knicks, Rangers, Mets and Yankees games, MSG and Barclays concerts, the US Open, parades, NYE) directly from MTA hourly subway ridership, with no event calendar. 96.5% recall against 513 known 2024 events.

Penn Station / MSG ridership during the 2024 Knicks vs Pacers Game 7, showing a sharp post-game spike above the event-aware baseline

View source on GitHub

Role

Solo: data engineering, modeling, signature analysis

Timeline

Several weeks of evenings, spring 2026

Scope

Pipeline, baselines, ground-truth scraping, clustering, viz

Stack

Python, pandas, NumPy, scikit-learn, SciPy, UMAP, holidays, beautifulsoup4, requests, matplotlib

Ingesting the data

The ingest stage downloads 2024 hourly ridership counts for five station complexes from the MTA’s Open Data. It downloads the ridership once and saves it locally as a Parquet file, so re-runs are fast and don’t depend on the MTA’s servers.

Building the baseline

The baseline is a table of 168 cells, one per hour of the week (24 hours × 7 days), holding the median ridership for each slot. Monday 1pm is a cell; Friday 10pm is a cell. The tables are split by season, because building one table for the whole year would blend low-ridership winter periods with high-ridership summer periods, and the result wouldn’t represent either accurately.

The baseline is then refit. When an event is detected, the program flags a 3-hour window around it and rebuilds the baseline with those windows excluded, so “normal” is built only from quiet, non-event days. Without this, frequent events at a venue get learned into the baseline as if they were normal.

Detecting anomalies

The program flags an unusual hour two ways. The ratio flag triggers when ridership is at least 1.5× the baseline. The z-score flag measures how many standard deviations an hour sits above normal, calculated against the last four same-slot weeks (a rolling four-week window). The window is kept short so the comparison stays local in time and doesn’t average across genuinely different periods.

Validating against known events

The detector works only from ridership and never sees a calendar. A separate list of 513 known 2024 events is the answer key, used after detection to check whether the flagged anomalies line up with real events.

That event list was built two ways. Structured, predictable sources were scraped: sports-reference has every game in clean tables, and setlist.fm offers an official API. Irregular events like parades, which happen only once or twice a year and follow no schedule, were curated by hand, because building a scraper for a handful of one-off events isn’t worth it.

Fingerprinting and clustering events

Each event is reduced to five features. Before clustering, the features are z-scored, putting all five on the same scale so a large-number feature doesn’t dominate the smaller ones.

Events are then grouped with k-means. k-means builds whatever number of clusters it’s told to and can’t tell you whether that number was a good choice. Silhouette score is the grading tool: it measures how tight and well-separated the clusters are, higher being better. I swept k from 2 to 8 and compared silhouette scores.

Silhouette favored k=2, but k=2 forces every indoor-arena event into a single bucket and hides the venue-versus-sport distinction the project was built to find. I used k=6 instead, which surfaces that structure, a deliberate trade of a higher metric for an answer to the actual question. I’m explicit about that trade. Ward hierarchical clustering was run as a cross-check on k-means, and UMAP was used only to visualize the data in 2D, never in the analysis itself, since collapsing five dimensions to two loses information.

Testing the findings

Differences between event types were checked with p-values. A small p-value means a measured difference is unlikely to be luck; a large one means luck can’t be ruled out. The Knicks-versus-Rangers peak intensity difference came back at p = 0.0041: if there were no real difference, a gap that large would appear by chance only 0.4% of the time, so luck is a very weak explanation.

What the refit actually did

The event-aware refit moved baseline cells in both directions. At MSG, early-evening hours rose and late-evening hours fell. The late hours fell because v1 had treated post-game exit crowds as normal ridership, inflating the baseline; v2 excludes those event hours, so “normal” reflects genuinely quiet nights. That correction is where most of the detection gain came from: once the late-evening baseline is accurate, a real game’s exit surge stands out clearly.