On an average weekday, over 750,000 riders use the Chicago Transit Authority’s iconic ‘L’ rail system to get around the city. It is a critical piece of Chicago’s infrastructure (and one that we’ve written about before), allowing the city’s residents to quickly and cheaply navigate the country’s 3rd largest city.
The CTA daily logistics operations manage 8 interconnected rail lines (colorfully named the Red, Blue, Green, Brown, Pink, Orange, Purple, and Yellow lines), 145 entry/exit stations, and over 2,300 combined trips by its railcars every day. This logistics team is responsible for scheduling trains, employees, railcar maintenance, etc, in order to address rider demand as optimally as possible (and as a daily rider myself in pre-pandemic times, I must say they already do a wonderful job!).
Can we leverage our ML/AI tools to help the CTA in this endeavor? Specifically, can we build a forecasting tool to accurately predict daily ridership to help optimize CTA operational logistics?
In contrast to a single univariate forecasting problem (such as “How much revenue do we expect to make over the next year?” or “What is the anticipated high temperature two weeks from today?”), for this problem we want to predict multiple outcomes simultaneously - namely, the daily ridership of each individual train line. However, we expect the daily ridership of these lines to be related, as common factors such as day of the week or temperature will simultaneously affect all of the train lines. As such, we can increase our forecasting accuracy by incorporating external information about these common factors, and making simultaneous predictions for the ridership of each train line.
Here at Strong Analytics, we are often tasked with solving problems involving forecasting - so much so that we’ve developed our own Python package, torchcast, for this specific purpose. torchcast is a powerful forecasting tool for building state-space models (including exponential smoothing and the Kalman filtering algorithm). In addition to being able to predict multiple outcomes, this package also has the advantage of allowing for batch training/forecasting and allowing the user to combine forecasting models with custom neural-network models for making predictions. It’s a public, open-source package built on top of the robust PyTorch package - try it for yourself!
The CTA began publishing daily ridership data at the individual station level in 2001. This will be our foundational dataset that we use to train our model. In order to transform this station-level data to line-level data, we will use a supplementary CTA dataset of station information that details which of these stations are associated with which train lines. For stations which are transfer stations (i.e. multiple train lines connect at a single station), we split that station’s ridership evenly between all of its associated lines.
So, what does this ridership data look like?
We see some clear periodic trends. Ridership slowly rises and falls over the course of a calendar year, as well as oscillates more rapidly over the course of a week. On top of these periodic trends, we see other extenuating factors - ridership dips sharply around the holiday season each year, as well as a long-term construction project that occurred during the summer/fall of 2013, which affected a large portion of the Red Line on the south side of the city.
So, as we can see even from this one slice of one train line’s data, many external factors influence whether or not Chicagoans ride the train on any particular day. Capturing these predictors of ridership would allow us to infer daily trends. Some of the most obvious factors include:
Daily weather, via publicly available historical weather data from the NOAA. We can envision a couple of potentially competing factors here:
- When it’s warmer out, people are more likely to venture out and do things.
- But, if it’s nice enough outside, they may seek other means of transport (biking, walking, etc.)
Weekday/weekend/holidays/events. The CTA is primarily used by commuters traveling to/from the downtown Loop area for work, and as such weekday vs. weekend trends are expected to differ quite dramatically.
Furthermore, ridership is expected to differ on holidays, which may potentially suppress ridership (since people aren’t commuting to work), but also may potentially increase ridership (as people flock to various celebrations around the city). We can see this effect by looking at the 5 lowest ridership days in the whole dataset - they’re all Christmas Day:
However, the 5 highest days in our dataset represent moments of celebration in recent Chicago history:
|Chicago Cubs World Series Championship parade
|World Series Game 3 (1st Cubs home game in WS)
|Chicago Blackhawks Stanley Cup Championship Parade
|Grant Park July 4th Fireworks Celebration
|Cubs won World Series at 11:30pm on previous night, everyone partied for a while and then went home.
Clearly, this city loves its sports! Days with Chicago Cubs home games boost the average ridership on the Red Line (which services the Cubs’ ballpark, Wrigley Field) by over 13,000 people!
So, we clearly need to capture information about these external factors in order to effectively predict ridership. To do this, I’ve added a few features to our dataset to serve as predictors:
- Daily temperature highs.
- Daily precipitation measurements.
- Flag for if each day is weekday/weekend.
- Flag for if each day is a holiday
- Flag for if a Cubs home game occurred on that day.
(Ideally, we would also incorporate flags for if construction was happening along a particular train line, but historical data on when construction projects occurred is surprisingly difficult to find.)
Building a Model
Our goal is to build a model that can predict ridership across each of the CTA train lines on any given day. In the realm of predictive analytics, this is referred to as a hierarchical model, indicating that this problem consists of predictions that could potentially be aggregated/disaggregated into other predictions. In the framework of this problem, we could imagine a ‘bottom-up’ hierarchical approach of predicting the ridership of individual train lines, which can then be aggregated into a single prediction of total daily ridership on the CTA.
Alternatively, we could take a ‘top-down’ hierarchical approach, in which we make a single prediction of total daily ridership, and then use some other assumptions (such as average distribution of total riders among the individual train lines), to disaggregate that single prediction into lower-level predictions of the individual train lines. (For a more in-depth discussion of these different flavors of hierarchical modeling, we recommend this excellent chapter from Forecasting: Principles and Practice by Hyndman & Athanasopolous)
For this problem, I’m going to take the ‘bottom-up’ approach and train our model on per-line ridership. torchcast’s approach to batch training allows the model to share some information across lines and use these to avoid overfitting within each line. Torchcast also supports multivariate forecasting, in which the model learns how the ridership of each train line is correlated with all other train lines (in addition to the external predictors discussed above).
Defining our Model Framework
Having collected our data on the daily ridership per train line, and incorporated the external factors listed above, we’re ready to define our model.
The torchcast package described above allows us to define an arbitrary number of separate ‘state-space’ processes that each contribute to the variations over time observed in the data. For example, we can define periodic variations of arbitrary frequency, such as the (low-frequency) seasonal ebbs and flows of ridership, or the (high-frequency) weekly cycles that arise from the work week. In addition to that, we can define non-periodic trends, like the steady increase in CTA usage over time. We are essentially telling the model “We expect any or all of these trends to be present in this data, try to identify and quantify them.” The Kalman-filtering algorithm then combines the information from these individual processes to generate its forecasts. One further advantage of the torchcast package is that, because it is built on top of PyTorch, it lets us combine these ‘state-space’ processes with arbitrary PyTorch neural networks. This allows us to inject prior knowledge where applicable (via these processes) while still leveraging the power of deep learning throughout our model.
For each train line we’re predicting, we’ll define two periodic trends: a yearly component, and a weekly component (each defined as a Fourier series).
Furthermore, we’ll also include two non-periodic components: a “local level,” which attempts to capture brief, rapid deviations away from the typical values (for example, perhaps some road construction in the downtown area leads to 10,000 extra people riding the trains for a few days, rather than driving to work), and a “local trend,” which attempts to capture longer-term trends in the overall magnitude of the values of interest (for example, total yearly ridership has steadily risen over the span of this dataset).
Lastly, we’ll add in some simple linear model components that attempt to map the predictors outlined above onto these trends. For each of the individual train lines we are making predictions for, we’ll have a quantified relationship between that train line and each of our predictor variables.
Fitting the Model
To fit this model, I’m going to take a 10-year chunk of this dataset, spanning from the Summer of 2008 to the Summer of 2018. I’ll set aside the final year of this data as validation data to test our final model, which leaves me with 9 years of training data.
Training the model on this data yields the following:
Not bad! We’re clearly capturing the primary cyclical trends, even correctly fitting some weird local effects like the bump in Green Line ridership when portions of the Red Line were under construction. We can also see that we do a good job predicting the year’s worth of validation data. It’s a little hard to see the details since we’re spanning a full decade here, so let’s zoom in on one of these lines (the Blue Line) and select a 2-year region, containing 1 year of training data and 1 year of validation data.
We can see that while our model correctly predicts the long-term general behavior over the course of the year, we still have a few sources of error:
- Small, high-frequency noise: this represents small daily errors in our predictions - no model will ever be perfectly accurate, so we expect these small deviations between our predictions and the actual values.
- Larger, less-frequent spikes: These could be due to either isolated, one-off days with increased ridership (such as the days of celebration discussed earlier), or some other effect that we are currently not accounting for, but could be predicted based on additional external predictors.
Kalman filtering also provides us with credible regions on these predictions (seen as the gray band on either side of our colored predictionpredictions lines). In predictive modeling, this uncertainty is often just as valuable as the raw prediction - we know that we’re not going to be 100% accurate all of the time, so having an accurate estimate of just how wrong we can reasonably expect to be is extremely important! In this case, our validation data falls within these bands over 94% of the time; since we’re showing 95% intervals, this suggests our model’s uncertainty is accurate and well-calibrated.
Lastly, torchcast lets us visualize how each of the predictive features we included in our model affect our predictions - a few of those are shown here:
We can see that the qualitative behavior for each of these predictors is as we expect:
- The higher temperatures during the summertime months increase ridership, while the lower temperatures in the winter suppress it.
- Days with precipitation always see decreased ridership.
- Federal holidays also see decreased ridership.
- Each summer, we see increased ridership on days when the Cubs have a home game.
While these effects are all intuitively what we expect, the advantage now is that we can leverage this model (and its quantitative knowledge of the relationship between these predictors and ridership) to begin constructing interrelated causal relationships between all of our predictor variables, allowing us to ask questions such as “How do we expect ridership on the Red Line to change for a Cubs game on a holiday with 0.5 inches of predicted rainfall?” This is in addition to the usual benefits of predictive modeling, such as quantifying relationships like “every 1 degree increase in temperature increases average ridership by 0.5%”
Forecasting models may not be able to perfectly predict the future, but they can allow organizations to anticipate and prepare for future circumstances in order to optimize their operations. Even with this simple prototype model built using a few publicly available datasets, we were able to show that predictive modeling can provide an organization as complex as the CTA with valuable information about the daily operation of each of their individual train lines, allowing them to more effectively serve the residents of Chicago. Furthermore, we’ve shown that state-space models, and specifically torchcast, are well-suited to these types of problems. These models allow the user to quickly and effectively leverage intuition about a problem and turn that intuition into quantifiable relationships, enabling smarter, data-driven business decisions.