Predicting the outcome of a presidential election is a fascinating challenge that has attracted the attention of data scientists, statisticians, and political analysts for decades. The 2024 U.S. Presidential Election is no exception. In 2008, Nate Silver popularized a data-driven approach to forecasting elections by aggregating polls and applying rigorous statistical analysis. These methods have proven to be quite effective, but they come with inherent limitations, such as severe biases in polling and difficulty accounting for unexpected events. Silver has tried to control for these factors with pollster rankings that attempt in some way to control for poll bias, but what happens when nearly the entire universe of pollsters is biased in the same direction? This leads to catastrophic model failures, like in the historic 2016 election when FiveThirtyEight gave Hillary Clinton a 71% probability of winning.
Instead, I decided to model the underlying fundamentals that affect poll results by harnessing the power of machine learning to predict the election with even greater accuracy. Unlike traditional statistical methods, ML can handle vast and complex datasets, learning and adapting as new information comes in. This adaptability makes it particularly well-suited for elections’ dynamic and often unpredictable nature.
This model will be integrated via Allora Network, as part of a new topic aimed at aggregating multiple models to produce the most accurate election predictions. With Allora, we can assimilate multiple predictive models and datasets to create a model that is greater than the sum of its parts. This integration enhances the robustness and accuracy of our election forecasts by utilizing diverse data sources and modeling techniques within a unified framework.
Defining the Problem
Predicting the presidential election outcome isn’t as simple as guessing who will win the popular vote. The Electoral College system in the United States designates 51 separate races (50 states plus Washington D.C.). Each race contributes to the national result, making it a complex puzzle where every piece matters.
Key Considerations:
- Individual State Races: Each state has its unique political landscape, demographic makeup, economy, and voting laws, making predicting each state's result a distinct challenge.
- National Aggregation: The overall election result depends on the electoral college system, not just the popular vote, adding a layer of complexity to the prediction task.
Choosing the Data Sources
I incorporated diverse data sources to build a robust ML model for predicting election outcomes. These include:
- Historical Polling Data: I collected Presidential approval data going back to the 1940s. Here is the spread of “Approve - Disapprove” for the last 15 Presidents:
Then, by flipping the sign for one party, we can get a “national sentiment indicator” that shows whether the nation as a whole leans to the left or right:
- Macroeconomic Data: “It’s the economy, stupid!”
Bill Clinton's campaign coined this phrase in the 1992 election cycle. Election years, especially those where the incumbent is running for re-election, are judgments of how the incumbent party did in its first term. No matter how much politicians try to use wedge issues to get voters on their side, voters respond to the pain in their pocketbooks.Here are the Presidential approval ratings plotted alongside 12-month inflation. These two series have a correlation of -.28, making inflation a significant determinant in election results. In particular, when inflation reaches extreme values (>7%), the mean approval spread is 22 points lower than when inflation is under control. This is particularly significant in the 2024 election as inflation reaches 40-year highs.Besides inflation, we are also looking at each state’s unemployment and housing affordability metrics.
- Historic State Elections
Past election results at the state level provide valuable insights into voting patterns. States tend to vote along historical lines with minor deviations. We calculate the long-term means for each state as well as trends that might tell us whether a battleground state will flip.
- Historic House Elections
While Presidential elections run every four years, House elections run every two years. This gives us advanced insight into trends that might manifest in the Presidential election cycle. - Racial Demographics
Voter preferences vary widely across different racial groups. White voters favor the Republican party by 15%, whereas black voters favor the Democrat party by roughly 60% (although this trend is starting to change). I collected demographic data to capture these dynamics as well as underlying trends.Across multiple states, we see a general trend of the white population decreasing while black, Latino, and Asian populations are increasing.
- Geography
States share values and cultural similarities with others in their geographical area. For instance, Maine, Connecticut, New Hampshire, and Vermont share a common New England cultural background of being Pilgrim colonies. In contrast, Deep South states like Mississippi, Alabama, and Georgia have another cultural background based on their agrarian economy and racial histories. - Illegal Immigration Count, Voter ID Laws, and Vote by Mail Status
Despite constant claims by mainstream media that voter fraud is nonexistent, it doesn’t take a data genius to see how a lack of voter ID plus a large illegal immigrant population could have an illegal influence on election results. Indeed, when we plot the illegal immigrant population against election results, we get the following:With a correlation of -0.24, the illegal immigrant population is almost as powerful of a predictor as inflation.
Choosing the Target Variable
The choice of the target variable is crucial for building an effective ML model. For this task, we consider several potential target variables:
- Spread (R-D): The difference in vote share between the Republican and Democratic candidates.
- Deviation from the National Average: How each state's spread deviates from the national polling average.
- Deviation from Long-Term Means: How current results compare to historical voting patterns in each state.
So which target do we choose? Ultimately, I wanted to capture the election dynamics from as many angles as possible, so I created models for all these responses and then took an average across all the resulting predictions.
During the model training, we must be mindful of weighting the samples because not every state is created equal. More populous states such as California and Texas have more electoral votes than sparsely populated ones such as Wyoming or North Dakota, which means their samples must be weighted higher in the training. Also, regime changes occur in the data as the nation reacts to emergent issues (such as technological innovation, geopolitical tensions, demographic change, or illegal immigration). Because of this, we have to give a larger sample weight to recent elections than those in the distant past. I chose to do this using an exponential weighting scheme.
At this point, I have 3 different targets, plus various choices for the exponential weighting parameter, yielding a large number of potential models. By taking a simple average across all the predictions, we get the following results:
It’s important to note that we have to diligently remove any look-ahead bias where it might creep in. Results were generated using out-of-sample data to understand better how the model will perform live.
Calculating the Probability
However, it’s not enough to make a point prediction for the election winner. We want to quantify the win probabilities, so I used quantile regression. This method allows us to predict the spread at various probability levels. By examining where the predicted quantiles for Democrats and Republicans cross, we can determine the probability of each state swinging one way or the other. This probabilistic approach gives us a more detailed and robust prediction than a simple binary win/lose model.
Steps in Quantile Regression:
- Model Training: Train the quantile regression model on historical data to predict the conditional quantiles between 0 and 1 for each state’s election result spread (R-D).
- National Result Aggregation: At each quantile, calculate the winner in each state and the national election.
- Find The Crossover Quantile: Find the quantile at which probabilities cross from Democrat to Republican victory.
Result
On Jun 3, 2024, my model predicted a 62.5% chance of Republican victory and 37.5% chance of Democrat victory. It’s important to note that these probabilities reflect the entire party’s chance of winning, not a single candidate. Because Trump is the only candidate in the Republican Party, the entire 62.5% goes to Trump. In the Democrat Party, however, a Vice President and multiple governors are waiting in the shadows like a pack of hungry vultures in case a geriatric Biden doesn’t make it. All of them combined got a predicted 37.5% chance of victory.
At that time, Polymarket shares of Trump were trading at $0.53, while Biden was sitting at around $0.38, and Michelle Obama and Gavin Newsom had a few cents each. According to our model, Trump would have been undervalued, and all of the Democrats would have been overvalued. Over the month of June, Trump's shares rose to $0.60 to catch up with our model prediction. That would have been a 13% return in less than four weeks! Nice.
(Incidentally, the model’s outputs remain 62.5% for Trump and 37.5% for all Democrats at the time of this writing on Jul 2, 2024. However, I expect this to change after poll results come in that reflect how Biden got bodied in the first CNN debate.)
Conclusion
If there was one job that I think everyone could agree should be replaced by AI, it would be political pundits on network television. Imagine a world where election season is no longer filled with countless hours of talking heads shouting over each other on the TV but where ML models come up with impartial analyses about who will win and why. What a relief it would be to many of us suffering from election fatigue!
So, whether you’re a machine learning enthusiast, a political junkie, or just someone who loves seeing technology push the boundaries of what’s possible, join us on this journey. Let’s turn the 2024 Presidential Election into a data-driven spectacle that’s as thrilling as it is insightful.
About the Author
Alexander Huang is a Senior ML Engineer at Allora Labs. With a background in data science in fintech, traditional finance, and trading, he most recently served as VP of Data Science on J.P. Morgan's AI Acceleration team. Alex holds an MS in Financial Mathematics from Stanford University.