Forecasting the 2020 US Election Using Multilevel Regression with Post-stratification

Presidential election, Trump, Political Polls, PYMC3, Python

Susan Li

Published in

Towards Data Science

8 min readJul 18, 2021

How to Estimate Public Opinion in the States

The most commonly used method for estimating state- level opinion is called disaggregation. The process is simple and easy to implement: After combining a set of national polls, you calculate the opinion percentages disaggregated by state.

The problem with disaggregation is that it requires a large number of national surveys that collected over 10 years or more to create a sufficient sample size within each state. In addition, disaggregation does not correct for sampling issues and may obscure temporal dynamics in state opinion.

To overcome these drawbacks, Multilevel Regression with Post-stratification (MRP) was developed to estimate American state-level opinions from national polls.

What is Multilevel Regression with Post-stratification (MRP)?

MRP begins by using multilevel regression to model individual survey responses as a function of demographic and geographic predictors, partially pooling respondents across states to an extent determined by the data. The final step is post-stratification, in which the estimates for each demographic-geographic respondent type are weighted (post-stratified) by the percentages of each type in the actual state populations.

Let’s say we run a survey asking people if they support gay marriage. We have a survey with 1,000 respondents, 700 of which are male and 300 of which are female. We could use the survey results to calculate the proportion of people who support gay marriage. However, we know that the population distribution between male and female is roughly 50%/50%, rather than 70%/30% implied by our survey. So just taking the raw result from our survey over-represents the opinions of male.

Given we know the gender distribution of the broader population, we can reweight (post-stratify) our results: so in this example, an estimate of the proportion of people who support gay marriage would be:

0.5 X Proportion of Male + 0.5 X Proportion of Female

This technique is called post-stratification. We need survey data and a reliable census data that gives us the population weights. Data can be re-weighted based on many different categories, such as age, education, gender, race, etc.

Forecasting the 2020 US election

In this post, we will follow examples here and here to forecast the 2020 US election using the technique of disaggregation & MRP and two data sets (one polling data set and one census data set). Let’s go!

The Elections Polls Data

I won’t be able to share the data set, however, below is the instruction on how to get the data:

Go to website: https://www.voterstudygroup.org.
Click “Nationscape”.
Click “GET THE LATEST NATIONSCAPE DATA”.
Fill your name and email address then click “SUBMIT REQUEST”.
You will receive an email with a download link, click this link.
We will download .dta files.
Unzip the downloaded “Nationscape-DataRelease_WeeklyMaterials_DTA_200910_161918”, then open the folder “phase_2_v20200814”.
There are many dta.files, I am using “ns20200625.dta”, which refers to 25 June 2020.

poll_data.py

There are many features in the data, this analysis will need seven: “vote_2020”, “race_ethnicity”, “education”, “state”, “gender”, “age”, “census_region”.

EDA

df.vote_2020.value_counts(normalize=True)

In the data, roughly 42% said they would vote for Biden and 38% said they would vote for Trump, 10% were not sure, and so on.

Number of Poll Respondents by States

respondents_by_state.py

Obviously, the bigger the state, the more poll respondents. Notably, every state had respondents.

Respondents by State and Race

We will simplify race to four categories: “White”, “Black”, “Asian” and “Others”.
Because we are interested in Trump’s voters, we will re-code “vote_2020” variable.

state_race.py

All the states had white respondents, several states had no black and / or Asian respondents, even more states had very few black and / or Asian respondents.

Respondents by State and Gender

state_gender.py

One state (WY) had no female respondent, all the other states had both male and female respondents.

Disaggregation Estimate Support for Trump

disa_trump.py

Arkansas had the highest support, and Vermont had the lowest support for Trump.

Vote for Trump by Age and Gender

age_gender.py

In this poll, around 35–45 years old male were the largest voting block for Trump.

Simplify education and age variables

sim_edu_age.py

We’re going to explore Trump’s voters by combining all possible combinations of gender(2 categories), race (4 categories), age (4 categories) and education (4 categories).

Vote for Trump by Gender

gender_group.py

Among voters who said they would vote for Trump, roughly 43% are female, and 57% are male.

Vote for Trump by Age Category

age_cat.py

30-44 years old are the largest voting block, followed by 60 years old and over.

Vote for Trump by Age and Gender

ageCat_genderCat.py

Vote for Trump by Race

race.py

Vote for Trump by Race & Gender

race_gender.py

Vote for Trump by Race & Age

race_age.py

Vote for Trump by Education

edu_cat.py

Vote for Trump by Education & Gender

edu_gender.py

Vote for Trump by Education & Race

edu_race.py

Vote for Trump by Education & Age

edu_age.py

Multilevel Model

Our multilevel polling model will include factors for state, race, gender, education, age, and vote. In order to accelerate inference, we count the number of unique combinations of these factors, along with how many respondents with each combination will vote for Trump.

encode_uniq.py

This reduction reduces the number of rows in the data set by almost 70%.

uniq_dt_df.shape[0] / dt_df.shape[0]

multilevel.py

Now we’ re ready to specify the model with PyMC3, start with wrapping the predictors in theano.shared.

shared.py

We specify the multilevel (hierarchical) model for α_state.

hierarchical_normal.py

We specify the parameters as follows:

parameter.py

Finally, we specify the likelihood and sample from the model using NUTS.

sample.py

We can verify the convergence of the chains formally using the Gelman Rubin test. Values close to 1.0 mean convergence.

Gelman-Rubin.py

Post-stratification Data

We will use IPUMS US Census & American Community Survey data. Again, I won’t be able to share the data set, here are the steps on how to get it:

Go to IPUMS website: https://ipums.org, then click “VISIT SITE” on “IPUMS USA”.
Click “Get Data”.
If you’ve not done before, you will need to create an account, then wait for a confirmation.
Once you have an account, go to ‘Select Samples’ and de-select everything apart from the 2019 ACS.
We need to get the variables that we’re interested in. From the household we want ‘STATEICP,’ then in person we want “SEX”, “AGE”, “EDUC” and “RACE”.
Once everything is selected, “view cart”, and we want to change the data format to “.dta”. It should be less than 300MB, then submit the request.
You should get an email saying that your data can be downloaded in a few minutes.
I downloaded and saved “usa_00002.dta” in my data folder.

We now can start cleaning up the data, that including re-organize age category, education category, race category and change state names to two-letter state abbreviations, that in align with the survey polls data we analyzed earlier.

census_df = pd.read_stata('data/usa_00002.dta')
census_df.head()

post_stra.py

After creating counts of each sub-cell, and proportions by state, merge with the previous state data. Now we finally have our Post-stratification data:

Again we encode category variables to numeric and encode this combined data in the same way as before.

ps_encode.py

We now set the values of the theano.shared variables in our PyMC3 model to the post-stratification data and sample from the posterior predictive distribution.

ps_mean.py

We complete the post-stratification step by taking a weighted sum across the demographic cells within each state, to produce posterior predictive samples from the state-level polling distribution. The simplest summary of state-level poll is the posterior expected mean.

The following map shows the MRP estimates of support for Trump by state.

mrp_trump.py

And we can have a look at our estimates to compare how the estimate for each state differs between disaggregation and MRP.

mrp_disa.py

How Trump was Supported by Black Men?

Let’s try to answer this question by two estimates.

First we plot the disaggregation estimates of support for Trump among black men. We realized that the pollsters polled few or none black men in many states. Therefore, support for Trump in these states among black men cannot be measured using disaggregation.

black_men_disa.py