After the 2020 election, I (like everyone) heard allegations of fraud and “cheating”. I wanted to see for myself how absurd the claims were. What I found back then, I am sharing publicly here for the first time. What I saw in this data (and the cascade of events around it) fundamentally changed how I see the world.
The Findings
I founded a VC-backed Machine Learning and Data Analytics startup focused in the Information Security market. So in December 2020 (after hearing all the controversy) I fired up some of my data analytics tools and wrote some simple custom code to take a closer look at voter data from a contested battleground state: Pennsylvania.
The TL;DR
This is some of what I found:
- Ballots were returned (filled-out and sent back to the state) with DoB showing them as under 18 years old. Count: 2
- Ballots were returned (filled-out and sent back to the state) with DoB showing them asover 100 years old. Count: 1,558
- Ballots were returned (filled-out and sent back to the state) with DoB showing them as over 120. Count: 93
- Ballots were marked as received by the state with dates BEFORE they were mailed out by the state. Count: 23,305
- Ballots were marked as received by the state the SAME DAY they were mailed to the voter by the state. Count: 34,916
Ok, now that you’ve seen some of it, you’ve gotta be thinking what I thought:
“Surely there is a rational explanation for this…”
or
“These are just clerical errors…”
If you are thinking the way I did, you should know that it was these simple questions that began a long “walkabout”, searching for explanations from public officials and news outlets.
It’s been nearly FOUR years of continuous homework. I’m no longer the same person, but it’d be nice to know what happened..
I have some longer thoughts here on this.
If you dont see the big deal with the above data, this is where cognition plays a key role. You have to think a bit like a risk analyst or an auditor. If you arent accustomed to this, you might have to slow your mind down a bit, because it can be subtle. We have to pause and think critically about the data because there is no guide:
- Why would someone build a process to collect this data only to also allow it to be inaccurate?
- What are some possible circumstances where inaccurate data of this kind can arise?
- Processes must be built and staff must be trained to interact with this data.
- Software and Databases schemas have to be written to support this data. Why would you make the effort to store all this data and then not validate it?
As you mentally play with the data, fields, and scenarios more things come to mind. Smarter people will think of better ways to interrogate the data than I did. The above are just a few of the initial things that alarmed me. This is why I share my code and the data below. Other queries have interesting results:
- Duplicate entry detection (using various combinations of columns)
- Birthday Paradox stats on Dates of Birth
- etc
For brevity sake, I’ll stop with the analysis here. I have more thoughts on a dedicated page.
The Data
You’re probably curious about where I got it and how I processed it. As of the time of publishing and for the last 4 years (I would check periodically) it is available Archive.org’s archive of the Pennsylvania OpenData page (which was briefly moved or removed shortly after the election controversy).
Using the PAOpenData interactive visualization tool:
You can see all of Archive.org’s snapshots of the site here:
https://web.archive.org/web/20240000000000*/https://data.pa.gov/Government-Efficiency-Citizen-Engagement/2020-General-Election-Mail-Ballot-Requests-Departm/mcba-yywm
Here is a specific snapshot:
Click “View Data” near the top.
NOTE: keep in mind this is Archive.org’s archive of the site, so it may take a long time to load. After loaded it will look like this:
The original PAOpenData link is here: https://data.pa.gov/Government-Efficiency-Citizen-Engagement/2020-General-Election-Mail-Ballot-Requests-Departm/mcba-yywm
Downloading the data yourself:
Here is a direct link: https://web.archive.org/web/20201118054516mp_/https://data.pa.gov/api/views/mcba-yywm/rows.csv?accessType=DOWNLOAD
My code is available here:
You can even download my code and data here using Docker:
docker run -it -p 8888:8888 -p 6006:6006 sa7ori/pa2020 bash
For more technical information on the dataset and using my code see this dedicated page with more details.
A personal note
Before I started on all this, I had no idea how election administration worked. I had no idea that the data was even publicly available. I didnt know what to expect. I assumed election-data was treated with the same protections, tracking, and care as physical cash money. I assumed all related records of ballots/votes would be as meticulously guarded as “account balances” or ledgers are for money. I thought there was a “Brinks truck” for votes. (I’ve spent the last 4 years getting mugged by reality.)
It isnt the findings in the raw data that I found so alarming, but as time passed what bothered me most (and shifted my worldview) was that NONE of my trusted sources of information offered a “thinking person’s” explanation of what had happened. All I found (from my trusted sources) was generalizations, vapid proclamations, slogans, and repetitive mantra.
Furthermore, friends, family, and colleagues could care less.
I have more thoughts here on a dedicated page.
My Politics (if you’re curious):
Before 2020, I didn’t care about politics or think critically about world events. I thought it gauche to discuss politics or money in mixed company (or even on the internet).
I did not vote for any major party candidate in the 2020 General Election. Historically, I economically leaned Free Market/AnCap and I politically leaned libertarian (small “L”, with affinity for Mises Caucus, Ron Paul-types, and for a short time, 2016 Bernie-types).
Although I did not vote for a major party candidate in 2020, I plan to do so in 2024.
Brief Legal Information
All data archived herein (and in the accompanying technical resources) was obtained through Open Public Records, and is not considered PII (Personally Identifiable Information). All sources of data have been provided in this document to corroborate and substantiate the findings presented in this document. Additionally, all data was not only publicly provided by the Pennsylvania Department of State through the OpenDataPA webportal, but was obtained through explicit DOWNLOAD links (not screen-scraped) and as such was cleansed by PA OpenData for all PII.
Other Pages here
There are a few other dedicated pages on this site, linked above:
- Technical Information: about the Data and the Code
- Aftermath: Longer personal exposition about the aftermath
- Scrapbook: Some picture clippings from my notes
- My PA2020 Jupyter Docker
- Archive.org OpenDataPA Page
- OpenDataPA Direct Link