A race condition...I think?
I’ve been working on a side project for a couple months now, having started it during quarantine. The project deals with analyzing some data that I collect daily, and running some (half-baked) agent-based models against that, and seeing what choices they make and how those choices perform. My end goal was to stand up an application that was constantly running this agent modeling against new intake data.
I focused on data collection first for two reasons. First, the data was temporary, and if I didn’t collect it one day it wouldn’t be readily available the next day. Second, I figured that I could always run a backfill against the data.
I hadn’t really focused too much on the backfill, thinking that the backfill would sort of be the capstone on the other work of the project. The backfill would use of all the data that I had collected up until the current moment, and once the backfill completed my thought was that the project would be out of its “beta” phase, since I’d have a few months of data and the project would be running by itself.
I’d been slowly building the project up over the past few months. I first built the data collection cron jobs, pulling from from data sources and persisting the data into a data store. I then hand-wrote some “agents” to interpret the data, make certain choices with the data, and then record their results in the data store. I wanted to better visualize the data (and to share it with friends), so I wrote a pretty simple API and booted up a web application to read from the API and represent the data in a barebones way. I also wanted to set up CI/CD around the project in order to automatically run tests and deploy it onto servers running in AWS, so that I could just merge into the
main branch on GitHub and have my code automatically deployed.
Finally, I got the project in a decent enough state where I figured now would be the time to run the backfill and then let the project run itself for the future. I wrote a script to generate about 10,000 agents each with their own parameters, wrote a backfill process (simulating what the world would look like at each point in time for each agent), and set off the script.
It was slow going at first. My first attempt used a
EC2 instance, but that was excruciatingly slow (obvious in retrospect, but I didn’t think I was doing all that much processing!). Bumping up to an
r5a.xlarge made things faster, but I originally had written everything to run sequentially, and there was no reason each agent couldn’t run independently! So I spent some time parallelizing things.
Once I had scaled up performance and parallelized things, I kicked off the backfill. It ran for a bit, and then…boom. It blew up. There was a unique constraint issue!
ERROR: duplicate key value violates unique constraint
That’s interesting. What was going on? I wrote some additional integration tests to try and find the problem, but couldn’t nail anything down.
Next, I removed some of the parallelization. While (I thought) it had been pretty straightforward to add, unintended consequences from even seemingly trivial parallelization had been known to cause problems. While I was hoping to make the backfill much faster, I wasn’t overall concerned with performance since this was going to be a one-time operation. I undid my parallelization changes, redeployed, and ran the script again.
Boom - same thing. A unique constraint violation. Super weird! One interesting thing to note was that it happened for a different point in time in the backfill, a few points after the original issue had occurred. The issue didn’t seem to be deterministic. It didn’t seem like it was a specific piece of bad data collected. I would rollback the latest changes in the database and run all of the queries the process should have been running, and didn’t find anything that would’ve led to this issue.
I added more logging around where I thought this issue would be happening. I redeployed and ran it again. And the issue still happened at yet a different point in time. Some of the logging that I thought would have hit never triggered, yet the issue kept happening!
I downloaded the dataset to my personal computer and ran the backfill there. Weirdly, the backfill worked without issue! The error never occurred. This made me think it was some race condition with talking to the datastore, where the latency on my personal computer was negligible but the latency across the network was was wreaking havoc on the backfill.
I was banging my head against my keyboard. I hadn’t felt this perplexed about a piece of code in a long time. It was one of those moments where I felt like something impossible must’ve been happening. I was questioning a lot of the truths I thought I knew about how this particular programming worked, how the datastore I was working with worked, etc. I was reevaluating a lot of the principles I wrote code by.
I tried as many permutations as I could to try and figure out what was happening. I thought about it when I was away from the computer, trying to take a different angle to discover the issue.
This morning, I finally figured it out (or, at least I think I figured it out!) In retrospect, like almost all bugs, it seems incredibly obvious. I’m happy to have figured it out, but frustrated that I spent so much time trying to determine the issue that it took the wind out of my sails about this side project for a few weeks.
The issue was a cron job that I had set up as one of the first things I did for the project. The cron job determined the results of all the choices the agents had previously made and that had not already been calculated. The cron job was one of a handful that allowed the project to run on its own; other cron jobs would ingest the data, would spur the agents to make their choices, etc.
This cron was special in that in ran frequently, every 15 minutes (at :05, :20, :35, and :50). It would load up all the pending decisions, tabulate the results, and close out each decision. Often, its run would be a no-op, because agents only made decisions a certain number of times per day.
However, while the backfill was running, this cron job would pick up on all of the pending decisions made by the backfill. The backfill had its own process for handling pending decisions; the backfill would make decisions, advance one point in time, then do the job of the cron job within the process itself, closing out all of the decisions from the previous point in time, and then start the process again on the current point in time.
What was happening was that the backfill would collect up all of the decisions it needed to close out and begin its way working through them. At the same time, this cron job would kick off and do its thing. It would get those same pending decisions that the backfill collected up as they were not all closed out yet and work through them itself. The variability came from when I would kick off the backfill compared to when the cron job would run, so the issue would appear at different points in time. It also seems that this issue wouldn’t happen every single time the cron job ran, but only when conditions lined up where the backfill had fetched the data it was supposed to process around the same time the cron job ran.
If I had noticed that it was happening around the same time within each hour, I might have been tipped off (all of the exceptions happened at the 15 minute marks of the cron job). If I had thought a little more about why my personal computer could run the process no problem, I might have been tipped off. If I had studied the data in the data store a little more closely, I might have been tipped off. If I looked at the logs of the cron jobs and not just of the backfill, I might have been tipped off (those same exceptions were being thrown in the cron job, but as I don’t have an alerting or monitoring set up for this side project, I wasn’t aware it was happening - that’s on me).
My issue was that I wasn’t thinking about the system as a whole. I was solely focused on the backfill process I had written, debugging and refining and changing that in an attempt to solve the problem. I was forgetting that the code did not run in isolation, and that the box it was running on had its own environment and other processes, etc.
In sum, it was…sort of fun to figure this out? I really haven’t had to deal with a nasty bug like this in awhile. It reminded me how nice it is to have clean, understandable code, and an understandable system, and proper alerting, messaging, etc. Additionally, it’s sort of nice to know that I care about a side project enough where its “big” enough to have a system issue like this, that I didn’t totally lose motivation on the project, and that I spent a good amount of time trying to figure out so that I can keep working on it.