Scheduled cluster outage

Write-up

As you may be aware, we have recently been making changes to how we roll out code and how we route traffic to new servers after we deploy code (forum post). This has already started paying dividends in the form of a few outages that didn't happen thanks to a big chunk of custom logic not being required anymore; and deployments are now faster and more reliable.

However, we are still in an intermediary state in which our legacy systems are still responsible for updating the new Load balancers with updated lists of server targets, and a corner-case in this logic broke out last night. The logic for updating server targets ran periodically unless a maintenance operation was ongoing, and crucially, it didn't run proactively when an engineer deployed code.

Since a big portion of the deploy operation counted as a "maintenance operation", our logic relied on always having a few seconds in between deploys, in order to update the list of servers our Load Balancer should point to.

On the November 3rd incident, a few extra compounding factors (a transient failure in a build process, some of the aforementioned speedups, and bad luck), caused some deploys to be queued and to occur in fast enough succession that the Load Balancer didn't get updated targets by the time the second deployment started and started a new "maintenance" operation.

As a result, for the duration of the second deployment, our Load Balancer directed traffic to instances that had been torn down, and this broke the immediate cluster.

This morning (November 4th) an almost identical issue occurred in our scheduled cluster, which, for different reasons, was deployed twice in very short succession, quickly enough to not let our background process update the Load balancer's list of targets.

We fixed this issue entirely by adding an additional guarantee that the list of targets gets updated at the end of every successful deployment, which should ensure that this cannot happen again.