Issues with caching layer

Incident Report for Bubble

Postmortem

While doing some essential cache scaling, 20% of our cache shards with higher than average data sets hit 100% CPU, meaning that they were unable to both migrate over the existing cache data and keep up with incoming writes.

Given that shards pegged at 100% CPUs were causing operations to fail, we made the decision to turn off traffic to the main Bubble cluster to give the cache cluster the opportunity to completely replicate all the existing data between nodes under minimal new write load.

Reverting to a smaller cluster would have extended the outage; adding nodes would have extended the outage because the cluster would have had to re-distribute keys; and doing nothing would have resulted in a partial service outage for a much longer time.

We have successfully performed this type of operation without downtime in the past, but due to changes in read/write patterns across our cluster, our cache data is no longer balanced between nodes.

Thus, in the future, we will only perform this kind of migration during a scheduled maintenance window, and in the meantime we will be analyzing how our data is getting distributed across the cluster to make the workloads more even.

Posted Jan 26, 2022 - 16:15 EST

Resolved

This incident has been resolved.

Posted Jan 26, 2022 - 13:08 EST

Monitoring

The underlying systems appear to have stabilized. We are continuing to monitor.

Posted Jan 26, 2022 - 12:40 EST

Identified

We have temporarily taken the bubble main cluster offline to try to effect repairs.

Posted Jan 26, 2022 - 12:10 EST

This incident affected: Bubble Core (Main Bubble Environment).