The three incidents on Jan 6 were caused by one of our database shards experiencing heavier-than-normal load. As a result, applications hosted on that shard, which represent roughly 1/7 of our main cluster applications, may have seen issues loading pages or data.
The root cause of the issues was one app performing a very expensive deletion operation. In response to the issues, we’re making the following short term and longer-term changes:
- We temporarily blocked the operation causing the issues, and reached out to the app owner to discuss workarounds
- We discovered that the app was missing some critical database indexes, which we created to prevent recurrences of the problem with that app going forward
- We adjusted the rate at which we do deletions of items that are heavily-referenced elsewhere in the database, which should provide protection from other apps causing the same issue.
- Longer-term, we are continuing to migrate off a legacy stored procedure framework (targeting by end of Q1) which would have prevented this incident from occurring