Write-up published
Resolved
The three incidents on Jan 6 were caused by one of our database shards experiencing heavier-than-normal load. As a result, applications hosted on that shard, which represent roughly 1/7 of our main cluster applications, may have seen issues loading pages or data.
The root cause of the issues was one app performing a very expensive deletion operation. In response to the issues, we’re making the following short term and longer-term changes:
We temporarily blocked the operation causing the issues, and reached out to the app owner to discuss workarounds
We discovered that the app was missing some critical database indexes, which we created to prevent recurrences of the problem with that app going forward
We adjusted the rate at which we do deletions of items that are heavily-referenced elsewhere in the database, which should provide protection from other apps causing the same issue.
Longer-term, we are continuing to migrate off a legacy stored procedure framework \(targeting by end of Q1\) which would have prevented this incident from occurring
Resolved
Our systems are functional and we are closing out this incident.
Investigating
We are investigating reports of issues with our systems.