Increased latency impacting some main cluster apps
Affected components
Updates

Write-up published

Read it here

Resolved

The three incidents on Jan 6 were caused by one of our database shards experiencing heavier-than-normal load. As a result, applications hosted on that shard, which represent roughly 1/7 of our main cluster applications, may have seen issues loading pages or data.

The root cause of the issues was one app performing a very expensive deletion operation. In response to the issues, we’re making the following short term and longer-term changes:

  • We temporarily blocked the operation causing the issues, and reached out to the app owner to discuss workarounds

  • We discovered that the app was missing some critical database indexes, which we created to prevent recurrences of the problem with that app going forward

  • We adjusted the rate at which we do deletions of items that are heavily-referenced elsewhere in the database, which should provide protection from other apps causing the same issue.

  • Longer-term, we are continuing to migrate off a legacy stored procedure framework \(targeting by end of Q1\) which would have prevented this incident from occurring

Thu, Jan 9, 2025, 09:36 PM

Resolved

Our systems are functional and we are closing out this incident.

Mon, Jan 6, 2025, 11:19 PM(2 days earlier)

Investigating

We are investigating reports of issues with our systems.

Mon, Jan 6, 2025, 11:15 PM