Resolved -
We have resolved the incident and are going to conduct an internal debrief and plan to share more details
Update from our internal debrief on this incident:
From approximately 3:00 AM to 11:00 AM ET on August 14, less than 1% of traffic on our main cluster experienced elevated error rates. The impact was spread across both the Bubble editor and live apps, but mostly affected the editor. A Bubble app with an edge-case, severely inefficient workflow was able to consume more CPU resources than it should have, which placed unusually high demand on the main cluster resources.
Once the problematic workflow was identified and paused, system performance returned to normal. We’re now making improvements to detect and isolate this type of issue faster, including tighter alerting. We’re also working on improvements to how processing resources are allocated across apps to prevent any similar issues from impacting overall platform performance in the future.
Aug 15, 14:42 EDT
Monitoring -
A fix has been implemented and we are monitoring the results.
Aug 15, 11:56 EDT
Update -
We have stabilized our systems and are monitoring. It turns out our previous hypothesis of an AWS networking layer issue was not the root cause, and we have narrowed it down and will publish our findings
Aug 15, 11:22 EDT
Update -
We believe there's an issue at the AWS networking layer and have opened a ticket with their support. Approximately 1% of requests are failing. These failures primarily affect the editor.
Aug 15, 09:36 EDT
Update -
We are continuing to investigate this issue.
Aug 15, 05:52 EDT
Investigating -
We have detected that various operations, both in the main environment and in the editor, are timing out. We are investigating the cause.
Aug 15, 05:15 EDT