At 06:15 UTC on August 9, 2021, we experienced a spike in CPU usage on our primary database servers. A very large number of simultaneous operations in a single workspace, specifically joining a workspace and moving content between workspaces, generated write conflicts and resulted in the primary database server locking up. Users that were logged in at the time had their sessions terminated and no new login requests could be processed.
Details and corrective actions
We identified the cause of the incident and initiated a fail-over to a new primary database server at 06:44 UTC. This was completed at 07:35 UTC, at which point full service was restored. We immediately started investigating the root cause of the write conflicts and optimizing the workflows for joining workspaces to prevent this from impacting system availability again.
The outage resulted in 1 hour and 20 minutes of downtime. No data from prior to the outage was lost during this time.
What we’ve done to avoid this happening again
As an immediate action we implemented an optimization to the workflow for joining the impacted workspace. We are working towards applying this update to all workspaces in an upcoming release.