What happened:
At 08:20 UTC on May 18, 2021, we experienced a spike in CPU usage on our primary database servers. This was caused by an unexpectedly-large number of concurrent requests, which resulted in the MURAL application being unresponsive and users no longer being able to log in.
Details and corrective actions:
At 09:01 UTC we initiated a manual fail-over to a new primary database server. This was completed at 09:05 UTC, at which time full service was restored with no data loss.
What we’ve done to avoid this happening again:
Over the weekend of May 22-23 we performed a series of maintenance operations to significantly improve our database performance. We are also improving our automated internal alerts to identify potential issues earlier. This will help us to react faster in the event of a similar occurrence in future.