At 21:51 UTC on September 27 2021, we executed a routine operation in our database which inadvertently caused database writes to fail. This in turn overloaded the database and resulted in access requests also failing. Once the original operation finished at 22:42, the database returned to normal operations and access to the MURAL application was restored.
Details and corrective actions
The specifics of how the operation was performed resulted in a less-known bug on our database server to be triggered. We now understand the steps and will update our processes to avoid repeating this scenario.
The outage resulted in 50 minutes of downtime. No data from prior to the outage was lost during this time.
What we've done to avoid this happening again
We have updated our processes for running database operations to avoid triggering the now-known bug, and will be performing an upgrade to fully mitigate the issue. This upgrade will be announced on the MURAL status page as routine maintenance with the appropriate notice period.