At 09:43 UTC on September 23rd 2022 our automated monitoring indicated that users were receiving intermittent errors when trying to access MURAL. We determined that the issues were being caused by a recent code change which unexpectedly impacted our infrastructure capacity. We were able to provisionally mitigate this by increasing capacity, resolving the issue definitively through an updated code change.
This incident impacted service availability from 09:43 UTC to 10:42 UTC, for a total of 1 hour of intermittent downtime. No data was lost, however some members would have experienced troubles accessing their MURAL data.
What we’ve done to avoid this happening again
We are documenting the constraints identified through this outage as guidance for future changes. In addition to this we are reviewing our automated infrastructure capacity management to ensure we stay within acceptable parameters.
We are also reviewing protocols to maintain system stability, should this specific issue occur again.