Summary:
On Saturday, November 11th at 03:00 UTC, Mural performed scheduled maintenance on our production clusters. Post-migration checks indicated all functions were performing as expected. On Monday, November 13th, some Mural customers reported difficulty logging into the Mural web application. Mural’s incident response team was immediately engaged in troubleshooting these reports.
Initial investigations revealed that the platform upgrade over the preceding weekend had incorrect settings for the DNS infrastructure and a key backend application's auto-scaling. This resulted in unstable connections for some users.
During the course of this investigation, we also discovered that load balancing improvements for clients with specific network and application configurations altered how the client’s IP address was interpreted by our system, preventing access for such clients.
Our incident response team addressed the auto-scaling configuration, resolving DNS-related issues and restored access for the majority of users. Next, a new load-balancing configuration underwent adjustments and testing to restore stable connections for the previously-impacted users.
The total time from when our incident response team started working on this incident, to deploying the final fix, was 9 hours 40 minutes.
What we’ve done to prevent this happening again:
As part of Mural’s post-incident procedure, our engineering teams conducted a thorough review to identify the root cause and outline necessary improvements. 8 separate changes have been identified and will be implemented in the coming weeks. These changes cover monitoring to detect this scenario sooner, enhanced post-migration checks to ensure this scenario and others are included in our use cases and reviewing our migration process to reduce the risks.
We apologize for any inconvenience this incident may have caused and sincerely thank your patience whilst we worked through this incident.