Real time collaboration is down
Incident Report for MURAL
Postmortem

Summary

On March 11th, 2020, at 11:20 Pacific Time, our infrastructure monitoring systems showed an unusual load on the realtime collaboration component in our platform. This component is what enables different users to work together on a mural at the same time seamlessly. We were able to see how some of our realtime collaboration servers became unresponsive, effectively bringing down realtime collaboration for those users working together in murals at that time.

As soon as we identified the increased load for these servers we acknowledged the situation in our status page at 11:33 and began applying remediation measures by increasing our dedicated capacity. The service gradually recovered responsiveness and 15 minutes later, at 11:48 we saw all load flow back to regular levels. We continued monitoring the situation for over an hour and at 12:55 we marked the incident as resolved.

Details and Corrective actions

Once the issue was mitigated, we began forensics analysis to determine the root cause for it. The outage was caused by one of our ingress network services being unable to cope with unprecedented increase rates in capacity throughput. This single network service failed and the load assigned to it was automatically distributed to other services in the same ingress network layer. This sudden excess load made the other ingress layer services become saturated themselves, effectively making the whole layer partially unresponsive until we deployed additional servers to increase overall throughput. Since this event took place, we have doubled our permanent capacity and enhanced our monitoring with more relevant metrics so we can detect future similar events before they happen, in order to prevent them from happening again across our infrastructure. We are also working on an auto-scaling mechanism for the realtime collaboration subsystem to be released later this month.

Conclusion

We are as concerned as you about the recent events regarding the COVID-19 pandemic. As a result of this outbreak and recent prevention measurements by governments across the world, many companies and teams find themselves suddenly working remotely, and every day brings more challenges to solve. MURAL helps teams collaborate remotely, and in these past days, we have seen unprecedented usage. Being prepared to handle these unexpected events is something we take extremely seriously. Our commitment to providing the best possible remote collaboration tools and practices to all imagination workers is stronger than ever, and we acknowledge this kind of service interruptions are not acceptable. Just as you and your community are learning to adapt to recent developments, we too are learning how to adapt to thrive in these trying times.

We apologize for any inconvenience caused by this incident and thank you for your understanding.

This won’t happen again.

The MURAL Team

Posted Mar 13, 2020 - 00:23 GMT-03:00

Resolved
The issue in our real-time collaboration service was mitigated and the platform is behaving normally.
Posted Mar 11, 2020 - 16:55 GMT-03:00
Monitoring
Service is now stable again and we are monitoring the service.
Posted Mar 11, 2020 - 15:48 GMT-03:00
Identified
We are currently experiencing issues with our real-time collaboration service. Intermittent connectivity issues may be experienced by some clients. We are actively monitoring and working on it, and we will keep status.mural.co page updated on any changes.
Posted Mar 11, 2020 - 15:33 GMT-03:00
This incident affected: Canvas.