Realtime collaboration latency
Incident Report for MURAL
Postmortem

Summary

On March 17th, 2020, at around 8AM Pacific Time, our infrastructure monitoring systems alerted us about increasing load times the our realtime collaboration servers. We were able to see how some of the clients connected to our realtime servers were entering a reconnection loop, the perceivable effect of this being users for those clients being unable to collaborate in a mural.

As soon as we identified the increased load for these servers we acknowledged the situation in our status page at 8:09 and began swift action to alleviate server load and bring the service back to normal. Latency levels began going back to normal at around 8:15 and at 8:25 we could confirm all services were operating normally at their regular load and latency levels. We continued monitoring the situation for the following half hour and at 9:16 we marked the incident as resolved.

Details and Corrective actions

After logs and metrics analysis, we concluded the cause being one of our pub/sub servers being overloaded by a sudden request surge about 20 times above normal. This caused the realtime servers that rely on this pub/sub infrastructure for communicating realtime collaboration messages to get disconnected from it and, in turn, disconnecting the user devices from realtime collaboration in the process. Once we detected this was happening, we moved all realtime pub/sub traffic to new, more powerful instances and traffic load went back to normal after a few minutes.

Conclusion

The recent prevention measurements against the CONVID-19 pandemia by governments across the world, many companies and teams find themselves suddenly working remotely and MURAL is helping an increasing number of teams to collaborate, in fact we are seen an unprecedented usage in the last days. Being prepared to handle these unexpected events is something we take extremely seriously. Our commitment to providing the best possible remote collaboration tools and practices to all imagination workers is stronger than ever, and we acknowledge this kind of service interruptions are not acceptable. Although we fixed the issue in a matter of minutes this morning, we are now building a long term solution to make sure this never happens again.

We apologize for any inconvenience caused by this incident and thank you for your understanding.

The MURAL Team

Posted Mar 17, 2020 - 19:06 GMT-03:00

Resolved
The issue affecting realtime collaboration was solved. All systems green.
Posted Mar 17, 2020 - 13:16 GMT-03:00
Monitoring
We identified the issue affecting realtime collaboration and applied a fix. We're monitoring the platform situation.
Posted Mar 17, 2020 - 12:36 GMT-03:00
Investigating
Some of our customers have reported high latency in realtime collaboration. We're investigating the issue
Posted Mar 17, 2020 - 12:09 GMT-03:00
This incident affected: Web Application.