Elevated API response times
Incident Report for MURAL
Postmortem

At 14:19 UTC, April 22nd a part of our realtime collaboration service started to process requests a lot slower than usual, and a few minutes later our monitoring systems reported that part of our realtime collaboration service as unavailable.

Our realtime service was experiencing high processing times due to increased latency in a backing service we use for synchronizing a portion of the realtime collaboration events among users on the same murals. The rest of the servers were behaving properly and most users were able to collaborate normally.

The cause for these increased response times for said pub-sub service was an unplanned server patching performed by our Cloud Provider outside of our requested maintenance window.Some API servers were affected with increased load for some requests that relied on said service.

We immediately triggered a rotation for those affected servers. As some servers were rotated our API became slow while the backing pub-sub service continued to have high latency.

Once we noticed that our cloud provider had performed unannounced patching of one of our pub-sub servers, we rotated it as well. This process takes a while but once it completed, dependent services started showing expected latency once again.

We understand your frustration if this caused any inconvenience. We are deeply sorry for this.

Posted Apr 24, 2020 - 19:33 GMT-03:00

Resolved
API latency is back to normal and all systems are behaving as expected.
Posted Apr 22, 2020 - 12:53 GMT-03:00
Update
We continue to monitor service recovery and API latency being back to normal.
Posted Apr 22, 2020 - 12:19 GMT-03:00
Monitoring
We identified a cause for the increased response times and rolled out a fix. We're monitoring recovery.
Posted Apr 22, 2020 - 11:59 GMT-03:00
Investigating
We identified some of our API having increased response times. We're currently investigating.
Posted Apr 22, 2020 - 11:43 GMT-03:00
This incident affected: Canvas.