Yesterday afternoon, MURAL users found they were unable to edit their murals or collaborate for a period of approximately one hour. Please accept our sincerest apologies. We strive to make all MURAL services available and fast for you all the time, but during the last 30 days we missed the mark of a 99.99% uptime that we’ve maintained for almost two years. 😞
Fortunately, the issue has been resolved, and we’re now focused on correcting the bug that caused the outage, as well as putting more checks and monitors in place to ensure that this doesn’t happen again. If you’re interested in the technical explanation for what happened and how it was fixed, read on.
On November 14th at 03:35 pm UTC, a new version of the product was deployed (see https://mural.co/changelog). For an unknown reason, the CDN service continued to serve the old versions of the product from some few edge servers which caused some glitches in the experience for those users that were served with the old version. By 04:10 pm the issue was identified and as we don't have control over the CDN (Akamai), our solution was to push a change in our load balancers configuration in order to disable the use of the CDN for everyone temporarily until the CDN normalized. Our Devops team pushed the change into production at 04:15 pm. The change in the load balancer cluster configuration backfired on us since our API response time increased way above average when every connected user reestablished a secure websocket connection (a compute intensive task) at the same time during a high-traffic time of day. Elevated API response time made clients to timeout and retry during websocket re-connection which at the same time generated more load on our servers causing the real time service to fail. The Product Engineering team rapidly engaged, and sent a hotfix to the server code to alleviate the system load, while at the same time more server resources were provisioned. By 05:20 pm, all systems were back to normal.
With services once again working normally, our work is now focused on (a) removing the source of failure that caused yesterday’s outage, and (b) speeding up recovery when a problem does occur. We'll be taking the following steps in the next few days:
1. Improving the websocket authentication and connection process on the server side to make it more performant yet as secure as it is today.
2. Improving the client exponential backoff reconnection strategy to prevent self-inflicted increase in server loads in the future under potential load balancer websockets disconnections.
3. Adding dedicated additional infrastructure resources to real-time websockets servers cluster to cope with potential massive reconnection scenarios that might occur in the future.
4. Adding additional validation checks for deployments, so that a bad or old version of the web client bundle being served at the edge servers in the future will not result in service disruption.
5. Adding additional targeted monitoring and alerts to more quickly detect and diagnose the cause of service failure.