Increased load times

Incident Report for Mural

Postmortem

Postmortem of Service Outage

On Friday, April 17th, 2020, we experienced an extended outage. The outage intermittently blocked access for users to our platform over the course of 14 hours. No data was lost or compromised during this time.

We know many customers were counting on MURAL to support important work with their teams and customers. We sincerely apologize for the downtime.

Root Cause Analysis

MURAL uses a MongoDB cluster as its primary data store, with a replica set distributed between geographically diverse Azure datacenters. We chose this architecture because it best matches the dynamic, schemaless nature of MURAL documents, and because it can provide high availability, data durability, and robust disaster recovery capabilities.

This event was initially triggered by an issue with our database provider that caused our cluster to lose connectivity to a secondary node for an extended period of time. After the secondary node automatically restarted, it could not join the cluster due to a rare networking issue.

Our MongoDB cluster was running a version that has a default configuration setting that causes the entire replica set to stop accepting data when it cannot verify that data was received and stored by a majority of data nodes in the cluster for an extended period of time.

This default configuration in conjunction with the replica set topology that we were using, resulted in failed writes under a high load situation when a secondary node was unreachable for an extended period of time and other nodes in the cluster were falling behind in the data replication process.

Normal operations were restored when we recovered the faulty node, joined it to the cluster, and intentionally severed the link to our failover region, causing the replica set topology to reset to a working configuration.

Why did it take us so long to fix the issue?

Our team was unaware of the MongoDB default configuration implications with our cluster topology under node failures. This issue is uncommon and requires a rare set of circumstances to result in the outcome we experienced.
To accelerate a resolution we escalated this issue with our managed service provider. Their guidance focused our team's attention on possible application errors as the cause for the unusual behavior. After investigating thoroughly, it was determined this was not the cause of the issue.
We followed our procedures for diagnosing and repairing this type of issue, including working closely with the relevant providers we rely on. Several of our efforts seemed to resolve the issue temporarily. Each attempt took additional time, since our database footprint is huge at this point.

What are we changing to avoid this happening again?

We have scaled up all of the hardware in our MongoDB cluster. MURAL is now running on hardware that is 4x more powerful than what it was prior to the increase of usage we’ve seen as a result of COVID-19. This change will give us a comfortable buffer to make changes as usage continues to grow.

We have also audited and updated our cluster topology and configuration to make sure it will be immune to similar edge cases in the future. Our monitoring systems and corresponding operational procedures have been updated and enhanced, as well.

Additionally, we are continuing to work on new strategic initiatives around scalability and disaster recovery. We will have more to share on those soon.

We deeply appreciate the trust our customers have placed in our software and service, and we will continue to work hard to provide the best and most reliable experience we can.

If you have any questions, you may contact me directly at: pato@mural.co.

Sincerely,

Patricio Jutard
CTO - MURAL

Posted Apr 22, 2020 - 21:09 GMT-03:00

Resolved

This incident has been resolved.

Posted Apr 17, 2020 - 14:16 GMT-03:00

Update

The platform has been performing as expected for the past update period. All systems operational. We're still monitoring for any unexpected development.

Posted Apr 17, 2020 - 14:12 GMT-03:00

Update

We applied a fix and are looking everything going back to normal once again. We'll keep monitoring and notify any developments.

Posted Apr 17, 2020 - 13:33 GMT-03:00

Update

We are still seeing issues for some of our customers. We're monitoring closely and addressing the situation.

Posted Apr 17, 2020 - 11:29 GMT-03:00

Update

We have implemented a fix and we are monitoring our platform

Posted Apr 17, 2020 - 10:39 GMT-03:00

Update

We are applying a remediation while working with our database provider. We will update again in 30 minutes, or as current events develop.

Posted Apr 17, 2020 - 09:55 GMT-03:00

Update

We are applying a remediation while working with our database provider. We will update again in 30 minutes, or as current events develop.

Posted Apr 17, 2020 - 09:13 GMT-03:00

Update

We are working closely with our database provider in order to resume normal operations as quickly as possible. Thank you for your patience.

Posted Apr 17, 2020 - 08:50 GMT-03:00

Update

We're still observing issues related to data load times. We're still addressing the issue.

Posted Apr 17, 2020 - 07:36 GMT-03:00

Update

We are continuing to monitor for any further issues.

Posted Apr 17, 2020 - 07:36 GMT-03:00

Update

We applied additional remediating measures and services went back to normal. We continue to monitor platform status.

Posted Apr 17, 2020 - 06:35 GMT-03:00

Update

We are continuing to monitor for any further issues.

Posted Apr 17, 2020 - 05:30 GMT-03:00

Update

We are continuing to monitor for any further issues.

Posted Apr 17, 2020 - 05:15 GMT-03:00

Update

Service level is getting back to normal. We are still monitoring for no more hiccups.

Posted Apr 17, 2020 - 03:13 GMT-03:00

Update

We are continuing to monitor for any further issues.

Posted Apr 17, 2020 - 02:03 GMT-03:00

Update

We're still observing issues related to data load times. We're still addressing the issue.

Posted Apr 17, 2020 - 01:42 GMT-03:00

Monitoring

We applied a fix and the service is healthy again. We are still monitoring the platform status.

Posted Apr 17, 2020 - 00:40 GMT-03:00

Identified

We identified the cause for the error. We're working on a fix.

Posted Apr 16, 2020 - 23:52 GMT-03:00

Update

We are continuing to investigate this issue.

Posted Apr 16, 2020 - 22:51 GMT-03:00

Investigating

We identified increased load times in one of our database servers and it's affecting load times for some of our customers. We're investigating the situation.

Posted Apr 16, 2020 - 22:50 GMT-03:00