On Friday, April 17th, 2020, we experienced an extended outage. The outage intermittently blocked access for users to our platform over the course of 14 hours. No data was lost or compromised during this time.
We know many customers were counting on MURAL to support important work with their teams and customers. We sincerely apologize for the downtime.
MURAL uses a MongoDB cluster as its primary data store, with a replica set distributed between geographically diverse Azure datacenters. We chose this architecture because it best matches the dynamic, schemaless nature of MURAL documents, and because it can provide high availability, data durability, and robust disaster recovery capabilities.
This event was initially triggered by an issue with our database provider that caused our cluster to lose connectivity to a secondary node for an extended period of time. After the secondary node automatically restarted, it could not join the cluster due to a rare networking issue.
Our MongoDB cluster was running a version that has a default configuration setting that causes the entire replica set to stop accepting data when it cannot verify that data was received and stored by a majority of data nodes in the cluster for an extended period of time.
This default configuration in conjunction with the replica set topology that we were using, resulted in failed writes under a high load situation when a secondary node was unreachable for an extended period of time and other nodes in the cluster were falling behind in the data replication process.
Normal operations were restored when we recovered the faulty node, joined it to the cluster, and intentionally severed the link to our failover region, causing the replica set topology to reset to a working configuration.
We have scaled up all of the hardware in our MongoDB cluster. MURAL is now running on hardware that is 4x more powerful than what it was prior to the increase of usage we’ve seen as a result of COVID-19. This change will give us a comfortable buffer to make changes as usage continues to grow.
We have also audited and updated our cluster topology and configuration to make sure it will be immune to similar edge cases in the future. Our monitoring systems and corresponding operational procedures have been updated and enhanced, as well.
Additionally, we are continuing to work on new strategic initiatives around scalability and disaster recovery. We will have more to share on those soon.
We deeply appreciate the trust our customers have placed in our software and service, and we will continue to work hard to provide the best and most reliable experience we can.
If you have any questions, you may contact me directly at: pato@mural.co.
Sincerely,
Patricio Jutard
CTO - MURAL