Unexpected anonymous user appeared in a mural
Incident Report for Mural
Postmortem

Context

As part of an ongoing effort to improve realtime performance and perceived lag in realtime collaboration we applied some experimental optimizations to our beta environment.

The optimization consisted in refactoring websocket notification code. This refactored code contained a bug that was not found during testing environment previous tests due to a very low probability edge case scenario.

In order for our tests to be as accurate as possible our realtime notification subsystem in the beta environment that was used for the test was linked to our production pub/sub cluster, resulting full-cluster broadcasts to span to production subscribers.

Incident timeline

8:10 PM GMT - We merged the offending code containing the defect described above. This merge triggered an automated rolling deployment to our beta environment.

8:23 PM GMT - The API rolling deployment completed in the beta environment effectively deploying the defective code to every beta API server

9:16 PM GMT - We executed an emergency rollback of our beta API, restoring every server to the previous version without the defective code.

Mitigation and preventive measures

• We fully rolled back the defective code both in our beta servers and our source control

• We included negative broadcast tests to every regression test batch

• We are working on automated regression tests to assert notification broadcasts are not possible

• We will sever the link between our pub/sub cluster in beta and production to prevent any beta code from ever again affecting production collaboration sessions\

Safety, Security and Consistency considerations

No test content was ever persisted in production murals. Our websocket notification subsystem only reports about changes in the database but never performs any change to persistent data. As a result of the defective broadcast of (what should have been multicast) notifications, production collaboration sessions received notifications about test changes in our test mural. These notifications were volatile and a mural reload (page refresh in the web app, or leave and enter the mural again in native apps) fixed the issue.

No unauthorized anonymous user joined the murals. When the mural clients received the extraneous notifications, incorrectly assumed those notifications were sent by an anonymous user.

This is the reason some anonymous avatar appeared in the realtime collaboration sessions. These avatars did not represent a connected user but the client code incorrectly mapping unknown notifications to anonymous users.

No user content was leaked to other users or MURAL employees. The defective code only affected beta notification servers and only MURAL employees have access to beta environments. The content broadcast was produced as part of our internal load test. Because user notifications are issued by production realtime servers not affected by the defect, no user content was ever broadcast or otherwise exposed.

Both the correct multicast and the defective broadcast operations are directed from a notification producer (the notification server) to mural collaborators (in multicast) and to every connected user (in broadcast). During the incident the only notification producers capable of issuing defective broadcasts were the beta notification servers, limiting the scope of broadcast messages to only messages produced internally as part of our load test.

Posted Mar 29, 2019 - 10:23 GMT-03:00

Resolved
Today, Thursday March 28th, while running an internal quality assurance test, a set of load test data scoped to target a single mural was broadcasted to all murals that had at least one online user at the moment of the tests from 01:23 PM PST to 02:16 PM PST. Any user that was working in a mural during this time, may have seen:

- What looked like an anonymous user joining the mural
- Some sticky notes appearing on those murals

Of course, this wasn’t intended to happen, and we apologize. The issue has now been fixed.

You can rest assured that:
- No content was leaked in any way.
- No external nor anonymous user actually accessed any mural. This was a quality simulation. All content was generated by our Quality Assurance team.
- No information inside the mural was altered, edited or changed.
- No external content got saved in your murals. The content disappears if the page is refreshed.

The MURAL team will now run a root-cause analysis and publish a remediation plan to prevent this from happening again.
Posted Mar 28, 2019 - 20:37 GMT-03:00
This incident affected: Mural Application (Canvas).