Microsoft posted a preliminary RCA regarding this incident:
Between 09:50 UTC and 17:21 UTC on 07 Sep 2022, a subset of customers using Azure Cosmos DB in North Europe may have experienced issues accessing services. Connections to Cosmos DB accounts in this region may have resulted in an error or timeout.
Downstream Azure services that rely on Cosmos DB also experienced impact during this window - including Azure Communication Services, Azure Data Factory, Azure Digital Twins, Azure Event Grid, Azure IoT Hub, Azure Red Hat OpenShift, Azure Remote Rendering, Azure Resource Mover, Azure Rights Management, Azure Spatial Anchors, Azure Synapse, and Microsoft Purview.
What went wrong and why?
Cosmos DB load balances workloads across its infrastructure, within frontend and backend clusters. Our frontend load balancing procedure had a regression that did not factor in the effect of a reduction in available cluster capacity, due to ongoing maintenance. This surfaced during an ongoing platform maintenance event in one of the frontend clusters in the North Europe region, causing the availability issues described above.
How did we respond?
Our monitors alerted us of the impact on this cluster. We ran two workstreams in parallel – one focused on identifying the reason for the issues themselves, while one focused on mitigating the customer impact. To mitigate, we load balanced off the impacted cluster by moving customer accounts to healthy clusters within the region.
Given the volume of accounts we had to migrate, it took us time to safely load balance accounts – we had to analyze the state of each account individually, then systematically move each to an alternative healthy cluster in North Europe. This load balancing operation allowed the cluster to recover to a healthy operating state.
Although we have the ability to mark a Cosmos DB region as offline (which would trigger automatic failover activities, for customers using multiple regions) we decided not to do that during this incident – as the majority of the clusters (and therefore customers) in the region were unimpacted.
How are we making incidents like this less likely or less impactful?