[North Europe] Degraded performance of signature management apps

Incident Report for CodeTwo

Postmortem

Microsoft posted a preliminary RCA regarding this incident:

What happened?

Between 09:50 UTC and 17:21 UTC on 07 Sep 2022, a subset of customers using Azure Cosmos DB in North Europe may have experienced issues accessing services. Connections to Cosmos DB accounts in this region may have resulted in an error or timeout.

Downstream Azure services that rely on Cosmos DB also experienced impact during this window - including Azure Communication Services, Azure Data Factory, Azure Digital Twins, Azure Event Grid, Azure IoT Hub, Azure Red Hat OpenShift, Azure Remote Rendering, Azure Resource Mover, Azure Rights Management, Azure Spatial Anchors, Azure Synapse, and Microsoft Purview.

What went wrong and why?

Cosmos DB load balances workloads across its infrastructure, within frontend and backend clusters. Our frontend load balancing procedure had a regression that did not factor in the effect of a reduction in available cluster capacity, due to ongoing maintenance. This surfaced during an ongoing platform maintenance event in one of the frontend clusters in the North Europe region, causing the availability issues described above.

How did we respond?

Our monitors alerted us of the impact on this cluster. We ran two workstreams in parallel – one focused on identifying the reason for the issues themselves, while one focused on mitigating the customer impact. To mitigate, we load balanced off the impacted cluster by moving customer accounts to healthy clusters within the region.

Given the volume of accounts we had to migrate, it took us time to safely load balance accounts – we had to analyze the state of each account individually, then systematically move each to an alternative healthy cluster in North Europe. This load balancing operation allowed the cluster to recover to a healthy operating state.

Although we have the ability to mark a Cosmos DB region as offline (which would trigger automatic failover activities, for customers using multiple regions) we decided not to do that during this incident – as the majority of the clusters (and therefore customers) in the region were unimpacted.

How are we making incidents like this less likely or less impactful?

Already completed:

Fixed the regression in our load balancer procedure, to safely factor in capacity fluctuations during maintenance.

In progress:

Improving our monitoring and alerting to detect these issues earlier and apply pre-emptive actions. (Estimated completion: October 2022)
Improving our processes to reduce the impact time with a more structured manual load balancing sequence during incidents. (Estimated completion: November 2022)

Posted Sep 12, 2022 - 13:15 UTC

Resolved

This incident has been resolved.

Posted Sep 07, 2022 - 17:52 UTC

Update

We are continuing to monitor for any further issues.

Posted Sep 07, 2022 - 17:27 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 07, 2022 - 15:58 UTC

Update

We are continuing to work on a fix for this issue.

Posted Sep 07, 2022 - 13:59 UTC

Update

Microsoft wrote:

Current Status: We are currently investigating a potential root cause and are exploring mitigation options. We will provide updates in 60 minutes or as events warrant.

Posted Sep 07, 2022 - 12:29 UTC

Update

We are continuing to work on a fix for this issue.

Posted Sep 07, 2022 - 12:28 UTC

Update

Microsoft confirmed they are currently investigating the problem with Cosmos DB in this region. This problem limits your access to app.codetwo.com and affects the performance of Autoresponder, which might result in auto-responses being not sent for some users. We're actively working with Microsoft to make sure they fix the problem ASAP.

All signatures are imprinted normally.

Posted Sep 07, 2022 - 12:10 UTC

Identified

Currently a subset of users in North Europe might be unable to manage signatures using app.codetwo.com. We're investigating this problem now. All emails are imprinted and delivered as normal - only designing and creating new signatures is temporarily not possible.

Posted Sep 07, 2022 - 11:49 UTC

Investigating

We are currently investigating this issue.

Posted Sep 07, 2022 - 11:45 UTC

This incident affected: North Europe (Signature management (app.codetwo.com)).