[North Europe] Email processing delays due to Microsoft power issues
Incident Report for CodeTwo
Postmortem

Email delivery delays that affected a subset of users in North Europe were caused by power issues in the Microsoft datacenter in Dublin. Even though the servers where CodeTwo services are hosted were not directly impacted, most of our clusters were side-impacted by the outage, as the connectivity to all the clusters in this region was experiencing stress or downtime. Our high availibility secondary services in this region were partially impacted as well, which led to email processing delays for some tenants and created bottlenecks in the mail transport pipeline before our failover systems hosted on unaffected nodes kicked in to mitigate the problem for affected users.

Our failover services mitigated the problem completely within minutes. When the entire datacenter was fully operational, we switched back to primary services.

For more information, please read the RCA provided by Microsoft:

Incident Summary:

Between 15:40 and 16:20 UTC on 23 Mar 2020, a subset of customers North Europe may have seen errors connecting to resources hosted in this region.

Root cause:

During an electrical switching procedure that was being performed on a construction site that shares utility power with one of our operational datacenters, an incorrect process was followed. Due to this improper switching, a large voltage sag was seen by our operational datacenter. While there was no loss of power to server racks, the event led to a subset of servers within a single storage scale unit to experience a reboot event. The rebooting of the various servers led to some of the region’s Storage subscriptions and their associated Azure services to be unreachable while the systems recovered.

Mitigation:

As this was a transient power sag event, the Storage servers were allowed to automatically recover.

We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Evaluate server hardware to determine the cause of rebooting.
  • Partner with the construction company to ensure that they understand the impact they caused and they take steps to ensure that all electrical work on the shared utility service follows correct procedures.

We apologize for any inconvenience this may have caused.

Posted Mar 26, 2020 - 12:18 CET

Resolved
The incident has been resolved. An RCA will be provided later. Please accept our apologies for the problem.
Posted Mar 23, 2020 - 17:45 CET
Monitoring
The issue has been mitigated. All emails are now delivered with no delays. We're monitoring the services actively to see if everything is working correctly. We’re also working with Microsoft Premier Support Team find out what caused the problem.
Posted Mar 23, 2020 - 17:36 CET
Identified
The issue has been identified and a fix is being implemented.
Posted Mar 23, 2020 - 17:29 CET
Update
We are continuing the investigate the issue with network quality within Microsoft datacenters in North Europe. A subset of users may experience delayed email delivery. Signatures are added correctly.
Posted Mar 23, 2020 - 17:23 CET
Investigating
We are currently investigating this issue.
Posted Mar 23, 2020 - 17:11 CET
This incident affected: North Europe (Mail flow).