[Australia] Email processing delays

Incident Report for CodeTwo

Postmortem

Below you can find the root cause analysis (RCA) from the Post Incident Report provided by Microsoft regarding this incident (full report available in Microsoft 365 admin center, ID: MO805755). In short, a change in a routing policy led to a configuration issue within Microsoft’s network routing infrastructure, causing impact to multiple Microsoft 365 services in the Asia-Pacific and Australia region. The incorrect routing policy that caused the outage was rolled back on June 27, 2024, at 2:02 AM UTC. All CodeTwo services remained healthy during the incident and all delayed emails have been delivered with signatures.

RCA FROM MICROSOFT’S POST INCIDENT REPORT (Microsoft 365 admin center Issue ID: MO805755):

Scope of Impact

This issue could have impacted users globally, however, was mostly experienced by users hosted within Australia and Asia-Pacific due to the timeframe of impact overlapping with core business hours in those regions.

Incident Start Date and Time

Thursday, June 27, 2024, at 1:18 AM UTC

Incident End Date and Time

Thursday, June 27, 2024, at 12:30 PM UTC

Root Cause

We’ve determined that a recent change caused a configuration issue within our network routing infrastructure, causing impact to multiple Microsoft 365 services.

Specifically, in preparation for a planned network upgrade project, a change was made to our automation procedures supporting the upgrade. This change caused the automation to generate an incorrect routing policy that was not captured by our safety test systems in pre-checks for the project. Our WAN consists of two different planes for redundancy. When this incorrect routing policy was applied in the production network, a very large volume of traffic that is usually routed over plane one was routed over to plane two, and then sent back over plane one to reach its destination. This not only induced a very large latency increase, but it also caused congestion on both planes. The incorrect routing policy that caused the outage was rolled back on June 27, 2024, at 2:02 AM UTC.

Due to the severity of the incident, several Microsoft 365 services experienced sustained impact and required further intervention to reach full recovery, which was attained on June 27, 2024, at 12:30 PM UTC.

Extended impact for Exchange Online

A component used by the Exchange Online frontend proxy service became stuck due to the network conditions, preventing it from recovering once the network conditions were restored. Subsequently, manual recovery interventions were required due to code architecture patterns and the temporary exhaustion of automated recovery actions during the initial impact.

Posted Jul 01, 2024 - 16:07 UTC

Resolved

This incident has been resolved. Delayed emails have been delivered with signatures. For more information about the incident with Microsoft 365, please refer to the Microsoft 365 Admin Center (Incident ID: MO805755).

All CodeTwo services are operational, emails are delivered without any delays and signatures are added as normal.

Posted Jun 27, 2024 - 03:15 UTC

Monitoring

We can see the situation has improved significantly. The queues are almost gone. Microsoft has just reported: “We determined that a recent change within Azure networking infrastructure led to impact. We reverted this change and we're monitoring our telemetry to ensure that affected services recover as expected.”

We will continue to actively monitor the situation.

Posted Jun 27, 2024 - 02:56 UTC

Update

Microsoft has just published on X a status about a major outage of Microsoft 365: https://x.com/msft365status/status/1806149130663649355?s=46 as well as in the Microsoft 365 Admin Center (Issue ID: MO805755).

This is a Microsoft issue. Please keep monitoring the communication from Microsoft to stay up to date with the mitigation steps during this outage.

Posted Jun 27, 2024 - 02:28 UTC

Identified

It looks like Microsoft Exchange Online Protections’ performance is degraded at the moment which means it is unable to process messages sent from CodeTwo and other vendors in a timely manner. We have notified Microsoft Support about the problem. We are, however, seeing sings of recovery which might suggest the problem should be mitigated soon.

Posted Jun 27, 2024 - 02:17 UTC

Investigating

We are currently investigating email processing issues in Australia.
A subset of users may experience delayed email delivery. Email signatures are added normally.
The next update will be provided in 30 minutes or as events warrant.

Posted Jun 27, 2024 - 02:09 UTC

This incident affected: Australia East (Mail flow).