Below you can find the root cause analysis (RCA) from the Post Incident Report provided by Microsoft regarding this incident (full report available in Microsoft 365 admin center, ID: MO805755). In short, a change in a routing policy led to a configuration issue within Microsoft’s network routing infrastructure, causing impact to multiple Microsoft 365 services in the Asia-Pacific and Australia region. The incorrect routing policy that caused the outage was rolled back on June 27, 2024, at 2:02 AM UTC. All CodeTwo services remained healthy during the incident and all delayed emails have been delivered with signatures.
RCA FROM MICROSOFT’S POST INCIDENT REPORT (Microsoft 365 admin center Issue ID: MO805755):
Scope of Impact
This issue could have impacted users globally, however, was mostly experienced by users hosted within Australia and Asia-Pacific due to the timeframe of impact overlapping with core business hours in those regions.
Incident Start Date and Time
Thursday, June 27, 2024, at 1:18 AM UTC
Incident End Date and Time
Thursday, June 27, 2024, at 12:30 PM UTC
Root Cause
We’ve determined that a recent change caused a configuration issue within our network routing infrastructure, causing impact to multiple Microsoft 365 services.
Specifically, in preparation for a planned network upgrade project, a change was made to our automation procedures supporting the upgrade. This change caused the automation to generate an incorrect routing policy that was not captured by our safety test systems in pre-checks for the project. Our WAN consists of two different planes for redundancy. When this incorrect routing policy was applied in the production network, a very large volume of traffic that is usually routed over plane one was routed over to plane two, and then sent back over plane one to reach its destination. This not only induced a very large latency increase, but it also caused congestion on both planes. The incorrect routing policy that caused the outage was rolled back on June 27, 2024, at 2:02 AM UTC.
Due to the severity of the incident, several Microsoft 365 services experienced sustained impact and required further intervention to reach full recovery, which was attained on June 27, 2024, at 12:30 PM UTC.
Extended impact for Exchange Online
A component used by the Exchange Online frontend proxy service became stuck due to the network conditions, preventing it from recovering once the network conditions were restored. Subsequently, manual recovery interventions were required due to code architecture patterns and the temporary exhaustion of automated recovery actions during the initial impact.