On 2022-11-10 22:15:00 UTC the Mews application started displaying the message: "Oops, something went wrong." and was unable to load. The system was unavailable for 20 minutes.
After identifying that the root cause was the unavailability of our primary database, a failover to the secondary infrastructure, in place for such scenarios, was initiated. The failover got the system back up and running at 22:35 UTC with slightly higher latency, caused by extra network traffic to the secondary region.
On the following morning, 2022-11-11 at 9:13 UTC, we performed the failover back to the primary region. We soon noticed an elevated error rate, caused by a stale database connection in part of the system, causing a delay in part of the background processing. We restarted the affected application and the delayed items were reprocessed.
Azure SQL Database service outage.
We will further improve the monitoring of our platform to detect such issues even faster. Also, we will introduce further automations to speed up the process of doing failover to the secondary region.