Problem
The system was unavailable for 48 minutes. Post restoration the system performance was degraded for several hours. Customers also experienced issues with background operations caused by processing delay or complete halt in some cases.
Action
Mews monitoring tools created an automatic alert of highest priority and paged the on-call engineer. Engineer started investigation. After several minutes of degraded system performance, Mews services became unresponsive causing an outage. This caused the website traffic to drop to zero. In the absence of any official communication from Microsoft about this outage, identification of the root cause took longer, approximately 35 minutes. This led to issues in an increased number of Mews services. Meanwhile the system remained unavailable.
On identification of the root cause being unavailability of our primary database, a failover to the secondary infrastructure, in place for such scenarios, was initiated. The failover got the system back up and running although with significantly degraded performance, which was caused by comparatively less powerful failover infrastructure configuration.
In order to mitigate this performance issue, an upgrade of the less performing services was initiated to match the primary region configuration. This took much longer than normally expected, over 5 hours. Meanwhile several Mews clients began contacting Mews Customer Support, as they were unaware of the outage and didn’t know about the Mews status page, leading to overload of issues for Mews Customer Support.
Causes
- Azure SQL Database service outage.
- Received alerts did not clarify that the problem was with Azure SQL Database. This led to a false pursuit for a while, as engineers suspected an issue with a different service in the Mews infrastructure.
- The official Azure Status page did not reveal any outage information for a considerable time.
- Failover infrastructure in secondary region running on less powerful configuration.
- The failover region infrastructure took more than 5.5 hours to scale up, much longer than previously experienced during a database tier upgrade.
- We were not notified about our background processing being stuck.
- System’s inability to degrade gracefully under excessive load. In hindsight, it is better to drop non-critical workload when underlying resources are slow to keep up.
- The database failover to secondary region is a manual process in Azure Portal.
- Multiple background operations processing remained in an “In progress” state and were not picked up for processing.
- We are not utilizing zone redundancy for the SQL databases.
- Mews has less visibility into the failover infrastructure performance.
- In spite of regular status page updates, some clients continued to contact Mews Support and visited the company website for more information.
- The product content team was later involved in crafting the status page message updates. This was an ad-hoc decision and their role in the incident resolution process is not documented.
- The status page was updated at regular intervals, although it became the incident response team’s responsibility to remember the next due update.
- We are most likely not utilizing the StatusPage service fully.
- Individual customer success team members joined the internal incident channel to request information.
- There is no runbook with automation scripts for basic infrastructure operations such as scaling up the services, when the Azure Portal itself isn’t working.
- The failover process is a series of manual steps.
- The status page charts display no data when the traffic is served from the failover region.
- The failover region application instance is hard-wired to process jobs with a delay, which makes it an insufficient job processor in case we failover.
Solutions
2-3.
- Ensure the database connection error caused by the instance not being available is raised as a critical error and not a regular timeout.
- Be alerted on Azure service health issues.
4-5.
- Migrate all our storage services configuration to infrastructure as code.
- Enforce a "Consistent replica tier" policy on the infrastructure as code level for all storage services wherever applicable.
- Extend the failover drill runbook to include failover service levels check.
6.
- Stricter existing "Queue throughput lower than baseline" alert.
- Addition of a “No queue throughput” alert.
8.
- An automation for both planned and forced failover. Ensure the target service levels are consistent with those of the primary service levels before proceeding.
- Create a runbook for a database failover execution, including especially the various effects and considerations for planned vs. forced failover.
- Update the existing runbook on a failover drill with the automated database failover step.
9.
- Improve the performance and load resiliency of the background operations timeout handling mechanism.
- Create a guide for handling stuck background operations processing.
10.
- Enable zone redundancy for the SQL databases.
- Enforce a "Zone redundancy" policy on the infrastructure as code level.
11.
- Get additional custom analytics Grafana dashboard for the failover database replica.
- Enable Azure SQL Analytics for the failover database replica.
- Enforce the required SQL database Diagnostic logs on the infrastructure as code level.
12.
- Establish a simple method to inform about system outage and the status page link on www.mews.com.
- Allow redirecting users to the status page when signing in takes long, with an open ongoing outage incident.
- Adopt a method to push critical in-app notifications to the customers.
13.
- Define the role of the product team in the incident resolution process.
14.
- Automate reminders about next due status updates.
15.
- Review the current status page setup and other offered features.
16.
- Create a blocker resolution guide for the Customer Success team. Appoint a coordinator from the team.
- Provide link to the Customer Success team blocker resolution guide from the message sent to the internal system status channel.
- Extend the blocker resolution guide with a step to consider informing about a critical incident in a company-wide channel.
17.
- Create a runbook for basic infrastructure operations using the Azure CLI or PowerShell.
18.
- Implement the failover as a script performing all the required changes.
19.
- Have the charts from the secondary region ready to display on the status page and have the failover script remind to do so.
20.
- Make it possible to change the background operations processing delay without re-deploying the code.