Partial outage

Incident Report for Mews

Postmortem

Problem

The system was unavailable for 48 minutes. Post restoration the system performance was degraded for several hours. Customers also experienced issues with background operations caused by processing delay or complete halt in some cases.

Action

Mews monitoring tools created an automatic alert of highest priority and paged the on-call engineer. Engineer started investigation. After several minutes of degraded system performance, Mews services became unresponsive causing an outage. This caused the website traffic to drop to zero. In the absence of any official communication from Microsoft about this outage, identification of the root cause took longer, approximately 35 minutes. This led to issues in an increased number of Mews services. Meanwhile the system remained unavailable.

On identification of the root cause being unavailability of our primary database, a failover to the secondary infrastructure, in place for such scenarios, was initiated. The failover got the system back up and running although with significantly degraded performance, which was caused by comparatively less powerful failover infrastructure configuration.

In order to mitigate this performance issue, an upgrade of the less performing services was initiated to match the primary region configuration. This took much longer than normally expected, over 5 hours. Meanwhile several Mews clients began contacting Mews Customer Support, as they were unaware of the outage and didn’t know about the Mews status page, leading to overload of issues for Mews Customer Support.

Causes

Azure SQL Database service outage.
Received alerts did not clarify that the problem was with Azure SQL Database. This led to a false pursuit for a while, as engineers suspected an issue with a different service in the Mews infrastructure.
The official Azure Status page did not reveal any outage information for a considerable time.
Failover infrastructure in secondary region running on less powerful configuration.
The failover region infrastructure took more than 5.5 hours to scale up, much longer than previously experienced during a database tier upgrade.
We were not notified about our background processing being stuck.
System’s inability to degrade gracefully under excessive load. In hindsight, it is better to drop non-critical workload when underlying resources are slow to keep up.
The database failover to secondary region is a manual process in Azure Portal.
Multiple background operations processing remained in an “In progress” state and were not picked up for processing.
We are not utilizing zone redundancy for the SQL databases.
Mews has less visibility into the failover infrastructure performance.
In spite of regular status page updates, some clients continued to contact Mews Support and visited the company website for more information.
The product content team was later involved in crafting the status page message updates. This was an ad-hoc decision and their role in the incident resolution process is not documented.
The status page was updated at regular intervals, although it became the incident response team’s responsibility to remember the next due update.
We are most likely not utilizing the StatusPage service fully.
Individual customer success team members joined the internal incident channel to request information.
There is no runbook with automation scripts for basic infrastructure operations such as scaling up the services, when the Azure Portal itself isn’t working.
The failover process is a series of manual steps.
The status page charts display no data when the traffic is served from the failover region.
The failover region application instance is hard-wired to process jobs with a delay, which makes it an insufficient job processor in case we failover.

Solutions

2-3.

Ensure the database connection error caused by the instance not being available is raised as a critical error and not a regular timeout.
Be alerted on Azure service health issues.

4-5.

Migrate all our storage services configuration to infrastructure as code.
Enforce a "Consistent replica tier" policy on the infrastructure as code level for all storage services wherever applicable.
Extend the failover drill runbook to include failover service levels check.

Stricter existing "Queue throughput lower than baseline" alert.
Addition of a “No queue throughput” alert.

An automation for both planned and forced failover. Ensure the target service levels are consistent with those of the primary service levels before proceeding.
Create a runbook for a database failover execution, including especially the various effects and considerations for planned vs. forced failover.
Update the existing runbook on a failover drill with the automated database failover step.

Improve the performance and load resiliency of the background operations timeout handling mechanism.
Create a guide for handling stuck background operations processing.

10.

Enable zone redundancy for the SQL databases.
Enforce a "Zone redundancy" policy on the infrastructure as code level.

11.

Get additional custom analytics Grafana dashboard for the failover database replica.
Enable Azure SQL Analytics for the failover database replica.
Enforce the required SQL database Diagnostic logs on the infrastructure as code level.

12.

Establish a simple method to inform about system outage and the status page link on www.mews.com.
Allow redirecting users to the status page when signing in takes long, with an open ongoing outage incident.
Adopt a method to push critical in-app notifications to the customers.

13.

Define the role of the product team in the incident resolution process.

14.

Automate reminders about next due status updates.

15.

Review the current status page setup and other offered features.

16.

Create a blocker resolution guide for the Customer Success team. Appoint a coordinator from the team.
Provide link to the Customer Success team blocker resolution guide from the message sent to the internal system status channel.
Extend the blocker resolution guide with a step to consider informing about a critical incident in a company-wide channel.

17.

Create a runbook for basic infrastructure operations using the Azure CLI or PowerShell.

18.

Implement the failover as a script performing all the required changes.

19.

Have the charts from the secondary region ready to display on the status page and have the failover script remind to do so.

20.

Make it possible to change the background operations processing delay without re-deploying the code.

Posted Aug 12, 2022 - 11:31 CEST

Resolved

Our cloud provider confirmed that the root cause was fully mitigated and our services are running without issues for last couple of hours. We apologize for the degraded quality of our product today and for any inconveniences it caused for you. We will do detailed root cause analysis and we will publish it here on our status page so that you will be informed about what exactly happened, what we’ve done to mitigate the impact and what improvements we will put in place as followups.

Posted Jul 21, 2022 - 20:56 CEST

Monitoring

We see significant improvement in performance of the system. We are closely monitoring the situation and waiting for confirmation from our cloud provider that the case is fully resolved.

Posted Jul 21, 2022 - 17:19 CEST

Update

We have now migrated our services back to the primary region. We are running full data backup now and the system is catching up with its job queue.
This causes performance to remain degraded resulting in slow system operation for a while but there will be improvement in your system speed thereafter.

Posted Jul 21, 2022 - 16:43 CEST

Update

As per our cloud provider’s latest status update, they expect the outage to be resolved for the majority of their impacted customers. We are in touch with their premium support contact to ensure stability of the service before we migrate back to the primary region. You can find updated information about the outage on the Azure Cloud official status page: https://status.azure.com/en-gb/status.

Meanwhile, mitigative actions for performance issues with our secondary region are taking longer than expected. We also identified that several parts of our system are experiencing serious issues, such as reservation processing, channel manager queue items, email processing, to name a few. Our team is working through these as they come up.

As we wait patiently for our primary region to be restored and stable enough to migrate, we continue working on improving the secondary region performance and resolution of the recent issues. We expect to provide another update in an hour.

Posted Jul 21, 2022 - 14:52 CEST

Update

Our cloud provider is facing an outage of one critical service, Azure SQL Database, in the West Europe region, the primary region for us. Detailed information about the outage on the Azure Cloud official status page: https://status.azure.com/en-gb/status.
We utilize secondary backup service in North Europe as a secondary region to ensure our system is always available. We were able to failover services to this secondary region not affected by the outage.
This might lead to slow system operations for you due to performance degradation, but we are currently taking mitigative steps to improve the performance of our secondary region services as well.
To learn more about the infrastructure setup we have in place to ensure a resilient system and in turn business continuity for you, check our platform documentation - https://www.mews.com/en/platform-documentation#infrastructure.

Posted Jul 21, 2022 - 13:00 CEST

Update

We are still running from failover region, causing the performance of our services to be still degraded. As a result you might experience your system to be slow. We are doing mitigative actions to improve performance of services in secondary region. We are still waiting for our cloud provider to mitigate the root cause in the primary region.

Posted Jul 21, 2022 - 11:38 CEST

Update

We are still running from failover region and due to that, performance of our services is still degraded. We are doing mitigative actions to improve performance of services in secondary region and we are waiting for our cloud provider to mitigate the root cause in the primary region.

Posted Jul 21, 2022 - 10:46 CEST

Update

We are currently running from failover region. Performance of our services is still degraded and we are investigating how to mitigate it. We are waiting for our cloud provider to mitigate the issue so that we can return to the primary region.

Posted Jul 21, 2022 - 09:10 CEST

Update

We were able to failover services to a secondary region, which is not affected by the outage of our cloud provider. Performance of our services is still degraded and we are investigating how to mitigate it.

Posted Jul 21, 2022 - 08:11 CEST

Identified

We are experiencing a full outage due to unavailability of services of our cloud provider (Azure SQL database)

Posted Jul 21, 2022 - 07:48 CEST

Investigating

We are currently investigating this issue.

Posted Jul 21, 2022 - 07:15 CEST

This incident affected: Mews Operations, Mews Guest Experience, Mews Business Intelligence, Mews Payments, Mews Open API, and Mews Marketplace.