Unplanned Outage - Resolved

Incident Report for Mews

Postmortem

Overview

In February 2026, Mews experienced an outage lasting 47 minutes and two shorter periods of degraded performance that affected access to the Mews PMS app. There was no data loss or data corruption. The impact was on availability and speed, not on the correctness of your data.

Even short interruptions are unacceptable, especially at critical times when your teams are checking guests in and out, running reports, and handling payments. In this postmortem we explain, in straightforward terms:

What happened
How you were affected
What caused it
What we are changing to prevent a repeat

We sincerely apologize for the disruption and are treating the symptoms of these incidents with the highest priority to ensure they do not happen again.

What happened

5 February 2026 – Major outage

Between 06:40–07:27 UTC, many users could not log in to Mews. Pages loaded very slowly or showed errors, and key functions were unavailable.
A shorter performance degradation occurred at around 10:00 UTC, affecting a smaller group of properties.
Behind the scenes, several of the servers that handle Mews PMS traffic became overloaded at the same time as a brief issue on our primary database in the Germany West Central region (Azure Cloud). Together, these made the app unstable until we restarted the affected servers and increased capacity.

12 February 2026 – Short performance drops

On 12 February, there were four short windows where Mews PMS was noticeably slower or briefly unavailable for some users.
These were much shorter than the 5 February outage, but still visible as slow page loads and occasional login problems.

17 February 2026 – Mews is slow

On the morning of 17 February, support received multiple reports from different regions that Mews was slow or not loading at all for short periods.
In our monitoring, one of the servers handling Mews PMS traffic was clearly struggling and needed to be restarted. After a restart, performance returned to normal.

How you were affected

Across the three events, properties experienced:

Difficulty logging in to Mews PMS
Slow pages, sometimes ending in timeouts or errors
The app being slow and/or unresponsive during the instances of performance degradation

Why it happened (high‑level)

Although the three incidents were triggered by different technical details, the overall pattern was the same.

1. The majority of our backend instances were degraded

Servers became overloaded by traffic;
A background database task put extra pressure on our main database;
Internal checks meant to verify that a server is healthy did more work than they should and themselves became slow.

In each case, part of our system was under more strain than it could handle, and some servers started responding very slowly or failing requests.

2. The “recycling” of instances was not responsive enough

We rely on a combination of:

Internal health checks inside our application; and
The cloud provider’s ability to stop sending traffic to unhealthy instances

In this case, that combination did not work as well as it needs to:

Our health‑check endpoint was too complex and depended on responses from several other downstream dependencies.
It also continued to say “I’m OK” often enough that the cloud platform - which looks only at that signal - kept the struggling servers in rotation.
The platform behaved exactly as designed, but our own health signal and thresholds were not strict enough to protect you from partial failures - hence the degraded performance you were experiencing.

3. Our monitoring focused too much on averages

Our automatic scaling and some of our dashboards look at averages across all servers. Those averages can look acceptable even when a few individual servers are in serious trouble. This made it harder to see, quickly and clearly, that “one or two machines are having a very bad time and need to be taken out of service now”.

In simple terms: local problems on specific servers and supporting systems exposed weaknesses in how we check health and remove bad servers, and that turned into outages and slowness for you.

What we did during the incidents

During each incident, our teams:

Declared and managed a formal incident, with clear ownership and regular updates.
Restarted or removed problematic servers and temporarily increased capacity to stabilize performance.
Worked with our cloud provider where there were signs of underlying platform issues.
Collected detailed logs and metrics to reconstruct exactly what happened, confirm that there was no data loss, and identify the design gaps that need to be fixed.

These actions restored service each time, but they also showed that we need deeper changes, not just tactical fixes.

What we are changing

We are now implementing structural improvements in three key areas.

1. Simpler, more trustworthy health checks

We are redesigning the internal “health check” that decides whether a server should receive traffic:

The new check will be much lighter and will not depend on multiple other systems being fully healthy just to answer “can this server safely handle requests?”.
It will have strict time limits, so a slow response does not get mistaken for a healthy server.
When a server is clearly unable to serve traffic, the health check will say so clearly and consistently, so the platform can remove it from rotation quickly.

This reduces the chance that a partially broken server continues to handle your traffic.

2. Faster “recycling” of unhealthy instances and smarter scaling

We are improving how we detect and react to unhealthy servers:

Looking at each server individually (errors, slow responses, health‑check results), not just at global averages.
Tuning our automatic scaling rules so we react faster when a subset of servers is overloaded.
Adjusting how we use the cloud provider’s automatic recovery tools so that they help with diagnostics and healing, without introducing extra instability.

Our goal is that if one server misbehaves, you do not notice it because it is taken out of service and replaced before it affects users.

3. Safer behavior around the database

We are also tightening up how we work with the database:

For the 5 February event, we are working with our cloud provider on the formal root cause analysis for the brief database issue in our primary region, and we are improving our monitoring of that database so we have earlier and clearer warning signals.
For the 12 February event, we have changed an internal maintenance job so that it no longer makes heavy changes during busy hours. Future adjustments of that type will follow a controlled, manual process.

Looking ahead

These incidents underline how important predictable performance and availability are to your operations. Our focus is on:

Reducing the chance that local technical issues ever become visible to you
Limiting the impact and duration if something does go wrong
Giving ourselves better visibility and clearer signals so we can act quickly and confidently

Posted Feb 23, 2026 - 10:29 CET

Resolved

Earlier today, between 06:45 UTC and 07:27 UTC, we encountered an issue where some of our web server instances became unhealthy, resulting in degraded service. This caused login disruptions and in-session issues on the Mews Commander app. The root cause was identified as a set of web service instances that stopped functioning correctly overnight.
We’ve already taken corrective action and stabilized the environment.

We apologize for any inconvenience and thank you for your understanding.

Posted Feb 05, 2026 - 09:05 CET

Update

We upscaled our infrastructure, and the system is stabilizing. We are closely monitoring the system status.

Posted Feb 05, 2026 - 08:51 CET

Update

We upscaled our infrastructure, and the system is stabilizing. We are closely monitoring the system status.

Posted Feb 05, 2026 - 08:47 CET

Update

We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.

Posted Feb 05, 2026 - 08:43 CET

Monitoring

We upscaled our infrastructure, and the system is stabilizing. We are closely monitoring the system status.

Posted Feb 05, 2026 - 08:34 CET

Investigating

We are currently experiencing an unplanned outage.
We are working on solving this problem as quickly as possible.
Please continue checking this page for updates on the status.

Posted Feb 05, 2026 - 08:05 CET

This incident affected: Mews Operations.