Overview
In February 2026, Mews experienced an outage lasting 47 minutes and two shorter periods of degraded performance that affected access to the Mews PMS app. There was no data loss or data corruption. The impact was on availability and speed, not on the correctness of your data.
Even short interruptions are unacceptable, especially at critical times when your teams are checking guests in and out, running reports, and handling payments. In this postmortem we explain, in straightforward terms:
- What happened
- How you were affected
- What caused it
- What we are changing to prevent a repeat
We sincerely apologize for the disruption and are treating the symptoms of these incidents with the highest priority to ensure they do not happen again.
What happened
5 February 2026 – Major outage
- Between 06:40–07:27 UTC, many users could not log in to Mews. Pages loaded very slowly or showed errors, and key functions were unavailable.
- A shorter performance degradation occurred at around 10:00 UTC, affecting a smaller group of properties.
- Behind the scenes, several of the servers that handle Mews PMS traffic became overloaded at the same time as a brief issue on our primary database in the Germany West Central region (Azure Cloud). Together, these made the app unstable until we restarted the affected servers and increased capacity.
12 February 2026 – Short performance drops
- On 12 February, there were four short windows where Mews PMS was noticeably slower or briefly unavailable for some users.
- These were much shorter than the 5 February outage, but still visible as slow page loads and occasional login problems.
17 February 2026 – Mews is slow
- On the morning of 17 February, support received multiple reports from different regions that Mews was slow or not loading at all for short periods.
- In our monitoring, one of the servers handling Mews PMS traffic was clearly struggling and needed to be restarted. After a restart, performance returned to normal.
How you were affected
Across the three events, properties experienced:
- Difficulty logging in to Mews PMS
- Slow pages, sometimes ending in timeouts or errors
- The app being slow and/or unresponsive during the instances of performance degradation
Why it happened (high‑level)
Although the three incidents were triggered by different technical details, the overall pattern was the same.
1. The majority of our backend instances were degraded
- Servers became overloaded by traffic;
- A background database task put extra pressure on our main database;
- Internal checks meant to verify that a server is healthy did more work than they should and themselves became slow.
In each case, part of our system was under more strain than it could handle, and some servers started responding very slowly or failing requests.
2. The “recycling” of instances was not responsive enough
We rely on a combination of:
- Internal health checks inside our application; and
- The cloud provider’s ability to stop sending traffic to unhealthy instances
In this case, that combination did not work as well as it needs to:
- Our health‑check endpoint was too complex and depended on responses from several other downstream dependencies.
- It also continued to say “I’m OK” often enough that the cloud platform - which looks only at that signal - kept the struggling servers in rotation.
- The platform behaved exactly as designed, but our own health signal and thresholds were not strict enough to protect you from partial failures - hence the degraded performance you were experiencing.
3. Our monitoring focused too much on averages
Our automatic scaling and some of our dashboards look at averages across all servers. Those averages can look acceptable even when a few individual servers are in serious trouble. This made it harder to see, quickly and clearly, that “one or two machines are having a very bad time and need to be taken out of service now”.
In simple terms: local problems on specific servers and supporting systems exposed weaknesses in how we check health and remove bad servers, and that turned into outages and slowness for you.
What we did during the incidents
During each incident, our teams:
- Declared and managed a formal incident, with clear ownership and regular updates.
- Restarted or removed problematic servers and temporarily increased capacity to stabilize performance.
- Worked with our cloud provider where there were signs of underlying platform issues.
- Collected detailed logs and metrics to reconstruct exactly what happened, confirm that there was no data loss, and identify the design gaps that need to be fixed.
These actions restored service each time, but they also showed that we need deeper changes, not just tactical fixes.
What we are changing
We are now implementing structural improvements in three key areas.
1. Simpler, more trustworthy health checks
We are redesigning the internal “health check” that decides whether a server should receive traffic:
- The new check will be much lighter and will not depend on multiple other systems being fully healthy just to answer “can this server safely handle requests?”.
- It will have strict time limits, so a slow response does not get mistaken for a healthy server.
- When a server is clearly unable to serve traffic, the health check will say so clearly and consistently, so the platform can remove it from rotation quickly.
This reduces the chance that a partially broken server continues to handle your traffic.
2. Faster “recycling” of unhealthy instances and smarter scaling
We are improving how we detect and react to unhealthy servers:
- Looking at each server individually (errors, slow responses, health‑check results), not just at global averages.
- Tuning our automatic scaling rules so we react faster when a subset of servers is overloaded.
- Adjusting how we use the cloud provider’s automatic recovery tools so that they help with diagnostics and healing, without introducing extra instability.
Our goal is that if one server misbehaves, you do not notice it because it is taken out of service and replaced before it affects users.
3. Safer behavior around the database
We are also tightening up how we work with the database:
- For the 5 February event, we are working with our cloud provider on the formal root cause analysis for the brief database issue in our primary region, and we are improving our monitoring of that database so we have earlier and clearer warning signals.
- For the 12 February event, we have changed an internal maintenance job so that it no longer makes heavy changes during busy hours. Future adjustments of that type will follow a controlled, manual process.
Looking ahead
These incidents underline how important predictable performance and availability are to your operations. Our focus is on:
- Reducing the chance that local technical issues ever become visible to you
- Limiting the impact and duration if something does go wrong
- Giving ourselves better visibility and clearer signals so we can act quickly and confidently