Unplanned Outage - We have identified and resolved the issue
Incident Report for Mews
Postmortem

Incident Overview:

Nature of the Incident: On October 30, 2024, at approximately 18:03:23 CET, a misconfigured ruleset on our Web Application Firewall (WAF) made app.mews.com unreachable for all users. Our team detected the issue within 2 minutes, and the ruleset was turned off at 18:07:35 CET, restoring access to the application. However, the platform experienced a significant DDoS attack right after, leading to further issues and increased latency for customers until 18:18:03 CET.

Start Time and Duration:

  • Incident Start: 18:03:23 CET on October 30, 2024
  • Incident Remediated: 18:07:35 CET
  • The DDoS attack started: 18:13:10 CET
  • Latency Remediated: 18:18:03 CET

Affected Services and Impacted Users: The primary impact was on app.mews.com, causing a downtime of approximately 4 minutes, with subsequent ~15 minutes of degraded performance. Customers experienced an inability to access the application during this time, which temporarily limited functionality and accessibility.

Root Cause: The misconfigured ruleset on the WAF led to the initial outage. The lack of automation on frequently changed endpoints, such as the list of malicious IPs within the rule, increased the risk of errors. Additionally, the platform reacted badly to the sudden spike in traffic influenced by an incoming DDoS attack, causing further disruptions.

Resolution:

  • Actions Taken: The misconfigured ruleset was turned off at 18:07:35 CET, restoring access to the application. The platform experienced a DDoS attack starting at 18:13:10 CET, which was managed through scaling and DDoS protection mechanisms.
  • Resolution Timeline: Full-service functionality was regained quickly after the misconfigured ruleset was turned off. Monitoring and additional instance scaling were maintained to mitigate potential additional impacts until the traffic spike subsided.
  • Temporary Solutions: Increasing the instance count served as a temporary stabilization measure, ensuring resilience during the traffic spike.

Preventative Measures:

  • Alert System Review: Alert configurations will be enhanced to trigger incidents specifically during traffic surges indicative of DDoS patterns.
  • Implement a Robust Approval Process: We aim to establish a stringent approval process for any changes to the WAF rules. This process will include multiple levels of review and approval to ensure that any changes are thoroughly vetted before implementation.
  • Automate Configuration Management: Automation tools will be used to manage WAF configurations. This will help reduce the risk of human error and ensure that changes are applied consistently and correctly.
  • Customer Communication Protocol: Clear procedures are being developed for updating the status page and communicating with customers in real-time.

Follow-Up Actions:

  • Continued Monitoring and Audits: Regular audits will be conducted on monitoring and alert configurations to detect high-volume traffic scenarios earlier.
  • Review of Incident Response and Training: Further training will focus on improving incident response times and enhancing cross-team coordination.
Posted Nov 04, 2024 - 14:14 CET

Resolved
We experienced an unplanned outage and systems are back to normal now.
Posted Oct 30, 2024 - 19:13 CET
Monitoring
We experienced an unplanned outage.
We resolved this problem.
Please continue checking this page for updates on the status.
Posted Oct 30, 2024 - 18:49 CET