Severely degraded POS service

Incident Report for Mews

Postmortem

Problem

On March 26, 2025, our Point of Sale (POS) system experienced a major incident due to heavy load, which rendered the service unavailable to customers. The root cause was identified as insufficient Input/Output Operations Per Second (IOPS) allocated to the storage used by the POS's database.

Action

Upon detection of the incident, the team promptly initiated recovery actions executing specific operations in AWS and on the database. The service was restored, and monitoring was put in place to ensure stability.

Causes

The incident was caused by a combination of factors:

  • Insufficient IOPS allocated to the database storage.
  • A gradual increase in the number of merchants and users, leading to a higher database load.
  • Execution of various heavy queries in the hours leading up to the incident.

Solutions

To prevent similar incidents in the future, the following actions will be taken:

  • Review and optimize database configurations, including checkpoint frequency and memory settings.
  • Implement better monitoring and alerting mechanisms to detect potential issues before they escalate.
  • New indexes were added to the database to improve query performance.
  • Increased the storage IOPS from 12k to 24k and to a larger instance size.
Posted Apr 15, 2025 - 18:36 CEST

Resolved

Severely degraded POS service
Posted Mar 26, 2025 - 20:30 CET