Problem
On 17th May 2025, we experienced an issue in our production system that caused the unintentional removal of key services used for the asynchronous processing of reservations and payments. As a result, around 12,000 reservation messages—particularly from channel partners—were lost or delayed. Some terminal payment events were also affected during this time.
Action
Once the issue was identified, we:
- Reversed the change that caused the problem.
- Rebuilt the services responsible for handling reservations and payments.
- Switched to a backup method for processing reservations to minimize further disruption.
- Notified affected customers and partners, advising them to reprocess any missing data from the affected period.
Causes
This issue was caused by a configuration error in a system update, which mistakenly removed important data services. The problem was not caught during the review, partly because of how the system presented the changes. It also deployed automatically to all environments, including production, making the impact immediate.
Solutions
To prevent this from happening again, we are:
- Adding stronger checks and alerts in the asynchronous processing system before any large-scale system changes are made.
- Improving how system changes in asynchronous processing are previewed to ensure potential issues are clearly flagged during our review process.
- Adjusting how we deploy updates across environments to enable safer testing of changes.
- Strengthening our system’s ability to fall back to other processing methods if something goes wrong.
- Enhancing monitoring and alerting to catch similar problems faster.