Incident Report: Stuck Event Collector on Mantle Staging Environment
Date: 2023-12-21
Time: 04:36 (GMT+2)
Duration: Approximately 2 days
Description
Detected stuck collector on mantle and was unable to get some transaction data or block for a while, causing delays and repeated alerts.
Root Cause
The issue stemmed from changes pushed to the event collectors on staging that affected Mantle. Additionally, the Reblok node had gotten stuck, further aggravating the issue.
Impact
Event collection was unavailable for Mantle for approximately 2 days, causing delayed or incorrect data reporting and alerts being repeatedly triggered.
Timeline
- 04:36 - First noticed the issue.
- 11:22 - Identified as a stuck event collector.
- 11:25 - Initial push change by Aaron identified as potential cause.
- 12:17(The other day) - Revert of changes initiated.
- 12:17 - Issue of unreachable database identified and resolved.
- 12:34 - Mantle syncing resumed and operational.
Lessons Learned
Regular checks and validations of changes in the staging environment are crucial to prevent prolonged issues. Immediate action and communication when issues are detected can minimize impact.
Actions Taken
- Pushed change to the event collectors on staging that affected Mantle.
- Identified and reverted changes causing the stuck event collector.
- Deployed old code to resolve immediate error messages.
- Addressed new issue of unreachable database.
- Monitored until Mantle resumed normal operations and synced completely.
Related Images/Logs
- Escalation link.
Incident Reviewer(s)
- Aaron (Resolved staging collector issue)
- Jiri Herzan (Identified event collector issue and followed up)
- Andrew Prasaath (Reported the initial and ongoing issue)