Incident Report: Opsgenie Failure

Date: 2024-04-30
Time: 18:07 (GMT+3)
Duration: 16 minutes

Description

Opsgenie is experiencing issues with processing alerts. Alerts are being sent but some are not processed correctly, leading to potential losses of critical notifications.

Root Cause

The root cause of the alert processing issues in Opsgenie is Opsgenie being down.

Impact

The impact includes delayed or lost alerts, particularly affecting feed/chain pairs on mainnets. Critical alerts such as MNT/USD and ETH/USD on the Speolia testnet have been lost or delayed, which could lead to delayed response actions in operational monitoring.

Timeline

18:07 - Issue first reported by Warren.
18:09 - Team members started monitoring the Opsgenie dashboard more closely.
18:20 - Additional reports of lost and delayed alerts were noted.
18:23 - Opsgenie went back online.

Lessons Learned

This incident highlights the need for a robust failover or backup system for alert processing in critical operational environments. It also stresses the importance of real-time monitoring and immediate manual verification processes during system failures.

Actions Taken

Increased monitoring of the Opsgenie dashboard.
Initiated manual checks when feed/chain pair deviations occur for three cycles without a value change.
Implemented close observation and note-taking of specific incidents in Slack for further analysis.
Prepared for deep dives and potential escalation if deviations persist into the fourth cycle.

Escalation link.

Incident Reviewer(s)

Warren
Arda

Description​

Root Cause​

Impact​

Timeline​

Lessons Learned​

Actions Taken​

Related Images/Logs​

Incident Reviewer(s)​