Skip to main content

📄 Approaching an Alert

1. Alert appears:

When the alert occurs, agent should take a notification and a call through OpsGenie application built on agent's phone, and Acknowledge the alert. Acknowledging the alert will let people know the agent is investigating the alert. Beside that, after the acknowledgment Opsgenie will send emails to the agent when alert description changes, alert closes etc.

PS: High-priority alerts might automatically notify technical team members if the responsible agent does not acknowledge the alert soon. To avoid that, acknowledging the alert is a must. (Escalation Policies & Routing Rules)

2. Investigation:

Investigation should be done right after the acknowledgement. Each alert requires a different treatment, alert descriptions will help to decide which tools to use and how to treat the alert. For better understanding: Alert sheet.

3. Escalation:

When a critical or high-priority alert is triggered, follow the Actions section of the alert. Generally, follow the steps below:

a. Immediate Action:

  • Before anything else, an agent should check the related chain's and following RPC URLs health.
  • For P1 (Critical) and P2 (High) alerts: escalate it to the appropriate team by either creating an alert on Opsgenie or with a message on the #monitoring-team Slack channel.
    The right team to escalate to can be found in the alert's description and/or Alert sheet.
  • For P3 and below alerts: the main actions should be "Investigate" and "Contact."

b. Investigate:

  • Cross-reference the alert with other active alerts to identify any correlation.
  • Check Centurion and/or Reblok logs for further details.
  • When unexpected chain issues occur, monitoring agent can search for chain's Telegram/Twitter/Discord channels to look for such information as chain maintenance etc.

c. Contact:

  • If the issue persists for more than 15 minutes after the initial investigation for P3 and below, escalate it to the appropriate team.
  • Use the #monitoring-team channel to discuss with the monitoring team members.

d. Document:

  • Keep a record of all alerts, responses, and actions taken. This aids in post-mortem analysis and can help prevent future issues. See Tasks Documentation.

e. Feedback Loop:

  • Once the issue is resolved, notify all concerned parties and provide a brief on the issue and the steps taken to address it.
  • Gather feedback to enhance the monitoring process.

Owner: UNKOWN