Incident Response Style and Documentation

Conventions for incident triage, communication, and documentation including timeline formatting, stakeholder updates, and postmortem structure.

experimental

IDE:

claude

codex

vscode

Version:

1.0.0

Owner:epic-platform-sre

incident

sre

ops

communication

Incident Response Style and Documentation Guide

Overview

This guide covers conventions for incident documentation, communication, and postmortem writing. Effective incident documentation enables faster resolution and better learning from failures.

Incident Timeline Format

Timeline Entry Structure

MUST format timeline entries consistently:

## Timeline

| Time (UTC) | Event                 | Actor     | Details                    |
| ---------- | --------------------- | --------- | -------------------------- |
| 14:32:00   | Alert triggered       | PagerDuty | CPU > 95% on app-prod-01   |
| 14:35:00   | Acknowledged          | @jane.doe | On-call engineer paged     |
| 14:38:00   | Investigation started | @jane.doe | Checking metrics dashboard |
| 14:42:00   | Root cause identified | @jane.doe | Memory leak in v2.3.1      |
| 14:45:00   | Rollback initiated    | @jane.doe | Rolling back to v2.3.0     |
| 14:52:00   | Rollback complete     | ArgoCD    | All pods healthy           |
| 14:55:00   | Monitoring confirmed  | @jane.doe | Metrics normalized         |
| 15:00:00   | Incident resolved     | @jane.doe | Customer impact ended      |

Timestamp Rules

MUST use UTC for all timestamps:

# ✅ Correct - UTC timestamps

14:32:00 UTC - Alert triggered

# ❌ Incorrect - Local time without zone

2:32 PM - Alert triggered

# ❌ Incorrect - Ambiguous timezone

14:32:00 EST - Alert triggered

MUST use 24-hour format with seconds:

# ✅ Correct

14:32:45

# ❌ Incorrect

2:32 PM
14:32

Incident Classification

Severity Levels

Severity	Criteria	Response Time	Update Frequency
P1 - Critical	Production down, data loss risk, security breach	< 15 min	Every 15 min
P2 - High	Major feature unavailable, significant degradation	< 30 min	Every 30 min
P3 - Medium	Minor feature impact, workaround available	< 2 hours	Every 2 hours
P4 - Low	Cosmetic issues, no customer impact	Next business day	Daily

Impact Categories

MUST document impact across these dimensions:

impact_assessment:
  customer_impact:
    affected_users: 5000
    affected_features: [login, dashboard]
    error_rate: '15% of requests failing'

  clinical_impact:
    patient_care_affected: false
    clinical_workflows_degraded: false
    phi_at_risk: false

  business_impact:
    revenue_loss_estimate: '$2,500/hour'
    sla_breach_risk: true
    compliance_implications: none

  technical_impact:
    affected_services: [auth-service, api-gateway]
    affected_regions: [eastus2]
    data_integrity_risk: false

Communication Templates

Initial Incident Notification

## 🔴 INCIDENT: [Brief Description]

**Severity:** P2 - High
**Status:** Investigating
**Started:** 2024-12-19 14:32 UTC

### Summary

[One sentence description of what's happening]

### Customer Impact

- [Feature X] is unavailable for [affected users]
- Error rate: [X]% of requests failing
- Affected regions: [list regions]

### Current Actions

- Incident Commander: @jane.doe
- Team investigating root cause
- Next update in 15 minutes

### Stakeholder Contacts

- Technical: #platform-incident
- Customer Support: @support-lead

Status Update Template

## 🟡 UPDATE: [Incident Title]

**Time:** 2024-12-19 14:50 UTC
**Status:** Identified → Mitigating

### Progress Since Last Update

- Root cause identified: [brief description]
- Mitigation in progress: [what's being done]

### Current Customer Impact

- [Updated impact statement]
- Estimated time to resolution: [X minutes/hours]

### Next Steps

1. [Next action] - Owner: @name
2. [Next action] - Owner: @name

### Next Update

In [X] minutes or when status changes

Resolution Notification

## 🟢 RESOLVED: [Incident Title]

**Resolved At:** 2024-12-19 15:00 UTC
**Duration:** 28 minutes
**Severity:** P2 - High

### Summary

[One paragraph summary of what happened and how it was resolved]

### Customer Impact

- Total affected users: ~5,000
- Total duration: 28 minutes
- Services affected: auth-service, api-gateway

### Resolution

[What fixed the issue]

### Follow-Up

- Postmortem scheduled: 2024-12-20 10:00 UTC
- Postmortem doc: [link]
- Action items will be tracked in: [JIRA epic link]

Postmortem Structure

Required Sections

MUST include all sections:

# Postmortem: [Incident Title]

## Metadata

- **Date:** 2024-12-19
- **Duration:** 28 minutes
- **Severity:** P2
- **Authors:** @jane.doe, @john.smith
- **Status:** Draft | In Review | Final

## Executive Summary

[2-3 sentences summarizing the incident for leadership]

## Timeline

[Detailed timeline table]

## Root Cause Analysis

### What Happened

[Technical description of the failure]

### Why It Happened

[Contributing factors - use 5 Whys or similar technique]

### Why It Wasn't Caught Earlier

[Detection gap analysis]

## Impact Assessment

### Customer Impact

[Quantified customer impact]

### Business Impact

[Revenue, SLA, compliance implications]

## Response Analysis

### What Went Well

- [Positive observation]
- [Positive observation]

### What Didn't Go Well

- [Area for improvement]
- [Area for improvement]

### Where We Got Lucky

- [Near-miss that could have been worse]

## Action Items

| ID   | Action   | Owner | Priority | Due Date   | Status |
| ---- | -------- | ----- | -------- | ---------- | ------ |
| AI-1 | [Action] | @name | P1       | 2024-12-26 | Open   |
| AI-2 | [Action] | @name | P2       | 2024-01-02 | Open   |

## Lessons Learned

[Key takeaways for the organization]

## Appendix

- Related tickets: [links]
- Dashboards: [links]
- Runbooks used: [links]

Writing Guidelines

Be Factual, Not Blaming

MUST focus on systems, not individuals:

# ❌ Blameful language

John pushed a bad config that broke production.
The QA team missed this obvious bug.
Operations should have caught this sooner.

# ✅ Systemic language

A configuration change introduced a regression.
The test suite did not cover this edge case.
Monitoring did not alert on this failure mode.

Distinguish Symptoms from Causes

MUST clearly separate symptoms, impact, and root cause:

## Symptoms (What we observed)

- Error rate spiked to 15%
- Response latency increased to 5 seconds
- Database connection pool exhausted

## Impact (What customers experienced)

- Login failures for 5,000 users
- Dashboard loading timeouts
- 28 minutes of degraded service

## Root Cause (Why it happened)

A memory leak in the connection pooling library (v2.3.1) caused
connections to not be properly released, exhausting the pool after
approximately 2 hours of runtime under load.

Quantify Impact

MUST use specific numbers, not vague descriptions:

# ❌ Vague

Many users were affected for a while.
Performance was degraded significantly.

# ✅ Specific

5,247 users experienced login failures over 28 minutes.
P99 latency increased from 200ms to 5,200ms (26x increase).
Error rate increased from 0.1% to 15.3%.

Action Items Must Be SMART

MUST write actionable, measurable follow-ups:

# ❌ Vague action item

Improve monitoring.
Add more tests.
Fix the bug.

# ✅ SMART action item

| Action                                              | Owner     | Due    | Success Criteria            |
| --------------------------------------------------- | --------- | ------ | --------------------------- |
| Add alert for connection pool utilization > 80%     | @sre-team | Dec 26 | Alert fires in staging test |
| Add integration test for connection pool exhaustion | @dev-team | Dec 30 | Test in CI passes           |
| Upgrade connection-pool library to v2.3.2           | @dev-team | Dec 23 | Deployed to production      |

Audience-Specific Communication

Technical Audience (SRE/Engineering)

## Technical Summary

**Root Cause:** Memory leak in hikari-cp v4.0.3 caused connection
exhaustion under sustained load (>100 req/s for >2 hours).

**Technical Details:**

- Connection objects not released after timeout exception
- Pool size: 20 connections, all exhausted by 14:32 UTC
- Stack trace: ConnectionPool.getConnection() blocking indefinitely

**Mitigation Applied:**

- Rolled back to hikari-cp v4.0.2 (known stable)
- Increased pool size to 50 as interim measure
- Added connection leak detection logging

Leadership Audience

## Executive Summary

**What happened:** Our customer login service was unavailable for 28 minutes.

**Who was affected:** Approximately 5,000 customers could not log in.

**Business impact:** Estimated $1,200 in lost revenue. No SLA breach.

**Root cause:** A software library had a bug that caused resource exhaustion.

**Status:** Fully resolved. Permanent fix scheduled for next release.

**Prevention:** Three action items identified to prevent recurrence.

Customer-Facing Communication

## Service Update

**Issue:** Some customers may have experienced difficulty logging in
between 2:30 PM and 3:00 PM UTC on December 19th.

**Resolution:** This issue has been fully resolved. All services are
operating normally.

**What we're doing:** We've identified the cause and are implementing
additional safeguards to prevent similar issues in the future.

**Questions?** Contact [email protected]

We apologize for any inconvenience this may have caused.

Sensitive Information Handling

What to Include

safe_to_include:
  - Error messages (sanitized)
  - Metric values and thresholds
  - Service names and versions
  - Timeline of events
  - Technical root cause
  - Action items

What to Exclude

must_not_include:
  - Patient identifiers (MRN, SSN, DOB)
  - Personal health information (PHI)
  - Customer names or account numbers
  - API keys or credentials
  - Internal IP addresses (use service names)
  - Detailed security vulnerability information

Sanitization Example

# ❌ Unsanitized

User [email protected] (MRN: 12345678) reported error.
Database password '[EXAMPLE_PASSWORD]' was exposed in logs.

# ✅ Sanitized

User [REDACTED] reported error.
Database credentials were exposed in logs (credentials rotated).

Links and References

Required References

MUST include links to:

## References

- **Incident Ticket:** [INC-12345](https://jira.internal/INC-12345)
- **Monitoring Dashboard:** [Grafana](https://grafana.internal/d/abc123)
- **Runbook Used:** [Database Failover](../runbooks/database-failover.md)
- **Related PRs:** [#456](https://github.com/org/repo/pull/456)
- **Action Item Epic:** [PLAT-789](https://jira.internal/PLAT-789)

Cross-Reference Format

Related incidents:

- [INC-11111](./inc-11111.md) - Similar root cause (connection pooling)
- [INC-10000](./inc-10000.md) - Same service affected

Related postmortems:

- [PM-2024-11-15](../postmortems/pm-2024-11-15.md) - Connection leak pattern

Review Checklist

Before finalizing incident documentation:

Timeline

All timestamps in UTC
24-hour format with seconds
Actors identified for each action
No gaps longer than 15 minutes unexplained

Content

Symptoms, impact, and root cause clearly separated
Impact quantified with specific numbers
No blameful language
Sensitive information sanitized

Action Items

Each action has owner and due date
Actions are specific and measurable
Priority assigned
Tracked in ticketing system

Communication

Appropriate detail level for audience
Customer-facing updates approved by comms team
Leadership summary concise and clear

References

All relevant tickets linked
Dashboard links included
Related incidents cross-referenced

Related Assets

Incident Triage and Timeline Builder

active

Build comprehensive incident timelines from logs, metrics, and tickets. Produces structured chronological summaries for postmortems and RCAs.

Owner: epic-platform-sre

Incident Triage Assistant

active

Assist with live incident triage, timeline building, and root cause analysis using logs, metrics, and incident management systems.

Owner: epic-platform-sre

Kubernetes Operations Assistant

active

Assist with Kubernetes cluster operations, debugging, and troubleshooting using read-only diagnostics and GitOps-safe recommendations.

Owner: epic-platform-sre

Deployment Risk Assessment

experimental

Assess deployment risks for releases based on change scope, system criticality, testing coverage, and historical incident patterns to inform go/no-go decisions.

Owner: community

Azure Resource Health Diagnosis

experimental

Analyze an Azure resource’s health, diagnose issues using logs and telemetry, and produce a remediation plan for identified problems.

Owner: epic-platform-sre

Dynatrace Kubernetes Service Triage

active

Systematic triage of a Dynatrace-monitored Kubernetes service using DQL queries for entity discovery, JVM health, thread analysis, pod generation comparison, and Davis problem correlation. Produces structured root cause analysis with Splunk query handoffs for restricted log environments.

Owner: epic-platform-sre