Incident Response Style and Documentation
Conventions for incident triage, communication, and documentation including timeline formatting, stakeholder updates, and postmortem structure.
Incident Response Style and Documentation Guide
Overview
This guide covers conventions for incident documentation, communication, and postmortem writing. Effective incident documentation enables faster resolution and better learning from failures.
Incident Timeline Format
Timeline Entry Structure
MUST format timeline entries consistently:
## Timeline
| Time (UTC) | Event | Actor | Details |
| ---------- | --------------------- | --------- | -------------------------- |
| 14:32:00 | Alert triggered | PagerDuty | CPU > 95% on app-prod-01 |
| 14:35:00 | Acknowledged | @jane.doe | On-call engineer paged |
| 14:38:00 | Investigation started | @jane.doe | Checking metrics dashboard |
| 14:42:00 | Root cause identified | @jane.doe | Memory leak in v2.3.1 |
| 14:45:00 | Rollback initiated | @jane.doe | Rolling back to v2.3.0 |
| 14:52:00 | Rollback complete | ArgoCD | All pods healthy |
| 14:55:00 | Monitoring confirmed | @jane.doe | Metrics normalized |
| 15:00:00 | Incident resolved | @jane.doe | Customer impact ended |
Timestamp Rules
MUST use UTC for all timestamps:
# ✅ Correct - UTC timestamps
14:32:00 UTC - Alert triggered
# ❌ Incorrect - Local time without zone
2:32 PM - Alert triggered
# ❌ Incorrect - Ambiguous timezone
14:32:00 EST - Alert triggered
MUST use 24-hour format with seconds:
# ✅ Correct
14:32:45
# ❌ Incorrect
2:32 PM
14:32
Incident Classification
Severity Levels
| Severity | Criteria | Response Time | Update Frequency |
|---|---|---|---|
| P1 - Critical | Production down, data loss risk, security breach | < 15 min | Every 15 min |
| P2 - High | Major feature unavailable, significant degradation | < 30 min | Every 30 min |
| P3 - Medium | Minor feature impact, workaround available | < 2 hours | Every 2 hours |
| P4 - Low | Cosmetic issues, no customer impact | Next business day | Daily |
Impact Categories
MUST document impact across these dimensions:
impact_assessment:
customer_impact:
affected_users: 5000
affected_features: [login, dashboard]
error_rate: '15% of requests failing'
clinical_impact:
patient_care_affected: false
clinical_workflows_degraded: false
phi_at_risk: false
business_impact:
revenue_loss_estimate: '$2,500/hour'
sla_breach_risk: true
compliance_implications: none
technical_impact:
affected_services: [auth-service, api-gateway]
affected_regions: [eastus2]
data_integrity_risk: false
Communication Templates
Initial Incident Notification
## 🔴 INCIDENT: [Brief Description]
**Severity:** P2 - High
**Status:** Investigating
**Started:** 2024-12-19 14:32 UTC
### Summary
[One sentence description of what's happening]
### Customer Impact
- [Feature X] is unavailable for [affected users]
- Error rate: [X]% of requests failing
- Affected regions: [list regions]
### Current Actions
- Incident Commander: @jane.doe
- Team investigating root cause
- Next update in 15 minutes
### Stakeholder Contacts
- Technical: #platform-incident
- Customer Support: @support-lead
Status Update Template
## 🟡 UPDATE: [Incident Title]
**Time:** 2024-12-19 14:50 UTC
**Status:** Identified → Mitigating
### Progress Since Last Update
- Root cause identified: [brief description]
- Mitigation in progress: [what's being done]
### Current Customer Impact
- [Updated impact statement]
- Estimated time to resolution: [X minutes/hours]
### Next Steps
1. [Next action] - Owner: @name
2. [Next action] - Owner: @name
### Next Update
In [X] minutes or when status changes
Resolution Notification
## 🟢 RESOLVED: [Incident Title]
**Resolved At:** 2024-12-19 15:00 UTC
**Duration:** 28 minutes
**Severity:** P2 - High
### Summary
[One paragraph summary of what happened and how it was resolved]
### Customer Impact
- Total affected users: ~5,000
- Total duration: 28 minutes
- Services affected: auth-service, api-gateway
### Resolution
[What fixed the issue]
### Follow-Up
- Postmortem scheduled: 2024-12-20 10:00 UTC
- Postmortem doc: [link]
- Action items will be tracked in: [JIRA epic link]
Postmortem Structure
Required Sections
MUST include all sections:
# Postmortem: [Incident Title]
## Metadata
- **Date:** 2024-12-19
- **Duration:** 28 minutes
- **Severity:** P2
- **Authors:** @jane.doe, @john.smith
- **Status:** Draft | In Review | Final
## Executive Summary
[2-3 sentences summarizing the incident for leadership]
## Timeline
[Detailed timeline table]
## Root Cause Analysis
### What Happened
[Technical description of the failure]
### Why It Happened
[Contributing factors - use 5 Whys or similar technique]
### Why It Wasn't Caught Earlier
[Detection gap analysis]
## Impact Assessment
### Customer Impact
[Quantified customer impact]
### Business Impact
[Revenue, SLA, compliance implications]
## Response Analysis
### What Went Well
- [Positive observation]
- [Positive observation]
### What Didn't Go Well
- [Area for improvement]
- [Area for improvement]
### Where We Got Lucky
- [Near-miss that could have been worse]
## Action Items
| ID | Action | Owner | Priority | Due Date | Status |
| ---- | -------- | ----- | -------- | ---------- | ------ |
| AI-1 | [Action] | @name | P1 | 2024-12-26 | Open |
| AI-2 | [Action] | @name | P2 | 2024-01-02 | Open |
## Lessons Learned
[Key takeaways for the organization]
## Appendix
- Related tickets: [links]
- Dashboards: [links]
- Runbooks used: [links]
Writing Guidelines
Be Factual, Not Blaming
MUST focus on systems, not individuals:
# ❌ Blameful language
John pushed a bad config that broke production.
The QA team missed this obvious bug.
Operations should have caught this sooner.
# ✅ Systemic language
A configuration change introduced a regression.
The test suite did not cover this edge case.
Monitoring did not alert on this failure mode.
Distinguish Symptoms from Causes
MUST clearly separate symptoms, impact, and root cause:
## Symptoms (What we observed)
- Error rate spiked to 15%
- Response latency increased to 5 seconds
- Database connection pool exhausted
## Impact (What customers experienced)
- Login failures for 5,000 users
- Dashboard loading timeouts
- 28 minutes of degraded service
## Root Cause (Why it happened)
A memory leak in the connection pooling library (v2.3.1) caused
connections to not be properly released, exhausting the pool after
approximately 2 hours of runtime under load.
Quantify Impact
MUST use specific numbers, not vague descriptions:
# ❌ Vague
Many users were affected for a while.
Performance was degraded significantly.
# ✅ Specific
5,247 users experienced login failures over 28 minutes.
P99 latency increased from 200ms to 5,200ms (26x increase).
Error rate increased from 0.1% to 15.3%.
Action Items Must Be SMART
MUST write actionable, measurable follow-ups:
# ❌ Vague action item
Improve monitoring.
Add more tests.
Fix the bug.
# ✅ SMART action item
| Action | Owner | Due | Success Criteria |
| --------------------------------------------------- | --------- | ------ | --------------------------- |
| Add alert for connection pool utilization > 80% | @sre-team | Dec 26 | Alert fires in staging test |
| Add integration test for connection pool exhaustion | @dev-team | Dec 30 | Test in CI passes |
| Upgrade connection-pool library to v2.3.2 | @dev-team | Dec 23 | Deployed to production |
Audience-Specific Communication
Technical Audience (SRE/Engineering)
## Technical Summary
**Root Cause:** Memory leak in hikari-cp v4.0.3 caused connection
exhaustion under sustained load (>100 req/s for >2 hours).
**Technical Details:**
- Connection objects not released after timeout exception
- Pool size: 20 connections, all exhausted by 14:32 UTC
- Stack trace: ConnectionPool.getConnection() blocking indefinitely
**Mitigation Applied:**
- Rolled back to hikari-cp v4.0.2 (known stable)
- Increased pool size to 50 as interim measure
- Added connection leak detection logging
Leadership Audience
## Executive Summary
**What happened:** Our customer login service was unavailable for 28 minutes.
**Who was affected:** Approximately 5,000 customers could not log in.
**Business impact:** Estimated $1,200 in lost revenue. No SLA breach.
**Root cause:** A software library had a bug that caused resource exhaustion.
**Status:** Fully resolved. Permanent fix scheduled for next release.
**Prevention:** Three action items identified to prevent recurrence.
Customer-Facing Communication
## Service Update
**Issue:** Some customers may have experienced difficulty logging in
between 2:30 PM and 3:00 PM UTC on December 19th.
**Resolution:** This issue has been fully resolved. All services are
operating normally.
**What we're doing:** We've identified the cause and are implementing
additional safeguards to prevent similar issues in the future.
**Questions?** Contact [email protected]
We apologize for any inconvenience this may have caused.
Sensitive Information Handling
What to Include
safe_to_include:
- Error messages (sanitized)
- Metric values and thresholds
- Service names and versions
- Timeline of events
- Technical root cause
- Action items
What to Exclude
must_not_include:
- Patient identifiers (MRN, SSN, DOB)
- Personal health information (PHI)
- Customer names or account numbers
- API keys or credentials
- Internal IP addresses (use service names)
- Detailed security vulnerability information
Sanitization Example
# ❌ Unsanitized
User [email protected] (MRN: 12345678) reported error.
Database password '[EXAMPLE_PASSWORD]' was exposed in logs.
# ✅ Sanitized
User [REDACTED] reported error.
Database credentials were exposed in logs (credentials rotated).
Links and References
Required References
MUST include links to:
## References
- **Incident Ticket:** [INC-12345](https://jira.internal/INC-12345)
- **Monitoring Dashboard:** [Grafana](https://grafana.internal/d/abc123)
- **Runbook Used:** [Database Failover](../runbooks/database-failover.md)
- **Related PRs:** [#456](https://github.com/org/repo/pull/456)
- **Action Item Epic:** [PLAT-789](https://jira.internal/PLAT-789)
Cross-Reference Format
Related incidents:
- [INC-11111](./inc-11111.md) - Similar root cause (connection pooling)
- [INC-10000](./inc-10000.md) - Same service affected
Related postmortems:
- [PM-2024-11-15](../postmortems/pm-2024-11-15.md) - Connection leak pattern
Review Checklist
Before finalizing incident documentation:
Timeline
- All timestamps in UTC
- 24-hour format with seconds
- Actors identified for each action
- No gaps longer than 15 minutes unexplained
Content
- Symptoms, impact, and root cause clearly separated
- Impact quantified with specific numbers
- No blameful language
- Sensitive information sanitized
Action Items
- Each action has owner and due date
- Actions are specific and measurable
- Priority assigned
- Tracked in ticketing system
Communication
- Appropriate detail level for audience
- Customer-facing updates approved by comms team
- Leadership summary concise and clear
References
- All relevant tickets linked
- Dashboard links included
- Related incidents cross-referenced
Related Assets
Incident Triage and Timeline Builder
Build comprehensive incident timelines from logs, metrics, and tickets. Produces structured chronological summaries for postmortems and RCAs.
Owner: epic-platform-sre
Incident Triage Assistant
Assist with live incident triage, timeline building, and root cause analysis using logs, metrics, and incident management systems.
Owner: epic-platform-sre
Kubernetes Operations Assistant
Assist with Kubernetes cluster operations, debugging, and troubleshooting using read-only diagnostics and GitOps-safe recommendations.
Owner: epic-platform-sre
Deployment Risk Assessment
Assess deployment risks for releases based on change scope, system criticality, testing coverage, and historical incident patterns to inform go/no-go decisions.
Owner: community
Azure Resource Health Diagnosis
Analyze an Azure resource’s health, diagnose issues using logs and telemetry, and produce a remediation plan for identified problems.
Owner: epic-platform-sre
Dynatrace Kubernetes Service Triage
Systematic triage of a Dynatrace-monitored Kubernetes service using DQL queries for entity discovery, JVM health, thread analysis, pod generation comparison, and Davis problem correlation. Produces structured root cause analysis with Splunk query handoffs for restricted log environments.
Owner: epic-platform-sre

