dynatrace-k8s-triage

Systematic Kubernetes service triage using Dynatrace DQL — entity discovery, JVM health, thread analysis, pod generation comparison, Davis problem correlation, and Splunk SPL query generation for restricted log environments.

active

IDE:

codex

Version:

1.0.0

Owner:epic-platform-sre

dynatrace

kubernetes

troubleshooting

jvm

spring-boot

splunk

sre

observability

Dynatrace Kubernetes Service Triage Skill

Core Competencies

Service entity discovery — resolve service IDs to names, namespaces, technology stacks, and pod inventories
Pod generation analysis — detect rolling deployments, stuck rollouts, and version mismatches across ReplicaSet generations
JVM health assessment — thread count baselines/spikes, heap pressure, GC pause analysis
Davis problem correlation — map problems and events to the service timeline
Splunk SPL generation — produce targeted log queries for human execution in restricted environments
Root cause classification — rolling deployment, startup failure, thread exhaustion, OOM, GC storm, probe failure

When to Apply This Skill

Container crashes or CrashLoopBackOff on a Dynatrace-monitored K8s service
Spring Boot or Java application health degradation
Rolling deployment suspected of causing instability
Need to correlate Dynatrace telemetry with Splunk logs across restricted boundaries
Proactive triage when Davis problems are raised for K8s workloads

Prerequisites

Dynatrace Access

API token with scopes: entities.read, metrics.read, events.read, settings.read
.dtenv file with DT_PLATFORM_URL and DT_API_TOKEN
storage:logs:read is NOT required (Splunk handoff covers log analysis)

Environment Variables

DT_PLATFORM_URL=https://{tenant}.apps.dynatrace.com
DT_API_TOKEN=dt0s16.xxxx  # Platform token (Bearer auth)

Triage Workflow

Step 1: Service Identity

Resolve the service entity ID to concrete details.

fetch dt.entity.service
| filter id == "{SERVICE_ENTITY_ID}"
| fields entity.name, serviceType, managementZones, tags, softwareTechnologies
| limit 1

Record: service name, type, management zones, technology stack.

Step 2: Pod Inventory and Generation Detection

List all process group instances (pods) associated with the service.

fetch dt.entity.process_group_instance
| filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")
| fields entity.name, softwareTechnologies, metadata
| limit 50

Analysis:

Extract ReplicaSet hashes from pod names (deployment-{rs-hash}-{pod-hash})
Group pods by ReplicaSet hash
Compare technology versions across groups
Flag if >1 generation is running (active rolling deployment)
Flag major version jumps (Spring Boot 2→3, 3→4; Tomcat 9→10, 10→11)

Step 3: JVM Thread Analysis

Thread count is the highest-signal metric for Spring Boot crashes. Baseline is typically 30-60 threads.

timeseries threads = avg(dt.runtime.jvm.threads.count),
  from:now()-{TIMERANGE}, by:{dt.entity.process_group_instance}
| lookup [fetch dt.entity.process_group_instance
  | filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")],
  sourceField:dt.entity.process_group_instance, lookupField:id
| filter isNotNull(lookup.id)

Interpretation:

Pattern	Meaning
Stable 40-60	Healthy
Gradual rise to 100+	Thread leak or connection pool exhaustion
Spike to 100-150 then null	Thread explosion → JVM killed
Brief rise to 35 then null	Startup failure (never reached steady state)
Repeated spike→null cycles	CrashLoopBackOff with thread-related root cause

Step 4: Heap and GC Metrics

timeseries heap = avg(dt.runtime.jvm.memory.pool.used),
  from:now()-{TIMERANGE}, by:{dt.entity.process_group_instance}
| lookup [fetch dt.entity.process_group_instance
  | filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")],
  sourceField:dt.entity.process_group_instance, lookupField:id
| filter isNotNull(lookup.id)

timeseries gc = avg(dt.runtime.jvm.gc.pause_time),
  from:now()-{TIMERANGE}, by:{dt.entity.process_group_instance}
| lookup [fetch dt.entity.process_group_instance
  | filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")],
  sourceField:dt.entity.process_group_instance, lookupField:id
| filter isNotNull(lookup.id)

Step 5: Davis Problem and Event Correlation

fetch events, from:now()-{TIMERANGE}
| filter event.kind == "DAVIS_PROBLEM" OR event.kind == "DAVIS_EVENT"
| filter contains(toString(affected_entity_ids), "{SERVICE_ENTITY_ID}")
| filter event.name != "Monitoring not available"
| fields timestamp, event.kind, event.name, event.status, event.category, display_id
| sort timestamp desc
| limit 50

Step 6: Service Error Rate

timeseries errors = avg(dt.service.request.failure_rate),
  from:now()-{TIMERANGE}, by:{dt.entity.service}
| filter dt.entity.service == "{SERVICE_ENTITY_ID}"

Step 7: Deployment Events

fetch events, from:now()-{TIMERANGE}
| filter event.kind == "CUSTOM_DEPLOYMENT"
| filter contains(toString(affected_entity_ids), "{SERVICE_ENTITY_ID}")
| fields timestamp, event.name, deployment.name, deployment.version
| sort timestamp desc
| limit 20

Splunk SPL Query Templates

When log access is restricted to Splunk (no Dynatrace log ingestion), generate these SPL queries for human execution. Replace {INDEX}, {NAMESPACE}, and {CONTAINER} with actual values discovered in Steps 1-2.

Application Errors

index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
  ("ERROR" OR "FATAL" OR "OOMKilled" OR "CrashLoopBackOff"
   OR "readiness probe failed" OR "liveness probe failed"
   OR "ApplicationContextException" OR "BeanCreationException"
   OR "OutOfMemoryError")
| sort -_time
| head 200
| table _time, pod_name, log_level, message

Pod Lifecycle Events

index={INDEX} namespace="{NAMESPACE}"
  ("Killing" OR "Back-off restarting" OR "Started container"
   OR "Pulled image" OR "Liveness probe failed" OR "Readiness probe failed")
| sort -_time
| head 100
| table _time, pod_name, reason, message

Spring Boot Startup Failures

index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
  ("Failed to start" OR "Application run failed"
   OR "BeanDefinitionStoreException" OR "NoSuchBeanDefinitionException"
   OR "UnsatisfiedDependencyException" OR "ClassNotFoundException"
   OR "NoClassDefFoundError" OR "NoSuchMethodError")
| sort -_time
| head 50
| table _time, pod_name, message

Thread and Connection Issues

index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
  ("deadlock" OR "thread dump" OR "blocked" OR "WAITING"
   OR "pool-" OR "http-nio-" OR "RejectedExecutionException"
   OR "Connection pool" OR "timeout" OR "refused")
| sort -_time
| head 100
| table _time, pod_name, message

Jakarta/Javax Migration Errors (Framework Upgrade)

index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
  ("javax.servlet" OR "javax.persistence" OR "javax.validation"
   OR "jakarta.servlet" OR "ClassCastException"
   OR "IncompatibleClassChangeError" OR "LinkageError")
| sort -_time
| head 50
| table _time, pod_name, message

Root Cause Classification

Classification	Key Signals	Confidence
Rolling deployment stuck	Multiple ReplicaSet hashes, version mismatch, deployment events	High if versions differ
Startup failure	Short-lived process instances, sparse JVM metrics, bean exceptions	High if startup pattern matches
Thread exhaustion	Thread spike to 100+, then null, then restart	High if spike→null pattern repeats
OOM kill	Heap at limit, exit code 137, GC pause spike before crash	High if heap data available
GC storm	GC pause >500ms sustained, CPU spike, response time degradation	Medium (need heap data too)
Probe misconfiguration	Restart events with no JVM anomaly, healthy service metrics	Medium (need negative evidence)
External dependency	Response time spike correlates with upstream service issues	Low (need cross-service data)

Spring Boot Major Upgrade Checklist

When pod generation analysis reveals a major Spring Boot version change, check these breaking changes:

Spring Boot 3.x → 4.x

Jakarta EE 11 namespace (some packages moved again)
Removed deprecated auto-configuration classes
Actuator security defaults changed (endpoints may require authentication)
Tomcat 11 required (not backward compatible with Tomcat 10)
spring.config.import behavior changes
Observation API changes (Micrometer)

Spring Boot 2.x → 3.x

javax.* → jakarta.* namespace migration
Java 17 minimum requirement
Spring Security 6.x (new SecurityFilterChain API)
Removed WebSecurityConfigurerAdapter
Actuator endpoint exposure defaults changed

Tomcat 10.x → 11.x

Jakarta EE 11 Servlet 6.1 API
Removed legacy AJP connector defaults
HTTP/2 configuration changes
Default thread pool sizing changes

Output Format

Present findings as a structured triage report:

Service Identity — name, namespace, framework versions
Pod Generation Status — table of generations with version comparison
Crash Classification — type and confidence level
Evidence Timeline — chronological table of events and metric anomalies
JVM Metrics Summary — per-pod thread/heap/GC table
Root Cause — primary cause with evidence citations, contributing factors
Recommended Actions — prioritized by immediate / short-term / prevention
Splunk Queries — generated SPL for human execution with descriptions

Security Best Practices

Never expose DT_API_TOKEN in output
Never execute Splunk queries programmatically in restricted environments
Never recommend destructive kubectl operations without explicit safety warnings
Always use Bearer auth (not Api-Token prefix) for Platform tokens
Read-only by default; flag any write operations clearly

Resources

Related Assets

Dynatrace Kubernetes Service Triage

active

Systematic triage of a Dynatrace-monitored Kubernetes service using DQL queries for entity discovery, JVM health, thread analysis, pod generation comparison, and Davis problem correlation. Produces structured root cause analysis with Splunk query handoffs for restricted log environments.

Owner: epic-platform-sre

Spring Boot Container Crash Triage

active

Diagnose Spring Boot container crashes in Kubernetes by correlating Dynatrace JVM telemetry, pod lifecycle events, and deployment state. Covers rolling deployment failures, OOM kills, thread exhaustion, startup failures, and major framework upgrades.

Owner: epic-platform-sre

Dynatrace Operations Agent

active

Autonomous Dynatrace Platform agent that executes DQL queries, reads settings, and runs diagnostic workflows against any Grail-based tenant. Discovers credentials automatically (env var, .dtenv file, or prompt), executes live API calls, and presents formatted results. Use for entity inventory, metrics analysis, problem triage, log review, and guided troubleshooting.

Owner: platform-infrastructure

Kubernetes Pod Debug Assistant

active

Diagnose failing or unhealthy Kubernetes pods using cluster state, events, and logs. Produces structured root cause analysis with safe remediation recommendations.

Owner: epic-platform-sre

Kubernetes Operations Assistant

active

Assist with Kubernetes cluster operations, debugging, and troubleshooting using read-only diagnostics and GitOps-safe recommendations.

Owner: epic-platform-sre

dynatrace-expert

active

Dynatrace Platform operations expertise — DQL queries, entity inventory, metrics analysis, problem triage, dashboard management, and Settings API for Grail-based tenants.

Owner: platform-infrastructure