Skip to content

dynatrace-k8s-triage

Systematic Kubernetes service triage using Dynatrace DQL — entity discovery, JVM health, thread analysis, pod generation comparison, Davis problem correlation, and Splunk SPL query generation for restricted log environments.

active
IDE:
codex
Version:
1.0.0
Owner:epic-platform-sre
dynatrace
kubernetes
troubleshooting
jvm
spring-boot
splunk
sre
observability

Dynatrace Kubernetes Service Triage Skill

Core Competencies

  • Service entity discovery — resolve service IDs to names, namespaces, technology stacks, and pod inventories
  • Pod generation analysis — detect rolling deployments, stuck rollouts, and version mismatches across ReplicaSet generations
  • JVM health assessment — thread count baselines/spikes, heap pressure, GC pause analysis
  • Davis problem correlation — map problems and events to the service timeline
  • Splunk SPL generation — produce targeted log queries for human execution in restricted environments
  • Root cause classification — rolling deployment, startup failure, thread exhaustion, OOM, GC storm, probe failure

When to Apply This Skill

  • Container crashes or CrashLoopBackOff on a Dynatrace-monitored K8s service
  • Spring Boot or Java application health degradation
  • Rolling deployment suspected of causing instability
  • Need to correlate Dynatrace telemetry with Splunk logs across restricted boundaries
  • Proactive triage when Davis problems are raised for K8s workloads

Prerequisites

Dynatrace Access

  • API token with scopes: entities.read, metrics.read, events.read, settings.read
  • .dtenv file with DT_PLATFORM_URL and DT_API_TOKEN
  • storage:logs:read is NOT required (Splunk handoff covers log analysis)

Environment Variables

DT_PLATFORM_URL=https://{tenant}.apps.dynatrace.com
DT_API_TOKEN=dt0s16.xxxx  # Platform token (Bearer auth)

Triage Workflow

Step 1: Service Identity

Resolve the service entity ID to concrete details.

fetch dt.entity.service
| filter id == "{SERVICE_ENTITY_ID}"
| fields entity.name, serviceType, managementZones, tags, softwareTechnologies
| limit 1

Record: service name, type, management zones, technology stack.

Step 2: Pod Inventory and Generation Detection

List all process group instances (pods) associated with the service.

fetch dt.entity.process_group_instance
| filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")
| fields entity.name, softwareTechnologies, metadata
| limit 50

Analysis:

  1. Extract ReplicaSet hashes from pod names (deployment-{rs-hash}-{pod-hash})
  2. Group pods by ReplicaSet hash
  3. Compare technology versions across groups
  4. Flag if >1 generation is running (active rolling deployment)
  5. Flag major version jumps (Spring Boot 2→3, 3→4; Tomcat 9→10, 10→11)

Step 3: JVM Thread Analysis

Thread count is the highest-signal metric for Spring Boot crashes. Baseline is typically 30-60 threads.

timeseries threads = avg(dt.runtime.jvm.threads.count),
  from:now()-{TIMERANGE}, by:{dt.entity.process_group_instance}
| lookup [fetch dt.entity.process_group_instance
  | filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")],
  sourceField:dt.entity.process_group_instance, lookupField:id
| filter isNotNull(lookup.id)

Interpretation:

PatternMeaning
Stable 40-60Healthy
Gradual rise to 100+Thread leak or connection pool exhaustion
Spike to 100-150 then nullThread explosion → JVM killed
Brief rise to 35 then nullStartup failure (never reached steady state)
Repeated spike→null cyclesCrashLoopBackOff with thread-related root cause

Step 4: Heap and GC Metrics

timeseries heap = avg(dt.runtime.jvm.memory.pool.used),
  from:now()-{TIMERANGE}, by:{dt.entity.process_group_instance}
| lookup [fetch dt.entity.process_group_instance
  | filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")],
  sourceField:dt.entity.process_group_instance, lookupField:id
| filter isNotNull(lookup.id)
timeseries gc = avg(dt.runtime.jvm.gc.pause_time),
  from:now()-{TIMERANGE}, by:{dt.entity.process_group_instance}
| lookup [fetch dt.entity.process_group_instance
  | filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")],
  sourceField:dt.entity.process_group_instance, lookupField:id
| filter isNotNull(lookup.id)

Step 5: Davis Problem and Event Correlation

fetch events, from:now()-{TIMERANGE}
| filter event.kind == "DAVIS_PROBLEM" OR event.kind == "DAVIS_EVENT"
| filter contains(toString(affected_entity_ids), "{SERVICE_ENTITY_ID}")
| filter event.name != "Monitoring not available"
| fields timestamp, event.kind, event.name, event.status, event.category, display_id
| sort timestamp desc
| limit 50

Step 6: Service Error Rate

timeseries errors = avg(dt.service.request.failure_rate),
  from:now()-{TIMERANGE}, by:{dt.entity.service}
| filter dt.entity.service == "{SERVICE_ENTITY_ID}"

Step 7: Deployment Events

fetch events, from:now()-{TIMERANGE}
| filter event.kind == "CUSTOM_DEPLOYMENT"
| filter contains(toString(affected_entity_ids), "{SERVICE_ENTITY_ID}")
| fields timestamp, event.name, deployment.name, deployment.version
| sort timestamp desc
| limit 20

Splunk SPL Query Templates

When log access is restricted to Splunk (no Dynatrace log ingestion), generate these SPL queries for human execution. Replace {INDEX}, {NAMESPACE}, and {CONTAINER} with actual values discovered in Steps 1-2.

Application Errors

index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
  ("ERROR" OR "FATAL" OR "OOMKilled" OR "CrashLoopBackOff"
   OR "readiness probe failed" OR "liveness probe failed"
   OR "ApplicationContextException" OR "BeanCreationException"
   OR "OutOfMemoryError")
| sort -_time
| head 200
| table _time, pod_name, log_level, message

Pod Lifecycle Events

index={INDEX} namespace="{NAMESPACE}"
  ("Killing" OR "Back-off restarting" OR "Started container"
   OR "Pulled image" OR "Liveness probe failed" OR "Readiness probe failed")
| sort -_time
| head 100
| table _time, pod_name, reason, message

Spring Boot Startup Failures

index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
  ("Failed to start" OR "Application run failed"
   OR "BeanDefinitionStoreException" OR "NoSuchBeanDefinitionException"
   OR "UnsatisfiedDependencyException" OR "ClassNotFoundException"
   OR "NoClassDefFoundError" OR "NoSuchMethodError")
| sort -_time
| head 50
| table _time, pod_name, message

Thread and Connection Issues

index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
  ("deadlock" OR "thread dump" OR "blocked" OR "WAITING"
   OR "pool-" OR "http-nio-" OR "RejectedExecutionException"
   OR "Connection pool" OR "timeout" OR "refused")
| sort -_time
| head 100
| table _time, pod_name, message

Jakarta/Javax Migration Errors (Framework Upgrade)

index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
  ("javax.servlet" OR "javax.persistence" OR "javax.validation"
   OR "jakarta.servlet" OR "ClassCastException"
   OR "IncompatibleClassChangeError" OR "LinkageError")
| sort -_time
| head 50
| table _time, pod_name, message

Root Cause Classification

ClassificationKey SignalsConfidence
Rolling deployment stuckMultiple ReplicaSet hashes, version mismatch, deployment eventsHigh if versions differ
Startup failureShort-lived process instances, sparse JVM metrics, bean exceptionsHigh if startup pattern matches
Thread exhaustionThread spike to 100+, then null, then restartHigh if spike→null pattern repeats
OOM killHeap at limit, exit code 137, GC pause spike before crashHigh if heap data available
GC stormGC pause >500ms sustained, CPU spike, response time degradationMedium (need heap data too)
Probe misconfigurationRestart events with no JVM anomaly, healthy service metricsMedium (need negative evidence)
External dependencyResponse time spike correlates with upstream service issuesLow (need cross-service data)

Spring Boot Major Upgrade Checklist

When pod generation analysis reveals a major Spring Boot version change, check these breaking changes:

Spring Boot 3.x → 4.x

  • Jakarta EE 11 namespace (some packages moved again)
  • Removed deprecated auto-configuration classes
  • Actuator security defaults changed (endpoints may require authentication)
  • Tomcat 11 required (not backward compatible with Tomcat 10)
  • spring.config.import behavior changes
  • Observation API changes (Micrometer)

Spring Boot 2.x → 3.x

  • javax.*jakarta.* namespace migration
  • Java 17 minimum requirement
  • Spring Security 6.x (new SecurityFilterChain API)
  • Removed WebSecurityConfigurerAdapter
  • Actuator endpoint exposure defaults changed

Tomcat 10.x → 11.x

  • Jakarta EE 11 Servlet 6.1 API
  • Removed legacy AJP connector defaults
  • HTTP/2 configuration changes
  • Default thread pool sizing changes

Output Format

Present findings as a structured triage report:

  1. Service Identity — name, namespace, framework versions
  2. Pod Generation Status — table of generations with version comparison
  3. Crash Classification — type and confidence level
  4. Evidence Timeline — chronological table of events and metric anomalies
  5. JVM Metrics Summary — per-pod thread/heap/GC table
  6. Root Cause — primary cause with evidence citations, contributing factors
  7. Recommended Actions — prioritized by immediate / short-term / prevention
  8. Splunk Queries — generated SPL for human execution with descriptions

Security Best Practices

  • Never expose DT_API_TOKEN in output
  • Never execute Splunk queries programmatically in restricted environments
  • Never recommend destructive kubectl operations without explicit safety warnings
  • Always use Bearer auth (not Api-Token prefix) for Platform tokens
  • Read-only by default; flag any write operations clearly

Resources

Related Assets

Dynatrace Kubernetes Service Triage

active

Systematic triage of a Dynatrace-monitored Kubernetes service using DQL queries for entity discovery, JVM health, thread analysis, pod generation comparison, and Davis problem correlation. Produces structured root cause analysis with Splunk query handoffs for restricted log environments.

claude
codex
vscode
dynatrace
kubernetes
troubleshooting
spring-boot
jvm
+2

Owner: epic-platform-sre

Spring Boot Container Crash Triage

active

Diagnose Spring Boot container crashes in Kubernetes by correlating Dynatrace JVM telemetry, pod lifecycle events, and deployment state. Covers rolling deployment failures, OOM kills, thread exhaustion, startup failures, and major framework upgrades.

claude
codex
vscode
spring-boot
java
kubernetes
troubleshooting
jvm
+3

Owner: epic-platform-sre

Dynatrace Operations Agent

active

Autonomous Dynatrace Platform agent that executes DQL queries, reads settings, and runs diagnostic workflows against any Grail-based tenant. Discovers credentials automatically (env var, .dtenv file, or prompt), executes live API calls, and presents formatted results. Use for entity inventory, metrics analysis, problem triage, log review, and guided troubleshooting.

claude
dynatrace
monitoring
observability
dql
grail
+4

Owner: platform-infrastructure

Kubernetes Pod Debug Assistant

active

Diagnose failing or unhealthy Kubernetes pods using cluster state, events, and logs. Produces structured root cause analysis with safe remediation recommendations.

claude
codex
vscode
k8s
kubernetes
ops
debug
troubleshooting

Owner: epic-platform-sre

Kubernetes Operations Assistant

active

Assist with Kubernetes cluster operations, debugging, and troubleshooting using read-only diagnostics and GitOps-safe recommendations.

vscode
k8s
kubernetes
ops
debug
sre

Owner: epic-platform-sre

dynatrace-expert

active

Dynatrace Platform operations expertise — DQL queries, entity inventory, metrics analysis, problem triage, dashboard management, and Settings API for Grail-based tenants.

codex
dynatrace
monitoring
observability
dql
grail
+1

Owner: platform-infrastructure