dynatrace-k8s-triage
Systematic Kubernetes service triage using Dynatrace DQL — entity discovery, JVM health, thread analysis, pod generation comparison, Davis problem correlation, and Splunk SPL query generation for restricted log environments.
Dynatrace Kubernetes Service Triage Skill
Core Competencies
- Service entity discovery — resolve service IDs to names, namespaces, technology stacks, and pod inventories
- Pod generation analysis — detect rolling deployments, stuck rollouts, and version mismatches across ReplicaSet generations
- JVM health assessment — thread count baselines/spikes, heap pressure, GC pause analysis
- Davis problem correlation — map problems and events to the service timeline
- Splunk SPL generation — produce targeted log queries for human execution in restricted environments
- Root cause classification — rolling deployment, startup failure, thread exhaustion, OOM, GC storm, probe failure
When to Apply This Skill
- Container crashes or CrashLoopBackOff on a Dynatrace-monitored K8s service
- Spring Boot or Java application health degradation
- Rolling deployment suspected of causing instability
- Need to correlate Dynatrace telemetry with Splunk logs across restricted boundaries
- Proactive triage when Davis problems are raised for K8s workloads
Prerequisites
Dynatrace Access
- API token with scopes:
entities.read,metrics.read,events.read,settings.read .dtenvfile withDT_PLATFORM_URLandDT_API_TOKENstorage:logs:readis NOT required (Splunk handoff covers log analysis)
Environment Variables
DT_PLATFORM_URL=https://{tenant}.apps.dynatrace.com
DT_API_TOKEN=dt0s16.xxxx # Platform token (Bearer auth)
Triage Workflow
Step 1: Service Identity
Resolve the service entity ID to concrete details.
fetch dt.entity.service
| filter id == "{SERVICE_ENTITY_ID}"
| fields entity.name, serviceType, managementZones, tags, softwareTechnologies
| limit 1
Record: service name, type, management zones, technology stack.
Step 2: Pod Inventory and Generation Detection
List all process group instances (pods) associated with the service.
fetch dt.entity.process_group_instance
| filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")
| fields entity.name, softwareTechnologies, metadata
| limit 50
Analysis:
- Extract ReplicaSet hashes from pod names (
deployment-{rs-hash}-{pod-hash}) - Group pods by ReplicaSet hash
- Compare technology versions across groups
- Flag if >1 generation is running (active rolling deployment)
- Flag major version jumps (Spring Boot 2→3, 3→4; Tomcat 9→10, 10→11)
Step 3: JVM Thread Analysis
Thread count is the highest-signal metric for Spring Boot crashes. Baseline is typically 30-60 threads.
timeseries threads = avg(dt.runtime.jvm.threads.count),
from:now()-{TIMERANGE}, by:{dt.entity.process_group_instance}
| lookup [fetch dt.entity.process_group_instance
| filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")],
sourceField:dt.entity.process_group_instance, lookupField:id
| filter isNotNull(lookup.id)
Interpretation:
| Pattern | Meaning |
|---|---|
| Stable 40-60 | Healthy |
| Gradual rise to 100+ | Thread leak or connection pool exhaustion |
| Spike to 100-150 then null | Thread explosion → JVM killed |
| Brief rise to 35 then null | Startup failure (never reached steady state) |
| Repeated spike→null cycles | CrashLoopBackOff with thread-related root cause |
Step 4: Heap and GC Metrics
timeseries heap = avg(dt.runtime.jvm.memory.pool.used),
from:now()-{TIMERANGE}, by:{dt.entity.process_group_instance}
| lookup [fetch dt.entity.process_group_instance
| filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")],
sourceField:dt.entity.process_group_instance, lookupField:id
| filter isNotNull(lookup.id)
timeseries gc = avg(dt.runtime.jvm.gc.pause_time),
from:now()-{TIMERANGE}, by:{dt.entity.process_group_instance}
| lookup [fetch dt.entity.process_group_instance
| filter contains(toString(belongsTo), "{SERVICE_ENTITY_ID}")],
sourceField:dt.entity.process_group_instance, lookupField:id
| filter isNotNull(lookup.id)
Step 5: Davis Problem and Event Correlation
fetch events, from:now()-{TIMERANGE}
| filter event.kind == "DAVIS_PROBLEM" OR event.kind == "DAVIS_EVENT"
| filter contains(toString(affected_entity_ids), "{SERVICE_ENTITY_ID}")
| filter event.name != "Monitoring not available"
| fields timestamp, event.kind, event.name, event.status, event.category, display_id
| sort timestamp desc
| limit 50
Step 6: Service Error Rate
timeseries errors = avg(dt.service.request.failure_rate),
from:now()-{TIMERANGE}, by:{dt.entity.service}
| filter dt.entity.service == "{SERVICE_ENTITY_ID}"
Step 7: Deployment Events
fetch events, from:now()-{TIMERANGE}
| filter event.kind == "CUSTOM_DEPLOYMENT"
| filter contains(toString(affected_entity_ids), "{SERVICE_ENTITY_ID}")
| fields timestamp, event.name, deployment.name, deployment.version
| sort timestamp desc
| limit 20
Splunk SPL Query Templates
When log access is restricted to Splunk (no Dynatrace log ingestion), generate these SPL queries for human execution. Replace {INDEX}, {NAMESPACE}, and {CONTAINER} with actual values discovered in Steps 1-2.
Application Errors
index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
("ERROR" OR "FATAL" OR "OOMKilled" OR "CrashLoopBackOff"
OR "readiness probe failed" OR "liveness probe failed"
OR "ApplicationContextException" OR "BeanCreationException"
OR "OutOfMemoryError")
| sort -_time
| head 200
| table _time, pod_name, log_level, message
Pod Lifecycle Events
index={INDEX} namespace="{NAMESPACE}"
("Killing" OR "Back-off restarting" OR "Started container"
OR "Pulled image" OR "Liveness probe failed" OR "Readiness probe failed")
| sort -_time
| head 100
| table _time, pod_name, reason, message
Spring Boot Startup Failures
index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
("Failed to start" OR "Application run failed"
OR "BeanDefinitionStoreException" OR "NoSuchBeanDefinitionException"
OR "UnsatisfiedDependencyException" OR "ClassNotFoundException"
OR "NoClassDefFoundError" OR "NoSuchMethodError")
| sort -_time
| head 50
| table _time, pod_name, message
Thread and Connection Issues
index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
("deadlock" OR "thread dump" OR "blocked" OR "WAITING"
OR "pool-" OR "http-nio-" OR "RejectedExecutionException"
OR "Connection pool" OR "timeout" OR "refused")
| sort -_time
| head 100
| table _time, pod_name, message
Jakarta/Javax Migration Errors (Framework Upgrade)
index={INDEX} namespace="{NAMESPACE}" container_name="{CONTAINER}"
("javax.servlet" OR "javax.persistence" OR "javax.validation"
OR "jakarta.servlet" OR "ClassCastException"
OR "IncompatibleClassChangeError" OR "LinkageError")
| sort -_time
| head 50
| table _time, pod_name, message
Root Cause Classification
| Classification | Key Signals | Confidence |
|---|---|---|
| Rolling deployment stuck | Multiple ReplicaSet hashes, version mismatch, deployment events | High if versions differ |
| Startup failure | Short-lived process instances, sparse JVM metrics, bean exceptions | High if startup pattern matches |
| Thread exhaustion | Thread spike to 100+, then null, then restart | High if spike→null pattern repeats |
| OOM kill | Heap at limit, exit code 137, GC pause spike before crash | High if heap data available |
| GC storm | GC pause >500ms sustained, CPU spike, response time degradation | Medium (need heap data too) |
| Probe misconfiguration | Restart events with no JVM anomaly, healthy service metrics | Medium (need negative evidence) |
| External dependency | Response time spike correlates with upstream service issues | Low (need cross-service data) |
Spring Boot Major Upgrade Checklist
When pod generation analysis reveals a major Spring Boot version change, check these breaking changes:
Spring Boot 3.x → 4.x
- Jakarta EE 11 namespace (some packages moved again)
- Removed deprecated auto-configuration classes
- Actuator security defaults changed (endpoints may require authentication)
- Tomcat 11 required (not backward compatible with Tomcat 10)
spring.config.importbehavior changes- Observation API changes (Micrometer)
Spring Boot 2.x → 3.x
javax.*→jakarta.*namespace migration- Java 17 minimum requirement
- Spring Security 6.x (new SecurityFilterChain API)
- Removed
WebSecurityConfigurerAdapter - Actuator endpoint exposure defaults changed
Tomcat 10.x → 11.x
- Jakarta EE 11 Servlet 6.1 API
- Removed legacy AJP connector defaults
- HTTP/2 configuration changes
- Default thread pool sizing changes
Output Format
Present findings as a structured triage report:
- Service Identity — name, namespace, framework versions
- Pod Generation Status — table of generations with version comparison
- Crash Classification — type and confidence level
- Evidence Timeline — chronological table of events and metric anomalies
- JVM Metrics Summary — per-pod thread/heap/GC table
- Root Cause — primary cause with evidence citations, contributing factors
- Recommended Actions — prioritized by immediate / short-term / prevention
- Splunk Queries — generated SPL for human execution with descriptions
Security Best Practices
- Never expose
DT_API_TOKENin output - Never execute Splunk queries programmatically in restricted environments
- Never recommend destructive kubectl operations without explicit safety warnings
- Always use
Bearerauth (notApi-Tokenprefix) for Platform tokens - Read-only by default; flag any write operations clearly
Resources
Related Assets
Dynatrace Kubernetes Service Triage
Systematic triage of a Dynatrace-monitored Kubernetes service using DQL queries for entity discovery, JVM health, thread analysis, pod generation comparison, and Davis problem correlation. Produces structured root cause analysis with Splunk query handoffs for restricted log environments.
Owner: epic-platform-sre
Spring Boot Container Crash Triage
Diagnose Spring Boot container crashes in Kubernetes by correlating Dynatrace JVM telemetry, pod lifecycle events, and deployment state. Covers rolling deployment failures, OOM kills, thread exhaustion, startup failures, and major framework upgrades.
Owner: epic-platform-sre
Dynatrace Operations Agent
Autonomous Dynatrace Platform agent that executes DQL queries, reads settings, and runs diagnostic workflows against any Grail-based tenant. Discovers credentials automatically (env var, .dtenv file, or prompt), executes live API calls, and presents formatted results. Use for entity inventory, metrics analysis, problem triage, log review, and guided troubleshooting.
Owner: platform-infrastructure
Kubernetes Pod Debug Assistant
Diagnose failing or unhealthy Kubernetes pods using cluster state, events, and logs. Produces structured root cause analysis with safe remediation recommendations.
Owner: epic-platform-sre
Kubernetes Operations Assistant
Assist with Kubernetes cluster operations, debugging, and troubleshooting using read-only diagnostics and GitOps-safe recommendations.
Owner: epic-platform-sre
dynatrace-expert
Dynatrace Platform operations expertise — DQL queries, entity inventory, metrics analysis, problem triage, dashboard management, and Settings API for Grail-based tenants.
Owner: platform-infrastructure

