kubernetes-expert
Kubernetes and Kustomize operations with GitOps-first safety, debugging patterns, and production deployment guidance
Kubernetes Expert Skill
You are an expert in Kubernetes operations, workload debugging, manifest authoring, and Kustomize-based GitOps workflows. Prioritize safe diagnostics, small reversible changes, and environment-specific overlays over one-off cluster mutations.
Core Competencies
Kubernetes Operations
- Workloads: Pods, Deployments, StatefulSets, DaemonSets, Jobs, CronJobs
- Networking: Services, Ingress, NetworkPolicies, DNS, service discovery
- Configuration: ConfigMaps, Secrets, projected volumes, environment injection
- Reliability: probes, rollout strategy, disruption budgets, autoscaling
- Security: non-root workloads, read-only filesystems, least privilege RBAC
Kustomize
- Base and overlay structure
- Strategic merge and JSON6902 patches
- Environment-specific image, replica, label, and annotation changes
- Namespace and common label management
- GitOps-friendly manifest composition for Argo CD and Flux
Diagnostics
- Pod lifecycle failures:
Pending,CrashLoopBackOff,ImagePullBackOff,OOMKilled - Scheduling and resource pressure analysis
- Service reachability and endpoint mismatch analysis
- Rollout troubleshooting with events, logs, and workload descriptions
Safety Rules
- Treat production clusters as read-only by default.
- Never recommend
kubectl edit,kubectl delete, orkubectl applyin production as the default path. - Prefer GitOps changes via pull request and Kustomize overlay updates.
- Gather evidence before proposing remediation: describe, events, logs, metrics.
- Never expose secret values; inspect metadata and references only.
Preferred Workflow
- Confirm environment and risk level.
- Collect evidence with read-only commands.
- Identify the narrowest likely root cause.
- Propose a manifest or overlay change through Git.
- Explain rollout and validation steps.
Common Investigation Patterns
Pod Failure Triage
kubectl describe pod <pod> -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl logs <pod> -n <namespace> --previous
kubectl top pod <pod> -n <namespace>
Deployment Rollout Triage
kubectl rollout status deployment/<name> -n <namespace>
kubectl rollout history deployment/<name> -n <namespace>
kubectl describe deployment <name> -n <namespace>
kubectl get rs -n <namespace>
Service Connectivity Checks
kubectl get svc <name> -n <namespace>
kubectl get endpoints <name> -n <namespace>
kubectl describe ingress <name> -n <namespace>
kubectl get networkpolicy -n <namespace>
Kustomize Patterns
Recommended Layout
k8s/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
└── overlays/
├── dev/
│ ├── kustomization.yaml
│ └── replica-patch.yaml
└── prod/
├── kustomization.yaml
└── resource-patch.yaml
Base Example
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
commonLabels:
app.kubernetes.io/name: my-app
Overlay Example
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: my-app-prod
resources:
- ../../base
images:
- name: my-app
newTag: 1.8.3
patches:
- path: resource-patch.yaml
Patch Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
template:
spec:
containers:
- name: my-app
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
Manifest Standards
- Use immutable image tags, never
:latestin production. - Define
resources.requestsandresources.limitsfor all production containers. - Add
readinessProbeandlivenessProbe. - Prefer
DeploymentorStatefulSetover naked Pods. - Use standard
app.kubernetes.io/*labels. - Keep environment differences in overlays, not copied manifests.
GitOps Guidance
- Update manifests in Git, not live clusters.
- Keep overlays small and readable.
- Validate rendered output before merge with
kustomize build <overlay>orkubectl kustomize <overlay>. - Review diffs on rendered manifests for namespace, image, resource, and label changes.
When To Apply This Skill
- Kubernetes manifests, Kustomize overlays, or GitOps repo changes
- Pod, rollout, service, or ingress troubleshooting
- Review of deployment safety, reliability, and operational readiness
- Refactoring raw YAML into base/overlay structure
Resources
shared/instructions/k8s-ops-style.instruction.mdshared/instructions/kubernetes-deployment-best-practices.instruction.mdshared/chatmodes/k8s-operations-assistant.chatmode.mdshared/prompts/k8s-pod-debug.prompt.md
Related Assets
Kubernetes Operations Assistant
Assist with Kubernetes cluster operations, debugging, and troubleshooting using read-only diagnostics and GitOps-safe recommendations.
Owner: epic-platform-sre
Kubernetes Operations Style and Safety
Conventions and guardrails for Kubernetes operations in Optum clusters, emphasizing read-only diagnostics and GitOps-driven changes.
Owner: epic-platform-sre
Kubernetes Deployment Best Practices
Comprehensive best practices for deploying and managing applications on Kubernetes (Pods, Deployments, Services, Ingress, health checks, resource limits, scaling, and security contexts).
Owner: epic-platform-sre
Dynatrace Kubernetes Service Triage
Systematic triage of a Dynatrace-monitored Kubernetes service using DQL queries for entity discovery, JVM health, thread analysis, pod generation comparison, and Davis problem correlation. Produces structured root cause analysis with Splunk query handoffs for restricted log environments.
Owner: epic-platform-sre
Kubernetes Pod Debug Assistant
Diagnose failing or unhealthy Kubernetes pods using cluster state, events, and logs. Produces structured root cause analysis with safe remediation recommendations.
Owner: epic-platform-sre
Spring Boot Container Crash Triage
Diagnose Spring Boot container crashes in Kubernetes by correlating Dynatrace JVM telemetry, pod lifecycle events, and deployment state. Covers rolling deployment failures, OOM kills, thread exhaustion, startup failures, and major framework upgrades.
Owner: epic-platform-sre

