Skip to content

Kubernetes Operations Style and Safety

Conventions and guardrails for Kubernetes operations in Optum clusters, emphasizing read-only diagnostics and GitOps-driven changes.

experimental
IDE:
claude
codex
vscode
Version:
1.0.0
Owner:epic-platform-sre
k8s
kubernetes
ops
safety
gitops

Kubernetes Operations Style and Safety Guide

Overview

This guide covers conventions and safety guardrails for Kubernetes operations in Optum clusters. Production namespaces are read-only for LLM agents; all changes flow through GitOps pipelines.

Critical Safety Rules

Read-Only Default

MUST treat all environments as read-only by default:

# ✅ ALLOWED - Read-only operations
kubectl get pods -n production
kubectl describe deployment app -n production
kubectl logs pod/app-xyz123 -n production
kubectl top pods -n production

# ❌ FORBIDDEN - Direct mutations in production
kubectl delete pod app-xyz123 -n production
kubectl scale deployment app --replicas=0 -n production
kubectl apply -f manifest.yaml -n production
kubectl edit deployment app -n production

Change Flow Requirements

ALL changes to production MUST flow through:

  1. Git commit to manifest repository
  2. Pull request with required reviews
  3. GitOps sync (ArgoCD or Flux)
  4. Automated validation before promotion
# Change flow diagram
# Developer → Git → PR Review → Merge → GitOps → Cluster
#                      ↑
#                  CI Validation

Allowed Operations by Environment

Development Environment

allowed_operations:
  read:
    - get, describe, logs, top, events
    - port-forward (for debugging)
  write:
    - apply, delete, scale, rollout
    - exec (for debugging)
  restrictions:
    - No changes to shared infrastructure
    - No changes to istio-system, cert-manager

QA Environment

allowed_operations:
  read:
    - All read operations
  write:
    - Requires approval for mutations
    - GitOps preferred but direct apply allowed for hotfixes
  restrictions:
    - No namespace deletion
    - No PVC deletion
    - Changes logged and audited

Production Environment

allowed_operations:
  read:
    - All read operations
  write:
    - GitOps only (no direct kubectl apply)
    - Emergency break-glass with dual approval
  restrictions:
    - No direct mutations
    - No exec into pods (except break-glass)
    - No port-forward (use ingress/mesh)

Diagnostic Commands

Pod Investigation

MUST use these patterns for pod diagnostics:

# List pods with status
kubectl get pods -n $NAMESPACE -o wide

# Get pod details
kubectl describe pod $POD_NAME -n $NAMESPACE

# Check pod events
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | grep $POD_NAME

# View pod logs
kubectl logs $POD_NAME -n $NAMESPACE --tail=100

# View previous container logs (after crash)
kubectl logs $POD_NAME -n $NAMESPACE --previous

# Multi-container pod logs
kubectl logs $POD_NAME -n $NAMESPACE -c $CONTAINER_NAME

# Stream logs
kubectl logs -f $POD_NAME -n $NAMESPACE

Resource Investigation

# Deployment status
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE

# Deployment history
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE

# ReplicaSet details
kubectl get rs -n $NAMESPACE -o wide

# Service endpoints
kubectl get endpoints $SERVICE -n $NAMESPACE

# ConfigMap contents
kubectl get configmap $CM_NAME -n $NAMESPACE -o yaml

# Secret metadata (never output values)
kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.metadata}'

Cluster Health

# Node status
kubectl get nodes -o wide

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods -n $NAMESPACE

# Resource quotas
kubectl describe resourcequota -n $NAMESPACE

# Limit ranges
kubectl describe limitrange -n $NAMESPACE

Forbidden Operations

Never Execute in Production

OperationRiskAlternative
kubectl delete namespaceData lossArchive and recreate via GitOps
kubectl delete pvcData lossBackup first, delete via GitOps
kubectl scale --replicas=0OutageUse GitOps with canary rollback
kubectl apply -fDriftAlways use GitOps pipeline
kubectl editDriftUpdate manifests in Git
kubectl exec -it -- shSecurityUse ephemeral debug containers
kubectl port-forward (prod)Bypass securityUse proper ingress/mesh

Dangerous Patterns

NEVER execute these patterns:

# ❌ Force delete - can cause data corruption
kubectl delete pod $POD --force --grace-period=0

# ❌ Delete all pods - will cause outage
kubectl delete pods --all -n $NAMESPACE

# ❌ Patch without review - causes drift
kubectl patch deployment $DEPLOYMENT -p '{"spec":{"replicas":0}}'

# ❌ Run privileged containers
kubectl run debug --image=alpine --privileged

# ❌ Mount host filesystem
kubectl run debug --image=alpine --overrides='{"spec":{"containers":[{"volumeMounts":[{"mountPath":"/host","name":"host"}],"volumes":[{"name":"host","hostPath":{"path":"/"}}]}]}}'

GitOps Change Patterns

Manifest Update Flow

MUST follow this flow for changes:

# 1. Clone manifest repository
git clone https://github.com/org/k8s-manifests.git
cd k8s-manifests

# 2. Create feature branch
git checkout -b feature/update-app-replicas

# 3. Make changes to manifests
vim apps/production/app/deployment.yaml

# 4. Validate locally
kubectl diff -f apps/production/app/

# 5. Commit and push
git add .
git commit -m "feat(app): increase replicas to 5 for traffic spike"
git push origin feature/update-app-replicas

# 6. Create PR (GitOps will handle apply after merge)

Kustomize Patterns

PREFER Kustomize overlays for environment-specific changes:

# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: app
          image: app:latest
          resources:
            requests:
              memory: "256Mi"
              cpu: "100m"

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
patchesStrategicMerge:
  - deployment-patch.yaml

# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 5
  template:
    spec:
      containers:
        - name: app
          resources:
            requests:
              memory: "1Gi"
              cpu: "500m"

Incident Response Operations

Break-Glass Procedure

For emergency operations requiring direct cluster access:

break_glass_procedure:
  prerequisites:
    - Active P1/P2 incident
    - GitOps too slow for remediation
    - Dual approval from on-call leads

  steps:
    1_document:
      - Create incident ticket
      - Record justification
      - Get verbal approval (record in ticket)

    2_execute:
      - Perform minimum necessary changes
      - Log all commands executed
      - Take screenshots of before/after

    3_reconcile:
      - Create PR to sync manifests with actual state
      - Document changes in incident postmortem
      - Review and refine runbooks

  allowed_break_glass_operations:
    - Restart failing pods
    - Scale deployment (up or down)
    - Rollback to previous revision
    - Update ConfigMap for critical fixes

  still_forbidden:
    - Namespace deletion
    - PVC deletion
    - Security policy changes

Emergency Rollback

# View rollout history
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE

# Rollback to previous revision
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE

# Rollback to specific revision
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE --to-revision=2

# Verify rollback
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE

Manifest Best Practices

Resource Definitions

MUST include resource requests and limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  labels:
    app: app
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app
  template:
    metadata:
      labels:
        app: app
        version: v1
    spec:
      containers:
        - name: app
          image: registry.internal/app:v1.2.3
          resources:
            requests:
              memory: '256Mi'
              cpu: '100m'
            limits:
              memory: '512Mi'
              cpu: '500m'
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5

Security Context

MUST set security context:

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
  containers:
    - name: app
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL

Pod Disruption Budget

MUST define PDB for production services:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2 # Or use maxUnavailable: 1
  selector:
    matchLabels:
      app: app

Network Policies

MUST define network policies for production:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: app-network-policy
spec:
  podSelector:
    matchLabels:
      app: app
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: api-gateway
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: database
      ports:
        - protocol: TCP
          port: 5432

Labeling Standards

MUST apply standard labels:

metadata:
  labels:
    # Required labels
    app.kubernetes.io/name: app
    app.kubernetes.io/instance: app-production
    app.kubernetes.io/version: v1.2.3
    app.kubernetes.io/component: backend
    app.kubernetes.io/part-of: platform
    app.kubernetes.io/managed-by: argocd

    # Optum required labels
    optum.com/owner: platform-team
    optum.com/environment: production
    optum.com/cost-center: PLAT-001

Observability Requirements

Pod Annotations for Monitoring

metadata:
  annotations:
    # Prometheus scraping
    prometheus.io/scrape: 'true'
    prometheus.io/port: '8080'
    prometheus.io/path: '/metrics'

    # Logging configuration
    logging.optum.com/format: 'json'
    logging.optum.com/parser: 'application'

Required Metrics

MUST expose standard metrics:

MetricTypeDescription
http_requests_totalCounterTotal HTTP requests
http_request_duration_secondsHistogramRequest latency
http_requests_in_flightGaugeCurrent requests
process_cpu_seconds_totalCounterCPU usage
process_resident_memory_bytesGaugeMemory usage

Review Checklist

When reviewing Kubernetes changes:

Security

  • SecurityContext defined (non-root, read-only fs)
  • NetworkPolicy defined for production
  • No privileged containers
  • No hostPath mounts
  • Secrets not hardcoded

Reliability

  • Resource requests and limits set
  • Liveness and readiness probes configured
  • PodDisruptionBudget defined
  • Replica count appropriate for environment
  • Anti-affinity rules for HA

Observability

  • Prometheus annotations present
  • Logging format documented
  • Standard labels applied

GitOps

  • All changes in manifest repository
  • No direct kubectl apply in PR
  • Kustomize overlays for environment differences

Related Assets

Kubernetes Pod Debug Assistant

active

Diagnose failing or unhealthy Kubernetes pods using cluster state, events, and logs. Produces structured root cause analysis with safe remediation recommendations.

claude
codex
vscode
k8s
kubernetes
ops
debug
troubleshooting

Owner: epic-platform-sre

Kubernetes Operations Assistant

active

Assist with Kubernetes cluster operations, debugging, and troubleshooting using read-only diagnostics and GitOps-safe recommendations.

vscode
k8s
kubernetes
ops
debug
sre

Owner: epic-platform-sre

Kubernetes Deployment Best Practices

experimental

Comprehensive best practices for deploying and managing applications on Kubernetes (Pods, Deployments, Services, Ingress, health checks, resource limits, scaling, and security contexts).

claude
codex
vscode
kubernetes
k8s
deployment
operations
security
+3

Owner: epic-platform-sre

kubernetes-expert

experimental

Kubernetes and Kustomize operations with GitOps-first safety, debugging patterns, and production deployment guidance

codex
kubernetes
k8s
kustomize
gitops
sre

Owner: epic-platform-sre

Dynatrace Kubernetes Service Triage

active

Systematic triage of a Dynatrace-monitored Kubernetes service using DQL queries for entity discovery, JVM health, thread analysis, pod generation comparison, and Davis problem correlation. Produces structured root cause analysis with Splunk query handoffs for restricted log environments.

claude
codex
vscode
dynatrace
kubernetes
troubleshooting
spring-boot
jvm
+2

Owner: epic-platform-sre

Incident Triage and Timeline Builder

active

Build comprehensive incident timelines from logs, metrics, and tickets. Produces structured chronological summaries for postmortems and RCAs.

claude
codex
vscode
incident
sre
ops
m365
timeline
+1

Owner: epic-platform-sre