Kubernetes Operations Style and Safety

Conventions and guardrails for Kubernetes operations in Optum clusters, emphasizing read-only diagnostics and GitOps-driven changes.

experimental

IDE:

claude

codex

vscode

Version:

1.0.0

Owner:epic-platform-sre

k8s

kubernetes

ops

safety

gitops

Kubernetes Operations Style and Safety Guide

Overview

This guide covers conventions and safety guardrails for Kubernetes operations in Optum clusters. Production namespaces are read-only for LLM agents; all changes flow through GitOps pipelines.

Critical Safety Rules

Read-Only Default

MUST treat all environments as read-only by default:

# ✅ ALLOWED - Read-only operations
kubectl get pods -n production
kubectl describe deployment app -n production
kubectl logs pod/app-xyz123 -n production
kubectl top pods -n production

# ❌ FORBIDDEN - Direct mutations in production
kubectl delete pod app-xyz123 -n production
kubectl scale deployment app --replicas=0 -n production
kubectl apply -f manifest.yaml -n production
kubectl edit deployment app -n production

Change Flow Requirements

ALL changes to production MUST flow through:

Git commit to manifest repository
Pull request with required reviews
GitOps sync (ArgoCD or Flux)
Automated validation before promotion

# Change flow diagram
# Developer → Git → PR Review → Merge → GitOps → Cluster
#                      ↑
#                  CI Validation

Allowed Operations by Environment

Development Environment

allowed_operations:
  read:
    - get, describe, logs, top, events
    - port-forward (for debugging)
  write:
    - apply, delete, scale, rollout
    - exec (for debugging)
  restrictions:
    - No changes to shared infrastructure
    - No changes to istio-system, cert-manager

QA Environment

allowed_operations:
  read:
    - All read operations
  write:
    - Requires approval for mutations
    - GitOps preferred but direct apply allowed for hotfixes
  restrictions:
    - No namespace deletion
    - No PVC deletion
    - Changes logged and audited

Production Environment

allowed_operations:
  read:
    - All read operations
  write:
    - GitOps only (no direct kubectl apply)
    - Emergency break-glass with dual approval
  restrictions:
    - No direct mutations
    - No exec into pods (except break-glass)
    - No port-forward (use ingress/mesh)

Diagnostic Commands

Pod Investigation

MUST use these patterns for pod diagnostics:

# List pods with status
kubectl get pods -n $NAMESPACE -o wide

# Get pod details
kubectl describe pod $POD_NAME -n $NAMESPACE

# Check pod events
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | grep $POD_NAME

# View pod logs
kubectl logs $POD_NAME -n $NAMESPACE --tail=100

# View previous container logs (after crash)
kubectl logs $POD_NAME -n $NAMESPACE --previous

# Multi-container pod logs
kubectl logs $POD_NAME -n $NAMESPACE -c $CONTAINER_NAME

# Stream logs
kubectl logs -f $POD_NAME -n $NAMESPACE

Resource Investigation

# Deployment status
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE

# Deployment history
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE

# ReplicaSet details
kubectl get rs -n $NAMESPACE -o wide

# Service endpoints
kubectl get endpoints $SERVICE -n $NAMESPACE

# ConfigMap contents
kubectl get configmap $CM_NAME -n $NAMESPACE -o yaml

# Secret metadata (never output values)
kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.metadata}'

Cluster Health

# Node status
kubectl get nodes -o wide

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods -n $NAMESPACE

# Resource quotas
kubectl describe resourcequota -n $NAMESPACE

# Limit ranges
kubectl describe limitrange -n $NAMESPACE

Forbidden Operations

Never Execute in Production

Operation	Risk	Alternative
`kubectl delete namespace`	Data loss	Archive and recreate via GitOps
`kubectl delete pvc`	Data loss	Backup first, delete via GitOps
`kubectl scale --replicas=0`	Outage	Use GitOps with canary rollback
`kubectl apply -f`	Drift	Always use GitOps pipeline
`kubectl edit`	Drift	Update manifests in Git
`kubectl exec -it -- sh`	Security	Use ephemeral debug containers
`kubectl port-forward` (prod)	Bypass security	Use proper ingress/mesh

Dangerous Patterns

NEVER execute these patterns:

# ❌ Force delete - can cause data corruption
kubectl delete pod $POD --force --grace-period=0

# ❌ Delete all pods - will cause outage
kubectl delete pods --all -n $NAMESPACE

# ❌ Patch without review - causes drift
kubectl patch deployment $DEPLOYMENT -p '{"spec":{"replicas":0}}'

# ❌ Run privileged containers
kubectl run debug --image=alpine --privileged

# ❌ Mount host filesystem
kubectl run debug --image=alpine --overrides='{"spec":{"containers":[{"volumeMounts":[{"mountPath":"/host","name":"host"}],"volumes":[{"name":"host","hostPath":{"path":"/"}}]}]}}'

GitOps Change Patterns

Manifest Update Flow

MUST follow this flow for changes:

# 1. Clone manifest repository
git clone https://github.com/org/k8s-manifests.git
cd k8s-manifests

# 2. Create feature branch
git checkout -b feature/update-app-replicas

# 3. Make changes to manifests
vim apps/production/app/deployment.yaml

# 4. Validate locally
kubectl diff -f apps/production/app/

# 5. Commit and push
git add .
git commit -m "feat(app): increase replicas to 5 for traffic spike"
git push origin feature/update-app-replicas

# 6. Create PR (GitOps will handle apply after merge)

Kustomize Patterns

PREFER Kustomize overlays for environment-specific changes:

# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: app
          image: app:latest
          resources:
            requests:
              memory: "256Mi"
              cpu: "100m"

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
patchesStrategicMerge:
  - deployment-patch.yaml

# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 5
  template:
    spec:
      containers:
        - name: app
          resources:
            requests:
              memory: "1Gi"
              cpu: "500m"

Incident Response Operations

Break-Glass Procedure

For emergency operations requiring direct cluster access:

break_glass_procedure:
  prerequisites:
    - Active P1/P2 incident
    - GitOps too slow for remediation
    - Dual approval from on-call leads

  steps:
    1_document:
      - Create incident ticket
      - Record justification
      - Get verbal approval (record in ticket)

    2_execute:
      - Perform minimum necessary changes
      - Log all commands executed
      - Take screenshots of before/after

    3_reconcile:
      - Create PR to sync manifests with actual state
      - Document changes in incident postmortem
      - Review and refine runbooks

  allowed_break_glass_operations:
    - Restart failing pods
    - Scale deployment (up or down)
    - Rollback to previous revision
    - Update ConfigMap for critical fixes

  still_forbidden:
    - Namespace deletion
    - PVC deletion
    - Security policy changes

Emergency Rollback

# View rollout history
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE

# Rollback to previous revision
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE

# Rollback to specific revision
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE --to-revision=2

# Verify rollback
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE

Manifest Best Practices

Resource Definitions

MUST include resource requests and limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  labels:
    app: app
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app
  template:
    metadata:
      labels:
        app: app
        version: v1
    spec:
      containers:
        - name: app
          image: registry.internal/app:v1.2.3
          resources:
            requests:
              memory: '256Mi'
              cpu: '100m'
            limits:
              memory: '512Mi'
              cpu: '500m'
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5

Security Context

MUST set security context:

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
  containers:
    - name: app
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL

Pod Disruption Budget

MUST define PDB for production services:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2 # Or use maxUnavailable: 1
  selector:
    matchLabels:
      app: app

Network Policies

MUST define network policies for production:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: app-network-policy
spec:
  podSelector:
    matchLabels:
      app: app
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: api-gateway
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: database
      ports:
        - protocol: TCP
          port: 5432

Labeling Standards

MUST apply standard labels:

metadata:
  labels:
    # Required labels
    app.kubernetes.io/name: app
    app.kubernetes.io/instance: app-production
    app.kubernetes.io/version: v1.2.3
    app.kubernetes.io/component: backend
    app.kubernetes.io/part-of: platform
    app.kubernetes.io/managed-by: argocd

    # Optum required labels
    optum.com/owner: platform-team
    optum.com/environment: production
    optum.com/cost-center: PLAT-001

Observability Requirements

Pod Annotations for Monitoring

metadata:
  annotations:
    # Prometheus scraping
    prometheus.io/scrape: 'true'
    prometheus.io/port: '8080'
    prometheus.io/path: '/metrics'

    # Logging configuration
    logging.optum.com/format: 'json'
    logging.optum.com/parser: 'application'

Required Metrics

MUST expose standard metrics:

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests
`http_request_duration_seconds`	Histogram	Request latency
`http_requests_in_flight`	Gauge	Current requests
`process_cpu_seconds_total`	Counter	CPU usage
`process_resident_memory_bytes`	Gauge	Memory usage

Review Checklist

When reviewing Kubernetes changes:

Security

SecurityContext defined (non-root, read-only fs)
NetworkPolicy defined for production
No privileged containers
No hostPath mounts
Secrets not hardcoded

Reliability

Resource requests and limits set
Liveness and readiness probes configured
PodDisruptionBudget defined
Replica count appropriate for environment
Anti-affinity rules for HA

Observability

Prometheus annotations present
Logging format documented
Standard labels applied

GitOps

All changes in manifest repository
No direct kubectl apply in PR
Kustomize overlays for environment differences

Related Assets

Kubernetes Pod Debug Assistant

active

Diagnose failing or unhealthy Kubernetes pods using cluster state, events, and logs. Produces structured root cause analysis with safe remediation recommendations.

Owner: epic-platform-sre

Kubernetes Operations Assistant

active

Assist with Kubernetes cluster operations, debugging, and troubleshooting using read-only diagnostics and GitOps-safe recommendations.

Owner: epic-platform-sre

Kubernetes Deployment Best Practices

experimental

Comprehensive best practices for deploying and managing applications on Kubernetes (Pods, Deployments, Services, Ingress, health checks, resource limits, scaling, and security contexts).

Owner: epic-platform-sre

kubernetes-expert

experimental

Kubernetes and Kustomize operations with GitOps-first safety, debugging patterns, and production deployment guidance

Owner: epic-platform-sre

Dynatrace Kubernetes Service Triage

active

Systematic triage of a Dynatrace-monitored Kubernetes service using DQL queries for entity discovery, JVM health, thread analysis, pod generation comparison, and Davis problem correlation. Produces structured root cause analysis with Splunk query handoffs for restricted log environments.

Owner: epic-platform-sre