Skip to content

uhg-grid-knowledge

Expert knowledge about UHG's Grid multi-cloud service mesh - architecture, IP addressing, DNS, service registration, security model, performance characteristics, and troubleshooting

active
IDE:
claude
codex
vscode
Version:
1.0.0
Owner:miverso2_uhg
grid
service-mesh
consul
haproxy
multi-cloud
networking
uhg

UHG Grid Multi-Cloud Service Mesh Knowledge Base

What is The Grid?

The Grid is UHG's internally-built multi-cloud service mesh that connects isolated cloud accounts (AWS, Azure, GCP) to each other and to on-premise datacenters.

Core Problem Solved:

  • Cloud accounts are isolated "islands" by default
  • Traditional networking takes 1-2 months per connection
  • 8,000+ applications need rapid connectivity
  • Multiple cloud providers need unified service discovery

Grid Solution:

  • 5-10 minute connectivity (vs 1-2 months traditional)
  • Automatic service discovery across all clouds
  • 100% automated (no manual tickets)
  • Supports overlapping IPs (app teams use any IPs they want)

Architecture Components

Core Technologies

┌─────────────────────────────────────────────────────────────────┐
│                         THE GRID                                │
│                                                                 │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐       │
│  │   Consul     │   │   HAProxy    │   │     BIND     │       │
│  │ Service Mesh │──>│ Reverse Proxy│──>│  DNS Server  │       │
│  │ + Discovery  │   │ + Routing    │   │ Authoritative│       │
│  └──────────────┘   └──────────────┘   └──────────────┘       │
│         │                   │                   │              │
│         └───────── consul-template ─────────────┘              │
│                  (Auto Config Deployment)                      │
└─────────────────────────────────────────────────────────────────┘

1. Hashicorp Consul

  • Service mesh foundation
  • Service registration and discovery
  • Key-Value store for configuration
  • Health checking
  • Admin partitions for isolation (partition = askid)

2. HAProxy

  • Reverse proxy reading from Consul
  • Dynamically routes based on service discovery
  • Load balancing across instances
  • TLS termination and passthrough

3. consul-template

  • Watches Consul for changes
  • Automatically regenerates HAProxy configs
  • Triggers HAProxy reloads (zero downtime)

4. BIND DNS

  • Authoritative DNS per environment
  • Two patterns: grid.uhg.com and mesh.uhg.com
  • DNS views for different query sources
  • RPZ (Response Policy Zones) with wildcard CNAME

5. Hashicorp Vault

  • Dynamic credential generation
  • PKI secrets engine (per-askid intermediate CA)
  • Consul secrets engine (time-limited tokens)
  • Integration with Venafi for CA issuance

IP Addressing & Network Topology

Grid Infrastructure Address Space

CGNAT Space (RFC6598 Compliant):

100.64.0.0/10 - Designed for service provider use (like Grid)
├─> 100.65.0.0/16 - CIH (Cloud Integration Hub) connectivity
├─> 100.72.0.0/13 - Grid West (us-west-2)
├─> 100.80.0.0/12 - Grid Central (us-east-2, centralus)
└─> 100.112.0.0/12 - Grid East (us-east-1, eastus)

Regional Breakdown:

  • West: 100.72.0.0/13 (prod), 100.74.0.0/16 (prod), 100.75.0.0/16 (stage)
  • Central: 100.80.0.0/12 (prod), 100.92.0.0/15 (prod), 100.94.0.0/15 (stage)
  • East: 100.112.0.0/12 (prod), 100.124.0.0/15 (prod), 100.126.0.0/15 (stage)
  • CIH: 100.65.0.0/16 (connectivity infrastructure, physical layer)

Why 100.x.x.x?

  • RFC6598 CGNAT space (Carrier-Grade NAT)
  • Designed specifically for service provider use
  • No conflicts with public internet routing
  • Supports overlapping customer IPs via NAT

Gateway Deployment Model

NOT Hub-and-Spoke:

  • Each askid gets dedicated gateway group (not shared hub)
  • 1:1 VNet/VPC peering between Grid gateway VNet and app team VNet
  • Enables overlapping IP support (app teams can use any IPs)
  • No central bottleneck (distributed architecture)

Gateway Network Sizing:

  • Typically /26 or /27 network per gateway group
  • Peered directly with matching askid+csp+env+region app networks
  • Example: grid-gateways-uhgwm110-022715-azure-prod-eastus

DNS Architecture

Two DNS Patterns

1. mesh.uhg.com - Direct to Consul

Purpose: Within-partition or directly routable connections
Returns: Actual instance IPs (from Consul service registration)
Use case: Services within same partition/trust boundary
Example: myservice.service.uhgwm110-022715.ap.centralus.azu.mesh.uhg.com
Performance: Lower latency (direct routing, no Grid gateway hop)

2. grid.uhg.com - Via Grid Gateways

Purpose: Cross-partition communication (trust boundaries)
Returns: Grid gateway IPs (HAProxy endpoints)
Use case: Cloud-to-cloud, cloud-to-on-prem, cross-askid
Example: myservice.service.uhgwm110-022715.ap.centralus.azu.grid.uhg.com
Performance: Slightly higher latency (Grid gateway hop for security/routing)

DNS Structure

Format:

<service-name>.service.<askid>.ap.<region>.<cloud>.<environment>.grid.uhg.com
                                                      └─ or mesh.uhg.com

Components:
├─> service-name: Your service (e.g., api, database, web)
├─> service: Literal string (always "service")
├─> askid: Application identifier (e.g., uhgwm110-022715, aide-0088590)
├─> ap: Admin partition (always "ap")
├─> region: Cloud region (eastus, us-east-1, central)
├─> cloud: Provider (azu, aws, gcp)
├─> environment: Optional — omit for prod (e.g. "stage" for non-prod)
└─> domain: grid.uhg.com or mesh.uhg.com

Examples:

  • Production: api.service.uhgwm110-022715.ap.eastus.azu.grid.uhg.com
  • Stage: api.service.uhgwm110-022715.ap.eastus.azu.stage.grid.uhg.com
  • Mesh (direct): api.service.uhgwm110-022715.ap.eastus.azu.mesh.uhg.com

BIND DNS Views

Default View:

Matches: Grid gateways, corporate DNS
Recursion: Enabled
Purpose: Standard DNS resolution for Grid infrastructure

Red View (Future):

Matches: Red network sources
Recursion: Disabled (no recursion for untrusted networks)
Purpose: Isolated DNS for segregated environments

Service Registration

Prerequisites

Before registering any service:

  1. Node IP: IP within your VNet/VPC address space (from PCAM)
  2. Service Port: Port your service listens on (1-65535)
  3. Health Check Endpoint: URL/port indicating health (mandatory)
  4. Terraform Setup: TFE workspace with Consul + Vault providers

Registration Process

Step 1: Register Node

resource "consul_node" "app_node" {
  name    = "my-app-node"  # Alphanumeric, dashes, periods only
  address = "10.100.1.10"  # IP from your VNet/VPC

  meta = {
    external-node  = "true"
    external-probe = "true"
  }
}

Step 2: Register Service

TCP Service:

resource "consul_service" "tcp_service" {
  name = "my-service"  # Alphanumeric, dashes only
  node = consul_node.app_node.name
  port = 3306

  tags = [
    "tag_key:<value>",  # Must use quotes
  ]

  check {
    check_id  = "tcp-health"
    name      = "TCP Connection Check"
    tcp       = "10.100.1.10:3306"
    interval  = "60s"
    timeout   = "5s"
    deregister_critical_service_after = "90m"
  }
}

HTTPS Service:

resource "consul_service" "https_service" {
  name = "my-service"
  node = consul_node.app_node.name
  port = 443

  tags = [
    "https",                    # Protocol designation
    "haproxy_mode: https",      # HAProxy traffic handling
    "haproxy_port_ssl: true",   # Enable SSL functionality
    "tag_key:<value>",
  ]

  check {
    check_id  = "https-health"
    name      = "HTTPS Health Check"
    https     = "https://10.100.1.10:443/health"
    interval  = "5s"
    timeout   = "2s"
    deregister_critical_service_after = "30m"
  }
}

Step 3: Automatic Propagation

  • consul-template detects new service (watches Consul)
  • HAProxy config regenerated with new backend
  • HAProxy reloaded (zero downtime)
  • BIND DNS updated automatically
  • Service available at: <service-name>.service.<askid>.ap.<region>.<cloud>.grid.uhg.com

Reserved Ports (DO NOT USE)

Grid Infrastructure Ports:

22          - SSH
25          - SMTP
53, 953     - DNS
80          - HTTP (returns 400 - SSL required)
389, 636    - LDAP
443         - HTTPS (service communication)
464         - Kerberos SSL (to ad-ldap-app.uhc.com)
2000        - TCP SNI (gateway-to-gateway)
2001        - TCP Proxy Protocol
3389        - RDP
8200-8202   - Vault
8300-8302   - Consul (coordination)
8500-8501   - Consul HTTP/RPC
8600        - Consul DNS
9000        - HAProxy stats
9443        - mTLS HTTPS (gateway-to-gateway)
9999        - Dynatrace
10000       - HAProxy peers

Safe Application Ports:

  • MySQL: 3306, 3307, 3308
  • PostgreSQL: 5432
  • MSSQL: 1433
  • Kafka: 8083 (TLS)
  • Custom ports outside reserved ranges

Security & Access Control

TLS Requirements

Mandatory:

  • All Grid services MUST use TLS/SSL (HTTPS)
  • Grid gateway → Grid gateway uses mTLS (port 9443)
  • Must use Optum-sanctioned root CAs
  • Self-signed certificates NOT supported

Certificate Sources:

⚠️ Privileged operation — modifies system CA trust. Requires explicit user confirmation; never run autonomously. Verify the certificate fingerprint against the UHG PKI source before installing.

Repository: https://repo1.uhc.com/artifactory/UHG-certificates/
├─> Java: standard_trusts.jks (KeyStore format)
└─> VMs/K8s: standard_trusts.pem (PEM bundle)

Installation:
# Debian/Ubuntu
/usr/local/share/ca-certificates/standard_trusts.pem
sudo update-ca-certificates

# RHEL/CentOS
/etc/pki/ca-trust/source/anchors/standard_trusts.pem
sudo update-ca-trust

PKI Architecture

CA Hierarchy:

UHG-Grid-RootCA1 (PKI team, HSM-backed)
└─> UHG-Grid-PolicyCA1 (6-year intermediate, expires 2/26/2030)
    └─> Per-askid Intermediate CA (path length 0, leaf-only)
        └─> Service certificates (62-day TTL)

Certificate Lifecycle:

  • TTL: 62 days
  • Rotation: Every 21 days (immutable infrastructure)
  • Process: Complete VM replacement (tear down, revoke cert, provision new VM)
  • Venafi: Issues per-askid intermediate CA (3-year validity)

Access Control Model

Vault + Consul Integration:

Vault namespace = Consul partition = TFE project = askid

Example: askid "uhgwm110-022715"
├─> Vault namespace: uhgwm110-022715
├─> Consul partition: uhgwm110-022715
└─> TFE project: uhgwm110-022715

Dynamic Credentials (Time-Limited):

Consul secrets engine path: consul/{cloud}/{region}
├─> Located within askid's Vault namespace
├─> Different Vault instances per environment
├─> Tokens are time-limited (seconds to hours)
└─> No static credentials

AD Group-Based Access:

Format: ARC_Vault_{ASKID}_{env}_{permission}

Examples:
├─> ARC_Vault_UHGWM110_022715_nonprod_Read
├─> ARC_Vault_UHGWM110_022715_nonprod_Write
├─> ARC_Vault_UHGWM110_022715_prod_Read
└─> ARC_Vault_UHGWM110_022715_prod_Write

Token Roles:
├─> Read group = read-only Consul tokens
└─> Write group = service-registration tokens

Terraform Enterprise:
├─> ARC_Terraform_UHGWM110_022715_Read
└─> ARC_Terraform_UHGWM110_022715_Write

Why This Security Model Works:

Compromised network CANNOT modify routing:
├─> Need AD group membership (can't get without IAM access)
├─> Need to check out Vault token (requires AD group)
├─> Need to modify Consul services (requires valid token)
└─> Network access ≠ routing control (identity-based security)

Cross-Partition Connectivity

CMDB Supply Chain Dependencies (MANDATORY):

For cross-askid communication (different Consul partitions):
├─> Step 1: Service Line Owner (SLO) requests dependency
├─> Step 2: Documented in CI Central (ServiceNow)
├─> Step 3: Relationship: "Critical for Continuous Operation (OP)"
├─> Step 4: Consul KV updated with approved upstreams
├─> Step 5: HAProxy config auto-generated
└─> Step 6: Service accessible (after 25 hour propagation)

Verification:
├─> Consul UI → Select askid → Key/Value → grid → upstreams
└─> Target askid should appear in upstream list

Important: Unidirectional (A→B doesn't grant B→A)

Without CMDB approval, cross-partition communication will NOT work.

Platform Governance

Dual TFE Clusters

terraform.uhg.com (App Teams):

  • Application infrastructure
  • App team workspaces
  • Service registration
  • Read/Write AD groups grant access

tfe-arc.uhg.com (Platform Engineers):

  • Grid infrastructure
  • Initializers (onboarding automation)
  • Grid gateways
  • Consul partitions, Vault namespaces, PKI setup
  • Walled off from app teams
  • Only accessible by platform/ops

Zero Customization Policy

Immutable Naming Convention:

Askid naming is immutable:
├─> Format: uhgwm{digits}-{digits} or aide-{digits}
├─> NO customization allowed
├─> NO exceptions (even for platform engineers)
├─> Example rejected: "my-special-app-project"
└─> Must use: "uhgwm110-022715" (as issued)

Everything automated:
├─> Engineers cannot manually create workspaces
├─> Engineers cannot manually modify things
├─> All changes through automation
└─> Grid team follows same rules (no special privileges)

Performance Characteristics

Production Data (Azure)

Grid:

Total traffic: 8-15 Gbps sustained
Single askid: 3+ Gbps (cross-region)
Latency: 7.155 ms average (on-prem to Azure East)
Per-connection throughput: 2.73 Gbps measured (iperf3)
Multi-stream (7): 10.0 Gbps sustained
Path: 12 hops (complete visibility)
Status: 3,000+ Azure subscriptions, 3+ years GA

NGCN (Aviatrix + Palo Alto) for comparison:

Total traffic: 100-420 Mbps (entire fabric)
Latency: 26.189 ms average (on-prem to Azure)
Per-connection: ~1.2 Gbps reported (iperf3 testing disabled)
Path: ~20 hops (destination cannot be confirmed by traceroute)
Status: Minimal adoption despite years available

Grid vs NGCN:

  • Grid handles 19x-150x MORE traffic
  • Grid has 3.66x lower latency (73% latency reduction)
  • Grid uses 67% fewer hops (12 vs 20)
  • Grid allows performance verification (transparent)

Expected Latency Ranges

On-Premise to Cloud:

  • Azure East: ~7-10 ms
  • Azure Central: ~10-15 ms
  • Azure West: ~15-25 ms
  • AWS: Similar to Azure (Direct Connect)
  • GCP: Similar to Azure (Cloud Interconnect)

Cloud-to-Cloud (via Grid):

  • Same region: +1-3 ms (Grid gateway hop)
  • Cross-region: Depends on cloud provider backbone
  • Cross-cloud: Via on-prem transit (higher latency)

Deployment Status

Cloud ProviderRegionNon-ProductionProductionNotes
AzureCentral✅ GA✅ GA3+ years, 8-15 Gbps traffic
AzureEast✅ GA✅ GA3+ years, proven
AzureWest✅ GA✅ GA3+ years, proven
GCPCentral✅ GA✅ GA2+ years, proven
GCPEast✅ GA✅ GA2+ years, proven
AWSAll Regions🚧 Pending🚧 Pendingv2 integration in progress

Note: GCP Grid overlay unavailable for teams using Aviatrix networking in GCP.

Cloud-Specific Implementation

Azure Grid

Status: ✅ Generally Available (Central, East, West)

Azure SQL Managed Instance Integration:

  • ⚠️ Cannot proxy entire SQL MI subnet (not permitted)
  • Must use private endpoints for SQL MI connectivity

Private Endpoint Setup:

  1. Create private endpoint for SQL MI instance
  2. Register private endpoint IP with Grid (port 1433)
  3. Submit ServiceNow request for certificate SAN updates
    • Include: SQL MI FQDN from private endpoint DNS tab
  4. Open DNS change ticket to alias Grid gateway IPs

Why Private Endpoints?

  • SQL MI enforces TLS host name verification
  • Private endpoints enable E2E encryption + authentication
  • Maintains network isolation while enabling connectivity

GCP Grid

Status: ✅ Generally Available (Central, East)

Cloud-Native Service Connectivity (Cloud SQL, etc.):

Private Service Connect (PSC) Required:

psc_config {
  psc_enabled = true
  allowed_consumer_projects = [
    "<APP-PROJECT-ID>",
    "<GRID-PROJECT-ID>"
  ]
}

Grid Project IDs (example — verify current IDs with Grid team):
├─> NonProd: <GRID-NONPROD-PROJECT-ID>
└─> Prod: <GRID-PROD-PROJECT-ID>

PSC Connection Process:

  1. Provision cloud-native service with Grid project ID in PSC allowlist
  2. Notify Grid team with:
    • Connection Name
    • Service attachment
    • Default TCP port
  3. Grid team creates PSC endpoint (automated)
  4. Verify connectivity

Important:

  • ⚠️ Services not provisioned with PSC cannot be retrofitted (must recreate)
  • 🚀 Grid team working on automatic PSC endpoint scanning/creation
  • ❌ Grid overlay unavailable for Aviatrix networking in GCP

AWS Grid

Status: 🚧 v2 Integration Pending (All Regions)

Infrastructure Foundation:

  • Physical layer: 4x 100 Gbps Direct Connect circuits
  • Network fabric: AWS Cloud WAN (us-east-1, us-east-2, us-west-2)
  • Gateway layer: TGW peerings to Cloud WAN
  • Connectivity: Cologix (Minneapolis) + Equinix (Chicago)

When Complete:

  • Same service registration patterns as Azure/GCP
  • Direct Connect provides low-latency on-prem connectivity
  • Cloud WAN enables automatic inter-region routing

Troubleshooting

Common Issues

1. Service Not Accessible

Check:
├─> Is service registered in Consul? (Consul UI → Services)
├─> Is health check passing? (Critical = deregistered)
├─> Is CMDB dependency configured? (For cross-askid)
├─> Is DNS resolving? (dig or nslookup)
└─> Are TLS certificates valid? (openssl s_client)

Resolution:
├─> Failed health check: Fix application health endpoint
├─> Missing CMDB: Request supply chain dependency (25 hour propagation)
├─> DNS not resolving: Check service name format
└─> TLS errors: Install standard_trusts.pem

2. Cross-Partition Communication Fails

Symptom: Can't reach service in different askid
Root cause: Missing CMDB supply chain dependency

Resolution:
├─> Step 1: Verify askid dependency in CI Central
├─> Step 2: Request "Critical for Continuous Operation (OP)" relationship
├─> Step 3: Wait 25 hours for propagation
├─> Step 4: Verify in Consul KV: grid → upstreams
└─> Step 5: Test connectivity again

3. TLS/SSL Verification Failures

⚠️ Privileged operation — modifies system CA trust. Requires explicit user confirmation; never run autonomously. Verify the certificate fingerprint against the UHG PKI source before installing.

Symptom: SSL certificate validation errors
Root cause: Missing Optum standard trust store

Resolution:
# Debian/Ubuntu
sudo curl -o /usr/local/share/ca-certificates/standard_trusts.pem \
  https://repo1.uhc.com/artifactory/UHG-certificates/standard_trusts.pem
sudo update-ca-certificates

# RHEL/CentOS
sudo curl -o /etc/pki/ca-trust/source/anchors/standard_trusts.pem \
  https://repo1.uhc.com/artifactory/UHG-certificates/standard_trusts.pem
sudo update-ca-trust

# Verify
openssl s_client -connect service.grid.uhg.com:443

4. Port Conflicts

Symptom: Service registration works but connection fails
Root cause: Using reserved Grid infrastructure port

Resolution:
├─> Check port against reserved list (see Reserved Ports section)
├─> Use different port (safe ranges: 3306-3308, 5432, 1433, etc.)
└─> Update service registration with new port

Diagnostic Commands

Check Service Registration:

# Via Consul API (requires token)
curl -H "X-Consul-Token: $CONSUL_TOKEN" \
  https://consul.service.consul:8501/v1/catalog/service/my-service

# Via Consul UI (browser)
https://consul-ui.grid.uhg.com → Services → Search

Check DNS Resolution:

# Check grid.uhg.com (via Grid gateways)
dig api.service.uhgwm110-022715.ap.eastus.azu.grid.uhg.com

# Check mesh.uhg.com (direct to instances)
dig api.service.uhgwm110-022715.ap.eastus.azu.mesh.uhg.com

# Should return Grid gateway IPs (100.x.x.x range)

Test Connectivity:

# Ping Grid gateway
ping <GRID-GATEWAY-IP>

# Traceroute to Grid gateway
traceroute <GRID-GATEWAY-IP>

# Test HTTPS connectivity
curl -v https://api.service.uhgwm110-022715.ap.eastus.azu.grid.uhg.com

# Test with cert validation
openssl s_client -connect api.service.uhgwm110-022715.ap.eastus.azu.grid.uhg.com:443

Check Consul KV for Upstreams:

# Via Consul API
curl -H "X-Consul-Token: $CONSUL_TOKEN" \
  https://consul.service.consul:8501/v1/kv/grid/upstreams?recurse

# Via Consul UI
Consul UI → Key/Value → grid → upstreams

Grid vs NGCN Comparison

Why Grid is Superior

Architecture:

  • ✅ Grid: Distributed (dedicated per-app capacity)
  • ❌ NGCN: Hub-and-spoke (central bottleneck)

IP Addressing:

  • ✅ Grid: RFC6598 CGNAT (100.x.x.x) - designed for this purpose
  • ❌ NGCN: 6.0.0.0/8 (allocated to US DoD per IANA) - misuse of public IPs

Overlapping IPs:

  • ✅ Grid: Fully supported (NAT at gateway)
  • ❌ NGCN: Not supported (unique IPs required)

Performance:

  • ✅ Grid: 7ms latency, 2.73 Gbps per connection, 10 Gbps multi-stream
  • ❌ NGCN: 26ms latency, ~1.2 Gbps per connection (IPSec limit)

Security:

  • ✅ Grid: Identity-based (Consul ACLs, Vault tokens, AD groups)
  • ❌ NGCN: IP-based (firewall rules only)

Connectivity:

  • ✅ Grid: Private circuits (Direct Connect, ExpressRoute, Interconnect)
  • ❌ NGCN: Public internet (IPSec tunnels)

Time to Connect:

  • ✅ Grid: 5-10 minutes (automated)
  • ❌ NGCN: 1-2 weeks (manual firewall rules)

Team Size:

  • ✅ Grid: 3-5 engineers (constant)
  • ❌ NGCN: Scales with app count (10+ for 1,000 apps)

Production Traffic:

  • ✅ Grid: 8-15 Gbps (real usage, proven)
  • ❌ NGCN: 100-420 Mbps (minimal usage, theoretical)

Network Visibility:

  • ✅ Grid: Complete (troubleshootable)
  • ❌ NGCN: Blocked (security by obscurity)

ESRO Concerns Addressed (Grid Team Perspective)

Concern: "Grid bypasses firewalls"

Reality: Grid has MORE security controls than firewalls
├─> Vault dynamic credentials (time-limited)
├─> Consul ACLs (service-level authorization)
├─> AD group membership (IAM integration)
├─> CMDB supply chain (business logic enforcement)
└─> mTLS gateway-to-gateway (encrypted transit)

NGCN's controls:
├─> IP-based firewall rules (network-level only)
└─> Manual approval workflow (slower)

Verdict: Grid has more technical controls

Concern: "NGCN inspects traffic for threats"

Reality: NGCN Palo Alto firewalls CANNOT decrypt mTLS
├─> Do NOT have application private keys
├─> CANNOT inspect encrypted payload
├─> Can only see: source IP, dest IP, port, protocol
└─> Does NOT require TLS (allows plain HTTP)

Grid's approach:
├─> REQUIRES TLS for all services (enforced)
├─> mTLS for gateway-to-gateway (mutual auth)
├─> Applications handle encryption (E2E)
└─> No MITM (preserves E2E encryption)

Verdict: Neither solution inspects encrypted payload
         Grid REQUIRES encryption, NGCN doesn't

Concern: "Grid allows data exfiltration"

Reality: Grid blocks internet traffic (same as NGCN)
├─> No internet ingress (nothing in)
├─> No internet egress (nothing out)
├─> Only registered services trusted
└─> 'Internet' is NOT a registered service

Verdict: Grid has same internet restrictions as NGCN

Concern: "Grid has no approval workflow"

Reality: Grid has CMDB-based approval workflow
├─> Service Line Owner (SLO) requests dependency
├─> Documented in CI Central (ServiceNow)
├─> Consul KV updated with approved upstreams
├─> HAProxy auto-configured from approved list
└─> Unidirectional (A→B doesn't grant B→A)

NGCN's workflow:
├─> Manual firewall rule request
├─> Manual approval by firewall team
└─> Manual implementation

Verdict: Grid has formal approval (CMDB vs firewall)

Concern: "NGCN is proven at scale"

Production data:
├─> NGCN: 100-420 Mbps (minimal usage)
├─> Grid: 8-15 Gbps (19x-150x MORE)
├─> NGCN: Available for years (unused)
├─> Grid: 3+ years GA (3,000+ subscriptions)

Verdict: Grid is proven at UHG, NGCN is not

Support & Resources

Documentation:

Teams Channel:

  • Search: "UHG Grid" in Microsoft Teams

Office Hours:

  • Monday, Tuesday, Wednesday, Friday at 9:00 AM CT
  • Join via Teams meeting link in channel

ServiceNow:

  • Workgroup: "UHG GRID"
  • Request Type: P5 Service Request
  • Provide: Subscription/Account ID, regions, VNet/VPC details

Key Repositories:

  • GitHub Organization: https://github.com/uhg-arc
  • uhg-grid/ - Multi-cloud shared components
  • uhg-grid-aws/ - AWS-specific automation
  • uhg-grid-azure/ - Azure-specific automation
  • uhg-grid-gcp/ - GCP-specific automation
  • uhg-grid-packages/ - Building .deb packages for Grid components
  • tfe-initializer/ - Onboarding automation (Initializers)

Summary: When to Use Grid

Use Grid when you need:

  • ✅ Cloud-to-cloud connectivity (AWS, Azure, GCP)
  • ✅ Cloud-to-on-premise connectivity
  • ✅ Cross-subscription/cross-account connectivity
  • ✅ Rapid connectivity (5-10 minutes vs weeks)
  • ✅ Automatic service discovery
  • ✅ Support for overlapping IP addresses
  • ✅ Identity-based security (not just IP-based)
  • ✅ High-performance connectivity (Gbps scale)
  • ✅ Complete network visibility (troubleshooting)

Grid is NOT for:

  • ❌ Public internet traffic (Grid is internal-only)
  • ❌ Interactive user sessions (service-to-service only)
  • ❌ Applications using Aviatrix in GCP (incompatible)
  • ❌ Applications requiring hub-and-spoke (Grid is distributed)

Grid Status:

  • Azure: GA (3+ years), 8-15 Gbps production traffic
  • GCP: GA (2+ years), production-proven
  • 🚧 AWS: v2 integration pending

Key Takeaway: Grid is UHG's proven multi-cloud networking solution, handling 19x-150x more traffic than alternatives with superior performance, security, and operational efficiency.

Related Assets