Multi-Site Active-Active with Kubernetes

Multi-site active-active means multiple sites serving production traffic simultaneously, with the ability to lose any single site without service degradation. It’s the highest tier of availability architecture, and it’s significantly more complex than active-passive (primary with standby) or active-active in a single data center.

The Figment validator infrastructure ran multi-site active-active across 13 providers. The design requirements were severe: validator operations require consistent uptime with geographic distribution, and the failure modes of individual providers are correlated enough that single-provider concentration is a meaningful risk. Building that topology provided hands-on experience with the specific failure modes and design patterns that work.

What Multi-Site Active-Active Actually Requires

Before the implementation details, the requirements that the architecture must satisfy:

Traffic routing must be globally aware. Users (or automated systems) must reach a functioning site regardless of which site is unavailable. This typically requires global DNS load balancing with health checks, or anycast routing where the network layer routes to the nearest healthy site.

State must be consistent or partitioned. If multiple sites are serving writes simultaneously, how does state stay consistent across sites? This is the hardest problem in distributed systems and the one that most multi-site designs get wrong or avoid.

Health checking must be accurate and fast. Routing traffic to a failed site is worse than having no multi-site setup. Health checks must reflect actual service availability, not just host reachability.

Failover must be tested. An untested failover is a theoretical failover. Regular failure injection — deliberately taking a site down during a maintenance window — is the only way to verify that the failover actually works.

DNS-Level Traffic Distribution

The entry point for multi-site Kubernetes is traffic distribution. Route 53 (AWS), Cloud DNS (GCP), and Cloudflare all support geolocation-based routing with health checks.

The health-check-based routing pattern:

Each site exposes a health endpoint (typically /healthz) that returns 200 when the site is healthy and serves real traffic
The global DNS health checker polls this endpoint from multiple vantage points
When a site fails its health check, DNS stops routing to it
Surviving sites absorb the traffic

The propagation delay of DNS changes — typically 30-120 seconds for modern TTLs — means there’s a brief window where clients may route to a failed site. For most use cases this is acceptable; for the strictest availability requirements, anycast provides faster failover.

ExternalDNS (the Kubernetes controller) automates DNS record management based on Kubernetes Service and Ingress resources. It eliminates the manual DNS updates that otherwise accompany deployments.

State Distribution: The Hard Part

Multi-site active-active with read-only workloads is straightforward. Multi-site with writes is the hard problem.

Read-only or read-heavy workloads can run active-active trivially — each site serves reads from a local replica, and the origin database handles writes from any site with standard replication to replicas. This covers many workload categories: CDN-like content serving, API servers that primarily read from a database, static content.

Write-heavy workloads require a database that supports multi-primary writes. The options:

CockroachDB — distributed SQL database designed for multi-region active-active. Handles distributed writes with configurable consistency guarantees. The operational overhead is higher than PostgreSQL, and the query optimization requires understanding distributed execution plans.

Cassandra/ScyllaDB — high-performance distributed databases with eventual consistency semantics. Appropriate for use cases where eventual consistency is acceptable (user activity feeds, time-series data). Not appropriate where strict consistency is required.

Vitess — MySQL sharding and distributed coordination layer. Adds multi-site write capability to MySQL with complex operational overhead.

Application-level partitioning — route different user segments or data partitions to different sites, with each site being authoritative for its partition. A user in Europe writes to the European site; a user in the US writes to the US site. The cross-region consistency problem is avoided by making each site authoritative for a partition.

For most multi-site Kubernetes deployments, the practical recommendation is: use database read replicas per region for reads, route writes to a primary region, and use active-passive (not active-active) for the write path. True active-active writes are only necessary when write latency from a non-primary region is unacceptable — which is a specific requirement, not a universal one.

Kubernetes Federation and Multi-Cluster Management

Managing multiple Kubernetes clusters requires tooling beyond a single cluster’s scope. The tools in this space:

KubeFed (Kubernetes Cluster Federation) — the official multi-cluster federation project. Allows deploying resources to multiple clusters with a single control point. Provides consistent resource definitions and federated HPA. Still maturing in terms of operational stability.

Argo CD — the GitOps tool that manages deployments per cluster. Can be configured to deploy the same application definitions across multiple clusters simultaneously. This is the practical approach most teams take — not federation, but coordinated GitOps across clusters.

Liqo — enables workload offloading between clusters, where a cluster that’s under capacity can schedule pods on another cluster’s nodes. More dynamic than federation; also more operationally complex.

Crossplane — infrastructure composition across providers, useful for managing the provider-level resources (databases, networking) that sit below the Kubernetes layer across multiple clouds.

The recommendation: use Argo CD for multi-cluster application management (it’s mature and widely deployed), use Crossplane for managing cloud provider resources if you need a Kubernetes-native control plane for provider resources, and evaluate federation tooling only if you have specific requirements that multi-cluster ArgoCD doesn’t handle.

Service Mesh for Cross-Site Communication

When services in different sites need to call each other — or when you want consistent security policy across sites — a service mesh that spans sites provides the control plane for inter-site traffic.

Istio supports multi-cluster topologies with shared control plane or per-cluster control planes with east-west gateway connectivity. The operational complexity of Istio is significant; for teams without existing Istio expertise, the multi-cluster setup is challenging.

Cilium Cluster Mesh connects Kubernetes clusters using Cilium as the CNI, enabling pod-to-pod connectivity across clusters with consistent network policy. Significantly simpler than Istio for the connectivity use case; doesn’t provide the full service mesh feature set (traffic management, retry policies, circuit breaking).

Linkerd is the simpler service mesh option with strong multi-cluster support. Less feature-complete than Istio; operationally much lighter.

For most teams, explicit service boundaries with well-designed APIs between sites are cleaner than a service mesh spanning sites. The mesh adds value when you need consistent security policy and observability across service-to-service calls in ways that application-layer API design can’t provide.

The Runbook That Gets You Through the Failure

Every multi-site active-active design needs documented runbooks for the failure scenarios:

Single site unavailable — is traffic failing over automatically? What’s the manual trigger if auto-failover fails? Who validates that the surviving site is handling load?
Database primary failure — what’s the promotion procedure for the replica? What applications need configuration updates to point at the new primary?
Split-brain scenario — if network partitioning separates sites, which site is authoritative? How is the conflict resolved when connectivity is restored?

Document these before the failure, not after. Test the runbooks in maintenance windows. The failure that follows a well-practiced runbook is a routine operational event. The failure without a runbook is an incident.

Our Kubernetes and containers practice designs multi-site Kubernetes topologies and has operational experience running them at production scale. Related: the cloud infrastructure decisions for the network topology between sites — BGP peering, VPN connectivity, or cloud interconnect — are tightly coupled with the Kubernetes networking design.