Kubernetes networking is where most operators hit a wall. The abstractions — Pods, Services, Ingress, NetworkPolicy, CNI plugins — compose in ways that aren’t obvious from reading the documentation, and the failure modes are subtle enough that debugging networking issues without a mental model of how the layers interact is frustrating.
The goal of this post is to give you that mental model, then walk through the specific networking problems that come up most often in production.
The Three Networking Problems Kubernetes Solves
Kubernetes networking addresses three distinct communication patterns, each with different requirements:
Pod-to-Pod communication — containers running in Kubernetes pods need to communicate with each other. The Kubernetes networking model specifies that all pods can reach all other pods directly (by pod IP) without NAT. The CNI plugin (Cilium, Calico, Flannel, etc.) is responsible for making this possible by managing routing between nodes.
Pod-to-Service communication — services provide stable virtual IPs (ClusterIPs) that route to the underlying pods, including load balancing across multiple pod replicas and automatic removal of unhealthy pods. kube-proxy (or eBPF in modern CNIs) manages the iptables or eBPF rules that implement this.
External-to-Service communication — external traffic entering the cluster through LoadBalancer Services, NodePort Services, or Ingress controllers, which proxy external traffic to the appropriate Services.
Most networking confusion comes from mixing up which layer is relevant to a given problem. Debugging a connection failure requires knowing which of these three layers you’re operating in.
ClusterIP, NodePort, LoadBalancer: The Service Hierarchy
Services have four types, and the relationship between them matters for understanding what each one does:
ClusterIP (the default) — creates a virtual IP within the cluster, reachable only from within the cluster. Every Service type builds on ClusterIP.
NodePort — exposes the Service on every node’s IP at a specific port (30000-32767 range). External traffic can reach the Service at <node-ip>:<node-port>. LoadBalancer Services automatically create a NodePort.
LoadBalancer — provisions a cloud load balancer in front of the NodePort, giving you a stable external IP that routes to the Service. Requires a cloud controller manager that can provision load balancers (EKS, GKE, AKS all do this; bare-metal requires MetalLB or similar).
ExternalName — creates a CNAME DNS record to an external service. Useful for accessing external databases or APIs with a consistent in-cluster DNS name.
The practical implication: if you’re debugging why an external request isn’t reaching a Service, check the Service type first. If it’s ClusterIP, it’s expected to be unreachable externally — that’s not a bug.
Ingress: HTTP Routing Into the Cluster
Ingress is the layer that routes external HTTP/HTTPS traffic to Services based on hostname and path rules. It requires an Ingress controller running in the cluster (nginx-ingress and Traefik are the most common open-source options; cloud providers offer their own).
A typical Ingress resource:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.yourapp.com
secretName: api-tls
rules:
- host: api.yourapp.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
cert-manager paired with Let’s Encrypt handles TLS certificate issuance and renewal automatically. Without cert-manager, certificate management becomes a manual operation that someone eventually forgets. cert-manager should be one of the first cluster-level tools installed.
The most common Ingress debugging mistakes:
- Missing
ingressClassNamewhen the cluster has multiple Ingress controllers — the Ingress resource doesn’t know which controller to use - Mismatched Service port — the backend port in the Ingress must match the Service’s port, not the pod’s containerPort
- Incorrect path type —
Prefixmatchesapi.yourapp.com/and all paths beneath it;Exactrequires an exact match
NetworkPolicy: The Firewall Inside the Cluster
By default, all Kubernetes pods can communicate with all other pods, across all namespaces. This is intentionally permissive for ease of development and increasingly inappropriate for production environments where you need to limit blast radius.
NetworkPolicy resources define ingress and egress rules for pod communication. A minimal zero-trust baseline: deny all traffic by default, then explicitly allow what’s needed.
# Default deny all ingress in a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
# Allow specific traffic to the API service
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-ingress
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- protocol: TCP
port: 8080
Critical: NetworkPolicy is only enforced if your CNI plugin supports it. Flannel does not enforce NetworkPolicy; Cilium and Calico do. If you apply NetworkPolicy with Flannel, the policies are stored in etcd but not enforced — all traffic is still permitted. Verify that your CNI enforces NetworkPolicy before relying on it for security.
DNS and Service Discovery
Kubernetes includes CoreDNS as the cluster-internal DNS server. Every Service gets a DNS entry: <service-name>.<namespace>.svc.cluster.local. Pods can use the short form <service-name> within the same namespace.
Common DNS debugging steps:
# Run a debug pod with network tools
kubectl run debug --image=nicolaka/netshoot --rm -it --restart=Never
# Inside the debug pod:
dig api-service.production.svc.cluster.local
nslookup api-service.production.svc.cluster.local
curl http://api-service.production.svc.cluster.local/healthz
If dig returns a valid IP but curl fails, the networking layer below DNS is the issue (CNI routing, NetworkPolicy). If dig fails to resolve, the problem is in CoreDNS configuration or the Service definition.
CoreDNS configuration problems are less common than CNI routing problems, but they occur. Check that CoreDNS pods are running and have available resources: a CoreDNS pod that’s OOMKilling causes intermittent DNS resolution failures that are frustrating to diagnose.
CNI Selection for Production
The CNI plugin determines how pod networking is implemented. For new production clusters:
Cilium is the modern choice. eBPF-based, excellent performance, NetworkPolicy enforcement, service mesh capabilities without a sidecar proxy (Cilium Service Mesh), and native support for cluster mesh across multiple clusters. The operational model has matured significantly; it’s now the right default for teams willing to invest in understanding it.
Calico is the mature alternative with a longer production track record than Cilium. Supports NetworkPolicy, BGP peering for direct hardware routing, and flexible routing modes. Slightly less cutting-edge than Cilium but more operational documentation for complex scenarios.
Avoid changing CNI on an existing cluster without careful planning — it requires draining nodes and re-provisioning networking, which is essentially cluster replacement.
Debugging Checklist
When a network connection fails, work through the layers:
- Can the source pod reach the DNS server? (
nslookup kubernetes.default) - Does the target Service resolve? (
nslookup <service-name>.<namespace>) - Does the Service have endpoints? (
kubectl get endpoints <service-name> -n <namespace>) - Are the target pods healthy? (
kubectl get pods -n <namespace> -l app=<target>) - Is there a NetworkPolicy blocking the connection? (
kubectl get networkpolicy -n <namespace>) - Are kube-proxy rules in place? (
kubectl get svc <service-name> -n <namespace>)
Step 3 — checking endpoints — is the single most useful debugging step. A Service with no endpoints means either no pods are matching the selector, or the pods exist but fail their readiness probe. Either of those is the root cause, not the network.
Our Kubernetes and containers practice includes networking design and troubleshooting as a core competency. Related: the networking model in Kubernetes extends naturally to multi-site active-active deployments, where the same mental model applies across cluster boundaries.