Most of the focus of this section will be on the standard node-local proxy implementation called kube-proxy
. It is used by default by most of the Kubernetes orchestrators and is installed as a daemonset on top of an newly bootstrapped cluster:
$ kubectl get daemonset -n kube-system -l k8s-app=kube-proxy
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-proxy 3 3 3 3 3 kubernetes.io/os=linux 2d16h
The default mode of operation for kube-proxy
is iptables
, as it provides support for a wider set of operating systems without requiring extra kernel modules and has a “good enough” performance characteristics for the majority of small to medium-sized clusters.
This area of Kubernetes networking is one of the most poorly documented. On the one hand, there are blogposts that cover parts of the kube-proxy
dataplane, on the other hand there’s an amazing diagram created by Tim Hockin that shows a complete logical flow of packet forwarding decisions but provides very little context and is quite difficult to trace for specific flows. The goal of this article is to bridge the gap between these two extremes and provide a high level of detail while maintaining an easily consumable format.
So for demonstration purposes, we’ll use the following topology with a “web” deployment and two pods scheduled on different worker nodes. The packet forwarding logic for ClusterIP-type services has two distinct paths within the dataplane, which is what we’re gonna be focusing on next:
The above diagram shows a slightly simplified sequence of match/set actions implemented inside Netfilter’s NAT table. The lab section below will show a more detailed view of this dataplane along verification commands.
One key thing to remember is that none of the ClusterIPs implemented this way are visible in the Linux routing table. The whole dataplane is implemented entirely within iptable’s NAT table, which makes it both very flexible and extremely difficult to troubleshoot at the same time.
To make sure that lab is in the right state, reset it to a blank state:
make up && make reset
Now let’s spin up a new deployment and expose it with a ClusterIP service:
$ kubectl create deploy web --image=nginx --replicas=2
$ kubectl expose deploy web --port 80
The result of the above two commands can be verified like this:
$ kubectl get deploy web
NAME READY UP-TO-DATE AVAILABLE AGE
web 2/2 2 2 160m
$ kubectl get svc web
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
web ClusterIP 10.96.94.225 <none> 8080/TCP 31s
The simplest way to test connectivity would be to connect to the assigned ClusterIP 10.96.94.225
from one of the nodes, e.g.:
$ docker exec k8s-guide-worker curl -s 10.96.94.225 | grep Welcome
<title>Welcome to nginx!</title>
<h1>Welcome to nginx!</h1>
One last thing before moving on, let’s set up the following bash alias as a shortcut to k8s-guide-worker
's NAT iptable:
$ alias d="docker exec k8s-guide-worker iptables -t nat -nvL"
According to Tim’s diagram all Pod-to-Service packets get intercepted by the PREROUTING
chain:
$ d PREROUTING
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
313 18736 KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
36 2242 DOCKER_OUTPUT all -- * * 0.0.0.0/0 172.16.0.190
These packets get redirected to the KUBE-SERVICES
chain, where they get matched against all configured ClusterIPs, eventually reaching these lines:
$ d KUBE-SERVICES | grep 10.96.94.225
3 180 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.96.94.225 /* default/web cluster IP */ tcp dpt:80
3 180 KUBE-SVC-LOLE4ISW44XBNF3G tcp -- * * 0.0.0.0/0 10.96.94.225 /* default/web cluster IP */ tcp dpt:80
Since the sourceIP of the packet belongs to a Pod (10.244.0.0/16
is the PodCIDR range), the second line gets matched and the lookup continues in the service-specific chain. Here we have two Pods matching the same label-selector (--replicas=2
) and both chains are configured with equal distribution probability:
$ d KUBE-SVC-LOLE4ISW44XBNF3G
Chain KUBE-SVC-LOLE4ISW44XBNF3G (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SEP-MHDQ23KUGG7EGFMW all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/web */ statistic mode random probability 0.50000000000
0 0 KUBE-SEP-ZA2JI7K7LSQNKDOS all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/web */
Let’s assume that in this case the first rule gets matched, so our packet continues on to the next chain where it gets DNAT’ed to the target IP of the destination Pod (10.244.1.3
):
$ d KUBE-SEP-MHDQ23KUGG7EGFMW
Chain KUBE-SEP-MHDQ23KUGG7EGFMW (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- * * 10.244.1.3 0.0.0.0/0 /* default/web */
3 180 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/web */ tcp to:10.244.1.3:80
From here on our packet remains unmodified and continues along its forwarding path set up by a CNI plugin until it reaches the target Node and gets sent directly to the destination Pod.
Let’s assume that the k8s-guide-worker
node (IP 172.18.0.12
) is sending a packet to our ClusterIP service. This packet gets intercepted in the OUTPUT
chain and continues to the KUBE-SERVICES
where it gets redirected via the KUBE-MARK-MASQ
chain:
$ d OUTPUT
Chain OUTPUT (policy ACCEPT 224 packets, 13440 bytes)
pkts bytes target prot opt in out source destination
4540 272K KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
42 2661 DOCKER_OUTPUT all -- * * 0.0.0.0/0 172.16.0.190
$ d KUBE-SERVICES | grep 10.96.94.225
3 180 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.96.94.225 /* default/web cluster IP */ tcp dpt:80
3 180 KUBE-SVC-LOLE4ISW44XBNF3G tcp -- * * 0.0.0.0/0 10.96.94.225 /* default/web cluster IP */ tcp dpt:80
The purpose of this chain is to mark all packets that will need to get SNAT’ed before they get sent to the final destination:
$ d KUBE-MARK-MASQ
Chain KUBE-MARK-MASQ (19 references)
pkts bytes target prot opt in out source destination
3 180 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000
Since MARK
is not a terminating target, the lookup continues down the KUBE-SERVICES
chain where our packets gets DNAT’ed to one of the randomly selected backend endpoints (as shown above).
However, this time, before it gets sent to its final destination, the packet gets another detour via the KUBE-POSTROUTING
chain:
$ d POSTROUTING
Chain POSTROUTING (policy ACCEPT 140 packets, 9413 bytes)
pkts bytes target prot opt in out source destination
715 47663 KUBE-POSTROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */
0 0 DOCKER_POSTROUTING all -- * * 0.0.0.0/0 172.16.0.190
657 44150 KIND-MASQ-AGENT all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type !LOCAL /* kind-masq-agent: ensure nat POSTROUTING directs all non-LOCAL destination traffic to our custom KIND-MASQ-AGENT chain */
Here all packets with a special SNAT mark (0x4000) fall through to the last rule and get SNAT’ed to the IP of the outgoing interface, which in this case is the veth interface connected to the Pod:
$ d KUBE-POSTROUTING
Chain KUBE-POSTROUTING (1 references)
pkts bytes target prot opt in out source destination
463 31166 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000/0x4000
2 120 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 MARK xor 0x4000
2 120 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ random-fully
The final MASQUERADE
action ensures that the return packets follow the same way back, even if they were originated outside of the Kubernetes cluster.
The above sequence of lookups may look long an inefficient but bear in mind that this is only done once, for the first packet of the flow and the remainder of the session gets offloaded to Netfilter’s connection tracking system.