Calico is another example of a full-blown Kubernetes “networking solution” with functionality including network policy controller, kube-proxy replacement and network traffic observability. CNI functionality is still the core element of Calico and the focus of this chapter will be on how it satisfies the Kubernetes network model requirements.
veth
link and moving one side of that link into a Pod’s namespace. The other side of the link is left dangling in the node’s root namespace. For each local Pod, Calico sets up a PodIP host-route pointing over the veth link.One oddity of Calico CNI is that the node end of the veth link does not have an IP address. In order to provide Pod-to-Node egress connectivity, each veth
link is set up with proxy_arp
which makes root NS respond to any ARP request coming from the Pod (assuming that the node has a default route itself).
Reachability can be established in two different ways:
Static routes and overlays – Calico supports IPIP and VXLAN and has an option to only setup tunnels for traffic crossing the L3 subnet boundary.
BGP – the most popular choice for on-prem deployments, it works by configuring a Bird BGP speaker on every node and setting up peerings to ensure that reachability information gets propagated to every node. There are several options for how to set up this peering, including full-mesh between nodes, dedicated route-reflector node and external peering with the physical network.
The above two modes are not mutually exclusive, BGP can be used with IPIP in public cloud environments. For a complete list of networking options for both on-prem and public cloud environments, refer to this guide.
For demonstration purposes, we’ll use a BGP-based configuration option with external off-cluster route-reflector. The fully converged and populated IP and MAC tables will look like this:
Assuming that the lab environment is already set up, calico can be enabled with the following commands:
make calico
Check that the calico-node daemonset has all pods in READY
state:
$ kubectl -n calico-system get daemonset
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
calico-node 3 3 3 3 3 kubernetes.io/os=linux 61s
Now we need to “kick” all Pods to restart and pick up the new CNI plugin:
make nuke-all-pods
To make sure kube-proxy and calico set up the right set of NAT rules, existing NAT tables need to be flushed and re-populated:
make flush-nat && make calico-restart
Build and start a GoBGP-based route reflector:
make gobgp-build && make gobgp-rr
Finally, reconfigure Calico’s BGP daemonset to peer with the GoBGP route reflector:
make gobgp-calico-patch
Here’s how the information from the diagram can be validated (using worker2
as an example):
$ NODE=k8s-guide-worker2 make tshoot
bash-5.0# ip -4 -br addr show dev eth0
eth0@if2 UP 10.244.190.5/32
bash-5.0# ip route
default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0 scope link
Note how the default route is pointing to the fake next-hop address 169.254.1.1
. This will be the same for all Pods and this IP will resolve to the same MAC address configured on all veth links:
bash-5.0# ip neigh
169.254.1.1 dev eth0 lladdr ee:ee:ee:ee:ee:ee REACHABLE
$ docker exec k8s-guide-worker2 ip route
default via 172.18.0.1 dev eth0
10.244.175.0/24 via 172.18.0.4 dev eth0 proto bird
10.244.190.0 dev calid7f7f4e15dd scope link
blackhole 10.244.190.0/24 proto bird
10.244.190.1 dev calid599cd3d268 scope link
10.244.190.2 dev cali82aeec08a68 scope link
10.244.190.3 dev calid2e34ad38c6 scope link
10.244.190.4 dev cali4a822ce5458 scope link
10.244.190.5 dev cali0ad20b06c15 scope link
10.244.236.0/24 via 172.18.0.5 dev eth0 proto bird
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.3
A few interesting things to note in the above output:
bird
are the PodCIDR ranges of the other two nodes.docker exec gobgp gobgp global rib
Network Next Hop AS_PATH Age Attrs
*> 10.244.175.0/24 172.18.0.4 00:05:04 [{Origin: i} {LocalPref: 100}]
*> 10.244.190.0/24 172.18.0.3 00:05:04 [{Origin: i} {LocalPref: 100}]
*> 10.244.236.0/24 172.18.0.5 00:05:03 [{Origin: i} {LocalPref: 100}]
Let’s track what happens when Pod-1 (actual name is net-tshoot-rg2lp) tries to talk to Pod-3 (net-tshoot-6wszq).
We’ll assume that the ARP and MAC tables are converged and fully populated. In order to do that issue a ping from Pod-1 to Pod-3’s IP (10.244.236.0)
$ kubectl -n default exec net-tshoot-rg2lp -- ip -br addr show dev eth0
3: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP mode DEFAULT group default
link/ether b2:24:13:ec:77:42 brd ff:ff:ff:ff:ff:ff link-netnsid 0
This information (if14) will be used in step 2 to identify the node side of the veth link.
10.244.236.0
. Its network stack performs a route lookup:$ kubectl -n default exec net-tshoot-rg2lp -- ip route get 10.244.236.0
10.244.236.0 via 169.254.1.1 dev eth0 src 10.244.175.4 uid 0
cache
169.254.1.1
on eth0
, ARP table lookup is needed to get the destination MAC:$ kubectl -n default exec net-tshoot-rg2lp -- ip neigh show 169.254.1.1
169.254.1.1 dev eth0 lladdr ee:ee:ee:ee:ee:ee STALE
As mentioned above, the node side of the veth link doesn’t have any IP configured:
$ docker exec k8s-guide-worker ip addr show dev if14
14: calic8441ae7134@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-262ff521-1b00-b1c9-f0d5-0943a48a2ddc
So in order to respond to an ARP request for 169.254.1.1
, all veth links have proxy ARP enabled:
$ docker exec k8s-guide-worker cat /proc/sys/net/ipv4/conf/calic8441ae7134/proxy_arp
1
$ docker exec k8s-guide-worker ip route get 10.244.236.0 fibmatch
10.244.236.0/24 via 172.18.0.5 dev eth0 proto bird
$ docker exec k8s-guide-control-plane ip route get 10.244.236.0 fibmatch
10.244.236.0 dev cali0ec6986a945 scope link
The target IP is reachable over the veth
link so ARP is used to determine the destination MAC address:
docker exec k8s-guide-control-plane ip neigh show 10.244.236.0
10.244.236.0 dev cali0ec6986a945 lladdr de:85:25:60:86:5b STALE
eth0
interface of the target pod:kubectl exec net-tshoot-6wszq -- ip -br addr show dev eth0
eth0@if2 UP 10.244.236.0/32 fe80::dc85:25ff:fe60:865b/64
SNAT functionality for traffic egressing the cluster is done in two stages:
cali-POSTROUTING
chain is inserted at the top of the POSTROUTING chain.
Inside that chain cali-nat-outgoin
is SNAT’ing all egress traffic originating from cali40masq-ipam-pools
.
iptables -t nat -vnL
<...>
Chain POSTROUTING (policy ACCEPT 5315 packets, 319K bytes)
pkts bytes target prot opt in out source destination
7844 529K cali-POSTROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:O3lYWMrLQYEMJtB5 */
<...>
Chain cali-POSTROUTING (1 references)
pkts bytes target prot opt in out source destination
7844 529K cali-fip-snat all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:Z-c7XtVd2Bq7s_hA */
7844 529K cali-nat-outgoing all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:nYKhEzDlr11Jccal */
<...>
Chain cali-nat-outgoing (1 references)
pkts bytes target prot opt in out source destination
1 84 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst random-fully
Calico configures all IPAM pools as ipsets for a more efficient matching within iptables. These pools can be viewed on each individual node:
$ docker exec k8s-guide-control-plane ipset -L cali40masq-ipam-pools
Name: cali40masq-ipam-pools
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 512
References: 1
Number of entries: 1
Members:
10.244.128.0/17