Weave Net is one of the “heavyweight” CNI plugins with a wide range of features and its own proprietary control plane to disseminate routing information between nodes. The scope of the plugin extends far beyond the base CNI functionality examined in this chapter and includes Network Policies, Encryption, Multicast and support for other container orchestration platforms (Swarm, Mesos).
Following a similar pattern, let’s examine how weave
achieves the base CNI plugin functionality:
weave-net
binary by attaching pods to the weave
Linux bridge. The bridge is, in turn, attached to the Open vSwitch’s kernel datapath which forwards the packets over the vxlan interface towards the target node.Although it would have been possible to attach containers directly to the OVS datapath (ODP), Linux bridge plays the role of an egress router for all local pods so that ODP is only used for pod-to-pod forwarding.
Reachability is established by two separate mechanisms:
The cluster-wide CIDR range is still split into multiple non-overlapping ranges, which may look like a node-local pod CIDRs, however, all Pod IPs still have the same prefix length as the cluster CIDR, effectively making them part of the same L3 subnet.
The fully converged and populated IP and MAC tables will look like this:
Assuming that the lab is already set up, weave can be enabled with the following commands:
make weave
Check that the weave daemonset has reached the READY
state:
$ kubectl -n kube-system get daemonset -l name=weave-net
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
weave-net 3 3 3 3 3 <none> 30s
Now we need to “kick” all Pods to restart and pick up the new CNI plugin:
make nuke-all-pods
To make sure kube-proxy and weave set up the right set of NAT rules, existing NAT tables need to be flushed and repopulated:
make flush-nat && make weave-restart
Here’s how the information from the diagram can be validated (using worker2
as an example):
$ NODE=k8s-guide-worker2 make tshoot
bash-5.0# ip route
default via 10.44.0.0 dev eth0
10.32.0.0/12 dev eth0 proto kernel scope link src 10.44.0.7
$ docker exec -it k8s-guide-worker2 ip route
default via 172.18.0.1 dev eth0
10.32.0.0/12 dev weave proto kernel scope link src 10.44.0.0
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.4
WEAVEPOD=$(kubectl get pods -n kube-system -l name=weave-net --field-selector spec.nodeName=k8s-guide-worker2 -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it $WEAVEPOD -n kube-system -- /home/weave/weave --local report
Let’s track what happens when Pod-1 (actual name is net-tshoot-22drp) tries to talk to Pod-3 (net-tshoot-pbp7z).
We’ll assume that the ARP and MAC tables are converged and fully populated. In order to do that issue a ping command from Pod-1 to Pod-3’s IP (10.40.0.1)
10.40.0.1
. Its network stack looks up the routing table:$ kubectl exec -it net-tshoot-22drp -- ip route get 10.40.0.1
10.40.0.1 dev eth0 src 10.32.0.4 uid 0
cache
$ kubectl exec -it net-tshoot-22drp -- ip neigh show 10.40.0.1
10.40.0.1 dev eth0 lladdr d6:8d:31:c4:95:85 STALE
weave
bridge in the root NS, where a L2 lookup is performed:$ docker exec -it k8s-guide-worker bridge fdb get d6:8d:31:c4:95:85 br weave
d6:8d:31:c4:95:85 dev vethwe-bridge master weave
weave
bridge down to the OVS kernel datapath over a veth link:$ docker exec -it k8s-guide-worker ip link | grep vethwe-
12: vethwe-datapath@vethwe-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master datapath state UP mode DEFAULT group default
13: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP mode DEFAULT group default
$ WEAVEPOD=$(kubectl get pods -n kube-system -l name=weave-net --field-selector spec.nodeName=k8s-guide-worker -o jsonpath='{.items[0].metadata.name}')
$ kubectl exec -it $WEAVEPOD -n kube-system -- /home/weave/weave --local report
<...>
{
"FlowKeys": [
"UnknownFlowKey{type: 22, key: 00000000, mask: 00000000}",
"EthernetFlowKey{src: 0a:75:b7:d0:31:58, dst: d6:8d:31:c4:95:85}",
"UnknownFlowKey{type: 25, key: 00000000000000000000000000000000, mask: 00000000000000000000000000000000}",
"UnknownFlowKey{type: 23, key: 0000, mask: 0000}",
"InPortFlowKey{vport: 1}",
"UnknownFlowKey{type: 24, key: 00000000, mask: 00000000}"
],
"Actions": [
"SetTunnelAction{id: 0000000000ade6da, ipv4src: 172.18.0.3, ipv4dst: 172.18.0.2, ttl: 64, df: true}",
"OutputAction{vport: 2}"
],
"Packets": 2,
"Bytes": 84,
"Used": 258933878
},
<...>
$ kubectl exec -it $WEAVEPOD -n kube-system -- /home/weave/weave --local report | jq '.Router.OverlayDiagnostics.fastdp.Vports[2]'
{
"ID": 2,
"Name": "vxlan-6784",
"TypeName": "vxlan"
}
kind
bridge and arrives at the control-plane
node, where another ODP lookup is performed$ WEAVEPOD=$(kubectl get pods -n kube-system -l name=weave-net --field-selector spec.nodeName=k8s-guide-control-plane -o jsonpath='{.items[0].metadata.name}')
$ kubectl exec -it $WEAVEPOD -n kube-system -- /home/weave/weave --local report
<...>
{
"FlowKeys": [
"UnknownFlowKey{type: 22, key: 00000000, mask: 00000000}",
"UnknownFlowKey{type: 24, key: 00000000, mask: 00000000}",
"UnknownFlowKey{type: 25, key: 00000000000000000000000000000000, mask: 00000000000000000000000000000000}",
"TunnelFlowKey{id: 0000000000ade6da, ipv4src: 172.18.0.3, ipv4dst: 172.18.0.2}",
"InPortFlowKey{vport: 2}",
"UnknownFlowKey{type: 23, key: 0000, mask: 0000}",
"EthernetFlowKey{src: 0a:75:b7:d0:31:58, dst: d6:8d:31:c4:95:85}"
],
"Actions": [
"OutputAction{vport: 1}"
],
"Packets": 3,
"Bytes": 182,
"Used": 259264545
},
<...>
weave
bridge:$ kubectl exec -it $WEAVEPOD -n kube-system -- /home/weave/weave --local report | jq '.Router.OverlayDiagnostics.fastdp.Vports[1]'
{
"ID": 1,
"Name": "vethwe-datapath",
"TypeName": "netdev"
}
weave
bridge, the packet is sent down the veth link connected to the target Pod-3:$ docker exec -it k8s-guide-control-plane bridge fdb get d6:8d:31:c4:95:85 br weave
d6:8d:31:c4:95:85 dev vethwepl6be12f5 master weave
eth0
interface of the target pod:$ kubectl exec -it net-tshoot-pbp7z -- ip link show dev eth0
16: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UP mode DEFAULT group default
link/ether d6:8d:31:c4:95:85 brd ff:ff:ff:ff:ff:ff link-netnsid 0
SNAT functionality for traffic egressing the cluster is done in two stages:
All packets that don’t match the cluster CIDR range, get sent to the IP of the local weave
bridge which sends them down the default route already configured in the root namespace.
A new WEAVE
chain gets appended to the POSTROUTING chain which matches all packets from the cluster IP range 10.32.0.0/12
destined to all non-cluster IPs !10.32.0.0/12
and translates all flows leaving the node (MASQUERADE
):
iptables -t nat -vnL
<...>
Chain POSTROUTING (policy ACCEPT 6270 packets, 516K bytes)
pkts bytes target prot opt in out source destination
51104 4185K WEAVE all -- * * 0.0.0.0/0 0.0.0.0/0
<...>
Chain WEAVE (1 references)
pkts bytes target prot opt in out source destination
4 336 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 match-set weaver-no-masq-local dst /* Prevent SNAT to locally running containers */
0 0 RETURN all -- * * 10.32.0.0/12 224.0.0.0/4
0 0 MASQUERADE all -- * * !10.32.0.0/12 10.32.0.0/12
2 120 MASQUERADE all -- * * 10.32.0.0/12 !10.32.0.0/12
One of the interesting and unique features of Weave is its ability to function in environments with partial connectivity. This functionality is enabled by Weave Mesh and its use of the gossip protocol, allowing mesh members to dynamically discover each other and build the topology graph which is used to calculate the most optimal forwarding path.
One way to demonstrate this is to break the connectivity between two worker nodes and verify that pods are still able to reach each other. Let’s start by checking that ping works under normal conditions:
POD_WORKER2_IP=$(kubectl get pods -n default --field-selector spec.nodeName=k8s-guide-worker2 -o jsonpath='{.items[0].status.podIP}')
POD_WORKER1_NAME=$(kubectl get pods -n default --field-selector spec.nodeName=k8s-guide-worker -o jsonpath='{.items[0].metadata.name}')
kubectl -n default exec $POD_WORKER1_NAME -- ping -q -c 5 $POD_WORKER2_IP
PING 10.40.0.7 (10.40.0.7) 56(84) bytes of data.
--- 10.40.0.7 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4055ms
rtt min/avg/max/mdev = 0.136/0.178/0.278/0.051 ms
Get the IPs of the two worker nodes:
IP_WORKER1=$(docker inspect k8s-guide-worker --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}')
IP_WORKER2=$(docker inspect k8s-guide-worker2 --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}')
Add a new DROP
rule for the traffic between these two IPs:
sudo iptables -I FORWARD -s $IP_WORKER1 -d $IP_WORKER2 -j DROP
A few seconds later, once the control plane has reconverged, repeat the ping test:
kubectl -n default exec $POD_WORKER1_NAME -- ping -q -c 5 $POD_WORKER2_IP
PING 10.40.0.7 (10.40.0.7) 56(84) bytes of data.
--- 10.40.0.7 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4031ms
rtt min/avg/max/mdev = 0.347/0.489/0.653/0.102 ms
The connectivity still works, although the traffic between the two worker nodes is definitely dropped:
sudo iptables -nvL FORWARD | grep DROP
Chain FORWARD (policy DROP 0 packets, 0 bytes)
312 43361 DROP all -- * * 172.18.0.5 172.18.0.4
One thing worth noting here is that the average RTT has almost doubled compared to the original test. This is because the traffic is now relayed by the control-plane node - the only node that has full connectivity to both worker nodes. In the dataplane, this is achieved with a special UDP-based protocol called sleeve(https://www.weave.works/docs/net/latest/concepts/router-encapsulation/).
The sending node (172.18.0.5) encapsulates ICMP packets for the other worker node (172.18.0.4) in a Sleeve payload and sends them to the control-plane node (172.18.0.2), which relays them on to the correct destination:
12:28:54.056814 IP 172.18.0.5.48052 > 172.18.0.2.6784: UDP, length 106
12:28:54.057599 IP 172.18.0.2.48052 > 172.18.0.4.6784: UDP, length 106
12:28:54.057957 IP 172.18.0.4.48052 > 172.18.0.2.6784: UDP, length 106
12:28:54.058376 IP 172.18.0.2.48052 > 172.18.0.5.6784: UDP, length 106
Although it certainly comes with substantial performance trade-offs, this functionality can become very handy in environments with bad network links or where remote nodes are hosted in an isolated network environment with limited/restricted external connectivity.
Don’t forget to remove the drop rule at the end of the testing:
sudo iptables -D FORWARD -s $IP_WORKER1 -d $IP_WORKER2 -j DROP
Weave’s IPAM
Overlay Method Selection
OVS dataplane Implementation Details