weave
Weave Net is one of the “heavyweight” CNI plugins with a wide range of features and its own proprietary control plane to disseminate routing information between nodes. The scope of the plugin extends far beyond the base CNI functionality examined in this chapter and includes Network Policies, Encryption, Multicast and support for other container orchestration platforms (Swarm, Mesos).
Following a similar pattern, let’s examine how weave achieves the base CNI plugin functionality:
- Connectivity is set up by the
weave-netbinary by attaching pods to theweaveLinux bridge. The bridge is, in turn, attached to the Open vSwitch’s kernel datapath which forwards the packets over the vxlan interface towards the target node.
Info
Although it would have been possible to attach containers directly to the OVS datapath (ODP), Linux bridge plays the role of an egress router for all local pods so that ODP is only used for pod-to-pod forwarding.
Reachability is established by two separate mechanisms:
- Weave Mesh helps agents discover each other, check health, connectivity and exchange node-local details, e.g. IPs for VXLAN tunnel endpoint.
- OVS datapath acts as a standard learning L2 switch with flood-and-learn behaviour being programmed by the local agent (based on information distributed by the Mesh). All pods get their IPs from a single cluster-wide subnet and see their peers as if they were attached to a single broadcast domain.
Info
The cluster-wide CIDR range is still split into multiple non-overlapping ranges, which may look like a node-local pod CIDRs, however, all Pod IPs still have the same prefix length as the cluster CIDR, effectively making them part of the same L3 subnet.
The fully converged and populated IP and MAC tables will look like this:
Lab
Assuming that the lab is already set up, weave can be enabled with the following commands:
Check that the weave daemonset has reached the READY state:
Now we need to “kick” all Pods to restart and pick up the new CNI plugin:
To make sure kube-proxy and weave set up the right set of NAT rules, existing NAT tables need to be flushed and repopulated:
Here’s how the information from the diagram can be validated (using worker2 as an example):
- Pod IP and default route
- Node routing table
- ODP configuration and flows (output omitted for brevity)
A day in the life of a Packet
Let’s track what happens when Pod-1 (actual name is net-tshoot-22drp) tries to talk to Pod-3 (net-tshoot-pbp7z).
Note
We’ll assume that the ARP and MAC tables are converged and fully populated. In order to do that issue a ping command from Pod-1 to Pod-3’s IP (10.40.0.1)
- Pod-1 wants to send a packet to
10.40.0.1. Its network stack looks up the routing table:
- Since the target IP is from a directly-connected network, the next step is to check its local ARP table:
- The packet is sent out of the veth interface and hits the
weavebridge in the root NS, where a L2 lookup is performed:
- The packet is sent from the
weavebridge down to the OVS kernel datapath over a veth link:
- The ODP does a flow lookup to determine what actions to apply to the packet (the output is redacted for brevity)
- ODP encapsulates the original packet into a VXLAN frame ands sends the packet out of its local vxlan port:
- The VXLAN frame gets L2-switched by the
kindbridge and arrives at thecontrol-planenode, where another ODP lookup is performed
- The output port is the veth link connecting ODP to the
weavebridge:
- Following another L2 lookup in the
weavebridge, the packet is sent down the veth link connected to the target Pod-3:
- Finally, the packet gets delivered to the
eth0interface of the target pod:
SNAT functionality
SNAT functionality for traffic egressing the cluster is done in two stages:
All packets that don’t match the cluster CIDR range, get sent to the IP of the local
weavebridge which sends them down the default route already configured in the root namespace.A new
WEAVEchain gets appended to the POSTROUTING chain which matches all packets from the cluster IP range10.32.0.0/12destined to all non-cluster IPs!10.32.0.0/12and translates all flows leaving the node (MASQUERADE):
Partial connectivity
One of the interesting and unique features of Weave is its ability to function in environments with partial connectivity. This functionality is enabled by Weave Mesh and its use of the gossip protocol, allowing mesh members to dynamically discover each other and build the topology graph which is used to calculate the most optimal forwarding path.
One way to demonstrate this is to break the connectivity between two worker nodes and verify that pods are still able to reach each other. Let’s start by checking that ping works under normal conditions:
Get the IPs of the two worker nodes:
Add a new DROP rule for the traffic between these two IPs:
A few seconds later, once the control plane has reconverged, repeat the ping test:
The connectivity still works, although the traffic between the two worker nodes is definitely dropped:
One thing worth noting here is that the average RTT has almost doubled compared to the original test. This is because the traffic is now relayed by the control-plane node - the only node that has full connectivity to both worker nodes. In the dataplane, this is achieved with a special UDP-based protocol called sleeve(https://www.weave.works/docs/net/latest/concepts/router-encapsulation/).
The sending node (172.18.0.5) encapsulates ICMP packets for the other worker node (172.18.0.4) in a Sleeve payload and sends them to the control-plane node (172.18.0.2), which relays them on to the correct destination:
Although it certainly comes with substantial performance trade-offs, this functionality can become very handy in environments with bad network links or where remote nodes are hosted in an isolated network environment with limited/restricted external connectivity.
Don’t forget to remove the drop rule at the end of the testing:
Caveats and Gotchas
- The official installation guide contains a number of things to watch out for.
- Addition/Deletion or intermittent connectivity to nodes results in flow invalidation on all nodes, which, for a brief period of time, disrupts all connections until the flood-and-learn re-populates all forwarding tables.
Additional reading:
Weave’s IPAM
Overlay Method Selection
OVS dataplane Implementation Details