NodePort

NodePort builds on top of the ClusterIP Service and provides a way to expose a group of Pods to the outside world. At the API level, the only difference from the ClusterIP is the mandatory service type which has to be set to NodePort, the rest of the values can remain the same.

apiVersion: v1
kind: Service
metadata:
  labels:
    app: FE
  name: FE
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: FE
  type: NodePort

Whenever a new Kubernetes cluster gets built, one of the available configuration parameters is service-node-port-range which defines a range of ports to use for NodePort allocation and usually defaults to 30000-32767. One interesting thing about NodePort allocation is that it is not managed by a controller. The configured port range value eventually gets passed to the kube-apiserver as an argument and allocation happens as the API server saves a Service resource into its persistent storage (e.g. etcd cluster); a unique port is allocated for both Nodeport and LoadBalancer services. So by the time the Service definition makes it to the persistent storage, it already contains a couple of extra fields:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: FE
  name: FE
spec:
  clusterIP: 10.96.75.104
  ports:
  - nodePort: 30171
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: FE
  type: NodePort

One of the side-effects of this kind of behaviour is that ClusterIP and NodePort values are immutable – they cannot be changed throughout the lifecycle of an object. The only way to change or update an existing Service is to provide the right metadata and omit both ClusterIP and NodePort values from the spec.

From the networking point of view, NodePort’s implementation is very easy to understand:

  • For each port in the NodePort Service, API server allocated a unique port from the service-node-port-range.
  • This port is programmed in the dataplane of each Node by the kube-proxy (or its equivalent) – the most common implementations with IPTables, IPVS and eBPF are covered in the Lab section below.
  • Any incoming packet matching one of the configured NodePorts will get destination NAT’ed to one of the healthy Endpoints and source NAT’ed (via masquerade/overload) to the address of the incoming interface.
  • The reply packet coming from the Pod will get reverse NAT’ed using the connection tracking entry set up by the incoming packet.

Both DNAT and SNAT can be avoided by using Direct server return (DSR) and service.spec.externalTrafficPolicy respectively. This is discussed in the Optimisations chapter

The following diagram shows network connectivity for a couple of hypothetical NodePort Services.

One important thing worth remembering is that a NodePort Service is rarely used on its own. Most of the time, you’d use a LoadBalancer type service which builds on top of the NodePort. That being said, NodePort services can be quite useful on their own in environments where LoadBalancer is not available or in more static setups utilising spec.externalIPs.

Lab

To demonstrate the different modes of dataplane operation, we’ll use three different scenarios:

  • IPTables orchestrated by kube-proxy
  • IPVS as orchestrated by kube-proxy
  • eBPF as orchestrated by Cilium

Preparation

Refer to the respective chapters for the instructions on how to setup the IPTables, IPVS or Cilium eBPF data planes. Once the required data plane is configured, setup a test deployment with 3 Pods and expose it via a NodePort Service:

$ make deployment && make scale-up && make nodeport
kubectl create deployment web --image=nginx
deployment.apps/web created
kubectl scale --replicas=3 deployment/web
deployment.apps/web scaled
kubectl expose deployment web --port=80 --type=NodePort
service/web exposed

Confirm the assigned NodePort (e.g. 30510 in the output below) and take a note of the Endpoint addresses:

$ kubectl get svc web
NAME   TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
web    NodePort   10.96.132.141   <none>        80:30510/TCP   43s
$ kubectl get ep
NAME   ENDPOINTS                                   AGE
web    10.244.1.6:80,10.244.2.7:80,10.244.2.8:80   45s

To verify that a NodePort service is functioning, first, determine IPs of each one of the cluster Nodes:

$ make node-ip-1
control-plane:192.168.224.3
$ make node-ip-2
worker:192.168.224.2
$ make node-ip-3
worker2:192.168.224.4

Combine each IP with the assigned NodePort value and check that there is external reachability from your host OS:

$ curl -s 192.168.224.3:30510 | grep Welcome
<title>Welcome to nginx!</title>
<h1>Welcome to nginx!</h1>
$ curl -s 192.168.224.2:30510 | grep Welcome
<title>Welcome to nginx!</title>
<h1>Welcome to nginx!</h1>
$ curl -s 192.168.224.4:30510 | grep Welcome
<title>Welcome to nginx!</title>
<h1>Welcome to nginx!</h1>

Finally, setup the following command aliases:

NODE=k8s-guide-worker2
alias ipt="docker exec $NODE iptables -t nat -nvL"
alias ipv="docker exec $NODE ipvsadm -ln"
alias ips="docker exec $NODE ipset list"
alias cilium="kubectl -n cilium exec $(kubectl get -l k8s-app=cilium pods -n cilium --field-selector spec.nodeName=$NODE -o jsonpath='{.items[0].
metadata.name}') --"

IPTables Implementation

According to Tim’s IPtables diagram, external packets get first intercepted in the PREROUTING chain and redirected to the KUBE-SERVICES chain:

$ ipt PREROUTING
Chain PREROUTING (policy ACCEPT 1 packets, 60 bytes)
 pkts bytes target     prot opt in     out     source               destination
  493 32442 KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
    0     0 DOCKER_OUTPUT  all  --  *      *       0.0.0.0/0            192.168.224.1

The KUBE-NODEPORTS chain is appended to the bottom of the KUBE-SERVICES chain and uses ADDRTYPE to only match packets that are destined to one of the locally configured addresses:

$ ipt KUBE-SERVICES | grep NODEPORT
    1    60 KUBE-NODEPORTS  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

Each of the configured NodePort Services will have two entries – one to enable SNAT masquerading in the KUBE-POSTROUTING chain (see ClusterIP walkthrough for more details) and another one for Endpoint-specific DNAT actions:

$ ipt KUBE-NODEPORTS
Chain KUBE-NODEPORTS (1 references)
 pkts bytes target     prot opt in     out     source               destination
    1    60 KUBE-MARK-MASQ  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web */ tcp dpt:30510
    1    60 KUBE-SVC-LOLE4ISW44XBNF3G  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web */ tcp dpt:30510

Inside the KUBE-SVC-* chain there will be one entry per each healthy backend Endpoint with random probability to ensure equal traffic distribution:

$ ipt KUBE-SVC-LOLE4ISW44XBNF3G
Chain KUBE-SVC-LOLE4ISW44XBNF3G (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SEP-PJHHG4YJTBHVHUTY  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web */ statistic mode random probability 0.33333333349
    0     0 KUBE-SEP-4OIMBIYGK4QJUGT7  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web */ statistic mode random probability 0.50000000000
    1    60 KUBE-SEP-R53NX34J3PCIETEY  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web */

This is where the final Destination NAT translation takes place, each of the above chains translates the original destination IP and NodePort to the address of one of the Endpoints:

$ ipt KUBE-SEP-PJHHG4YJTBHVHUTY
Chain KUBE-SEP-PJHHG4YJTBHVHUTY (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  *      *       10.244.1.6           0.0.0.0/0            /* default/web */
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web */ tcp to:10.244.1.6:80
$ ipt KUBE-SEP-4OIMBIYGK4QJUGT7
Chain KUBE-SEP-4OIMBIYGK4QJUGT7 (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  *      *       10.244.2.7           0.0.0.0/0            /* default/web */
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web */ tcp to:10.244.2.7:80
$ ipt KUBE-SEP-R53NX34J3PCIETEY
Chain KUBE-SEP-R53NX34J3PCIETEY (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  *      *       10.244.2.8           0.0.0.0/0            /* default/web */
    1    60 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web */ tcp to:10.244.2.8:80

You may have noticed the presence of KUBE-MARK-MASQ in the above chains, this rule exists to account for a corner case of Pod talking to its own Service via a ClusterIP (i.e. Pod itself is a part of the Service it’s trying to talk to) and the random distribution selecting itself as the destination. In this case, both source and destination IPs will be the same and this rule exists to ensure that the packets get SNAT’ed to prevent them from being dropped.


IPVS Implementation

IPVS data plane still relies on IPTables for a number of corner cases, which is why we can see a similar rule, matching all LOCAL packets and redirecting them to the KUBE-NODE-PORT chain:

$ ipt KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  *      *      !10.244.0.0/16        0.0.0.0/0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst
    0     0 KUBE-NODE-PORT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst

However, its is implemented is slightly different and makes use of IP sets, reducing the time complexity of a lookup for N configured Services from O(N) to O(1):

$ ipt KUBE-NODE-PORT
Chain KUBE-NODE-PORT (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* Kubernetes nodeport TCP port for masquerade purpose */ match-set KUBE-NODE-PORT-TCP dst

All configured NodePorts are kept inside the KUBE-NODE-PORT-TCP ipset:

$ ips KUBE-NODE-PORT-TCP                                                                                   ▼
Name: KUBE-NODE-PORT-TCP
Type: bitmap:port
Revision: 3
Header: range 0-65535
Size in memory: 8264
References: 1
Number of entries: 1
Members:
30064

Assuming we’ve got 30064 allocated as a NodePort, we can see all interfaces that are listening for incoming packets for this Service:

$ ipv | grep 30064
TCP  192.168.224.2:30064 rr
TCP  10.244.1.1:30064 rr
TCP  127.0.0.1:30064 rr

The IPVS configuration for each individual listener is the same and contains a set of backend Endpoint addresses with the default round-robin traffic distribution:

$ ipv
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.224.2:30064 rr
  -> 10.244.1.6:80                Masq    1      0          0
  -> 10.244.2.7:80                Masq    1      0          0
  -> 10.244.2.8:80                Masq    1      0          0

Cilium eBPF Implementation

The way Cilium deals with NodePort Services is quite complicated so we’ll try to focus only on the relevant “happy” code paths ignoring corner cases and interaction with other services, like firewalling or encryption.

At boot time, Cilium attaches a pair of eBPF programs to a set of Node’s external network interfaces (they can be picked automatically or defined in configuration). In our case, we only have one external interface eth0 and we can see eBPF programs attached to it using bpftool:

$ cilium bpftool net | grep eth0
eth0(19) clsact/ingress bpf_netdev_eth0.o:[from-netdev] id 6098
eth0(19) clsact/egress bpf_netdev_eth0.o:[to-netdev] id 6104

Let’s focus on the ingress part and walk through the source code of the from-netdev program. During the first few steps, the SKB data structure gets first passed to the handle_netdev function (source) and on to the do_netdev function (source) which handles IPSec, security identity and logging operations. At the end, a tail call transfers the control over to the handle_ipv4 function (source) which is where most of the forwarding decisions take place.

One of the first things that happen inside handle_ipv4 is the following check which confirms that Cilium was configured to process NodePort Services and the packet is coming from an external source, in which case the SKB context is passed over to the nodeport_lb4 function:

#ifdef ENABLE_NODEPORT
	if (!from_host) {
		if (ctx_get_xfer(ctx) != XFER_PKT_NO_SVC &&
		    !bpf_skip_nodeport(ctx)) {
			ret = nodeport_lb4(ctx, secctx);
			if (ret < 0)
				return ret;
		}
		/* Verifier workaround: modified ctx access. */
		if (!revalidate_data(ctx, &data, &data_end, &ip4))
			return DROP_INVALID;
	}
#endif /* ENABLE_NODEPORT */

The nodeport_lb4 function (source) deals with anything related to NodePort Service load-balancing and address translation. Initially, it builds a 4-tuple which will be used for internal connection tracking and attempts to extract a Service map lookup key:

tuple.nexthdr = ip4->protocol;
tuple.daddr = ip4->daddr;
tuple.saddr = ip4->saddr;

l4_off = l3_off + ipv4_hdrlen(ip4);

ret = lb4_extract_key(ctx, ip4, l4_off, &key, &csum_off, CT_EGRESS);

The key gets build with the destination IP and L4 port of an ingress packet. Similar to Cilium’s ClusterIP implementation (and for the same reasons) the lookup is performed in two stages and the first one is only used to determine the total number of backend Endpoints (svc->count):

struct lb4_service *lb4_lookup_service(struct lb4_key *key,
				       const bool scope_switch)
{
	struct lb4_service *svc;

	key->scope = LB_LOOKUP_SCOPE_EXT;
	key->backend_slot = 0;
	svc = map_lookup_elem(&LB4_SERVICES_MAP_V2, key);
	if (svc) {
		if (!scope_switch || !lb4_svc_is_local_scope(svc))
			return svc->count ? svc : NULL;
		key->scope = LB_LOOKUP_SCOPE_INT;
		svc = map_lookup_elem(&LB4_SERVICES_MAP_V2, key);
		if (svc && svc->count)
			return svc;
	}

	return NULL;
}

For example, this is how a map lookup for a packet going to 172.18.0.6:30171 would look like:

cilium bpftool map lookup pinned /sys/fs/bpf/tc/globals/cilium_lb4_services_v2 key 0xac 0x12 0x00 0x06 0x75 0xdb 0x00 0x00 0x00 0x00 0x00 0x00
key: ac 12 00 06 75 db 00 00  00 00 00 00  value: 00 00 00 00 03 00 00 08  42 00 00 00

The returned result sets the count to the number of healthy backend Endpoints (0x03 in our case) which is then used in the second lookup inside the lb4_local function (source):

if (backend_id == 0) {
	/* No CT entry has been found, so select a svc endpoint */
	backend_id = lb4_select_backend_id(ctx, key, tuple, svc);
	backend = lb4_lookup_backend(ctx, backend_id);
	if (backend == NULL)
		goto drop_no_service;
}

This time, the exact backend_id is determined either randomly of using a MAGLEV hash lookup. The value of backend_id is used to look up the destination IP and port of the target Endpoint:

static __always_inline struct lb4_backend *__lb4_lookup_backend(__u16 backend_id)
{
	return map_lookup_elem(&LB4_BACKEND_MAP, &backend_id);
}

With this information in hand, the control flow is passed from the lb4_local to the lb4_xlate function:

	return lb_skip_l4_dnat() ? CTX_ACT_OK :
	       lb4_xlate(ctx, &new_daddr, &new_saddr, &saddr,
			 tuple->nexthdr, l3_off, l4_off, csum_off, key,
			 backend, has_l4_header, skip_l3_xlate);

As its name suggests, lb4_xlate (source) performs L4 header re-writes and checksum updates to finish the translation of the original packet which now has the destination IP and port of one of the backend Endpoints:

if (likely(backend->port) && key->dport != backend->port &&
    (nexthdr == IPPROTO_TCP || nexthdr == IPPROTO_UDP) &&
    has_l4_header) {
	__be16 tmp = backend->port;

	/* Port offsets for UDP and TCP are the same */
	ret = l4_modify_port(ctx, l4_off, TCP_DPORT_OFF, csum_off,
			     tmp, key->dport);
	if (IS_ERR(ret))
		return ret;
}

return CTX_ACT_OK;

At this point, with the packet fully translated and connection tracking entries updated, the control flow returns to the handle_ipv4 function where a Cilium endpoint is looked up and its details are used to call the bpf_redirect_neigh eBPF helper function to redirect the packet straight to the target interface, similar to how it was described in the Cilium CNI chapter:

	/* Lookup IPv4 address in list of local endpoints and host IPs */
	ep = lookup_ip4_endpoint(ip4);
	if (ep) {
		/* Let through packets to the node-ip so they are processed by
		 * the local ip stack.
		 */
		if (ep->flags & ENDPOINT_F_HOST)
			return CTX_ACT_OK;

		return ipv4_local_delivery(ctx, ETH_HLEN, secctx, ip4, ep,
					   METRIC_INGRESS, from_host);
	}