Chapter 14: Network Namespaces And Virtual Ethernet

Container networking is built out of three Linux primitives that exist on their own merits. Virtual Ethernet pairs (veth) give the kernel a way to wire two network devices back-to-back in software, like a patch cable that never leaves the kernel. Linux bridges are software Ethernet switches: any frame that comes in on one port leaves on the right port, with MAC learning and broadcast just like the metal box on a rack. Network namespaces give a process its own copy of the entire network stack — its own interfaces, IPs, routes, sockets, port numbers, and firewall rules.

A container is a process inside a network namespace. A container talks to the network because the namespace contains one end of a veth pair, the other end is on a bridge in the host namespace, and the bridge is wired to the rest of the world by routing or NAT. Every CNI plugin in the bridge family is some variation on that theme. The chapter takes the three primitives one at a time, with hands-on examples that work in isolation, and then composes them into a working two-container network from scratch.

Safety: every command below mutates kernel networking state and most need root. Use a disposable Linux VM. Examples were checked on Ubuntu 24.04 with kernel 6.8 and iproute2 6.1.

What A Network Namespace Holds

A network namespace is a separate, independent network stack. The network_namespaces(7) man page enumerates the pieces, but the list is more useful once you know what each piece does and why isolating it matters.

Network devices. Every interface lives in exactly one namespace at a time. eth0 in one namespace and eth0 in another are two unrelated devices; moving a device with ip link set <dev> netns <ns> is a literal handoff, not a copy. This is the foundation of everything else — with no devices, there is nothing to bind to, route through, or filter on.

The IPv4 and IPv6 protocol stacks. Each namespace has its own copy of the kernel's TCP/IP state: open sockets, the connection-tracking table (conntrack), neighbor caches (ARP for v4, NDP for v6), the Path MTU cache, and the tunables under /proc/sys/net. Two namespaces setting tcp_keepalive_time differently do not see each other; a connection-table-exhausting workload in one does not starve the other. The stack code is the same kernel; the state is per-namespace.

Routing tables. Each namespace has its own forwarding information base. A packet generated or received in namespace A is routed using A's tables only; A's default via 10.0.0.1 says nothing about how B forwards. This is why two containers can each call 10.40.0.1 their default gateway without contradiction — the lookup happens in their own table.

Firewall rules. Netfilter (iptables/nftables) hooks are per-namespace. Rules installed inside a container's namespace filter only that namespace's traffic; the host's iptables -L does not see them, and they do not see the host's. CNI plugins exploit this directly — a per-pod ruleset stays scoped to the pod.

Port numbers. The TCP and UDP port spaces are per-namespace. Two containers can both bind(0.0.0.0:80) because the kernel checks for collisions inside the namespace's own port table; the host's port 80 is a third, independent slot.

The /proc/net, /sys/class/net, and /proc/sys/net views. Userspace tools — ss, ip, iptables, sysctl — read kernel state through these filesystems. The kernel makes them namespace-aware, so a process inside a namespace sees only its own interfaces, sockets, and sysctls. That is what lets ip link inside a container show only the container's devices without the tool needing to know about namespaces at all.

The abstract UNIX domain socket namespace. Abstract sockets — those with a leading null byte in their address, used by D-Bus and a handful of system services — are scoped to the network namespace rather than the filesystem. Path-based UNIX sockets (/run/foo.sock) live in the mount namespace instead. Most application code never notices, but system services that bind abstract names sometimes do.

The consequence of all of this is the rule worth memorizing: two namespaces cannot collide on a port and cannot accidentally route through each other, because they share neither the port table nor the route table.

What the namespace does not contain is also worth saying. It does not include the resolver — /etc/resolv.conf is a regular file in the mount namespace, and the runtime arranges it. It does not include process credentials, capabilities, or the user namespace's privilege rules; those compose with the network namespace but are independent of it (a process can be inside a network namespace while still being uid 0 in the host user namespace, or vice versa). And it does not, by itself, route anywhere. A fresh network namespace contains only a loopback interface, and that loopback is down. Until something puts an interface in and configures it, the namespace cannot send a packet.

The simplest demonstration is hand-rolled:

sudo ip netns add demo
sudo ip netns exec demo ip link
# 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN ...
sudo ip netns exec demo ip route
# (empty)

ip netns add does two things: it calls unshare(CLONE_NEWNET) to create the namespace, and it bind-mounts the namespace's nsfs inode under /var/run/netns/demo so the namespace persists past the calling process. That bind mount is what lets a CNI plugin configure a namespace before any container process exists inside it.

The rest of this chapter is about getting useful interfaces into that empty namespace and getting packets to flow through them.

veth Pairs: A Cable In The Kernel

A veth pair is two virtual Ethernet devices joined back-to-back. Whatever transmits on one end is received on the other. The pair has no protocol of its own, no encapsulation, no userspace forwarder; the kernel's veth driver simply hands the skb from one device to the other. The man page (veth(4)) is two paragraphs because there is not much to say beyond "it is a cable."

That simplicity is what makes veth the workhorse of container networking. A pair created on the host can have one end moved into a namespace; once moved, packets sent from inside the namespace come out on the host end, and the host's networking stack can do whatever it likes with them — switch them through a bridge, route them, hand them to an eBPF program, send them to a userspace agent. The container side does not need to know.

Build one and watch it work, no namespaces yet:

sudo ip link add veth-a type veth peer name veth-b
sudo ip addr add 10.20.0.1/24 dev veth-a
sudo ip addr add 10.20.0.2/24 dev veth-b
sudo ip link set veth-a up
sudo ip link set veth-b up

# Linux's reverse-path filter normally rejects this kind of self-connected
# topology because both addresses are on the same host. Disable it for the demo.
sudo sysctl -w net.ipv4.conf.all.rp_filter=2
sudo sysctl -w net.ipv4.conf.veth-a.rp_filter=0
sudo sysctl -w net.ipv4.conf.veth-b.rp_filter=0

ping -c1 -I veth-a 10.20.0.2
# 64 bytes from 10.20.0.2: icmp_seq=1 ttl=64 time=0.04 ms

Both ends are on the host, both have IPs, both are up. The ping leaves on veth-a and is received on veth-b. There is no switch in between — it is a literal cable. Tear it down:

sudo ip link del veth-a
# deleting one end removes the pair

The shape of every container network attachment is the same: a veth pair with one end relocated. Move veth-b into a namespace and the pair becomes a wire from inside the namespace out to the host. We will do exactly that in a few sections.

Linux Bridges: A Software Switch

A bridge is a software layer-2 switch. Each device added to the bridge becomes a port. Frames received on one port are forwarded to the appropriate port based on the destination MAC address; the bridge learns which port a MAC lives behind by remembering the source MAC of frames it sees. Broadcasts and unknown unicasts go to every port. This is the same model the rack-mounted switch under your desk uses; the only difference is that the cables are virtual.

Build one in isolation and prove the forwarding works:

# A bridge with two veth pairs plugged into it.
sudo ip link add br-demo type bridge
sudo ip link set br-demo up

sudo ip link add veth1a type veth peer name veth1b
sudo ip link add veth2a type veth peer name veth2b

# Plug the "a" ends into the bridge.
sudo ip link set veth1a master br-demo
sudo ip link set veth2a master br-demo
sudo ip link set veth1a up
sudo ip link set veth2a up

master br-demo enslaves the device to the bridge — the kernel installs a receive handler on the device that hands incoming frames to the bridge's forwarding logic. From here, any frame arriving on veth1a will be sent out veth2a if the destination MAC is on the other side, and broadcasts will go out both. Configure the "b" ends with addresses on the same subnet, bring them up, and the two ends can talk to each other through the bridge:

sudo ip addr add 10.30.0.1/24 dev veth1b
sudo ip addr add 10.30.0.2/24 dev veth2b
sudo ip link set veth1b up
sudo ip link set veth2b up

ping -c1 -I veth1b 10.30.0.2
# 64 bytes from 10.30.0.2: icmp_seq=1 ttl=64 time=0.05 ms

Watch the bridge learn:

bridge fdb show br br-demo
# <mac of veth1a>  dev veth1a master br-demo permanent
# <mac of veth2a>  dev veth2a master br-demo permanent
# <mac of veth1b>  dev veth1a master br-demo
# <mac of veth2b>  dev veth2a master br-demo

The first two entries are permanent — the bridge ports' own MACs. The last two are learned dynamically: when veth1b sent its ARP request through the bridge, the bridge saw the source MAC arrive on port veth1a and remembered it. The same thing a hardware switch does, in a kernel module. Tear it down:

sudo ip link del br-demo
sudo ip link del veth1a
sudo ip link del veth2a

Three things to internalize before we start adding namespaces. A bridge does not assign IPs to ports; the devices hold IPs, and the bridge forwards Ethernet frames between them. A bridge can itself hold an IP — ip addr add 10.30.0.254/24 dev br-demo — which makes the bridge a layer-3 entity for its subnet, the role it plays as the gateway in a container network. And a bridge does not by itself reach the outside world: forwarding decisions to anywhere outside the bridge subnet require routes and, usually, NAT.

Composing The Primitives: A Two-Container Network

Now compose. The pattern is one bridge in the host namespace, one veth pair per container with the host end on the bridge and the container end inside the container's network namespace, addresses on the namespaced ends, and a default route via the bridge's gateway IP. That is the shape every CNI bridge plugin produces.

Two namespaces, two veth pairs, one bridge:

# Bridge with a gateway IP for the container subnet.
sudo ip link add br0 type bridge
sudo ip addr add 10.40.0.1/24 dev br0
sudo ip link set br0 up

# Container namespaces.
sudo ip netns add c1
sudo ip netns add c2

# A veth pair per container. The "h" end stays on the host bridge;
# the "c" end goes into the container namespace and is renamed to eth0
# so the container sees a familiar interface name.
sudo ip link add c1-h type veth peer name c1-c
sudo ip link add c2-h type veth peer name c2-c

sudo ip link set c1-h master br0
sudo ip link set c2-h master br0
sudo ip link set c1-h up
sudo ip link set c2-h up

sudo ip link set c1-c netns c1
sudo ip link set c2-c netns c2
sudo ip -n c1 link set c1-c name eth0
sudo ip -n c2 link set c2-c name eth0

# Configure the namespaced ends.
sudo ip -n c1 addr add 10.40.0.2/24 dev eth0
sudo ip -n c2 addr add 10.40.0.3/24 dev eth0
sudo ip -n c1 link set lo up
sudo ip -n c2 link set lo up
sudo ip -n c1 link set eth0 up
sudo ip -n c2 link set eth0 up

# Default route via the bridge gateway.
sudo ip -n c1 route add default via 10.40.0.1
sudo ip -n c2 route add default via 10.40.0.1

Verify the two "containers" can reach each other and the gateway:

sudo ip netns exec c1 ping -c1 10.40.0.3
# 64 bytes from 10.40.0.3: icmp_seq=1 ttl=64 time=0.06 ms
sudo ip netns exec c1 ping -c1 10.40.0.1
# 64 bytes from 10.40.0.1: icmp_seq=1 ttl=64 time=0.05 ms

Trace what just happened. From inside c1, the kernel selected eth0 as the output device based on the route to 10.40.0.0/24, did neighbor resolution for 10.40.0.3, and transmitted the frame on eth0 (the namespaced end of c1's veth pair). The pair delivered the frame to c1-h on the host. c1-h is a port on br0, so the bridge's forwarding logic ran: looked up the destination MAC, found it on port c2-h, transmitted the frame there. The pair delivered it to eth0 inside c2. The namespace, the veth, and the bridge each did exactly the job their man page promises, and together they implement container networking.

This is the topology:

flowchart LR c1ns[c1 ns] c1eth[eth0 - 10.40.0.2] c1ns --- c1eth c2ns[c2 ns] c2eth[eth0 - 10.40.0.3] c2ns --- c2eth c1eth -. veth pair .- c1h c2eth -. veth pair .- c2h subgraph host[Host network namespace] br0[br0 - 10.40.0.1/24] c1h[c1-h] --- br0 c2h[c2-h] --- br0 end

Routes, Forwarding, And Masquerade

The two-container setup so far reaches itself but cannot reach the outside world. Three changes are needed. The host needs to forward packets between interfaces (off by default for safety). The container subnet needs to be NATed at the host's egress interface, because no router upstream of the host knows how to reach 10.40.0.0/24. And the host needs return-path routing for any traffic heading back to that subnet.

sudo sysctl -w net.ipv4.ip_forward=1
sudo iptables -t nat -A POSTROUTING -s 10.40.0.0/24 ! -o br0 -j MASQUERADE

sudo ip netns exec c1 ping -c1 1.1.1.1
# 64 bytes from 1.1.1.1: icmp_seq=1 ttl=57 time=8.3 ms

That is masquerade: as packets leave the host on its real upstream interface, iptables rewrites their source address to the host's address, and the reply comes back to the host where the conntrack table recognizes it and rewrites the destination back to the container's address. The container has external connectivity without anyone outside knowing the container subnet exists.

Two things this hides that show up at scale. Masquerade defeats source-based policy upstream — every container looks like the host. And the conntrack table is finite: at very high connection rates, container-to-external traffic can exhaust it. Kubernetes' cluster networking promises pod-to-pod without NAT precisely to avoid these problems within the cluster; pod-to-external still typically uses some form of NAT or proxy.

CNI Plugins Do Exactly This

A CNI ADD invocation against the bridge plugin produces the same shape we just built by hand. The runtime hands the plugin a path to the container's network namespace; the plugin creates a veth pair, moves one end into that namespace path, attaches the other end to a configured bridge, calls the IPAM plugin to allocate an address, configures the address and a default route inside the namespace, and optionally installs a masquerade rule. The plugin returns JSON describing what it did. Chapter 15 walks the contract; the takeaway here is that the kernel work is what we have already done, and the plugin is mostly orchestration around ip link add type veth and friends.

When the container exits, the runtime calls CNI DEL. The plugin reverses the operations: deletes the veth pair (which destroys both ends), releases the IP back to IPAM, and removes any masquerade rule it owns. Because the namespace itself is owned by the runtime, the plugin does not delete it; the runtime tears down the namespace when the container's last process exits.

Cleanup

Tear down the demo before moving on:

sudo iptables -t nat -D POSTROUTING -s 10.40.0.0/24 ! -o br0 -j MASQUERADE
sudo ip netns del c1
sudo ip netns del c2
sudo ip link del br0
# veth pairs were owned by the namespaces; deleting the namespaces
# removed them, except for any host-side ends still on the bridge,
# which are removed when the bridge is deleted.

DNS Is A Separate Problem

A network namespace gives a process its own network stack. It does not give it its own resolver. DNS in a container comes from /etc/resolv.conf inside the container's mount namespace — a regular file, written by the runtime. Kubernetes layers cluster DNS on top of that: kubelet writes /etc/resolv.conf to point at the cluster DNS service IP, and the cluster DNS add-on (CoreDNS) answers lookups for services and pods. The CNI plugin can return DNS information in its result, but that does not, by itself, configure resolution; the runtime has to act on it.

Where This Goes

The next chapter covers the CNI contract in detail — how the runtime invokes plugins, what each plugin field means, and how chains compose. Chapter 16 then covers the Kubernetes pod networking model: how the kubelet uses CNI, how pod IPs are allocated, and what the cluster's reachability promises are.

Sources And Further Reading

network_namespaces(7): https://man7.org/linux/man-pages/man7/network_namespaces.7.html
veth(4): https://man7.org/linux/man-pages/man4/veth.4.html
ip-netns(8): https://man7.org/linux/man-pages/man8/ip-netns.8.html
ip-link(8): https://man7.org/linux/man-pages/man8/ip-link.8.html
bridge(8): https://man7.org/linux/man-pages/man8/bridge.8.html
arp(7): https://man7.org/linux/man-pages/man7/arp.7.html
rtnetlink(7): https://man7.org/linux/man-pages/man7/rtnetlink.7.html
Linux kernel veth driver: https://github.com/torvalds/linux/blob/master/drivers/net/veth.c
Linux kernel bridge code: https://github.com/torvalds/linux/tree/master/net/bridge
Linux kernel network namespace setup: https://github.com/torvalds/linux/blob/master/net/core/net_namespace.c
Linux kernel struct net: https://github.com/torvalds/linux/blob/master/include/net/net_namespace.h
CNI bridge plugin: https://github.com/containernetworking/plugins/blob/main/plugins/main/bridge/bridge.go
Kubernetes networking model: https://kubernetes.io/docs/concepts/services-networking/