Chapter 15: CNI

When kubelet asks containerd to run a pod, containerd has to attach a network namespace to a network before any workload container starts. It does not perform that work itself. It calls a separate executable on disk — a CNI plugin — and lets the plugin create interfaces, allocate addresses, install routes, and report back. CNI, the Container Network Interface, is the protocol that makes that handoff possible.

The word "plugin" is overloaded here, and it is worth pinning down before anything else. Chapter 10 described containerd as a daemon hosting a plugin graph — content, snapshotters, metadata, runtime v2, CRI, and the rest — where each plugin is a Go object loaded into the daemon at startup. CNI plugins are not part of that graph. They are external executables on disk, spawned as child processes. The connection point between the two worlds is the CRI plugin inside containerd: when kubelet asks it for a pod sandbox, it uses the go-cni library (which wraps the upstream libcni) to find and fork+exec CNI plugin binaries from /opt/cni/bin.

flowchart TB subgraph daemon[containerd daemon process] subgraph plugins[Plugin graph - chapter 10] cri[CRI plugin] meta[Metadata plugin] runtimev2[Runtime v2] content[Content store] snap[Snapshotters] end cnipath[go-cni + libcni
libraries inside CRI plugin] cri --> cnipath end subgraph bin[CNI binaries in /opt/cni/bin] bridge[/bridge/] hostlocal[/host-local/] portmap[/portmap/] end cnipath -. fork+exec .-> bridge cnipath -. fork+exec .-> hostlocal cnipath -. fork+exec .-> portmap

That is the static picture: CNI plugins live outside containerd, and one branch of the CRI plugin reaches them. The dynamic picture — what happens at runtime, in what order, with what data — is this:

sequenceDiagram participant K as kubelet participant C as containerd CRI participant G as go-cni participant L as libcni participant B as bridge binary participant I as host-local binary K->>C: RunPodSandbox Note over C: create netns at
/var/run/netns/cni- C->>G: Setup(sandbox id, netns path) G->>L: AddNetworkList(config, runtime args) L->>B: fork+exec /opt/cni/bin/bridge
env: CNI_COMMAND=ADD,
CNI_NETNS, CNI_IFNAME, CNI_CONTAINERID
stdin: plugin config JSON B->>I: fork+exec /opt/cni/bin/host-local
env: CNI_COMMAND=ADD, ...
stdin: ipam config JSON I-->>B: stdout: { ips, routes } Note over B: setns into netns,
create veth, configure addr,
install default route,
install masquerade rule B-->>L: stdout: { interfaces, ips, dns } L-->>G: parsed CNI result G-->>C: result Note over C: cache result on sandbox,
start pause container in netns C-->>K: sandbox ready

Five components, three of them on the containerd side and two of them plugin binaries. containerd CRI is the call site — it received RunPodSandbox from kubelet, created the network namespace as a bind-mounted file under /var/run/netns/, and now needs that namespace populated with an interface, an IP, and routes. go-cni is containerd's adapter, a small library in containerd's own repository that wraps libcni behind an interface shaped for the sandbox lifecycle: Setup, Remove, Check. libcni is the upstream CNI library from containernetworking/cni. It owns plugin discovery, configuration parsing, the invocation protocol, result caching, and GC. The plugin binaries are independent programs in /opt/cni/bin, none of them linked into containerd.

Plugin binaries come in three roles, distinguished by where they sit in a chain. Main plugins like bridge, ipvlan, and macvlan are first in the chain and create the attachment from nothing — they receive an empty namespace and populate it with an interface. IPAM plugins like host-local and dhcp are invoked recursively by a main plugin, not by the runtime; they own address allocation and nothing else. Decorator plugins like portmap and bandwidth come after the main plugin in the chain, read the upstream plugin's result, and add behavior around the existing attachment without creating one of their own. The diagram shows one of each main–IPAM pair: bridge is the main plugin and host-local is its IPAM. A chain that also included portmap would run it as a third step after bridge returned.

The data crossing each boundary is small. Between go-cni and libcni it is a single Go function call:

r, err := n.cni.AddNetworkList(ctx, n.config, ns.config(n.ifName))

n.config is the parsed .conflist; ns.config(n.ifName) packages the sandbox ID, the netns path, and the interface name into the runtime arguments libcni expects; the return value is the CNI result containerd CRI caches on the sandbox. Between libcni and a plugin binary the boundary widens into a UNIX process invocation: a handful of environment variables for the per-call arguments, JSON on stdin for the configuration body, JSON on stdout for the result, and an exit code for success or failure. The bridge plugin in the diagram is itself a CNI runtime to the host-local plugin — it re-invokes the same protocol against the address allocator named in its ipam block. Delegation is recursion, not a separate mechanism.

CNI itself is small. It defines three things and stops. A configuration schema says what JSON the runtime hands the plugin. An invocation protocol says how the runtime starts the plugin and what it puts in the environment. A result schema says what the plugin writes back on stdout. The shape of the data path — bridge, overlay, eBPF, cloud routes — is whatever the plugin chooses; CNI does not care. That separation is what lets containerd, CRI-O, and any other runtime ship against dozens of network implementations without linking any of them in.

The rest of this chapter walks each piece of the protocol in turn — config, invocation, operations, chains — and ties the bridge plugin back to the hand-rolled network from chapter 14.

Version pin: this chapter tracks containernetworking/cni v1.3.0, containernetworking/plugins v1.9.1, and containerd/go-cni v1.1.13, which together implement CNI spec 1.1.0. The spec version and the library tags move independently.

What A CNI Configuration Looks Like

A CNI configuration describes a network: a name, plus an ordered chain of plugins that act on every attachment to that name. The runtime does not learn about networks from code — it reads configurations off disk, indexes them by name, and walks the matching chain when something asks to attach a namespace. Multiple networks can coexist on the same node; nothing in the protocol forces a one-to-one mapping between runtimes and networks. The configuration is the deployment contract: change the file and you change the network shape without rebuilding the runtime.

The file lives under /etc/cni/net.d, named *.conflist for a chain or *.conf for a single plugin. A typical containerd-on-Kubernetes node has one .conflist and no other CNI files.

A representative bridge-and-portmap chain:

{
  "cniVersion": "1.0.0",
  "name": "k8s-pod-network",
  "plugins": [
    {
      "type": "bridge",
      "bridge": "cni0",
      "isGateway": true,
      "ipMasq": true,
      "ipam": { "type": "host-local", "subnet": "10.244.0.0/16" }
    },
    {
      "type": "portmap",
      "capabilities": { "portMappings": true }
    }
  ]
}

The list-level fields are small. cniVersion is the spec version the runtime should obey when walking this list. name identifies the network — multiple .conflist files can coexist, and the runtime picks by name. plugins is the ordered chain. disableCheck and disableGC opt the list out of those maintenance operations; the optional cniVersions array advertises additional versions the chain supports.

Inside each plugin object, type is the binary name. The runtime resolves it against CNI_PATH (usually /opt/cni/bin) and executes whatever it finds. Every other field is the plugin's business: bridge and isGateway mean something to the bridge plugin and nothing to portmap; ipam is a nested config that the main plugin will delegate to another binary; capabilities is the plugin's request for runtime-supplied data such as port mappings or bandwidth limits. The spec calls out a few well-known keys — ipMasq, ipam, dns, capabilities — but they are conventions, not required fields.

The word "container" in the spec is broader than the Linux sense. It means a network isolation domain being attached to a network. On Linux that is a network namespace path, but CNI is not Linux-only and the protocol is not pinned to namespaces.

Invocation: Environment In, JSON Out

CNI's invocation protocol is the shape a CGI handler would have if you took it out of HTTP. The runtime treats the plugin as an opaque executable and reaches it through exactly the surfaces every process already has: environment variables for the per-call arguments, stdin for the configuration body, stdout for the structured reply, exit code for success or failure. Nothing else crosses the boundary — no shared memory, no library linkage, no language requirement, no version negotiation outside the JSON itself.

That choice is what makes CNI portable. A plugin can be written in any language. It can run in a different security context than the runtime, because it is a fresh process. It can be replaced on disk without restarting anything that does not have a call in flight. And the protocol can be reproduced by hand: print the environment, capture the stdin, run the binary in a shell.

libcni packs the call into environment variables:

"CNI_COMMAND="+args.Command,
"CNI_NETNS="+args.NetNS,
"CNI_IFNAME="+args.IfName,

The full set is CNI_COMMAND (the verb), CNI_CONTAINERID (the runtime's opaque ID for the attachment), CNI_NETNS (a path the plugin can setns(2) into, or omit for non-Linux), CNI_IFNAME (the interface name the plugin should create inside the namespace), CNI_ARGS (a ;-separated bag the runtime can use for free-form labels), and CNI_PATH (where to find further plugin binaries, since IPAM delegation re-invokes this same protocol).

After the binary exits, libcni decodes the bytes against the requested result version:

stdoutBytes, err := exec.ExecPlugin(ctx, pluginPath, netconf, args.AsEnv())
return create.Create(resultVersion, fixedBytes)

The result is a structured object — interfaces, IPs, routes, DNS — versioned so that an older runtime can still consume output produced by a newer plugin. Errors are structured CNI error objects, not free-form text on stderr; that is what lets a runtime distinguish "plugin not found" from "IPAM exhausted."

This is why a CNI plugin does not have to be linked into containerd, kubelet, or any runtime. It has to be executable, discoverable, and able to speak environment variables plus JSON.

Operations And Ownership

Every CNI verb is one half of a state-ownership pair. The plugin can create kernel state — interfaces, addresses, routes, firewall rules, IPAM files — but the runtime is the only thing that knows when a namespace has gone away, when the configured chain has changed, or when a cached attachment no longer corresponds to any live pod. The verbs exist so the runtime can tell the plugin which of those things just happened, and the plugin can do the matching mutation. ADD and DEL are the obvious pair; CHECK, STATUS, and GC are there because real systems drift and the cache is not always right.

The verbs are sent in CNI_COMMAND:

Operation Purpose
ADD Attach the namespace to the network. Creates interfaces, addresses, routes, firewall rules.
DEL Detach. Removes whatever ADD created — including the IPAM allocation.
CHECK Verify the attachment described by the cached result still exists.
STATUS Report whether the plugin is currently able to service requests.
VERSION Report supported CNI versions.
GC Drop attachments that the runtime no longer considers valid.

ADD is the easy path to understand because it creates visible state. DEL matters just as much: a host veth, a masquerade rule, and an IPAM allocation outlive the process that asked for them unless something removes them, and the runtime has no way to know what the plugin created beyond what came back in the result. The IPAM allocation is the one that bites first — chapter 14's host-local plugin writes a file per address, and a missed DEL leaves that file on disk forever.

CHECK exists because the kernel state can drift. A daemon restart, a manual ip link del, a node reboot — any of these can leave a cached attachment that no longer matches reality. GC is the broader form: libcni reads cached attachments, compares them against a runtime-supplied list of attachments that should still exist, calls DEL on the stragglers, and (for spec 1.1.0 and later) issues a plugin-level GC so plugins can clean state the per-attachment cache does not name.

Chains And Reverse-Order Delete

A configuration list is a pipeline. The first plugin creates an attachment; each later plugin reads the cumulative result so far, mutates whatever it intends to mutate, and returns an updated result. By the end of the chain the result describes the union of every plugin's work, and the kernel state matches. The pipeline is the protocol's answer to "how do unrelated plugins compose without knowing about each other" — they communicate by passing a result object down the chain, and the runtime never has to know what any individual plugin does.

Order matters in both directions. Creation has to go front-to-back so decorators see something to decorate: a port-forwarding rule needs an allocated IP to point at, and a traffic shaper needs an interface to attach to. Deletion has to go back-to-front so decorators clean up before the thing they decorated disappears — otherwise you have iptables rules pointing at an IP that has already been re-allocated to a different pod.

On ADD, libcni walks the chain in order, passing each plugin's result into the next:

result, err = c.addNetwork(ctx, list.Name, list.CNIVersion, net, result, rt)

The first plugin attaches the namespace and returns a result describing what it created. The second plugin reads that result through the prevResult field of its own stdin config, decorates it, and returns a new result. The third does the same. By the end of the chain the result describes the union of every plugin's contribution.

On DEL, libcni walks the same list in reverse:

for i := len(list.Plugins) - 1; i >= 0; i-- {
    net := list.Plugins[i]

Reverse-order delete makes ownership the protocol's responsibility rather than the plugin author's. A decorator does not have to defensively check whether the upstream attachment still exists — it can trust that it is being called first.

From spec 1.1.0 onward, libcni passes the cached ADD result into the DEL call, so a plugin tearing down its own state has the same view it had during creation. Older plugins that ran DEL blind — knowing only the container ID — had to recompute their state from whatever they could probe on the host, which is exactly the kind of thing that goes wrong on a partial failure.

The Bridge Plugin Is Chapter 14 In A Binary

The bridge plugin is the canonical main plugin and the one whose source is worth reading first, because every other main plugin follows the same shape. Its ADD performs the same ip link / ip addr / ip route / iptables sequence chapter 14 wrote out by hand, packaged as a binary that consumes JSON instead of taking shell arguments: create or reuse a Linux bridge, create a veth pair with one end in the namespace given by CNI_NETNS, call IPAM, configure the address on the namespaced end, set a default route through the bridge gateway, and — if ipMasq is set — install a POSTROUTING masquerade rule for the allocated subnet.

Two lines in the source show the split:

hostInterface, containerInterface, err := setupVeth(...)
r, err := ipam.ExecAdd(n.IPAM.Type, args.StdinData)

setupVeth does the namespace work. It opens the network namespace from the path in CNI_NETNS, creates the veth pair inside the namespace (so the container end never appears in the host namespace, even briefly), moves the host end back out by name, and attaches it as a port on the configured bridge. The host end keeps the name the plugin chose; the namespaced end is renamed to whatever CNI_IFNAME said — eth0, for pods.

ipam.ExecAdd is the second half, and it is the protocol's defining trick: a plugin can be a CNI runtime to another plugin. The bridge plugin does not allocate addresses. It treats the type field inside its ipam block as another plugin binary name and re-invokes the CNI protocol — same environment variables, same stdin/stdout shape — against that binary, passing the IPAM sub-config as stdin. The IPAM plugin has no idea it is nested; it sees the same call shape a top-level runtime would have made. The IPs and routes come back through stdout, and the bridge plugin applies them to the namespaced interface before returning the merged result up to libcni.

Delegation is recursion, not a separate concept. That is how CNI keeps the protocol surface small: every problem that looks like "ask another component to do part of this" is solved by another CNI invocation.

IPAM Is A Separate Binary

IP address management is its own problem, and the spec keeps it at arm's length from the main plugin for two reasons. The address-allocation logic — pick from a range, persist the assignment, release on teardown — has nothing to do with bridges, overlays, or any data path; bundling it into every main plugin would mean re-implementing the same allocator. And different deployments need different allocators: a static range on a developer host, a coordinated DHCP server on a bare-metal network, a cloud provider's IPAM API in a managed environment. The recursion-as-delegation pattern lets all three share the same main plugins.

The host-local IPAM plugin is the simplest one shipped. It reads a range from its config, picks the next free address, writes a file per allocation under /var/lib/cni/networks/<network>/, and returns the address and any routes the config asked for.

ipConf, err := allocator.Get(args.ContainerID, args.IfName, requestedIP)

The on-disk state is why DEL matters. host-local releases an allocation by deleting the file named after the container ID. Skip the DEL — because the runtime crashed, because the plugin returned an error, because the user rm -rf'd the wrong directory — and the file stays. Later pods then fail to allocate from a range that the file system says is full, even after every visible process has exited. Production failures from "IPAM exhausted" are almost always state files for containers that no longer exist.

The dhcp IPAM plugin is the other shape worth knowing about. It runs as a long-lived daemon, holds DHCP leases on behalf of CNI invocations, and answers ADD/DEL from a Unix socket. Lease renewal happens out of band; the plugin lifecycle exists to keep the daemon's state coherent with the runtime's attachment set.

Decorators: portmap And Chained Plugins

Decorators install kernel state that is only meaningful in the presence of state from an earlier plugin: a port-forwarding rule that targets an allocated IP, a traffic shaper that targets a created interface, an iptables rule that names an interface the main plugin produced. They read the upstream plugin's result through prevResult on stdin, optionally combine it with capability data the runtime supplied out of band, and install additional kernel state. The main plugin upstream knows nothing about the decorator; the decorator's contract is with the runtime, not with the plugin it follows. That is what lets one bridge plugin be paired with any combination of portmap, bandwidth, tuning, or third-party decorators without modification.

portmap is the textbook decorator. It expects a previous plugin to have produced an interface and an IP, reads portMappings from the runtime's capability data, and installs host port forwarding rules through an iptables or nftables backend that target the IP it read out of the previous result.

Its guard is blunt:

if netConf.PrevResult == nil {
    return fmt.Errorf("must be called as chained plugin")
}

Two things follow from that one check. CNI plugins are not equal peers — they have positional roles. The plugin that creates the interface has to come first, decorators come after, and a config that gets the order wrong is a configuration error rather than a runtime error. And prevResult is part of the protocol surface, not an internal detail: a third-party plugin that wants to act on the same attachment reads it from stdin and uses the IPs the upstream plugin reported.

The capability handshake is the other half. portmap's config block declares "capabilities": {"portMappings": true}, and the runtime is expected to fill in runtimeConfig.portMappings from out-of-band data — for Kubernetes, the pod spec. The plugin reads its own static config plus the runtime-supplied capabilities and produces rules. Bandwidth shaping (bandwidth), traffic-control sysctls (tuning), and source-based routing (sbr) work the same way.

What CNI Does Not Promise

CNI is a process protocol, not a network model. It does not promise that every pod can reach every other pod, that pod IPs are routable off the node, that policies are enforced, that addresses survive a node reboot, or that the data path uses any particular technology. Kubernetes defines the pod networking model, plugins implement it with different data paths, and the operator picks a plugin whose promises match the workload.

CNI also does not make plugin execution safe to run casually on a developer host. A real ADD mutates kernel interfaces, routes, firewall state, and on-disk IPAM. The previous chapter's hand-rolled topology is reproducible because every command was visible; running an unknown plugin against /var/run/netns/foo is not. Part VI puts plugin invocation into a disposable VM, which is where it belongs.

Where This Goes

The next chapter walks the runtime side: how kubelet, containerd CRI, and go-cni cooperate to produce the namespace path that this chapter assumed already existed, and how the workload containers in a pod join the namespace the sandbox owns.

Sources And Further Reading