Chapter 6: Container Filesystems

A container needs its own /. The processes inside should see an Alpine root or a Debian root, not whatever the host happens to be running, and writes inside should not leak back out. The naive way to deliver that — give every container a private copy of a full root filesystem — is what this chapter exists to avoid.

A 200 MB Alpine image starting ten containers would burn 2 GB of disk for ten near-identical trees. A Debian-based image starting the same ten would burn 5 GB. Pulling each of those copies once per host, every time the image moves, makes startup a download problem instead of a process-creation problem. Real systems do not work that way. They share the bytes that are common between images, store each unique blob exactly once, and give every running container a thin writable layer that captures only the changes it makes. The whole chapter is about the mechanics of that idea.

Safety: the OverlayFS and pivot_root examples mutate kernel mount state and need root. The image-inspection examples are safe and read-only. Use a disposable Linux VM for the privileged sections. Examples were checked on Ubuntu 24.04 with kernel 6.8 and containerd 1.7.

The Mental Model: Stacked, Content-Addressed, Copy-On-Write

Three ideas, each pulling its weight:

Stacked. A container's root filesystem is built by stacking directories. The bottom is the base layer (Alpine's /bin, /etc, /lib, ...). On top of that, each RUN or COPY in a Dockerfile adds another directory containing only the files that step changed. The final stack — read top-to-bottom, with upper entries hiding lower ones of the same name — is what the process sees as /. Two images sharing the same Alpine base share the same bottom directory on disk.

Content-addressed. Every layer is identified by the SHA-256 of its bytes. Two layers with identical content have identical digests, so the storage layer deduplicates by construction: you never store the same layer twice, regardless of which images reference it. The image manifest is just a list of digests.

Copy-on-write. The stacked layers are read-only. The container gets one extra writable directory on top. A read falls through the stack until it finds the file; a write to a previously-untouched file copies it up into the writable layer first, then modifies the copy. The lower layers are never touched. When the container exits, throwing away the writable layer throws away every change.

Linux gives all three of these to userspace through OverlayFS, a kernel filesystem that takes a list of directories and presents them as a merged view. The container runtime arranges the directories; OverlayFS does the merging. Once the mental model is clear, the rest of the chapter is just naming the pieces and showing where each one lives on disk.

How OverlayFS Merges Directories

OverlayFS takes four arguments at mount time:

lowerdir — a colon-separated list of read-only directories. Read right-to-left: the last entry is the bottom of the stack, the first is the top.
upperdir — the single writable directory. All changes land here.
workdir — an empty scratch directory the kernel uses for atomic operations. Must live on the same filesystem as upperdir.
The mount point itself, sometimes called merged — what processes actually see.

The merging rule is simple: for any path, the kernel returns the entry from the topmost layer that has it, with the writable upper layer on top of every lower. A write to a file that exists only in a lower layer triggers a copy-up: the kernel copies the file into upperdir and applies the write there. A delete is encoded as a whiteout — a character device with major:minor 0:0 in upperdir, which OverlayFS interprets as "hide the entry of this name in any lower layer." Deleting an entire directory's lower contents uses an opaque marker, the trusted.overlay.opaque="y" xattr on a directory in upperdir, which says "ignore everything below this directory."

You do not have to take this on faith. Mount one yourself.

Build An Overlay By Hand

sudo mkdir -p /tmp/ov/{lower1,lower2,upper,work,merged}

# Lower1 = base layer
echo "from lower1" | sudo tee /tmp/ov/lower1/file-a > /dev/null
echo "lower1 only" | sudo tee /tmp/ov/lower1/file-b > /dev/null

# Lower2 = stacked above lower1
echo "from lower2 (overrides lower1)" | sudo tee /tmp/ov/lower2/file-a > /dev/null

# Stack them. Order is right-to-left: lower1 is the bottom, lower2 is on top of it.
sudo mount -t overlay overlay \
  -o lowerdir=/tmp/ov/lower2:/tmp/ov/lower1,upperdir=/tmp/ov/upper,workdir=/tmp/ov/work \
  /tmp/ov/merged

cat /tmp/ov/merged/file-a   # from lower2 (overrides lower1)
cat /tmp/ov/merged/file-b   # lower1 only

file-a exists in both lower layers; the merged view shows the upper of the two. file-b exists only in lower1; it shows through unchanged. Now write through the merged view and watch the change land in upperdir, not in either lower:

echo "appended" | sudo tee -a /tmp/ov/merged/file-a > /dev/null
ls /tmp/ov/upper/
# file-a
diff /tmp/ov/upper/file-a /tmp/ov/lower2/file-a
# differs: upper has the appended line; lower2 is untouched

That is a copy-up. The first write to file-a copied it from lower2 to upper, then appended. The lower layers are byte-for-byte the same as before. Delete file-b from the merged view to see the whiteout:

sudo rm /tmp/ov/merged/file-b
ls /tmp/ov/merged/        # file-a
ls -la /tmp/ov/upper/     # c--------- 0 0 ... file-b

A character device with mode 0 and major:minor 0:0. That is OverlayFS's whiteout. The lower-layer file-b still exists; the upper-layer character device hides it. Clean up:

sudo umount /tmp/ov/merged
sudo rm -rf /tmp/ov

That is the entire kernel mechanism a container's root filesystem rides on. The rest of the chapter is about how layers get into lowerdir in the first place, and how merged becomes / for the container process.

What An OCI Image Is

An OCI image is the on-the-wire and on-disk format for shipping the layers a snapshotter will eventually unpack. It is a content-addressed bundle: one manifest, one image config, and an ordered list of layer blobs, each addressed by SHA-256 digest. The easiest way to see the structure is to pull an image into an OCI layout directory:

sudo apt-get install -y skopeo jq
skopeo copy docker://alpine:3.20 oci:./alpine-oci:3.20
ls alpine-oci/
# blobs/  index.json  oci-layout

oci-layout is a marker file. index.json is the entry point: for a single-arch image it points at one manifest; for a multi-arch image it points at an image index that selects per {architecture, os}. The manifest in turn names a config blob and a list of layers, each as a {mediaType, digest, size} descriptor. Walk the chain:

# index.json -> manifest digest
MANIFEST_DIGEST=$(jq -r '.manifests[0].digest' alpine-oci/index.json | sed 's/sha256://')

# manifest -> config digest and layer digests
jq '{config: .config.digest, layers: [.layers[].digest]}' \
  alpine-oci/blobs/sha256/$MANIFEST_DIGEST

# config -> rootfs.diff_ids
CONFIG_DIGEST=$(jq -r '.config.digest' alpine-oci/blobs/sha256/$MANIFEST_DIGEST | sed 's/sha256://')
jq '.rootfs' alpine-oci/blobs/sha256/$CONFIG_DIGEST
# {
#   "type": "layers",
#   "diff_ids": ["sha256:..."]
# }

Three different SHA-256 digests appear in this chain, and confusing them is the source of most layer-mismatch and snapshot-deduplication bugs:

Manifest layer digest — SHA-256 of the compressed layer bytes (gzip or zstd). This is what the registry serves and what the manifest references. The pull path verifies it.
DiffID — SHA-256 of the uncompressed tar. Listed in the image config's rootfs.diff_ids. Identifies the layer's content independent of how it was compressed in transit.
ChainID — a recursive hash defined as ChainID(L0) = DiffID(L0) and ChainID(Ln) = SHA256("ChainID(Ln-1) DiffID(Ln)"). Identifies a stack of layers. Snapshotters key snapshots by ChainID, which is why two images sharing the same Alpine base share one on-disk snapshot for that stack instead of two.

To verify a layer's DiffID by hand:

LAYER_DIGEST=$(jq -r '.layers[0].digest' \
  alpine-oci/blobs/sha256/$MANIFEST_DIGEST | sed 's/sha256://')
zcat alpine-oci/blobs/sha256/$LAYER_DIGEST | sha256sum
# Should match diff_ids[0] in the image config.

If the registry's compressed-layer digest does not produce a tar whose uncompressed SHA-256 matches the config's DiffID, the snapshotter refuses to use the result. Two independent verification points keep the bytes honest end to end.

Layer Tar Conventions

A layer is a tar archive representing the diff between this layer and the one below it. Two encoded modifications give the diff its semantics:

A file <dir>/.wh.<name> is a whiteout: it deletes <name> from any lower layer.
A file <dir>/.wh..wh..opq is an opaque marker: lower layers' contents of <dir> are entirely hidden.

Notice the disconnect: the OCI layer format encodes deletion as a tar entry with a magic name; OverlayFS encodes it as a character device with major:minor 0:0. Neither side knows about the other. The unpacker translates between the two during snapshot creation.

To see whiteouts in a real image, build one that deletes a file:

mkdir build && cat > build/Dockerfile <<'EOF'
FROM alpine:3.20
RUN rm /etc/motd
EOF
docker buildx build --output type=oci,dest=motd.tar build/
mkdir motd-oci && tar -C motd-oci -xf motd.tar

MANIFEST=$(jq -r '.manifests[0].digest' motd-oci/index.json | sed 's/sha256://')
TOP_LAYER=$(jq -r '.layers[-1].digest' motd-oci/blobs/sha256/$MANIFEST | sed 's/sha256://')

zcat motd-oci/blobs/sha256/$TOP_LAYER | tar -tvf - | grep -E '\.wh\.|motd'
# -rw-r--r-- 0/0 0 ... etc/.wh.motd

The zero-byte etc/.wh.motd is RUN rm /etc/motd reduced to a tar entry the snapshotter can apply.

Content Store: Where The Bytes Live

When containerd pulls an image, it stores every blob — manifests, configs, and compressed layers alike — in its content store, addressed by digest:

/var/lib/containerd/io.containerd.content.v1.content/
  blobs/sha256/<digest>     # the actual bytes
  ingest/<id>/              # in-flight uploads

sudo ls /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/ | head
sudo ctr content ls | head

The content store does not know what any blob means. It stores opaque bytes and verifies them on read. Meaning lives in the image service, which tracks the human-readable mapping alpine:3.20 -> sha256:<manifest> and follows the manifest to its config and layers. Splitting "store the bytes" from "track what bytes belong to which image" is what lets garbage collection and lazy fetching coexist with normal pulls.

Garbage collection treats content and snapshots together. A blob is reachable if some image, lease, or snapshot references it; everything else is collectable. Leases keep arbitrary content alive during in-flight operations — a pull holds a lease over its blobs until the unpack finishes:

sudo ctr leases ls

Snapshotters: From Blobs To Mountable Directories

Compressed tar blobs are not a filesystem. Some component has to decompress each layer, lay its contents out as a directory, translate .wh.* markers into the filesystem's native deletion form, and produce a lowerdir-ready tree. That component is the snapshotter.

containerd defines a snapshotter interface; plugins implement it. The default on Linux is the OverlayFS snapshotter. The interesting operations are:

Prepare(key, parent) — produce a writable snapshot whose lower layers are the chain rooted at parent. Returns the mount config the runtime should apply.
View(key, parent) — same, but read-only.
Commit(name, key) — finalize a Prepare result as a new immutable layer named name (typically a ChainID).

Unpacking a multi-layer image is a sequence of Prepare → write the layer's tar contents → Commit, repeated for each layer in order. The final container gets one more Prepare on top, and this one stays writable for the lifetime of the container.

Three kinds of snapshot fall out of this:

Committed — immutable, named by ChainID, produced by unpacking one layer on top of another.
Active — writable, used as the upper layer for a running container.
View — read-only checkout of a committed chain, used by tools that just need to read.

sudo ctr snapshot ls | head
# KEY                                                PARENT                                             KIND
# sha256:abcd... (chainID for layer 0)                                                                  Committed
# sha256:efgh... (chainID for layer 0+1)             sha256:abcd...                                     Committed
# k8s.io/12/<container-id>                           sha256:efgh...                                     Active

The on-disk layout of the OverlayFS snapshotter:

/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/
  metadata.db
  snapshots/<id>/fs/        # the layer's directory, fed to OverlayFS as lower or upper
  snapshots/<id>/work/      # OverlayFS work dir (active snapshots only)

fs/ is the data; work/ is the kernel's scratch space for atomic operations, required to be on the same filesystem as the upper layer. metadata.db is a BoltDB file mapping snapshot keys to their parents and disk locations.

Other snapshotter implementations exist for other use cases:

native copies entire layers instead of stacking. Used when OverlayFS is unavailable; very inefficient.
btrfs uses btrfs subvolumes and snapshots; faster commits, requires a btrfs filesystem.
zfs does the same on ZFS.
stargz / soci support lazy fetching: the snapshotter mounts a layer before its bytes have finished downloading, by parsing the gzip or zstd index and serving on demand. Important for cold-start with large images.

The interchangeability is the same trick OCI runtimes pull a level up: containerd does not care which snapshotter is configured, as long as it implements the interface.

Mount Setup In runc

When runc starts a container, it has a snapshot from containerd and a config.json with a mounts array. Inside a fresh mount namespace, it executes a fixed sequence:

Make the new mount namespace's root mount private to prevent propagation back to the host.
Mount the rootfs (the OverlayFS the snapshotter returned).
Mount the special filesystems described in config.json's mounts array.
Create device nodes (bind from host) and link standard ones (/dev/stdin → /proc/self/fd/0, etc.).
Apply linux.maskedPaths (bind /dev/null over them).
Apply linux.readonlyPaths (remount as read-only).
pivot_root(2) to swap the prepared rootfs in as /.
Unmount and remove the old root.
Set propagation on / per linux.rootfsPropagation.

The order is fixed because pivot_root(2) has constraints: the new and old roots must be on different mounts and neither may be shared, and the old root must be reachable as a directory under the new root. Step 1 makes propagation private, step 2 mounts the new root, and steps 3–6 happen before pivot because the runtime can still name host paths during setup.

Watching it from outside is awkward — the work happens between clone(2) and execve(2) of the user process — but the kernel-side view is easy to inspect after the container has started:

docker run --rm -d --name demo alpine:3.20 sleep 600
PID=$(docker inspect -f '{{.State.Pid}}' demo)
sudo cat /proc/$PID/mountinfo
# 1 0 0:34 / / rw,relatime master:1 - overlay overlay rw,lowerdir=...
# 2 1 0:35 / /proc rw,nosuid,nodev,noexec,relatime - proc proc ...
# ...
docker stop demo

The first line is the rootfs — an OverlayFS mount with a lowerdir chain corresponding to the image's layers and an upperdir corresponding to the snapshotter's active snapshot. The rest are the special filesystems from step 3. The runc source for this lives in libcontainer/rootfs_linux.go; the OCI spec describes the requirements in runtime-linux.md.

`pivot_root` Versus `chroot`

The classic Unix way to change a process's root is chroot(2): it sets the calling process's root directory to a path. It is also famously escapable. A process with an open file descriptor to a directory outside the chroot, or with CAP_SYS_CHROOT, can climb back out. chroot rewrites the process's view of /, but it does not unmount the old root or move it out of reach.

pivot_root(2) is what containers actually use, and the difference matters. pivot_root(new_root, put_old) atomically swaps the mount namespace's root: new_root becomes /, and the previous / is moved to put_old. Once that swap is done, the runtime unmounts put_old so the old root is no longer reachable at all — not by file descriptor, not by .., not by climbing out of a chroot. The container's process literally cannot name the host root through the filesystem after step 8 above.

To see the swap in isolation, without runc in the way:

sudo unshare --mount -- bash
# Inside the new mount namespace:

mkdir -p /tmp/newroot/{old,bin,proc,sys,dev}
mount -t tmpfs tmpfs /tmp/newroot       # placeholder rootfs
cp /bin/busybox /tmp/newroot/bin/

# pivot_root requires / to be private and the new root to be a separate mount.
mount --make-rprivate /
mount --bind /tmp/newroot /tmp/newroot

cd /tmp/newroot
pivot_root . old

# After this, `.` is the new /; `/old` is the old root.
exec /bin/busybox sh
ls /old
# usr  etc  var  ...   (the host's old root, still reachable)

umount -l /old
ls /
# bin  old  proc  sys  dev   (now without the host's root visible)

This is exactly what runc does, with one important addition: runc unmounts the old root before exec'ing the user process so the container cannot even briefly see the host. The example above leaves it mounted long enough to inspect.

OCI Mounts In Practice

The mounts array in config.json is what the runtime applies in step 3 above. Each entry has destination, type, source, and options. The conventional default set:

Destination	Type	Source	Notes
`/proc`	`proc`	`proc`	Reflects PID namespace. Required for `ps`, `/proc/self/`.
`/dev`	`tmpfs`	`tmpfs`	`mode=755`.
`/dev/pts`	`devpts`	`devpts`	`newinstance,ptmxmode=0666`.
`/dev/shm`	`tmpfs`	`shm`	`mode=1777,size=65536k`.
`/dev/mqueue`	`mqueue`	`mqueue`	Matches IPC namespace.
`/sys`	`sysfs`	`sysfs`	Often `ro` for non-privileged.
`/sys/fs/cgroup`	`cgroup2`	`cgroup`	Read-only bind, view limited by cgroup namespace.

Bind mounts (type: bind, source: <host-path>) are how host paths show up inside containers — Docker volumes, Kubernetes hostPath, and secret mounts all use them. The options field controls propagation: rprivate (default) does not propagate; rslave accepts host changes; rshared propagates both ways. Kubernetes' mountPropagation: HostToContainer corresponds to rslave; Bidirectional corresponds to rshared.

docker run --rm -d --name demo -v /tmp:/host-tmp alpine:3.20 sleep 600
PID=$(docker inspect -f '{{.State.Pid}}' demo)
sudo cat /proc/$PID/mountinfo | grep host-tmp
# .../tmp /host-tmp rw,relatime - ext4 /dev/...
docker stop demo

Rootless OverlayFS

OverlayFS uses trusted.* xattrs for opaque markers and redirects, and writing trusted.* requires CAP_SYS_ADMIN on the host. That is why, until recently, rootless OverlayFS did not work. Linux 5.11 added support for OverlayFS in user namespaces; on older kernels, rootless tooling (rootless Docker, podman) falls back to fuse-overlayfs, a userspace reimplementation that uses user.* xattrs and accepts the performance cost.

podman info --format '{{.Store.GraphDriverName}}'
# overlay (kernel) or overlay (fuse-overlayfs)

Where This Goes

The next chapter covers the security controls that compose with the filesystem to form the full container boundary — capabilities, seccomp, MAC, masked paths. The masked paths in particular layer on top of the mount setup described here: they are bind mounts performed after the OCI mount table is in place.

Sources And Further Reading

OCI Image Specification: https://github.com/opencontainers/image-spec
OCI image layout: https://github.com/opencontainers/image-spec/blob/main/image-layout.md
OCI image layer format: https://github.com/opencontainers/image-spec/blob/main/layer.md
containerd content flow: https://github.com/containerd/containerd/blob/main/docs/content-flow.md
containerd snapshotter docs: https://containerd.io/docs/2.2/snapshotters/readme/
Linux OverlayFS: https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html
pivot_root(2): https://man7.org/linux/man-pages/man2/pivot_root.2.html
mount(2): https://man7.org/linux/man-pages/man2/mount.2.html
runc rootfs setup: https://github.com/opencontainers/runc/blob/main/libcontainer/rootfs_linux.go