Appendix B: Containers Inside Containers

"Can you run Docker inside a container?" has a short answer and a long one. The short answer is yes. The rest of this appendix is the long one.

Strip away the branding and the question is mechanical. A container is a process under namespaces and cgroups (chapters 2 and 4), fenced in by the layered controls of chapter 7. "Docker inside a container" means running a second container manager — dockerd, or containerd, or just an image builder — as one of those processes, and then handing it enough privilege back to create namespaces, cgroups, and mounts of its own. Almost everything that is hard here comes from that one tension: the outer container exists to take privilege away, and the inner engine needs some of it returned.

Two unrelated mechanisms both get called "Docker-in-Docker," and conflating them causes most of the confusion. One bind-mounts a socket; the other runs a nested daemon. They are not two flavors of one technique — they fail in different ways and call for different decisions.

Two Things Called "Docker-in-Docker"

Docker-out-of-Docker (DooD): the container runs only the docker client; it talks to the host's daemon over a bind-mounted socket. Containers it launches are siblings on the host.

Docker-in-Docker (DinD): the container runs its own dockerd; the containers it launches are children, nested one level down.

The names are folklore, not spec, but the distinction is real. DooD does not nest anything — it punches a hole back to the host. DinD nests a whole runtime stack and pays for it.

Docker-out-of-Docker: Bind-Mounting The Socket

docker run -v /var/run/docker.sock:/var/run/docker.sock docker:cli \
  docker ps

The docker CLI is a thin HTTP client over /var/run/docker.sock. Mount that socket in and the in-container client drives the host's dockerd. No second daemon starts. When this container runs docker run, the new container is created by the top-level daemon and lands beside its parent in the host's process tree — a sibling, not a child. For continuous integration, Jérôme Petazzoni — who wrote the first dind image — recommends this socket approach over a nested daemon.

Two consequences follow from "the host daemon does the work," and both bite.

The socket is an unauthenticated root API. dockerd runs as root and applies no authorization to socket clients; anything that can write to the socket can start a container with --privileged, bind-mount / from the host, and read or rewrite any file on the machine. Handing a container the Docker socket is therefore equivalent to handing it root on the host — it removes the entire boundary Part 2 spends four chapters building. Treat a socket mount as a trust decision about the workload, not a convenience.

Paths resolve on the host, not in the container. When the in-container client says docker run -v "$(pwd):/work" ..., that -v is interpreted by the host daemon against the host filesystem. The path the inner client computed inside its own mount namespace usually does not exist on the host, so the mount silently brings in the wrong directory or an empty one. The shared daemon does buy one real advantage: every sibling shares the host's image cache and layer store, so CI pulls are warm.

True Docker-in-Docker: A Nested Daemon

The socket mount never nested anything — it borrowed the host's daemon through a hole in the wall. True Docker-in-Docker does the opposite: it runs a second, complete runtime stack inside the container, with its own dockerd, its own /var/lib/docker, its own bridge networks, and its own containers a level down.

The word "nested" is where the theory lives, and it is easy to picture wrongly. Nesting containers does not nest kernels. A virtual machine boots a second kernel for the guest; a nested container does not — there is one kernel on the box, and the inner daemon's containers are ordinary host processes whose namespaces are stacked a level deeper than their parent's. A process in the innermost container has a PID in the inner PID namespace, another in the DinD container's, and another on the host: 1 to itself, some four-digit number to the host. It is the clone(2) and unshare(2) machinery from chapter 4, applied twice.

# Disposable VM only. --privileged removes most of chapter 7's boundary.
docker run --privileged --name dind -d docker:dind
docker exec dind docker info   # a second, independent daemon

That second dockerd has to do everything chapter 9 described runc doing, only now it does it from inside a context built specifically to forbid those operations. To start a single container the inner runtime calls clone(2) with the CLONE_NEW* flags, sets up a fresh mount namespace and pivot_root(2)s into the image rootfs, writes a cgroup subtree to bound the child, and wires a veth pair into a bridge it raised in its own network namespace. Every one of those is a privileged kernel operation, and the outer container removed the privilege for precisely those operations — chapter 7 is, in effect, the list of what a runtime needs and what a sandbox takes away. So the inner daemon walks straight into the wall, and docker:dind needs --privileged (the flag Petazzoni added to Docker in 2013 alongside that first dind) to get through it. Where it hits, control by control:

It mounts overlay filesystems and calls pivot_root(2) for every container. The default seccomp profile returns EPERM for mount(2) and umount2(2), and the docker-default AppArmor profile denies mounts outright.
It creates the nested namespaces and configures interfaces, which needs CAP_SYS_ADMIN and CAP_NET_ADMIN — both stripped from the default bounding set.
It writes a cgroup subtree to set limits on its children, against a /sys/fs/cgroup the outer runtime mounted read-only.
It programs iptables/netfilter rules for its bridge and may need loopback devices for some storage backends, both denied by the default cgroup device filter (chapter 7's BPF program).

--privileged answers all four the same blunt way: it clears the bounding set, drops the seccomp profile, removes the AppArmor or SELinux profile, and gives the device cgroup a wildcard rule. Every operation in that list now succeeds, because every layer that would have stopped it is gone. That is what makes the nested daemon work, and also why --privileged is a security regression rather than a fix — it does not hand the inner runtime the four specific powers it needs, it removes all of chapter 7 at once and lets everything through. Sysbox and rootless mode, both covered below, give the inner runtime much less than that.

The Storage Driver Problem

Even with --privileged, the inner daemon's storage is the part that breaks first, and the reason reaches straight back to chapter 6. The inner dockerd wants to stack overlay2 for its own images, but its /var/lib/docker already lives on the outer container's overlay rootfs. The kernel's overlayfs documentation is explicit that a read-only lower layer "can even be another overlayfs," but the writable upperdir/workdir an inner overlay needs cannot sit on top of an overlay lower — the whiteout and trusted.overlay.* xattr semantics do not compose. containerd's issue #3144 is this exact wall: "overlay2" is not supported over overlayfs.

The historical escape hatch is the vfs storage driver, which docker:dind falls back to when it detects an overlay rootfs. vfs does a full recursive copy of every layer instead of stacking them — correct on any filesystem, and brutally slow and space-hungry. The userspace alternative is fuse-overlayfs (appendix A), which gives overlay semantics through FUSE without needing the kernel to stack overlays.

The clean fix avoids the conflict instead of working around it: give the inner /var/lib/docker a real filesystem.

# A dedicated volume puts the inner daemon's storage on the host's
# filesystem (ext4/xfs), not on the outer container's overlay rootfs,
# so overlay2 stacks normally and vfs is never needed.
docker run --privileged -d \
  -v dind-storage:/var/lib/docker \
  docker:dind

Two failure modes here corrupt data rather than just failing loudly. The legacy devicemapper driver is not namespaced — separate daemons sharing a host see and clobber each other's backing devices. And dockerd is designed for exclusive ownership of /var/lib/docker; pointing two daemons at one shared directory, or sharing it across a restart in a way that overlaps, corrupts the layer store. One daemon, one storage tree.

Doing It Without `--privileged`: Sysbox

Sysbox is an OCI runtime — a drop-in replacement for runc in containerd's runtime-v2 slot (chapter 12) — built specifically to run dockerd, systemd, or containerd inside a container without --privileged. Nestybox open-sourced it; Docker acquired Nestybox in May 2022. What separates it from the privileged daemon is what it does not remove: the seccomp filter, the trimmed capability set, the LSM profile, and the device cgroup all stay in force. Sysbox keeps chapter 7's layers and emulates the handful of operations the inner daemon trips on, so the runtime gets what it needs without the sandbox coming down.

The mechanism is the primitives from Part 2 reassembled. The container always runs in a user namespace (chapter 7), so the root the inner daemon holds — and the capabilities that come with it — are real only against resources the namespace owns, and map to an unprivileged UID on the host. Rather than allow mount(2), umount2(2), and pivot_root(2) outright, sysbox-runc registers a seccomp notification — SECCOMP_RET_USER_NOTIF, the userspace-supervisor primitive chapter 7 named for gVisor — and inspects and emulates each call as the kernel traps it. A FUSE daemon, sysbox-fs, mounts over the parts of /proc and /sys the kernel does not namespace, so the inner daemon reads a coherent, container-scoped view instead of the host's — the same leaks chapter 7's masked-paths section plugs by hand. On-disk ownership is lined up with the namespace's UID range through idmapped mounts on Linux 5.12+, or the shiftfs module on older kernels, which avoids a recursive chown of the rootfs. A third daemon, sysbox-mgr, hands out the mapping ranges and other per-container resources, and the three pieces talk over gRPC.

The result is that docker run --runtime=sysbox-runc -d docker:dind runs a full nested daemon with no --privileged, and chapter 7's layers stay wrapped around it.

Rootless: Moving The Whole Stack Off Host Root

Rootless mode attacks a different axis. The approaches so far relax or emulate the sandbox so a runtime can work inside it; rootless moves the entire engine somewhere it needs no host privilege to begin with. Sysbox makes a nested daemon safe; rootless makes the outer engine run as an ordinary user with no root anywhere in the picture. The two compose — rootless plus nesting is how CI runs untrusted builds with no host privilege at all — but the techniques are distinct, and they all work around the same constraint: an unprivileged user can create a user namespace and become root inside it, but that root still cannot do the privileged things a container engine needs on the host.

Networking is handled in userspace, because an unprivileged process cannot create a veth pair on the host bridge. slirp4netns (or the newer pasta) attaches a TAP device in the container's network namespace to a usermode TCP/IP stack; RootlessKit sets up the namespace and forwards ports, and is what rootless Docker uses by default.
Storage uses native overlay when the kernel allows it. Mounting overlayfs from inside a user namespace landed in Linux 5.11 (Miklos Szeredi's userxattr work, which moves overlay's bookkeeping from trusted.overlay.* to the user.overlay.* xattr namespace and disables redirect_dir/metacopy to close the escalation path). On older kernels the fallback is again fuse-overlayfs.

Podman is rootless and daemonless out of the box (appendix A); rootless dockerd is a supported mode you opt into. Either way, a compromise of the engine lands at an unprivileged host UID instead of host root — chapter 7's user-namespace argument, applied to the runtime itself.

You Probably Want To Build, Not Run

The most common reason people reach for DinD is a CI pipeline that runs docker build, and that needs neither a daemon nor most of the privileges the sections above fought for. Building an image is mechanically simpler than running one: execute each RUN step, capture the filesystem changes it made as a layer, move on. There is no long-lived container to hand its own network namespace or cgroup subtree, so a build never trips most of chapter 7's blocks. Several tools exploit that.

BuildKit / buildx is Docker's own build engine and the default builder in current Docker. It can run as an unprivileged, rootless process on the same user-namespace and fuse-overlayfs stack as rootless Docker, which is what lets it build inside an unprivileged CI container.
Kaniko (from Google) takes the extreme position: it runs as an ordinary container, reads the Dockerfile, and executes each instruction in its own root filesystem, snapshotting the diff after each step to emit a layer. Because it never stands up a nested container, it needs no daemon and no --privileged — built for exactly the Kubernetes-CI case where DinD is tempting.
buildah (appendix A) builds OCI images daemonless, mounting the working rootfs and running build steps against it directly, and runs rootless through a user namespace.

Reach for a nested dockerd only when you genuinely need to run containers in the pipeline — integration tests that start real services — not merely to produce an image.

Nested containerd: kind

The containerd-native version of this whole topic is kind — Kubernetes IN Docker. Each Kubernetes "node" is a single container (the kindest/node image) running systemd, which starts the node's own containerd, which in turn runs the kubelet and every pod (chapters 10–13). The container boots to a paused state and waits for SIGUSR1 before systemd takes over, which lets kind fix up mounts and preload images first.

This is containerd inside a container in the fullest sense: a nested containerd with its own content store, its own snapshotter, and its own shims (chapters 11 and 12), all running a level down from the host's runtime. It works for the same reason docker:dind works — the node container is given the namespaces, cgroup access, and mounts its inner runtime needs, whether through privilege or through a runtime like Sysbox. When you load a local image into a kind cluster and the cluster cannot see it, this is why: the node's containerd has a content store entirely separate from your host daemon's.

Choosing

Goal	Reach for
Build images in CI	BuildKit/`buildx`, Kaniko, `buildah` — no daemon
Orchestrate sibling containers from inside one	Socket mount (DooD), as a trust decision
A real nested engine, host stays protected	Sysbox runtime
A real nested engine, disposable VM, you accept the risk	`docker:dind` with `--privileged` and a storage volume
Run the whole stack as a non-root user	Rootless Docker or Podman
A local Kubernetes cluster	`kind` (nested containerd)

The reflex chapter 7 closes on holds here too: add privilege back in named, specific increments instead of clearing every layer at once. DooD and --privileged DinD are the wholesale options; Sysbox, rootless mode, and the daemonless builders are how you give back only what the kernel actually forces you to.

Sources And Further Reading

Jérôme Petazzoni, "Using Docker-in-Docker for your CI or testing environment? Think twice.": https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/
Docker, "Docker advances container isolation and workloads with acquisition of Nestybox": https://www.docker.com/blog/docker-advances-container-isolation-and-workloads-with-acquisition-of-nestybox/
Sysbox design notes: https://github.com/nestybox/sysbox/blob/master/docs/user-guide/design.md
Sysbox security notes: https://github.com/nestybox/sysbox/blob/master/docs/user-guide/security.md
containerd issue #3144, overlay2 over overlayfs: https://github.com/containerd/containerd/issues/3144
Kernel overlayfs documentation: https://docs.kernel.org/filesystems/overlayfs.html
"allow unprivileged overlay mounts" (LWN, Linux 5.11 userxattr): https://lwn.net/Articles/839210/
Christian Brauner, "The Seccomp Notifier — New Frontiers in Unprivileged Container Development": https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
slirp4netns: https://github.com/rootless-containers/slirp4netns
Rootless Containers project: https://rootlesscontaine.rs/
Kaniko: https://github.com/GoogleContainerTools/kaniko
kind initial design: https://kind.sigs.k8s.io/docs/design/initial/