Chapter 4: Namespaces

A namespace is a kernel-maintained, reference-counted set of resources that a process sees as the global instance of those resources. Two processes in the same namespace share the same view; two processes in different namespaces of the same type see disjoint views, even though they are running on the same machine and the same kernel. The kernel keeps one namespace per type (mount, PID, network, ...) per task, and exposes membership through the nsproxy struct attached to every task.

Containers exist because Linux generalized this idea across eight different kinds of resource and gave userspace a way to compose them. A process inside a container has its own PID namespace (so it sees init as PID 1 and cannot signal anything outside), its own mount namespace (so its /etc/hosts is not the host's), its own network namespace (its own loopback, route table, sockets, iptables rules), and so on. The container appears to be a small Linux system because, from its own viewpoint, it is one. Nothing about the kernel changes; only the views do.

Two things namespaces do not do are worth saying up front. They do not isolate resource consumption — a process in its own PID namespace can still allocate every byte of host memory, which is what cgroups (chapter 5) are for. And they do not enforce policy beyond visibility — a process that can see a resource and has the capabilities to act on it will succeed, which is what capabilities, seccomp, and MAC (chapter 7) compose with namespaces to constrain.

Safety: the commands below mutate kernel state and most need root. Use a disposable Linux VM. Examples here were checked on Ubuntu 24.04 with kernel 6.8 and the util-linux package supplying unshare, nsenter, and lsns.

The Eight Types

Linux ships eight namespace types, each gated by a CLONE_NEW* flag passed to clone(2) or unshare(2). The order they were added matters: a runtime targeting kernel 6.x can use all eight; older kernels miss the later ones.

Type	Flag	Linux	What it isolates
Mount	`CLONE_NEWNS`	2.4.19 (2002)	The mount table
UTS	`CLONE_NEWUTS`	2.6.19 (2006)	Hostname, NIS domain name
IPC	`CLONE_NEWIPC`	2.6.19 (2006)	System V IPC, POSIX message queues
PID	`CLONE_NEWPID`	2.6.24 (2008)	Process ID number space
Network	`CLONE_NEWNET`	2.6.24 (2008)	Net devices, addresses, routes, ports, netfilter
User	`CLONE_NEWUSER`	3.8 (2013)	UID/GID mappings, namespaced privilege
Cgroup	`CLONE_NEWCGROUP`	4.6 (2016)	View of the cgroup root
Time	`CLONE_NEWTIME`	5.6 (2020)	`CLOCK_MONOTONIC` and `CLOCK_BOOTTIME` offsets

CLONE_NEWNS predates the era when "namespaces" became plural — the flag still creates a mount namespace, despite the historical name. The mount namespace was the first one because mount-table virtualization was the original motivation; everything else followed.

User namespaces are the special one. A process can hold root-equivalent capabilities inside a user namespace it created without holding any host-level privilege, which is the basis of every rootless container runtime. They also redefine what "privilege" means everywhere else — capabilities apply only within the user namespace that owns the resource being acted on. The user namespace section below covers the consequences.

Creating, Joining, Leaving

Three syscalls do almost all the work:

clone(2) / clone3(2) — create a child process with one or more CLONE_NEW* flags. The child starts in the new namespace(s); the parent stays where it was. This is the path containers actually take: runc calls clone3 with the right flags, and the child becomes the container init.
unshare(2) — move the current process into freshly-created namespaces. Same flag set as clone. Useful when you want to change membership without forking.
setns(2) — join an existing namespace via a file descriptor. The fd usually comes from /proc/<pid>/ns/<type> or from a bind mount of one of those symlinks.

A namespace exists as long as something holds a reference to it. References come from three places: a process whose nsproxy points at it, an open file descriptor on its /proc/<pid>/ns/<type> symlink, or a bind mount of that symlink elsewhere in the filesystem. When the last reference goes away, the kernel destroys the namespace and reclaims its resources. This is why ip netns add foo bind-mounts the network namespace under /var/run/netns/foo: the bind mount keeps the namespace alive so a CNI plugin can configure it before any container process exists inside.

Two processes share a namespace exactly when their /proc/<pid>/ns/<type> symlinks resolve to the same inode. The inode number is the namespace's identity.

What Lives In `/proc/<pid>/ns/`

ls -l /proc/self/ns/
# lrwxrwxrwx 1 me me 0 ... cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 me me 0 ... ipc    -> 'ipc:[4026531839]'
# lrwxrwxrwx 1 me me 0 ... mnt    -> 'mnt:[4026531840]'
# lrwxrwxrwx 1 me me 0 ... net    -> 'net:[4026531992]'
# lrwxrwxrwx 1 me me 0 ... pid    -> 'pid:[4026531836]'
# lrwxrwxrwx 1 me me 0 ... pid_for_children -> 'pid:[4026531836]'
# lrwxrwxrwx 1 me me 0 ... time   -> 'time:[4026531834]'
# lrwxrwxrwx 1 me me 0 ... time_for_children -> 'time:[4026531834]'
# lrwxrwxrwx 1 me me 0 ... user   -> 'user:[4026531837]'
# lrwxrwxrwx 1 me me 0 ... uts    -> 'uts:[4026531838]'

The _for_children entries apply to processes the current process spawns. PID and time namespace changes take effect at the next fork(2), not immediately, so the kernel exposes both the current value and the value newly-forked children will inherit. This is why unshare --pid requires --fork — the calling process cannot move into the new PID namespace itself; it must fork a child that becomes PID 1.

The same information in tabular form:

lsns

lsns walks /proc/*/ns/ and groups processes by namespace; it is the easiest way to figure out which container is using which namespace in a debug session.

Creating Namespaces With `unshare`

unshare(1) is the user-space wrapper for unshare(2). It creates new namespaces and execs a command inside them.

A UTS namespace needs CAP_SYS_ADMIN to create directly:

sudo unshare --uts -- bash -c 'hostname inside; hostname'
# inside
hostname
# (unchanged on host)

To get the same behavior unprivileged, wrap it in a user namespace — the one namespace that does not require pre-existing privilege to create:

unshare --user --map-root-user --uts -- bash -c 'hostname inside; hostname'
# inside
hostname
# (unchanged on host)

--map-root-user writes a UID/GID mapping that maps the caller to UID 0 inside the new user namespace. The shell believes it is root; from the host it is the same unprivileged process.

PID Namespace And The PID 1 Problem

A new PID namespace makes the first process inside it PID 1.

sudo unshare --pid --fork --mount-proc -- bash -c 'echo "pid: $$"; ps -ef'
# pid: 1
# UID  PID  PPID ... CMD
#   0    1     0 ... bash -c echo "pid: $$"; ps -ef
#   0    2     1 ... ps -ef

--fork is required because the calling process cannot move into the new PID namespace itself; it must fork a child that becomes PID 1. --mount-proc remounts /proc inside the new mount namespace so ps reflects the new PID space.

PID namespaces are hierarchical: every process in a child namespace also has a PID in the parent and in every ancestor up to the root. The host can see and signal container processes by their host PIDs; the container cannot see host processes at all. PIDs are translated when crossing the boundary — the same task is, say, host-PID 24891 and container-PID 1.

PID 1 is special, and the special treatment is the reason container processes routinely fail to exit on docker stop or kubectl delete pod. Two kernel rules apply only to PID 1 in its namespace:

Default signal handlers do not run. SIGTERM, SIGINT, and friends are silently dropped unless PID 1 has installed a handler. SIGKILL and SIGSTOP are exceptions because the kernel cannot block them.
Orphaned descendants get reparented to PID 1. When PID 1 fails to wait(2) on them, they become zombies that pile up indefinitely.

To see the failure:

sudo unshare --pid --fork --mount-proc -- bash -c '
  echo "pid 1 = $$"
  while :; do sleep 1; done
'
# In another terminal:
# kill -TERM <pid-of-the-bash-process-on-host>
# bash continues running

The kernel will not deliver the default-action SIGTERM. Container runtimes work around this by injecting a tiny init like tini or dumb-init that installs handlers, forwards signals to its children, and reaps zombies on their behalf. Docker has had --init for this since 1.13; Kubernetes pods that don't ship their own init usually need one in the image, or a sidecar approach.

Mount Namespaces And Propagation

A mount namespace governs which mounts a process sees. To watch a mount appear and disappear:

sudo unshare --mount -- bash
# Inside the new namespace:
mount -t tmpfs tmpfs /mnt
mount | grep /mnt
# tmpfs on /mnt type tmpfs (rw,relatime)
exit
# Back on the host:
mount | grep /mnt
# (nothing)

The host never saw the mount, because the mount namespace's mount table is private. A mount namespace governs mounts, not inodes: bind-mount the host's /etc/passwd into a container's mount namespace and the container reads the host's file regardless of its mount view. The boundary is the mount table, not the data.

The catch is propagation. Each mount in a mount namespace has one of four propagation types:

shared (MS_SHARED) — events propagate to peers in the same propagation group.
private (MS_PRIVATE) — no propagation.
slave (MS_SLAVE) — receives events from a master, does not send.
unbindable (MS_UNBINDABLE) — like private, plus cannot be the source of a bind mount.

Many distributions ship / as shared. A new mount namespace inherits the propagation of its source, which would mean a container's mount events propagate back to the host's peers. Default unshare --mount makes the new namespace's root private to prevent that:

findmnt -o TARGET,PROPAGATION /
# TARGET PROPAGATION
# /      shared

When runc sets up a container, it sets the new mount namespace's root to private (or whatever linux.rootfsPropagation requests), then mounts the rootfs and special filesystems before pivot_root(2) swaps it in. Chapter 6 walks through that sequence. Kubernetes' mountPropagation field maps directly: HostToContainer is rslave, Bidirectional is rshared, the default is rprivate.

Network Namespaces

Each network namespace is a complete, independent network stack: its own loopback (down by default), its own list of network devices, its own routing table, its own neighbor table, its own netfilter rules, its own sockets, and its own per-namespace sysctls (most of net.* is namespaced in modern kernels). Two namespaces cannot accidentally share a port because they do not share a port table.

ip netns is the convenient subcommand. Unlike unshare, ip netns add keeps the namespace alive after the calling process exits by bind-mounting it to /var/run/netns/<name>:

sudo ip netns add demo
sudo ip netns list
# demo
ls /var/run/netns/
# demo

sudo ip netns exec demo ip link
# 1: lo: <LOOPBACK> mtu 65536 state DOWN ...
sudo ip netns exec demo ip link set lo up
sudo ip netns exec demo ping -c1 127.0.0.1

When a network device is moved into a namespace it loses its address configuration; addresses, routes, and firewall state must be reapplied inside the new namespace.

Connecting two namespaces requires a virtual link. The classic recipe is a veth pair — a kernel-provided cable with two ends, where one end goes into each namespace:

sudo ip netns add a
sudo ip netns add b

sudo ip link add va type veth peer name vb
sudo ip link set va netns a
sudo ip link set vb netns b

sudo ip -n a link set lo up
sudo ip -n b link set lo up
sudo ip -n a link set va up
sudo ip -n b link set vb up

sudo ip -n a addr add 10.10.0.1/24 dev va
sudo ip -n b addr add 10.10.0.2/24 dev vb

sudo ip netns exec a ping -c1 10.10.0.2

This is one bridge away from the standard container networking pattern: replace vb going into namespace b with vb attached to a host bridge, and add NAT rules. Part 6 covers the bridge-based and CNI flows in detail.

Joining An Existing Namespace With `nsenter`

nsenter(1) calls setns(2) on a target namespace and execs a command. To run a command inside a process's mount namespace:

pgrep -f some-container-process
sudo nsenter -t <pid> -m -p -- ls /proc

kubectl exec, crictl exec, and docker exec all reduce to a setns chain plus an exec. The reason it has to chain in a specific order is that some namespace transitions invalidate previously-set state — notably, joining a mount namespace makes paths from the old namespace unresolvable, so PID namespace switches that need to read /proc must come first.

A subtle gotcha: setns(2) for PID and mount namespaces was historically restricted on multithreaded processes, because Go programs (containerd, runc) cannot safely call it from arbitrary goroutines. runc works around this by doing all its namespace setup in a small C shim (libcontainer/nsenter/nsexec.c) that runs before the Go runtime starts.

User Namespaces And Privilege Scoping

User namespaces are the conceptual hinge of the chapter. Most other namespace types are owned by a user namespace, and capabilities held by a process apply only within the user namespace that owns the resource being acted on. That is what makes namespaces composable: an unprivileged user can create a user namespace, gain CAP_SYS_ADMIN inside it, and then create mount or network namespaces from inside that user namespace — because those new namespaces are owned by the user namespace where the user has CAP_SYS_ADMIN.

To create one and observe the mapping:

unshare --user --map-root-user -- bash -c '
  echo "id inside: $(id)"
  cat /proc/self/uid_map
'
# id inside: uid=0(root) gid=0(root) groups=0(root)
# 0          1000          1

The mapping reads as <inside-uid> <outside-uid> <length>. UID 0 inside maps to UID 1000 outside, for one UID. The shell believes it is root inside; from the host it is the same unprivileged user.

A user namespace gives root-equivalent capabilities only over resources owned by that user namespace. Try to do something that requires actual host root:

unshare --user --map-root-user -- bash -c '
  mount -t tmpfs tmpfs /mnt
'
# mount: /mnt: permission denied. (only privileged user can mount)

The mount fails because the host's mount namespace is owned by the initial user namespace, where the calling user is just UID 1000. The same CAP_SYS_ADMIN, scoped to a freshly-created mount namespace owned by the new user namespace, is enough to mount things there:

unshare --user --map-root-user -- bash -c '
  unshare --mount -- bash -c "mount -t tmpfs tmpfs /mnt && echo mounted"
'
# mounted

This distinction — capabilities-scoped-to-the-owning-user-namespace — is the source of most "I have root in the container, why doesn't this work?" confusion. The capability is real; it just does not apply to host-owned objects.

For unprivileged users to map UIDs other than their own, they need newuidmap(1) and newgidmap(1) (setuid helpers from shadow-utils) and entries in /etc/subuid and /etc/subgid:

grep "^$(whoami):" /etc/subuid /etc/subgid
# /etc/subuid:me:100000:65536
# /etc/subgid:me:100000:65536

These say: the user me may map host UIDs 100000 through 165535 inside any user namespace they own. This is what gives a rootless container a 64K-UID range to allocate inside. Writing to gid_map from an unprivileged user also requires writing deny to /proc/<pid>/setgroups first, to prevent privilege escalation via supplementary group manipulation — a kernel-side check added after a 2014 CVE.

Time Namespaces

Time namespaces are the youngest of the eight (Linux 5.6, March 2020) and the most limited. They offset only CLOCK_MONOTONIC and CLOCK_BOOTTIME. CLOCK_REALTIME is shared with the host and cannot be changed per-namespace, which means containers cannot lie about the wall-clock time — only about how long they have been running.

sudo unshare --time --fork -- bash -c '
  echo "Before offset:"
  cat /proc/uptime
  # Apply offsets via /proc/<pid>/timens_offsets.
  # Format: <clock_id> <secs> <nanosecs>
  # CLOCK_MONOTONIC=1, CLOCK_BOOTTIME=7
  echo "1 -100 0" > /proc/$$/timens_offsets
  echo "After offset:"
  cat /proc/uptime
'

timens_offsets must be written before any process executes inside the namespace, which is why runtimes write it in the brief window after unshare(CLONE_NEWTIME) and before exec. The use case is checkpoint/restore (CRIU) — a restored process needs CLOCK_MONOTONIC to appear continuous across the freeze, even though wall time has advanced.

Putting It Together: A Hand-Rolled Container

The same assembly without runc: a shell with its own user, PID, mount, UTS, IPC, and net namespaces, and an Alpine rootfs as /.

# Get a small rootfs.
mkdir alpine-rootfs
docker export $(docker create alpine:3.20) | tar -C alpine-rootfs -xf -

# Run a shell in fresh namespaces with that as /.
sudo unshare \
  --user --map-root-user \
  --pid --fork --mount-proc \
  --mount --uts --ipc --net \
  -- chroot alpine-rootfs /bin/sh

# Inside:
# / # hostname
# (empty -- new UTS namespace)
# / # ps
# PID   USER     TIME  COMMAND
#     1 root      0:00 /bin/sh
#     2 root      0:00 ps
# / # ls /
# bin etc lib mnt proc root sbin sys tmp usr var

This is intentionally crude. There is no cgroup, no seccomp, no capability dropping, no pivot_root (just chroot, which is escapable), no IO or signal supervision, no networking inside the new netns. What it does demonstrate is the order of namespace creation: user namespace first (so the rest can be created unprivileged), then PID with --fork (so the calling shell does not need to enter it), then the rest. Stacking flags on a single unshare call is also how runc does it — one syscall, all the namespaces at once, an atomic transition into the new state.

OCI Mapping

The runtime spec's linux.namespaces array names which namespaces to enter or create. Each entry is {type, path}; if path is set, the runtime calls setns(2) on it instead of creating a new one.

This is how Kubernetes shares a pod's network namespace across containers — every container in the pod has its config's network entry pointing at the same /proc/<pid>/ns/net path. The first container creates the namespace; the rest join it. The pod sandbox container (the "pause container") exists primarily to hold that namespace open — it bind-mounts it so the namespace survives the death of any individual container.

linux.uidMappings and linux.gidMappings carry the user-namespace ID mappings. linux.timeOffsets carries the time-namespace clock offsets.

Where This Goes

Namespaces by themselves do not constrain CPU or memory and do not prevent a process from forking until the host runs out of process slots. The next chapter — cgroups v2 — covers the resource side.

Sources And Further Reading

namespaces(7): https://man7.org/linux/man-pages/man7/namespaces.7.html
unshare(1): https://man7.org/linux/man-pages/man1/unshare.1.html
nsenter(1): https://man7.org/linux/man-pages/man1/nsenter.1.html
user_namespaces(7): https://man7.org/linux/man-pages/man7/user_namespaces.7.html
pid_namespaces(7): https://man7.org/linux/man-pages/man7/pid_namespaces.7.html
mount_namespaces(7): https://man7.org/linux/man-pages/man7/mount_namespaces.7.html
network_namespaces(7): https://man7.org/linux/man-pages/man7/network_namespaces.7.html
time_namespaces(7): https://man7.org/linux/man-pages/man7/time_namespaces.7.html
cgroup_namespaces(7): https://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html
Linux shared subtree docs: https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt
OCI Runtime Spec, namespaces: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#namespaces
runc nsenter C source: https://github.com/opencontainers/runc/blob/main/libcontainer/nsenter/nsexec.c