Chapter 5: Cgroups v2

Namespaces decide what a process can see. Cgroups — control groups — decide what it can use. The two pair up because isolation alone is not enough: a container with its own PID, mount, and network namespaces can still allocate every byte of the host's memory, spin every core to 100%, and fork until the kernel runs out of process slots. Cgroups close that gap by attaching kernel resource controllers to a tree of process groups, so every limit and accounting decision applies to a defined subset of the system instead of to the host as a whole.

The production failure mode this exists to prevent is concrete. One pod's Java heap doubles overnight after a config change, the JVM walks the host into swap, every other pod on the node misses its liveness probe, the kubelet declares the node unhealthy, and the workload reschedules onto a peer that promptly suffers the same fate. Without a per-cgroup memory.max, "noisy neighbor" is "all neighbors die." The rest of this chapter is about how the kernel actually accounts for and enforces those limits, why the interface looks the way it does, and what every file under /sys/fs/cgroup is for.

Safety: writing to /sys/fs/cgroup requires root. The example that triggers an OOM kill is harmless on a VM but will cause a real OOM event in the kernel log. Use a disposable Linux VM. Examples were checked on Ubuntu 24.04 with kernel 6.8 and systemd 255 in cgroup v2 unified mode.

v1 And v2: What Changed And Why

Cgroups began as "process containers," a 2006–2007 patch series from Paul Menage and Rohit Seth at Google, renamed at merge to avoid clashing with the Solaris term. The v1 patch landed in Linux 2.6.24 in January 2008. Its design choice was multiple hierarchies: every resource controller (cpu, memory, io, pids, ...) got its own directory tree, and a process could sit at a different position in each tree. You could place process X in a CPU cgroup that capped it at one core while simultaneously placing it in a memory cgroup that gave it 32 GB. Flexible, in principle.

In practice the multi-hierarchy model produced three persistent problems. Incoherent attribution was the worst of them. The page cache and writeback paths need memory and I/O accounting to agree about which group an allocation belongs to: a write that dirties a page is a memory event when the page is allocated and an I/O event when the page is written back, possibly minutes later by an unrelated kernel thread. In v1, a process could be in a memory cgroup of one shape and an io (then blkio) cgroup of a completely different shape. The kernel charged the dirty page to one group and the writeback I/O to another. Throttling either side did not throttle the workload that actually caused the load, and limits could be over- or under-counted. The kernel docs eventually documented this as "v1 cannot do meaningful page-cache writeback accounting," which is a kind way to put it.

Delegation was unsafe. Handing a subtree to an unprivileged manager required reasoning about every controller hierarchy independently, and v1 had no kernel-side notion of what a delegated manager was allowed to do. Single-writer collisions with systemd were the third: once systemd took over the cgroup tree on most distros, two managers writing to the same v1 controllers raced, and the cgroup core had no notion of ownership.

Cgroup v2 merged in Linux 4.5 (March 2016) and became the default unified mode in Fedora 31 (2019), Debian 11 (2021), Ubuntu 21.10 (2021), and RHEL 9 (2022). Kubernetes flipped the v2 default in 1.25 (2022). The redesign is built around three principles. Single hierarchy: every cgroup is a node in one tree, and a process belongs to exactly one cgroup. Top-down enablement: controllers are not "in" a cgroup by default; the parent enables them for children by writing into its cgroup.subtree_control. Coherent accounting: because every controller sees the same hierarchy, the page-cache problem disappears — memory and I/O charges land on the same node. The model trades v1's per-controller flexibility for coherence, and the flexibility was, in practice, almost never useful and almost always confusing.

The rest of this chapter assumes v2. A few v1 oddities (separate devices.allow files, the freezer subsystem, net_cls/net_prio) are gone or replaced; the substitutes are noted as they come up.

Why A Pseudo-Filesystem

The cgroup API could have been a syscall. Instead the kernel exposes the entire interface as a virtual filesystem mounted at /sys/fs/cgroup: every cgroup is a directory, every knob is a file, and configuration happens with mkdir, echo, and cat. The choice is deliberate and pays off in five places.

A directory tree is the natural shape for a hierarchy of process groups, so the API matches the underlying object. UNIX file permissions already exist for restricting who can read or write what, which means delegation comes for free: chown a subtree to an unprivileged user and they own everything below it without any new access-control code in the kernel. The interface is introspectable from any language — no libc binding, no syscall numbers, just read(2) and write(2) against paths. Each kernel-side write is atomic per line, so scripts can edit a knob without partial-state races. And cgroup.events is inotify-watchable, which lets userspace react to a cgroup becoming empty without polling.

The cost is a slightly unusual UX: configuring kernel resource policy with shell redirects feels like a hack the first time. But everything in this chapter is mkdir, echo, and cat, and that is the point.

The Charging Model

Limits and accounting need a verb. The kernel calls it charging. When code on behalf of a process allocates a tracked resource — a page of memory, a process slot, a millisecond of CPU — the kernel walks from the calling task to the cgroup it belongs to (task->cgroups, a per-task pointer) and increments a counter on that cgroup. When the resource is released, the counter decrements. Limits are enforced on the increment: if the new value would exceed the cgroup's *.max, the operation either fails (ENOMEM, EAGAIN) or triggers reclaim, depending on the controller.

Charging is hierarchical. Every charge against a leaf is also charged against every ancestor up to the root. That is what cgroup.subtree_control and the unified hierarchy buy: parents see the sum of their children's usage, and a parent limit constrains the total of everything beneath it. A pod-level memory.max of 4 GB is enforced even if the pod's containers individually have no limit, because every page they allocate is charged to the pod cgroup as well.

A few concrete examples make the abstraction tangible:

Memory. mm/memcontrol.c calls try_charge on every page allocation that should count against the calling task's memory cgroup. If the new charge would exceed memory.max, the kernel attempts direct reclaim inside the cgroup. If reclaim cannot free enough, the OOM killer fires inside the cgroup and picks a victim from its processes.
CPU. The scheduler tracks CPU time per scheduler entity. With the cpu controller enabled, every cgroup gets its own entity in the runqueue, and the scheduler updates cpu.stat with the cycles consumed. cpu.max adds a bandwidth controller (CFS/EEVDF bandwidth) that throttles the cgroup when it has used its quota for the current period.
PIDs. kernel/cgroup/pids.c increments the cgroup's pids.current inside fork(2), after the new task is allocated but before it is visible. If pids.current would exceed pids.max, fork returns EAGAIN.
I/O. Block-layer accounting attaches a charge to every BIO submitted on behalf of a process, with the calling cgroup recorded in the BIO. The block scheduler and writeback path use that to attribute throttling and weight.

Two consequences fall out of the charging model that are worth holding in mind. First, a process's cgroup is sticky to its work, not just to its identity: a kernel thread doing writeback on behalf of a user process is charged to the user's cgroup, because the BIO carries the original task's cgroup pointer. Second, moving a process between cgroups does not retroactively re-charge anything. Already-allocated pages stay charged to the original cgroup until they are freed. Restarting a heavy process is sometimes the only way to re-attribute its memory.

Confirm v2 Is Active

The unified hierarchy is mounted as cgroup2 at /sys/fs/cgroup:

mount | grep cgroup2
# cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

If you also see cgroup mounts at /sys/fs/cgroup/<controller>/, the system is in legacy v1 or hybrid mode and the rest of this chapter does not apply directly. To force unified mode, set systemd.unified_cgroup_hierarchy=1 on the kernel command line and reboot.

The two mount options worth knowing about are nsdelegate (treats a cgroup namespace boundary as a delegation point — the namespace's root cgroup is the limit of what processes inside the namespace can climb above) and memory_recursiveprot (makes memory.low and memory.min apply recursively down the subtree, the version of these knobs everyone actually wants).

Walk The Tree

Every directory under /sys/fs/cgroup is a cgroup. The root is /sys/fs/cgroup itself.

ls /sys/fs/cgroup/ | head
# cgroup.controllers
# cgroup.max.depth
# cgroup.max.descendants
# cgroup.procs
# cgroup.subtree_control
# cgroup.threads
# cpu.pressure
# cpu.stat
# init.scope
# io.pressure
# ...
cat /sys/fs/cgroup/cgroup.controllers
# cpuset cpu io memory hugetlb pids rdma misc

cgroup.controllers lists what is available to this cgroup; cgroup.subtree_control lists what is enabled for children:

cat /sys/fs/cgroup/cgroup.subtree_control
# cpuset cpu io memory pids

A controller has to be enabled by the parent before child cgroups can use it. That is the top-down constraint: parents enable, children inherit, and a non-root cgroup with processes in it cannot enable controllers in its own subtree_control (more on that below). To see how systemd has shaped the tree:

systemd-cgls --no-pager | head -30

The standard layout: system.slice/ for system services, user.slice/user-<uid>.slice/ for user sessions, and (when present) kubepods.slice/ or machine.slice/ for orchestrated workloads.

Make A Cgroup By Hand

A new cgroup is just a mkdir. The kernel synthesizes a standard set of files inside automatically:

sudo mkdir /sys/fs/cgroup/demo
ls /sys/fs/cgroup/demo/
# cgroup.controllers   cgroup.events     cgroup.freeze   cgroup.kill
# cgroup.max.depth     cgroup.max.descendants            cgroup.procs
# cgroup.stat          cgroup.subtree_control            cgroup.threads
# cgroup.type          cpu.pressure      cpu.stat
# io.pressure          memory.pressure

Every cgroup, regardless of which controllers are enabled, gets the cgroup.* files. Each one has a job:

cgroup.procs — PIDs in this cgroup. Read it to list members; write a PID to it to move that process into the cgroup. Atomic: a write moves the process out of its previous cgroup as it places it.
cgroup.threads — like cgroup.procs but for individual TIDs. Only meaningful when cgroup.type is threaded.
cgroup.controllers — controllers available to this cgroup, inherited from the parent's cgroup.subtree_control. Read-only.
cgroup.subtree_control — controllers enabled for child cgroups of this one. Written as a diff: +memory +pids -io. The diff syntax is what makes scripts safe — you do not need to know the existing state to add a controller.
cgroup.events — readable, with two keys (populated 0/1, frozen 0/1). Set up an inotify(7) watch on this file to be notified when the cgroup empties out — that is how runtimes know a container has fully exited without polling cgroup.procs.
cgroup.type — usually domain (the standard "contains processes" mode). threaded switches to per-thread cgroup membership; domain threaded is the threaded-root marker. domain invalid means the cgroup is currently in a state controllers cannot handle (typically a transitional state inside a threaded subtree).
cgroup.stat — counts of descendant cgroups, in two flavors: nr_descendants (live) and nr_dying_descendants (released but not yet freed because they still hold charged resources, usually pinned page-cache pages).
cgroup.freeze — write 1 to suspend every process in the cgroup at the next safe point; write 0 to thaw. Replaces v1's separate freezer controller. Useful for snapshotting and for safely moving processes that the runtime does not want running during the move.
cgroup.kill — write 1 to atomically SIGKILL every process in the cgroup, all at once, with no race window for a forking child to escape. Added in Linux 5.14 (August 2021); described in detail later.
cgroup.max.depth, cgroup.max.descendants — caps on hierarchy size, defending against runaway cgroup creation by buggy or malicious tenants.

Three additional files appear unconditionally because the kernel always tracks them, even with no controller enabled: cpu.stat (CPU accounting in microseconds), and cpu.pressure, memory.pressure, io.pressure — the PSI files described later in the chapter.

Controller-specific files (like memory.max, cpu.weight, pids.max) appear only after the parent enables that controller in cgroup.subtree_control. To enable more for demo, write the diff to the parent — in this case /sys/fs/cgroup itself:

echo "+memory +pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control > /dev/null
ls /sys/fs/cgroup/demo/ | grep -E '^memory\.|^pids\.' | head
# memory.current
# memory.events
# memory.high
# memory.low
# memory.max
# memory.peak
# memory.stat
# memory.swap.current
# pids.current
# pids.events
# pids.max

Enabling a controller for children does not enable it for the cgroup itself — that is what "top-down" means in practice. The kernel docs phrase it as: a controller is propagated into a cgroup's resource files by the parent's subtree_control, and is propagated out of that cgroup to its descendants by its own subtree_control. A cgroup with no children has no reason to enable anything in its subtree_control.

Move A Process In

Put a process into a cgroup by writing its PID to cgroup.procs:

sudo sh -c 'sleep 600 & echo $! > /sys/fs/cgroup/demo/cgroup.procs; wait'
# In another terminal:
cat /sys/fs/cgroup/demo/cgroup.procs
# 12345

A process can only be in one cgroup at a time. Writing its PID to a different cgroup's cgroup.procs moves it atomically. cgroup.threads works the same way for individual threads, on cgroups whose cgroup.type is threaded.

The move is a write of the calling task's membership, not a deep copy of state. As noted in the charging section, already-allocated memory stays charged to the original cgroup until it is freed.

Memory: Limits, Reclaim, OOM

The memory controller is the most consequential one to get right and the easiest to misconfigure. Start with what memory.max actually counts. In v2 the answer is "the cgroup's full memory footprint, including kernel allocations made on its behalf." Concretely, the controller charges:

Anonymous pages (heap, stack, malloc-backed memory).
File-backed pages that the cgroup is the first to fault into the page cache.
Kernel objects the cgroup forced into existence: slab caches (dentries, inodes, kmalloc), socket buffers, page-table memory, percpu allocations.

What it does not count: pages already charged to a different cgroup that this one is also reading (the first reader gets the charge), and tmpfs mounts that are charged to whoever wrote the file.

Set a small limit, run a process that allocates more, and watch the kernel kill it within the cgroup:

sudo mkdir -p /sys/fs/cgroup/oom-demo
echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control > /dev/null
echo 50M | sudo tee /sys/fs/cgroup/oom-demo/memory.max > /dev/null

sudo sh -c 'echo $$ > /sys/fs/cgroup/oom-demo/cgroup.procs; \
  exec python3 -c "x=bytearray(200*1024*1024); print(\"alive\")"'
# Killed
echo $?
# 137  (128 + SIGKILL)

Confirm the kill was scoped to the cgroup:

cat /sys/fs/cgroup/oom-demo/memory.events
# low 0
# high 0
# max 1         <- count of allocations rejected/throttled at memory.max
# oom 1         <- count of OOM events
# oom_kill 1    <- count of processes killed

The sequence is: an allocation request would push memory.current past memory.max; the kernel calls try_charge, which triggers reclaim inside the cgroup; reclaim cannot free enough (this Python is allocating anonymous memory, which can only be reclaimed via swap, and there is none); the cgroup-scoped OOM killer fires and chooses a victim from the cgroup's processes by oom_score_adj. The host's oom_score for unrelated processes is irrelevant.

By default the OOM killer kills one process. Setting memory.oom.group = 1 makes it kill every process in the cgroup, useful for "all-or-nothing" workloads where a partially-killed pod is worse than a fully-killed one — the most common Kubernetes example is a pod whose containers depend on each other and would deadlock if one died. Setting oom_score_adj per-process is how you tell the killer which process to prefer or avoid.

memory.high is the softer counterpart. The kernel throttles the cgroup's allocations and forces reclaim when usage exceeds the threshold, but does not kill anything:

echo 30M | sudo tee /sys/fs/cgroup/oom-demo/memory.high > /dev/null

A typical pattern is memory.high set slightly below memory.max, so reclaim and throttling kick in before the hard limit is hit and the workload has time to react (drop a cache, finish a request, gc) before the OOM killer becomes inevitable.

memory.low and memory.min work the other direction: they protect a cgroup from reclaim. A cgroup below memory.low is skipped by reclaim under normal pressure (other cgroups are reclaimed first); memory.min is a hard floor — the kernel will not reclaim below it even under memory pressure. memory_recursiveprot makes both apply recursively, which is what you want when the protection should hold for the whole subtree, not just the immediate cgroup.

For deeper accounting, read memory.stat:

cat /sys/fs/cgroup/oom-demo/memory.stat | head
# anon                  ...
# file                  ...
# kernel                ...
# kernel_stack          ...
# pagetables            ...
# percpu                ...
# sock                  ...
# vmalloc               ...
# slab                  ...
# slab_reclaimable      ...
# slab_unreclaimable    ...

memory.peak (Linux 5.19, July 2022) records the high-water mark — useful for sizing because memory.current only shows the right-now value. Without it, catching the peak required polling.

CPU: Weight And Quota

Two CPU files do almost all the work:

cpu.weight — proportional share, applied when CPUs are oversubscribed. Default 100. Range 1–10000. Higher is more share.
cpu.max — <quota> <period> in microseconds. Default max 100000, meaning unlimited.

cpu.weight is a scheduler weight. The Linux scheduler (CFS through 6.5; EEVDF from 6.6 onward, October 2023) treats every cgroup with the cpu controller enabled as a hierarchical scheduler entity, and distributes time at each level in proportion to weight. Two cgroups under the same parent with weights 100 and 200 get one-third and two-thirds of the parent's CPU share when both are runnable. With no contention — say only one of them is busy — weight does nothing; the busy one gets all the CPU it can use up to cpu.max.

cpu.max is the CFS bandwidth controller. It enforces a hard cap by tracking how much CPU time the cgroup has consumed in the current period, throttling it once the quota is reached, and untrottling it at the start of the next period:

sudo mkdir -p /sys/fs/cgroup/cpu-demo
echo "+cpu" | sudo tee /sys/fs/cgroup/cgroup.subtree_control > /dev/null
echo "50000 100000" | sudo tee /sys/fs/cgroup/cpu-demo/cpu.max > /dev/null

sudo sh -c 'echo $$ > /sys/fs/cgroup/cpu-demo/cgroup.procs; \
  exec timeout 5 sh -c "while :; do :; done"'

cat /sys/fs/cgroup/cpu-demo/cpu.stat
# usage_usec     2500000
# user_usec      2500000
# system_usec       0
# nr_periods      ~50
# nr_throttled    ~50
# throttled_usec ~2500000

50000 100000 means "50 ms of CPU time per 100 ms period," i.e. 0.5 cores. nr_throttled shows the cgroup hit its quota every period; throttled_usec is how long it sat paused. Kubernetes' CPU throttling alerts read these counters. The pathology to watch for is tail-latency throttling: a low-cpu.max workload that is mostly idle but bursty can blow through its quota in the middle of a request, sit throttled for the rest of the period (up to 100 ms), and miss its SLO. Latency-sensitive Kubernetes workloads sometimes drop cpu.max entirely and rely on cpu.weight plus capacity planning instead, accepting that one bad neighbor can slow them down rather than guaranteeing one will block them at every period boundary.

The Kubernetes CPU model maps directly: requests.cpu becomes cpu.weight (rescaled), limits.cpu becomes cpu.max. Setting requests without limits is what gives you weight without bandwidth throttling.

The io controller exposes two knobs, keyed per block device:

io.max — bandwidth cap, written as <major>:<minor> rbps=<bytes/s> wbps=<bytes/s> riops=<n> wiops=<n>. Each field defaults to max.
io.weight — proportional share among cgroups contending for the same device. Requires the bfq I/O scheduler; on the default mq-deadline or none schedulers, io.weight is silently ineffective.

io.stat reports per-device usage:

cat /sys/fs/cgroup/system.slice/io.stat
# 8:0 rbytes=... wbytes=... rios=... wios=... dbytes=... dios=...

Two complications make I/O accounting harder than memory or CPU. Writeback is asynchronous: a process dirties a page in memory, and the actual disk write happens later, possibly by a kernel thread on a different CPU. v2's coherent hierarchy is what makes this attributable — the BIO carries the original cgroup, and the writeback charge lands on the same node as the memory charge for the dirty page. v1 famously could not do this. The other complication is that filesystem journal traffic is shared across cgroups; some of it cannot be cleanly attributed to any one cgroup, and the kernel charges it to the root.

Limit Process Count

Forks happen all the time. A bug or attack can exhaust the host's pid space; pids.max defends against it:

sudo mkdir -p /sys/fs/cgroup/pid-demo
echo "+pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control > /dev/null
echo 5 | sudo tee /sys/fs/cgroup/pid-demo/pids.max > /dev/null

sudo sh -c 'echo $$ > /sys/fs/cgroup/pid-demo/cgroup.procs; \
  for i in 1 2 3 4 5 6 7 8; do \
    sleep 30 & echo "started $!"; \
  done'
# After the limit:
# sh: fork: retry: Resource temporarily unavailable

pids.events records the rejections:

cat /sys/fs/cgroup/pid-demo/pids.events
# max 4   <- count of fork rejections

The check is inside fork(2): the kernel allocates the new task struct, then attempts to charge +1 against the cgroup's pids.current. If the new value would exceed pids.max, the allocation is undone and fork returns EAGAIN. This is the same charging path described earlier — the cgroup is debited at the moment the kernel commits to the new resource.

Pressure: PSI

Utilization is a misleading metric for resource health. A CPU pegged at 100% is not under pressure if every task that wants the CPU is currently getting it. A CPU at 40% is under pressure if there are queued tasks waiting their turn. PSI (Pressure Stall Information) measures the second condition — time spent stalled, not time spent busy — and exposes it per-cgroup in cpu.pressure, memory.pressure, and io.pressure:

cat /sys/fs/cgroup/oom-demo/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some is "at least one task in the cgroup was stalled on this resource." full is "every task in the cgroup was stalled" — strictly worse, because the cgroup made no progress. The avg numbers are 10-, 60-, and 300-second moving averages of the percentage of wall time stalled. total is a microsecond counter useful for delta-style monitoring.

PSI was added in Linux 4.20 (December 2018). Facebook's oomd and the kernel's PSI-aware reclaim were the original consumers; the most useful threshold most operators reach for is "memory.pressure full > 5% over 60s = act now," because by the time RSS hits the limit it is already too late.

"No Internal Processes" In Practice

A non-root cgroup cannot both contain processes and have controllers enabled in its subtree_control that propagate to children. To see the rule fire:

sudo mkdir -p /sys/fs/cgroup/parent/child
sudo sh -c 'echo $$ > /sys/fs/cgroup/parent/cgroup.procs'
echo "+memory" | sudo tee /sys/fs/cgroup/parent/cgroup.subtree_control
# tee: /sys/fs/cgroup/parent/cgroup.subtree_control: Device or resource busy

Move the process out first, then the write succeeds:

sudo sh -c 'echo $$ > /sys/fs/cgroup/parent/child/cgroup.procs'
echo "+memory" | sudo tee /sys/fs/cgroup/parent/cgroup.subtree_control
# +memory

The reason is the charging model: if a parent contained both processes and child cgroups with the same controllers enabled, the kernel would have to choose whether the parent's processes count against the parent's limits in isolation, against the children's, or both — and however it chose, some configurations would produce contradictions. v1 tried to be clever and produced exactly the contradictions described earlier. v2 simply forbids the configuration. Containers always sit in leaf cgroups for this reason: the runtime creates a hierarchy of intermediate cgroups (slice → kubepods → pod → container), each of which has children but no processes, and puts the container's processes only in the leaf.

Killing A Cgroup Atomically

cgroup.kill is the right way to terminate a container. Writing 1 to it sends SIGKILL to every process in the cgroup, all at once, in a single kernel transaction:

sudo mkdir -p /sys/fs/cgroup/kill-demo
sudo sh -c 'sleep 9999 & sleep 9999 & sleep 9999 & wait' &
# Move them in:
for pid in $(pgrep -P $! sleep); do echo $pid | sudo tee /sys/fs/cgroup/kill-demo/cgroup.procs; done

# Atomic kill:
echo 1 | sudo tee /sys/fs/cgroup/kill-demo/cgroup.kill
# All sleeps are now gone; cgroup.procs is empty.

Without cgroup.kill, the userspace alternative is "list cgroup.procs, send SIGKILL to each, repeat until empty." The race window in that loop is real: a process inside the cgroup can fork(2) between the listing and the kill, and the new child escapes the first round. If it forks fast enough you have an unkillable cgroup. cgroup.kill is implemented in the kernel as "hold the cgroup mutex, walk every task, signal it, then return," which closes the race entirely. It is one of the small features that made container shutdown in Kubernetes substantially less flaky after Linux 5.14 (August 2021).

cgroup.freeze is the related primitive for the not-quite-kill case: write 1 to suspend every process at the next safe point, do whatever the runtime needs to do, write 0 to thaw. Snapshotting a process tree, moving processes between cgroups without races, and detaching for live migration all use it.

The Cgroup Namespace From Inside A Container

Containers see a different /sys/fs/cgroup from the host. Cgroup namespaces, added in Linux 4.6 (May 2016), virtualize the cgroup tree the same way mount namespaces virtualize the mount table: a process inside a cgroup namespace sees its own cgroup as /, and any path it reads from /proc/self/cgroup is relative to that root.

# Host:
cat /proc/self/cgroup
# 0::/user.slice/user-1000.slice/session-3.scope

# Inside a container:
docker run --rm alpine:3.20 cat /proc/self/cgroup
# 0::/

The container does not see "I am at /system.slice/docker-<id>.scope"; it sees "I am at /." The runtime mounts a fresh cgroup2 filesystem at /sys/fs/cgroup inside the namespace, and the kernel renders the tree relative to the namespace's root. From the host, the same cgroup is still at its real path; the namespace is purely a per-process view.

The nsdelegate mount option pairs with this: it makes the cgroup namespace boundary act as a delegation point, so a process inside the namespace cannot move itself or other processes above its own root cgroup, even if it has the file permissions to do so. Without nsdelegate, a privileged process inside a container could in principle write to cgroup.procs of an ancestor and escape — nsdelegate makes that an EPERM.

systemd Owns The Tree

On systemd-managed hosts (most production Linux), systemd owns the cgroup tree. The kernel enforces a single-writer model per cgroup directory — if two processes both manage the same cgroup, one will lose. systemd's answer is delegation: it creates a parent with Delegate=yes, marks it as owned by another manager, and stops touching what is underneath.

systemd-run is the convenient way to launch a transient unit and observe its cgroup placement:

sudo systemd-run --slice=demo.slice --unit=oneshot.service --scope sleep 60 &
systemctl status oneshot.service
# Look for "CGroup: /demo.slice/oneshot.service"
cat /sys/fs/cgroup/demo.slice/oneshot.service/cgroup.procs
# <pid of sleep>

When containerd is configured with the systemd cgroup driver, it does not write cgroup files directly. It asks systemd over D-Bus to create a transient scope for each container shim, and lets systemd populate the resource files. The OCI linux.cgroupsPath in this mode is a systemd path:

kubepods-besteffort.slice:cri-containerd:<container-id>

read as <slice>:<prefix>:<id>. The runtime creates a transient scope cri-containerd-<id>.scope under kubepods-besteffort.slice/.

The "cgroup driver" config in containerd, kubelet, and runc must agree. A common production failure: kubelet uses systemd, containerd uses cgroupfs, and pods fail to start because each side is trying to create cgroups the other does not see.

Delegation Enables Rootless

cgroup v2's delegation model is what gives a rootless container resource limits. Without v2 delegation, only the root user could write to cgroup files, and rootless containers would have no enforced limits. systemd creates a delegated subtree under user.slice/user-<uid>.slice/user@<uid>.service/ and grants the user ownership:

ls -ld /sys/fs/cgroup/user.slice/user-1000.slice/[email protected]
# drwxr-xr-x 2 1000 1000 ... [email protected]

Inside the delegated subtree, the user can mkdir, write +cpu +pids +memory to subtree_control (subject to what systemd enabled at the boundary), and place processes — without root.

# As an unprivileged user:
mkdir /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/me-demo
echo $$ > /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/me-demo/cgroup.procs

Rootless podman, buildah, and rootless containerd put their containers under this kind of subtree.

Device Access Is BPF, Not Files

cgroup v2 does not have devices.allow or devices.deny files. Device access policy is enforced by an eBPF program of type BPF_PROG_TYPE_CGROUP_DEVICE attached to the cgroup. runc compiles the OCI linux.resources.devices list into a BPF program at container start, and the kernel runs that program on every device-class operation: the program returns 0 (deny) or 1 (allow) per syscall.

To see the attached program on a running container's cgroup:

sudo bpftool cgroup tree | head
# CgroupPath
# ID       AttachType      AttachFlags     Name
# /sys/fs/cgroup/system.slice/docker-<id>.scope
#     12  cgroup_device                   <prog-name>

v1 exposed device policy as devices.allow and devices.deny files; v2 exposes nothing in the cgroup directory. Tooling that walked the v1 files has to load the BPF program with bpftool cgroup show instead.

OCI Resource Mapping

The relevant linux.resources fields in config.json and where they land:

OCI field	v2 file
`memory.limit`	`memory.max`
`memory.reservation`	`memory.low`
`memory.swap`	`memory.swap.max`
`cpu.shares`	`cpu.weight` (rescaled from 1024 → 100 default)
`cpu.quota` / `cpu.period`	`cpu.max`
`cpu.cpus` / `cpu.mems`	`cpuset.cpus` / `cpuset.mems`
`pids.limit`	`pids.max`
`blockIO.weight`, `throttleReadBpsDevice`, etc.	`io.weight`, `io.max`
`devices`	BPF program attached via `BPF_PROG_TYPE_CGROUP_DEVICE`

The remap from v1 names to v2 files is the runtime's job, not the user's. The OCI spec keeps the v1-style names for compatibility; runc, crun, and youki translate.

Where This Goes

The next chapter covers the rootfs: content addressing, snapshotters, and how runc gets a process to see a custom /. Cgroups reappear in chapter 7 when the device cgroup BPF program comes up under the security boundary.

Sources And Further Reading

Linux cgroup v2 admin docs: https://kernel.org/doc/html/next/admin-guide/cgroup-v2.html
cgroups(7): https://man7.org/linux/man-pages/man7/cgroups.7.html
systemd cgroup delegation: https://systemd.io/CGROUP_DELEGATION/
PSI docs: https://docs.kernel.org/accounting/psi.html
BPF cgroup device program: https://docs.kernel.org/bpf/prog_cgroup_device.html
CFS bandwidth control: https://docs.kernel.org/scheduler/sched-bwc.html
EEVDF scheduler (Linux 6.6): https://lwn.net/Articles/925371/
cgroup.kill (Linux 5.14): https://lwn.net/Articles/855286/
OCI runtime spec, Linux resources: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#control-groups