Chapter 7: Security Boundaries

A container is not a hypervisor. It is a process — or a tree of processes — running directly on the host kernel, separated from everything else by a stack of independently-tunable Linux subsystems. The security boundary is the composition of those subsystems: namespaces shape what the process sees, capabilities and seccomp shape what it can ask the kernel to do, MAC policy gates the operations the kernel would otherwise allow, masked paths plug the holes that namespaces cannot, and cgroups bound the damage if something else fails. None of these is "the security model." The model is the layered effect, and the cost of --privileged is removing several layers at once.

The chapter walks the layers in roughly the order the kernel applies them, with one section per control. The framing to keep in mind throughout: every layer is defense in depth. A misconfiguration or a bypass of one layer should leave the others holding.

Safety: most examples need root. The seccomp and capabilities demonstrations are harmless on a VM. The MAC examples assume the relevant LSM is already loaded and configured by the distribution. Use a disposable Linux VM. Examples were checked on Ubuntu 24.04 (AppArmor) and a Fedora 40 VM (SELinux).

What The Controls Are For

Three threat classes shape the controls:

  1. Container → host escape — a process inside the container gaining privileges or visibility on the host.
  2. Container → container interference — one container reading another's data, signaling its processes, or starving its resources.
  3. Container → external resource abuse — a container performing actions outside its intended scope.

Capabilities and seccomp limit (1) and (3). MAC (AppArmor, SELinux) covers (1) and (2). User namespaces strengthen (1) at the cost of operational complexity. Cgroups address resource starvation in (2). No single control covers everything; the next sections cover each one, then the chapter returns to what --privileged actually loosens.

Linux Capabilities

Capabilities split the historical "root or not" privilege model into roughly forty independently-grantable units. CAP_NET_ADMIN lets a process configure interfaces; CAP_SYS_TIME lets it set the system clock; CAP_NET_BIND_SERVICE lets it bind ports below 1024. A non-root process with the right capability can do the corresponding privileged operation; a root-equivalent process without that capability cannot. The model exists because "root" was always a coarse abstraction — a daemon that needs to bind port 80 should not also be able to load kernel modules.

Every process has five capability sets, each a 64-bit bitmap, and the rules for how they evolve across execve(2) are the part that takes the most time to internalize:

The bounding set is the bound on a container's lifetime privilege. A container with bounding set {CHOWN, DAC_OVERRIDE, FOWNER, ...} can never gain a capability outside that set, regardless of what setuid binaries or file-capabilities-bearing executables it runs. Setting the bounding set is the runtime's job; reading the current state back is yours.

# What capabilities does the current shell have?
grep ^Cap /proc/self/status
# CapInh: 0000000000000000
# CapPrm: 0000000000000000
# CapEff: 0000000000000000
# CapBnd: 000001ffffffffff
# CapAmb: 0000000000000000

Five hex bitmaps, one per set. For an unprivileged shell, the bounding set is "all caps known to this kernel" because a setuid root binary could push more in. For a runc-launched container the bounding set is restricted to the OCI process.capabilities.bounding list.

Decode the bitmaps with capsh:

capsh --decode=000001ffffffffff
# 0x000001ffffffffff=cap_chown,dac_override,...,cap_checkpoint_restore

To see the difference inside a container:

docker run --rm alpine:3.20 sh -c 'grep ^Cap /proc/self/status'
# CapInh: 0000000000000000
# CapPrm: 00000000a80425fb
# CapEff: 00000000a80425fb
# CapBnd: 00000000a80425fb
# CapAmb: 0000000000000000

docker run --rm alpine:3.20 sh -c '
  apk add -q libcap
  capsh --decode=$(grep ^CapBnd /proc/self/status | cut -f2)
'
# 0x00000000a80425fb = cap_chown,dac_override,fowner,fsetid,kill,setgid,
# setuid,setpcap,net_bind_service,sys_chroot,mknod,audit_write,setfcap

Thirteen capabilities — the conventional default container set. Notably absent: CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE, CAP_SYS_TIME, CAP_NET_RAW, CAP_SYS_MODULE. The container's "root" cannot configure interfaces, load kernel modules, or set the clock. Compare with --privileged:

docker run --rm --privileged alpine:3.20 sh -c 'grep ^CapBnd /proc/self/status'
# CapBnd: 000001ffffffffff   <- everything

--privileged clears the bounding set, drops the seccomp profile, removes the AppArmor or SELinux profile, and gives the device cgroup a wildcard rule. It is the easiest way to make a container "work," and it removes most of what made it a container.

File Capabilities And noNewPrivileges

Capabilities can also live on executables as the security.capability xattr:

sudo apt-get install -y libcap2-bin
getcap -r /usr/bin /usr/sbin 2>/dev/null | head
# /usr/bin/ping cap_net_raw=ep
# /usr/bin/newuidmap cap_setuid+ep
# /usr/bin/newgidmap cap_setgid+ep

When ping is exec'd, its file capabilities become permitted+effective on the new process. This is how a non-root user can ping despite needing CAP_NET_RAW — the binary brings the capability with it. The same mechanism is a privilege-escalation primitive in the wrong hands: a buggy setuid binary inside a container can become a way out.

noNewPrivileges is a one-bit prctl(2) switch, set with prctl(PR_SET_NO_NEW_PRIVS, 1), that closes that path. Once set, the process cannot gain privileges via execve: setuid bits are ignored, file capabilities are ignored, AppArmor profile transitions are blocked. The bit is one-way; it cannot be cleared once set, and it is inherited across fork(2) and execve(2).

# A non-privileged shell. ping works because of file capabilities.
ping -c1 127.0.0.1 > /dev/null && echo "ping ok"
# ping ok

# Set no_new_privs, then exec ping. File capabilities are ignored.
exec setpriv --no-new-privs ping -c1 127.0.0.1
# ping: socktype: SOCK_RAW

OCI containers default to noNewPrivileges: true. It is a near-zero-cost defense against an entire class of escalation, and most container workloads do not need privilege transitions across exec.

Seccomp

Seccomp ("secure computing mode") is a syscall filter. The kernel runs an attached BPF program on every syscall, the program inspects the syscall number and arguments, and returns an action. The actions that matter are SECCOMP_RET_ALLOW (proceed normally), SECCOMP_RET_ERRNO(n) (fail the syscall with errno n), SECCOMP_RET_KILL_PROCESS (kill the process), and SECCOMP_RET_USER_NOTIF (hand the syscall to a user-space supervisor — used by gVisor, podman's user-namespace fallback, and other userland implementations of restricted syscalls).

The architecture is worth holding in mind. Seccomp is a layer between userspace and the syscall entry point: the program runs after the kernel decodes the syscall number but before any syscall logic executes. It cannot block kernel actions taken on the process's behalf (page faults, signal delivery), only deliberate syscalls. It also cannot dereference pointer arguments — the program sees raw register values — which is why filters work on syscall numbers and flag arguments rather than on path strings. Ksyscalls like mount's path are off-limits to seccomp; AppArmor and SELinux are how you filter on paths.

A small example using libseccomp:

sudo apt-get install -y libseccomp-dev gcc

cat > /tmp/seccomp-demo.c <<'EOF'
#include <seccomp.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/utsname.h>

int main(void) {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(uname), 0);
    seccomp_load(ctx);
    seccomp_release(ctx);

    char buf[1024];
    if (gethostname(buf, sizeof buf) < 0) {
        perror("gethostname");
    } else {
        printf("hostname: %s\n", buf);
    }

    struct utsname u;
    if (uname(&u) < 0) {
        perror("uname");
    } else {
        printf("uname: %s\n", u.sysname);
    }
    return 0;
}
EOF

gcc /tmp/seccomp-demo.c -lseccomp -o /tmp/seccomp-demo
/tmp/seccomp-demo
# hostname: <something>
# uname: Operation not permitted

gethostname(2) is allowed, uname(2) is forced to return EPERM. The OCI spec's linux.seccomp field describes the same filter as JSON; runc compiles it to BPF before exec. Docker and containerd ship a default profile that blocks roughly fifty syscalls, including kexec_load, keyctl, add_key, init_module, mount, umount2, swapon, clock_settime, and reboot. The full profile is at containerd/contrib/seccomp/seccomp_default.go (or moby/profiles/seccomp/default.json for the Docker copy).

Seccomp interacts with noNewPrivileges: loading a non-trivial seccomp filter requires either CAP_SYS_ADMIN or noNewPrivileges. The latter is what containers use, because requiring CAP_SYS_ADMIN to set up a sandbox would defeat the point. The kernel's reasoning is that without noNewPrivileges, a seccomp filter could prevent a setuid binary from dropping privileges correctly — so the kernel forbids the combination.

To inspect the filter on a running container:

docker run --rm -d --name demo alpine:3.20 sleep 600
PID=$(docker inspect -f '{{.State.Pid}}' demo)
grep -E 'Seccomp|^Sec' /proc/$PID/status
# Seccomp:        2
# Seccomp_filters: 1
docker stop demo

Seccomp: 2 is SECCOMP_MODE_FILTER. Seccomp_filters: 1 is the number of attached BPF programs.

Mandatory Access Control: Path-Based Versus Label-Based

Capabilities and seccomp constrain the calling process. MAC — Mandatory Access Control — constrains every operation against system policy, regardless of the calling process's capabilities and regardless of the file's UNIX permissions. MAC is the layer that says "even if DAC and capabilities would have allowed this, the policy disallows it."

Linux ships two MAC implementations, distros pick one, and they take fundamentally different approaches to the same problem:

The trade-off is the usual one: paths are obvious and easy to reason about but break under bind mounts and renames; labels are precise and stable but require labeling every file and a policy to manage them. AppArmor is easier to pick up and deploy; SELinux is stricter and harder to bypass with creative filesystem manipulation. Distros pick one because the kernel's LSM framework historically allowed only one major MAC LSM at a time (modern stacking changes this for some LSM combinations, but path-versus-label is still one or the other in practice).

AppArmor (Ubuntu/Debian)

# Confirm AppArmor is enabled.
sudo aa-status | head
# apparmor module is loaded.
# 70 profiles are loaded.

# Find the profile a running container is using.
docker run --rm -d --name demo alpine:3.20 sleep 600
PID=$(docker inspect -f '{{.State.Pid}}' demo)
sudo cat /proc/$PID/attr/current
# docker-default (enforce)
docker stop demo

Inside docker-default, writes to /proc/sys, /proc/sysrq-trigger, /sys/kernel, and most of /sys are denied. Mount operations are blocked except for the ones the runtime itself sets up before the profile attaches. Try writing to a kernel parameter from inside a default-profile container:

docker run --rm alpine:3.20 sh -c 'echo 1 > /proc/sys/kernel/sysrq'
# sh: can't create /proc/sys/kernel/sysrq: Permission denied

Without AppArmor, this would be allowed if the container had CAP_SYS_ADMIN (it doesn't by default), or if the kernel parameter happened to be writable for non-root (it isn't here). The AppArmor profile is the layer denying it.

Per-pod AppArmor profiles in Kubernetes use securityContext.appArmorProfile (since 1.30, GA). The named profile must already be loaded on the node — Kubernetes does not ship profiles, only references them.

SELinux (RHEL/Fedora)

# On a Fedora/RHEL host with SELinux in enforcing mode:
getenforce
# Enforcing

# Process contexts.
ps -eZ | head
# system_u:system_r:init_t:s0   1 ?  init
# ...

# Container process context.
podman run --rm -d --name demo registry.access.redhat.com/ubi9/ubi-minimal sleep 600
PID=$(podman inspect -f '{{.State.Pid}}' demo)
sudo cat /proc/$PID/attr/current
# system_u:system_r:container_t:s0:c123,c456
podman stop demo

The process type container_t is the policy bucket for normal containers. The :s0:c123,c456 suffix is MCS (Multi-Category Security): each container gets a unique pair of categories, and the policy permits access only when the categories of subject and object match. Two containers running as the same container_t cannot read each other's files because their MCS labels differ — one of the cleanest examples of label-based MAC's strengths, because the kernel does not have to track which paths belong to which container, only which categories.

To see the file labels under a container's rootfs:

sudo ls -lZ /var/lib/containers/storage/overlay/<id>/diff/etc/ | head
# system_u:object_r:container_file_t:s0:c123,c456 ...

container_file_t is the policy bucket for container-managed files. The MCS pair matches the process's, which is what makes the access decision come out "allowed."

When SELinux denies an action that DAC and capabilities would allow, the audit log records it:

sudo ausearch -m AVC -ts recent | tail
# type=AVC msg=audit(...): avc:  denied  { read } for  pid=...
#   scontext=...:container_t:s0:c123,c456
#   tcontext=...:container_file_t:s0:c789,c012
#   tclass=file

An AVC denial line tells you which container (the categories), which kind of object (the type), and which permission the policy was missing (the action).

User Namespaces As A Security Boundary

User namespaces let a process hold root-equivalent capabilities inside the namespace without holding any host-level privilege. The chapter on namespaces showed how to create one; here is what the security implication looks like.

# Inside a user namespace where I am "root":
unshare --user --map-root-user -- bash -c '
  # Try to mount the host /proc.
  mount -t proc proc /mnt 2>&1
  # mount: /mnt: permission denied. (only privileged user can mount)

  # But create a new mount namespace - mount in there works:
  unshare --mount -- bash -c "mount -t tmpfs tmpfs /mnt && echo mounted"
  # mounted
'

The CAP_SYS_ADMIN the inner shell holds applies to resources owned by its user namespace. Mount namespaces created from inside that user namespace are owned by it; the host's mount namespace is not. A compromise that escalates to root inside such a container lands at an unprivileged host UID instead of host root — the threat-model improvement rootless containers exist for.

Kubernetes has spec.hostUsers: false (beta in 1.30, on track for GA) to give every pod its own user namespace. Inside, the pod's processes appear to run as their requested UIDs; outside, those UIDs map to a high-numbered host UID range (e.g. 100000–165535). A compromise that escalates to "root" inside the pod gets host UID 100000, which on the host is unprivileged.

Caveats:

Masked And Read-Only Paths

/proc and /sys aggregate host-wide information that the PID and mount namespaces do not isolate. Two OCI fields plug the leaks:

The default masked set in most runtimes:

/proc/asound
/proc/acpi
/proc/kcore
/proc/keys
/proc/latency_stats
/proc/timer_list
/proc/timer_stats
/proc/sched_debug
/proc/scsi
/sys/firmware
/sys/devices/virtual/powercap

Default read-only:

/proc/bus
/proc/fs
/proc/irq
/proc/sys
/proc/sysrq-trigger

To verify:

docker run --rm alpine:3.20 sh -c 'cat /proc/kcore' 2>&1 | head
# (no output -- /dev/null is mounted over it)

docker run --rm alpine:3.20 sh -c 'echo 1 > /proc/sysrq-trigger' 2>&1
# sh: can't create /proc/sysrq-trigger: Read-only file system

The list looks arbitrary at first glance and reads more naturally as a history of disclosures. Each entry was added in response to something specific:

Any new /proc or /sys interface that exposes host state is a candidate for the masked list. Treat the list as a defense in depth atop seccomp and capabilities: a container that should not need any of these would be fine without them, but masking them removes the failure mode where some other layer slips.

Device Access (cgroup v2 + eBPF)

cgroup v2 has no devices.allow file. Device policy is enforced by an eBPF program of type BPF_PROG_TYPE_CGROUP_DEVICE attached to the cgroup. runc compiles the OCI linux.resources.devices list into BPF and attaches it. The kernel runs that program on every device-class operation (open, mknod, etc.) and uses the program's return value to allow or deny.

docker run --rm -d --name demo alpine:3.20 sleep 600
CGROUP=$(docker inspect -f '{{.HostConfig.CgroupParent}}/docker-{{.Id}}.scope' demo)

sudo bpftool cgroup tree /sys/fs/cgroup$CGROUP 2>/dev/null || \
  sudo bpftool cgroup tree | grep -A1 "$(docker inspect -f '{{.Id}}' demo | head -c12)"
# /sys/fs/cgroup/.../docker-<id>.scope
# ID  AttachType      AttachFlags     Name
# X   cgroup_device                   sd_devices
docker stop demo

Try to access a device that is not in the allow list:

docker run --rm alpine:3.20 sh -c '
  cat /dev/null > /dev/null  # allowed
  echo "null ok"
  cat /dev/sda 2>&1 | head -1  # not allowed
'
# null ok
# cat: /dev/sda: Operation not permitted

The kernel returns EPERM because the BPF program denies the open. Privileged containers attach a BPF program that allows everything (a *:* rwm).

The default device set is small: /dev/null, /dev/zero, /dev/full, /dev/random, /dev/urandom, /dev/tty, /dev/console, /dev/ptmx, all rwm. Anything else has to be explicitly added in linux.resources.devices.

Putting It Together

A non-privileged container's actual boundary, in roughly the order it is built:

  1. Namespaces create separate views.
  2. Mount setup with masked and read-only paths closes /proc and /sys leaks.
  3. Capability bounding set strips kitchen-sink privileges.
  4. noNewPrivileges prevents privilege gain across exec.
  5. Seccomp filters dangerous syscalls (and is what unlocks loading the filter without CAP_SYS_ADMIN, paired with the previous step).
  6. AppArmor or SELinux denies operations that DAC and capabilities would allow.
  7. Cgroup device BPF restricts which devices work.
  8. Cgroups bound resource consumption.
  9. User namespace mapping (when enabled) makes container root unprivileged on the host.

Each layer is independently configurable. --privileged clears most of these layers at once, which is why it is rarely the right answer when a container does not work. The right reflex when a container needs more than the defaults allow is to add the specific capability, the specific device, the specific AppArmor profile transition — keeping every other layer in place.

Where This Goes

Part 3 picks up the OCI runtime side: how an OCI bundle is laid out, what config.json looks like in detail, and how runc translates the spec into the kernel state we have just spent four chapters cataloguing.

Sources And Further Reading