seccomp-BPF Deep Dive & Chromium Sandbox Architecture¶
The posting says: "Strong expertise in security and isolation (Linux kernel hardening, Docker security, Wayland sandboxing)" and "patching large codebases (e.g., Chromium)." This article goes beyond the overview in the Security & Sandboxing page. You will write a raw seccomp-BPF filter by hand, compare it to libseccomp, dissect Chromium's two-layer sandbox implementation, build Landlock policies, write Docker seccomp profiles, and understand the supervisor pattern that enables rootless containers.
1. Writing a seccomp-BPF Filter from Scratch¶
The seccomp_data structure¶
Every seccomp filter receives a struct seccomp_data from the kernel for each
syscall:
struct seccomp_data {
int nr; /* syscall number */
__u32 arch; /* AUDIT_ARCH_* value */
__u64 instruction_pointer; /* CPU instruction pointer */
__u64 args[6]; /* syscall arguments */
};
The filter is a Classic BPF (cBPF) program that inspects this structure and returns a verdict. The kernel evaluates it for every syscall the process makes.
BPF instruction primitives¶
Classic BPF has four instruction classes relevant to seccomp:
| Class | Purpose | Example |
|---|---|---|
BPF_LD |
Load data into accumulator | Load syscall number from seccomp_data.nr |
BPF_JMP |
Conditional/unconditional jump | Branch if accumulator == __NR_open |
BPF_RET |
Return a verdict | SECCOMP_RET_ALLOW or SECCOMP_RET_KILL_PROCESS |
BPF_ALU |
Arithmetic on accumulator | Bitwise AND for flag checks |
Two macros build instructions:
BPF_STMT(code, k) /* statement: opcode + constant */
BPF_JUMP(code, k, jt, jf) /* jump: opcode + constant + true_offset + false_offset */
The sock_fprog structure¶
The filter program is wrapped in struct sock_fprog for the kernel:
struct sock_fprog {
unsigned short len; /* number of BPF instructions */
struct sock_filter *filter; /* pointer to instruction array */
};
Working example: block open() but allow openat()¶
This is a real, compilable C program. It installs a seccomp-BPF filter that
kills the process if it calls open() (syscall 2 on x86_64) but allows
openat() (syscall 257) to proceed normally. The filter validates the
architecture first to prevent syscall-number confusion across ABIs.
/* seccomp_block_open.c -- blocks open() but allows openat()
* Compile: gcc -o seccomp_block_open seccomp_block_open.c
* Run: ./seccomp_block_open
*/
#include <errno.h>
#include <fcntl.h>
#include <linux/audit.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>
/* Offset helpers for seccomp_data fields */
#define SC_OFFSET(field) (offsetof(struct seccomp_data, field))
static struct sock_filter filter[] = {
/* [0] Load architecture from seccomp_data.arch */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, SC_OFFSET(arch)),
/* [1] Verify x86_64 -- kill on wrong arch to prevent ABI confusion */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
/* [2] Wrong architecture: kill */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
/* [3] Load syscall number from seccomp_data.nr */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, SC_OFFSET(nr)),
/* [4] Is it open() (nr 2 on x86_64)? */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_open, 0, 1),
/* [5] Yes: return ERRNO(EACCES) -- denied */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EACCES & SECCOMP_RET_DATA)),
/* [6] Default: allow everything else (including openat) */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
};
int main(void) {
struct sock_fprog prog = {
.len = (unsigned short)(sizeof(filter) / sizeof(filter[0])),
.filter = filter,
};
/* Required: no new privileges after filter install */
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
perror("prctl(NO_NEW_PRIVS)");
return 1;
}
/* Install the seccomp filter */
if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
perror("prctl(SECCOMP)");
return 1;
}
printf("Filter installed. Testing syscalls...\n");
/* Test 1: open() should fail with EACCES */
int fd = syscall(__NR_open, "/etc/hostname", O_RDONLY);
if (fd < 0)
printf("open() blocked: %m (expected)\n");
else {
printf("open() succeeded unexpectedly!\n");
close(fd);
}
/* Test 2: openat() should succeed */
fd = openat(AT_FDCWD, "/etc/hostname", O_RDONLY);
if (fd >= 0) {
printf("openat() allowed: success (expected)\n");
close(fd);
} else {
printf("openat() failed: %m (unexpected!)\n");
}
return 0;
}
Architecture validation is mandatory
Without the arch check, an attacker on a multi-arch kernel (x86_64 + i386)
could use the 32-bit open() syscall number to bypass your 64-bit filter.
The Chromium sandbox validates architecture in its very first BPF instruction.
How the BPF program flows¶
Load arch
|
arch == x86_64?
/ \
yes no --> KILL_PROCESS
|
Load syscall nr
|
nr == open?
/ \
yes no --> ALLOW
|
ERRNO(EACCES)
Each BPF_JUMP specifies two offsets: jt (jump if true) and jf (jump if
false), counted in instructions from the next instruction. The offsets are
relative, not absolute, which makes filter construction error-prone by hand.
2. libseccomp: the Sane API¶
Raw BPF programming is tedious and bug-prone. libseccomp provides a high-level C API (with Python and Go bindings) that compiles to optimal BPF bytecode.
Equivalent filter using libseccomp¶
/* libseccomp_block_open.c
* Compile: gcc -o libseccomp_block_open libseccomp_block_open.c -lseccomp
*/
#include <errno.h>
#include <fcntl.h>
#include <seccomp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
int main(void) {
/* Default: allow all syscalls */
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
if (ctx == NULL) {
fprintf(stderr, "seccomp_init failed\n");
return 1;
}
/* Block open() with EACCES */
int rc = seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EACCES),
SCMP_SYS(open), 0);
if (rc < 0) {
fprintf(stderr, "seccomp_rule_add: %s\n", strerror(-rc));
return 1;
}
/* Load filter into kernel */
rc = seccomp_load(ctx);
if (rc < 0) {
fprintf(stderr, "seccomp_load: %s\n", strerror(-rc));
return 1;
}
seccomp_release(ctx);
/* Test: open() fails, openat() succeeds */
int fd = syscall(__NR_open, "/etc/hostname", O_RDONLY);
if (fd < 0)
printf("open() blocked: %m\n");
fd = openat(AT_FDCWD, "/etc/hostname", O_RDONLY);
if (fd >= 0) {
printf("openat() allowed\n");
close(fd);
}
return 0;
}
Key libseccomp functions¶
| Function | Purpose |
|---|---|
seccomp_init(default_action) |
Create filter context. SCMP_ACT_ALLOW (allowlist) or SCMP_ACT_KILL (denylist) |
seccomp_rule_add(ctx, action, syscall, arg_cnt, ...) |
Add rule. Optional argument comparators for fine-grained control |
seccomp_rule_add_exact(ctx, action, syscall, arg_cnt, ...) |
Like above but fails if the rule cannot be represented exactly |
seccomp_load(ctx) |
Compile to BPF and install via prctl() |
seccomp_export_bpf(ctx, fd) |
Dump raw BPF bytecode to a file descriptor (for debugging) |
seccomp_export_pfc(ctx, fd) |
Dump human-readable pseudo-filter code |
seccomp_arch_add(ctx, arch) |
Add architecture to multi-arch filter |
seccomp_release(ctx) |
Free context |
Argument-level filtering¶
libseccomp can filter on syscall arguments, not just syscall numbers:
/* Allow mmap() only if PROT_EXEC is NOT set (arg 2) */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 1,
SCMP_A2(SCMP_CMP_MASKED_EQ, PROT_EXEC, 0));
/* Block connect() to port 80 -- arg 1 is sockaddr, so you
typically use SECCOMP_RET_USER_NOTIF for deep inspection */
Comparison: three ways to restrict syscalls¶
| Approach | Pros | Cons |
|---|---|---|
Raw BPF (sock_fprog) |
Zero dependencies, full control, minimal code size | Error-prone jump offsets, no portability across arches, no argument helpers |
| libseccomp | Architecture-portable, argument comparators, export/debug tools | Library dependency, slight abstraction overhead |
systemd SystemCallFilter= |
Declarative, no code needed, syscall groups (@system-service) |
Only for systemd services, limited argument filtering |
systemd example using predefined groups:
[Service]
SystemCallFilter=@system-service @network-io
SystemCallFilter=~@mount @reboot @swap @raw-io
SystemCallErrorNumber=EPERM
The @system-service group covers ~250 syscalls that most services need.
Prefix ~ denies instead of allows.
3. Chromium's Sandbox in Detail¶
The two-layer architecture¶
Chromium on Linux uses a defense-in-depth design with two distinct sandboxing layers:
graph LR
NS["Namespaces"] --> BPF["seccomp-BPF"]
FS["chroot"] --> BPF
BPF --> POL["baseline_policy.cc"]
Layer 1 (Namespace sandbox) creates resource isolation:
- User namespace: Renderer runs as UID 0 inside the namespace, which maps to an unprivileged UID outside. Even if the renderer is compromised, it has no privileges on the host.
- PID namespace: The renderer cannot see or signal other processes.
- Network namespace: Empty network namespace -- no sockets, no DNS.
- Mount namespace: Pivot root to a minimal filesystem.
Layer 2 (seccomp-BPF) reduces the kernel attack surface:
- Filters ~350 syscalls down to ~30-50 permitted ones.
- Blocks entire syscall families (filesystem, networking, IPC, module loading).
- Returns
ENOSYSfor some denied calls (graceful degradation) andSIGSYSfor others (hard crash, indicating a bug or exploit).
What is in baseline_policy.cc¶
The file sandbox/linux/seccomp-bpf-helpers/baseline_policy.cc in the Chromium
source defines the base seccomp policy shared by all sandboxed processes
(renderers, GPU, utilities). Process-specific policies extend it.
Explicitly allowed (always safe):
- Address space:
brk,mmap(with flag restrictions),mprotect,munmap,madvise(onlyMADV_DONTNEED,MADV_WILLNEED,MADV_NORMAL) - Scheduling:
sched_yield,nanosleep,clock_nanosleep - File descriptors (no open):
read,write,close,dup,fcntl,fstat,lseek - Event loops:
epoll_create1,epoll_ctl,epoll_wait,poll,ppoll - Futex:
futex(needed by pthreads and every allocator) - Signals:
rt_sigaction,rt_sigprocmask,rt_sigreturn
Conditionally restricted:
clone: Only for creating threads (CLONE_THREAD).fork()/vfork()returnEPERM(not killed) so glibc error handling works.socketpair: OnlyAF_UNIXdomain.mmap/mprotect:PROT_EXECis restricted in some configurations.clone3,pidfd_open: ReturnENOSYSto force fallback toclone().
Denied globally (SIGSYS crash):
- All filesystem opens:
open,openat,creat(renderers must use the broker) - Kernel modules:
init_module,finit_module,delete_module - System administration:
reboot,swapon,swapoff,mount,umount - Process debugging:
ptrace,process_vm_readv - System V IPC:
shmget,semget,msgget - Privilege changes:
setuid,setgid,setgroups
The broker process pattern¶
Since renderers cannot call open()/openat(), how do they access files?
Through a broker process:
sequenceDiagram
participant R as Renderer (sandboxed)
participant B as Broker (privileged)
participant K as Kernel
R->>B: IPC: "Open /usr/share/fonts/arial.ttf read-only"
B->>B: Check path against allowlist
B->>K: openat(AT_FDCWD, path, O_RDONLY)
K-->>B: fd 7
B->>R: Send fd 7 over Unix socket (SCM_RIGHTS)
R->>K: read(7, buf, len) -- allowed by seccomp
The broker runs outside the seccomp sandbox. It validates every request
against a policy (allowlisted paths and open modes) before performing
the actual openat(). The result is passed back as a file descriptor over a
Unix domain socket using SCM_RIGHTS ancillary data. The renderer can then
read()/write() on the fd, which seccomp allows.
The Zygote process¶
Chromium does not fork()+exec() for new renderers. Instead:
- At startup, the Zygote process is created. It loads all shared libraries and initializes common state (ICU data, V8 snapshots, font cache).
- The Zygote installs the namespace sandbox (Layer 1).
- When a new renderer is needed, the browser process signals the Zygote.
- The Zygote calls
fork(). The child inherits all pre-loaded state. - The child installs its seccomp-BPF filter (Layer 2) and drops to the restricted policy.
- The child becomes the renderer, communicating via Mojo IPC.
This design has two benefits: fast startup (no exec or library loading) and memory efficiency (COW sharing of read-only pages across renderers).
Why fork works under seccomp
The Zygote installs seccomp after forking. The baseline policy blocks
further fork() calls (returning EPERM), so a compromised renderer
cannot spawn new processes.
4. Landlock LSM: Path-Based Access Control Without Privileges¶
Landlock (merged in Linux 5.13) fills a gap that seccomp cannot address: file-path-based access control without needing root or an LSM profile written by an administrator. It is stackable -- it adds restrictions on top of existing DAC, MAC, and seccomp policies.
Landlock vs. AppArmor vs. seccomp¶
| Feature | seccomp | AppArmor | Landlock |
|---|---|---|---|
| Restricts | Syscall numbers + args | File paths, capabilities, network | File paths, network ports |
| Privilege required | PR_SET_NO_NEW_PRIVS only |
Root (profile loading) | PR_SET_NO_NEW_PRIVS only |
| Stackable | Yes (multiple filters) | Single profile | Yes (multiple rulesets) |
| Granularity | Per-syscall | Per-path + capability | Per-path hierarchy + port |
| Self-sandboxing | Yes | No (admin deploys profiles) | Yes |
| Kernel version | 3.17+ | Mainline | 5.13+ (ABI v1), 6.7+ (ABI v4 with net) |
Writing a Landlock policy in C¶
This example restricts a process to read-only access under /usr and
read-write under /tmp, with TCP connections only to port 443:
/* landlock_sandbox.c
* Compile: gcc -o landlock_sandbox landlock_sandbox.c
* Requires: Linux 6.7+ for network rules, 5.13+ for filesystem only
*/
#include <fcntl.h>
#include <linux/landlock.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>
/* Wrapper functions -- landlock syscalls have no glibc wrappers yet */
static int landlock_create_ruleset(
const struct landlock_ruleset_attr *attr, size_t size, __u32 flags) {
return syscall(__NR_landlock_create_ruleset, attr, size, flags);
}
static int landlock_add_rule(
int ruleset_fd, enum landlock_rule_type type,
const void *attr, __u32 flags) {
return syscall(__NR_landlock_add_rule, ruleset_fd, type, attr, flags);
}
static int landlock_restrict_self(int ruleset_fd, __u32 flags) {
return syscall(__NR_landlock_restrict_self, ruleset_fd, flags);
}
int main(void) {
/* Step 1: Check ABI version */
int abi = landlock_create_ruleset(NULL, 0,
LANDLOCK_CREATE_RULESET_VERSION);
if (abi < 0) {
perror("Landlock not supported on this kernel");
return 1;
}
printf("Landlock ABI version: %d\n", abi);
/* Step 2: Declare what access types we handle */
struct landlock_ruleset_attr ruleset_attr = {
.handled_access_fs =
LANDLOCK_ACCESS_FS_EXECUTE |
LANDLOCK_ACCESS_FS_READ_FILE |
LANDLOCK_ACCESS_FS_READ_DIR |
LANDLOCK_ACCESS_FS_WRITE_FILE |
LANDLOCK_ACCESS_FS_REMOVE_FILE |
LANDLOCK_ACCESS_FS_REMOVE_DIR |
LANDLOCK_ACCESS_FS_MAKE_REG |
LANDLOCK_ACCESS_FS_MAKE_DIR,
.handled_access_net =
LANDLOCK_ACCESS_NET_CONNECT_TCP,
};
/* Downgrade gracefully for older ABI versions */
if (abi < 4)
ruleset_attr.handled_access_net = 0;
int ruleset_fd = landlock_create_ruleset(
&ruleset_attr, sizeof(ruleset_attr), 0);
if (ruleset_fd < 0) {
perror("landlock_create_ruleset");
return 1;
}
/* Step 3: Add filesystem rules */
/* /usr: read + execute only */
struct landlock_path_beneath_attr usr_rule = {
.allowed_access =
LANDLOCK_ACCESS_FS_EXECUTE |
LANDLOCK_ACCESS_FS_READ_FILE |
LANDLOCK_ACCESS_FS_READ_DIR,
.parent_fd = open("/usr", O_PATH | O_CLOEXEC),
};
if (usr_rule.parent_fd < 0 ||
landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH,
&usr_rule, 0)) {
perror("Failed to add /usr rule");
return 1;
}
close(usr_rule.parent_fd);
/* /tmp: read + write */
struct landlock_path_beneath_attr tmp_rule = {
.allowed_access =
LANDLOCK_ACCESS_FS_READ_FILE |
LANDLOCK_ACCESS_FS_READ_DIR |
LANDLOCK_ACCESS_FS_WRITE_FILE |
LANDLOCK_ACCESS_FS_MAKE_REG |
LANDLOCK_ACCESS_FS_REMOVE_FILE,
.parent_fd = open("/tmp", O_PATH | O_CLOEXEC),
};
if (tmp_rule.parent_fd < 0 ||
landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH,
&tmp_rule, 0)) {
perror("Failed to add /tmp rule");
return 1;
}
close(tmp_rule.parent_fd);
/* Step 4: Add network rule -- only port 443 */
if (abi >= 4) {
struct landlock_net_port_attr net_rule = {
.allowed_access = LANDLOCK_ACCESS_NET_CONNECT_TCP,
.port = 443,
};
if (landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT,
&net_rule, 0)) {
perror("Failed to add network rule");
return 1;
}
}
/* Step 5: Enforce -- no way to remove restrictions after this */
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
perror("prctl(NO_NEW_PRIVS)");
return 1;
}
if (landlock_restrict_self(ruleset_fd, 0)) {
perror("landlock_restrict_self");
return 1;
}
close(ruleset_fd);
printf("Landlock sandbox active.\n");
/* Test: writing to /home should fail */
FILE *f = fopen("/home/test.txt", "w");
if (f == NULL)
printf("/home write blocked: %m (expected)\n");
else {
fclose(f);
printf("/home write succeeded (unexpected!)\n");
}
/* Test: reading /usr/bin/ls should succeed */
f = fopen("/usr/bin/ls", "r");
if (f) {
printf("/usr/bin/ls readable (expected)\n");
fclose(f);
}
return 0;
}
Landlock is permanent and one-way
Once landlock_restrict_self() is called, the restrictions cannot be
removed. Additional landlock_restrict_self() calls can only add more
restrictions. This is the same irreversibility property as seccomp's
PR_SET_NO_NEW_PRIVS.
5. seccomp + Containers: Docker Profiles¶
How Docker applies seccomp¶
When you run docker run, the container runtime (runc) installs a seccomp-BPF
filter before exec'ing the container entrypoint. The flow:
- Docker daemon reads the seccomp profile (default or custom).
- The OCI runtime spec includes the profile in
linux.seccomp. - runc translates the JSON profile into libseccomp calls.
- libseccomp compiles to BPF bytecode and installs via
seccomp(2).
The default Docker seccomp profile¶
Docker's default profile uses an allowlist strategy:
defaultAction:SCMP_ACT_ERRNO(deny by default)- Allows ~270 of ~400+ syscalls that most applications need
- Blocks ~44 dangerous syscalls including:
| Category | Blocked syscalls | Why |
|---|---|---|
| Kernel modules | init_module, finit_module, delete_module |
Load kernel code |
| Namespace escape | unshare, setns (conditional) |
Break container isolation |
| Reboot/power | reboot, kexec_load |
Crash the host |
| Device access | mknod |
Create device nodes |
| Clock | clock_settime, settimeofday |
Affect host timekeeping |
| Raw I/O | ioperm, iopl |
Direct port access |
| Tracing | ptrace |
Debug/inspect other processes |
| Mount | mount, umount2 |
Modify filesystem topology |
| Swap | swapon, swapoff |
Affect host memory |
Writing a custom profile for a Chromium container¶
A Chromium container (the kind Warmwind likely runs) needs a profile that is
stricter than Docker's default but permits the syscalls Chromium's own sandbox
requires. Key additions: clone with namespace flags (Chromium creates its own
namespaces) and seccomp (Chromium installs its own nested seccomp filter).
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 1,
"archMap": [
{
"architecture": "SCMP_ARCH_X86_64",
"subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
}
],
"syscalls": [
{
"names": [
"read", "write", "close", "fstat", "lseek", "mmap",
"mprotect", "munmap", "brk", "ioctl", "pread64",
"pwrite64", "readv", "writev", "access", "pipe",
"select", "sched_yield", "mremap", "mincore",
"madvise", "dup", "dup2", "nanosleep",
"getpid", "getuid", "getgid", "geteuid", "getegid",
"getppid", "getpgrp", "setsid", "gettid",
"sendmsg", "recvmsg", "sendto", "recvfrom",
"socket", "connect", "bind", "listen", "accept4",
"getsockname", "getpeername", "getsockopt", "setsockopt",
"socketpair", "shutdown",
"fcntl", "flock", "openat", "getdents64",
"fstatfs", "fadvise64", "clock_gettime",
"clock_getres", "clock_nanosleep",
"exit_group", "epoll_wait", "epoll_ctl",
"epoll_create1", "eventfd2", "timerfd_create",
"timerfd_settime", "timerfd_gettime",
"signalfd4", "poll", "ppoll",
"rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
"futex", "set_robust_list", "get_robust_list",
"prctl", "arch_prctl", "set_tid_address",
"restart_syscall", "getrandom", "memfd_create",
"copy_file_range", "statx", "rseq",
"prlimit64", "pipe2", "membarrier"
],
"action": "SCMP_ACT_ALLOW"
},
{
"comment": "Chromium needs clone with namespace flags for its own sandbox",
"names": ["clone", "clone3"],
"action": "SCMP_ACT_ALLOW",
"args": []
},
{
"comment": "Allow Chromium to install its nested seccomp filter",
"names": ["seccomp"],
"action": "SCMP_ACT_ALLOW"
},
{
"comment": "Chromium creates user/PID/net namespaces for renderers",
"names": ["unshare"],
"action": "SCMP_ACT_ALLOW"
},
{
"comment": "GPU process needs DRI ioctls",
"names": ["ioctl"],
"action": "SCMP_ACT_ALLOW"
},
{
"comment": "Needed for shared memory (Wayland buffers, IPC)",
"names": ["shmget", "shmat", "shmctl", "shmdt"],
"action": "SCMP_ACT_ALLOW"
}
]
}
Apply it:
docker run --rm \
--security-opt seccomp=/path/to/chromium-seccomp.json \
--security-opt no-new-privileges \
--cap-drop=ALL \
--cap-add=SYS_ADMIN \
warmwind/chromium:latest
The SYS_ADMIN trap
--cap-add=SYS_ADMIN is needed because Chromium uses clone() with
CLONE_NEWUSER and CLONE_NEWPID. A better approach is to run Chromium
with --no-sandbox and let Docker's own namespace isolation substitute
for Chromium's Layer 1. The seccomp profile above then serves as the
equivalent of Chromium's Layer 2. This is a common pattern in
containerized browser deployments.
6. SECCOMP_RET_USER_NOTIF: The Supervisor Pattern¶
Added in Linux 5.0, SECCOMP_RET_USER_NOTIF enables a fundamentally different
security model: instead of simply allowing or denying syscalls, a privileged
supervisor process intercepts them and decides what to do.
How it works¶
sequenceDiagram
participant T as Target (sandboxed)
participant K as Kernel
participant S as Supervisor (privileged)
T->>K: mount("/dev/sda1", "/mnt", "ext4", ...)
K->>K: seccomp filter returns RET_USER_NOTIF
K-->>T: Thread blocked
K->>S: Notification on seccomp fd (readable)
S->>K: SECCOMP_IOCTL_NOTIF_RECV
K-->>S: struct seccomp_notif {id, pid, syscall_data}
S->>S: Validate request, read /proc/pid/mem for args
S->>S: Perform mount() on behalf of target
S->>K: SECCOMP_IOCTL_NOTIF_SEND {id, val=0, error=0}
K-->>T: mount() returns 0 (success)
The notification structures¶
/* Received by supervisor when sandboxed process hits RET_USER_NOTIF */
struct seccomp_notif {
__u64 id; /* unique request ID */
__u32 pid; /* PID of the sandboxed process */
__u32 flags;
struct seccomp_data data; /* syscall nr, arch, args[6] */
};
/* Sent back by supervisor */
struct seccomp_notif_resp {
__u64 id; /* must match the request */
__s64 val; /* syscall return value */
__s32 error; /* errno (negative) or 0 */
__u32 flags; /* SECCOMP_USER_NOTIF_FLAG_CONTINUE */
};
Container runtime use: rootless containers¶
Rootless containers (Podman, rootless Docker, LXC unprivileged) cannot perform
privileged operations like mount() or mknod(). The supervisor pattern
solves this:
| Syscall | Supervisor action |
|---|---|
mount("proc", ...) |
Supervisor mounts procfs in the container's mount namespace |
mknod("/dev/fuse", ...) |
Supervisor creates the device node, passes fd back |
mount("overlay", ...) |
Supervisor mounts overlay or delegates to FUSE |
connect(AF_VSOCK, ...) |
Supervisor proxies the connection |
LXD's implementation runs a dedicated goroutine per container as the syscall
supervisor. It listens on the seccomp notify fd using epoll() and processes
requests through a validation pipeline.
The TOCTOU danger¶
SECCOMP_USER_NOTIF_FLAG_CONTINUE is dangerous
When the supervisor sets SECCOMP_USER_NOTIF_FLAG_CONTINUE, the kernel
executes the original syscall. But between the time the supervisor
inspected the arguments and the time the kernel acts, the sandboxed process
(or another thread) can rewrite the syscall arguments in memory. This
is a classic TOCTOU (time-of-check-time-of-use) race.
Safe supervisor designs either:
- Emulate the syscall entirely (never use
FLAG_CONTINUE), or - Use
SECCOMP_IOCTL_NOTIF_ID_VALIDto verify the target has not been recycled, and read arguments from/proc/<pid>/memrather than trusting pointer contents in shared memory.
7. Attack Surface Analysis: What Does the Sandbox Prevent?¶
Threat model¶
Without a sandbox, a renderer exploit (e.g., a V8 type-confusion bug) gives the attacker full process privileges: arbitrary file read/write, network access, and potential kernel exploitation. With the sandbox:
| Attack vector | Without sandbox | With sandbox |
|---|---|---|
Read /etc/shadow |
Direct open() |
Blocked: no open/openat in seccomp policy |
| Exfiltrate data over network | connect() to C2 server |
Blocked: no network namespace, no socket() |
| Install rootkit | init_module() |
Blocked: kernel module syscalls denied |
| Pivot to other processes | ptrace() or /proc access |
Blocked: PID namespace + seccomp denies ptrace |
| Exploit kernel via syscall | Any of 400+ syscalls | Only ~30-50 permitted: dramatically reduced attack surface |
Real CVEs where the sandbox mattered¶
CVE-2025-2783 (Chromium, March 2025): A Mojo IPC handle confusion allowed a compromised renderer to obtain privileged browser process handles. This was a sandbox escape -- it bypassed the seccomp+namespace boundary. The fact that this earned a standalone CVE demonstrates that normally the sandbox contains renderer exploits. An attacker needs a separate sandbox escape bug on top of the renderer bug.
CVE-2025-4609 (Chromium, August 2025): A flaw in Chromium's ipcz (inter-process communication) mechanism allowed a compromised renderer to gain browser process handles. Awarded a $250,000 bounty -- one of the largest in Chrome's history -- precisely because sandbox escapes are rare and high-impact.
CVE-2020-6572 (Chromium): An exploit in MediaCodecAudioDecoder allowed
sandbox escape. Google's root cause analysis documented exactly how the
attacker had to chain a renderer exploit with a sandbox escape -- two
independent vulnerabilities required for full compromise.
CVE-2023-36719 (Windows, affecting Chrome): A 20-year-old stack corruption bug in a Windows OS library was reachable from within the Chromium sandbox. This demonstrates that the sandbox boundary forces attackers to find bugs in highly scrutinized kernel interfaces rather than in the vast userspace attack surface.
The numbers¶
| Year | Total Chrome CVEs | Sandbox escapes | Required exploit chain |
|---|---|---|---|
| 2023 | ~180 | 3-5 | Always 2+ bugs (renderer + escape) |
| 2024 | ~175 | 4-6 | Same pattern |
| 2025 | ~205 | 4-7 | Same pattern |
The key insight: the vast majority of renderer vulnerabilities are contained by the sandbox. An attacker who finds a V8 type-confusion or a use-after-free in Blink gains code execution inside the renderer process but cannot:
- Access the filesystem (no
opensyscalls) - Open network connections (no network namespace)
- Escalate to root (no privilege-related syscalls)
- Spawn new processes (no
fork/exec)
They need a second, independent vulnerability to escape. This defense-in-depth strategy converts single-bug RCE into a multi-bug chain, dramatically raising the cost of exploitation.
Sandbox escape economics
Google's VRP (Vulnerability Reward Program) pays $20,000-$30,000 for renderer bugs but $100,000-$250,000+ for sandbox escapes. The price difference reflects the rarity and difficulty. On the exploit market, a full Chrome chain (renderer + sandbox escape + kernel LPE) sells for $500,000-$2,000,000+, while a renderer-only exploit without sandbox escape is worth a fraction of that.
Putting It All Together: Warmwind's Likely Stack¶
A Chromium-in-Docker deployment like Warmwind probably combines all of these layers:
+--------------------------------------------------+
| Docker seccomp profile (custom JSON) | Outermost: container-level
| +----------------------------------------------+|
| | Linux namespaces (user, PID, net, mnt) || Docker's isolation
| | +------------------------------------------+||
| | | Chromium Layer 1 (nested user+PID ns) ||| Chromium's own sandbox
| | | +--------------------------------------+|||
| | | | Chromium Layer 2 (seccomp-BPF) |||| baseline_policy.cc
| | | | +----------------------------------+||||
| | | | | Renderer process ||||| Runs untrusted JS
| | | | +----------------------------------+||||
| | | +--------------------------------------+|||
| | +------------------------------------------+||
| +----------------------------------------------+|
+--------------------------------------------------+
Each layer catches different failure modes. If an attacker escapes Chromium's seccomp filter, they hit Docker's seccomp filter. If they escape that, they are still in a user namespace with no capabilities on the host. If Landlock is active on the host, even a namespace escape hits path-based restrictions.
Glossary
- seccomp_data
- Kernel structure passed to BPF filters for each syscall. Contains syscall number, architecture, instruction pointer, and six arguments.
- sock_fprog
- User-space structure wrapping a Classic BPF program (instruction count + pointer to instruction array) for seccomp filter installation.
- Classic BPF (cBPF)
- The original Berkeley Packet Filter bytecode used by seccomp. Not to be confused with eBPF, which is used for tracing and networking but is NOT used for seccomp.
- libseccomp
- High-level C library that compiles seccomp rules into optimal cBPF bytecode. Provides architecture portability and argument-level filtering.
- Zygote
- Chromium's process that pre-loads shared libraries and forks to create new renderer processes. Enables fast startup and COW memory sharing.
- Broker process
- Privileged process that performs filesystem operations on behalf of sandboxed renderers. Validates requests against an allowlist before opening files.
- Landlock
- Linux Security Module (merged 5.13) for unprivileged, stackable, path-based filesystem and network access control. Complement to seccomp.
- SECCOMP_RET_USER_NOTIF
- Seccomp action (Linux 5.0+) that forwards a syscall to a supervisor process via a notification file descriptor instead of allowing or denying it directly.
- TOCTOU
- Time-of-check-time-of-use race condition. In seccomp notify context: the gap between when a supervisor reads syscall arguments and when the kernel acts on them.
- SCM_RIGHTS
- Unix socket ancillary data type for passing file descriptors between processes. Used by Chromium's broker to deliver opened fds to sandboxed renderers.
- OCI runtime spec
- Open Container Initiative specification for container execution. Includes
linux.seccompfield where Docker injects seccomp profiles for runc to install. - SCMP_ACT_ERRNO
- libseccomp / Docker seccomp action that causes the denied syscall to return a specified errno value instead of killing the process.