seccomp-BPF Deep Dive & Chromium Sandbox Architecture¶

The posting says: "Strong expertise in security and isolation (Linux kernel hardening, Docker security, Wayland sandboxing)" and "patching large codebases (e.g., Chromium)." This article goes beyond the overview in the Security & Sandboxing page. You will write a raw seccomp-BPF filter by hand, compare it to libseccomp, dissect Chromium's two-layer sandbox implementation, build Landlock policies, write Docker seccomp profiles, and understand the supervisor pattern that enables rootless containers.

1. Writing a seccomp-BPF Filter from Scratch¶

The seccomp_data structure¶

Every seccomp filter receives a struct seccomp_data from the kernel for each syscall:

struct seccomp_data {
    int   nr;                    /* syscall number */
    __u32 arch;                  /* AUDIT_ARCH_* value */
    __u64 instruction_pointer;   /* CPU instruction pointer */
    __u64 args[6];               /* syscall arguments */
};

The filter is a Classic BPF (cBPF) program that inspects this structure and returns a verdict. The kernel evaluates it for every syscall the process makes.

BPF instruction primitives¶

Classic BPF has four instruction classes relevant to seccomp:

Class	Purpose	Example
`BPF_LD`	Load data into accumulator	Load syscall number from `seccomp_data.nr`
`BPF_JMP`	Conditional/unconditional jump	Branch if accumulator == `__NR_open`
`BPF_RET`	Return a verdict	`SECCOMP_RET_ALLOW` or `SECCOMP_RET_KILL_PROCESS`
`BPF_ALU`	Arithmetic on accumulator	Bitwise AND for flag checks

Two macros build instructions:

BPF_STMT(code, k)          /* statement: opcode + constant */
BPF_JUMP(code, k, jt, jf)  /* jump: opcode + constant + true_offset + false_offset */

The sock_fprog structure¶

The filter program is wrapped in struct sock_fprog for the kernel:

struct sock_fprog {
    unsigned short      len;    /* number of BPF instructions */
    struct sock_filter *filter; /* pointer to instruction array */
};

Working example: block `open()` but allow `openat()`¶

This is a real, compilable C program. It installs a seccomp-BPF filter that kills the process if it calls open() (syscall 2 on x86_64) but allows openat() (syscall 257) to proceed normally. The filter validates the architecture first to prevent syscall-number confusion across ABIs.

/* seccomp_block_open.c -- blocks open() but allows openat()
 * Compile: gcc -o seccomp_block_open seccomp_block_open.c
 * Run:     ./seccomp_block_open
 */
#include <errno.h>
#include <fcntl.h>
#include <linux/audit.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>

/* Offset helpers for seccomp_data fields */
#define SC_OFFSET(field) (offsetof(struct seccomp_data, field))

static struct sock_filter filter[] = {
    /* [0] Load architecture from seccomp_data.arch */
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, SC_OFFSET(arch)),

    /* [1] Verify x86_64 -- kill on wrong arch to prevent ABI confusion */
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
    /* [2] Wrong architecture: kill */
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),

    /* [3] Load syscall number from seccomp_data.nr */
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, SC_OFFSET(nr)),

    /* [4] Is it open() (nr 2 on x86_64)? */
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_open, 0, 1),
    /* [5] Yes: return ERRNO(EACCES) -- denied */
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EACCES & SECCOMP_RET_DATA)),

    /* [6] Default: allow everything else (including openat) */
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
};

int main(void) {
    struct sock_fprog prog = {
        .len    = (unsigned short)(sizeof(filter) / sizeof(filter[0])),
        .filter = filter,
    };

    /* Required: no new privileges after filter install */
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("prctl(NO_NEW_PRIVS)");
        return 1;
    }

    /* Install the seccomp filter */
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
        perror("prctl(SECCOMP)");
        return 1;
    }

    printf("Filter installed. Testing syscalls...\n");

    /* Test 1: open() should fail with EACCES */
    int fd = syscall(__NR_open, "/etc/hostname", O_RDONLY);
    if (fd < 0)
        printf("open() blocked: %m  (expected)\n");
    else {
        printf("open() succeeded unexpectedly!\n");
        close(fd);
    }

    /* Test 2: openat() should succeed */
    fd = openat(AT_FDCWD, "/etc/hostname", O_RDONLY);
    if (fd >= 0) {
        printf("openat() allowed: success (expected)\n");
        close(fd);
    } else {
        printf("openat() failed: %m  (unexpected!)\n");
    }

    return 0;
}

Architecture validation is mandatory

Without the arch check, an attacker on a multi-arch kernel (x86_64 + i386) could use the 32-bit open() syscall number to bypass your 64-bit filter. The Chromium sandbox validates architecture in its very first BPF instruction.

How the BPF program flows¶

         Load arch
            |
      arch == x86_64?
       /          \
     yes           no --> KILL_PROCESS
      |
   Load syscall nr
      |
   nr == open?
    /       \
  yes        no --> ALLOW
   |
 ERRNO(EACCES)

Each BPF_JUMP specifies two offsets: jt (jump if true) and jf (jump if false), counted in instructions from the next instruction. The offsets are relative, not absolute, which makes filter construction error-prone by hand.

2. libseccomp: the Sane API¶

Raw BPF programming is tedious and bug-prone. libseccomp provides a high-level C API (with Python and Go bindings) that compiles to optimal BPF bytecode.

Equivalent filter using libseccomp¶

/* libseccomp_block_open.c
 * Compile: gcc -o libseccomp_block_open libseccomp_block_open.c -lseccomp
 */
#include <errno.h>
#include <fcntl.h>
#include <seccomp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>

int main(void) {
    /* Default: allow all syscalls */
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
    if (ctx == NULL) {
        fprintf(stderr, "seccomp_init failed\n");
        return 1;
    }

    /* Block open() with EACCES */
    int rc = seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EACCES),
                              SCMP_SYS(open), 0);
    if (rc < 0) {
        fprintf(stderr, "seccomp_rule_add: %s\n", strerror(-rc));
        return 1;
    }

    /* Load filter into kernel */
    rc = seccomp_load(ctx);
    if (rc < 0) {
        fprintf(stderr, "seccomp_load: %s\n", strerror(-rc));
        return 1;
    }

    seccomp_release(ctx);

    /* Test: open() fails, openat() succeeds */
    int fd = syscall(__NR_open, "/etc/hostname", O_RDONLY);
    if (fd < 0)
        printf("open() blocked: %m\n");

    fd = openat(AT_FDCWD, "/etc/hostname", O_RDONLY);
    if (fd >= 0) {
        printf("openat() allowed\n");
        close(fd);
    }

    return 0;
}

Key libseccomp functions¶

Function	Purpose
`seccomp_init(default_action)`	Create filter context. `SCMP_ACT_ALLOW` (allowlist) or `SCMP_ACT_KILL` (denylist)
`seccomp_rule_add(ctx, action, syscall, arg_cnt, ...)`	Add rule. Optional argument comparators for fine-grained control
`seccomp_rule_add_exact(ctx, action, syscall, arg_cnt, ...)`	Like above but fails if the rule cannot be represented exactly
`seccomp_load(ctx)`	Compile to BPF and install via `prctl()`
`seccomp_export_bpf(ctx, fd)`	Dump raw BPF bytecode to a file descriptor (for debugging)
`seccomp_export_pfc(ctx, fd)`	Dump human-readable pseudo-filter code
`seccomp_arch_add(ctx, arch)`	Add architecture to multi-arch filter
`seccomp_release(ctx)`	Free context

Argument-level filtering¶

libseccomp can filter on syscall arguments, not just syscall numbers:

/* Allow mmap() only if PROT_EXEC is NOT set (arg 2) */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 1,
    SCMP_A2(SCMP_CMP_MASKED_EQ, PROT_EXEC, 0));

/* Block connect() to port 80 -- arg 1 is sockaddr, so you
   typically use SECCOMP_RET_USER_NOTIF for deep inspection */

Comparison: three ways to restrict syscalls¶

Approach	Pros	Cons
Raw BPF (`sock_fprog`)	Zero dependencies, full control, minimal code size	Error-prone jump offsets, no portability across arches, no argument helpers
libseccomp	Architecture-portable, argument comparators, export/debug tools	Library dependency, slight abstraction overhead
systemd `SystemCallFilter=`	Declarative, no code needed, syscall groups (`@system-service`)	Only for systemd services, limited argument filtering

systemd example using predefined groups:

[Service]
SystemCallFilter=@system-service @network-io
SystemCallFilter=~@mount @reboot @swap @raw-io
SystemCallErrorNumber=EPERM

The @system-service group covers ~250 syscalls that most services need. Prefix ~ denies instead of allows.

3. Chromium's Sandbox in Detail¶

The two-layer architecture¶

Chromium on Linux uses a defense-in-depth design with two distinct sandboxing layers:

graph LR
    NS["Namespaces"] --> BPF["seccomp-BPF"]
    FS["chroot"] --> BPF
    BPF --> POL["baseline_policy.cc"]

Layer 1 (Namespace sandbox) creates resource isolation:

User namespace: Renderer runs as UID 0 inside the namespace, which maps to an unprivileged UID outside. Even if the renderer is compromised, it has no privileges on the host.
PID namespace: The renderer cannot see or signal other processes.
Network namespace: Empty network namespace -- no sockets, no DNS.
Mount namespace: Pivot root to a minimal filesystem.

Layer 2 (seccomp-BPF) reduces the kernel attack surface:

Filters ~350 syscalls down to ~30-50 permitted ones.
Blocks entire syscall families (filesystem, networking, IPC, module loading).
Returns ENOSYS for some denied calls (graceful degradation) and SIGSYS for others (hard crash, indicating a bug or exploit).

What is in `baseline_policy.cc`¶

The file sandbox/linux/seccomp-bpf-helpers/baseline_policy.cc in the Chromium source defines the base seccomp policy shared by all sandboxed processes (renderers, GPU, utilities). Process-specific policies extend it.

Explicitly allowed (always safe):

Address space: brk, mmap (with flag restrictions), mprotect, munmap, madvise (only MADV_DONTNEED, MADV_WILLNEED, MADV_NORMAL)
Scheduling: sched_yield, nanosleep, clock_nanosleep
File descriptors (no open): read, write, close, dup, fcntl, fstat, lseek
Event loops: epoll_create1, epoll_ctl, epoll_wait, poll, ppoll
Futex: futex (needed by pthreads and every allocator)
Signals: rt_sigaction, rt_sigprocmask, rt_sigreturn

Conditionally restricted:

clone: Only for creating threads (CLONE_THREAD). fork()/vfork() return EPERM (not killed) so glibc error handling works.
socketpair: Only AF_UNIX domain.
mmap/mprotect: PROT_EXEC is restricted in some configurations.
clone3, pidfd_open: Return ENOSYS to force fallback to clone().

Denied globally (SIGSYS crash):

All filesystem opens: open, openat, creat (renderers must use the broker)
Kernel modules: init_module, finit_module, delete_module
System administration: reboot, swapon, swapoff, mount, umount
Process debugging: ptrace, process_vm_readv
System V IPC: shmget, semget, msgget
Privilege changes: setuid, setgid, setgroups

The broker process pattern¶

Since renderers cannot call open()/openat(), how do they access files? Through a broker process:

sequenceDiagram
    participant R as Renderer (sandboxed)
    participant B as Broker (privileged)
    participant K as Kernel
    R->>B: IPC: "Open /usr/share/fonts/arial.ttf read-only"
    B->>B: Check path against allowlist
    B->>K: openat(AT_FDCWD, path, O_RDONLY)
    K-->>B: fd 7
    B->>R: Send fd 7 over Unix socket (SCM_RIGHTS)
    R->>K: read(7, buf, len)  -- allowed by seccomp

The broker runs outside the seccomp sandbox. It validates every request against a policy (allowlisted paths and open modes) before performing the actual openat(). The result is passed back as a file descriptor over a Unix domain socket using SCM_RIGHTS ancillary data. The renderer can then read()/write() on the fd, which seccomp allows.

The Zygote process¶

Chromium does not fork()+exec() for new renderers. Instead:

At startup, the Zygote process is created. It loads all shared libraries and initializes common state (ICU data, V8 snapshots, font cache).
The Zygote installs the namespace sandbox (Layer 1).
When a new renderer is needed, the browser process signals the Zygote.
The Zygote calls fork(). The child inherits all pre-loaded state.
The child installs its seccomp-BPF filter (Layer 2) and drops to the restricted policy.
The child becomes the renderer, communicating via Mojo IPC.

This design has two benefits: fast startup (no exec or library loading) and memory efficiency (COW sharing of read-only pages across renderers).

Why fork works under seccomp

The Zygote installs seccomp after forking. The baseline policy blocks further fork() calls (returning EPERM), so a compromised renderer cannot spawn new processes.

4. Landlock LSM: Path-Based Access Control Without Privileges¶

Landlock (merged in Linux 5.13) fills a gap that seccomp cannot address: file-path-based access control without needing root or an LSM profile written by an administrator. It is stackable -- it adds restrictions on top of existing DAC, MAC, and seccomp policies.

Landlock vs. AppArmor vs. seccomp¶

Feature	seccomp	AppArmor	Landlock
Restricts	Syscall numbers + args	File paths, capabilities, network	File paths, network ports
Privilege required	`PR_SET_NO_NEW_PRIVS` only	Root (profile loading)	`PR_SET_NO_NEW_PRIVS` only
Stackable	Yes (multiple filters)	Single profile	Yes (multiple rulesets)
Granularity	Per-syscall	Per-path + capability	Per-path hierarchy + port
Self-sandboxing	Yes	No (admin deploys profiles)	Yes
Kernel version	3.17+	Mainline	5.13+ (ABI v1), 6.7+ (ABI v4 with net)

Writing a Landlock policy in C¶

This example restricts a process to read-only access under /usr and read-write under /tmp, with TCP connections only to port 443:

/* landlock_sandbox.c
 * Compile: gcc -o landlock_sandbox landlock_sandbox.c
 * Requires: Linux 6.7+ for network rules, 5.13+ for filesystem only
 */
#include <fcntl.h>
#include <linux/landlock.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>

/* Wrapper functions -- landlock syscalls have no glibc wrappers yet */
static int landlock_create_ruleset(
    const struct landlock_ruleset_attr *attr, size_t size, __u32 flags) {
    return syscall(__NR_landlock_create_ruleset, attr, size, flags);
}

static int landlock_add_rule(
    int ruleset_fd, enum landlock_rule_type type,
    const void *attr, __u32 flags) {
    return syscall(__NR_landlock_add_rule, ruleset_fd, type, attr, flags);
}

static int landlock_restrict_self(int ruleset_fd, __u32 flags) {
    return syscall(__NR_landlock_restrict_self, ruleset_fd, flags);
}

int main(void) {
    /* Step 1: Check ABI version */
    int abi = landlock_create_ruleset(NULL, 0,
                                      LANDLOCK_CREATE_RULESET_VERSION);
    if (abi < 0) {
        perror("Landlock not supported on this kernel");
        return 1;
    }
    printf("Landlock ABI version: %d\n", abi);

    /* Step 2: Declare what access types we handle */
    struct landlock_ruleset_attr ruleset_attr = {
        .handled_access_fs =
            LANDLOCK_ACCESS_FS_EXECUTE |
            LANDLOCK_ACCESS_FS_READ_FILE |
            LANDLOCK_ACCESS_FS_READ_DIR |
            LANDLOCK_ACCESS_FS_WRITE_FILE |
            LANDLOCK_ACCESS_FS_REMOVE_FILE |
            LANDLOCK_ACCESS_FS_REMOVE_DIR |
            LANDLOCK_ACCESS_FS_MAKE_REG |
            LANDLOCK_ACCESS_FS_MAKE_DIR,
        .handled_access_net =
            LANDLOCK_ACCESS_NET_CONNECT_TCP,
    };

    /* Downgrade gracefully for older ABI versions */
    if (abi < 4)
        ruleset_attr.handled_access_net = 0;

    int ruleset_fd = landlock_create_ruleset(
        &ruleset_attr, sizeof(ruleset_attr), 0);
    if (ruleset_fd < 0) {
        perror("landlock_create_ruleset");
        return 1;
    }

    /* Step 3: Add filesystem rules */

    /* /usr: read + execute only */
    struct landlock_path_beneath_attr usr_rule = {
        .allowed_access =
            LANDLOCK_ACCESS_FS_EXECUTE |
            LANDLOCK_ACCESS_FS_READ_FILE |
            LANDLOCK_ACCESS_FS_READ_DIR,
        .parent_fd = open("/usr", O_PATH | O_CLOEXEC),
    };
    if (usr_rule.parent_fd < 0 ||
        landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH,
                          &usr_rule, 0)) {
        perror("Failed to add /usr rule");
        return 1;
    }
    close(usr_rule.parent_fd);

    /* /tmp: read + write */
    struct landlock_path_beneath_attr tmp_rule = {
        .allowed_access =
            LANDLOCK_ACCESS_FS_READ_FILE |
            LANDLOCK_ACCESS_FS_READ_DIR |
            LANDLOCK_ACCESS_FS_WRITE_FILE |
            LANDLOCK_ACCESS_FS_MAKE_REG |
            LANDLOCK_ACCESS_FS_REMOVE_FILE,
        .parent_fd = open("/tmp", O_PATH | O_CLOEXEC),
    };
    if (tmp_rule.parent_fd < 0 ||
        landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH,
                          &tmp_rule, 0)) {
        perror("Failed to add /tmp rule");
        return 1;
    }
    close(tmp_rule.parent_fd);

    /* Step 4: Add network rule -- only port 443 */
    if (abi >= 4) {
        struct landlock_net_port_attr net_rule = {
            .allowed_access = LANDLOCK_ACCESS_NET_CONNECT_TCP,
            .port = 443,
        };
        if (landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT,
                              &net_rule, 0)) {
            perror("Failed to add network rule");
            return 1;
        }
    }

    /* Step 5: Enforce -- no way to remove restrictions after this */
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("prctl(NO_NEW_PRIVS)");
        return 1;
    }
    if (landlock_restrict_self(ruleset_fd, 0)) {
        perror("landlock_restrict_self");
        return 1;
    }
    close(ruleset_fd);

    printf("Landlock sandbox active.\n");

    /* Test: writing to /home should fail */
    FILE *f = fopen("/home/test.txt", "w");
    if (f == NULL)
        printf("/home write blocked: %m  (expected)\n");
    else {
        fclose(f);
        printf("/home write succeeded (unexpected!)\n");
    }

    /* Test: reading /usr/bin/ls should succeed */
    f = fopen("/usr/bin/ls", "r");
    if (f) {
        printf("/usr/bin/ls readable (expected)\n");
        fclose(f);
    }

    return 0;
}

Landlock is permanent and one-way

Once landlock_restrict_self() is called, the restrictions cannot be removed. Additional landlock_restrict_self() calls can only add more restrictions. This is the same irreversibility property as seccomp's PR_SET_NO_NEW_PRIVS.

5. seccomp + Containers: Docker Profiles¶

How Docker applies seccomp¶

When you run docker run, the container runtime (runc) installs a seccomp-BPF filter before exec'ing the container entrypoint. The flow:

Docker daemon reads the seccomp profile (default or custom).
The OCI runtime spec includes the profile in linux.seccomp.
runc translates the JSON profile into libseccomp calls.
libseccomp compiles to BPF bytecode and installs via seccomp(2).

The default Docker seccomp profile¶

Docker's default profile uses an allowlist strategy:

defaultAction: SCMP_ACT_ERRNO (deny by default)
Allows ~270 of ~400+ syscalls that most applications need
Blocks ~44 dangerous syscalls including:

Category	Blocked syscalls	Why
Kernel modules	`init_module`, `finit_module`, `delete_module`	Load kernel code
Namespace escape	`unshare`, `setns` (conditional)	Break container isolation
Reboot/power	`reboot`, `kexec_load`	Crash the host
Device access	`mknod`	Create device nodes
Clock	`clock_settime`, `settimeofday`	Affect host timekeeping
Raw I/O	`ioperm`, `iopl`	Direct port access
Tracing	`ptrace`	Debug/inspect other processes
Mount	`mount`, `umount2`	Modify filesystem topology
Swap	`swapon`, `swapoff`	Affect host memory

Writing a custom profile for a Chromium container¶

A Chromium container (the kind Warmwind likely runs) needs a profile that is stricter than Docker's default but permits the syscalls Chromium's own sandbox requires. Key additions: clone with namespace flags (Chromium creates its own namespaces) and seccomp (Chromium installs its own nested seccomp filter).

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "archMap": [
    {
      "architecture": "SCMP_ARCH_X86_64",
      "subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
    }
  ],
  "syscalls": [
    {
      "names": [
        "read", "write", "close", "fstat", "lseek", "mmap",
        "mprotect", "munmap", "brk", "ioctl", "pread64",
        "pwrite64", "readv", "writev", "access", "pipe",
        "select", "sched_yield", "mremap", "mincore",
        "madvise", "dup", "dup2", "nanosleep",
        "getpid", "getuid", "getgid", "geteuid", "getegid",
        "getppid", "getpgrp", "setsid", "gettid",
        "sendmsg", "recvmsg", "sendto", "recvfrom",
        "socket", "connect", "bind", "listen", "accept4",
        "getsockname", "getpeername", "getsockopt", "setsockopt",
        "socketpair", "shutdown",
        "fcntl", "flock", "openat", "getdents64",
        "fstatfs", "fadvise64", "clock_gettime",
        "clock_getres", "clock_nanosleep",
        "exit_group", "epoll_wait", "epoll_ctl",
        "epoll_create1", "eventfd2", "timerfd_create",
        "timerfd_settime", "timerfd_gettime",
        "signalfd4", "poll", "ppoll",
        "rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
        "futex", "set_robust_list", "get_robust_list",
        "prctl", "arch_prctl", "set_tid_address",
        "restart_syscall", "getrandom", "memfd_create",
        "copy_file_range", "statx", "rseq",
        "prlimit64", "pipe2", "membarrier"
      ],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "comment": "Chromium needs clone with namespace flags for its own sandbox",
      "names": ["clone", "clone3"],
      "action": "SCMP_ACT_ALLOW",
      "args": []
    },
    {
      "comment": "Allow Chromium to install its nested seccomp filter",
      "names": ["seccomp"],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "comment": "Chromium creates user/PID/net namespaces for renderers",
      "names": ["unshare"],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "comment": "GPU process needs DRI ioctls",
      "names": ["ioctl"],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "comment": "Needed for shared memory (Wayland buffers, IPC)",
      "names": ["shmget", "shmat", "shmctl", "shmdt"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Apply it:

docker run --rm \
    --security-opt seccomp=/path/to/chromium-seccomp.json \
    --security-opt no-new-privileges \
    --cap-drop=ALL \
    --cap-add=SYS_ADMIN \
    warmwind/chromium:latest

The SYS_ADMIN trap

--cap-add=SYS_ADMIN is needed because Chromium uses clone() with CLONE_NEWUSER and CLONE_NEWPID. A better approach is to run Chromium with --no-sandbox and let Docker's own namespace isolation substitute for Chromium's Layer 1. The seccomp profile above then serves as the equivalent of Chromium's Layer 2. This is a common pattern in containerized browser deployments.

6. SECCOMP_RET_USER_NOTIF: The Supervisor Pattern¶

Added in Linux 5.0, SECCOMP_RET_USER_NOTIF enables a fundamentally different security model: instead of simply allowing or denying syscalls, a privileged supervisor process intercepts them and decides what to do.

How it works¶

sequenceDiagram
    participant T as Target (sandboxed)
    participant K as Kernel
    participant S as Supervisor (privileged)

    T->>K: mount("/dev/sda1", "/mnt", "ext4", ...)
    K->>K: seccomp filter returns RET_USER_NOTIF
    K-->>T: Thread blocked
    K->>S: Notification on seccomp fd (readable)
    S->>K: SECCOMP_IOCTL_NOTIF_RECV
    K-->>S: struct seccomp_notif {id, pid, syscall_data}
    S->>S: Validate request, read /proc/pid/mem for args
    S->>S: Perform mount() on behalf of target
    S->>K: SECCOMP_IOCTL_NOTIF_SEND {id, val=0, error=0}
    K-->>T: mount() returns 0 (success)

The notification structures¶

/* Received by supervisor when sandboxed process hits RET_USER_NOTIF */
struct seccomp_notif {
    __u64 id;                   /* unique request ID */
    __u32 pid;                  /* PID of the sandboxed process */
    __u32 flags;
    struct seccomp_data data;   /* syscall nr, arch, args[6] */
};

/* Sent back by supervisor */
struct seccomp_notif_resp {
    __u64 id;                   /* must match the request */
    __s64 val;                  /* syscall return value */
    __s32 error;                /* errno (negative) or 0 */
    __u32 flags;                /* SECCOMP_USER_NOTIF_FLAG_CONTINUE */
};

Container runtime use: rootless containers¶

Rootless containers (Podman, rootless Docker, LXC unprivileged) cannot perform privileged operations like mount() or mknod(). The supervisor pattern solves this:

Syscall	Supervisor action
`mount("proc", ...)`	Supervisor mounts procfs in the container's mount namespace
`mknod("/dev/fuse", ...)`	Supervisor creates the device node, passes fd back
`mount("overlay", ...)`	Supervisor mounts overlay or delegates to FUSE
`connect(AF_VSOCK, ...)`	Supervisor proxies the connection

LXD's implementation runs a dedicated goroutine per container as the syscall supervisor. It listens on the seccomp notify fd using epoll() and processes requests through a validation pipeline.

The TOCTOU danger¶

SECCOMP_USER_NOTIF_FLAG_CONTINUE is dangerous

When the supervisor sets SECCOMP_USER_NOTIF_FLAG_CONTINUE, the kernel executes the original syscall. But between the time the supervisor inspected the arguments and the time the kernel acts, the sandboxed process (or another thread) can rewrite the syscall arguments in memory. This is a classic TOCTOU (time-of-check-time-of-use) race.

Safe supervisor designs either:

Emulate the syscall entirely (never use FLAG_CONTINUE), or
Use SECCOMP_IOCTL_NOTIF_ID_VALID to verify the target has not been recycled, and read arguments from /proc/<pid>/mem rather than trusting pointer contents in shared memory.

7. Attack Surface Analysis: What Does the Sandbox Prevent?¶

Threat model¶

Without a sandbox, a renderer exploit (e.g., a V8 type-confusion bug) gives the attacker full process privileges: arbitrary file read/write, network access, and potential kernel exploitation. With the sandbox:

Attack vector	Without sandbox	With sandbox
Read `/etc/shadow`	Direct `open()`	Blocked: no `open`/`openat` in seccomp policy
Exfiltrate data over network	`connect()` to C2 server	Blocked: no network namespace, no `socket()`
Install rootkit	`init_module()`	Blocked: kernel module syscalls denied
Pivot to other processes	`ptrace()` or `/proc` access	Blocked: PID namespace + seccomp denies `ptrace`
Exploit kernel via syscall	Any of 400+ syscalls	Only ~30-50 permitted: dramatically reduced attack surface

Real CVEs where the sandbox mattered¶

CVE-2025-2783 (Chromium, March 2025): A Mojo IPC handle confusion allowed a compromised renderer to obtain privileged browser process handles. This was a sandbox escape -- it bypassed the seccomp+namespace boundary. The fact that this earned a standalone CVE demonstrates that normally the sandbox contains renderer exploits. An attacker needs a separate sandbox escape bug on top of the renderer bug.

CVE-2025-4609 (Chromium, August 2025): A flaw in Chromium's ipcz (inter-process communication) mechanism allowed a compromised renderer to gain browser process handles. Awarded a $250,000 bounty -- one of the largest in Chrome's history -- precisely because sandbox escapes are rare and high-impact.

CVE-2020-6572 (Chromium): An exploit in MediaCodecAudioDecoder allowed sandbox escape. Google's root cause analysis documented exactly how the attacker had to chain a renderer exploit with a sandbox escape -- two independent vulnerabilities required for full compromise.

CVE-2023-36719 (Windows, affecting Chrome): A 20-year-old stack corruption bug in a Windows OS library was reachable from within the Chromium sandbox. This demonstrates that the sandbox boundary forces attackers to find bugs in highly scrutinized kernel interfaces rather than in the vast userspace attack surface.

The numbers¶

Year	Total Chrome CVEs	Sandbox escapes	Required exploit chain
2023	~180	3-5	Always 2+ bugs (renderer + escape)
2024	~175	4-6	Same pattern
2025	~205	4-7	Same pattern

The key insight: the vast majority of renderer vulnerabilities are contained by the sandbox. An attacker who finds a V8 type-confusion or a use-after-free in Blink gains code execution inside the renderer process but cannot:

Access the filesystem (no open syscalls)
Open network connections (no network namespace)
Escalate to root (no privilege-related syscalls)
Spawn new processes (no fork/exec)

They need a second, independent vulnerability to escape. This defense-in-depth strategy converts single-bug RCE into a multi-bug chain, dramatically raising the cost of exploitation.

Sandbox escape economics

Google's VRP (Vulnerability Reward Program) pays $20,000-$30,000 for renderer bugs but $100,000-$250,000+ for sandbox escapes. The price difference reflects the rarity and difficulty. On the exploit market, a full Chrome chain (renderer + sandbox escape + kernel LPE) sells for $500,000-$2,000,000+, while a renderer-only exploit without sandbox escape is worth a fraction of that.

Putting It All Together: Warmwind's Likely Stack¶

A Chromium-in-Docker deployment like Warmwind probably combines all of these layers:

+--------------------------------------------------+
| Docker seccomp profile (custom JSON)             |  Outermost: container-level
|  +----------------------------------------------+|
|  | Linux namespaces (user, PID, net, mnt)       ||  Docker's isolation
|  |  +------------------------------------------+||
|  |  | Chromium Layer 1 (nested user+PID ns)    |||  Chromium's own sandbox
|  |  |  +--------------------------------------+|||
|  |  |  | Chromium Layer 2 (seccomp-BPF)       ||||  baseline_policy.cc
|  |  |  |  +----------------------------------+||||
|  |  |  |  | Renderer process                 |||||  Runs untrusted JS
|  |  |  |  +----------------------------------+||||
|  |  |  +--------------------------------------+|||
|  |  +------------------------------------------+||
|  +----------------------------------------------+|
+--------------------------------------------------+

Each layer catches different failure modes. If an attacker escapes Chromium's seccomp filter, they hit Docker's seccomp filter. If they escape that, they are still in a user namespace with no capabilities on the host. If Landlock is active on the host, even a namespace escape hits path-based restrictions.

Glossary

seccomp_data: Kernel structure passed to BPF filters for each syscall. Contains syscall number, architecture, instruction pointer, and six arguments.
sock_fprog: User-space structure wrapping a Classic BPF program (instruction count + pointer to instruction array) for seccomp filter installation.
Classic BPF (cBPF): The original Berkeley Packet Filter bytecode used by seccomp. Not to be confused with eBPF, which is used for tracing and networking but is NOT used for seccomp.
libseccomp: High-level C library that compiles seccomp rules into optimal cBPF bytecode. Provides architecture portability and argument-level filtering.
Zygote: Chromium's process that pre-loads shared libraries and forks to create new renderer processes. Enables fast startup and COW memory sharing.
Broker process: Privileged process that performs filesystem operations on behalf of sandboxed renderers. Validates requests against an allowlist before opening files.
Landlock: Linux Security Module (merged 5.13) for unprivileged, stackable, path-based filesystem and network access control. Complement to seccomp.
SECCOMP_RET_USER_NOTIF: Seccomp action (Linux 5.0+) that forwards a syscall to a supervisor process via a notification file descriptor instead of allowing or denying it directly.
TOCTOU: Time-of-check-time-of-use race condition. In seccomp notify context: the gap between when a supervisor reads syscall arguments and when the kernel acts on them.
SCM_RIGHTS: Unix socket ancillary data type for passing file descriptors between processes. Used by Chromium's broker to deliver opened fds to sandboxed renderers.
OCI runtime spec: Open Container Initiative specification for container execution. Includes linux.seccomp field where Docker injects seccomp profiles for runc to install.
SCMP_ACT_ERRNO: libseccomp / Docker seccomp action that causes the denied syscall to return a specified errno value instead of killing the process.