Skip to content

seccomp-BPF Deep Dive & Chromium Sandbox Architecture

The posting says: "Strong expertise in security and isolation (Linux kernel hardening, Docker security, Wayland sandboxing)" and "patching large codebases (e.g., Chromium)." This article goes beyond the overview in the Security & Sandboxing page. You will write a raw seccomp-BPF filter by hand, compare it to libseccomp, dissect Chromium's two-layer sandbox implementation, build Landlock policies, write Docker seccomp profiles, and understand the supervisor pattern that enables rootless containers.


1. Writing a seccomp-BPF Filter from Scratch

The seccomp_data structure

Every seccomp filter receives a struct seccomp_data from the kernel for each syscall:

struct seccomp_data {
    int   nr;                    /* syscall number */
    __u32 arch;                  /* AUDIT_ARCH_* value */
    __u64 instruction_pointer;   /* CPU instruction pointer */
    __u64 args[6];               /* syscall arguments */
};

The filter is a Classic BPF (cBPF) program that inspects this structure and returns a verdict. The kernel evaluates it for every syscall the process makes.

BPF instruction primitives

Classic BPF has four instruction classes relevant to seccomp:

Class Purpose Example
BPF_LD Load data into accumulator Load syscall number from seccomp_data.nr
BPF_JMP Conditional/unconditional jump Branch if accumulator == __NR_open
BPF_RET Return a verdict SECCOMP_RET_ALLOW or SECCOMP_RET_KILL_PROCESS
BPF_ALU Arithmetic on accumulator Bitwise AND for flag checks

Two macros build instructions:

BPF_STMT(code, k)          /* statement: opcode + constant */
BPF_JUMP(code, k, jt, jf)  /* jump: opcode + constant + true_offset + false_offset */

The sock_fprog structure

The filter program is wrapped in struct sock_fprog for the kernel:

struct sock_fprog {
    unsigned short      len;    /* number of BPF instructions */
    struct sock_filter *filter; /* pointer to instruction array */
};

Working example: block open() but allow openat()

This is a real, compilable C program. It installs a seccomp-BPF filter that kills the process if it calls open() (syscall 2 on x86_64) but allows openat() (syscall 257) to proceed normally. The filter validates the architecture first to prevent syscall-number confusion across ABIs.

/* seccomp_block_open.c -- blocks open() but allows openat()
 * Compile: gcc -o seccomp_block_open seccomp_block_open.c
 * Run:     ./seccomp_block_open
 */
#include <errno.h>
#include <fcntl.h>
#include <linux/audit.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>

/* Offset helpers for seccomp_data fields */
#define SC_OFFSET(field) (offsetof(struct seccomp_data, field))

static struct sock_filter filter[] = {
    /* [0] Load architecture from seccomp_data.arch */
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, SC_OFFSET(arch)),

    /* [1] Verify x86_64 -- kill on wrong arch to prevent ABI confusion */
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
    /* [2] Wrong architecture: kill */
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),

    /* [3] Load syscall number from seccomp_data.nr */
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, SC_OFFSET(nr)),

    /* [4] Is it open() (nr 2 on x86_64)? */
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_open, 0, 1),
    /* [5] Yes: return ERRNO(EACCES) -- denied */
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EACCES & SECCOMP_RET_DATA)),

    /* [6] Default: allow everything else (including openat) */
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
};

int main(void) {
    struct sock_fprog prog = {
        .len    = (unsigned short)(sizeof(filter) / sizeof(filter[0])),
        .filter = filter,
    };

    /* Required: no new privileges after filter install */
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("prctl(NO_NEW_PRIVS)");
        return 1;
    }

    /* Install the seccomp filter */
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
        perror("prctl(SECCOMP)");
        return 1;
    }

    printf("Filter installed. Testing syscalls...\n");

    /* Test 1: open() should fail with EACCES */
    int fd = syscall(__NR_open, "/etc/hostname", O_RDONLY);
    if (fd < 0)
        printf("open() blocked: %m  (expected)\n");
    else {
        printf("open() succeeded unexpectedly!\n");
        close(fd);
    }

    /* Test 2: openat() should succeed */
    fd = openat(AT_FDCWD, "/etc/hostname", O_RDONLY);
    if (fd >= 0) {
        printf("openat() allowed: success (expected)\n");
        close(fd);
    } else {
        printf("openat() failed: %m  (unexpected!)\n");
    }

    return 0;
}

Architecture validation is mandatory

Without the arch check, an attacker on a multi-arch kernel (x86_64 + i386) could use the 32-bit open() syscall number to bypass your 64-bit filter. The Chromium sandbox validates architecture in its very first BPF instruction.

How the BPF program flows

         Load arch
            |
      arch == x86_64?
       /          \
     yes           no --> KILL_PROCESS
      |
   Load syscall nr
      |
   nr == open?
    /       \
  yes        no --> ALLOW
   |
 ERRNO(EACCES)

Each BPF_JUMP specifies two offsets: jt (jump if true) and jf (jump if false), counted in instructions from the next instruction. The offsets are relative, not absolute, which makes filter construction error-prone by hand.


2. libseccomp: the Sane API

Raw BPF programming is tedious and bug-prone. libseccomp provides a high-level C API (with Python and Go bindings) that compiles to optimal BPF bytecode.

Equivalent filter using libseccomp

/* libseccomp_block_open.c
 * Compile: gcc -o libseccomp_block_open libseccomp_block_open.c -lseccomp
 */
#include <errno.h>
#include <fcntl.h>
#include <seccomp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>

int main(void) {
    /* Default: allow all syscalls */
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
    if (ctx == NULL) {
        fprintf(stderr, "seccomp_init failed\n");
        return 1;
    }

    /* Block open() with EACCES */
    int rc = seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EACCES),
                              SCMP_SYS(open), 0);
    if (rc < 0) {
        fprintf(stderr, "seccomp_rule_add: %s\n", strerror(-rc));
        return 1;
    }

    /* Load filter into kernel */
    rc = seccomp_load(ctx);
    if (rc < 0) {
        fprintf(stderr, "seccomp_load: %s\n", strerror(-rc));
        return 1;
    }

    seccomp_release(ctx);

    /* Test: open() fails, openat() succeeds */
    int fd = syscall(__NR_open, "/etc/hostname", O_RDONLY);
    if (fd < 0)
        printf("open() blocked: %m\n");

    fd = openat(AT_FDCWD, "/etc/hostname", O_RDONLY);
    if (fd >= 0) {
        printf("openat() allowed\n");
        close(fd);
    }

    return 0;
}

Key libseccomp functions

Function Purpose
seccomp_init(default_action) Create filter context. SCMP_ACT_ALLOW (allowlist) or SCMP_ACT_KILL (denylist)
seccomp_rule_add(ctx, action, syscall, arg_cnt, ...) Add rule. Optional argument comparators for fine-grained control
seccomp_rule_add_exact(ctx, action, syscall, arg_cnt, ...) Like above but fails if the rule cannot be represented exactly
seccomp_load(ctx) Compile to BPF and install via prctl()
seccomp_export_bpf(ctx, fd) Dump raw BPF bytecode to a file descriptor (for debugging)
seccomp_export_pfc(ctx, fd) Dump human-readable pseudo-filter code
seccomp_arch_add(ctx, arch) Add architecture to multi-arch filter
seccomp_release(ctx) Free context

Argument-level filtering

libseccomp can filter on syscall arguments, not just syscall numbers:

/* Allow mmap() only if PROT_EXEC is NOT set (arg 2) */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 1,
    SCMP_A2(SCMP_CMP_MASKED_EQ, PROT_EXEC, 0));

/* Block connect() to port 80 -- arg 1 is sockaddr, so you
   typically use SECCOMP_RET_USER_NOTIF for deep inspection */

Comparison: three ways to restrict syscalls

Approach Pros Cons
Raw BPF (sock_fprog) Zero dependencies, full control, minimal code size Error-prone jump offsets, no portability across arches, no argument helpers
libseccomp Architecture-portable, argument comparators, export/debug tools Library dependency, slight abstraction overhead
systemd SystemCallFilter= Declarative, no code needed, syscall groups (@system-service) Only for systemd services, limited argument filtering

systemd example using predefined groups:

[Service]
SystemCallFilter=@system-service @network-io
SystemCallFilter=~@mount @reboot @swap @raw-io
SystemCallErrorNumber=EPERM

The @system-service group covers ~250 syscalls that most services need. Prefix ~ denies instead of allows.


3. Chromium's Sandbox in Detail

The two-layer architecture

Chromium on Linux uses a defense-in-depth design with two distinct sandboxing layers:

graph LR
    NS["Namespaces"] --> BPF["seccomp-BPF"]
    FS["chroot"] --> BPF
    BPF --> POL["baseline_policy.cc"]

Layer 1 (Namespace sandbox) creates resource isolation:

  • User namespace: Renderer runs as UID 0 inside the namespace, which maps to an unprivileged UID outside. Even if the renderer is compromised, it has no privileges on the host.
  • PID namespace: The renderer cannot see or signal other processes.
  • Network namespace: Empty network namespace -- no sockets, no DNS.
  • Mount namespace: Pivot root to a minimal filesystem.

Layer 2 (seccomp-BPF) reduces the kernel attack surface:

  • Filters ~350 syscalls down to ~30-50 permitted ones.
  • Blocks entire syscall families (filesystem, networking, IPC, module loading).
  • Returns ENOSYS for some denied calls (graceful degradation) and SIGSYS for others (hard crash, indicating a bug or exploit).

What is in baseline_policy.cc

The file sandbox/linux/seccomp-bpf-helpers/baseline_policy.cc in the Chromium source defines the base seccomp policy shared by all sandboxed processes (renderers, GPU, utilities). Process-specific policies extend it.

Explicitly allowed (always safe):

  • Address space: brk, mmap (with flag restrictions), mprotect, munmap, madvise (only MADV_DONTNEED, MADV_WILLNEED, MADV_NORMAL)
  • Scheduling: sched_yield, nanosleep, clock_nanosleep
  • File descriptors (no open): read, write, close, dup, fcntl, fstat, lseek
  • Event loops: epoll_create1, epoll_ctl, epoll_wait, poll, ppoll
  • Futex: futex (needed by pthreads and every allocator)
  • Signals: rt_sigaction, rt_sigprocmask, rt_sigreturn

Conditionally restricted:

  • clone: Only for creating threads (CLONE_THREAD). fork()/vfork() return EPERM (not killed) so glibc error handling works.
  • socketpair: Only AF_UNIX domain.
  • mmap/mprotect: PROT_EXEC is restricted in some configurations.
  • clone3, pidfd_open: Return ENOSYS to force fallback to clone().

Denied globally (SIGSYS crash):

  • All filesystem opens: open, openat, creat (renderers must use the broker)
  • Kernel modules: init_module, finit_module, delete_module
  • System administration: reboot, swapon, swapoff, mount, umount
  • Process debugging: ptrace, process_vm_readv
  • System V IPC: shmget, semget, msgget
  • Privilege changes: setuid, setgid, setgroups

The broker process pattern

Since renderers cannot call open()/openat(), how do they access files? Through a broker process:

sequenceDiagram
    participant R as Renderer (sandboxed)
    participant B as Broker (privileged)
    participant K as Kernel
    R->>B: IPC: "Open /usr/share/fonts/arial.ttf read-only"
    B->>B: Check path against allowlist
    B->>K: openat(AT_FDCWD, path, O_RDONLY)
    K-->>B: fd 7
    B->>R: Send fd 7 over Unix socket (SCM_RIGHTS)
    R->>K: read(7, buf, len)  -- allowed by seccomp

The broker runs outside the seccomp sandbox. It validates every request against a policy (allowlisted paths and open modes) before performing the actual openat(). The result is passed back as a file descriptor over a Unix domain socket using SCM_RIGHTS ancillary data. The renderer can then read()/write() on the fd, which seccomp allows.

The Zygote process

Chromium does not fork()+exec() for new renderers. Instead:

  1. At startup, the Zygote process is created. It loads all shared libraries and initializes common state (ICU data, V8 snapshots, font cache).
  2. The Zygote installs the namespace sandbox (Layer 1).
  3. When a new renderer is needed, the browser process signals the Zygote.
  4. The Zygote calls fork(). The child inherits all pre-loaded state.
  5. The child installs its seccomp-BPF filter (Layer 2) and drops to the restricted policy.
  6. The child becomes the renderer, communicating via Mojo IPC.

This design has two benefits: fast startup (no exec or library loading) and memory efficiency (COW sharing of read-only pages across renderers).

Why fork works under seccomp

The Zygote installs seccomp after forking. The baseline policy blocks further fork() calls (returning EPERM), so a compromised renderer cannot spawn new processes.


4. Landlock LSM: Path-Based Access Control Without Privileges

Landlock (merged in Linux 5.13) fills a gap that seccomp cannot address: file-path-based access control without needing root or an LSM profile written by an administrator. It is stackable -- it adds restrictions on top of existing DAC, MAC, and seccomp policies.

Landlock vs. AppArmor vs. seccomp

Feature seccomp AppArmor Landlock
Restricts Syscall numbers + args File paths, capabilities, network File paths, network ports
Privilege required PR_SET_NO_NEW_PRIVS only Root (profile loading) PR_SET_NO_NEW_PRIVS only
Stackable Yes (multiple filters) Single profile Yes (multiple rulesets)
Granularity Per-syscall Per-path + capability Per-path hierarchy + port
Self-sandboxing Yes No (admin deploys profiles) Yes
Kernel version 3.17+ Mainline 5.13+ (ABI v1), 6.7+ (ABI v4 with net)

Writing a Landlock policy in C

This example restricts a process to read-only access under /usr and read-write under /tmp, with TCP connections only to port 443:

/* landlock_sandbox.c
 * Compile: gcc -o landlock_sandbox landlock_sandbox.c
 * Requires: Linux 6.7+ for network rules, 5.13+ for filesystem only
 */
#include <fcntl.h>
#include <linux/landlock.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>

/* Wrapper functions -- landlock syscalls have no glibc wrappers yet */
static int landlock_create_ruleset(
    const struct landlock_ruleset_attr *attr, size_t size, __u32 flags) {
    return syscall(__NR_landlock_create_ruleset, attr, size, flags);
}

static int landlock_add_rule(
    int ruleset_fd, enum landlock_rule_type type,
    const void *attr, __u32 flags) {
    return syscall(__NR_landlock_add_rule, ruleset_fd, type, attr, flags);
}

static int landlock_restrict_self(int ruleset_fd, __u32 flags) {
    return syscall(__NR_landlock_restrict_self, ruleset_fd, flags);
}

int main(void) {
    /* Step 1: Check ABI version */
    int abi = landlock_create_ruleset(NULL, 0,
                                      LANDLOCK_CREATE_RULESET_VERSION);
    if (abi < 0) {
        perror("Landlock not supported on this kernel");
        return 1;
    }
    printf("Landlock ABI version: %d\n", abi);

    /* Step 2: Declare what access types we handle */
    struct landlock_ruleset_attr ruleset_attr = {
        .handled_access_fs =
            LANDLOCK_ACCESS_FS_EXECUTE |
            LANDLOCK_ACCESS_FS_READ_FILE |
            LANDLOCK_ACCESS_FS_READ_DIR |
            LANDLOCK_ACCESS_FS_WRITE_FILE |
            LANDLOCK_ACCESS_FS_REMOVE_FILE |
            LANDLOCK_ACCESS_FS_REMOVE_DIR |
            LANDLOCK_ACCESS_FS_MAKE_REG |
            LANDLOCK_ACCESS_FS_MAKE_DIR,
        .handled_access_net =
            LANDLOCK_ACCESS_NET_CONNECT_TCP,
    };

    /* Downgrade gracefully for older ABI versions */
    if (abi < 4)
        ruleset_attr.handled_access_net = 0;

    int ruleset_fd = landlock_create_ruleset(
        &ruleset_attr, sizeof(ruleset_attr), 0);
    if (ruleset_fd < 0) {
        perror("landlock_create_ruleset");
        return 1;
    }

    /* Step 3: Add filesystem rules */

    /* /usr: read + execute only */
    struct landlock_path_beneath_attr usr_rule = {
        .allowed_access =
            LANDLOCK_ACCESS_FS_EXECUTE |
            LANDLOCK_ACCESS_FS_READ_FILE |
            LANDLOCK_ACCESS_FS_READ_DIR,
        .parent_fd = open("/usr", O_PATH | O_CLOEXEC),
    };
    if (usr_rule.parent_fd < 0 ||
        landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH,
                          &usr_rule, 0)) {
        perror("Failed to add /usr rule");
        return 1;
    }
    close(usr_rule.parent_fd);

    /* /tmp: read + write */
    struct landlock_path_beneath_attr tmp_rule = {
        .allowed_access =
            LANDLOCK_ACCESS_FS_READ_FILE |
            LANDLOCK_ACCESS_FS_READ_DIR |
            LANDLOCK_ACCESS_FS_WRITE_FILE |
            LANDLOCK_ACCESS_FS_MAKE_REG |
            LANDLOCK_ACCESS_FS_REMOVE_FILE,
        .parent_fd = open("/tmp", O_PATH | O_CLOEXEC),
    };
    if (tmp_rule.parent_fd < 0 ||
        landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH,
                          &tmp_rule, 0)) {
        perror("Failed to add /tmp rule");
        return 1;
    }
    close(tmp_rule.parent_fd);

    /* Step 4: Add network rule -- only port 443 */
    if (abi >= 4) {
        struct landlock_net_port_attr net_rule = {
            .allowed_access = LANDLOCK_ACCESS_NET_CONNECT_TCP,
            .port = 443,
        };
        if (landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT,
                              &net_rule, 0)) {
            perror("Failed to add network rule");
            return 1;
        }
    }

    /* Step 5: Enforce -- no way to remove restrictions after this */
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("prctl(NO_NEW_PRIVS)");
        return 1;
    }
    if (landlock_restrict_self(ruleset_fd, 0)) {
        perror("landlock_restrict_self");
        return 1;
    }
    close(ruleset_fd);

    printf("Landlock sandbox active.\n");

    /* Test: writing to /home should fail */
    FILE *f = fopen("/home/test.txt", "w");
    if (f == NULL)
        printf("/home write blocked: %m  (expected)\n");
    else {
        fclose(f);
        printf("/home write succeeded (unexpected!)\n");
    }

    /* Test: reading /usr/bin/ls should succeed */
    f = fopen("/usr/bin/ls", "r");
    if (f) {
        printf("/usr/bin/ls readable (expected)\n");
        fclose(f);
    }

    return 0;
}

Landlock is permanent and one-way

Once landlock_restrict_self() is called, the restrictions cannot be removed. Additional landlock_restrict_self() calls can only add more restrictions. This is the same irreversibility property as seccomp's PR_SET_NO_NEW_PRIVS.


5. seccomp + Containers: Docker Profiles

How Docker applies seccomp

When you run docker run, the container runtime (runc) installs a seccomp-BPF filter before exec'ing the container entrypoint. The flow:

  1. Docker daemon reads the seccomp profile (default or custom).
  2. The OCI runtime spec includes the profile in linux.seccomp.
  3. runc translates the JSON profile into libseccomp calls.
  4. libseccomp compiles to BPF bytecode and installs via seccomp(2).

The default Docker seccomp profile

Docker's default profile uses an allowlist strategy:

  • defaultAction: SCMP_ACT_ERRNO (deny by default)
  • Allows ~270 of ~400+ syscalls that most applications need
  • Blocks ~44 dangerous syscalls including:
Category Blocked syscalls Why
Kernel modules init_module, finit_module, delete_module Load kernel code
Namespace escape unshare, setns (conditional) Break container isolation
Reboot/power reboot, kexec_load Crash the host
Device access mknod Create device nodes
Clock clock_settime, settimeofday Affect host timekeeping
Raw I/O ioperm, iopl Direct port access
Tracing ptrace Debug/inspect other processes
Mount mount, umount2 Modify filesystem topology
Swap swapon, swapoff Affect host memory

Writing a custom profile for a Chromium container

A Chromium container (the kind Warmwind likely runs) needs a profile that is stricter than Docker's default but permits the syscalls Chromium's own sandbox requires. Key additions: clone with namespace flags (Chromium creates its own namespaces) and seccomp (Chromium installs its own nested seccomp filter).

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "archMap": [
    {
      "architecture": "SCMP_ARCH_X86_64",
      "subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
    }
  ],
  "syscalls": [
    {
      "names": [
        "read", "write", "close", "fstat", "lseek", "mmap",
        "mprotect", "munmap", "brk", "ioctl", "pread64",
        "pwrite64", "readv", "writev", "access", "pipe",
        "select", "sched_yield", "mremap", "mincore",
        "madvise", "dup", "dup2", "nanosleep",
        "getpid", "getuid", "getgid", "geteuid", "getegid",
        "getppid", "getpgrp", "setsid", "gettid",
        "sendmsg", "recvmsg", "sendto", "recvfrom",
        "socket", "connect", "bind", "listen", "accept4",
        "getsockname", "getpeername", "getsockopt", "setsockopt",
        "socketpair", "shutdown",
        "fcntl", "flock", "openat", "getdents64",
        "fstatfs", "fadvise64", "clock_gettime",
        "clock_getres", "clock_nanosleep",
        "exit_group", "epoll_wait", "epoll_ctl",
        "epoll_create1", "eventfd2", "timerfd_create",
        "timerfd_settime", "timerfd_gettime",
        "signalfd4", "poll", "ppoll",
        "rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
        "futex", "set_robust_list", "get_robust_list",
        "prctl", "arch_prctl", "set_tid_address",
        "restart_syscall", "getrandom", "memfd_create",
        "copy_file_range", "statx", "rseq",
        "prlimit64", "pipe2", "membarrier"
      ],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "comment": "Chromium needs clone with namespace flags for its own sandbox",
      "names": ["clone", "clone3"],
      "action": "SCMP_ACT_ALLOW",
      "args": []
    },
    {
      "comment": "Allow Chromium to install its nested seccomp filter",
      "names": ["seccomp"],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "comment": "Chromium creates user/PID/net namespaces for renderers",
      "names": ["unshare"],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "comment": "GPU process needs DRI ioctls",
      "names": ["ioctl"],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "comment": "Needed for shared memory (Wayland buffers, IPC)",
      "names": ["shmget", "shmat", "shmctl", "shmdt"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Apply it:

docker run --rm \
    --security-opt seccomp=/path/to/chromium-seccomp.json \
    --security-opt no-new-privileges \
    --cap-drop=ALL \
    --cap-add=SYS_ADMIN \
    warmwind/chromium:latest

The SYS_ADMIN trap

--cap-add=SYS_ADMIN is needed because Chromium uses clone() with CLONE_NEWUSER and CLONE_NEWPID. A better approach is to run Chromium with --no-sandbox and let Docker's own namespace isolation substitute for Chromium's Layer 1. The seccomp profile above then serves as the equivalent of Chromium's Layer 2. This is a common pattern in containerized browser deployments.


6. SECCOMP_RET_USER_NOTIF: The Supervisor Pattern

Added in Linux 5.0, SECCOMP_RET_USER_NOTIF enables a fundamentally different security model: instead of simply allowing or denying syscalls, a privileged supervisor process intercepts them and decides what to do.

How it works

sequenceDiagram
    participant T as Target (sandboxed)
    participant K as Kernel
    participant S as Supervisor (privileged)

    T->>K: mount("/dev/sda1", "/mnt", "ext4", ...)
    K->>K: seccomp filter returns RET_USER_NOTIF
    K-->>T: Thread blocked
    K->>S: Notification on seccomp fd (readable)
    S->>K: SECCOMP_IOCTL_NOTIF_RECV
    K-->>S: struct seccomp_notif {id, pid, syscall_data}
    S->>S: Validate request, read /proc/pid/mem for args
    S->>S: Perform mount() on behalf of target
    S->>K: SECCOMP_IOCTL_NOTIF_SEND {id, val=0, error=0}
    K-->>T: mount() returns 0 (success)

The notification structures

/* Received by supervisor when sandboxed process hits RET_USER_NOTIF */
struct seccomp_notif {
    __u64 id;                   /* unique request ID */
    __u32 pid;                  /* PID of the sandboxed process */
    __u32 flags;
    struct seccomp_data data;   /* syscall nr, arch, args[6] */
};

/* Sent back by supervisor */
struct seccomp_notif_resp {
    __u64 id;                   /* must match the request */
    __s64 val;                  /* syscall return value */
    __s32 error;                /* errno (negative) or 0 */
    __u32 flags;                /* SECCOMP_USER_NOTIF_FLAG_CONTINUE */
};

Container runtime use: rootless containers

Rootless containers (Podman, rootless Docker, LXC unprivileged) cannot perform privileged operations like mount() or mknod(). The supervisor pattern solves this:

Syscall Supervisor action
mount("proc", ...) Supervisor mounts procfs in the container's mount namespace
mknod("/dev/fuse", ...) Supervisor creates the device node, passes fd back
mount("overlay", ...) Supervisor mounts overlay or delegates to FUSE
connect(AF_VSOCK, ...) Supervisor proxies the connection

LXD's implementation runs a dedicated goroutine per container as the syscall supervisor. It listens on the seccomp notify fd using epoll() and processes requests through a validation pipeline.

The TOCTOU danger

SECCOMP_USER_NOTIF_FLAG_CONTINUE is dangerous

When the supervisor sets SECCOMP_USER_NOTIF_FLAG_CONTINUE, the kernel executes the original syscall. But between the time the supervisor inspected the arguments and the time the kernel acts, the sandboxed process (or another thread) can rewrite the syscall arguments in memory. This is a classic TOCTOU (time-of-check-time-of-use) race.

Safe supervisor designs either:

  • Emulate the syscall entirely (never use FLAG_CONTINUE), or
  • Use SECCOMP_IOCTL_NOTIF_ID_VALID to verify the target has not been recycled, and read arguments from /proc/<pid>/mem rather than trusting pointer contents in shared memory.

7. Attack Surface Analysis: What Does the Sandbox Prevent?

Threat model

Without a sandbox, a renderer exploit (e.g., a V8 type-confusion bug) gives the attacker full process privileges: arbitrary file read/write, network access, and potential kernel exploitation. With the sandbox:

Attack vector Without sandbox With sandbox
Read /etc/shadow Direct open() Blocked: no open/openat in seccomp policy
Exfiltrate data over network connect() to C2 server Blocked: no network namespace, no socket()
Install rootkit init_module() Blocked: kernel module syscalls denied
Pivot to other processes ptrace() or /proc access Blocked: PID namespace + seccomp denies ptrace
Exploit kernel via syscall Any of 400+ syscalls Only ~30-50 permitted: dramatically reduced attack surface

Real CVEs where the sandbox mattered

CVE-2025-2783 (Chromium, March 2025): A Mojo IPC handle confusion allowed a compromised renderer to obtain privileged browser process handles. This was a sandbox escape -- it bypassed the seccomp+namespace boundary. The fact that this earned a standalone CVE demonstrates that normally the sandbox contains renderer exploits. An attacker needs a separate sandbox escape bug on top of the renderer bug.

CVE-2025-4609 (Chromium, August 2025): A flaw in Chromium's ipcz (inter-process communication) mechanism allowed a compromised renderer to gain browser process handles. Awarded a $250,000 bounty -- one of the largest in Chrome's history -- precisely because sandbox escapes are rare and high-impact.

CVE-2020-6572 (Chromium): An exploit in MediaCodecAudioDecoder allowed sandbox escape. Google's root cause analysis documented exactly how the attacker had to chain a renderer exploit with a sandbox escape -- two independent vulnerabilities required for full compromise.

CVE-2023-36719 (Windows, affecting Chrome): A 20-year-old stack corruption bug in a Windows OS library was reachable from within the Chromium sandbox. This demonstrates that the sandbox boundary forces attackers to find bugs in highly scrutinized kernel interfaces rather than in the vast userspace attack surface.

The numbers

Year Total Chrome CVEs Sandbox escapes Required exploit chain
2023 ~180 3-5 Always 2+ bugs (renderer + escape)
2024 ~175 4-6 Same pattern
2025 ~205 4-7 Same pattern

The key insight: the vast majority of renderer vulnerabilities are contained by the sandbox. An attacker who finds a V8 type-confusion or a use-after-free in Blink gains code execution inside the renderer process but cannot:

  • Access the filesystem (no open syscalls)
  • Open network connections (no network namespace)
  • Escalate to root (no privilege-related syscalls)
  • Spawn new processes (no fork/exec)

They need a second, independent vulnerability to escape. This defense-in-depth strategy converts single-bug RCE into a multi-bug chain, dramatically raising the cost of exploitation.

Sandbox escape economics

Google's VRP (Vulnerability Reward Program) pays $20,000-$30,000 for renderer bugs but $100,000-$250,000+ for sandbox escapes. The price difference reflects the rarity and difficulty. On the exploit market, a full Chrome chain (renderer + sandbox escape + kernel LPE) sells for $500,000-$2,000,000+, while a renderer-only exploit without sandbox escape is worth a fraction of that.


Putting It All Together: Warmwind's Likely Stack

A Chromium-in-Docker deployment like Warmwind probably combines all of these layers:

+--------------------------------------------------+
| Docker seccomp profile (custom JSON)             |  Outermost: container-level
|  +----------------------------------------------+|
|  | Linux namespaces (user, PID, net, mnt)       ||  Docker's isolation
|  |  +------------------------------------------+||
|  |  | Chromium Layer 1 (nested user+PID ns)    |||  Chromium's own sandbox
|  |  |  +--------------------------------------+|||
|  |  |  | Chromium Layer 2 (seccomp-BPF)       ||||  baseline_policy.cc
|  |  |  |  +----------------------------------+||||
|  |  |  |  | Renderer process                 |||||  Runs untrusted JS
|  |  |  |  +----------------------------------+||||
|  |  |  +--------------------------------------+|||
|  |  +------------------------------------------+||
|  +----------------------------------------------+|
+--------------------------------------------------+

Each layer catches different failure modes. If an attacker escapes Chromium's seccomp filter, they hit Docker's seccomp filter. If they escape that, they are still in a user namespace with no capabilities on the host. If Landlock is active on the host, even a namespace escape hits path-based restrictions.


Glossary

seccomp_data
Kernel structure passed to BPF filters for each syscall. Contains syscall number, architecture, instruction pointer, and six arguments.
sock_fprog
User-space structure wrapping a Classic BPF program (instruction count + pointer to instruction array) for seccomp filter installation.
Classic BPF (cBPF)
The original Berkeley Packet Filter bytecode used by seccomp. Not to be confused with eBPF, which is used for tracing and networking but is NOT used for seccomp.
libseccomp
High-level C library that compiles seccomp rules into optimal cBPF bytecode. Provides architecture portability and argument-level filtering.
Zygote
Chromium's process that pre-loads shared libraries and forks to create new renderer processes. Enables fast startup and COW memory sharing.
Broker process
Privileged process that performs filesystem operations on behalf of sandboxed renderers. Validates requests against an allowlist before opening files.
Landlock
Linux Security Module (merged 5.13) for unprivileged, stackable, path-based filesystem and network access control. Complement to seccomp.
SECCOMP_RET_USER_NOTIF
Seccomp action (Linux 5.0+) that forwards a syscall to a supervisor process via a notification file descriptor instead of allowing or denying it directly.
TOCTOU
Time-of-check-time-of-use race condition. In seccomp notify context: the gap between when a supervisor reads syscall arguments and when the kernel acts on them.
SCM_RIGHTS
Unix socket ancillary data type for passing file descriptors between processes. Used by Chromium's broker to deliver opened fds to sandboxed renderers.
OCI runtime spec
Open Container Initiative specification for container execution. Includes linux.seccomp field where Docker injects seccomp profiles for runc to install.
SCMP_ACT_ERRNO
libseccomp / Docker seccomp action that causes the denied syscall to return a specified errno value instead of killing the process.