ebpf epoll futex io-uring kernel linux-platform namespaces scheduler virtual-memory

Linux Kernel Internals for Platform Engineers¶

This is the knowledge that separates a senior Linux platform engineer from someone who can run apt install. Not a kernel development tutorial, but the internals that matter when you build on top of the kernel: virtual memory (page tables, TLB, huge pages, mmap semantics), the CPU scheduler (CFS, EEVDF, real-time classes, cpuset cgroups), file descriptor internals and SCM_RIGHTS (why Wayland's fd-passing is efficient), futexes (the foundation of every mutex), epoll internals (red-black trees, wait queues, thundering herd), io_uring (submission/completion rings), namespace implementation, and eBPF from the kernel's perspective.

1. Virtual Memory¶

Every process sees a flat 48-bit (or 57-bit with 5-level paging) virtual address space. The kernel manages the mapping from virtual to physical via page tables and the MMU (Memory Management Unit).

Page table walk¶

Virtual address (48-bit, 4-level paging):
┌──────┬──────┬──────┬──────┬──────────┐
│ PGD  │ PUD  │ PMD  │ PTE  │ Offset   │
│ 9bit │ 9bit │ 9bit │ 9bit │ 12bit    │
└──┬───┴──┬───┴──┬───┴──┬───┴────┬─────┘
   │      │      │      │        │
   ▼      ▼      ▼      ▼        ▼
  PGD → PUD → PMD → PTE → Physical page + offset
  table  table  table  table

Each level is a 4KB page containing 512 entries (2^9). Four levels give 2^48 = 256 TB of virtual address space. The 12-bit offset addresses bytes within a 4KB page.

Cost: a TLB miss requires four memory reads to walk the page tables. This is why TLB coverage matters enormously.

TLB (Translation Lookaside Buffer)¶

The TLB caches recent virtual-to-physical translations:

TLB level	Entries (typical)	Latency
L1 DTLB	64-128	1 cycle
L1 ITLB	64-128	1 cycle
L2 STLB	1024-2048	7-10 cycles
TLB miss (page walk)	-	20-100 cycles

With 4KB pages, 2048 STLB entries cover 8MB. For a compositor with a 200MB working set, most accesses miss the TLB. This is where huge pages matter.

Huge pages¶

Page size	TLB entries for 1GB	Available as
4KB	262,144	Default
2MB (PMD-level)	512	THP or hugetlbfs
1GB (PUD-level)	1	hugetlbfs only

Transparent Huge Pages (THP): the kernel automatically promotes 4KB pages to 2MB pages when it finds 512 contiguous pages with compatible attributes. No application changes needed.

# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

# For a compositor: "madvise" is best -- let the application opt in
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

# Application-level opt-in
madvise(addr, length, MADV_HUGEPAGE);  // suggest THP

When THP hurts: THP compaction can cause latency spikes (the kernel pauses to defragment physical memory). For latency-sensitive compositors:

# Disable THP compaction (prevent latency spikes)
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
# "defer+madvise": background compaction only, direct reclaim only for
# explicit madvise(MADV_HUGEPAGE) regions

mmap semantics¶

mmap() is the foundation of everything: file I/O, shared memory, GPU buffers, anonymous allocations.

// Key flags and their semantics:

// MAP_PRIVATE: copy-on-write. Reads from file, writes go to anonymous pages.
// Used for: loading shared libraries (.so), private file mappings
void *lib = mmap(NULL, len, PROT_READ|PROT_EXEC, MAP_PRIVATE, fd, 0);

// MAP_SHARED: writes visible to other processes and written back to file.
// Used for: shared memory (shmem), dma-buf, Wayland SHM buffers
void *shm = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, shm_fd, 0);

// MAP_ANONYMOUS: no file backing, zero-filled pages.
// Used for: heap (malloc), thread stacks
void *heap = mmap(NULL, len, PROT_READ|PROT_WRITE,
                  MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

// MAP_POPULATE: fault all pages immediately (no lazy allocation).
// Used for: real-time paths that cannot tolerate page faults
void *rt = mmap(NULL, len, PROT_READ|PROT_WRITE,
                MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);

MADV_DONTNEED vs MADV_FREE¶

Both tell the kernel that pages are no longer needed, but with critical differences:

	`MADV_DONTNEED`	`MADV_FREE`
Behavior	Immediately unmaps pages. Next access → page fault, zero-filled page	Marks pages as reclaimable. Next access returns existing data if not yet reclaimed
Performance	Expensive (page table updates, TLB flush)	Cheap (just marks pages in the page table)
Determinism	Deterministic (always zeroed on next access)	Non-deterministic (may or may not be reclaimed)
Used by	`free()` in glibc, Go runtime	jemalloc, tcmalloc
Kiosk relevance	Use for security (wipe sensitive data)	Use for performance (recycle allocator pages)

// Security: wipe a buffer containing credentials
madvise(secret_buf, len, MADV_DONTNEED);  // pages zeroed

// Performance: return allocator pages to kernel without zeroing
madvise(free_pages, len, MADV_FREE);  // pages recyclable but not zeroed

2. The CPU Scheduler¶

CFS (Completely Fair Scheduler)¶

CFS (kernel 2.6.23 through 6.5) uses a red-black tree of tasks ordered by virtual runtime (vruntime). The task with the smallest vruntime runs next.

            RB-tree (ordered by vruntime)
               ┌──────────────┐
               │  task C (5ms) │
               └──┬───────┬───┘
            ┌─────┘       └─────┐
     ┌──────┴──────┐     ┌──────┴──────┐
     │ task A (3ms) │     │ task D (8ms) │
     └──────────────┘     └──────────────┘

     ← smallest vruntime = runs next

vruntime increases as a task runs. Tasks with higher nice values accumulate vruntime faster (get less CPU). The key insight: CFS does not use fixed time slices. It dynamically computes a slice based on the number of runnable tasks and their weights.

EEVDF (Earliest Eligible Virtual Deadline First)¶

EEVDF replaced CFS in kernel 6.6. The motivation: CFS was fair in the long run but could starve short-running tasks in the short term (a task waking from sleep had to wait for previously running tasks to finish their slices).

EEVDF adds a virtual deadline to each task. A task is "eligible" when its vruntime is not ahead of the fair share. Among eligible tasks, the one with the earliest deadline runs first.

Task    Eligible?    Virtual Deadline    Status
A       yes          15ms                ← runs next (earliest deadline)
B       yes          18ms                waiting
C       no           12ms                already ahead of fair share
D       yes          20ms                waiting

Practical impact: EEVDF improves latency for interactive tasks (the compositor and input handling threads) without explicit tuning. Wake-up latency dropped by ~30% in real-world tests.

# Check which scheduler is active
cat /sys/kernel/debug/sched/debug | head -5
# On 6.6+: "EEVDF" appears in the output

# Tune EEVDF: minimum granularity (smallest time slice)
cat /sys/kernel/debug/sched/min_granularity_ns
# Default: 750000 (750us). Lower = more responsive, higher = more throughput

Real-time scheduling for compositor threads¶

The compositor's render thread should never be preempted by background tasks. Use SCHED_FIFO (fixed-priority, run-to-completion):

#include <sched.h>

struct sched_param param = { .sched_priority = 50 };  // 1-99
sched_setscheduler(0, SCHED_FIFO, &param);
// This thread now preempts ALL CFS/EEVDF tasks
// Only higher-priority SCHED_FIFO threads can preempt it

# Set real-time priority from outside the process
chrt -f 50 -p $(pidof sway)

# Or in the systemd unit:
# [Service]
# CPUSchedulingPolicy=fifo
# CPUSchedulingPriority=50

Warning: a SCHED_FIFO thread that loops indefinitely will lock up the CPU. Use SCHED_DEADLINE (kernel 3.14+) for bounded execution:

struct sched_attr attr = {
    .size = sizeof(attr),
    .sched_policy = SCHED_DEADLINE,
    .sched_runtime  = 5000000,   // 5ms per period
    .sched_deadline = 16666666,  // 16.67ms (60Hz)
    .sched_period   = 16666666,
};
syscall(SYS_sched_setattr, 0, &attr, 0);
// Kernel guarantees 5ms of CPU every 16.67ms

cpuset cgroups: CPU pinning¶

For a kiosk with dedicated hardware, pin the compositor to specific cores:

# Create a cpuset for the compositor
mkdir /sys/fs/cgroup/compositor
echo "0-1" > /sys/fs/cgroup/compositor/cpuset.cpus    # cores 0-1
echo "0" > /sys/fs/cgroup/compositor/cpuset.mems       # NUMA node 0
echo $SWAY_PID > /sys/fs/cgroup/compositor/cgroup.procs

# Put everything else on remaining cores
echo "2-7" > /sys/fs/cgroup/system.slice/cpuset.cpus

This eliminates cache contention between the compositor and background tasks: cores 0-1 are exclusively for Sway.

3. File Descriptor Internals¶

struct file¶

Every open file descriptor points to a struct file in the kernel:

struct file {
    struct path             f_path;     // dentry + vfsmount
    const struct file_operations *f_op; // read, write, mmap, ioctl, poll
    atomic_long_t           f_count;    // reference count
    unsigned int            f_flags;    // O_RDONLY, O_NONBLOCK, etc.
    fmode_t                 f_mode;     // FMODE_READ, FMODE_WRITE
    loff_t                  f_pos;      // current file position
    void                    *private_data; // driver-specific data
    // ...
};

The process's fd table maps integer fds to struct file pointers:

Process A                    Kernel
fd table:                    struct file objects:
  0 → ──────────────────────→ [struct file: /dev/tty, count=2]
  1 → ──────────────────────→        ↑ (same struct file)
  2 → ──────────────────────→        ↑ (dup'd)
  3 → ──────────────────────→ [struct file: /dev/dri/card0, count=1]
  4 → ──────────────────────→ [struct file: anon_inode:[eventfd], count=1]

dup() creates a new fd pointing to the same struct file (increments f_count). fork() copies the fd table, incrementing f_count for each entry. close() decrements f_count; when it reaches 0, the file is released.

SCM_RIGHTS: how Wayland passes file descriptors¶

The Wayland protocol passes dma-buf file descriptors between client and compositor via Unix domain sockets using SCM_RIGHTS:

// Sending an fd over a Unix socket
struct msghdr msg = {0};
struct cmsghdr *cmsg;
char buf[CMSG_SPACE(sizeof(int))];
int dma_buf_fd = gbm_bo_get_fd(bo);

msg.msg_control = buf;
msg.msg_controllen = sizeof(buf);

cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &dma_buf_fd, sizeof(int));

sendmsg(socket_fd, &msg, 0);

What happens in the kernel:

sendmsg() enters the kernel.
The kernel looks up the sender's struct file for dma_buf_fd.
It creates a new fd in the receiver's fd table pointing to the same struct file.
f_count is incremented (now both processes hold a reference).
The receiver gets a new integer fd (possibly with a different number).

This is why fd-passing is efficient: no data is copied. Both processes share the same kernel object (and for dma-bufs, the same GPU memory). Closing the fd in the sender does not affect the receiver (reference counted).

Why this matters for Wayland: every buffer (wl_buffer) is backed by a dma-buf fd. The client creates it, sends the fd to the compositor, and the compositor imports it as a GPU texture. Zero copies, zero serialization.

4. Futexes: The Foundation of Userspace Synchronization¶

Every pthread_mutex_lock(), pthread_cond_wait(), and sem_wait() in glibc is built on top of futexes (Fast Userspace muTEXes).

The fast path¶

// Simplified futex-based mutex (what glibc actually does):

// Lock (fast path: no syscall)
int expected = 0;
if (atomic_compare_exchange(&mutex->state, &expected, 1)) {
    // Got the lock. No kernel involvement.
    return;
}

// Lock (slow path: contention → kernel)
futex(&mutex->state, FUTEX_WAIT, 1, ...);
// Kernel adds this thread to a wait queue keyed by &mutex->state
// Thread sleeps until another thread calls FUTEX_WAKE

// Unlock
mutex->state = 0;
futex(&mutex->state, FUTEX_WAKE, 1, ...);
// Kernel wakes one waiter

The genius of futexes: the common case (no contention) is a single atomic instruction in userspace -- no syscall at all. Only when there is contention does the thread enter the kernel to sleep.

How glibc implements pthread_mutex_lock¶

The actual glibc implementation uses a three-state mutex:

State	Meaning
0	Unlocked
1	Locked, no waiters
2	Locked, has waiters

// pthread_mutex_lock (simplified from glibc nptl/pthread_mutex_lock.c):

int __pthread_mutex_lock(pthread_mutex_t *mutex) {
    // Fast path: try to go 0 → 1 (unlocked → locked, no waiters)
    if (atomic_compare_exchange_weak(&mutex->__data.__lock, 0, 1) == 0)
        return 0;  // Got the lock, no syscall

    // Slow path: spin briefly, then sleep
    int old = atomic_exchange(&mutex->__data.__lock, 2);  // set "has waiters"
    while (old != 0) {
        futex(&mutex->__data.__lock, FUTEX_WAIT_PRIVATE, 2, NULL);
        old = atomic_exchange(&mutex->__data.__lock, 2);
    }
    return 0;
}

The transition to state 2 ensures that pthread_mutex_unlock always calls FUTEX_WAKE when there are waiters, preventing missed wakeups.

Priority inversion and PI futexes¶

For real-time compositor threads: a high-priority thread can be blocked by a low-priority thread holding a mutex, while a medium-priority thread runs instead. This is priority inversion (the Mars Pathfinder bug).

Linux provides FUTEX_LOCK_PI (priority inheritance futexes):

// The lock holder's priority is temporarily raised to the
// highest priority among waiters
pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr);
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
pthread_mutex_init(&mutex, &attr);

5. epoll Internals¶

Data structures¶

Process calls epoll_create1():
  → Kernel allocates struct eventpoll:
      .rbr:   red-black tree of monitored fds (struct epitem)
      .rdllist: ready list (doubly-linked list of fired epitems)
      .wq:    wait queue (threads blocked in epoll_wait)
      .poll_wait: used for nested epoll

Process calls epoll_ctl(EPOLL_CTL_ADD, fd, event):
  → Kernel creates struct epitem
  → Inserts into the red-black tree (O(log n))
  → Registers a callback on the fd's wait queue:
      when the fd becomes ready, the callback moves the
      epitem to the rdllist and wakes threads in .wq

Process calls epoll_wait():
  → If rdllist is non-empty: return ready events immediately
  → If rdllist is empty: sleep on .wq until a callback fires

Why epoll is O(1) for event delivery¶

Traditional poll() / select() scan the entire fd set every call: O(n) per call. epoll registers callbacks once (via epoll_ctl), and event delivery is O(1) per ready fd:

A packet arrives on a socket.
The kernel wakes the socket's wait queue.
The epoll callback fires, moving the epitem to the ready list.
The thread sleeping in epoll_wait() wakes up.
Only ready fds are returned -- no scanning.

Thundering herd and EPOLLEXCLUSIVE¶

When multiple threads call epoll_wait() on the same epoll instance, a single event wakes all of them. Only one can handle the event; the rest immediately sleep again. This wastes CPU.

// Solution: EPOLLEXCLUSIVE (kernel 4.5+)
struct epoll_event ev = {
    .events = EPOLLIN | EPOLLEXCLUSIVE,
    .data.fd = listen_fd,
};
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
// Now only ONE waiter is woken per event

Level-triggered vs edge-triggered¶

Mode	Behavior	Re-arm needed?
Level-triggered (default)	epoll_wait returns if fd is ready	No -- returns again if still ready
Edge-triggered (`EPOLLET`)	epoll_wait returns when fd becomes ready	Yes -- must drain all data or you miss events

Edge-triggered is faster (fewer epoll_wait returns) but dangerous: if you do not read all available data, you will never be notified again until new data arrives.

// Edge-triggered pattern: must drain the fd
while (true) {
    ssize_t n = read(fd, buf, sizeof(buf));
    if (n == -1 && errno == EAGAIN) break;  // fully drained
    process(buf, n);
}

wlroots uses level-triggered epoll for simplicity and correctness (the event loop is not latency-critical to the point where edge-triggered matters).

6. io_uring: The New Async I/O Interface¶

io_uring (kernel 5.1+) provides true asynchronous I/O via shared-memory ring buffers between userspace and kernel. No syscalls on the fast path.

Architecture¶

Userspace                    Kernel
┌──────────────────┐        ┌──────────────────┐
│ Submission Queue │ ──────→│ SQ Thread         │
│ (SQE ring)       │        │ (processes SQEs)  │
│                  │        │                   │
│ Completion Queue │ ←──────│ Completion path   │
│ (CQE ring)       │        │ (fills CQEs)      │
└──────────────────┘        └──────────────────┘
     ↕ mmap'd shared memory

// io_uring setup
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);  // 256 SQE entries

// Submit a read (no syscall in SQPOLL mode)
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, 0);
io_uring_sqe_set_data(sqe, user_data);
io_uring_submit(&ring);  // syscall (or no-op in SQPOLL mode)

// Reap completions
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int result = cqe->res;  // bytes read, or -errno
void *data = io_uring_cqe_get_data(cqe);
io_uring_cqe_seen(&ring, cqe);

io_uring vs epoll for compositors¶

	epoll	io_uring
Model	Readiness notification	Completion notification
Syscalls per I/O	1 (epoll_wait) + 1 (read/write)	0-1 (batched submit)
Best for	Many fds, few events	High I/O throughput, batching
Compositor use	Wayland socket, input, timers	Could batch DRM ioctls, file I/O
Maturity	Battle-tested	Newer, ongoing security hardening

io_uring could replace epoll for compositors, but the benefit is marginal (compositors are not I/O-bound). The real win is for storage-heavy workloads (database, logging).

SQPOLL mode: zero-syscall I/O¶

struct io_uring_params params = {
    .flags = IORING_SETUP_SQPOLL,  // kernel thread polls the SQ
    .sq_thread_idle = 2000,         // kill SQ thread after 2s idle
};
io_uring_queue_init_params(256, &ring, &params);
// Now io_uring_submit() is a no-op -- the kernel thread picks up SQEs

The kernel thread polls the submission queue, so io_uring_submit() does not need a syscall. This is the fastest path for high-frequency I/O.

7. Namespace Implementation¶

How namespaces work in the kernel¶

Every process has a struct nsproxy containing pointers to its namespaces:

struct nsproxy {
    struct uts_namespace   *uts_ns;    // hostname
    struct ipc_namespace   *ipc_ns;    // SysV IPC
    struct mnt_namespace   *mnt_ns;    // mount table
    struct pid_namespace   *pid_ns;    // PID numbering
    struct net             *net_ns;    // network stack
    struct cgroup_namespace *cgroup_ns; // cgroup view
    struct time_namespace  *time_ns;   // clock offsets
};

clone() and unshare()¶

clone() creates a new process, optionally with new namespaces:

// Create a child process with new PID and network namespaces
int flags = CLONE_NEWPID | CLONE_NEWNET | SIGCHLD;
pid_t child = clone(child_fn, stack + STACK_SIZE, flags, arg);

unshare() creates new namespaces for the calling process:

// Move the current process into new mount and PID namespaces
unshare(CLONE_NEWNS | CLONE_NEWPID);
// Next fork() will be PID 1 in the new PID namespace

Namespace creation cost¶

Namespace	Creation cost	Runtime overhead	Notes
PID	~5 us	Negligible	Just a new pid_namespace struct
Mount	~50 us	Negligible after setup	Copies mount table (cow)
Network	~100 us	1-5% for network I/O	Creates new network stack
User	~10 us	Negligible	Enables unprivileged namespaces
UTS	~2 us	Negligible	Just a hostname string
IPC	~5 us	Negligible	New SysV IPC namespace
cgroup	~5 us	Negligible	New cgroup root view
Time	~2 us	Negligible	Clock offset (kernel 5.6+)

Network namespaces are the most expensive because they duplicate the entire network stack (routing table, iptables rules, socket hash tables). This is why container networking adds measurable overhead.

User namespaces: unprivileged containers¶

User namespaces (kernel 3.8+) allow unprivileged users to create all other namespace types. Inside the user namespace, the process is uid 0 (root), but the kernel maps this to an unprivileged uid outside:

# Create a user namespace (no root required)
unshare --user --map-root-user bash
id
# uid=0(root) gid=0(root)    ← inside the namespace
# Actually uid=1000 outside

# Now you can create other namespaces:
unshare --pid --mount --fork bash
# PID 1 in a new PID namespace, as "root" in the user namespace

This is how rootless Podman works: user namespace provides fake root, enabling mount/PID/network namespaces without actual privileges.

8. eBPF from the Kernel Perspective¶

The verifier algorithm¶

Before any eBPF program runs, the kernel's verifier analyzes it statically:

eBPF bytecode
  → Directed Acyclic Graph (DAG) check
    (no backward jumps except bounded loops since 5.3)
  → Abstract interpretation
    (track register types and value ranges through every path)
  → Memory safety check
    (all pointer dereferences go through BPF helpers with bounds checks)
  → Stack depth check
    (max 512 bytes of stack per program)
  → Instruction count check
    (max 1 million verified instructions since 5.2)

The verifier tracks register types through every execution path:

Type	Meaning
`SCALAR_VALUE`	Integer (known range)
`PTR_TO_CTX`	Pointer to the program context (e.g., `struct __sk_buff*`)
`PTR_TO_MAP_VALUE`	Pointer into a BPF map
`PTR_TO_STACK`	Pointer to the BPF stack
`PTR_TO_BTF_ID`	Pointer to a kernel struct (with BTF type info)

The verifier rejects any program where a pointer could be:

Dereferenced out of bounds
Used after its containing map element is freed
Confused with a scalar (type confusion)

JIT compilation¶

After verification, the eBPF bytecode is JIT-compiled to native machine code:

# Check if JIT is enabled
cat /proc/sys/net/core/bpf_jit_enable
# 1 = JIT enabled
# 2 = JIT enabled + emit to /tmp for debugging

# JIT backends: x86_64, arm64, s390x, riscv64, mips, powerpc, loongarch

JIT-compiled BPF programs run at near-native speed. The overhead compared to a native kernel function is ~2-5% (indirect call + stack frame setup).

BPF Type Format (BTF)¶

BTF is a compact type encoding that enables:

CO-RE (Compile Once, Run Everywhere): BPF programs reference kernel struct fields by name, not offset. The loader (libbpf) relocates offsets at load time using BTF information from the running kernel.
Pretty-printing: bpftool map dump can show map contents with field names instead of raw bytes.
Verifier type checking: the verifier uses BTF to ensure struct field accesses are valid.

# Check if the kernel has BTF (required for CO-RE)
ls -la /sys/kernel/btf/vmlinux
# -r--r--r-- 1 root root 5791432 /sys/kernel/btf/vmlinux

# Inspect BTF information
bpftool btf dump file /sys/kernel/btf/vmlinux format c | head -50
# Shows C struct definitions for all kernel types

BPF program types relevant to compositors¶

Program type	Attach point	Use case
`BPF_PROG_TYPE_KPROBE`	Any kernel function	Trace DRM ioctls, scheduler events
`BPF_PROG_TYPE_TRACEPOINT`	Static tracepoints	`sched:sched_switch`, `drm:drm_vblank_event`
`BPF_PROG_TYPE_PERF_EVENT`	PMC overflow	Sample CPU cache misses in compositor
`BPF_PROG_TYPE_CGROUP_DEVICE`	cgroup device access	Control GPU device access per container
`BPF_PROG_TYPE_SYSCALL`	Direct invocation	Complex map operations from userspace

9. Putting It All Together: A Compositor's Kernel Interaction¶

Every frame rendered by Sway involves multiple kernel subsystems:

sequenceDiagram
    participant App as Chromium
    participant Comp as Sway
    participant Kernel as Kernel

    Note over App,Kernel: Client renders a frame
    App->>Kernel: DRM ioctl (submit GPU commands)
    Kernel->>App: GPU fence signaled
    App->>Comp: wl_surface.commit (SCM_RIGHTS: dma-buf fd)
    Note over Comp: fd received via Unix socket
    Comp->>Kernel: epoll_wait returns (socket readable)
    Comp->>Kernel: recvmsg + SCM_RIGHTS (receive dma-buf fd)
    Note over Comp: Import buffer, composite
    Comp->>Kernel: EGL import (dma-buf → texture)
    Comp->>Kernel: GL draw calls (composite)
    Comp->>Kernel: drmModeAtomicCommit (submit frame)
    Note over Kernel: Wait for vblank
    Kernel->>Comp: Page-flip event (epoll notification)
    Comp->>App: wl_callback.done (frame callback via sendmsg)

Kernel subsystems touched in a single frame:

DRM/KMS: atomic commit, page flip events, vblank handling
dma-buf: buffer sharing between GPU, compositor, and VNC
Unix sockets: Wayland protocol transport, SCM_RIGHTS for fd passing
epoll: event notification for sockets, timers, DRM events
Futex: mutex contention in multi-threaded compositor
Scheduler: CFS/EEVDF scheduling of compositor vs client threads
Virtual memory: mmap for GPU buffers, SHM, shared libraries

What's new (2025--2026)

EEVDF scheduler (kernel 6.6) replaced CFS. Improves wake-up latency for interactive tasks by ~30%.
io_uring continues to expand: io_uring-based networking (IORING_OP_SEND/RECV) is now production-ready.
BPF arena (kernel 6.9): allows BPF programs to allocate memory from a shared arena, enabling more complex data structures.
User-space interrupt (uintr, x86): CPU can deliver interrupts directly to userspace, bypassing the kernel. Potential future path for zero-latency IPC.
Rust in the kernel: rvkms (virtual KMS driver) merged. Rust abstractions for DRM, network, and filesystem subsystems expanding.

Glossary

Page table: Multi-level tree structure mapping virtual addresses to physical pages. Four levels on x86_64 (PGD → PUD → PMD → PTE).
TLB (Translation Lookaside Buffer): CPU cache for page table entries. Avoids the 4-level page walk on cache hit.
THP (Transparent Huge Pages): Kernel feature automatically promoting 4KB pages to 2MB pages. Reduces TLB pressure.
CFS (Completely Fair Scheduler): Linux's default scheduler (2.6.23--6.5). Red-black tree ordered by virtual runtime.
EEVDF (Earliest Eligible Virtual Deadline First): CFS replacement (kernel 6.6+). Adds virtual deadline for better latency under load.
SCHED_FIFO: Real-time scheduling class. Fixed priority, run-to-completion. Preempts all CFS/EEVDF tasks.
Futex (Fast Userspace Mutex): Kernel primitive for userspace synchronization. Fast path is pure userspace (atomic CAS); slow path sleeps in kernel.
epoll: Scalable I/O event notification. O(1) event delivery via callbacks and a ready list.
EPOLLEXCLUSIVE: epoll flag preventing thundering herd: only one waiter is woken per event.
io_uring: Async I/O interface (kernel 5.1+). Shared-memory submission/completion rings. Zero-syscall fast path.
SCM_RIGHTS: Socket control message for passing file descriptors between processes via Unix sockets.
nsproxy: Kernel struct containing a process's namespace references. Manipulated by clone() and unshare().
BTF (BPF Type Format): Compact type encoding for eBPF programs. Enables CO-RE (Compile Once, Run Everywhere) and pretty-printing.
CO-RE (Compile Once, Run Everywhere): eBPF portability mechanism. Programs reference struct fields by name; libbpf relocates offsets using BTF.
MADV_DONTNEED: madvise flag: immediately release pages (zeroed on next access). Used for security-sensitive data.
MADV_FREE: madvise flag: mark pages as reclaimable (may not be zeroed). Used for allocator performance.