Skip to content

Linux Kernel Internals for Platform Engineers

This is the knowledge that separates a senior Linux platform engineer from someone who can run apt install. Not a kernel development tutorial, but the internals that matter when you build on top of the kernel: virtual memory (page tables, TLB, huge pages, mmap semantics), the CPU scheduler (CFS, EEVDF, real-time classes, cpuset cgroups), file descriptor internals and SCM_RIGHTS (why Wayland's fd-passing is efficient), futexes (the foundation of every mutex), epoll internals (red-black trees, wait queues, thundering herd), io_uring (submission/completion rings), namespace implementation, and eBPF from the kernel's perspective.


1. Virtual Memory

Every process sees a flat 48-bit (or 57-bit with 5-level paging) virtual address space. The kernel manages the mapping from virtual to physical via page tables and the MMU (Memory Management Unit).

Page table walk

Virtual address (48-bit, 4-level paging):
┌──────┬──────┬──────┬──────┬──────────┐
│ PGD  │ PUD  │ PMD  │ PTE  │ Offset   │
│ 9bit │ 9bit │ 9bit │ 9bit │ 12bit    │
└──┬───┴──┬───┴──┬───┴──┬───┴────┬─────┘
   │      │      │      │        │
   ▼      ▼      ▼      ▼        ▼
  PGD → PUD → PMD → PTE → Physical page + offset
  table  table  table  table

Each level is a 4KB page containing 512 entries (2^9). Four levels give 2^48 = 256 TB of virtual address space. The 12-bit offset addresses bytes within a 4KB page.

Cost: a TLB miss requires four memory reads to walk the page tables. This is why TLB coverage matters enormously.

TLB (Translation Lookaside Buffer)

The TLB caches recent virtual-to-physical translations:

TLB level Entries (typical) Latency
L1 DTLB 64-128 1 cycle
L1 ITLB 64-128 1 cycle
L2 STLB 1024-2048 7-10 cycles
TLB miss (page walk) - 20-100 cycles

With 4KB pages, 2048 STLB entries cover 8MB. For a compositor with a 200MB working set, most accesses miss the TLB. This is where huge pages matter.

Huge pages

Page size TLB entries for 1GB Available as
4KB 262,144 Default
2MB (PMD-level) 512 THP or hugetlbfs
1GB (PUD-level) 1 hugetlbfs only

Transparent Huge Pages (THP): the kernel automatically promotes 4KB pages to 2MB pages when it finds 512 contiguous pages with compatible attributes. No application changes needed.

# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

# For a compositor: "madvise" is best -- let the application opt in
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

# Application-level opt-in
madvise(addr, length, MADV_HUGEPAGE);  // suggest THP

When THP hurts: THP compaction can cause latency spikes (the kernel pauses to defragment physical memory). For latency-sensitive compositors:

# Disable THP compaction (prevent latency spikes)
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
# "defer+madvise": background compaction only, direct reclaim only for
# explicit madvise(MADV_HUGEPAGE) regions

mmap semantics

mmap() is the foundation of everything: file I/O, shared memory, GPU buffers, anonymous allocations.

// Key flags and their semantics:

// MAP_PRIVATE: copy-on-write. Reads from file, writes go to anonymous pages.
// Used for: loading shared libraries (.so), private file mappings
void *lib = mmap(NULL, len, PROT_READ|PROT_EXEC, MAP_PRIVATE, fd, 0);

// MAP_SHARED: writes visible to other processes and written back to file.
// Used for: shared memory (shmem), dma-buf, Wayland SHM buffers
void *shm = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, shm_fd, 0);

// MAP_ANONYMOUS: no file backing, zero-filled pages.
// Used for: heap (malloc), thread stacks
void *heap = mmap(NULL, len, PROT_READ|PROT_WRITE,
                  MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

// MAP_POPULATE: fault all pages immediately (no lazy allocation).
// Used for: real-time paths that cannot tolerate page faults
void *rt = mmap(NULL, len, PROT_READ|PROT_WRITE,
                MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);

MADV_DONTNEED vs MADV_FREE

Both tell the kernel that pages are no longer needed, but with critical differences:

MADV_DONTNEED MADV_FREE
Behavior Immediately unmaps pages. Next access → page fault, zero-filled page Marks pages as reclaimable. Next access returns existing data if not yet reclaimed
Performance Expensive (page table updates, TLB flush) Cheap (just marks pages in the page table)
Determinism Deterministic (always zeroed on next access) Non-deterministic (may or may not be reclaimed)
Used by free() in glibc, Go runtime jemalloc, tcmalloc
Kiosk relevance Use for security (wipe sensitive data) Use for performance (recycle allocator pages)
// Security: wipe a buffer containing credentials
madvise(secret_buf, len, MADV_DONTNEED);  // pages zeroed

// Performance: return allocator pages to kernel without zeroing
madvise(free_pages, len, MADV_FREE);  // pages recyclable but not zeroed

2. The CPU Scheduler

CFS (Completely Fair Scheduler)

CFS (kernel 2.6.23 through 6.5) uses a red-black tree of tasks ordered by virtual runtime (vruntime). The task with the smallest vruntime runs next.

            RB-tree (ordered by vruntime)
               ┌──────────────┐
               │  task C (5ms) │
               └──┬───────┬───┘
            ┌─────┘       └─────┐
     ┌──────┴──────┐     ┌──────┴──────┐
     │ task A (3ms) │     │ task D (8ms) │
     └──────────────┘     └──────────────┘

     ← smallest vruntime = runs next

vruntime increases as a task runs. Tasks with higher nice values accumulate vruntime faster (get less CPU). The key insight: CFS does not use fixed time slices. It dynamically computes a slice based on the number of runnable tasks and their weights.

EEVDF (Earliest Eligible Virtual Deadline First)

EEVDF replaced CFS in kernel 6.6. The motivation: CFS was fair in the long run but could starve short-running tasks in the short term (a task waking from sleep had to wait for previously running tasks to finish their slices).

EEVDF adds a virtual deadline to each task. A task is "eligible" when its vruntime is not ahead of the fair share. Among eligible tasks, the one with the earliest deadline runs first.

Task    Eligible?    Virtual Deadline    Status
A       yes          15ms                ← runs next (earliest deadline)
B       yes          18ms                waiting
C       no           12ms                already ahead of fair share
D       yes          20ms                waiting

Practical impact: EEVDF improves latency for interactive tasks (the compositor and input handling threads) without explicit tuning. Wake-up latency dropped by ~30% in real-world tests.

# Check which scheduler is active
cat /sys/kernel/debug/sched/debug | head -5
# On 6.6+: "EEVDF" appears in the output

# Tune EEVDF: minimum granularity (smallest time slice)
cat /sys/kernel/debug/sched/min_granularity_ns
# Default: 750000 (750us). Lower = more responsive, higher = more throughput

Real-time scheduling for compositor threads

The compositor's render thread should never be preempted by background tasks. Use SCHED_FIFO (fixed-priority, run-to-completion):

#include <sched.h>

struct sched_param param = { .sched_priority = 50 };  // 1-99
sched_setscheduler(0, SCHED_FIFO, &param);
// This thread now preempts ALL CFS/EEVDF tasks
// Only higher-priority SCHED_FIFO threads can preempt it
# Set real-time priority from outside the process
chrt -f 50 -p $(pidof sway)

# Or in the systemd unit:
# [Service]
# CPUSchedulingPolicy=fifo
# CPUSchedulingPriority=50

Warning: a SCHED_FIFO thread that loops indefinitely will lock up the CPU. Use SCHED_DEADLINE (kernel 3.14+) for bounded execution:

struct sched_attr attr = {
    .size = sizeof(attr),
    .sched_policy = SCHED_DEADLINE,
    .sched_runtime  = 5000000,   // 5ms per period
    .sched_deadline = 16666666,  // 16.67ms (60Hz)
    .sched_period   = 16666666,
};
syscall(SYS_sched_setattr, 0, &attr, 0);
// Kernel guarantees 5ms of CPU every 16.67ms

cpuset cgroups: CPU pinning

For a kiosk with dedicated hardware, pin the compositor to specific cores:

# Create a cpuset for the compositor
mkdir /sys/fs/cgroup/compositor
echo "0-1" > /sys/fs/cgroup/compositor/cpuset.cpus    # cores 0-1
echo "0" > /sys/fs/cgroup/compositor/cpuset.mems       # NUMA node 0
echo $SWAY_PID > /sys/fs/cgroup/compositor/cgroup.procs

# Put everything else on remaining cores
echo "2-7" > /sys/fs/cgroup/system.slice/cpuset.cpus

This eliminates cache contention between the compositor and background tasks: cores 0-1 are exclusively for Sway.


3. File Descriptor Internals

struct file

Every open file descriptor points to a struct file in the kernel:

struct file {
    struct path             f_path;     // dentry + vfsmount
    const struct file_operations *f_op; // read, write, mmap, ioctl, poll
    atomic_long_t           f_count;    // reference count
    unsigned int            f_flags;    // O_RDONLY, O_NONBLOCK, etc.
    fmode_t                 f_mode;     // FMODE_READ, FMODE_WRITE
    loff_t                  f_pos;      // current file position
    void                    *private_data; // driver-specific data
    // ...
};

The process's fd table maps integer fds to struct file pointers:

Process A                    Kernel
fd table:                    struct file objects:
  0 → ──────────────────────→ [struct file: /dev/tty, count=2]
  1 → ──────────────────────→        ↑ (same struct file)
  2 → ──────────────────────→        ↑ (dup'd)
  3 → ──────────────────────→ [struct file: /dev/dri/card0, count=1]
  4 → ──────────────────────→ [struct file: anon_inode:[eventfd], count=1]

dup() creates a new fd pointing to the same struct file (increments f_count). fork() copies the fd table, incrementing f_count for each entry. close() decrements f_count; when it reaches 0, the file is released.

SCM_RIGHTS: how Wayland passes file descriptors

The Wayland protocol passes dma-buf file descriptors between client and compositor via Unix domain sockets using SCM_RIGHTS:

// Sending an fd over a Unix socket
struct msghdr msg = {0};
struct cmsghdr *cmsg;
char buf[CMSG_SPACE(sizeof(int))];
int dma_buf_fd = gbm_bo_get_fd(bo);

msg.msg_control = buf;
msg.msg_controllen = sizeof(buf);

cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &dma_buf_fd, sizeof(int));

sendmsg(socket_fd, &msg, 0);

What happens in the kernel:

  1. sendmsg() enters the kernel.
  2. The kernel looks up the sender's struct file for dma_buf_fd.
  3. It creates a new fd in the receiver's fd table pointing to the same struct file.
  4. f_count is incremented (now both processes hold a reference).
  5. The receiver gets a new integer fd (possibly with a different number).

This is why fd-passing is efficient: no data is copied. Both processes share the same kernel object (and for dma-bufs, the same GPU memory). Closing the fd in the sender does not affect the receiver (reference counted).

Why this matters for Wayland: every buffer (wl_buffer) is backed by a dma-buf fd. The client creates it, sends the fd to the compositor, and the compositor imports it as a GPU texture. Zero copies, zero serialization.


4. Futexes: The Foundation of Userspace Synchronization

Every pthread_mutex_lock(), pthread_cond_wait(), and sem_wait() in glibc is built on top of futexes (Fast Userspace muTEXes).

The fast path

// Simplified futex-based mutex (what glibc actually does):

// Lock (fast path: no syscall)
int expected = 0;
if (atomic_compare_exchange(&mutex->state, &expected, 1)) {
    // Got the lock. No kernel involvement.
    return;
}

// Lock (slow path: contention → kernel)
futex(&mutex->state, FUTEX_WAIT, 1, ...);
// Kernel adds this thread to a wait queue keyed by &mutex->state
// Thread sleeps until another thread calls FUTEX_WAKE

// Unlock
mutex->state = 0;
futex(&mutex->state, FUTEX_WAKE, 1, ...);
// Kernel wakes one waiter

The genius of futexes: the common case (no contention) is a single atomic instruction in userspace -- no syscall at all. Only when there is contention does the thread enter the kernel to sleep.

How glibc implements pthread_mutex_lock

The actual glibc implementation uses a three-state mutex:

State Meaning
0 Unlocked
1 Locked, no waiters
2 Locked, has waiters
// pthread_mutex_lock (simplified from glibc nptl/pthread_mutex_lock.c):

int __pthread_mutex_lock(pthread_mutex_t *mutex) {
    // Fast path: try to go 0 → 1 (unlocked → locked, no waiters)
    if (atomic_compare_exchange_weak(&mutex->__data.__lock, 0, 1) == 0)
        return 0;  // Got the lock, no syscall

    // Slow path: spin briefly, then sleep
    int old = atomic_exchange(&mutex->__data.__lock, 2);  // set "has waiters"
    while (old != 0) {
        futex(&mutex->__data.__lock, FUTEX_WAIT_PRIVATE, 2, NULL);
        old = atomic_exchange(&mutex->__data.__lock, 2);
    }
    return 0;
}

The transition to state 2 ensures that pthread_mutex_unlock always calls FUTEX_WAKE when there are waiters, preventing missed wakeups.

Priority inversion and PI futexes

For real-time compositor threads: a high-priority thread can be blocked by a low-priority thread holding a mutex, while a medium-priority thread runs instead. This is priority inversion (the Mars Pathfinder bug).

Linux provides FUTEX_LOCK_PI (priority inheritance futexes):

// The lock holder's priority is temporarily raised to the
// highest priority among waiters
pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr);
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
pthread_mutex_init(&mutex, &attr);

5. epoll Internals

Data structures

Process calls epoll_create1():
  → Kernel allocates struct eventpoll:
      .rbr:   red-black tree of monitored fds (struct epitem)
      .rdllist: ready list (doubly-linked list of fired epitems)
      .wq:    wait queue (threads blocked in epoll_wait)
      .poll_wait: used for nested epoll

Process calls epoll_ctl(EPOLL_CTL_ADD, fd, event):
  → Kernel creates struct epitem
  → Inserts into the red-black tree (O(log n))
  → Registers a callback on the fd's wait queue:
      when the fd becomes ready, the callback moves the
      epitem to the rdllist and wakes threads in .wq

Process calls epoll_wait():
  → If rdllist is non-empty: return ready events immediately
  → If rdllist is empty: sleep on .wq until a callback fires

Why epoll is O(1) for event delivery

Traditional poll() / select() scan the entire fd set every call: O(n) per call. epoll registers callbacks once (via epoll_ctl), and event delivery is O(1) per ready fd:

  1. A packet arrives on a socket.
  2. The kernel wakes the socket's wait queue.
  3. The epoll callback fires, moving the epitem to the ready list.
  4. The thread sleeping in epoll_wait() wakes up.
  5. Only ready fds are returned -- no scanning.

Thundering herd and EPOLLEXCLUSIVE

When multiple threads call epoll_wait() on the same epoll instance, a single event wakes all of them. Only one can handle the event; the rest immediately sleep again. This wastes CPU.

// Solution: EPOLLEXCLUSIVE (kernel 4.5+)
struct epoll_event ev = {
    .events = EPOLLIN | EPOLLEXCLUSIVE,
    .data.fd = listen_fd,
};
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
// Now only ONE waiter is woken per event

Level-triggered vs edge-triggered

Mode Behavior Re-arm needed?
Level-triggered (default) epoll_wait returns if fd is ready No -- returns again if still ready
Edge-triggered (EPOLLET) epoll_wait returns when fd becomes ready Yes -- must drain all data or you miss events

Edge-triggered is faster (fewer epoll_wait returns) but dangerous: if you do not read all available data, you will never be notified again until new data arrives.

// Edge-triggered pattern: must drain the fd
while (true) {
    ssize_t n = read(fd, buf, sizeof(buf));
    if (n == -1 && errno == EAGAIN) break;  // fully drained
    process(buf, n);
}

wlroots uses level-triggered epoll for simplicity and correctness (the event loop is not latency-critical to the point where edge-triggered matters).


6. io_uring: The New Async I/O Interface

io_uring (kernel 5.1+) provides true asynchronous I/O via shared-memory ring buffers between userspace and kernel. No syscalls on the fast path.

Architecture

Userspace                    Kernel
┌──────────────────┐        ┌──────────────────┐
│ Submission Queue │ ──────→│ SQ Thread         │
│ (SQE ring)       │        │ (processes SQEs)  │
│                  │        │                   │
│ Completion Queue │ ←──────│ Completion path   │
│ (CQE ring)       │        │ (fills CQEs)      │
└──────────────────┘        └──────────────────┘
     ↕ mmap'd shared memory
// io_uring setup
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);  // 256 SQE entries

// Submit a read (no syscall in SQPOLL mode)
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, 0);
io_uring_sqe_set_data(sqe, user_data);
io_uring_submit(&ring);  // syscall (or no-op in SQPOLL mode)

// Reap completions
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int result = cqe->res;  // bytes read, or -errno
void *data = io_uring_cqe_get_data(cqe);
io_uring_cqe_seen(&ring, cqe);

io_uring vs epoll for compositors

epoll io_uring
Model Readiness notification Completion notification
Syscalls per I/O 1 (epoll_wait) + 1 (read/write) 0-1 (batched submit)
Best for Many fds, few events High I/O throughput, batching
Compositor use Wayland socket, input, timers Could batch DRM ioctls, file I/O
Maturity Battle-tested Newer, ongoing security hardening

io_uring could replace epoll for compositors, but the benefit is marginal (compositors are not I/O-bound). The real win is for storage-heavy workloads (database, logging).

SQPOLL mode: zero-syscall I/O

struct io_uring_params params = {
    .flags = IORING_SETUP_SQPOLL,  // kernel thread polls the SQ
    .sq_thread_idle = 2000,         // kill SQ thread after 2s idle
};
io_uring_queue_init_params(256, &ring, &params);
// Now io_uring_submit() is a no-op -- the kernel thread picks up SQEs

The kernel thread polls the submission queue, so io_uring_submit() does not need a syscall. This is the fastest path for high-frequency I/O.


7. Namespace Implementation

How namespaces work in the kernel

Every process has a struct nsproxy containing pointers to its namespaces:

struct nsproxy {
    struct uts_namespace   *uts_ns;    // hostname
    struct ipc_namespace   *ipc_ns;    // SysV IPC
    struct mnt_namespace   *mnt_ns;    // mount table
    struct pid_namespace   *pid_ns;    // PID numbering
    struct net             *net_ns;    // network stack
    struct cgroup_namespace *cgroup_ns; // cgroup view
    struct time_namespace  *time_ns;   // clock offsets
};

clone() and unshare()

clone() creates a new process, optionally with new namespaces:

// Create a child process with new PID and network namespaces
int flags = CLONE_NEWPID | CLONE_NEWNET | SIGCHLD;
pid_t child = clone(child_fn, stack + STACK_SIZE, flags, arg);

unshare() creates new namespaces for the calling process:

// Move the current process into new mount and PID namespaces
unshare(CLONE_NEWNS | CLONE_NEWPID);
// Next fork() will be PID 1 in the new PID namespace

Namespace creation cost

Namespace Creation cost Runtime overhead Notes
PID ~5 us Negligible Just a new pid_namespace struct
Mount ~50 us Negligible after setup Copies mount table (cow)
Network ~100 us 1-5% for network I/O Creates new network stack
User ~10 us Negligible Enables unprivileged namespaces
UTS ~2 us Negligible Just a hostname string
IPC ~5 us Negligible New SysV IPC namespace
cgroup ~5 us Negligible New cgroup root view
Time ~2 us Negligible Clock offset (kernel 5.6+)

Network namespaces are the most expensive because they duplicate the entire network stack (routing table, iptables rules, socket hash tables). This is why container networking adds measurable overhead.

User namespaces: unprivileged containers

User namespaces (kernel 3.8+) allow unprivileged users to create all other namespace types. Inside the user namespace, the process is uid 0 (root), but the kernel maps this to an unprivileged uid outside:

# Create a user namespace (no root required)
unshare --user --map-root-user bash
id
# uid=0(root) gid=0(root)    ← inside the namespace
# Actually uid=1000 outside

# Now you can create other namespaces:
unshare --pid --mount --fork bash
# PID 1 in a new PID namespace, as "root" in the user namespace

This is how rootless Podman works: user namespace provides fake root, enabling mount/PID/network namespaces without actual privileges.


8. eBPF from the Kernel Perspective

The verifier algorithm

Before any eBPF program runs, the kernel's verifier analyzes it statically:

eBPF bytecode
  → Directed Acyclic Graph (DAG) check
    (no backward jumps except bounded loops since 5.3)
  → Abstract interpretation
    (track register types and value ranges through every path)
  → Memory safety check
    (all pointer dereferences go through BPF helpers with bounds checks)
  → Stack depth check
    (max 512 bytes of stack per program)
  → Instruction count check
    (max 1 million verified instructions since 5.2)

The verifier tracks register types through every execution path:

Type Meaning
SCALAR_VALUE Integer (known range)
PTR_TO_CTX Pointer to the program context (e.g., struct __sk_buff*)
PTR_TO_MAP_VALUE Pointer into a BPF map
PTR_TO_STACK Pointer to the BPF stack
PTR_TO_BTF_ID Pointer to a kernel struct (with BTF type info)

The verifier rejects any program where a pointer could be:

  • Dereferenced out of bounds
  • Used after its containing map element is freed
  • Confused with a scalar (type confusion)

JIT compilation

After verification, the eBPF bytecode is JIT-compiled to native machine code:

# Check if JIT is enabled
cat /proc/sys/net/core/bpf_jit_enable
# 1 = JIT enabled
# 2 = JIT enabled + emit to /tmp for debugging

# JIT backends: x86_64, arm64, s390x, riscv64, mips, powerpc, loongarch

JIT-compiled BPF programs run at near-native speed. The overhead compared to a native kernel function is ~2-5% (indirect call + stack frame setup).

BPF Type Format (BTF)

BTF is a compact type encoding that enables:

  1. CO-RE (Compile Once, Run Everywhere): BPF programs reference kernel struct fields by name, not offset. The loader (libbpf) relocates offsets at load time using BTF information from the running kernel.

  2. Pretty-printing: bpftool map dump can show map contents with field names instead of raw bytes.

  3. Verifier type checking: the verifier uses BTF to ensure struct field accesses are valid.

# Check if the kernel has BTF (required for CO-RE)
ls -la /sys/kernel/btf/vmlinux
# -r--r--r-- 1 root root 5791432 /sys/kernel/btf/vmlinux

# Inspect BTF information
bpftool btf dump file /sys/kernel/btf/vmlinux format c | head -50
# Shows C struct definitions for all kernel types

BPF program types relevant to compositors

Program type Attach point Use case
BPF_PROG_TYPE_KPROBE Any kernel function Trace DRM ioctls, scheduler events
BPF_PROG_TYPE_TRACEPOINT Static tracepoints sched:sched_switch, drm:drm_vblank_event
BPF_PROG_TYPE_PERF_EVENT PMC overflow Sample CPU cache misses in compositor
BPF_PROG_TYPE_CGROUP_DEVICE cgroup device access Control GPU device access per container
BPF_PROG_TYPE_SYSCALL Direct invocation Complex map operations from userspace

9. Putting It All Together: A Compositor's Kernel Interaction

Every frame rendered by Sway involves multiple kernel subsystems:

sequenceDiagram
    participant App as Chromium
    participant Comp as Sway
    participant Kernel as Kernel

    Note over App,Kernel: Client renders a frame
    App->>Kernel: DRM ioctl (submit GPU commands)
    Kernel->>App: GPU fence signaled
    App->>Comp: wl_surface.commit (SCM_RIGHTS: dma-buf fd)
    Note over Comp: fd received via Unix socket
    Comp->>Kernel: epoll_wait returns (socket readable)
    Comp->>Kernel: recvmsg + SCM_RIGHTS (receive dma-buf fd)
    Note over Comp: Import buffer, composite
    Comp->>Kernel: EGL import (dma-buf → texture)
    Comp->>Kernel: GL draw calls (composite)
    Comp->>Kernel: drmModeAtomicCommit (submit frame)
    Note over Kernel: Wait for vblank
    Kernel->>Comp: Page-flip event (epoll notification)
    Comp->>App: wl_callback.done (frame callback via sendmsg)

Kernel subsystems touched in a single frame:

  • DRM/KMS: atomic commit, page flip events, vblank handling
  • dma-buf: buffer sharing between GPU, compositor, and VNC
  • Unix sockets: Wayland protocol transport, SCM_RIGHTS for fd passing
  • epoll: event notification for sockets, timers, DRM events
  • Futex: mutex contention in multi-threaded compositor
  • Scheduler: CFS/EEVDF scheduling of compositor vs client threads
  • Virtual memory: mmap for GPU buffers, SHM, shared libraries

What's new (2025--2026)
  • EEVDF scheduler (kernel 6.6) replaced CFS. Improves wake-up latency for interactive tasks by ~30%.
  • io_uring continues to expand: io_uring-based networking (IORING_OP_SEND/RECV) is now production-ready.
  • BPF arena (kernel 6.9): allows BPF programs to allocate memory from a shared arena, enabling more complex data structures.
  • User-space interrupt (uintr, x86): CPU can deliver interrupts directly to userspace, bypassing the kernel. Potential future path for zero-latency IPC.
  • Rust in the kernel: rvkms (virtual KMS driver) merged. Rust abstractions for DRM, network, and filesystem subsystems expanding.

Glossary

Page table
Multi-level tree structure mapping virtual addresses to physical pages. Four levels on x86_64 (PGD → PUD → PMD → PTE).
TLB (Translation Lookaside Buffer)
CPU cache for page table entries. Avoids the 4-level page walk on cache hit.
THP (Transparent Huge Pages)
Kernel feature automatically promoting 4KB pages to 2MB pages. Reduces TLB pressure.
CFS (Completely Fair Scheduler)
Linux's default scheduler (2.6.23--6.5). Red-black tree ordered by virtual runtime.
EEVDF (Earliest Eligible Virtual Deadline First)
CFS replacement (kernel 6.6+). Adds virtual deadline for better latency under load.
SCHED_FIFO
Real-time scheduling class. Fixed priority, run-to-completion. Preempts all CFS/EEVDF tasks.
Futex (Fast Userspace Mutex)
Kernel primitive for userspace synchronization. Fast path is pure userspace (atomic CAS); slow path sleeps in kernel.
epoll
Scalable I/O event notification. O(1) event delivery via callbacks and a ready list.
EPOLLEXCLUSIVE
epoll flag preventing thundering herd: only one waiter is woken per event.
io_uring
Async I/O interface (kernel 5.1+). Shared-memory submission/completion rings. Zero-syscall fast path.
SCM_RIGHTS
Socket control message for passing file descriptors between processes via Unix sockets.
nsproxy
Kernel struct containing a process's namespace references. Manipulated by clone() and unshare().
BTF (BPF Type Format)
Compact type encoding for eBPF programs. Enables CO-RE (Compile Once, Run Everywhere) and pretty-printing.
CO-RE (Compile Once, Run Everywhere)
eBPF portability mechanism. Programs reference struct fields by name; libbpf relocates offsets using BTF.
MADV_DONTNEED
madvise flag: immediately release pages (zeroed on next access). Used for security-sensitive data.
MADV_FREE
madvise flag: mark pages as reclaimable (may not be zeroed). Used for allocator performance.