Linux Kernel Internals for Platform Engineers¶
This is the knowledge that separates a senior Linux platform engineer from
someone who can run apt install. Not a kernel development tutorial, but the
internals that matter when you build on top of the kernel: virtual memory
(page tables, TLB, huge pages, mmap semantics), the CPU scheduler (CFS, EEVDF,
real-time classes, cpuset cgroups), file descriptor internals and SCM_RIGHTS
(why Wayland's fd-passing is efficient), futexes (the foundation of every
mutex), epoll internals (red-black trees, wait queues, thundering herd),
io_uring (submission/completion rings), namespace implementation, and eBPF
from the kernel's perspective.
1. Virtual Memory¶
Every process sees a flat 48-bit (or 57-bit with 5-level paging) virtual address space. The kernel manages the mapping from virtual to physical via page tables and the MMU (Memory Management Unit).
Page table walk¶
Virtual address (48-bit, 4-level paging):
┌──────┬──────┬──────┬──────┬──────────┐
│ PGD │ PUD │ PMD │ PTE │ Offset │
│ 9bit │ 9bit │ 9bit │ 9bit │ 12bit │
└──┬───┴──┬───┴──┬───┴──┬───┴────┬─────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
PGD → PUD → PMD → PTE → Physical page + offset
table table table table
Each level is a 4KB page containing 512 entries (2^9). Four levels give 2^48 = 256 TB of virtual address space. The 12-bit offset addresses bytes within a 4KB page.
Cost: a TLB miss requires four memory reads to walk the page tables. This is why TLB coverage matters enormously.
TLB (Translation Lookaside Buffer)¶
The TLB caches recent virtual-to-physical translations:
| TLB level | Entries (typical) | Latency |
|---|---|---|
| L1 DTLB | 64-128 | 1 cycle |
| L1 ITLB | 64-128 | 1 cycle |
| L2 STLB | 1024-2048 | 7-10 cycles |
| TLB miss (page walk) | - | 20-100 cycles |
With 4KB pages, 2048 STLB entries cover 8MB. For a compositor with a 200MB working set, most accesses miss the TLB. This is where huge pages matter.
Huge pages¶
| Page size | TLB entries for 1GB | Available as |
|---|---|---|
| 4KB | 262,144 | Default |
| 2MB (PMD-level) | 512 | THP or hugetlbfs |
| 1GB (PUD-level) | 1 | hugetlbfs only |
Transparent Huge Pages (THP): the kernel automatically promotes 4KB pages to 2MB pages when it finds 512 contiguous pages with compatible attributes. No application changes needed.
# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
# For a compositor: "madvise" is best -- let the application opt in
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
# Application-level opt-in
madvise(addr, length, MADV_HUGEPAGE); // suggest THP
When THP hurts: THP compaction can cause latency spikes (the kernel pauses to defragment physical memory). For latency-sensitive compositors:
# Disable THP compaction (prevent latency spikes)
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
# "defer+madvise": background compaction only, direct reclaim only for
# explicit madvise(MADV_HUGEPAGE) regions
mmap semantics¶
mmap() is the foundation of everything: file I/O, shared memory, GPU
buffers, anonymous allocations.
// Key flags and their semantics:
// MAP_PRIVATE: copy-on-write. Reads from file, writes go to anonymous pages.
// Used for: loading shared libraries (.so), private file mappings
void *lib = mmap(NULL, len, PROT_READ|PROT_EXEC, MAP_PRIVATE, fd, 0);
// MAP_SHARED: writes visible to other processes and written back to file.
// Used for: shared memory (shmem), dma-buf, Wayland SHM buffers
void *shm = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, shm_fd, 0);
// MAP_ANONYMOUS: no file backing, zero-filled pages.
// Used for: heap (malloc), thread stacks
void *heap = mmap(NULL, len, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// MAP_POPULATE: fault all pages immediately (no lazy allocation).
// Used for: real-time paths that cannot tolerate page faults
void *rt = mmap(NULL, len, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
MADV_DONTNEED vs MADV_FREE¶
Both tell the kernel that pages are no longer needed, but with critical differences:
MADV_DONTNEED |
MADV_FREE |
|
|---|---|---|
| Behavior | Immediately unmaps pages. Next access → page fault, zero-filled page | Marks pages as reclaimable. Next access returns existing data if not yet reclaimed |
| Performance | Expensive (page table updates, TLB flush) | Cheap (just marks pages in the page table) |
| Determinism | Deterministic (always zeroed on next access) | Non-deterministic (may or may not be reclaimed) |
| Used by | free() in glibc, Go runtime |
jemalloc, tcmalloc |
| Kiosk relevance | Use for security (wipe sensitive data) | Use for performance (recycle allocator pages) |
// Security: wipe a buffer containing credentials
madvise(secret_buf, len, MADV_DONTNEED); // pages zeroed
// Performance: return allocator pages to kernel without zeroing
madvise(free_pages, len, MADV_FREE); // pages recyclable but not zeroed
2. The CPU Scheduler¶
CFS (Completely Fair Scheduler)¶
CFS (kernel 2.6.23 through 6.5) uses a red-black tree of tasks ordered by virtual runtime (vruntime). The task with the smallest vruntime runs next.
RB-tree (ordered by vruntime)
┌──────────────┐
│ task C (5ms) │
└──┬───────┬───┘
┌─────┘ └─────┐
┌──────┴──────┐ ┌──────┴──────┐
│ task A (3ms) │ │ task D (8ms) │
└──────────────┘ └──────────────┘
← smallest vruntime = runs next
vruntime increases as a task runs. Tasks with higher nice values
accumulate vruntime faster (get less CPU). The key insight: CFS does not
use fixed time slices. It dynamically computes a slice based on the number
of runnable tasks and their weights.
EEVDF (Earliest Eligible Virtual Deadline First)¶
EEVDF replaced CFS in kernel 6.6. The motivation: CFS was fair in the long run but could starve short-running tasks in the short term (a task waking from sleep had to wait for previously running tasks to finish their slices).
EEVDF adds a virtual deadline to each task. A task is "eligible" when its vruntime is not ahead of the fair share. Among eligible tasks, the one with the earliest deadline runs first.
Task Eligible? Virtual Deadline Status
A yes 15ms ← runs next (earliest deadline)
B yes 18ms waiting
C no 12ms already ahead of fair share
D yes 20ms waiting
Practical impact: EEVDF improves latency for interactive tasks (the compositor and input handling threads) without explicit tuning. Wake-up latency dropped by ~30% in real-world tests.
# Check which scheduler is active
cat /sys/kernel/debug/sched/debug | head -5
# On 6.6+: "EEVDF" appears in the output
# Tune EEVDF: minimum granularity (smallest time slice)
cat /sys/kernel/debug/sched/min_granularity_ns
# Default: 750000 (750us). Lower = more responsive, higher = more throughput
Real-time scheduling for compositor threads¶
The compositor's render thread should never be preempted by background
tasks. Use SCHED_FIFO (fixed-priority, run-to-completion):
#include <sched.h>
struct sched_param param = { .sched_priority = 50 }; // 1-99
sched_setscheduler(0, SCHED_FIFO, ¶m);
// This thread now preempts ALL CFS/EEVDF tasks
// Only higher-priority SCHED_FIFO threads can preempt it
# Set real-time priority from outside the process
chrt -f 50 -p $(pidof sway)
# Or in the systemd unit:
# [Service]
# CPUSchedulingPolicy=fifo
# CPUSchedulingPriority=50
Warning: a SCHED_FIFO thread that loops indefinitely will lock up the
CPU. Use SCHED_DEADLINE (kernel 3.14+) for bounded execution:
struct sched_attr attr = {
.size = sizeof(attr),
.sched_policy = SCHED_DEADLINE,
.sched_runtime = 5000000, // 5ms per period
.sched_deadline = 16666666, // 16.67ms (60Hz)
.sched_period = 16666666,
};
syscall(SYS_sched_setattr, 0, &attr, 0);
// Kernel guarantees 5ms of CPU every 16.67ms
cpuset cgroups: CPU pinning¶
For a kiosk with dedicated hardware, pin the compositor to specific cores:
# Create a cpuset for the compositor
mkdir /sys/fs/cgroup/compositor
echo "0-1" > /sys/fs/cgroup/compositor/cpuset.cpus # cores 0-1
echo "0" > /sys/fs/cgroup/compositor/cpuset.mems # NUMA node 0
echo $SWAY_PID > /sys/fs/cgroup/compositor/cgroup.procs
# Put everything else on remaining cores
echo "2-7" > /sys/fs/cgroup/system.slice/cpuset.cpus
This eliminates cache contention between the compositor and background tasks: cores 0-1 are exclusively for Sway.
3. File Descriptor Internals¶
struct file¶
Every open file descriptor points to a struct file in the kernel:
struct file {
struct path f_path; // dentry + vfsmount
const struct file_operations *f_op; // read, write, mmap, ioctl, poll
atomic_long_t f_count; // reference count
unsigned int f_flags; // O_RDONLY, O_NONBLOCK, etc.
fmode_t f_mode; // FMODE_READ, FMODE_WRITE
loff_t f_pos; // current file position
void *private_data; // driver-specific data
// ...
};
The process's fd table maps integer fds to struct file pointers:
Process A Kernel
fd table: struct file objects:
0 → ──────────────────────→ [struct file: /dev/tty, count=2]
1 → ──────────────────────→ ↑ (same struct file)
2 → ──────────────────────→ ↑ (dup'd)
3 → ──────────────────────→ [struct file: /dev/dri/card0, count=1]
4 → ──────────────────────→ [struct file: anon_inode:[eventfd], count=1]
dup() creates a new fd pointing to the same struct file (increments
f_count). fork() copies the fd table, incrementing f_count for each
entry. close() decrements f_count; when it reaches 0, the file is
released.
SCM_RIGHTS: how Wayland passes file descriptors¶
The Wayland protocol passes dma-buf file descriptors between client and
compositor via Unix domain sockets using SCM_RIGHTS:
// Sending an fd over a Unix socket
struct msghdr msg = {0};
struct cmsghdr *cmsg;
char buf[CMSG_SPACE(sizeof(int))];
int dma_buf_fd = gbm_bo_get_fd(bo);
msg.msg_control = buf;
msg.msg_controllen = sizeof(buf);
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &dma_buf_fd, sizeof(int));
sendmsg(socket_fd, &msg, 0);
What happens in the kernel:
sendmsg()enters the kernel.- The kernel looks up the sender's
struct filefordma_buf_fd. - It creates a new fd in the receiver's fd table pointing to the
same
struct file. f_countis incremented (now both processes hold a reference).- The receiver gets a new integer fd (possibly with a different number).
This is why fd-passing is efficient: no data is copied. Both processes share the same kernel object (and for dma-bufs, the same GPU memory). Closing the fd in the sender does not affect the receiver (reference counted).
Why this matters for Wayland: every buffer (wl_buffer) is backed by a
dma-buf fd. The client creates it, sends the fd to the compositor, and the
compositor imports it as a GPU texture. Zero copies, zero serialization.
4. Futexes: The Foundation of Userspace Synchronization¶
Every pthread_mutex_lock(), pthread_cond_wait(), and sem_wait() in
glibc is built on top of futexes (Fast Userspace muTEXes).
The fast path¶
// Simplified futex-based mutex (what glibc actually does):
// Lock (fast path: no syscall)
int expected = 0;
if (atomic_compare_exchange(&mutex->state, &expected, 1)) {
// Got the lock. No kernel involvement.
return;
}
// Lock (slow path: contention → kernel)
futex(&mutex->state, FUTEX_WAIT, 1, ...);
// Kernel adds this thread to a wait queue keyed by &mutex->state
// Thread sleeps until another thread calls FUTEX_WAKE
// Unlock
mutex->state = 0;
futex(&mutex->state, FUTEX_WAKE, 1, ...);
// Kernel wakes one waiter
The genius of futexes: the common case (no contention) is a single atomic instruction in userspace -- no syscall at all. Only when there is contention does the thread enter the kernel to sleep.
How glibc implements pthread_mutex_lock¶
The actual glibc implementation uses a three-state mutex:
| State | Meaning |
|---|---|
| 0 | Unlocked |
| 1 | Locked, no waiters |
| 2 | Locked, has waiters |
// pthread_mutex_lock (simplified from glibc nptl/pthread_mutex_lock.c):
int __pthread_mutex_lock(pthread_mutex_t *mutex) {
// Fast path: try to go 0 → 1 (unlocked → locked, no waiters)
if (atomic_compare_exchange_weak(&mutex->__data.__lock, 0, 1) == 0)
return 0; // Got the lock, no syscall
// Slow path: spin briefly, then sleep
int old = atomic_exchange(&mutex->__data.__lock, 2); // set "has waiters"
while (old != 0) {
futex(&mutex->__data.__lock, FUTEX_WAIT_PRIVATE, 2, NULL);
old = atomic_exchange(&mutex->__data.__lock, 2);
}
return 0;
}
The transition to state 2 ensures that pthread_mutex_unlock always calls
FUTEX_WAKE when there are waiters, preventing missed wakeups.
Priority inversion and PI futexes¶
For real-time compositor threads: a high-priority thread can be blocked by a low-priority thread holding a mutex, while a medium-priority thread runs instead. This is priority inversion (the Mars Pathfinder bug).
Linux provides FUTEX_LOCK_PI (priority inheritance futexes):
// The lock holder's priority is temporarily raised to the
// highest priority among waiters
pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr);
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
pthread_mutex_init(&mutex, &attr);
5. epoll Internals¶
Data structures¶
Process calls epoll_create1():
→ Kernel allocates struct eventpoll:
.rbr: red-black tree of monitored fds (struct epitem)
.rdllist: ready list (doubly-linked list of fired epitems)
.wq: wait queue (threads blocked in epoll_wait)
.poll_wait: used for nested epoll
Process calls epoll_ctl(EPOLL_CTL_ADD, fd, event):
→ Kernel creates struct epitem
→ Inserts into the red-black tree (O(log n))
→ Registers a callback on the fd's wait queue:
when the fd becomes ready, the callback moves the
epitem to the rdllist and wakes threads in .wq
Process calls epoll_wait():
→ If rdllist is non-empty: return ready events immediately
→ If rdllist is empty: sleep on .wq until a callback fires
Why epoll is O(1) for event delivery¶
Traditional poll() / select() scan the entire fd set every call:
O(n) per call. epoll registers callbacks once (via epoll_ctl), and
event delivery is O(1) per ready fd:
- A packet arrives on a socket.
- The kernel wakes the socket's wait queue.
- The epoll callback fires, moving the
epitemto the ready list. - The thread sleeping in
epoll_wait()wakes up. - Only ready fds are returned -- no scanning.
Thundering herd and EPOLLEXCLUSIVE¶
When multiple threads call epoll_wait() on the same epoll instance, a
single event wakes all of them. Only one can handle the event; the
rest immediately sleep again. This wastes CPU.
// Solution: EPOLLEXCLUSIVE (kernel 4.5+)
struct epoll_event ev = {
.events = EPOLLIN | EPOLLEXCLUSIVE,
.data.fd = listen_fd,
};
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
// Now only ONE waiter is woken per event
Level-triggered vs edge-triggered¶
| Mode | Behavior | Re-arm needed? |
|---|---|---|
| Level-triggered (default) | epoll_wait returns if fd is ready | No -- returns again if still ready |
Edge-triggered (EPOLLET) |
epoll_wait returns when fd becomes ready | Yes -- must drain all data or you miss events |
Edge-triggered is faster (fewer epoll_wait returns) but dangerous: if you do not read all available data, you will never be notified again until new data arrives.
// Edge-triggered pattern: must drain the fd
while (true) {
ssize_t n = read(fd, buf, sizeof(buf));
if (n == -1 && errno == EAGAIN) break; // fully drained
process(buf, n);
}
wlroots uses level-triggered epoll for simplicity and correctness (the event loop is not latency-critical to the point where edge-triggered matters).
6. io_uring: The New Async I/O Interface¶
io_uring (kernel 5.1+) provides true asynchronous I/O via shared-memory ring buffers between userspace and kernel. No syscalls on the fast path.
Architecture¶
Userspace Kernel
┌──────────────────┐ ┌──────────────────┐
│ Submission Queue │ ──────→│ SQ Thread │
│ (SQE ring) │ │ (processes SQEs) │
│ │ │ │
│ Completion Queue │ ←──────│ Completion path │
│ (CQE ring) │ │ (fills CQEs) │
└──────────────────┘ └──────────────────┘
↕ mmap'd shared memory
// io_uring setup
struct io_uring ring;
io_uring_queue_init(256, &ring, 0); // 256 SQE entries
// Submit a read (no syscall in SQPOLL mode)
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, 0);
io_uring_sqe_set_data(sqe, user_data);
io_uring_submit(&ring); // syscall (or no-op in SQPOLL mode)
// Reap completions
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int result = cqe->res; // bytes read, or -errno
void *data = io_uring_cqe_get_data(cqe);
io_uring_cqe_seen(&ring, cqe);
io_uring vs epoll for compositors¶
| epoll | io_uring | |
|---|---|---|
| Model | Readiness notification | Completion notification |
| Syscalls per I/O | 1 (epoll_wait) + 1 (read/write) | 0-1 (batched submit) |
| Best for | Many fds, few events | High I/O throughput, batching |
| Compositor use | Wayland socket, input, timers | Could batch DRM ioctls, file I/O |
| Maturity | Battle-tested | Newer, ongoing security hardening |
io_uring could replace epoll for compositors, but the benefit is marginal (compositors are not I/O-bound). The real win is for storage-heavy workloads (database, logging).
SQPOLL mode: zero-syscall I/O¶
struct io_uring_params params = {
.flags = IORING_SETUP_SQPOLL, // kernel thread polls the SQ
.sq_thread_idle = 2000, // kill SQ thread after 2s idle
};
io_uring_queue_init_params(256, &ring, ¶ms);
// Now io_uring_submit() is a no-op -- the kernel thread picks up SQEs
The kernel thread polls the submission queue, so io_uring_submit() does
not need a syscall. This is the fastest path for high-frequency I/O.
7. Namespace Implementation¶
How namespaces work in the kernel¶
Every process has a struct nsproxy containing pointers to its namespaces:
struct nsproxy {
struct uts_namespace *uts_ns; // hostname
struct ipc_namespace *ipc_ns; // SysV IPC
struct mnt_namespace *mnt_ns; // mount table
struct pid_namespace *pid_ns; // PID numbering
struct net *net_ns; // network stack
struct cgroup_namespace *cgroup_ns; // cgroup view
struct time_namespace *time_ns; // clock offsets
};
clone() and unshare()¶
clone() creates a new process, optionally with new namespaces:
// Create a child process with new PID and network namespaces
int flags = CLONE_NEWPID | CLONE_NEWNET | SIGCHLD;
pid_t child = clone(child_fn, stack + STACK_SIZE, flags, arg);
unshare() creates new namespaces for the calling process:
// Move the current process into new mount and PID namespaces
unshare(CLONE_NEWNS | CLONE_NEWPID);
// Next fork() will be PID 1 in the new PID namespace
Namespace creation cost¶
| Namespace | Creation cost | Runtime overhead | Notes |
|---|---|---|---|
| PID | ~5 us | Negligible | Just a new pid_namespace struct |
| Mount | ~50 us | Negligible after setup | Copies mount table (cow) |
| Network | ~100 us | 1-5% for network I/O | Creates new network stack |
| User | ~10 us | Negligible | Enables unprivileged namespaces |
| UTS | ~2 us | Negligible | Just a hostname string |
| IPC | ~5 us | Negligible | New SysV IPC namespace |
| cgroup | ~5 us | Negligible | New cgroup root view |
| Time | ~2 us | Negligible | Clock offset (kernel 5.6+) |
Network namespaces are the most expensive because they duplicate the entire network stack (routing table, iptables rules, socket hash tables). This is why container networking adds measurable overhead.
User namespaces: unprivileged containers¶
User namespaces (kernel 3.8+) allow unprivileged users to create all other namespace types. Inside the user namespace, the process is uid 0 (root), but the kernel maps this to an unprivileged uid outside:
# Create a user namespace (no root required)
unshare --user --map-root-user bash
id
# uid=0(root) gid=0(root) ← inside the namespace
# Actually uid=1000 outside
# Now you can create other namespaces:
unshare --pid --mount --fork bash
# PID 1 in a new PID namespace, as "root" in the user namespace
This is how rootless Podman works: user namespace provides fake root, enabling mount/PID/network namespaces without actual privileges.
8. eBPF from the Kernel Perspective¶
The verifier algorithm¶
Before any eBPF program runs, the kernel's verifier analyzes it statically:
eBPF bytecode
→ Directed Acyclic Graph (DAG) check
(no backward jumps except bounded loops since 5.3)
→ Abstract interpretation
(track register types and value ranges through every path)
→ Memory safety check
(all pointer dereferences go through BPF helpers with bounds checks)
→ Stack depth check
(max 512 bytes of stack per program)
→ Instruction count check
(max 1 million verified instructions since 5.2)
The verifier tracks register types through every execution path:
| Type | Meaning |
|---|---|
SCALAR_VALUE |
Integer (known range) |
PTR_TO_CTX |
Pointer to the program context (e.g., struct __sk_buff*) |
PTR_TO_MAP_VALUE |
Pointer into a BPF map |
PTR_TO_STACK |
Pointer to the BPF stack |
PTR_TO_BTF_ID |
Pointer to a kernel struct (with BTF type info) |
The verifier rejects any program where a pointer could be:
- Dereferenced out of bounds
- Used after its containing map element is freed
- Confused with a scalar (type confusion)
JIT compilation¶
After verification, the eBPF bytecode is JIT-compiled to native machine code:
# Check if JIT is enabled
cat /proc/sys/net/core/bpf_jit_enable
# 1 = JIT enabled
# 2 = JIT enabled + emit to /tmp for debugging
# JIT backends: x86_64, arm64, s390x, riscv64, mips, powerpc, loongarch
JIT-compiled BPF programs run at near-native speed. The overhead compared to a native kernel function is ~2-5% (indirect call + stack frame setup).
BPF Type Format (BTF)¶
BTF is a compact type encoding that enables:
-
CO-RE (Compile Once, Run Everywhere): BPF programs reference kernel struct fields by name, not offset. The loader (libbpf) relocates offsets at load time using BTF information from the running kernel.
-
Pretty-printing:
bpftool map dumpcan show map contents with field names instead of raw bytes. -
Verifier type checking: the verifier uses BTF to ensure struct field accesses are valid.
# Check if the kernel has BTF (required for CO-RE)
ls -la /sys/kernel/btf/vmlinux
# -r--r--r-- 1 root root 5791432 /sys/kernel/btf/vmlinux
# Inspect BTF information
bpftool btf dump file /sys/kernel/btf/vmlinux format c | head -50
# Shows C struct definitions for all kernel types
BPF program types relevant to compositors¶
| Program type | Attach point | Use case |
|---|---|---|
BPF_PROG_TYPE_KPROBE |
Any kernel function | Trace DRM ioctls, scheduler events |
BPF_PROG_TYPE_TRACEPOINT |
Static tracepoints | sched:sched_switch, drm:drm_vblank_event |
BPF_PROG_TYPE_PERF_EVENT |
PMC overflow | Sample CPU cache misses in compositor |
BPF_PROG_TYPE_CGROUP_DEVICE |
cgroup device access | Control GPU device access per container |
BPF_PROG_TYPE_SYSCALL |
Direct invocation | Complex map operations from userspace |
9. Putting It All Together: A Compositor's Kernel Interaction¶
Every frame rendered by Sway involves multiple kernel subsystems:
sequenceDiagram
participant App as Chromium
participant Comp as Sway
participant Kernel as Kernel
Note over App,Kernel: Client renders a frame
App->>Kernel: DRM ioctl (submit GPU commands)
Kernel->>App: GPU fence signaled
App->>Comp: wl_surface.commit (SCM_RIGHTS: dma-buf fd)
Note over Comp: fd received via Unix socket
Comp->>Kernel: epoll_wait returns (socket readable)
Comp->>Kernel: recvmsg + SCM_RIGHTS (receive dma-buf fd)
Note over Comp: Import buffer, composite
Comp->>Kernel: EGL import (dma-buf → texture)
Comp->>Kernel: GL draw calls (composite)
Comp->>Kernel: drmModeAtomicCommit (submit frame)
Note over Kernel: Wait for vblank
Kernel->>Comp: Page-flip event (epoll notification)
Comp->>App: wl_callback.done (frame callback via sendmsg)
Kernel subsystems touched in a single frame:
- DRM/KMS: atomic commit, page flip events, vblank handling
- dma-buf: buffer sharing between GPU, compositor, and VNC
- Unix sockets: Wayland protocol transport, SCM_RIGHTS for fd passing
- epoll: event notification for sockets, timers, DRM events
- Futex: mutex contention in multi-threaded compositor
- Scheduler: CFS/EEVDF scheduling of compositor vs client threads
- Virtual memory: mmap for GPU buffers, SHM, shared libraries
What's new (2025--2026)
- EEVDF scheduler (kernel 6.6) replaced CFS. Improves wake-up latency for interactive tasks by ~30%.
- io_uring continues to expand: io_uring-based networking
(
IORING_OP_SEND/RECV) is now production-ready. - BPF arena (kernel 6.9): allows BPF programs to allocate memory from a shared arena, enabling more complex data structures.
- User-space interrupt (uintr, x86): CPU can deliver interrupts directly to userspace, bypassing the kernel. Potential future path for zero-latency IPC.
- Rust in the kernel:
rvkms(virtual KMS driver) merged. Rust abstractions for DRM, network, and filesystem subsystems expanding.
Glossary
- Page table
- Multi-level tree structure mapping virtual addresses to physical pages. Four levels on x86_64 (PGD → PUD → PMD → PTE).
- TLB (Translation Lookaside Buffer)
- CPU cache for page table entries. Avoids the 4-level page walk on cache hit.
- THP (Transparent Huge Pages)
- Kernel feature automatically promoting 4KB pages to 2MB pages. Reduces TLB pressure.
- CFS (Completely Fair Scheduler)
- Linux's default scheduler (2.6.23--6.5). Red-black tree ordered by virtual runtime.
- EEVDF (Earliest Eligible Virtual Deadline First)
- CFS replacement (kernel 6.6+). Adds virtual deadline for better latency under load.
- SCHED_FIFO
- Real-time scheduling class. Fixed priority, run-to-completion. Preempts all CFS/EEVDF tasks.
- Futex (Fast Userspace Mutex)
- Kernel primitive for userspace synchronization. Fast path is pure userspace (atomic CAS); slow path sleeps in kernel.
- epoll
- Scalable I/O event notification. O(1) event delivery via callbacks and a ready list.
- EPOLLEXCLUSIVE
- epoll flag preventing thundering herd: only one waiter is woken per event.
- io_uring
- Async I/O interface (kernel 5.1+). Shared-memory submission/completion rings. Zero-syscall fast path.
- SCM_RIGHTS
- Socket control message for passing file descriptors between processes via Unix sockets.
- nsproxy
- Kernel struct containing a process's namespace references. Manipulated by clone() and unshare().
- BTF (BPF Type Format)
- Compact type encoding for eBPF programs. Enables CO-RE (Compile Once, Run Everywhere) and pretty-printing.
- CO-RE (Compile Once, Run Everywhere)
- eBPF portability mechanism. Programs reference struct fields by name; libbpf relocates offsets using BTF.
- MADV_DONTNEED
- madvise flag: immediately release pages (zeroed on next access). Used for security-sensitive data.
- MADV_FREE
- madvise flag: mark pages as reclaimable (may not be zeroed). Used for allocator performance.