Practical eBPF/bpftrace for Linux Systems Debugging¶

The job posting requires "performance optimization using tools like perf, bpftrace, valgrind, and flamegraphs." This article goes deep on eBPF and bpftrace -- the kernel virtual machine, the tracing language, and real compositor/graphics recipes you would use daily. If you already know strace, this is the next level: safe, programmable, near-zero-overhead kernel instrumentation that can answer questions strace cannot.

1. What eBPF Actually Is¶

The kernel virtual machine¶

eBPF (extended Berkeley Packet Filter) is an in-kernel virtual machine that executes sandboxed programs attached to kernel and userspace events. Unlike kernel modules, eBPF programs:

Cannot crash the kernel. A static verifier rejects unsafe programs before they run.
Cannot loop forever. The verifier proves bounded execution for every code path.
Cannot access arbitrary memory. All pointer dereferences go through helper functions that perform bounds checking.

   userspace                        kernel
  ┌────────────┐                  ┌──────────────────────────────┐
  │ bpftrace   │  bpf() syscall  │  Verifier                    │
  │ / libbpf   │ ───────────────>│   ↓ (rejects unsafe code)    │
  │            │                  │  JIT compiler                │
  │            │                  │   ↓ (bytecode → native x86)  │
  │            │  maps (shared)   │  Attached program             │
  │            │ <═══════════════>│   (runs at probe point)      │
  └────────────┘                  └──────────────────────────────┘

Verification in detail¶

When you load an eBPF program via the bpf() syscall, the kernel verifier performs static analysis on the bytecode before it ever runs:

Control-flow graph analysis -- builds a DAG of all basic blocks, rejecting back-edges unless the loop is provably bounded.
Symbolic execution -- walks every possible path, tracking register types and value ranges.
Type safety -- each register is tagged as SCALAR, PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK, etc. The verifier rejects type-confused operations (e.g., using a scalar as a pointer).
Bounds checking -- pointer arithmetic must stay within proven bounds. Accessing map_value + offset is only allowed if offset < value_size.
Resource safety -- acquired spinlocks must be released on every path, reference-counted objects must be freed, and no sleeping in atomic contexts.

If verification passes, the JIT compiler translates BPF bytecode instruction-by-instruction into native machine code (x86-64, ARM64, etc.) with constant blinding for speculative execution safety. The result runs at near-native speed.

How it differs from kernel modules¶

Aspect	Kernel module	eBPF program
Safety	Can panic the kernel	Verified before execution
Loading	`insmod`, requires root + signing	`bpf()` syscall, CAP_BPF
Crash risk	Full kernel crash on bug	Program rejected or safely terminated
Overhead	Arbitrary	Bounded instruction count, no sleeping
Deployment	Recompile for each kernel	CO-RE (Compile Once, Run Everywhere)
Unloading	Reference counting issues	Automatic cleanup on fd close

Maps: the shared data structures¶

BPF maps are kernel-resident key-value stores shared between eBPF programs and userspace. They are the primary mechanism for aggregating data, passing state between probe firings, and exporting results.

Map type	Use case
`BPF_MAP_TYPE_HASH`	General key-value store (e.g., tracking per-PID timestamps)
`BPF_MAP_TYPE_ARRAY`	Fixed-size indexed array (fast lookup by integer key)
`BPF_MAP_TYPE_PERF_EVENT_ARRAY`	Stream events to userspace ring buffer
`BPF_MAP_TYPE_RINGBUF`	Modern replacement for perf event array (single ring, no per-CPU waste)
`BPF_MAP_TYPE_LRU_HASH`	Auto-evicting hash for bounded memory use
`BPF_MAP_TYPE_STACK_TRACE`	Store kernel/user stack traces by ID

Maps are lock-free (using per-CPU variants or RCU) and strongly typed at creation time. They survive across probe invocations, enabling stateful analysis like latency histograms.

2. The bpftrace Language¶

bpftrace is a high-level tracing language that compiles to eBPF bytecode. Think of it as awk for kernel tracing: probes are patterns, actions are blocks.

Program structure¶

probe_type:target /filter/ {
    action_statements;
}

A program can have multiple probe blocks. Two special probes run at program start and end:

BEGIN {
    printf("tracing started\n");
}

END {
    printf("tracing stopped\n");
}

Probe types¶

Probe	Syntax	Attaches to
kprobe	`kprobe:function_name`	Entry of any kernel function
kretprobe	`kretprobe:function_name`	Return of any kernel function
uprobe	`uprobe:/path/to/lib:function`	Entry of a userspace function
uretprobe	`uretprobe:/path/to/lib:function`	Return of a userspace function
tracepoint	`tracepoint:category:name`	Kernel static tracepoint
usdt	`usdt:/path/to/bin:provider:name`	Userspace static probe (DTRACE)
profile	`profile:hz:99`	Timed sampling on every CPU
interval	`interval:s:1`	Periodic timer (one CPU only)
software	`software:page-faults:1`	Software performance event
hardware	`hardware:cache-misses:1000`	Hardware PMC event

Listing available probes:

# List all tracepoints matching "drm"
bpftrace -l 'tracepoint:drm:*'

# List all kernel functions matching "vblank"
bpftrace -l 'kprobe:*vblank*'

# List uprobes in libwayland
bpftrace -l 'uprobe:/usr/lib/x86_64-linux-gnu/libwayland-server.so:*'

Built-in variables¶

Variable	Meaning
`pid`	Process ID
`tid`	Thread ID
`comm`	Process name (16 chars)
`nsecs`	Nanosecond timestamp (monotonic)
`elapsed`	Nanoseconds since program start
`cpu`	Current CPU number
`uid`	User ID
`arg0`..`argN`	Function arguments (kprobe/uprobe)
`retval`	Return value (kretprobe/uretprobe)
`args->field`	Tracepoint argument struct field
`curtask`	Pointer to current `task_struct`
`kstack`	Kernel stack trace
`ustack`	Userspace stack trace

Filters (predicates)¶

Filters appear between / delimiters and act as guards:

// Only trace the compositor process
kprobe:drm_atomic_commit /comm == "sway"/ {
    printf("atomic commit by sway\n");
}

// Only trace PIDs above 1000 (skip kernel threads)
tracepoint:syscalls:sys_enter_write /pid > 1000/ {
    @writes[comm] = count();
}

Maps and aggregations¶

Maps are bpftrace's most powerful feature. They are global associative arrays prefixed with @ that persist across probe firings:

// Count by key
@syscalls[comm, probe] = count();

// Sum values
@bytes_read[comm] = sum(args->count);

// Track min/max
@latency_min = min($delta);
@latency_max = max($delta);

// Store individual values
@start[tid] = nsecs;

// Delete an entry
delete(@start[tid]);

// Clear entire map
clear(@start);

Histograms¶

Power-of-2 histograms (hist()) and linear histograms (lhist()) are built-in:

// Log2 histogram of read sizes
tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
    @size = hist(args->ret);
}
// Output:
// @size:
// [1]                    5 |@@                      |
// [2, 4)                12 |@@@@@@                  |
// [4, 8)                45 |@@@@@@@@@@@@@@@@@@@@@   |
// [8, 16)               98 |@@@@@@@@@@@@@@@@@@@@@@@@|
// ...

// Linear histogram: 0-100ms in 5ms buckets
@frame_ms = lhist($delta / 1000000, 0, 100, 5);

Multi-line scripts¶

bpftrace programs go well beyond one-liners. Save as .bt files:

#!/usr/bin/env bpftrace
/*
 * syscall_latency.bt -- Histogram of syscall latency by syscall name.
 * Usage: sudo bpftrace syscall_latency.bt
 */

tracepoint:raw_syscalls:sys_enter {
    @start[tid] = nsecs;
}

tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    $delta_us = (nsecs - @start[tid]) / 1000;
    @us[ksym(*(curtask->syscall_dispatch.table +
              args->id * 8))] = hist($delta_us);
    delete(@start[tid]);
}

END {
    clear(@start);
}

String handling and printf¶

// Formatted output
printf("%-16s %-6d %d\n", comm, pid, arg0);

// String comparison
uprobe:/usr/lib/libwayland-server.so:wl_resource_post_event
/str(ustack) != ""/ {
    printf("%s called wl_resource_post_event\n", comm);
}

// Joining strings
$path = str(arg0);

3. Compositor & Graphics Recipes¶

These recipes target the specific problems you encounter when debugging a Wayland compositor (Sway, KWin, Mutter) or its associated graphics stack.

Recipe 1: Trace DRM page flip latency¶

Measure how long between the compositor requesting a page flip and the kernel completing it (vblank event delivery):

#!/usr/bin/env bpftrace
/*
 * pageflip_latency.bt -- Measure page-flip-to-vblank latency.
 *
 * The compositor calls drmModeAtomicCommit() which eventually triggers
 * drm_crtc_send_vblank_event() when the flip completes at vblank.
 * We measure the delta.
 */

kprobe:drm_mode_atomic_ioctl {
    @flip_start[tid] = nsecs;
}

kprobe:drm_crtc_send_vblank_event /@flip_start[tid]/ {
    $delta_us = (nsecs - @flip_start[tid]) / 1000;
    @flip_latency_us = hist($delta_us);
    printf("flip done: %d us (tid=%d comm=%s)\n", $delta_us, tid, comm);
    delete(@flip_start[tid]);
}

interval:s:5 {
    print(@flip_latency_us);
}

END {
    clear(@flip_start);
}

You can also use the DRM tracepoints directly if your kernel exposes them:

# Check available DRM tracepoints
bpftrace -l 'tracepoint:drm:*'

# Typical output on a recent kernel:
#   tracepoint:drm:drm_vblank_event
#   tracepoint:drm:drm_vblank_event_queued
#   tracepoint:drm:drm_vblank_event_delivered

// Using stable DRM tracepoints for vblank delivery latency
tracepoint:drm:drm_vblank_event_queued {
    @queued[args->crtc] = nsecs;
}

tracepoint:drm:drm_vblank_event_delivered /@queued[args->crtc]/ {
    $delta_us = (nsecs - @queued[args->crtc]) / 1000;
    @vblank_delivery_us[args->crtc] = hist($delta_us);
    delete(@queued[args->crtc]);
}

Recipe 2: Trace Wayland protocol messages¶

Instrument wl_connection_demarshal in libwayland-server.so to see every Wayland request the compositor receives:

#!/usr/bin/env bpftrace
/*
 * wayland_messages.bt -- Count Wayland protocol messages by compositor.
 *
 * wl_connection_demarshal is called for every incoming Wayland message.
 * The second argument is the message size.
 */

uprobe:/usr/lib/x86_64-linux-gnu/libwayland-server.so:wl_connection_demarshal {
    @msg_count[comm] = count();
    @msg_bytes[comm] = sum(arg1);
}

interval:s:5 {
    printf("\n--- Wayland messages (5s window) ---\n");
    print(@msg_count);
    print(@msg_bytes);
    clear(@msg_count);
    clear(@msg_bytes);
}

Finding the library path

The exact path to libwayland-server.so varies by distro. Find it with:

ldconfig -p | grep libwayland-server
# Or for a running compositor:
cat /proc/$(pidof sway)/maps | grep libwayland-server

For more targeted tracing, combine with WAYLAND_DEBUG=1 which prints human-readable protocol traces to stderr, but note that WAYLAND_DEBUG has measurable overhead and is not suitable for production. The bpftrace approach adds near-zero overhead and can run continuously.

Recipe 3: Input-to-display latency¶

This is the holy grail of compositor performance measurement: how long from a physical input event (keypress, mouse move) to the resulting frame appearing on the display.

The chain: libinput event -> compositor processes input -> compositor renders frame -> atomic commit -> page flip at vblank.

#!/usr/bin/env bpftrace
/*
 * input_to_flip.bt -- Measure end-to-end input-to-display latency.
 *
 * Traces the full path from libinput event dispatch through to
 * DRM page flip completion.
 *
 * NOTE: This measures the compositor's main thread. The actual display
 * latency includes one additional frame of scanout time.
 */

// libinput dispatches an event to the compositor
uprobe:/usr/lib/x86_64-linux-gnu/libinput.so:libinput_dispatch {
    @input_time[tid] = nsecs;
}

// Compositor submits atomic commit (end of render + composite)
kprobe:drm_mode_atomic_ioctl /@input_time[tid]/ {
    $render_us = (nsecs - @input_time[tid]) / 1000;
    @input_to_commit_us = hist($render_us);
}

// Page flip completes at vblank
kprobe:drm_crtc_send_vblank_event /@input_time[tid]/ {
    $total_us = (nsecs - @input_time[tid]) / 1000;
    @input_to_flip_us = hist($total_us);
    delete(@input_time[tid]);
}

END {
    printf("\n=== Input to atomic commit (compositor processing time) ===\n");
    print(@input_to_commit_us);
    printf("\n=== Input to page flip (includes waiting for vblank) ===\n");
    print(@input_to_flip_us);
    clear(@input_time);
}

Thread ID caveat

This recipe assumes the compositor processes input and submits the atomic commit on the same thread. Sway and most wlroots compositors do this (single-threaded event loop). KWin uses multiple threads -- you would need to pass timing through a shared map keyed on something other than tid.

Recipe 4: Track GPU memory allocation (GEM objects)¶

GEM (Graphics Execution Manager) objects are the kernel's GPU buffer abstraction. Track creation and destruction to find leaks:

#!/usr/bin/env bpftrace
/*
 * gem_tracker.bt -- Track GEM buffer object allocation and freeing.
 *
 * Works for i915 (Intel). Adapt function names for amdgpu or nouveau.
 * Use: bpftrace -l 'kprobe:*gem*create*' to find your driver's functions.
 */

// Intel i915 GEM create
kprobe:i915_gem_object_create {
    @gem_created = count();
    @gem_size_hist = hist(arg0);
    printf("GEM create: size=%d bytes comm=%s\n", arg0, comm);
}

// i915 GEM free
kprobe:i915_gem_object_release {
    @gem_freed = count();
}

interval:s:10 {
    $live = @gem_created - @gem_freed;
    printf("Live GEM objects: ~%d\n", $live);
}

For driver-agnostic tracking, use the DRM scheduler tracepoints:

# List GPU driver tracepoints on your system
bpftrace -l 'tracepoint:*gpu*' 'tracepoint:*drm*' 'tracepoint:*i915*' 'tracepoint:*amdgpu*'

Driver	Create tracepoint	Notes
i915 (Intel)	`i915_gem_object_create`	Also: `i915_vma_bind`, `i915_vma_unbind`
amdgpu (AMD)	`amdgpu_bo_create`	Also: `amdgpu_bo_move`, `amdgpu_vm_bo_map`
nouveau (NVIDIA)	`nouveau_bo_new`	Limited tracepoint coverage
Generic DRM	`drm_run_job` / `drm_sched_process_job`	Works across all drivers using DRM scheduler

Recipe 5: Monitor compositor frame timing¶

Measure the time between successive frame callbacks (wl_callback.done) to detect dropped frames:

#!/usr/bin/env bpftrace
/*
 * frame_timing.bt -- Detect dropped frames in a Wayland compositor.
 *
 * At 60Hz, frames should arrive every ~16.6ms.
 * At 144Hz, every ~6.9ms.
 * Flag any gap > 1.5x the expected interval.
 */

kprobe:drm_crtc_send_vblank_event {
    $now = nsecs;
    if (@last_vblank > 0) {
        $delta_ms = ($now - @last_vblank) / 1000000;
        @frame_interval_ms = hist($delta_ms);

        // Flag dropped frames (>25ms gap at 60Hz)
        if ($delta_ms > 25) {
            printf("*** DROPPED FRAME: %d ms gap (tid=%d comm=%s)\n",
                   $delta_ms, tid, comm);
            @drops = count();
        }
    }
    @last_vblank = $now;
}

interval:s:30 {
    printf("\n=== Frame interval histogram (ms) ===\n");
    print(@frame_interval_ms);
    printf("Dropped frames: ");
    print(@drops);
}

Recipe 6: Find syscall bottlenecks in Chromium renderer¶

Chromium spawns dozens of processes. Target the renderer specifically:

#!/usr/bin/env bpftrace
/*
 * chromium_syscalls.bt -- Profile syscalls in Chromium renderer processes.
 *
 * Run with: sudo bpftrace chromium_syscalls.bt
 * First find the PID: pgrep -f 'type=renderer'
 */

tracepoint:raw_syscalls:sys_enter
/comm == "chromium" || comm == "chrome"/ {
    @start[tid] = nsecs;
    @syscall_nr[tid] = args->id;
}

tracepoint:raw_syscalls:sys_exit
/@start[tid]/ {
    $delta_us = (nsecs - @start[tid]) / 1000;
    @latency_by_syscall[ksym(*(curtask->syscall_dispatch.table +
                              @syscall_nr[tid] * 8))] = hist($delta_us);
    @slow_syscalls[ksym(*(curtask->syscall_dispatch.table +
                         @syscall_nr[tid] * 8))] = count();
    delete(@start[tid]);
    delete(@syscall_nr[tid]);
}

END {
    printf("\n=== Syscall latency by type (us) ===\n");
    print(@latency_by_syscall);
    clear(@start);
    clear(@syscall_nr);
}

A simpler alternative that filters by PID:

# Get Chromium renderer PID
RENDERER_PID=$(pgrep -f 'type=renderer' | head -1)

# Count syscalls by type for that process
bpftrace -e "
tracepoint:raw_syscalls:sys_enter /pid == $RENDERER_PID/ {
    @[probe] = count();
}
tracepoint:syscalls:sys_enter_* /pid == $RENDERER_PID/ {
    @syscalls[probe] = count();
}" -c 'sleep 10'

Mesa debug environment variables¶

When debugging graphics issues, combine bpftrace with Mesa's built-in debug knobs:

Variable	Effect
`MESA_DEBUG=1`	Print GL errors to stderr
`LIBGL_DEBUG=verbose`	Verbose LibGL loader messages
`MESA_LOG_FILE=/tmp/mesa.log`	Redirect Mesa logs to file
`GALLIUM_HUD="fps+cpu,draw-calls"`	On-screen performance overlay
`AMD_DEBUG=info`	AMD driver debug output
`INTEL_DEBUG=perf`	Intel driver performance warnings
`RADV_DEBUG=info`	RADV Vulkan debug output
`MESA_SHADER_CAPTURE_PATH=/tmp/shaders`	Dump compiled shaders to disk
`DRI_PRIME=1`	Select discrete GPU for rendering
`MESA_LOADER_DRIVER_OVERRIDE=zink`	Force a specific Mesa driver (e.g., Zink for Vulkan-backed GL)

4. perf + Flame Graphs Workflow¶

On-CPU flame graphs¶

The full pipeline from recording to visualization:

# 1. Record call stacks at 99 Hz for 30 seconds (system-wide)
#    Use -g for frame pointer unwinding, --call-graph dwarf for DWARF
sudo perf record -F 99 -a -g -- sleep 30

# 2. Export to text
sudo perf script --header > out.perf

# 3. Collapse stacks into single-line format
#    (one line per unique stack, with a count)
./stackcollapse-perf.pl out.perf > out.folded

# 4. Generate SVG flame graph
./flamegraph.pl out.folded > flamegraph.svg

# Or as a pipeline:
sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flame.svg

Targeted recording for compositor work:

# Profile only the compositor
sudo perf record -F 99 -g -p $(pidof sway) -- sleep 30

# Profile only Chromium's GPU process
sudo perf record -F 99 -g -p $(pgrep -f 'type=gpu-process') -- sleep 10

# Profile with DWARF unwinding (better C++ stacks, higher overhead)
sudo perf record -F 49 --call-graph dwarf -p $(pidof sway) -- sleep 30

Reading a flame graph¶

                 ┌─ Narrow tower: deep call chain,
                 │  but little total CPU time.
                 │  Probably not the bottleneck.
                 v
    ┌───┐       ┌─┐
    │   │  ┌────┤ │
    │   │  │    │ │     ┌──────────────────┐ <-- Wide plateau at top:
    │   ├──┤    │ │     │  hot_function()   │     THIS is your bottleneck.
    │   │  │    │ ├─────┤                  │     Width = % of total samples.
    ├───┴──┴────┴─┴─────┴──────────────────┤
    │            main()                     │ <-- Bottom: entry point
    └───────────────────────────────────────┘
    X-axis: alphabetical (NOT time). Width = sample proportion.
    Y-axis: stack depth. Bottom = entry, top = leaf.

Key patterns to look for:

Wide plateaus at the top = functions consuming the most CPU. Optimize these first.
"Comb" pattern = many thin spikes from a single parent. The parent dispatches varied work -- profile its children individually.
Missing frames = gaps in the stack. Use --call-graph dwarf or compile with -fno-omit-frame-pointer.

Off-CPU flame graphs¶

On-CPU flame graphs show where the CPU is busy. Off-CPU flame graphs show where threads are blocked -- waiting on I/O, locks, or scheduling. This is critical for compositors that are often I/O-bound (waiting for GPU, waiting for vblank, waiting for client buffers).

# Method 1: BCC's offcputime tool
sudo /usr/share/bcc/tools/offcputime -df -p $(pidof sway) 30 > out.stacks
./flamegraph.pl --color=io --title="Off-CPU: sway" \
    --countname=us < out.stacks > offcpu.svg

# Method 2: bpftrace off-CPU script
sudo bpftrace -e '
    tracepoint:sched:sched_switch {
        @start[tid] = nsecs;
    }
    tracepoint:sched:sched_switch /@start[args->next_pid]/ {
        $us = (nsecs - @start[args->next_pid]) / 1000;
        @offcpu[args->next_comm, kstack()] = sum($us);
        delete(@start[args->next_pid]);
    }
    END { clear(@start); }
' > offcpu.bt.out

# Method 3: Off-wake flame graphs (shows WHY a thread woke up)
sudo /usr/share/bcc/tools/offwaketime -f -p $(pidof sway) 30 > out.stacks
./flamegraph.pl --color=chain --title="Off-Wake: sway" \
    --countname=us < out.stacks > offwake.svg

Interpreting off-CPU flame graphs:

The widest towers represent the longest total blocked time.
Look for stacks ending in schedule() -> io_schedule() -> your compositor's read/poll paths -- this tells you what I/O the compositor waits on.
Off-wake flame graphs are even more powerful: the bottom half shows the sleeping stack, and the top half shows the waker's stack, connected by -- in the middle.

Differential flame graphs¶

Compare two profiles (before/after a change, or with/without VNC active):

# Record baseline
sudo perf record -F 99 -a -g -o baseline.data -- sleep 30

# Record with VNC active
sudo perf record -F 99 -a -g -o vnc.data -- sleep 30

# Generate differential flame graph
sudo perf script -i baseline.data | ./stackcollapse-perf.pl > baseline.folded
sudo perf script -i vnc.data | ./stackcollapse-perf.pl > vnc.folded
./difffolded.pl baseline.folded vnc.folded | \
    ./flamegraph.pl --title="Diff: VNC vs baseline" > diff.svg

In a differential flame graph, red = regression (more samples in the second profile), blue = improvement (fewer samples).

5. Real-World Debugging Story¶

"The compositor drops frames when VNC is active"¶

You are running a Sway compositor serving a Chromium kiosk to remote users via WayVNC. Users report stuttering and dropped frames when VNC clients are connected, but the display is smooth when no VNC session is active. Here is how you diagnose it with the tools from this article.

Step 1: Confirm the symptom with frame timing¶

#!/usr/bin/env bpftrace
// dropped_frames_check.bt
kprobe:drm_crtc_send_vblank_event {
    $now = nsecs;
    if (@last > 0) {
        $gap_ms = ($now - @last) / 1000000;
        if ($gap_ms > 20) {
            printf("DROPPED: %d ms gap at %ld\n", $gap_ms, nsecs);
            @dropped = count();
        }
    }
    @last = $now;
    @frame_gap = hist(($now - @last) / 1000000);
}

Run this while toggling VNC on/off. You see the histogram shift: without VNC, tight 16-17ms intervals. With VNC, periodic 33-50ms gaps (skipping 1-2 frames).

Step 2: Is it CPU-bound? On-CPU flame graph¶

# Profile sway while VNC is active
sudo perf record -F 99 -g -p $(pidof sway) -- sleep 30
sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > sway_vnc.svg

Open the SVG. You see a wide plateau in wlr_output_commit -> the rendering path. But sway's composite time is only ~3ms per frame. The CPU is not saturated. Not CPU-bound.

Step 3: Is it I/O-bound? Off-CPU flame graph¶

sudo /usr/share/bcc/tools/offcputime -df -p $(pidof sway) 30 > sway_offcpu.stacks
./flamegraph.pl --color=io --title="Off-CPU: sway + VNC" \
    --countname=us < sway_offcpu.stacks > sway_offcpu.svg

Now you see it: a massive tower in schedule() <- ep_poll() <- wl_event_loop_dispatch(). Sway's event loop is blocking in epoll. Something is making the event loop stall.

Step 4: What is sway waiting on? Syscall latency¶

#!/usr/bin/env bpftrace
// sway_syscall_latency.bt
tracepoint:raw_syscalls:sys_enter /comm == "sway"/ {
    @start[tid] = nsecs;
}
tracepoint:raw_syscalls:sys_exit /comm == "sway" && @start[tid]/ {
    $us = (nsecs - @start[tid]) / 1000;
    @latency[args->id] = hist($us);
    if ($us > 5000) {
        printf("SLOW syscall %d: %d us\n", args->id, $us);
    }
    delete(@start[tid]);
}

The output reveals: writev() calls occasionally take 8-15ms. These correspond to Wayland protocol messages being sent to the WayVNC client.

Step 5: Is it the VNC encoding? Trace WayVNC¶

# How much CPU does wayvnc use?
sudo perf record -F 99 -g -p $(pidof wayvnc) -- sleep 30
sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > wayvnc.svg

The flame graph shows WayVNC spending most CPU time in tight_encode() (Tight encoding) and zlib_compress(). When VNC clients are connected, WayVNC captures frames via wlr_screencopy, encodes them, and writes to the socket.

Step 6: Find the root cause with scheduler analysis¶

#!/usr/bin/env bpftrace
// runqueue_latency.bt -- Is sway getting preempted?
tracepoint:sched:sched_wakeup /args->comm == "sway"/ {
    @wake[args->pid] = nsecs;
}
tracepoint:sched:sched_switch /args->next_comm == "sway" && @wake[args->next_pid]/ {
    $us = (nsecs - @wake[args->next_pid]) / 1000;
    @runq_latency = hist($us);
    if ($us > 1000) {
        printf("sway waited %d us in run queue\n", $us);
    }
    delete(@wake[args->next_pid]);
}

Finding: when WayVNC's encoding threads are active, sway experiences run-queue latencies of 2-5ms on the same CPU cores. The VNC encoding is CPU-intensive and competing with the compositor for CPU time.

Step 7: The fix¶

The diagnosis points to two problems:

Blocking writes: Sway's event loop blocks when writing large Wayland messages to a slow consumer (WayVNC). Fix: increase the Wayland socket buffer size or make WayVNC consume faster.
CPU contention: WayVNC encoding runs on the same cores as sway. Fix: pin WayVNC to different CPU cores with taskset or cgroups:

# Pin wayvnc to CPUs 2-3, keep sway on CPUs 0-1
taskset -p 0x0C $(pidof wayvnc)
taskset -p 0x03 $(pidof sway)

This investigation used: bpftrace (frame timing, syscall latency, run-queue latency), perf + flame graphs (on-CPU and off-CPU analysis), and system knowledge (Wayland event loop, VNC encoding pipeline). In an interview, walking through this methodology is more impressive than memorizing tool syntax.

6. Brendan Gregg's Methodologies¶

USE Method: Utilization, Saturation, Errors¶

The USE Method provides a systematic checklist for identifying bottlenecks. For every resource in the system, check three metrics:

Metric	Definition	Example
Utilization	Average time the resource was busy servicing work (%)	CPU at 85%, GPU at 60%
Saturation	Degree to which extra work is queued, waiting	Run queue depth > 0, swap usage
Errors	Count of error events	NIC drops, ECC errors, failed allocations

USE applied to compositor performance¶

Resource	Utilization tool	Saturation tool	Errors tool
CPU	`mpstat -P ALL 1`, `perf stat`	`vmstat 1` (r column), bpftrace run-queue	`perf stat` (branch-miss, cache-miss)
GPU	`intel_gpu_top`, `cat gpu_busy_percent`	Check for fence waits in `drm_sched_job_wait_dep`	`dmesg` (GPU hang, reset)
Memory	`free -m`, `/proc/meminfo`	`vmstat` (si/so columns), OOM messages	`edac-util`, `dmesg` (MCE)
Network	`sar -n DEV 1`	`ss -tnp` (send-Q), `tc -s qdisc`	`ip -s link` (errors, drops)
DRM/KMS	Frame timing histogram (Recipe 5)	Pending flip count, fence wait time	`dmesg` (flip_done timed out)
Wayland socket	Message rate (Recipe 2)	`ss -xp` (send-Q on Wayland socket)	Client disconnect logs

The power of USE is completeness: by checking every resource, you avoid anchoring on the first anomaly you find. In the VNC debugging story above, you might have stopped at "CPU is fine" without checking saturation (run-queue latency) and I/O contention.

TSA Method: Thread State Analysis¶

TSA complements USE by focusing on threads instead of resources. For each thread of interest, measure total time in each state:

State	Meaning	Tool
Executing	Running on CPU	`perf record` -> flame graph
Runnable	In run queue, waiting for CPU	bpftrace sched tracepoints
Sleeping (I/O)	Blocked on disk, network, GPU	Off-CPU flame graph
Sleeping (lock)	Blocked on mutex/futex	`bpftrace -e 'uprobe:pthread_mutex_lock { @[ustack]=count(); }'`
Sleeping (idle)	Waiting for work (epoll)	Expected for event-loop compositors
Paging	Waiting for memory to be paged in	`vmstat` si/so columns

TSA applied to compositor thread¶

For a healthy 60Hz sway compositor, the main thread's time budget per frame is ~16.6ms:

Expected breakdown of sway main thread (60 Hz):
├── Executing (on-CPU)        ~3-5 ms   (input handling + render + composite)
├── Sleeping (idle/epoll)     ~10-12 ms (waiting for vblank or client events)
├── Sleeping (I/O)            ~0-1 ms   (DRM ioctl completion)
├── Runnable (run queue)      ~0-0.1 ms (should be negligible)
└── Sleeping (lock)           ~0 ms     (single-threaded event loop)

If you see the Executing slice growing above 12ms, the compositor cannot finish rendering before the next vblank -- frames will be dropped. If the Runnable slice grows, the compositor is being starved by other processes (the VNC scenario). If the Sleeping (I/O) slice grows, a DRM ioctl or client read is blocking the loop.

Measure this directly:

#!/usr/bin/env bpftrace
/*
 * compositor_tsa.bt -- Thread State Analysis for sway.
 * Records time in executing vs off-CPU states.
 */

tracepoint:sched:sched_switch /args->prev_comm == "sway"/ {
    // sway left the CPU
    if (@oncpu_start[args->prev_pid] > 0) {
        @on_cpu_us = sum((nsecs - @oncpu_start[args->prev_pid]) / 1000);
    }
    @offcpu_start[args->prev_pid] = nsecs;
}

tracepoint:sched:sched_switch /args->next_comm == "sway"/ {
    // sway got the CPU
    if (@offcpu_start[args->next_pid] > 0) {
        @off_cpu_us = sum((nsecs - @offcpu_start[args->next_pid]) / 1000);
    }
    @oncpu_start[args->next_pid] = nsecs;
}

interval:s:5 {
    printf("on-CPU: %d us | off-CPU: %d us (5s window)\n",
           @on_cpu_us, @off_cpu_us);
    @on_cpu_us = 0;
    @off_cpu_us = 0;
}

Combining USE + TSA¶

In practice, use both:

USE first -- quickly scan all resources for saturation or errors. This takes 60 seconds with the right checklist.
TSA second -- for the threads that matter (compositor main loop, Chromium renderer, VNC encoder), break down where time goes.
Drill down -- use bpftrace recipes and flame graphs to investigate the specific bottleneck USE/TSA identified.

Quick Reference: bpftrace Cheat Sheet¶

# List probes matching a pattern
bpftrace -l 'kprobe:drm_*'
bpftrace -l 'tracepoint:syscalls:sys_enter_*'
bpftrace -l 'uprobe:/usr/lib/libwayland-server.so:*'

# Trace a function with arguments
bpftrace -e 'kprobe:vfs_read { printf("%s reads fd %d\n", comm, arg0); }'

# Return value of a function
bpftrace -e 'kretprobe:vfs_read { @bytes = hist(retval); }'

# Aggregate by stack trace
bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

# Time a function
bpftrace -e 'kprobe:mutex_lock { @start[tid] = nsecs; }
    kretprobe:mutex_lock /@start[tid]/ {
    @lock_ns = hist(nsecs - @start[tid]); delete(@start[tid]); }'

# Per-second output
bpftrace -e 'tracepoint:block:block_rq_issue { @iops = count(); }
    interval:s:1 { print(@iops); clear(@iops); }'

Installation & Prerequisites¶

# Ubuntu/Debian
sudo apt install bpftrace bpfcc-tools linux-headers-$(uname -r)

# Fedora
sudo dnf install bpftrace bcc-tools

# Arch
sudo pacman -S bpftrace bcc

# Verify eBPF support
sudo bpftrace -e 'BEGIN { printf("eBPF works!\n"); exit(); }'

# Install FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:$(pwd)/FlameGraph

Requirements:

Linux kernel >= 4.9 (for basic eBPF), >= 5.8 (for ring buffer, BTF, CO-RE)
CONFIG_BPF=y, CONFIG_BPF_SYSCALL=y, CONFIG_BPF_JIT=y in kernel config
Root or CAP_BPF + CAP_PERFMON capabilities
Kernel headers installed (for BTF type information)

Glossary

eBPF (extended Berkeley Packet Filter): In-kernel virtual machine executing verified, JIT-compiled programs attached to kernel and userspace events. Cannot crash the kernel.
BPF verifier: Static analysis pass that rejects unsafe eBPF programs before execution. Checks memory safety, bounded loops, type correctness, and resource cleanup.
JIT (Just-In-Time) compiler: Translates verified BPF bytecode to native machine instructions (x86-64, ARM64) for near-native execution speed.
BPF map: Kernel-resident key-value data structure shared between eBPF programs and userspace. Types include hash, array, ring buffer, LRU, and stack trace.
kprobe / kretprobe: Dynamic tracing of kernel function entry/return. Can attach to almost any kernel function without recompilation.
uprobe / uretprobe: Dynamic tracing of userspace function entry/return. Requires knowing the library path and symbol name.
tracepoint: Stable, statically defined instrumentation point in the kernel. Preferred over kprobes when available because the interface is versioned.
USDT (User Statically-Defined Tracing): Static probes compiled into userspace applications (like DTrace probes). Used by PostgreSQL, Python, Node.js, JVM.
GEM (Graphics Execution Manager): Kernel subsystem managing GPU buffer objects. Handles allocation, mapping, and lifetime of GPU memory.
DRM (Direct Rendering Manager): Linux kernel subsystem providing GPU access, modesetting (KMS), and buffer management (GEM). Contains tracepoints for vblank, scheduling, and memory.
Flame graph: Visualization of sampled stack traces where x-axis width represents time proportion. Invented by Brendan Gregg. On-CPU shows busy time; off-CPU shows blocked time.
Off-CPU analysis: Measuring time threads spend blocked (I/O, locks, scheduling) rather than executing. Uses scheduler tracepoints to capture blocking stacks.
USE Method: Performance methodology: for every resource, check Utilization, Saturation, and Errors. Ensures complete coverage of potential bottlenecks.
TSA (Thread State Analysis): Performance methodology: for each thread, measure time in executing, runnable, sleeping, lock, and idle states. Complementary to USE.
CO-RE (Compile Once, Run Everywhere): eBPF portability mechanism using BTF (BPF Type Format) to relocate struct field accesses at load time, avoiding per-kernel recompilation.
BTF (BPF Type Format): Compact type metadata embedded in the kernel, enabling CO-RE relocations and rich introspection of kernel data structures from eBPF programs.
vblank: Vertical blanking interval: the period between the last scanline of one frame and the first scanline of the next. Page flips happen here to avoid tearing.
Atomic modesetting: DRM API that commits all display changes (planes, CRTCs, connectors) in a single atomic transaction, preventing partial updates and tearing.