Skip to content

Practical eBPF/bpftrace for Linux Systems Debugging

The job posting requires "performance optimization using tools like perf, bpftrace, valgrind, and flamegraphs." This article goes deep on eBPF and bpftrace -- the kernel virtual machine, the tracing language, and real compositor/graphics recipes you would use daily. If you already know strace, this is the next level: safe, programmable, near-zero-overhead kernel instrumentation that can answer questions strace cannot.


1. What eBPF Actually Is

The kernel virtual machine

eBPF (extended Berkeley Packet Filter) is an in-kernel virtual machine that executes sandboxed programs attached to kernel and userspace events. Unlike kernel modules, eBPF programs:

  • Cannot crash the kernel. A static verifier rejects unsafe programs before they run.
  • Cannot loop forever. The verifier proves bounded execution for every code path.
  • Cannot access arbitrary memory. All pointer dereferences go through helper functions that perform bounds checking.
   userspace                        kernel
  ┌────────────┐                  ┌──────────────────────────────┐
  │ bpftrace   │  bpf() syscall  │  Verifier                    │
  │ / libbpf   │ ───────────────>│   ↓ (rejects unsafe code)    │
  │            │                  │  JIT compiler                │
  │            │                  │   ↓ (bytecode → native x86)  │
  │            │  maps (shared)   │  Attached program             │
  │            │ <═══════════════>│   (runs at probe point)      │
  └────────────┘                  └──────────────────────────────┘

Verification in detail

When you load an eBPF program via the bpf() syscall, the kernel verifier performs static analysis on the bytecode before it ever runs:

  1. Control-flow graph analysis -- builds a DAG of all basic blocks, rejecting back-edges unless the loop is provably bounded.
  2. Symbolic execution -- walks every possible path, tracking register types and value ranges.
  3. Type safety -- each register is tagged as SCALAR, PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK, etc. The verifier rejects type-confused operations (e.g., using a scalar as a pointer).
  4. Bounds checking -- pointer arithmetic must stay within proven bounds. Accessing map_value + offset is only allowed if offset < value_size.
  5. Resource safety -- acquired spinlocks must be released on every path, reference-counted objects must be freed, and no sleeping in atomic contexts.

If verification passes, the JIT compiler translates BPF bytecode instruction-by-instruction into native machine code (x86-64, ARM64, etc.) with constant blinding for speculative execution safety. The result runs at near-native speed.

How it differs from kernel modules

Aspect Kernel module eBPF program
Safety Can panic the kernel Verified before execution
Loading insmod, requires root + signing bpf() syscall, CAP_BPF
Crash risk Full kernel crash on bug Program rejected or safely terminated
Overhead Arbitrary Bounded instruction count, no sleeping
Deployment Recompile for each kernel CO-RE (Compile Once, Run Everywhere)
Unloading Reference counting issues Automatic cleanup on fd close

Maps: the shared data structures

BPF maps are kernel-resident key-value stores shared between eBPF programs and userspace. They are the primary mechanism for aggregating data, passing state between probe firings, and exporting results.

Map type Use case
BPF_MAP_TYPE_HASH General key-value store (e.g., tracking per-PID timestamps)
BPF_MAP_TYPE_ARRAY Fixed-size indexed array (fast lookup by integer key)
BPF_MAP_TYPE_PERF_EVENT_ARRAY Stream events to userspace ring buffer
BPF_MAP_TYPE_RINGBUF Modern replacement for perf event array (single ring, no per-CPU waste)
BPF_MAP_TYPE_LRU_HASH Auto-evicting hash for bounded memory use
BPF_MAP_TYPE_STACK_TRACE Store kernel/user stack traces by ID

Maps are lock-free (using per-CPU variants or RCU) and strongly typed at creation time. They survive across probe invocations, enabling stateful analysis like latency histograms.


2. The bpftrace Language

bpftrace is a high-level tracing language that compiles to eBPF bytecode. Think of it as awk for kernel tracing: probes are patterns, actions are blocks.

Program structure

probe_type:target /filter/ {
    action_statements;
}

A program can have multiple probe blocks. Two special probes run at program start and end:

BEGIN {
    printf("tracing started\n");
}

END {
    printf("tracing stopped\n");
}

Probe types

Probe Syntax Attaches to
kprobe kprobe:function_name Entry of any kernel function
kretprobe kretprobe:function_name Return of any kernel function
uprobe uprobe:/path/to/lib:function Entry of a userspace function
uretprobe uretprobe:/path/to/lib:function Return of a userspace function
tracepoint tracepoint:category:name Kernel static tracepoint
usdt usdt:/path/to/bin:provider:name Userspace static probe (DTRACE)
profile profile:hz:99 Timed sampling on every CPU
interval interval:s:1 Periodic timer (one CPU only)
software software:page-faults:1 Software performance event
hardware hardware:cache-misses:1000 Hardware PMC event

Listing available probes:

# List all tracepoints matching "drm"
bpftrace -l 'tracepoint:drm:*'

# List all kernel functions matching "vblank"
bpftrace -l 'kprobe:*vblank*'

# List uprobes in libwayland
bpftrace -l 'uprobe:/usr/lib/x86_64-linux-gnu/libwayland-server.so:*'

Built-in variables

Variable Meaning
pid Process ID
tid Thread ID
comm Process name (16 chars)
nsecs Nanosecond timestamp (monotonic)
elapsed Nanoseconds since program start
cpu Current CPU number
uid User ID
arg0..argN Function arguments (kprobe/uprobe)
retval Return value (kretprobe/uretprobe)
args->field Tracepoint argument struct field
curtask Pointer to current task_struct
kstack Kernel stack trace
ustack Userspace stack trace

Filters (predicates)

Filters appear between / delimiters and act as guards:

// Only trace the compositor process
kprobe:drm_atomic_commit /comm == "sway"/ {
    printf("atomic commit by sway\n");
}

// Only trace PIDs above 1000 (skip kernel threads)
tracepoint:syscalls:sys_enter_write /pid > 1000/ {
    @writes[comm] = count();
}

Maps and aggregations

Maps are bpftrace's most powerful feature. They are global associative arrays prefixed with @ that persist across probe firings:

// Count by key
@syscalls[comm, probe] = count();

// Sum values
@bytes_read[comm] = sum(args->count);

// Track min/max
@latency_min = min($delta);
@latency_max = max($delta);

// Store individual values
@start[tid] = nsecs;

// Delete an entry
delete(@start[tid]);

// Clear entire map
clear(@start);

Histograms

Power-of-2 histograms (hist()) and linear histograms (lhist()) are built-in:

// Log2 histogram of read sizes
tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
    @size = hist(args->ret);
}
// Output:
// @size:
// [1]                    5 |@@                      |
// [2, 4)                12 |@@@@@@                  |
// [4, 8)                45 |@@@@@@@@@@@@@@@@@@@@@   |
// [8, 16)               98 |@@@@@@@@@@@@@@@@@@@@@@@@|
// ...

// Linear histogram: 0-100ms in 5ms buckets
@frame_ms = lhist($delta / 1000000, 0, 100, 5);

Multi-line scripts

bpftrace programs go well beyond one-liners. Save as .bt files:

#!/usr/bin/env bpftrace
/*
 * syscall_latency.bt -- Histogram of syscall latency by syscall name.
 * Usage: sudo bpftrace syscall_latency.bt
 */

tracepoint:raw_syscalls:sys_enter {
    @start[tid] = nsecs;
}

tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    $delta_us = (nsecs - @start[tid]) / 1000;
    @us[ksym(*(curtask->syscall_dispatch.table +
              args->id * 8))] = hist($delta_us);
    delete(@start[tid]);
}

END {
    clear(@start);
}

String handling and printf

// Formatted output
printf("%-16s %-6d %d\n", comm, pid, arg0);

// String comparison
uprobe:/usr/lib/libwayland-server.so:wl_resource_post_event
/str(ustack) != ""/ {
    printf("%s called wl_resource_post_event\n", comm);
}

// Joining strings
$path = str(arg0);

3. Compositor & Graphics Recipes

These recipes target the specific problems you encounter when debugging a Wayland compositor (Sway, KWin, Mutter) or its associated graphics stack.

Recipe 1: Trace DRM page flip latency

Measure how long between the compositor requesting a page flip and the kernel completing it (vblank event delivery):

#!/usr/bin/env bpftrace
/*
 * pageflip_latency.bt -- Measure page-flip-to-vblank latency.
 *
 * The compositor calls drmModeAtomicCommit() which eventually triggers
 * drm_crtc_send_vblank_event() when the flip completes at vblank.
 * We measure the delta.
 */

kprobe:drm_mode_atomic_ioctl {
    @flip_start[tid] = nsecs;
}

kprobe:drm_crtc_send_vblank_event /@flip_start[tid]/ {
    $delta_us = (nsecs - @flip_start[tid]) / 1000;
    @flip_latency_us = hist($delta_us);
    printf("flip done: %d us (tid=%d comm=%s)\n", $delta_us, tid, comm);
    delete(@flip_start[tid]);
}

interval:s:5 {
    print(@flip_latency_us);
}

END {
    clear(@flip_start);
}

You can also use the DRM tracepoints directly if your kernel exposes them:

# Check available DRM tracepoints
bpftrace -l 'tracepoint:drm:*'

# Typical output on a recent kernel:
#   tracepoint:drm:drm_vblank_event
#   tracepoint:drm:drm_vblank_event_queued
#   tracepoint:drm:drm_vblank_event_delivered
// Using stable DRM tracepoints for vblank delivery latency
tracepoint:drm:drm_vblank_event_queued {
    @queued[args->crtc] = nsecs;
}

tracepoint:drm:drm_vblank_event_delivered /@queued[args->crtc]/ {
    $delta_us = (nsecs - @queued[args->crtc]) / 1000;
    @vblank_delivery_us[args->crtc] = hist($delta_us);
    delete(@queued[args->crtc]);
}

Recipe 2: Trace Wayland protocol messages

Instrument wl_connection_demarshal in libwayland-server.so to see every Wayland request the compositor receives:

#!/usr/bin/env bpftrace
/*
 * wayland_messages.bt -- Count Wayland protocol messages by compositor.
 *
 * wl_connection_demarshal is called for every incoming Wayland message.
 * The second argument is the message size.
 */

uprobe:/usr/lib/x86_64-linux-gnu/libwayland-server.so:wl_connection_demarshal {
    @msg_count[comm] = count();
    @msg_bytes[comm] = sum(arg1);
}

interval:s:5 {
    printf("\n--- Wayland messages (5s window) ---\n");
    print(@msg_count);
    print(@msg_bytes);
    clear(@msg_count);
    clear(@msg_bytes);
}

Finding the library path

The exact path to libwayland-server.so varies by distro. Find it with:

ldconfig -p | grep libwayland-server
# Or for a running compositor:
cat /proc/$(pidof sway)/maps | grep libwayland-server

For more targeted tracing, combine with WAYLAND_DEBUG=1 which prints human-readable protocol traces to stderr, but note that WAYLAND_DEBUG has measurable overhead and is not suitable for production. The bpftrace approach adds near-zero overhead and can run continuously.

Recipe 3: Input-to-display latency

This is the holy grail of compositor performance measurement: how long from a physical input event (keypress, mouse move) to the resulting frame appearing on the display.

The chain: libinput event -> compositor processes input -> compositor renders frame -> atomic commit -> page flip at vblank.

#!/usr/bin/env bpftrace
/*
 * input_to_flip.bt -- Measure end-to-end input-to-display latency.
 *
 * Traces the full path from libinput event dispatch through to
 * DRM page flip completion.
 *
 * NOTE: This measures the compositor's main thread. The actual display
 * latency includes one additional frame of scanout time.
 */

// libinput dispatches an event to the compositor
uprobe:/usr/lib/x86_64-linux-gnu/libinput.so:libinput_dispatch {
    @input_time[tid] = nsecs;
}

// Compositor submits atomic commit (end of render + composite)
kprobe:drm_mode_atomic_ioctl /@input_time[tid]/ {
    $render_us = (nsecs - @input_time[tid]) / 1000;
    @input_to_commit_us = hist($render_us);
}

// Page flip completes at vblank
kprobe:drm_crtc_send_vblank_event /@input_time[tid]/ {
    $total_us = (nsecs - @input_time[tid]) / 1000;
    @input_to_flip_us = hist($total_us);
    delete(@input_time[tid]);
}

END {
    printf("\n=== Input to atomic commit (compositor processing time) ===\n");
    print(@input_to_commit_us);
    printf("\n=== Input to page flip (includes waiting for vblank) ===\n");
    print(@input_to_flip_us);
    clear(@input_time);
}

Thread ID caveat

This recipe assumes the compositor processes input and submits the atomic commit on the same thread. Sway and most wlroots compositors do this (single-threaded event loop). KWin uses multiple threads -- you would need to pass timing through a shared map keyed on something other than tid.

Recipe 4: Track GPU memory allocation (GEM objects)

GEM (Graphics Execution Manager) objects are the kernel's GPU buffer abstraction. Track creation and destruction to find leaks:

#!/usr/bin/env bpftrace
/*
 * gem_tracker.bt -- Track GEM buffer object allocation and freeing.
 *
 * Works for i915 (Intel). Adapt function names for amdgpu or nouveau.
 * Use: bpftrace -l 'kprobe:*gem*create*' to find your driver's functions.
 */

// Intel i915 GEM create
kprobe:i915_gem_object_create {
    @gem_created = count();
    @gem_size_hist = hist(arg0);
    printf("GEM create: size=%d bytes comm=%s\n", arg0, comm);
}

// i915 GEM free
kprobe:i915_gem_object_release {
    @gem_freed = count();
}

interval:s:10 {
    $live = @gem_created - @gem_freed;
    printf("Live GEM objects: ~%d\n", $live);
}

For driver-agnostic tracking, use the DRM scheduler tracepoints:

# List GPU driver tracepoints on your system
bpftrace -l 'tracepoint:*gpu*' 'tracepoint:*drm*' 'tracepoint:*i915*' 'tracepoint:*amdgpu*'
Driver Create tracepoint Notes
i915 (Intel) i915_gem_object_create Also: i915_vma_bind, i915_vma_unbind
amdgpu (AMD) amdgpu_bo_create Also: amdgpu_bo_move, amdgpu_vm_bo_map
nouveau (NVIDIA) nouveau_bo_new Limited tracepoint coverage
Generic DRM drm_run_job / drm_sched_process_job Works across all drivers using DRM scheduler

Recipe 5: Monitor compositor frame timing

Measure the time between successive frame callbacks (wl_callback.done) to detect dropped frames:

#!/usr/bin/env bpftrace
/*
 * frame_timing.bt -- Detect dropped frames in a Wayland compositor.
 *
 * At 60Hz, frames should arrive every ~16.6ms.
 * At 144Hz, every ~6.9ms.
 * Flag any gap > 1.5x the expected interval.
 */

kprobe:drm_crtc_send_vblank_event {
    $now = nsecs;
    if (@last_vblank > 0) {
        $delta_ms = ($now - @last_vblank) / 1000000;
        @frame_interval_ms = hist($delta_ms);

        // Flag dropped frames (>25ms gap at 60Hz)
        if ($delta_ms > 25) {
            printf("*** DROPPED FRAME: %d ms gap (tid=%d comm=%s)\n",
                   $delta_ms, tid, comm);
            @drops = count();
        }
    }
    @last_vblank = $now;
}

interval:s:30 {
    printf("\n=== Frame interval histogram (ms) ===\n");
    print(@frame_interval_ms);
    printf("Dropped frames: ");
    print(@drops);
}

Recipe 6: Find syscall bottlenecks in Chromium renderer

Chromium spawns dozens of processes. Target the renderer specifically:

#!/usr/bin/env bpftrace
/*
 * chromium_syscalls.bt -- Profile syscalls in Chromium renderer processes.
 *
 * Run with: sudo bpftrace chromium_syscalls.bt
 * First find the PID: pgrep -f 'type=renderer'
 */

tracepoint:raw_syscalls:sys_enter
/comm == "chromium" || comm == "chrome"/ {
    @start[tid] = nsecs;
    @syscall_nr[tid] = args->id;
}

tracepoint:raw_syscalls:sys_exit
/@start[tid]/ {
    $delta_us = (nsecs - @start[tid]) / 1000;
    @latency_by_syscall[ksym(*(curtask->syscall_dispatch.table +
                              @syscall_nr[tid] * 8))] = hist($delta_us);
    @slow_syscalls[ksym(*(curtask->syscall_dispatch.table +
                         @syscall_nr[tid] * 8))] = count();
    delete(@start[tid]);
    delete(@syscall_nr[tid]);
}

END {
    printf("\n=== Syscall latency by type (us) ===\n");
    print(@latency_by_syscall);
    clear(@start);
    clear(@syscall_nr);
}

A simpler alternative that filters by PID:

# Get Chromium renderer PID
RENDERER_PID=$(pgrep -f 'type=renderer' | head -1)

# Count syscalls by type for that process
bpftrace -e "
tracepoint:raw_syscalls:sys_enter /pid == $RENDERER_PID/ {
    @[probe] = count();
}
tracepoint:syscalls:sys_enter_* /pid == $RENDERER_PID/ {
    @syscalls[probe] = count();
}" -c 'sleep 10'

Mesa debug environment variables

When debugging graphics issues, combine bpftrace with Mesa's built-in debug knobs:

Variable Effect
MESA_DEBUG=1 Print GL errors to stderr
LIBGL_DEBUG=verbose Verbose LibGL loader messages
MESA_LOG_FILE=/tmp/mesa.log Redirect Mesa logs to file
GALLIUM_HUD="fps+cpu,draw-calls" On-screen performance overlay
AMD_DEBUG=info AMD driver debug output
INTEL_DEBUG=perf Intel driver performance warnings
RADV_DEBUG=info RADV Vulkan debug output
MESA_SHADER_CAPTURE_PATH=/tmp/shaders Dump compiled shaders to disk
DRI_PRIME=1 Select discrete GPU for rendering
MESA_LOADER_DRIVER_OVERRIDE=zink Force a specific Mesa driver (e.g., Zink for Vulkan-backed GL)

4. perf + Flame Graphs Workflow

On-CPU flame graphs

The full pipeline from recording to visualization:

# 1. Record call stacks at 99 Hz for 30 seconds (system-wide)
#    Use -g for frame pointer unwinding, --call-graph dwarf for DWARF
sudo perf record -F 99 -a -g -- sleep 30

# 2. Export to text
sudo perf script --header > out.perf

# 3. Collapse stacks into single-line format
#    (one line per unique stack, with a count)
./stackcollapse-perf.pl out.perf > out.folded

# 4. Generate SVG flame graph
./flamegraph.pl out.folded > flamegraph.svg

# Or as a pipeline:
sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flame.svg

Targeted recording for compositor work:

# Profile only the compositor
sudo perf record -F 99 -g -p $(pidof sway) -- sleep 30

# Profile only Chromium's GPU process
sudo perf record -F 99 -g -p $(pgrep -f 'type=gpu-process') -- sleep 10

# Profile with DWARF unwinding (better C++ stacks, higher overhead)
sudo perf record -F 49 --call-graph dwarf -p $(pidof sway) -- sleep 30

Reading a flame graph

                 ┌─ Narrow tower: deep call chain,
                 │  but little total CPU time.
                 │  Probably not the bottleneck.
                 v
    ┌───┐       ┌─┐
    │   │  ┌────┤ │
    │   │  │    │ │     ┌──────────────────┐ <-- Wide plateau at top:
    │   ├──┤    │ │     │  hot_function()   │     THIS is your bottleneck.
    │   │  │    │ ├─────┤                  │     Width = % of total samples.
    ├───┴──┴────┴─┴─────┴──────────────────┤
    │            main()                     │ <-- Bottom: entry point
    └───────────────────────────────────────┘
    X-axis: alphabetical (NOT time). Width = sample proportion.
    Y-axis: stack depth. Bottom = entry, top = leaf.

Key patterns to look for:

  • Wide plateaus at the top = functions consuming the most CPU. Optimize these first.
  • "Comb" pattern = many thin spikes from a single parent. The parent dispatches varied work -- profile its children individually.
  • Missing frames = gaps in the stack. Use --call-graph dwarf or compile with -fno-omit-frame-pointer.

Off-CPU flame graphs

On-CPU flame graphs show where the CPU is busy. Off-CPU flame graphs show where threads are blocked -- waiting on I/O, locks, or scheduling. This is critical for compositors that are often I/O-bound (waiting for GPU, waiting for vblank, waiting for client buffers).

# Method 1: BCC's offcputime tool
sudo /usr/share/bcc/tools/offcputime -df -p $(pidof sway) 30 > out.stacks
./flamegraph.pl --color=io --title="Off-CPU: sway" \
    --countname=us < out.stacks > offcpu.svg

# Method 2: bpftrace off-CPU script
sudo bpftrace -e '
    tracepoint:sched:sched_switch {
        @start[tid] = nsecs;
    }
    tracepoint:sched:sched_switch /@start[args->next_pid]/ {
        $us = (nsecs - @start[args->next_pid]) / 1000;
        @offcpu[args->next_comm, kstack()] = sum($us);
        delete(@start[args->next_pid]);
    }
    END { clear(@start); }
' > offcpu.bt.out

# Method 3: Off-wake flame graphs (shows WHY a thread woke up)
sudo /usr/share/bcc/tools/offwaketime -f -p $(pidof sway) 30 > out.stacks
./flamegraph.pl --color=chain --title="Off-Wake: sway" \
    --countname=us < out.stacks > offwake.svg

Interpreting off-CPU flame graphs:

  • The widest towers represent the longest total blocked time.
  • Look for stacks ending in schedule() -> io_schedule() -> your compositor's read/poll paths -- this tells you what I/O the compositor waits on.
  • Off-wake flame graphs are even more powerful: the bottom half shows the sleeping stack, and the top half shows the waker's stack, connected by -- in the middle.

Differential flame graphs

Compare two profiles (before/after a change, or with/without VNC active):

# Record baseline
sudo perf record -F 99 -a -g -o baseline.data -- sleep 30

# Record with VNC active
sudo perf record -F 99 -a -g -o vnc.data -- sleep 30

# Generate differential flame graph
sudo perf script -i baseline.data | ./stackcollapse-perf.pl > baseline.folded
sudo perf script -i vnc.data | ./stackcollapse-perf.pl > vnc.folded
./difffolded.pl baseline.folded vnc.folded | \
    ./flamegraph.pl --title="Diff: VNC vs baseline" > diff.svg

In a differential flame graph, red = regression (more samples in the second profile), blue = improvement (fewer samples).


5. Real-World Debugging Story

"The compositor drops frames when VNC is active"

You are running a Sway compositor serving a Chromium kiosk to remote users via WayVNC. Users report stuttering and dropped frames when VNC clients are connected, but the display is smooth when no VNC session is active. Here is how you diagnose it with the tools from this article.

Step 1: Confirm the symptom with frame timing

#!/usr/bin/env bpftrace
// dropped_frames_check.bt
kprobe:drm_crtc_send_vblank_event {
    $now = nsecs;
    if (@last > 0) {
        $gap_ms = ($now - @last) / 1000000;
        if ($gap_ms > 20) {
            printf("DROPPED: %d ms gap at %ld\n", $gap_ms, nsecs);
            @dropped = count();
        }
    }
    @last = $now;
    @frame_gap = hist(($now - @last) / 1000000);
}

Run this while toggling VNC on/off. You see the histogram shift: without VNC, tight 16-17ms intervals. With VNC, periodic 33-50ms gaps (skipping 1-2 frames).

Step 2: Is it CPU-bound? On-CPU flame graph

# Profile sway while VNC is active
sudo perf record -F 99 -g -p $(pidof sway) -- sleep 30
sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > sway_vnc.svg

Open the SVG. You see a wide plateau in wlr_output_commit -> the rendering path. But sway's composite time is only ~3ms per frame. The CPU is not saturated. Not CPU-bound.

Step 3: Is it I/O-bound? Off-CPU flame graph

sudo /usr/share/bcc/tools/offcputime -df -p $(pidof sway) 30 > sway_offcpu.stacks
./flamegraph.pl --color=io --title="Off-CPU: sway + VNC" \
    --countname=us < sway_offcpu.stacks > sway_offcpu.svg

Now you see it: a massive tower in schedule() <- ep_poll() <- wl_event_loop_dispatch(). Sway's event loop is blocking in epoll. Something is making the event loop stall.

Step 4: What is sway waiting on? Syscall latency

#!/usr/bin/env bpftrace
// sway_syscall_latency.bt
tracepoint:raw_syscalls:sys_enter /comm == "sway"/ {
    @start[tid] = nsecs;
}
tracepoint:raw_syscalls:sys_exit /comm == "sway" && @start[tid]/ {
    $us = (nsecs - @start[tid]) / 1000;
    @latency[args->id] = hist($us);
    if ($us > 5000) {
        printf("SLOW syscall %d: %d us\n", args->id, $us);
    }
    delete(@start[tid]);
}

The output reveals: writev() calls occasionally take 8-15ms. These correspond to Wayland protocol messages being sent to the WayVNC client.

Step 5: Is it the VNC encoding? Trace WayVNC

# How much CPU does wayvnc use?
sudo perf record -F 99 -g -p $(pidof wayvnc) -- sleep 30
sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > wayvnc.svg

The flame graph shows WayVNC spending most CPU time in tight_encode() (Tight encoding) and zlib_compress(). When VNC clients are connected, WayVNC captures frames via wlr_screencopy, encodes them, and writes to the socket.

Step 6: Find the root cause with scheduler analysis

#!/usr/bin/env bpftrace
// runqueue_latency.bt -- Is sway getting preempted?
tracepoint:sched:sched_wakeup /args->comm == "sway"/ {
    @wake[args->pid] = nsecs;
}
tracepoint:sched:sched_switch /args->next_comm == "sway" && @wake[args->next_pid]/ {
    $us = (nsecs - @wake[args->next_pid]) / 1000;
    @runq_latency = hist($us);
    if ($us > 1000) {
        printf("sway waited %d us in run queue\n", $us);
    }
    delete(@wake[args->next_pid]);
}

Finding: when WayVNC's encoding threads are active, sway experiences run-queue latencies of 2-5ms on the same CPU cores. The VNC encoding is CPU-intensive and competing with the compositor for CPU time.

Step 7: The fix

The diagnosis points to two problems:

  1. Blocking writes: Sway's event loop blocks when writing large Wayland messages to a slow consumer (WayVNC). Fix: increase the Wayland socket buffer size or make WayVNC consume faster.

  2. CPU contention: WayVNC encoding runs on the same cores as sway. Fix: pin WayVNC to different CPU cores with taskset or cgroups:

# Pin wayvnc to CPUs 2-3, keep sway on CPUs 0-1
taskset -p 0x0C $(pidof wayvnc)
taskset -p 0x03 $(pidof sway)

This investigation used: bpftrace (frame timing, syscall latency, run-queue latency), perf + flame graphs (on-CPU and off-CPU analysis), and system knowledge (Wayland event loop, VNC encoding pipeline). In an interview, walking through this methodology is more impressive than memorizing tool syntax.


6. Brendan Gregg's Methodologies

USE Method: Utilization, Saturation, Errors

The USE Method provides a systematic checklist for identifying bottlenecks. For every resource in the system, check three metrics:

Metric Definition Example
Utilization Average time the resource was busy servicing work (%) CPU at 85%, GPU at 60%
Saturation Degree to which extra work is queued, waiting Run queue depth > 0, swap usage
Errors Count of error events NIC drops, ECC errors, failed allocations

USE applied to compositor performance

Resource Utilization tool Saturation tool Errors tool
CPU mpstat -P ALL 1, perf stat vmstat 1 (r column), bpftrace run-queue perf stat (branch-miss, cache-miss)
GPU intel_gpu_top, cat gpu_busy_percent Check for fence waits in drm_sched_job_wait_dep dmesg (GPU hang, reset)
Memory free -m, /proc/meminfo vmstat (si/so columns), OOM messages edac-util, dmesg (MCE)
Network sar -n DEV 1 ss -tnp (send-Q), tc -s qdisc ip -s link (errors, drops)
DRM/KMS Frame timing histogram (Recipe 5) Pending flip count, fence wait time dmesg (flip_done timed out)
Wayland socket Message rate (Recipe 2) ss -xp (send-Q on Wayland socket) Client disconnect logs

The power of USE is completeness: by checking every resource, you avoid anchoring on the first anomaly you find. In the VNC debugging story above, you might have stopped at "CPU is fine" without checking saturation (run-queue latency) and I/O contention.

TSA Method: Thread State Analysis

TSA complements USE by focusing on threads instead of resources. For each thread of interest, measure total time in each state:

State Meaning Tool
Executing Running on CPU perf record -> flame graph
Runnable In run queue, waiting for CPU bpftrace sched tracepoints
Sleeping (I/O) Blocked on disk, network, GPU Off-CPU flame graph
Sleeping (lock) Blocked on mutex/futex bpftrace -e 'uprobe:pthread_mutex_lock { @[ustack]=count(); }'
Sleeping (idle) Waiting for work (epoll) Expected for event-loop compositors
Paging Waiting for memory to be paged in vmstat si/so columns

TSA applied to compositor thread

For a healthy 60Hz sway compositor, the main thread's time budget per frame is ~16.6ms:

Expected breakdown of sway main thread (60 Hz):
├── Executing (on-CPU)        ~3-5 ms   (input handling + render + composite)
├── Sleeping (idle/epoll)     ~10-12 ms (waiting for vblank or client events)
├── Sleeping (I/O)            ~0-1 ms   (DRM ioctl completion)
├── Runnable (run queue)      ~0-0.1 ms (should be negligible)
└── Sleeping (lock)           ~0 ms     (single-threaded event loop)

If you see the Executing slice growing above 12ms, the compositor cannot finish rendering before the next vblank -- frames will be dropped. If the Runnable slice grows, the compositor is being starved by other processes (the VNC scenario). If the Sleeping (I/O) slice grows, a DRM ioctl or client read is blocking the loop.

Measure this directly:

#!/usr/bin/env bpftrace
/*
 * compositor_tsa.bt -- Thread State Analysis for sway.
 * Records time in executing vs off-CPU states.
 */

tracepoint:sched:sched_switch /args->prev_comm == "sway"/ {
    // sway left the CPU
    if (@oncpu_start[args->prev_pid] > 0) {
        @on_cpu_us = sum((nsecs - @oncpu_start[args->prev_pid]) / 1000);
    }
    @offcpu_start[args->prev_pid] = nsecs;
}

tracepoint:sched:sched_switch /args->next_comm == "sway"/ {
    // sway got the CPU
    if (@offcpu_start[args->next_pid] > 0) {
        @off_cpu_us = sum((nsecs - @offcpu_start[args->next_pid]) / 1000);
    }
    @oncpu_start[args->next_pid] = nsecs;
}

interval:s:5 {
    printf("on-CPU: %d us | off-CPU: %d us (5s window)\n",
           @on_cpu_us, @off_cpu_us);
    @on_cpu_us = 0;
    @off_cpu_us = 0;
}

Combining USE + TSA

In practice, use both:

  1. USE first -- quickly scan all resources for saturation or errors. This takes 60 seconds with the right checklist.
  2. TSA second -- for the threads that matter (compositor main loop, Chromium renderer, VNC encoder), break down where time goes.
  3. Drill down -- use bpftrace recipes and flame graphs to investigate the specific bottleneck USE/TSA identified.

Quick Reference: bpftrace Cheat Sheet

# List probes matching a pattern
bpftrace -l 'kprobe:drm_*'
bpftrace -l 'tracepoint:syscalls:sys_enter_*'
bpftrace -l 'uprobe:/usr/lib/libwayland-server.so:*'

# Trace a function with arguments
bpftrace -e 'kprobe:vfs_read { printf("%s reads fd %d\n", comm, arg0); }'

# Return value of a function
bpftrace -e 'kretprobe:vfs_read { @bytes = hist(retval); }'

# Aggregate by stack trace
bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

# Time a function
bpftrace -e 'kprobe:mutex_lock { @start[tid] = nsecs; }
    kretprobe:mutex_lock /@start[tid]/ {
    @lock_ns = hist(nsecs - @start[tid]); delete(@start[tid]); }'

# Per-second output
bpftrace -e 'tracepoint:block:block_rq_issue { @iops = count(); }
    interval:s:1 { print(@iops); clear(@iops); }'

Installation & Prerequisites

# Ubuntu/Debian
sudo apt install bpftrace bpfcc-tools linux-headers-$(uname -r)

# Fedora
sudo dnf install bpftrace bcc-tools

# Arch
sudo pacman -S bpftrace bcc

# Verify eBPF support
sudo bpftrace -e 'BEGIN { printf("eBPF works!\n"); exit(); }'

# Install FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:$(pwd)/FlameGraph

Requirements:

  • Linux kernel >= 4.9 (for basic eBPF), >= 5.8 (for ring buffer, BTF, CO-RE)
  • CONFIG_BPF=y, CONFIG_BPF_SYSCALL=y, CONFIG_BPF_JIT=y in kernel config
  • Root or CAP_BPF + CAP_PERFMON capabilities
  • Kernel headers installed (for BTF type information)

Glossary

eBPF (extended Berkeley Packet Filter)
In-kernel virtual machine executing verified, JIT-compiled programs attached to kernel and userspace events. Cannot crash the kernel.
BPF verifier
Static analysis pass that rejects unsafe eBPF programs before execution. Checks memory safety, bounded loops, type correctness, and resource cleanup.
JIT (Just-In-Time) compiler
Translates verified BPF bytecode to native machine instructions (x86-64, ARM64) for near-native execution speed.
BPF map
Kernel-resident key-value data structure shared between eBPF programs and userspace. Types include hash, array, ring buffer, LRU, and stack trace.
kprobe / kretprobe
Dynamic tracing of kernel function entry/return. Can attach to almost any kernel function without recompilation.
uprobe / uretprobe
Dynamic tracing of userspace function entry/return. Requires knowing the library path and symbol name.
tracepoint
Stable, statically defined instrumentation point in the kernel. Preferred over kprobes when available because the interface is versioned.
USDT (User Statically-Defined Tracing)
Static probes compiled into userspace applications (like DTrace probes). Used by PostgreSQL, Python, Node.js, JVM.
GEM (Graphics Execution Manager)
Kernel subsystem managing GPU buffer objects. Handles allocation, mapping, and lifetime of GPU memory.
DRM (Direct Rendering Manager)
Linux kernel subsystem providing GPU access, modesetting (KMS), and buffer management (GEM). Contains tracepoints for vblank, scheduling, and memory.
Flame graph
Visualization of sampled stack traces where x-axis width represents time proportion. Invented by Brendan Gregg. On-CPU shows busy time; off-CPU shows blocked time.
Off-CPU analysis
Measuring time threads spend blocked (I/O, locks, scheduling) rather than executing. Uses scheduler tracepoints to capture blocking stacks.
USE Method
Performance methodology: for every resource, check Utilization, Saturation, and Errors. Ensures complete coverage of potential bottlenecks.
TSA (Thread State Analysis)
Performance methodology: for each thread, measure time in executing, runnable, sleeping, lock, and idle states. Complementary to USE.
CO-RE (Compile Once, Run Everywhere)
eBPF portability mechanism using BTF (BPF Type Format) to relocate struct field accesses at load time, avoiding per-kernel recompilation.
BTF (BPF Type Format)
Compact type metadata embedded in the kernel, enabling CO-RE relocations and rich introspection of kernel data structures from eBPF programs.
vblank
Vertical blanking interval: the period between the last scanline of one frame and the first scanline of the next. Page flips happen here to avoid tearing.
Atomic modesetting
DRM API that commits all display changes (planes, CRTCs, connectors) in a single atomic transaction, preventing partial updates and tearing.