Skip to content

Linux Performance Profiling

The posting explicitly lists "perf, bpftrace, valgrind, flamegraphs" as required skills. This article covers each tool, plus GPU profiling and compositor latency measurement -- the full toolkit for diagnosing performance issues in a Sway + Chromium + WayVNC stack.

perf: Hardware Counters & Sampling

perf reads CPU performance monitoring counters (PMCs) and samples call stacks at configurable rates.

# Count cache misses and instructions for a command
perf stat -e cache-misses,instructions,cycles ./my_compositor

# Sample call stacks at 99 Hz for 30 seconds (system-wide)
perf record -F 99 -ag -- sleep 30

# Report hottest functions
perf report --stdio --sort=dso,symbol

# Record with DWARF unwinding (better stacks for C++)
perf record -F 99 -ag --call-graph dwarf -- sleep 10
Subcommand Use
perf stat Count events (instructions, cache misses, branch misses)
perf record Sample call stacks to perf.data
perf report Interactive TUI for perf.data
perf top Live sampling, like top for functions
perf trace Lightweight strace alternative using perf

Flame Graphs

Brendan Gregg's methodology: collapse perf stacks into an SVG where width = time spent.

# Record, then generate flame graph
perf record -F 99 -ag -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Reading a flame graph:

  • X-axis: alphabetical sort (not time). Width = proportion of samples.
  • Y-axis: call stack depth. Bottom = entry point, top = leaf function.
  • Plateaus: wide bars at the top = hot functions to optimize.
  • Towers: deep narrow stacks = probably not a bottleneck.

For Chromium: profile the GPU process separately (perf record -p $(pgrep -f "type=gpu-process")).

bpftrace One-Liners

bpftrace uses eBPF for safe, low-overhead kernel and user tracing:

# Syscall latency histogram (nanoseconds)
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
    tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    @usecs = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]); }'

# File I/O by process (reads)
bpftrace -e 'tracepoint:syscalls:sys_enter_read {
    @bytes[comm] = sum(args->count); }'

# Scheduler run queue latency
bpftrace -e 'tracepoint:sched:sched_wakeup {
    @qtime[args->pid] = nsecs; }
    tracepoint:sched:sched_switch /@qtime[args->next_pid]/ {
    @usecs = hist((nsecs - @qtime[args->next_pid]) / 1000);
    delete(@qtime[args->next_pid]); }'

# Count Wayland protocol calls in a compositor
bpftrace -e 'uprobe:/usr/lib/libwayland-server.so:wl_resource_post_event {
    @[comm] = count(); }'

valgrind

Tool Purpose
Memcheck Detect use-after-free, uninitialized reads, leaks
Cachegrind Simulate CPU cache, find cache-hostile code
Callgrind Call graph profiling (pair with KCachegrind GUI)
Helgrind Detect data races in threaded code
# Memory error detection (10-50x slowdown)
valgrind --tool=memcheck --leak-check=full ./sway

# Cache profiling
valgrind --tool=cachegrind ./my_renderer
cg_annotate cachegrind.out.*

Memcheck is invaluable when patching Sway/wlroots C code -- catch use-after-free before it becomes a compositor crash.

strace / ltrace

# Trace syscalls with timing
strace -fe trace=read,write,ioctl -T -p $(pidof sway)

# Trace library calls (e.g., EGL/GL calls)
ltrace -e 'egl*' -p $(pidof sway)

# Count syscalls by type
strace -c -p $(pidof chromium) -f 2>&1 | head -30

GPU Profiling

# Mesa's built-in HUD (works with any Mesa driver)
GALLIUM_HUD="fps+cpu,draw-calls" ./my_app

# Intel GPU utilization
sudo intel_gpu_top

# AMD GPU utilization
cat /sys/class/drm/card0/device/gpu_busy_percent

# Vulkan frame timing (via VK_LAYER_MESA_overlay)
VK_INSTANCE_LAYERS=VK_LAYER_MESA_overlay ./my_app

For compositor work: measure GPU render time per frame. If compositing takes

4ms at 60Hz (16.6ms budget), investigate shader complexity or texture upload bottlenecks.

Compositor Latency Measurement

The Wayland presentation-time protocol provides precise feedback:

sequenceDiagram
    participant C as Client
    participant Comp as Compositor
    participant K as Kernel (KMS)

    C->>Comp: wp_presentation.feedback(surface)
    C->>Comp: wl_surface.commit()
    Comp->>K: Atomic commit
    K->>Comp: Page-flip event (vblank timestamp)
    Comp->>C: wp_presentation_feedback.presented(tv_sec, tv_nsec, flags)
# Measure frame presentation times with weston-presentation-shm
weston-presentation-shm --output=HDMI-A-1

# Or manually: log timestamps in a custom Wayland client
# presented_time - commit_time = total latency

Key latency sources in the Warmwind stack:

Stage Typical Tool to measure
Chromium render 2-8ms --enable-gpu-benchmarking
Compositor composite 0.5-4ms presentation-time protocol
VNC encode 1-10ms WayVNC --log-level=debug
Network transfer 1-50ms ping, Wireshark
VNC decode + display 1-5ms Client-side timestamps

Glossary

PMC (Performance Monitoring Counter)
CPU hardware register counting events (cycles, cache misses, branch mispredictions).
eBPF (extended Berkeley Packet Filter)
In-kernel VM for safe, JIT-compiled tracing and networking programs.
Flame graph
Visualization of sampled stack traces. Width = time proportion. Invented by Brendan Gregg.
Memcheck
Valgrind tool detecting memory errors via binary instrumentation. 10-50x slowdown.
GALLIUM_HUD
Mesa environment variable enabling a real-time on-screen GPU metrics overlay.
presentation-time
Wayland protocol providing precise vblank timestamps for committed surfaces.
Run queue latency
Time a thread waits in the scheduler queue before getting CPU time.