Linux Performance Profiling¶

The posting explicitly lists "perf, bpftrace, valgrind, flamegraphs" as required skills. This article covers each tool, plus GPU profiling and compositor latency measurement -- the full toolkit for diagnosing performance issues in a Sway + Chromium + WayVNC stack.

perf: Hardware Counters & Sampling¶

perf reads CPU performance monitoring counters (PMCs) and samples call stacks at configurable rates.

# Count cache misses and instructions for a command
perf stat -e cache-misses,instructions,cycles ./my_compositor

# Sample call stacks at 99 Hz for 30 seconds (system-wide)
perf record -F 99 -ag -- sleep 30

# Report hottest functions
perf report --stdio --sort=dso,symbol

# Record with DWARF unwinding (better stacks for C++)
perf record -F 99 -ag --call-graph dwarf -- sleep 10

Subcommand	Use
`perf stat`	Count events (instructions, cache misses, branch misses)
`perf record`	Sample call stacks to `perf.data`
`perf report`	Interactive TUI for `perf.data`
`perf top`	Live sampling, like `top` for functions
`perf trace`	Lightweight `strace` alternative using perf

Flame Graphs¶

Brendan Gregg's methodology: collapse perf stacks into an SVG where width = time spent.

# Record, then generate flame graph
perf record -F 99 -ag -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Reading a flame graph:

X-axis: alphabetical sort (not time). Width = proportion of samples.
Y-axis: call stack depth. Bottom = entry point, top = leaf function.
Plateaus: wide bars at the top = hot functions to optimize.
Towers: deep narrow stacks = probably not a bottleneck.

For Chromium: profile the GPU process separately (perf record -p $(pgrep -f "type=gpu-process")).

bpftrace One-Liners¶

bpftrace uses eBPF for safe, low-overhead kernel and user tracing:

# Syscall latency histogram (nanoseconds)
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
    tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    @usecs = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]); }'

# File I/O by process (reads)
bpftrace -e 'tracepoint:syscalls:sys_enter_read {
    @bytes[comm] = sum(args->count); }'

# Scheduler run queue latency
bpftrace -e 'tracepoint:sched:sched_wakeup {
    @qtime[args->pid] = nsecs; }
    tracepoint:sched:sched_switch /@qtime[args->next_pid]/ {
    @usecs = hist((nsecs - @qtime[args->next_pid]) / 1000);
    delete(@qtime[args->next_pid]); }'

# Count Wayland protocol calls in a compositor
bpftrace -e 'uprobe:/usr/lib/libwayland-server.so:wl_resource_post_event {
    @[comm] = count(); }'

valgrind¶

Tool	Purpose
Memcheck	Detect use-after-free, uninitialized reads, leaks
Cachegrind	Simulate CPU cache, find cache-hostile code
Callgrind	Call graph profiling (pair with KCachegrind GUI)
Helgrind	Detect data races in threaded code

# Memory error detection (10-50x slowdown)
valgrind --tool=memcheck --leak-check=full ./sway

# Cache profiling
valgrind --tool=cachegrind ./my_renderer
cg_annotate cachegrind.out.*

Memcheck is invaluable when patching Sway/wlroots C code -- catch use-after-free before it becomes a compositor crash.

strace / ltrace¶

# Trace syscalls with timing
strace -fe trace=read,write,ioctl -T -p $(pidof sway)

# Trace library calls (e.g., EGL/GL calls)
ltrace -e 'egl*' -p $(pidof sway)

# Count syscalls by type
strace -c -p $(pidof chromium) -f 2>&1 | head -30

GPU Profiling¶

# Mesa's built-in HUD (works with any Mesa driver)
GALLIUM_HUD="fps+cpu,draw-calls" ./my_app

# Intel GPU utilization
sudo intel_gpu_top

# AMD GPU utilization
cat /sys/class/drm/card0/device/gpu_busy_percent

# Vulkan frame timing (via VK_LAYER_MESA_overlay)
VK_INSTANCE_LAYERS=VK_LAYER_MESA_overlay ./my_app

For compositor work: measure GPU render time per frame. If compositing takes

4ms at 60Hz (16.6ms budget), investigate shader complexity or texture upload bottlenecks.

Compositor Latency Measurement¶

The Wayland presentation-time protocol provides precise feedback:

sequenceDiagram
    participant C as Client
    participant Comp as Compositor
    participant K as Kernel (KMS)

    C->>Comp: wp_presentation.feedback(surface)
    C->>Comp: wl_surface.commit()
    Comp->>K: Atomic commit
    K->>Comp: Page-flip event (vblank timestamp)
    Comp->>C: wp_presentation_feedback.presented(tv_sec, tv_nsec, flags)

# Measure frame presentation times with weston-presentation-shm
weston-presentation-shm --output=HDMI-A-1

# Or manually: log timestamps in a custom Wayland client
# presented_time - commit_time = total latency

Key latency sources in the Warmwind stack:

Stage	Typical	Tool to measure
Chromium render	2-8ms	`--enable-gpu-benchmarking`
Compositor composite	0.5-4ms	`presentation-time` protocol
VNC encode	1-10ms	WayVNC `--log-level=debug`
Network transfer	1-50ms	`ping`, Wireshark
VNC decode + display	1-5ms	Client-side timestamps

Glossary

PMC (Performance Monitoring Counter): CPU hardware register counting events (cycles, cache misses, branch mispredictions).
eBPF (extended Berkeley Packet Filter): In-kernel VM for safe, JIT-compiled tracing and networking programs.
Flame graph: Visualization of sampled stack traces. Width = time proportion. Invented by Brendan Gregg.
Memcheck: Valgrind tool detecting memory errors via binary instrumentation. 10-50x slowdown.
GALLIUM_HUD: Mesa environment variable enabling a real-time on-screen GPU metrics overlay.
presentation-time: Wayland protocol providing precise vblank timestamps for committed surfaces.
Run queue latency: Time a thread waits in the scheduler queue before getting CPU time.