Linux Performance Profiling¶
The posting explicitly lists "perf, bpftrace, valgrind, flamegraphs" as required skills. This article covers each tool, plus GPU profiling and compositor latency measurement -- the full toolkit for diagnosing performance issues in a Sway + Chromium + WayVNC stack.
perf: Hardware Counters & Sampling¶
perf reads CPU performance monitoring counters (PMCs) and samples call
stacks at configurable rates.
# Count cache misses and instructions for a command
perf stat -e cache-misses,instructions,cycles ./my_compositor
# Sample call stacks at 99 Hz for 30 seconds (system-wide)
perf record -F 99 -ag -- sleep 30
# Report hottest functions
perf report --stdio --sort=dso,symbol
# Record with DWARF unwinding (better stacks for C++)
perf record -F 99 -ag --call-graph dwarf -- sleep 10
| Subcommand | Use |
|---|---|
perf stat |
Count events (instructions, cache misses, branch misses) |
perf record |
Sample call stacks to perf.data |
perf report |
Interactive TUI for perf.data |
perf top |
Live sampling, like top for functions |
perf trace |
Lightweight strace alternative using perf |
Flame Graphs¶
Brendan Gregg's methodology: collapse perf stacks into an SVG where width = time spent.
# Record, then generate flame graph
perf record -F 99 -ag -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
Reading a flame graph:
- X-axis: alphabetical sort (not time). Width = proportion of samples.
- Y-axis: call stack depth. Bottom = entry point, top = leaf function.
- Plateaus: wide bars at the top = hot functions to optimize.
- Towers: deep narrow stacks = probably not a bottleneck.
For Chromium: profile the GPU process separately
(perf record -p $(pgrep -f "type=gpu-process")).
bpftrace One-Liners¶
bpftrace uses eBPF for safe, low-overhead kernel and user tracing:
# Syscall latency histogram (nanoseconds)
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
@usecs = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]); }'
# File I/O by process (reads)
bpftrace -e 'tracepoint:syscalls:sys_enter_read {
@bytes[comm] = sum(args->count); }'
# Scheduler run queue latency
bpftrace -e 'tracepoint:sched:sched_wakeup {
@qtime[args->pid] = nsecs; }
tracepoint:sched:sched_switch /@qtime[args->next_pid]/ {
@usecs = hist((nsecs - @qtime[args->next_pid]) / 1000);
delete(@qtime[args->next_pid]); }'
# Count Wayland protocol calls in a compositor
bpftrace -e 'uprobe:/usr/lib/libwayland-server.so:wl_resource_post_event {
@[comm] = count(); }'
valgrind¶
| Tool | Purpose |
|---|---|
| Memcheck | Detect use-after-free, uninitialized reads, leaks |
| Cachegrind | Simulate CPU cache, find cache-hostile code |
| Callgrind | Call graph profiling (pair with KCachegrind GUI) |
| Helgrind | Detect data races in threaded code |
# Memory error detection (10-50x slowdown)
valgrind --tool=memcheck --leak-check=full ./sway
# Cache profiling
valgrind --tool=cachegrind ./my_renderer
cg_annotate cachegrind.out.*
Memcheck is invaluable when patching Sway/wlroots C code -- catch use-after-free before it becomes a compositor crash.
strace / ltrace¶
# Trace syscalls with timing
strace -fe trace=read,write,ioctl -T -p $(pidof sway)
# Trace library calls (e.g., EGL/GL calls)
ltrace -e 'egl*' -p $(pidof sway)
# Count syscalls by type
strace -c -p $(pidof chromium) -f 2>&1 | head -30
GPU Profiling¶
# Mesa's built-in HUD (works with any Mesa driver)
GALLIUM_HUD="fps+cpu,draw-calls" ./my_app
# Intel GPU utilization
sudo intel_gpu_top
# AMD GPU utilization
cat /sys/class/drm/card0/device/gpu_busy_percent
# Vulkan frame timing (via VK_LAYER_MESA_overlay)
VK_INSTANCE_LAYERS=VK_LAYER_MESA_overlay ./my_app
For compositor work: measure GPU render time per frame. If compositing takes
4ms at 60Hz (16.6ms budget), investigate shader complexity or texture upload bottlenecks.
Compositor Latency Measurement¶
The Wayland presentation-time protocol provides precise feedback:
sequenceDiagram
participant C as Client
participant Comp as Compositor
participant K as Kernel (KMS)
C->>Comp: wp_presentation.feedback(surface)
C->>Comp: wl_surface.commit()
Comp->>K: Atomic commit
K->>Comp: Page-flip event (vblank timestamp)
Comp->>C: wp_presentation_feedback.presented(tv_sec, tv_nsec, flags)
# Measure frame presentation times with weston-presentation-shm
weston-presentation-shm --output=HDMI-A-1
# Or manually: log timestamps in a custom Wayland client
# presented_time - commit_time = total latency
Key latency sources in the Warmwind stack:
| Stage | Typical | Tool to measure |
|---|---|---|
| Chromium render | 2-8ms | --enable-gpu-benchmarking |
| Compositor composite | 0.5-4ms | presentation-time protocol |
| VNC encode | 1-10ms | WayVNC --log-level=debug |
| Network transfer | 1-50ms | ping, Wireshark |
| VNC decode + display | 1-5ms | Client-side timestamps |
Glossary
- PMC (Performance Monitoring Counter)
- CPU hardware register counting events (cycles, cache misses, branch mispredictions).
- eBPF (extended Berkeley Packet Filter)
- In-kernel VM for safe, JIT-compiled tracing and networking programs.
- Flame graph
- Visualization of sampled stack traces. Width = time proportion. Invented by Brendan Gregg.
- Memcheck
- Valgrind tool detecting memory errors via binary instrumentation. 10-50x slowdown.
- GALLIUM_HUD
- Mesa environment variable enabling a real-time on-screen GPU metrics overlay.
- presentation-time
- Wayland protocol providing precise vblank timestamps for committed surfaces.
- Run queue latency
- Time a thread waits in the scheduler queue before getting CPU time.