7.3 KiB
How Sampling Profilers Work
The Core Idea
Sampling profilers answer the question: "Where is my program spending time?"
Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.
Program execution: ████████████████████████████████████████
↑ ↑ ↑ ↑ ↑ ↑ ↑
sample sample sample ...
Sampling vs Instrumentation
| Approach | How it works | Overhead | Accuracy |
|---|---|---|---|
| Instrumentation | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts |
| Sampling | Interrupt periodically, record stack | Low (1-5%) | Statistical |
Instrumentation (like gprof with -pg) modifies your program. Sampling just observes it.
How perf Does It
1. Hardware Performance Counters (PMU)
Modern CPUs have Performance Monitoring Units (PMUs) with special registers:
┌─────────────────────────────────────────┐
│ CPU │
│ ┌─────────────────────────────────┐ │
│ │ Performance Monitoring Unit │ │
│ │ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Counter │ │ Counter │ ... │ │
│ │ │ cycles │ │ instrs │ │ │
│ │ └─────────┘ └─────────┘ │ │
│ │ ↓ overflow interrupt │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
When you run perf record:
- perf programs a PMU counter to count CPU cycles
- Counter overflows every N cycles (default: enough for ~4000 samples/sec)
- Overflow triggers a Non-Maskable Interrupt (NMI)
- Kernel handler records: instruction pointer, process ID, timestamp
- Optionally: walks the stack to get call chain
2. The Sampling Frequency
perf record -F 99 ./program # 99 samples per second
perf record -F 9999 ./program # 9999 samples per second
Higher frequency = more samples = better accuracy, but more overhead.
Why 99 and not 100? Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).
3. What Gets Recorded
Each sample contains:
- IP (Instruction Pointer): Which instruction was executing
- PID/TID: Which process/thread
- Timestamp: When it happened
- CPU: Which core
- Call chain (with
-g): Stack of return addresses
Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
callchain: main → process_data → compute_inner
Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
callchain: main → process_data → compute_inner
...
Symbol Resolution
Raw samples contain memory addresses. To show function names, perf needs:
- Symbol tables: Map address ranges to function names
- Debug info (
-g): Map addresses to source lines
Without symbols: 45.23% 0x00000000004011a0
With symbols: 45.23% compute_inner
With debug info: 45.23% compute_inner (program.c:28)
This is why perf report needs access to the same binary you profiled.
Call Graph Collection
With perf record -g, perf records the call stack for each sample.
Frame Pointer Walking (Traditional)
Stack Memory:
┌──────────────┐
│ return addr │ ← where to return after current function
│ saved RBP │ ← pointer to previous frame
├──────────────┤
│ local vars │
├──────────────┤
│ return addr │
│ saved RBP ───┼──→ previous frame
├──────────────┤
│ ... │
└──────────────┘
Walk the chain of frame pointers to reconstruct the call stack.
Problem: Modern compilers omit frame pointers (-fomit-frame-pointer) for performance. Solution: Compile with -fno-omit-frame-pointer or use DWARF unwinding.
DWARF Unwinding
Uses debug info (.eh_frame section) to unwind without frame pointers. More reliable but slower.
perf record --call-graph dwarf ./program
Statistical Nature
Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.
Law of Large Numbers: More samples = closer to true distribution.
100 samples: A: 8-12% (high variance)
1000 samples: A: 9-11% (better)
10000 samples: A: 9.8-10.2% (quite accurate)
This is why short-running programs need higher sampling frequency.
Limitations
1. Short Functions Miss Samples
If a function runs for less time than the sampling interval, it might not get sampled at all.
Sampling interval: ──────────────────────────────────
Function A: ██ (might miss!)
Function B: ████████████████████████████████ (definitely hit)
2. Inlined Functions Disappear
When the compiler inlines a function, it no longer exists as a separate entity:
// Source code
inline int square(int x) { return x * x; }
int compute(int x) { return square(x) + 1; }
// After inlining - square() disappears from profile
int compute(int x) { return x * x + 1; }
With debug info, perf can sometimes recover inline information.
3. Sampling Bias
Some events are harder to catch:
- Very short functions
- Functions that mostly wait (I/O, locks)
- Interrupt handlers
4. Observer Effect
Profiling itself has overhead:
- NMI handling takes cycles
- Stack unwinding takes cycles
- Writing samples to buffer takes cycles
Usually <5%, but can affect extremely performance-sensitive code.
perf Events
perf can sample on different events, not just CPU cycles:
perf record -e cycles ./program # CPU cycles (default)
perf record -e instructions ./program # Instructions retired
perf record -e cache-misses ./program # Cache misses
perf record -e branch-misses ./program # Branch mispredictions
This lets you answer "where do cache misses happen?" not just "where is time spent?"
Summary
- Sampling interrupts periodically to see what's executing
- PMU counters trigger interrupts at configurable frequency
- Statistical accuracy improves with more samples
- Symbol resolution maps addresses to function names
- Call graphs show the path to each sample
- Low overhead (~1-5%) makes it usable in production
Further Reading
- Brendan Gregg's perf examples
- perf wiki
- Intel PMU documentation (Volume 3, Chapter 18)