perf-workshop/docs/HOW-SAMPLING-PROFILERS-WORK.md
2026-01-11 12:04:36 +05:30

212 lines
7.3 KiB
Markdown

# How Sampling Profilers Work
## The Core Idea
Sampling profilers answer the question: **"Where is my program spending time?"**
Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.
```
Program execution: ████████████████████████████████████████
↑ ↑ ↑ ↑ ↑ ↑ ↑
sample sample sample ...
```
## Sampling vs Instrumentation
| Approach | How it works | Overhead | Accuracy |
|----------|--------------|----------|----------|
| **Instrumentation** | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts |
| **Sampling** | Interrupt periodically, record stack | Low (1-5%) | Statistical |
Instrumentation (like `gprof` with `-pg`) modifies your program. Sampling just observes it.
## How perf Does It
### 1. Hardware Performance Counters (PMU)
Modern CPUs have Performance Monitoring Units (PMUs) with special registers:
```
┌─────────────────────────────────────────┐
│ CPU │
│ ┌─────────────────────────────────┐ │
│ │ Performance Monitoring Unit │ │
│ │ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Counter │ │ Counter │ ... │ │
│ │ │ cycles │ │ instrs │ │ │
│ │ └─────────┘ └─────────┘ │ │
│ │ ↓ overflow interrupt │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
```
When you run `perf record`:
1. perf programs a PMU counter to count CPU cycles
2. Counter overflows every N cycles (default: enough for ~4000 samples/sec)
3. Overflow triggers a **Non-Maskable Interrupt (NMI)**
4. Kernel handler records: instruction pointer, process ID, timestamp
5. Optionally: walks the stack to get call chain
### 2. The Sampling Frequency
```bash
perf record -F 99 ./program # 99 samples per second
perf record -F 9999 ./program # 9999 samples per second
```
Higher frequency = more samples = better accuracy, but more overhead.
**Why 99 and not 100?** Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).
### 3. What Gets Recorded
Each sample contains:
- **IP (Instruction Pointer)**: Which instruction was executing
- **PID/TID**: Which process/thread
- **Timestamp**: When it happened
- **CPU**: Which core
- **Call chain** (with `-g`): Stack of return addresses
```
Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
callchain: main → process_data → compute_inner
Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
callchain: main → process_data → compute_inner
...
```
## Symbol Resolution
Raw samples contain memory addresses. To show function names, perf needs:
1. **Symbol tables**: Map address ranges to function names
2. **Debug info** (`-g`): Map addresses to source lines
```
Without symbols: 45.23% 0x00000000004011a0
With symbols: 45.23% compute_inner
With debug info: 45.23% compute_inner (program.c:28)
```
This is why `perf report` needs access to the same binary you profiled.
## Call Graph Collection
With `perf record -g`, perf records the call stack for each sample.
### Frame Pointer Walking (Traditional)
```
Stack Memory:
┌──────────────┐
│ return addr │ ← where to return after current function
│ saved RBP │ ← pointer to previous frame
├──────────────┤
│ local vars │
├──────────────┤
│ return addr │
│ saved RBP ───┼──→ previous frame
├──────────────┤
│ ... │
└──────────────┘
```
Walk the chain of frame pointers to reconstruct the call stack.
**Problem**: Modern compilers omit frame pointers (`-fomit-frame-pointer`) for performance. Solution: Compile with `-fno-omit-frame-pointer` or use DWARF unwinding.
### DWARF Unwinding
Uses debug info (`.eh_frame` section) to unwind without frame pointers. More reliable but slower.
```bash
perf record --call-graph dwarf ./program
```
## Statistical Nature
Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.
**Law of Large Numbers**: More samples = closer to true distribution.
```
100 samples: A: 8-12% (high variance)
1000 samples: A: 9-11% (better)
10000 samples: A: 9.8-10.2% (quite accurate)
```
This is why short-running programs need higher sampling frequency.
## Limitations
### 1. Short Functions Miss Samples
If a function runs for less time than the sampling interval, it might not get sampled at all.
```
Sampling interval: ──────────────────────────────────
Function A: ██ (might miss!)
Function B: ████████████████████████████████ (definitely hit)
```
### 2. Inlined Functions Disappear
When the compiler inlines a function, it no longer exists as a separate entity:
```c
// Source code
inline int square(int x) { return x * x; }
int compute(int x) { return square(x) + 1; }
// After inlining - square() disappears from profile
int compute(int x) { return x * x + 1; }
```
With debug info, perf can sometimes recover inline information.
### 3. Sampling Bias
Some events are harder to catch:
- Very short functions
- Functions that mostly wait (I/O, locks)
- Interrupt handlers
### 4. Observer Effect
Profiling itself has overhead:
- NMI handling takes cycles
- Stack unwinding takes cycles
- Writing samples to buffer takes cycles
Usually <5%, but can affect extremely performance-sensitive code.
## perf Events
perf can sample on different events, not just CPU cycles:
```bash
perf record -e cycles ./program # CPU cycles (default)
perf record -e instructions ./program # Instructions retired
perf record -e cache-misses ./program # Cache misses
perf record -e branch-misses ./program # Branch mispredictions
```
This lets you answer "where do cache misses happen?" not just "where is time spent?"
## Summary
1. **Sampling** interrupts periodically to see what's executing
2. **PMU counters** trigger interrupts at configurable frequency
3. **Statistical accuracy** improves with more samples
4. **Symbol resolution** maps addresses to function names
5. **Call graphs** show the path to each sample
6. **Low overhead** (~1-5%) makes it usable in production
## Further Reading
- [Brendan Gregg's perf examples](https://www.brendangregg.com/perf.html)
- [perf wiki](https://perf.wiki.kernel.org/index.php/Main_Page)
- [Intel PMU documentation](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) (Volume 3, Chapter 18)