212 lines
7.3 KiB
Markdown
212 lines
7.3 KiB
Markdown
# How Sampling Profilers Work
|
|
|
|
## The Core Idea
|
|
|
|
Sampling profilers answer the question: **"Where is my program spending time?"**
|
|
|
|
Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.
|
|
|
|
```
|
|
Program execution: ████████████████████████████████████████
|
|
↑ ↑ ↑ ↑ ↑ ↑ ↑
|
|
sample sample sample ...
|
|
```
|
|
|
|
## Sampling vs Instrumentation
|
|
|
|
| Approach | How it works | Overhead | Accuracy |
|
|
|----------|--------------|----------|----------|
|
|
| **Instrumentation** | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts |
|
|
| **Sampling** | Interrupt periodically, record stack | Low (1-5%) | Statistical |
|
|
|
|
Instrumentation (like `gprof` with `-pg`) modifies your program. Sampling just observes it.
|
|
|
|
## How perf Does It
|
|
|
|
### 1. Hardware Performance Counters (PMU)
|
|
|
|
Modern CPUs have Performance Monitoring Units (PMUs) with special registers:
|
|
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ CPU │
|
|
│ ┌─────────────────────────────────┐ │
|
|
│ │ Performance Monitoring Unit │ │
|
|
│ │ ┌─────────┐ ┌─────────┐ │ │
|
|
│ │ │ Counter │ │ Counter │ ... │ │
|
|
│ │ │ cycles │ │ instrs │ │ │
|
|
│ │ └─────────┘ └─────────┘ │ │
|
|
│ │ ↓ overflow interrupt │ │
|
|
│ └─────────────────────────────────┘ │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
When you run `perf record`:
|
|
|
|
1. perf programs a PMU counter to count CPU cycles
|
|
2. Counter overflows every N cycles (default: enough for ~4000 samples/sec)
|
|
3. Overflow triggers a **Non-Maskable Interrupt (NMI)**
|
|
4. Kernel handler records: instruction pointer, process ID, timestamp
|
|
5. Optionally: walks the stack to get call chain
|
|
|
|
### 2. The Sampling Frequency
|
|
|
|
```bash
|
|
perf record -F 99 ./program # 99 samples per second
|
|
perf record -F 9999 ./program # 9999 samples per second
|
|
```
|
|
|
|
Higher frequency = more samples = better accuracy, but more overhead.
|
|
|
|
**Why 99 and not 100?** Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).
|
|
|
|
### 3. What Gets Recorded
|
|
|
|
Each sample contains:
|
|
- **IP (Instruction Pointer)**: Which instruction was executing
|
|
- **PID/TID**: Which process/thread
|
|
- **Timestamp**: When it happened
|
|
- **CPU**: Which core
|
|
- **Call chain** (with `-g`): Stack of return addresses
|
|
|
|
```
|
|
Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
|
|
callchain: main → process_data → compute_inner
|
|
Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
|
|
callchain: main → process_data → compute_inner
|
|
...
|
|
```
|
|
|
|
## Symbol Resolution
|
|
|
|
Raw samples contain memory addresses. To show function names, perf needs:
|
|
|
|
1. **Symbol tables**: Map address ranges to function names
|
|
2. **Debug info** (`-g`): Map addresses to source lines
|
|
|
|
```
|
|
Without symbols: 45.23% 0x00000000004011a0
|
|
With symbols: 45.23% compute_inner
|
|
With debug info: 45.23% compute_inner (program.c:28)
|
|
```
|
|
|
|
This is why `perf report` needs access to the same binary you profiled.
|
|
|
|
## Call Graph Collection
|
|
|
|
With `perf record -g`, perf records the call stack for each sample.
|
|
|
|
### Frame Pointer Walking (Traditional)
|
|
|
|
```
|
|
Stack Memory:
|
|
┌──────────────┐
|
|
│ return addr │ ← where to return after current function
|
|
│ saved RBP │ ← pointer to previous frame
|
|
├──────────────┤
|
|
│ local vars │
|
|
├──────────────┤
|
|
│ return addr │
|
|
│ saved RBP ───┼──→ previous frame
|
|
├──────────────┤
|
|
│ ... │
|
|
└──────────────┘
|
|
```
|
|
|
|
Walk the chain of frame pointers to reconstruct the call stack.
|
|
|
|
**Problem**: Modern compilers omit frame pointers (`-fomit-frame-pointer`) for performance. Solution: Compile with `-fno-omit-frame-pointer` or use DWARF unwinding.
|
|
|
|
### DWARF Unwinding
|
|
|
|
Uses debug info (`.eh_frame` section) to unwind without frame pointers. More reliable but slower.
|
|
|
|
```bash
|
|
perf record --call-graph dwarf ./program
|
|
```
|
|
|
|
## Statistical Nature
|
|
|
|
Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.
|
|
|
|
**Law of Large Numbers**: More samples = closer to true distribution.
|
|
|
|
```
|
|
100 samples: A: 8-12% (high variance)
|
|
1000 samples: A: 9-11% (better)
|
|
10000 samples: A: 9.8-10.2% (quite accurate)
|
|
```
|
|
|
|
This is why short-running programs need higher sampling frequency.
|
|
|
|
## Limitations
|
|
|
|
### 1. Short Functions Miss Samples
|
|
|
|
If a function runs for less time than the sampling interval, it might not get sampled at all.
|
|
|
|
```
|
|
Sampling interval: ──────────────────────────────────
|
|
Function A: ██ (might miss!)
|
|
Function B: ████████████████████████████████ (definitely hit)
|
|
```
|
|
|
|
### 2. Inlined Functions Disappear
|
|
|
|
When the compiler inlines a function, it no longer exists as a separate entity:
|
|
|
|
```c
|
|
// Source code
|
|
inline int square(int x) { return x * x; }
|
|
int compute(int x) { return square(x) + 1; }
|
|
|
|
// After inlining - square() disappears from profile
|
|
int compute(int x) { return x * x + 1; }
|
|
```
|
|
|
|
With debug info, perf can sometimes recover inline information.
|
|
|
|
### 3. Sampling Bias
|
|
|
|
Some events are harder to catch:
|
|
- Very short functions
|
|
- Functions that mostly wait (I/O, locks)
|
|
- Interrupt handlers
|
|
|
|
### 4. Observer Effect
|
|
|
|
Profiling itself has overhead:
|
|
- NMI handling takes cycles
|
|
- Stack unwinding takes cycles
|
|
- Writing samples to buffer takes cycles
|
|
|
|
Usually <5%, but can affect extremely performance-sensitive code.
|
|
|
|
## perf Events
|
|
|
|
perf can sample on different events, not just CPU cycles:
|
|
|
|
```bash
|
|
perf record -e cycles ./program # CPU cycles (default)
|
|
perf record -e instructions ./program # Instructions retired
|
|
perf record -e cache-misses ./program # Cache misses
|
|
perf record -e branch-misses ./program # Branch mispredictions
|
|
```
|
|
|
|
This lets you answer "where do cache misses happen?" not just "where is time spent?"
|
|
|
|
## Summary
|
|
|
|
1. **Sampling** interrupts periodically to see what's executing
|
|
2. **PMU counters** trigger interrupts at configurable frequency
|
|
3. **Statistical accuracy** improves with more samples
|
|
4. **Symbol resolution** maps addresses to function names
|
|
5. **Call graphs** show the path to each sample
|
|
6. **Low overhead** (~1-5%) makes it usable in production
|
|
|
|
## Further Reading
|
|
|
|
- [Brendan Gregg's perf examples](https://www.brendangregg.com/perf.html)
|
|
- [perf wiki](https://perf.wiki.kernel.org/index.php/Main_Page)
|
|
- [Intel PMU documentation](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) (Volume 3, Chapter 18)
|