# How Sampling Profilers Work ## The Core Idea Sampling profilers answer the question: **"Where is my program spending time?"** Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically. ``` Program execution: ████████████████████████████████████████ ↑ ↑ ↑ ↑ ↑ ↑ ↑ sample sample sample ... ``` ## Sampling vs Instrumentation | Approach | How it works | Overhead | Accuracy | |----------|--------------|----------|----------| | **Instrumentation** | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts | | **Sampling** | Interrupt periodically, record stack | Low (1-5%) | Statistical | Instrumentation (like `gprof` with `-pg`) modifies your program. Sampling just observes it. ## How perf Does It ### 1. Hardware Performance Counters (PMU) Modern CPUs have Performance Monitoring Units (PMUs) with special registers: ``` ┌─────────────────────────────────────────┐ │ CPU │ │ ┌─────────────────────────────────┐ │ │ │ Performance Monitoring Unit │ │ │ │ ┌─────────┐ ┌─────────┐ │ │ │ │ │ Counter │ │ Counter │ ... │ │ │ │ │ cycles │ │ instrs │ │ │ │ │ └─────────┘ └─────────┘ │ │ │ │ ↓ overflow interrupt │ │ │ └─────────────────────────────────┘ │ └─────────────────────────────────────────┘ ``` When you run `perf record`: 1. perf programs a PMU counter to count CPU cycles 2. Counter overflows every N cycles (default: enough for ~4000 samples/sec) 3. Overflow triggers a **Non-Maskable Interrupt (NMI)** 4. Kernel handler records: instruction pointer, process ID, timestamp 5. Optionally: walks the stack to get call chain ### 2. The Sampling Frequency ```bash perf record -F 99 ./program # 99 samples per second perf record -F 9999 ./program # 9999 samples per second ``` Higher frequency = more samples = better accuracy, but more overhead. **Why 99 and not 100?** Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer). ### 3. What Gets Recorded Each sample contains: - **IP (Instruction Pointer)**: Which instruction was executing - **PID/TID**: Which process/thread - **Timestamp**: When it happened - **CPU**: Which core - **Call chain** (with `-g`): Stack of return addresses ``` Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890 callchain: main → process_data → compute_inner Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891 callchain: main → process_data → compute_inner ... ``` ## Symbol Resolution Raw samples contain memory addresses. To show function names, perf needs: 1. **Symbol tables**: Map address ranges to function names 2. **Debug info** (`-g`): Map addresses to source lines ``` Without symbols: 45.23% 0x00000000004011a0 With symbols: 45.23% compute_inner With debug info: 45.23% compute_inner (program.c:28) ``` This is why `perf report` needs access to the same binary you profiled. ## Call Graph Collection With `perf record -g`, perf records the call stack for each sample. ### Frame Pointer Walking (Traditional) ``` Stack Memory: ┌──────────────┐ │ return addr │ ← where to return after current function │ saved RBP │ ← pointer to previous frame ├──────────────┤ │ local vars │ ├──────────────┤ │ return addr │ │ saved RBP ───┼──→ previous frame ├──────────────┤ │ ... │ └──────────────┘ ``` Walk the chain of frame pointers to reconstruct the call stack. **Problem**: Modern compilers omit frame pointers (`-fomit-frame-pointer`) for performance. Solution: Compile with `-fno-omit-frame-pointer` or use DWARF unwinding. ### DWARF Unwinding Uses debug info (`.eh_frame` section) to unwind without frame pointers. More reliable but slower. ```bash perf record --call-graph dwarf ./program ``` ## Statistical Nature Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A. **Law of Large Numbers**: More samples = closer to true distribution. ``` 100 samples: A: 8-12% (high variance) 1000 samples: A: 9-11% (better) 10000 samples: A: 9.8-10.2% (quite accurate) ``` This is why short-running programs need higher sampling frequency. ## Limitations ### 1. Short Functions Miss Samples If a function runs for less time than the sampling interval, it might not get sampled at all. ``` Sampling interval: ────────────────────────────────── Function A: ██ (might miss!) Function B: ████████████████████████████████ (definitely hit) ``` ### 2. Inlined Functions Disappear When the compiler inlines a function, it no longer exists as a separate entity: ```c // Source code inline int square(int x) { return x * x; } int compute(int x) { return square(x) + 1; } // After inlining - square() disappears from profile int compute(int x) { return x * x + 1; } ``` With debug info, perf can sometimes recover inline information. ### 3. Sampling Bias Some events are harder to catch: - Very short functions - Functions that mostly wait (I/O, locks) - Interrupt handlers ### 4. Observer Effect Profiling itself has overhead: - NMI handling takes cycles - Stack unwinding takes cycles - Writing samples to buffer takes cycles Usually <5%, but can affect extremely performance-sensitive code. ## perf Events perf can sample on different events, not just CPU cycles: ```bash perf record -e cycles ./program # CPU cycles (default) perf record -e instructions ./program # Instructions retired perf record -e cache-misses ./program # Cache misses perf record -e branch-misses ./program # Branch mispredictions ``` This lets you answer "where do cache misses happen?" not just "where is time spent?" ## Summary 1. **Sampling** interrupts periodically to see what's executing 2. **PMU counters** trigger interrupts at configurable frequency 3. **Statistical accuracy** improves with more samples 4. **Symbol resolution** maps addresses to function names 5. **Call graphs** show the path to each sample 6. **Low overhead** (~1-5%) makes it usable in production ## Further Reading - [Brendan Gregg's perf examples](https://www.brendangregg.com/perf.html) - [perf wiki](https://perf.wiki.kernel.org/index.php/Main_Page) - [Intel PMU documentation](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) (Volume 3, Chapter 18)