perf-workshop/docs/HOW-SAMPLING-PROFILERS-WORK.md

# How Sampling Profilers Work

## The Core Idea

Sampling profilers answer the question: **"Where is my program spending time?"**

Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.

```
Program execution:  ████████████████████████████████████████
                         ↑    ↑    ↑    ↑    ↑    ↑    ↑
                       sample sample sample ...
```

## Sampling vs Instrumentation

| Approach | How it works | Overhead | Accuracy |
|----------|--------------|----------|----------|
| **Instrumentation** | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts |
| **Sampling** | Interrupt periodically, record stack | Low (1-5%) | Statistical |

Instrumentation (like `gprof` with `-pg`) modifies your program. Sampling just observes it.

## How perf Does It

### 1. Hardware Performance Counters (PMU)

Modern CPUs have Performance Monitoring Units (PMUs) with special registers:

```
┌─────────────────────────────────────────┐
│                  CPU                    │
│  ┌─────────────────────────────────┐   │
│  │     Performance Monitoring Unit  │   │
│  │  ┌─────────┐  ┌─────────┐       │   │
│  │  │ Counter │  │ Counter │  ...  │   │
│  │  │ cycles  │  │ instrs  │       │   │
│  │  └─────────┘  └─────────┘       │   │
│  │         ↓ overflow interrupt     │   │
│  └─────────────────────────────────┘   │
└─────────────────────────────────────────┘
```

When you run `perf record`:

1. perf programs a PMU counter to count CPU cycles
2. Counter overflows every N cycles (default: enough for ~4000 samples/sec)
3. Overflow triggers a **Non-Maskable Interrupt (NMI)**
4. Kernel handler records: instruction pointer, process ID, timestamp
5. Optionally: walks the stack to get call chain

### 2. The Sampling Frequency

```bash
perf record -F 99 ./program    # 99 samples per second
perf record -F 9999 ./program  # 9999 samples per second
```

Higher frequency = more samples = better accuracy, but more overhead.

**Why 99 and not 100?** Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).

### 3. What Gets Recorded

Each sample contains:
- **IP (Instruction Pointer)**: Which instruction was executing
- **PID/TID**: Which process/thread
- **Timestamp**: When it happened
- **CPU**: Which core
- **Call chain** (with `-g`): Stack of return addresses

```
Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
           callchain: main → process_data → compute_inner
Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
           callchain: main → process_data → compute_inner
...
```

## Symbol Resolution

Raw samples contain memory addresses. To show function names, perf needs:

1. **Symbol tables**: Map address ranges to function names
2. **Debug info** (`-g`): Map addresses to source lines

```
Without symbols:     45.23%  0x00000000004011a0
With symbols:        45.23%  compute_inner
With debug info:     45.23%  compute_inner (program.c:28)
```

This is why `perf report` needs access to the same binary you profiled.

## Call Graph Collection

With `perf record -g`, perf records the call stack for each sample.

### Frame Pointer Walking (Traditional)

```
Stack Memory:
┌──────────────┐
│ return addr  │ ← where to return after current function
│ saved RBP    │ ← pointer to previous frame
├──────────────┤
│ local vars   │
├──────────────┤
│ return addr  │
│ saved RBP ───┼──→ previous frame
├──────────────┤
│ ...          │
└──────────────┘
```

Walk the chain of frame pointers to reconstruct the call stack.

**Problem**: Modern compilers omit frame pointers (`-fomit-frame-pointer`) for performance. Solution: Compile with `-fno-omit-frame-pointer` or use DWARF unwinding.

### DWARF Unwinding

Uses debug info (`.eh_frame` section) to unwind without frame pointers. More reliable but slower.

```bash
perf record --call-graph dwarf ./program
```

## Statistical Nature

Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.

**Law of Large Numbers**: More samples = closer to true distribution.

```
100 samples:    A: 8-12%  (high variance)
1000 samples:   A: 9-11%  (better)
10000 samples:  A: 9.8-10.2%  (quite accurate)
```

This is why short-running programs need higher sampling frequency.

## Limitations

### 1. Short Functions Miss Samples

If a function runs for less time than the sampling interval, it might not get sampled at all.

```
Sampling interval: ──────────────────────────────────
Function A:        ██                                 (might miss!)
Function B:        ████████████████████████████████   (definitely hit)
```

### 2. Inlined Functions Disappear

When the compiler inlines a function, it no longer exists as a separate entity:

```c
// Source code
inline int square(int x) { return x * x; }
int compute(int x) { return square(x) + 1; }

// After inlining - square() disappears from profile
int compute(int x) { return x * x + 1; }
```

With debug info, perf can sometimes recover inline information.

### 3. Sampling Bias

Some events are harder to catch:
- Very short functions
- Functions that mostly wait (I/O, locks)
- Interrupt handlers

### 4. Observer Effect

Profiling itself has overhead:
- NMI handling takes cycles
- Stack unwinding takes cycles
- Writing samples to buffer takes cycles

Usually <5%, but can affect extremely performance-sensitive code.

## perf Events

perf can sample on different events, not just CPU cycles:

```bash
perf record -e cycles ./program           # CPU cycles (default)
perf record -e instructions ./program     # Instructions retired
perf record -e cache-misses ./program     # Cache misses
perf record -e branch-misses ./program    # Branch mispredictions
```

This lets you answer "where do cache misses happen?" not just "where is time spent?"

## Summary

1. **Sampling** interrupts periodically to see what's executing
2. **PMU counters** trigger interrupts at configurable frequency
3. **Statistical accuracy** improves with more samples
4. **Symbol resolution** maps addresses to function names
5. **Call graphs** show the path to each sample
6. **Low overhead** (~1-5%) makes it usable in production

## Further Reading

- [Brendan Gregg's perf examples](https://www.brendangregg.com/perf.html)
- [perf wiki](https://perf.wiki.kernel.org/index.php/Main_Page)
- [Intel PMU documentation](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) (Volume 3, Chapter 18)