perf-workshop/docs/HOW-SAMPLING-PROFILERS-WORK.md
2026-01-11 12:04:36 +05:30

7.3 KiB

How Sampling Profilers Work

The Core Idea

Sampling profilers answer the question: "Where is my program spending time?"

Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.

Program execution:  ████████████████████████████████████████
                         ↑    ↑    ↑    ↑    ↑    ↑    ↑
                       sample sample sample ...

Sampling vs Instrumentation

Approach How it works Overhead Accuracy
Instrumentation Insert code at every function entry/exit High (10-100x slowdown) Exact counts
Sampling Interrupt periodically, record stack Low (1-5%) Statistical

Instrumentation (like gprof with -pg) modifies your program. Sampling just observes it.

How perf Does It

1. Hardware Performance Counters (PMU)

Modern CPUs have Performance Monitoring Units (PMUs) with special registers:

┌─────────────────────────────────────────┐
│                  CPU                    │
│  ┌─────────────────────────────────┐   │
│  │     Performance Monitoring Unit  │   │
│  │  ┌─────────┐  ┌─────────┐       │   │
│  │  │ Counter │  │ Counter │  ...  │   │
│  │  │ cycles  │  │ instrs  │       │   │
│  │  └─────────┘  └─────────┘       │   │
│  │         ↓ overflow interrupt     │   │
│  └─────────────────────────────────┘   │
└─────────────────────────────────────────┘

When you run perf record:

  1. perf programs a PMU counter to count CPU cycles
  2. Counter overflows every N cycles (default: enough for ~4000 samples/sec)
  3. Overflow triggers a Non-Maskable Interrupt (NMI)
  4. Kernel handler records: instruction pointer, process ID, timestamp
  5. Optionally: walks the stack to get call chain

2. The Sampling Frequency

perf record -F 99 ./program    # 99 samples per second
perf record -F 9999 ./program  # 9999 samples per second

Higher frequency = more samples = better accuracy, but more overhead.

Why 99 and not 100? Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).

3. What Gets Recorded

Each sample contains:

  • IP (Instruction Pointer): Which instruction was executing
  • PID/TID: Which process/thread
  • Timestamp: When it happened
  • CPU: Which core
  • Call chain (with -g): Stack of return addresses
Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
           callchain: main → process_data → compute_inner
Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
           callchain: main → process_data → compute_inner
...

Symbol Resolution

Raw samples contain memory addresses. To show function names, perf needs:

  1. Symbol tables: Map address ranges to function names
  2. Debug info (-g): Map addresses to source lines
Without symbols:     45.23%  0x00000000004011a0
With symbols:        45.23%  compute_inner
With debug info:     45.23%  compute_inner (program.c:28)

This is why perf report needs access to the same binary you profiled.

Call Graph Collection

With perf record -g, perf records the call stack for each sample.

Frame Pointer Walking (Traditional)

Stack Memory:
┌──────────────┐
│ return addr  │ ← where to return after current function
│ saved RBP    │ ← pointer to previous frame
├──────────────┤
│ local vars   │
├──────────────┤
│ return addr  │
│ saved RBP ───┼──→ previous frame
├──────────────┤
│ ...          │
└──────────────┘

Walk the chain of frame pointers to reconstruct the call stack.

Problem: Modern compilers omit frame pointers (-fomit-frame-pointer) for performance. Solution: Compile with -fno-omit-frame-pointer or use DWARF unwinding.

DWARF Unwinding

Uses debug info (.eh_frame section) to unwind without frame pointers. More reliable but slower.

perf record --call-graph dwarf ./program

Statistical Nature

Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.

Law of Large Numbers: More samples = closer to true distribution.

100 samples:    A: 8-12%  (high variance)
1000 samples:   A: 9-11%  (better)
10000 samples:  A: 9.8-10.2%  (quite accurate)

This is why short-running programs need higher sampling frequency.

Limitations

1. Short Functions Miss Samples

If a function runs for less time than the sampling interval, it might not get sampled at all.

Sampling interval: ──────────────────────────────────
Function A:        ██                                 (might miss!)
Function B:        ████████████████████████████████   (definitely hit)

2. Inlined Functions Disappear

When the compiler inlines a function, it no longer exists as a separate entity:

// Source code
inline int square(int x) { return x * x; }
int compute(int x) { return square(x) + 1; }

// After inlining - square() disappears from profile
int compute(int x) { return x * x + 1; }

With debug info, perf can sometimes recover inline information.

3. Sampling Bias

Some events are harder to catch:

  • Very short functions
  • Functions that mostly wait (I/O, locks)
  • Interrupt handlers

4. Observer Effect

Profiling itself has overhead:

  • NMI handling takes cycles
  • Stack unwinding takes cycles
  • Writing samples to buffer takes cycles

Usually <5%, but can affect extremely performance-sensitive code.

perf Events

perf can sample on different events, not just CPU cycles:

perf record -e cycles ./program           # CPU cycles (default)
perf record -e instructions ./program     # Instructions retired
perf record -e cache-misses ./program     # Cache misses
perf record -e branch-misses ./program    # Branch mispredictions

This lets you answer "where do cache misses happen?" not just "where is time spent?"

Summary

  1. Sampling interrupts periodically to see what's executing
  2. PMU counters trigger interrupts at configurable frequency
  3. Statistical accuracy improves with more samples
  4. Symbol resolution maps addresses to function names
  5. Call graphs show the path to each sample
  6. Low overhead (~1-5%) makes it usable in production

Further Reading