illustris/perf-workshop

Fork 0

illustris 1432bdaff9

add docs

2026-01-11 12:04:36 +05:30

7.3 KiB

Raw Permalink Blame History

How Sampling Profilers Work

The Core Idea

Sampling profilers answer the question: "Where is my program spending time?"

Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.

Program execution:  ████████████████████████████████████████
                         ↑    ↑    ↑    ↑    ↑    ↑    ↑
                       sample sample sample ...

Sampling vs Instrumentation

Approach	How it works	Overhead	Accuracy
Instrumentation	Insert code at every function entry/exit	High (10-100x slowdown)	Exact counts
Sampling	Interrupt periodically, record stack	Low (1-5%)	Statistical

Instrumentation (like gprof with -pg) modifies your program. Sampling just observes it.

How perf Does It

1. Hardware Performance Counters (PMU)

Modern CPUs have Performance Monitoring Units (PMUs) with special registers:

┌─────────────────────────────────────────┐
│                  CPU                    │
│  ┌─────────────────────────────────┐   │
│  │     Performance Monitoring Unit  │   │
│  │  ┌─────────┐  ┌─────────┐       │   │
│  │  │ Counter │  │ Counter │  ...  │   │
│  │  │ cycles  │  │ instrs  │       │   │
│  │  └─────────┘  └─────────┘       │   │
│  │         ↓ overflow interrupt     │   │
│  └─────────────────────────────────┘   │
└─────────────────────────────────────────┘

When you run perf record:

perf programs a PMU counter to count CPU cycles
Counter overflows every N cycles (default: enough for ~4000 samples/sec)
Overflow triggers a Non-Maskable Interrupt (NMI)
Kernel handler records: instruction pointer, process ID, timestamp
Optionally: walks the stack to get call chain

2. The Sampling Frequency

perf record -F 99 ./program    # 99 samples per second
perf record -F 9999 ./program  # 9999 samples per second

Higher frequency = more samples = better accuracy, but more overhead.

Why 99 and not 100? Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).

3. What Gets Recorded

Each sample contains:

IP (Instruction Pointer): Which instruction was executing
PID/TID: Which process/thread
Timestamp: When it happened
CPU: Which core
Call chain (with -g): Stack of return addresses

Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
           callchain: main → process_data → compute_inner
Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
           callchain: main → process_data → compute_inner
...

Symbol Resolution

Raw samples contain memory addresses. To show function names, perf needs:

Symbol tables: Map address ranges to function names
Debug info (-g): Map addresses to source lines

Without symbols:     45.23%  0x00000000004011a0
With symbols:        45.23%  compute_inner
With debug info:     45.23%  compute_inner (program.c:28)

This is why perf report needs access to the same binary you profiled.

Call Graph Collection

With perf record -g, perf records the call stack for each sample.

Frame Pointer Walking (Traditional)

Stack Memory:
┌──────────────┐
│ return addr  │ ← where to return after current function
│ saved RBP    │ ← pointer to previous frame
├──────────────┤
│ local vars   │
├──────────────┤
│ return addr  │
│ saved RBP ───┼──→ previous frame
├──────────────┤
│ ...          │
└──────────────┘

Walk the chain of frame pointers to reconstruct the call stack.

Problem: Modern compilers omit frame pointers (-fomit-frame-pointer) for performance. Solution: Compile with -fno-omit-frame-pointer or use DWARF unwinding.

DWARF Unwinding

Uses debug info (.eh_frame section) to unwind without frame pointers. More reliable but slower.

perf record --call-graph dwarf ./program

Statistical Nature

Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.

Law of Large Numbers: More samples = closer to true distribution.

100 samples:    A: 8-12%  (high variance)
1000 samples:   A: 9-11%  (better)
10000 samples:  A: 9.8-10.2%  (quite accurate)

This is why short-running programs need higher sampling frequency.

Limitations

1. Short Functions Miss Samples

If a function runs for less time than the sampling interval, it might not get sampled at all.

Sampling interval: ──────────────────────────────────
Function A:        ██                                 (might miss!)
Function B:        ████████████████████████████████   (definitely hit)

2. Inlined Functions Disappear

When the compiler inlines a function, it no longer exists as a separate entity:

// Source code
inline int square(int x) { return x * x; }
int compute(int x) { return square(x) + 1; }

// After inlining - square() disappears from profile
int compute(int x) { return x * x + 1; }

With debug info, perf can sometimes recover inline information.

3. Sampling Bias

Some events are harder to catch:

Very short functions
Functions that mostly wait (I/O, locks)
Interrupt handlers

4. Observer Effect

Profiling itself has overhead:

NMI handling takes cycles
Stack unwinding takes cycles
Writing samples to buffer takes cycles

Usually <5%, but can affect extremely performance-sensitive code.

perf Events

perf can sample on different events, not just CPU cycles:

perf record -e cycles ./program           # CPU cycles (default)
perf record -e instructions ./program     # Instructions retired
perf record -e cache-misses ./program     # Cache misses
perf record -e branch-misses ./program    # Branch mispredictions

This lets you answer "where do cache misses happen?" not just "where is time spent?"

Summary

Sampling interrupts periodically to see what's executing
PMU counters trigger interrupts at configurable frequency
Statistical accuracy improves with more samples
Symbol resolution maps addresses to function names
Call graphs show the path to each sample
Low overhead (~1-5%) makes it usable in production

7.3 KiB Raw Permalink Blame History