add docs
This commit is contained in:
parent
c8f56cf3f1
commit
1432bdaff9
327
docs/CPU-CACHES-AND-MEMORY.md
Normal file
327
docs/CPU-CACHES-AND-MEMORY.md
Normal file
@ -0,0 +1,327 @@
|
|||||||
|
# CPU Caches and Memory: Why Access Patterns Matter
|
||||||
|
|
||||||
|
## The Memory Wall
|
||||||
|
|
||||||
|
CPUs are fast. Memory is slow. This gap is called the "memory wall."
|
||||||
|
|
||||||
|
```
|
||||||
|
Relative Speed
|
||||||
|
══════════════
|
||||||
|
CPU registers ████████████████████████████████ (~1 cycle)
|
||||||
|
L1 cache ██████████████████████ (~4 cycles)
|
||||||
|
L2 cache ████████████ (~12 cycles)
|
||||||
|
L3 cache ██████ (~40 cycles)
|
||||||
|
Main RAM █ (~200 cycles)
|
||||||
|
SSD (~10,000 cycles)
|
||||||
|
HDD (~10,000,000 cycles)
|
||||||
|
```
|
||||||
|
|
||||||
|
A cache miss to RAM costs **50-100x more** than an L1 hit. This is why cache behavior dominates performance for many programs.
|
||||||
|
|
||||||
|
## The Cache Hierarchy
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ CPU Core │
|
||||||
|
│ ┌─────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Registers (bytes, <1ns) │ │
|
||||||
|
│ └─────────────────────────────────────────────────────┘ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d) │ │
|
||||||
|
│ └─────────────────────────────────────────────────────┘ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ L2 Cache: 256-512 KB, ~3-4ns │ │
|
||||||
|
│ └─────────────────────────────────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌─────────────────┴─────────────────┐
|
||||||
|
│ L3 Cache: 8-64 MB, ~10-12ns │ (shared between cores)
|
||||||
|
└─────────────────┬─────────────────┘
|
||||||
|
│
|
||||||
|
┌─────────────────┴─────────────────┐
|
||||||
|
│ Main RAM: GBs, ~50-100ns │
|
||||||
|
└───────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Typical Numbers (Desktop CPU, 2024)
|
||||||
|
|
||||||
|
| Level | Size | Latency | Bandwidth |
|
||||||
|
|-------|------|---------|-----------|
|
||||||
|
| L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s |
|
||||||
|
| L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s |
|
||||||
|
| L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s |
|
||||||
|
| RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s |
|
||||||
|
|
||||||
|
## Cache Lines: The Unit of Transfer
|
||||||
|
|
||||||
|
Memory isn't fetched byte-by-byte. It's fetched in **cache lines** (typically 64 bytes).
|
||||||
|
|
||||||
|
```
|
||||||
|
Memory addresses:
|
||||||
|
0x1000: [████████████████████████████████████████████████████████████████]
|
||||||
|
└──────────────── 64 bytes = 1 cache line ────────────────────┘
|
||||||
|
|
||||||
|
If you access address 0x1020:
|
||||||
|
- CPU fetches entire cache line (0x1000-0x103F)
|
||||||
|
- Next access to 0x1021? Already in cache! (free)
|
||||||
|
- Access to 0x1040? Different cache line, another fetch
|
||||||
|
```
|
||||||
|
|
||||||
|
This is why **sequential access** is so much faster than random access.
|
||||||
|
|
||||||
|
## Spatial and Temporal Locality
|
||||||
|
|
||||||
|
Caches exploit two patterns in real programs:
|
||||||
|
|
||||||
|
### Spatial Locality
|
||||||
|
"If you accessed address X, you'll probably access X+1 soon."
|
||||||
|
|
||||||
|
```c
|
||||||
|
// GOOD: Sequential access (spatial locality)
|
||||||
|
for (int i = 0; i < N; i++) {
|
||||||
|
sum += array[i]; // Next element is in same cache line
|
||||||
|
}
|
||||||
|
|
||||||
|
// BAD: Random access (no spatial locality)
|
||||||
|
for (int i = 0; i < N; i++) {
|
||||||
|
sum += array[random_index()]; // Each access misses cache
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Temporal Locality
|
||||||
|
"If you accessed address X, you'll probably access X again soon."
|
||||||
|
|
||||||
|
```c
|
||||||
|
// GOOD: Reuse data while it's hot
|
||||||
|
for (int i = 0; i < N; i++) {
|
||||||
|
x = array[i];
|
||||||
|
result += x * x + x; // x stays in registers
|
||||||
|
}
|
||||||
|
|
||||||
|
// BAD: Touch data once, move on
|
||||||
|
for (int i = 0; i < N; i++) {
|
||||||
|
sum += array[i];
|
||||||
|
}
|
||||||
|
for (int i = 0; i < N; i++) {
|
||||||
|
product *= array[i]; // Array evicted from cache, refetch
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Why Random Access Kills Performance
|
||||||
|
|
||||||
|
### Example: Array vs Linked List
|
||||||
|
|
||||||
|
```
|
||||||
|
Array (sequential memory):
|
||||||
|
┌───┬───┬───┬───┬───┬───┬───┬───┐
|
||||||
|
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ All in 1-2 cache lines
|
||||||
|
└───┴───┴───┴───┴───┴───┴───┴───┘
|
||||||
|
|
||||||
|
Linked List (scattered memory):
|
||||||
|
┌───┐ ┌───┐ ┌───┐
|
||||||
|
│ 0 │────────→│ 1 │────────→│ 2 │...
|
||||||
|
└───┘ └───┘ └───┘
|
||||||
|
↑ ↑ ↑
|
||||||
|
0x1000 0x5420 0x2108
|
||||||
|
Each node in different cache line!
|
||||||
|
```
|
||||||
|
|
||||||
|
Traversing a scattered linked list causes a **cache miss per node**.
|
||||||
|
|
||||||
|
### Real Numbers
|
||||||
|
|
||||||
|
```
|
||||||
|
Array traversal: ~0.004 seconds (10M elements)
|
||||||
|
Sequential list: ~0.018 seconds (4.5x slower)
|
||||||
|
Scattered list: ~1.400 seconds (350x slower!)
|
||||||
|
```
|
||||||
|
|
||||||
|
The scattered list is O(n) just like the array, but the constant factor is 350x worse.
|
||||||
|
|
||||||
|
## Pipeline Stalls: Why the CPU Can't Hide Latency
|
||||||
|
|
||||||
|
Modern CPUs execute many instructions simultaneously:
|
||||||
|
|
||||||
|
```
|
||||||
|
Pipeline (simplified):
|
||||||
|
Cycle: 1 2 3 4 5 6 7 8
|
||||||
|
Fetch: [A] [B] [C] [D] [E] [F] [G] [H]
|
||||||
|
Decode: [A] [B] [C] [D] [E] [F] [G]
|
||||||
|
Execute: [A] [B] [C] [D] [E] [F]
|
||||||
|
Memory: [A] [B] [C] [D] [E]
|
||||||
|
Write: [A] [B] [C] [D]
|
||||||
|
```
|
||||||
|
|
||||||
|
But what happens when instruction C needs data from memory?
|
||||||
|
|
||||||
|
```
|
||||||
|
Cycle: 1 2 3 4 5 ... 200 201 202
|
||||||
|
Fetch: [A] [B] [C] [C] [C] ... [C] [D] [E]
|
||||||
|
Decode: [A] [B] [C] [C] ... [C] [C] [D]
|
||||||
|
Execute: [A] [B] waiting for memory...
|
||||||
|
Memory: [A] [B] ... ... [C]
|
||||||
|
Write: [A] [B] ... [C]
|
||||||
|
↑
|
||||||
|
STALL! Pipeline bubbles
|
||||||
|
```
|
||||||
|
|
||||||
|
The CPU stalls for **~200 cycles** waiting for RAM. Those 200 cycles could have executed 200+ instructions.
|
||||||
|
|
||||||
|
### Out-of-Order Execution Helps (But Not Enough)
|
||||||
|
|
||||||
|
CPUs can execute later instructions while waiting:
|
||||||
|
|
||||||
|
```c
|
||||||
|
a = array[i]; // Cache miss, stall...
|
||||||
|
b = x + y; // Can execute while waiting!
|
||||||
|
c = b * 2; // Can execute while waiting!
|
||||||
|
d = a + 1; // Must wait for 'a'
|
||||||
|
```
|
||||||
|
|
||||||
|
But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.
|
||||||
|
|
||||||
|
## The Prefetcher: CPU Tries to Help
|
||||||
|
|
||||||
|
Modern CPUs detect sequential access patterns and fetch data **before you ask**:
|
||||||
|
|
||||||
|
```
|
||||||
|
Your code accesses: [0] [1] [2] [3] [4] ...
|
||||||
|
Prefetcher fetches: [5] [6] [7] [8] ... (ahead of you!)
|
||||||
|
```
|
||||||
|
|
||||||
|
But prefetchers can only predict **regular patterns**:
|
||||||
|
- Sequential: ✅ Perfect prediction
|
||||||
|
- Strided (every Nth element): ✅ Usually works
|
||||||
|
- Random: ❌ No pattern to detect
|
||||||
|
|
||||||
|
```c
|
||||||
|
// Prefetcher wins
|
||||||
|
for (int i = 0; i < N; i++) {
|
||||||
|
sum += array[i]; // Prefetcher fetches ahead
|
||||||
|
}
|
||||||
|
|
||||||
|
// Prefetcher loses
|
||||||
|
for (int i = 0; i < N; i++) {
|
||||||
|
sum += array[indices[i]]; // Random indices, can't predict
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Row-Major vs Column-Major
|
||||||
|
|
||||||
|
C stores 2D arrays in row-major order:
|
||||||
|
|
||||||
|
```
|
||||||
|
int matrix[3][4];
|
||||||
|
|
||||||
|
Memory layout:
|
||||||
|
[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
|
||||||
|
└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Row-Major Access (Cache-Friendly)
|
||||||
|
|
||||||
|
```c
|
||||||
|
for (int i = 0; i < ROWS; i++) {
|
||||||
|
for (int j = 0; j < COLS; j++) {
|
||||||
|
sum += matrix[i][j]; // Sequential in memory!
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Access pattern: `[0,0] [0,1] [0,2] [0,3] [1,0]...` - sequential, cache line fully used.
|
||||||
|
|
||||||
|
### Column-Major Access (Cache-Hostile)
|
||||||
|
|
||||||
|
```c
|
||||||
|
for (int j = 0; j < COLS; j++) {
|
||||||
|
for (int i = 0; i < ROWS; i++) {
|
||||||
|
sum += matrix[i][j]; // Jumps by COLS each time!
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Access pattern: `[0,0] [1,0] [2,0] [0,1]...` - jumps by `COLS * sizeof(int)` bytes.
|
||||||
|
|
||||||
|
If COLS=8192, each access jumps 32KB - far beyond any cache line!
|
||||||
|
|
||||||
|
**Result**: Column-major can be **10-50x slower** for large matrices.
|
||||||
|
|
||||||
|
## False Sharing: The Multithreaded Trap
|
||||||
|
|
||||||
|
Cache coherency means cores must agree on cache line contents.
|
||||||
|
|
||||||
|
```
|
||||||
|
Thread 1 (Core 0): counter1++ ┐
|
||||||
|
Thread 2 (Core 1): counter2++ ├── Both in same cache line!
|
||||||
|
┘
|
||||||
|
┌────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Cache line: [counter1] [counter2] [padding.................] │
|
||||||
|
└────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.
|
||||||
|
|
||||||
|
**Fix**: Pad data to separate cache lines:
|
||||||
|
|
||||||
|
```c
|
||||||
|
struct {
|
||||||
|
long counter;
|
||||||
|
char padding[64 - sizeof(long)]; // Pad to 64 bytes
|
||||||
|
} counters[NUM_THREADS];
|
||||||
|
```
|
||||||
|
|
||||||
|
## NUMA: When Memory Has Geography
|
||||||
|
|
||||||
|
On multi-socket systems, memory is "closer" to some CPUs:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ Socket 0 Socket 1 │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ │
|
||||||
|
│ │ Core 0-7 │ │ Core 8-15 │ │
|
||||||
|
│ └──────┬──────┘ └──────┬──────┘ │
|
||||||
|
│ │ │ │
|
||||||
|
│ ┌──────┴──────┐ interconnect ┌──────┴──────┐ │
|
||||||
|
│ │ RAM 0 │ ←────────────→ │ RAM 1 │ │
|
||||||
|
│ │ (local) │ (slow) │ (remote) │ │
|
||||||
|
│ └─────────────┘ └─────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.
|
||||||
|
|
||||||
|
## Measuring Cache Behavior
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Overall cache stats
|
||||||
|
perf stat -e cache-misses,cache-references ./program
|
||||||
|
|
||||||
|
# Detailed breakdown
|
||||||
|
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program
|
||||||
|
|
||||||
|
# Where do cache misses happen?
|
||||||
|
perf record -e cache-misses ./program
|
||||||
|
perf report
|
||||||
|
```
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
| Pattern | Cache Behavior | Performance |
|
||||||
|
|---------|---------------|-------------|
|
||||||
|
| Sequential access | Prefetcher wins, cache lines fully used | Fast |
|
||||||
|
| Strided access | Partial cache line use | Medium |
|
||||||
|
| Random access | Every access misses, pipeline stalls | Slow |
|
||||||
|
|
||||||
|
**Key Takeaways**:
|
||||||
|
|
||||||
|
1. **Memory access pattern matters as much as algorithm complexity**
|
||||||
|
2. **Sequential access is almost always faster** - prefetcher + cache lines
|
||||||
|
3. **Random access causes pipeline stalls** - CPU waits ~200 cycles per miss
|
||||||
|
4. **Structure data for access pattern** - not just for logical organization
|
||||||
|
5. **Measure with `perf stat`** before optimizing
|
||||||
|
|
||||||
|
## Further Reading
|
||||||
|
|
||||||
|
- [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - Ulrich Drepper
|
||||||
|
- [Gallery of Processor Cache Effects](http://igoro.com/archive/gallery-of-processor-cache-effects/)
|
||||||
|
- [Brendan Gregg's Memory Flamegraphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html)
|
||||||
211
docs/HOW-SAMPLING-PROFILERS-WORK.md
Normal file
211
docs/HOW-SAMPLING-PROFILERS-WORK.md
Normal file
@ -0,0 +1,211 @@
|
|||||||
|
# How Sampling Profilers Work
|
||||||
|
|
||||||
|
## The Core Idea
|
||||||
|
|
||||||
|
Sampling profilers answer the question: **"Where is my program spending time?"**
|
||||||
|
|
||||||
|
Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.
|
||||||
|
|
||||||
|
```
|
||||||
|
Program execution: ████████████████████████████████████████
|
||||||
|
↑ ↑ ↑ ↑ ↑ ↑ ↑
|
||||||
|
sample sample sample ...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Sampling vs Instrumentation
|
||||||
|
|
||||||
|
| Approach | How it works | Overhead | Accuracy |
|
||||||
|
|----------|--------------|----------|----------|
|
||||||
|
| **Instrumentation** | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts |
|
||||||
|
| **Sampling** | Interrupt periodically, record stack | Low (1-5%) | Statistical |
|
||||||
|
|
||||||
|
Instrumentation (like `gprof` with `-pg`) modifies your program. Sampling just observes it.
|
||||||
|
|
||||||
|
## How perf Does It
|
||||||
|
|
||||||
|
### 1. Hardware Performance Counters (PMU)
|
||||||
|
|
||||||
|
Modern CPUs have Performance Monitoring Units (PMUs) with special registers:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────┐
|
||||||
|
│ CPU │
|
||||||
|
│ ┌─────────────────────────────────┐ │
|
||||||
|
│ │ Performance Monitoring Unit │ │
|
||||||
|
│ │ ┌─────────┐ ┌─────────┐ │ │
|
||||||
|
│ │ │ Counter │ │ Counter │ ... │ │
|
||||||
|
│ │ │ cycles │ │ instrs │ │ │
|
||||||
|
│ │ └─────────┘ └─────────┘ │ │
|
||||||
|
│ │ ↓ overflow interrupt │ │
|
||||||
|
│ └─────────────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
When you run `perf record`:
|
||||||
|
|
||||||
|
1. perf programs a PMU counter to count CPU cycles
|
||||||
|
2. Counter overflows every N cycles (default: enough for ~4000 samples/sec)
|
||||||
|
3. Overflow triggers a **Non-Maskable Interrupt (NMI)**
|
||||||
|
4. Kernel handler records: instruction pointer, process ID, timestamp
|
||||||
|
5. Optionally: walks the stack to get call chain
|
||||||
|
|
||||||
|
### 2. The Sampling Frequency
|
||||||
|
|
||||||
|
```bash
|
||||||
|
perf record -F 99 ./program # 99 samples per second
|
||||||
|
perf record -F 9999 ./program # 9999 samples per second
|
||||||
|
```
|
||||||
|
|
||||||
|
Higher frequency = more samples = better accuracy, but more overhead.
|
||||||
|
|
||||||
|
**Why 99 and not 100?** Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).
|
||||||
|
|
||||||
|
### 3. What Gets Recorded
|
||||||
|
|
||||||
|
Each sample contains:
|
||||||
|
- **IP (Instruction Pointer)**: Which instruction was executing
|
||||||
|
- **PID/TID**: Which process/thread
|
||||||
|
- **Timestamp**: When it happened
|
||||||
|
- **CPU**: Which core
|
||||||
|
- **Call chain** (with `-g`): Stack of return addresses
|
||||||
|
|
||||||
|
```
|
||||||
|
Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
|
||||||
|
callchain: main → process_data → compute_inner
|
||||||
|
Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
|
||||||
|
callchain: main → process_data → compute_inner
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Symbol Resolution
|
||||||
|
|
||||||
|
Raw samples contain memory addresses. To show function names, perf needs:
|
||||||
|
|
||||||
|
1. **Symbol tables**: Map address ranges to function names
|
||||||
|
2. **Debug info** (`-g`): Map addresses to source lines
|
||||||
|
|
||||||
|
```
|
||||||
|
Without symbols: 45.23% 0x00000000004011a0
|
||||||
|
With symbols: 45.23% compute_inner
|
||||||
|
With debug info: 45.23% compute_inner (program.c:28)
|
||||||
|
```
|
||||||
|
|
||||||
|
This is why `perf report` needs access to the same binary you profiled.
|
||||||
|
|
||||||
|
## Call Graph Collection
|
||||||
|
|
||||||
|
With `perf record -g`, perf records the call stack for each sample.
|
||||||
|
|
||||||
|
### Frame Pointer Walking (Traditional)
|
||||||
|
|
||||||
|
```
|
||||||
|
Stack Memory:
|
||||||
|
┌──────────────┐
|
||||||
|
│ return addr │ ← where to return after current function
|
||||||
|
│ saved RBP │ ← pointer to previous frame
|
||||||
|
├──────────────┤
|
||||||
|
│ local vars │
|
||||||
|
├──────────────┤
|
||||||
|
│ return addr │
|
||||||
|
│ saved RBP ───┼──→ previous frame
|
||||||
|
├──────────────┤
|
||||||
|
│ ... │
|
||||||
|
└──────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
Walk the chain of frame pointers to reconstruct the call stack.
|
||||||
|
|
||||||
|
**Problem**: Modern compilers omit frame pointers (`-fomit-frame-pointer`) for performance. Solution: Compile with `-fno-omit-frame-pointer` or use DWARF unwinding.
|
||||||
|
|
||||||
|
### DWARF Unwinding
|
||||||
|
|
||||||
|
Uses debug info (`.eh_frame` section) to unwind without frame pointers. More reliable but slower.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
perf record --call-graph dwarf ./program
|
||||||
|
```
|
||||||
|
|
||||||
|
## Statistical Nature
|
||||||
|
|
||||||
|
Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.
|
||||||
|
|
||||||
|
**Law of Large Numbers**: More samples = closer to true distribution.
|
||||||
|
|
||||||
|
```
|
||||||
|
100 samples: A: 8-12% (high variance)
|
||||||
|
1000 samples: A: 9-11% (better)
|
||||||
|
10000 samples: A: 9.8-10.2% (quite accurate)
|
||||||
|
```
|
||||||
|
|
||||||
|
This is why short-running programs need higher sampling frequency.
|
||||||
|
|
||||||
|
## Limitations
|
||||||
|
|
||||||
|
### 1. Short Functions Miss Samples
|
||||||
|
|
||||||
|
If a function runs for less time than the sampling interval, it might not get sampled at all.
|
||||||
|
|
||||||
|
```
|
||||||
|
Sampling interval: ──────────────────────────────────
|
||||||
|
Function A: ██ (might miss!)
|
||||||
|
Function B: ████████████████████████████████ (definitely hit)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Inlined Functions Disappear
|
||||||
|
|
||||||
|
When the compiler inlines a function, it no longer exists as a separate entity:
|
||||||
|
|
||||||
|
```c
|
||||||
|
// Source code
|
||||||
|
inline int square(int x) { return x * x; }
|
||||||
|
int compute(int x) { return square(x) + 1; }
|
||||||
|
|
||||||
|
// After inlining - square() disappears from profile
|
||||||
|
int compute(int x) { return x * x + 1; }
|
||||||
|
```
|
||||||
|
|
||||||
|
With debug info, perf can sometimes recover inline information.
|
||||||
|
|
||||||
|
### 3. Sampling Bias
|
||||||
|
|
||||||
|
Some events are harder to catch:
|
||||||
|
- Very short functions
|
||||||
|
- Functions that mostly wait (I/O, locks)
|
||||||
|
- Interrupt handlers
|
||||||
|
|
||||||
|
### 4. Observer Effect
|
||||||
|
|
||||||
|
Profiling itself has overhead:
|
||||||
|
- NMI handling takes cycles
|
||||||
|
- Stack unwinding takes cycles
|
||||||
|
- Writing samples to buffer takes cycles
|
||||||
|
|
||||||
|
Usually <5%, but can affect extremely performance-sensitive code.
|
||||||
|
|
||||||
|
## perf Events
|
||||||
|
|
||||||
|
perf can sample on different events, not just CPU cycles:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
perf record -e cycles ./program # CPU cycles (default)
|
||||||
|
perf record -e instructions ./program # Instructions retired
|
||||||
|
perf record -e cache-misses ./program # Cache misses
|
||||||
|
perf record -e branch-misses ./program # Branch mispredictions
|
||||||
|
```
|
||||||
|
|
||||||
|
This lets you answer "where do cache misses happen?" not just "where is time spent?"
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
1. **Sampling** interrupts periodically to see what's executing
|
||||||
|
2. **PMU counters** trigger interrupts at configurable frequency
|
||||||
|
3. **Statistical accuracy** improves with more samples
|
||||||
|
4. **Symbol resolution** maps addresses to function names
|
||||||
|
5. **Call graphs** show the path to each sample
|
||||||
|
6. **Low overhead** (~1-5%) makes it usable in production
|
||||||
|
|
||||||
|
## Further Reading
|
||||||
|
|
||||||
|
- [Brendan Gregg's perf examples](https://www.brendangregg.com/perf.html)
|
||||||
|
- [perf wiki](https://perf.wiki.kernel.org/index.php/Main_Page)
|
||||||
|
- [Intel PMU documentation](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) (Volume 3, Chapter 18)
|
||||||
265
docs/HOW-TRACING-WORKS.md
Normal file
265
docs/HOW-TRACING-WORKS.md
Normal file
@ -0,0 +1,265 @@
|
|||||||
|
# How Tracing Works: strace, bpftrace, and eBPF
|
||||||
|
|
||||||
|
## What is Tracing?
|
||||||
|
|
||||||
|
While sampling answers "where is time spent?", tracing answers "what happened?"
|
||||||
|
|
||||||
|
Tracing captures **every occurrence** of specific events:
|
||||||
|
- Every syscall
|
||||||
|
- Every function call
|
||||||
|
- Every network packet
|
||||||
|
- Every disk I/O
|
||||||
|
|
||||||
|
## strace: The Simple Way
|
||||||
|
|
||||||
|
### How ptrace Works
|
||||||
|
|
||||||
|
strace uses the `ptrace()` syscall - the same mechanism debuggers use.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────┐ ┌─────────────────┐
|
||||||
|
│ Your Program │ │ strace │
|
||||||
|
│ │ │ │
|
||||||
|
│ 1. syscall │─────────→│ 2. STOP! │
|
||||||
|
│ ║ │ SIGTRAP │ inspect │
|
||||||
|
│ ║ (paused) │ │ log │
|
||||||
|
│ ║ │←─────────│ 3. continue │
|
||||||
|
│ 4. resume │ PTRACE │ │
|
||||||
|
│ │ CONT │ │
|
||||||
|
└─────────────────┘ └─────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
Step by step:
|
||||||
|
|
||||||
|
1. **Attach**: strace calls `ptrace(PTRACE_ATTACH, pid)` or starts the process with `PTRACE_TRACEME`
|
||||||
|
2. **Trap on syscall**: strace sets `PTRACE_SYSCALL` - kernel stops tracee at each syscall
|
||||||
|
3. **Inspect**: strace reads registers to see syscall number and arguments
|
||||||
|
4. **Continue**: strace calls `ptrace(PTRACE_CONT)` to resume
|
||||||
|
5. **Repeat**: Kernel stops again when syscall returns, strace reads return value
|
||||||
|
|
||||||
|
### Why strace is Slow
|
||||||
|
|
||||||
|
Each syscall causes **two context switches** to strace:
|
||||||
|
- One on syscall entry
|
||||||
|
- One on syscall exit
|
||||||
|
|
||||||
|
```
|
||||||
|
Normal syscall:
|
||||||
|
User → Kernel → User
|
||||||
|
|
||||||
|
With strace:
|
||||||
|
User → Kernel → strace → Kernel → strace → Kernel → User
|
||||||
|
↑ ↑
|
||||||
|
entry stop exit stop
|
||||||
|
```
|
||||||
|
|
||||||
|
Overhead can be **10-100x** for syscall-heavy programs!
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Normal
|
||||||
|
time ./read_fast testfile # 0.01s
|
||||||
|
|
||||||
|
# With strace
|
||||||
|
time strace -c ./read_fast testfile # 0.5s (50x slower!)
|
||||||
|
```
|
||||||
|
|
||||||
|
### When to Use strace
|
||||||
|
|
||||||
|
Despite overhead, strace is invaluable for:
|
||||||
|
- Debugging "why won't this program start?"
|
||||||
|
- Finding which files a program opens
|
||||||
|
- Understanding program behavior
|
||||||
|
- One-off investigation (not production)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
strace -e openat ./program # What files does it open?
|
||||||
|
strace -e connect ./program # What network connections?
|
||||||
|
strace -c ./program # Syscall summary
|
||||||
|
```
|
||||||
|
|
||||||
|
## eBPF: The Fast Way
|
||||||
|
|
||||||
|
### What is eBPF?
|
||||||
|
|
||||||
|
eBPF (extended Berkeley Packet Filter) lets you run **sandboxed programs inside the kernel**.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────┐
|
||||||
|
│ Kernel │
|
||||||
|
│ ┌──────────────────────────────────────────────┐ │
|
||||||
|
│ │ Your eBPF Program │ │
|
||||||
|
│ │ - Runs at kernel speed │ │
|
||||||
|
│ │ - No context switches │ │
|
||||||
|
│ │ - Verified for safety │ │
|
||||||
|
│ └──────────────────────────────────────────────┘ │
|
||||||
|
│ ↓ attach to │
|
||||||
|
│ ┌──────────────────────────────────────────────┐ │
|
||||||
|
│ │ Tracepoints, Kprobes, Uprobes, USDT │ │
|
||||||
|
│ └──────────────────────────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────┘
|
||||||
|
↓ results via
|
||||||
|
┌─────────────────────────────────────────────────────┐
|
||||||
|
│ User Space │
|
||||||
|
│ Maps, ring buffers, perf events │
|
||||||
|
└─────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### The eBPF Verifier
|
||||||
|
|
||||||
|
Before your eBPF program runs, the kernel **verifies** it:
|
||||||
|
- No infinite loops
|
||||||
|
- No out-of-bounds memory access
|
||||||
|
- No unsafe operations
|
||||||
|
- Bounded execution time
|
||||||
|
|
||||||
|
This makes eBPF safe to run in production, even from untrusted users.
|
||||||
|
|
||||||
|
### Attachment Points
|
||||||
|
|
||||||
|
eBPF can attach to various kernel hooks:
|
||||||
|
|
||||||
|
| Type | What it traces | Example |
|
||||||
|
|------|----------------|---------|
|
||||||
|
| **Tracepoints** | Stable kernel events | `tracepoint:syscalls:sys_enter_read` |
|
||||||
|
| **Kprobes** | Any kernel function | `kprobe:do_sys_open` |
|
||||||
|
| **Uprobes** | Any userspace function | `uprobe:/bin/bash:readline` |
|
||||||
|
| **USDT** | User-defined static probes | `usdt:./server:myapp:request` |
|
||||||
|
|
||||||
|
### Why eBPF is Fast
|
||||||
|
|
||||||
|
```
|
||||||
|
strace (ptrace):
|
||||||
|
Process stops → context switch → strace reads → context switch → resume
|
||||||
|
|
||||||
|
eBPF:
|
||||||
|
Event fires → eBPF runs IN KERNEL → continue (no context switch!)
|
||||||
|
```
|
||||||
|
|
||||||
|
eBPF overhead is typically **<1%** even for frequent events.
|
||||||
|
|
||||||
|
## bpftrace: eBPF Made Easy
|
||||||
|
|
||||||
|
bpftrace is a high-level language for eBPF, like awk for tracing.
|
||||||
|
|
||||||
|
### Basic Syntax
|
||||||
|
|
||||||
|
```
|
||||||
|
probe /filter/ { action }
|
||||||
|
```
|
||||||
|
|
||||||
|
### Examples
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Count syscalls by name
|
||||||
|
bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
|
||||||
|
|
||||||
|
# Trace open() calls with filename
|
||||||
|
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
|
||||||
|
printf("%s opened %s\n", comm, str(args->filename));
|
||||||
|
}'
|
||||||
|
|
||||||
|
# Histogram of read() sizes
|
||||||
|
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
|
||||||
|
@bytes = hist(args->ret);
|
||||||
|
}'
|
||||||
|
|
||||||
|
# Latency of disk I/O
|
||||||
|
bpftrace -e 'kprobe:blk_start_request { @start[arg0] = nsecs; }
|
||||||
|
kprobe:blk_account_io_done /@start[arg0]/ {
|
||||||
|
@usecs = hist((nsecs - @start[arg0]) / 1000);
|
||||||
|
delete(@start[arg0]);
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### bpftrace Built-in Variables
|
||||||
|
|
||||||
|
| Variable | Meaning |
|
||||||
|
|----------|---------|
|
||||||
|
| `pid` | Process ID |
|
||||||
|
| `tid` | Thread ID |
|
||||||
|
| `comm` | Process name |
|
||||||
|
| `nsecs` | Nanosecond timestamp |
|
||||||
|
| `arg0-argN` | Function arguments |
|
||||||
|
| `retval` | Return value |
|
||||||
|
| `probe` | Current probe name |
|
||||||
|
|
||||||
|
### bpftrace Aggregations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
@x = count() # Count events
|
||||||
|
@x = sum(value) # Sum values
|
||||||
|
@x = avg(value) # Average
|
||||||
|
@x = min(value) # Minimum
|
||||||
|
@x = max(value) # Maximum
|
||||||
|
@x = hist(value) # Power-of-2 histogram
|
||||||
|
@x = lhist(v, min, max, step) # Linear histogram
|
||||||
|
```
|
||||||
|
|
||||||
|
## Comparison: When to Use What
|
||||||
|
|
||||||
|
| Tool | Overhead | Setup | Use Case |
|
||||||
|
|------|----------|-------|----------|
|
||||||
|
| **strace** | High (10-100x) | Zero | Quick debugging, non-production |
|
||||||
|
| **ltrace** | High | Zero | Library call tracing |
|
||||||
|
| **bpftrace** | Low (<1%) | Needs root | Production tracing, performance analysis |
|
||||||
|
| **perf** | Low (<5%) | Minimal | CPU profiling, hardware events |
|
||||||
|
|
||||||
|
### Decision Tree
|
||||||
|
|
||||||
|
```
|
||||||
|
Need to trace events?
|
||||||
|
├── Quick one-off debugging?
|
||||||
|
│ └── strace (easy, but slow)
|
||||||
|
├── Production system?
|
||||||
|
│ └── bpftrace/eBPF (fast, safe)
|
||||||
|
├── Custom application probes?
|
||||||
|
│ └── USDT + bpftrace
|
||||||
|
└── CPU profiling?
|
||||||
|
└── perf record
|
||||||
|
```
|
||||||
|
|
||||||
|
## USDT: User Statically Defined Tracing
|
||||||
|
|
||||||
|
USDT probes are markers you add to your code:
|
||||||
|
|
||||||
|
```c
|
||||||
|
#include <sys/sdt.h>
|
||||||
|
|
||||||
|
void handle_request(int id) {
|
||||||
|
DTRACE_PROBE1(myserver, request_start, id);
|
||||||
|
// ... handle request ...
|
||||||
|
DTRACE_PROBE1(myserver, request_end, id);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Then trace with bpftrace:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bpftrace -e 'usdt:./server:myserver:request_start {
|
||||||
|
@start[arg0] = nsecs;
|
||||||
|
}
|
||||||
|
usdt:./server:myserver:request_end /@start[arg0]/ {
|
||||||
|
@latency = hist((nsecs - @start[arg0]) / 1000);
|
||||||
|
delete(@start[arg0]);
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Advantages of USDT**:
|
||||||
|
- Zero overhead when not tracing
|
||||||
|
- Stable interface (unlike kprobes)
|
||||||
|
- Access to application-level data
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
| Mechanism | How it works | Speed | Safety |
|
||||||
|
|-----------|--------------|-------|--------|
|
||||||
|
| **ptrace (strace)** | Process stops, tracer inspects | Slow | Safe |
|
||||||
|
| **eBPF (bpftrace)** | Code runs in kernel | Fast | Verified |
|
||||||
|
| **USDT** | Compiled-in no-ops, enabled by eBPF | Near-zero | Safe |
|
||||||
|
|
||||||
|
## Further Reading
|
||||||
|
|
||||||
|
- [Brendan Gregg's bpftrace tutorial](https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md)
|
||||||
|
- [bpftrace reference guide](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md)
|
||||||
|
- [eBPF documentation](https://ebpf.io/what-is-ebpf/)
|
||||||
|
- [strace source code](https://github.com/strace/strace) - surprisingly readable!
|
||||||
Loading…
x
Reference in New Issue
Block a user