add docs
This commit is contained in:
parent
c8f56cf3f1
commit
1432bdaff9
327
docs/CPU-CACHES-AND-MEMORY.md
Normal file
327
docs/CPU-CACHES-AND-MEMORY.md
Normal file
@ -0,0 +1,327 @@
|
||||
# CPU Caches and Memory: Why Access Patterns Matter
|
||||
|
||||
## The Memory Wall
|
||||
|
||||
CPUs are fast. Memory is slow. This gap is called the "memory wall."
|
||||
|
||||
```
|
||||
Relative Speed
|
||||
══════════════
|
||||
CPU registers ████████████████████████████████ (~1 cycle)
|
||||
L1 cache ██████████████████████ (~4 cycles)
|
||||
L2 cache ████████████ (~12 cycles)
|
||||
L3 cache ██████ (~40 cycles)
|
||||
Main RAM █ (~200 cycles)
|
||||
SSD (~10,000 cycles)
|
||||
HDD (~10,000,000 cycles)
|
||||
```
|
||||
|
||||
A cache miss to RAM costs **50-100x more** than an L1 hit. This is why cache behavior dominates performance for many programs.
|
||||
|
||||
## The Cache Hierarchy
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ CPU Core │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ Registers (bytes, <1ns) │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d) │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ L2 Cache: 256-512 KB, ~3-4ns │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────┴─────────────────┐
|
||||
│ L3 Cache: 8-64 MB, ~10-12ns │ (shared between cores)
|
||||
└─────────────────┬─────────────────┘
|
||||
│
|
||||
┌─────────────────┴─────────────────┐
|
||||
│ Main RAM: GBs, ~50-100ns │
|
||||
└───────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Typical Numbers (Desktop CPU, 2024)
|
||||
|
||||
| Level | Size | Latency | Bandwidth |
|
||||
|-------|------|---------|-----------|
|
||||
| L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s |
|
||||
| L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s |
|
||||
| L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s |
|
||||
| RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s |
|
||||
|
||||
## Cache Lines: The Unit of Transfer
|
||||
|
||||
Memory isn't fetched byte-by-byte. It's fetched in **cache lines** (typically 64 bytes).
|
||||
|
||||
```
|
||||
Memory addresses:
|
||||
0x1000: [████████████████████████████████████████████████████████████████]
|
||||
└──────────────── 64 bytes = 1 cache line ────────────────────┘
|
||||
|
||||
If you access address 0x1020:
|
||||
- CPU fetches entire cache line (0x1000-0x103F)
|
||||
- Next access to 0x1021? Already in cache! (free)
|
||||
- Access to 0x1040? Different cache line, another fetch
|
||||
```
|
||||
|
||||
This is why **sequential access** is so much faster than random access.
|
||||
|
||||
## Spatial and Temporal Locality
|
||||
|
||||
Caches exploit two patterns in real programs:
|
||||
|
||||
### Spatial Locality
|
||||
"If you accessed address X, you'll probably access X+1 soon."
|
||||
|
||||
```c
|
||||
// GOOD: Sequential access (spatial locality)
|
||||
for (int i = 0; i < N; i++) {
|
||||
sum += array[i]; // Next element is in same cache line
|
||||
}
|
||||
|
||||
// BAD: Random access (no spatial locality)
|
||||
for (int i = 0; i < N; i++) {
|
||||
sum += array[random_index()]; // Each access misses cache
|
||||
}
|
||||
```
|
||||
|
||||
### Temporal Locality
|
||||
"If you accessed address X, you'll probably access X again soon."
|
||||
|
||||
```c
|
||||
// GOOD: Reuse data while it's hot
|
||||
for (int i = 0; i < N; i++) {
|
||||
x = array[i];
|
||||
result += x * x + x; // x stays in registers
|
||||
}
|
||||
|
||||
// BAD: Touch data once, move on
|
||||
for (int i = 0; i < N; i++) {
|
||||
sum += array[i];
|
||||
}
|
||||
for (int i = 0; i < N; i++) {
|
||||
product *= array[i]; // Array evicted from cache, refetch
|
||||
}
|
||||
```
|
||||
|
||||
## Why Random Access Kills Performance
|
||||
|
||||
### Example: Array vs Linked List
|
||||
|
||||
```
|
||||
Array (sequential memory):
|
||||
┌───┬───┬───┬───┬───┬───┬───┬───┐
|
||||
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ All in 1-2 cache lines
|
||||
└───┴───┴───┴───┴───┴───┴───┴───┘
|
||||
|
||||
Linked List (scattered memory):
|
||||
┌───┐ ┌───┐ ┌───┐
|
||||
│ 0 │────────→│ 1 │────────→│ 2 │...
|
||||
└───┘ └───┘ └───┘
|
||||
↑ ↑ ↑
|
||||
0x1000 0x5420 0x2108
|
||||
Each node in different cache line!
|
||||
```
|
||||
|
||||
Traversing a scattered linked list causes a **cache miss per node**.
|
||||
|
||||
### Real Numbers
|
||||
|
||||
```
|
||||
Array traversal: ~0.004 seconds (10M elements)
|
||||
Sequential list: ~0.018 seconds (4.5x slower)
|
||||
Scattered list: ~1.400 seconds (350x slower!)
|
||||
```
|
||||
|
||||
The scattered list is O(n) just like the array, but the constant factor is 350x worse.
|
||||
|
||||
## Pipeline Stalls: Why the CPU Can't Hide Latency
|
||||
|
||||
Modern CPUs execute many instructions simultaneously:
|
||||
|
||||
```
|
||||
Pipeline (simplified):
|
||||
Cycle: 1 2 3 4 5 6 7 8
|
||||
Fetch: [A] [B] [C] [D] [E] [F] [G] [H]
|
||||
Decode: [A] [B] [C] [D] [E] [F] [G]
|
||||
Execute: [A] [B] [C] [D] [E] [F]
|
||||
Memory: [A] [B] [C] [D] [E]
|
||||
Write: [A] [B] [C] [D]
|
||||
```
|
||||
|
||||
But what happens when instruction C needs data from memory?
|
||||
|
||||
```
|
||||
Cycle: 1 2 3 4 5 ... 200 201 202
|
||||
Fetch: [A] [B] [C] [C] [C] ... [C] [D] [E]
|
||||
Decode: [A] [B] [C] [C] ... [C] [C] [D]
|
||||
Execute: [A] [B] waiting for memory...
|
||||
Memory: [A] [B] ... ... [C]
|
||||
Write: [A] [B] ... [C]
|
||||
↑
|
||||
STALL! Pipeline bubbles
|
||||
```
|
||||
|
||||
The CPU stalls for **~200 cycles** waiting for RAM. Those 200 cycles could have executed 200+ instructions.
|
||||
|
||||
### Out-of-Order Execution Helps (But Not Enough)
|
||||
|
||||
CPUs can execute later instructions while waiting:
|
||||
|
||||
```c
|
||||
a = array[i]; // Cache miss, stall...
|
||||
b = x + y; // Can execute while waiting!
|
||||
c = b * 2; // Can execute while waiting!
|
||||
d = a + 1; // Must wait for 'a'
|
||||
```
|
||||
|
||||
But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.
|
||||
|
||||
## The Prefetcher: CPU Tries to Help
|
||||
|
||||
Modern CPUs detect sequential access patterns and fetch data **before you ask**:
|
||||
|
||||
```
|
||||
Your code accesses: [0] [1] [2] [3] [4] ...
|
||||
Prefetcher fetches: [5] [6] [7] [8] ... (ahead of you!)
|
||||
```
|
||||
|
||||
But prefetchers can only predict **regular patterns**:
|
||||
- Sequential: ✅ Perfect prediction
|
||||
- Strided (every Nth element): ✅ Usually works
|
||||
- Random: ❌ No pattern to detect
|
||||
|
||||
```c
|
||||
// Prefetcher wins
|
||||
for (int i = 0; i < N; i++) {
|
||||
sum += array[i]; // Prefetcher fetches ahead
|
||||
}
|
||||
|
||||
// Prefetcher loses
|
||||
for (int i = 0; i < N; i++) {
|
||||
sum += array[indices[i]]; // Random indices, can't predict
|
||||
}
|
||||
```
|
||||
|
||||
## Row-Major vs Column-Major
|
||||
|
||||
C stores 2D arrays in row-major order:
|
||||
|
||||
```
|
||||
int matrix[3][4];
|
||||
|
||||
Memory layout:
|
||||
[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
|
||||
└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘
|
||||
```
|
||||
|
||||
### Row-Major Access (Cache-Friendly)
|
||||
|
||||
```c
|
||||
for (int i = 0; i < ROWS; i++) {
|
||||
for (int j = 0; j < COLS; j++) {
|
||||
sum += matrix[i][j]; // Sequential in memory!
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Access pattern: `[0,0] [0,1] [0,2] [0,3] [1,0]...` - sequential, cache line fully used.
|
||||
|
||||
### Column-Major Access (Cache-Hostile)
|
||||
|
||||
```c
|
||||
for (int j = 0; j < COLS; j++) {
|
||||
for (int i = 0; i < ROWS; i++) {
|
||||
sum += matrix[i][j]; // Jumps by COLS each time!
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Access pattern: `[0,0] [1,0] [2,0] [0,1]...` - jumps by `COLS * sizeof(int)` bytes.
|
||||
|
||||
If COLS=8192, each access jumps 32KB - far beyond any cache line!
|
||||
|
||||
**Result**: Column-major can be **10-50x slower** for large matrices.
|
||||
|
||||
## False Sharing: The Multithreaded Trap
|
||||
|
||||
Cache coherency means cores must agree on cache line contents.
|
||||
|
||||
```
|
||||
Thread 1 (Core 0): counter1++ ┐
|
||||
Thread 2 (Core 1): counter2++ ├── Both in same cache line!
|
||||
┘
|
||||
┌────────────────────────────────────────────────────────────────┐
|
||||
│ Cache line: [counter1] [counter2] [padding.................] │
|
||||
└────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.
|
||||
|
||||
**Fix**: Pad data to separate cache lines:
|
||||
|
||||
```c
|
||||
struct {
|
||||
long counter;
|
||||
char padding[64 - sizeof(long)]; // Pad to 64 bytes
|
||||
} counters[NUM_THREADS];
|
||||
```
|
||||
|
||||
## NUMA: When Memory Has Geography
|
||||
|
||||
On multi-socket systems, memory is "closer" to some CPUs:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Socket 0 Socket 1 │
|
||||
│ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Core 0-7 │ │ Core 8-15 │ │
|
||||
│ └──────┬──────┘ └──────┬──────┘ │
|
||||
│ │ │ │
|
||||
│ ┌──────┴──────┐ interconnect ┌──────┴──────┐ │
|
||||
│ │ RAM 0 │ ←────────────→ │ RAM 1 │ │
|
||||
│ │ (local) │ (slow) │ (remote) │ │
|
||||
│ └─────────────┘ └─────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.
|
||||
|
||||
## Measuring Cache Behavior
|
||||
|
||||
```bash
|
||||
# Overall cache stats
|
||||
perf stat -e cache-misses,cache-references ./program
|
||||
|
||||
# Detailed breakdown
|
||||
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program
|
||||
|
||||
# Where do cache misses happen?
|
||||
perf record -e cache-misses ./program
|
||||
perf report
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
| Pattern | Cache Behavior | Performance |
|
||||
|---------|---------------|-------------|
|
||||
| Sequential access | Prefetcher wins, cache lines fully used | Fast |
|
||||
| Strided access | Partial cache line use | Medium |
|
||||
| Random access | Every access misses, pipeline stalls | Slow |
|
||||
|
||||
**Key Takeaways**:
|
||||
|
||||
1. **Memory access pattern matters as much as algorithm complexity**
|
||||
2. **Sequential access is almost always faster** - prefetcher + cache lines
|
||||
3. **Random access causes pipeline stalls** - CPU waits ~200 cycles per miss
|
||||
4. **Structure data for access pattern** - not just for logical organization
|
||||
5. **Measure with `perf stat`** before optimizing
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - Ulrich Drepper
|
||||
- [Gallery of Processor Cache Effects](http://igoro.com/archive/gallery-of-processor-cache-effects/)
|
||||
- [Brendan Gregg's Memory Flamegraphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html)
|
||||
211
docs/HOW-SAMPLING-PROFILERS-WORK.md
Normal file
211
docs/HOW-SAMPLING-PROFILERS-WORK.md
Normal file
@ -0,0 +1,211 @@
|
||||
# How Sampling Profilers Work
|
||||
|
||||
## The Core Idea
|
||||
|
||||
Sampling profilers answer the question: **"Where is my program spending time?"**
|
||||
|
||||
Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.
|
||||
|
||||
```
|
||||
Program execution: ████████████████████████████████████████
|
||||
↑ ↑ ↑ ↑ ↑ ↑ ↑
|
||||
sample sample sample ...
|
||||
```
|
||||
|
||||
## Sampling vs Instrumentation
|
||||
|
||||
| Approach | How it works | Overhead | Accuracy |
|
||||
|----------|--------------|----------|----------|
|
||||
| **Instrumentation** | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts |
|
||||
| **Sampling** | Interrupt periodically, record stack | Low (1-5%) | Statistical |
|
||||
|
||||
Instrumentation (like `gprof` with `-pg`) modifies your program. Sampling just observes it.
|
||||
|
||||
## How perf Does It
|
||||
|
||||
### 1. Hardware Performance Counters (PMU)
|
||||
|
||||
Modern CPUs have Performance Monitoring Units (PMUs) with special registers:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ CPU │
|
||||
│ ┌─────────────────────────────────┐ │
|
||||
│ │ Performance Monitoring Unit │ │
|
||||
│ │ ┌─────────┐ ┌─────────┐ │ │
|
||||
│ │ │ Counter │ │ Counter │ ... │ │
|
||||
│ │ │ cycles │ │ instrs │ │ │
|
||||
│ │ └─────────┘ └─────────┘ │ │
|
||||
│ │ ↓ overflow interrupt │ │
|
||||
│ └─────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
When you run `perf record`:
|
||||
|
||||
1. perf programs a PMU counter to count CPU cycles
|
||||
2. Counter overflows every N cycles (default: enough for ~4000 samples/sec)
|
||||
3. Overflow triggers a **Non-Maskable Interrupt (NMI)**
|
||||
4. Kernel handler records: instruction pointer, process ID, timestamp
|
||||
5. Optionally: walks the stack to get call chain
|
||||
|
||||
### 2. The Sampling Frequency
|
||||
|
||||
```bash
|
||||
perf record -F 99 ./program # 99 samples per second
|
||||
perf record -F 9999 ./program # 9999 samples per second
|
||||
```
|
||||
|
||||
Higher frequency = more samples = better accuracy, but more overhead.
|
||||
|
||||
**Why 99 and not 100?** Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).
|
||||
|
||||
### 3. What Gets Recorded
|
||||
|
||||
Each sample contains:
|
||||
- **IP (Instruction Pointer)**: Which instruction was executing
|
||||
- **PID/TID**: Which process/thread
|
||||
- **Timestamp**: When it happened
|
||||
- **CPU**: Which core
|
||||
- **Call chain** (with `-g`): Stack of return addresses
|
||||
|
||||
```
|
||||
Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
|
||||
callchain: main → process_data → compute_inner
|
||||
Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
|
||||
callchain: main → process_data → compute_inner
|
||||
...
|
||||
```
|
||||
|
||||
## Symbol Resolution
|
||||
|
||||
Raw samples contain memory addresses. To show function names, perf needs:
|
||||
|
||||
1. **Symbol tables**: Map address ranges to function names
|
||||
2. **Debug info** (`-g`): Map addresses to source lines
|
||||
|
||||
```
|
||||
Without symbols: 45.23% 0x00000000004011a0
|
||||
With symbols: 45.23% compute_inner
|
||||
With debug info: 45.23% compute_inner (program.c:28)
|
||||
```
|
||||
|
||||
This is why `perf report` needs access to the same binary you profiled.
|
||||
|
||||
## Call Graph Collection
|
||||
|
||||
With `perf record -g`, perf records the call stack for each sample.
|
||||
|
||||
### Frame Pointer Walking (Traditional)
|
||||
|
||||
```
|
||||
Stack Memory:
|
||||
┌──────────────┐
|
||||
│ return addr │ ← where to return after current function
|
||||
│ saved RBP │ ← pointer to previous frame
|
||||
├──────────────┤
|
||||
│ local vars │
|
||||
├──────────────┤
|
||||
│ return addr │
|
||||
│ saved RBP ───┼──→ previous frame
|
||||
├──────────────┤
|
||||
│ ... │
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
Walk the chain of frame pointers to reconstruct the call stack.
|
||||
|
||||
**Problem**: Modern compilers omit frame pointers (`-fomit-frame-pointer`) for performance. Solution: Compile with `-fno-omit-frame-pointer` or use DWARF unwinding.
|
||||
|
||||
### DWARF Unwinding
|
||||
|
||||
Uses debug info (`.eh_frame` section) to unwind without frame pointers. More reliable but slower.
|
||||
|
||||
```bash
|
||||
perf record --call-graph dwarf ./program
|
||||
```
|
||||
|
||||
## Statistical Nature
|
||||
|
||||
Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.
|
||||
|
||||
**Law of Large Numbers**: More samples = closer to true distribution.
|
||||
|
||||
```
|
||||
100 samples: A: 8-12% (high variance)
|
||||
1000 samples: A: 9-11% (better)
|
||||
10000 samples: A: 9.8-10.2% (quite accurate)
|
||||
```
|
||||
|
||||
This is why short-running programs need higher sampling frequency.
|
||||
|
||||
## Limitations
|
||||
|
||||
### 1. Short Functions Miss Samples
|
||||
|
||||
If a function runs for less time than the sampling interval, it might not get sampled at all.
|
||||
|
||||
```
|
||||
Sampling interval: ──────────────────────────────────
|
||||
Function A: ██ (might miss!)
|
||||
Function B: ████████████████████████████████ (definitely hit)
|
||||
```
|
||||
|
||||
### 2. Inlined Functions Disappear
|
||||
|
||||
When the compiler inlines a function, it no longer exists as a separate entity:
|
||||
|
||||
```c
|
||||
// Source code
|
||||
inline int square(int x) { return x * x; }
|
||||
int compute(int x) { return square(x) + 1; }
|
||||
|
||||
// After inlining - square() disappears from profile
|
||||
int compute(int x) { return x * x + 1; }
|
||||
```
|
||||
|
||||
With debug info, perf can sometimes recover inline information.
|
||||
|
||||
### 3. Sampling Bias
|
||||
|
||||
Some events are harder to catch:
|
||||
- Very short functions
|
||||
- Functions that mostly wait (I/O, locks)
|
||||
- Interrupt handlers
|
||||
|
||||
### 4. Observer Effect
|
||||
|
||||
Profiling itself has overhead:
|
||||
- NMI handling takes cycles
|
||||
- Stack unwinding takes cycles
|
||||
- Writing samples to buffer takes cycles
|
||||
|
||||
Usually <5%, but can affect extremely performance-sensitive code.
|
||||
|
||||
## perf Events
|
||||
|
||||
perf can sample on different events, not just CPU cycles:
|
||||
|
||||
```bash
|
||||
perf record -e cycles ./program # CPU cycles (default)
|
||||
perf record -e instructions ./program # Instructions retired
|
||||
perf record -e cache-misses ./program # Cache misses
|
||||
perf record -e branch-misses ./program # Branch mispredictions
|
||||
```
|
||||
|
||||
This lets you answer "where do cache misses happen?" not just "where is time spent?"
|
||||
|
||||
## Summary
|
||||
|
||||
1. **Sampling** interrupts periodically to see what's executing
|
||||
2. **PMU counters** trigger interrupts at configurable frequency
|
||||
3. **Statistical accuracy** improves with more samples
|
||||
4. **Symbol resolution** maps addresses to function names
|
||||
5. **Call graphs** show the path to each sample
|
||||
6. **Low overhead** (~1-5%) makes it usable in production
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Brendan Gregg's perf examples](https://www.brendangregg.com/perf.html)
|
||||
- [perf wiki](https://perf.wiki.kernel.org/index.php/Main_Page)
|
||||
- [Intel PMU documentation](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) (Volume 3, Chapter 18)
|
||||
265
docs/HOW-TRACING-WORKS.md
Normal file
265
docs/HOW-TRACING-WORKS.md
Normal file
@ -0,0 +1,265 @@
|
||||
# How Tracing Works: strace, bpftrace, and eBPF
|
||||
|
||||
## What is Tracing?
|
||||
|
||||
While sampling answers "where is time spent?", tracing answers "what happened?"
|
||||
|
||||
Tracing captures **every occurrence** of specific events:
|
||||
- Every syscall
|
||||
- Every function call
|
||||
- Every network packet
|
||||
- Every disk I/O
|
||||
|
||||
## strace: The Simple Way
|
||||
|
||||
### How ptrace Works
|
||||
|
||||
strace uses the `ptrace()` syscall - the same mechanism debuggers use.
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐
|
||||
│ Your Program │ │ strace │
|
||||
│ │ │ │
|
||||
│ 1. syscall │─────────→│ 2. STOP! │
|
||||
│ ║ │ SIGTRAP │ inspect │
|
||||
│ ║ (paused) │ │ log │
|
||||
│ ║ │←─────────│ 3. continue │
|
||||
│ 4. resume │ PTRACE │ │
|
||||
│ │ CONT │ │
|
||||
└─────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
Step by step:
|
||||
|
||||
1. **Attach**: strace calls `ptrace(PTRACE_ATTACH, pid)` or starts the process with `PTRACE_TRACEME`
|
||||
2. **Trap on syscall**: strace sets `PTRACE_SYSCALL` - kernel stops tracee at each syscall
|
||||
3. **Inspect**: strace reads registers to see syscall number and arguments
|
||||
4. **Continue**: strace calls `ptrace(PTRACE_CONT)` to resume
|
||||
5. **Repeat**: Kernel stops again when syscall returns, strace reads return value
|
||||
|
||||
### Why strace is Slow
|
||||
|
||||
Each syscall causes **two context switches** to strace:
|
||||
- One on syscall entry
|
||||
- One on syscall exit
|
||||
|
||||
```
|
||||
Normal syscall:
|
||||
User → Kernel → User
|
||||
|
||||
With strace:
|
||||
User → Kernel → strace → Kernel → strace → Kernel → User
|
||||
↑ ↑
|
||||
entry stop exit stop
|
||||
```
|
||||
|
||||
Overhead can be **10-100x** for syscall-heavy programs!
|
||||
|
||||
```bash
|
||||
# Normal
|
||||
time ./read_fast testfile # 0.01s
|
||||
|
||||
# With strace
|
||||
time strace -c ./read_fast testfile # 0.5s (50x slower!)
|
||||
```
|
||||
|
||||
### When to Use strace
|
||||
|
||||
Despite overhead, strace is invaluable for:
|
||||
- Debugging "why won't this program start?"
|
||||
- Finding which files a program opens
|
||||
- Understanding program behavior
|
||||
- One-off investigation (not production)
|
||||
|
||||
```bash
|
||||
strace -e openat ./program # What files does it open?
|
||||
strace -e connect ./program # What network connections?
|
||||
strace -c ./program # Syscall summary
|
||||
```
|
||||
|
||||
## eBPF: The Fast Way
|
||||
|
||||
### What is eBPF?
|
||||
|
||||
eBPF (extended Berkeley Packet Filter) lets you run **sandboxed programs inside the kernel**.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Kernel │
|
||||
│ ┌──────────────────────────────────────────────┐ │
|
||||
│ │ Your eBPF Program │ │
|
||||
│ │ - Runs at kernel speed │ │
|
||||
│ │ - No context switches │ │
|
||||
│ │ - Verified for safety │ │
|
||||
│ └──────────────────────────────────────────────┘ │
|
||||
│ ↓ attach to │
|
||||
│ ┌──────────────────────────────────────────────┐ │
|
||||
│ │ Tracepoints, Kprobes, Uprobes, USDT │ │
|
||||
│ └──────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
↓ results via
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ User Space │
|
||||
│ Maps, ring buffers, perf events │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### The eBPF Verifier
|
||||
|
||||
Before your eBPF program runs, the kernel **verifies** it:
|
||||
- No infinite loops
|
||||
- No out-of-bounds memory access
|
||||
- No unsafe operations
|
||||
- Bounded execution time
|
||||
|
||||
This makes eBPF safe to run in production, even from untrusted users.
|
||||
|
||||
### Attachment Points
|
||||
|
||||
eBPF can attach to various kernel hooks:
|
||||
|
||||
| Type | What it traces | Example |
|
||||
|------|----------------|---------|
|
||||
| **Tracepoints** | Stable kernel events | `tracepoint:syscalls:sys_enter_read` |
|
||||
| **Kprobes** | Any kernel function | `kprobe:do_sys_open` |
|
||||
| **Uprobes** | Any userspace function | `uprobe:/bin/bash:readline` |
|
||||
| **USDT** | User-defined static probes | `usdt:./server:myapp:request` |
|
||||
|
||||
### Why eBPF is Fast
|
||||
|
||||
```
|
||||
strace (ptrace):
|
||||
Process stops → context switch → strace reads → context switch → resume
|
||||
|
||||
eBPF:
|
||||
Event fires → eBPF runs IN KERNEL → continue (no context switch!)
|
||||
```
|
||||
|
||||
eBPF overhead is typically **<1%** even for frequent events.
|
||||
|
||||
## bpftrace: eBPF Made Easy
|
||||
|
||||
bpftrace is a high-level language for eBPF, like awk for tracing.
|
||||
|
||||
### Basic Syntax
|
||||
|
||||
```
|
||||
probe /filter/ { action }
|
||||
```
|
||||
|
||||
### Examples
|
||||
|
||||
```bash
|
||||
# Count syscalls by name
|
||||
bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
|
||||
|
||||
# Trace open() calls with filename
|
||||
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
|
||||
printf("%s opened %s\n", comm, str(args->filename));
|
||||
}'
|
||||
|
||||
# Histogram of read() sizes
|
||||
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
|
||||
@bytes = hist(args->ret);
|
||||
}'
|
||||
|
||||
# Latency of disk I/O
|
||||
bpftrace -e 'kprobe:blk_start_request { @start[arg0] = nsecs; }
|
||||
kprobe:blk_account_io_done /@start[arg0]/ {
|
||||
@usecs = hist((nsecs - @start[arg0]) / 1000);
|
||||
delete(@start[arg0]);
|
||||
}'
|
||||
```
|
||||
|
||||
### bpftrace Built-in Variables
|
||||
|
||||
| Variable | Meaning |
|
||||
|----------|---------|
|
||||
| `pid` | Process ID |
|
||||
| `tid` | Thread ID |
|
||||
| `comm` | Process name |
|
||||
| `nsecs` | Nanosecond timestamp |
|
||||
| `arg0-argN` | Function arguments |
|
||||
| `retval` | Return value |
|
||||
| `probe` | Current probe name |
|
||||
|
||||
### bpftrace Aggregations
|
||||
|
||||
```bash
|
||||
@x = count() # Count events
|
||||
@x = sum(value) # Sum values
|
||||
@x = avg(value) # Average
|
||||
@x = min(value) # Minimum
|
||||
@x = max(value) # Maximum
|
||||
@x = hist(value) # Power-of-2 histogram
|
||||
@x = lhist(v, min, max, step) # Linear histogram
|
||||
```
|
||||
|
||||
## Comparison: When to Use What
|
||||
|
||||
| Tool | Overhead | Setup | Use Case |
|
||||
|------|----------|-------|----------|
|
||||
| **strace** | High (10-100x) | Zero | Quick debugging, non-production |
|
||||
| **ltrace** | High | Zero | Library call tracing |
|
||||
| **bpftrace** | Low (<1%) | Needs root | Production tracing, performance analysis |
|
||||
| **perf** | Low (<5%) | Minimal | CPU profiling, hardware events |
|
||||
|
||||
### Decision Tree
|
||||
|
||||
```
|
||||
Need to trace events?
|
||||
├── Quick one-off debugging?
|
||||
│ └── strace (easy, but slow)
|
||||
├── Production system?
|
||||
│ └── bpftrace/eBPF (fast, safe)
|
||||
├── Custom application probes?
|
||||
│ └── USDT + bpftrace
|
||||
└── CPU profiling?
|
||||
└── perf record
|
||||
```
|
||||
|
||||
## USDT: User Statically Defined Tracing
|
||||
|
||||
USDT probes are markers you add to your code:
|
||||
|
||||
```c
|
||||
#include <sys/sdt.h>
|
||||
|
||||
void handle_request(int id) {
|
||||
DTRACE_PROBE1(myserver, request_start, id);
|
||||
// ... handle request ...
|
||||
DTRACE_PROBE1(myserver, request_end, id);
|
||||
}
|
||||
```
|
||||
|
||||
Then trace with bpftrace:
|
||||
|
||||
```bash
|
||||
bpftrace -e 'usdt:./server:myserver:request_start {
|
||||
@start[arg0] = nsecs;
|
||||
}
|
||||
usdt:./server:myserver:request_end /@start[arg0]/ {
|
||||
@latency = hist((nsecs - @start[arg0]) / 1000);
|
||||
delete(@start[arg0]);
|
||||
}'
|
||||
```
|
||||
|
||||
**Advantages of USDT**:
|
||||
- Zero overhead when not tracing
|
||||
- Stable interface (unlike kprobes)
|
||||
- Access to application-level data
|
||||
|
||||
## Summary
|
||||
|
||||
| Mechanism | How it works | Speed | Safety |
|
||||
|-----------|--------------|-------|--------|
|
||||
| **ptrace (strace)** | Process stops, tracer inspects | Slow | Safe |
|
||||
| **eBPF (bpftrace)** | Code runs in kernel | Fast | Verified |
|
||||
| **USDT** | Compiled-in no-ops, enabled by eBPF | Near-zero | Safe |
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Brendan Gregg's bpftrace tutorial](https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md)
|
||||
- [bpftrace reference guide](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md)
|
||||
- [eBPF documentation](https://ebpf.io/what-is-ebpf/)
|
||||
- [strace source code](https://github.com/strace/strace) - surprisingly readable!
|
||||
Loading…
x
Reference in New Issue
Block a user