From 1432bdaff9acb93c493d24c934e7e3bcd475279b Mon Sep 17 00:00:00 2001 From: illustris Date: Sun, 11 Jan 2026 12:04:36 +0530 Subject: [PATCH] add docs --- docs/CPU-CACHES-AND-MEMORY.md | 327 ++++++++++++++++++++++++++++ docs/HOW-SAMPLING-PROFILERS-WORK.md | 211 ++++++++++++++++++ docs/HOW-TRACING-WORKS.md | 265 ++++++++++++++++++++++ 3 files changed, 803 insertions(+) create mode 100644 docs/CPU-CACHES-AND-MEMORY.md create mode 100644 docs/HOW-SAMPLING-PROFILERS-WORK.md create mode 100644 docs/HOW-TRACING-WORKS.md diff --git a/docs/CPU-CACHES-AND-MEMORY.md b/docs/CPU-CACHES-AND-MEMORY.md new file mode 100644 index 0000000..76f5889 --- /dev/null +++ b/docs/CPU-CACHES-AND-MEMORY.md @@ -0,0 +1,327 @@ +# CPU Caches and Memory: Why Access Patterns Matter + +## The Memory Wall + +CPUs are fast. Memory is slow. This gap is called the "memory wall." + +``` + Relative Speed + ══════════════ +CPU registers ████████████████████████████████ (~1 cycle) +L1 cache ██████████████████████ (~4 cycles) +L2 cache ████████████ (~12 cycles) +L3 cache ██████ (~40 cycles) +Main RAM █ (~200 cycles) +SSD (~10,000 cycles) +HDD (~10,000,000 cycles) +``` + +A cache miss to RAM costs **50-100x more** than an L1 hit. This is why cache behavior dominates performance for many programs. + +## The Cache Hierarchy + +``` +┌─────────────────────────────────────────────────────────────┐ +│ CPU Core │ +│ ┌─────────────────────────────────────────────────────┐ │ +│ │ Registers (bytes, <1ns) │ │ +│ └─────────────────────────────────────────────────────┘ │ +│ ┌─────────────────────────────────────────────────────┐ │ +│ │ L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d) │ │ +│ └─────────────────────────────────────────────────────┘ │ +│ ┌─────────────────────────────────────────────────────┐ │ +│ │ L2 Cache: 256-512 KB, ~3-4ns │ │ +│ └─────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────┘ + │ + ┌─────────────────┴─────────────────┐ + │ L3 Cache: 8-64 MB, ~10-12ns │ (shared between cores) + └─────────────────┬─────────────────┘ + │ + ┌─────────────────┴─────────────────┐ + │ Main RAM: GBs, ~50-100ns │ + └───────────────────────────────────┘ +``` + +### Typical Numbers (Desktop CPU, 2024) + +| Level | Size | Latency | Bandwidth | +|-------|------|---------|-----------| +| L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s | +| L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s | +| L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s | +| RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s | + +## Cache Lines: The Unit of Transfer + +Memory isn't fetched byte-by-byte. It's fetched in **cache lines** (typically 64 bytes). + +``` +Memory addresses: +0x1000: [████████████████████████████████████████████████████████████████] + └──────────────── 64 bytes = 1 cache line ────────────────────┘ + +If you access address 0x1020: +- CPU fetches entire cache line (0x1000-0x103F) +- Next access to 0x1021? Already in cache! (free) +- Access to 0x1040? Different cache line, another fetch +``` + +This is why **sequential access** is so much faster than random access. + +## Spatial and Temporal Locality + +Caches exploit two patterns in real programs: + +### Spatial Locality +"If you accessed address X, you'll probably access X+1 soon." + +```c +// GOOD: Sequential access (spatial locality) +for (int i = 0; i < N; i++) { + sum += array[i]; // Next element is in same cache line +} + +// BAD: Random access (no spatial locality) +for (int i = 0; i < N; i++) { + sum += array[random_index()]; // Each access misses cache +} +``` + +### Temporal Locality +"If you accessed address X, you'll probably access X again soon." + +```c +// GOOD: Reuse data while it's hot +for (int i = 0; i < N; i++) { + x = array[i]; + result += x * x + x; // x stays in registers +} + +// BAD: Touch data once, move on +for (int i = 0; i < N; i++) { + sum += array[i]; +} +for (int i = 0; i < N; i++) { + product *= array[i]; // Array evicted from cache, refetch +} +``` + +## Why Random Access Kills Performance + +### Example: Array vs Linked List + +``` +Array (sequential memory): +┌───┬───┬───┬───┬───┬───┬───┬───┐ +│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ All in 1-2 cache lines +└───┴───┴───┴───┴───┴───┴───┴───┘ + +Linked List (scattered memory): +┌───┐ ┌───┐ ┌───┐ +│ 0 │────────→│ 1 │────────→│ 2 │... +└───┘ └───┘ └───┘ + ↑ ↑ ↑ +0x1000 0x5420 0x2108 +Each node in different cache line! +``` + +Traversing a scattered linked list causes a **cache miss per node**. + +### Real Numbers + +``` +Array traversal: ~0.004 seconds (10M elements) +Sequential list: ~0.018 seconds (4.5x slower) +Scattered list: ~1.400 seconds (350x slower!) +``` + +The scattered list is O(n) just like the array, but the constant factor is 350x worse. + +## Pipeline Stalls: Why the CPU Can't Hide Latency + +Modern CPUs execute many instructions simultaneously: + +``` +Pipeline (simplified): + Cycle: 1 2 3 4 5 6 7 8 + Fetch: [A] [B] [C] [D] [E] [F] [G] [H] + Decode: [A] [B] [C] [D] [E] [F] [G] + Execute: [A] [B] [C] [D] [E] [F] + Memory: [A] [B] [C] [D] [E] + Write: [A] [B] [C] [D] +``` + +But what happens when instruction C needs data from memory? + +``` + Cycle: 1 2 3 4 5 ... 200 201 202 + Fetch: [A] [B] [C] [C] [C] ... [C] [D] [E] + Decode: [A] [B] [C] [C] ... [C] [C] [D] + Execute: [A] [B] waiting for memory... + Memory: [A] [B] ... ... [C] + Write: [A] [B] ... [C] + ↑ + STALL! Pipeline bubbles +``` + +The CPU stalls for **~200 cycles** waiting for RAM. Those 200 cycles could have executed 200+ instructions. + +### Out-of-Order Execution Helps (But Not Enough) + +CPUs can execute later instructions while waiting: + +```c +a = array[i]; // Cache miss, stall... +b = x + y; // Can execute while waiting! +c = b * 2; // Can execute while waiting! +d = a + 1; // Must wait for 'a' +``` + +But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly. + +## The Prefetcher: CPU Tries to Help + +Modern CPUs detect sequential access patterns and fetch data **before you ask**: + +``` +Your code accesses: [0] [1] [2] [3] [4] ... +Prefetcher fetches: [5] [6] [7] [8] ... (ahead of you!) +``` + +But prefetchers can only predict **regular patterns**: +- Sequential: ✅ Perfect prediction +- Strided (every Nth element): ✅ Usually works +- Random: ❌ No pattern to detect + +```c +// Prefetcher wins +for (int i = 0; i < N; i++) { + sum += array[i]; // Prefetcher fetches ahead +} + +// Prefetcher loses +for (int i = 0; i < N; i++) { + sum += array[indices[i]]; // Random indices, can't predict +} +``` + +## Row-Major vs Column-Major + +C stores 2D arrays in row-major order: + +``` +int matrix[3][4]; + +Memory layout: +[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3] +└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘ +``` + +### Row-Major Access (Cache-Friendly) + +```c +for (int i = 0; i < ROWS; i++) { + for (int j = 0; j < COLS; j++) { + sum += matrix[i][j]; // Sequential in memory! + } +} +``` + +Access pattern: `[0,0] [0,1] [0,2] [0,3] [1,0]...` - sequential, cache line fully used. + +### Column-Major Access (Cache-Hostile) + +```c +for (int j = 0; j < COLS; j++) { + for (int i = 0; i < ROWS; i++) { + sum += matrix[i][j]; // Jumps by COLS each time! + } +} +``` + +Access pattern: `[0,0] [1,0] [2,0] [0,1]...` - jumps by `COLS * sizeof(int)` bytes. + +If COLS=8192, each access jumps 32KB - far beyond any cache line! + +**Result**: Column-major can be **10-50x slower** for large matrices. + +## False Sharing: The Multithreaded Trap + +Cache coherency means cores must agree on cache line contents. + +``` +Thread 1 (Core 0): counter1++ ┐ +Thread 2 (Core 1): counter2++ ├── Both in same cache line! + ┘ +┌────────────────────────────────────────────────────────────────┐ +│ Cache line: [counter1] [counter2] [padding.................] │ +└────────────────────────────────────────────────────────────────┘ +``` + +When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line. + +**Fix**: Pad data to separate cache lines: + +```c +struct { + long counter; + char padding[64 - sizeof(long)]; // Pad to 64 bytes +} counters[NUM_THREADS]; +``` + +## NUMA: When Memory Has Geography + +On multi-socket systems, memory is "closer" to some CPUs: + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Socket 0 Socket 1 │ +│ ┌─────────────┐ ┌─────────────┐ │ +│ │ Core 0-7 │ │ Core 8-15 │ │ +│ └──────┬──────┘ └──────┬──────┘ │ +│ │ │ │ +│ ┌──────┴──────┐ interconnect ┌──────┴──────┐ │ +│ │ RAM 0 │ ←────────────→ │ RAM 1 │ │ +│ │ (local) │ (slow) │ (remote) │ │ +│ └─────────────┘ └─────────────┘ │ +└─────────────────────────────────────────────────────────────┘ +``` + +Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency. + +## Measuring Cache Behavior + +```bash +# Overall cache stats +perf stat -e cache-misses,cache-references ./program + +# Detailed breakdown +perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program + +# Where do cache misses happen? +perf record -e cache-misses ./program +perf report +``` + +## Summary + +| Pattern | Cache Behavior | Performance | +|---------|---------------|-------------| +| Sequential access | Prefetcher wins, cache lines fully used | Fast | +| Strided access | Partial cache line use | Medium | +| Random access | Every access misses, pipeline stalls | Slow | + +**Key Takeaways**: + +1. **Memory access pattern matters as much as algorithm complexity** +2. **Sequential access is almost always faster** - prefetcher + cache lines +3. **Random access causes pipeline stalls** - CPU waits ~200 cycles per miss +4. **Structure data for access pattern** - not just for logical organization +5. **Measure with `perf stat`** before optimizing + +## Further Reading + +- [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - Ulrich Drepper +- [Gallery of Processor Cache Effects](http://igoro.com/archive/gallery-of-processor-cache-effects/) +- [Brendan Gregg's Memory Flamegraphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html) diff --git a/docs/HOW-SAMPLING-PROFILERS-WORK.md b/docs/HOW-SAMPLING-PROFILERS-WORK.md new file mode 100644 index 0000000..621df12 --- /dev/null +++ b/docs/HOW-SAMPLING-PROFILERS-WORK.md @@ -0,0 +1,211 @@ +# How Sampling Profilers Work + +## The Core Idea + +Sampling profilers answer the question: **"Where is my program spending time?"** + +Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically. + +``` +Program execution: ████████████████████████████████████████ + ↑ ↑ ↑ ↑ ↑ ↑ ↑ + sample sample sample ... +``` + +## Sampling vs Instrumentation + +| Approach | How it works | Overhead | Accuracy | +|----------|--------------|----------|----------| +| **Instrumentation** | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts | +| **Sampling** | Interrupt periodically, record stack | Low (1-5%) | Statistical | + +Instrumentation (like `gprof` with `-pg`) modifies your program. Sampling just observes it. + +## How perf Does It + +### 1. Hardware Performance Counters (PMU) + +Modern CPUs have Performance Monitoring Units (PMUs) with special registers: + +``` +┌─────────────────────────────────────────┐ +│ CPU │ +│ ┌─────────────────────────────────┐ │ +│ │ Performance Monitoring Unit │ │ +│ │ ┌─────────┐ ┌─────────┐ │ │ +│ │ │ Counter │ │ Counter │ ... │ │ +│ │ │ cycles │ │ instrs │ │ │ +│ │ └─────────┘ └─────────┘ │ │ +│ │ ↓ overflow interrupt │ │ +│ └─────────────────────────────────┘ │ +└─────────────────────────────────────────┘ +``` + +When you run `perf record`: + +1. perf programs a PMU counter to count CPU cycles +2. Counter overflows every N cycles (default: enough for ~4000 samples/sec) +3. Overflow triggers a **Non-Maskable Interrupt (NMI)** +4. Kernel handler records: instruction pointer, process ID, timestamp +5. Optionally: walks the stack to get call chain + +### 2. The Sampling Frequency + +```bash +perf record -F 99 ./program # 99 samples per second +perf record -F 9999 ./program # 9999 samples per second +``` + +Higher frequency = more samples = better accuracy, but more overhead. + +**Why 99 and not 100?** Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer). + +### 3. What Gets Recorded + +Each sample contains: +- **IP (Instruction Pointer)**: Which instruction was executing +- **PID/TID**: Which process/thread +- **Timestamp**: When it happened +- **CPU**: Which core +- **Call chain** (with `-g`): Stack of return addresses + +``` +Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890 + callchain: main → process_data → compute_inner +Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891 + callchain: main → process_data → compute_inner +... +``` + +## Symbol Resolution + +Raw samples contain memory addresses. To show function names, perf needs: + +1. **Symbol tables**: Map address ranges to function names +2. **Debug info** (`-g`): Map addresses to source lines + +``` +Without symbols: 45.23% 0x00000000004011a0 +With symbols: 45.23% compute_inner +With debug info: 45.23% compute_inner (program.c:28) +``` + +This is why `perf report` needs access to the same binary you profiled. + +## Call Graph Collection + +With `perf record -g`, perf records the call stack for each sample. + +### Frame Pointer Walking (Traditional) + +``` +Stack Memory: +┌──────────────┐ +│ return addr │ ← where to return after current function +│ saved RBP │ ← pointer to previous frame +├──────────────┤ +│ local vars │ +├──────────────┤ +│ return addr │ +│ saved RBP ───┼──→ previous frame +├──────────────┤ +│ ... │ +└──────────────┘ +``` + +Walk the chain of frame pointers to reconstruct the call stack. + +**Problem**: Modern compilers omit frame pointers (`-fomit-frame-pointer`) for performance. Solution: Compile with `-fno-omit-frame-pointer` or use DWARF unwinding. + +### DWARF Unwinding + +Uses debug info (`.eh_frame` section) to unwind without frame pointers. More reliable but slower. + +```bash +perf record --call-graph dwarf ./program +``` + +## Statistical Nature + +Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A. + +**Law of Large Numbers**: More samples = closer to true distribution. + +``` +100 samples: A: 8-12% (high variance) +1000 samples: A: 9-11% (better) +10000 samples: A: 9.8-10.2% (quite accurate) +``` + +This is why short-running programs need higher sampling frequency. + +## Limitations + +### 1. Short Functions Miss Samples + +If a function runs for less time than the sampling interval, it might not get sampled at all. + +``` +Sampling interval: ────────────────────────────────── +Function A: ██ (might miss!) +Function B: ████████████████████████████████ (definitely hit) +``` + +### 2. Inlined Functions Disappear + +When the compiler inlines a function, it no longer exists as a separate entity: + +```c +// Source code +inline int square(int x) { return x * x; } +int compute(int x) { return square(x) + 1; } + +// After inlining - square() disappears from profile +int compute(int x) { return x * x + 1; } +``` + +With debug info, perf can sometimes recover inline information. + +### 3. Sampling Bias + +Some events are harder to catch: +- Very short functions +- Functions that mostly wait (I/O, locks) +- Interrupt handlers + +### 4. Observer Effect + +Profiling itself has overhead: +- NMI handling takes cycles +- Stack unwinding takes cycles +- Writing samples to buffer takes cycles + +Usually <5%, but can affect extremely performance-sensitive code. + +## perf Events + +perf can sample on different events, not just CPU cycles: + +```bash +perf record -e cycles ./program # CPU cycles (default) +perf record -e instructions ./program # Instructions retired +perf record -e cache-misses ./program # Cache misses +perf record -e branch-misses ./program # Branch mispredictions +``` + +This lets you answer "where do cache misses happen?" not just "where is time spent?" + +## Summary + +1. **Sampling** interrupts periodically to see what's executing +2. **PMU counters** trigger interrupts at configurable frequency +3. **Statistical accuracy** improves with more samples +4. **Symbol resolution** maps addresses to function names +5. **Call graphs** show the path to each sample +6. **Low overhead** (~1-5%) makes it usable in production + +## Further Reading + +- [Brendan Gregg's perf examples](https://www.brendangregg.com/perf.html) +- [perf wiki](https://perf.wiki.kernel.org/index.php/Main_Page) +- [Intel PMU documentation](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) (Volume 3, Chapter 18) diff --git a/docs/HOW-TRACING-WORKS.md b/docs/HOW-TRACING-WORKS.md new file mode 100644 index 0000000..78ea3ac --- /dev/null +++ b/docs/HOW-TRACING-WORKS.md @@ -0,0 +1,265 @@ +# How Tracing Works: strace, bpftrace, and eBPF + +## What is Tracing? + +While sampling answers "where is time spent?", tracing answers "what happened?" + +Tracing captures **every occurrence** of specific events: +- Every syscall +- Every function call +- Every network packet +- Every disk I/O + +## strace: The Simple Way + +### How ptrace Works + +strace uses the `ptrace()` syscall - the same mechanism debuggers use. + +``` +┌─────────────────┐ ┌─────────────────┐ +│ Your Program │ │ strace │ +│ │ │ │ +│ 1. syscall │─────────→│ 2. STOP! │ +│ ║ │ SIGTRAP │ inspect │ +│ ║ (paused) │ │ log │ +│ ║ │←─────────│ 3. continue │ +│ 4. resume │ PTRACE │ │ +│ │ CONT │ │ +└─────────────────┘ └─────────────────┘ +``` + +Step by step: + +1. **Attach**: strace calls `ptrace(PTRACE_ATTACH, pid)` or starts the process with `PTRACE_TRACEME` +2. **Trap on syscall**: strace sets `PTRACE_SYSCALL` - kernel stops tracee at each syscall +3. **Inspect**: strace reads registers to see syscall number and arguments +4. **Continue**: strace calls `ptrace(PTRACE_CONT)` to resume +5. **Repeat**: Kernel stops again when syscall returns, strace reads return value + +### Why strace is Slow + +Each syscall causes **two context switches** to strace: +- One on syscall entry +- One on syscall exit + +``` +Normal syscall: + User → Kernel → User + +With strace: + User → Kernel → strace → Kernel → strace → Kernel → User + ↑ ↑ + entry stop exit stop +``` + +Overhead can be **10-100x** for syscall-heavy programs! + +```bash +# Normal +time ./read_fast testfile # 0.01s + +# With strace +time strace -c ./read_fast testfile # 0.5s (50x slower!) +``` + +### When to Use strace + +Despite overhead, strace is invaluable for: +- Debugging "why won't this program start?" +- Finding which files a program opens +- Understanding program behavior +- One-off investigation (not production) + +```bash +strace -e openat ./program # What files does it open? +strace -e connect ./program # What network connections? +strace -c ./program # Syscall summary +``` + +## eBPF: The Fast Way + +### What is eBPF? + +eBPF (extended Berkeley Packet Filter) lets you run **sandboxed programs inside the kernel**. + +``` +┌─────────────────────────────────────────────────────┐ +│ Kernel │ +│ ┌──────────────────────────────────────────────┐ │ +│ │ Your eBPF Program │ │ +│ │ - Runs at kernel speed │ │ +│ │ - No context switches │ │ +│ │ - Verified for safety │ │ +│ └──────────────────────────────────────────────┘ │ +│ ↓ attach to │ +│ ┌──────────────────────────────────────────────┐ │ +│ │ Tracepoints, Kprobes, Uprobes, USDT │ │ +│ └──────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────┘ + ↓ results via +┌─────────────────────────────────────────────────────┐ +│ User Space │ +│ Maps, ring buffers, perf events │ +└─────────────────────────────────────────────────────┘ +``` + +### The eBPF Verifier + +Before your eBPF program runs, the kernel **verifies** it: +- No infinite loops +- No out-of-bounds memory access +- No unsafe operations +- Bounded execution time + +This makes eBPF safe to run in production, even from untrusted users. + +### Attachment Points + +eBPF can attach to various kernel hooks: + +| Type | What it traces | Example | +|------|----------------|---------| +| **Tracepoints** | Stable kernel events | `tracepoint:syscalls:sys_enter_read` | +| **Kprobes** | Any kernel function | `kprobe:do_sys_open` | +| **Uprobes** | Any userspace function | `uprobe:/bin/bash:readline` | +| **USDT** | User-defined static probes | `usdt:./server:myapp:request` | + +### Why eBPF is Fast + +``` +strace (ptrace): + Process stops → context switch → strace reads → context switch → resume + +eBPF: + Event fires → eBPF runs IN KERNEL → continue (no context switch!) +``` + +eBPF overhead is typically **<1%** even for frequent events. + +## bpftrace: eBPF Made Easy + +bpftrace is a high-level language for eBPF, like awk for tracing. + +### Basic Syntax + +``` +probe /filter/ { action } +``` + +### Examples + +```bash +# Count syscalls by name +bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }' + +# Trace open() calls with filename +bpftrace -e 'tracepoint:syscalls:sys_enter_openat { + printf("%s opened %s\n", comm, str(args->filename)); +}' + +# Histogram of read() sizes +bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { + @bytes = hist(args->ret); +}' + +# Latency of disk I/O +bpftrace -e 'kprobe:blk_start_request { @start[arg0] = nsecs; } + kprobe:blk_account_io_done /@start[arg0]/ { + @usecs = hist((nsecs - @start[arg0]) / 1000); + delete(@start[arg0]); + }' +``` + +### bpftrace Built-in Variables + +| Variable | Meaning | +|----------|---------| +| `pid` | Process ID | +| `tid` | Thread ID | +| `comm` | Process name | +| `nsecs` | Nanosecond timestamp | +| `arg0-argN` | Function arguments | +| `retval` | Return value | +| `probe` | Current probe name | + +### bpftrace Aggregations + +```bash +@x = count() # Count events +@x = sum(value) # Sum values +@x = avg(value) # Average +@x = min(value) # Minimum +@x = max(value) # Maximum +@x = hist(value) # Power-of-2 histogram +@x = lhist(v, min, max, step) # Linear histogram +``` + +## Comparison: When to Use What + +| Tool | Overhead | Setup | Use Case | +|------|----------|-------|----------| +| **strace** | High (10-100x) | Zero | Quick debugging, non-production | +| **ltrace** | High | Zero | Library call tracing | +| **bpftrace** | Low (<1%) | Needs root | Production tracing, performance analysis | +| **perf** | Low (<5%) | Minimal | CPU profiling, hardware events | + +### Decision Tree + +``` +Need to trace events? +├── Quick one-off debugging? +│ └── strace (easy, but slow) +├── Production system? +│ └── bpftrace/eBPF (fast, safe) +├── Custom application probes? +│ └── USDT + bpftrace +└── CPU profiling? + └── perf record +``` + +## USDT: User Statically Defined Tracing + +USDT probes are markers you add to your code: + +```c +#include + +void handle_request(int id) { + DTRACE_PROBE1(myserver, request_start, id); + // ... handle request ... + DTRACE_PROBE1(myserver, request_end, id); +} +``` + +Then trace with bpftrace: + +```bash +bpftrace -e 'usdt:./server:myserver:request_start { + @start[arg0] = nsecs; +} +usdt:./server:myserver:request_end /@start[arg0]/ { + @latency = hist((nsecs - @start[arg0]) / 1000); + delete(@start[arg0]); +}' +``` + +**Advantages of USDT**: +- Zero overhead when not tracing +- Stable interface (unlike kprobes) +- Access to application-level data + +## Summary + +| Mechanism | How it works | Speed | Safety | +|-----------|--------------|-------|--------| +| **ptrace (strace)** | Process stops, tracer inspects | Slow | Safe | +| **eBPF (bpftrace)** | Code runs in kernel | Fast | Verified | +| **USDT** | Compiled-in no-ops, enabled by eBPF | Near-zero | Safe | + +## Further Reading + +- [Brendan Gregg's bpftrace tutorial](https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md) +- [bpftrace reference guide](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md) +- [eBPF documentation](https://ebpf.io/what-is-ebpf/) +- [strace source code](https://github.com/strace/strace) - surprisingly readable!