add docs

2026-01-11 12:04:36 +05:30 · 2026-01-11 12:04:36 +05:30 · 1432bdaff9
commit 1432bdaff9
parent c8f56cf3f1
3 changed files with 803 additions and 0 deletions
--- a/docs/CPU-CACHES-AND-MEMORY.md
+++ b/docs/CPU-CACHES-AND-MEMORY.md
@ -0,0 +1,327 @@
+# CPU Caches and Memory: Why Access Patterns Matter
+
+## The Memory Wall
+
+CPUs are fast. Memory is slow. This gap is called the "memory wall."
+
+```
+                    Relative Speed
+                    ══════════════
+CPU registers       ████████████████████████████████  (~1 cycle)
+L1 cache            ██████████████████████            (~4 cycles)
+L2 cache            ████████████                      (~12 cycles)
+L3 cache            ██████                            (~40 cycles)
+Main RAM            █                                 (~200 cycles)
+SSD                                                   (~10,000 cycles)
+HDD                                                   (~10,000,000 cycles)
+```
+
+A cache miss to RAM costs **50-100x more** than an L1 hit. This is why cache behavior dominates performance for many programs.
+
+## The Cache Hierarchy
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                         CPU Core                             │
+│  ┌─────────────────────────────────────────────────────┐    │
+│  │  Registers (bytes, <1ns)                             │    │
+│  └─────────────────────────────────────────────────────┘    │
+│  ┌─────────────────────────────────────────────────────┐    │
+│  │  L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d)         │    │
+│  └─────────────────────────────────────────────────────┘    │
+│  ┌─────────────────────────────────────────────────────┐    │
+│  │  L2 Cache: 256-512 KB, ~3-4ns                        │    │
+│  └─────────────────────────────────────────────────────┘    │
+└─────────────────────────────────────────────────────────────┘
+                              │
+            ┌─────────────────┴─────────────────┐
+            │  L3 Cache: 8-64 MB, ~10-12ns      │  (shared between cores)
+            └─────────────────┬─────────────────┘
+                              │
+            ┌─────────────────┴─────────────────┐
+            │  Main RAM: GBs, ~50-100ns         │
+            └───────────────────────────────────┘
+```
+
+### Typical Numbers (Desktop CPU, 2024)
+
+| Level | Size | Latency | Bandwidth |
+|-------|------|---------|-----------|
+| L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s |
+| L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s |
+| L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s |
+| RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s |
+
+## Cache Lines: The Unit of Transfer
+
+Memory isn't fetched byte-by-byte. It's fetched in **cache lines** (typically 64 bytes).
+
+```
+Memory addresses:
+0x1000: [████████████████████████████████████████████████████████████████]
+        └──────────────── 64 bytes = 1 cache line ────────────────────┘
+
+If you access address 0x1020:
+- CPU fetches entire cache line (0x1000-0x103F)
+- Next access to 0x1021? Already in cache! (free)
+- Access to 0x1040? Different cache line, another fetch
+```
+
+This is why **sequential access** is so much faster than random access.
+
+## Spatial and Temporal Locality
+
+Caches exploit two patterns in real programs:
+
+### Spatial Locality
+"If you accessed address X, you'll probably access X+1 soon."
+
+```c
+// GOOD: Sequential access (spatial locality)
+for (int i = 0; i < N; i++) {
+    sum += array[i];  // Next element is in same cache line
+}
+
+// BAD: Random access (no spatial locality)
+for (int i = 0; i < N; i++) {
+    sum += array[random_index()];  // Each access misses cache
+}
+```
+
+### Temporal Locality
+"If you accessed address X, you'll probably access X again soon."
+
+```c
+// GOOD: Reuse data while it's hot
+for (int i = 0; i < N; i++) {
+    x = array[i];
+    result += x * x + x;  // x stays in registers
+}
+
+// BAD: Touch data once, move on
+for (int i = 0; i < N; i++) {
+    sum += array[i];
+}
+for (int i = 0; i < N; i++) {
+    product *= array[i];  // Array evicted from cache, refetch
+}
+```
+
+## Why Random Access Kills Performance
+
+### Example: Array vs Linked List
+
+```
+Array (sequential memory):
+┌───┬───┬───┬───┬───┬───┬───┬───┐
+│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │  All in 1-2 cache lines
+└───┴───┴───┴───┴───┴───┴───┴───┘
+
+Linked List (scattered memory):
+┌───┐         ┌───┐         ┌───┐
+│ 0 │────────→│ 1 │────────→│ 2 │...
+└───┘         └───┘         └───┘
+ ↑              ↑              ↑
+0x1000       0x5420        0x2108
+Each node in different cache line!
+```
+
+Traversing a scattered linked list causes a **cache miss per node**.
+
+### Real Numbers
+
+```
+Array traversal:       ~0.004 seconds (10M elements)
+Sequential list:       ~0.018 seconds (4.5x slower)
+Scattered list:        ~1.400 seconds (350x slower!)
+```
+
+The scattered list is O(n) just like the array, but the constant factor is 350x worse.
+
+## Pipeline Stalls: Why the CPU Can't Hide Latency
+
+Modern CPUs execute many instructions simultaneously:
+
+```
+Pipeline (simplified):
+  Cycle:    1    2    3    4    5    6    7    8
+  Fetch:   [A]  [B]  [C]  [D]  [E]  [F]  [G]  [H]
+  Decode:       [A]  [B]  [C]  [D]  [E]  [F]  [G]
+  Execute:           [A]  [B]  [C]  [D]  [E]  [F]
+  Memory:                 [A]  [B]  [C]  [D]  [E]
+  Write:                       [A]  [B]  [C]  [D]
+```
+
+But what happens when instruction C needs data from memory?
+
+```
+  Cycle:    1    2    3    4    5   ...  200  201  202
+  Fetch:   [A]  [B]  [C]  [C]  [C]  ... [C]  [D]  [E]
+  Decode:       [A]  [B]  [C]  [C]  ... [C]  [C]  [D]
+  Execute:           [A]  [B]  waiting for memory...
+  Memory:                 [A]  [B]  ... ... [C]
+  Write:                       [A]  [B]      ... [C]
+                               ↑
+                        STALL! Pipeline bubbles
+```
+
+The CPU stalls for **~200 cycles** waiting for RAM. Those 200 cycles could have executed 200+ instructions.
+
+### Out-of-Order Execution Helps (But Not Enough)
+
+CPUs can execute later instructions while waiting:
+
+```c
+a = array[i];      // Cache miss, stall...
+b = x + y;         // Can execute while waiting!
+c = b * 2;         // Can execute while waiting!
+d = a + 1;         // Must wait for 'a'
+```
+
+But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.
+
+## The Prefetcher: CPU Tries to Help
+
+Modern CPUs detect sequential access patterns and fetch data **before you ask**:
+
+```
+Your code accesses:  [0]  [1]  [2]  [3]  [4]  ...
+Prefetcher fetches:            [5]  [6]  [7]  [8]  ...  (ahead of you!)
+```
+
+But prefetchers can only predict **regular patterns**:
+- Sequential: ✅ Perfect prediction
+- Strided (every Nth element): ✅ Usually works
+- Random: ❌ No pattern to detect
+
+```c
+// Prefetcher wins
+for (int i = 0; i < N; i++) {
+    sum += array[i];  // Prefetcher fetches ahead
+}
+
+// Prefetcher loses
+for (int i = 0; i < N; i++) {
+    sum += array[indices[i]];  // Random indices, can't predict
+}
+```
+
+## Row-Major vs Column-Major
+
+C stores 2D arrays in row-major order:
+
+```
+int matrix[3][4];
+
+Memory layout:
+[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
+└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘
+```
+
+### Row-Major Access (Cache-Friendly)
+
+```c
+for (int i = 0; i < ROWS; i++) {
+    for (int j = 0; j < COLS; j++) {
+        sum += matrix[i][j];  // Sequential in memory!
+    }
+}
+```
+
+Access pattern: `[0,0] [0,1] [0,2] [0,3] [1,0]...` - sequential, cache line fully used.
+
+### Column-Major Access (Cache-Hostile)
+
+```c
+for (int j = 0; j < COLS; j++) {
+    for (int i = 0; i < ROWS; i++) {
+        sum += matrix[i][j];  // Jumps by COLS each time!
+    }
+}
+```
+
+Access pattern: `[0,0] [1,0] [2,0] [0,1]...` - jumps by `COLS * sizeof(int)` bytes.
+
+If COLS=8192, each access jumps 32KB - far beyond any cache line!
+
+**Result**: Column-major can be **10-50x slower** for large matrices.
+
+## False Sharing: The Multithreaded Trap
+
+Cache coherency means cores must agree on cache line contents.
+
+```
+Thread 1 (Core 0):  counter1++    ┐
+Thread 2 (Core 1):  counter2++    ├── Both in same cache line!
+                                  ┘
+┌────────────────────────────────────────────────────────────────┐
+│  Cache line: [counter1] [counter2] [padding.................]  │
+└────────────────────────────────────────────────────────────────┘
+```
+
+When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.
+
+**Fix**: Pad data to separate cache lines:
+
+```c
+struct {
+    long counter;
+    char padding[64 - sizeof(long)];  // Pad to 64 bytes
+} counters[NUM_THREADS];
+```
+
+## NUMA: When Memory Has Geography
+
+On multi-socket systems, memory is "closer" to some CPUs:
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  Socket 0                        Socket 1                   │
+│  ┌─────────────┐                ┌─────────────┐             │
+│  │  Core 0-7   │                │  Core 8-15  │             │
+│  └──────┬──────┘                └──────┬──────┘             │
+│         │                              │                    │
+│  ┌──────┴──────┐  interconnect  ┌──────┴──────┐             │
+│  │   RAM 0     │ ←────────────→ │   RAM 1     │             │
+│  │ (local)     │     (slow)     │ (remote)    │             │
+│  └─────────────┘                └─────────────┘             │
+└─────────────────────────────────────────────────────────────┘
+```
+
+Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.
+
+## Measuring Cache Behavior
+
+```bash
+# Overall cache stats
+perf stat -e cache-misses,cache-references ./program
+
+# Detailed breakdown
+perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program
+
+# Where do cache misses happen?
+perf record -e cache-misses ./program
+perf report
+```
+
+## Summary
+
+| Pattern | Cache Behavior | Performance |
+|---------|---------------|-------------|
+| Sequential access | Prefetcher wins, cache lines fully used | Fast |
+| Strided access | Partial cache line use | Medium |
+| Random access | Every access misses, pipeline stalls | Slow |
+
+**Key Takeaways**:
+
+1. **Memory access pattern matters as much as algorithm complexity**
+2. **Sequential access is almost always faster** - prefetcher + cache lines
+3. **Random access causes pipeline stalls** - CPU waits ~200 cycles per miss
+4. **Structure data for access pattern** - not just for logical organization
+5. **Measure with `perf stat`** before optimizing
+
+## Further Reading
+
+- [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - Ulrich Drepper
+- [Gallery of Processor Cache Effects](http://igoro.com/archive/gallery-of-processor-cache-effects/)
+- [Brendan Gregg's Memory Flamegraphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html)
--- a/docs/HOW-SAMPLING-PROFILERS-WORK.md
+++ b/docs/HOW-SAMPLING-PROFILERS-WORK.md
@ -0,0 +1,211 @@
+# How Sampling Profilers Work
+
+## The Core Idea
+
+Sampling profilers answer the question: **"Where is my program spending time?"**
+
+Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.
+
+```
+Program execution:  ████████████████████████████████████████
+                         ↑    ↑    ↑    ↑    ↑    ↑    ↑
+                       sample sample sample ...
+```
+
+## Sampling vs Instrumentation
+
+| Approach | How it works | Overhead | Accuracy |
+|----------|--------------|----------|----------|
+| **Instrumentation** | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts |
+| **Sampling** | Interrupt periodically, record stack | Low (1-5%) | Statistical |
+
+Instrumentation (like `gprof` with `-pg`) modifies your program. Sampling just observes it.
+
+## How perf Does It
+
+### 1. Hardware Performance Counters (PMU)
+
+Modern CPUs have Performance Monitoring Units (PMUs) with special registers:
+
+```
+┌─────────────────────────────────────────┐
+│                  CPU                    │
+│  ┌─────────────────────────────────┐   │
+│  │     Performance Monitoring Unit  │   │
+│  │  ┌─────────┐  ┌─────────┐       │   │
+│  │  │ Counter │  │ Counter │  ...  │   │
+│  │  │ cycles  │  │ instrs  │       │   │
+│  │  └─────────┘  └─────────┘       │   │
+│  │         ↓ overflow interrupt     │   │
+│  └─────────────────────────────────┘   │
+└─────────────────────────────────────────┘
+```
+
+When you run `perf record`:
+
+1. perf programs a PMU counter to count CPU cycles
+2. Counter overflows every N cycles (default: enough for ~4000 samples/sec)
+3. Overflow triggers a **Non-Maskable Interrupt (NMI)**
+4. Kernel handler records: instruction pointer, process ID, timestamp
+5. Optionally: walks the stack to get call chain
+
+### 2. The Sampling Frequency
+
+```bash
+perf record -F 99 ./program    # 99 samples per second
+perf record -F 9999 ./program  # 9999 samples per second
+```
+
+Higher frequency = more samples = better accuracy, but more overhead.
+
+**Why 99 and not 100?** Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).
+
+### 3. What Gets Recorded
+
+Each sample contains:
+- **IP (Instruction Pointer)**: Which instruction was executing
+- **PID/TID**: Which process/thread
+- **Timestamp**: When it happened
+- **CPU**: Which core
+- **Call chain** (with `-g`): Stack of return addresses
+
+```
+Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
+           callchain: main → process_data → compute_inner
+Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
+           callchain: main → process_data → compute_inner
+...
+```
+
+## Symbol Resolution
+
+Raw samples contain memory addresses. To show function names, perf needs:
+
+1. **Symbol tables**: Map address ranges to function names
+2. **Debug info** (`-g`): Map addresses to source lines
+
+```
+Without symbols:     45.23%  0x00000000004011a0
+With symbols:        45.23%  compute_inner
+With debug info:     45.23%  compute_inner (program.c:28)
+```
+
+This is why `perf report` needs access to the same binary you profiled.
+
+## Call Graph Collection
+
+With `perf record -g`, perf records the call stack for each sample.
+
+### Frame Pointer Walking (Traditional)
+
+```
+Stack Memory:
+┌──────────────┐
+│ return addr  │ ← where to return after current function
+│ saved RBP    │ ← pointer to previous frame
+├──────────────┤
+│ local vars   │
+├──────────────┤
+│ return addr  │
+│ saved RBP ───┼──→ previous frame
+├──────────────┤
+│ ...          │
+└──────────────┘
+```
+
+Walk the chain of frame pointers to reconstruct the call stack.
+
+**Problem**: Modern compilers omit frame pointers (`-fomit-frame-pointer`) for performance. Solution: Compile with `-fno-omit-frame-pointer` or use DWARF unwinding.
+
+### DWARF Unwinding
+
+Uses debug info (`.eh_frame` section) to unwind without frame pointers. More reliable but slower.
+
+```bash
+perf record --call-graph dwarf ./program
+```
+
+## Statistical Nature
+
+Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.
+
+**Law of Large Numbers**: More samples = closer to true distribution.
+
+```
+100 samples:    A: 8-12%  (high variance)
+1000 samples:   A: 9-11%  (better)
+10000 samples:  A: 9.8-10.2%  (quite accurate)
+```
+
+This is why short-running programs need higher sampling frequency.
+
+## Limitations
+
+### 1. Short Functions Miss Samples
+
+If a function runs for less time than the sampling interval, it might not get sampled at all.
+
+```
+Sampling interval: ──────────────────────────────────
+Function A:        ██                                 (might miss!)
+Function B:        ████████████████████████████████   (definitely hit)
+```
+
+### 2. Inlined Functions Disappear
+
+When the compiler inlines a function, it no longer exists as a separate entity:
+
+```c
+// Source code
+inline int square(int x) { return x * x; }
+int compute(int x) { return square(x) + 1; }
+
+// After inlining - square() disappears from profile
+int compute(int x) { return x * x + 1; }
+```
+
+With debug info, perf can sometimes recover inline information.
+
+### 3. Sampling Bias
+
+Some events are harder to catch:
+- Very short functions
+- Functions that mostly wait (I/O, locks)
+- Interrupt handlers
+
+### 4. Observer Effect
+
+Profiling itself has overhead:
+- NMI handling takes cycles
+- Stack unwinding takes cycles
+- Writing samples to buffer takes cycles
+
+Usually <5%, but can affect extremely performance-sensitive code.
+
+## perf Events
+
+perf can sample on different events, not just CPU cycles:
+
+```bash
+perf record -e cycles ./program           # CPU cycles (default)
+perf record -e instructions ./program     # Instructions retired
+perf record -e cache-misses ./program     # Cache misses
+perf record -e branch-misses ./program    # Branch mispredictions
+```
+
+This lets you answer "where do cache misses happen?" not just "where is time spent?"
+
+## Summary
+
+1. **Sampling** interrupts periodically to see what's executing
+2. **PMU counters** trigger interrupts at configurable frequency
+3. **Statistical accuracy** improves with more samples
+4. **Symbol resolution** maps addresses to function names
+5. **Call graphs** show the path to each sample
+6. **Low overhead** (~1-5%) makes it usable in production
+
+## Further Reading
+
+- [Brendan Gregg's perf examples](https://www.brendangregg.com/perf.html)
+- [perf wiki](https://perf.wiki.kernel.org/index.php/Main_Page)
+- [Intel PMU documentation](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) (Volume 3, Chapter 18)
--- a/docs/HOW-TRACING-WORKS.md
+++ b/docs/HOW-TRACING-WORKS.md
@ -0,0 +1,265 @@
+# How Tracing Works: strace, bpftrace, and eBPF
+
+## What is Tracing?
+
+While sampling answers "where is time spent?", tracing answers "what happened?"
+
+Tracing captures **every occurrence** of specific events:
+- Every syscall
+- Every function call
+- Every network packet
+- Every disk I/O
+
+## strace: The Simple Way
+
+### How ptrace Works
+
+strace uses the `ptrace()` syscall - the same mechanism debuggers use.
+
+```
+┌─────────────────┐          ┌─────────────────┐
+│   Your Program  │          │     strace      │
+│                 │          │                 │
+│  1. syscall     │─────────→│  2. STOP!       │
+│     ║           │  SIGTRAP │     inspect     │
+│     ║ (paused)  │          │     log         │
+│     ║           │←─────────│  3. continue    │
+│  4. resume      │  PTRACE  │                 │
+│                 │  CONT    │                 │
+└─────────────────┘          └─────────────────┘
+```
+
+Step by step:
+
+1. **Attach**: strace calls `ptrace(PTRACE_ATTACH, pid)` or starts the process with `PTRACE_TRACEME`
+2. **Trap on syscall**: strace sets `PTRACE_SYSCALL` - kernel stops tracee at each syscall
+3. **Inspect**: strace reads registers to see syscall number and arguments
+4. **Continue**: strace calls `ptrace(PTRACE_CONT)` to resume
+5. **Repeat**: Kernel stops again when syscall returns, strace reads return value
+
+### Why strace is Slow
+
+Each syscall causes **two context switches** to strace:
+- One on syscall entry
+- One on syscall exit
+
+```
+Normal syscall:
+  User → Kernel → User
+
+With strace:
+  User → Kernel → strace → Kernel → strace → Kernel → User
+                    ↑                  ↑
+              entry stop          exit stop
+```
+
+Overhead can be **10-100x** for syscall-heavy programs!
+
+```bash
+# Normal
+time ./read_fast testfile     # 0.01s
+
+# With strace
+time strace -c ./read_fast testfile   # 0.5s (50x slower!)
+```
+
+### When to Use strace
+
+Despite overhead, strace is invaluable for:
+- Debugging "why won't this program start?"
+- Finding which files a program opens
+- Understanding program behavior
+- One-off investigation (not production)
+
+```bash
+strace -e openat ./program     # What files does it open?
+strace -e connect ./program    # What network connections?
+strace -c ./program            # Syscall summary
+```
+
+## eBPF: The Fast Way
+
+### What is eBPF?
+
+eBPF (extended Berkeley Packet Filter) lets you run **sandboxed programs inside the kernel**.
+
+```
+┌─────────────────────────────────────────────────────┐
+│                      Kernel                          │
+│  ┌──────────────────────────────────────────────┐   │
+│  │              Your eBPF Program                │   │
+│  │  - Runs at kernel speed                       │   │
+│  │  - No context switches                        │   │
+│  │  - Verified for safety                        │   │
+│  └──────────────────────────────────────────────┘   │
+│         ↓ attach to                                  │
+│  ┌──────────────────────────────────────────────┐   │
+│  │    Tracepoints, Kprobes, Uprobes, USDT        │   │
+│  └──────────────────────────────────────────────┘   │
+└─────────────────────────────────────────────────────┘
+         ↓ results via
+┌─────────────────────────────────────────────────────┐
+│                    User Space                        │
+│    Maps, ring buffers, perf events                  │
+└─────────────────────────────────────────────────────┘
+```
+
+### The eBPF Verifier
+
+Before your eBPF program runs, the kernel **verifies** it:
+- No infinite loops
+- No out-of-bounds memory access
+- No unsafe operations
+- Bounded execution time
+
+This makes eBPF safe to run in production, even from untrusted users.
+
+### Attachment Points
+
+eBPF can attach to various kernel hooks:
+
+| Type | What it traces | Example |
+|------|----------------|---------|
+| **Tracepoints** | Stable kernel events | `tracepoint:syscalls:sys_enter_read` |
+| **Kprobes** | Any kernel function | `kprobe:do_sys_open` |
+| **Uprobes** | Any userspace function | `uprobe:/bin/bash:readline` |
+| **USDT** | User-defined static probes | `usdt:./server:myapp:request` |
+
+### Why eBPF is Fast
+
+```
+strace (ptrace):
+  Process stops → context switch → strace reads → context switch → resume
+
+eBPF:
+  Event fires → eBPF runs IN KERNEL → continue (no context switch!)
+```
+
+eBPF overhead is typically **<1%** even for frequent events.
+
+## bpftrace: eBPF Made Easy
+
+bpftrace is a high-level language for eBPF, like awk for tracing.
+
+### Basic Syntax
+
+```
+probe /filter/ { action }
+```
+
+### Examples
+
+```bash
+# Count syscalls by name
+bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
+
+# Trace open() calls with filename
+bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
+    printf("%s opened %s\n", comm, str(args->filename));
+}'
+
+# Histogram of read() sizes
+bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
+    @bytes = hist(args->ret);
+}'
+
+# Latency of disk I/O
+bpftrace -e 'kprobe:blk_start_request { @start[arg0] = nsecs; }
+             kprobe:blk_account_io_done /@start[arg0]/ {
+                 @usecs = hist((nsecs - @start[arg0]) / 1000);
+                 delete(@start[arg0]);
+             }'
+```
+
+### bpftrace Built-in Variables
+
+| Variable | Meaning |
+|----------|---------|
+| `pid` | Process ID |
+| `tid` | Thread ID |
+| `comm` | Process name |
+| `nsecs` | Nanosecond timestamp |
+| `arg0-argN` | Function arguments |
+| `retval` | Return value |
+| `probe` | Current probe name |
+
+### bpftrace Aggregations
+
+```bash
+@x = count()           # Count events
+@x = sum(value)        # Sum values
+@x = avg(value)        # Average
+@x = min(value)        # Minimum
+@x = max(value)        # Maximum
+@x = hist(value)       # Power-of-2 histogram
+@x = lhist(v, min, max, step)  # Linear histogram
+```
+
+## Comparison: When to Use What
+
+| Tool | Overhead | Setup | Use Case |
+|------|----------|-------|----------|
+| **strace** | High (10-100x) | Zero | Quick debugging, non-production |
+| **ltrace** | High | Zero | Library call tracing |
+| **bpftrace** | Low (<1%) | Needs root | Production tracing, performance analysis |
+| **perf** | Low (<5%) | Minimal | CPU profiling, hardware events |
+
+### Decision Tree
+
+```
+Need to trace events?
+├── Quick one-off debugging?
+│   └── strace (easy, but slow)
+├── Production system?
+│   └── bpftrace/eBPF (fast, safe)
+├── Custom application probes?
+│   └── USDT + bpftrace
+└── CPU profiling?
+    └── perf record
+```
+
+## USDT: User Statically Defined Tracing
+
+USDT probes are markers you add to your code:
+
+```c
+#include <sys/sdt.h>
+
+void handle_request(int id) {
+    DTRACE_PROBE1(myserver, request_start, id);
+    // ... handle request ...
+    DTRACE_PROBE1(myserver, request_end, id);
+}
+```
+
+Then trace with bpftrace:
+
+```bash
+bpftrace -e 'usdt:./server:myserver:request_start {
+    @start[arg0] = nsecs;
+}
+usdt:./server:myserver:request_end /@start[arg0]/ {
+    @latency = hist((nsecs - @start[arg0]) / 1000);
+    delete(@start[arg0]);
+}'
+```
+
+**Advantages of USDT**:
+- Zero overhead when not tracing
+- Stable interface (unlike kprobes)
+- Access to application-level data
+
+## Summary
+
+| Mechanism | How it works | Speed | Safety |
+|-----------|--------------|-------|--------|
+| **ptrace (strace)** | Process stops, tracer inspects | Slow | Safe |
+| **eBPF (bpftrace)** | Code runs in kernel | Fast | Verified |
+| **USDT** | Compiled-in no-ops, enabled by eBPF | Near-zero | Safe |
+
+## Further Reading
+
+- [Brendan Gregg's bpftrace tutorial](https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md)
+- [bpftrace reference guide](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md)
+- [eBPF documentation](https://ebpf.io/what-is-ebpf/)
+- [strace source code](https://github.com/strace/strace) - surprisingly readable!