add docs

2026-01-11 12:04:36 +05:30 · 2026-01-11 12:04:36 +05:30 · 1432bdaff9
commit 1432bdaff9
parent c8f56cf3f1
3 changed files with 803 additions and 0 deletions
--- a/docs/CPU-CACHES-AND-MEMORY.md
+++ b/docs/CPU-CACHES-AND-MEMORY.md
@ -0,0 +1,327 @@
 # CPU Caches and Memory: Why Access Patterns Matter
 ## The Memory Wall
 CPUs are fast. Memory is slow. This gap is called the "memory wall."
 ```
                    Relative Speed
                    ══════════════
 CPU registers       ████████████████████████████████  (~1 cycle)
 L1 cache            ██████████████████████            (~4 cycles)
 L2 cache            ████████████                      (~12 cycles)
 L3 cache            ██████                            (~40 cycles)
 Main RAM            █                                 (~200 cycles)
 SSD                                                   (~10,000 cycles)
 HDD                                                   (~10,000,000 cycles)
 ```
 A cache miss to RAM costs **50-100x more** than an L1 hit. This is why cache behavior dominates performance for many programs.
 ## The Cache Hierarchy
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │                         CPU Core                             │
 │  ┌─────────────────────────────────────────────────────┐    │
 │  │  Registers (bytes, <1ns)                             │    │
 │  └─────────────────────────────────────────────────────┘    │
 │  ┌─────────────────────────────────────────────────────┐    │
 │  │  L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d)         │    │
 │  └─────────────────────────────────────────────────────┘    │
 │  ┌─────────────────────────────────────────────────────┐    │
 │  │  L2 Cache: 256-512 KB, ~3-4ns                        │    │
 │  └─────────────────────────────────────────────────────┘    │
 └─────────────────────────────────────────────────────────────┘
                              │
            ┌─────────────────┴─────────────────┐
            │  L3 Cache: 8-64 MB, ~10-12ns      │  (shared between cores)
            └─────────────────┬─────────────────┘
                              │
            ┌─────────────────┴─────────────────┐
            │  Main RAM: GBs, ~50-100ns         │
            └───────────────────────────────────┘
 ```
 ### Typical Numbers (Desktop CPU, 2024)
 | Level | Size | Latency | Bandwidth |
 |-------|------|---------|-----------|
 | L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s |
 | L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s |
 | L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s |
 | RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s |
 ## Cache Lines: The Unit of Transfer
 Memory isn't fetched byte-by-byte. It's fetched in **cache lines** (typically 64 bytes).
 ```
 Memory addresses:
 0x1000: [████████████████████████████████████████████████████████████████]
        └──────────────── 64 bytes = 1 cache line ────────────────────┘
 If you access address 0x1020:
 - CPU fetches entire cache line (0x1000-0x103F)
 - Next access to 0x1021? Already in cache! (free)
 - Access to 0x1040? Different cache line, another fetch
 ```
 This is why **sequential access** is so much faster than random access.
 ## Spatial and Temporal Locality
 Caches exploit two patterns in real programs:
 ### Spatial Locality
 "If you accessed address X, you'll probably access X+1 soon."
 ```c
 // GOOD: Sequential access (spatial locality)
 for (int i = 0; i < N; i++) {
    sum += array[i];  // Next element is in same cache line
 }
 // BAD: Random access (no spatial locality)
 for (int i = 0; i < N; i++) {
    sum += array[random_index()];  // Each access misses cache
 }
 ```
 ### Temporal Locality
 "If you accessed address X, you'll probably access X again soon."
 ```c
 // GOOD: Reuse data while it's hot
 for (int i = 0; i < N; i++) {
    x = array[i];
    result += x * x + x;  // x stays in registers
 }
 // BAD: Touch data once, move on
 for (int i = 0; i < N; i++) {
    sum += array[i];
 }
 for (int i = 0; i < N; i++) {
    product *= array[i];  // Array evicted from cache, refetch
 }
 ```
 ## Why Random Access Kills Performance
 ### Example: Array vs Linked List
 ```
 Array (sequential memory):
 ┌───┬───┬───┬───┬───┬───┬───┬───┐
 │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │  All in 1-2 cache lines
 └───┴───┴───┴───┴───┴───┴───┴───┘
 Linked List (scattered memory):
 ┌───┐         ┌───┐         ┌───┐
 │ 0 │────────→│ 1 │────────→│ 2 │...
 └───┘         └───┘         └───┘
 ↑              ↑              ↑
 0x1000       0x5420        0x2108
 Each node in different cache line!
 ```
 Traversing a scattered linked list causes a **cache miss per node**.
 ### Real Numbers
 ```
 Array traversal:       ~0.004 seconds (10M elements)
 Sequential list:       ~0.018 seconds (4.5x slower)
 Scattered list:        ~1.400 seconds (350x slower!)
 ```
 The scattered list is O(n) just like the array, but the constant factor is 350x worse.
 ## Pipeline Stalls: Why the CPU Can't Hide Latency
 Modern CPUs execute many instructions simultaneously:
 ```
 Pipeline (simplified):
  Cycle:    1    2    3    4    5    6    7    8
  Fetch:   [A]  [B]  [C]  [D]  [E]  [F]  [G]  [H]
  Decode:       [A]  [B]  [C]  [D]  [E]  [F]  [G]
  Execute:           [A]  [B]  [C]  [D]  [E]  [F]
  Memory:                 [A]  [B]  [C]  [D]  [E]
  Write:                       [A]  [B]  [C]  [D]
 ```
 But what happens when instruction C needs data from memory?
 ```
  Cycle:    1    2    3    4    5   ...  200  201  202
  Fetch:   [A]  [B]  [C]  [C]  [C]  ... [C]  [D]  [E]
  Decode:       [A]  [B]  [C]  [C]  ... [C]  [C]  [D]
  Execute:           [A]  [B]  waiting for memory...
  Memory:                 [A]  [B]  ... ... [C]
  Write:                       [A]  [B]      ... [C]
                               ↑
                        STALL! Pipeline bubbles
 ```
 The CPU stalls for **~200 cycles** waiting for RAM. Those 200 cycles could have executed 200+ instructions.
 ### Out-of-Order Execution Helps (But Not Enough)
 CPUs can execute later instructions while waiting:
 ```c
 a = array[i];      // Cache miss, stall...
 b = x + y;         // Can execute while waiting!
 c = b * 2;         // Can execute while waiting!
 d = a + 1;         // Must wait for 'a'
 ```
 But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.
 ## The Prefetcher: CPU Tries to Help
 Modern CPUs detect sequential access patterns and fetch data **before you ask**:
 ```
 Your code accesses:  [0]  [1]  [2]  [3]  [4]  ...
 Prefetcher fetches:            [5]  [6]  [7]  [8]  ...  (ahead of you!)
 ```
 But prefetchers can only predict **regular patterns**:
 - Sequential: ✅ Perfect prediction
 - Strided (every Nth element): ✅ Usually works
 - Random: ❌ No pattern to detect
 ```c
 // Prefetcher wins
 for (int i = 0; i < N; i++) {
    sum += array[i];  // Prefetcher fetches ahead
 }
 // Prefetcher loses
 for (int i = 0; i < N; i++) {
    sum += array[indices[i]];  // Random indices, can't predict
 }
 ```
 ## Row-Major vs Column-Major
 C stores 2D arrays in row-major order:
 ```
 int matrix[3][4];
 Memory layout:
 [0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
 └──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘
 ```
 ### Row-Major Access (Cache-Friendly)
 ```c
 for (int i = 0; i < ROWS; i++) {
    for (int j = 0; j < COLS; j++) {
        sum += matrix[i][j];  // Sequential in memory!
    }
 }
 ```
 Access pattern: `[0,0] [0,1] [0,2] [0,3] [1,0]...` - sequential, cache line fully used.
 ### Column-Major Access (Cache-Hostile)
 ```c
 for (int j = 0; j < COLS; j++) {
    for (int i = 0; i < ROWS; i++) {
        sum += matrix[i][j];  // Jumps by COLS each time!
    }
 }
 ```
 Access pattern: `[0,0] [1,0] [2,0] [0,1]...` - jumps by `COLS * sizeof(int)` bytes.
 If COLS=8192, each access jumps 32KB - far beyond any cache line!
 **Result**: Column-major can be **10-50x slower** for large matrices.
 ## False Sharing: The Multithreaded Trap
 Cache coherency means cores must agree on cache line contents.
 ```
 Thread 1 (Core 0):  counter1++    ┐
 Thread 2 (Core 1):  counter2++    ├── Both in same cache line!
                                  ┘
 ┌────────────────────────────────────────────────────────────────┐
 │  Cache line: [counter1] [counter2] [padding.................]  │
 └────────────────────────────────────────────────────────────────┘
 ```
 When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.
 **Fix**: Pad data to separate cache lines:
 ```c
 struct {
    long counter;
    char padding[64 - sizeof(long)];  // Pad to 64 bytes
 } counters[NUM_THREADS];
 ```
 ## NUMA: When Memory Has Geography
 On multi-socket systems, memory is "closer" to some CPUs:
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │  Socket 0                        Socket 1                   │
 │  ┌─────────────┐                ┌─────────────┐             │
 │  │  Core 0-7   │                │  Core 8-15  │             │
 │  └──────┬──────┘                └──────┬──────┘             │
 │         │                              │                    │
 │  ┌──────┴──────┐  interconnect  ┌──────┴──────┐             │
 │  │   RAM 0     │ ←────────────→ │   RAM 1     │             │
 │  │ (local)     │     (slow)     │ (remote)    │             │
 │  └─────────────┘                └─────────────┘             │
 └─────────────────────────────────────────────────────────────┘
 ```
 Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.
 ## Measuring Cache Behavior
 ```bash
 # Overall cache stats
 perf stat -e cache-misses,cache-references ./program
 # Detailed breakdown
 perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program
 # Where do cache misses happen?
 perf record -e cache-misses ./program
 perf report
 ```
 ## Summary
 | Pattern | Cache Behavior | Performance |
 |---------|---------------|-------------|
 | Sequential access | Prefetcher wins, cache lines fully used | Fast |
 | Strided access | Partial cache line use | Medium |
 | Random access | Every access misses, pipeline stalls | Slow |
 **Key Takeaways**:
 1. **Memory access pattern matters as much as algorithm complexity**
 2. **Sequential access is almost always faster** - prefetcher + cache lines
 3. **Random access causes pipeline stalls** - CPU waits ~200 cycles per miss
 4. **Structure data for access pattern** - not just for logical organization
 5. **Measure with `perf stat`** before optimizing
 ## Further Reading
 - [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - Ulrich Drepper
 - [Gallery of Processor Cache Effects](http://igoro.com/archive/gallery-of-processor-cache-effects/)
 - [Brendan Gregg's Memory Flamegraphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html)
--- a/docs/HOW-SAMPLING-PROFILERS-WORK.md
+++ b/docs/HOW-SAMPLING-PROFILERS-WORK.md
@ -0,0 +1,211 @@
 # How Sampling Profilers Work
 ## The Core Idea
 Sampling profilers answer the question: **"Where is my program spending time?"**
 Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.
 ```
 Program execution:  ████████████████████████████████████████
                         ↑    ↑    ↑    ↑    ↑    ↑    ↑
                       sample sample sample ...
 ```
 ## Sampling vs Instrumentation
 | Approach | How it works | Overhead | Accuracy |
 |----------|--------------|----------|----------|
 | **Instrumentation** | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts |
 | **Sampling** | Interrupt periodically, record stack | Low (1-5%) | Statistical |
 Instrumentation (like `gprof` with `-pg`) modifies your program. Sampling just observes it.
 ## How perf Does It
 ### 1. Hardware Performance Counters (PMU)
 Modern CPUs have Performance Monitoring Units (PMUs) with special registers:
 ```
 ┌─────────────────────────────────────────┐
 │                  CPU                    │
 │  ┌─────────────────────────────────┐   │
 │  │     Performance Monitoring Unit  │   │
 │  │  ┌─────────┐  ┌─────────┐       │   │
 │  │  │ Counter │  │ Counter │  ...  │   │
 │  │  │ cycles  │  │ instrs  │       │   │
 │  │  └─────────┘  └─────────┘       │   │
 │  │         ↓ overflow interrupt     │   │
 │  └─────────────────────────────────┘   │
 └─────────────────────────────────────────┘
 ```
 When you run `perf record`:
 1. perf programs a PMU counter to count CPU cycles
 2. Counter overflows every N cycles (default: enough for ~4000 samples/sec)
 3. Overflow triggers a **Non-Maskable Interrupt (NMI)**
 4. Kernel handler records: instruction pointer, process ID, timestamp
 5. Optionally: walks the stack to get call chain
 ### 2. The Sampling Frequency
 ```bash
 perf record -F 99 ./program    # 99 samples per second
 perf record -F 9999 ./program  # 9999 samples per second
 ```
 Higher frequency = more samples = better accuracy, but more overhead.
 **Why 99 and not 100?** Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).
 ### 3. What Gets Recorded
 Each sample contains:
 - **IP (Instruction Pointer)**: Which instruction was executing
 - **PID/TID**: Which process/thread
 - **Timestamp**: When it happened
 - **CPU**: Which core
 - **Call chain** (with `-g`): Stack of return addresses
 ```
 Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
           callchain: main → process_data → compute_inner
 Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
           callchain: main → process_data → compute_inner
 ...
 ```
 ## Symbol Resolution
 Raw samples contain memory addresses. To show function names, perf needs:
 1. **Symbol tables**: Map address ranges to function names
 2. **Debug info** (`-g`): Map addresses to source lines
 ```
 Without symbols:     45.23%  0x00000000004011a0
 With symbols:        45.23%  compute_inner
 With debug info:     45.23%  compute_inner (program.c:28)
 ```
 This is why `perf report` needs access to the same binary you profiled.
 ## Call Graph Collection
 With `perf record -g`, perf records the call stack for each sample.
 ### Frame Pointer Walking (Traditional)
 ```
 Stack Memory:
 ┌──────────────┐
 │ return addr  │ ← where to return after current function
 │ saved RBP    │ ← pointer to previous frame
 ├──────────────┤
 │ local vars   │
 ├──────────────┤
 │ return addr  │
 │ saved RBP ───┼──→ previous frame
 ├──────────────┤
 │ ...          │
 └──────────────┘
 ```
 Walk the chain of frame pointers to reconstruct the call stack.
 **Problem**: Modern compilers omit frame pointers (`-fomit-frame-pointer`) for performance. Solution: Compile with `-fno-omit-frame-pointer` or use DWARF unwinding.
 ### DWARF Unwinding
 Uses debug info (`.eh_frame` section) to unwind without frame pointers. More reliable but slower.
 ```bash
 perf record --call-graph dwarf ./program
 ```
 ## Statistical Nature
 Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.
 **Law of Large Numbers**: More samples = closer to true distribution.
 ```
 100 samples:    A: 8-12%  (high variance)
 1000 samples:   A: 9-11%  (better)
 10000 samples:  A: 9.8-10.2%  (quite accurate)
 ```
 This is why short-running programs need higher sampling frequency.
 ## Limitations
 ### 1. Short Functions Miss Samples
 If a function runs for less time than the sampling interval, it might not get sampled at all.
 ```
 Sampling interval: ──────────────────────────────────
 Function A:        ██                                 (might miss!)
 Function B:        ████████████████████████████████   (definitely hit)
 ```
 ### 2. Inlined Functions Disappear
 When the compiler inlines a function, it no longer exists as a separate entity:
 ```c
 // Source code
 inline int square(int x) { return x * x; }
 int compute(int x) { return square(x) + 1; }
 // After inlining - square() disappears from profile
 int compute(int x) { return x * x + 1; }
 ```
 With debug info, perf can sometimes recover inline information.
 ### 3. Sampling Bias
 Some events are harder to catch:
 - Very short functions
 - Functions that mostly wait (I/O, locks)
 - Interrupt handlers
 ### 4. Observer Effect
 Profiling itself has overhead:
 - NMI handling takes cycles
 - Stack unwinding takes cycles
 - Writing samples to buffer takes cycles
 Usually <5%, but can affect extremely performance-sensitive code.
 ## perf Events
 perf can sample on different events, not just CPU cycles:
 ```bash
 perf record -e cycles ./program           # CPU cycles (default)
 perf record -e instructions ./program     # Instructions retired
 perf record -e cache-misses ./program     # Cache misses
 perf record -e branch-misses ./program    # Branch mispredictions
 ```
 This lets you answer "where do cache misses happen?" not just "where is time spent?"
 ## Summary
 1. **Sampling** interrupts periodically to see what's executing
 2. **PMU counters** trigger interrupts at configurable frequency
 3. **Statistical accuracy** improves with more samples
 4. **Symbol resolution** maps addresses to function names
 5. **Call graphs** show the path to each sample
 6. **Low overhead** (~1-5%) makes it usable in production
 ## Further Reading
 - [Brendan Gregg's perf examples](https://www.brendangregg.com/perf.html)
 - [perf wiki](https://perf.wiki.kernel.org/index.php/Main_Page)
 - [Intel PMU documentation](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) (Volume 3, Chapter 18)
--- a/docs/HOW-TRACING-WORKS.md
+++ b/docs/HOW-TRACING-WORKS.md
@ -0,0 +1,265 @@
 # How Tracing Works: strace, bpftrace, and eBPF
 ## What is Tracing?
 While sampling answers "where is time spent?", tracing answers "what happened?"
 Tracing captures **every occurrence** of specific events:
 - Every syscall
 - Every function call
 - Every network packet
 - Every disk I/O
 ## strace: The Simple Way
 ### How ptrace Works
 strace uses the `ptrace()` syscall - the same mechanism debuggers use.
 ```
 ┌─────────────────┐          ┌─────────────────┐
 │   Your Program  │          │     strace      │
 │                 │          │                 │
 │  1. syscall     │─────────→│  2. STOP!       │
 │     ║           │  SIGTRAP │     inspect     │
 │     ║ (paused)  │          │     log         │
 │     ║           │←─────────│  3. continue    │
 │  4. resume      │  PTRACE  │                 │
 │                 │  CONT    │                 │
 └─────────────────┘          └─────────────────┘
 ```
 Step by step:
 1. **Attach**: strace calls `ptrace(PTRACE_ATTACH, pid)` or starts the process with `PTRACE_TRACEME`
 2. **Trap on syscall**: strace sets `PTRACE_SYSCALL` - kernel stops tracee at each syscall
 3. **Inspect**: strace reads registers to see syscall number and arguments
 4. **Continue**: strace calls `ptrace(PTRACE_CONT)` to resume
 5. **Repeat**: Kernel stops again when syscall returns, strace reads return value
 ### Why strace is Slow
 Each syscall causes **two context switches** to strace:
 - One on syscall entry
 - One on syscall exit
 ```
 Normal syscall:
  User → Kernel → User
 With strace:
  User → Kernel → strace → Kernel → strace → Kernel → User
                    ↑                  ↑
              entry stop          exit stop
 ```
 Overhead can be **10-100x** for syscall-heavy programs!
 ```bash
 # Normal
 time ./read_fast testfile     # 0.01s
 # With strace
 time strace -c ./read_fast testfile   # 0.5s (50x slower!)
 ```
 ### When to Use strace
 Despite overhead, strace is invaluable for:
 - Debugging "why won't this program start?"
 - Finding which files a program opens
 - Understanding program behavior
 - One-off investigation (not production)
 ```bash
 strace -e openat ./program     # What files does it open?
 strace -e connect ./program    # What network connections?
 strace -c ./program            # Syscall summary
 ```
 ## eBPF: The Fast Way
 ### What is eBPF?
 eBPF (extended Berkeley Packet Filter) lets you run **sandboxed programs inside the kernel**.
 ```
 ┌─────────────────────────────────────────────────────┐
 │                      Kernel                          │
 │  ┌──────────────────────────────────────────────┐   │
 │  │              Your eBPF Program                │   │
 │  │  - Runs at kernel speed                       │   │
 │  │  - No context switches                        │   │
 │  │  - Verified for safety                        │   │
 │  └──────────────────────────────────────────────┘   │
 │         ↓ attach to                                  │
 │  ┌──────────────────────────────────────────────┐   │
 │  │    Tracepoints, Kprobes, Uprobes, USDT        │   │
 │  └──────────────────────────────────────────────┘   │
 └─────────────────────────────────────────────────────┘
         ↓ results via
 ┌─────────────────────────────────────────────────────┐
 │                    User Space                        │
 │    Maps, ring buffers, perf events                  │
 └─────────────────────────────────────────────────────┘
 ```
 ### The eBPF Verifier
 Before your eBPF program runs, the kernel **verifies** it:
 - No infinite loops
 - No out-of-bounds memory access
 - No unsafe operations
 - Bounded execution time
 This makes eBPF safe to run in production, even from untrusted users.
 ### Attachment Points
 eBPF can attach to various kernel hooks:
 | Type | What it traces | Example |
 |------|----------------|---------|
 | **Tracepoints** | Stable kernel events | `tracepoint:syscalls:sys_enter_read` |
 | **Kprobes** | Any kernel function | `kprobe:do_sys_open` |
 | **Uprobes** | Any userspace function | `uprobe:/bin/bash:readline` |
 | **USDT** | User-defined static probes | `usdt:./server:myapp:request` |
 ### Why eBPF is Fast
 ```
 strace (ptrace):
  Process stops → context switch → strace reads → context switch → resume
 eBPF:
  Event fires → eBPF runs IN KERNEL → continue (no context switch!)
 ```
 eBPF overhead is typically **<1%** even for frequent events.
 ## bpftrace: eBPF Made Easy
 bpftrace is a high-level language for eBPF, like awk for tracing.
 ### Basic Syntax
 ```
 probe /filter/ { action }
 ```
 ### Examples
 ```bash
 # Count syscalls by name
 bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
 # Trace open() calls with filename
 bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%s opened %s\n", comm, str(args->filename));
 }'
 # Histogram of read() sizes
 bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
    @bytes = hist(args->ret);
 }'
 # Latency of disk I/O
 bpftrace -e 'kprobe:blk_start_request { @start[arg0] = nsecs; }
             kprobe:blk_account_io_done /@start[arg0]/ {
                 @usecs = hist((nsecs - @start[arg0]) / 1000);
                 delete(@start[arg0]);
             }'
 ```
 ### bpftrace Built-in Variables
 | Variable | Meaning |
 |----------|---------|
 | `pid` | Process ID |
 | `tid` | Thread ID |
 | `comm` | Process name |
 | `nsecs` | Nanosecond timestamp |
 | `arg0-argN` | Function arguments |
 | `retval` | Return value |
 | `probe` | Current probe name |
 ### bpftrace Aggregations
 ```bash
@x = count()           # Count events
@x = sum(value)        # Sum values
@x = avg(value)        # Average
@x = min(value)        # Minimum
@x = max(value)        # Maximum
@x = hist(value)       # Power-of-2 histogram
@x = lhist(v, min, max, step)  # Linear histogram
 ```
 ## Comparison: When to Use What
 | Tool | Overhead | Setup | Use Case |
 |------|----------|-------|----------|
 | **strace** | High (10-100x) | Zero | Quick debugging, non-production |
 | **ltrace** | High | Zero | Library call tracing |
 | **bpftrace** | Low (<1%) | Needs root | Production tracing, performance analysis |
 | **perf** | Low (<5%) | Minimal | CPU profiling, hardware events |
 ### Decision Tree
 ```
 Need to trace events?
 ├── Quick one-off debugging?
 │   └── strace (easy, but slow)
 ├── Production system?
 │   └── bpftrace/eBPF (fast, safe)
 ├── Custom application probes?
 │   └── USDT + bpftrace
 └── CPU profiling?
    └── perf record
 ```
 ## USDT: User Statically Defined Tracing
 USDT probes are markers you add to your code:
 ```c
 #include <sys/sdt.h>
 void handle_request(int id) {
    DTRACE_PROBE1(myserver, request_start, id);
    // ... handle request ...
    DTRACE_PROBE1(myserver, request_end, id);
 }
 ```
 Then trace with bpftrace:
 ```bash
 bpftrace -e 'usdt:./server:myserver:request_start {
    @start[arg0] = nsecs;
 }
 usdt:./server:myserver:request_end /@start[arg0]/ {
    @latency = hist((nsecs - @start[arg0]) / 1000);
    delete(@start[arg0]);
 }'
 ```
 **Advantages of USDT**:
 - Zero overhead when not tracing
 - Stable interface (unlike kprobes)
 - Access to application-level data
 ## Summary
 | Mechanism | How it works | Speed | Safety |
 |-----------|--------------|-------|--------|
 | **ptrace (strace)** | Process stops, tracer inspects | Slow | Safe |
 | **eBPF (bpftrace)** | Code runs in kernel | Fast | Verified |
 | **USDT** | Compiled-in no-ops, enabled by eBPF | Near-zero | Safe |
 ## Further Reading
 - [Brendan Gregg's bpftrace tutorial](https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md)
 - [bpftrace reference guide](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md)
 - [eBPF documentation](https://ebpf.io/what-is-ebpf/)
 - [strace source code](https://github.com/strace/strace) - surprisingly readable!