# How Tracing Works: strace, bpftrace, and eBPF ## What is Tracing? While sampling answers "where is time spent?", tracing answers "what happened?" Tracing captures **every occurrence** of specific events: - Every syscall - Every function call - Every network packet - Every disk I/O ## strace: The Simple Way ### How ptrace Works strace uses the `ptrace()` syscall - the same mechanism debuggers use. ``` ┌─────────────────┐ ┌─────────────────┐ │ Your Program │ │ strace │ │ │ │ │ │ 1. syscall │─────────→│ 2. STOP! │ │ ║ │ SIGTRAP │ inspect │ │ ║ (paused) │ │ log │ │ ║ │←─────────│ 3. continue │ │ 4. resume │ PTRACE │ │ │ │ CONT │ │ └─────────────────┘ └─────────────────┘ ``` Step by step: 1. **Attach**: strace calls `ptrace(PTRACE_ATTACH, pid)` or starts the process with `PTRACE_TRACEME` 2. **Trap on syscall**: strace sets `PTRACE_SYSCALL` - kernel stops tracee at each syscall 3. **Inspect**: strace reads registers to see syscall number and arguments 4. **Continue**: strace calls `ptrace(PTRACE_CONT)` to resume 5. **Repeat**: Kernel stops again when syscall returns, strace reads return value ### Why strace is Slow Each syscall causes **two context switches** to strace: - One on syscall entry - One on syscall exit ``` Normal syscall: User → Kernel → User With strace: User → Kernel → strace → Kernel → strace → Kernel → User ↑ ↑ entry stop exit stop ``` Overhead can be **10-100x** for syscall-heavy programs! ```bash # Normal time ./read_fast testfile # 0.01s # With strace time strace -c ./read_fast testfile # 0.5s (50x slower!) ``` ### When to Use strace Despite overhead, strace is invaluable for: - Debugging "why won't this program start?" - Finding which files a program opens - Understanding program behavior - One-off investigation (not production) ```bash strace -e openat ./program # What files does it open? strace -e connect ./program # What network connections? strace -c ./program # Syscall summary ``` ## eBPF: The Fast Way ### What is eBPF? eBPF (extended Berkeley Packet Filter) lets you run **sandboxed programs inside the kernel**. ``` ┌─────────────────────────────────────────────────────┐ │ Kernel │ │ ┌──────────────────────────────────────────────┐ │ │ │ Your eBPF Program │ │ │ │ - Runs at kernel speed │ │ │ │ - No context switches │ │ │ │ - Verified for safety │ │ │ └──────────────────────────────────────────────┘ │ │ ↓ attach to │ │ ┌──────────────────────────────────────────────┐ │ │ │ Tracepoints, Kprobes, Uprobes, USDT │ │ │ └──────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────┘ ↓ results via ┌─────────────────────────────────────────────────────┐ │ User Space │ │ Maps, ring buffers, perf events │ └─────────────────────────────────────────────────────┘ ``` ### The eBPF Verifier Before your eBPF program runs, the kernel **verifies** it: - No infinite loops - No out-of-bounds memory access - No unsafe operations - Bounded execution time This makes eBPF safe to run in production, even from untrusted users. ### Attachment Points eBPF can attach to various kernel hooks: | Type | What it traces | Example | |------|----------------|---------| | **Tracepoints** | Stable kernel events | `tracepoint:syscalls:sys_enter_read` | | **Kprobes** | Any kernel function | `kprobe:do_sys_open` | | **Uprobes** | Any userspace function | `uprobe:/bin/bash:readline` | | **USDT** | User-defined static probes | `usdt:./server:myapp:request` | ### Why eBPF is Fast ``` strace (ptrace): Process stops → context switch → strace reads → context switch → resume eBPF: Event fires → eBPF runs IN KERNEL → continue (no context switch!) ``` eBPF overhead is typically **<1%** even for frequent events. ## bpftrace: eBPF Made Easy bpftrace is a high-level language for eBPF, like awk for tracing. ### Basic Syntax ``` probe /filter/ { action } ``` ### Examples ```bash # Count syscalls by name bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }' # Trace open() calls with filename bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s opened %s\n", comm, str(args->filename)); }' # Histogram of read() sizes bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { @bytes = hist(args->ret); }' # Latency of disk I/O bpftrace -e 'kprobe:blk_start_request { @start[arg0] = nsecs; } kprobe:blk_account_io_done /@start[arg0]/ { @usecs = hist((nsecs - @start[arg0]) / 1000); delete(@start[arg0]); }' ``` ### bpftrace Built-in Variables | Variable | Meaning | |----------|---------| | `pid` | Process ID | | `tid` | Thread ID | | `comm` | Process name | | `nsecs` | Nanosecond timestamp | | `arg0-argN` | Function arguments | | `retval` | Return value | | `probe` | Current probe name | ### bpftrace Aggregations ```bash @x = count() # Count events @x = sum(value) # Sum values @x = avg(value) # Average @x = min(value) # Minimum @x = max(value) # Maximum @x = hist(value) # Power-of-2 histogram @x = lhist(v, min, max, step) # Linear histogram ``` ## Comparison: When to Use What | Tool | Overhead | Setup | Use Case | |------|----------|-------|----------| | **strace** | High (10-100x) | Zero | Quick debugging, non-production | | **ltrace** | High | Zero | Library call tracing | | **bpftrace** | Low (<1%) | Needs root | Production tracing, performance analysis | | **perf** | Low (<5%) | Minimal | CPU profiling, hardware events | ### Decision Tree ``` Need to trace events? ├── Quick one-off debugging? │ └── strace (easy, but slow) ├── Production system? │ └── bpftrace/eBPF (fast, safe) ├── Custom application probes? │ └── USDT + bpftrace └── CPU profiling? └── perf record ``` ## USDT: User Statically Defined Tracing USDT probes are markers you add to your code: ```c #include void handle_request(int id) { DTRACE_PROBE1(myserver, request_start, id); // ... handle request ... DTRACE_PROBE1(myserver, request_end, id); } ``` Then trace with bpftrace: ```bash bpftrace -e 'usdt:./server:myserver:request_start { @start[arg0] = nsecs; } usdt:./server:myserver:request_end /@start[arg0]/ { @latency = hist((nsecs - @start[arg0]) / 1000); delete(@start[arg0]); }' ``` **Advantages of USDT**: - Zero overhead when not tracing - Stable interface (unlike kprobes) - Access to application-level data ## Summary | Mechanism | How it works | Speed | Safety | |-----------|--------------|-------|--------| | **ptrace (strace)** | Process stops, tracer inspects | Slow | Safe | | **eBPF (bpftrace)** | Code runs in kernel | Fast | Verified | | **USDT** | Compiled-in no-ops, enabled by eBPF | Near-zero | Safe | ## Further Reading - [Brendan Gregg's bpftrace tutorial](https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md) - [bpftrace reference guide](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md) - [eBPF documentation](https://ebpf.io/what-is-ebpf/) - [strace source code](https://github.com/strace/strace) - surprisingly readable!