266 lines
8.8 KiB
Markdown
266 lines
8.8 KiB
Markdown
# How Tracing Works: strace, bpftrace, and eBPF
|
|
|
|
## What is Tracing?
|
|
|
|
While sampling answers "where is time spent?", tracing answers "what happened?"
|
|
|
|
Tracing captures **every occurrence** of specific events:
|
|
- Every syscall
|
|
- Every function call
|
|
- Every network packet
|
|
- Every disk I/O
|
|
|
|
## strace: The Simple Way
|
|
|
|
### How ptrace Works
|
|
|
|
strace uses the `ptrace()` syscall - the same mechanism debuggers use.
|
|
|
|
```
|
|
┌─────────────────┐ ┌─────────────────┐
|
|
│ Your Program │ │ strace │
|
|
│ │ │ │
|
|
│ 1. syscall │─────────→│ 2. STOP! │
|
|
│ ║ │ SIGTRAP │ inspect │
|
|
│ ║ (paused) │ │ log │
|
|
│ ║ │←─────────│ 3. continue │
|
|
│ 4. resume │ PTRACE │ │
|
|
│ │ CONT │ │
|
|
└─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
Step by step:
|
|
|
|
1. **Attach**: strace calls `ptrace(PTRACE_ATTACH, pid)` or starts the process with `PTRACE_TRACEME`
|
|
2. **Trap on syscall**: strace sets `PTRACE_SYSCALL` - kernel stops tracee at each syscall
|
|
3. **Inspect**: strace reads registers to see syscall number and arguments
|
|
4. **Continue**: strace calls `ptrace(PTRACE_CONT)` to resume
|
|
5. **Repeat**: Kernel stops again when syscall returns, strace reads return value
|
|
|
|
### Why strace is Slow
|
|
|
|
Each syscall causes **two context switches** to strace:
|
|
- One on syscall entry
|
|
- One on syscall exit
|
|
|
|
```
|
|
Normal syscall:
|
|
User → Kernel → User
|
|
|
|
With strace:
|
|
User → Kernel → strace → Kernel → strace → Kernel → User
|
|
↑ ↑
|
|
entry stop exit stop
|
|
```
|
|
|
|
Overhead can be **10-100x** for syscall-heavy programs!
|
|
|
|
```bash
|
|
# Normal
|
|
time ./read_fast testfile # 0.01s
|
|
|
|
# With strace
|
|
time strace -c ./read_fast testfile # 0.5s (50x slower!)
|
|
```
|
|
|
|
### When to Use strace
|
|
|
|
Despite overhead, strace is invaluable for:
|
|
- Debugging "why won't this program start?"
|
|
- Finding which files a program opens
|
|
- Understanding program behavior
|
|
- One-off investigation (not production)
|
|
|
|
```bash
|
|
strace -e openat ./program # What files does it open?
|
|
strace -e connect ./program # What network connections?
|
|
strace -c ./program # Syscall summary
|
|
```
|
|
|
|
## eBPF: The Fast Way
|
|
|
|
### What is eBPF?
|
|
|
|
eBPF (extended Berkeley Packet Filter) lets you run **sandboxed programs inside the kernel**.
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ Kernel │
|
|
│ ┌──────────────────────────────────────────────┐ │
|
|
│ │ Your eBPF Program │ │
|
|
│ │ - Runs at kernel speed │ │
|
|
│ │ - No context switches │ │
|
|
│ │ - Verified for safety │ │
|
|
│ └──────────────────────────────────────────────┘ │
|
|
│ ↓ attach to │
|
|
│ ┌──────────────────────────────────────────────┐ │
|
|
│ │ Tracepoints, Kprobes, Uprobes, USDT │ │
|
|
│ └──────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────┘
|
|
↓ results via
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ User Space │
|
|
│ Maps, ring buffers, perf events │
|
|
└─────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### The eBPF Verifier
|
|
|
|
Before your eBPF program runs, the kernel **verifies** it:
|
|
- No infinite loops
|
|
- No out-of-bounds memory access
|
|
- No unsafe operations
|
|
- Bounded execution time
|
|
|
|
This makes eBPF safe to run in production, even from untrusted users.
|
|
|
|
### Attachment Points
|
|
|
|
eBPF can attach to various kernel hooks:
|
|
|
|
| Type | What it traces | Example |
|
|
|------|----------------|---------|
|
|
| **Tracepoints** | Stable kernel events | `tracepoint:syscalls:sys_enter_read` |
|
|
| **Kprobes** | Any kernel function | `kprobe:do_sys_open` |
|
|
| **Uprobes** | Any userspace function | `uprobe:/bin/bash:readline` |
|
|
| **USDT** | User-defined static probes | `usdt:./server:myapp:request` |
|
|
|
|
### Why eBPF is Fast
|
|
|
|
```
|
|
strace (ptrace):
|
|
Process stops → context switch → strace reads → context switch → resume
|
|
|
|
eBPF:
|
|
Event fires → eBPF runs IN KERNEL → continue (no context switch!)
|
|
```
|
|
|
|
eBPF overhead is typically **<1%** even for frequent events.
|
|
|
|
## bpftrace: eBPF Made Easy
|
|
|
|
bpftrace is a high-level language for eBPF, like awk for tracing.
|
|
|
|
### Basic Syntax
|
|
|
|
```
|
|
probe /filter/ { action }
|
|
```
|
|
|
|
### Examples
|
|
|
|
```bash
|
|
# Count syscalls by name
|
|
bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
|
|
|
|
# Trace open() calls with filename
|
|
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
|
|
printf("%s opened %s\n", comm, str(args->filename));
|
|
}'
|
|
|
|
# Histogram of read() sizes
|
|
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
|
|
@bytes = hist(args->ret);
|
|
}'
|
|
|
|
# Latency of disk I/O
|
|
bpftrace -e 'kprobe:blk_start_request { @start[arg0] = nsecs; }
|
|
kprobe:blk_account_io_done /@start[arg0]/ {
|
|
@usecs = hist((nsecs - @start[arg0]) / 1000);
|
|
delete(@start[arg0]);
|
|
}'
|
|
```
|
|
|
|
### bpftrace Built-in Variables
|
|
|
|
| Variable | Meaning |
|
|
|----------|---------|
|
|
| `pid` | Process ID |
|
|
| `tid` | Thread ID |
|
|
| `comm` | Process name |
|
|
| `nsecs` | Nanosecond timestamp |
|
|
| `arg0-argN` | Function arguments |
|
|
| `retval` | Return value |
|
|
| `probe` | Current probe name |
|
|
|
|
### bpftrace Aggregations
|
|
|
|
```bash
|
|
@x = count() # Count events
|
|
@x = sum(value) # Sum values
|
|
@x = avg(value) # Average
|
|
@x = min(value) # Minimum
|
|
@x = max(value) # Maximum
|
|
@x = hist(value) # Power-of-2 histogram
|
|
@x = lhist(v, min, max, step) # Linear histogram
|
|
```
|
|
|
|
## Comparison: When to Use What
|
|
|
|
| Tool | Overhead | Setup | Use Case |
|
|
|------|----------|-------|----------|
|
|
| **strace** | High (10-100x) | Zero | Quick debugging, non-production |
|
|
| **ltrace** | High | Zero | Library call tracing |
|
|
| **bpftrace** | Low (<1%) | Needs root | Production tracing, performance analysis |
|
|
| **perf** | Low (<5%) | Minimal | CPU profiling, hardware events |
|
|
|
|
### Decision Tree
|
|
|
|
```
|
|
Need to trace events?
|
|
├── Quick one-off debugging?
|
|
│ └── strace (easy, but slow)
|
|
├── Production system?
|
|
│ └── bpftrace/eBPF (fast, safe)
|
|
├── Custom application probes?
|
|
│ └── USDT + bpftrace
|
|
└── CPU profiling?
|
|
└── perf record
|
|
```
|
|
|
|
## USDT: User Statically Defined Tracing
|
|
|
|
USDT probes are markers you add to your code:
|
|
|
|
```c
|
|
#include <sys/sdt.h>
|
|
|
|
void handle_request(int id) {
|
|
DTRACE_PROBE1(myserver, request_start, id);
|
|
// ... handle request ...
|
|
DTRACE_PROBE1(myserver, request_end, id);
|
|
}
|
|
```
|
|
|
|
Then trace with bpftrace:
|
|
|
|
```bash
|
|
bpftrace -e 'usdt:./server:myserver:request_start {
|
|
@start[arg0] = nsecs;
|
|
}
|
|
usdt:./server:myserver:request_end /@start[arg0]/ {
|
|
@latency = hist((nsecs - @start[arg0]) / 1000);
|
|
delete(@start[arg0]);
|
|
}'
|
|
```
|
|
|
|
**Advantages of USDT**:
|
|
- Zero overhead when not tracing
|
|
- Stable interface (unlike kprobes)
|
|
- Access to application-level data
|
|
|
|
## Summary
|
|
|
|
| Mechanism | How it works | Speed | Safety |
|
|
|-----------|--------------|-------|--------|
|
|
| **ptrace (strace)** | Process stops, tracer inspects | Slow | Safe |
|
|
| **eBPF (bpftrace)** | Code runs in kernel | Fast | Verified |
|
|
| **USDT** | Compiled-in no-ops, enabled by eBPF | Near-zero | Safe |
|
|
|
|
## Further Reading
|
|
|
|
- [Brendan Gregg's bpftrace tutorial](https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md)
|
|
- [bpftrace reference guide](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md)
|
|
- [eBPF documentation](https://ebpf.io/what-is-ebpf/)
|
|
- [strace source code](https://github.com/strace/strace) - surprisingly readable!
|