8.8 KiB
How Tracing Works: strace, bpftrace, and eBPF
What is Tracing?
While sampling answers "where is time spent?", tracing answers "what happened?"
Tracing captures every occurrence of specific events:
- Every syscall
- Every function call
- Every network packet
- Every disk I/O
strace: The Simple Way
How ptrace Works
strace uses the ptrace() syscall - the same mechanism debuggers use.
┌─────────────────┐ ┌─────────────────┐
│ Your Program │ │ strace │
│ │ │ │
│ 1. syscall │─────────→│ 2. STOP! │
│ ║ │ SIGTRAP │ inspect │
│ ║ (paused) │ │ log │
│ ║ │←─────────│ 3. continue │
│ 4. resume │ PTRACE │ │
│ │ CONT │ │
└─────────────────┘ └─────────────────┘
Step by step:
- Attach: strace calls
ptrace(PTRACE_ATTACH, pid)or starts the process withPTRACE_TRACEME - Trap on syscall: strace sets
PTRACE_SYSCALL- kernel stops tracee at each syscall - Inspect: strace reads registers to see syscall number and arguments
- Continue: strace calls
ptrace(PTRACE_CONT)to resume - Repeat: Kernel stops again when syscall returns, strace reads return value
Why strace is Slow
Each syscall causes two context switches to strace:
- One on syscall entry
- One on syscall exit
Normal syscall:
User → Kernel → User
With strace:
User → Kernel → strace → Kernel → strace → Kernel → User
↑ ↑
entry stop exit stop
Overhead can be 10-100x for syscall-heavy programs!
# Normal
time ./read_fast testfile # 0.01s
# With strace
time strace -c ./read_fast testfile # 0.5s (50x slower!)
When to Use strace
Despite overhead, strace is invaluable for:
- Debugging "why won't this program start?"
- Finding which files a program opens
- Understanding program behavior
- One-off investigation (not production)
strace -e openat ./program # What files does it open?
strace -e connect ./program # What network connections?
strace -c ./program # Syscall summary
eBPF: The Fast Way
What is eBPF?
eBPF (extended Berkeley Packet Filter) lets you run sandboxed programs inside the kernel.
┌─────────────────────────────────────────────────────┐
│ Kernel │
│ ┌──────────────────────────────────────────────┐ │
│ │ Your eBPF Program │ │
│ │ - Runs at kernel speed │ │
│ │ - No context switches │ │
│ │ - Verified for safety │ │
│ └──────────────────────────────────────────────┘ │
│ ↓ attach to │
│ ┌──────────────────────────────────────────────┐ │
│ │ Tracepoints, Kprobes, Uprobes, USDT │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
↓ results via
┌─────────────────────────────────────────────────────┐
│ User Space │
│ Maps, ring buffers, perf events │
└─────────────────────────────────────────────────────┘
The eBPF Verifier
Before your eBPF program runs, the kernel verifies it:
- No infinite loops
- No out-of-bounds memory access
- No unsafe operations
- Bounded execution time
This makes eBPF safe to run in production, even from untrusted users.
Attachment Points
eBPF can attach to various kernel hooks:
| Type | What it traces | Example |
|---|---|---|
| Tracepoints | Stable kernel events | tracepoint:syscalls:sys_enter_read |
| Kprobes | Any kernel function | kprobe:do_sys_open |
| Uprobes | Any userspace function | uprobe:/bin/bash:readline |
| USDT | User-defined static probes | usdt:./server:myapp:request |
Why eBPF is Fast
strace (ptrace):
Process stops → context switch → strace reads → context switch → resume
eBPF:
Event fires → eBPF runs IN KERNEL → continue (no context switch!)
eBPF overhead is typically <1% even for frequent events.
bpftrace: eBPF Made Easy
bpftrace is a high-level language for eBPF, like awk for tracing.
Basic Syntax
probe /filter/ { action }
Examples
# Count syscalls by name
bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
# Trace open() calls with filename
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
printf("%s opened %s\n", comm, str(args->filename));
}'
# Histogram of read() sizes
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
@bytes = hist(args->ret);
}'
# Latency of disk I/O
bpftrace -e 'kprobe:blk_start_request { @start[arg0] = nsecs; }
kprobe:blk_account_io_done /@start[arg0]/ {
@usecs = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);
}'
bpftrace Built-in Variables
| Variable | Meaning |
|---|---|
pid |
Process ID |
tid |
Thread ID |
comm |
Process name |
nsecs |
Nanosecond timestamp |
arg0-argN |
Function arguments |
retval |
Return value |
probe |
Current probe name |
bpftrace Aggregations
@x = count() # Count events
@x = sum(value) # Sum values
@x = avg(value) # Average
@x = min(value) # Minimum
@x = max(value) # Maximum
@x = hist(value) # Power-of-2 histogram
@x = lhist(v, min, max, step) # Linear histogram
Comparison: When to Use What
| Tool | Overhead | Setup | Use Case |
|---|---|---|---|
| strace | High (10-100x) | Zero | Quick debugging, non-production |
| ltrace | High | Zero | Library call tracing |
| bpftrace | Low (<1%) | Needs root | Production tracing, performance analysis |
| perf | Low (<5%) | Minimal | CPU profiling, hardware events |
Decision Tree
Need to trace events?
├── Quick one-off debugging?
│ └── strace (easy, but slow)
├── Production system?
│ └── bpftrace/eBPF (fast, safe)
├── Custom application probes?
│ └── USDT + bpftrace
└── CPU profiling?
└── perf record
USDT: User Statically Defined Tracing
USDT probes are markers you add to your code:
#include <sys/sdt.h>
void handle_request(int id) {
DTRACE_PROBE1(myserver, request_start, id);
// ... handle request ...
DTRACE_PROBE1(myserver, request_end, id);
}
Then trace with bpftrace:
bpftrace -e 'usdt:./server:myserver:request_start {
@start[arg0] = nsecs;
}
usdt:./server:myserver:request_end /@start[arg0]/ {
@latency = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);
}'
Advantages of USDT:
- Zero overhead when not tracing
- Stable interface (unlike kprobes)
- Access to application-level data
Summary
| Mechanism | How it works | Speed | Safety |
|---|---|---|---|
| ptrace (strace) | Process stops, tracer inspects | Slow | Safe |
| eBPF (bpftrace) | Code runs in kernel | Fast | Verified |
| USDT | Compiled-in no-ops, enabled by eBPF | Near-zero | Safe |
Further Reading
- Brendan Gregg's bpftrace tutorial
- bpftrace reference guide
- eBPF documentation
- strace source code - surprisingly readable!