illustris/perf-workshop

Fork 0

illustris 1432bdaff9

add docs

2026-01-11 12:04:36 +05:30

8.8 KiB

Raw Permalink Blame History

How Tracing Works: strace, bpftrace, and eBPF

What is Tracing?

While sampling answers "where is time spent?", tracing answers "what happened?"

Tracing captures every occurrence of specific events:

Every syscall
Every function call
Every network packet
Every disk I/O

strace: The Simple Way

How ptrace Works

strace uses the ptrace() syscall - the same mechanism debuggers use.

┌─────────────────┐          ┌─────────────────┐
│   Your Program  │          │     strace      │
│                 │          │                 │
│  1. syscall     │─────────→│  2. STOP!       │
│     ║           │  SIGTRAP │     inspect     │
│     ║ (paused)  │          │     log         │
│     ║           │←─────────│  3. continue    │
│  4. resume      │  PTRACE  │                 │
│                 │  CONT    │                 │
└─────────────────┘          └─────────────────┘

Step by step:

Attach: strace calls ptrace(PTRACE_ATTACH, pid) or starts the process with PTRACE_TRACEME
Trap on syscall: strace sets PTRACE_SYSCALL - kernel stops tracee at each syscall
Inspect: strace reads registers to see syscall number and arguments
Continue: strace calls ptrace(PTRACE_CONT) to resume
Repeat: Kernel stops again when syscall returns, strace reads return value

Why strace is Slow

Each syscall causes two context switches to strace:

One on syscall entry
One on syscall exit

Normal syscall:
  User → Kernel → User

With strace:
  User → Kernel → strace → Kernel → strace → Kernel → User
                    ↑                  ↑
              entry stop          exit stop

Overhead can be 10-100x for syscall-heavy programs!

# Normal
time ./read_fast testfile     # 0.01s

# With strace
time strace -c ./read_fast testfile   # 0.5s (50x slower!)

When to Use strace

Despite overhead, strace is invaluable for:

Debugging "why won't this program start?"
Finding which files a program opens
Understanding program behavior
One-off investigation (not production)

strace -e openat ./program     # What files does it open?
strace -e connect ./program    # What network connections?
strace -c ./program            # Syscall summary

eBPF: The Fast Way

What is eBPF?

eBPF (extended Berkeley Packet Filter) lets you run sandboxed programs inside the kernel.

┌─────────────────────────────────────────────────────┐
│                      Kernel                          │
│  ┌──────────────────────────────────────────────┐   │
│  │              Your eBPF Program                │   │
│  │  - Runs at kernel speed                       │   │
│  │  - No context switches                        │   │
│  │  - Verified for safety                        │   │
│  └──────────────────────────────────────────────┘   │
│         ↓ attach to                                  │
│  ┌──────────────────────────────────────────────┐   │
│  │    Tracepoints, Kprobes, Uprobes, USDT        │   │
│  └──────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘
         ↓ results via
┌─────────────────────────────────────────────────────┐
│                    User Space                        │
│    Maps, ring buffers, perf events                  │
└─────────────────────────────────────────────────────┘

The eBPF Verifier

Before your eBPF program runs, the kernel verifies it:

No infinite loops
No out-of-bounds memory access
No unsafe operations
Bounded execution time

This makes eBPF safe to run in production, even from untrusted users.

Attachment Points

eBPF can attach to various kernel hooks:

Type	What it traces	Example
Tracepoints	Stable kernel events	`tracepoint:syscalls:sys_enter_read`
Kprobes	Any kernel function	`kprobe:do_sys_open`
Uprobes	Any userspace function	`uprobe:/bin/bash:readline`
USDT	User-defined static probes	`usdt:./server:myapp:request`

Why eBPF is Fast

strace (ptrace):
  Process stops → context switch → strace reads → context switch → resume

eBPF:
  Event fires → eBPF runs IN KERNEL → continue (no context switch!)

eBPF overhead is typically <1% even for frequent events.

bpftrace: eBPF Made Easy

bpftrace is a high-level language for eBPF, like awk for tracing.

Basic Syntax

probe /filter/ { action }

Examples

# Count syscalls by name
bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'

# Trace open() calls with filename
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%s opened %s\n", comm, str(args->filename));
}'

# Histogram of read() sizes
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
    @bytes = hist(args->ret);
}'

# Latency of disk I/O
bpftrace -e 'kprobe:blk_start_request { @start[arg0] = nsecs; }
             kprobe:blk_account_io_done /@start[arg0]/ {
                 @usecs = hist((nsecs - @start[arg0]) / 1000);
                 delete(@start[arg0]);
             }'

bpftrace Built-in Variables

Variable	Meaning
`pid`	Process ID
`tid`	Thread ID
`comm`	Process name
`nsecs`	Nanosecond timestamp
`arg0-argN`	Function arguments
`retval`	Return value
`probe`	Current probe name

bpftrace Aggregations

@x = count()           # Count events
@x = sum(value)        # Sum values
@x = avg(value)        # Average
@x = min(value)        # Minimum
@x = max(value)        # Maximum
@x = hist(value)       # Power-of-2 histogram
@x = lhist(v, min, max, step)  # Linear histogram

Comparison: When to Use What

Tool	Overhead	Setup	Use Case
strace	High (10-100x)	Zero	Quick debugging, non-production
ltrace	High	Zero	Library call tracing
bpftrace	Low (<1%)	Needs root	Production tracing, performance analysis
perf	Low (<5%)	Minimal	CPU profiling, hardware events

Decision Tree

Need to trace events?
├── Quick one-off debugging?
│   └── strace (easy, but slow)
├── Production system?
│   └── bpftrace/eBPF (fast, safe)
├── Custom application probes?
│   └── USDT + bpftrace
└── CPU profiling?
    └── perf record

USDT: User Statically Defined Tracing

USDT probes are markers you add to your code:

#include <sys/sdt.h>

void handle_request(int id) {
    DTRACE_PROBE1(myserver, request_start, id);
    // ... handle request ...
    DTRACE_PROBE1(myserver, request_end, id);
}

Then trace with bpftrace:

bpftrace -e 'usdt:./server:myserver:request_start {
    @start[arg0] = nsecs;
}
usdt:./server:myserver:request_end /@start[arg0]/ {
    @latency = hist((nsecs - @start[arg0]) / 1000);
    delete(@start[arg0]);
}'

Advantages of USDT:

Zero overhead when not tracing
Stable interface (unlike kprobes)
Access to application-level data

Summary

Mechanism	How it works	Speed	Safety
ptrace (strace)	Process stops, tracer inspects	Slow	Safe
eBPF (bpftrace)	Code runs in kernel	Fast	Verified
USDT	Compiled-in no-ops, enabled by eBPF	Near-zero	Safe

8.8 KiB Raw Permalink Blame History