This commit is contained in:
illustris 2026-01-11 12:04:36 +05:30
parent c8f56cf3f1
commit 1432bdaff9
Signed by: illustris
GPG Key ID: 56C8FC0B899FEFA3
3 changed files with 803 additions and 0 deletions

View File

@ -0,0 +1,327 @@
# CPU Caches and Memory: Why Access Patterns Matter
## The Memory Wall
CPUs are fast. Memory is slow. This gap is called the "memory wall."
```
Relative Speed
══════════════
CPU registers ████████████████████████████████ (~1 cycle)
L1 cache ██████████████████████ (~4 cycles)
L2 cache ████████████ (~12 cycles)
L3 cache ██████ (~40 cycles)
Main RAM █ (~200 cycles)
SSD (~10,000 cycles)
HDD (~10,000,000 cycles)
```
A cache miss to RAM costs **50-100x more** than an L1 hit. This is why cache behavior dominates performance for many programs.
## The Cache Hierarchy
```
┌─────────────────────────────────────────────────────────────┐
│ CPU Core │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Registers (bytes, <1ns)
│ └─────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d) │ │
│ └─────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ L2 Cache: 256-512 KB, ~3-4ns │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────┴─────────────────┐
│ L3 Cache: 8-64 MB, ~10-12ns │ (shared between cores)
└─────────────────┬─────────────────┘
┌─────────────────┴─────────────────┐
│ Main RAM: GBs, ~50-100ns │
└───────────────────────────────────┘
```
### Typical Numbers (Desktop CPU, 2024)
| Level | Size | Latency | Bandwidth |
|-------|------|---------|-----------|
| L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s |
| L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s |
| L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s |
| RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s |
## Cache Lines: The Unit of Transfer
Memory isn't fetched byte-by-byte. It's fetched in **cache lines** (typically 64 bytes).
```
Memory addresses:
0x1000: [████████████████████████████████████████████████████████████████]
└──────────────── 64 bytes = 1 cache line ────────────────────┘
If you access address 0x1020:
- CPU fetches entire cache line (0x1000-0x103F)
- Next access to 0x1021? Already in cache! (free)
- Access to 0x1040? Different cache line, another fetch
```
This is why **sequential access** is so much faster than random access.
## Spatial and Temporal Locality
Caches exploit two patterns in real programs:
### Spatial Locality
"If you accessed address X, you'll probably access X+1 soon."
```c
// GOOD: Sequential access (spatial locality)
for (int i = 0; i < N; i++) {
sum += array[i]; // Next element is in same cache line
}
// BAD: Random access (no spatial locality)
for (int i = 0; i < N; i++) {
sum += array[random_index()]; // Each access misses cache
}
```
### Temporal Locality
"If you accessed address X, you'll probably access X again soon."
```c
// GOOD: Reuse data while it's hot
for (int i = 0; i < N; i++) {
x = array[i];
result += x * x + x; // x stays in registers
}
// BAD: Touch data once, move on
for (int i = 0; i < N; i++) {
sum += array[i];
}
for (int i = 0; i < N; i++) {
product *= array[i]; // Array evicted from cache, refetch
}
```
## Why Random Access Kills Performance
### Example: Array vs Linked List
```
Array (sequential memory):
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ All in 1-2 cache lines
└───┴───┴───┴───┴───┴───┴───┴───┘
Linked List (scattered memory):
┌───┐ ┌───┐ ┌───┐
│ 0 │────────→│ 1 │────────→│ 2 │...
└───┘ └───┘ └───┘
↑ ↑ ↑
0x1000 0x5420 0x2108
Each node in different cache line!
```
Traversing a scattered linked list causes a **cache miss per node**.
### Real Numbers
```
Array traversal: ~0.004 seconds (10M elements)
Sequential list: ~0.018 seconds (4.5x slower)
Scattered list: ~1.400 seconds (350x slower!)
```
The scattered list is O(n) just like the array, but the constant factor is 350x worse.
## Pipeline Stalls: Why the CPU Can't Hide Latency
Modern CPUs execute many instructions simultaneously:
```
Pipeline (simplified):
Cycle: 1 2 3 4 5 6 7 8
Fetch: [A] [B] [C] [D] [E] [F] [G] [H]
Decode: [A] [B] [C] [D] [E] [F] [G]
Execute: [A] [B] [C] [D] [E] [F]
Memory: [A] [B] [C] [D] [E]
Write: [A] [B] [C] [D]
```
But what happens when instruction C needs data from memory?
```
Cycle: 1 2 3 4 5 ... 200 201 202
Fetch: [A] [B] [C] [C] [C] ... [C] [D] [E]
Decode: [A] [B] [C] [C] ... [C] [C] [D]
Execute: [A] [B] waiting for memory...
Memory: [A] [B] ... ... [C]
Write: [A] [B] ... [C]
STALL! Pipeline bubbles
```
The CPU stalls for **~200 cycles** waiting for RAM. Those 200 cycles could have executed 200+ instructions.
### Out-of-Order Execution Helps (But Not Enough)
CPUs can execute later instructions while waiting:
```c
a = array[i]; // Cache miss, stall...
b = x + y; // Can execute while waiting!
c = b * 2; // Can execute while waiting!
d = a + 1; // Must wait for 'a'
```
But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.
## The Prefetcher: CPU Tries to Help
Modern CPUs detect sequential access patterns and fetch data **before you ask**:
```
Your code accesses: [0] [1] [2] [3] [4] ...
Prefetcher fetches: [5] [6] [7] [8] ... (ahead of you!)
```
But prefetchers can only predict **regular patterns**:
- Sequential: ✅ Perfect prediction
- Strided (every Nth element): ✅ Usually works
- Random: ❌ No pattern to detect
```c
// Prefetcher wins
for (int i = 0; i < N; i++) {
sum += array[i]; // Prefetcher fetches ahead
}
// Prefetcher loses
for (int i = 0; i < N; i++) {
sum += array[indices[i]]; // Random indices, can't predict
}
```
## Row-Major vs Column-Major
C stores 2D arrays in row-major order:
```
int matrix[3][4];
Memory layout:
[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘
```
### Row-Major Access (Cache-Friendly)
```c
for (int i = 0; i < ROWS; i++) {
for (int j = 0; j < COLS; j++) {
sum += matrix[i][j]; // Sequential in memory!
}
}
```
Access pattern: `[0,0] [0,1] [0,2] [0,3] [1,0]...` - sequential, cache line fully used.
### Column-Major Access (Cache-Hostile)
```c
for (int j = 0; j < COLS; j++) {
for (int i = 0; i < ROWS; i++) {
sum += matrix[i][j]; // Jumps by COLS each time!
}
}
```
Access pattern: `[0,0] [1,0] [2,0] [0,1]...` - jumps by `COLS * sizeof(int)` bytes.
If COLS=8192, each access jumps 32KB - far beyond any cache line!
**Result**: Column-major can be **10-50x slower** for large matrices.
## False Sharing: The Multithreaded Trap
Cache coherency means cores must agree on cache line contents.
```
Thread 1 (Core 0): counter1++ ┐
Thread 2 (Core 1): counter2++ ├── Both in same cache line!
┌────────────────────────────────────────────────────────────────┐
│ Cache line: [counter1] [counter2] [padding.................] │
└────────────────────────────────────────────────────────────────┘
```
When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.
**Fix**: Pad data to separate cache lines:
```c
struct {
long counter;
char padding[64 - sizeof(long)]; // Pad to 64 bytes
} counters[NUM_THREADS];
```
## NUMA: When Memory Has Geography
On multi-socket systems, memory is "closer" to some CPUs:
```
┌─────────────────────────────────────────────────────────────┐
│ Socket 0 Socket 1 │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Core 0-7 │ │ Core 8-15 │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ┌──────┴──────┐ interconnect ┌──────┴──────┐ │
│ │ RAM 0 │ ←────────────→ │ RAM 1 │ │
│ │ (local) │ (slow) │ (remote) │ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.
## Measuring Cache Behavior
```bash
# Overall cache stats
perf stat -e cache-misses,cache-references ./program
# Detailed breakdown
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program
# Where do cache misses happen?
perf record -e cache-misses ./program
perf report
```
## Summary
| Pattern | Cache Behavior | Performance |
|---------|---------------|-------------|
| Sequential access | Prefetcher wins, cache lines fully used | Fast |
| Strided access | Partial cache line use | Medium |
| Random access | Every access misses, pipeline stalls | Slow |
**Key Takeaways**:
1. **Memory access pattern matters as much as algorithm complexity**
2. **Sequential access is almost always faster** - prefetcher + cache lines
3. **Random access causes pipeline stalls** - CPU waits ~200 cycles per miss
4. **Structure data for access pattern** - not just for logical organization
5. **Measure with `perf stat`** before optimizing
## Further Reading
- [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - Ulrich Drepper
- [Gallery of Processor Cache Effects](http://igoro.com/archive/gallery-of-processor-cache-effects/)
- [Brendan Gregg's Memory Flamegraphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html)

View File

@ -0,0 +1,211 @@
# How Sampling Profilers Work
## The Core Idea
Sampling profilers answer the question: **"Where is my program spending time?"**
Instead of measuring every function call (instrumentation), they periodically interrupt the program and record what it's doing. Over thousands of samples, hot spots emerge statistically.
```
Program execution: ████████████████████████████████████████
↑ ↑ ↑ ↑ ↑ ↑ ↑
sample sample sample ...
```
## Sampling vs Instrumentation
| Approach | How it works | Overhead | Accuracy |
|----------|--------------|----------|----------|
| **Instrumentation** | Insert code at every function entry/exit | High (10-100x slowdown) | Exact counts |
| **Sampling** | Interrupt periodically, record stack | Low (1-5%) | Statistical |
Instrumentation (like `gprof` with `-pg`) modifies your program. Sampling just observes it.
## How perf Does It
### 1. Hardware Performance Counters (PMU)
Modern CPUs have Performance Monitoring Units (PMUs) with special registers:
```
┌─────────────────────────────────────────┐
│ CPU │
│ ┌─────────────────────────────────┐ │
│ │ Performance Monitoring Unit │ │
│ │ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Counter │ │ Counter │ ... │ │
│ │ │ cycles │ │ instrs │ │ │
│ │ └─────────┘ └─────────┘ │ │
│ │ ↓ overflow interrupt │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
```
When you run `perf record`:
1. perf programs a PMU counter to count CPU cycles
2. Counter overflows every N cycles (default: enough for ~4000 samples/sec)
3. Overflow triggers a **Non-Maskable Interrupt (NMI)**
4. Kernel handler records: instruction pointer, process ID, timestamp
5. Optionally: walks the stack to get call chain
### 2. The Sampling Frequency
```bash
perf record -F 99 ./program # 99 samples per second
perf record -F 9999 ./program # 9999 samples per second
```
Higher frequency = more samples = better accuracy, but more overhead.
**Why 99 and not 100?** Using a prime number avoids aliasing with periodic behavior in your program (like a 100Hz timer).
### 3. What Gets Recorded
Each sample contains:
- **IP (Instruction Pointer)**: Which instruction was executing
- **PID/TID**: Which process/thread
- **Timestamp**: When it happened
- **CPU**: Which core
- **Call chain** (with `-g`): Stack of return addresses
```
Sample #1: IP=0x4011a0, PID=1234, CPU=2, time=1234567890
callchain: main → process_data → compute_inner
Sample #2: IP=0x4011b8, PID=1234, CPU=2, time=1234567891
callchain: main → process_data → compute_inner
...
```
## Symbol Resolution
Raw samples contain memory addresses. To show function names, perf needs:
1. **Symbol tables**: Map address ranges to function names
2. **Debug info** (`-g`): Map addresses to source lines
```
Without symbols: 45.23% 0x00000000004011a0
With symbols: 45.23% compute_inner
With debug info: 45.23% compute_inner (program.c:28)
```
This is why `perf report` needs access to the same binary you profiled.
## Call Graph Collection
With `perf record -g`, perf records the call stack for each sample.
### Frame Pointer Walking (Traditional)
```
Stack Memory:
┌──────────────┐
│ return addr │ ← where to return after current function
│ saved RBP │ ← pointer to previous frame
├──────────────┤
│ local vars │
├──────────────┤
│ return addr │
│ saved RBP ───┼──→ previous frame
├──────────────┤
│ ... │
└──────────────┘
```
Walk the chain of frame pointers to reconstruct the call stack.
**Problem**: Modern compilers omit frame pointers (`-fomit-frame-pointer`) for performance. Solution: Compile with `-fno-omit-frame-pointer` or use DWARF unwinding.
### DWARF Unwinding
Uses debug info (`.eh_frame` section) to unwind without frame pointers. More reliable but slower.
```bash
perf record --call-graph dwarf ./program
```
## Statistical Nature
Sampling is inherently statistical. If function A takes 10% of execution time, about 10% of samples should land in A.
**Law of Large Numbers**: More samples = closer to true distribution.
```
100 samples: A: 8-12% (high variance)
1000 samples: A: 9-11% (better)
10000 samples: A: 9.8-10.2% (quite accurate)
```
This is why short-running programs need higher sampling frequency.
## Limitations
### 1. Short Functions Miss Samples
If a function runs for less time than the sampling interval, it might not get sampled at all.
```
Sampling interval: ──────────────────────────────────
Function A: ██ (might miss!)
Function B: ████████████████████████████████ (definitely hit)
```
### 2. Inlined Functions Disappear
When the compiler inlines a function, it no longer exists as a separate entity:
```c
// Source code
inline int square(int x) { return x * x; }
int compute(int x) { return square(x) + 1; }
// After inlining - square() disappears from profile
int compute(int x) { return x * x + 1; }
```
With debug info, perf can sometimes recover inline information.
### 3. Sampling Bias
Some events are harder to catch:
- Very short functions
- Functions that mostly wait (I/O, locks)
- Interrupt handlers
### 4. Observer Effect
Profiling itself has overhead:
- NMI handling takes cycles
- Stack unwinding takes cycles
- Writing samples to buffer takes cycles
Usually <5%, but can affect extremely performance-sensitive code.
## perf Events
perf can sample on different events, not just CPU cycles:
```bash
perf record -e cycles ./program # CPU cycles (default)
perf record -e instructions ./program # Instructions retired
perf record -e cache-misses ./program # Cache misses
perf record -e branch-misses ./program # Branch mispredictions
```
This lets you answer "where do cache misses happen?" not just "where is time spent?"
## Summary
1. **Sampling** interrupts periodically to see what's executing
2. **PMU counters** trigger interrupts at configurable frequency
3. **Statistical accuracy** improves with more samples
4. **Symbol resolution** maps addresses to function names
5. **Call graphs** show the path to each sample
6. **Low overhead** (~1-5%) makes it usable in production
## Further Reading
- [Brendan Gregg's perf examples](https://www.brendangregg.com/perf.html)
- [perf wiki](https://perf.wiki.kernel.org/index.php/Main_Page)
- [Intel PMU documentation](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) (Volume 3, Chapter 18)

265
docs/HOW-TRACING-WORKS.md Normal file
View File

@ -0,0 +1,265 @@
# How Tracing Works: strace, bpftrace, and eBPF
## What is Tracing?
While sampling answers "where is time spent?", tracing answers "what happened?"
Tracing captures **every occurrence** of specific events:
- Every syscall
- Every function call
- Every network packet
- Every disk I/O
## strace: The Simple Way
### How ptrace Works
strace uses the `ptrace()` syscall - the same mechanism debuggers use.
```
┌─────────────────┐ ┌─────────────────┐
│ Your Program │ │ strace │
│ │ │ │
│ 1. syscall │─────────→│ 2. STOP! │
│ ║ │ SIGTRAP │ inspect │
│ ║ (paused) │ │ log │
│ ║ │←─────────│ 3. continue │
│ 4. resume │ PTRACE │ │
│ │ CONT │ │
└─────────────────┘ └─────────────────┘
```
Step by step:
1. **Attach**: strace calls `ptrace(PTRACE_ATTACH, pid)` or starts the process with `PTRACE_TRACEME`
2. **Trap on syscall**: strace sets `PTRACE_SYSCALL` - kernel stops tracee at each syscall
3. **Inspect**: strace reads registers to see syscall number and arguments
4. **Continue**: strace calls `ptrace(PTRACE_CONT)` to resume
5. **Repeat**: Kernel stops again when syscall returns, strace reads return value
### Why strace is Slow
Each syscall causes **two context switches** to strace:
- One on syscall entry
- One on syscall exit
```
Normal syscall:
User → Kernel → User
With strace:
User → Kernel → strace → Kernel → strace → Kernel → User
↑ ↑
entry stop exit stop
```
Overhead can be **10-100x** for syscall-heavy programs!
```bash
# Normal
time ./read_fast testfile # 0.01s
# With strace
time strace -c ./read_fast testfile # 0.5s (50x slower!)
```
### When to Use strace
Despite overhead, strace is invaluable for:
- Debugging "why won't this program start?"
- Finding which files a program opens
- Understanding program behavior
- One-off investigation (not production)
```bash
strace -e openat ./program # What files does it open?
strace -e connect ./program # What network connections?
strace -c ./program # Syscall summary
```
## eBPF: The Fast Way
### What is eBPF?
eBPF (extended Berkeley Packet Filter) lets you run **sandboxed programs inside the kernel**.
```
┌─────────────────────────────────────────────────────┐
│ Kernel │
│ ┌──────────────────────────────────────────────┐ │
│ │ Your eBPF Program │ │
│ │ - Runs at kernel speed │ │
│ │ - No context switches │ │
│ │ - Verified for safety │ │
│ └──────────────────────────────────────────────┘ │
│ ↓ attach to │
│ ┌──────────────────────────────────────────────┐ │
│ │ Tracepoints, Kprobes, Uprobes, USDT │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
↓ results via
┌─────────────────────────────────────────────────────┐
│ User Space │
│ Maps, ring buffers, perf events │
└─────────────────────────────────────────────────────┘
```
### The eBPF Verifier
Before your eBPF program runs, the kernel **verifies** it:
- No infinite loops
- No out-of-bounds memory access
- No unsafe operations
- Bounded execution time
This makes eBPF safe to run in production, even from untrusted users.
### Attachment Points
eBPF can attach to various kernel hooks:
| Type | What it traces | Example |
|------|----------------|---------|
| **Tracepoints** | Stable kernel events | `tracepoint:syscalls:sys_enter_read` |
| **Kprobes** | Any kernel function | `kprobe:do_sys_open` |
| **Uprobes** | Any userspace function | `uprobe:/bin/bash:readline` |
| **USDT** | User-defined static probes | `usdt:./server:myapp:request` |
### Why eBPF is Fast
```
strace (ptrace):
Process stops → context switch → strace reads → context switch → resume
eBPF:
Event fires → eBPF runs IN KERNEL → continue (no context switch!)
```
eBPF overhead is typically **<1%** even for frequent events.
## bpftrace: eBPF Made Easy
bpftrace is a high-level language for eBPF, like awk for tracing.
### Basic Syntax
```
probe /filter/ { action }
```
### Examples
```bash
# Count syscalls by name
bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
# Trace open() calls with filename
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
printf("%s opened %s\n", comm, str(args->filename));
}'
# Histogram of read() sizes
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
@bytes = hist(args->ret);
}'
# Latency of disk I/O
bpftrace -e 'kprobe:blk_start_request { @start[arg0] = nsecs; }
kprobe:blk_account_io_done /@start[arg0]/ {
@usecs = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);
}'
```
### bpftrace Built-in Variables
| Variable | Meaning |
|----------|---------|
| `pid` | Process ID |
| `tid` | Thread ID |
| `comm` | Process name |
| `nsecs` | Nanosecond timestamp |
| `arg0-argN` | Function arguments |
| `retval` | Return value |
| `probe` | Current probe name |
### bpftrace Aggregations
```bash
@x = count() # Count events
@x = sum(value) # Sum values
@x = avg(value) # Average
@x = min(value) # Minimum
@x = max(value) # Maximum
@x = hist(value) # Power-of-2 histogram
@x = lhist(v, min, max, step) # Linear histogram
```
## Comparison: When to Use What
| Tool | Overhead | Setup | Use Case |
|------|----------|-------|----------|
| **strace** | High (10-100x) | Zero | Quick debugging, non-production |
| **ltrace** | High | Zero | Library call tracing |
| **bpftrace** | Low (<1%) | Needs root | Production tracing, performance analysis |
| **perf** | Low (<5%) | Minimal | CPU profiling, hardware events |
### Decision Tree
```
Need to trace events?
├── Quick one-off debugging?
│ └── strace (easy, but slow)
├── Production system?
│ └── bpftrace/eBPF (fast, safe)
├── Custom application probes?
│ └── USDT + bpftrace
└── CPU profiling?
└── perf record
```
## USDT: User Statically Defined Tracing
USDT probes are markers you add to your code:
```c
#include <sys/sdt.h>
void handle_request(int id) {
DTRACE_PROBE1(myserver, request_start, id);
// ... handle request ...
DTRACE_PROBE1(myserver, request_end, id);
}
```
Then trace with bpftrace:
```bash
bpftrace -e 'usdt:./server:myserver:request_start {
@start[arg0] = nsecs;
}
usdt:./server:myserver:request_end /@start[arg0]/ {
@latency = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);
}'
```
**Advantages of USDT**:
- Zero overhead when not tracing
- Stable interface (unlike kprobes)
- Access to application-level data
## Summary
| Mechanism | How it works | Speed | Safety |
|-----------|--------------|-------|--------|
| **ptrace (strace)** | Process stops, tracer inspects | Slow | Safe |
| **eBPF (bpftrace)** | Code runs in kernel | Fast | Verified |
| **USDT** | Compiled-in no-ops, enabled by eBPF | Near-zero | Safe |
## Further Reading
- [Brendan Gregg's bpftrace tutorial](https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md)
- [bpftrace reference guide](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md)
- [eBPF documentation](https://ebpf.io/what-is-ebpf/)
- [strace source code](https://github.com/strace/strace) - surprisingly readable!