perf-workshop/docs/CPU-CACHES-AND-MEMORY.md
2026-01-11 12:04:36 +05:30

328 lines
13 KiB
Markdown

# CPU Caches and Memory: Why Access Patterns Matter
## The Memory Wall
CPUs are fast. Memory is slow. This gap is called the "memory wall."
```
Relative Speed
══════════════
CPU registers ████████████████████████████████ (~1 cycle)
L1 cache ██████████████████████ (~4 cycles)
L2 cache ████████████ (~12 cycles)
L3 cache ██████ (~40 cycles)
Main RAM █ (~200 cycles)
SSD (~10,000 cycles)
HDD (~10,000,000 cycles)
```
A cache miss to RAM costs **50-100x more** than an L1 hit. This is why cache behavior dominates performance for many programs.
## The Cache Hierarchy
```
┌─────────────────────────────────────────────────────────────┐
│ CPU Core │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Registers (bytes, <1ns) │ │
│ └─────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d) │ │
│ └─────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ L2 Cache: 256-512 KB, ~3-4ns │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────┴─────────────────┐
│ L3 Cache: 8-64 MB, ~10-12ns │ (shared between cores)
└─────────────────┬─────────────────┘
┌─────────────────┴─────────────────┐
│ Main RAM: GBs, ~50-100ns │
└───────────────────────────────────┘
```
### Typical Numbers (Desktop CPU, 2024)
| Level | Size | Latency | Bandwidth |
|-------|------|---------|-----------|
| L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s |
| L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s |
| L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s |
| RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s |
## Cache Lines: The Unit of Transfer
Memory isn't fetched byte-by-byte. It's fetched in **cache lines** (typically 64 bytes).
```
Memory addresses:
0x1000: [████████████████████████████████████████████████████████████████]
└──────────────── 64 bytes = 1 cache line ────────────────────┘
If you access address 0x1020:
- CPU fetches entire cache line (0x1000-0x103F)
- Next access to 0x1021? Already in cache! (free)
- Access to 0x1040? Different cache line, another fetch
```
This is why **sequential access** is so much faster than random access.
## Spatial and Temporal Locality
Caches exploit two patterns in real programs:
### Spatial Locality
"If you accessed address X, you'll probably access X+1 soon."
```c
// GOOD: Sequential access (spatial locality)
for (int i = 0; i < N; i++) {
sum += array[i]; // Next element is in same cache line
}
// BAD: Random access (no spatial locality)
for (int i = 0; i < N; i++) {
sum += array[random_index()]; // Each access misses cache
}
```
### Temporal Locality
"If you accessed address X, you'll probably access X again soon."
```c
// GOOD: Reuse data while it's hot
for (int i = 0; i < N; i++) {
x = array[i];
result += x * x + x; // x stays in registers
}
// BAD: Touch data once, move on
for (int i = 0; i < N; i++) {
sum += array[i];
}
for (int i = 0; i < N; i++) {
product *= array[i]; // Array evicted from cache, refetch
}
```
## Why Random Access Kills Performance
### Example: Array vs Linked List
```
Array (sequential memory):
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ All in 1-2 cache lines
└───┴───┴───┴───┴───┴───┴───┴───┘
Linked List (scattered memory):
┌───┐ ┌───┐ ┌───┐
│ 0 │────────→│ 1 │────────→│ 2 │...
└───┘ └───┘ └───┘
↑ ↑ ↑
0x1000 0x5420 0x2108
Each node in different cache line!
```
Traversing a scattered linked list causes a **cache miss per node**.
### Real Numbers
```
Array traversal: ~0.004 seconds (10M elements)
Sequential list: ~0.018 seconds (4.5x slower)
Scattered list: ~1.400 seconds (350x slower!)
```
The scattered list is O(n) just like the array, but the constant factor is 350x worse.
## Pipeline Stalls: Why the CPU Can't Hide Latency
Modern CPUs execute many instructions simultaneously:
```
Pipeline (simplified):
Cycle: 1 2 3 4 5 6 7 8
Fetch: [A] [B] [C] [D] [E] [F] [G] [H]
Decode: [A] [B] [C] [D] [E] [F] [G]
Execute: [A] [B] [C] [D] [E] [F]
Memory: [A] [B] [C] [D] [E]
Write: [A] [B] [C] [D]
```
But what happens when instruction C needs data from memory?
```
Cycle: 1 2 3 4 5 ... 200 201 202
Fetch: [A] [B] [C] [C] [C] ... [C] [D] [E]
Decode: [A] [B] [C] [C] ... [C] [C] [D]
Execute: [A] [B] waiting for memory...
Memory: [A] [B] ... ... [C]
Write: [A] [B] ... [C]
STALL! Pipeline bubbles
```
The CPU stalls for **~200 cycles** waiting for RAM. Those 200 cycles could have executed 200+ instructions.
### Out-of-Order Execution Helps (But Not Enough)
CPUs can execute later instructions while waiting:
```c
a = array[i]; // Cache miss, stall...
b = x + y; // Can execute while waiting!
c = b * 2; // Can execute while waiting!
d = a + 1; // Must wait for 'a'
```
But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.
## The Prefetcher: CPU Tries to Help
Modern CPUs detect sequential access patterns and fetch data **before you ask**:
```
Your code accesses: [0] [1] [2] [3] [4] ...
Prefetcher fetches: [5] [6] [7] [8] ... (ahead of you!)
```
But prefetchers can only predict **regular patterns**:
- Sequential: ✅ Perfect prediction
- Strided (every Nth element): ✅ Usually works
- Random: ❌ No pattern to detect
```c
// Prefetcher wins
for (int i = 0; i < N; i++) {
sum += array[i]; // Prefetcher fetches ahead
}
// Prefetcher loses
for (int i = 0; i < N; i++) {
sum += array[indices[i]]; // Random indices, can't predict
}
```
## Row-Major vs Column-Major
C stores 2D arrays in row-major order:
```
int matrix[3][4];
Memory layout:
[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘
```
### Row-Major Access (Cache-Friendly)
```c
for (int i = 0; i < ROWS; i++) {
for (int j = 0; j < COLS; j++) {
sum += matrix[i][j]; // Sequential in memory!
}
}
```
Access pattern: `[0,0] [0,1] [0,2] [0,3] [1,0]...` - sequential, cache line fully used.
### Column-Major Access (Cache-Hostile)
```c
for (int j = 0; j < COLS; j++) {
for (int i = 0; i < ROWS; i++) {
sum += matrix[i][j]; // Jumps by COLS each time!
}
}
```
Access pattern: `[0,0] [1,0] [2,0] [0,1]...` - jumps by `COLS * sizeof(int)` bytes.
If COLS=8192, each access jumps 32KB - far beyond any cache line!
**Result**: Column-major can be **10-50x slower** for large matrices.
## False Sharing: The Multithreaded Trap
Cache coherency means cores must agree on cache line contents.
```
Thread 1 (Core 0): counter1++ ┐
Thread 2 (Core 1): counter2++ ├── Both in same cache line!
┌────────────────────────────────────────────────────────────────┐
│ Cache line: [counter1] [counter2] [padding.................] │
└────────────────────────────────────────────────────────────────┘
```
When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.
**Fix**: Pad data to separate cache lines:
```c
struct {
long counter;
char padding[64 - sizeof(long)]; // Pad to 64 bytes
} counters[NUM_THREADS];
```
## NUMA: When Memory Has Geography
On multi-socket systems, memory is "closer" to some CPUs:
```
┌─────────────────────────────────────────────────────────────┐
│ Socket 0 Socket 1 │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Core 0-7 │ │ Core 8-15 │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ┌──────┴──────┐ interconnect ┌──────┴──────┐ │
│ │ RAM 0 │ ←────────────→ │ RAM 1 │ │
│ │ (local) │ (slow) │ (remote) │ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.
## Measuring Cache Behavior
```bash
# Overall cache stats
perf stat -e cache-misses,cache-references ./program
# Detailed breakdown
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program
# Where do cache misses happen?
perf record -e cache-misses ./program
perf report
```
## Summary
| Pattern | Cache Behavior | Performance |
|---------|---------------|-------------|
| Sequential access | Prefetcher wins, cache lines fully used | Fast |
| Strided access | Partial cache line use | Medium |
| Random access | Every access misses, pipeline stalls | Slow |
**Key Takeaways**:
1. **Memory access pattern matters as much as algorithm complexity**
2. **Sequential access is almost always faster** - prefetcher + cache lines
3. **Random access causes pipeline stalls** - CPU waits ~200 cycles per miss
4. **Structure data for access pattern** - not just for logical organization
5. **Measure with `perf stat`** before optimizing
## Further Reading
- [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - Ulrich Drepper
- [Gallery of Processor Cache Effects](http://igoro.com/archive/gallery-of-processor-cache-effects/)
- [Brendan Gregg's Memory Flamegraphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html)