perf-workshop/docs/CPU-CACHES-AND-MEMORY.md

# CPU Caches and Memory: Why Access Patterns Matter

## The Memory Wall

CPUs are fast. Memory is slow. This gap is called the "memory wall."

```
                    Relative Speed
                    ══════════════
CPU registers       ████████████████████████████████  (~1 cycle)
L1 cache            ██████████████████████            (~4 cycles)
L2 cache            ████████████                      (~12 cycles)
L3 cache            ██████                            (~40 cycles)
Main RAM            █                                 (~200 cycles)
SSD                                                   (~10,000 cycles)
HDD                                                   (~10,000,000 cycles)
```

A cache miss to RAM costs **50-100x more** than an L1 hit. This is why cache behavior dominates performance for many programs.

## The Cache Hierarchy

```
┌─────────────────────────────────────────────────────────────┐
│                         CPU Core                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Registers (bytes, <1ns)                             │    │
│  └─────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d)         │    │
│  └─────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  L2 Cache: 256-512 KB, ~3-4ns                        │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                              │
            ┌─────────────────┴─────────────────┐
            │  L3 Cache: 8-64 MB, ~10-12ns      │  (shared between cores)
            └─────────────────┬─────────────────┘
                              │
            ┌─────────────────┴─────────────────┐
            │  Main RAM: GBs, ~50-100ns         │
            └───────────────────────────────────┘
```

### Typical Numbers (Desktop CPU, 2024)

| Level | Size | Latency | Bandwidth |
|-------|------|---------|-----------|
| L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s |
| L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s |
| L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s |
| RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s |

## Cache Lines: The Unit of Transfer

Memory isn't fetched byte-by-byte. It's fetched in **cache lines** (typically 64 bytes).

```
Memory addresses:
0x1000: [████████████████████████████████████████████████████████████████]
        └──────────────── 64 bytes = 1 cache line ────────────────────┘

If you access address 0x1020:
- CPU fetches entire cache line (0x1000-0x103F)
- Next access to 0x1021? Already in cache! (free)
- Access to 0x1040? Different cache line, another fetch
```

This is why **sequential access** is so much faster than random access.

## Spatial and Temporal Locality

Caches exploit two patterns in real programs:

### Spatial Locality
"If you accessed address X, you'll probably access X+1 soon."

```c
// GOOD: Sequential access (spatial locality)
for (int i = 0; i < N; i++) {
    sum += array[i];  // Next element is in same cache line
}

// BAD: Random access (no spatial locality)
for (int i = 0; i < N; i++) {
    sum += array[random_index()];  // Each access misses cache
}
```

### Temporal Locality
"If you accessed address X, you'll probably access X again soon."

```c
// GOOD: Reuse data while it's hot
for (int i = 0; i < N; i++) {
    x = array[i];
    result += x * x + x;  // x stays in registers
}

// BAD: Touch data once, move on
for (int i = 0; i < N; i++) {
    sum += array[i];
}
for (int i = 0; i < N; i++) {
    product *= array[i];  // Array evicted from cache, refetch
}
```

## Why Random Access Kills Performance

### Example: Array vs Linked List

```
Array (sequential memory):
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │  All in 1-2 cache lines
└───┴───┴───┴───┴───┴───┴───┴───┘

Linked List (scattered memory):
┌───┐         ┌───┐         ┌───┐
│ 0 │────────→│ 1 │────────→│ 2 │...
└───┘         └───┘         └───┘
 ↑              ↑              ↑
0x1000       0x5420        0x2108
Each node in different cache line!
```

Traversing a scattered linked list causes a **cache miss per node**.

### Real Numbers

```
Array traversal:       ~0.004 seconds (10M elements)
Sequential list:       ~0.018 seconds (4.5x slower)
Scattered list:        ~1.400 seconds (350x slower!)
```

The scattered list is O(n) just like the array, but the constant factor is 350x worse.

## Pipeline Stalls: Why the CPU Can't Hide Latency

Modern CPUs execute many instructions simultaneously:

```
Pipeline (simplified):
  Cycle:    1    2    3    4    5    6    7    8
  Fetch:   [A]  [B]  [C]  [D]  [E]  [F]  [G]  [H]
  Decode:       [A]  [B]  [C]  [D]  [E]  [F]  [G]
  Execute:           [A]  [B]  [C]  [D]  [E]  [F]
  Memory:                 [A]  [B]  [C]  [D]  [E]
  Write:                       [A]  [B]  [C]  [D]
```

But what happens when instruction C needs data from memory?

```
  Cycle:    1    2    3    4    5   ...  200  201  202
  Fetch:   [A]  [B]  [C]  [C]  [C]  ... [C]  [D]  [E]
  Decode:       [A]  [B]  [C]  [C]  ... [C]  [C]  [D]
  Execute:           [A]  [B]  waiting for memory...
  Memory:                 [A]  [B]  ... ... [C]
  Write:                       [A]  [B]      ... [C]
                               ↑
                        STALL! Pipeline bubbles
```

The CPU stalls for **~200 cycles** waiting for RAM. Those 200 cycles could have executed 200+ instructions.

### Out-of-Order Execution Helps (But Not Enough)

CPUs can execute later instructions while waiting:

```c
a = array[i];      // Cache miss, stall...
b = x + y;         // Can execute while waiting!
c = b * 2;         // Can execute while waiting!
d = a + 1;         // Must wait for 'a'
```

But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.

## The Prefetcher: CPU Tries to Help

Modern CPUs detect sequential access patterns and fetch data **before you ask**:

```
Your code accesses:  [0]  [1]  [2]  [3]  [4]  ...
Prefetcher fetches:            [5]  [6]  [7]  [8]  ...  (ahead of you!)
```

But prefetchers can only predict **regular patterns**:
- Sequential: ✅ Perfect prediction
- Strided (every Nth element): ✅ Usually works
- Random: ❌ No pattern to detect

```c
// Prefetcher wins
for (int i = 0; i < N; i++) {
    sum += array[i];  // Prefetcher fetches ahead
}

// Prefetcher loses
for (int i = 0; i < N; i++) {
    sum += array[indices[i]];  // Random indices, can't predict
}
```

## Row-Major vs Column-Major

C stores 2D arrays in row-major order:

```
int matrix[3][4];

Memory layout:
[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘
```

### Row-Major Access (Cache-Friendly)

```c
for (int i = 0; i < ROWS; i++) {
    for (int j = 0; j < COLS; j++) {
        sum += matrix[i][j];  // Sequential in memory!
    }
}
```

Access pattern: `[0,0] [0,1] [0,2] [0,3] [1,0]...` - sequential, cache line fully used.

### Column-Major Access (Cache-Hostile)

```c
for (int j = 0; j < COLS; j++) {
    for (int i = 0; i < ROWS; i++) {
        sum += matrix[i][j];  // Jumps by COLS each time!
    }
}
```

Access pattern: `[0,0] [1,0] [2,0] [0,1]...` - jumps by `COLS * sizeof(int)` bytes.

If COLS=8192, each access jumps 32KB - far beyond any cache line!

**Result**: Column-major can be **10-50x slower** for large matrices.

## False Sharing: The Multithreaded Trap

Cache coherency means cores must agree on cache line contents.

```
Thread 1 (Core 0):  counter1++    ┐
Thread 2 (Core 1):  counter2++    ├── Both in same cache line!
                                  ┘
┌────────────────────────────────────────────────────────────────┐
│  Cache line: [counter1] [counter2] [padding.................]  │
└────────────────────────────────────────────────────────────────┘
```

When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.

**Fix**: Pad data to separate cache lines:

```c
struct {
    long counter;
    char padding[64 - sizeof(long)];  // Pad to 64 bytes
} counters[NUM_THREADS];
```

## NUMA: When Memory Has Geography

On multi-socket systems, memory is "closer" to some CPUs:

```
┌─────────────────────────────────────────────────────────────┐
│  Socket 0                        Socket 1                   │
│  ┌─────────────┐                ┌─────────────┐             │
│  │  Core 0-7   │                │  Core 8-15  │             │
│  └──────┬──────┘                └──────┬──────┘             │
│         │                              │                    │
│  ┌──────┴──────┐  interconnect  ┌──────┴──────┐             │
│  │   RAM 0     │ ←────────────→ │   RAM 1     │             │
│  │ (local)     │     (slow)     │ (remote)    │             │
│  └─────────────┘                └─────────────┘             │
└─────────────────────────────────────────────────────────────┘
```

Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.

## Measuring Cache Behavior

```bash
# Overall cache stats
perf stat -e cache-misses,cache-references ./program

# Detailed breakdown
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program

# Where do cache misses happen?
perf record -e cache-misses ./program
perf report
```

## Summary

| Pattern | Cache Behavior | Performance |
|---------|---------------|-------------|
| Sequential access | Prefetcher wins, cache lines fully used | Fast |
| Strided access | Partial cache line use | Medium |
| Random access | Every access misses, pipeline stalls | Slow |

**Key Takeaways**:

1. **Memory access pattern matters as much as algorithm complexity**
2. **Sequential access is almost always faster** - prefetcher + cache lines
3. **Random access causes pipeline stalls** - CPU waits ~200 cycles per miss
4. **Structure data for access pattern** - not just for logical organization
5. **Measure with `perf stat`** before optimizing

## Further Reading

- [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - Ulrich Drepper
- [Gallery of Processor Cache Effects](http://igoro.com/archive/gallery-of-processor-cache-effects/)
- [Brendan Gregg's Memory Flamegraphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html)