# CPU Caches and Memory: Why Access Patterns Matter ## The Memory Wall CPUs are fast. Memory is slow. This gap is called the "memory wall." ``` Relative Speed ══════════════ CPU registers ████████████████████████████████ (~1 cycle) L1 cache ██████████████████████ (~4 cycles) L2 cache ████████████ (~12 cycles) L3 cache ██████ (~40 cycles) Main RAM █ (~200 cycles) SSD (~10,000 cycles) HDD (~10,000,000 cycles) ``` A cache miss to RAM costs **50-100x more** than an L1 hit. This is why cache behavior dominates performance for many programs. ## The Cache Hierarchy ``` ┌─────────────────────────────────────────────────────────────┐ │ CPU Core │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Registers (bytes, <1ns) │ │ │ └─────────────────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d) │ │ │ └─────────────────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ L2 Cache: 256-512 KB, ~3-4ns │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ┌─────────────────┴─────────────────┐ │ L3 Cache: 8-64 MB, ~10-12ns │ (shared between cores) └─────────────────┬─────────────────┘ │ ┌─────────────────┴─────────────────┐ │ Main RAM: GBs, ~50-100ns │ └───────────────────────────────────┘ ``` ### Typical Numbers (Desktop CPU, 2024) | Level | Size | Latency | Bandwidth | |-------|------|---------|-----------| | L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s | | L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s | | L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s | | RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s | ## Cache Lines: The Unit of Transfer Memory isn't fetched byte-by-byte. It's fetched in **cache lines** (typically 64 bytes). ``` Memory addresses: 0x1000: [████████████████████████████████████████████████████████████████] └──────────────── 64 bytes = 1 cache line ────────────────────┘ If you access address 0x1020: - CPU fetches entire cache line (0x1000-0x103F) - Next access to 0x1021? Already in cache! (free) - Access to 0x1040? Different cache line, another fetch ``` This is why **sequential access** is so much faster than random access. ## Spatial and Temporal Locality Caches exploit two patterns in real programs: ### Spatial Locality "If you accessed address X, you'll probably access X+1 soon." ```c // GOOD: Sequential access (spatial locality) for (int i = 0; i < N; i++) { sum += array[i]; // Next element is in same cache line } // BAD: Random access (no spatial locality) for (int i = 0; i < N; i++) { sum += array[random_index()]; // Each access misses cache } ``` ### Temporal Locality "If you accessed address X, you'll probably access X again soon." ```c // GOOD: Reuse data while it's hot for (int i = 0; i < N; i++) { x = array[i]; result += x * x + x; // x stays in registers } // BAD: Touch data once, move on for (int i = 0; i < N; i++) { sum += array[i]; } for (int i = 0; i < N; i++) { product *= array[i]; // Array evicted from cache, refetch } ``` ## Why Random Access Kills Performance ### Example: Array vs Linked List ``` Array (sequential memory): ┌───┬───┬───┬───┬───┬───┬───┬───┐ │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ All in 1-2 cache lines └───┴───┴───┴───┴───┴───┴───┴───┘ Linked List (scattered memory): ┌───┐ ┌───┐ ┌───┐ │ 0 │────────→│ 1 │────────→│ 2 │... └───┘ └───┘ └───┘ ↑ ↑ ↑ 0x1000 0x5420 0x2108 Each node in different cache line! ``` Traversing a scattered linked list causes a **cache miss per node**. ### Real Numbers ``` Array traversal: ~0.004 seconds (10M elements) Sequential list: ~0.018 seconds (4.5x slower) Scattered list: ~1.400 seconds (350x slower!) ``` The scattered list is O(n) just like the array, but the constant factor is 350x worse. ## Pipeline Stalls: Why the CPU Can't Hide Latency Modern CPUs execute many instructions simultaneously: ``` Pipeline (simplified): Cycle: 1 2 3 4 5 6 7 8 Fetch: [A] [B] [C] [D] [E] [F] [G] [H] Decode: [A] [B] [C] [D] [E] [F] [G] Execute: [A] [B] [C] [D] [E] [F] Memory: [A] [B] [C] [D] [E] Write: [A] [B] [C] [D] ``` But what happens when instruction C needs data from memory? ``` Cycle: 1 2 3 4 5 ... 200 201 202 Fetch: [A] [B] [C] [C] [C] ... [C] [D] [E] Decode: [A] [B] [C] [C] ... [C] [C] [D] Execute: [A] [B] waiting for memory... Memory: [A] [B] ... ... [C] Write: [A] [B] ... [C] ↑ STALL! Pipeline bubbles ``` The CPU stalls for **~200 cycles** waiting for RAM. Those 200 cycles could have executed 200+ instructions. ### Out-of-Order Execution Helps (But Not Enough) CPUs can execute later instructions while waiting: ```c a = array[i]; // Cache miss, stall... b = x + y; // Can execute while waiting! c = b * 2; // Can execute while waiting! d = a + 1; // Must wait for 'a' ``` But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly. ## The Prefetcher: CPU Tries to Help Modern CPUs detect sequential access patterns and fetch data **before you ask**: ``` Your code accesses: [0] [1] [2] [3] [4] ... Prefetcher fetches: [5] [6] [7] [8] ... (ahead of you!) ``` But prefetchers can only predict **regular patterns**: - Sequential: ✅ Perfect prediction - Strided (every Nth element): ✅ Usually works - Random: ❌ No pattern to detect ```c // Prefetcher wins for (int i = 0; i < N; i++) { sum += array[i]; // Prefetcher fetches ahead } // Prefetcher loses for (int i = 0; i < N; i++) { sum += array[indices[i]]; // Random indices, can't predict } ``` ## Row-Major vs Column-Major C stores 2D arrays in row-major order: ``` int matrix[3][4]; Memory layout: [0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3] └──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘ ``` ### Row-Major Access (Cache-Friendly) ```c for (int i = 0; i < ROWS; i++) { for (int j = 0; j < COLS; j++) { sum += matrix[i][j]; // Sequential in memory! } } ``` Access pattern: `[0,0] [0,1] [0,2] [0,3] [1,0]...` - sequential, cache line fully used. ### Column-Major Access (Cache-Hostile) ```c for (int j = 0; j < COLS; j++) { for (int i = 0; i < ROWS; i++) { sum += matrix[i][j]; // Jumps by COLS each time! } } ``` Access pattern: `[0,0] [1,0] [2,0] [0,1]...` - jumps by `COLS * sizeof(int)` bytes. If COLS=8192, each access jumps 32KB - far beyond any cache line! **Result**: Column-major can be **10-50x slower** for large matrices. ## False Sharing: The Multithreaded Trap Cache coherency means cores must agree on cache line contents. ``` Thread 1 (Core 0): counter1++ ┐ Thread 2 (Core 1): counter2++ ├── Both in same cache line! ┘ ┌────────────────────────────────────────────────────────────────┐ │ Cache line: [counter1] [counter2] [padding.................] │ └────────────────────────────────────────────────────────────────┘ ``` When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line. **Fix**: Pad data to separate cache lines: ```c struct { long counter; char padding[64 - sizeof(long)]; // Pad to 64 bytes } counters[NUM_THREADS]; ``` ## NUMA: When Memory Has Geography On multi-socket systems, memory is "closer" to some CPUs: ``` ┌─────────────────────────────────────────────────────────────┐ │ Socket 0 Socket 1 │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Core 0-7 │ │ Core 8-15 │ │ │ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ ┌──────┴──────┐ interconnect ┌──────┴──────┐ │ │ │ RAM 0 │ ←────────────→ │ RAM 1 │ │ │ │ (local) │ (slow) │ (remote) │ │ │ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency. ## Measuring Cache Behavior ```bash # Overall cache stats perf stat -e cache-misses,cache-references ./program # Detailed breakdown perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program # Where do cache misses happen? perf record -e cache-misses ./program perf report ``` ## Summary | Pattern | Cache Behavior | Performance | |---------|---------------|-------------| | Sequential access | Prefetcher wins, cache lines fully used | Fast | | Strided access | Partial cache line use | Medium | | Random access | Every access misses, pipeline stalls | Slow | **Key Takeaways**: 1. **Memory access pattern matters as much as algorithm complexity** 2. **Sequential access is almost always faster** - prefetcher + cache lines 3. **Random access causes pipeline stalls** - CPU waits ~200 cycles per miss 4. **Structure data for access pattern** - not just for logical organization 5. **Measure with `perf stat`** before optimizing ## Further Reading - [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - Ulrich Drepper - [Gallery of Processor Cache Effects](http://igoro.com/archive/gallery-of-processor-cache-effects/) - [Brendan Gregg's Memory Flamegraphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html)