13 KiB
CPU Caches and Memory: Why Access Patterns Matter
The Memory Wall
CPUs are fast. Memory is slow. This gap is called the "memory wall."
Relative Speed
══════════════
CPU registers ████████████████████████████████ (~1 cycle)
L1 cache ██████████████████████ (~4 cycles)
L2 cache ████████████ (~12 cycles)
L3 cache ██████ (~40 cycles)
Main RAM █ (~200 cycles)
SSD (~10,000 cycles)
HDD (~10,000,000 cycles)
A cache miss to RAM costs 50-100x more than an L1 hit. This is why cache behavior dominates performance for many programs.
The Cache Hierarchy
┌─────────────────────────────────────────────────────────────┐
│ CPU Core │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Registers (bytes, <1ns) │ │
│ └─────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d) │ │
│ └─────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ L2 Cache: 256-512 KB, ~3-4ns │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────┴─────────────────┐
│ L3 Cache: 8-64 MB, ~10-12ns │ (shared between cores)
└─────────────────┬─────────────────┘
│
┌─────────────────┴─────────────────┐
│ Main RAM: GBs, ~50-100ns │
└───────────────────────────────────┘
Typical Numbers (Desktop CPU, 2024)
| Level | Size | Latency | Bandwidth |
|---|---|---|---|
| L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s |
| L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s |
| L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s |
| RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s |
Cache Lines: The Unit of Transfer
Memory isn't fetched byte-by-byte. It's fetched in cache lines (typically 64 bytes).
Memory addresses:
0x1000: [████████████████████████████████████████████████████████████████]
└──────────────── 64 bytes = 1 cache line ────────────────────┘
If you access address 0x1020:
- CPU fetches entire cache line (0x1000-0x103F)
- Next access to 0x1021? Already in cache! (free)
- Access to 0x1040? Different cache line, another fetch
This is why sequential access is so much faster than random access.
Spatial and Temporal Locality
Caches exploit two patterns in real programs:
Spatial Locality
"If you accessed address X, you'll probably access X+1 soon."
// GOOD: Sequential access (spatial locality)
for (int i = 0; i < N; i++) {
sum += array[i]; // Next element is in same cache line
}
// BAD: Random access (no spatial locality)
for (int i = 0; i < N; i++) {
sum += array[random_index()]; // Each access misses cache
}
Temporal Locality
"If you accessed address X, you'll probably access X again soon."
// GOOD: Reuse data while it's hot
for (int i = 0; i < N; i++) {
x = array[i];
result += x * x + x; // x stays in registers
}
// BAD: Touch data once, move on
for (int i = 0; i < N; i++) {
sum += array[i];
}
for (int i = 0; i < N; i++) {
product *= array[i]; // Array evicted from cache, refetch
}
Why Random Access Kills Performance
Example: Array vs Linked List
Array (sequential memory):
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ All in 1-2 cache lines
└───┴───┴───┴───┴───┴───┴───┴───┘
Linked List (scattered memory):
┌───┐ ┌───┐ ┌───┐
│ 0 │────────→│ 1 │────────→│ 2 │...
└───┘ └───┘ └───┘
↑ ↑ ↑
0x1000 0x5420 0x2108
Each node in different cache line!
Traversing a scattered linked list causes a cache miss per node.
Real Numbers
Array traversal: ~0.004 seconds (10M elements)
Sequential list: ~0.018 seconds (4.5x slower)
Scattered list: ~1.400 seconds (350x slower!)
The scattered list is O(n) just like the array, but the constant factor is 350x worse.
Pipeline Stalls: Why the CPU Can't Hide Latency
Modern CPUs execute many instructions simultaneously:
Pipeline (simplified):
Cycle: 1 2 3 4 5 6 7 8
Fetch: [A] [B] [C] [D] [E] [F] [G] [H]
Decode: [A] [B] [C] [D] [E] [F] [G]
Execute: [A] [B] [C] [D] [E] [F]
Memory: [A] [B] [C] [D] [E]
Write: [A] [B] [C] [D]
But what happens when instruction C needs data from memory?
Cycle: 1 2 3 4 5 ... 200 201 202
Fetch: [A] [B] [C] [C] [C] ... [C] [D] [E]
Decode: [A] [B] [C] [C] ... [C] [C] [D]
Execute: [A] [B] waiting for memory...
Memory: [A] [B] ... ... [C]
Write: [A] [B] ... [C]
↑
STALL! Pipeline bubbles
The CPU stalls for ~200 cycles waiting for RAM. Those 200 cycles could have executed 200+ instructions.
Out-of-Order Execution Helps (But Not Enough)
CPUs can execute later instructions while waiting:
a = array[i]; // Cache miss, stall...
b = x + y; // Can execute while waiting!
c = b * 2; // Can execute while waiting!
d = a + 1; // Must wait for 'a'
But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.
The Prefetcher: CPU Tries to Help
Modern CPUs detect sequential access patterns and fetch data before you ask:
Your code accesses: [0] [1] [2] [3] [4] ...
Prefetcher fetches: [5] [6] [7] [8] ... (ahead of you!)
But prefetchers can only predict regular patterns:
- Sequential: ✅ Perfect prediction
- Strided (every Nth element): ✅ Usually works
- Random: ❌ No pattern to detect
// Prefetcher wins
for (int i = 0; i < N; i++) {
sum += array[i]; // Prefetcher fetches ahead
}
// Prefetcher loses
for (int i = 0; i < N; i++) {
sum += array[indices[i]]; // Random indices, can't predict
}
Row-Major vs Column-Major
C stores 2D arrays in row-major order:
int matrix[3][4];
Memory layout:
[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘
Row-Major Access (Cache-Friendly)
for (int i = 0; i < ROWS; i++) {
for (int j = 0; j < COLS; j++) {
sum += matrix[i][j]; // Sequential in memory!
}
}
Access pattern: [0,0] [0,1] [0,2] [0,3] [1,0]... - sequential, cache line fully used.
Column-Major Access (Cache-Hostile)
for (int j = 0; j < COLS; j++) {
for (int i = 0; i < ROWS; i++) {
sum += matrix[i][j]; // Jumps by COLS each time!
}
}
Access pattern: [0,0] [1,0] [2,0] [0,1]... - jumps by COLS * sizeof(int) bytes.
If COLS=8192, each access jumps 32KB - far beyond any cache line!
Result: Column-major can be 10-50x slower for large matrices.
False Sharing: The Multithreaded Trap
Cache coherency means cores must agree on cache line contents.
Thread 1 (Core 0): counter1++ ┐
Thread 2 (Core 1): counter2++ ├── Both in same cache line!
┘
┌────────────────────────────────────────────────────────────────┐
│ Cache line: [counter1] [counter2] [padding.................] │
└────────────────────────────────────────────────────────────────┘
When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.
Fix: Pad data to separate cache lines:
struct {
long counter;
char padding[64 - sizeof(long)]; // Pad to 64 bytes
} counters[NUM_THREADS];
NUMA: When Memory Has Geography
On multi-socket systems, memory is "closer" to some CPUs:
┌─────────────────────────────────────────────────────────────┐
│ Socket 0 Socket 1 │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Core 0-7 │ │ Core 8-15 │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ┌──────┴──────┐ interconnect ┌──────┴──────┐ │
│ │ RAM 0 │ ←────────────→ │ RAM 1 │ │
│ │ (local) │ (slow) │ (remote) │ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.
Measuring Cache Behavior
# Overall cache stats
perf stat -e cache-misses,cache-references ./program
# Detailed breakdown
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program
# Where do cache misses happen?
perf record -e cache-misses ./program
perf report
Summary
| Pattern | Cache Behavior | Performance |
|---|---|---|
| Sequential access | Prefetcher wins, cache lines fully used | Fast |
| Strided access | Partial cache line use | Medium |
| Random access | Every access misses, pipeline stalls | Slow |
Key Takeaways:
- Memory access pattern matters as much as algorithm complexity
- Sequential access is almost always faster - prefetcher + cache lines
- Random access causes pipeline stalls - CPU waits ~200 cycles per miss
- Structure data for access pattern - not just for logical organization
- Measure with
perf statbefore optimizing