328 lines
13 KiB
Markdown
328 lines
13 KiB
Markdown
# CPU Caches and Memory: Why Access Patterns Matter
|
|
|
|
## The Memory Wall
|
|
|
|
CPUs are fast. Memory is slow. This gap is called the "memory wall."
|
|
|
|
```
|
|
Relative Speed
|
|
══════════════
|
|
CPU registers ████████████████████████████████ (~1 cycle)
|
|
L1 cache ██████████████████████ (~4 cycles)
|
|
L2 cache ████████████ (~12 cycles)
|
|
L3 cache ██████ (~40 cycles)
|
|
Main RAM █ (~200 cycles)
|
|
SSD (~10,000 cycles)
|
|
HDD (~10,000,000 cycles)
|
|
```
|
|
|
|
A cache miss to RAM costs **50-100x more** than an L1 hit. This is why cache behavior dominates performance for many programs.
|
|
|
|
## The Cache Hierarchy
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ CPU Core │
|
|
│ ┌─────────────────────────────────────────────────────┐ │
|
|
│ │ Registers (bytes, <1ns) │ │
|
|
│ └─────────────────────────────────────────────────────┘ │
|
|
│ ┌─────────────────────────────────────────────────────┐ │
|
|
│ │ L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d) │ │
|
|
│ └─────────────────────────────────────────────────────┘ │
|
|
│ ┌─────────────────────────────────────────────────────┐ │
|
|
│ │ L2 Cache: 256-512 KB, ~3-4ns │ │
|
|
│ └─────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌─────────────────┴─────────────────┐
|
|
│ L3 Cache: 8-64 MB, ~10-12ns │ (shared between cores)
|
|
└─────────────────┬─────────────────┘
|
|
│
|
|
┌─────────────────┴─────────────────┐
|
|
│ Main RAM: GBs, ~50-100ns │
|
|
└───────────────────────────────────┘
|
|
```
|
|
|
|
### Typical Numbers (Desktop CPU, 2024)
|
|
|
|
| Level | Size | Latency | Bandwidth |
|
|
|-------|------|---------|-----------|
|
|
| L1d | 32-48 KB | ~4 cycles (~1 ns) | ~1 TB/s |
|
|
| L2 | 256 KB - 1 MB | ~12 cycles (~3 ns) | ~500 GB/s |
|
|
| L3 | 8-64 MB | ~40 cycles (~10 ns) | ~200 GB/s |
|
|
| RAM | 16-128 GB | ~200 cycles (~50 ns) | ~50 GB/s |
|
|
|
|
## Cache Lines: The Unit of Transfer
|
|
|
|
Memory isn't fetched byte-by-byte. It's fetched in **cache lines** (typically 64 bytes).
|
|
|
|
```
|
|
Memory addresses:
|
|
0x1000: [████████████████████████████████████████████████████████████████]
|
|
└──────────────── 64 bytes = 1 cache line ────────────────────┘
|
|
|
|
If you access address 0x1020:
|
|
- CPU fetches entire cache line (0x1000-0x103F)
|
|
- Next access to 0x1021? Already in cache! (free)
|
|
- Access to 0x1040? Different cache line, another fetch
|
|
```
|
|
|
|
This is why **sequential access** is so much faster than random access.
|
|
|
|
## Spatial and Temporal Locality
|
|
|
|
Caches exploit two patterns in real programs:
|
|
|
|
### Spatial Locality
|
|
"If you accessed address X, you'll probably access X+1 soon."
|
|
|
|
```c
|
|
// GOOD: Sequential access (spatial locality)
|
|
for (int i = 0; i < N; i++) {
|
|
sum += array[i]; // Next element is in same cache line
|
|
}
|
|
|
|
// BAD: Random access (no spatial locality)
|
|
for (int i = 0; i < N; i++) {
|
|
sum += array[random_index()]; // Each access misses cache
|
|
}
|
|
```
|
|
|
|
### Temporal Locality
|
|
"If you accessed address X, you'll probably access X again soon."
|
|
|
|
```c
|
|
// GOOD: Reuse data while it's hot
|
|
for (int i = 0; i < N; i++) {
|
|
x = array[i];
|
|
result += x * x + x; // x stays in registers
|
|
}
|
|
|
|
// BAD: Touch data once, move on
|
|
for (int i = 0; i < N; i++) {
|
|
sum += array[i];
|
|
}
|
|
for (int i = 0; i < N; i++) {
|
|
product *= array[i]; // Array evicted from cache, refetch
|
|
}
|
|
```
|
|
|
|
## Why Random Access Kills Performance
|
|
|
|
### Example: Array vs Linked List
|
|
|
|
```
|
|
Array (sequential memory):
|
|
┌───┬───┬───┬───┬───┬───┬───┬───┐
|
|
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ All in 1-2 cache lines
|
|
└───┴───┴───┴───┴───┴───┴───┴───┘
|
|
|
|
Linked List (scattered memory):
|
|
┌───┐ ┌───┐ ┌───┐
|
|
│ 0 │────────→│ 1 │────────→│ 2 │...
|
|
└───┘ └───┘ └───┘
|
|
↑ ↑ ↑
|
|
0x1000 0x5420 0x2108
|
|
Each node in different cache line!
|
|
```
|
|
|
|
Traversing a scattered linked list causes a **cache miss per node**.
|
|
|
|
### Real Numbers
|
|
|
|
```
|
|
Array traversal: ~0.004 seconds (10M elements)
|
|
Sequential list: ~0.018 seconds (4.5x slower)
|
|
Scattered list: ~1.400 seconds (350x slower!)
|
|
```
|
|
|
|
The scattered list is O(n) just like the array, but the constant factor is 350x worse.
|
|
|
|
## Pipeline Stalls: Why the CPU Can't Hide Latency
|
|
|
|
Modern CPUs execute many instructions simultaneously:
|
|
|
|
```
|
|
Pipeline (simplified):
|
|
Cycle: 1 2 3 4 5 6 7 8
|
|
Fetch: [A] [B] [C] [D] [E] [F] [G] [H]
|
|
Decode: [A] [B] [C] [D] [E] [F] [G]
|
|
Execute: [A] [B] [C] [D] [E] [F]
|
|
Memory: [A] [B] [C] [D] [E]
|
|
Write: [A] [B] [C] [D]
|
|
```
|
|
|
|
But what happens when instruction C needs data from memory?
|
|
|
|
```
|
|
Cycle: 1 2 3 4 5 ... 200 201 202
|
|
Fetch: [A] [B] [C] [C] [C] ... [C] [D] [E]
|
|
Decode: [A] [B] [C] [C] ... [C] [C] [D]
|
|
Execute: [A] [B] waiting for memory...
|
|
Memory: [A] [B] ... ... [C]
|
|
Write: [A] [B] ... [C]
|
|
↑
|
|
STALL! Pipeline bubbles
|
|
```
|
|
|
|
The CPU stalls for **~200 cycles** waiting for RAM. Those 200 cycles could have executed 200+ instructions.
|
|
|
|
### Out-of-Order Execution Helps (But Not Enough)
|
|
|
|
CPUs can execute later instructions while waiting:
|
|
|
|
```c
|
|
a = array[i]; // Cache miss, stall...
|
|
b = x + y; // Can execute while waiting!
|
|
c = b * 2; // Can execute while waiting!
|
|
d = a + 1; // Must wait for 'a'
|
|
```
|
|
|
|
But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.
|
|
|
|
## The Prefetcher: CPU Tries to Help
|
|
|
|
Modern CPUs detect sequential access patterns and fetch data **before you ask**:
|
|
|
|
```
|
|
Your code accesses: [0] [1] [2] [3] [4] ...
|
|
Prefetcher fetches: [5] [6] [7] [8] ... (ahead of you!)
|
|
```
|
|
|
|
But prefetchers can only predict **regular patterns**:
|
|
- Sequential: ✅ Perfect prediction
|
|
- Strided (every Nth element): ✅ Usually works
|
|
- Random: ❌ No pattern to detect
|
|
|
|
```c
|
|
// Prefetcher wins
|
|
for (int i = 0; i < N; i++) {
|
|
sum += array[i]; // Prefetcher fetches ahead
|
|
}
|
|
|
|
// Prefetcher loses
|
|
for (int i = 0; i < N; i++) {
|
|
sum += array[indices[i]]; // Random indices, can't predict
|
|
}
|
|
```
|
|
|
|
## Row-Major vs Column-Major
|
|
|
|
C stores 2D arrays in row-major order:
|
|
|
|
```
|
|
int matrix[3][4];
|
|
|
|
Memory layout:
|
|
[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
|
|
└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘
|
|
```
|
|
|
|
### Row-Major Access (Cache-Friendly)
|
|
|
|
```c
|
|
for (int i = 0; i < ROWS; i++) {
|
|
for (int j = 0; j < COLS; j++) {
|
|
sum += matrix[i][j]; // Sequential in memory!
|
|
}
|
|
}
|
|
```
|
|
|
|
Access pattern: `[0,0] [0,1] [0,2] [0,3] [1,0]...` - sequential, cache line fully used.
|
|
|
|
### Column-Major Access (Cache-Hostile)
|
|
|
|
```c
|
|
for (int j = 0; j < COLS; j++) {
|
|
for (int i = 0; i < ROWS; i++) {
|
|
sum += matrix[i][j]; // Jumps by COLS each time!
|
|
}
|
|
}
|
|
```
|
|
|
|
Access pattern: `[0,0] [1,0] [2,0] [0,1]...` - jumps by `COLS * sizeof(int)` bytes.
|
|
|
|
If COLS=8192, each access jumps 32KB - far beyond any cache line!
|
|
|
|
**Result**: Column-major can be **10-50x slower** for large matrices.
|
|
|
|
## False Sharing: The Multithreaded Trap
|
|
|
|
Cache coherency means cores must agree on cache line contents.
|
|
|
|
```
|
|
Thread 1 (Core 0): counter1++ ┐
|
|
Thread 2 (Core 1): counter2++ ├── Both in same cache line!
|
|
┘
|
|
┌────────────────────────────────────────────────────────────────┐
|
|
│ Cache line: [counter1] [counter2] [padding.................] │
|
|
└────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.
|
|
|
|
**Fix**: Pad data to separate cache lines:
|
|
|
|
```c
|
|
struct {
|
|
long counter;
|
|
char padding[64 - sizeof(long)]; // Pad to 64 bytes
|
|
} counters[NUM_THREADS];
|
|
```
|
|
|
|
## NUMA: When Memory Has Geography
|
|
|
|
On multi-socket systems, memory is "closer" to some CPUs:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Socket 0 Socket 1 │
|
|
│ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Core 0-7 │ │ Core 8-15 │ │
|
|
│ └──────┬──────┘ └──────┬──────┘ │
|
|
│ │ │ │
|
|
│ ┌──────┴──────┐ interconnect ┌──────┴──────┐ │
|
|
│ │ RAM 0 │ ←────────────→ │ RAM 1 │ │
|
|
│ │ (local) │ (slow) │ (remote) │ │
|
|
│ └─────────────┘ └─────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.
|
|
|
|
## Measuring Cache Behavior
|
|
|
|
```bash
|
|
# Overall cache stats
|
|
perf stat -e cache-misses,cache-references ./program
|
|
|
|
# Detailed breakdown
|
|
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program
|
|
|
|
# Where do cache misses happen?
|
|
perf record -e cache-misses ./program
|
|
perf report
|
|
```
|
|
|
|
## Summary
|
|
|
|
| Pattern | Cache Behavior | Performance |
|
|
|---------|---------------|-------------|
|
|
| Sequential access | Prefetcher wins, cache lines fully used | Fast |
|
|
| Strided access | Partial cache line use | Medium |
|
|
| Random access | Every access misses, pipeline stalls | Slow |
|
|
|
|
**Key Takeaways**:
|
|
|
|
1. **Memory access pattern matters as much as algorithm complexity**
|
|
2. **Sequential access is almost always faster** - prefetcher + cache lines
|
|
3. **Random access causes pipeline stalls** - CPU waits ~200 cycles per miss
|
|
4. **Structure data for access pattern** - not just for logical organization
|
|
5. **Measure with `perf stat`** before optimizing
|
|
|
|
## Further Reading
|
|
|
|
- [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - Ulrich Drepper
|
|
- [Gallery of Processor Cache Effects](http://igoro.com/archive/gallery-of-processor-cache-effects/)
|
|
- [Brendan Gregg's Memory Flamegraphs](https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html)
|