perf-workshop/docs/CPU-CACHES-AND-MEMORY.md
2026-01-11 12:04:36 +05:30

13 KiB

CPU Caches and Memory: Why Access Patterns Matter

The Memory Wall

CPUs are fast. Memory is slow. This gap is called the "memory wall."

                    Relative Speed
                    ══════════════
CPU registers       ████████████████████████████████  (~1 cycle)
L1 cache            ██████████████████████            (~4 cycles)
L2 cache            ████████████                      (~12 cycles)
L3 cache            ██████                            (~40 cycles)
Main RAM            █                                 (~200 cycles)
SSD                                                   (~10,000 cycles)
HDD                                                   (~10,000,000 cycles)

A cache miss to RAM costs 50-100x more than an L1 hit. This is why cache behavior dominates performance for many programs.

The Cache Hierarchy

┌─────────────────────────────────────────────────────────────┐
│                         CPU Core                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Registers (bytes, <1ns)                             │    │
│  └─────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d)         │    │
│  └─────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  L2 Cache: 256-512 KB, ~3-4ns                        │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                              │
            ┌─────────────────┴─────────────────┐
            │  L3 Cache: 8-64 MB, ~10-12ns      │  (shared between cores)
            └─────────────────┬─────────────────┘
                              │
            ┌─────────────────┴─────────────────┐
            │  Main RAM: GBs, ~50-100ns         │
            └───────────────────────────────────┘

Typical Numbers (Desktop CPU, 2024)

Level Size Latency Bandwidth
L1d 32-48 KB ~4 cycles (~1 ns) ~1 TB/s
L2 256 KB - 1 MB ~12 cycles (~3 ns) ~500 GB/s
L3 8-64 MB ~40 cycles (~10 ns) ~200 GB/s
RAM 16-128 GB ~200 cycles (~50 ns) ~50 GB/s

Cache Lines: The Unit of Transfer

Memory isn't fetched byte-by-byte. It's fetched in cache lines (typically 64 bytes).

Memory addresses:
0x1000: [████████████████████████████████████████████████████████████████]
        └──────────────── 64 bytes = 1 cache line ────────────────────┘

If you access address 0x1020:
- CPU fetches entire cache line (0x1000-0x103F)
- Next access to 0x1021? Already in cache! (free)
- Access to 0x1040? Different cache line, another fetch

This is why sequential access is so much faster than random access.

Spatial and Temporal Locality

Caches exploit two patterns in real programs:

Spatial Locality

"If you accessed address X, you'll probably access X+1 soon."

// GOOD: Sequential access (spatial locality)
for (int i = 0; i < N; i++) {
    sum += array[i];  // Next element is in same cache line
}

// BAD: Random access (no spatial locality)
for (int i = 0; i < N; i++) {
    sum += array[random_index()];  // Each access misses cache
}

Temporal Locality

"If you accessed address X, you'll probably access X again soon."

// GOOD: Reuse data while it's hot
for (int i = 0; i < N; i++) {
    x = array[i];
    result += x * x + x;  // x stays in registers
}

// BAD: Touch data once, move on
for (int i = 0; i < N; i++) {
    sum += array[i];
}
for (int i = 0; i < N; i++) {
    product *= array[i];  // Array evicted from cache, refetch
}

Why Random Access Kills Performance

Example: Array vs Linked List

Array (sequential memory):
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │  All in 1-2 cache lines
└───┴───┴───┴───┴───┴───┴───┴───┘

Linked List (scattered memory):
┌───┐         ┌───┐         ┌───┐
│ 0 │────────→│ 1 │────────→│ 2 │...
└───┘         └───┘         └───┘
 ↑              ↑              ↑
0x1000       0x5420        0x2108
Each node in different cache line!

Traversing a scattered linked list causes a cache miss per node.

Real Numbers

Array traversal:       ~0.004 seconds (10M elements)
Sequential list:       ~0.018 seconds (4.5x slower)
Scattered list:        ~1.400 seconds (350x slower!)

The scattered list is O(n) just like the array, but the constant factor is 350x worse.

Pipeline Stalls: Why the CPU Can't Hide Latency

Modern CPUs execute many instructions simultaneously:

Pipeline (simplified):
  Cycle:    1    2    3    4    5    6    7    8
  Fetch:   [A]  [B]  [C]  [D]  [E]  [F]  [G]  [H]
  Decode:       [A]  [B]  [C]  [D]  [E]  [F]  [G]
  Execute:           [A]  [B]  [C]  [D]  [E]  [F]
  Memory:                 [A]  [B]  [C]  [D]  [E]
  Write:                       [A]  [B]  [C]  [D]

But what happens when instruction C needs data from memory?

  Cycle:    1    2    3    4    5   ...  200  201  202
  Fetch:   [A]  [B]  [C]  [C]  [C]  ... [C]  [D]  [E]
  Decode:       [A]  [B]  [C]  [C]  ... [C]  [C]  [D]
  Execute:           [A]  [B]  waiting for memory...
  Memory:                 [A]  [B]  ... ... [C]
  Write:                       [A]  [B]      ... [C]
                               ↑
                        STALL! Pipeline bubbles

The CPU stalls for ~200 cycles waiting for RAM. Those 200 cycles could have executed 200+ instructions.

Out-of-Order Execution Helps (But Not Enough)

CPUs can execute later instructions while waiting:

a = array[i];      // Cache miss, stall...
b = x + y;         // Can execute while waiting!
c = b * 2;         // Can execute while waiting!
d = a + 1;         // Must wait for 'a'

But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.

The Prefetcher: CPU Tries to Help

Modern CPUs detect sequential access patterns and fetch data before you ask:

Your code accesses:  [0]  [1]  [2]  [3]  [4]  ...
Prefetcher fetches:            [5]  [6]  [7]  [8]  ...  (ahead of you!)

But prefetchers can only predict regular patterns:

  • Sequential: Perfect prediction
  • Strided (every Nth element): Usually works
  • Random: No pattern to detect
// Prefetcher wins
for (int i = 0; i < N; i++) {
    sum += array[i];  // Prefetcher fetches ahead
}

// Prefetcher loses
for (int i = 0; i < N; i++) {
    sum += array[indices[i]];  // Random indices, can't predict
}

Row-Major vs Column-Major

C stores 2D arrays in row-major order:

int matrix[3][4];

Memory layout:
[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘

Row-Major Access (Cache-Friendly)

for (int i = 0; i < ROWS; i++) {
    for (int j = 0; j < COLS; j++) {
        sum += matrix[i][j];  // Sequential in memory!
    }
}

Access pattern: [0,0] [0,1] [0,2] [0,3] [1,0]... - sequential, cache line fully used.

Column-Major Access (Cache-Hostile)

for (int j = 0; j < COLS; j++) {
    for (int i = 0; i < ROWS; i++) {
        sum += matrix[i][j];  // Jumps by COLS each time!
    }
}

Access pattern: [0,0] [1,0] [2,0] [0,1]... - jumps by COLS * sizeof(int) bytes.

If COLS=8192, each access jumps 32KB - far beyond any cache line!

Result: Column-major can be 10-50x slower for large matrices.

False Sharing: The Multithreaded Trap

Cache coherency means cores must agree on cache line contents.

Thread 1 (Core 0):  counter1++    ┐
Thread 2 (Core 1):  counter2++    ├── Both in same cache line!
                                  ┘
┌────────────────────────────────────────────────────────────────┐
│  Cache line: [counter1] [counter2] [padding.................]  │
└────────────────────────────────────────────────────────────────┘

When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.

Fix: Pad data to separate cache lines:

struct {
    long counter;
    char padding[64 - sizeof(long)];  // Pad to 64 bytes
} counters[NUM_THREADS];

NUMA: When Memory Has Geography

On multi-socket systems, memory is "closer" to some CPUs:

┌─────────────────────────────────────────────────────────────┐
│  Socket 0                        Socket 1                   │
│  ┌─────────────┐                ┌─────────────┐             │
│  │  Core 0-7   │                │  Core 8-15  │             │
│  └──────┬──────┘                └──────┬──────┘             │
│         │                              │                    │
│  ┌──────┴──────┐  interconnect  ┌──────┴──────┐             │
│  │   RAM 0     │ ←────────────→ │   RAM 1     │             │
│  │ (local)     │     (slow)     │ (remote)    │             │
│  └─────────────┘                └─────────────┘             │
└─────────────────────────────────────────────────────────────┘

Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.

Measuring Cache Behavior

# Overall cache stats
perf stat -e cache-misses,cache-references ./program

# Detailed breakdown
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program

# Where do cache misses happen?
perf record -e cache-misses ./program
perf report

Summary

Pattern Cache Behavior Performance
Sequential access Prefetcher wins, cache lines fully used Fast
Strided access Partial cache line use Medium
Random access Every access misses, pipeline stalls Slow

Key Takeaways:

  1. Memory access pattern matters as much as algorithm complexity
  2. Sequential access is almost always faster - prefetcher + cache lines
  3. Random access causes pipeline stalls - CPU waits ~200 cycles per miss
  4. Structure data for access pattern - not just for logical organization
  5. Measure with perf stat before optimizing

Further Reading