illustris/perf-workshop

Fork 0

illustris 1432bdaff9

add docs

2026-01-11 12:04:36 +05:30

13 KiB

Raw Permalink Blame History

CPU Caches and Memory: Why Access Patterns Matter

The Memory Wall

CPUs are fast. Memory is slow. This gap is called the "memory wall."

                    Relative Speed
                    ══════════════
CPU registers       ████████████████████████████████  (~1 cycle)
L1 cache            ██████████████████████            (~4 cycles)
L2 cache            ████████████                      (~12 cycles)
L3 cache            ██████                            (~40 cycles)
Main RAM            █                                 (~200 cycles)
SSD                                                   (~10,000 cycles)
HDD                                                   (~10,000,000 cycles)

A cache miss to RAM costs 50-100x more than an L1 hit. This is why cache behavior dominates performance for many programs.

The Cache Hierarchy

┌─────────────────────────────────────────────────────────────┐
│                         CPU Core                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Registers (bytes, <1ns)                             │    │
│  └─────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  L1 Cache: 32-64 KB, ~1ns (split: L1i + L1d)         │    │
│  └─────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  L2 Cache: 256-512 KB, ~3-4ns                        │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                              │
            ┌─────────────────┴─────────────────┐
            │  L3 Cache: 8-64 MB, ~10-12ns      │  (shared between cores)
            └─────────────────┬─────────────────┘
                              │
            ┌─────────────────┴─────────────────┐
            │  Main RAM: GBs, ~50-100ns         │
            └───────────────────────────────────┘

Typical Numbers (Desktop CPU, 2024)

Level	Size	Latency	Bandwidth
L1d	32-48 KB	~4 cycles (~1 ns)	~1 TB/s
L2	256 KB - 1 MB	~12 cycles (~3 ns)	~500 GB/s
L3	8-64 MB	~40 cycles (~10 ns)	~200 GB/s
RAM	16-128 GB	~200 cycles (~50 ns)	~50 GB/s

Cache Lines: The Unit of Transfer

Memory isn't fetched byte-by-byte. It's fetched in cache lines (typically 64 bytes).

Memory addresses:
0x1000: [████████████████████████████████████████████████████████████████]
        └──────────────── 64 bytes = 1 cache line ────────────────────┘

If you access address 0x1020:
- CPU fetches entire cache line (0x1000-0x103F)
- Next access to 0x1021? Already in cache! (free)
- Access to 0x1040? Different cache line, another fetch

This is why sequential access is so much faster than random access.

Spatial and Temporal Locality

Caches exploit two patterns in real programs:

Spatial Locality

"If you accessed address X, you'll probably access X+1 soon."

// GOOD: Sequential access (spatial locality)
for (int i = 0; i < N; i++) {
    sum += array[i];  // Next element is in same cache line
}

// BAD: Random access (no spatial locality)
for (int i = 0; i < N; i++) {
    sum += array[random_index()];  // Each access misses cache
}

Temporal Locality

"If you accessed address X, you'll probably access X again soon."

// GOOD: Reuse data while it's hot
for (int i = 0; i < N; i++) {
    x = array[i];
    result += x * x + x;  // x stays in registers
}

// BAD: Touch data once, move on
for (int i = 0; i < N; i++) {
    sum += array[i];
}
for (int i = 0; i < N; i++) {
    product *= array[i];  // Array evicted from cache, refetch
}

Why Random Access Kills Performance

Example: Array vs Linked List

Array (sequential memory):
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │  All in 1-2 cache lines
└───┴───┴───┴───┴───┴───┴───┴───┘

Linked List (scattered memory):
┌───┐         ┌───┐         ┌───┐
│ 0 │────────→│ 1 │────────→│ 2 │...
└───┘         └───┘         └───┘
 ↑              ↑              ↑
0x1000       0x5420        0x2108
Each node in different cache line!

Traversing a scattered linked list causes a cache miss per node.

Real Numbers

Array traversal:       ~0.004 seconds (10M elements)
Sequential list:       ~0.018 seconds (4.5x slower)
Scattered list:        ~1.400 seconds (350x slower!)

The scattered list is O(n) just like the array, but the constant factor is 350x worse.

Pipeline Stalls: Why the CPU Can't Hide Latency

Modern CPUs execute many instructions simultaneously:

Pipeline (simplified):
  Cycle:    1    2    3    4    5    6    7    8
  Fetch:   [A]  [B]  [C]  [D]  [E]  [F]  [G]  [H]
  Decode:       [A]  [B]  [C]  [D]  [E]  [F]  [G]
  Execute:           [A]  [B]  [C]  [D]  [E]  [F]
  Memory:                 [A]  [B]  [C]  [D]  [E]
  Write:                       [A]  [B]  [C]  [D]

But what happens when instruction C needs data from memory?

  Cycle:    1    2    3    4    5   ...  200  201  202
  Fetch:   [A]  [B]  [C]  [C]  [C]  ... [C]  [D]  [E]
  Decode:       [A]  [B]  [C]  [C]  ... [C]  [C]  [D]
  Execute:           [A]  [B]  waiting for memory...
  Memory:                 [A]  [B]  ... ... [C]
  Write:                       [A]  [B]      ... [C]
                               ↑
                        STALL! Pipeline bubbles

The CPU stalls for ~200 cycles waiting for RAM. Those 200 cycles could have executed 200+ instructions.

Out-of-Order Execution Helps (But Not Enough)

CPUs can execute later instructions while waiting:

a = array[i];      // Cache miss, stall...
b = x + y;         // Can execute while waiting!
c = b * 2;         // Can execute while waiting!
d = a + 1;         // Must wait for 'a'

But there's a limit to how much work the CPU can find. With random memory access, it runs out of independent work quickly.

The Prefetcher: CPU Tries to Help

Modern CPUs detect sequential access patterns and fetch data before you ask:

Your code accesses:  [0]  [1]  [2]  [3]  [4]  ...
Prefetcher fetches:            [5]  [6]  [7]  [8]  ...  (ahead of you!)

But prefetchers can only predict regular patterns:

Sequential: ✅ Perfect prediction
Strided (every Nth element): ✅ Usually works
Random: ❌ No pattern to detect

// Prefetcher wins
for (int i = 0; i < N; i++) {
    sum += array[i];  // Prefetcher fetches ahead
}

// Prefetcher loses
for (int i = 0; i < N; i++) {
    sum += array[indices[i]];  // Random indices, can't predict
}

Row-Major vs Column-Major

C stores 2D arrays in row-major order:

int matrix[3][4];

Memory layout:
[0,0][0,1][0,2][0,3] [1,0][1,1][1,2][1,3] [2,0][2,1][2,2][2,3]
└──── row 0 ───────┘ └──── row 1 ───────┘ └──── row 2 ───────┘

Row-Major Access (Cache-Friendly)

for (int i = 0; i < ROWS; i++) {
    for (int j = 0; j < COLS; j++) {
        sum += matrix[i][j];  // Sequential in memory!
    }
}

Access pattern: [0,0] [0,1] [0,2] [0,3] [1,0]... - sequential, cache line fully used.

Column-Major Access (Cache-Hostile)

for (int j = 0; j < COLS; j++) {
    for (int i = 0; i < ROWS; i++) {
        sum += matrix[i][j];  // Jumps by COLS each time!
    }
}

Access pattern: [0,0] [1,0] [2,0] [0,1]... - jumps by COLS * sizeof(int) bytes.

If COLS=8192, each access jumps 32KB - far beyond any cache line!

Result: Column-major can be 10-50x slower for large matrices.

Cache coherency means cores must agree on cache line contents.

Thread 1 (Core 0):  counter1++    ┐
Thread 2 (Core 1):  counter2++    ├── Both in same cache line!
                                  ┘
┌────────────────────────────────────────────────────────────────┐
│  Cache line: [counter1] [counter2] [padding.................]  │
└────────────────────────────────────────────────────────────────┘

When Thread 1 writes counter1, Core 1's cache line is invalidated, even though counter2 didn't change. Both cores fight over the cache line.

Fix: Pad data to separate cache lines:

struct {
    long counter;
    char padding[64 - sizeof(long)];  // Pad to 64 bytes
} counters[NUM_THREADS];

NUMA: When Memory Has Geography

On multi-socket systems, memory is "closer" to some CPUs:

┌─────────────────────────────────────────────────────────────┐
│  Socket 0                        Socket 1                   │
│  ┌─────────────┐                ┌─────────────┐             │
│  │  Core 0-7   │                │  Core 8-15  │             │
│  └──────┬──────┘                └──────┬──────┘             │
│         │                              │                    │
│  ┌──────┴──────┐  interconnect  ┌──────┴──────┐             │
│  │   RAM 0     │ ←────────────→ │   RAM 1     │             │
│  │ (local)     │     (slow)     │ (remote)    │             │
│  └─────────────┘                └─────────────┘             │
└─────────────────────────────────────────────────────────────┘

Accessing "remote" memory (RAM 1 from Core 0) adds ~50% latency.

Measuring Cache Behavior

# Overall cache stats
perf stat -e cache-misses,cache-references ./program

# Detailed breakdown
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program

# Where do cache misses happen?
perf record -e cache-misses ./program
perf report

Summary

Pattern	Cache Behavior	Performance
Sequential access	Prefetcher wins, cache lines fully used	Fast
Strided access	Partial cache line use	Medium
Random access	Every access misses, pipeline stalls	Slow

Key Takeaways:

Memory access pattern matters as much as algorithm complexity
Sequential access is almost always faster - prefetcher + cache lines
Random access causes pipeline stalls - CPU waits ~200 cycles per miss
Structure data for access pattern - not just for logical organization
Measure with perf stat before optimizing

13 KiB Raw Permalink Blame History