perf-workshop/scenario4-cache-misses

Scenario 4: Cache Misses and Memory Access Patterns

Learning Objectives

  • Understand CPU cache basics (L1, L2, L3)
  • Use perf stat to measure cache behavior
  • Recognize cache-friendly vs cache-hostile access patterns
  • Understand why Big-O notation doesn't tell the whole story

Background: How CPU Caches Work

CPU Core
    ↓
L1 Cache (~32KB, ~4 cycles)
    ↓
L2 Cache (~256KB, ~12 cycles)
    ↓
L3 Cache (~8MB, ~40 cycles)
    ↓
Main RAM (~64GB, ~200 cycles)

Key concepts:

  • Cache line: Data is loaded in chunks (typically 64 bytes)
  • Spatial locality: If you access byte N, bytes N+1, N+2, ... are likely already cached
  • Temporal locality: Recently accessed data is likely to be accessed again

Files

  • matrix_col_major.c - BAD: Column-major traversal (cache-hostile)
  • matrix_row_major.c - GOOD: Row-major traversal (cache-friendly)
  • list_scattered.c - BAD: Scattered linked list (worst cache behavior)
  • list_sequential.c - MEDIUM: Sequential linked list (better, but still has overhead)
  • array_sum.c - GOOD: Contiguous array (best cache behavior)

Setup

make all

Exercise 1: Row-Major vs Column-Major Matrix Traversal

Step 1: Run the BAD version (column-major)

./matrix_col_major

Note the execution time.

Step 2: Profile to identify the issue

perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./matrix_col_major

Observe the high cache miss rate and count.

Step 3: Run the GOOD version (row-major)

./matrix_row_major

This should be significantly faster (often 3-10x).

Step 4: Profile to confirm the improvement

perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./matrix_row_major

Compare the cache miss counts and ratios with the column-major version.

Why does this happen?

C stores 2D arrays in row-major order:

Memory: [0][0] [0][1] [0][2] ... [0][COLS-1] [1][0] [1][1] ...
         ←————— row 0 ——————→    ←—— row 1 ——→

Row-major access: Sequential in memory → cache lines are fully utilized

Access: [0][0] [0][1] [0][2] [0][3] ...
Cache:  [████████████████] ← one cache line serves 16 ints

Column-major access: Jumping by COLS * sizeof(int) bytes each time

Access: [0][0] [1][0] [2][0] [3][0] ...
Cache:  [█_______________] ← load entire line, use 1 int, evict
        [█_______________] ← repeat for each access

Exercise 2: Data Structure Memory Layout

Step 1: Run the WORST case (scattered linked list)

./list_scattered

Note the execution time - this is the worst case.

Step 2: Profile the cache behavior

perf stat -e cache-misses,cache-references ./list_scattered

Observe the terrible cache miss rate due to random memory access.

Step 3: First improvement - sequential allocation

./list_sequential

This should be faster than scattered, as nodes are contiguous in memory.

Step 4: Profile the improvement

perf stat -e cache-misses,cache-references ./list_sequential

Cache behavior improves, but still not optimal due to pointer chasing.

Step 5: Best solution - contiguous array

./array_sum

This should be the fastest by a significant margin.

Step 6: Profile the optimal case

perf stat -e cache-misses,cache-references ./array_sum

Compare all three cache miss counts:

Case Memory Layout Cache Behavior
Array Contiguous Excellent - prefetcher wins
List (sequential) Contiguous (lucky!) Good - nodes happen to be adjacent
List (scattered) Random Terrible - every access misses

Why linked lists are slow

Even with sequential allocation, linked lists are slower than arrays:

  1. Pointer chasing: CPU can't prefetch next element (doesn't know address until current node is loaded)
  2. Larger elements: struct node is bigger than int (includes pointer)
  3. Indirect access: Extra memory load for the next pointer

Exercise 3: Deeper perf Analysis

See more cache events

perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./matrix_col_major
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./matrix_row_major

Events explained:

  • L1-dcache-*: Level 1 data cache (fastest, smallest)
  • LLC-*: Last Level Cache (L3, slowest cache before RAM)
  • cycles: Total CPU cycles
  • instructions: Total instructions executed
  • IPC (instructions per cycle): Higher is better

Profile with perf record

perf record -e cache-misses ./matrix_col_major
perf report

This shows which functions cause the most cache misses.


Discussion Questions

  1. Why doesn't the compiler fix this?

    • Compilers can sometimes interchange loops, but:
    • Side effects may prevent it
    • Aliasing makes it unsafe to assume
    • The programmer often knows better
  2. How big does the array need to be to see this effect?

    • If array fits in L1 cache: No difference
    • If array fits in L3 cache: Moderate difference
    • If array exceeds L3 cache: Dramatic difference
  3. What about multithreaded code?

    • False sharing: Different threads accessing same cache line
    • Cache coherency traffic between cores

Real-World Implications

  • Image processing: Process row-by-row, not column-by-column
  • Matrix operations: Libraries like BLAS use cache-blocking
  • Data structures: Arrays often beat linked lists in practice
  • Database design: Row stores vs column stores

Key Takeaways

  1. Memory access pattern matters as much as algorithm complexity
  2. Sequential access is almost always faster than random access
  3. Measure with perf stat before optimizing
  4. Big-O notation hides constant factors that can be 10-100x