illustris 4fb1bd90db
init
2026-01-08 18:11:30 +05:30

4.3 KiB

Scenario 4: Cache Misses and Memory Access Patterns

Learning Objectives

  • Understand CPU cache basics (L1, L2, L3)
  • Use perf stat to measure cache behavior
  • Recognize cache-friendly vs cache-hostile access patterns
  • Understand why Big-O notation doesn't tell the whole story

Background: How CPU Caches Work

CPU Core
    ↓
L1 Cache (~32KB, ~4 cycles)
    ↓
L2 Cache (~256KB, ~12 cycles)
    ↓
L3 Cache (~8MB, ~40 cycles)
    ↓
Main RAM (~64GB, ~200 cycles)

Key concepts:

  • Cache line: Data is loaded in chunks (typically 64 bytes)
  • Spatial locality: If you access byte N, bytes N+1, N+2, ... are likely already cached
  • Temporal locality: Recently accessed data is likely to be accessed again

Files

  • cache_demo.c - Row-major vs column-major 2D array traversal
  • list_vs_array.c - Array vs linked list traversal

Exercise 1: Row vs Column Major

Step 1: Build and run

make cache_demo
./cache_demo

You should see column-major is significantly slower (often 3-10x).

Step 2: Measure cache misses

perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./cache_demo

Compare the cache miss counts and ratios.

Why does this happen?

C stores 2D arrays in row-major order:

Memory: [0][0] [0][1] [0][2] ... [0][COLS-1] [1][0] [1][1] ...
         ←————— row 0 ——————→    ←—— row 1 ——→

Row-major access: Sequential in memory → cache lines are fully utilized

Access: [0][0] [0][1] [0][2] [0][3] ...
Cache:  [████████████████] ← one cache line serves 16 ints

Column-major access: Jumping by COLS * sizeof(int) bytes each time

Access: [0][0] [1][0] [2][0] [3][0] ...
Cache:  [█_______________] ← load entire line, use 1 int, evict
        [█_______________] ← repeat for each access

Exercise 2: Array vs Linked List

Step 1: Build and run

make list_vs_array
./list_vs_array

Step 2: Measure cache behavior

perf stat -e cache-misses,cache-references ./list_vs_array

Three cases compared:

Case Memory Layout Cache Behavior
Array Contiguous Excellent - prefetcher wins
List (sequential) Contiguous (lucky!) Good - nodes happen to be adjacent
List (scattered) Random Terrible - every access misses

Why "sequential list" is still slower than array:

  1. Pointer chasing: CPU can't prefetch next element (doesn't know address)
  2. Larger elements: struct node is bigger than int (includes pointer)
  3. Indirect access: Extra memory load for the next pointer

Exercise 3: Deeper perf Analysis

See more cache events

perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./cache_demo

Events explained:

  • L1-dcache-*: Level 1 data cache (fastest, smallest)
  • LLC-*: Last Level Cache (L3, slowest cache before RAM)
  • cycles: Total CPU cycles
  • instructions: Total instructions executed
  • IPC (instructions per cycle): Higher is better

Profile with perf record

perf record -e cache-misses ./cache_demo
perf report

This shows which functions cause the most cache misses.

Discussion Questions

  1. Why doesn't the compiler fix this?

    • Compilers can sometimes interchange loops, but:
    • Side effects may prevent it
    • Aliasing makes it unsafe to assume
    • The programmer often knows better
  2. How big does the array need to be to see this effect?

    • If array fits in L1 cache: No difference
    • If array fits in L3 cache: Moderate difference
    • If array exceeds L3 cache: Dramatic difference
  3. What about multithreaded code?

    • False sharing: Different threads accessing same cache line
    • Cache coherency traffic between cores

Real-World Implications

  • Image processing: Process row-by-row, not column-by-column
  • Matrix operations: Libraries like BLAS use cache-blocking
  • Data structures: Arrays often beat linked lists in practice
  • Database design: Row stores vs column stores

Key Takeaways

  1. Memory access pattern matters as much as algorithm complexity
  2. Sequential access is almost always faster than random access
  3. Measure with perf stat before optimizing
  4. Big-O notation hides constant factors that can be 10-100x