# Scenario 4: Cache Misses and Memory Access Patterns ## Learning Objectives - Understand CPU cache basics (L1, L2, L3) - Use `perf stat` to measure cache behavior - Recognize cache-friendly vs cache-hostile access patterns - Understand why Big-O notation doesn't tell the whole story ## Background: How CPU Caches Work ``` CPU Core ↓ L1 Cache (~32KB, ~4 cycles) ↓ L2 Cache (~256KB, ~12 cycles) ↓ L3 Cache (~8MB, ~40 cycles) ↓ Main RAM (~64GB, ~200 cycles) ``` Key concepts: - **Cache line**: Data is loaded in chunks (typically 64 bytes) - **Spatial locality**: If you access byte N, bytes N+1, N+2, ... are likely already cached - **Temporal locality**: Recently accessed data is likely to be accessed again ## Files - `cache_demo.c` - Row-major vs column-major 2D array traversal - `list_vs_array.c` - Array vs linked list traversal ## Exercise 1: Row vs Column Major ### Step 1: Build and run ```bash make cache_demo ./cache_demo ``` You should see column-major is significantly slower (often 3-10x). ### Step 2: Measure cache misses ```bash perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./cache_demo ``` Compare the cache miss counts and ratios. ### Why does this happen? C stores 2D arrays in **row-major** order: ``` Memory: [0][0] [0][1] [0][2] ... [0][COLS-1] [1][0] [1][1] ... ←————— row 0 ——————→ ←—— row 1 ——→ ``` **Row-major access**: Sequential in memory → cache lines are fully utilized ``` Access: [0][0] [0][1] [0][2] [0][3] ... Cache: [████████████████] ← one cache line serves 16 ints ``` **Column-major access**: Jumping by COLS * sizeof(int) bytes each time ``` Access: [0][0] [1][0] [2][0] [3][0] ... Cache: [█_______________] ← load entire line, use 1 int, evict [█_______________] ← repeat for each access ``` ## Exercise 2: Array vs Linked List ### Step 1: Build and run ```bash make list_vs_array ./list_vs_array ``` ### Step 2: Measure cache behavior ```bash perf stat -e cache-misses,cache-references ./list_vs_array ``` ### Three cases compared: | Case | Memory Layout | Cache Behavior | |------|---------------|----------------| | Array | Contiguous | Excellent - prefetcher wins | | List (sequential) | Contiguous (lucky!) | Good - nodes happen to be adjacent | | List (scattered) | Random | Terrible - every access misses | ### Why "sequential list" is still slower than array: 1. **Pointer chasing**: CPU can't prefetch next element (doesn't know address) 2. **Larger elements**: `struct node` is bigger than `int` (includes pointer) 3. **Indirect access**: Extra memory load for the `next` pointer ## Exercise 3: Deeper perf Analysis ### See more cache events ```bash perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./cache_demo ``` Events explained: - `L1-dcache-*`: Level 1 data cache (fastest, smallest) - `LLC-*`: Last Level Cache (L3, slowest cache before RAM) - `cycles`: Total CPU cycles - `instructions`: Total instructions executed - IPC (instructions per cycle): Higher is better ### Profile with perf record ```bash perf record -e cache-misses ./cache_demo perf report ``` This shows which functions cause the most cache misses. ## Discussion Questions 1. **Why doesn't the compiler fix this?** - Compilers can sometimes interchange loops, but: - Side effects may prevent it - Aliasing makes it unsafe to assume - The programmer often knows better 2. **How big does the array need to be to see this effect?** - If array fits in L1 cache: No difference - If array fits in L3 cache: Moderate difference - If array exceeds L3 cache: Dramatic difference 3. **What about multithreaded code?** - False sharing: Different threads accessing same cache line - Cache coherency traffic between cores ## Real-World Implications - **Image processing**: Process row-by-row, not column-by-column - **Matrix operations**: Libraries like BLAS use cache-blocking - **Data structures**: Arrays often beat linked lists in practice - **Database design**: Row stores vs column stores ## Key Takeaways 1. **Memory access pattern matters as much as algorithm complexity** 2. **Sequential access is almost always faster than random access** 3. **Measure with `perf stat` before optimizing** 4. **Big-O notation hides constant factors that can be 10-100x**