# Scenario 4: Cache Misses and Memory Access Patterns ## Learning Objectives - Understand CPU cache basics (L1, L2, L3) - Use `perf stat` to measure cache behavior - Recognize cache-friendly vs cache-hostile access patterns - Understand why Big-O notation doesn't tell the whole story ## Background: How CPU Caches Work ``` CPU Core ↓ L1 Cache (~32KB, ~4 cycles) ↓ L2 Cache (~256KB, ~12 cycles) ↓ L3 Cache (~8MB, ~40 cycles) ↓ Main RAM (~64GB, ~200 cycles) ``` Key concepts: - **Cache line**: Data is loaded in chunks (typically 64 bytes) - **Spatial locality**: If you access byte N, bytes N+1, N+2, ... are likely already cached - **Temporal locality**: Recently accessed data is likely to be accessed again ## Files - `matrix_col_major.c` - BAD: Column-major traversal (cache-hostile) - `matrix_row_major.c` - GOOD: Row-major traversal (cache-friendly) - `list_scattered.c` - BAD: Scattered linked list (worst cache behavior) - `list_sequential.c` - MEDIUM: Sequential linked list (better, but still has overhead) - `array_sum.c` - GOOD: Contiguous array (best cache behavior) ## Setup ```bash make all ``` --- ## Exercise 1: Row-Major vs Column-Major Matrix Traversal ### Step 1: Run the BAD version (column-major) ```bash ./matrix_col_major ``` Note the execution time. ### Step 2: Profile to identify the issue ```bash perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./matrix_col_major ``` Observe the high cache miss rate and count. ### Step 3: Run the GOOD version (row-major) ```bash ./matrix_row_major ``` This should be significantly faster (often 3-10x). ### Step 4: Profile to confirm the improvement ```bash perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./matrix_row_major ``` Compare the cache miss counts and ratios with the column-major version. ### Why does this happen? C stores 2D arrays in **row-major** order: ``` Memory: [0][0] [0][1] [0][2] ... [0][COLS-1] [1][0] [1][1] ... ←————— row 0 ——————→ ←—— row 1 ——→ ``` **Row-major access**: Sequential in memory → cache lines are fully utilized ``` Access: [0][0] [0][1] [0][2] [0][3] ... Cache: [████████████████] ← one cache line serves 16 ints ``` **Column-major access**: Jumping by COLS * sizeof(int) bytes each time ``` Access: [0][0] [1][0] [2][0] [3][0] ... Cache: [█_______________] ← load entire line, use 1 int, evict [█_______________] ← repeat for each access ``` --- ## Exercise 2: Data Structure Memory Layout ### Step 1: Run the WORST case (scattered linked list) ```bash ./list_scattered ``` Note the execution time - this is the worst case. ### Step 2: Profile the cache behavior ```bash perf stat -e cache-misses,cache-references ./list_scattered ``` Observe the terrible cache miss rate due to random memory access. ### Step 3: First improvement - sequential allocation ```bash ./list_sequential ``` This should be faster than scattered, as nodes are contiguous in memory. ### Step 4: Profile the improvement ```bash perf stat -e cache-misses,cache-references ./list_sequential ``` Cache behavior improves, but still not optimal due to pointer chasing. ### Step 5: Best solution - contiguous array ```bash ./array_sum ``` This should be the fastest by a significant margin. ### Step 6: Profile the optimal case ```bash perf stat -e cache-misses,cache-references ./array_sum ``` Compare all three cache miss counts: | Case | Memory Layout | Cache Behavior | |------|---------------|----------------| | Array | Contiguous | Excellent - prefetcher wins | | List (sequential) | Contiguous (lucky!) | Good - nodes happen to be adjacent | | List (scattered) | Random | Terrible - every access misses | ### Why linked lists are slow Even with sequential allocation, linked lists are slower than arrays: 1. **Pointer chasing**: CPU can't prefetch next element (doesn't know address until current node is loaded) 2. **Larger elements**: `struct node` is bigger than `int` (includes pointer) 3. **Indirect access**: Extra memory load for the `next` pointer --- ## Exercise 3: Deeper perf Analysis ### See more cache events ```bash perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./matrix_col_major perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./matrix_row_major ``` Events explained: - `L1-dcache-*`: Level 1 data cache (fastest, smallest) - `LLC-*`: Last Level Cache (L3, slowest cache before RAM) - `cycles`: Total CPU cycles - `instructions`: Total instructions executed - IPC (instructions per cycle): Higher is better ### Profile with perf record ```bash perf record -e cache-misses ./matrix_col_major perf report ``` This shows which functions cause the most cache misses. --- ## Discussion Questions 1. **Why doesn't the compiler fix this?** - Compilers can sometimes interchange loops, but: - Side effects may prevent it - Aliasing makes it unsafe to assume - The programmer often knows better 2. **How big does the array need to be to see this effect?** - If array fits in L1 cache: No difference - If array fits in L3 cache: Moderate difference - If array exceeds L3 cache: Dramatic difference 3. **What about multithreaded code?** - False sharing: Different threads accessing same cache line - Cache coherency traffic between cores ## Real-World Implications - **Image processing**: Process row-by-row, not column-by-column - **Matrix operations**: Libraries like BLAS use cache-blocking - **Data structures**: Arrays often beat linked lists in practice - **Database design**: Row stores vs column stores ## Key Takeaways 1. **Memory access pattern matters as much as algorithm complexity** 2. **Sequential access is almost always faster than random access** 3. **Measure with `perf stat` before optimizing** 4. **Big-O notation hides constant factors that can be 10-100x**