5.9 KiB
Scenario 4: Cache Misses and Memory Access Patterns
Learning Objectives
- Understand CPU cache basics (L1, L2, L3)
- Use
perf statto measure cache behavior - Recognize cache-friendly vs cache-hostile access patterns
- Understand why Big-O notation doesn't tell the whole story
Background: How CPU Caches Work
CPU Core
↓
L1 Cache (~32KB, ~4 cycles)
↓
L2 Cache (~256KB, ~12 cycles)
↓
L3 Cache (~8MB, ~40 cycles)
↓
Main RAM (~64GB, ~200 cycles)
Key concepts:
- Cache line: Data is loaded in chunks (typically 64 bytes)
- Spatial locality: If you access byte N, bytes N+1, N+2, ... are likely already cached
- Temporal locality: Recently accessed data is likely to be accessed again
Files
matrix_col_major.c- BAD: Column-major traversal (cache-hostile)matrix_row_major.c- GOOD: Row-major traversal (cache-friendly)list_scattered.c- BAD: Scattered linked list (worst cache behavior)list_sequential.c- MEDIUM: Sequential linked list (better, but still has overhead)array_sum.c- GOOD: Contiguous array (best cache behavior)
Setup
make all
Exercise 1: Row-Major vs Column-Major Matrix Traversal
Step 1: Run the BAD version (column-major)
./matrix_col_major
Note the execution time.
Step 2: Profile to identify the issue
perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./matrix_col_major
Observe the high cache miss rate and count.
Step 3: Run the GOOD version (row-major)
./matrix_row_major
This should be significantly faster (often 3-10x).
Step 4: Profile to confirm the improvement
perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./matrix_row_major
Compare the cache miss counts and ratios with the column-major version.
Why does this happen?
C stores 2D arrays in row-major order:
Memory: [0][0] [0][1] [0][2] ... [0][COLS-1] [1][0] [1][1] ...
←————— row 0 ——————→ ←—— row 1 ——→
Row-major access: Sequential in memory → cache lines are fully utilized
Access: [0][0] [0][1] [0][2] [0][3] ...
Cache: [████████████████] ← one cache line serves 16 ints
Column-major access: Jumping by COLS * sizeof(int) bytes each time
Access: [0][0] [1][0] [2][0] [3][0] ...
Cache: [█_______________] ← load entire line, use 1 int, evict
[█_______________] ← repeat for each access
Exercise 2: Data Structure Memory Layout
Step 1: Run the WORST case (scattered linked list)
./list_scattered
Note the execution time - this is the worst case.
Step 2: Profile the cache behavior
perf stat -e cache-misses,cache-references ./list_scattered
Observe the terrible cache miss rate due to random memory access.
Step 3: First improvement - sequential allocation
./list_sequential
This should be faster than scattered, as nodes are contiguous in memory.
Step 4: Profile the improvement
perf stat -e cache-misses,cache-references ./list_sequential
Cache behavior improves, but still not optimal due to pointer chasing.
Step 5: Best solution - contiguous array
./array_sum
This should be the fastest by a significant margin.
Step 6: Profile the optimal case
perf stat -e cache-misses,cache-references ./array_sum
Compare all three cache miss counts:
| Case | Memory Layout | Cache Behavior |
|---|---|---|
| Array | Contiguous | Excellent - prefetcher wins |
| List (sequential) | Contiguous (lucky!) | Good - nodes happen to be adjacent |
| List (scattered) | Random | Terrible - every access misses |
Why linked lists are slow
Even with sequential allocation, linked lists are slower than arrays:
- Pointer chasing: CPU can't prefetch next element (doesn't know address until current node is loaded)
- Larger elements:
struct nodeis bigger thanint(includes pointer) - Indirect access: Extra memory load for the
nextpointer
Exercise 3: Deeper perf Analysis
See more cache events
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./matrix_col_major
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./matrix_row_major
Events explained:
L1-dcache-*: Level 1 data cache (fastest, smallest)LLC-*: Last Level Cache (L3, slowest cache before RAM)cycles: Total CPU cyclesinstructions: Total instructions executed- IPC (instructions per cycle): Higher is better
Profile with perf record
perf record -e cache-misses ./matrix_col_major
perf report
This shows which functions cause the most cache misses.
Discussion Questions
-
Why doesn't the compiler fix this?
- Compilers can sometimes interchange loops, but:
- Side effects may prevent it
- Aliasing makes it unsafe to assume
- The programmer often knows better
-
How big does the array need to be to see this effect?
- If array fits in L1 cache: No difference
- If array fits in L3 cache: Moderate difference
- If array exceeds L3 cache: Dramatic difference
-
What about multithreaded code?
- False sharing: Different threads accessing same cache line
- Cache coherency traffic between cores
Real-World Implications
- Image processing: Process row-by-row, not column-by-column
- Matrix operations: Libraries like BLAS use cache-blocking
- Data structures: Arrays often beat linked lists in practice
- Database design: Row stores vs column stores
Key Takeaways
- Memory access pattern matters as much as algorithm complexity
- Sequential access is almost always faster than random access
- Measure with
perf statbefore optimizing - Big-O notation hides constant factors that can be 10-100x