211 lines
5.9 KiB
Markdown
211 lines
5.9 KiB
Markdown
# Scenario 4: Cache Misses and Memory Access Patterns
|
|
|
|
## Learning Objectives
|
|
- Understand CPU cache basics (L1, L2, L3)
|
|
- Use `perf stat` to measure cache behavior
|
|
- Recognize cache-friendly vs cache-hostile access patterns
|
|
- Understand why Big-O notation doesn't tell the whole story
|
|
|
|
## Background: How CPU Caches Work
|
|
|
|
```
|
|
CPU Core
|
|
↓
|
|
L1 Cache (~32KB, ~4 cycles)
|
|
↓
|
|
L2 Cache (~256KB, ~12 cycles)
|
|
↓
|
|
L3 Cache (~8MB, ~40 cycles)
|
|
↓
|
|
Main RAM (~64GB, ~200 cycles)
|
|
```
|
|
|
|
Key concepts:
|
|
- **Cache line**: Data is loaded in chunks (typically 64 bytes)
|
|
- **Spatial locality**: If you access byte N, bytes N+1, N+2, ... are likely already cached
|
|
- **Temporal locality**: Recently accessed data is likely to be accessed again
|
|
|
|
## Files
|
|
- `matrix_col_major.c` - BAD: Column-major traversal (cache-hostile)
|
|
- `matrix_row_major.c` - GOOD: Row-major traversal (cache-friendly)
|
|
- `list_scattered.c` - BAD: Scattered linked list (worst cache behavior)
|
|
- `list_sequential.c` - MEDIUM: Sequential linked list (better, but still has overhead)
|
|
- `array_sum.c` - GOOD: Contiguous array (best cache behavior)
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
make all
|
|
```
|
|
|
|
---
|
|
|
|
## Exercise 1: Row-Major vs Column-Major Matrix Traversal
|
|
|
|
### Step 1: Run the BAD version (column-major)
|
|
```bash
|
|
./matrix_col_major
|
|
```
|
|
|
|
Note the execution time.
|
|
|
|
### Step 2: Profile to identify the issue
|
|
```bash
|
|
perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./matrix_col_major
|
|
```
|
|
|
|
Observe the high cache miss rate and count.
|
|
|
|
### Step 3: Run the GOOD version (row-major)
|
|
```bash
|
|
./matrix_row_major
|
|
```
|
|
|
|
This should be significantly faster (often 3-10x).
|
|
|
|
### Step 4: Profile to confirm the improvement
|
|
```bash
|
|
perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./matrix_row_major
|
|
```
|
|
|
|
Compare the cache miss counts and ratios with the column-major version.
|
|
|
|
### Why does this happen?
|
|
|
|
C stores 2D arrays in **row-major** order:
|
|
```
|
|
Memory: [0][0] [0][1] [0][2] ... [0][COLS-1] [1][0] [1][1] ...
|
|
←————— row 0 ——————→ ←—— row 1 ——→
|
|
```
|
|
|
|
**Row-major access**: Sequential in memory → cache lines are fully utilized
|
|
```
|
|
Access: [0][0] [0][1] [0][2] [0][3] ...
|
|
Cache: [████████████████] ← one cache line serves 16 ints
|
|
```
|
|
|
|
**Column-major access**: Jumping by COLS * sizeof(int) bytes each time
|
|
```
|
|
Access: [0][0] [1][0] [2][0] [3][0] ...
|
|
Cache: [█_______________] ← load entire line, use 1 int, evict
|
|
[█_______________] ← repeat for each access
|
|
```
|
|
|
|
---
|
|
|
|
## Exercise 2: Data Structure Memory Layout
|
|
|
|
### Step 1: Run the WORST case (scattered linked list)
|
|
```bash
|
|
./list_scattered
|
|
```
|
|
|
|
Note the execution time - this is the worst case.
|
|
|
|
### Step 2: Profile the cache behavior
|
|
```bash
|
|
perf stat -e cache-misses,cache-references ./list_scattered
|
|
```
|
|
|
|
Observe the terrible cache miss rate due to random memory access.
|
|
|
|
### Step 3: First improvement - sequential allocation
|
|
```bash
|
|
./list_sequential
|
|
```
|
|
|
|
This should be faster than scattered, as nodes are contiguous in memory.
|
|
|
|
### Step 4: Profile the improvement
|
|
```bash
|
|
perf stat -e cache-misses,cache-references ./list_sequential
|
|
```
|
|
|
|
Cache behavior improves, but still not optimal due to pointer chasing.
|
|
|
|
### Step 5: Best solution - contiguous array
|
|
```bash
|
|
./array_sum
|
|
```
|
|
|
|
This should be the fastest by a significant margin.
|
|
|
|
### Step 6: Profile the optimal case
|
|
```bash
|
|
perf stat -e cache-misses,cache-references ./array_sum
|
|
```
|
|
|
|
Compare all three cache miss counts:
|
|
|
|
| Case | Memory Layout | Cache Behavior |
|
|
|------|---------------|----------------|
|
|
| Array | Contiguous | Excellent - prefetcher wins |
|
|
| List (sequential) | Contiguous (lucky!) | Good - nodes happen to be adjacent |
|
|
| List (scattered) | Random | Terrible - every access misses |
|
|
|
|
### Why linked lists are slow
|
|
|
|
Even with sequential allocation, linked lists are slower than arrays:
|
|
|
|
1. **Pointer chasing**: CPU can't prefetch next element (doesn't know address until current node is loaded)
|
|
2. **Larger elements**: `struct node` is bigger than `int` (includes pointer)
|
|
3. **Indirect access**: Extra memory load for the `next` pointer
|
|
|
|
---
|
|
|
|
## Exercise 3: Deeper perf Analysis
|
|
|
|
### See more cache events
|
|
```bash
|
|
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./matrix_col_major
|
|
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./matrix_row_major
|
|
```
|
|
|
|
Events explained:
|
|
- `L1-dcache-*`: Level 1 data cache (fastest, smallest)
|
|
- `LLC-*`: Last Level Cache (L3, slowest cache before RAM)
|
|
- `cycles`: Total CPU cycles
|
|
- `instructions`: Total instructions executed
|
|
- IPC (instructions per cycle): Higher is better
|
|
|
|
### Profile with perf record
|
|
```bash
|
|
perf record -e cache-misses ./matrix_col_major
|
|
perf report
|
|
```
|
|
|
|
This shows which functions cause the most cache misses.
|
|
|
|
---
|
|
|
|
## Discussion Questions
|
|
|
|
1. **Why doesn't the compiler fix this?**
|
|
- Compilers can sometimes interchange loops, but:
|
|
- Side effects may prevent it
|
|
- Aliasing makes it unsafe to assume
|
|
- The programmer often knows better
|
|
|
|
2. **How big does the array need to be to see this effect?**
|
|
- If array fits in L1 cache: No difference
|
|
- If array fits in L3 cache: Moderate difference
|
|
- If array exceeds L3 cache: Dramatic difference
|
|
|
|
3. **What about multithreaded code?**
|
|
- False sharing: Different threads accessing same cache line
|
|
- Cache coherency traffic between cores
|
|
|
|
## Real-World Implications
|
|
|
|
- **Image processing**: Process row-by-row, not column-by-column
|
|
- **Matrix operations**: Libraries like BLAS use cache-blocking
|
|
- **Data structures**: Arrays often beat linked lists in practice
|
|
- **Database design**: Row stores vs column stores
|
|
|
|
## Key Takeaways
|
|
|
|
1. **Memory access pattern matters as much as algorithm complexity**
|
|
2. **Sequential access is almost always faster than random access**
|
|
3. **Measure with `perf stat` before optimizing**
|
|
4. **Big-O notation hides constant factors that can be 10-100x**
|