perf-workshop/scenario4-cache-misses/README.md

# Scenario 4: Cache Misses and Memory Access Patterns

## Learning Objectives
- Understand CPU cache basics (L1, L2, L3)
- Use `perf stat` to measure cache behavior
- Recognize cache-friendly vs cache-hostile access patterns
- Understand why Big-O notation doesn't tell the whole story

## Background: How CPU Caches Work

```
CPU Core
    ↓
L1 Cache (~32KB, ~4 cycles)
    ↓
L2 Cache (~256KB, ~12 cycles)
    ↓
L3 Cache (~8MB, ~40 cycles)
    ↓
Main RAM (~64GB, ~200 cycles)
```

Key concepts:
- **Cache line**: Data is loaded in chunks (typically 64 bytes)
- **Spatial locality**: If you access byte N, bytes N+1, N+2, ... are likely already cached
- **Temporal locality**: Recently accessed data is likely to be accessed again

## Files
- `cache_demo.c` - Row-major vs column-major 2D array traversal
- `list_vs_array.c` - Array vs linked list traversal

## Exercise 1: Row vs Column Major

### Step 1: Build and run
```bash
make cache_demo
./cache_demo
```

You should see column-major is significantly slower (often 3-10x).

### Step 2: Measure cache misses
```bash
perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./cache_demo
```

Compare the cache miss counts and ratios.

### Why does this happen?

C stores 2D arrays in **row-major** order:
```
Memory: [0][0] [0][1] [0][2] ... [0][COLS-1] [1][0] [1][1] ...
         ←————— row 0 ——————→    ←—— row 1 ——→
```

**Row-major access**: Sequential in memory → cache lines are fully utilized
```
Access: [0][0] [0][1] [0][2] [0][3] ...
Cache:  [████████████████] ← one cache line serves 16 ints
```

**Column-major access**: Jumping by COLS * sizeof(int) bytes each time
```
Access: [0][0] [1][0] [2][0] [3][0] ...
Cache:  [█_______________] ← load entire line, use 1 int, evict
        [█_______________] ← repeat for each access
```

## Exercise 2: Array vs Linked List

### Step 1: Build and run
```bash
make list_vs_array
./list_vs_array
```

### Step 2: Measure cache behavior
```bash
perf stat -e cache-misses,cache-references ./list_vs_array
```

### Three cases compared:

| Case | Memory Layout | Cache Behavior |
|------|---------------|----------------|
| Array | Contiguous | Excellent - prefetcher wins |
| List (sequential) | Contiguous (lucky!) | Good - nodes happen to be adjacent |
| List (scattered) | Random | Terrible - every access misses |

### Why "sequential list" is still slower than array:

1. **Pointer chasing**: CPU can't prefetch next element (doesn't know address)
2. **Larger elements**: `struct node` is bigger than `int` (includes pointer)
3. **Indirect access**: Extra memory load for the `next` pointer

## Exercise 3: Deeper perf Analysis

### See more cache events
```bash
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./cache_demo
```

Events explained:
- `L1-dcache-*`: Level 1 data cache (fastest, smallest)
- `LLC-*`: Last Level Cache (L3, slowest cache before RAM)
- `cycles`: Total CPU cycles
- `instructions`: Total instructions executed
- IPC (instructions per cycle): Higher is better

### Profile with perf record
```bash
perf record -e cache-misses ./cache_demo
perf report
```

This shows which functions cause the most cache misses.

## Discussion Questions

1. **Why doesn't the compiler fix this?**
   - Compilers can sometimes interchange loops, but:
   - Side effects may prevent it
   - Aliasing makes it unsafe to assume
   - The programmer often knows better

2. **How big does the array need to be to see this effect?**
   - If array fits in L1 cache: No difference
   - If array fits in L3 cache: Moderate difference
   - If array exceeds L3 cache: Dramatic difference

3. **What about multithreaded code?**
   - False sharing: Different threads accessing same cache line
   - Cache coherency traffic between cores

## Real-World Implications

- **Image processing**: Process row-by-row, not column-by-column
- **Matrix operations**: Libraries like BLAS use cache-blocking
- **Data structures**: Arrays often beat linked lists in practice
- **Database design**: Row stores vs column stores

## Key Takeaways

1. **Memory access pattern matters as much as algorithm complexity**
2. **Sequential access is almost always faster than random access**
3. **Measure with `perf stat` before optimizing**
4. **Big-O notation hides constant factors that can be 10-100x**