illustris/perf-workshop

Fork 0

Files

History

illustris 4fb1bd90db

init

2026-01-08 18:11:30 +05:30

cache_demo.c

init

2026-01-08 18:11:30 +05:30

list_vs_array.c

init

2026-01-08 18:11:30 +05:30

Makefile

init

2026-01-08 18:11:30 +05:30

README.md

init

2026-01-08 18:11:30 +05:30

README.md

Scenario 4: Cache Misses and Memory Access Patterns

Learning Objectives

Understand CPU cache basics (L1, L2, L3)
Use perf stat to measure cache behavior
Recognize cache-friendly vs cache-hostile access patterns
Understand why Big-O notation doesn't tell the whole story

Background: How CPU Caches Work

CPU Core
    ↓
L1 Cache (~32KB, ~4 cycles)
    ↓
L2 Cache (~256KB, ~12 cycles)
    ↓
L3 Cache (~8MB, ~40 cycles)
    ↓
Main RAM (~64GB, ~200 cycles)

Key concepts:

Cache line: Data is loaded in chunks (typically 64 bytes)
Spatial locality: If you access byte N, bytes N+1, N+2, ... are likely already cached
Temporal locality: Recently accessed data is likely to be accessed again

Files

cache_demo.c - Row-major vs column-major 2D array traversal
list_vs_array.c - Array vs linked list traversal

Exercise 1: Row vs Column Major

Step 1: Build and run

make cache_demo
./cache_demo

You should see column-major is significantly slower (often 3-10x).

Step 2: Measure cache misses

perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./cache_demo

Compare the cache miss counts and ratios.

Why does this happen?

C stores 2D arrays in row-major order:

Memory: [0][0] [0][1] [0][2] ... [0][COLS-1] [1][0] [1][1] ...
         ←————— row 0 ——————→    ←—— row 1 ——→

Row-major access: Sequential in memory → cache lines are fully utilized

Access: [0][0] [0][1] [0][2] [0][3] ...
Cache:  [████████████████] ← one cache line serves 16 ints

Column-major access: Jumping by COLS * sizeof(int) bytes each time

Access: [0][0] [1][0] [2][0] [3][0] ...
Cache:  [█_______________] ← load entire line, use 1 int, evict
        [█_______________] ← repeat for each access

Exercise 2: Array vs Linked List

Step 1: Build and run

make list_vs_array
./list_vs_array

Step 2: Measure cache behavior

perf stat -e cache-misses,cache-references ./list_vs_array

Three cases compared:

Case	Memory Layout	Cache Behavior
Array	Contiguous	Excellent - prefetcher wins
List (sequential)	Contiguous (lucky!)	Good - nodes happen to be adjacent
List (scattered)	Random	Terrible - every access misses

Why "sequential list" is still slower than array:

Pointer chasing: CPU can't prefetch next element (doesn't know address)
Larger elements: struct node is bigger than int (includes pointer)
Indirect access: Extra memory load for the next pointer

Exercise 3: Deeper perf Analysis

See more cache events

perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./cache_demo

Events explained:

L1-dcache-*: Level 1 data cache (fastest, smallest)
LLC-*: Last Level Cache (L3, slowest cache before RAM)
cycles: Total CPU cycles
instructions: Total instructions executed
IPC (instructions per cycle): Higher is better

Profile with perf record

perf record -e cache-misses ./cache_demo
perf report

This shows which functions cause the most cache misses.

Discussion Questions

Why doesn't the compiler fix this?
- Compilers can sometimes interchange loops, but:
- Side effects may prevent it
- Aliasing makes it unsafe to assume
- The programmer often knows better
How big does the array need to be to see this effect?
- If array fits in L1 cache: No difference
- If array fits in L3 cache: Moderate difference
- If array exceeds L3 cache: Dramatic difference
What about multithreaded code?
- False sharing: Different threads accessing same cache line
- Cache coherency traffic between cores

Real-World Implications

Image processing: Process row-by-row, not column-by-column
Matrix operations: Libraries like BLAS use cache-blocking
Data structures: Arrays often beat linked lists in practice
Database design: Row stores vs column stores

Key Takeaways

Memory access pattern matters as much as algorithm complexity
Sequential access is almost always faster than random access
Measure with perf stat before optimizing
Big-O notation hides constant factors that can be 10-100x