illustris/perf-workshop

Fork 0

illustris 51ab2ed553

scenario 4: break down into multiple files

2026-01-11 06:39:27 +05:30

5.9 KiB

Raw Blame History

Scenario 4: Cache Misses and Memory Access Patterns

Learning Objectives

Understand CPU cache basics (L1, L2, L3)
Use perf stat to measure cache behavior
Recognize cache-friendly vs cache-hostile access patterns
Understand why Big-O notation doesn't tell the whole story

Background: How CPU Caches Work

CPU Core
    ↓
L1 Cache (~32KB, ~4 cycles)
    ↓
L2 Cache (~256KB, ~12 cycles)
    ↓
L3 Cache (~8MB, ~40 cycles)
    ↓
Main RAM (~64GB, ~200 cycles)

Key concepts:

Cache line: Data is loaded in chunks (typically 64 bytes)
Spatial locality: If you access byte N, bytes N+1, N+2, ... are likely already cached
Temporal locality: Recently accessed data is likely to be accessed again

Files

matrix_col_major.c - BAD: Column-major traversal (cache-hostile)
matrix_row_major.c - GOOD: Row-major traversal (cache-friendly)
list_scattered.c - BAD: Scattered linked list (worst cache behavior)
list_sequential.c - MEDIUM: Sequential linked list (better, but still has overhead)
array_sum.c - GOOD: Contiguous array (best cache behavior)

Setup

make all

Exercise 1: Row-Major vs Column-Major Matrix Traversal

Step 1: Run the BAD version (column-major)

./matrix_col_major

Note the execution time.

Step 2: Profile to identify the issue

perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./matrix_col_major

Observe the high cache miss rate and count.

Step 3: Run the GOOD version (row-major)

./matrix_row_major

This should be significantly faster (often 3-10x).

Step 4: Profile to confirm the improvement

perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./matrix_row_major

Compare the cache miss counts and ratios with the column-major version.

Why does this happen?

C stores 2D arrays in row-major order:

Memory: [0][0] [0][1] [0][2] ... [0][COLS-1] [1][0] [1][1] ...
         ←————— row 0 ——————→    ←—— row 1 ——→

Row-major access: Sequential in memory → cache lines are fully utilized

Access: [0][0] [0][1] [0][2] [0][3] ...
Cache:  [████████████████] ← one cache line serves 16 ints

Column-major access: Jumping by COLS * sizeof(int) bytes each time

Access: [0][0] [1][0] [2][0] [3][0] ...
Cache:  [█_______________] ← load entire line, use 1 int, evict
        [█_______________] ← repeat for each access

Exercise 2: Data Structure Memory Layout

Step 1: Run the WORST case (scattered linked list)

./list_scattered

Note the execution time - this is the worst case.

Step 2: Profile the cache behavior

perf stat -e cache-misses,cache-references ./list_scattered

Observe the terrible cache miss rate due to random memory access.

Step 3: First improvement - sequential allocation

./list_sequential

This should be faster than scattered, as nodes are contiguous in memory.

Step 4: Profile the improvement

perf stat -e cache-misses,cache-references ./list_sequential

Cache behavior improves, but still not optimal due to pointer chasing.

Step 5: Best solution - contiguous array

./array_sum

This should be the fastest by a significant margin.

Step 6: Profile the optimal case

perf stat -e cache-misses,cache-references ./array_sum

Compare all three cache miss counts:

Case	Memory Layout	Cache Behavior
Array	Contiguous	Excellent - prefetcher wins
List (sequential)	Contiguous (lucky!)	Good - nodes happen to be adjacent
List (scattered)	Random	Terrible - every access misses

Why linked lists are slow

Even with sequential allocation, linked lists are slower than arrays:

Pointer chasing: CPU can't prefetch next element (doesn't know address until current node is loaded)
Larger elements: struct node is bigger than int (includes pointer)
Indirect access: Extra memory load for the next pointer

Exercise 3: Deeper perf Analysis

See more cache events

perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./matrix_col_major
perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./matrix_row_major

Events explained:

L1-dcache-*: Level 1 data cache (fastest, smallest)
LLC-*: Last Level Cache (L3, slowest cache before RAM)
cycles: Total CPU cycles
instructions: Total instructions executed
IPC (instructions per cycle): Higher is better

Profile with perf record

perf record -e cache-misses ./matrix_col_major
perf report

This shows which functions cause the most cache misses.

Discussion Questions

Why doesn't the compiler fix this?
- Compilers can sometimes interchange loops, but:
- Side effects may prevent it
- Aliasing makes it unsafe to assume
- The programmer often knows better
How big does the array need to be to see this effect?
- If array fits in L1 cache: No difference
- If array fits in L3 cache: Moderate difference
- If array exceeds L3 cache: Dramatic difference
What about multithreaded code?
- False sharing: Different threads accessing same cache line
- Cache coherency traffic between cores

Real-World Implications

Image processing: Process row-by-row, not column-by-column
Matrix operations: Libraries like BLAS use cache-blocking
Data structures: Arrays often beat linked lists in practice
Database design: Row stores vs column stores

Key Takeaways

Memory access pattern matters as much as algorithm complexity
Sequential access is almost always faster than random access
Measure with perf stat before optimizing
Big-O notation hides constant factors that can be 10-100x

5.9 KiB Raw Blame History