illustris/perf-workshop

Fork 0

illustris 596ae02dd4

scenario 2: improve lrucache vs list bit

2026-01-10 19:50:13 +05:30

4.5 KiB

Raw Blame History

Scenario 2: Memoization and Precomputation

Learning Objectives

Use cProfile to identify performance bottlenecks
Recognize when @lru_cache becomes a bottleneck itself
Understand when precomputation beats memoization
Learn to read profiler output to guide optimization decisions

Files

Fibonacci Example

fib_slow.py - Naive recursive Fibonacci (exponential time)
fib_cached.py - Memoized Fibonacci (linear time)

Config Validator Example

generate_events.py - Generate test data (run first)
config_validator_naive.py - Baseline: no caching
config_validator_memoized.py - Uses @lru_cache
config_validator_precomputed.py - Uses 2D array lookup
config_validator.py - Comparison runner
common.py - Shared code

Exercise 1: Fibonacci (Identifying Redundant Calls)

Step 1: Experience the slowness

time python3 fib_slow.py 35

This takes several seconds. Don't try n=50!

Step 2: Profile to understand why

python3 -m cProfile -s ncalls fib_slow.py 35

Look at ncalls for the fib function - it's called millions of times because the same values are recomputed repeatedly.

Step 3: Apply memoization and verify

time python3 fib_cached.py 35
python3 -m cProfile -s ncalls fib_cached.py 35

The ncalls drops from millions to ~35.

Exercise 2: Config Validator (When Caching Becomes the Bottleneck)

This exercise demonstrates a common pattern: you add caching, get a big speedup, but then discover the cache itself is now the bottleneck.

Step 1: Generate test data

python3 generate_events.py 100000

Step 2: Profile the naive version

python3 -m cProfile -s tottime config_validator_naive.py

What to look for: validate_rule_slow dominates the profile. It's called 100,000 times even though there are only 400 unique input combinations.

Step 3: Add memoization - big improvement!

python3 -m cProfile -s tottime config_validator_memoized.py

Observation: Dramatic speedup! But look carefully at the profile...

Step 4: Identify the new bottleneck

Compare process_events time between memoized and precomputed:

python3 -m cProfile -s tottime config_validator_memoized.py
python3 -m cProfile -s tottime config_validator_precomputed.py

Key insight: Compare the process_events tottime:

Memoized: ~0.014s
Precomputed: ~0.004s (3.5x faster!)

The cache lookup overhead now dominates because:

The validation function is cheap (only 50 iterations)
But we do 100,000 cache lookups
Each lookup involves: tuple creation for the key, hashing, dict lookup

Step 5: Hypothesis - can we beat the cache?

When the input space is small and bounded (400 combinations), we can:

Precompute all results into a 2D array
Use array indexing instead of hash-based lookup

Array indexing is faster because:

No hash computation
Direct memory offset calculation
Better CPU cache locality

Step 6: Profile the precomputed version

python3 -m cProfile -s tottime config_validator_precomputed.py

Observation: No wrapper overhead. Clean array indexing in process_events.

Step 7: Compare all three

python3 config_validator.py

Expected output shows precomputed ~2x faster than memoized.

Key Profiling Techniques

Finding where time is spent

python3 -m cProfile -s tottime script.py    # Sort by time in function itself
python3 -m cProfile -s cumtime script.py    # Sort by cumulative time (includes callees)

Understanding the columns

ncalls: Number of calls
tottime: Time spent in function (excluding callees)
cumtime: Time spent in function (including callees)
percall: Time per call

When to Use Each Approach

Approach	When to Use
No caching	Function is cheap OR each input seen only once
Memoization (`@lru_cache`)	Unknown/large input space, expensive function
Precomputation	Known/small input space, many lookups, bounded integers

Discussion Questions

Why does @lru_cache have overhead?
- Hint: What happens on each call even for cache hits?
When would memoization beat precomputation?
- Hint: What if there were 10,000 x 10,000 possible inputs but you only see 100?
Could we make precomputation even faster?
- Hint: What about a flat array with table[rule_id * 20 + event_type]?

4.5 KiB Raw Blame History