5.3 KiB
Scenario 2: Memoization and Precomputation
Learning Objectives
- Use cProfile to identify performance bottlenecks
- Recognize when
@lru_cachebecomes a bottleneck itself - Understand when precomputation beats memoization
- Learn to read profiler output to guide optimization decisions
Files
Fibonacci Example
fib_slow.py- Naive recursive Fibonacci (exponential time)fib_cached.py- Memoized Fibonacci (linear time)
Config Validator Example
generate_events.py- Generate test data (run first)config_validator_naive.py- Baseline: no cachingconfig_validator_memoized.py- Uses@lru_cacheconfig_validator_precomputed.py- Uses 2D array lookupconfig_validator.py- Comparison runnercommon.py- Shared code
Exercise 1: Fibonacci (Identifying Redundant Calls)
Step 1: Experience the slowness
time python3 fib_slow.py 35
This takes several seconds. Don't try n=50!
Step 2: Profile to understand why
python3 -m cProfile -s ncalls fib_slow.py 35
Look at ncalls for the fib function - it's called millions of times because
the same values are recomputed repeatedly.
Step 3: Apply memoization and verify
time python3 fib_cached.py 35
python3 -m cProfile -s ncalls fib_cached.py 35
The ncalls drops from millions to ~35.
Exercise 2: Config Validator (When Caching Becomes the Bottleneck)
This exercise demonstrates a common pattern: you add caching, get a big speedup, but then discover the cache itself is now the bottleneck. Along the way, you'll learn the limits of different profiling tools.
Step 1: Generate test data
python3 generate_events.py 1000000
Step 2: Run the naive version
python3 config_validator_naive.py
It's slow (~3s). Let's profile to see why.
Step 3: Profile with py-spy
py-spy record -o naive.svg -- python3 config_validator_naive.py
Open naive.svg in a browser. You'll see validate_rule_slow dominating -
it's called 1,000,000 times even though there are only 400 unique input combinations.
Step 4: Apply memoization
python3 config_validator_memoized.py
Dramatic speedup! But where is the remaining time going?
Step 5: Profile memoized with py-spy
py-spy record -o memoized.svg -- python3 config_validator_memoized.py
Open memoized.svg. The flamegraph looks thin - most time is unaccounted for.
Problem: py-spy only traces Python functions. The lru_cache wrapper overhead
is in native C code (dict operations, hashing), so py-spy can't see it.
Step 6: Profile with perf (native code)
perf record -g -F 9999 python3 config_validator_memoized.py
perf report
Now you see native C code: lookdict, _PyObject_Call, hash functions. But it's
hard to tell which Python code triggered these operations.
Step 7: Profile with perf + Python frames
perf record -g -F 9999 python3 -X perf config_validator_memoized.py
perf report
The -X perf flag enables Python's perf map support. Now the call stack clearly
shows time spent under _lru_cache_wrapper - that's the cache overhead!
You can also generate a flamegraph:
perf script | stackcollapse-perf.pl | flamegraph.pl > memoized_perf.svg
Step 8: The precomputed solution
When the input space is small and bounded (400 combinations), we can:
- Precompute all results into a 2D array
- Use array indexing instead of hash-based lookup
Array indexing is faster because:
- No hash computation
- Direct memory offset calculation
- Better CPU cache locality
python3 config_validator_precomputed.py
Step 9: Compare all three
python3 config_validator.py
Expected output shows precomputed ~2x faster than memoized.
Key Profiling Techniques
Tool comparison
| Tool | Shows | Limitations |
|---|---|---|
| cProfile | Python function times | No native code, high overhead |
| py-spy | Python flamegraph, low overhead | No native code |
| perf | Native code | No Python frames by default |
perf + -X perf |
Both native and Python | Requires Python 3.12+ |
cProfile usage
python3 -m cProfile -s tottime script.py # Sort by time in function itself
python3 -m cProfile -s cumtime script.py # Sort by cumulative time
Understanding cProfile columns
ncalls: Number of callstottime: Time spent in function (excluding callees)cumtime: Time spent in function (including callees)percall: Time per call
When to Use Each Approach
| Approach | When to Use |
|---|---|
| No caching | Function is cheap OR each input seen only once |
Memoization (@lru_cache) |
Unknown/large input space, expensive function |
| Precomputation | Known/small input space, many lookups, bounded integers |
Discussion Questions
-
Why does
@lru_cachehave overhead?- Hint: What happens on each call even for cache hits?
-
When would memoization beat precomputation?
- Hint: What if there were 10,000 x 10,000 possible inputs but you only see 100?
-
Could we make precomputation even faster?
- Hint: What about a flat array with
table[rule_id * 20 + event_type]?
- Hint: What about a flat array with
Further Reading
functools.lru_cachedocumentationfunctools.cache(Python 3.9+) - unbounded cache, slightly less overhead- NumPy arrays for truly O(1) array access