# Scenario 2: Memoization and Precomputation ## Learning Objectives - Use cProfile to identify performance bottlenecks - Recognize when `@lru_cache` becomes a bottleneck itself - Understand when precomputation beats memoization - Learn to read profiler output to guide optimization decisions ## Files ### Fibonacci Example - `fib_slow.py` - Naive recursive Fibonacci (exponential time) - `fib_cached.py` - Memoized Fibonacci (linear time) ### Config Validator Example - `generate_events.py` - Generate test data (run first) - `config_validator_naive.py` - Baseline: no caching - `config_validator_memoized.py` - Uses `@lru_cache` - `config_validator_precomputed.py` - Uses 2D array lookup - `config_validator.py` - Comparison runner - `common.py` - Shared code --- ## Exercise 1: Fibonacci (Identifying Redundant Calls) ### Step 1: Experience the slowness ```bash time python3 fib_slow.py 35 ``` This takes several seconds. Don't try n=50! ### Step 2: Profile to understand why ```bash python3 -m cProfile -s ncalls fib_slow.py 35 ``` Look at `ncalls` for the `fib` function - it's called millions of times because the same values are recomputed repeatedly. ### Step 3: Apply memoization and verify ```bash time python3 fib_cached.py 35 python3 -m cProfile -s ncalls fib_cached.py 35 ``` The `ncalls` drops from millions to ~35. --- ## Exercise 2: Config Validator (When Caching Becomes the Bottleneck) This exercise demonstrates a common pattern: you add caching, get a big speedup, but then discover the cache itself is now the bottleneck. Along the way, you'll learn the limits of different profiling tools. ### Step 1: Generate test data ```bash python3 generate_events.py 1000000 ``` ### Step 2: Run the naive version ```bash python3 config_validator_naive.py ``` It's slow (~3s). Let's profile to see why. ### Step 3: Profile with py-spy ```bash py-spy record -o naive.svg -- python3 config_validator_naive.py ``` Open `naive.svg` in a browser. You'll see `validate_rule_slow` dominating - it's called 1,000,000 times even though there are only 400 unique input combinations. ### Step 4: Apply memoization ```bash python3 config_validator_memoized.py ``` Dramatic speedup! But where is the remaining time going? ### Step 5: Profile memoized with py-spy ```bash py-spy record -o memoized.svg -- python3 config_validator_memoized.py ``` Open `memoized.svg`. The flamegraph looks thin - most time is unaccounted for. **Problem:** py-spy only traces Python functions. The `lru_cache` wrapper overhead is in native C code (dict operations, hashing), so py-spy can't see it. ### Step 6: Profile with perf (native code) ```bash perf record -g -F 9999 python3 config_validator_memoized.py perf report ``` Now you see native C code: `lookdict`, `_PyObject_Call`, hash functions. But it's hard to tell which Python code triggered these operations. ### Step 7: Profile with perf + Python frames ```bash perf record -g -F 9999 python3 -X perf config_validator_memoized.py perf report ``` The `-X perf` flag enables Python's perf map support. Now the call stack clearly shows time spent under `_lru_cache_wrapper` - that's the cache overhead! You can also generate a flamegraph: ```bash perf script | stackcollapse-perf.pl | flamegraph.pl > memoized_perf.svg ``` ### Step 8: The precomputed solution When the input space is **small and bounded** (400 combinations), we can: 1. Precompute all results into a 2D array 2. Use array indexing instead of hash-based lookup Array indexing is faster because: - No hash computation - Direct memory offset calculation - Better CPU cache locality ```bash python3 config_validator_precomputed.py ``` ### Step 9: Compare all three ```bash python3 config_validator.py ``` Expected output shows precomputed ~2x faster than memoized. --- ## Key Profiling Techniques ### Tool comparison | Tool | Shows | Limitations | |------|-------|-------------| | cProfile | Python function times | No native code, high overhead | | py-spy | Python flamegraph, low overhead | No native code | | perf | Native code | No Python frames by default | | perf + `-X perf` | Both native and Python | Requires Python 3.12+ | ### cProfile usage ```bash python3 -m cProfile -s tottime script.py # Sort by time in function itself python3 -m cProfile -s cumtime script.py # Sort by cumulative time ``` ### Understanding cProfile columns - `ncalls`: Number of calls - `tottime`: Time spent in function (excluding callees) - `cumtime`: Time spent in function (including callees) - `percall`: Time per call --- ## When to Use Each Approach | Approach | When to Use | |----------|-------------| | No caching | Function is cheap OR each input seen only once | | Memoization (`@lru_cache`) | Unknown/large input space, expensive function | | Precomputation | Known/small input space, many lookups, bounded integers | --- ## Discussion Questions 1. Why does `@lru_cache` have overhead? - Hint: What happens on each call even for cache hits? 2. When would memoization beat precomputation? - Hint: What if there were 10,000 x 10,000 possible inputs but you only see 100? 3. Could we make precomputation even faster? - Hint: What about a flat array with `table[rule_id * 20 + event_type]`? --- ## Further Reading - `functools.lru_cache` documentation - `functools.cache` (Python 3.9+) - unbounded cache, slightly less overhead - NumPy arrays for truly O(1) array access