192 lines
5.3 KiB
Markdown
192 lines
5.3 KiB
Markdown
# Scenario 2: Memoization and Precomputation
|
|
|
|
## Learning Objectives
|
|
- Use cProfile to identify performance bottlenecks
|
|
- Recognize when `@lru_cache` becomes a bottleneck itself
|
|
- Understand when precomputation beats memoization
|
|
- Learn to read profiler output to guide optimization decisions
|
|
|
|
## Files
|
|
|
|
### Fibonacci Example
|
|
- `fib_slow.py` - Naive recursive Fibonacci (exponential time)
|
|
- `fib_cached.py` - Memoized Fibonacci (linear time)
|
|
|
|
### Config Validator Example
|
|
- `generate_events.py` - Generate test data (run first)
|
|
- `config_validator_naive.py` - Baseline: no caching
|
|
- `config_validator_memoized.py` - Uses `@lru_cache`
|
|
- `config_validator_precomputed.py` - Uses 2D array lookup
|
|
- `config_validator.py` - Comparison runner
|
|
- `common.py` - Shared code
|
|
|
|
---
|
|
|
|
## Exercise 1: Fibonacci (Identifying Redundant Calls)
|
|
|
|
### Step 1: Experience the slowness
|
|
```bash
|
|
time python3 fib_slow.py 35
|
|
```
|
|
This takes several seconds. Don't try n=50!
|
|
|
|
### Step 2: Profile to understand why
|
|
```bash
|
|
python3 -m cProfile -s ncalls fib_slow.py 35
|
|
```
|
|
|
|
Look at `ncalls` for the `fib` function - it's called millions of times because
|
|
the same values are recomputed repeatedly.
|
|
|
|
### Step 3: Apply memoization and verify
|
|
```bash
|
|
time python3 fib_cached.py 35
|
|
python3 -m cProfile -s ncalls fib_cached.py 35
|
|
```
|
|
|
|
The `ncalls` drops from millions to ~35.
|
|
|
|
---
|
|
|
|
## Exercise 2: Config Validator (When Caching Becomes the Bottleneck)
|
|
|
|
This exercise demonstrates a common pattern: you add caching, get a big speedup,
|
|
but then discover the cache itself is now the bottleneck. Along the way, you'll
|
|
learn the limits of different profiling tools.
|
|
|
|
### Step 1: Generate test data
|
|
```bash
|
|
python3 generate_events.py 1000000
|
|
```
|
|
|
|
### Step 2: Run the naive version
|
|
```bash
|
|
python3 config_validator_naive.py
|
|
```
|
|
|
|
It's slow (~3s). Let's profile to see why.
|
|
|
|
### Step 3: Profile with py-spy
|
|
```bash
|
|
py-spy record -o naive.svg -- python3 config_validator_naive.py
|
|
```
|
|
|
|
Open `naive.svg` in a browser. You'll see `validate_rule_slow` dominating -
|
|
it's called 1,000,000 times even though there are only 400 unique input combinations.
|
|
|
|
### Step 4: Apply memoization
|
|
```bash
|
|
python3 config_validator_memoized.py
|
|
```
|
|
|
|
Dramatic speedup! But where is the remaining time going?
|
|
|
|
### Step 5: Profile memoized with py-spy
|
|
```bash
|
|
py-spy record -o memoized.svg -- python3 config_validator_memoized.py
|
|
```
|
|
|
|
Open `memoized.svg`. The flamegraph looks thin - most time is unaccounted for.
|
|
|
|
**Problem:** py-spy only traces Python functions. The `lru_cache` wrapper overhead
|
|
is in native C code (dict operations, hashing), so py-spy can't see it.
|
|
|
|
### Step 6: Profile with perf (native code)
|
|
```bash
|
|
perf record -g -F 9999 python3 config_validator_memoized.py
|
|
perf report
|
|
```
|
|
|
|
Now you see native C code: `lookdict`, `_PyObject_Call`, hash functions. But it's
|
|
hard to tell which Python code triggered these operations.
|
|
|
|
### Step 7: Profile with perf + Python frames
|
|
```bash
|
|
perf record -g -F 9999 python3 -X perf config_validator_memoized.py
|
|
perf report
|
|
```
|
|
|
|
The `-X perf` flag enables Python's perf map support. Now the call stack clearly
|
|
shows time spent under `_lru_cache_wrapper` - that's the cache overhead!
|
|
|
|
You can also generate a flamegraph:
|
|
```bash
|
|
perf script | stackcollapse-perf.pl | flamegraph.pl > memoized_perf.svg
|
|
```
|
|
|
|
### Step 8: The precomputed solution
|
|
|
|
When the input space is **small and bounded** (400 combinations), we can:
|
|
1. Precompute all results into a 2D array
|
|
2. Use array indexing instead of hash-based lookup
|
|
|
|
Array indexing is faster because:
|
|
- No hash computation
|
|
- Direct memory offset calculation
|
|
- Better CPU cache locality
|
|
|
|
```bash
|
|
python3 config_validator_precomputed.py
|
|
```
|
|
|
|
### Step 9: Compare all three
|
|
```bash
|
|
python3 config_validator.py
|
|
```
|
|
|
|
Expected output shows precomputed ~2x faster than memoized.
|
|
|
|
---
|
|
|
|
## Key Profiling Techniques
|
|
|
|
### Tool comparison
|
|
| Tool | Shows | Limitations |
|
|
|------|-------|-------------|
|
|
| cProfile | Python function times | No native code, high overhead |
|
|
| py-spy | Python flamegraph, low overhead | No native code |
|
|
| perf | Native code | No Python frames by default |
|
|
| perf + `-X perf` | Both native and Python | Requires Python 3.12+ |
|
|
|
|
### cProfile usage
|
|
```bash
|
|
python3 -m cProfile -s tottime script.py # Sort by time in function itself
|
|
python3 -m cProfile -s cumtime script.py # Sort by cumulative time
|
|
```
|
|
|
|
### Understanding cProfile columns
|
|
- `ncalls`: Number of calls
|
|
- `tottime`: Time spent in function (excluding callees)
|
|
- `cumtime`: Time spent in function (including callees)
|
|
- `percall`: Time per call
|
|
|
|
---
|
|
|
|
## When to Use Each Approach
|
|
|
|
| Approach | When to Use |
|
|
|----------|-------------|
|
|
| No caching | Function is cheap OR each input seen only once |
|
|
| Memoization (`@lru_cache`) | Unknown/large input space, expensive function |
|
|
| Precomputation | Known/small input space, many lookups, bounded integers |
|
|
|
|
---
|
|
|
|
## Discussion Questions
|
|
|
|
1. Why does `@lru_cache` have overhead?
|
|
- Hint: What happens on each call even for cache hits?
|
|
|
|
2. When would memoization beat precomputation?
|
|
- Hint: What if there were 10,000 x 10,000 possible inputs but you only see 100?
|
|
|
|
3. Could we make precomputation even faster?
|
|
- Hint: What about a flat array with `table[rule_id * 20 + event_type]`?
|
|
|
|
---
|
|
|
|
## Further Reading
|
|
- `functools.lru_cache` documentation
|
|
- `functools.cache` (Python 3.9+) - unbounded cache, slightly less overhead
|
|
- NumPy arrays for truly O(1) array access
|