2026-01-10 19:50:13 +05:30

163 lines
4.5 KiB
Markdown

# Scenario 2: Memoization and Precomputation
## Learning Objectives
- Use cProfile to identify performance bottlenecks
- Recognize when `@lru_cache` becomes a bottleneck itself
- Understand when precomputation beats memoization
- Learn to read profiler output to guide optimization decisions
## Files
### Fibonacci Example
- `fib_slow.py` - Naive recursive Fibonacci (exponential time)
- `fib_cached.py` - Memoized Fibonacci (linear time)
### Config Validator Example
- `generate_events.py` - Generate test data (run first)
- `config_validator_naive.py` - Baseline: no caching
- `config_validator_memoized.py` - Uses `@lru_cache`
- `config_validator_precomputed.py` - Uses 2D array lookup
- `config_validator.py` - Comparison runner
- `common.py` - Shared code
---
## Exercise 1: Fibonacci (Identifying Redundant Calls)
### Step 1: Experience the slowness
```bash
time python3 fib_slow.py 35
```
This takes several seconds. Don't try n=50!
### Step 2: Profile to understand why
```bash
python3 -m cProfile -s ncalls fib_slow.py 35
```
Look at `ncalls` for the `fib` function - it's called millions of times because
the same values are recomputed repeatedly.
### Step 3: Apply memoization and verify
```bash
time python3 fib_cached.py 35
python3 -m cProfile -s ncalls fib_cached.py 35
```
The `ncalls` drops from millions to ~35.
---
## Exercise 2: Config Validator (When Caching Becomes the Bottleneck)
This exercise demonstrates a common pattern: you add caching, get a big speedup,
but then discover the cache itself is now the bottleneck.
### Step 1: Generate test data
```bash
python3 generate_events.py 100000
```
### Step 2: Profile the naive version
```bash
python3 -m cProfile -s tottime config_validator_naive.py
```
**What to look for:** `validate_rule_slow` dominates the profile. It's called
100,000 times even though there are only 400 unique input combinations.
### Step 3: Add memoization - big improvement!
```bash
python3 -m cProfile -s tottime config_validator_memoized.py
```
**Observation:** Dramatic speedup! But look carefully at the profile...
### Step 4: Identify the new bottleneck
Compare `process_events` time between memoized and precomputed:
```bash
python3 -m cProfile -s tottime config_validator_memoized.py
python3 -m cProfile -s tottime config_validator_precomputed.py
```
**Key insight:** Compare the `process_events` tottime:
- Memoized: ~0.014s
- Precomputed: ~0.004s (3.5x faster!)
The cache lookup overhead now dominates because:
- The validation function is cheap (only 50 iterations)
- But we do 100,000 cache lookups
- Each lookup involves: tuple creation for the key, hashing, dict lookup
### Step 5: Hypothesis - can we beat the cache?
When the input space is **small and bounded** (400 combinations), we can:
1. Precompute all results into a 2D array
2. Use array indexing instead of hash-based lookup
Array indexing is faster because:
- No hash computation
- Direct memory offset calculation
- Better CPU cache locality
### Step 6: Profile the precomputed version
```bash
python3 -m cProfile -s tottime config_validator_precomputed.py
```
**Observation:** No wrapper overhead. Clean array indexing in `process_events`.
### Step 7: Compare all three
```bash
python3 config_validator.py
```
Expected output shows precomputed ~2x faster than memoized.
---
## Key Profiling Techniques
### Finding where time is spent
```bash
python3 -m cProfile -s tottime script.py # Sort by time in function itself
python3 -m cProfile -s cumtime script.py # Sort by cumulative time (includes callees)
```
### Understanding the columns
- `ncalls`: Number of calls
- `tottime`: Time spent in function (excluding callees)
- `cumtime`: Time spent in function (including callees)
- `percall`: Time per call
---
## When to Use Each Approach
| Approach | When to Use |
|----------|-------------|
| No caching | Function is cheap OR each input seen only once |
| Memoization (`@lru_cache`) | Unknown/large input space, expensive function |
| Precomputation | Known/small input space, many lookups, bounded integers |
---
## Discussion Questions
1. Why does `@lru_cache` have overhead?
- Hint: What happens on each call even for cache hits?
2. When would memoization beat precomputation?
- Hint: What if there were 10,000 x 10,000 possible inputs but you only see 100?
3. Could we make precomputation even faster?
- Hint: What about a flat array with `table[rule_id * 20 + event_type]`?
---
## Further Reading
- `functools.lru_cache` documentation
- `functools.cache` (Python 3.9+) - unbounded cache, slightly less overhead
- NumPy arrays for truly O(1) array access