scenario2: add perf

2026-01-10 20:17:43 +05:30 · 2026-01-10 20:17:43 +05:30 · 658f4e98bc
commit 658f4e98bc
parent 596ae02dd4
3 changed files with 65 additions and 30 deletions
--- a/.gitignore
+++ b/.gitignore
@ -4,3 +4,7 @@ result
 events.pkl
 __pycache__
 *.svg
 *.so
 *#*#
 perf.data*
 out.perf-folded
--- a/scenario1-python-to-c/README.md
+++ b/scenario1-python-to-c/README.md
@ -38,6 +38,8 @@ py-spy record -o prime_slow.svg -- python3 prime_slow.py
 Open `prime_slow.svg` in a browser. The widest bar at the top shows where time is spent.
 **Note:** py-spy only traces Python functions. Time spent in native C code (like the C library calls in later steps) won't appear in py-spy flamegraphs.
 ### Step 4: Compile and run the partially optimized version
 ```bash
 # Compile the C library
--- a/scenario2-memoization/README.md
+++ b/scenario2-memoization/README.md
@ -51,46 +51,70 @@ The `ncalls` drops from millions to ~35.
 ## Exercise 2: Config Validator (When Caching Becomes the Bottleneck)
 This exercise demonstrates a common pattern: you add caching, get a big speedup,
-but then discover the cache itself is now the bottleneck.
+but then discover the cache itself is now the bottleneck. Along the way, you'll
 learn the limits of different profiling tools.
 ### Step 1: Generate test data
 ```bash
-python3 generate_events.py 100000
+python3 generate_events.py 1000000
 ```
-### Step 2: Profile the naive version
+### Step 2: Run the naive version
 ```bash
-python3 -m cProfile -s tottime config_validator_naive.py
+python3 config_validator_naive.py
 ```
-**What to look for:** `validate_rule_slow` dominates the profile. It's called
+It's slow (~3s). Let's profile to see why.
 100,000 times even though there are only 400 unique input combinations.
-### Step 3: Add memoization - big improvement!
+### Step 3: Profile with py-spy
 ```bash
-python3 -m cProfile -s tottime config_validator_memoized.py
+py-spy record -o naive.svg -- python3 config_validator_naive.py
 ```
-**Observation:** Dramatic speedup! But look carefully at the profile...
+Open `naive.svg` in a browser. You'll see `validate_rule_slow` dominating -
 it's called 1,000,000 times even though there are only 400 unique input combinations.
-### Step 4: Identify the new bottleneck
+### Step 4: Apply memoization
 Compare `process_events` time between memoized and precomputed:
 ```bash
-python3 -m cProfile -s tottime config_validator_memoized.py
+python3 config_validator_memoized.py
 python3 -m cProfile -s tottime config_validator_precomputed.py
 ```
-**Key insight:** Compare the `process_events` tottime:
+Dramatic speedup! But where is the remaining time going?
 - Memoized: ~0.014s
 - Precomputed: ~0.004s (3.5x faster!)
-The cache lookup overhead now dominates because:
+### Step 5: Profile memoized with py-spy
- The validation function is cheap (only 50 iterations)
+```bash
- But we do 100,000 cache lookups
+py-spy record -o memoized.svg -- python3 config_validator_memoized.py
- Each lookup involves: tuple creation for the key, hashing, dict lookup
+```
-### Step 5: Hypothesis - can we beat the cache?
+Open `memoized.svg`. The flamegraph looks thin - most time is unaccounted for.
 **Problem:** py-spy only traces Python functions. The `lru_cache` wrapper overhead
 is in native C code (dict operations, hashing), so py-spy can't see it.
 ### Step 6: Profile with perf (native code)
 ```bash
 perf record -g -F 9999 python3 config_validator_memoized.py
 perf report
 ```
 Now you see native C code: `lookdict`, `_PyObject_Call`, hash functions. But it's
 hard to tell which Python code triggered these operations.
 ### Step 7: Profile with perf + Python frames
 ```bash
 perf record -g -F 9999 python3 -X perf config_validator_memoized.py
 perf report
 ```
 The `-X perf` flag enables Python's perf map support. Now the call stack clearly
 shows time spent under `_lru_cache_wrapper` - that's the cache overhead!
 You can also generate a flamegraph:
 ```bash
 perf script | stackcollapse-perf.pl | flamegraph.pl > memoized_perf.svg
 ```
 ### Step 8: The precomputed solution
 When the input space is **small and bounded** (400 combinations), we can:
 1. Precompute all results into a 2D array
@ -101,14 +125,11 @@ Array indexing is faster because:
 - Direct memory offset calculation
 - Better CPU cache locality
 ### Step 6: Profile the precomputed version
 ```bash
-python3 -m cProfile -s tottime config_validator_precomputed.py
+python3 config_validator_precomputed.py
 ```
-**Observation:** No wrapper overhead. Clean array indexing in `process_events`.
+### Step 9: Compare all three
 ### Step 7: Compare all three
 ```bash
 python3 config_validator.py
 ```
@ -119,13 +140,21 @@ Expected output shows precomputed ~2x faster than memoized.
 ## Key Profiling Techniques
-### Finding where time is spent
+### Tool comparison
 | Tool | Shows | Limitations |
 |------|-------|-------------|
 | cProfile | Python function times | No native code, high overhead |
 | py-spy | Python flamegraph, low overhead | No native code |
 | perf | Native code | No Python frames by default |
 | perf + `-X perf` | Both native and Python | Requires Python 3.12+ |
 ### cProfile usage
 ```bash
 python3 -m cProfile -s tottime script.py    # Sort by time in function itself
-python3 -m cProfile -s cumtime script.py    # Sort by cumulative time (includes callees)
+python3 -m cProfile -s cumtime script.py    # Sort by cumulative time
 ```
-### Understanding the columns
+### Understanding cProfile columns
 - `ncalls`: Number of calls
 - `tottime`: Time spent in function (excluding callees)
 - `cumtime`: Time spent in function (including callees)