scenario2: add perf

2026-01-10 20:17:43 +05:30
parent 596ae02dd4
commit 658f4e98bc
3 changed files with 65 additions and 30 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -4,3 +4,7 @@ result
 events.pkl
 __pycache__
 *.svg
+*.so
+*#*#
+perf.data*
+out.perf-folded
--- a/scenario1-python-to-c/README.md
+++ b/scenario1-python-to-c/README.md
@@ -38,6 +38,8 @@ py-spy record -o prime_slow.svg -- python3 prime_slow.py

 Open `prime_slow.svg` in a browser. The widest bar at the top shows where time is spent.

+**Note:** py-spy only traces Python functions. Time spent in native C code (like the C library calls in later steps) won't appear in py-spy flamegraphs.
+
 ### Step 4: Compile and run the partially optimized version
 ```bash
 # Compile the C library
--- a/scenario2-memoization/README.md
+++ b/scenario2-memoization/README.md
@@ -51,46 +51,70 @@ The `ncalls` drops from millions to ~35.
 ## Exercise 2: Config Validator (When Caching Becomes the Bottleneck)

 This exercise demonstrates a common pattern: you add caching, get a big speedup,
-but then discover the cache itself is now the bottleneck.
+but then discover the cache itself is now the bottleneck. Along the way, you'll
+learn the limits of different profiling tools.

 ### Step 1: Generate test data
 ```bash
-python3 generate_events.py 100000
+python3 generate_events.py 1000000
 ```

-### Step 2: Profile the naive version
+### Step 2: Run the naive version
 ```bash
-python3 -m cProfile -s tottime config_validator_naive.py
+python3 config_validator_naive.py
 ```

-**What to look for:** `validate_rule_slow` dominates the profile. It's called
-100,000 times even though there are only 400 unique input combinations.
+It's slow (~3s). Let's profile to see why.

-### Step 3: Add memoization - big improvement!
+### Step 3: Profile with py-spy
 ```bash
-python3 -m cProfile -s tottime config_validator_memoized.py
+py-spy record -o naive.svg -- python3 config_validator_naive.py
 ```

-**Observation:** Dramatic speedup! But look carefully at the profile...
+Open `naive.svg` in a browser. You'll see `validate_rule_slow` dominating -
+it's called 1,000,000 times even though there are only 400 unique input combinations.

-### Step 4: Identify the new bottleneck
-
-Compare `process_events` time between memoized and precomputed:
+### Step 4: Apply memoization
 ```bash
-python3 -m cProfile -s tottime config_validator_memoized.py
-python3 -m cProfile -s tottime config_validator_precomputed.py
+python3 config_validator_memoized.py
 ```

-**Key insight:** Compare the `process_events` tottime:
- Memoized: ~0.014s
- Precomputed: ~0.004s (3.5x faster!)
+Dramatic speedup! But where is the remaining time going?

-The cache lookup overhead now dominates because:
- The validation function is cheap (only 50 iterations)
- But we do 100,000 cache lookups
- Each lookup involves: tuple creation for the key, hashing, dict lookup
+### Step 5: Profile memoized with py-spy
+```bash
+py-spy record -o memoized.svg -- python3 config_validator_memoized.py
+```

-### Step 5: Hypothesis - can we beat the cache?
+Open `memoized.svg`. The flamegraph looks thin - most time is unaccounted for.
+
+**Problem:** py-spy only traces Python functions. The `lru_cache` wrapper overhead
+is in native C code (dict operations, hashing), so py-spy can't see it.
+
+### Step 6: Profile with perf (native code)
+```bash
+perf record -g -F 9999 python3 config_validator_memoized.py
+perf report
+```
+
+Now you see native C code: `lookdict`, `_PyObject_Call`, hash functions. But it's
+hard to tell which Python code triggered these operations.
+
+### Step 7: Profile with perf + Python frames
+```bash
+perf record -g -F 9999 python3 -X perf config_validator_memoized.py
+perf report
+```
+
+The `-X perf` flag enables Python's perf map support. Now the call stack clearly
+shows time spent under `_lru_cache_wrapper` - that's the cache overhead!
+
+You can also generate a flamegraph:
+```bash
+perf script | stackcollapse-perf.pl | flamegraph.pl > memoized_perf.svg
+```
+
+### Step 8: The precomputed solution

 When the input space is **small and bounded** (400 combinations), we can:
 1. Precompute all results into a 2D array
@@ -101,14 +125,11 @@ Array indexing is faster because:
 - Direct memory offset calculation
 - Better CPU cache locality

-### Step 6: Profile the precomputed version
 ```bash
-python3 -m cProfile -s tottime config_validator_precomputed.py
+python3 config_validator_precomputed.py
 ```

-**Observation:** No wrapper overhead. Clean array indexing in `process_events`.
-
-### Step 7: Compare all three
+### Step 9: Compare all three
 ```bash
 python3 config_validator.py
 ```
@@ -119,13 +140,21 @@ Expected output shows precomputed ~2x faster than memoized.

 ## Key Profiling Techniques

-### Finding where time is spent
+### Tool comparison
+| Tool | Shows | Limitations |
+|------|-------|-------------|
+| cProfile | Python function times | No native code, high overhead |
+| py-spy | Python flamegraph, low overhead | No native code |
+| perf | Native code | No Python frames by default |
+| perf + `-X perf` | Both native and Python | Requires Python 3.12+ |
+
+### cProfile usage
 ```bash
 python3 -m cProfile -s tottime script.py    # Sort by time in function itself
-python3 -m cProfile -s cumtime script.py    # Sort by cumulative time (includes callees)
+python3 -m cProfile -s cumtime script.py    # Sort by cumulative time
 ```

-### Understanding the columns
+### Understanding cProfile columns
 - `ncalls`: Number of calls
 - `tottime`: Time spent in function (excluding callees)
 - `cumtime`: Time spent in function (including callees)