From 658f4e98bce36e0d53a9f56e633b530e6780d7d9 Mon Sep 17 00:00:00 2001 From: illustris Date: Sat, 10 Jan 2026 20:17:43 +0530 Subject: [PATCH] scenario2: add perf --- .gitignore | 4 ++ scenario1-python-to-c/README.md | 2 + scenario2-memoization/README.md | 89 ++++++++++++++++++++++----------- 3 files changed, 65 insertions(+), 30 deletions(-) diff --git a/.gitignore b/.gitignore index 3bd3edd..93d4e55 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,7 @@ result events.pkl __pycache__ *.svg +*.so +*#*# +perf.data* +out.perf-folded diff --git a/scenario1-python-to-c/README.md b/scenario1-python-to-c/README.md index 67731b0..e5dbe50 100644 --- a/scenario1-python-to-c/README.md +++ b/scenario1-python-to-c/README.md @@ -38,6 +38,8 @@ py-spy record -o prime_slow.svg -- python3 prime_slow.py Open `prime_slow.svg` in a browser. The widest bar at the top shows where time is spent. +**Note:** py-spy only traces Python functions. Time spent in native C code (like the C library calls in later steps) won't appear in py-spy flamegraphs. + ### Step 4: Compile and run the partially optimized version ```bash # Compile the C library diff --git a/scenario2-memoization/README.md b/scenario2-memoization/README.md index 53ba802..f1533b2 100644 --- a/scenario2-memoization/README.md +++ b/scenario2-memoization/README.md @@ -51,46 +51,70 @@ The `ncalls` drops from millions to ~35. ## Exercise 2: Config Validator (When Caching Becomes the Bottleneck) This exercise demonstrates a common pattern: you add caching, get a big speedup, -but then discover the cache itself is now the bottleneck. +but then discover the cache itself is now the bottleneck. Along the way, you'll +learn the limits of different profiling tools. ### Step 1: Generate test data ```bash -python3 generate_events.py 100000 +python3 generate_events.py 1000000 ``` -### Step 2: Profile the naive version +### Step 2: Run the naive version ```bash -python3 -m cProfile -s tottime config_validator_naive.py +python3 config_validator_naive.py ``` -**What to look for:** `validate_rule_slow` dominates the profile. It's called -100,000 times even though there are only 400 unique input combinations. +It's slow (~3s). Let's profile to see why. -### Step 3: Add memoization - big improvement! +### Step 3: Profile with py-spy ```bash -python3 -m cProfile -s tottime config_validator_memoized.py +py-spy record -o naive.svg -- python3 config_validator_naive.py ``` -**Observation:** Dramatic speedup! But look carefully at the profile... +Open `naive.svg` in a browser. You'll see `validate_rule_slow` dominating - +it's called 1,000,000 times even though there are only 400 unique input combinations. -### Step 4: Identify the new bottleneck - -Compare `process_events` time between memoized and precomputed: +### Step 4: Apply memoization ```bash -python3 -m cProfile -s tottime config_validator_memoized.py -python3 -m cProfile -s tottime config_validator_precomputed.py +python3 config_validator_memoized.py ``` -**Key insight:** Compare the `process_events` tottime: -- Memoized: ~0.014s -- Precomputed: ~0.004s (3.5x faster!) +Dramatic speedup! But where is the remaining time going? -The cache lookup overhead now dominates because: -- The validation function is cheap (only 50 iterations) -- But we do 100,000 cache lookups -- Each lookup involves: tuple creation for the key, hashing, dict lookup +### Step 5: Profile memoized with py-spy +```bash +py-spy record -o memoized.svg -- python3 config_validator_memoized.py +``` -### Step 5: Hypothesis - can we beat the cache? +Open `memoized.svg`. The flamegraph looks thin - most time is unaccounted for. + +**Problem:** py-spy only traces Python functions. The `lru_cache` wrapper overhead +is in native C code (dict operations, hashing), so py-spy can't see it. + +### Step 6: Profile with perf (native code) +```bash +perf record -g -F 9999 python3 config_validator_memoized.py +perf report +``` + +Now you see native C code: `lookdict`, `_PyObject_Call`, hash functions. But it's +hard to tell which Python code triggered these operations. + +### Step 7: Profile with perf + Python frames +```bash +perf record -g -F 9999 python3 -X perf config_validator_memoized.py +perf report +``` + +The `-X perf` flag enables Python's perf map support. Now the call stack clearly +shows time spent under `_lru_cache_wrapper` - that's the cache overhead! + +You can also generate a flamegraph: +```bash +perf script | stackcollapse-perf.pl | flamegraph.pl > memoized_perf.svg +``` + +### Step 8: The precomputed solution When the input space is **small and bounded** (400 combinations), we can: 1. Precompute all results into a 2D array @@ -101,14 +125,11 @@ Array indexing is faster because: - Direct memory offset calculation - Better CPU cache locality -### Step 6: Profile the precomputed version ```bash -python3 -m cProfile -s tottime config_validator_precomputed.py +python3 config_validator_precomputed.py ``` -**Observation:** No wrapper overhead. Clean array indexing in `process_events`. - -### Step 7: Compare all three +### Step 9: Compare all three ```bash python3 config_validator.py ``` @@ -119,13 +140,21 @@ Expected output shows precomputed ~2x faster than memoized. ## Key Profiling Techniques -### Finding where time is spent +### Tool comparison +| Tool | Shows | Limitations | +|------|-------|-------------| +| cProfile | Python function times | No native code, high overhead | +| py-spy | Python flamegraph, low overhead | No native code | +| perf | Native code | No Python frames by default | +| perf + `-X perf` | Both native and Python | Requires Python 3.12+ | + +### cProfile usage ```bash python3 -m cProfile -s tottime script.py # Sort by time in function itself -python3 -m cProfile -s cumtime script.py # Sort by cumulative time (includes callees) +python3 -m cProfile -s cumtime script.py # Sort by cumulative time ``` -### Understanding the columns +### Understanding cProfile columns - `ncalls`: Number of calls - `tottime`: Time spent in function (excluding callees) - `cumtime`: Time spent in function (including callees)