scenario2: add perf

This commit is contained in:
illustris 2026-01-10 20:17:43 +05:30
parent 596ae02dd4
commit 658f4e98bc
Signed by: illustris
GPG Key ID: 56C8FC0B899FEFA3
3 changed files with 65 additions and 30 deletions

4
.gitignore vendored
View File

@ -4,3 +4,7 @@ result
events.pkl
__pycache__
*.svg
*.so
*#*#
perf.data*
out.perf-folded

View File

@ -38,6 +38,8 @@ py-spy record -o prime_slow.svg -- python3 prime_slow.py
Open `prime_slow.svg` in a browser. The widest bar at the top shows where time is spent.
**Note:** py-spy only traces Python functions. Time spent in native C code (like the C library calls in later steps) won't appear in py-spy flamegraphs.
### Step 4: Compile and run the partially optimized version
```bash
# Compile the C library

View File

@ -51,46 +51,70 @@ The `ncalls` drops from millions to ~35.
## Exercise 2: Config Validator (When Caching Becomes the Bottleneck)
This exercise demonstrates a common pattern: you add caching, get a big speedup,
but then discover the cache itself is now the bottleneck.
but then discover the cache itself is now the bottleneck. Along the way, you'll
learn the limits of different profiling tools.
### Step 1: Generate test data
```bash
python3 generate_events.py 100000
python3 generate_events.py 1000000
```
### Step 2: Profile the naive version
### Step 2: Run the naive version
```bash
python3 -m cProfile -s tottime config_validator_naive.py
python3 config_validator_naive.py
```
**What to look for:** `validate_rule_slow` dominates the profile. It's called
100,000 times even though there are only 400 unique input combinations.
It's slow (~3s). Let's profile to see why.
### Step 3: Add memoization - big improvement!
### Step 3: Profile with py-spy
```bash
python3 -m cProfile -s tottime config_validator_memoized.py
py-spy record -o naive.svg -- python3 config_validator_naive.py
```
**Observation:** Dramatic speedup! But look carefully at the profile...
Open `naive.svg` in a browser. You'll see `validate_rule_slow` dominating -
it's called 1,000,000 times even though there are only 400 unique input combinations.
### Step 4: Identify the new bottleneck
Compare `process_events` time between memoized and precomputed:
### Step 4: Apply memoization
```bash
python3 -m cProfile -s tottime config_validator_memoized.py
python3 -m cProfile -s tottime config_validator_precomputed.py
python3 config_validator_memoized.py
```
**Key insight:** Compare the `process_events` tottime:
- Memoized: ~0.014s
- Precomputed: ~0.004s (3.5x faster!)
Dramatic speedup! But where is the remaining time going?
The cache lookup overhead now dominates because:
- The validation function is cheap (only 50 iterations)
- But we do 100,000 cache lookups
- Each lookup involves: tuple creation for the key, hashing, dict lookup
### Step 5: Profile memoized with py-spy
```bash
py-spy record -o memoized.svg -- python3 config_validator_memoized.py
```
### Step 5: Hypothesis - can we beat the cache?
Open `memoized.svg`. The flamegraph looks thin - most time is unaccounted for.
**Problem:** py-spy only traces Python functions. The `lru_cache` wrapper overhead
is in native C code (dict operations, hashing), so py-spy can't see it.
### Step 6: Profile with perf (native code)
```bash
perf record -g -F 9999 python3 config_validator_memoized.py
perf report
```
Now you see native C code: `lookdict`, `_PyObject_Call`, hash functions. But it's
hard to tell which Python code triggered these operations.
### Step 7: Profile with perf + Python frames
```bash
perf record -g -F 9999 python3 -X perf config_validator_memoized.py
perf report
```
The `-X perf` flag enables Python's perf map support. Now the call stack clearly
shows time spent under `_lru_cache_wrapper` - that's the cache overhead!
You can also generate a flamegraph:
```bash
perf script | stackcollapse-perf.pl | flamegraph.pl > memoized_perf.svg
```
### Step 8: The precomputed solution
When the input space is **small and bounded** (400 combinations), we can:
1. Precompute all results into a 2D array
@ -101,14 +125,11 @@ Array indexing is faster because:
- Direct memory offset calculation
- Better CPU cache locality
### Step 6: Profile the precomputed version
```bash
python3 -m cProfile -s tottime config_validator_precomputed.py
python3 config_validator_precomputed.py
```
**Observation:** No wrapper overhead. Clean array indexing in `process_events`.
### Step 7: Compare all three
### Step 9: Compare all three
```bash
python3 config_validator.py
```
@ -119,13 +140,21 @@ Expected output shows precomputed ~2x faster than memoized.
## Key Profiling Techniques
### Finding where time is spent
### Tool comparison
| Tool | Shows | Limitations |
|------|-------|-------------|
| cProfile | Python function times | No native code, high overhead |
| py-spy | Python flamegraph, low overhead | No native code |
| perf | Native code | No Python frames by default |
| perf + `-X perf` | Both native and Python | Requires Python 3.12+ |
### cProfile usage
```bash
python3 -m cProfile -s tottime script.py # Sort by time in function itself
python3 -m cProfile -s cumtime script.py # Sort by cumulative time (includes callees)
python3 -m cProfile -s cumtime script.py # Sort by cumulative time
```
### Understanding the columns
### Understanding cProfile columns
- `ncalls`: Number of calls
- `tottime`: Time spent in function (excluding callees)
- `cumtime`: Time spent in function (including callees)