scenario2: add perf
This commit is contained in:
parent
596ae02dd4
commit
658f4e98bc
4
.gitignore
vendored
4
.gitignore
vendored
@ -4,3 +4,7 @@ result
|
|||||||
events.pkl
|
events.pkl
|
||||||
__pycache__
|
__pycache__
|
||||||
*.svg
|
*.svg
|
||||||
|
*.so
|
||||||
|
*#*#
|
||||||
|
perf.data*
|
||||||
|
out.perf-folded
|
||||||
|
|||||||
@ -38,6 +38,8 @@ py-spy record -o prime_slow.svg -- python3 prime_slow.py
|
|||||||
|
|
||||||
Open `prime_slow.svg` in a browser. The widest bar at the top shows where time is spent.
|
Open `prime_slow.svg` in a browser. The widest bar at the top shows where time is spent.
|
||||||
|
|
||||||
|
**Note:** py-spy only traces Python functions. Time spent in native C code (like the C library calls in later steps) won't appear in py-spy flamegraphs.
|
||||||
|
|
||||||
### Step 4: Compile and run the partially optimized version
|
### Step 4: Compile and run the partially optimized version
|
||||||
```bash
|
```bash
|
||||||
# Compile the C library
|
# Compile the C library
|
||||||
|
|||||||
@ -51,46 +51,70 @@ The `ncalls` drops from millions to ~35.
|
|||||||
## Exercise 2: Config Validator (When Caching Becomes the Bottleneck)
|
## Exercise 2: Config Validator (When Caching Becomes the Bottleneck)
|
||||||
|
|
||||||
This exercise demonstrates a common pattern: you add caching, get a big speedup,
|
This exercise demonstrates a common pattern: you add caching, get a big speedup,
|
||||||
but then discover the cache itself is now the bottleneck.
|
but then discover the cache itself is now the bottleneck. Along the way, you'll
|
||||||
|
learn the limits of different profiling tools.
|
||||||
|
|
||||||
### Step 1: Generate test data
|
### Step 1: Generate test data
|
||||||
```bash
|
```bash
|
||||||
python3 generate_events.py 100000
|
python3 generate_events.py 1000000
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 2: Profile the naive version
|
### Step 2: Run the naive version
|
||||||
```bash
|
```bash
|
||||||
python3 -m cProfile -s tottime config_validator_naive.py
|
python3 config_validator_naive.py
|
||||||
```
|
```
|
||||||
|
|
||||||
**What to look for:** `validate_rule_slow` dominates the profile. It's called
|
It's slow (~3s). Let's profile to see why.
|
||||||
100,000 times even though there are only 400 unique input combinations.
|
|
||||||
|
|
||||||
### Step 3: Add memoization - big improvement!
|
### Step 3: Profile with py-spy
|
||||||
```bash
|
```bash
|
||||||
python3 -m cProfile -s tottime config_validator_memoized.py
|
py-spy record -o naive.svg -- python3 config_validator_naive.py
|
||||||
```
|
```
|
||||||
|
|
||||||
**Observation:** Dramatic speedup! But look carefully at the profile...
|
Open `naive.svg` in a browser. You'll see `validate_rule_slow` dominating -
|
||||||
|
it's called 1,000,000 times even though there are only 400 unique input combinations.
|
||||||
|
|
||||||
### Step 4: Identify the new bottleneck
|
### Step 4: Apply memoization
|
||||||
|
|
||||||
Compare `process_events` time between memoized and precomputed:
|
|
||||||
```bash
|
```bash
|
||||||
python3 -m cProfile -s tottime config_validator_memoized.py
|
python3 config_validator_memoized.py
|
||||||
python3 -m cProfile -s tottime config_validator_precomputed.py
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Key insight:** Compare the `process_events` tottime:
|
Dramatic speedup! But where is the remaining time going?
|
||||||
- Memoized: ~0.014s
|
|
||||||
- Precomputed: ~0.004s (3.5x faster!)
|
|
||||||
|
|
||||||
The cache lookup overhead now dominates because:
|
### Step 5: Profile memoized with py-spy
|
||||||
- The validation function is cheap (only 50 iterations)
|
```bash
|
||||||
- But we do 100,000 cache lookups
|
py-spy record -o memoized.svg -- python3 config_validator_memoized.py
|
||||||
- Each lookup involves: tuple creation for the key, hashing, dict lookup
|
```
|
||||||
|
|
||||||
### Step 5: Hypothesis - can we beat the cache?
|
Open `memoized.svg`. The flamegraph looks thin - most time is unaccounted for.
|
||||||
|
|
||||||
|
**Problem:** py-spy only traces Python functions. The `lru_cache` wrapper overhead
|
||||||
|
is in native C code (dict operations, hashing), so py-spy can't see it.
|
||||||
|
|
||||||
|
### Step 6: Profile with perf (native code)
|
||||||
|
```bash
|
||||||
|
perf record -g -F 9999 python3 config_validator_memoized.py
|
||||||
|
perf report
|
||||||
|
```
|
||||||
|
|
||||||
|
Now you see native C code: `lookdict`, `_PyObject_Call`, hash functions. But it's
|
||||||
|
hard to tell which Python code triggered these operations.
|
||||||
|
|
||||||
|
### Step 7: Profile with perf + Python frames
|
||||||
|
```bash
|
||||||
|
perf record -g -F 9999 python3 -X perf config_validator_memoized.py
|
||||||
|
perf report
|
||||||
|
```
|
||||||
|
|
||||||
|
The `-X perf` flag enables Python's perf map support. Now the call stack clearly
|
||||||
|
shows time spent under `_lru_cache_wrapper` - that's the cache overhead!
|
||||||
|
|
||||||
|
You can also generate a flamegraph:
|
||||||
|
```bash
|
||||||
|
perf script | stackcollapse-perf.pl | flamegraph.pl > memoized_perf.svg
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 8: The precomputed solution
|
||||||
|
|
||||||
When the input space is **small and bounded** (400 combinations), we can:
|
When the input space is **small and bounded** (400 combinations), we can:
|
||||||
1. Precompute all results into a 2D array
|
1. Precompute all results into a 2D array
|
||||||
@ -101,14 +125,11 @@ Array indexing is faster because:
|
|||||||
- Direct memory offset calculation
|
- Direct memory offset calculation
|
||||||
- Better CPU cache locality
|
- Better CPU cache locality
|
||||||
|
|
||||||
### Step 6: Profile the precomputed version
|
|
||||||
```bash
|
```bash
|
||||||
python3 -m cProfile -s tottime config_validator_precomputed.py
|
python3 config_validator_precomputed.py
|
||||||
```
|
```
|
||||||
|
|
||||||
**Observation:** No wrapper overhead. Clean array indexing in `process_events`.
|
### Step 9: Compare all three
|
||||||
|
|
||||||
### Step 7: Compare all three
|
|
||||||
```bash
|
```bash
|
||||||
python3 config_validator.py
|
python3 config_validator.py
|
||||||
```
|
```
|
||||||
@ -119,13 +140,21 @@ Expected output shows precomputed ~2x faster than memoized.
|
|||||||
|
|
||||||
## Key Profiling Techniques
|
## Key Profiling Techniques
|
||||||
|
|
||||||
### Finding where time is spent
|
### Tool comparison
|
||||||
|
| Tool | Shows | Limitations |
|
||||||
|
|------|-------|-------------|
|
||||||
|
| cProfile | Python function times | No native code, high overhead |
|
||||||
|
| py-spy | Python flamegraph, low overhead | No native code |
|
||||||
|
| perf | Native code | No Python frames by default |
|
||||||
|
| perf + `-X perf` | Both native and Python | Requires Python 3.12+ |
|
||||||
|
|
||||||
|
### cProfile usage
|
||||||
```bash
|
```bash
|
||||||
python3 -m cProfile -s tottime script.py # Sort by time in function itself
|
python3 -m cProfile -s tottime script.py # Sort by time in function itself
|
||||||
python3 -m cProfile -s cumtime script.py # Sort by cumulative time (includes callees)
|
python3 -m cProfile -s cumtime script.py # Sort by cumulative time
|
||||||
```
|
```
|
||||||
|
|
||||||
### Understanding the columns
|
### Understanding cProfile columns
|
||||||
- `ncalls`: Number of calls
|
- `ncalls`: Number of calls
|
||||||
- `tottime`: Time spent in function (excluding callees)
|
- `tottime`: Time spent in function (excluding callees)
|
||||||
- `cumtime`: Time spent in function (including callees)
|
- `cumtime`: Time spent in function (including callees)
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user