2026-01-10 20:20:07 +05:30

5.3 KiB

Scenario 2: Memoization and Precomputation

Learning Objectives

  • Use cProfile to identify performance bottlenecks
  • Recognize when @lru_cache becomes a bottleneck itself
  • Understand when precomputation beats memoization
  • Learn to read profiler output to guide optimization decisions

Files

Fibonacci Example

  • fib_slow.py - Naive recursive Fibonacci (exponential time)
  • fib_cached.py - Memoized Fibonacci (linear time)

Config Validator Example

  • generate_events.py - Generate test data (run first)
  • config_validator_naive.py - Baseline: no caching
  • config_validator_memoized.py - Uses @lru_cache
  • config_validator_precomputed.py - Uses 2D array lookup
  • config_validator.py - Comparison runner
  • common.py - Shared code

Exercise 1: Fibonacci (Identifying Redundant Calls)

Step 1: Experience the slowness

time python3 fib_slow.py 35

This takes several seconds. Don't try n=50!

Step 2: Profile to understand why

python3 -m cProfile -s ncalls fib_slow.py 35

Look at ncalls for the fib function - it's called millions of times because the same values are recomputed repeatedly.

Step 3: Apply memoization and verify

time python3 fib_cached.py 35
python3 -m cProfile -s ncalls fib_cached.py 35

The ncalls drops from millions to ~35.


Exercise 2: Config Validator (When Caching Becomes the Bottleneck)

This exercise demonstrates a common pattern: you add caching, get a big speedup, but then discover the cache itself is now the bottleneck. Along the way, you'll learn the limits of different profiling tools.

Step 1: Generate test data

python3 generate_events.py 1000000

Step 2: Run the naive version

python3 config_validator_naive.py

It's slow (~3s). Let's profile to see why.

Step 3: Profile with py-spy

py-spy record -o naive.svg -- python3 config_validator_naive.py

Open naive.svg in a browser. You'll see validate_rule_slow dominating - it's called 1,000,000 times even though there are only 400 unique input combinations.

Step 4: Apply memoization

python3 config_validator_memoized.py

Dramatic speedup! But where is the remaining time going?

Step 5: Profile memoized with py-spy

py-spy record -o memoized.svg -- python3 config_validator_memoized.py

Open memoized.svg. The flamegraph looks thin - most time is unaccounted for.

Problem: py-spy only traces Python functions. The lru_cache wrapper overhead is in native C code (dict operations, hashing), so py-spy can't see it.

Step 6: Profile with perf (native code)

perf record -g -F 9999 python3 config_validator_memoized.py
perf report

Now you see native C code: lookdict, _PyObject_Call, hash functions. But it's hard to tell which Python code triggered these operations.

Step 7: Profile with perf + Python frames

perf record -g -F 9999 python3 -X perf config_validator_memoized.py
perf report

The -X perf flag enables Python's perf map support. Now the call stack clearly shows time spent under _lru_cache_wrapper - that's the cache overhead!

You can also generate a flamegraph:

perf script | stackcollapse-perf.pl | flamegraph.pl > memoized_perf.svg

Step 8: The precomputed solution

When the input space is small and bounded (400 combinations), we can:

  1. Precompute all results into a 2D array
  2. Use array indexing instead of hash-based lookup

Array indexing is faster because:

  • No hash computation
  • Direct memory offset calculation
  • Better CPU cache locality
python3 config_validator_precomputed.py

Step 9: Compare all three

python3 config_validator.py

Expected output shows precomputed ~2x faster than memoized.


Key Profiling Techniques

Tool comparison

Tool Shows Limitations
cProfile Python function times No native code, high overhead
py-spy Python flamegraph, low overhead No native code
perf Native code No Python frames by default
perf + -X perf Both native and Python Requires Python 3.12+

cProfile usage

python3 -m cProfile -s tottime script.py    # Sort by time in function itself
python3 -m cProfile -s cumtime script.py    # Sort by cumulative time

Understanding cProfile columns

  • ncalls: Number of calls
  • tottime: Time spent in function (excluding callees)
  • cumtime: Time spent in function (including callees)
  • percall: Time per call

When to Use Each Approach

Approach When to Use
No caching Function is cheap OR each input seen only once
Memoization (@lru_cache) Unknown/large input space, expensive function
Precomputation Known/small input space, many lookups, bounded integers

Discussion Questions

  1. Why does @lru_cache have overhead?

    • Hint: What happens on each call even for cache hits?
  2. When would memoization beat precomputation?

    • Hint: What if there were 10,000 x 10,000 possible inputs but you only see 100?
  3. Could we make precomputation even faster?

    • Hint: What about a flat array with table[rule_id * 20 + event_type]?

Further Reading

  • functools.lru_cache documentation
  • functools.cache (Python 3.9+) - unbounded cache, slightly less overhead
  • NumPy arrays for truly O(1) array access