# Scenario 7: Continuous Profiling with Pyroscope ## Learning Objectives - Understand the difference between one-shot and continuous profiling - Set up and use Pyroscope for Python applications - Navigate the Pyroscope UI to find performance issues - Compare flamegraphs over time ## Background **One-shot profiling** (what we've done so far): - Run profiler → Execute program → Stop → Analyze - Good for: reproducible tests, specific scenarios - Bad for: intermittent issues, production systems **Continuous profiling**: - Always running in the background - Low overhead (~2-5% CPU) - Aggregates data over time - Good for: production monitoring, finding intermittent issues ## Files - `app.py` - Flask web application with Pyroscope instrumentation - `loadgen.sh` - Script to generate traffic - `requirements.txt` - Python dependencies ## Setup ### 1. Start Pyroscope Server Option A: Docker (recommended) ```bash docker run -d --name pyroscope -p 4040:4040 grafana/pyroscope ``` Option B: Binary download ```bash # Download from https://github.com/grafana/pyroscope/releases ./pyroscope server ``` ### 2. Install Python Dependencies ```bash pip install -r requirements.txt # Or: pip install flask pyroscope-io ``` ### 3. Start the Application ```bash python3 app.py ``` ### 4. Generate Load ```bash chmod +x loadgen.sh ./loadgen.sh http://localhost:5000 120 # 2 minutes of load ``` ### 5. View Profiles Open http://localhost:4040 in your browser. ## Exercise 1: Explore the Pyroscope UI 1. Go to http://localhost:4040 2. Select `workshop.flask.app` from the application dropdown 3. Observe the flamegraph ### UI Navigation - **Timeline**: Shows CPU usage over time, click to select time range - **Flamegraph**: Visual representation of where time is spent - **Table view**: Sortable list of functions by self/total time - **Diff view**: Compare two time ranges ## Exercise 2: Find the Hot Function While `loadgen.sh` is running: 1. Look at the flamegraph 2. Find `compute_primes_slow` - it should be prominent 3. Click on it to zoom in 4. See the call stack leading to it ## Exercise 3: Compare Cached vs Uncached 1. Note the current time 2. Stop `loadgen.sh` 3. Modify `loadgen.sh` to only hit cached endpoints (or run manually): ```bash for i in $(seq 100); do curl -s "localhost:5000/api/hash_cached/test_$((i % 5))" done ``` 4. In Pyroscope, compare the two time periods using the diff view ## Exercise 4: Spot I/O-Bound Code 1. Generate load to the slow_io endpoint: ```bash for i in $(seq 50); do curl -s localhost:5000/api/slow_io; done ``` 2. Look at the flamegraph 3. Notice that `time.sleep` doesn't show up much - why? - CPU profiling only captures CPU time - I/O wait (sleeping, network, disk) doesn't consume CPU - This is why I/O-bound code looks "fast" in CPU profiles! ## Exercise 5: Timeline Analysis 1. Let `loadgen.sh` run for several minutes 2. In Pyroscope, zoom out the timeline 3. Look for patterns: - Spikes in CPU usage - Changes in the flamegraph shape over time 4. Select different time ranges to compare ## Key Pyroscope Concepts ### Flamegraph Reading - **Width** = proportion of total samples (time) - **Height** = call stack depth - **Color** = usually arbitrary (for differentiation) - **Plateaus** = functions that are "hot" ### Comparing Profiles Pyroscope can show: - **Diff view**: Red = more time, Green = less time - Useful for before/after comparisons ### Tags The app uses tags for filtering: ```python pyroscope.configure( tags={"env": "workshop", "version": "1.0.0"} ) ``` You can filter by tags in the UI. ## Production Considerations ### Overhead - Pyroscope Python agent: ~2-5% CPU overhead - Sampling rate can be tuned (default: 100Hz) ### Data Volume - Profiles are aggregated, not stored raw - Storage is efficient (10-100MB per day per app) ### Security - Profile data can reveal code structure - Consider who has access to Pyroscope ### Alternatives - **Datadog Continuous Profiler** - **AWS CodeGuru Profiler** - **Google Cloud Profiler** - **Parca** (open source, eBPF-based) ## Troubleshooting ### "No data in Pyroscope" - Check if Pyroscope server is running: http://localhost:4040 - Check app logs for connection errors - Verify `pyroscope-io` is installed ### "Profile looks empty" - Generate more load - The endpoint might be I/O bound (not CPU) - Check the time range in the UI ### High overhead - Reduce sampling rate in pyroscope.configure() - Check for profiling-related exceptions ## Discussion Questions 1. **When would you use continuous profiling vs one-shot?** - Continuous: production, long-running apps, intermittent issues - One-shot: development, benchmarking, specific scenarios 2. **What can't CPU profiling show you?** - I/O wait time - Lock contention (mostly) - Memory allocation patterns 3. **How would you profile a batch job vs a web server?** - Batch: one-shot profiling of the entire run - Server: continuous, focus on request handling paths ## Key Takeaways 1. **Continuous profiling catches issues that one-shot misses** 2. **Low overhead makes it safe for production** 3. **Timeline view reveals patterns over time** 4. **CPU profiling doesn't show I/O time**