196 lines
5.1 KiB
Markdown
196 lines
5.1 KiB
Markdown
# Scenario 7: Continuous Profiling with Pyroscope
|
|
|
|
## Learning Objectives
|
|
- Understand the difference between one-shot and continuous profiling
|
|
- Set up and use Pyroscope for Python applications
|
|
- Navigate the Pyroscope UI to find performance issues
|
|
- Compare flamegraphs over time
|
|
|
|
## Background
|
|
|
|
**One-shot profiling** (what we've done so far):
|
|
- Run profiler → Execute program → Stop → Analyze
|
|
- Good for: reproducible tests, specific scenarios
|
|
- Bad for: intermittent issues, production systems
|
|
|
|
**Continuous profiling**:
|
|
- Always running in the background
|
|
- Low overhead (~2-5% CPU)
|
|
- Aggregates data over time
|
|
- Good for: production monitoring, finding intermittent issues
|
|
|
|
## Files
|
|
- `app.py` - Flask web application with Pyroscope instrumentation
|
|
- `loadgen.sh` - Script to generate traffic
|
|
- `requirements.txt` - Python dependencies
|
|
|
|
## Setup
|
|
|
|
### 1. Start Pyroscope Server
|
|
|
|
Option A: Docker (recommended)
|
|
```bash
|
|
docker run -d --name pyroscope -p 4040:4040 grafana/pyroscope
|
|
```
|
|
|
|
Option B: Binary download
|
|
```bash
|
|
# Download from https://github.com/grafana/pyroscope/releases
|
|
./pyroscope server
|
|
```
|
|
|
|
### 2. Install Python Dependencies
|
|
```bash
|
|
pip install -r requirements.txt
|
|
# Or: pip install flask pyroscope-io
|
|
```
|
|
|
|
### 3. Start the Application
|
|
```bash
|
|
python3 app.py
|
|
```
|
|
|
|
### 4. Generate Load
|
|
```bash
|
|
chmod +x loadgen.sh
|
|
./loadgen.sh http://localhost:5000 120 # 2 minutes of load
|
|
```
|
|
|
|
### 5. View Profiles
|
|
Open http://localhost:4040 in your browser.
|
|
|
|
## Exercise 1: Explore the Pyroscope UI
|
|
|
|
1. Go to http://localhost:4040
|
|
2. Select `workshop.flask.app` from the application dropdown
|
|
3. Observe the flamegraph
|
|
|
|
### UI Navigation
|
|
- **Timeline**: Shows CPU usage over time, click to select time range
|
|
- **Flamegraph**: Visual representation of where time is spent
|
|
- **Table view**: Sortable list of functions by self/total time
|
|
- **Diff view**: Compare two time ranges
|
|
|
|
## Exercise 2: Find the Hot Function
|
|
|
|
While `loadgen.sh` is running:
|
|
|
|
1. Look at the flamegraph
|
|
2. Find `compute_primes_slow` - it should be prominent
|
|
3. Click on it to zoom in
|
|
4. See the call stack leading to it
|
|
|
|
## Exercise 3: Compare Cached vs Uncached
|
|
|
|
1. Note the current time
|
|
2. Stop `loadgen.sh`
|
|
3. Modify `loadgen.sh` to only hit cached endpoints (or run manually):
|
|
```bash
|
|
for i in $(seq 100); do
|
|
curl -s "localhost:5000/api/hash_cached/test_$((i % 5))"
|
|
done
|
|
```
|
|
4. In Pyroscope, compare the two time periods using the diff view
|
|
|
|
## Exercise 4: Spot I/O-Bound Code
|
|
|
|
1. Generate load to the slow_io endpoint:
|
|
```bash
|
|
for i in $(seq 50); do curl -s localhost:5000/api/slow_io; done
|
|
```
|
|
2. Look at the flamegraph
|
|
3. Notice that `time.sleep` doesn't show up much - why?
|
|
- CPU profiling only captures CPU time
|
|
- I/O wait (sleeping, network, disk) doesn't consume CPU
|
|
- This is why I/O-bound code looks "fast" in CPU profiles!
|
|
|
|
## Exercise 5: Timeline Analysis
|
|
|
|
1. Let `loadgen.sh` run for several minutes
|
|
2. In Pyroscope, zoom out the timeline
|
|
3. Look for patterns:
|
|
- Spikes in CPU usage
|
|
- Changes in the flamegraph shape over time
|
|
4. Select different time ranges to compare
|
|
|
|
## Key Pyroscope Concepts
|
|
|
|
### Flamegraph Reading
|
|
- **Width** = proportion of total samples (time)
|
|
- **Height** = call stack depth
|
|
- **Color** = usually arbitrary (for differentiation)
|
|
- **Plateaus** = functions that are "hot"
|
|
|
|
### Comparing Profiles
|
|
Pyroscope can show:
|
|
- **Diff view**: Red = more time, Green = less time
|
|
- Useful for before/after comparisons
|
|
|
|
### Tags
|
|
The app uses tags for filtering:
|
|
```python
|
|
pyroscope.configure(
|
|
tags={"env": "workshop", "version": "1.0.0"}
|
|
)
|
|
```
|
|
|
|
You can filter by tags in the UI.
|
|
|
|
## Production Considerations
|
|
|
|
### Overhead
|
|
- Pyroscope Python agent: ~2-5% CPU overhead
|
|
- Sampling rate can be tuned (default: 100Hz)
|
|
|
|
### Data Volume
|
|
- Profiles are aggregated, not stored raw
|
|
- Storage is efficient (10-100MB per day per app)
|
|
|
|
### Security
|
|
- Profile data can reveal code structure
|
|
- Consider who has access to Pyroscope
|
|
|
|
### Alternatives
|
|
- **Datadog Continuous Profiler**
|
|
- **AWS CodeGuru Profiler**
|
|
- **Google Cloud Profiler**
|
|
- **Parca** (open source, eBPF-based)
|
|
|
|
## Troubleshooting
|
|
|
|
### "No data in Pyroscope"
|
|
- Check if Pyroscope server is running: http://localhost:4040
|
|
- Check app logs for connection errors
|
|
- Verify `pyroscope-io` is installed
|
|
|
|
### "Profile looks empty"
|
|
- Generate more load
|
|
- The endpoint might be I/O bound (not CPU)
|
|
- Check the time range in the UI
|
|
|
|
### High overhead
|
|
- Reduce sampling rate in pyroscope.configure()
|
|
- Check for profiling-related exceptions
|
|
|
|
## Discussion Questions
|
|
|
|
1. **When would you use continuous profiling vs one-shot?**
|
|
- Continuous: production, long-running apps, intermittent issues
|
|
- One-shot: development, benchmarking, specific scenarios
|
|
|
|
2. **What can't CPU profiling show you?**
|
|
- I/O wait time
|
|
- Lock contention (mostly)
|
|
- Memory allocation patterns
|
|
|
|
3. **How would you profile a batch job vs a web server?**
|
|
- Batch: one-shot profiling of the entire run
|
|
- Server: continuous, focus on request handling paths
|
|
|
|
## Key Takeaways
|
|
|
|
1. **Continuous profiling catches issues that one-shot misses**
|
|
2. **Low overhead makes it safe for production**
|
|
3. **Timeline view reveals patterns over time**
|
|
4. **CPU profiling doesn't show I/O time**
|