perf-workshop/scenario7-pyroscope/README.md

# Scenario 7: Continuous Profiling with Pyroscope

## Learning Objectives
- Understand the difference between one-shot and continuous profiling
- Set up and use Pyroscope for Python applications
- Navigate the Pyroscope UI to find performance issues
- Compare flamegraphs over time

## Background

**One-shot profiling** (what we've done so far):
- Run profiler → Execute program → Stop → Analyze
- Good for: reproducible tests, specific scenarios
- Bad for: intermittent issues, production systems

**Continuous profiling**:
- Always running in the background
- Low overhead (~2-5% CPU)
- Aggregates data over time
- Good for: production monitoring, finding intermittent issues

## Files
- `app.py` - Flask web application with Pyroscope instrumentation
- `loadgen.sh` - Script to generate traffic
- `requirements.txt` - Python dependencies

## Setup

### 1. Start Pyroscope Server

Option A: Docker (recommended)
```bash
docker run -d --name pyroscope -p 4040:4040 grafana/pyroscope
```

Option B: Binary download
```bash
# Download from https://github.com/grafana/pyroscope/releases
./pyroscope server
```

### 2. Install Python Dependencies
```bash
pip install -r requirements.txt
# Or: pip install flask pyroscope-io
```

### 3. Start the Application
```bash
python3 app.py
```

### 4. Generate Load
```bash
chmod +x loadgen.sh
./loadgen.sh http://localhost:5000 120  # 2 minutes of load
```

### 5. View Profiles
Open http://localhost:4040 in your browser.

## Exercise 1: Explore the Pyroscope UI

1. Go to http://localhost:4040
2. Select `workshop.flask.app` from the application dropdown
3. Observe the flamegraph

### UI Navigation
- **Timeline**: Shows CPU usage over time, click to select time range
- **Flamegraph**: Visual representation of where time is spent
- **Table view**: Sortable list of functions by self/total time
- **Diff view**: Compare two time ranges

## Exercise 2: Find the Hot Function

While `loadgen.sh` is running:

1. Look at the flamegraph
2. Find `compute_primes_slow` - it should be prominent
3. Click on it to zoom in
4. See the call stack leading to it

## Exercise 3: Compare Cached vs Uncached

1. Note the current time
2. Stop `loadgen.sh`
3. Modify `loadgen.sh` to only hit cached endpoints (or run manually):
   ```bash
   for i in $(seq 100); do
       curl -s "localhost:5000/api/hash_cached/test_$((i % 5))"
   done
   ```
4. In Pyroscope, compare the two time periods using the diff view

## Exercise 4: Spot I/O-Bound Code

1. Generate load to the slow_io endpoint:
   ```bash
   for i in $(seq 50); do curl -s localhost:5000/api/slow_io; done
   ```
2. Look at the flamegraph
3. Notice that `time.sleep` doesn't show up much - why?
   - CPU profiling only captures CPU time
   - I/O wait (sleeping, network, disk) doesn't consume CPU
   - This is why I/O-bound code looks "fast" in CPU profiles!

## Exercise 5: Timeline Analysis

1. Let `loadgen.sh` run for several minutes
2. In Pyroscope, zoom out the timeline
3. Look for patterns:
   - Spikes in CPU usage
   - Changes in the flamegraph shape over time
4. Select different time ranges to compare

## Key Pyroscope Concepts

### Flamegraph Reading
- **Width** = proportion of total samples (time)
- **Height** = call stack depth
- **Color** = usually arbitrary (for differentiation)
- **Plateaus** = functions that are "hot"

### Comparing Profiles
Pyroscope can show:
- **Diff view**: Red = more time, Green = less time
- Useful for before/after comparisons

### Tags
The app uses tags for filtering:
```python
pyroscope.configure(
    tags={"env": "workshop", "version": "1.0.0"}
)
```

You can filter by tags in the UI.

## Production Considerations

### Overhead
- Pyroscope Python agent: ~2-5% CPU overhead
- Sampling rate can be tuned (default: 100Hz)

### Data Volume
- Profiles are aggregated, not stored raw
- Storage is efficient (10-100MB per day per app)

### Security
- Profile data can reveal code structure
- Consider who has access to Pyroscope

### Alternatives
- **Datadog Continuous Profiler**
- **AWS CodeGuru Profiler**
- **Google Cloud Profiler**
- **Parca** (open source, eBPF-based)

## Troubleshooting

### "No data in Pyroscope"
- Check if Pyroscope server is running: http://localhost:4040
- Check app logs for connection errors
- Verify `pyroscope-io` is installed

### "Profile looks empty"
- Generate more load
- The endpoint might be I/O bound (not CPU)
- Check the time range in the UI

### High overhead
- Reduce sampling rate in pyroscope.configure()
- Check for profiling-related exceptions

## Discussion Questions

1. **When would you use continuous profiling vs one-shot?**
   - Continuous: production, long-running apps, intermittent issues
   - One-shot: development, benchmarking, specific scenarios

2. **What can't CPU profiling show you?**
   - I/O wait time
   - Lock contention (mostly)
   - Memory allocation patterns

3. **How would you profile a batch job vs a web server?**
   - Batch: one-shot profiling of the entire run
   - Server: continuous, focus on request handling paths

## Key Takeaways

1. **Continuous profiling catches issues that one-shot misses**
2. **Low overhead makes it safe for production**
3. **Timeline view reveals patterns over time**
4. **CPU profiling doesn't show I/O time**