160 lines
3.7 KiB
Markdown
160 lines
3.7 KiB
Markdown
# Scenario 3: The Syscall Storm
|
|
|
|
## Learning Objectives
|
|
- Understand the cost of syscalls (user-space to kernel-space transitions)
|
|
- Use `strace -c` to count and profile syscalls
|
|
- Learn why buffering matters for I/O performance
|
|
- Understand `time` output: `real`, `user`, `sys`
|
|
|
|
## Files
|
|
- `read_slow.c` - Reads file byte-by-byte with raw `read()` syscalls
|
|
- `read_fast.c` - Reads file in 64KB chunks
|
|
- `read_stdio.c` - Uses stdio's `fgetc()` (internally buffered)
|
|
- `read_python.py` - Python equivalents
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
# Compile all C programs and create test file
|
|
make all
|
|
```
|
|
|
|
## Exercise 1: Observe the Problem
|
|
|
|
### Step 1: Time the slow version
|
|
```bash
|
|
time ./read_slow testfile
|
|
```
|
|
|
|
Notice that `sys` time dominates - the CPU is mostly waiting for syscalls!
|
|
|
|
### Step 2: Count the syscalls
|
|
```bash
|
|
strace -c ./read_slow testfile
|
|
```
|
|
|
|
Look at:
|
|
- `calls` column for `read`: Should be ~1,000,000 (one per byte!)
|
|
- `% time` column: Most time in `read`
|
|
|
|
### Step 3: Compare with fast version
|
|
```bash
|
|
time ./read_fast testfile
|
|
strace -c ./read_fast testfile
|
|
```
|
|
|
|
The `read` call count drops from ~1,000,000 to ~16 (1MB / 64KB).
|
|
|
|
## Exercise 2: Understanding the Time Output
|
|
|
|
```bash
|
|
time ./read_slow testfile
|
|
```
|
|
|
|
Output explanation:
|
|
- `real` - Wall clock time (what you'd measure with a stopwatch)
|
|
- `user` - CPU time in user space (your code running)
|
|
- `sys` - CPU time in kernel space (syscalls, I/O)
|
|
|
|
For `read_slow`: `sys` >> `user` because we spend most time in the kernel.
|
|
For `read_fast`: `user` > `sys` because we spend more time processing data.
|
|
|
|
## Exercise 3: strace Deep Dive
|
|
|
|
### See individual syscalls (first 50)
|
|
```bash
|
|
strace -T ./read_slow testfile 2>&1 | head -50
|
|
```
|
|
|
|
The `-T` flag shows time spent in each syscall.
|
|
|
|
### Filter to just read() calls
|
|
```bash
|
|
strace -e read -c ./read_slow testfile
|
|
```
|
|
|
|
## Exercise 4: stdio Comparison
|
|
|
|
```bash
|
|
time ./read_stdio testfile
|
|
strace -c ./read_stdio testfile
|
|
```
|
|
|
|
Questions:
|
|
- How many `read` syscalls does stdio make?
|
|
- Why is it still slower than `read_fast`?
|
|
- Hint: Check the default stdio buffer size (usually 4KB or 8KB)
|
|
|
|
## Exercise 5: Python I/O
|
|
|
|
```bash
|
|
# Create smaller file for unbuffered test
|
|
make smallfile
|
|
|
|
# Run Python comparison
|
|
python3 read_python.py smallfile
|
|
|
|
# Profile with strace
|
|
strace -c python3 read_python.py smallfile
|
|
```
|
|
|
|
## Key Insights
|
|
|
|
### Why are syscalls expensive?
|
|
1. **Context switch**: CPU saves user state, loads kernel state
|
|
2. **Security checks**: Kernel validates permissions
|
|
3. **Memory barriers**: Caches may need flushing
|
|
4. **Scheduling**: Kernel may switch to another process
|
|
|
|
### The buffering solution
|
|
Instead of:
|
|
```
|
|
read(1 byte) → kernel → read(1 byte) → kernel → ... (million times)
|
|
```
|
|
|
|
We do:
|
|
```
|
|
read(64KB) → kernel → process 64KB in user space → read(64KB) → ...
|
|
```
|
|
|
|
### Rule of thumb
|
|
- Syscall overhead: ~100-1000 nanoseconds
|
|
- Reading 64KB: ~10-100 microseconds (from cache/RAM)
|
|
- Break-even: buffer should be at least a few KB
|
|
|
|
## perf stat Comparison
|
|
|
|
```bash
|
|
perf stat ./read_slow testfile
|
|
perf stat ./read_fast testfile
|
|
```
|
|
|
|
Look at:
|
|
- `context-switches`
|
|
- `cpu-migrations`
|
|
- `instructions per cycle`
|
|
|
|
## Exercise 6: Benchmarking with hyperfine
|
|
|
|
Use `hyperfine` for proper statistical benchmarking with warmup runs:
|
|
|
|
```bash
|
|
hyperfine --warmup 1 './read_slow testfile' './read_fast testfile'
|
|
```
|
|
|
|
This gives you:
|
|
- Mean execution time with standard deviation
|
|
- Min/max times
|
|
- Direct comparison between implementations
|
|
|
|
For all three implementations:
|
|
```bash
|
|
hyperfine --warmup 1 './read_slow testfile' './read_stdio testfile' './read_fast testfile'
|
|
```
|
|
|
|
## Further Exploration
|
|
|
|
1. What happens with `read(fd, buf, 4096)` vs `read(fd, buf, 65536)`?
|
|
2. How does `mmap()` compare? (Memory-mapped I/O)
|
|
3. What about `O_DIRECT` flag? (Bypass page cache)
|