Scenario 3: The Syscall Storm
Learning Objectives
- Understand the cost of syscalls (user-space to kernel-space transitions)
- Use
strace -cto count and profile syscalls - Learn why buffering matters for I/O performance
- Understand
timeoutput:real,user,sys
Files
read_slow.c- Reads file byte-by-byte with rawread()syscallsread_fast.c- Reads file in 64KB chunksread_stdio.c- Uses stdio'sfgetc()(internally buffered)read_python.py- Python equivalents
Setup
# Compile all C programs and create test file
make all
Exercise 1: Observe the Problem
Step 1: Time the slow version
time ./read_slow testfile
Notice that sys time dominates - the CPU is mostly waiting for syscalls!
Step 2: Count the syscalls
strace -c ./read_slow testfile
Look at:
callscolumn forread: Should be ~1,000,000 (one per byte!)% timecolumn: Most time inread
Step 3: Compare with fast version
time ./read_fast testfile
strace -c ./read_fast testfile
The read call count drops from ~1,000,000 to ~16 (1MB / 64KB).
Exercise 2: Understanding the Time Output
time ./read_slow testfile
Output explanation:
real- Wall clock time (what you'd measure with a stopwatch)user- CPU time in user space (your code running)sys- CPU time in kernel space (syscalls, I/O)
For read_slow: sys >> user because we spend most time in the kernel.
For read_fast: user > sys because we spend more time processing data.
Exercise 3: strace Deep Dive
See individual syscalls (first 50)
strace -T ./read_slow testfile 2>&1 | head -50
The -T flag shows time spent in each syscall.
Filter to just read() calls
strace -e read -c ./read_slow testfile
Exercise 4: stdio Comparison
time ./read_stdio testfile
strace -c ./read_stdio testfile
Questions:
- How many
readsyscalls does stdio make? - Why is it still slower than
read_fast? - Hint: Check the default stdio buffer size (usually 4KB or 8KB)
Exercise 5: Python I/O
# Create smaller file for unbuffered test
make smallfile
# Run Python comparison
python3 read_python.py smallfile
# Profile with strace
strace -c python3 read_python.py smallfile
Key Insights
Why are syscalls expensive?
- Context switch: CPU saves user state, loads kernel state
- Security checks: Kernel validates permissions
- Memory barriers: Caches may need flushing
- Scheduling: Kernel may switch to another process
The buffering solution
Instead of:
read(1 byte) → kernel → read(1 byte) → kernel → ... (million times)
We do:
read(64KB) → kernel → process 64KB in user space → read(64KB) → ...
Rule of thumb
- Syscall overhead: ~100-1000 nanoseconds
- Reading 64KB: ~10-100 microseconds (from cache/RAM)
- Break-even: buffer should be at least a few KB
perf stat Comparison
perf stat ./read_slow testfile
perf stat ./read_fast testfile
Look at:
context-switchescpu-migrationsinstructions per cycle
Further Exploration
- What happens with
read(fd, buf, 4096)vsread(fd, buf, 65536)? - How does
mmap()compare? (Memory-mapped I/O) - What about
O_DIRECTflag? (Bypass page cache)