illustris/perf-workshop

Fork 0

History

illustris 4fb1bd90db

init

2026-01-08 18:11:30 +05:30

Makefile

init

2026-01-08 18:11:30 +05:30

read_fast.c

init

2026-01-08 18:11:30 +05:30

read_python.py

init

2026-01-08 18:11:30 +05:30

read_slow.c

init

2026-01-08 18:11:30 +05:30

read_stdio.c

init

2026-01-08 18:11:30 +05:30

README.md

init

2026-01-08 18:11:30 +05:30

testfile

init

2026-01-08 18:11:30 +05:30

README.md

Scenario 3: The Syscall Storm

Learning Objectives

Understand the cost of syscalls (user-space to kernel-space transitions)
Use strace -c to count and profile syscalls
Learn why buffering matters for I/O performance
Understand time output: real, user, sys

Files

read_slow.c - Reads file byte-by-byte with raw read() syscalls
read_fast.c - Reads file in 64KB chunks
read_stdio.c - Uses stdio's fgetc() (internally buffered)
read_python.py - Python equivalents

Setup

# Compile all C programs and create test file
make all

Exercise 1: Observe the Problem

Step 1: Time the slow version

time ./read_slow testfile

Notice that sys time dominates - the CPU is mostly waiting for syscalls!

Step 2: Count the syscalls

strace -c ./read_slow testfile

Look at:

calls column for read: Should be ~1,000,000 (one per byte!)
% time column: Most time in read

Step 3: Compare with fast version

time ./read_fast testfile
strace -c ./read_fast testfile

The read call count drops from ~1,000,000 to ~16 (1MB / 64KB).

Exercise 2: Understanding the Time Output

time ./read_slow testfile

Output explanation:

real - Wall clock time (what you'd measure with a stopwatch)
user - CPU time in user space (your code running)
sys - CPU time in kernel space (syscalls, I/O)

For read_slow: sys >> user because we spend most time in the kernel. For read_fast: user > sys because we spend more time processing data.

Exercise 3: strace Deep Dive

See individual syscalls (first 50)

strace -T ./read_slow testfile 2>&1 | head -50

The -T flag shows time spent in each syscall.

Filter to just read() calls

strace -e read -c ./read_slow testfile

Exercise 4: stdio Comparison

time ./read_stdio testfile
strace -c ./read_stdio testfile

Questions:

How many read syscalls does stdio make?
Why is it still slower than read_fast?
Hint: Check the default stdio buffer size (usually 4KB or 8KB)

Exercise 5: Python I/O

# Create smaller file for unbuffered test
make smallfile

# Run Python comparison
python3 read_python.py smallfile

# Profile with strace
strace -c python3 read_python.py smallfile

Key Insights

Why are syscalls expensive?

Context switch: CPU saves user state, loads kernel state
Security checks: Kernel validates permissions
Memory barriers: Caches may need flushing
Scheduling: Kernel may switch to another process

The buffering solution

Instead of:

read(1 byte) → kernel → read(1 byte) → kernel → ... (million times)

We do:

read(64KB) → kernel → process 64KB in user space → read(64KB) → ...

Rule of thumb

Syscall overhead: ~100-1000 nanoseconds
Reading 64KB: ~10-100 microseconds (from cache/RAM)
Break-even: buffer should be at least a few KB

perf stat Comparison

perf stat ./read_slow testfile
perf stat ./read_fast testfile

Look at:

context-switches
cpu-migrations
instructions per cycle

Further Exploration

What happens with read(fd, buf, 4096) vs read(fd, buf, 65536)?
How does mmap() compare? (Memory-mapped I/O)
What about O_DIRECT flag? (Bypass page cache)