j & s - performance notes
Some jotted down notes of mine from Jeff and Sanjay's blog
“In established engineering disciplines a 12% improvement… is never considered marginal.”
“When preparing quality programs, I don’t want to restrict myself to tools that deny me such efficiencies.”
Why “optimize later” can be wrong
Many engineers say:
write simple code first → profile later → optimize hotspots
But this approach often fails in real systems.
Problems with ignoring performance early
- Flat performance profile problem
If inefficiencies are everywhere, profiling won’t show obvious hotspots. Performance is lost in many small places. note: death by thousand jabs. nothing obvious to fix.
- Library users suffer instead of authors
If you publish a slow library:
- downstream users must debug your internals
- they cannot easily change your code
- they must negotiate performance fixes across teams
note: lol usual, everyone wants it, nobody owns or maintains it.
- Hard to change systems in heavy production
Once systems are widely deployed:
- architecture changes become risky
- migration costs explode
- performance fixes become political
What is Over-Replication?
Running multiple unnecessary copies of a service or dataset to compensate for slowness.
Example:
- inefficient DB queries → add read replicas
- slow service → deploy 10 instances instead of fixing code note: throwing servers at problem instead of fixing algorithm. seen often
What is Overprovisioning?
Allocating far more CPU / memory / machines than actually needed.
Example:
- service needs 2 cores but uses 16 because code inefficient
- DB needs 64GB RAM just to hide bad queries
Example: choosing better data structures
If vectors are usually small:
Use:
absl::InlinedVector
instead of:
std::vector
Why?
- avoids heap allocations for small sizes
- improves cache locality
- faster access note: probably helps more in hot path for high throughput apps, i.e. algo trading systems
Back of the Envelope Performance Calculations
Goal:
Estimate performance before implementing anything. note: important section for me to re-read and DRILL it in. i always prefer deploying, profiling, measuring latency - but this is important.
Method
Estimate number of expensive operations:
- disk seeks
- network round trips
- bytes transferred
- memory accesses
Multiply by approximate cost
Sum results
If concurrency exists → costs may overlap.
note: similar to fermi estimation but for cs , v interesting
Approximate operation costs
| Operation | Cost |
|---|---|
| L1 cache reference | 0.5 ns |
| L2 cache reference | 3 ns |
| Branch mispredict | 5 ns |
| Mutex lock/unlock uncontended | 15 ns |
| Main memory reference | 50 ns |
| Compress 1KB Snappy | 1,000 ns |
| Read 4KB SSD | 20,000 ns |
| Datacenter roundtrip | 50,000 ns |
| Read 1MB memory | 64,000 ns |
| Read 1MB over 100Gb network | 100,000 ns |
| Read 1MB SSD | 1,000,000 ns |
| Disk seek | 5,000,000 ns |
| Read 1MB disk | 10,000,000 ns |
| CA ↔ Netherlands packet | 150,000,000 ns |
note: disk seek basically geological time compared to cpu.
What is a “point read” in SQL?
A point read means:
Query retrieving one specific row using an indexed lookup.
Example:
SELECT * FROM users WHERE id = 123;
Not:
SELECT * FROM users WHERE age > 18;
Why?
- point read uses index → O(log N)
- range scan may read many rows
Example #1 — Quicksort a Billion 4-Byte Numbers
Given
- 1 billion numbers
- each 4 bytes
- total = 4GB array note: already bigger than most cpu caches obv
Why log(N) passes?
Quicksort average depth:
O(log₂ N)
Since:
N = 2³⁰ ≈ 1 billion
log₂(N) = 30
So about 30 passes
Your question:
log(billion) is six?
That’s:
log₁₀(1,000,000,000) = 9
But algorithms use:
log₂(N)
Why base 2?
Because:
- data structures split in half
- binary comparisons
- binary trees
- binary recursion depth
note: whenever you see log in algorithms assume base 2 unless stated otherwise
Cost #1 — Memory bandwidth
Array = 4GB
Bandwidth = 16GB/s
Time per full scan:
4 / 16 = 0.25 s
30 passes →
30 × 0.25 = 7.5 seconds
Cost #2 — Branch mispredictions
Each pass compares every element.
Total comparisons:
N × log₂N
= 1e9 × 30
= 30 billion comparisons
Assume:
- 50% mispredicted
- mispredict cost = 5 ns
Total:
15 billion × 5ns = 75 seconds
Your question:
why count only mispredictions?
Because:
Correct branch prediction is nearly free.
Why?
Modern CPUs:
- speculative execution already running predicted path
- pipeline continues smoothly
Cost mostly comes from:
- flushing pipeline
- refetching instructions
- restarting execution
That pipeline flush = ~5ns penalty.
So:
correct prediction ≈ negligible
misprediction = expensive
note: branch predictor being wrong is what kills you.
Total estimated runtime
Memory cost = 7.5s
Branch cost = 75s
Total ≈ 82.5s
Branching dominates.
Cache-aware refinement
Assume:
- L3 cache = 32MB
It holds:
2^23 numbers
So:
- only first ~10 passes require full memory streaming
- remaining passes operate from cache
Memory transfers: note: once data fits in cache everything suddenly fast. memory is enemy not cpu.
10 × 4GB / 16GB/s = 2.5s
Instead of 7.5s.
mental model
- CPU arithmetic –> v fast
- memory –> slower
- network –> very slow
- disk –> prehistoric slow