j & s - performance notes

Some jotted down notes of mine from Jeff and Sanjay's blog

“In established engineering disciplines a 12% improvement… is never considered marginal.”

“When preparing quality programs, I don’t want to restrict myself to tools that deny me such efficiencies.”

Why “optimize later” can be wrong

Many engineers say:

write simple code first → profile later → optimize hotspots

But this approach often fails in real systems.

Problems with ignoring performance early

Flat performance profile problem

If inefficiencies are everywhere, profiling won’t show obvious hotspots. Performance is lost in many small places. note: death by thousand jabs. nothing obvious to fix.

Library users suffer instead of authors

If you publish a slow library:

downstream users must debug your internals
they cannot easily change your code
they must negotiate performance fixes across teams

note: lol usual, everyone wants it, nobody owns or maintains it.

Hard to change systems in heavy production

Once systems are widely deployed:

architecture changes become risky
migration costs explode
performance fixes become political

What is Over-Replication?

Running multiple unnecessary copies of a service or dataset to compensate for slowness.

Example:

inefficient DB queries → add read replicas
slow service → deploy 10 instances instead of fixing code note: throwing servers at problem instead of fixing algorithm. seen often

What is Overprovisioning?

Allocating far more CPU / memory / machines than actually needed.

Example:

service needs 2 cores but uses 16 because code inefficient
DB needs 64GB RAM just to hide bad queries

Example: choosing better data structures

If vectors are usually small:

Use:

absl::InlinedVector

instead of:

std::vector

Why?

avoids heap allocations for small sizes
improves cache locality
faster access note: probably helps more in hot path for high throughput apps, i.e. algo trading systems

Back of the Envelope Performance Calculations

Goal:

Estimate performance before implementing anything. note: important section for me to re-read and DRILL it in. i always prefer deploying, profiling, measuring latency - but this is important.

Method

Estimate number of expensive operations:
- disk seeks
- network round trips
- bytes transferred
- memory accesses
Multiply by approximate cost
Sum results

If concurrency exists → costs may overlap.

note: similar to fermi estimation but for cs , v interesting

Approximate operation costs

Operation	Cost
L1 cache reference	0.5 ns
L2 cache reference	3 ns
Branch mispredict	5 ns
Mutex lock/unlock uncontended	15 ns
Main memory reference	50 ns
Compress 1KB Snappy	1,000 ns
Read 4KB SSD	20,000 ns
Datacenter roundtrip	50,000 ns
Read 1MB memory	64,000 ns
Read 1MB over 100Gb network	100,000 ns
Read 1MB SSD	1,000,000 ns
Disk seek	5,000,000 ns
Read 1MB disk	10,000,000 ns
CA ↔ Netherlands packet	150,000,000 ns

note: disk seek basically geological time compared to cpu.

What is a “point read” in SQL?

A point read means:

Query retrieving one specific row using an indexed lookup.

Example:

SELECT * FROM users WHERE id = 123;

Not:

SELECT * FROM users WHERE age > 18;

Why?

point read uses index → O(log N)
range scan may read many rows

Example #1 — Quicksort a Billion 4-Byte Numbers

Given

1 billion numbers
each 4 bytes
total = 4GB array note: already bigger than most cpu caches obv

Why log(N) passes?

Quicksort average depth:

O(log₂ N)

Since:

N = 2³⁰ ≈ 1 billion

log₂(N) = 30

So about 30 passes

Your question:

log(billion) is six?

That’s:

log₁₀(1,000,000,000) = 9

But algorithms use:

log₂(N)

Why base 2?

Because:

data structures split in half
binary comparisons
binary trees
binary recursion depth

note: whenever you see log in algorithms assume base 2 unless stated otherwise

Cost #1 — Memory bandwidth

Array = 4GB

Bandwidth = 16GB/s

Time per full scan:

4 / 16 = 0.25 s

30 passes →

30 × 0.25 = 7.5 seconds

Cost #2 — Branch mispredictions

Each pass compares every element.

Total comparisons:

N × log₂N
= 1e9 × 30
= 30 billion comparisons

Assume:

50% mispredicted
mispredict cost = 5 ns

Total:

15 billion × 5ns = 75 seconds

Your question:

why count only mispredictions?

Because:

Correct branch prediction is nearly free.

Why?

Modern CPUs:

speculative execution already running predicted path
pipeline continues smoothly

Cost mostly comes from:

flushing pipeline
refetching instructions
restarting execution

That pipeline flush = ~5ns penalty.

So:

correct prediction ≈ negligible
misprediction = expensive

note: branch predictor being wrong is what kills you.

Total estimated runtime

Memory cost = 7.5s
Branch cost = 75s
Total ≈ 82.5s

Branching dominates.

Assume:

L3 cache = 32MB

It holds:

2^23 numbers

So:

only first ~10 passes require full memory streaming
remaining passes operate from cache

Memory transfers: note: once data fits in cache everything suddenly fast. memory is enemy not cpu.

10 × 4GB / 16GB/s = 2.5s

Instead of 7.5s.

mental model

CPU arithmetic –> v fast
memory –> slower
network –> very slow
disk –> prehistoric slow

j & s - performance notes

Why “optimize later” can be wrong

Problems with ignoring performance early

What is Over-Replication?

What is Overprovisioning?

Example: choosing better data structures

Back of the Envelope Performance Calculations

Method

Approximate operation costs

What is a “point read” in SQL?

Example #1 — Quicksort a Billion 4-Byte Numbers

Given

Why log(N) passes?

Your question:

Cost #1 — Memory bandwidth

Cost #2 — Branch mispredictions

Your question:

Total estimated runtime

Cache-aware refinement

mental model