↫ Home

j & s - performance notes

Some jotted down notes of mine from Jeff and Sanjay's blog

“In established engineering disciplines a 12% improvement… is never considered marginal.”

“When preparing quality programs, I don’t want to restrict myself to tools that deny me such efficiencies.”


Why “optimize later” can be wrong

Many engineers say:

write simple code first → profile later → optimize hotspots

But this approach often fails in real systems.

Problems with ignoring performance early

  • Flat performance profile problem

If inefficiencies are everywhere, profiling won’t show obvious hotspots. Performance is lost in many small places. note: death by thousand jabs. nothing obvious to fix.


  • Library users suffer instead of authors

If you publish a slow library:

  • downstream users must debug your internals
  • they cannot easily change your code
  • they must negotiate performance fixes across teams

note: lol usual, everyone wants it, nobody owns or maintains it.

  • Hard to change systems in heavy production

Once systems are widely deployed:

  • architecture changes become risky
  • migration costs explode
  • performance fixes become political

What is Over-Replication?

Running multiple unnecessary copies of a service or dataset to compensate for slowness.

Example:

  • inefficient DB queries → add read replicas
  • slow service → deploy 10 instances instead of fixing code note: throwing servers at problem instead of fixing algorithm. seen often

What is Overprovisioning?

Allocating far more CPU / memory / machines than actually needed.

Example:

  • service needs 2 cores but uses 16 because code inefficient
  • DB needs 64GB RAM just to hide bad queries

Example: choosing better data structures

If vectors are usually small:

Use:

absl::InlinedVector

instead of:

std::vector

Why?

  • avoids heap allocations for small sizes
  • improves cache locality
  • faster access note: probably helps more in hot path for high throughput apps, i.e. algo trading systems

Back of the Envelope Performance Calculations

Goal:

Estimate performance before implementing anything. note: important section for me to re-read and DRILL it in. i always prefer deploying, profiling, measuring latency - but this is important.

Method

  1. Estimate number of expensive operations:

    • disk seeks
    • network round trips
    • bytes transferred
    • memory accesses
  2. Multiply by approximate cost

  3. Sum results

If concurrency exists → costs may overlap.

note: similar to fermi estimation but for cs , v interesting


Approximate operation costs

Operation Cost
L1 cache reference 0.5 ns
L2 cache reference 3 ns
Branch mispredict 5 ns
Mutex lock/unlock uncontended 15 ns
Main memory reference 50 ns
Compress 1KB Snappy 1,000 ns
Read 4KB SSD 20,000 ns
Datacenter roundtrip 50,000 ns
Read 1MB memory 64,000 ns
Read 1MB over 100Gb network 100,000 ns
Read 1MB SSD 1,000,000 ns
Disk seek 5,000,000 ns
Read 1MB disk 10,000,000 ns
CA ↔ Netherlands packet 150,000,000 ns

note: disk seek basically geological time compared to cpu.


What is a “point read” in SQL?

A point read means:

Query retrieving one specific row using an indexed lookup.

Example:

SELECT * FROM users WHERE id = 123;

Not:

SELECT * FROM users WHERE age > 18;

Why?

  • point read uses index → O(log N)
  • range scan may read many rows

Example #1 — Quicksort a Billion 4-Byte Numbers


Given

  • 1 billion numbers
  • each 4 bytes
  • total = 4GB array note: already bigger than most cpu caches obv

Why log(N) passes?

Quicksort average depth:

O(log₂ N)

Since:

N = 2³⁰ ≈ 1 billion
log₂(N) = 30

So about 30 passes


Your question:

log(billion) is six?

That’s:

log₁₀(1,000,000,000) = 9

But algorithms use:

log₂(N)

Why base 2?

Because:

  • data structures split in half
  • binary comparisons
  • binary trees
  • binary recursion depth

note: whenever you see log in algorithms assume base 2 unless stated otherwise


Cost #1 — Memory bandwidth

Array = 4GB

Bandwidth = 16GB/s

Time per full scan:

4 / 16 = 0.25 s

30 passes →

30 × 0.25 = 7.5 seconds

Cost #2 — Branch mispredictions

Each pass compares every element.

Total comparisons:

N × log₂N
= 1e9 × 30
= 30 billion comparisons

Assume:

  • 50% mispredicted
  • mispredict cost = 5 ns

Total:

15 billion × 5ns = 75 seconds

Your question:

why count only mispredictions?

Because:

Correct branch prediction is nearly free.

Why?

Modern CPUs:

  • speculative execution already running predicted path
  • pipeline continues smoothly

Cost mostly comes from:

  • flushing pipeline
  • refetching instructions
  • restarting execution

That pipeline flush = ~5ns penalty.

So:

correct prediction ≈ negligible
misprediction = expensive

note: branch predictor being wrong is what kills you.


Total estimated runtime

Memory cost = 7.5s
Branch cost = 75s
Total ≈ 82.5s

Branching dominates.


Cache-aware refinement

Assume:

  • L3 cache = 32MB

It holds:

2^23 numbers

So:

  • only first ~10 passes require full memory streaming
  • remaining passes operate from cache

Memory transfers: note: once data fits in cache everything suddenly fast. memory is enemy not cpu.

10 × 4GB / 16GB/s = 2.5s

Instead of 7.5s.


mental model

  • CPU arithmetic –> v fast
  • memory –> slower
  • network –> very slow
  • disk –> prehistoric slow
Reading