The memory hierarchy, visualized — Circuit Copilot Glossary

Every computation needs data. The catch: memory speed has lagged compute speed for forty years. The memory hierarchy is the engineering response to that gap.

Every computation needs two things: an instruction and the data it operates on. Both have to come from memory. The catch: memory speed has lagged compute speed for forty years. The memory hierarchy is the engineering response to that gap.

The hierarchy, from fastest to slowest

Each level is roughly 10× slower and 10× bigger than the level above. On a modern CPU:

Registers — ~1 cycle · 1 KB total · directly inside the ALU
L1 cache — ~4 cycles · 32–64 KB · per core
L2 cache — ~12 cycles · 256 KB–1 MB · per core
L3 cache — ~40 cycles · 8–64 MB · shared across cores
Main RAM — ~200 cycles · 8–256 GB · DDR5 in 2026
NVMe SSD — ~100,000 cycles · TBs · PCIe storage
Network storage — ~10,000,000 cycles · effectively unlimited

"Cycle" here means clock cycle of the CPU. At 3 GHz, one cycle is 0.33 ns. A round trip to main RAM costs you ~70 ns — a lifetime, in computer terms. If the CPU could otherwise execute 4 instructions per cycle, that's 800 wasted instruction slots every time you miss the cache.

Why caches exist

Caches exploit two patterns in real programs:

Temporal locality — if you used a value recently, you'll probably use it again soon. Keep it nearby.
Spatial locality — if you used address X, you'll probably use X+1 soon. So fetch a whole "cache line" (typically 64 bytes) at once instead of one byte.

When a program has good locality, ~95%+ of memory accesses hit cache and the program runs fast. When it has bad locality (random access into a huge array), most accesses miss and the program crawls. This is why algorithmic complexity isn't the whole story — cache-friendly O(n) often beats cache-hostile O(log n) in practice.

Why this matters for machine learning

Matrix multiplication has excellent locality if you do it right. The classic optimization is blocked (a.k.a. tiled) matrix multiply: instead of multiplying entire matrices, you break them into small blocks that fit in L1 cache, multiply those, and accumulate. BLAS libraries like OpenBLAS and Intel MKL spend most of their tuning effort on cache blocking.

On GPUs, the equivalent is called shared memory — a programmer-managed cache local to each SM. CUDA kernels for matmul explicitly load tiles of A and B into shared memory, perform the local multiply-accumulate, then store back. Flash Attention is essentially "do the attention math tile-by-tile in shared memory so you never have to write the full attention matrix to HBM."

Why HBM is a big deal

Standard CPU RAM is on a separate PCB connected by a relatively narrow bus (~50 GB/s for DDR5). High Bandwidth Memory is stacked vertically on the same package as the GPU and connected by a very wide bus (~3 TB/s on HBM3). That's the difference between "memory is far away" and "memory is right there."

AI workloads, especially inference, are often memory-bound. HBM is the only thing that keeps the compute fed. The trade-off: HBM is expensive (single-source from a couple of manufacturers, hard to scale capacity), which is why an H100 has "only" 80 GB of memory despite costing $30,000+.