◆ CIRCUITS & CHIPS, DECODED FOR THE AI ERA 10 FREE CALCULATORS NO LOGIN · NO ADS · NO TRACKING BUILT FOR EE STUDENTS WORLDWIDE v0.1 LIVE
◆ CIRCUITS & CHIPS, DECODED FOR THE AI ERA 10 FREE CALCULATORS NO LOGIN · NO ADS · NO TRACKING BUILT FOR EE STUDENTS WORLDWIDE v0.1 LIVE

Every computation needs data. The catch: memory speed has lagged compute speed for forty years. The memory hierarchy is the engineering response to that gap.

Every computation needs two things: an instruction and the data it operates on. Both have to come from memory. The catch: memory speed has lagged compute speed for forty years. The memory hierarchy is the engineering response to that gap.

The hierarchy, from fastest to slowest

Each level is roughly 10× slower and 10× bigger than the level above. On a modern CPU:

"Cycle" here means clock cycle of the CPU. At 3 GHz, one cycle is 0.33 ns. A round trip to main RAM costs you ~70 ns — a lifetime, in computer terms. If the CPU could otherwise execute 4 instructions per cycle, that's 800 wasted instruction slots every time you miss the cache.

Why caches exist

Caches exploit two patterns in real programs:

When a program has good locality, ~95%+ of memory accesses hit cache and the program runs fast. When it has bad locality (random access into a huge array), most accesses miss and the program crawls. This is why algorithmic complexity isn't the whole story — cache-friendly O(n) often beats cache-hostile O(log n) in practice.

Why this matters for machine learning

Matrix multiplication has excellent locality if you do it right. The classic optimization is blocked (a.k.a. tiled) matrix multiply: instead of multiplying entire matrices, you break them into small blocks that fit in L1 cache, multiply those, and accumulate. BLAS libraries like OpenBLAS and Intel MKL spend most of their tuning effort on cache blocking.

On GPUs, the equivalent is called shared memory — a programmer-managed cache local to each SM. CUDA kernels for matmul explicitly load tiles of A and B into shared memory, perform the local multiply-accumulate, then store back. Flash Attention is essentially "do the attention math tile-by-tile in shared memory so you never have to write the full attention matrix to HBM."

Why HBM is a big deal

Standard CPU RAM is on a separate PCB connected by a relatively narrow bus (~50 GB/s for DDR5). High Bandwidth Memory is stacked vertically on the same package as the GPU and connected by a very wide bus (~3 TB/s on HBM3). That's the difference between "memory is far away" and "memory is right there."

AI workloads, especially inference, are often memory-bound. HBM is the only thing that keeps the compute fed. The trade-off: HBM is expensive (single-source from a couple of manufacturers, hard to scale capacity), which is why an H100 has "only" 80 GB of memory despite costing $30,000+.

why this matters
When someone says a model is 'memory-bound,' they mean the GPU's compute is mostly idle, waiting for HBM. That's not a software problem — it's the deepest constraint in modern computing. Flash Attention, KV-cache optimization, and quantization all exist to fight this.
← back to glossary