What's actually inside a GPU? — Circuit Copilot Glossary

A GPU is not a 'faster CPU.' It's thousands of small, dumb workers all doing the same operation at once on different data. That's why it accelerates matrix math — and that's the whole reason it accelerates AI.

Open up an NVIDIA H100 die shot and you'll see 132 Streaming Multiprocessors (SMs), each containing 128 CUDA cores plus specialized Tensor Cores for matrix-multiply-and-accumulate. That's roughly 16,000 simple arithmetic units all running simultaneously. On a B200, the count is higher still.

The key insight: a CPU optimizes for latency (finish one task fast). A GPU optimizes for throughput (finish 16,000 tasks at once, even if each one is slower). Training a neural network is mostly multiplying large matrices — exactly the workload GPUs were designed for, even before AI was the point.

Three things that actually live on the die

The SMs (compute). Each SM is a small SIMT engine: same instruction, multiple threads, on different data. When you launch a CUDA kernel with 1,000,000 threads, the GPU schedules them across SMs in groups of 32 (a "warp"). Threads in a warp execute in lockstep, which is why branchy code wrecks GPU performance — divergent branches force the warp to execute both paths and mask off lanes.

The Tensor Cores (acceleration). Since Volta (2017), every NVIDIA GPU has had dedicated matrix-multiply units. A Tensor Core does a 4×4 matrix-multiply-and-accumulate in a single clock. On H100 they do 16×16 (and support FP8). This is the unit doing the actual work when you fine-tune a model — not the general CUDA cores.

The HBM (memory). A modern AI GPU has roughly 80–192 GB of high-bandwidth memory stacked directly next to the die, connected by a 5,120-bit-wide bus. HBM3 stacks deliver around 3 TB/s of bandwidth. Starve the compute of data and the cores sit idle — this is why "memory-bound" is a real concern for inference workloads.

Why this matters for AI workloads

Training has high arithmetic intensity: you do many FLOPs per byte loaded. Compute-bound. Inference, especially with small batch sizes, has low arithmetic intensity. Memory-bound. That's why an H100 doing inference at batch size 1 is enormously underutilized — most of its compute waits on HBM reads.

This is also why KV-cache size dominates LLM inference cost. The KV cache has to live in HBM, and serving long contexts means moving a lot of it per token. Cut KV cache (PagedAttention, MQA, GQA, sliding windows) and you save real money.

The other things people forget about

Modern GPUs are networked. An H100 has NVLink running at ~900 GB/s between GPUs in the same node, and InfiniBand to other nodes. For training large models, GPUs spend meaningful time waiting on inter-GPU communication — which is why model parallelism strategies (tensor parallel, pipeline parallel, expert parallel) matter so much for trillion-parameter training runs.

And they need power. An H100 draws ~700 W flat-out. A B200 draws ~1,200 W. A 100,000-GPU cluster is a 100 MW load — the same as a small city. This is why hyperscalers are buying nuclear plants.