All three accelerate computation. The difference is how flexible vs. how efficient they are. Moving from GPU to FPGA to ASIC trades programmability for performance per watt.
All three accelerate computation. The difference is how flexible vs. how efficient they are. As you move from GPU to FPGA to ASIC, you trade programmability for performance per watt.
GPU. A fixed silicon design optimized for parallel arithmetic, programmable through software. You write CUDA or PyTorch, the GPU runs it. Same chip can train a transformer today and render a video game tomorrow. Mass-produced (millions of units), general-purpose, widely supported.
FPGA (Field-Programmable Gate Array). A chip whose logic gates and connections can be reconfigured after manufacturing. You write hardware description (Verilog or VHDL), the FPGA tools "synthesize" your design onto its programmable fabric. Much more efficient than a CPU for fixed dataflow work — but harder to program and much smaller market.
ASIC (Application-Specific Integrated Circuit). A custom chip designed and manufactured for one job. Maximum performance per watt because nothing on the die is wasted. But: high upfront cost (an N3 tape-out is ~$30M+), long lead time (12–24 months), and if your requirements change after fabrication, you're stuck.
GPUs win for training. Training has huge variety — every new model architecture has slightly different ops — and you want to iterate fast. The flexibility of a programmable GPU + mature software (CUDA, cuDNN, PyTorch) crushes everything else for that workload.
ASICs win for at-scale inference. Once a model is frozen and you're serving it billions of times, custom silicon pays off. Google's TPU is an ASIC optimized for matrix-multiply at low precision. Amazon's Inferentia is the same idea. Groq's LPU takes this further by hard-wiring a deterministic dataflow that achieves remarkable token-per-second throughput on transformer inference.
FPGAs win for edge and prototyping. Low-volume military, aerospace, medical imaging, network processing, and pre-silicon ASIC prototyping. Microsoft uses FPGAs (Project Brainwave) in Azure for some inference. The economics rarely beat GPUs for large-scale AI work — but for embedded, real-time, and unusual workloads, they're unbeatable.
A GPU has a memory hierarchy with caches, schedulers, and dynamic routing. That flexibility costs power and creates jitter. Groq's LPU throws all of that out: every operation has a deterministic, compile-time-scheduled location and timing. There's no caching, no scheduling at runtime. The compiler knows exactly when every byte arrives.
Result: 500+ tokens/sec for a Llama 70B is achievable on Groq, where a single H100 might do 30. The catch: you can't run arbitrary workloads — you need the compiler to fit your model into the fixed dataflow. Trades flexibility for raw throughput.
That last constraint is why ASICs for AI are a hyperscaler story — Google, Amazon, Meta — not a startup story. They have the workload volume to amortize the design cost.