FPGA vs ASIC vs GPU — Circuit Copilot Glossary

All three accelerate computation. The difference is how flexible vs. how efficient they are. Moving from GPU to FPGA to ASIC trades programmability for performance per watt.

All three accelerate computation. The difference is how flexible vs. how efficient they are. As you move from GPU to FPGA to ASIC, you trade programmability for performance per watt.

The three options, briefly

GPU. A fixed silicon design optimized for parallel arithmetic, programmable through software. You write CUDA or PyTorch, the GPU runs it. Same chip can train a transformer today and render a video game tomorrow. Mass-produced (millions of units), general-purpose, widely supported.

FPGA (Field-Programmable Gate Array). A chip whose logic gates and connections can be reconfigured after manufacturing. You write hardware description (Verilog or VHDL), the FPGA tools "synthesize" your design onto its programmable fabric. Much more efficient than a CPU for fixed dataflow work — but harder to program and much smaller market.

ASIC (Application-Specific Integrated Circuit). A custom chip designed and manufactured for one job. Maximum performance per watt because nothing on the die is wasted. But: high upfront cost (an N3 tape-out is ~$30M+), long lead time (12–24 months), and if your requirements change after fabrication, you're stuck.

Where each one wins for AI

GPUs win for training. Training has huge variety — every new model architecture has slightly different ops — and you want to iterate fast. The flexibility of a programmable GPU + mature software (CUDA, cuDNN, PyTorch) crushes everything else for that workload.

ASICs win for at-scale inference. Once a model is frozen and you're serving it billions of times, custom silicon pays off. Google's TPU is an ASIC optimized for matrix-multiply at low precision. Amazon's Inferentia is the same idea. Groq's LPU takes this further by hard-wiring a deterministic dataflow that achieves remarkable token-per-second throughput on transformer inference.

FPGAs win for edge and prototyping. Low-volume military, aerospace, medical imaging, network processing, and pre-silicon ASIC prototyping. Microsoft uses FPGAs (Project Brainwave) in Azure for some inference. The economics rarely beat GPUs for large-scale AI work — but for embedded, real-time, and unusual workloads, they're unbeatable.

Why Groq's LPU is not just a fast GPU

A GPU has a memory hierarchy with caches, schedulers, and dynamic routing. That flexibility costs power and creates jitter. Groq's LPU throws all of that out: every operation has a deterministic, compile-time-scheduled location and timing. There's no caching, no scheduling at runtime. The compiler knows exactly when every byte arrives.

Result: 500+ tokens/sec for a Llama 70B is achievable on Groq, where a single H100 might do 30. The catch: you can't run arbitrary workloads — you need the compiler to fit your model into the fixed dataflow. Trades flexibility for raw throughput.

The economics, simplified

GPU: $5,000–$40,000 per chip · 1 year design cycle · sells to everyone
FPGA: $1,000–$15,000 per chip · weeks to deploy a new design · niche markets
ASIC: $30M+ NRE · 1–2 year design cycle · only viable if you ship millions of units or have your own workload

That last constraint is why ASICs for AI are a hyperscaler story — Google, Amazon, Meta — not a startup story. They have the workload volume to amortize the design cost.