For inference, GPUs are still dominant as well, it's just that bandwidth tends to matter more than compute (though depending on the bandwidth the latter can still contribute - i.e. it's been suggested that Apple devices have more than enough bandwidth that on inference tasks the GPUs are compute bound and thus improving compute would dramatically improve inference performance, especially for multi-batch inference).I thought, at least for LLM's, that the GPU is the unit most performant for training, while the CPU is the unit most performant for inference, indicating that it's not always the case that the CPU has limited performance for AI. And based on what you've written, the NPU would be used as a CPU-alternative for inference when you want to optimize efficiency. Do I have that right?
Last edited: