But you can't force it to run on the NPU, right?
Edit: I think it more useful to compare the M4 to the base M3, like here
You can't force a layer to run on HW that isn't supported. eg a 32b layer cannot run on the ANE (which supports FP16 and INT8). Also remember a MODEL does not run on some hardware, a LAYER of the net runs on some hardware.
Poorly ported neural nets (ie most of them) run on a mix of ANE, GPU, and CPU because the various layers have not been tweaked (often in very simple ways) so that they can run on just ANE. If you read the various articles on the Apple Machine Learning website (eg the one talking about porting Transformers to ANE) you will see a discussion of this. The final result from Apple could run all layers on ANE except some minor lookup table stuff. There's a similar article about image segmentation, again showing the desired endpoint is 100% layer execution on ANE.
Finally the three cases are supersets: either every layer runs on CPU, or on CPU or GPU, or on whichever of CPU or GPU or ANE is optimal.
So overall it's very messy. But if you look at the patterns of numbers on GB6-ML, you see that mostly FP16 and INT8 stuff asked to run on ANE does run on ANE, whereas FP32 runs mostly on GPU.
And note how the FP16 and INT8 numbers are so similar in GB6-ML NPU results. Like I said, each FMAC unit can execute either FP16 or INT8 so throughput is the same for either.
The benchmark that leaps out at you is Text Classification, but we don't have any data to get a good analysis of this.
Let's compare with the A17
(And, editorial, GB6 ML browser is ABSOLUTELY ***** MADDENING. Like it's deliberately designed to ensure you make mistakes in comparing GPU with CPU with ANE...)
Most of the F32 stuff is 2x. No surprise, that's all on the GPU.
Most of the F16 and INT8 stuff is about 1.3x. Probably ANE is running faster, and faster DRAM (and maybe larger SLC).
But again Text Classification on M4 runs like a bat out of hell.
How to interpret this?
Well (and good luck getting the details right starting from scratch in GB6-ML browser!) look at
https://browser.geekbench.com/ml/v0/inference/373246 (A17 CPU)
https://browser.geekbench.com/ml/v0/inference/373248 (A17 GPU)
https://browser.geekbench.com/ml/v0/inference/373283 (A17 ANE)
and look particularly at Text Classification.
Clearly the ML compiler is making a serious error here. (The only error I saw, but I have missed something.)
Text Classification runs a LOT faster on CPU [presumably AMX] than on ANE or GPU, but ML Compiler seems to want to run most/all of its layers on something other than the CPU. (The similarity of speeds suggests it's running on GPU in both the GPU and ANE cases.)
So that gives us a feeling for the speed of Text Classification on CPU[AMX] vs GPU (and "ANE").
Now go back to looking at the M4 numbers for Text Classification. They're a lot better than the A17 (or M3) GPU (and "ANE") numbers, but a lot worse than the A17 CPU[AMX] numbers...
This suggests three things:
- the ML Compiler bug that's incorrectly routing Text Classification away from the CPU still exists
- Text Classification is being routed to the ANE (since the M4 performance is more than 2x the A17 performance, and they have essentially similar GPUs, so we're told).
- something big changed in the M4 ANE relative to not just the M3 ANE but also relative to the A17 ANE.
What could this be?
I've stated before that the known ANE consists of two parts, 16 Neural Cores that perform convolutions, and a Planar Engine that performs pooling, with the two sharing a "Smart L2" pool of SRAM.
But there are at least two patents (eg
https://patents.google.com/patent/US11614937B1 ) that describe a vector DSP and are from people associated with the ANE. The patents are somewhat vague as to where this "accelerator" might be placed, but the obvious placement is as a third unit of the existing ANE.
So my current hypothesis is that
- the vector DSP was part of the A17 (but perhaps hidden behind chicken bits)
- by the M4 is was considered bug free and so available for use by the ML Compiler, and that's what we're seeing here
There are a number of questions here. For example, if the vDSP was behind chicken bits on A17, maybe it does in fact work correctly? Maybe it just requires the latest version of the coreML runtime to route to it, and that latest version hasn't yet been released, so we see it on this iOS 18 code but not on any iOS 17 benchmarks?
And maybe the vDSP can improve many more layers? Quite likely all we're seeing is a very first quick mod to coreML to test maybe one type of layer, but we might see many more layers (especially language-relevant layers) moved to run on the vDSP, meaning speed boost for both the M4 iPad and A17 iPhone in September?