Just curious, where did you see INT8 being used for M4?
People are making up these claims about ANE. Simple as that. Almost no-one knows how ANE works, and people are just assuming (a dumb assumption...) that it works like an nV chip. Apple already HAS a GPU, so why would they create new hardware that behaves like a second GPU???
The ANE that we are familiar with uses a clever FMAC design that can support FP16 or INT8. So same performance for FP16 vs INT8. If you look at the GB6 ML benchmarks run on NPU, you will see that this is the case. The FP32 benchmarks (which run on GPU, even if you ask them to run on NPU, bcs NPU does not support FP32) run at one speed, the FP16/INT8 benchmarks run at a different speed that's mostly the same (modulo small differences because of additional memory traffic and suchlike).
Apple PROBABLY switched to using "OPS" rather than FLOPS for the simple reason that the US is a crazy litigious society that can always find some idiot who will file a lawsuit based on some crazy claim that they did not get the FP ANE performance they expected because they think FLOPS means FP32 or FP64, or because Apple wants to count operations like ReLU lookup that are a big part of a neural network but which no-one calls a FLOP. "OPS" is vague enough that non-one can form a lawsuit around it.
The original ANE had 8 cores [not the very first, Lattice, version which was only good for FaceID, nothing else], optimized for convolution. Subsequent generations added the Planar Engine (pooling and element-wise operations) along with a lot of surrounding infrastructure (ways to use less memory, ways to synchronize cores and the Planar Engine). There are hints that the next step would be to add a vectorDSP. It's
possible this was done with the A17 and the M4 ANE, and that's how the OPS count jumps from 18T to 38T.
But in terms of "primary computation" the existing (ie up to M3) design is essentially the same number of INT8 or FP16 ops per cycle - 256 FMACs per core, 16 cores. The design is very clever (nothing like a GPU), but with interesting limitations of how data can be routed around.
There is an "exploratory" patent showing how the current multiplier could be split to allow for two INT4 multiplies, BUT for that to be useful, some modifications to the data routing would be required and we see no evidence of that. Honestly I think it's a much lower priority than other things.
Apple can get most of the value of INT4 by quantized weights, with much more flexibility (they can quantize weights to 4bits, but also 5 or 6 bits), and if you follow both Apple announcements and the compilation of LLMs and art generators on Mac, you will see that aggressive use is made of this weight quantization.