Geekbench AI and Apple Silicon’s NPU

Exclave · Aug 15, 2024

Geekbench AI was announced today. There are a number of scores that one can browse here:

Apart from the usual interest in score comparisons, what I found interesting is evidence to support what @leman and @name99 have stated previously. That is, while people assumed Apple were “marketing” the data by quoting Int8 as opposed to the Fp16 they used to quote when discussing the Neural Engine, it seems there isn’t as much difference between the two as there is on other platforms.

If we look at some figures here from this tweet

https://twitter.com/i/web/status/1824133490360914051

we can see that usually, Fp16 performance is half that of Int8. On Apple Silicon devices, that doesn’t seem to be the case. Indeed while Int8 is higher, it is quite close between the two.

Interestingly, a couple of results for the M4 iPad Pro on iOS 18 show significant improvements in scores. So much so, that I am a little sceptical until verified.
https://browser.geekbench.com/ai/v1/4960
This also, but without a direct link, so I’m even more sceptical.

In any case, @leman and @name99 it looks like you were correct.

Carrotstick · Aug 15, 2024

Looks like beta 2 is even faster

https://twitter.com/i/web/status/1824173936000766059

https://twitter.com/i/web/status/1824187885517750573

Chuckeee · Aug 15, 2024

Exclave said:
M4 iPad Pro on iOS 18

Pardon my ignorance but how do you run iOS on an iPad?

leman · Aug 16, 2024

Some interesting results out there! It does seem like M4 brings some new capabilities and just needs a new software foundation to take advantage of them. I am also really impressed how good smartphones are at inference tasks here, they compete very well against dedicated GPUs.

Few observations: new AMD NPUs are very impressive on paper, not so much according to these results. Whether it is lack of software optimizations or a discrepancy between the theoretical and practical performance remains to be seen. Qualcomm Oryon does not do too well either underperforming the S24 Ultra. There are some questions regarding S24 Ultra performance as well, as there are very large discrepancies between some of the scores.

Chuckeee said:
Pardon my ignorance but how do you run iOS on an iPad?

IPadOS is the same as iOS, with some additional UI elements.

komuh · Aug 16, 2024

ONNX with DirectML on RTX 4090 is such a bad comparison to CoreML, it should be at least TensorRT or cutlass vs CoreML or just use torch if they don't care about peak performance, kernels in CUDA are like 50-200% faster than DX12 not even saying about fully optimised mixed precision kernels using tensorcores, even a simple torchompiled model would be a lot faster than buggy DirectML, this is like comparing hand optimised assembly to some JS/Python generic code.

Just to show how bad this "benchmark" is, theoretical int8 (tensor cores) of RTX 4090 without sparsity is around ~660 TOPs with sparisty ~1300 TOPs (and ~2600 TOPs for int4), compared to theoretical peak for M4 NPU which is ~38 TOPs with ~10x slower memory, but in this "benchmark" the int8 results for M4 are almost the same as RTX 4090, not even saying that FP16 NV version is 2x faster (and should be 2x slower <330TFLOps vs 660 TOPs>)

Using such a bad software (not to say CoreML is perfect but it is a lot better than DirectML) should be criminal, with such a comparisons we can make Intel CPUs 100x as performant compared to AMD just use ICPX to compile on Intel and TinyCC for AMD.

Exclave · Aug 16, 2024

komuh said:
ONNX with DirectML on RTX 4090 is such a bad comparison to CoreML, it should be at least TensorRT or cutlass vs CoreML or just use torch if they don't care about peak performance, kernels in CUDA are like 50-200% faster than DX12 not even saying about fully optimised mixed precision kernels using tensorcores, even a simple torchompiled model would be a lot faster than buggy DirectML, this is like comparing hand optimised assembly to some JS/Python generic code.

Just to show how bad this "benchmark" is, theoretical int8 (tensor cores) of RTX 4090 without sparsity is around ~660 TOPs with sparisty ~1300 TOPs (and ~2600 TOPs for int4), compared to theoretical peak for M4 NPU which is ~38 TOPs with ~10x slower memory, but in this "benchmark" the int8 results for M4 are almost the same as RTX 4090, not even saying that FP16 NV version is 2x faster (and should be 2x slower <330TFLOps vs 660 TOPs>)

Using such a bad software (not to say CoreML is perfect but it is a lot better than DirectML) should be criminal, with such a comparisons we can make Intel CPUs 100x as performant compared to AMD just use ICPX to compile on Intel and TinyCC for AMD.

I don’t disagree that Cuda would give better results, and personally I always want to see the best possible results for a device, so I would welcome it.

Given Geekbench 5 included Cuda scores, I was a little surprised when it was removed at the start of Geekbench 6. John Poole was asked about this and I think it reveals how he sees Geekbench’s target audience and purpose. Essentially he sees it as a measure for consumer software and device performance. With that in mind his tweet about Cuda is revealing:

https://twitter.com/i/web/status/1627336570020806659

I was surprised to hear that Cuda usage was so low in Geekbench 5. I am a little less surprised to hear it’s usage is lower in consumer software given their tendency to want to use as few variations as possible.

All that being said, I’m not convinced the difference is as massive as you say, although I am open to evidence to the contrary. I found this Geekbench 5 comparison between the 4090 in terms of OpenCL results vs Cuda. The biggest difference was 27%.

NVIDIA Geforce 4090 FE Geekbench 5 - ServeTheHome

NVIDIA Geforce 4090 FE Geekbench 5

www.servethehome.com

I don’t know how OpenCL compares in terms of optimisation to DirectML. I had heard OpenCL was not great on Nvidia cards.

komuh · Aug 16, 2024

Exclave said:
I don’t disagree that Cuda would give better results, and personally I always want to see the best possible results for a device, so I would welcome it.

Given Geekbench 5 included Cuda scores, I was a little surprised when it was removed at the start of Geekbench 6. John Poole was asked about this and I think it reveals how he sees Geekbench’s target audience and purpose. Essentially he sees it as a measure for consumer software and device performance. With that in mind his tweet about Cuda is revealing:

https://twitter.com/i/web/status/1627336570020806659

I was surprised to hear that Cuda usage was so low in Geekbench 5. I am a little less surprised to hear it’s usage is lower in consumer software given their tendency to want to use as few variations as possible.

All that being said, I’m not convinced the difference is as massive as you say, although I am open to evidence to the contrary. I found this Geekbench 5 comparison between the 4090 in terms of OpenCL results vs Cuda. The biggest difference was 27%.

NVIDIA Geforce 4090 FE Geekbench 5 - ServeTheHome

NVIDIA Geforce 4090 FE Geekbench 5

www.servethehome.com

I don’t know how OpenCL compares in terms of optimisation to DirectML. I had heard OpenCL was not great on Nvidia cards.

OpenCL and CUDA for classic shader code won't be a big difference, but for ML workload will be as you can't use Tensor Cores (OpenCL vs CoreML would be the same you can't use NPU from OpenCL so you can't utilize "accelerators", so you'll have scores that are most likely 10-15x lower if you do the same stuff on M4 iPad or iPhones as you'll be limited by GPU)

ML benchmark is totally meaningless if you don't utilize accelerators provided by system in one system but use it in other, in pure OpenCL limit for our AI workload would be around 50/60 Teraflops (around 70-80% max classic shader performance using CUDA cores) but in CUDA it'll be 330 Teraflops for FP32 <tensor cores> + 86 Teraflops of CUDA cores.

Not even saying anything about peak of 2600 Teraops for int4. And ofc. DirectML isn't accessing any of that "accelerators" at least not in this version, it just call DX12 shaders that only use ~x% (most times it is <50% and it is slower than just using PyTorch which itself is also pretty slow without compiling) of potential CUDA cores (so <50% of this 86 Teraflops).

Ofc we are still saying theoretical limit most models are limited by memory speed not compute especially if you have as much horsepower as newer GPU's and NPU's so neither ANE nor RTX 4090 will be able to use all of the Teraops/Teraflops in 99% of cases.

https://docs.nvidia.com/cuda/cublas/index.html?tensor%20core#cublasltmatmul

https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-v2.1.pdf

name99 · Aug 16, 2024

Exclave said:
Geekbench AI was announced today. There are a number of scores that one can browse here:

Geekbench ML Results - Geekbench

Apart from the usual interest in score comparisons, what I found interesting is evidence to support what @leman and @name99 have stated previously. That is, while people assumed Apple were “marketing” the data by quoting Int8 as opposed to the Fp16 they used to quote when discussing the Neural Engine, it seems there isn’t as much difference between the two as there is on other platforms.

If we look at some figures here from this tweet
https://twitter.com/i/web/status/1824133490360914051
we can see that usually, Fp16 performance is half that of Int8. On Apple Silicon devices, that doesn’t seem to be the case. Indeed while Int8 is higher, it is quite close between the two.

Interestingly, a couple of results for the M4 iPad Pro on iOS 18 show significant improvements in scores. So much so, that I am a little sceptical until verified.
https://browser.geekbench.com/ai/v1/4960
This also, but without a direct link, so I’m even more sceptical.

In any case, @leman and @name99 it looks like you were correct.

You need to keep up!
here's my latest analysis as to what's going on (from about a week ago)

https://forums.macrumors.com/threads/m4-chip-generation-speculation-megathread-merged.2393843/page-34?post=33302795#post-33302795

GB6 AI probably does NOT reflect the issues I talk about in that post because it was likely finalized before coreMLTools 8 was released, and even now they can't just switch to using coreMLTools 8, they also need to run model characterization profiling to use W8A8.
Presumably in a few months this will all be clearer. My current assumption is that A17/M4 DO in fact have doubled W8A8 hardware (but this HW was not available to developers until coreMLTools 8 was released); but some of the evidence for that claim, ie numbers for models claiming to use coreMLTools 8, is based on performance number tweets that are fairly noisy, so I'd say best is to maintain an open mind either way.

name99 · Aug 16, 2024

Chuckeee said:
Pardon my ignorance but how do you run iOS on an iPad?

He's writing notes in his spreadsheet and calling "iPadOS" iOS because, like any sensible person, he knows that the differences DO NOT MATTER for the issue he is tracking.

name99 · Aug 16, 2024

komuh said:
ONNX with DirectML on RTX 4090 is such a bad comparison to CoreML, it should be at least TensorRT or cutlass vs CoreML or just use torch if they don't care about peak performance, kernels in CUDA are like 50-200% faster than DX12 not even saying about fully optimised mixed precision kernels using tensorcores, even a simple torchompiled model would be a lot faster than buggy DirectML, this is like comparing hand optimised assembly to some JS/Python generic code.

Just to show how bad this "benchmark" is, theoretical int8 (tensor cores) of RTX 4090 without sparsity is around ~660 TOPs with sparisty ~1300 TOPs (and ~2600 TOPs for int4), compared to theoretical peak for M4 NPU which is ~38 TOPs with ~10x slower memory, but in this "benchmark" the int8 results for M4 are almost the same as RTX 4090, not even saying that FP16 NV version is 2x faster (and should be 2x slower <330TFLOps vs 660 TOPs>)

Using such a bad software (not to say CoreML is perfect but it is a lot better than DirectML) should be criminal, with such a comparisons we can make Intel CPUs 100x as performant compared to AMD just use ICPX to compile on Intel and TinyCC for AMD.

You think this is about Apple HW vs nV hardware.

But for plenty of people the important distinction is Windows vs Apple OS, and in that context ONNX numbers matter, at least for as long as that's the way that MS delivers their AI products to their non-tech-savvy customers.

komuh · Aug 16, 2024

name99 said:
You need to keep up!
here's my latest analysis as to what's going on (from about a week ago)

https://forums.macrumors.com/threads/m4-chip-generation-speculation-megathread-merged.2393843/page-34?post=33302795#post-33302795

GB6 AI probably does NOT reflect the issues I talk about in that post because it was likely finalized before coreMLTools 8 was released, and even now they can't just switch to using coreMLTools 8, they also need to run model characterization profiling to use W8A8.
Presumably in a few months this will all be clearer. My current assumption is that A17/M4 DO in fact have doubled W8A8 hardware (but this HW was not available to developers until coreMLTools 8 was released); but some of the evidence for that claim, ie numbers for models claiming to use coreMLTools 8, is based on performance number tweets that are fairly noisy, so I'd say best is to maintain an open mind either way.

W8A8 you mean 2x int8 matmul coprocessors or just we can fit two W8A8 ops in single coprocessor <which would mean fp16 is 1/2x of i8? if yes this is move in the right direction>

komuh · Aug 16, 2024

name99 said:
You think this is about Apple HW vs nV hardware.

But for plenty of people the important distinction is Windows vs Apple OS, and in that context ONNX numbers matter, at least for as long as that's the way that MS delivers their AI products to their non-tech-savvy customers.

But microsoft isn't using DirectML for any "OS" AI product at least not yet (there are only Qualcomm AI right now? if yes it use NPU with custom compilers not DX12)

Kazgarth · Aug 16, 2024

My RTX 4090 PC score

Big difference between fp16 and int8 score

Exclave · Aug 16, 2024

komuh said:
OpenCL and CUDA for classic shader code won't be a big difference, but for ML workload will be as you can't use Tensor Cores (OpenCL vs CoreML would be the same you can't use NPU from OpenCL so you can't utilize "accelerators", so you'll have scores that are most likely 10-15x lower if you do the same stuff on M4 iPad or iPhones as you'll be limited by GPU)

ML benchmark is totally meaningless if you don't utilize accelerators provided by system in one system but use it in other, in pure OpenCL limit for our AI workload would be around 50/60 Teraflops (around 70-80% max classic shader performance using CUDA cores) but in CUDA it'll be 330 Teraflops for FP32 <tensor cores> + 86 Teraflops of CUDA cores.

Not even saying anything about peak of 2600 Teraops for int4. And ofc. DirectML isn't accessing any of that "accelerators" at least not in this version, it just call DX12 shaders that only use ~x% (most times it is <50% and it is slower than just using PyTorch which itself is also pretty slow without compiling) of potential CUDA cores (so <50% of this 86 Teraflops).

Ofc we are still saying theoretical limit most models are limited by memory speed not compute especially if you have as much horsepower as newer GPU's and NPU's so neither ANE nor RTX 4090 will be able to use all of the Teraops/Teraflops in 99% of cases.

https://docs.nvidia.com/cuda/cublas/index.html?tensor%20core#cublasltmatmul

https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-v2.1.pdf

It seems that not much consumer software is using the software necessary to utilize these accelerators. As this is a consumer focused benchmark, I doubt much will change.

komuh · Aug 16, 2024

Exclave said:
It seems that not much consumer software is using the software necessary to utilize these accelerators. As this is a consumer focused benchmark, I doubt much will change.

I mean there is no company that i know that use DirectML they use optimized and compiled models (if running locally for example Qualcomm custom NPU compiler or Davinci Resolve <for example denoiser which use TensorRT>) or CUDA/TensorRT/cutlass etc. in the cloud (like 95% of software you most likely use RunwayML, Adobe, ChatGPT, Meta, Instagram, Chatbots and many many more). There is also Google AI with TPU in cloud or AI accelerators for phones i have no idea who's idea was to use DirectML for Nvidia (for AMD GPU's it is maybe good idea as they software is really bad) but i can't find any program that i have installed that use DirectML (Davinci, StableDiffusion nor local llama)

I have a lot more models that use Metal (on macos) and even newer Apple AI features aren't using NPU on my system (for example Xcode assistant, and summary + other non system tools like llama.cpp, AI plugins for Final Cut Pro) only one that use some sort of NPU is stable diffusion produced by Apple and downloaded from github.

But i'm open to see some statistics that shows DirectML is indeed used for popular AI products with NV hardware in windows systems.

*I ofc. agree that NPU's are pretty cool especially on mobile devices and we should benchmark and compare different solutions, i just argue that we should do it fairly especially if we can use some sort of accelerator also on other platform (for example Qualcomm NPU in PC)

Exclave · Aug 16, 2024

It’s absolutely worth looking at the accuracy scores within the individual results. There is great variance. From the ones I’ve seen, Apple’s accuracy is very high. Some of the others…not so much.

Carrotstick · Aug 16, 2024

Geekbench AI should best be used for NPUs and CPU testing. Not desktop dGPUs.

Exclave · Aug 16, 2024

Carrotstick said:
Geekbench AI should best be used for NPUs and CPU testing. Not desktop dGPUs.

Why?

throAU · Aug 16, 2024

Exclave said:
Why?

Because clearly it doesn't make his/her pet device look as good.

Carrotstick · Aug 16, 2024

throAU said:
Because clearly it doesn't make his/her pet device look as good.

Because they is a difference beteeen training models and inference which is what INT8 is good for. Nvidia excels in training models FP32.

Exclave · Aug 17, 2024

Carrotstick said:
Because they is a difference beteeen training models and inference which is what INT8 is good for. Nvidia excels in training models FP32.

Geekbench AI is an inference test. I assume Nvidia cards are used by people who have a need for inference, ergo the test is useful.

komuh · Aug 17, 2024

Carrotstick said:
Because they is a difference beteeen training models and inference which is what INT8 is good for. Nvidia excels in training models FP32.

Yes 1260 Teraops of INT8 inside RTX 4090 is not enough to compare with modern 30 Teraops NPU's, some hard copium as theoretically you could be 40x faster on RTX 4090 compared to M4 NPU on INT8 benchmark.

dmccloud · Aug 18, 2024

I ran all three versions of the GB6 AI Benchmark, and the scores are vastly different for each version of the benchmark:

(MBP 14" M2 Max)

CPU:
4152 Single Precision Score
6702 Half Precision Score
5707 Quantized Score

GPU:
12632 Single Precision Score
13815 Half Precision Score
10733 Quantized Score

Neural Engine:
4132 Single Precision Score
20807 Half Precision Score
23059 Quantized Score

zarathu · Oct 31, 2024

According to Apple in their presentation yesterday(10-31-24), the neural engine in the M4 is 3 times faster than the neural engine in the M1. This means that Apple did not change the neural engine from the M3 generation since according to Geekbench AI, the listed M3’s were already 3.3 times faster that the M1’s listed.

Another point to note is that the memory bandwidth of the M3Max is already 400mb/sec, not the 273 of the M4Pro. No software makes use of this at any level even approaching 200 mb/sec, but one might assume they would, and when they do the M3Max will beat out the M4pro.

So if one is like me, and is upgrading only for use of Topaz 3 in photography, then getting an m4pro is not going to give ME much of a speed bump. If i wait until after xmas 2024, then the price of the M3Max’s with 48 or 64 gb ram will drop below the current price of an M4Pro. And so I wait.

SG- · Oct 31, 2024

the neural engine isn't the only thing doing AI acceleration, the GPUs are doing a lot of it based on the specific workloads. that's where the newer generation sees a difference and especially going to Pros and Max versions.

you'd have to see what the apps actually use, you can see yourself if its using the GPU or NE using 'mactop' in homebrew.

Geekbench AI and Apple Silicon’s NPU

Suspended

Suspended

macrumors 68040

macrumors Core

macrumors regular

Suspended

macrumors regular

macrumors 68020

macrumors 68020

macrumors 68020

macrumors regular

macrumors regular

macrumors 6502

Suspended

macrumors regular

Suspended

Suspended

Suspended

macrumors G3

Suspended

Suspended

macrumors regular

macrumors 68040

macrumors 6502a

macrumors regular

Our Staff