Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Exclave

Suspended
Original poster
Jun 5, 2024
77
102
Geekbench AI was announced today. There are a number of scores that one can browse here:


Apart from the usual interest in score comparisons, what I found interesting is evidence to support what @leman and @name99 have stated previously. That is, while people assumed Apple were “marketing” the data by quoting Int8 as opposed to the Fp16 they used to quote when discussing the Neural Engine, it seems there isn’t as much difference between the two as there is on other platforms.

If we look at some figures here from this tweet we can see that usually, Fp16 performance is half that of Int8. On Apple Silicon devices, that doesn’t seem to be the case. Indeed while Int8 is higher, it is quite close between the two.

GVCfvYzXAAA8dKK

GVCfvY8XoAAI6UU



Interestingly, a couple of results for the M4 iPad Pro on iOS 18 show significant improvements in scores. So much so, that I am a little sceptical until verified.
https://browser.geekbench.com/ai/v1/4960
This also, but without a direct link, so I’m even more sceptical.
GVDJIW6XIAAnnen



In any case, @leman and @name99 it looks like you were correct.
 
  • Like
Reactions: T'hain Esh Kelch

leman

macrumors Core
Oct 14, 2008
19,516
19,662
Some interesting results out there! It does seem like M4 brings some new capabilities and just needs a new software foundation to take advantage of them. I am also really impressed how good smartphones are at inference tasks here, they compete very well against dedicated GPUs.

Few observations: new AMD NPUs are very impressive on paper, not so much according to these results. Whether it is lack of software optimizations or a discrepancy between the theoretical and practical performance remains to be seen. Qualcomm Oryon does not do too well either underperforming the S24 Ultra. There are some questions regarding S24 Ultra performance as well, as there are very large discrepancies between some of the scores.

Pardon my ignorance but how do you run iOS on an iPad?

IPadOS is the same as iOS, with some additional UI elements.
 
  • Like
Reactions: Tagbert

komuh

macrumors regular
May 13, 2023
126
113
ONNX with DirectML on RTX 4090 is such a bad comparison to CoreML, it should be at least TensorRT or cutlass vs CoreML or just use torch if they don't care about peak performance, kernels in CUDA are like 50-200% faster than DX12 not even saying about fully optimised mixed precision kernels using tensorcores, even a simple torchompiled model would be a lot faster than buggy DirectML, this is like comparing hand optimised assembly to some JS/Python generic code.

Just to show how bad this "benchmark" is, theoretical int8 (tensor cores) of RTX 4090 without sparsity is around ~660 TOPs with sparisty ~1300 TOPs (and ~2600 TOPs for int4), compared to theoretical peak for M4 NPU which is ~38 TOPs with ~10x slower memory, but in this "benchmark" the int8 results for M4 are almost the same as RTX 4090, not even saying that FP16 NV version is 2x faster (and should be 2x slower <330TFLOps vs 660 TOPs>)

Using such a bad software (not to say CoreML is perfect but it is a lot better than DirectML) should be criminal, with such a comparisons we can make Intel CPUs 100x as performant compared to AMD just use ICPX to compile on Intel and TinyCC for AMD.
 
Last edited:
  • Like
Reactions: appledevicefan420

Exclave

Suspended
Original poster
Jun 5, 2024
77
102
ONNX with DirectML on RTX 4090 is such a bad comparison to CoreML, it should be at least TensorRT or cutlass vs CoreML or just use torch if they don't care about peak performance, kernels in CUDA are like 50-200% faster than DX12 not even saying about fully optimised mixed precision kernels using tensorcores, even a simple torchompiled model would be a lot faster than buggy DirectML, this is like comparing hand optimised assembly to some JS/Python generic code.

Just to show how bad this "benchmark" is, theoretical int8 (tensor cores) of RTX 4090 without sparsity is around ~660 TOPs with sparisty ~1300 TOPs (and ~2600 TOPs for int4), compared to theoretical peak for M4 NPU which is ~38 TOPs with ~10x slower memory, but in this "benchmark" the int8 results for M4 are almost the same as RTX 4090, not even saying that FP16 NV version is 2x faster (and should be 2x slower <330TFLOps vs 660 TOPs>)

Using such a bad software (not to say CoreML is perfect but it is a lot better than DirectML) should be criminal, with such a comparisons we can make Intel CPUs 100x as performant compared to AMD just use ICPX to compile on Intel and TinyCC for AMD.
I don’t disagree that Cuda would give better results, and personally I always want to see the best possible results for a device, so I would welcome it.

Given Geekbench 5 included Cuda scores, I was a little surprised when it was removed at the start of Geekbench 6. John Poole was asked about this and I think it reveals how he sees Geekbench’s target audience and purpose. Essentially he sees it as a measure for consumer software and device performance. With that in mind his tweet about Cuda is revealing:

I was surprised to hear that Cuda usage was so low in Geekbench 5. I am a little less surprised to hear it’s usage is lower in consumer software given their tendency to want to use as few variations as possible.

All that being said, I’m not convinced the difference is as massive as you say, although I am open to evidence to the contrary. I found this Geekbench 5 comparison between the 4090 in terms of OpenCL results vs Cuda. The biggest difference was 27%.
NVIDIA-Geforce-4090-FE-Geekbench-5.jpg


I don’t know how OpenCL compares in terms of optimisation to DirectML. I had heard OpenCL was not great on Nvidia cards.
 

komuh

macrumors regular
May 13, 2023
126
113
I don’t disagree that Cuda would give better results, and personally I always want to see the best possible results for a device, so I would welcome it.

Given Geekbench 5 included Cuda scores, I was a little surprised when it was removed at the start of Geekbench 6. John Poole was asked about this and I think it reveals how he sees Geekbench’s target audience and purpose. Essentially he sees it as a measure for consumer software and device performance. With that in mind his tweet about Cuda is revealing:

I was surprised to hear that Cuda usage was so low in Geekbench 5. I am a little less surprised to hear it’s usage is lower in consumer software given their tendency to want to use as few variations as possible.

All that being said, I’m not convinced the difference is as massive as you say, although I am open to evidence to the contrary. I found this Geekbench 5 comparison between the 4090 in terms of OpenCL results vs Cuda. The biggest difference was 27%.
NVIDIA-Geforce-4090-FE-Geekbench-5.jpg


I don’t know how OpenCL compares in terms of optimisation to DirectML. I had heard OpenCL was not great on Nvidia cards.
OpenCL and CUDA for classic shader code won't be a big difference, but for ML workload will be as you can't use Tensor Cores (OpenCL vs CoreML would be the same you can't use NPU from OpenCL so you can't utilize "accelerators", so you'll have scores that are most likely 10-15x lower if you do the same stuff on M4 iPad or iPhones as you'll be limited by GPU)

ML benchmark is totally meaningless if you don't utilize accelerators provided by system in one system but use it in other, in pure OpenCL limit for our AI workload would be around 50/60 Teraflops (around 70-80% max classic shader performance using CUDA cores) but in CUDA it'll be 330 Teraflops for FP32 <tensor cores> + 86 Teraflops of CUDA cores.

Not even saying anything about peak of 2600 Teraops for int4. And ofc. DirectML isn't accessing any of that "accelerators" at least not in this version, it just call DX12 shaders that only use ~x% (most times it is <50% and it is slower than just using PyTorch which itself is also pretty slow without compiling) of potential CUDA cores (so <50% of this 86 Teraflops).

Ofc we are still saying theoretical limit most models are limited by memory speed not compute especially if you have as much horsepower as newer GPU's and NPU's so neither ANE nor RTX 4090 will be able to use all of the Teraops/Teraflops in 99% of cases.


 
Last edited:
  • Like
Reactions: Kronsteen

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
Geekbench AI was announced today. There are a number of scores that one can browse here:


Apart from the usual interest in score comparisons, what I found interesting is evidence to support what @leman and @name99 have stated previously. That is, while people assumed Apple were “marketing” the data by quoting Int8 as opposed to the Fp16 they used to quote when discussing the Neural Engine, it seems there isn’t as much difference between the two as there is on other platforms.

If we look at some figures here from this tweet we can see that usually, Fp16 performance is half that of Int8. On Apple Silicon devices, that doesn’t seem to be the case. Indeed while Int8 is higher, it is quite close between the two.

GVCfvYzXAAA8dKK

GVCfvY8XoAAI6UU



Interestingly, a couple of results for the M4 iPad Pro on iOS 18 show significant improvements in scores. So much so, that I am a little sceptical until verified.
https://browser.geekbench.com/ai/v1/4960
This also, but without a direct link, so I’m even more sceptical.
GVDJIW6XIAAnnen



In any case, @leman and @name99 it looks like you were correct.

You need to keep up!
here's my latest analysis as to what's going on (from about a week ago)

GB6 AI probably does NOT reflect the issues I talk about in that post because it was likely finalized before coreMLTools 8 was released, and even now they can't just switch to using coreMLTools 8, they also need to run model characterization profiling to use W8A8.
Presumably in a few months this will all be clearer. My current assumption is that A17/M4 DO in fact have doubled W8A8 hardware (but this HW was not available to developers until coreMLTools 8 was released); but some of the evidence for that claim, ie numbers for models claiming to use coreMLTools 8, is based on performance number tweets that are fairly noisy, so I'd say best is to maintain an open mind either way.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
ONNX with DirectML on RTX 4090 is such a bad comparison to CoreML, it should be at least TensorRT or cutlass vs CoreML or just use torch if they don't care about peak performance, kernels in CUDA are like 50-200% faster than DX12 not even saying about fully optimised mixed precision kernels using tensorcores, even a simple torchompiled model would be a lot faster than buggy DirectML, this is like comparing hand optimised assembly to some JS/Python generic code.

Just to show how bad this "benchmark" is, theoretical int8 (tensor cores) of RTX 4090 without sparsity is around ~660 TOPs with sparisty ~1300 TOPs (and ~2600 TOPs for int4), compared to theoretical peak for M4 NPU which is ~38 TOPs with ~10x slower memory, but in this "benchmark" the int8 results for M4 are almost the same as RTX 4090, not even saying that FP16 NV version is 2x faster (and should be 2x slower <330TFLOps vs 660 TOPs>)

Using such a bad software (not to say CoreML is perfect but it is a lot better than DirectML) should be criminal, with such a comparisons we can make Intel CPUs 100x as performant compared to AMD just use ICPX to compile on Intel and TinyCC for AMD.
You think this is about Apple HW vs nV hardware.

But for plenty of people the important distinction is Windows vs Apple OS, and in that context ONNX numbers matter, at least for as long as that's the way that MS delivers their AI products to their non-tech-savvy customers.
 

komuh

macrumors regular
May 13, 2023
126
113
You need to keep up!
here's my latest analysis as to what's going on (from about a week ago)

GB6 AI probably does NOT reflect the issues I talk about in that post because it was likely finalized before coreMLTools 8 was released, and even now they can't just switch to using coreMLTools 8, they also need to run model characterization profiling to use W8A8.
Presumably in a few months this will all be clearer. My current assumption is that A17/M4 DO in fact have doubled W8A8 hardware (but this HW was not available to developers until coreMLTools 8 was released); but some of the evidence for that claim, ie numbers for models claiming to use coreMLTools 8, is based on performance number tweets that are fairly noisy, so I'd say best is to maintain an open mind either way.
W8A8 you mean 2x int8 matmul coprocessors or just we can fit two W8A8 ops in single coprocessor <which would mean fp16 is 1/2x of i8? if yes this is move in the right direction>
 
Last edited:

komuh

macrumors regular
May 13, 2023
126
113
You think this is about Apple HW vs nV hardware.

But for plenty of people the important distinction is Windows vs Apple OS, and in that context ONNX numbers matter, at least for as long as that's the way that MS delivers their AI products to their non-tech-savvy customers.
But microsoft isn't using DirectML for any "OS" AI product at least not yet (there are only Qualcomm AI right now? if yes it use NPU with custom compilers not DX12)
 

Exclave

Suspended
Original poster
Jun 5, 2024
77
102
OpenCL and CUDA for classic shader code won't be a big difference, but for ML workload will be as you can't use Tensor Cores (OpenCL vs CoreML would be the same you can't use NPU from OpenCL so you can't utilize "accelerators", so you'll have scores that are most likely 10-15x lower if you do the same stuff on M4 iPad or iPhones as you'll be limited by GPU)

ML benchmark is totally meaningless if you don't utilize accelerators provided by system in one system but use it in other, in pure OpenCL limit for our AI workload would be around 50/60 Teraflops (around 70-80% max classic shader performance using CUDA cores) but in CUDA it'll be 330 Teraflops for FP32 <tensor cores> + 86 Teraflops of CUDA cores.

Not even saying anything about peak of 2600 Teraops for int4. And ofc. DirectML isn't accessing any of that "accelerators" at least not in this version, it just call DX12 shaders that only use ~x% (most times it is <50% and it is slower than just using PyTorch which itself is also pretty slow without compiling) of potential CUDA cores (so <50% of this 86 Teraflops).

Ofc we are still saying theoretical limit most models are limited by memory speed not compute especially if you have as much horsepower as newer GPU's and NPU's so neither ANE nor RTX 4090 will be able to use all of the Teraops/Teraflops in 99% of cases.


It seems that not much consumer software is using the software necessary to utilize these accelerators. As this is a consumer focused benchmark, I doubt much will change.
 

komuh

macrumors regular
May 13, 2023
126
113
It seems that not much consumer software is using the software necessary to utilize these accelerators. As this is a consumer focused benchmark, I doubt much will change.
I mean there is no company that i know that use DirectML they use optimized and compiled models (if running locally for example Qualcomm custom NPU compiler or Davinci Resolve <for example denoiser which use TensorRT>) or CUDA/TensorRT/cutlass etc. in the cloud (like 95% of software you most likely use RunwayML, Adobe, ChatGPT, Meta, Instagram, Chatbots and many many more). There is also Google AI with TPU in cloud or AI accelerators for phones i have no idea who's idea was to use DirectML for Nvidia (for AMD GPU's it is maybe good idea as they software is really bad) but i can't find any program that i have installed that use DirectML (Davinci, StableDiffusion nor local llama)

I have a lot more models that use Metal (on macos) and even newer Apple AI features aren't using NPU on my system (for example Xcode assistant, and summary + other non system tools like llama.cpp, AI plugins for Final Cut Pro) only one that use some sort of NPU is stable diffusion produced by Apple and downloaded from github.

But i'm open to see some statistics that shows DirectML is indeed used for popular AI products with NV hardware in windows systems.

*I ofc. agree that NPU's are pretty cool especially on mobile devices and we should benchmark and compare different solutions, i just argue that we should do it fairly especially if we can use some sort of accelerator also on other platform (for example Qualcomm NPU in PC)
 
Last edited:
  • Like
Reactions: appledevicefan420

Exclave

Suspended
Original poster
Jun 5, 2024
77
102
It’s absolutely worth looking at the accuracy scores within the individual results. There is great variance. From the ones I’ve seen, Apple’s accuracy is very high. Some of the others…not so much.
 

Exclave

Suspended
Original poster
Jun 5, 2024
77
102
Because they is a difference beteeen training models and inference which is what INT8 is good for. Nvidia excels in training models FP32.
Geekbench AI is an inference test. I assume Nvidia cards are used by people who have a need for inference, ergo the test is useful.
 
  • Like
Reactions: throAU and komuh

komuh

macrumors regular
May 13, 2023
126
113
Because they is a difference beteeen training models and inference which is what INT8 is good for. Nvidia excels in training models FP32.
Yes 1260 Teraops of INT8 inside RTX 4090 is not enough to compare with modern 30 Teraops NPU's, some hard copium as theoretically you could be 40x faster on RTX 4090 compared to M4 NPU on INT8 benchmark.
 

dmccloud

macrumors 68040
Sep 7, 2009
3,138
1,899
Anchorage, AK
I ran all three versions of the GB6 AI Benchmark, and the scores are vastly different for each version of the benchmark:

(MBP 14" M2 Max)

CPU:
4152 Single Precision Score
6702 Half Precision Score
5707 Quantized Score

GPU:
12632 Single Precision Score
13815 Half Precision Score
10733 Quantized Score

Neural Engine:
4132 Single Precision Score
20807 Half Precision Score
23059 Quantized Score
 
  • Like
Reactions: appledevicefan420

zarathu

macrumors 6502a
May 14, 2003
652
362
According to Apple in their presentation yesterday(10-31-24), the neural engine in the M4 is 3 times faster than the neural engine in the M1. This means that Apple did not change the neural engine from the M3 generation since according to Geekbench AI, the listed M3’s were already 3.3 times faster that the M1’s listed.

Another point to note is that the memory bandwidth of the M3Max is already 400mb/sec, not the 273 of the M4Pro. No software makes use of this at any level even approaching 200 mb/sec, but one might assume they would, and when they do the M3Max will beat out the M4pro.

So if one is like me, and is upgrading only for use of Topaz 3 in photography, then getting an m4pro is not going to give ME much of a speed bump. If i wait until after xmas 2024, then the price of the M3Max’s with 48 or 64 gb ram will drop below the current price of an M4Pro. And so I wait.
 

SG-

macrumors regular
Jun 8, 2015
150
88
the neural engine isn't the only thing doing AI acceleration, the GPUs are doing a lot of it based on the specific workloads. that's where the newer generation sees a difference and especially going to Pros and Max versions.

you'd have to see what the apps actually use, you can see yourself if its using the GPU or NE using 'mactop' in homebrew.
 
  • Like
Reactions: Macintosh IIcx
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.