Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Ultimately I think you are hostage to the monic fallacy (everything must have a SINGLE solution) that has crippled so much of computing.

I don't think that I am. I just believe that the offered solutions should be viable and that there should be a clear criterion what approach to choose. Because, as I wrote earlier, if I want to do a bunch of matrix multiplications (but don't care much for latency), on base M1 I should chose the path via the AMX units. But on M1 Max, using the GPU for this task will be a better choice. Right now, scalability of AMX is unpredictable. But I agree that it's a probably minor nitpick. I am just concerned with programmability and predictability as we move forward to bigger and bigger chips...

Who cares if there is an AMX block on a chip or not? Transistors are cheap! Providing an AMX block (and hiding it behind Accelerate APIs) means that it can be used where appropriate (high precision math, low latency requirements) and not used where appropriate (low precision math, throughput requirements).

Well, actually I would like to have more AMX blocks on the chip :) I think 3TFLOPs of matmul performance is not enough.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
When you start seeing results uploaded. Right now team blue just dethroned long time dominant team red. All the current ARM representatives are in the low and bottom tiers.

https://openbenchmarking.org/test/pts/numpy
Can someone use the Numpy benchmark to test M1 SOCs?

It seems it is compatible with macOS, but there is no result for M1 SOC.

It would be very informative to know also the result of the benchmark run on a Linux VM.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
I don't think that I am. I just believe that the offered solutions should be viable and that there should be a clear criterion what approach to choose. Because, as I wrote earlier, if I want to do a bunch of matrix multiplications (but don't care much for latency), on base M1 I should chose the path via the AMX units. But on M1 Max, using the GPU for this task will be a better choice. Right now, scalability of AMX is unpredictable. But I agree that it's a probably minor nitpick. I am just concerned with programmability and predictability as we move forward to bigger and bigger chips...

If you want to "do a bunch of matrix multiplications" you should just use Accelerate. You still have not shown that it's actually an unsatisfactory solution...
 

altaic

macrumors 6502a
Jan 26, 2004
711
484
@gl3lan Good news! Pytorch will get a Metal backend next year! ?
It’s nice to see Edward Z Yang involved in this. I followed his language theory research around a decade ago. Very smart dude.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
It’s nice to see Edward Z Yang involved in this. I followed his language theory research around a decade ago. Very smart dude.

Is there any documentation that explains (for someone technically savvy about computing, but not about ML details) just why something like PyTorch requires this large effort? What causes the mismatch between the primitives that PyTorch wants to use and the primitives offered by Apple in their various API's?
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Is there any documentation that explains (for someone technically savvy about computing, but not about ML details) just why something like PyTorch requires this large effort? What causes the mismatch between the primitives that PyTorch wants to use and the primitives offered by Apple in their various API's?
I think you need to adapt the code to make it compatible with Metal Shader Language. The Apple dev in charge of the Metal backend for Blender has listed what he needs to do for that. https://developer.blender.org/T92212

By the way, the Apple dev will also need 3-4 months to create the Metal backend.
 

jerryk

macrumors 604
Nov 3, 2011
7,421
4,208
SF Bay Area
Hi. I tried a training task of image segmentation using TensorFlow/Keras on GPUs, Apple M1 and nVidia Quadro RTX6000. I also made an estimation of M1 Max (I don't have an M1 Max machine).
RTX6000 is 20-times faster than M1(not Max or Pro) SoC, when Automatic Mixed Precision is enabled in RTX...

I posted the various results in Medium.
Nice article!

I am unfamilar with the RTX 6000. But I am familiar with the consumer graphics Nvidia cards like the RTX 3070 and 3080s. How do those cards compare in performance to an RTX 6000 for Deep Learning training and inference? I usually use TF/Keras/JAX
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
Suggestion for posters on this thread: Anytime you're citing the performance of an NVIDIA GPU, you should specify if it's the desktop or laptop model since, as a general rule of thumb, the desktop variant offers twice the performance of the laptop variant.

The problem is NVIDIA uses the same model numbers for both, so just specifying the model number isn't enough. For instance, the RTX 3000 series GPU's, like the RTX 3090, come in both desktop and laptop variants. And likewise for the RTX A series GPU's (e.g., the RTX A5000).
 
  • Like
Reactions: jerryk and JMacHack

jerryk

macrumors 604
Nov 3, 2011
7,421
4,208
SF Bay Area
Suggestion for posters on this thread: Anytime you're citing the performance of an NVIDIA GPU, you should specify if it's the desktop or laptop model since, as a general rule of thumb, the desktop variant offers twice the performance of the laptop variant.

The problem is NVIDIA uses the same model numbers for both, so just specifying the model number isn't enough. For instance, the RTX 3000 series GPU's, like the RTX 3090, come in both desktop and laptop variants. And likewise for the RTX A series GPU's (e.g., the RTX A5000).
We should probably consider doing that as Mxxx chips start appearing in higher-powered Mac desktops. They are likely to have variants of the architectures used in portable systems.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
I think you need to adapt the code to make it compatible with Metal Shader Language. The Apple dev in charge of the Metal backend for Blender has listed what he needs to do for that. https://developer.blender.org/T92212

By the way, the Apple dev will also need 3-4 months to create the Metal backend.
That tells me what needs to be done to port existing code to Metal. It does not tell me WHY the code needs to be ported to Metal as opposed to just calling Apple APIs in eg CoreML Accelerate::BNNS.

I'm not saying the port is dumb! I'm trying to understand the gap between the API's Apple provides vs what tools like pytorch do.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
That tells me what needs to be done to port existing code to Metal. It does not tell me WHY the code needs to be ported to Metal as opposed to just calling Apple APIs in eg CoreML Accelerate::BNNS.
BNNS library uses the CPU (https://developer.apple.com/documentation/accelerate/bnns), and MPS/MPS Graph framework use the GPU (https://developer.apple.com/documentation/metalperformanceshaders).

P.S.: It seems that the MPS Graph syntax is more similar to Tensorflow 1 syntax than to Pytorch syntax. So, it may be easier to provide a Metal backend to Tensorflow than to Pytorch.
 
Last edited:

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
@gl3lan How good is the Metal backend of Tensorflow?
- Does it support all popular architectures?
- How big should the architecture and batch size be to perform the GPU better than the CPU?
 

project_2501

macrumors 6502a
Jul 1, 2017
676
792
so does all this mean that, at some point in the near future, an official download of python from python.org for macOS will automatically take advantage of Apple Silicon for numpy/scipy installed using the standard tool "pip"?

I realise Pytorch / TF are separate from official python.
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
so does all this mean that, at some point in the near future, an official download of python from python.org for macOS will automatically take advantage of Apple Silicon for numpy/scipy installed using the standard tool "pip"?

It already does today. However, if you want your numpy to take advantage of Apple AMX you have to build it with Accelerate support (which as I understand is not the default config for the available packages).
 

project_2501

macrumors 6502a
Jul 1, 2017
676
792
It already does today. However, if you want your numpy to take advantage of Apple AMX you have to build it with Accelerate support (which as I understand is not the default config for the available packages).
hi Leman - just to be clear, do you mean Python runs natively on Apple Silicon, or do you mean it takes advantage of the Accelerate API to speed up some functions using hardware?
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
hi Leman - just to be clear, do you mean Python runs natively on Apple Silicon, or do you mean it takes advantage of the Accelerate API to speed up some functions using hardware?
1. Python and most Python packages that we can install through pip or conda have been compiled to run natively on Apple Silicon.

2. Few Python packages take advantage of Apple Silicon hardware. For instance, Numpy uses Apple AMX coprocessor and Apple's Tensorflow fork uses Apple GPU.
 
Last edited:
  • Like
Reactions: project_2501

crazy dave

macrumors 65816
Sep 9, 2010
1,450
1,219
Either marketing or binning.

Obviously not marketing since they only advertised one and the second is clearly disabled.

Probably not the binning either since that should result in one of the two randomly being disabled. But it appears to always be the *same* one disabled: ane1. Ane0 is always the active one. Now it could be that somehow they’ve tied the label to active/inactive ones, but this seems unlikely given the data so far. Still possible I suppose.

Hector’s other theory is a silicon errata that stops the second one from working right in some esoteric way and wasn’t caught in time so they had to disable the second neural engine.
 

jjcs

Cancelled
Oct 18, 2021
317
153
Obviously not marketing since they only advertised one and the second is clearly disabled.

Probably not the binning either since that should result in one of the two randomly being disabled. But it appears to always be the *same* one disabled: ane1. Ane0 is always the active one. Now it could be that somehow they’ve tied the label to active/inactive ones, but this seems unlikely given the data so far. Still possible I suppose.

Hector’s other theory is a silicon errata that stops the second one from working right in some esoteric way and wasn’t caught in time so they had to disable the second neural engine.
That could be. Unless they want to enable the 2nd for some sort of "M2" without further R&D. The errata explanation might be more likely, though.
 
  • Like
Reactions: Macintosh IIcx

leman

macrumors Core
Oct 14, 2008
19,516
19,664
That is odd indeed. I guess the theory that the die shots from Apple were doctored was wrong. So, why two neural engines? Something to do with the upcoming die interconnects?

My entirely unfounded and completely arbitrary supposition would be that neural engine software was never designed to distribute work between multiple hardware units, so to keep things consistent they only use one. Not that it matters, NPU is used as a signal/image processor most of the time and its more than fast enough for this purpose.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.