Apple Silicon deep learning performance

mi7chy · Nov 7, 2021

Xiao_Xi said:
How can we know whether scientific programs/libraries like Numpy, Julia or Matlab use Apple's AMX to speed up matrix multiplications?

When you start seeing results uploaded. Right now team blue just dethroned long time dominant team red. All the current ARM representatives are in the low and bottom tiers.

https://openbenchmarking.org/test/pts/numpy

leman · Nov 7, 2021

name99 said:
Ultimately I think you are hostage to the monic fallacy (everything must have a SINGLE solution) that has crippled so much of computing.

I don't think that I am. I just believe that the offered solutions should be viable and that there should be a clear criterion what approach to choose. Because, as I wrote earlier, if I want to do a bunch of matrix multiplications (but don't care much for latency), on base M1 I should chose the path via the AMX units. But on M1 Max, using the GPU for this task will be a better choice. Right now, scalability of AMX is unpredictable. But I agree that it's a probably minor nitpick. I am just concerned with programmability and predictability as we move forward to bigger and bigger chips...

name99 said:
Who cares if there is an AMX block on a chip or not? Transistors are cheap! Providing an AMX block (and hiding it behind Accelerate APIs) means that it can be used where appropriate (high precision math, low latency requirements) and not used where appropriate (low precision math, throughput requirements).

Well, actually I would like to have more AMX blocks on the chip

I think 3TFLOPs of matmul performance is not enough.

Xiao_Xi · Nov 8, 2021

mi7chy said:
When you start seeing results uploaded. Right now team blue just dethroned long time dominant team red. All the current ARM representatives are in the low and bottom tiers.

https://openbenchmarking.org/test/pts/numpy

Can someone use the Numpy benchmark to test M1 SOCs?

It seems it is compatible with macOS, but there is no result for M1 SOC.

It would be very informative to know also the result of the benchmark run on a Linux VM.

name99 · Nov 9, 2021

leman said:
I don't think that I am. I just believe that the offered solutions should be viable and that there should be a clear criterion what approach to choose. Because, as I wrote earlier, if I want to do a bunch of matrix multiplications (but don't care much for latency), on base M1 I should chose the path via the AMX units. But on M1 Max, using the GPU for this task will be a better choice. Right now, scalability of AMX is unpredictable. But I agree that it's a probably minor nitpick. I am just concerned with programmability and predictability as we move forward to bigger and bigger chips...

If you want to "do a bunch of matrix multiplications" you should just use Accelerate. You still have not shown that it's actually an unsatisfactory solution...

Xiao_Xi · Nov 10, 2021

@gl3lan Good news! Pytorch will get a Metal backend next year! 😍

GPU acceleration for Apple's M1 chip? · Issue #47702 · pytorch/pytorch

🚀 Feature Hi, I was wondering if we could evaluate PyTorch's performance on Apple's new M1 chip. I'm also wondering how we could possibly optimize Pytorch's capabilities on M1 GPUs/neural engines. ...

github.com

altaic · Nov 10, 2021

Xiao_Xi said:
@gl3lan Good news! Pytorch will get a Metal backend next year! 😍

GPU acceleration for Apple's M1 chip? · Issue #47702 · pytorch/pytorch

🚀 Feature Hi, I was wondering if we could evaluate PyTorch's performance on Apple's new M1 chip. I'm also wondering how we could possibly optimize Pytorch's capabilities on M1 GPUs/neural engines. ...

github.com

It’s nice to see Edward Z Yang involved in this. I followed his language theory research around a decade ago. Very smart dude.

name99 · Nov 11, 2021

altaic said:
It’s nice to see Edward Z Yang involved in this. I followed his language theory research around a decade ago. Very smart dude.

Is there any documentation that explains (for someone technically savvy about computing, but not about ML details) just why something like PyTorch requires this large effort? What causes the mismatch between the primitives that PyTorch wants to use and the primitives offered by Apple in their various API's?

Xiao_Xi · Nov 11, 2021

name99 said:
Is there any documentation that explains (for someone technically savvy about computing, but not about ML details) just why something like PyTorch requires this large effort? What causes the mismatch between the primitives that PyTorch wants to use and the primitives offered by Apple in their various API's?

I think you need to adapt the code to make it compatible with Metal Shader Language. The Apple dev in charge of the Metal backend for Blender has listed what he needs to do for that. https://developer.blender.org/T92212

By the way, the Apple dev will also need 3-4 months to create the Metal backend.

jerryk · Nov 11, 2021

Stakashi said:
Hi. I tried a training task of image segmentation using TensorFlow/Keras on GPUs, Apple M1 and nVidia Quadro RTX6000. I also made an estimation of M1 Max (I don't have an M1 Max machine).
RTX6000 is 20-times faster than M1(not Max or Pro) SoC, when Automatic Mixed Precision is enabled in RTX...

I posted the various results in Medium.

Training Speed of TensorFlow in macOS Monterey | Towards Data Science

GPU training in M1 SoC comparing with results in Quadro RTX6000 and estimation in M1 Max SoC

towardsdatascience.com

Nice article!

I am unfamilar with the RTX 6000. But I am familiar with the consumer graphics Nvidia cards like the RTX 3070 and 3080s. How do those cards compare in performance to an RTX 6000 for Deep Learning training and inference? I usually use TF/Keras/JAX

altaic · Nov 11, 2021

Xiao_Xi said:
I think you need to adapt the code to make it compatible with Metal Shader Language. The Apple dev in charge of the Metal backend for Blender has listed what he needs to do for that. https://developer.blender.org/T92212

By the way, the Apple dev will also need 3-4 months to create the Metal backend.

Yeah, the task you linked explains the work very well.

theorist9 · Nov 11, 2021

Suggestion for posters on this thread: Anytime you're citing the performance of an NVIDIA GPU, you should specify if it's the desktop or laptop model since, as a general rule of thumb, the desktop variant offers twice the performance of the laptop variant.

The problem is NVIDIA uses the same model numbers for both, so just specifying the model number isn't enough. For instance, the RTX 3000 series GPU's, like the RTX 3090, come in both desktop and laptop variants. And likewise for the RTX A series GPU's (e.g., the RTX A5000).

jerryk · Nov 12, 2021

theorist9 said:
Suggestion for posters on this thread: Anytime you're citing the performance of an NVIDIA GPU, you should specify if it's the desktop or laptop model since, as a general rule of thumb, the desktop variant offers twice the performance of the laptop variant.

The problem is NVIDIA uses the same model numbers for both, so just specifying the model number isn't enough. For instance, the RTX 3000 series GPU's, like the RTX 3090, come in both desktop and laptop variants. And likewise for the RTX A series GPU's (e.g., the RTX A5000).

We should probably consider doing that as Mxxx chips start appearing in higher-powered Mac desktops. They are likely to have variants of the architectures used in portable systems.

name99 · Nov 12, 2021

Xiao_Xi said:
I think you need to adapt the code to make it compatible with Metal Shader Language. The Apple dev in charge of the Metal backend for Blender has listed what he needs to do for that. https://developer.blender.org/T92212

By the way, the Apple dev will also need 3-4 months to create the Metal backend.

That tells me what needs to be done to port existing code to Metal. It does not tell me WHY the code needs to be ported to Metal as opposed to just calling Apple APIs in eg CoreML Accelerate::BNNS.

I'm not saying the port is dumb! I'm trying to understand the gap between the API's Apple provides vs what tools like pytorch do.

Xiao_Xi · Nov 12, 2021

name99 said:
That tells me what needs to be done to port existing code to Metal. It does not tell me WHY the code needs to be ported to Metal as opposed to just calling Apple APIs in eg CoreML Accelerate::BNNS.

BNNS library uses the CPU (https://developer.apple.com/documentation/accelerate/bnns), and MPS/MPS Graph framework use the GPU (https://developer.apple.com/documentation/metalperformanceshaders).

P.S.: It seems that the MPS Graph syntax is more similar to Tensorflow 1 syntax than to Pytorch syntax. So, it may be easier to provide a Metal backend to Tensorflow than to Pytorch.

Xiao_Xi · Nov 14, 2021

@gl3lan How good is the Metal backend of Tensorflow?
- Does it support all popular architectures?
- How big should the architecture and batch size be to perform the GPU better than the CPU?

project_2501 · Nov 14, 2021

so does all this mean that, at some point in the near future, an official download of python from python.org for macOS will automatically take advantage of Apple Silicon for numpy/scipy installed using the standard tool "pip"?

I realise Pytorch / TF are separate from official python.

leman · Nov 15, 2021

project_2501 said:
so does all this mean that, at some point in the near future, an official download of python from python.org for macOS will automatically take advantage of Apple Silicon for numpy/scipy installed using the standard tool "pip"?

It already does today. However, if you want your numpy to take advantage of Apple AMX you have to build it with Accelerate support (which as I understand is not the default config for the available packages).

project_2501 · Nov 15, 2021

leman said:
It already does today. However, if you want your numpy to take advantage of Apple AMX you have to build it with Accelerate support (which as I understand is not the default config for the available packages).

hi Leman - just to be clear, do you mean Python runs natively on Apple Silicon, or do you mean it takes advantage of the Accelerate API to speed up some functions using hardware?

Xiao_Xi · Nov 15, 2021

project_2501 said:
hi Leman - just to be clear, do you mean Python runs natively on Apple Silicon, or do you mean it takes advantage of the Accelerate API to speed up some functions using hardware?

1. Python and most Python packages that we can install through pip or conda have been compiled to run natively on Apple Silicon.

2. Few Python packages take advantage of Apple Silicon hardware. For instance, Numpy uses Apple AMX coprocessor and Apple's Tensorflow fork uses Apple GPU.

Xiao_Xi · Nov 17, 2021

The guy porting Linux to Apple Silicon has found that the Mac Max has two neural engines, but only one works. Why would Apple disable one of those neural engines?

https://twitter.com/x/status/1460926428254326786

altaic · Nov 17, 2021

Xiao_Xi said:
The guy porting Linux to Apple Silicon has found that the Mac Max has two neural engines, but only one works. Why would Apple disable one of those neural engines?

https://twitter.com/x/status/1460926428254326786

That is odd indeed. I guess the theory that the die shots from Apple were doctored was wrong. So, why two neural engines? Something to do with the upcoming die interconnects?

jjcs · Nov 17, 2021

altaic said:
That is odd indeed. I guess the theory that the die shots from Apple were doctored was wrong. So, why two neural engines? Something to do with the upcoming die interconnects?

Either marketing or binning.

crazy dave · Nov 17, 2021

jjcs said:
Either marketing or binning.

Obviously not marketing since they only advertised one and the second is clearly disabled.

Probably not the binning either since that should result in one of the two randomly being disabled. But it appears to always be the *same* one disabled: ane1. Ane0 is always the active one. Now it could be that somehow they’ve tied the label to active/inactive ones, but this seems unlikely given the data so far. Still possible I suppose.

Hector’s other theory is a silicon errata that stops the second one from working right in some esoteric way and wasn’t caught in time so they had to disable the second neural engine.

jjcs · Nov 17, 2021

crazy dave said:
Obviously not marketing since they only advertised one and the second is clearly disabled.

Probably not the binning either since that should result in one of the two randomly being disabled. But it appears to always be the *same* one disabled: ane1. Ane0 is always the active one. Now it could be that somehow they’ve tied the label to active/inactive ones, but this seems unlikely given the data so far. Still possible I suppose.

Hector’s other theory is a silicon errata that stops the second one from working right in some esoteric way and wasn’t caught in time so they had to disable the second neural engine.

That could be. Unless they want to enable the 2nd for some sort of "M2" without further R&D. The errata explanation might be more likely, though.

leman · Nov 18, 2021

altaic said:
That is odd indeed. I guess the theory that the die shots from Apple were doctored was wrong. So, why two neural engines? Something to do with the upcoming die interconnects?

My entirely unfounded and completely arbitrary supposition would be that neural engine software was never designed to distribute work between multiple hardware units, so to keep things consistent they only use one. Not that it matters, NPU is used as a signal/image processor most of the time and its more than fast enough for this purpose.

Apple Silicon deep learning performance

Suspended

macrumors Core

macrumors 68000

macrumors 68030

macrumors 68000

macrumors 6502a

macrumors 68030

macrumors 68000

macrumors 604

macrumors 6502a

macrumors 601

macrumors 604

macrumors 68030

macrumors 68000

macrumors 68000

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors 68000

macrumors 68000

macrumors 6502a

Cancelled

macrumors 68000

Cancelled

macrumors Core

Our Staff