Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Something doesn't make sense...it seems like the AMX is a souped up math coprocessor, but how is it more efficient than using GPU? Something else is in play (i.e., they built the LApack/BLAS directly into the hardware instruction sets?!?)

Well, the AMX hardware is highly specialized, performs only a subset of operations (mostly matmul stuff), requires specific memory layout etc. GPU is more flexible.
 
  • Like
Reactions: throAU

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Well, the AMX hardware is highly specialized, performs only a subset of operations (mostly matmul stuff), requires specific memory layout etc. GPU is more flexible.
This is a comparison between Apple's M1 matmul performance and ARMv8.6’s standard NEON instructions

 
  • Like
Reactions: throAU and altaic

ingambe

macrumors 6502
Mar 22, 2020
320
355
No, AMX and AME are two different things. The confusing thing is that Apple offers three ways of doing ML on their hardware: AME, AMX and the GPU. The AME seems to be limited to tasks like audio and image processing that Apple uses for their apps, AMX is a general purpose matrix multiplication (good for model learning) and the GPU is the most flexible but also the least efficient of the three.
Interesting
But in that case, how is that different from Intel's CPU who has instruction optimized for matrix multiplication (Intel even proposes some MKL TensorFlow version for best performance)?
In the end, matrix multiplication is at the core of DL. It would be surprising not to have this optimized in the AME

Edit: the above article helps to understand the interaction between the CPU/AME and AMX
It's still not clear to me how it's different from x86 chips that also have matrix multiplication accelerated hardware
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Interesting
But in that case, how is that different from Intel's CPU who has instruction optimized for matrix multiplication (Intel even proposes some MKL TensorFlow version for best performance)?

It’s not. Apple AMX is actually fairly similar in design to Intel AMX as far as I can tell. The main difference is that Apple has been shipping matrix coprocessors on all of its products since 2019. Intels stuff will only ship on server class CPUs next year.


In the end, matrix multiplication is at the core of DL, it would be surprising to not have this optimized in the AME

These are simply optimized for different usage scenarios.

NPU (AME) seems to be strictly for inference, supports only a subset of models and is extremely power efficient. Of course it has to consist of matrix multiplication units, but they must some sort of fixed function stuff with only limited functionality.

AMX units are more programmable but consume more energy. Also, they are obviously designed to be used in tandem with CPU workflows, as they are driven directly by the CPU instruction.

Finally, GPU is there if you need the most flexibility.
 

ingambe

macrumors 6502
Mar 22, 2020
320
355
It’s not. Apple AMX is actually fairly similar in design to Intel AMX as far as I can tell. The main difference is that Apple has been shipping matrix coprocessors on all of its products since 2019. Intels stuff will only ship on server class CPUs next year.




These are simply optimized for different usage scenarios.

NPU (AME) seems to be strictly for inference, supports only a subset of models and is extremely power efficient. Of course it has to consist of matrix multiplication units, but they must some sort of fixed function stuff with only limited functionality.

AMX units are more programmable but consume more energy. Also, they are obviously designed to be used in tandem with CPU workflows, as they are driven directly by the CPU instruction.

Finally, GPU is there if you need the most flexibility.
Thank you, that’s very interesting
 

Stakashi

macrumors newbie
Nov 1, 2021
1
2
I would be happy to run any other benchmark if suggested (or help someone to run the benchmark on a M1 Max chip), even if I am more of a PyTorch guy. ;-)

[edit] Makes me wonder whether I should have gone for the M1 Max chip... probably not.
Hi. I tried a training task of image segmentation using TensorFlow/Keras on GPUs, Apple M1 and nVidia Quadro RTX6000. I also made an estimation of M1 Max (I don't have an M1 Max machine).
RTX6000 is 20-times faster than M1(not Max or Pro) SoC, when Automatic Mixed Precision is enabled in RTX...

I posted the various results in Medium.
 
  • Like
Reactions: iDron and jerryk

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
If you are interested in more extensive benchmarks, you can have have a look here. Basically, the M1 Max is around 8 times slower than a RTX 3090 (the 3090 benchmark being run in fp16 precision for maximum speed), but consumes 8 times less power.

I ran the ResNet50 benchmark on my M1 Pro (16GB RAM) and achieved circa 65 img/sec (half the M1 Max throughput, as expected), the RAM pressure was sometimes orange during the benchmark.

I also ran the same benchmark on a RTX 2080 Ti (256 img/sec in fp32 precision, 620 img/sec in fp16 precision), and on a 2015 issued Geforce GTX TITAN X (128 img/sec in fp32 precision, 170 img/sec in fp16 precision).
It seems Tensorflow's performance on M1 GPU improves when the batch size is slightly larger than usual.

@gl3lan Can you check if performance improves when the batch size is 512 or 1024?
 

Bil21

macrumors newbie
Nov 2, 2021
2
0
In case anyone is interested, in ran a fairly simple MNIST benchmark (proposed here : https://github.com/apple/tensorflow_macos/issues/25) on my recently acquired M1 Pro MBP (16-core GPU, 16GB RAM). I installed Tensorflow using the following guide (https://developer.apple.com/metal/tensorflow-plugin/).

For reference, this benchmark seems to run at around 24ms/step on M1 GPU.

On the M1 Pro, the benchmark runs at between 11 and 12ms/step (twice the TFLOPs, twice as fast as an M1 chip).

The same benchmark run on an RTX-2080 (fp32 13.5 TFLOPS) gives 6ms/step and 8ms/step when run on a GeForce GTX Titan X (fp32 6.7 TFLOPs). A similar level of performance should be also expected on the M1 Max GPU (which should run twice as fast as the M1 Pro).

Of course, this benchmark runs a fairly simple CNN model but it already gives an idea. Keep also in mind that RTX generation cards are able to run faster at fp16 precision, I am not sure it would apply to Apple Silicon.

I would be happy to run any other benchmark if suggested (or help someone to run the benchmark on a M1 Max chip), even if I am more of a PyTorch guy. ;-)

[edit] Makes me wonder whether I should have gone for the M1 Max chip... probably not.
Hello gl3lan,

Thank you so much for creating this thread, actually it's the information i've been looking for for a while.
My question is that i got the Macbook Pro 16 inch, 16GB of RAM, 512GB SSD, 10 Core CPU and 16 Core GPU.
Do you think it's more than enough for machine learning developer aor should i upgrade RAM, processor and SSD.
Your answer is very important to me since i can still return the one i got and make a hardware upgrade.

Thank you so much again!
 

gl3lan

macrumors newbie
Original poster
May 11, 2020
19
22
It seems Tensorflow's performance on M1 GPU improves when the batch size is slightly larger than usual.

@gl3lan Can you check if performance improves when the batch size is 512 or 1024?
I am afraid I do not have enough RAM to run such batch sizes, I only have 16 Gb.
 

gl3lan

macrumors newbie
Original poster
May 11, 2020
19
22
Hello gl3lan,

Thank you so much for creating this thread, actually it's the information i've been looking for for a while.
My question is that i got the Macbook Pro 16 inch, 16GB of RAM, 512GB SSD, 10 Core CPU and 16 Core GPU.
Do you think it's more than enough for machine learning developer aor should i upgrade RAM, processor and SSD.
Your answer is very important to me since i can still return the one i got and make a hardware upgrade.

Thank you so much again!
Honestly depends on your use cases, the type of ML (deep or not ?), dataset sizes... For example I happen to work with datasets of hundreds of Gb, but the training is usually done on a server/cloud instance/workstation, where the data is located. I think the machine is sufficient to design and test ML models. For example, you only need to run a DL model on tiny batch sizes/random inputs when developing (small amount of RAM required). Once it is coded you perform training on more capable hardware. Should you require more compute, you are better off with a cloud instance. IMHO upgrading ram or gpu is not worth it considering the cost/performance gain (you won’t train a model for days on a laptop anyway if you can train 10 times faster on a cloud instance). Upgrading SSD might have more sense if you have to perform data analysis locally on huge datasets.

I actually went for the 10/16/1TB version. This is a personal machine but I will sometimes use it for work-related personal projects. My work consists in deep-learning based applied research.

I might invest more in a future MBP/Mac Pro revision, but for me the chip is still WIP regarding machine learning (especially considering it is not yet supported by PyTorch).
 

Bil21

macrumors newbie
Nov 2, 2021
2
0
Honestly depends on your use cases, the type of ML (deep or not ?), dataset sizes... For example I happen to work with datasets of hundreds of Gb, but the training is usually done on a server/cloud instance/workstation, where the data is located. I think the machine is sufficient to design and test ML models. For example, you only need to run a DL model on tiny batch sizes/random inputs when developing (small amount of RAM required). Once it is coded you perform training on more capable hardware. Should you require more compute, you are better off with a cloud instance. IMHO upgrading ram or gpu is not worth it considering the cost/performance gain (you won’t train a model for days on a laptop anyway if you can train 10 times faster on a cloud instance). Upgrading SSD might have more sense if you have to perform data analysis locally on huge datasets.

I actually went for the 10/16/1TB version. This is a personal machine but I will sometimes use it for work-related personal projects. My work consists in deep-learning based applied research.

I might invest more in a future MBP/Mac Pro revision, but for me the chip is still WIP regarding machine learning (especially considering it is not yet supported by PyTorch).
Thank you so much,
You answer is more than appreciated. I was really confused and frustrated between the base model or upgrading. But after your answer i'll be keeping the base model. I'm pretty sure now that Macbook pro M1Pro 16 inch, 16gb RAM would be more than enough.
You saved me money and stress.
Thank you again!
 

crazy dave

macrumors 65816
Sep 9, 2010
1,450
1,219
Yes...that article was informative on explaining things a bit...I'm not sure I'm so excited about this new architecture if Apple's going to keep things under lock n key.

No choice under the circumstances unless ARM adopts a similar or better instruction set. ARM v9 does indeed introduce such a thing so we’ll see if Apple adopts it or maybe it will be similar to what Apple was already doing. But right now accessible through an API is the only way.
 
Last edited:

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
Interesting
But in that case, how is that different from Intel's CPU who has instruction optimized for matrix multiplication (Intel even proposes some MKL TensorFlow version for best performance)?
In the end, matrix multiplication is at the core of DL. It would be surprising not to have this optimized in the AME

Edit: the above article helps to understand the interaction between the CPU/AME and AMX
It's still not clear to me how it's different from x86 chips that also have matrix multiplication accelerated hardware
Differences are at a technical level.

- AMX instructions look like normal ARM instructions and are embedded in the standard ARM instruction stream, executed in a core's load/store unit.
- the instructions are sent to the L2 of the cluster the core belongs to (P, or E) and are executed by the AMX unit for tha t cluster, which can be considered as associated with the L2. Essentially it uses the L2 as its "cache" and has low latency access to it.
- AMX instructions provide at least double, float, and fp16 instructions.

Intel AMX seems to provide only int8, int16 (I think) and bf16 instructions. It's unclear to me how tightly Intel AMX is coupled to L2 (or L3) (ie low latency load/store) or to the instruction stream (ie low latency dispatch of instructions rather than a whole drama of "creating a code package to mail to an accelerator then tell it to execute that")

Apple provides a MATRIX accelerator, good for a variety of FP work.
Intel provides an AI accelerator that does the job for INFERENCE and may be good enough for most training. (Training almost always requires FP, but bf16 may be good enough.) But Intel's solution won't help if, for example, you are more interested in solving PDEs or physics/engineering rather than just AI.

I'm honestly not sure quite why Apple gave SUCH A POWERFUL generic HPC tool. But, as someone who uses a lot of Mathematica, I'm grateful! (Or will be, once a fully-optimized version of Mathematica is released, not the current bare-bones "should be called beta" release.)
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I get the impression that some scientific open source projects (Numpy, Pytorch...) can't develop for Apple Silicon GPU as fast as they would like to because Github doesn't provide that platform.

Is it true? Would Apple change their MacOs license for servers? Would an Apple Mini Pro change the situation?
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Differences are at a technical level.

- AMX instructions look like normal ARM instructions and are embedded in the standard ARM instruction stream, executed in a core's load/store unit.
- the instructions are sent to the L2 of the cluster the core belongs to (P, or E) and are executed by the AMX unit for tha t cluster, which can be considered as associated with the L2. Essentially it uses the L2 as its "cache" and has low latency access to it.
- AMX instructions provide at least double, float, and fp16 instructions.

Intel AMX seems to provide only int8, int16 (I think) and bf16 instructions. It's unclear to me how tightly Intel AMX is coupled to L2 (or L3) (ie low latency load/store) or to the instruction stream (ie low latency dispatch of instructions rather than a whole drama of "creating a code package to mail to an accelerator then tell it to execute that")

Apple provides a MATRIX accelerator, good for a variety of FP work.
Intel provides an AI accelerator that does the job for INFERENCE and may be good enough for most training. (Training almost always requires FP, but bf16 may be good enough.) But Intel's solution won't help if, for example, you are more interested in solving PDEs or physics/engineering rather than just AI.

I'm honestly not sure quite why Apple gave SUCH A POWERFUL generic HPC tool. But, as someone who uses a lot of Mathematica, I'm grateful! (Or will be, once a fully-optimized version of Mathematica is released, not the current bare-bones "should be called beta" release.)

Hi Maynard, great to have someone with your knowledge level comment on this! Regarding AMX, I am not quite sure where Apple is going... We have one AMX unit per cluster of 4 P-cores, so if I read Dougall's notes correctly, that would translate to around ~ 3 TFLOPS FP32 throughput with ideal utilization on multiple cores. This is of course great for general vector algebra tasks, but not nearly enough to challenge the deep learning performance of larger GPUs.

I have the feeling that Apple still has to figure these things out properly. Their hybrid solution seems to lack a bit of balance here. Right now, for large matrix-related tasks, it makes sense to use AMX on the base M1, but you might be better off using the GPU on the M1 Max. This obviously creates a weird situation for the developers... besides, it appears that Apple's new tensorflow plugin does not use AMX at all, which again makes me question the future of these units.
 

altaic

macrumors 6502a
Jan 26, 2004
711
484
Hi Maynard, great to have someone with your knowledge level comment on this! Regarding AMX, I am not quite sure where Apple is going... We have one AMX unit per cluster of 4 P-cores, so if I read Dougall's notes correctly, that would translate to around ~ 3 TFLOPS FP32 throughput with ideal utilization on multiple cores. This is of course great for general vector algebra tasks, but not nearly enough to challenge the deep learning performance of larger GPUs.

I have the feeling that Apple still has to figure these things out properly. Their hybrid solution seems to lack a bit of balance here. Right now, for large matrix-related tasks, it makes sense to use AMX on the base M1, but you might be better off using the GPU on the M1 Max. This obviously creates a weird situation for the developers... besides, it appears that Apple's new tensorflow plugin does not use AMX at all, which again makes me question the future of these units.
Not sure if the following twitter thread has been mentioned before. Someone posted a 2.7 TFLOPs result using gemm-benchmark, which seems to be in the ballpark.

There's also a die shot annotation that looks pretty thorough (although it's missing all of the PCIe and serial stuff at the top, and there are two blocks above and below the left hand SLC that I'm curious about). Anyway, the AMX blocks marked out in the die shots are so small compared to the GPU cores, while their sgemm TFLOPS/block is ~1.4 vs 0.3 with the GPU. If the annotation is accurate, I wonder why there wouldn't be more focus on AMX.

Edit: Oops, yeah, the twitter thread was posted earlier in this thread. Having trouble keeping threads straight ?‍♂️

 
Last edited:

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
Hi Maynard, great to have someone with your knowledge level comment on this! Regarding AMX, I am not quite sure where Apple is going... We have one AMX unit per cluster of 4 P-cores, so if I read Dougall's notes correctly, that would translate to around ~ 3 TFLOPS FP32 throughput with ideal utilization on multiple cores. This is of course great for general vector algebra tasks, but not nearly enough to challenge the deep learning performance of larger GPUs.

I have the feeling that Apple still has to figure these things out properly. Their hybrid solution seems to lack a bit of balance here. Right now, for large matrix-related tasks, it makes sense to use AMX on the base M1, but you might be better off using the GPU on the M1 Max. This obviously creates a weird situation for the developers... besides, it appears that Apple's new tensorflow plugin does not use AMX at all, which again makes me question the future of these units.

Where do you think Apple wants to sell these machines?
The fact that AI/ML is the hot new thing (and so the only thing we ever hear talked about by journalists and pundits) doesn't change the fact that vastly more engineers and scientists work in "traditional" hard sciences than in AI/ML. And AMX is a very nice performance boost for those folk...

Read what I said. AMX is a MATRIX accelerator. Do you want a computer than can perform matrix math fast (ie are you the sort of person who would, twenty years ago, have wanted to own a "workstation")? Then Apple Macs are a much nicer machine for you than the competition (x86 today, maybe ARM tomorrow). AMX can handle ENGINEERING/PHYSICS matrix math -- doubles and floats.

If you have something that works well on the GPUs, great. But latency computing is different from throughput computing. With AMX Apple is covering all bases, rather than forcing matrix computing into the throughput mold even when your problem is a latency problem:


Ultimately I think you are hostage to the monic fallacy (everything must have a SINGLE solution) that has crippled so much of computing. This viewpoint is what drove MS over a cliff ("let's use the same UI for everything from phones to servers") and Intel ("let's force x86 even into circumstances, like phons, where it makes no sense"). You see it in the people who insist that the iPad be turned into a Mac, or that the aWatch be able to do everything without an iPhone.
Apple is not immune to this diseases, just mostly better at fighting it off than most companies.

Who cares if there is an AMX block on a chip or not? Transistors are cheap! Providing an AMX block (and hiding it behind Accelerate APIs) means that it can be used where appropriate (high precision math, low latency requirements) and not used where appropriate (low precision math, throughput requirements).
I agree that the fp16 support in AMX is somewhat weird and I suspect at some point that may be tossed; it doesn't seem relevant to the primary (hard sciences, low latency) use cases I see for AMX. But let's see... Maybe it will turn out that there are, eg, ways to solve elliptical PDEs (think something like Laplace or Poisson) that run the first few rounds of relaxation in fp16 before switching to fp32 then ultimately fp64, and by cascading accuracy this way you can solve the PDE 30% faster than starting directly at fp64?
 
  • Like
Reactions: ZZ9pluralZalpha

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
AMX can handle ENGINEERING/PHYSICS matrix math -- doubles and floats.
How can we know whether scientific programs/libraries like Numpy, Julia or Matlab use Apple's AMX to speed up matrix multiplications?
 

JimmyjamesEU

Suspended
Jun 28, 2018
397
426
Not sure if it guarantees AMX use, but if you run otool -l ”path to application” it’ll tell you what frameworks are linked.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.