Apple Silicon deep learning performance

ingambe · Nov 18, 2021

Someone on Reddit just posted this link to a blog post with some benchmarks against Colab Free (K80) and Kaggle (P100):

https://www.reddit.com/r/MachineLearning/comments/qwmizo

It seems that M1Pro is faster than a K80 (not surprising) but slower than a P100
For those who hesitate to max out their MacBook pro, Colab pro seems to be a way better option

Xiao_Xi · Nov 22, 2021

Is Apple's Tensorflow fork reliable?

Some people report bad results when using the GPU.

TensorFlow model predictions are i… | Apple Developer Forums

developer.apple.com

Training LSTM: Low Accuracy on M1… | Apple Developer Forums

developer.apple.com

leman · Nov 22, 2021

Xiao_Xi said:
Is Apple's Tensorflow fork reliable?

Some people report bad results when using the GPU.

TensorFlow model predictions are i… | Apple Developer Forums

developer.apple.com

Training LSTM: Low Accuracy on M1… | Apple Developer Forums

developer.apple.com

The problem with Apple's tensor flow plugin is that it's a closed-source, undocumented software with no obvious support channel. This is so typical Apple. At least with the older crappy TF fork one could have a look at the code and try to identify (maybe even patch) the issue — but now one can't do absolutely anything and Apple engineers don't reply in the forums to these topics.

Xiao_Xi · Nov 22, 2021

What can other companies learn from the Apple Tensorflow plugin? What does Apple need to hide?

If Apple publishes the code of its Tensorflow plugin, other projects can learn from it and develop a Metal backend faster.

leman · Nov 22, 2021

Xiao_Xi said:
If Apple publishes the code of its Tensorflow plugin, other projects can learn from it and develop a Metal backend faster.

Exactly! The decision to hide the plugin implementation does not make any sense and harms both Apple and its loyal user base. It's clear that Apple does not have a unified strategy — it seems to me like there are multiple teams sign different stuff without coordinating with each other. On one hand they release a C++ Metal bindings — a wonderful initiative that will massively simplify Metal development for many devs outside of Apple's ecosystem, on the other hand, their ML efforts are awkward at best...

theorist9 · Nov 22, 2021

How does NVIDIA do things in this area, and could Apple learn from them? My (pure) guess, from how popular NVIDIA is among those who do ML, is that NVIDIA's approach is refined, unified, and well-developed.

Open-Source Projects

Collaborate with open-source communities to build and accelerate innovation

developer.nvidia.com

Relatedly, a lot of the researchers doing ML are scientists rather than programmers, and the fact that CUDA is much easier to work in than OpenCL is a big plus. How does Metal compare to CUDA for ease of use?

hajime · Nov 24, 2021

So the conclusion is that although the Apple Silicon has the potential for researchers to do research in ML, Apple's decision to make the API closed source and a lack of hardware AVX and CUDA support makes Intel PC still a better platform for researchers?

leman · Nov 24, 2021

hajime said:
So the conclusion is that although the Apple Silicon has the potential for researchers to do research in ML, Apple's decision to make the API closed source and a lack of hardware AVX and CUDA support makes Intel PC still a better platform for researchers?

Kind of depends on what you do. For many people who are locked in into the CUDA ecosystem, there is no way around Nvidia at the moment. Apple Silicon can become good, but it will take some time until the software (and hardware) matures.

Then again, Apple Silicon is great for software prototyping in many ways, and it has very good CPU performance. Something like a 14" MBP might be one of the best portable laptops on the market for a data scientist. Training of large models can be done on remote GPU-accelerated machines (cloud services or supercomputers).

ingambe · Nov 24, 2021

leman said:
Kind of depends on what you do. For many people who are locked in into the CUDA ecosystem, there is no way around Nvidia at the moment. Apple Silicon can become good, but it will take some time until the software (and hardware) matures.

Then again, Apple Silicon is great for software prototyping in many ways, and it has very good CPU performance. Something like a 14" MBP might be one of the best portable laptops on the market for a data scientist. Training of large models can be done on remote GPU-accelerated machines (cloud services or supercomputers).

In any case, even if performance were on-pair with a 3080ti, the question would be: Do you really want to let your laptop run for days at maximum utilization?
You would also have to do a lot of checkpoints to stop the training when you need to carry your laptop around

Being able to do at least one iteration with a reasonable batch size is already very nice

hajime · Nov 24, 2021

Thanks. Am I correct that although it has been a year since Apple released the M1 chip, the maturation of software and hardware support for both ML and DS are still quite poor?

Xiao_Xi · Nov 24, 2021

hajime said:
the maturation of software and hardware support for both ML and DS are still quite poor?

Most of the software run native on Apple Silicon, but few of them support the GPU.

The main problem is that developing for Mac is more expensive and cumbersome than for Windows or Linux.

I wonder how Pytorch devs will build and test its code because Github still doesn't offer any Apple Silicon hardware.

About GitHub-hosted runners - GitHub Docs

GitHub offers hosted virtual machines to run workflows. The virtual machine contains an environment of tools, packages, and settings available for GitHub Actions to use.

docs.github.com

Romain_H · Nov 24, 2021

Xiao_Xi said:
The main problem is that developing for Mac is more expensive and cumbersome than for Windows or Linux.

I agree. I do want to point out, however, that this is true for ML development only. Xcode/Swift and Cocoa are otherwise second to none

leman · Nov 24, 2021

Xiao_Xi said:
The main problem is that developing for Mac is more expensive and cumbersome than for Windows or Linux.

I disagree with this sentiment wholeheartedly. Developing for macOS is a breeze compared to Windows and it can also be more straightforward than with Linux where you have to consider the differences between the distros. For UI stuff, it’s not even a competition.

But then again, your mileage might vary. Maybe you are referring to lack of CI for macOS? That might impact experience for large collaborative FOSS efforts somewhat, but then again, there are always solutions if the community is passionate (I have a private GitLab runner on the NAS in my closet

)

Xiao_Xi · Nov 24, 2021

I wasn't specific enough. I was talking about headless software like Julia, Pytorch or Sklearn.

It seems developing for macOS is easy when the app targets only macOS, but not for multi-OS apps because apps for macOS tends to lack some features.

Romain_H said:
Xcode/Swift and Cocoa are otherwise second to none

leman said:
Developing for macOS is a breeze compared to Windows and it can also be more straightforward than with Linux where you have to consider the differences between the distros. For UI stuff, it’s not even a competition.

The lack of a cheap and convenient CI for macOS makes developing for macOS very difficult for even big FOSS projects like GIMP.

leman said:
Maybe you are referring to lack of CI for macOS? That might impact experience for large collaborative FOSS efforts somewhat

leman · Nov 24, 2021

Xiao_Xi said:
The lack of a cheap and convenient CI for macOS makes developing for macOS very difficult for even big FOSS projects like GIMP.

That is true, but big FOSS projects of this type are an exception and not the norm. And GIMP community specifically has very little interest in producing a high quality Mac version since they are focused in Linux. In terms of photo editing macOS has plenty high quality options, efforts like GIMP are almost amateurish in comparison.

Xiao_Xi · Nov 26, 2021

The efforts of Pytorch's devs are bearing their first fruits. Some ops are already Metal-accelerated.

pytorch GPU version for Mac · Issue #68811 · pytorch/pytorch

🚀 Feature Is there any plan to launch amd GPU accelerated version for Mac Motivation Mac is very popular with programmers, but it does not support GPU acceleration. Apple has launched its own graph...

github.com

project_2501 · Dec 9, 2021

Aside from the number of GPU cores, are the M1 Pro and M1 Max equivalent for numerical computing workloads?

... do they have the same number of AMX matrix multipliers, for example?

mr_roboto · Dec 9, 2021

project_2501 said:
Aside from the number of GPU cores, are the M1 Pro and M1 Max equivalent for numerical computing workloads?

... do they have the same number of AMX matrix multipliers, for example?

Yes, but also maybe no!

In terms of announced non-GPU NC capability, Pro and Max should be completely equal.

But when people went over the M1 Pro and Max die photos Apple published, one thing which stuck out is that Max has a second copy of the 16-core Neural Engine. Some speculated that it was a weird Photoshop by Apple to cover up the real things in that spot for some conspiracy reason. However, Hector Martin of Asahi Linux reported that this extra Neural Engine cluster appears in the M1 Max device tree, but its entry is deleted by iBoot (one of the bootloader stages) for reasons which are unclear. He's also mentioned that there are references to up to four clusters in macOS M1 drivers, which makes sense as other info he's uncovered points to M1 Max supporting a dual-die configuration which would provide four neural engines.

It's a mystery why M1 Max currently has it disabled. Perhaps Apple doesn't want to enable it outside of the dual-die config, which seems likely to be desktop-only and thus would have access to a lot more power.

iDron · Dec 11, 2021

I just got my 10/16 (CPU/GPU) MBP with 16GB RAM and tested my recent (very simple ML models).

I am a materials scientist, so by no means deep into the computation. However, I recently built some simple CNN and DNN models to use on some of our measurement data. I had a 2015 13" MBP with Intel graphics, and quickly realized it would take me forever, that even playing around with it and designing the model would be inconvenient. I quickly switched over to Google Colab, where one can use Nvidia Tesla K80 GPU for example. I see figures for the Tesla's performance in the range of 5.6 to 8.7TFLOPS, depending on "GPU" boost. However, you'll run into frequent disconnects and such on Google Colab, that's why I do not think it is a serious tool for full time work.

I moved my workflow back from Colab to my Mac, and did some comparisons with Google Colab. The M1 Pro is supposed to have 5.2TFLOPS power. A 150 epoch training on my dataset took 564 seconds on Colab, and 687 seconds on the M1 Pro. So the Mac takes about 20% longer than the K80. I'm not sure what is the actual rating on Google's K80, if they would be at 5.6 TFLOPS, the Mac actually over-performs in terms of GPU power. If its at 8.7 TFLOPS, the Metal-Tensorflow would seem to be slightly inefficient, but not terribly so.

It took me several trials (a restart of the Mac solved the issues in the end) following this article: https://makeoptim.com/en/deep-learning/tensorflow-metal .

Also, other libraries such as scikit-learn are still a bit of a hassle to compile on arm, while alternatves to Tensorflow, such as PyTorch are still at an earlier stage.

However, as a laptop to develop models on, the new MBP are a good choice I think. Of course, there are other laptops with over 30TFLOPS of GPU power in them, put they are way too gigantic to be portable.

Xiao_Xi · Dec 11, 2021

iDron said:
I recently built some simple CNN and DNN models to use on some of our measurement data

Did you have any problems with the Tensorflow-Metal plug-in? It seems Tensorflow-Metal doesn't support some ops, so code runs in the CPU instead of the GPU.

iDron said:
However, you'll run into frequent disconnects and such on Google Colab, that's why I do not think it is a serious tool for full time work.

Why don't you use Google Colab Pro?

iDron said:
I [...] did some comparisons with Google Colab. [T]he Mac takes about 20% longer than the K80.

Your results seem to contradict the benchmark shared in the Tensorflow forum.

Accelerating TensorFlow using Apple M1 Max?

I did a bunch of testing across Google Colab, Apple’s M1 Pro and M1 Max as well as a TITAN RTX GPU. Turns out the M1 Max and M1 Pro are faster than Google Colab (the free version with K80s). And though not as fast as a TITAN RTX, the M1 Max still puts in a pretty epic performance for a laptop...

discuss.tensorflow.org

leman · Dec 11, 2021

It seems that Apple is actively working on the Tensorflow-Metal plugin (new version was released last week), but the complete silence and lack of communication or even minor update notes is infuriating. Their "official" Tensorflow plugin website is still hidden away behind obscure links, refers to "macOS 12 beta" and contains zero useful information about what is supported and what they are working on. Developer relations really dropped the ball on this one.

iDron · Dec 11, 2021

Xiao_Xi said:
Did you have any problems with the Tensorflow-Metal plug-in? It seems Tensorflow-Metal doesn't support some ops, so code runs in the CPU instead of the GPU.

I checked the GPU monitor, and it was constantly at above 90%. Speeds were roughly comparable to using the 5TFLOPS GPU on Colab, so I think it is about what I can expect. Do you have a link to what actions are seemingly not supported by TF-Metal? I would be interested to look deeper into it.

Xiao_Xi said:
Why don't you use Google Colab Pro?

1. Even with Pro, there is no guarantee that you get to use faster GPUs, you are only getting prioritized.
2. Even with Pro, it disconnects when your computer goes to sleep.
3. Colab has some issues every now and then with saving files and also with displaying the cells.
4. Working in a proper IDE is a lot better than working in the browser.
5. I do not live in a country which Google offers Colab Pro in. I guess you could probably use it anyway, but they officially say in their conditions that they reserve the right to terminate your account if you violate the terms. That's not useful for a work project then.

Xiao_Xi said:
Your results seem to contradict the benchmark shared in the Tensorflow forum.

Accelerating TensorFlow using Apple M1 Max?

I did a bunch of testing across Google Colab, Apple’s M1 Pro and M1 Max as well as a TITAN RTX GPU. Turns out the M1 Max and M1 Pro are faster than Google Colab (the free version with K80s). And though not as fast as a TITAN RTX, the M1 Max still puts in a pretty epic performance for a laptop...

discuss.tensorflow.org

Well, in those results the M1Max is 20 percent faster than the K80, in my project, its 20% slower. It still is in the same order of magnitude. My project consists of two simple DNN and one CNN, which are feeding into a concatenation layer followed by a single fully connected layer. GPU acceleration of the DNN streams did not improve the speed much.
This structure might just be more efficient on Colab, for whatever reason. Besides GPU at almost 100%, my CPUs were also running rather high, so Tensorflow was definitely still trying to use significant CPU power.

Depending on the model structure, you might get better or worse performance with M1 Pro/Max vs. Intel and Nvidia elsewhere.

iDron · Dec 11, 2021

I did a bit more testing: with

Code:

device_lib.list_local_devices()

it shows me:

Code:

2021-12-11 23:42:24.651015: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-12-11 23:42:24.651043: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

[name: "/device:CPU:0"  device_type: "CPU"  
memory_limit: 268435456  locality {  }  incarnation: 11260881963924141305  xla_global_id: -1,  
name: "/device:GPU:0" device_type: "GPU" 
locality { bus_id: 1 } incarnation: 11509770477685582231 
physical_device_desc: "device: 0, name: METAL, pci bus id: <undefined>" xla_global_id: -1]

However, I can still run code with:

Code:

with tf.device('gpu:0'):

Forcing tensorflow to use the GPU with this command gives exactly the same execution time as not giving this option. Interestingly, I forced it to use only CPU with the same command, but the GPU was still running in the background, albeit a bit lower. Forcing the CPU led to about 30% longer times. So not that big of a difference.

Xiao_Xi · Dec 11, 2021

iDron said:
Do you have a link to what actions are seemingly not supported by TF-Metal?

So far, Apple hasn't published what ops the Tensorflow-Metal plugin supports. Honestly, I doubt that even Apple engineers know what ops are supported.

leman said:
It seems that Apple is actively working on the Tensorflow-Metal plugin

It seems that Apple is working faster and needs less time to adapt Tensorflow-Metal to every new Tensorflow version. It took Apple one month and a half for version 2.6 and one month for 2.7.

tensorflow

TensorFlow is an open source machine learning framework for everyone.

pypi.org

Client Challenge

pypi.org

@iDron Could you run the same code using only Google's Tensorflow in another environment to force the code to run only in the CPU?

iDron · Dec 12, 2021

batch size	with GPU	with CPU	no-metal
1	-
memory bandwith:	32GB/s	35GB/s	44GB/s
GPU	99%	88%	0%
CPU	110%	250%	330%
GPU power	1.4W	0.6W	0.1W
CPU power	6W	10W	14W
execution time	234s	317s	130s
4	-
memory bandwith:	32GB/s	30GB/s	35GB/s
GPU	99%	80%	0%
CPU	110%	400%	520%
GPU power	2W	0.5W	0.1W
CPU power	5.5W	15W	21W
execution time	62s	96s	51s
16	-
memory bandwith:	32GB/s	26GB/s
GPU	99%	60%	0%
CPU	60%	440%	610%
GPU power	2.5W	0.4W	0.1W
CPU power	5.3W	17W	28W
execution time	67s	56s	32s
128	-
memory bandwith:	28GB/s	26GB/s	29GB/s
GPU	99%	40%	0%
CPU	55%	650%	700%
GPU power	7W	0.2W	0.1W
CPU power	5.5W	26W	28W
execution time	18s	28s	25s
512	-
memory bandwith:	40GB/s	26GB/s	30GB/s
GPU	99%	23%	0%
CPU	50%	700%	700%
GPU power	12W	0.1W	0.1W
CPU power	5.5W	28W	29W
execution time	11s	25s	24s

I just did this comparison benchmark of my model, training it in either a tensorflow-metal environment and running GPU/CPU with 'with tf.device():' or running it in a standard tensorflow environment.

Tensorflow-metal always tries to use both GPU and CPU, even if you tell it to use either. At small batch sizes, this does not work well, as the no-metal environment loads the CPU a lot heavier and is almost twice as fast! At a batch size of 4, tensorflow-metal and standard tensorflow are almost level for some reason. At a batch size of 16, once again the standard tensorflow environment is twice as fast.

As batch sizes increase, the GPU lets the model train a lot faster, at a batch size of 512 the GPU accelerated training is 2.5x the speed as either running CPU in the metal environment or with CPU in tf-metal. If you have models suitable at even larger batch sizes, you might get an even higher acceleration!

Also noteworthy, as batch size increase, tf-metal uses the GPU much less when told to only use CPU. Thus, the speeds for using tf-standard and tf-metal converge.

However, if you plan to use small batch sizes, the metal-plugin might actually slow things down!

Also, I benchmarked my old MBP (2015 2.7Ghz i5), it is roughly 6x slower across all batch sizes compared to the standard(non-metal) tensorflow compared to the M1 Pro. So if one is in a similar position regarding equipment as me, one can expect at least 15x speedup when actually utilising the GPU when upgrading from 5-6 year old MBPs.

Apple Silicon deep learning performance

macrumors 6502

macrumors 68000

macrumors Core

macrumors 68000

macrumors Core

macrumors 601

macrumors G3

macrumors Core

macrumors 6502

macrumors G3

macrumors 68000

macrumors 6502a

macrumors Core

macrumors 68000

macrumors Core

macrumors 68000

Suspended

macrumors 6502a

macrumors regular

macrumors 68000

macrumors Core

macrumors regular

macrumors regular

macrumors 68000

macrumors regular

Our Staff