Apple Silicon deep learning performance

Xiao_Xi · Dec 12, 2021

@iDron Your benchmark is brilliant: very throughout and comprehensive.

If you can't share your code and data, could you create a public repo, at least, with the results and how to replicate the benchmark?

Your benchmark could be extremely valuable for Apple engineers and other data scientists.

iDron · Dec 12, 2021

@Xiao_Xi : thanks! I may not be able to do this immediately, but I can try to provide something in the medium term.

Another thing I noticed: With batch=1, the tf-metal takes about 2x as long as the standard to train. From batch=10, tf-metal may be even a bit slower, at a batch size of about 80 tf-metal starts to surpass the non-metal performance and reaches 2.5x acceleration by batchsize=512.

However, for batch=2 to batch=9, tf-metal might be on par (some are slightly faster, some sligthly slower) with the non-metal tf. But I noticed a slighylt higher accuracy.

project_2501 · Dec 12, 2021

@iDron - why does the GPU get utilised when your code tells TF not to use it?

iDron · Dec 12, 2021

project_2501 said:
@iDron - why does the GPU get utilised when your code tells TF not to use it?

I am not sure. I googled and found some people reporting a similar behaviour on other devices. I wonder if the command with tf.device is not a hard switch, but more like tells tf what device to prioritize. Another explanation could be that there is some optimization happening on OS level?

Okay, another little benchmark I did. Looking at small batch sizes:

batch	GPU	CPU
4	70s	62s
6	50s	52s
7	55s	49s
8	51s	46s
9	36s	44s
12	89s	41s
18	66s	37s
36	39s	32s
44	34s	31s
66	27s	29s
132	16s	28s
196	14s	28s
396	12s	29s

At a batch size of 9, running tf-metal actually is faster then the standard. However, in the ranges above (like 12,18) it is twice as slow. Looking at GPU power gives insights: At a batch size of 9, GPU power is at 4.5W. Once I set a number of 10-20, it hovers at about 2W.

This is just speculation now, but a try to understand it: 9 is just one below the CPU core number of 10, so I wonder if the OS allocates one efficiency core to system tasks and gives tf 9, which it then can use quite efficiently for the overhead tasks and pipeline input at batch sizes up to this number 9. If the cores are given multiple batches to process and feed into the GPUs, they might get very inefficient at first, unless batch numbers are sufficiently big (giving each CPU tens or hundreds of batches to feed into the GPU).

GPU power only comes back to similar levels of about 5W at a batch size of about 60. Interesting is this hard drop at a batch size of 10. As batch sizes get bigger, as reported above, GPU power in my case goes up to 12W. The CPU seems to max out at about 30W in my experiments, which is what has been reported as the full power of the CPU elsewhere: https://www.notebookcheck.net/Apple-M1-Pro-Processor-Benchmarks-and-Specs.579915.0.html. This means, TF is maxing out the CPU easily. One the one hand this is good to see to actually put the hardware to use, on the other hand shows the neccesity of a GPU. Finding the maximum GPU power of the M1 Pro is a bit difficult. Notebookcheck reports 15W https://www.notebookcheck.net/Apple-M1-Pro-14-Core-GPU-GPU-Benchmarks-and-Specs.576651.0.html , while Anandtech reports 43W as well as this benachmark on github https://github.com/tlkh/tf-metal-experiments/blob/main/README.md#experiments-and-benchmarks for the M1 Max, which would be about 20W I assume then (guessing that the faster memory eats up a bit more power) for the M1 Pro. However, I also ran a TFLOPs calculation benchmark from this github repo, and the GPU power on my machine actually went up to 27W. The benchmark (matrix multiplication in tensorflow gave a power of 4.1TFLOPS).

This means I'm using only 45% GPU power at large batches in my experiment.

What do you guys think? Is this (the drop in GPU usage at a batch size of 10) normal and expected, or actually something to report to Apple, and how?

Xiao_Xi · Dec 12, 2021

@iDron If I were you, I would report:

the odd behaviour at a batch size of 9
the differences between Google's Tensorflow and Tensorflow-Metal plugin

At:

the Apple forum https://developer.apple.com/forums/tags/tensorflow-metal/
An Apple engineer will eventually answer you because Apple engineers regularly monitor the forum.
the Tensorflow forum https://discuss.tensorflow.org
Other macOS users could replicate your benchmark, and a Tensorflow core dev could shed light on the behaviour of the Tensorflow-Metal plugin.

iDron · Dec 12, 2021

Thanks I will do that soon! I adapted the model a bit to see if I can test the GPU a bit more. I basically increased the filter size of convolutional layers a few times, which led the (otherwise unmodified) model use a lot more GPU resources, even at small batch size.

In this scenario, I got an acceleration of 4-6times, depending on batch size, which is a lot better then what I wrote a few posts above. The 6 times acceleration is achieved at a GPU power of 24W, which is about 90%. As the model uses super basic tf layers and was running at 100% CPU power in the standard tensorflow, I would think that this 6x acceleration is a good estimate of maximum real world performance gains of using tf-metal. As I have learned myself, this hugely depends on the model structure and on the batch size, for small models and datasets it might not bring a lot of benefit.

This weird behavior of GPU power dropping to about half when changing batch size from 9 to 10 was observed in this higher-GPU intense models as well though! Makes it feel like a general problem.

Xiao_Xi · Dec 14, 2021

iDron said:
Thanks I will do that soon!

I have just seen your code at the Apple forum, and it is not clear if you "cache" and "prefetch" the data.

Better performance with the tf.data API | TensorFlow Core

www.tensorflow.org

Performance tips | TensorFlow Datasets

www.tensorflow.org

Do you use the latest version of the tensorflow-metal and tensorflow-macos package? I can imagine you doing it, but you don't mention it in the post.

iDron · Dec 14, 2021

Xiao_Xi said:
I have just seen your code at the Apple forum, and it is not clear if you "cache" and "prefetch" the data.

Better performance with the tf.data API | TensorFlow Core

www.tensorflow.org

Performance tips | TensorFlow Datasets

www.tensorflow.org

Do you use the latest version of the tensorflow-metal and tensorflow-macos package? I can imagine you doing it, but you don't mention it in the post.

I have tensorflow 2.7.0, and tensorflow-metal 0.3.0 installed, I think they should be the latest.

Thanks, this is very helpful! So far I'm just feeding numpy arrays from a pandas Dataframe. I'm sure this can be optimized and this was one bottleneck I was thinking about, as also CPU power does not max out (far from it) at small batch sizes.

I'll look into it, and hope it will speed things up!

Xiao_Xi · Dec 14, 2021

iDron said:
I'm just feeding numpy arrays from a pandas Dataframe.

Have you tried XGBoost instead of Tensorflow? If you are not using images or audio files, you could use XGBoost or a similar library instead of Tensorflow.

However, I don't know if XGBoost has a Metal backend and is compatible with Dask in Apple Silicon.

iDron · Dec 15, 2021

@Xiao_Xi this was a helpful tip. I changed my input pipeline from the numpy array to tf.dataset. GPU power, even at small batch sizes increases to above 20W, averaging at about 21.5W on my model. From the maximum power I've measured, I believe this is about 85% GPU power. Prefetching and Caching did not bring any additional benefits. The memory bandwidth also increased by 50% from averaging at 40GB/s to 60GB/s. Batch-size neither changes GPU power nor training time, so this 85% GPU power is probably the maximum which tensorflow can use. Numpy inputs seem like a huge bottleneck.

I also did training with the CPU. Now, with the tf.dataset it is also able to use the entire CPU power at small batch sizes (previously, it barely used more than 2 cores). However, even at small batch sizes, I now get 5x acceleration with the GPU.

Conclusion: In order to fully use the Apple Silicon chips for deep learning, the tf.dataset API is absolutely necessary, but then 5x acceleration can be achieved even on relatively small (convolutional) models.

Xiao_Xi · Dec 15, 2021

iDron said:
5x acceleration can be achieved even on relatively small (convolutional) models

Honestly, I didn't expect tf.dataset to be that much better; we are learning together. I get used to tf.datasets and always use them because of their advantages.

I think the speedup between using a tf.dataset and a NumPy array could be linked to the mode Tensorflow runs. It seems Tensorflow runs in graph mode when the input is a tf.dataset, and in eager mode when the input is a NumPy array. Tensorflow in graph mode runs faster, but it is much harder to debug the code.

Can someone explain why Tensorflow runs much faster with tf.datasets? Should we ask it in Google's Tensorflow forum?

project_2501 · Dec 15, 2021

I wonder if someone has done similar tests to @iDron comparing CPU and GPU performance and utilisation, but for PyTorch not Tensorflow?

iDron · Dec 15, 2021

project_2501 said:
I wonder if someone has done similar tests to @iDron comparing CPU and GPU performance and utilisation, but for PyTorch not Tensorflow?

is PyTorch already accelerated on metal? I thought it was just an announcement yet.

Xiao_Xi · Dec 15, 2021

project_2501 said:
I wonder if someone has done similar tests to @iDron comparing CPU and GPU performance and utilisation, but for PyTorch not Tensorflow?

iDron said:
is PyTorch already accelerated on metal?

Not yet. PyTorch devs explain their progress here https://github.com/pytorch/pytorch/issues/47702

It seems PyTorch could get a Metal backend in March/April, but progress seems very slow. Indeed, I am not sure whether a Metal backend is a priority for PyTorch devs.

Xiao_Xi · Dec 15, 2021

iDron said:
Conclusion: In order to fully use the Apple Silicon chips for deep learning, the tf.dataset API is absolutely necessary

Could you tell this finding in the Apple forum? This tip could solve the complaints regarding the GPU being slower than the CPU. The faster Apple engineers can solve complaints, the quicker they improve the Tensorflow-Metal.

iDron · Dec 15, 2021

Xiao_Xi said:
Could you tell this finding in the Apple forum? This tip could solve the complaints regarding the GPU being slower than the CPU. The faster Apple engineers can solve complaints, the quicker they improve the Tensorflow-Metal.

I will.
Right now I'm still trying if I can improve performance a bit. Memory bandwidth is not fully utilized. It should be able to achieve 200GB/s, and I've seen benchmarks that report this for deep learning models. If the GPU computational power is the limiting factor, then of course the bandwidth can not be utilized. But I feel like ~20% improvement could still be possible. The memory bandwidth hovers at 65GB/s, caching and prefetching brought a couple percent improvement but not much. GPU average power is at ~21W, seemingly about 80-85% of the max.

Xiao_Xi · Dec 19, 2021

iDron said:
caching and prefetching brought a couple percent improvement but not much

Why do caching and prefetching improve performance in the M1 SOC only slightly? Does the M1 SOC memory management make caching and prefetching less significant?

Has someone tried NumPy on Tensorflow? Is it faster than regular Numpy?

NumPy API on TensorFlow | TensorFlow Core

www.tensorflow.org

project_2501 · Dec 19, 2021

Xiao_Xi said:
Why do caching and prefetching improve performance in the M1 SOC only slightly? Does the M1 SOC memory management make caching and prefetching less significant?

Has someone tried NumPy on Tensorflow? Is it faster than regular Numpy?

NumPy API on TensorFlow | TensorFlow Core

www.tensorflow.org

I too am interested in numpy performance.

iDron · Dec 19, 2021

Xiao_Xi said:
Why do caching and prefetching improve performance in the M1 SOC only slightly? Does the M1 SOC memory management make caching and prefetching less significant?

Has someone tried NumPy on Tensorflow? Is it faster than regular Numpy?

NumPy API on TensorFlow | TensorFlow Core

www.tensorflow.org

It could be. I've did this TFLOPS linear algebra benchmark I linked previously, which gives 4.1TFLOPS at a sustained power consumption of 26.5W. My real world example is at about 85% of this maximum. I think that 85% real world utilization is actually already pretty good, and maybe much more is not really feasible. Do you guys have experience with this on other GPU? How close does your real-world GPU utilisation get compared to an artificial benchmark.

project_2501 · Dec 21, 2021

This video runs a standard MNIST test and seems to suggest the M1 Pro is faster than an M1 Max.

A suggestion is that the M1 Max does have a wider memory bandwidth but the latency is larger than the M1 Pro - and for small GPU tasks, eg many small batches for ML training, this has an overall negative impact?

Xiao_Xi · Dec 21, 2021

project_2501 said:
A suggestion is that the M1 Max does have a wider memory bandwidth but the latency is larger than the M1 Pro - and for small GPU tasks, eg many small batches for ML training, this has an overall negative impact?

@iDron 's experiments prove that the batch size and the input type can impact the performance significantly.

Xiao_Xi · Dec 23, 2021

Could someone explain:
- what MLIR is
- why Tensorflow will use MLIR
- whether Tensorflow will be faster with MLIR
- whether MLIR will improve/worsen the performance of the Tensorflow-Metal plugin

This presentation talks about MLIR and Tensorflow:

Xiao_Xi · Dec 27, 2021

What can happen if I install a Google's Tensorflow-based package in a mac with tensorflow-macos and tensorflow-metal installed? Will the package download Tensorflow and use only CPU or will it use the installed Tensorflow and use the GPU?

If the installation fails, would I need to change the requirements file and compile the package?If it fails, would I need to change the requirements file and compile the package?

Not able to install packages that … | Apple Developer Forums

developer.apple.com

white7561 · Jan 9, 2022

hi guys. So is currently the M1 Pro 16 cores slower than the Colab Pro? is it a lot? is it because of unoptimized code on the Mac's side which will improve? Thank you.

iDron · Jan 9, 2022

white7561 said:
hi guys. So is currently the M1 Pro 16 cores slower than the Colab Pro? is it a lot? is it because of unoptimized code on the Mac's side which will improve? Thank you.

It should be quite comparable. Even with Colab Pro, it's not guaranteed that you get a faster GPU. The slower GPUs in Colab are in practice quite comparable to the M1 Pro GPU. It hugely depends on your specific models. People have benchmarked somewhere that standard MNIST (I believe) models might run slighlyt (~20%) faster. I've found some of my personal stuff running slightly slower.

On Colab Pro, the fastest GPUs are at about 9TFLOPS, compared to 5 for the M1 Pro. So if you get an instance with those, the M1 Pro will be about half the speed.

However, code optimazations for the M1 architecture in the future might make things a bit faster.

Also, some models might not benefit from a GPU at all, and of course M1 has a pretty fast CPU.

Apple Silicon deep learning performance

macrumors 68000

macrumors regular

macrumors 6502a

macrumors regular

macrumors 68000

macrumors regular

macrumors 68000

macrumors regular

macrumors 68000

macrumors regular

macrumors 68000

macrumors 6502a

macrumors regular

macrumors 68000

macrumors 68000

macrumors regular

macrumors 68000

macrumors 6502a

macrumors regular

macrumors 6502a

macrumors 68000

macrumors 68000

macrumors 68000

macrumors 6502a

macrumors regular

Our Staff