Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
@iDron Your benchmark is brilliant: very throughout and comprehensive.

If you can't share your code and data, could you create a public repo, at least, with the results and how to replicate the benchmark?

Your benchmark could be extremely valuable for Apple engineers and other data scientists.
 
  • Like
Reactions: iDron

iDron

macrumors regular
Apr 6, 2010
219
252
@Xiao_Xi : thanks! I may not be able to do this immediately, but I can try to provide something in the medium term.

Another thing I noticed: With batch=1, the tf-metal takes about 2x as long as the standard to train. From batch=10, tf-metal may be even a bit slower, at a batch size of about 80 tf-metal starts to surpass the non-metal performance and reaches 2.5x acceleration by batchsize=512.

However, for batch=2 to batch=9, tf-metal might be on par (some are slightly faster, some sligthly slower) with the non-metal tf. But I noticed a slighylt higher accuracy.
 

iDron

macrumors regular
Apr 6, 2010
219
252
@iDron - why does the GPU get utilised when your code tells TF not to use it?
I am not sure. I googled and found some people reporting a similar behaviour on other devices. I wonder if the command with tf.device is not a hard switch, but more like tells tf what device to prioritize. Another explanation could be that there is some optimization happening on OS level?

Okay, another little benchmark I did. Looking at small batch sizes:
batchGPUCPU
470s62s
650s52s
755s49s
851s46s
936s44s
1289s41s
1866s37s
3639s32s
4434s31s
6627s29s
13216s28s
19614s28s
39612s29s

At a batch size of 9, running tf-metal actually is faster then the standard. However, in the ranges above (like 12,18) it is twice as slow. Looking at GPU power gives insights: At a batch size of 9, GPU power is at 4.5W. Once I set a number of 10-20, it hovers at about 2W.

This is just speculation now, but a try to understand it: 9 is just one below the CPU core number of 10, so I wonder if the OS allocates one efficiency core to system tasks and gives tf 9, which it then can use quite efficiently for the overhead tasks and pipeline input at batch sizes up to this number 9. If the cores are given multiple batches to process and feed into the GPUs, they might get very inefficient at first, unless batch numbers are sufficiently big (giving each CPU tens or hundreds of batches to feed into the GPU).

GPU power only comes back to similar levels of about 5W at a batch size of about 60. Interesting is this hard drop at a batch size of 10. As batch sizes get bigger, as reported above, GPU power in my case goes up to 12W. The CPU seems to max out at about 30W in my experiments, which is what has been reported as the full power of the CPU elsewhere: https://www.notebookcheck.net/Apple-M1-Pro-Processor-Benchmarks-and-Specs.579915.0.html. This means, TF is maxing out the CPU easily. One the one hand this is good to see to actually put the hardware to use, on the other hand shows the neccesity of a GPU. Finding the maximum GPU power of the M1 Pro is a bit difficult. Notebookcheck reports 15W https://www.notebookcheck.net/Apple-M1-Pro-14-Core-GPU-GPU-Benchmarks-and-Specs.576651.0.html , while Anandtech reports 43W as well as this benachmark on github https://github.com/tlkh/tf-metal-experiments/blob/main/README.md#experiments-and-benchmarks for the M1 Max, which would be about 20W I assume then (guessing that the faster memory eats up a bit more power) for the M1 Pro. However, I also ran a TFLOPs calculation benchmark from this github repo, and the GPU power on my machine actually went up to 27W. The benchmark (matrix multiplication in tensorflow gave a power of 4.1TFLOPS).

This means I'm using only 45% GPU power at large batches in my experiment.

What do you guys think? Is this (the drop in GPU usage at a batch size of 10) normal and expected, or actually something to report to Apple, and how?
 
Last edited:
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
@iDron If I were you, I would report:
  • the odd behaviour at a batch size of 9
  • the differences between Google's Tensorflow and Tensorflow-Metal plugin
At:
 

iDron

macrumors regular
Apr 6, 2010
219
252
Thanks I will do that soon! I adapted the model a bit to see if I can test the GPU a bit more. I basically increased the filter size of convolutional layers a few times, which led the (otherwise unmodified) model use a lot more GPU resources, even at small batch size.

In this scenario, I got an acceleration of 4-6times, depending on batch size, which is a lot better then what I wrote a few posts above. The 6 times acceleration is achieved at a GPU power of 24W, which is about 90%. As the model uses super basic tf layers and was running at 100% CPU power in the standard tensorflow, I would think that this 6x acceleration is a good estimate of maximum real world performance gains of using tf-metal. As I have learned myself, this hugely depends on the model structure and on the batch size, for small models and datasets it might not bring a lot of benefit.

This weird behavior of GPU power dropping to about half when changing batch size from 9 to 10 was observed in this higher-GPU intense models as well though! Makes it feel like a general problem.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101

iDron

macrumors regular
Apr 6, 2010
219
252
I have just seen your code at the Apple forum, and it is not clear if you "cache" and "prefetch" the data.

Do you use the latest version of the tensorflow-metal and tensorflow-macos package? I can imagine you doing it, but you don't mention it in the post.

I have tensorflow 2.7.0, and tensorflow-metal 0.3.0 installed, I think they should be the latest.

Thanks, this is very helpful! So far I'm just feeding numpy arrays from a pandas Dataframe. I'm sure this can be optimized and this was one bottleneck I was thinking about, as also CPU power does not max out (far from it) at small batch sizes.

I'll look into it, and hope it will speed things up!
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I'm just feeding numpy arrays from a pandas Dataframe.
Have you tried XGBoost instead of Tensorflow? If you are not using images or audio files, you could use XGBoost or a similar library instead of Tensorflow.

However, I don't know if XGBoost has a Metal backend and is compatible with Dask in Apple Silicon.
 

iDron

macrumors regular
Apr 6, 2010
219
252
@Xiao_Xi this was a helpful tip. I changed my input pipeline from the numpy array to tf.dataset. GPU power, even at small batch sizes increases to above 20W, averaging at about 21.5W on my model. From the maximum power I've measured, I believe this is about 85% GPU power. Prefetching and Caching did not bring any additional benefits. The memory bandwidth also increased by 50% from averaging at 40GB/s to 60GB/s. Batch-size neither changes GPU power nor training time, so this 85% GPU power is probably the maximum which tensorflow can use. Numpy inputs seem like a huge bottleneck.

I also did training with the CPU. Now, with the tf.dataset it is also able to use the entire CPU power at small batch sizes (previously, it barely used more than 2 cores). However, even at small batch sizes, I now get 5x acceleration with the GPU.

Conclusion: In order to fully use the Apple Silicon chips for deep learning, the tf.dataset API is absolutely necessary, but then 5x acceleration can be achieved even on relatively small (convolutional) models.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
5x acceleration can be achieved even on relatively small (convolutional) models
Honestly, I didn't expect tf.dataset to be that much better; we are learning together. I get used to tf.datasets and always use them because of their advantages.

I think the speedup between using a tf.dataset and a NumPy array could be linked to the mode Tensorflow runs. It seems Tensorflow runs in graph mode when the input is a tf.dataset, and in eager mode when the input is a NumPy array. Tensorflow in graph mode runs faster, but it is much harder to debug the code.

Can someone explain why Tensorflow runs much faster with tf.datasets? Should we ask it in Google's Tensorflow forum?
 
  • Like
Reactions: iDron

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I wonder if someone has done similar tests to @iDron comparing CPU and GPU performance and utilisation, but for PyTorch not Tensorflow?

is PyTorch already accelerated on metal?
Not yet. PyTorch devs explain their progress here https://github.com/pytorch/pytorch/issues/47702

It seems PyTorch could get a Metal backend in March/April, but progress seems very slow. Indeed, I am not sure whether a Metal backend is a priority for PyTorch devs.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Conclusion: In order to fully use the Apple Silicon chips for deep learning, the tf.dataset API is absolutely necessary
Could you tell this finding in the Apple forum? This tip could solve the complaints regarding the GPU being slower than the CPU. The faster Apple engineers can solve complaints, the quicker they improve the Tensorflow-Metal.
 

iDron

macrumors regular
Apr 6, 2010
219
252
Could you tell this finding in the Apple forum? This tip could solve the complaints regarding the GPU being slower than the CPU. The faster Apple engineers can solve complaints, the quicker they improve the Tensorflow-Metal.
I will.
Right now I'm still trying if I can improve performance a bit. Memory bandwidth is not fully utilized. It should be able to achieve 200GB/s, and I've seen benchmarks that report this for deep learning models. If the GPU computational power is the limiting factor, then of course the bandwidth can not be utilized. But I feel like ~20% improvement could still be possible. The memory bandwidth hovers at 65GB/s, caching and prefetching brought a couple percent improvement but not much. GPU average power is at ~21W, seemingly about 80-85% of the max.
 

iDron

macrumors regular
Apr 6, 2010
219
252
Why do caching and prefetching improve performance in the M1 SOC only slightly? Does the M1 SOC memory management make caching and prefetching less significant?

Has someone tried NumPy on Tensorflow? Is it faster than regular Numpy?
It could be. I've did this TFLOPS linear algebra benchmark I linked previously, which gives 4.1TFLOPS at a sustained power consumption of 26.5W. My real world example is at about 85% of this maximum. I think that 85% real world utilization is actually already pretty good, and maybe much more is not really feasible. Do you guys have experience with this on other GPU? How close does your real-world GPU utilisation get compared to an artificial benchmark.
 
  • Like
Reactions: Xiao_Xi

project_2501

macrumors 6502a
Jul 1, 2017
676
792
This video runs a standard MNIST test and seems to suggest the M1 Pro is faster than an M1 Max.

A suggestion is that the M1 Max does have a wider memory bandwidth but the latency is larger than the M1 Pro - and for small GPU tasks, eg many small batches for ML training, this has an overall negative impact?

 
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
A suggestion is that the M1 Max does have a wider memory bandwidth but the latency is larger than the M1 Pro - and for small GPU tasks, eg many small batches for ML training, this has an overall negative impact?
@iDron 's experiments prove that the batch size and the input type can impact the performance significantly.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Could someone explain:
- what MLIR is
- why Tensorflow will use MLIR
- whether Tensorflow will be faster with MLIR
- whether MLIR will improve/worsen the performance of the Tensorflow-Metal plugin

This presentation talks about MLIR and Tensorflow:
 
Last edited:

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
What can happen if I install a Google's Tensorflow-based package in a mac with tensorflow-macos and tensorflow-metal installed? Will the package download Tensorflow and use only CPU or will it use the installed Tensorflow and use the GPU?

If the installation fails, would I need to change the requirements file and compile the package?If it fails, would I need to change the requirements file and compile the package?
 

white7561

macrumors 6502a
Jun 28, 2016
934
386
World
hi guys. So is currently the M1 Pro 16 cores slower than the Colab Pro? is it a lot? is it because of unoptimized code on the Mac's side which will improve? Thank you.
 

iDron

macrumors regular
Apr 6, 2010
219
252
hi guys. So is currently the M1 Pro 16 cores slower than the Colab Pro? is it a lot? is it because of unoptimized code on the Mac's side which will improve? Thank you.
It should be quite comparable. Even with Colab Pro, it's not guaranteed that you get a faster GPU. The slower GPUs in Colab are in practice quite comparable to the M1 Pro GPU. It hugely depends on your specific models. People have benchmarked somewhere that standard MNIST (I believe) models might run slighlyt (~20%) faster. I've found some of my personal stuff running slightly slower.

On Colab Pro, the fastest GPUs are at about 9TFLOPS, compared to 5 for the M1 Pro. So if you get an instance with those, the M1 Pro will be about half the speed.

However, code optimazations for the M1 architecture in the future might make things a bit faster.

Also, some models might not benefit from a GPU at all, and of course M1 has a pretty fast CPU.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.