Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Is Apple's Tensorflow fork reliable?

Some people report bad results when using the GPU.

The problem with Apple's tensor flow plugin is that it's a closed-source, undocumented software with no obvious support channel. This is so typical Apple. At least with the older crappy TF fork one could have a look at the code and try to identify (maybe even patch) the issue — but now one can't do absolutely anything and Apple engineers don't reply in the forums to these topics.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
What can other companies learn from the Apple Tensorflow plugin? What does Apple need to hide?

If Apple publishes the code of its Tensorflow plugin, other projects can learn from it and develop a Metal backend faster.
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
If Apple publishes the code of its Tensorflow plugin, other projects can learn from it and develop a Metal backend faster.

Exactly! The decision to hide the plugin implementation does not make any sense and harms both Apple and its loyal user base. It's clear that Apple does not have a unified strategy — it seems to me like there are multiple teams sign different stuff without coordinating with each other. On one hand they release a C++ Metal bindings — a wonderful initiative that will massively simplify Metal development for many devs outside of Apple's ecosystem, on the other hand, their ML efforts are awkward at best...
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
How does NVIDIA do things in this area, and could Apple learn from them? My (pure) guess, from how popular NVIDIA is among those who do ML, is that NVIDIA's approach is refined, unified, and well-developed.


Relatedly, a lot of the researchers doing ML are scientists rather than programmers, and the fact that CUDA is much easier to work in than OpenCL is a big plus. How does Metal compare to CUDA for ease of use?
 
Last edited:

hajime

macrumors 604
Jul 23, 2007
7,920
1,310
So the conclusion is that although the Apple Silicon has the potential for researchers to do research in ML, Apple's decision to make the API closed source and a lack of hardware AVX and CUDA support makes Intel PC still a better platform for researchers?
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
So the conclusion is that although the Apple Silicon has the potential for researchers to do research in ML, Apple's decision to make the API closed source and a lack of hardware AVX and CUDA support makes Intel PC still a better platform for researchers?

Kind of depends on what you do. For many people who are locked in into the CUDA ecosystem, there is no way around Nvidia at the moment. Apple Silicon can become good, but it will take some time until the software (and hardware) matures.

Then again, Apple Silicon is great for software prototyping in many ways, and it has very good CPU performance. Something like a 14" MBP might be one of the best portable laptops on the market for a data scientist. Training of large models can be done on remote GPU-accelerated machines (cloud services or supercomputers).
 
  • Like
Reactions: hajime and ingambe

ingambe

macrumors 6502
Mar 22, 2020
320
355
Kind of depends on what you do. For many people who are locked in into the CUDA ecosystem, there is no way around Nvidia at the moment. Apple Silicon can become good, but it will take some time until the software (and hardware) matures.

Then again, Apple Silicon is great for software prototyping in many ways, and it has very good CPU performance. Something like a 14" MBP might be one of the best portable laptops on the market for a data scientist. Training of large models can be done on remote GPU-accelerated machines (cloud services or supercomputers).
In any case, even if performance were on-pair with a 3080ti, the question would be: Do you really want to let your laptop run for days at maximum utilization?
You would also have to do a lot of checkpoints to stop the training when you need to carry your laptop around

Being able to do at least one iteration with a reasonable batch size is already very nice
 

hajime

macrumors 604
Jul 23, 2007
7,920
1,310
Thanks. Am I correct that although it has been a year since Apple released the M1 chip, the maturation of software and hardware support for both ML and DS are still quite poor?
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
the maturation of software and hardware support for both ML and DS are still quite poor?
Most of the software run native on Apple Silicon, but few of them support the GPU.

The main problem is that developing for Mac is more expensive and cumbersome than for Windows or Linux.

I wonder how Pytorch devs will build and test its code because Github still doesn't offer any Apple Silicon hardware.
 

Romain_H

macrumors 6502a
Sep 20, 2021
520
438
The main problem is that developing for Mac is more expensive and cumbersome than for Windows or Linux.
I agree. I do want to point out, however, that this is true for ML development only. Xcode/Swift and Cocoa are otherwise second to none
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
The main problem is that developing for Mac is more expensive and cumbersome than for Windows or Linux.

I disagree with this sentiment wholeheartedly. Developing for macOS is a breeze compared to Windows and it can also be more straightforward than with Linux where you have to consider the differences between the distros. For UI stuff, it’s not even a competition.

But then again, your mileage might vary. Maybe you are referring to lack of CI for macOS? That might impact experience for large collaborative FOSS efforts somewhat, but then again, there are always solutions if the community is passionate (I have a private GitLab runner on the NAS in my closet :) )
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I wasn't specific enough. I was talking about headless software like Julia, Pytorch or Sklearn.

It seems developing for macOS is easy when the app targets only macOS, but not for multi-OS apps because apps for macOS tends to lack some features.
Xcode/Swift and Cocoa are otherwise second to none
Developing for macOS is a breeze compared to Windows and it can also be more straightforward than with Linux where you have to consider the differences between the distros. For UI stuff, it’s not even a competition.
The lack of a cheap and convenient CI for macOS makes developing for macOS very difficult for even big FOSS projects like GIMP.
Maybe you are referring to lack of CI for macOS? That might impact experience for large collaborative FOSS efforts somewhat
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
The lack of a cheap and convenient CI for macOS makes developing for macOS very difficult for even big FOSS projects like GIMP.

That is true, but big FOSS projects of this type are an exception and not the norm. And GIMP community specifically has very little interest in producing a high quality Mac version since they are focused in Linux. In terms of photo editing macOS has plenty high quality options, efforts like GIMP are almost amateurish in comparison.
 

project_2501

macrumors 6502a
Jul 1, 2017
676
792
Aside from the number of GPU cores, are the M1 Pro and M1 Max equivalent for numerical computing workloads?

... do they have the same number of AMX matrix multipliers, for example?
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
Aside from the number of GPU cores, are the M1 Pro and M1 Max equivalent for numerical computing workloads?

... do they have the same number of AMX matrix multipliers, for example?
Yes, but also maybe no!

In terms of announced non-GPU NC capability, Pro and Max should be completely equal.

But when people went over the M1 Pro and Max die photos Apple published, one thing which stuck out is that Max has a second copy of the 16-core Neural Engine. Some speculated that it was a weird Photoshop by Apple to cover up the real things in that spot for some conspiracy reason. However, Hector Martin of Asahi Linux reported that this extra Neural Engine cluster appears in the M1 Max device tree, but its entry is deleted by iBoot (one of the bootloader stages) for reasons which are unclear. He's also mentioned that there are references to up to four clusters in macOS M1 drivers, which makes sense as other info he's uncovered points to M1 Max supporting a dual-die configuration which would provide four neural engines.

It's a mystery why M1 Max currently has it disabled. Perhaps Apple doesn't want to enable it outside of the dual-die config, which seems likely to be desktop-only and thus would have access to a lot more power.
 

iDron

macrumors regular
Apr 6, 2010
219
252
I just got my 10/16 (CPU/GPU) MBP with 16GB RAM and tested my recent (very simple ML models).

I am a materials scientist, so by no means deep into the computation. However, I recently built some simple CNN and DNN models to use on some of our measurement data. I had a 2015 13" MBP with Intel graphics, and quickly realized it would take me forever, that even playing around with it and designing the model would be inconvenient. I quickly switched over to Google Colab, where one can use Nvidia Tesla K80 GPU for example. I see figures for the Tesla's performance in the range of 5.6 to 8.7TFLOPS, depending on "GPU" boost. However, you'll run into frequent disconnects and such on Google Colab, that's why I do not think it is a serious tool for full time work.

I moved my workflow back from Colab to my Mac, and did some comparisons with Google Colab. The M1 Pro is supposed to have 5.2TFLOPS power. A 150 epoch training on my dataset took 564 seconds on Colab, and 687 seconds on the M1 Pro. So the Mac takes about 20% longer than the K80. I'm not sure what is the actual rating on Google's K80, if they would be at 5.6 TFLOPS, the Mac actually over-performs in terms of GPU power. If its at 8.7 TFLOPS, the Metal-Tensorflow would seem to be slightly inefficient, but not terribly so.

It took me several trials (a restart of the Mac solved the issues in the end) following this article: https://makeoptim.com/en/deep-learning/tensorflow-metal .

Also, other libraries such as scikit-learn are still a bit of a hassle to compile on arm, while alternatves to Tensorflow, such as PyTorch are still at an earlier stage.

However, as a laptop to develop models on, the new MBP are a good choice I think. Of course, there are other laptops with over 30TFLOPS of GPU power in them, put they are way too gigantic to be portable.

Screenshot 2021-12-11 at 6.51.24 PM.png
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I recently built some simple CNN and DNN models to use on some of our measurement data
Did you have any problems with the Tensorflow-Metal plug-in? It seems Tensorflow-Metal doesn't support some ops, so code runs in the CPU instead of the GPU.

However, you'll run into frequent disconnects and such on Google Colab, that's why I do not think it is a serious tool for full time work.
Why don't you use Google Colab Pro?

I [...] did some comparisons with Google Colab. [T]he Mac takes about 20% longer than the K80.
Your results seem to contradict the benchmark shared in the Tensorflow forum.
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
It seems that Apple is actively working on the Tensorflow-Metal plugin (new version was released last week), but the complete silence and lack of communication or even minor update notes is infuriating. Their "official" Tensorflow plugin website is still hidden away behind obscure links, refers to "macOS 12 beta" and contains zero useful information about what is supported and what they are working on. Developer relations really dropped the ball on this one.
 
Last edited:

iDron

macrumors regular
Apr 6, 2010
219
252
Did you have any problems with the Tensorflow-Metal plug-in? It seems Tensorflow-Metal doesn't support some ops, so code runs in the CPU instead of the GPU.
I checked the GPU monitor, and it was constantly at above 90%. Speeds were roughly comparable to using the 5TFLOPS GPU on Colab, so I think it is about what I can expect. Do you have a link to what actions are seemingly not supported by TF-Metal? I would be interested to look deeper into it.

Why don't you use Google Colab Pro?

1. Even with Pro, there is no guarantee that you get to use faster GPUs, you are only getting prioritized.
2. Even with Pro, it disconnects when your computer goes to sleep.
3. Colab has some issues every now and then with saving files and also with displaying the cells.
4. Working in a proper IDE is a lot better than working in the browser.
5. I do not live in a country which Google offers Colab Pro in. I guess you could probably use it anyway, but they officially say in their conditions that they reserve the right to terminate your account if you violate the terms. That's not useful for a work project then.

Your results seem to contradict the benchmark shared in the Tensorflow forum.

Well, in those results the M1Max is 20 percent faster than the K80, in my project, its 20% slower. It still is in the same order of magnitude. My project consists of two simple DNN and one CNN, which are feeding into a concatenation layer followed by a single fully connected layer. GPU acceleration of the DNN streams did not improve the speed much.
This structure might just be more efficient on Colab, for whatever reason. Besides GPU at almost 100%, my CPUs were also running rather high, so Tensorflow was definitely still trying to use significant CPU power.

Depending on the model structure, you might get better or worse performance with M1 Pro/Max vs. Intel and Nvidia elsewhere.
 
  • Like
Reactions: project_2501

iDron

macrumors regular
Apr 6, 2010
219
252
I did a bit more testing: with
Code:
device_lib.list_local_devices()

it shows me:

Code:
2021-12-11 23:42:24.651015: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-12-11 23:42:24.651043: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

[name: "/device:CPU:0"  device_type: "CPU"  
memory_limit: 268435456  locality {  }  incarnation: 11260881963924141305  xla_global_id: -1,  
name: "/device:GPU:0" device_type: "GPU" 
locality { bus_id: 1 } incarnation: 11509770477685582231 
physical_device_desc: "device: 0, name: METAL, pci bus id: <undefined>" xla_global_id: -1]

However, I can still run code with:

Code:
with tf.device('gpu:0'):

Forcing tensorflow to use the GPU with this command gives exactly the same execution time as not giving this option. Interestingly, I forced it to use only CPU with the same command, but the GPU was still running in the background, albeit a bit lower. Forcing the CPU led to about 30% longer times. So not that big of a difference.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Do you have a link to what actions are seemingly not supported by TF-Metal?
So far, Apple hasn't published what ops the Tensorflow-Metal plugin supports. Honestly, I doubt that even Apple engineers know what ops are supported.

It seems that Apple is actively working on the Tensorflow-Metal plugin
It seems that Apple is working faster and needs less time to adapt Tensorflow-Metal to every new Tensorflow version. It took Apple one month and a half for version 2.6 and one month for 2.7.

@iDron Could you run the same code using only Google's Tensorflow in another environment to force the code to run only in the CPU?
 

iDron

macrumors regular
Apr 6, 2010
219
252
batch sizewith GPUwith CPUno-metal
1-
memory bandwith:32GB/s35GB/s44GB/s
GPU99%88%0%
CPU110%250%330%
GPU power1.4W0.6W0.1W
CPU power6W10W14W
execution time234s317s130s
4-
memory bandwith:32GB/s30GB/s35GB/s
GPU99%80%0%
CPU110%400%520%
GPU power2W0.5W0.1W
CPU power5.5W15W21W
execution time62s96s51s
16-
memory bandwith:32GB/s26GB/s
GPU99%60%0%
CPU60%440%610%
GPU power2.5W0.4W0.1W
CPU power5.3W17W28W
execution time67s56s32s
128-
memory bandwith:28GB/s26GB/s29GB/s
GPU99%40%0%
CPU55%650%700%
GPU power7W0.2W0.1W
CPU power5.5W26W28W
execution time18s28s25s
512-
memory bandwith:40GB/s26GB/s30GB/s
GPU99%23%0%
CPU50%700%700%
GPU power12W0.1W0.1W
CPU power5.5W28W29W
execution time11s25s24s

I just did this comparison benchmark of my model, training it in either a tensorflow-metal environment and running GPU/CPU with 'with tf.device():' or running it in a standard tensorflow environment.

Tensorflow-metal always tries to use both GPU and CPU, even if you tell it to use either. At small batch sizes, this does not work well, as the no-metal environment loads the CPU a lot heavier and is almost twice as fast! At a batch size of 4, tensorflow-metal and standard tensorflow are almost level for some reason. At a batch size of 16, once again the standard tensorflow environment is twice as fast.

As batch sizes increase, the GPU lets the model train a lot faster, at a batch size of 512 the GPU accelerated training is 2.5x the speed as either running CPU in the metal environment or with CPU in tf-metal. If you have models suitable at even larger batch sizes, you might get an even higher acceleration!

Also noteworthy, as batch size increase, tf-metal uses the GPU much less when told to only use CPU. Thus, the speeds for using tf-standard and tf-metal converge.

However, if you plan to use small batch sizes, the metal-plugin might actually slow things down!

Also, I benchmarked my old MBP (2015 2.7Ghz i5), it is roughly 6x slower across all batch sizes compared to the standard(non-metal) tensorflow compared to the M1 Pro. So if one is in a similar position regarding equipment as me, one can expect at least 15x speedup when actually utilising the GPU when upgrading from 5-6 year old MBPs.
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.