ANE/Metal & PyTorch/Tensorflow

patent10021 · Dec 19, 2022

Clarification on Metal/ANE and PT/TF for ML on Silicon.

Reading various Apple pages and other blogs here's what I understand.

PyTorch
To accelerate the training of ML models, PT takes advantage of the hardware acceleration of the ANE, but any model you use needs to be translated/compiled as a CoreML version of the model.

PT also supports GPU-accelerated training on Mac via Metal.

If we don't intend to deploy transformers/models on Apple devices, we don't need to take into consideration how many cores the ANE has on any machine we are considering using for training? In this case, we can just look at number of GPU cores when choosing a Mac?

Without deploying transformers/models on Apple devices can we still use ANE for training purposes only? Can we use both ANE and Metal for maximum training performance?

Tensorflow
TF uses tensorflow-metal PluggableDevice to accelerate the training of machine learning models on Silicon with Metal.

TF models compiled as CoreML models also uses ANE.

Is this a correct summary?

Xiao_Xi · Dec 19, 2022

Yes.

When training ML models, developers benefit from accelerated training on GPUs with PyTorch and TensorFlow by leveraging the Metal Performance Shaders (MPS) back end. For deployment of trained models on Apple devices, they use coremltools, Apple’s open-source unified conversion tool, to convert their favorite PyTorch and TensorFlow models to the Core ML model package format. Core ML then seamlessly blends CPU, GPU, and ANE (if available) to create the most effective hybrid execution plan exploiting all available engines on a given device.

Deploying Transformers on the Apple Neural Engine

An increasing number of the machine learning (ML) models we build at Apple each year are either partly or fully adopting the Transformer…

machinelearning.apple.com

patent10021 · Dec 19, 2022

Xiao_Xi said:
Yes.

Deploying Transformers on the Apple Neural Engine

An increasing number of the machine learning (ML) models we build at Apple each year are either partly or fully adopting the Transformer…

machinelearning.apple.com

Thanks. Are you training on Silicon? Cloud? Linux? I'm looking for a new machine. Mainly computer vision and analytics.

I'm still unclear on this question though.

Without deploying transformers/models on Apple devices can we still use ANE for training purposes only?

Can we use both ANE and Metal for maximum training performance?

Xiao_Xi · Dec 19, 2022

patent10021 said:
Without deploying transformers/models on Apple devices can we still use ANE for training purposes only?

So far, Tensorflow and PyTorch only use Apple's GPU.

patent10021 said:
I'm looking for a new machine. Mainly computer vision and analytics.

I would try Google Colab or Amazon Sagemaker Studio Lab and see if either of them suits my needs. I wouldn't train a model in Apple Silicon because the Metal backend for Tensorflow and Pytorch are not mature enough.

You may find interesting this presentation about how Stable Diffusion was trained on AWS.

leman · Dec 19, 2022

ANE is generally only used for inference. For training the GPU (and maybe AMX) is used.

patent10021 · Dec 20, 2022

leman said:
ANE is generally only used for inference. For training the GPU (and maybe AMX) is used.

Right. Reading further,

When training ML models, developers benefit from accelerated training on GPUs with PyTorch and TensorFlow by leveraging Metal. Core ML seamlessly blends CPU, GPU, and ANE (if available) to create the most effective hybrid execution plan exploiting all available engines on a given device.

Therefore since both the 48 core GPU and 64 core GPU M1 Mac Studios both have 32 core NEs there's no point in looking at the number of NE cores when shopping for a Mac right?

Training on Silicon appears to be on par with many discreet Nvidia GPUs. Even more, WWDC2022 videos showed that TensorFlow will soon support Metal 3, which according to Apple should provide a 16x increase in performance. Do you interpret this as meaning that we can now confidently have high performance training on Macs and not have to think about training on systems with discreet Nvidia GPUs? Examining tests on GitHub it appears with Silicon we can get performance on par with Colab Tesla GPU accelerators.

Lastly, clearly a 64-core GPU is recommended if one can purchase it. But what about memory? Do you really need to max memory to 64GB or 128GB? For deep learning training are we still better off with a Windows system + Titan/RTX GPU?

Xiao_Xi · Dec 20, 2022

patent10021 said:
Training on Silicon appears to be on par with Nvidia/RTX PC systems.

patent10021 said:
Examining tests on GitHub it appears with Silicon we can get performance on par with Tesla GPU accelerators.

Do you have a link that confirms all this?

patent10021 said:
Even more, WWDC2022 videos showed that TensorFlow will soon support Metal 3, which according to Apple should provide a 16x increase in performance.

Can you share the link to the presentation?

I would be very impressed if Apple has managed to improve so much Tensorflow or Pytorch in such a short time. For reference, Apple claimed that Pytorch on the GPU was 8 times faster than Pytorch on the CPU.

Introducing Accelerated PyTorch Training on Mac – PyTorch

pytorch.org

patent10021 said:
Therefore since both the 48 core GPU and 64 core GPU M1 Mac Studios both have 32 core NEs there's no point in looking at the number of NE cores when shopping for a Mac right?

patent10021 said:
Clearly a 64-core GPU is recommended if one can purchase it. But what about memory? Do you really need to max memory to 64GB or 128GB?

It would be easier to help you if you explain what you want to do. Are you going to fit a pre-trained model or train a model from scratch? What kind of models do you want to train? How large is your data set? How much time do you have to train a model?

leman · Dec 20, 2022

patent10021 said:
Training on Silicon appears to be on par with many discreet Nvidia GPUs.

This is really not the case. Inference can be fairly fast when you use CoreML and utilise all the available accelerators, especially given the power consumption (e.g. Apple's CoreML implementation of stable diffusion is only two-three times slower compared to high-end Nvidia GPUs). But I haven't seen a single example where Apple Silicon would outperform an Nvidia GPU with the comparable nominal performance in training.

patent10021 said:
Even more, WWDC2022 videos showed that TensorFlow will soon support Metal 3, which according to Apple should provide a 16x increase in performance. Do you interpret this as meaning that we can now confidently have high performance training on Macs and not have to think about training on systems with discreet Nvidia GPUs?

I interpret this as you probably misunderstanding something. Maybe this 16x was about a specific corner case or something like that.

Xiao_Xi · Dec 20, 2022

@leman Out of curiosity, what features relating to computer shaders does Metal 3 improve?

leman · Dec 20, 2022

Xiao_Xi said:
@leman Out of curiosity, what features relating to computer shaders does Metal 3 improve?

Nothing specific comes to my mind. Except for ray tracing of course.

innominato5090 · Dec 20, 2022

As much as I like my Apple Silicon, not sure why you'd want to train a model on it. Performance is not as good, support seems limited, and you're missing out on a lot of tools that are available in the CUDA ecosystem.

patent10021 · Dec 24, 2022

innominato5090 said:
As much as I like my Apple Silicon, not sure why you'd want to train a model on it. Performance is not as good, support seems limited, and you're missing out on a lot of tools that are available in the CUDA ecosystem.

As an aside, truly big datasets ie >25GB require memory which even the most expensive consumer NVIDIA/AMD GPUs don't have enough of. The advantage of Apple Silicon is that you can work with big data sets as they will all load in memory. 64GB/128GB unified.

This is why *to me, the best of both worlds is using an M1 Ultra 64GB/32/48-core GPU for data science projects and use Colab+Cuda when you need them for deep learning projects: I've been using university accelerators so I didn't know until recently that Cuda could be installed on Colab.
%load_ext nvcc_plugin

To me at least it seems like cloud is the way forward for private workstations since you will have access to whatever you need.

It's worth noting that probably no one in industry actually uses consumer (Apple Silicon/PC) systems for real-world production and training. I imagine most use Colab, SageMaker, multiple A6000's etc. Or wherever there is access to compute accelerators.

I'd love to hear about any other info/news you guys could tell me about since I am new to setting up my own home workstation and shopping around.

Xiao_Xi · Dec 25, 2022

patent10021 said:
It's worth noting that probably no one in industry actually uses single GPU consumer (Apple Silicon/PC) systems for real-world production and training.

Keep in mind that cloud service providers offer not only the latest data center GPUs, but also best-in-class software for MLOps. MLOps can be difficult to do well on-premises.

Hidden technical debt in Machine learning systems | Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2

dl.acm.org

Search

Search

ANE/Metal & PyTorch/Tensorflow

patent10021

macrumors 68040

Xiao_Xi

macrumors 68000

Deploying Transformers on the Apple Neural Engine

patent10021

macrumors 68040

Deploying Transformers on the Apple Neural Engine

Xiao_Xi

macrumors 68000

leman

macrumors Core

patent10021

macrumors 68040

Xiao_Xi

macrumors 68000

Introducing Accelerated PyTorch Training on Mac – PyTorch

leman

macrumors Core

Xiao_Xi

macrumors 68000

leman

macrumors Core

innominato5090

macrumors 6502

patent10021

macrumors 68040

Xiao_Xi

macrumors 68000

Hidden technical debt in Machine learning systems | Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2

Our Staff