Apple Silicon in AI (2023)

TechnoMonk · Mar 20, 2023

Zest28 said:
Interesting. I was going to build a new PC with a RTX 4090 for AI, but based on what I am reading it sounds like a bad idea?

Is it not a driver issue that can simply be fixed with a software / firmware update?

It’s pretty good GPU for gaming and something like stable diffusion. I built a maxed out workstation with AMD thread ripper and 3090, now upgraded to 4090.
The reality is somewhere in the middle, this place likes to pretend Apple has all the problems and others don’t, but it is expected with the technology.

TechnoMonk · Mar 20, 2023

GrumpyCoder said:
And then what? 512GB is great (we already have 640GB with Nvidia in a single box though and super fast connections for multiple boxes). But then we still have (likely) slow performance. Keep in mind the 1080Ti is running circles around the M1 Ultra... so that potential Mac Pro would have to be much, much, MUCH faster than the Ultra. They need something to compete with (multiple) A100 level cards to be interesting for the cloud.

There have been a bunch of CUDA to <insert-favorite-hype-alternative-here-that-dies-after-2-months> projects, none of which worked as expected and all were abandoned after a while. Like it or not, 99% of the market is Nvidia and it won't change anytime soon. A super fast desktop Mac for fiddling around locally would be nice though.

I don’t see apple competing in segment of training or creating AI models. Nvidia has that market in the bag. Inference and tools using these AI models is a different conversation. Apple with unified memory can take a 4090 to cleaners in some workflows.

TechnoMonk · Mar 20, 2023

Xiao_Xi said:
Apple could create a tool to convert CUDA code to Metal code as Intel and AMD have done, but I don't think they can legally convert compiled CUDA code at runtime. I have read that neither AMD nor Intel have tried due to the Oracle/Google lawsuit.

Converting cuda at run time will be inefficient and slow. I rather optimize my models for Apple Silicon and then run inference. Not to mention data compatibility issues between models at run time.

Xiao_Xi · Mar 21, 2023

leman said:
What's stopping one from implementing a CUDA-compatible toolkit? Are there provisions that prohibit copying an API? That would be extremely odd.

So it seems. Google got off the hook for copying the Java API because it claimed fair use. I'm not sure Apple will be so lucky if it tries to copy the CUDA API.

Google v Oracle: Supreme Court declares Google's code copying fair

The US Supreme Court has handed Google a major win in a decade-long battle over copied code.

www.bbc.com

senttoschool said:
I think/hope Apple will put the "Extreme" version of their chips in the cloud. Imagine an Extreme chip packaged together with 512GB of VRAM available for training or inference in the cloud.

Not even nVidia offers that. In fact, nVidia just unveiled its most powerful GPU, a dual-GPU card that offers 188 GB of HBM3 memory (94 GB per card). Would HBM3 be suitable for the Mx Extreme or would Apple need more advanced technology?

NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models

www.anandtech.com

GrumpyCoder · Mar 21, 2023

TechnoMonk said:
Inference and tools using these AI models is a different conversation. Apple with unified memory can take a 4090 to cleaners in some workflows.

But what workflow? Video work with Final Cut? Sure, that's what the Mx is optimized for, among a few other things.

I'm not so sure about tools, I mean, what tools would I need once I've arrived at the inference and left development of the model behind? It's just running at that point. Integration? Sure. And inference, I somehow can't see Apple providing cloud solutions for pure inference tasks or who would even use that. On some computer on-site, maybe..., but if I need a large A100 cluster for inference, something else is wrong.

But even if that's the case, the M1 Ultra with 64Gb is still beaten by a V100 (not A100!) with 16GB when it comes to inference (https://github.com/lucadiliello/pytorch-apple-silicon-benchmarks). The old A100 is already beaten by the 3080, except when memory is an issue or using multiple GPUs (https://bizon-tech.com/gpu-benchmarks/NVIDIA-Tesla-V100-vs-NVIDIA-RTX-3080/535vs578). And even the 1080Ti is faster than the Ultra for inference (https://sebastianraschka.com/blog/2022/pytorch-m1-gpu.html). For every single task? Maybe not, but for general models I wouldn't trade a 4090 for anything Apple has right now. Time will tell if that might change one day.

If we look at production applications (robotics, digital twins, etc.), wouldn't a Nvidia Jetson with it's dedicated hardware/pipelines that are not even available in GPGPUs be a better option than some computer?

In the end, I don't think that's what Apple want anyway. I still hope we'll see a more powerful Mac Pro in the future.

Xiao_Xi said:
Not even nVidia offers that. In fact, nVidia just unveiled its most powerful GPU, a dual-GPU card that offers 188 GB of HBM3 memory (94 GB per card). Would HBM3 be suitable for the Mx Extreme or would Apple need more advanced technology?

I think that's just their response to the ChatGPT hype. We already have cards with 80GB per card and are able to put up to 8 of these into a workstation and server with SXM being fast enough to utilize the 640GB very well. It's "good enough". And bluntly speaking, when someone scans the whole internet, throws in attention to create a model that understands input and creates the best fitting average answer, which is often not fully correct or simply wrong in other cases (because the internet as a source only contains the truth and facts...), then yes, more memory is always better. Or in other words, they upped the memory a little from 80 to 94GB and doubled the specs to create a tighter "package".

One thing is for certain, the upcoming Mac Pro will give us a good idea of where Apple heading, even if it falls short of what Nvidia is offering.

senttoschool · Mar 21, 2023

Xiao_Xi said:
Not even nVidia offers that. In fact, nVidia just unveiled its most powerful GPU, a dual-GPU card that offers 188 GB of HBM3 memory (94 GB per card). Would HBM3 be suitable for the Mx Extreme or would Apple need more advanced technology?

NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models

www.anandtech.com

I mean, the big advantage of Apple Silicon is unified memory. So I hope Apple sees what they have and put out something that others simply can't compete with in the professional space.

Anyways, Apple denying the Apple Silicon leaders from creating server versions of their chips might be one of the bigger mistakes they made. These leaders eventually left to form Nuvia, which Qualcomm bought and will bring to market for PCs, smartphones, and servers.

senttoschool · Mar 23, 2023

GitHub - apple/ml-ane-transformers: Reference implementation of the Transformer architecture optimized for Apple Neural Engine (ANE)

Reference implementation of the Transformer architecture optimized for Apple Neural Engine (ANE) - apple/ml-ane-transformers

github.com

Deploying Transformers on the Apple Neural Engine

An increasing number of the machine learning (ML) models we build at Apple each year are either partly or fully adopting the Transformer…

machinelearning.apple.com

Pressure · Mar 24, 2023

Looks like it will be much easier running something like ChatGPT locally on a Mac soon.

Imagine Siri but it actually works 😂

TechnoMonk · Mar 24, 2023

GrumpyCoder said:
But what workflow? Video work with Final Cut? Sure, that's what the Mx is optimized for, among a few other things.

I'm not so sure about tools, I mean, what tools would I need once I've arrived at the inference and left development of the model behind? It's just running at that point. Integration? Sure. And inference, I somehow can't see Apple providing cloud solutions for pure inference tasks or who would even use that. On some computer on-site, maybe..., but if I need a large A100 cluster for inference, something else is wrong.

But even if that's the case, the M1 Ultra with 64Gb is still beaten by a V100 (not A100!) with 16GB when it comes to inference (https://github.com/lucadiliello/pytorch-apple-silicon-benchmarks). The old A100 is already beaten by the 3080, except when memory is an issue or using multiple GPUs (https://bizon-tech.com/gpu-benchmarks/NVIDIA-Tesla-V100-vs-NVIDIA-RTX-3080/535vs578). And even the 1080Ti is faster than the Ultra for inference (https://sebastianraschka.com/blog/2022/pytorch-m1-gpu.html). For every single task? Maybe not, but for general models I wouldn't trade a 4090 for anything Apple has right now. Time will tell if that might change one day.

If we look at production applications (robotics, digital twins, etc.), wouldn't a Nvidia Jetson with it's dedicated hardware/pipelines that are not even available in GPGPUs be a better option than some computer?

In the end, I don't think that's what Apple want anyway. I still hope we'll see a more powerful Mac Pro in the future.

I think that's just their response to the ChatGPT hype. We already have cards with 80GB per card and are able to put up to 8 of these into a workstation and server with SXM being fast enough to utilize the 640GB very well. It's "good enough". And bluntly speaking, when someone scans the whole internet, throws in attention to create a model that understands input and creates the best fitting average answer, which is often not fully correct or simply wrong in other cases (because the internet as a source only contains the truth and facts...), then yes, more memory is always better. Or in other words, they upped the memory a little from 80 to 94GB and doubled the specs to create a tighter "package".

One thing is for certain, the upcoming Mac Pro will give us a good idea of where Apple heading, even if it falls short of what Nvidia is offering.

FCP? What year is this? 2010? Bunch of meaningless benchmarks don’t mean much to my pipelines. I was talking about my real life experience using Ai pipelines and workflows. More and more tools have started using AI models. I mostly use Topaz Video AI for upscaling to 4k or 8k. Adobe Premier pro has content aware erase, which uses AI models. Not to mention video generative models, 4090 chokes at generating anything more than 720P. This is small subset of my workflow.
I am not paying ridiculous rates for running inferences in cloud, my compute power for running inferences is predictable. I use cloud for training, using those Nvidia GPU clusters.

TechnoMonk · Mar 24, 2023

senttoschool said:
GitHub - apple/ml-ane-transformers: Reference implementation of the Transformer architecture optimized for Apple Neural Engine (ANE)

Reference implementation of the Transformer architecture optimized for Apple Neural Engine (ANE) - apple/ml-ane-transformers

github.com

Deploying Transformers on the Apple Neural Engine

An increasing number of the machine learning (ML) models we build at Apple each year are either partly or fully adopting the Transformer…

machinelearning.apple.com

This was turning point on why I started looking at Apple for inferences. Apple still needs lot of work but they are headed in the right direction.

GrumpyCoder · Mar 25, 2023

TechnoMonk said:
FCP? What year is this? 2010? Bunch of meaningless benchmarks don’t mean much to my pipelines. I was talking about my real life experience using Ai pipelines and workflows. More and more tools have started using AI models. I mostly use Topaz Video AI for upscaling to 4k or 8k. Adobe Premier pro has content aware erase, which uses AI models. Not to mention video generative models, 4090 chokes at generating anything more than 720P. This is small subset of my workflow.
I am not paying ridiculous rates for running inferences in cloud, my compute power for running inferences is predictable. I use cloud for training, using those Nvidia GPU clusters.

Your bottleneck is not the inference part, it's the actual video/image part which, similar to FCP, Topaz and Premier is using. That's what AS is optimzed for. Run those models in inference stand alone and compare them to a Nvidia GPU.

Oh and btw, there's no general defect in 4090 cards. There is however an issue with specific chipsets or rather manufacturers implementation of that chipset in their mainboards. So if anything, if the 4090 isn't defective (happens), change the mainboard rather than the GPU. I've seen that myself when we tried a 4090 with a Ryzen 7 series CPU and an Asus mainboard. Bad performance, crashes after a while, bad temp control, in Linux we couldn't read sensor information and so on. Other boards were fine, but required a little tweaking. I'd always stay away from AMD CPUs for these new builds. AMD systems only seems to be really stable after a year or so, particularly in Linux. Intel is the much better choice.

Other than that, many people seem to have problems with the power supply not supplying enough power for spikes and the 12vhpwr pin in general. The sense pins/cable seem to break easily which can result in all sort of problems.

But if you're so heavily doing video work, a RTX8000 or A6000 would be a better choice over a 4090 anyway.

sunny5 · Mar 25, 2023

TechnoMonk said:
I don’t see apple competing in segment of training or creating AI models. Nvidia has that market in the bag. Inference and tools using these AI models is a different conversation. Apple with unified memory can take a 4090 to cleaners in some workflows.

Only in theory, but in reality? It's not. I'm using WebUI and it use ALL unified memory and it's way slower than mobile RTX 3080 or maybe mobile RTX 3060. In 3D and AI, having unified memory seems meaningless as nobody barely use and optimize unified memory and doesn't really perform great.

senttoschool · Mar 25, 2023

sunny5 said:
Only in theory, but in reality? It's not. I'm using WebUI and it use ALL unified memory and it's way slower than mobile RTX 3080 or maybe mobile RTX 3060. In 3D and AI, having unified memory seems meaningless as nobody barely use and optimize unified memory and doesn't really perform great.

RAM is RAM. There is no way to create more physical RAM from software. There clearly exists some bottleneck elsewhere for Apple Silicon in both training and inference. If and when these bottlenecks get fixed, then unified memory can shine.

leman · Mar 26, 2023

senttoschool said:
RAM is RAM. There is no way to create more physical RAM from software. There clearly exists some bottleneck elsewhere for Apple Silicon in both training and inference. If and when these bottlenecks get fixed, then unified memory can shine.

Spot on. Nvidia has more matmul flops, and that’s the beginning and the end of the story. That said, the CoreML version of stable diffusion published by Apple runs fairly decently and indicates what Apple Silicon should be capable of in this domain if Apple commits to shipping larger accelerators.

senttoschool · Mar 26, 2023

leman said:
Spot on. Nvidia has more matmul flops, and that’s the beginning and the end of the story. That said, the CoreML version of stable diffusion published by Apple runs fairly decently and indicates what Apple Silicon should be capable of in this domain if Apple commits to shipping larger accelerators.

If they haven’t, they will definitely focus on AI acceleration in Apple Silicon now.

I’m going to guess that by 2030, we are going to be buying huge NPUs with a CPU attached to it. Currently, it’s the other way around. AI inference power will be much more important than CPU speed.

TechnoMonk · Mar 26, 2023

sunny5 said:
Only in theory, but in reality? It's not. I'm using WebUI and it use ALL unified memory and it's way slower than mobile RTX 3080 or maybe mobile RTX 3060. In 3D and AI, having unified memory seems meaningless as nobody barely use and optimize unified memory and doesn't really perform great.

WebUI uses CPU, have you converted the models to CoreML?

TechnoMonk · Mar 26, 2023

senttoschool said:
RAM is RAM. There is no way to create more physical RAM from software. There clearly exists some bottleneck elsewhere for Apple Silicon in both training and inference. If and when these bottlenecks get fixed, then unified memory can shine.

He is using WebUI, which doesn’t use GPU. It’s very poor comparison.

TechnoMonk · Mar 26, 2023

GrumpyCoder said:
Your bottleneck is not the inference part, it's the actual video/image part which, similar to FCP, Topaz and Premier is using. That's what AS is optimzed for. Run those models in inference stand alone and compare them to a Nvidia GPU.

Oh and btw, there's no general defect in 4090 cards. There is however an issue with specific chipsets or rather manufacturers implementation of that chipset in their mainboards. So if anything, if the 4090 isn't defective (happens), change the mainboard rather than the GPU. I've seen that myself when we tried a 4090 with a Ryzen 7 series CPU and an Asus mainboard. Bad performance, crashes after a while, bad temp control, in Linux we couldn't read sensor information and so on. Other boards were fine, but required a little tweaking. I'd always stay away from AMD CPUs for these new builds. AMD systems only seems to be really stable after a year or so, particularly in Linux. Intel is the much better choice.

Other than that, many people seem to have problems with the power supply not supplying enough power for spikes and the 12vhpwr pin in general. The sense pins/cable seem to break easily which can result in all sort of problems.

But if you're so heavily doing video work, a RTX8000 or A6000 would be a better choice over a 4090 anyway.

You have no idea what you talking about in workflows. I know exactly where my bottleneck are in the workflow. I don’t use any fcp or video tools till Generative AI models create frames for videos. The only thing used there is Ai inference. It’s not just my workflow but I have seen people easily create the issue with other open-source generative videos AI like Deforum. Install deform on webui and try using it at higher resolutions.
How much does A6000 or A8000 cost? Pretty much the cost of a maxed-out RAM and processer in MacBook pro.

TechnoMonk · Mar 26, 2023

leman said:
Spot on. Nvidia has more matmul flops, and that’s the beginning and the end of the story. That said, the CoreML version of stable diffusion published by Apple runs fairly decently and indicates what Apple Silicon should be capable of in this domain if Apple commits to shipping larger accelerators.

Exactly this, Apple needs to commit and provide accelerators and library support for most commonly used their party libraries. The stable diffusion coreML isn’t even optimized by best performance. Writing your inference and dynamic batching speeds up even more on Apple silicon.

sunny5 · Mar 26, 2023

TechnoMonk said:
WebUI uses CPU, have you converted the models to CoreML?

Do you even have Automaitc1111 WebUI? It heavily uses both GPU cores and VRAM. CPU? Are you kidding me?

TechnoMonk · Mar 26, 2023

sunny5 said:
Do you even have Automaitc1111 WebUI? It heavily uses both GPU cores and VRAM. CPU? Are you kidding me?

I tried it couple months ago, it was using cpu. I would love to see some screen shots and the link pointing me the version of automatic1111 we Uk using Coreml.
This is the last one from feb 27, the screen shots in the link shows CPU argument and still using default ckpt models.

Install AUTOMATIC1111's Stable Diffusion WebUI on Mac M1/M2 (Apple Silicon) - Aituts

How to run Stable Diffusion on Mac devices with Apple Silicon chip.

aituts.com

sunny5 · Mar 26, 2023

TechnoMonk said:
I tried it couple months ago, it was using cpu. I would love to see some screen shots and the link pointing me the version of automatic1111 using Coreml.

HAhaha, automatic1111's WebUI does not even use CoreML. You clearly know nothing and you just proven yourself. Go and check Stable Diffusion subreddit to see how it works first.

I see your link and that also leads to what I'm using so you clearly don't know what you are saying. I've been using WebUI for almost 5 months.

TechnoMonk · Mar 26, 2023

sunny5 said:
HAhaha, automatic1111's WebUI does not even use CoreML. You clearly know nothing and you just proven yourself. Go and check Stable Diffusion subreddit to see how it works first.

Lol. I know how it works, I run it on my 4090. Show me automatic1111 working on Apple GPU with coreml models. What you have shown is ignorance of how Apple silicon runs these models.

show me the coreml version of Automatic1111 like stable diffusion

GitHub - apple/ml-stable-diffusion: Stable Diffusion with Core ML on Apple Silicon

Stable Diffusion with Core ML on Apple Silicon. Contribute to apple/ml-stable-diffusion development by creating an account on GitHub.

github.com

I would love to see screen shots of automatic1111 running in GPU mode on your apple silicon.

GrumpyCoder · Mar 26, 2023

TechnoMonk said:
You have no idea what you talking about in workflows. I know exactly where my bottleneck are in the workflow.

Ah ok. Hey, maybe you should get a PhD in the field and a professorship at a leading university teaching this stuff. But of course it's always the others that have no idea.

TechnoMonk said:
I don’t use any fcp or video tools till Generative AI models create frames for videos.

I didn't say you're using FCP, I said the M series is optimised for workflow similar to FCP. You're using video workflow and maybe, just maybe you should check what exactly happens in these models and the model output. How many of these models have you created yourself? How many have you published at peer-reviewed conferences? None. You can't even get your 4090 going. 'Nuff said.

TechnoMonk said:
How much does A6000 or A8000 cost? Pretty much the cost of a maxed-out RAM and processer in MacBook pro.

There is no A8000, only a A6000. The 8000 is a RTX8000, no A there. Good thing you know your stuff and don't have to rely on people who don't know what they're talking about. Oh wait...

sunny5 · Mar 26, 2023

TechnoMonk said:
Lol. I know how it works, I run it on my 4090. Show me automatic1111 working on Apple GPU with coreml models. What you have shown is ignorance of how Apple silicon runs these models.

show me the coreml version of Automatic1111 like stable diffusion

GitHub - apple/ml-stable-diffusion: Stable Diffusion with Core ML on Apple Silicon

Stable Diffusion with Core ML on Apple Silicon. Contribute to apple/ml-stable-diffusion development by creating an account on GitHub.

github.com

I would love to see screen shots of automatic1111 running in GPU mode on your apple silicon.

lol, who even use ColeML to run WebUI? Using GPU is the most fastest way to generate AI images and I have no idea what you are talking about? I guess you dont eve use custom models at all. WebUI use all GPU core while generating images.

Apple Silicon in AI (2023)

macrumors 68040

macrumors 68040

macrumors 68040

macrumors 68000

macrumors 68020

macrumors 68030

macrumors 68030

macrumors 603

macrumors 68040

macrumors 68040

macrumors 68020

Suspended

macrumors 68030

macrumors Core

macrumors 68030

macrumors 68040

macrumors 68040

macrumors 68040

macrumors 68040

Suspended

macrumors 68040

Suspended

macrumors 68040

macrumors 68020

Suspended

Our Staff