Apple Silicon in AI (2023)

TechnoMonk · Mar 17, 2023

teagls said:
Inference on Nvidia hardware is very good... they have tons of optimizations for it and even more depending on which framework like PyTorch/tensorflow you use. I have had zero issues with inference. It's very solid.

Conversion to CoreML is still a nightmare. I dread it the most. It is the most painful god forsaken part of dealing with AI in Apple's ecosystem. The python package Coremltools developed by Apple is riddled with bugs, missing features, terrible documentation. I have submitted countless bugs and they go unanswered. Take for instance this INCREDIBLY basic tensorflow function unravel_index. This bug was submitted almost 2 years ago and was never addressed!!!! https://github.com/apple/coremltools/issues/1195

The irony is Apple internally uses Nvidia and PyTorch for their own AI workflows. They'd never use their own ecosystem.

4090 is the most unstable GPU i have worked with in past few years. Its good if you are running some basic Diffusion models like stable diffusion, or some text based models. Anything with sustained load, it crashes, reboots the workstation, or just runs out of memory. I almost exclusively use AI models for 4k/8K upscaling, its like saying a prayer on 4090. You may run something for 30 mins, and then it crashes after 25 mins. M1 max may not run the same in 30 mins, it may take 35-37 mins, but is stable and reliable. Nvidia need to make their GPUs efficient, 4090 easily draws 600 W, then starts throttling or just crashes after sustained load.
Good luck using 4090 for removing content, or adding content to a video using AI models with resolution more than 720p.
I have a different approach when using CoreML or Apple GPU for running inferences, I am not trying to map CoreML one to one with Cuda, or use the PyTorch or TF as such. Ofcourse I would have to use some custom libraries or custom code to make it work on AS. If you are expecting everything to work the same way as CUDA, then it is a different battle.

Xiao_Xi · Mar 17, 2023

TechnoMonk said:
I have a different approach when using CoreML or Apple GPU for running inferences, I am not trying to map CoreML one to one with Cuda, or use the PyTorch or TF as such.

Why is the conversion to CoreML so error-prone? Doesn't ONNX solve the problem of inter-library conversion?

diamond.g · Mar 17, 2023

TechnoMonk said:
4090 is the most unstable GPU i have worked with in past few years. Its good if you are running some basic Diffusion models like stable diffusion, or some text based models. Anything with sustained load, it crashes, reboots the workstation, or just runs out of memory. I almost exclusively use AI models for 4k/8K upscaling, its like saying a prayer on 4090. You may run something for 30 mins, and then it crashes after 25 mins. M1 max may not run the same in 30 mins, it may take 35-37 mins, but is stable and reliable. Nvidia need to make their GPUs efficient, 4090 easily draws 600 W, then starts throttling or just crashes after sustained load.
Good luck using 4090 for removing content, or adding content to a video using AI models with resolution more than 720p.
I have a different approach when using CoreML or Apple GPU for running inferences, I am not trying to map CoreML one to one with Cuda, or use the PyTorch or TF as such. Ofcourse I would have to use some custom libraries or custom code to make it work on AS. If you are expecting everything to work the same way as CUDA, then it is a different battle.

Are you running an FE card or a AIB model?

teagls · Mar 17, 2023

TechnoMonk said:
4090 is the most unstable GPU i have worked with in past few years. Its good if you are running some basic Diffusion models like stable diffusion, or some text based models. Anything with sustained load, it crashes, reboots the workstation, or just runs out of memory. I almost exclusively use AI models for 4k/8K upscaling, its like saying a prayer on 4090. You may run something for 30 mins, and then it crashes after 25 mins. M1 max may not run the same in 30 mins, it may take 35-37 mins, but is stable and reliable. Nvidia need to make their GPUs efficient, 4090 easily draws 600 W, then starts throttling or just crashes after sustained load.
Good luck using 4090 for removing content, or adding content to a video using AI models with resolution more than 720p.
I have a different approach when using CoreML or Apple GPU for running inferences, I am not trying to map CoreML one to one with Cuda, or use the PyTorch or TF as such. Ofcourse I would have to use some custom libraries or custom code to make it work on AS. If you are expecting everything to work the same way as CUDA, then it is a different battle.

That honestly sounds like you are letting the 4090 overheat. Are you cooling it adequately?

TechnoMonk · Mar 17, 2023

teagls said:
That honestly sounds like you are letting the 4090 overheat. Are you cooling it adequately?

I monitor the temperature,It’s not heating issue though, I have plenty of cooling. It’s more of power issue. I cap the power to 70%, just like I did on my retired 3090. This is not uncommon for folks who run sustained workloads. I can rent a 4090 or A5000 in cloud and replicate the issue pretty easily.
It’s a love hate relationship with my 4090, it’s amazing doing certain tasks but unreliable at others.
I just hope Apple invests in this segment of GPU market, both in hardware and software.

Kimmo · Mar 17, 2023

After Meta's stunning performance in developing the Metaverse (a two-year loss of $23.9 billion in their Reality Labs division) Axios is reporting a pivot to AI.

Meta's metaverse is on the back burner

The company formerly known as Facebook is talking more about AI than the metaverse these days.

www.axios.com

Metaverse funding plummets as investors favor generative AI

AI has replaced metaverse as media and tech's new buzzword.

www.axios.com

GrumpyCoder · Mar 17, 2023

I have some small 4090 and dual 4090 workstations, no problem with the GPUs in these with the usual TF/PyTorch/Omniverse applications. Rock stable. Of course there's the issue with memory on these, as it's very limited. That's what RTX6000/8000 cards are for. The next step is V100/H100, but I personally think using these in the cloud is pointless. They're too expensive for continuous workload and when used all the time, buying is much cheaper than a cloud service.

teagls · Mar 17, 2023

TechnoMonk said:
I monitor the temperature,It’s not heating issue though, I have plenty of cooling. It’s more of power issue. I cap the power to 70%, just like I did on my retired 3090. This is not uncommon for folks who run sustained workloads. I can rent a 4090 or A5000 in cloud and replicate the issue pretty easily.
It’s a love hate relationship with my 4090, it’s amazing doing certain tasks but unreliable at others.
I just hope Apple invests in this segment of GPU market, both in hardware and software.

Something doesn't seem right. Are you also saying you have power issues with A5000? We regularly run many A6000's in a single workstation for weeks under full load and never have issues. You might want to check your PSU or connections. Maybe even check your electrical outlets.

leman · Mar 18, 2023

Some evidence that a new neural engine design might be incoming:

WIPO - Search International and National Patent Collections

A quick uninformed summary from someone who has no clue about these things: it seems to describe an architecture that consists of two types of hardware units, optimised for different types of operations (convolutions vs. vector processing) and a dependency management engine that asynchronously invokes these units. From what I understand, the current NPU is primarily a convolution engine.

diamond.g · Mar 18, 2023

TechnoMonk said:
I monitor the temperature,It’s not heating issue though, I have plenty of cooling. It’s more of power issue. I cap the power to 70%, just like I did on my retired 3090. This is not uncommon for folks who run sustained workloads. I can rent a 4090 or A5000 in cloud and replicate the issue pretty easily.
It’s a love hate relationship with my 4090, it’s amazing doing certain tasks but unreliable at others.
I just hope Apple invests in this segment of GPU market, both in hardware and software.

That is interesting because 100% power is 450W for most 4090's with like 5 or 6 cards able to actually draw more than that in power (as in a lot of cards are power limited to 450W). I am going to assume the place you are renting is using FE cards which can pull up to 600W, but have to be overclocked (run at more than 100% power limit) to get there.

It will be interesting to see how Apple competes in this space as renting compute appears to be the overwhelming choice for folks these days.

TechnoMonk · Mar 18, 2023

Xiao_Xi said:
Why is the conversion to CoreML so error-prone? Doesn't ONNX solve the problem of inter-library conversion?

That was the bigger problem. You don’t need to convert to ONNX any more, at least not what I do now. I would say lot of it is not plug and play, I use custom Python code leveraging apple core ml to do the conversion.

diamond.g said:
That is interesting because 100% power is 450W for most 4090's with like 5 or 6 cards able to actually draw more than that in power (as in a lot of cards are power limited to 450W). I am going to assume the place you are renting is using FE cards which can pull up to 600W, but have to be overclocked (run at more than 100% power limit) to get there.

It will be interesting to see how Apple competes in this space as renting compute appears to be the overwhelming choice for folks these days.

i am not renting 4090, it’s in my workstation. There is no peer limiting in most cards, NVidia claims 450 W cap, and the whole board draw is much higher depending on the manufacturer. I under clock and power limit mine to 70%. I am not a gamer so my use case could be very different than most of folks who use it for gaming.

TechnoMonk · Mar 18, 2023

teagls said:
Something doesn't seem right. Are you also saying you have power issues with A5000? We regularly run many A6000's in a single workstation for weeks under full load and never have issues. You might want to check your PSU or connections. Maybe even check your electrical outlets.

It’s same in the cloud(A5000) and 4090(local). I don’t run in to same issues on a100. I am not wasting money on A100 for inferences though.

GrumpyCoder · Mar 18, 2023

Something is wrong there. The 4090 average power draw is around 400-450W, overclocked cards draw a little more, somewhere between 450-500W. Only for short bursts should it reach 600W. The board limit is around 650W. So if a capped 4090 at 70% is reaching that, something isn't working properly in the system. Not a single of our 4090 cards is behaving like that. All of our systems are configured at the supplier and tested for 48 hours under full load before they're shipped. Our larger Dell systems with 6000/8000 are tested as well and so are the Dell servers with A100/H100 for our GPU clusters.

Another issue, using PyTorch with VGG16, even an old 1080Ti is running circles around a M1 Ultra (leave alone Max) for both training and inference. So if a 4090 needs 30 minutes and crashes before it reaches that and a M1 Max can do the same job in under 40 minutes, something is terribly wrong here. As it happens, I'm at a conference and then at Nvidia for the rest of next week, so I'll ask them about the power draw of 600W with a power cap at 70%.

TechnoMonk · Mar 18, 2023

GrumpyCoder said:
Something is wrong there. The 4090 average power draw is around 400-450W, overclocked cards draw a little more, somewhere between 450-500W. Only for short bursts should it reach 600W. The board limit is around 650W. So if a capped 4090 at 70% is reaching that, something isn't working properly in the system. Not a single of our 4090 cards is behaving like that. All of our systems are configured at the supplier and tested for 48 hours under full load before they're shipped. Our larger Dell systems with 6000/8000 are tested as well and so are the Dell servers with A100/H100 for our GPU clusters.

Another issue, using PyTorch with VGG16, even an old 1080Ti is running circles around a M1 Ultra (leave alone Max) for both training and inference. So if a 4090 needs 30 minutes and crashes before it reaches that and a M1 Max can do the same job in under 40 minutes, something is terribly wrong here. As it happens, I'm at a conference and then at Nvidia for the rest of next week, so I'll ask them about the power draw of 600W with a power cap at 70%.

Great! I would love to hear what NVidia guys say about it. Just for clarification 600 W is with out limiting power to 70%. The GPU board draw is around at 400-425 with power limits and under locking the GPU 20%. I tried multiple vendors.
Ask them if there are stability issues running at sustained loads. I see the same in a 5000.

Zest28 · Mar 19, 2023

Interesting. I was going to build a new PC with a RTX 4090 for AI, but based on what I am reading it sounds like a bad idea?

Is it not a driver issue that can simply be fixed with a software / firmware update?

Xiao_Xi · Mar 19, 2023

diamond.g said:
It will be interesting to see how Apple competes in this space as renting compute appears to be the overwhelming choice for folks these days.

I think deep learning and scientific library developers would care more about Apple hardware if it were available in the cloud at a competitive price.

diamond.g · Mar 19, 2023

TechnoMonk said:
That was the bigger problem. You don’t need to convert to ONNX any more, at least not what I do now. I would say lot of it is not plug and play, I use custom Python code leveraging apple core ml to do the conversion.

i am not renting 4090, it’s in my workstation. There is no peer limiting in most cards, NVidia claims 450 W cap, and the whole board draw is much higher depending on the manufacturer. I under clock and power limit mine to 70%. I am not a gamer so my use case could be very different than most of folks who use it for gaming.

Making another assumption. Unplug 1 of the 8-pin to 16-pin adapter leads, that will force the card to cap at 450W. 70% of that should bring the power down to ~315W.

The below is from this YT video where the person compared 35 4090's to find "the best" one(s). Was just to reiterate that most cards have a stock power limit of 450W (and there are quite a few that can't go over that limit for whatever reason). You were right most can go over the limit, but that usually means pushing the slider past 100%, which you said you were not doing. Note all this is ignoring transient power spikes which can be much higher, but are only for a fraction of a second.

senttoschool · Mar 19, 2023

Xiao_Xi said:
I think deep learning and scientific library developers would care more about Apple hardware if it were available in the cloud at a competitive price.

Did somebody say Apple Silicon Cloud?

You're Apple. You just built a 40-core SoC for Mac Pro. Now what?

Mark Gurman is saying that Apple is working on a 40-core SoC for the Mac Pro for 2022. You're Tim Cook, sitting in his nice office, looking at how much money you just spent to make this giant SoC for a relatively small market. In fact, you have to do this every year or every two years to keep...

forums.macrumors.com

I think/hope Apple will put the "Extreme" version of their chips in the cloud. Imagine an Extreme chip packaged together with 512GB of VRAM available for training or inference in the cloud.

diamond.g · Mar 20, 2023

Xiao_Xi said:
I think deep learning and scientific library developers would care more about Apple hardware if it were available in the cloud at a competitive price.

senttoschool said:
Did somebody say Apple Silicon Cloud?

You're Apple. You just built a 40-core SoC for Mac Pro. Now what?

Mark Gurman is saying that Apple is working on a 40-core SoC for the Mac Pro for 2022. You're Tim Cook, sitting in his nice office, looking at how much money you just spent to make this giant SoC for a relatively small market. In fact, you have to do this every year or every two years to keep...

forums.macrumors.com

I think/hope Apple will put the "Extreme" version of their chips in the cloud. Imagine an Extreme chip packaged together with 512GB of VRAM available for training or inference in the cloud.

I think if Apple had a way to convert CUDA to Metal in the backend it would make the cloud service more approachable Maybe along with a notice that you could get way better performance writing Metal code directly versus the conversion/translation layer.

Xiao_Xi · Mar 20, 2023

diamond.g said:
I think if Apple had a way to convert CUDA to Metal in the backend it would make the cloud service more approachable

Can anyone confirm if that would be legal? I've read that other GPU manufacturers can't use CUDA code because nVidia restricts the use of CUDA to their GPUs. To alleviate the problem, AMD released Orochi.

leman · Mar 20, 2023

Xiao_Xi said:
Can anyone confirm if that would be legal? I've read that other GPU manufacturers can't use CUDA code because nVidia restricts the use of CUDA to their GPUs. To alleviate the problem, AMD released Orochi.

What's stopping one from implementing a CUDA-compatible toolkit? Are there provisions that prohibit copying an API? That would be extremely odd.

diamond.g · Mar 20, 2023

Xiao_Xi said:
Can anyone confirm if that would be legal? I've read that other GPU manufacturers can't use CUDA code because nVidia restricts the use of CUDA to their GPUs. To alleviate the problem, AMD released Orochi.

IIRC ROCm has tools to port CUDA code to HIP, so I see no reason why Apple couldn't do the same (and even do it at runtime ala Rosetta 2).

Xiao_Xi · Mar 20, 2023

Apple could create a tool to convert CUDA code to Metal code as Intel and AMD have done, but I don't think they can legally convert compiled CUDA code at runtime. I have read that neither AMD nor Intel have tried due to the Oracle/Google lawsuit.

GrumpyCoder · Mar 20, 2023

senttoschool said:
I think/hope Apple will put the "Extreme" version of their chips in the cloud. Imagine an Extreme chip packaged together with 512GB of VRAM available for training or inference in the cloud.

And then what? 512GB is great (we already have 640GB with Nvidia in a single box though and super fast connections for multiple boxes). But then we still have (likely) slow performance. Keep in mind the 1080Ti is running circles around the M1 Ultra... so that potential Mac Pro would have to be much, much, MUCH faster than the Ultra. They need something to compete with (multiple) A100 level cards to be interesting for the cloud.

Xiao_Xi said:
Can anyone confirm if that would be legal? I've read that other GPU manufacturers can't use CUDA code because nVidia restricts the use of CUDA to their GPUs. To alleviate the problem, AMD released Orochi.

There have been a bunch of CUDA to <insert-favorite-hype-alternative-here-that-dies-after-2-months> projects, none of which worked as expected and all were abandoned after a while. Like it or not, 99% of the market is Nvidia and it won't change anytime soon. A super fast desktop Mac for fiddling around locally would be nice though.

jdb8167 · Mar 20, 2023

Xiao_Xi said:
Apple could create a tool to convert CUDA code to Metal code as Intel and AMD have done, but I don't think they can legally convert compiled CUDA code at runtime. I have read that neither AMD nor Intel have tried due to the Oracle/Google lawsuit.

Apple translates compiled x86-64 code to AArch64 with Rosetta 2 without any legal problems despite Intel threatening both Microsoft and Apple previously. If Intel isn't suing Apple then I doubt that there is any case for Nvidia to sue over binary translation either.

Apple Silicon in AI (2023)

macrumors 68030

macrumors 68000

macrumors G4

macrumors regular

macrumors 68030

macrumors 6502

macrumors 68020

macrumors regular

macrumors Core

macrumors G4

macrumors 68030

macrumors 68030

macrumors 68020

macrumors 68030

macrumors 68030

macrumors 68000

macrumors G4

macrumors 68030

macrumors G4

macrumors 68000

macrumors Core

macrumors G4

macrumors 68000

macrumors 68020

macrumors 601

Our Staff