Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
I thought, at least for LLM's, that the GPU is the unit most performant for training, while the CPU is the unit most performant for inference, indicating that it's not always the case that the CPU has limited performance for AI. And based on what you've written, the NPU would be used as a CPU-alternative for inference when you want to optimize efficiency. Do I have that right?
For inference, GPUs are still dominant as well, it's just that bandwidth tends to matter more than compute (though depending on the bandwidth the latter can still contribute - i.e. it's been suggested that Apple devices have more than enough bandwidth that on inference tasks the GPUs are compute bound and thus improving compute would dramatically improve inference performance, especially for multi-batch inference).
 
Last edited:
Maybe it's indicative of several things, but using an MLX LLM on M4 and M4 Pro/Max returns wildly different performance, even though both have the same NPE. They have much different memory bandwidth, CPU, and GPU prowess. MLX should offload a portion go the NPE.
 
Why would those results be surprising to anyone? Mobile 5090 alone has a TGP of 175W. MBP 16” M4 Max 40c has a total max load of about 140W, for the whole laptop. The GPU alone draws about 50W during max load. Also OpenCL was deprecated in 2018 after Mojave. macOS supports only up to version 1.2. The latest Windows version is 3.0.17. So it’s totally misleading but you can’t expect more from such sloppy youtubers with different errors in their tests as always.
Do Mac users even have the *option* of GPUs that powerful? Efficiency is great but if we can't even reach that power, that's kind of an issue. External GPUs used to seemingly be a way around that but I don't think they're supported anymore
 
I thought, at least for LLM's, that the GPU is the unit most performant for training, while the CPU is the unit most performant for inference, indicating that it's not always the case that the CPU has limited performance for AI. And based on what you've written, the NPU would be used as a CPU-alternative for inference when you want to optimize efficiency. Do I have that right?

Not sure you can generalize it this way. For example, Nvidia GPUs are much faster at inference than pretty much any CPU you can find. Even large workstation CPUs have neither the memory bandwidth nor the compute throughput to do fast matrix multiplication — they are simply optimized for different tasks.

I would look at it from the perspective of the die area and energy efficiency tradeoffs. A specialized unit (NPU) will likely always have an edge if you want to maximize perf/area and/or perf/watt, as long as your models are simple enough. The NPU does not need large register files or even caches, it can be implemented as a streaming engine, and there is a lot of potential for optimization. There is a good reason why Google has been building TPUs and why so many startups focus on ML-specific accelerators. However, the question is, how large of an NPU do you actually want to build? I mean, you could increase the NPU footprint by a factor of 20x and have a 720 INT8 TOPS engine rivaling large Nvidia GPUs, but there goes half of your die. Is this a good tradeoff for a general-purpose SoC?

Apple's strategy so far has been about functional specialization. The NPU is large enough to support practical ML models that have to run on-device on a constrained energy budget, like signal processing and Apple Intelligence features, but not any larger. And then you have other processors, like SME and the GPU, that can be used for other applications. As I wrote before, using the GPU for large models simply makes sense. It is already there, it is massively parallel, and adding some matrix accelerators won't actually require that many transistors.
 
Tthe most annoying thing right now is that training on Macs is just not an option. I work in computer vision and trying to train on a Mac m3 max during dev does work to some degree but it is very very slow. An m3 ultra would be way too slow as well. Did some basic resnet tests using pytorch and my m3 max had perf at best at 25% of a 4090 on fp32, and 10% at fp16 (well, actullay AMP). My h100 server doubles the 4090 perf, even without tweaking.
Also tried converting to MLX to see if it unlocks something but it was actually even worse than pytorch for CV, probably due to not having smartypants optimizations of the compute graph. (curiously, my tests using MLX "function by function" for simple tests showed 2X perf boost over pytorch)
As it is now, Macs are still awesome basic laptop-workstations, but that's that. Let's hope for a paradigm shift announced at WWDC for the fabled "hidra" chip in a real workstation/server. Mostly since it would be "fun", since I do not think we will get something with good price/perf ratio. After all, a h100 @30k is at least 10 times faster than a m3 ultra that costs about 10k
 
  • Like
Reactions: crazy dave
Tthe most annoying thing right now is that training on Macs is just not an option. I work in computer vision and trying to train on a Mac m3 max during dev does work to some degree but it is very very slow. An m3 ultra would be way too slow as well. Did some basic resnet tests using pytorch and my m3 max had perf at best at 25% of a 4090 on fp32, and 10% at fp16 (well, actullay AMP). My h100 server doubles the 4090 perf, even without tweaking.
Also tried converting to MLX to see if it unlocks something but it was actually even worse than pytorch for CV, probably due to not having smartypants optimizations of the compute graph. (curiously, my tests using MLX "function by function" for simple tests showed 2X perf boost over pytorch)
As it is now, Macs are still awesome basic laptop-workstations, but that's that. Let's hope for a paradigm shift announced at WWDC for the fabled "hidra" chip in a real workstation/server. Mostly since it would be "fun", since I do not think we will get something with good price/perf ratio. After all, a h100 @30k is at least 10 times faster than a m3 ultra that costs about 10k
I agree with everything except the last part - the $9.5k M3 Ultra has 512GB of VRAM, that's its only differentiating factor with the $5.5k M3 Ultra which has 96GB of VRAM, same as an H100 with 80-96GB. So depending on what Apple actually releases in this space, its performance, and what the VRAM requirements of the user are, I could actually see Apple releasing a very price efficient product relative to Nvidia. Of course uhh ... other than world events being a thing. But then that'll affect all the competitors too.
 
Nobody is talking about moving components far away. We are talking about multi-chip modules, which are connected with high bandwidth interfaces. And you don’t need extra controllers for vertical wires between dies. Intel builds CPUs like that and the performance didn’t go down. Same for Mx Ultra series.

The only drawback I can see (and I might be wrong) is increased power consumption.
On power consumption, TSMC has been promoting it as an advantage of this next generation, beyond “conventional” 3D IC. See the tiny battery powering the rocket ship below…

@Allen_Wentz — TSMC has produced an unusual (for them) Whats, Whys, and Hows marketing graphic and introductory blurb for this. Note that it gets a formal, registered trademark name: “TSMC-SoIC®” … providing “front-end, 3D inter-chip (3D IC) stacking technologies for re-integration of chiplets partitioned from System on Chip (SoC).” The “3Dx3D” carryover aspect is also important (for Apple, that means it’s compatible with UltraFusion).

Other key words in the graphic are “vertical” and “faster and shorter connections” …

IMG_5518.jpeg
 
Last edited:
On power consumption, TSMC has been promoting it as an advantage of this next generation, beyond “conventional” 3D IC. See the tiny battery powering the rocket ship below…

@Allen_Wentz — TSMC has produced an unusual (for them) Whats, Whys, and Hows marketing graphic and introductory blurb for this. Note that it gets a formal, registered trademark name: “TSMC-SoIC®” … providing “front-end, 3D inter-chip (3D IC) stacking technologies for re-integration of chiplets partitioned from System on Chip (SoC).” The “3Dx3D” carryover aspect is also important (for Apple, that means it’s compatible with UltraFusion).

Other key words in the graphic are “vertical” and “faster and shorter connections” …

View attachment 2500361
AMD uses it for their 3d V-Cache chips. With this go around putting the cache under the die instead of on top. I assume the same could be done with logic. Maybe Apple could put the whole GPU under everything else allowing them to up the core count.
 
On power consumption, TSMC has been promoting it as an advantage of this next generation, beyond “conventional” 3D IC. See the tiny battery powering the rocket ship below…

@Allen_Wentz — TSMC has produced an unusual (for them) Whats, Whys, and Hows marketing graphic and introductory blurb for this. Note that it gets a formal, registered trademark name: “TSMC-SoIC®” … providing “front-end, 3D inter-chip (3D IC) stacking technologies for re-integration of chiplets partitioned from System on Chip (SoC).” The “3Dx3D” carryover aspect is also important (for Apple, that means it’s compatible with UltraFusion).

Other key words in the graphic are “vertical” and “faster and shorter connections” …

View attachment 2500361

At the end of the day, this is about connecting things. Pretty much until now the ideal way was to put everything (connections included) on the same die. However, it seems these new technologies allow one to build connections between dies that are almost as good or maybe even just as good.

To make it clear, we are not talking about connecting chips using an I/O solution. We are talking about directly wiring the on-chip networks of two or more chips. Once the wires can be made small enough, you probably don’t need any boosters or signal converters.
 
For gaming, Nvidia will crush the Mac. However, for LLMs or video production, you can put up to 512GB of unified memory in the Mac Studio Ultra and destroy the Nvidia 5090. Watched a few YT comparison videos. Even when running LLMs locally that are small enough to work on the 5090, it got crushed by the M3 Ultra. Anything that requires VRAM gets crushed except video games. In addition, the power draw was almost comically different. Spend at least 4x as much on the power to run the 5090 when running but in idle mode, spend even more than the M3 Ultra. Almost everything was faster for those two purposes as well as advanced maths and anything science or work would be done on would use a Mac Studio. Games however, would be played on the 5090. Where do you make money? If it’s playing video games, buy the 5090 PC. If you make money like the rest of us quit worrying about it and buy the most Mac Studio you need.
I use Mac and PC and can say the Mac M3 Ultra certainly does not destroy the desktop 5090 in the video testing I've done.
My 5090/9950X3D can render a sample Davinci resolve project (open to anyone to test on there own system) more than a minute faster than the full 80GPU M3 Ultra which has more Ram than the PC (96GB vs 64GB).

A minute maybe doesn't sound like much but when a PC does it in 2 mins and the M3 Ultra takes 3 mins (give or take a couple of seconds either way), as a percentage, that's pretty big for machines and hardware that launched at almost the same time and are flagship hardware.

As for power usage, those spending 5k upwards aren't really that bothered about a slight increase in electric bills annually if work can be done faster.
I also use my office heating much less in Winter when the machines are running at full load so that offsets things :)
 
  • Like
Reactions: krell100
I use Mac and PC and can say the Mac M3 Ultra certainly does not destroy the desktop 5090 in the video testing I've done.
My 5090/9950X3D can render a sample Davinci resolve project (open to anyone to test on there own system) more than a minute faster than the full 80GPU M3 Ultra which has more Ram than the PC (96GB vs 64GB).

A minute maybe doesn't sound like much but when a PC does it in 2 mins and the M3 Ultra takes 3 mins (give or take a couple of seconds either way), as a percentage, that's pretty big for machines and hardware that launched at almost the same time and are flagship hardware.

As for power usage, those spending 5k upwards aren't really that bothered about a slight increase in electric bills annually if work can be done faster.
I also use my office heating much less in Winter when the machines are running at full load so that offsets things :)
Yep. As someone with three Ultra Mac Studios, Apple is really dropping the ball here. But people like to point out how little people buy these systems so it is okay.
 
I use Nvidia workstation and M4 max. For anything that fits with in the paltry Ram of Nvidia it smokes m4 max. But more than half of my use cases need at least 100 GB of VRAM/ unified memory. If I am training or running full analysis of my datasets, I use cloud. But my M1 Max is go to for all dev/testing, and work out all kinks before I use expensive cloud infrastructure.

It’s quite simple, anything less than 24-32 GB RAM, Nvidia is still the best choice. But anything more than 32GB, Nvidia 50XX is useless.
 
  • Like
Reactions: theorist9
Too bad Apples Mac mini M4 for $600 doesn't beat a $600 Playstation 5 or $600 Xbox series for gaming.
Thats hard to understand. They could get a few more dollars with better GPUs in the cheaper Mac minis made for gaming.
Even if Apple did that, nobody would buy it. Because Apple has no interest in gamers.

Gaming means long-lasting compatibility, simple hardware upgrades and individual optimization of performance.

The exact opposite of Apple.

@con2apple, you neglected to include the part where @NewOldStock is talking about game consoles, NOT gaming PCs, so the question of "simple hardware upgrades" is a moot point...

In fact, game consoles and Apple Silicon Macs now share the same upgrade path, that of buying a newer unit... ;^p
 
I agree with everything except the last part - the $9.5k M3 Ultra has 512GB of VRAM, that's its only differentiating factor with the $5.5k M3 Ultra which has 96GB of VRAM, same as an H100 with 80-96GB. So depending on what Apple actually releases in this space, its performance, and what the VRAM requirements of the user are, I could actually see Apple releasing a very price efficient product relative to Nvidia. Of course uhh ... other than world events being a thing. But then that'll affect all the competitors too.
Correct. And also complicated. I live in a country in Europe and prices here are insane for apples products. the h100 that we bought recently cost us about 30K and a 80 core m3 ultra cost 8K. Then of course the nvidia card comes in an overpriced dual xeon server with 256GB ram and lots of disks for 10k extra... So if you want to promote the apple way, I guess you could go for the baseline m3 ultra at 5K and compare to a 40-50K h100 as it is usually sold :) Oh well, I now I really would like a m3 ultra as my daily driver and hope intensely that MLX will mature quickly.
 
When the M1 Max launched, Apple said that it rivaled the flagship NVIDIA GPU at the time, the RTX 3080.

However, in 2025, when comparing the M4 Max MacBook Pro to PC laptops equipped with the flagship NVIDIA GPU, the RTX 5090, which costs the same as a M4 Max MacBook Pro, the MacBook Pro gets destroyed, it is not even close.

And this is true even on battery power.

View attachment 2498773

View attachment 2498774
A large part of this is NV pushed up their power consumption a lot. They gained performance between generations in part by throwing more wattage at it. The 3080 mobile version drew 80w-115w, and often was closer to the 80w spot on most laptops. The 5090 mobile version alone, not counting the rest of the laptop, draws between 135w and 160w. That's more than the entire M4 Max MBP draws, total, even on the lower end of the power draw for the 5090.

Worth pointing out if you need massive amounts of memory for your GPU usage the M4 Max wins hands down

Different needs for different folks

If you think of Apple's graphics as closer to the workstation cards NV or AMD ship, instead of their consumer gaming cards, but with the unified memory as a sweetener, it also makes a lot more sense cost/benefit wise. Apple isnt targeting gamers.
 
  • Like
Reactions: AdamBuker
Correct. And also complicated. I live in a country in Europe and prices here are insane for apples products. the h100 that we bought recently cost us about 30K and a 80 core m3 ultra cost 8K. Then of course the nvidia card comes in an overpriced dual xeon server with 256GB ram and lots of disks for 10k extra... So if you want to promote the apple way, I guess you could go for the baseline m3 ultra at 5K and compare to a 40-50K h100 as it is usually sold :) Oh well, I now I really would like a m3 ultra as my daily driver and hope intensely that MLX will mature quickly.
If you already have H100, I don't see how an Ultra will match in the near future. MLX is lot more than what it was a year and half ago, but it is not going to magically fix the hardware limitations. Apple need to provide ability for MLX to leverage NPE/Core ML.
 
I use Mac and PC and can say the Mac M3 Ultra certainly does not destroy the desktop 5090 in the video testing I've done.
My 5090/9950X3D can render a sample Davinci resolve project (open to anyone to test on there own system) more than a minute faster than the full 80GPU M3 Ultra which has more Ram than the PC (96GB vs 64GB).

A minute maybe doesn't sound like much but when a PC does it in 2 mins and the M3 Ultra takes 3 mins (give or take a couple of seconds either way), as a percentage, that's pretty big for machines and hardware that launched at almost the same time and are flagship hardware.

As for power usage, those spending 5k upwards aren't really that bothered about a slight increase in electric bills annually if work can be done faster.
I also use my office heating much less in Winter when the machines are running at full load so that offsets things :)
Depending on the codec and complexity, I’m sure. For h.264/hevc, given Nvidia added extra encoders, the 5090 will be faster. The 4000 series certainly wasn’t. For ProRes, I doubt the 5090 is quicker. Worth remembering also that 4:2:2 10 bit performance was poor previously on Nvidia cards.

My word need not be taken for these facts. EposVox on YouTube knows as much about encoding and rendering video as anyone and he personally switched from a high end 4090 workstation to an M2 Ultra for large projects which continually crashed on his 4090.

Given that, it feels a little strange to frame the discussion around video and Apple as Apple being behind.
 
Depending on the codec and complexity, I’m sure. For h.264/hevc, given Nvidia added extra encoders, the 5090 will be faster. The 4000 series certainly wasn’t. For ProRes, I doubt the 5090 is quicker. Worth remembering also that 4:2:2 10 bit performance was poor previously on Nvidia cards.

My word need not be taken for these facts. EposVox on YouTube knows as much about encoding and rendering video as anyone and he personally switched from a high end 4090 workstation to an M2 Ultra for large projects which continually crashed on his 4090.

Given that, it feels a little strange to frame the discussion around video and Apple as Apple being behind.
I’m just going by what I use my own machines for as they pay my bills.
For the video work I do, yes the 5090 new encoders/decoders are a great addition and probably don’t get mentioned enough as for most people, the 5090 is a card generally known for gaming.
I waited so long for the new Studio Ultra but just can’t buy the M3 Ultra despite buying the best in class Apple flagship machines in the past.
I actually ordered the 40 core m4 studio max instead as that seems to be faster for photoshop which I also use a lot so am interested to test it out when it arrives.
 
I’m just going by what I use my own machines for as they pay my bills.
For the video work I do, yes the 5090 new encoders/decoders are a great addition and probably don’t get mentioned enough as for most people, the 5090 is a card generally known for gaming.
I waited so long for the new Studio Ultra but just can’t buy the M3 Ultra despite buying the best in class Apple flagship machines in the past.
I actually ordered the 40 core m4 studio max instead as that seems to be faster for photoshop which I also use a lot so am interested to test it out when it arrives.
Perhaps I missed it, but if you didn’t buy the M3 Ultra, how do you know the 5090 better for video work?
 
Perhaps I missed it, but if you didn’t buy the M3 Ultra, how do you know the 5090 better for video work?
The new 50 series encoders/decoders suit me well.
There’s also a YouTube channel
who do a lot Davinci Resolve stuff and they have a leaderboard with a Resolve project you can download, test then submit the results. There’s lots of Mac results including a couple of new M3 ultra’s.
My 5090 results are currently leading but Resolve have just released a Beta which now supports the 50 series Nvidia cards which has speeded things up even more… although they haven’t updated the leaderboard yet with my faster time.
I’d still like to see more tests/results from anyone with an M3 Ultra though.
 
And this is where Apple needs to step up. Offer incentives to Valve. Help them keep the development costs up for Steam. Apple already assists devs with their games. They need to show support for the biggest gaming platform.
It’s to Steam’s benefit to attract more customers, same as with Apple. Apple’s currently selling 20+ million Apple Silicon Macs a year without providing incentives to Steam and, they’ll likely sell 20+ million Macs THIS year without providing incentives to Steam. If Steam’s looking at 20+ million users and going “Nah, we’d rather not try to win those customers” then that’s on them. Games will never drive Mac sales in a huge way as anyone not in the market and looking to play a given game has a wide range of options, almost all less expensive than a Mac, to play that game.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.