[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

diamond.g · Feb 20, 2025

leman said:
It is unlikely that more bandwidth would result in better performance. As it is, the bandwidth/performance ratio for ML is already better than a 4090. For mixed-precision inference, these SKUs should work well enough.

This I agree with. There are better products in almost every category for less money. And if you need CPU performance, a Mac is a better deal.

I am pretty sure the pricing is reflecting the form factor. Hopefully someone makes a normal laptop with this chip for a more sane price, instead of this Surface competitor.

dmr727 · Feb 20, 2025

diamond.g said:
I am pretty sure the pricing is reflecting the form factor. Hopefully someone makes a normal laptop with this chip for a more sane price, instead of this Surface competitor.

That was my thought. I was impressed with the graphics performance for a 2-1. I'm waiting for this to be dropped into a 13" MBA competitor.

mi7chy · Feb 20, 2025

Going by this logic maybe the 128GB will be $100 less than 64GB. Too bad you can't preorder to lock in the pricing.

chars1ub0w · Feb 20, 2025

mi7chy said:
Promising, only a $500 premium to upgrade from 32GB to 128GB RAM vs Apple $1K from 48GB to 128GB.

https://rog.asus.com/us/laptops/rog-flow/rog-flow-z13-2025/spec/
View attachment 2484136
View attachment 2484139

I think RAM pricing is key to its competitiveness. In a mini-PC (no need for laptop screen etc.) at 128GB, it might be half the price of Apple 128GB M4 Max. So one could get two, and split the LLM layers between them. The cost of Apple 256GB M4 Ultra (assuming that's available when it's released) might be too high.

DeepSeek R1 (671B parameters, Q4), the interesting one, takes 404GB. I'm not sure a 512GB M4 Ultra will be possible, but it'd fit comfortably in one. The cost certainly will be extremely high even if it is. 4 mini-PCs, each 128GB, might be a lot cheaper.

Low quant models don't perform as well in my testing with Llama 70B (various quant levels), more hallucinations. But someone was using two M2 Ultras (192GB?) with mlx-distributed on DeepSeek R1 671B but only at Q3. That is 293GB in size. The cost of that is well over $10K in hardware.

crazy dave · Feb 20, 2025

senttoschool said:
Right now, this SoC does not have product market fit. It gets beat by much cheaper RTX laptops in gaming. It gets beat by Apple in performance and efficiency. I can't think of a market for this. If there is one, it must be very small.

leman said:
This I agree with. There are better products in almost every category for less money. And if you need CPU performance, a Mac is a better deal.

I'm a little more optimistic about the market for this chip, but I too am a little concerned about its commercial viability. What I see is that sure the Halo may not be as good as an M4x and there are cheaper Windows gaming alternatives, but this provides a Mac-like high VRAM system for cheaper than a Mac with native Windows and probably easier Linux support. So it's like a cheaper Mac in the workstation aspect and better Windows gaming. That makes the sub-14" tablet hybrid Flow a bit of an odd duck for a first device, but as the reviewer @Xiao_Xi linked to said, it's a bit of engineering showoff - look what we can do rather than, this is a high volume market mover. Personally, if I was still working, I'd be more interested in the CUDA capable Nvidia-MediaTek equivalent SOC rumored to be coming later this year (probably not DIGITS). So I am sort of in this market - although for Nvidia cause CUDA and I don't know how big the market is beyond me and me isn't even working right now.

name99 · Feb 20, 2025

komuh said:
As for Apple you are limited by static graphs, so LLM workloads are pretty limited if you want utilize ANE (As you need autoregressive predictions in decoder part, there is new API in macOS 15+ but i didn't have any luck making it faster than good Metal kernels on pro+ devices).

Static graphs is probably a SW/compiler limitation, not a HW issue. There are patents, like
(2021) https://patents.google.com/patent/US20220237439A1 Branching operation for neural processor circuit
that suggest dynamic graphs are feasible in the hardware.

The ANE tools are constantly lagging a year or more behind the hardware (eg the A17 ANE which Apple advertising suggested had 2x INT8 throughput, but we only saw that activated about a year later when the new OS and tools shipped).

I wouldn't be too hard on the tools team. They are building a whole new world from scratch (unlike the CPU and even the GPU team) and they have to prioritize many different directions (make PyTorch more compatible? make it easier to port some particular CUDA idiom? add something to MLX?). Dynamic graphs probably just have not yet made it onto the "must have SOON" list.
All the excitement now about MoEs (and equivalents, like a base omni-language model that feeds into language specialization layers) may mean that dynamic graphs HAVE now moved onto this must have soon list.

komuh · Feb 20, 2025

name99 said:
Static graphs is probably a SW/compiler limitation, not a HW issue. There are patents, like
(2021) https://patents.google.com/patent/US20220237439A1 Branching operation for neural processor circuit
that suggest dynamic graphs are feasible in the hardware.

The ANE tools are constantly lagging a year or more behind the hardware (eg the A17 ANE which Apple advertising suggested had 2x INT8 throughput, but we only saw that activated about a year later when the new OS and tools shipped).

I wouldn't be too hard on the tools team. They are building a whole new world from scratch (unlike the CPU and even the GPU team) and they have to prioritize many different directions (make PyTorch more compatible? make it easier to port some particular CUDA idiom? add something to MLX?). Dynamic graphs probably just have not yet made it onto the "must have SOON" list.
All the excitement now about MoEs (and equivalents, like a base omni-language model that feeds into language specialization layers) may mean that dynamic graphs HAVE now moved onto this must have soon list.

I'm not hard on ANE team, I'm just saying that you are still limited to GPU in a lot of most popular scenarios (like LLM's). "Dynamic graphs" are probably limitation of some sort of older ANE design or just because team doing ML in Apple was partially using TensorFlow 1 as base for their design (code documentation are even pointing at Tensorflow 1 wiki for most operations) and it is very hard to move everything into dynamic "eager" like mode after writing everything in static way <Both MPSGraph, CoreML etc. are still almost fully static in a way you do stuff>.

And looking at the whole TensorFlow 2.0 '**** show' which killed google dominance in ML open source space I'm not sure if they want to do something like "dynamic" graphs, maybe adding few more hacks for LLM's workloads would be just good enough <and they seems to move in this direction, maybe opening ANE framework would allow us to make our own dynamic API's but as it is Apple i have no hope they'll ever do that>.

senttoschool · Feb 20, 2025

crazy dave said:
but this provides a Mac-like high VRAM system for cheaper than a Mac with native Windows and probably easier Linux support.

What is the use case for something like this?

crazy dave · Feb 20, 2025

senttoschool said:
What is the use case for something like this?

I would say the same use cases as for the high VRAM Mac (modeling and sim, AI, development, 3D rendering)*, but for people who prefer Windows (especially if they want easier access to the Windows gaming ecosystem in the same machine) all in a package that gets better battery life and/or thinner/lighter enclosures than normal PC laptops or mini-PCs at the same performance levels. Open question if PC users are willing to pay for such a package and how big that market is to sustain development of such chips longterm.

*there may be other fields that could benefit from high VRAM that I am not aware of

senttoschool · Feb 20, 2025

crazy dave said:
I would say the same use cases as for the high VRAM Mac (modeling and sim, AI, development, 3D rendering)*, but for people who prefer Windows (especially if they want easier access to the Windows gaming ecosystem in the same machine) all in a package that gets better battery life and/or thinner/lighter enclosures than normal PC laptops or mini-PCs at the same performance levels. Open question if PC users are willing to pay for such a package and how big that market is to sustain development of such chips longterm.

*there may be other fields that could benefit from high VRAM that I am not aware of

I highly doubt that anyone would buy an expensive, slow AMD GPU over an inexpensive Nvidia GPU for modeling, simulation, development, 3D rendering. If they have tens of gigabytes models, surely they aren't doing it on an underpowered AMD expensive laptop. Nvidia GPUs run circles around AMD in Blender rending. https://opendata.blender.org/benchm...PI&group_by=device_name&blender_version=4.3.0

As for AI, it's already been mentioned that 256GB/s is terrible for big models. In general, AMD has less local software LLM support than both Apple and Nvidia. Even the M4 Max is quite torturous for large models. The prompt processing, 546GB/s memory bandwidth bottleneck, are hard to overcome for large LLMs that require high memory capacity.

Again, I fail to see any product market fit. I'm unconvinced by your argument.

crazy dave · Feb 20, 2025

senttoschool said:
I highly doubt that anyone would buy an expensive, slow AMD GPU over an inexpensive Nvidia GPU for modeling, simulation, development, 3D rendering.

VRAM - if those inexpensive Nvidia GPUs can't hold the problem they get slow very quickly (if it works at all- while Nvidia has done a lot of great work to make swapping easier on the programmer it isn't free ... yet). And Nvidia GPUs that have lots of VRAM are far more expensive. Like with the Mac, this is a personal/"prosumer" workstation chip but with the added bonus that you can do native Windows-x86 gaming on it as well.

senttoschool said:
If they have tens of gigabytes models, surely they aren't doing it on an underpowered AMD expensive laptop. Nvidia GPUs run circles around AMD in Blender rending. https://opendata.blender.org/benchm...PI&group_by=device_name&blender_version=4.3.0

Oh I'm very aware. As I mentioned in a different forum, it is indeed a pity that this initial Halo product is RDNA 3.5 instead of 4. Based on reports, the latter would've improved the GPU's utility substantially for 3D rendering. I'm looking forwards to when Blender benchmarks or even just Solar Bay drops for the Halo (have you seen any?). I'm not expecting a lot from this particular chip though. Remember AMD and the rest are playing catchup to Apple in this medium SOC segment so in that sense this is an M1/2 Pro equivalent several years later. If AMD continues with the Halo line, an RDNA4/UDNA generation GPU SOC should be far better for both AI and ray tracing. And speaking of Apple, despite the obvious benefits that a large pool of unified memory to 3D rendering and AI, Apple didn't ship the M1/2 GPU with great ray tracing nor does Apple's modern GPU have fantastic matmul performance (neither does AMD's latest RDNA 4, UDNA in the future might be different) - they both ship with NPUs but as is being discussed in this thread that's a little different. My point is this is a first generation SOC chip (at least something this size for Windows PC). Like Apple did, AMD is starting with the tech they have.

And yes, we've already established that for myself, I'd be waiting more for an Nvidia-Mediatek device for my purposes (FP32 compute + CUDA rather than AI) and there are undoubtedly a lot of people who would prefer such an Nvidia-based product as well, but that doesn't mean AMD (or Apple for that matter) just gives up and stops competing (especially when Nvidia hasn't released said product yet).

senttoschool said:
As for AI, it's already been mentioned that 256GB/s is terrible for big models. In general, AMD has less local software LLM support than both Apple and Nvidia. Even the M4 Max is quite torturous for large models. The prompt processing, 546GB/s memory bandwidth bottleneck, are hard to overcome for large LLMs that require high memory capacity.

People do it. That's why they use Mac Mini Pros for exactly that purpose (EDIT: (strikethrough) ~~and probably why Apple allows the Mini M4 Pro to go to 128GB while capping the MacBook Pro M4 Pro at 64GB~~) and why those same people are excited for 128GB Nvidia DIGITS with its 273GB/s for $3000. Also, are we discussing the market for the Halo or the Flow? Because those are two different discussions.

For the Halo, it is an x86 equivalent to a 14-core M4 Pro chip with worse power efficiency and ST performance and a stronger (non-TBDR/ray tracing) GPU for cheaper (maybe, price is still a question mark, I'm seeing different prices for seemingly the same model, and we don't know the price for the yet to be released models). So yes, if you want a dedicated gaming rig or a dedicated workstation you can find better/cheaper. What this chip enables is a multipurpose workstation that can be portable or smaller/low energy than the typical PC that can also be used for Windows gaming in a single device but is more expensive. That's the market. What does that sound like? An Apple chip. As I opened the paragraph with, they've essentially made a M4 Pro-level chip but for Windows-x86 with all the benefits and caveats inherent in Windows-x86 vs macOS-Apple Silicon.

For the Flow in particular ... it's technically very impressive (although surprisingly bad idle power according to NBC, over 11W), but I'm very unsure about the market reach for a sub-14" tablet hybrid with a Halo chip. A mini-PC and 14" laptop from HP are also supposedly in the works.

senttoschool said:
Again, I fail to see any product market fit. I'm unconvinced by your argument.

I'm not really trying to convince you that this the best chip ever with fantastic market potential. I'm just saying they're going after the same market segment Apple goes after but for people who want to stay in the Windows ecosystem. How big is that? I don't know. Will this chip succeed in capturing said market? I don't know. AMD clearly have been wanting to make this kind of chip for PCs ever since they acquired ATI back in the day (and they sort of did for consoles). Apple made it first (for PC) and basically proved the medium and large SOC PC market exists (for Apple). AMD wants to see if they can make a medium SOC market for themselves and I think it's not unreasonable to try. That doesn't mean they'll succeed (especially not with the first Halo generation).

throAU · Feb 21, 2025

I think whether or not this succeeds in PCs, it (or a future iteration) will make a killer console chip.

Xiao_Xi · Feb 21, 2025

Until recently, AMD didn't believe that this type of SoC made sense.

AMD VP explains why the Ryzen AI Max likely wouldn't exist without Apple

At CES 2025, AMD's Joe Macri praised Apple for proving that people would buy powerful computers without discrete GPUs.

www.engadget.com

Pressure · Feb 21, 2025

throAU said:
I think whether or not this succeeds in PCs, it (or a future iteration) will make a killer console chip.

That's pretty much already what the current consoles are, except the memory bandwidth aren't there yet for APUs.

The PS5 has 448GB/s compared to 256GB/s for Halo Strix.

The M4 Max, however, is already faster than the PS5 Pro due to the much faster CPU cores (Zen2 isn't great) and comparatively memory bandwidth (PS5 Pro: 576GB/s, M4 Max: 546GB/s) and GPU TFlops (PS5 Pro: 16.7TFlops, M4 Max: 16.4TFlops).

diamond.g · Feb 21, 2025

crazy dave said:
VRAM - if those inexpensive Nvidia GPUs can't hold the problem they get slow very quickly (if it works at all- while Nvidia has done a lot of great work to make swapping easier on the programmer it isn't free ... yet). And Nvidia GPUs that have lots of VRAM are far more expensive. Like with the Mac, this is a personal/"prosumer" workstation chip but with the added bonus that you can do native Windows-x86 gaming on it as well.

Oh I'm very aware. As I mentioned in a different forum, it is indeed a pity that this initial Halo product is RDNA 3.5 instead of 4. Based on reports, the latter would've improved the GPU's utility substantially for 3D rendering. I'm looking forwards to when Blender benchmarks or even just Solar Bay drops for the Halo (have you seen any?). I'm not expecting a lot from this particular chip though. Remember AMD and the rest are playing catchup to Apple in this medium SOC segment so in that sense this is an M1/2 Pro equivalent several years later. If AMD continues with the Halo line, an RDNA4/UDNA generation GPU SOC should be far better for both AI and ray tracing. And speaking of Apple, despite the obvious benefits that a large pool of unified memory to 3D rendering and AI, Apple didn't ship the M1/2 GPU with great ray tracing nor does Apple's modern GPU have fantastic matmul performance (neither does AMD's latest RDNA 4, UDNA in the future might be different) - they both ship with NPUs but as is being discussed in this thread that's a little different. My point is this is a first generation SOC chip (at least something this size for Windows PC). Like Apple did, AMD is starting with the tech they have.

And yes, we've already established that for myself, I'd be waiting more for an Nvidia-Mediatek device for my purposes (FP32 compute + CUDA rather than AI) and there are undoubtedly a lot of people who would prefer such an Nvidia-based product as well, but that doesn't mean AMD (or Apple for that matter) just gives up and stops competing (especially when Nvidia hasn't released said product yet).

People do it. That's why they use Mac Mini Pros for exactly that purpose (and probably why Apple allows the Mini M4 Pro to go to 128GB while capping the MacBook Pro M4 Pro at 64GB) and why those same people are excited for 128GB Nvidia DIGITS with its 273GB/s for $3000. Also, are we discussing the market for the Halo or the Flow? Because those are two different discussions.

For the Halo, it is an x86 equivalent to a 14-core M4 Pro chip with worse power efficiency and ST performance and a stronger (non-TBDR/ray tracing) GPU for cheaper (maybe, price is still a question mark, I'm seeing different prices for seemingly the same model, and we don't know the price for the yet to be released models). So yes, if you want a dedicated gaming rig or a dedicated workstation you can find better/cheaper. What this chip enables is a multipurpose workstation that can be portable or smaller/low energy than the typical PC that can also be used for Windows gaming in a single device but is more expensive. That's the market. What does that sound like? An Apple chip. As I opened the paragraph with, they've essentially made a M4 Pro-level chip but for Windows-x86 with all the benefits and caveats inherent in Windows-x86 vs macOS-Apple Silicon.

For the Flow in particular ... it's technically very impressive (although surprisingly bad idle power according to NBC, over 11W), but I'm very unsure about the market reach for a sub-14" tablet hybrid with a Halo chip. A mini-PC and 14" laptop from HP are also supposedly in the works.

I'm not really trying to convince you that this the best chip ever with fantastic market potential. I'm just saying they're going after the same market segment Apple goes after but for people who want to stay in the Windows ecosystem. How big is that? I don't know. Will this chip succeed in capturing said market? I don't know. AMD clearly have been wanting to make this kind of chip for PCs ever since they acquired ATI back in the day (and they sort of did for consoles). Apple made it first (for PC) and basically proved the medium and large SOC PC market exists (for Apple). AMD wants to see if they can make a medium SOC market for themselves and I think it's not unreasonable to try. That doesn't mean they'll succeed (especially not with the first Halo generation).

I could see AMD making something like Halo the default for Ryzen 3/5/7 CPU's since discrete GPU's are so expensive and hard to get these days. Allows them to drop entry level and focus on mid range and enthusiast.

mi7chy · Feb 21, 2025

crazy dave said:
People do it. That's why they use Mac Mini Pros for exactly that purpose (and probably why Apple allows the Mini M4 Pro to go to 128GB while capping the MacBook Pro M4 Pro at 64GB)

Where do you see Mac Mini Pro with 128GB RAM option? Only 64GB is listed which puts it behind 128GB Ryzen 395 while Mac Studio 128GB is cost prohibitive at $4300 but even maxed out with 196GB RAM is still way short to run full Grok 3, Deepseek R1 671b or even Q4 which requires at least x64 server or workstation with more extensive 2TB+ RAM support.

crazy dave · Feb 21, 2025

mi7chy said:
Where do you see Mac Mini Pro with 128GB RAM option? Only 64GB is listed which puts it behind 128GB Ryzen 395 while Mac Studio 128GB is cost prohibitive at $4300 but even maxed out with 196GB RAM is still way short to run full Grok 3, Deepseek R1 671b or even Q4 which requires at least x64 server or workstation with more extensive 2TB+ RAM support.

View attachment 2484429

Oh I must've gotten confused switching through browser tabs too quickly a week ago*. I thought it was odd that the mini M4 Pro went to 128GB while the MacBook M4 Pro only went to 64GB, but was lazy/tired and didn't double check. Having said that, people have still been building AI server farms out of the minis - they generally combine them together. And this is the same market for Halo and DIGITS and why AMD and Nvidia are marketing them as inference chips/boxes with up to 128GB of RAM and 250-273GB/s bandwidth.

*EDIT: Either that or I was misremembering my own hypotheticals from a month ago as being real late at night. Since apparently a month ago I was writing about the mini Pro and that it didn't go 128GB.

chars1ub0w · Feb 21, 2025

mi7chy said:
Only 64GB is listed which puts it behind 128GB Ryzen 395 while Mac Studio 128GB is cost prohibitive at $4300 but even maxed out with 196GB RAM is still way short to run full Grok 3, Deepseek R1 671b or even Q4 which requires at least x64 server or workstation with more extensive 2TB+ RAM support.

A whole stack of Mac Minis for DeepSeek R1 Q4. Double the memory and half the stack would be nice. Wonder if he's using Thunderbolt 5 networking, and what kind of network topology is that for 8 machines?
P.S. he has an extra Mac Mini in the background, so wonder why he didn't just go all in with Mac Minis for 8 x 64GB =512 GB

mi7chy · Feb 24, 2025

Issue with distributed is the bottleneck is moved from faster memory bandwidth to slower I/O bandwidth. In the case of Mac Mini Pro, Thunderbolt 5 is around 65Gb/s so a fraction of memory bandwidth. As far as network topology, what's displayed on laptop screen hints at Thunderbolt ring topology with unknown impact to I/O bandwidth when hosts are under LLM load plus having to hop through multiple hosts.

chars1ub0w · Feb 24, 2025

mi7chy said:
As far as network topology, what's displayed on laptop screen hints at Thunderbolt ring topology with unknown impact to I/O bandwidth when hosts are under LLM load plus having to hop through multiple hosts.

Seems to me if you split the LLM layers between the Minis, each one serially passes on the weights to the next Mini. You begin at one Mini (type in the query), and the last Mini should generate the text. If so, I'm not sure why one needs a ring.

komuh · Feb 24, 2025

chars1ub0w said:
Seems to me if you split the LLM layers between the Minis, each one serially passes on the weights to the next Mini. You begin at one Mini (type in the query), and the last Mini should generate the text. If so, I'm not sure why one needs a ring.

You don't if you can split layers you only need "ring" like connections with training but I'm pretty sure no one want to train models on M4 minis it would be painfully slow (just like prediction is slow but at least you are having your data stored locally so tradeoff can be worth it)

chars1ub0w · Feb 24, 2025

komuh said:
You don't if you can split layers you only need "ring" like connections with training but I'm pretty sure no one want to train models on M4 minis it would be painfully slow (just like prediction is slow but at least you are having your data stored locally so tradeoff can be worth it)

This is for inference only of course, spread over a bunch of macs.
Training DeepSeek R1 or doing a distill is way beyond the capabilities of any home or university setup.

komuh · Feb 24, 2025

chars1ub0w said:
This is for inference only of course, spread over a bunch of macs.
Training DeepSeek R1 or doing a distill is way beyond the capabilities of any home or university setup.

Yes i know that's why i wrote that "ring" like setup is used for training so there is most likely no multiple mac-2-mac connection for gradient updates just sequential predictions and layers spliced over multiple devices (on photo).

Xiao_Xi · Feb 24, 2025

This video shows 5 Mac Studios together.

chars1ub0w · Feb 24, 2025

Xiao_Xi said:
This video shows 5 Mac Studios together.

Doesn't look like Thunderbolt 5 networking. He's using a wired Ethernet router (10Gbps?) in the picture. Should be slower, right? Is there a Thunderbolt 5 "router" that works for LAN?

[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

macrumors G5

macrumors G4

Suspended

macrumors 6502

macrumors 68000

macrumors 68030

Suspended

macrumors 68030

macrumors 68000

macrumors 68030

macrumors 68000

macrumors G4

macrumors 68000

macrumors 603

macrumors G5

Suspended

macrumors 68000

macrumors 6502

Suspended

macrumors 6502

Suspended

macrumors 6502

Suspended

macrumors 68000

macrumors 6502

Our Staff