Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
It is crazy to make a full analysis on Cinebench, like can we as community stop with "jerking off" to synthetic benchmarks? Modern CPU's are doing 100x more stuff than just pure "compute" which isn't even that important for 99% people.

Blender is a lot better benchmark cause it is example of "real world" application and compute/render is main/only job for GPUs (at least for now).
 
It is crazy to make a full analysis on Cinebench,
Yes Cinebench is only a single floating point heavy benchmark, however it is one that particularly lends itself well to doing power analysis given its length. It would be nice to have an integer heavy one, like code compilation, but no such data set exists as far as I know.
like can we as community stop with "jerking off" to synthetic benchmarks? Modern CPU's are doing 100x more stuff than just pure "compute" which isn't even that important for 99% people.

Blender is a lot better benchmark cause it is example of "real world" application and compute/render is main/only job for GPUs (at least for now).
There is little difference between Blender and Cinebench as a benchmark beyond the underlying engine, CB R24 is the RedShift engine while Blender is the Cycles Engine (although Blender can use RedShift too), and the scene(s) being rendered (both benchmarks are on both the CPU and GPU). Neither is more real world or synthetic than the other and they are fairly tightly correlated. And even the notion of synthetic benchmarks is a little outdated, few modern benchmarks are truly synthetic (as in are not based on a real world task) - the only largely synthetic CPU benchmark that I can think of that is still commonly used is Passmark.
 
Last edited:
My Strix Halo vs M4 CPU analysis (click to expand) as always using NBC data, subtracting out idle power:

View attachment 2486793

Things are getting a bit scrunched up, haven't had time to prune the data points (unfortunately Numbers isn't great here, change the order or remove a data point and the entire graph's style has to be redone).

With respect to the Strix Halo in Asus ROG Flow (AI Max+ 395), I expected higher power/lower efficiency for the ST but I thought there would be better performance than the Strix Point. Instead it is just same performance with much worse efficiency - not totally unexpected given the larger SOC design, but this is substantially worse. It almost uses as much power in ST as the base M4 does in multicore. Also, this is with the idle power removed and the idle power here is substantial - 11.2W and this all due to the SOC, no dGPU to blame and it is a <14" laptop/tablet hybrid. With only one data point I don't know if that is a feature of the Halo's design or Asus' implementation.

Similar story with multithreaded efficiency. The 14-core M4 Pro is over 40% more efficient (side note: in this analysis I use the 14-core M4 Pro from the 14" Pro which had a slightly lower performance and slightly better efficiency than the 16" device reported in the Halo article) for similar performance. The Halo is getting 41% more performance than the Strix Point for roughly the same power increase when the HX 370 nets a score of 1166. Halo's MT efficiency improves dramatically as power draw lowers. Unfortunately NBC did not measure wall power at each TDP. However, while TDP doesn't not correspond well to wall power in an absolute sense, the Strix Point analysis where they did indeed measure both seems to confirm that relative TDP values are proportional to the wall power. So, assuming the same is true for the Halo, I can still use the TDP values to estimate wall power assuming 70W TDP for the default state wall power was measured in. The Halo is well off the 14-core's efficiency at similar power and doesn't quite reach the 12-core M4 Pro's efficiency, but is much closer, only about 11% lower. At first blush, this is not a bad result for AMD. Strix Point is only about 10% less efficient than the base M4 while the Halo is about 11% lower than the 12-core M4 Pro. However, it should be pointed out that these chips are massive compared to the Apple ones I'm comparing them to and CB R24 is a nearly embarrassingly parallel MT test where SMT2 yields around a 25% increase in performance. Strix Point therefore essentially has 12 P-cores (4 Zen 5 AVX 256bit cores + 8 Zen 5c cores) and 15 Performance threads equivalent while the base M4 has only 4 P-cores + 6 E-cores for maybe 6 performance threads equivalent. Meanwhile, Halo has 16 full Zen 5 AVX-512 cores for 20 performance cores equivalent as compared to the 12-core M4 Pro which has 8 P-cores and 4 E-cores for roughly 9.3 performance threads equivalent. Things get even more stark when considering die size.

Based on annotated die shots, I've estimated that Strix Point CPU is roughly the same size in mm^2 as the M4 Max! (the M4 Max CPU size is estimated based on multiplying the appropriate component sizes as measured by the M4 die shot). Meanwhile the Halo is nearly double that size.

Here are my estimates:

mm^2
Apple M427.0
Apple M4 12-core Pro46.1
Apple M4 14-core Pro52.2
Apple M4 Max58.2
Strix Point58.1
Strix Halo109.5

A small note about what it is included. SLC is NOT included for any SOC above, neither are Apple's E/P-core AMX units (adds about 10%). It includes AMD's L1-L3 and Apple's L1 and L2. Of course there is a major caveat here. Apple's chips are manufactured on TSMC's N3E while AMD's are manufactured on N4P (maybe N4X for the Halo, more on that below). Quantifying how much that matters, how many transistors each chip uses, is nearly impossible. We can try to use references (1 and 2) to estimate the density difference, but that is so dependent on the proportion of SRAM to logic the even knowing transistor count of the whole die doesn't tell us much. SRAM isn't expected to change in density, but logic did substantially (although even things annotated as cache can contain logic, as Apple's 16MB cache is 30% smaller than AMD's 16MB cache on the Halo). By my estimations, Apple's chips are anywhere from 15-40% more dense than AMD's depending on what values you want to take, with I believe TSMC themselves giving a rough decrease of 30% of N3E compared to N5 for their unknown reference chip (and N4P/X being <6% smaller than N5 since SRAM didn't change at all and logic decreased by 6% - although Granite Ridge is supposedly quite a bit denser than its N5 predecessor so 🤷‍♂️).

So how does Strix Halo compare to AMD's Granite Ridge (desktop Ryzen)? My guess is they are actually manufactured on different processes. N4X and N4P are nearly identical and so if you look at die shots the two CPUs are exactly the same, but the two dies outside of the CPU looks ever so slightly different. Further what N4X adds to N4P are a couple of extra transistor types for both extra low power and high power but more leakage. Look at the desktop Ryzen 9950X, it can achieve a score of 2254 in CB R24. According to NBC, it gets 7.18 pts/W, which, subtracting out 100W idle power since it was connected to a 4090 at the time, means it used 213W above idle to achieve that score. Meanwhile Strix Halo's performance is already tapering at much lower wattages - going from 101W wall power to 117W wall power (16%) only results in 5% extra performance (117W off chart). We saw Strix Point behave the same way, eventually maxing out at just over 1200pts. Given how far 2254 and 1731 are it seems difficult to see how Halo could reach that even with double the power with such small gains (which will continue to get smaller as power increases). Now, this was measured in the tablet hybrid form factor and there is dispute online whether Granite Ridge itself is N4X or N4P, so this isn't a sure thing. But it seems to me, with the slight differences in die and the seemingly different performance characteristics, it looks reasonable to me that something is different between the desktop Ryzen and Strix Halo chips where the latter is probably geared towards performance at lower power, while the former is geared towards absolute performance.

Bottom line: Apple has a huge advantage in terms of performance/Watt AND performance/Area - perhaps not hugely surprising since the amount of power used should be proportional to the amount of silicon die area turned on, but still AMD needs far bigger CPUs to compete with Apple. This probably translates increased profit margins for Apple (with the caveat that Apple probably spends a lot more per die being first on the most advanced node), especially since they package their chips themselves without an OEM middle man. This ability to build smaller, but still performant CPUs, is also probably one factor that allows them to more cost effectively build something like the M4 Max (and Ultra or whatever Hidra ends up being) whereas the Strix Halo, for all its size is closer to a M4 Pro competitor. It isn't clear beyond consoles if there is an appetite in the PC space for larger SOCs and indeed the Strix Halo itself is as of yet unproven in the marketplace.
I wonder how the “half pipeline” AVX512 shows up on Halo vs the normal desktop part.
 
I wonder how the “half pipeline” AVX512 shows up on Halo vs the normal desktop part.
As far as I understand, the Halo has the same full AVX-512 implementation as the desktop part. They certainly have the same core size while the Point's "big" Zen 5 cores are smaller. However, I do not know if Cinebench has implemented AVX-512, so there may be no difference as far as this particular test is concerned. Such a difference might show up on the relevant Geekbench 6 subtests (just photo library and object detection I believe, though possibly the ray tracer). Cinebench as of R23 (and the GB ray tracer) use the Intel Embree library for CPU ray tracing and Intel Embree does support AVX-512, but I don't know if that is an optional thing to support what is required to enable it. Ian claims as of 2020 R23 disabled AVX-512 (as did Intel Embree). I don't know its current status in either program.
 
As far as I understand, the Halo has the same full AVX-512 implementation as the desktop part. They certainly have the same core size while the Point's "big" Zen 5 cores are smaller. However, I do not know if Cinebench has implemented AVX-512, so there may be no difference as far as this particular test is concerned. Such a difference might show up on the relevant Geekbench 6 subtests (just photo library and object detection I believe, though possibly the ray tracer). Cinebench as of R23 (and the GB ray tracer) use the Intel Embree library for CPU ray tracing and Intel Embree does support AVX-512, but I don't know if that is an optional thing to support what is required to enable it. Ian claims as of 2020 R23 disabled AVX-512 (as did Intel Embree). I don't know its current status in either program.
Wikipedia seems to think the data path is halved for power consumption reasons.

The vector engine in Zen 5 features 4 floating point pipes compared to 3 pipes in Zen 4. Zen 4 introduced AVX-512 instructions. AVX-512 capabilities have been expanded with Zen 5 with a doubling of the floating point pipe width to a native 512-bit floating point datapath. The AVX-512 datapath is configurable depending on the product. Ryzen 9000 series desktop processors and EPYC 9005 server processors feature the full 512-bit datapath but Ryzen AI 300 mobile processors feature a 256-bit datapath in order to reduce power consumption.
 
Wikipedia seems to think the data path is halved for power consumption reasons.
Wikipedia simply hasn't been updated yet. That was true for the Strix Point last year. Halo is different. The Halo CCD (which Strix Point does not use, it's monolithic) is almost identical to Granite Ridge and the cores/cache are, as far as I can tell, exactly the same in that regards - their performance/power curve may be different though. So they may be on slightly different lithographies or have subtle tweaks which wouldn't show up on a die shot.
 
  • Like
Reactions: diamond.g
Does Qualcomm's Oryon also offer these advantages over AMD CPUs?
To a much lesser extent. The Oryon's ST perf and perf/Watt in CB R24 is much better than AMD's Strix Point, but still only around an M2's level of performance and perf/watt (a little less in fact). With 12 P-cores and no E-cores, the Snapdragon Elite's MT performance and perf/watt is similar to a Strix Point (a little worse at least in the version tested by NBC at multiple TDPs which was not the best binned Snapdragon variant). In terms of size, the Snapdragon has 12 Performance threads versus 15 equivalent for the Strix Point and the CPU is slightly smaller in die area (48.7mm^2 vs 58.1mm^2, both TSMC N4) and each P-cores is smaller individually (2.55mm^2 Oryon vs 3.15mm^2 Strix Point vs 3.7mm^2 Strix Halo/Granite Ridge vs 2.05mm^2 Zen 5c Strix Point).

However Qualcomm's 2nd generation Oryon P-cores ("L"-cores) found in their latest mobile SOC were better than their first, those mobile SOCs also had smaller "M"-cores similar to Apple's "E"-cores, and they will reportedly be releasing two different PC SOCs with third generation Oryon-L cores + the smaller M cores later this year. So the answer may be more unequivocally yes by the end of the year. We'll see.
 
RDNA4 has dynamic GPU registers like Apple
architecture-12.jpg

 
View attachment 2487184
It has been pointed out on Beyond3d that the cache is still not unified like Apple, though AMD seems to have a patent for an implementation.
 
There is a rumor about AMD Medusa Halo I read on tweaktown.com. Seems like a logical progression for next year. But by then we'll have the M4 Ultra Studio and M5 Max moving the bar along a bit as well. This is all great for the non-enterprise user. Such rapid progress after years of Intel malaise.

"AMD's new Medusa Point APU will feature a single Zen 6 CCD with 12 cores and 24 threads, with between 16 CUs and 32 CUs of RDNA 3.X integrated GPU on a 256-bit memory bus, very similar to Strix Point APUs. However, the flagship Medusa Halo APU is rumored with up to 48 CUs of RDNA 3.X GPU on a possibly far higher 384-bit memory bus, that should see 30-50% performance gains over the already very impressive Strix Halo APU."
 
  • Like
Reactions: crazy dave
There is a rumor about AMD Medusa Halo I read on tweaktown.com. Seems like a logical progression for next year. But by then we'll have the M4 Ultra Studio and M5 Max moving the bar along a bit as well. This is all great for the non-enterprise user. Such rapid progress after years of Intel malaise.

"AMD's new Medusa Point APU will feature a single Zen 6 CCD with 12 cores and 24 threads, with between 16 CUs and 32 CUs of RDNA 3.X integrated GPU on a 256-bit memory bus, very similar to Strix Point APUs. However, the flagship Medusa Halo APU is rumored with up to 48 CUs of RDNA 3.X GPU on a possibly far higher 384-bit memory bus, that should see 30-50% performance gains over the already very impressive Strix Halo APU."
All that sounds good except sticking with RDNA 3.X when RDNA 4 brings such nice advancements to exactly the use cases of high VRAM workstations (AI and ray tracing). And they have to redesign the GPU die anyway for both the extra cores and the higher bandwidth. If they were keeping that die the same - reusing it while updating the CPU dies - then that might make sense from economics perspective. But they aren't, so I don't get it.
 
All that sounds good except sticking with RDNA 3.X when RDNA 4 brings such nice advancements to exactly the use cases of high VRAM workstations (AI and ray tracing)
RDNA 4, such as it is, may not be very suitable for laptops and needs to be modified. Just as RDNA 3 had to be modified to get RDNA 3.5.
 
All that sounds good except sticking with RDNA 3.X when RDNA 4 brings such nice advancements to exactly the use cases of high VRAM workstations (AI and ray tracing). And they have to redesign the GPU die anyway for both the extra cores and the higher bandwidth. If they were keeping that die the same - reusing it while updating the CPU dies - then that might make sense from economics perspective. But they aren't, so I don't get it.
I don't know when they design these APUs, but the AI stacked VRAM use case is fairly recent. This relies on both networked architectures (e.g. EXO) being practical and high-performance openly available LLM models to run (principally DeepSeek and Llama). I don't remember this being the case in 2023, only in late 2024. If AMD is smart, they should now be furiously designing in more than 2 channel RAM memory and producing a line of APUs specifically for this use case (rather than for gaming laptops).
 
RDNA 4, such as it is, may not be very suitable for laptops and needs to be modified. Just as RDNA 3 had to be modified to get RDNA 3.5.

I don't know when they design these APUs, but the AI stacked VRAM use case is fairly recent. This relies on both networked architectures (e.g. EXO) being practical and high-performance openly available LLM models to run (principally DeepSeek and Llama). I don't remember this being the case in 2023, only in late 2024. If AMD is smart, they should now be furiously designing in more than 2 channel RAM memory and producing a line of APUs specifically for this use case (rather than for gaming laptops).
Aye I've heard RDNA 4 may not even be coming to mobile, it maybe RDNA 5 or whatever comes next, it's just ... extremely frustrating. Unfortunate though it may be, the RDNA 4 tech missing this release, Strix, was very understandable. Not having a mobile RDNA 4 or whatever ready even for next year or more? Less so. I mean I get that AMD has to devote resources to where it will make the most money and that the consumer tech market is no where near the top of the list for almost anyone but Apple these days when the data center is so much more profitable, but still.

Of course maybe AMD just catching strays from me being confused by Gurman's latest that the Studios released this week (good!) will be an M4 Max (good!) and ... and M3 Ultra? (WTF?)
 
Last edited:
The full M4 Max Studio is an interesting addition to the price/value debate vis-a-vis the Halo products announced so far, especially if RAM bandwidth + capacity is the major concern. $2700 for the 64GB model, $3500 for the 128GB model but 512GB/s bandwidth - so the 128GB Halo has to beat $1750 and the 64GB Halo has to beat $1350. Tough. Of course high SSD upgrade pricing on the Mac (base model only comes with 512GB, next step upgrade is $200), but also a better CPU/GPU - though naturally one has Windows vs Mac compatibility for programs.

The 96GB binned and full M3 Ultra Studios are also a wrinkle at 600GB/s and 800GB/s for $4000 and $5500 (and 1 TB hard drive). My sense is that, for the purposes of AI, the M4 Max above is the better deal, almost certainly better than the binned M3 Ultra, but the full M3 Ultra might be okay? Two M4 Maxes is cheaper than the 256GB model ($7100) with higher combined bandwidth, but the M3 Ultra achieving its capacity/performance in a single chip not needing to be linked also provides benefits. Similarly the full M3 Ultra at 96GB provides almost as much usable VRAM as the Halo but at more than 3x the bandwidth + a lot more extra CPU/GPU compute power. Versus the 128GB M4 Max, for 1.5x the price and 32GB less RAM, the 96GB full Ultra provides 1.5x the bandwidth and about 1.5x the CPU/GPU compute (and double the hard drive). That could justify its price.
 
My sense is that, for the purposes of AI, the M4 Max above is the better deal, almost certainly better than the binned M3 Ultra, but the full M3 Ultra might be okay? Two M4 Maxes is cheaper than the 256GB model ($7100) with higher combined bandwidth, but the M3 Ultra achieving its capacity/performance in a single chip not needing to be linked also provides benefits.
The key imho is what it costs and how to achieve 512GB RAM for larger LLM models.

M3 Ultra 32 core/80 GPU that unlocks big memory is a $1,500 upgrade on the "binned" M3 Ultra 28 core/60 GPU unit. Then 512GB RAM is another $4,000 on top. So, a $5,500 hit. Total price $9,500. But, it can run a 404GB LLM model. If I got the pricing right, actually this is the cheapest way to go.

Alternatively, 512GB RAM can also be attained using two 256GB RAM binned M3 Ultra 28 core/60 GPU units at $5,600 each makes $11,200.

A 128GB RAM M4 Max 16 core/40 GPU costs $3,500. But one would need 4 of them to get to 512GB RAM, and that's $14,000.
 
Last edited:
  • Like
Reactions: crazy dave
18 token/s for the 671B model, doesn’t look bad!
It isn't "617B" model in inference it is MoE you use only subset of this (37B) per prediction so ~20 tokens per 37B model. (which ofc. still needs full data to be loaded thats why it is probably a bit slower than just 37B model would be)


Sadly pure compute power is way behind NV or even AMD so if you don't need 600B models to run locally it is still better and cheaper to buy 2-4 3090 with nvlink performance would be 20-40x faster. (tbh. R1 speed is kind a unusable for real work at least on M3 Ultra, but it can be enough for some sort of creative writing, if we get M5 Ultra it could be possibly game changer in that space if we assume +35-50% GPU performance over M3)
 
It isn't "617B" model in inference it is MoE you use only subset of this (37B) per prediction so ~20 tokens per 37B model. (which ofc. still needs full data to be loaded thats why it is probably a bit slower than just 37B model would be)
I don’t think that’s true.

They did use 4 bit quantisation so it can fit in 512GB but the test is using all the parameters.
 
Some people seem to have their 512GB Ultra Studio already, but I wonder if they're really correctly running the 671B parameter Q4 model. Look at his memory used. It should take much more than that.
Screenshot 2025-03-14 at 2.08.53 AM.png
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.