Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Distributed ARM cluster is more expensive but slower than single x64 server.

5 x $2600 = $13000 Mac Studios running ~200GB Llama 3.1 405b Q4 at 0.6 token/s. NetworkChuck mentioned Deepseek R1 but didn't run it probably since it's more demanding and even slower.

1 x $6000 x64 server running ~700GB Deepseek R1 671b Q8 at ~6-8 tokens/s.
https://xcancel.com/carrigmat/status/1884244369907278106

1x $2000 x64 server running ~400GB Deepseek R1 671 Q4 at ~4 tokens/s.
https://digitalspaceport.com/how-to-run-deepseek-r1-671b-fully-locally-on-2000-epyc-rig/
 
  • Like
Reactions: komuh
Doesn't look like Thunderbolt 5 networking. He's using a wired Ethernet router (10Gbps?) in the picture. Should be slower, right? Is there a Thunderbolt 5 "router" that works for LAN?

He tested both. Thunderbolt in a quasi hub and spoke topology since Mac Studio has four Thunderbolt 4 ports.
 
1 x $6000 x64 server running ~700GB Deepseek R1 671b Q8 at ~6-8 tokens/s.
https://xcancel.com/carrigmat/status/1884244369907278106
This dual socket one is interesting. He saved a ton of money by using two low-end Epyc 5-gen series CPUs. 16 cores each. Total: $1.7K. If one wanted more general purpose high-core count Epyc CPUs, that would cost $13K just for the two 64 core CPUs. Then, it's all about 24 DDR5 RDIMMs. That should cost plenty. Each one 32GB means $4K in memory. 24 x 32GB =768 GB, barely enough. Suppose you buy some headroom with 64GB RDIMMs for future-proofing. Assuming the motherboard will accept them, we get double for $8.5K. Then add in maybe $2K for power supply (one or two?), two CPU coolers, a big-enough case, fans and SSDs. Plus something for sales tax. It's gonna add up.
AYB7S2303170I3YWF8F.jpg
 
  • Like
Reactions: komuh
Framework announced a mini PC. Was compared to the Apple Mac Studio, see below.View attachment 2485969
Truthfully the Mini M4 Pro is the more comparable device, but the Halo should still be a better deal - more power hungry than the M4 Pro with less RAM bandwidth, but cheaper for the RAM, higher RAM ceiling, and potentially more powerful (although the M4 Pro still wins some benchmarks depending on power settings, especially CPU and bandwidth).
 
Truthfully the Mini M4 Pro is the more comparable device, but the Halo should still be a better deal - more power hungry than the M4 Pro with less RAM bandwidth, but cheaper for the RAM, higher RAM ceiling, and potentially more powerful (although the M4 Pro still wins some benchmarks depending on power settings, especially CPU and bandwidth).
I didn't see benchmarks for the ITX-board Framework. I guess it's clocked faster than the laptop one.
"putting it in a desktop chassis gets it to 120W sustained power and 140W boost "while staying quiet and cool."
 
I didn't see benchmarks for the ITX-board Framework. I guess it's clocked faster than the laptop one.
"putting it in a desktop chassis gets it to 120W sustained power and 140W boost "while staying quiet and cool."
Oh I'm sure it will be and the Halo will likely be equal to or faster than the M4 Pro CPU at those power levels depending on the test (and to be fair a full Pro CPU is going to perform nearly identically to a binned Max CPU), but it's very unlikely to reach a (full) Max's CPU performance and also unlikely to reach a binned Max's GPU performance.

Concentrating on the CPU, I haven't finished my own analysis yet, but we can see the behavior in CB R24 over different power levels:


The CPU's performance is starting to taper off around 80W TDP (a 23% increase in power from 65W only yields an 8% increase in performance). Unfortunately that's where the test ends and the taper could be the result of the form factor, but we saw the Strix Point do the same thing past 50W TDP. Interestingly the top desktop Zen 5 CPU, which is structurally the same as Strix Halo, can hit or exceed the Max's performance in CB R24, but it's fabbed on N4X and can draw over 180W. Strix Halo may, like Strix Point, be manufactured on N4P, as its CCD's are ever so slightly different from Zen 5 desktop. But I've not seen that confirmed. So that's why my guess even beyond 80W TDP, we aren't seeing an M4 Max competitor. To be clear this isn't a knock on the chip at all, AMD themselves are pushing the Halo as an M4 Pro competitor (their marketing compared the Halo to 12-core and 14-core Pros).
 
  • Like
Reactions: chars1ub0w
Apparently, Linux can allocate 110GB to the GPU out of 128GB. More than 96GB anyway. Apparently, a stack of 4 ran DeepSeek R1 671B Q4 (404GB). Looks like they're USB4 daisy-chaining for fast communication during AI inferencing. Plus wired Ethernet also for internet access.
Now, 4 * $2K = $8K. Framework is not value-priced. If at $1.5K each from another vendor, then $6K. Cheaper than Mac Studio Ultras.

framework-desktop-128gb-mainboard-only-costs-1-699-and-can-v0-ar95l5bgtcle1.png copy.jpg
 
Last edited:
And a lot slower and more power hungry.
Depends on how much it costs to get DeepSeek R1 671B Q4 running at a reasonable token/s rate. If half the price or less, being a bit slower GPU-wise is okay. Or you buy twice the compute, perhaps run DeepSeek R1 671B fp8, which would be cost-prohibitive across several top-end M4 Macs when they're released.
 
Last edited:
Oh I'm sure it will be and the Halo will likely be equal to or faster than the M4 Pro CPU at those power levels depending on the test (and to be fair a full Pro CPU is going to perform nearly identically to a binned Max CPU), but it's very unlikely to reach a (full) Max's CPU performance and also unlikely to reach a binned Max's GPU performance.

Concentrating on the CPU, I haven't finished my own analysis yet, but we can see the behavior in CB R24 over different power levels:


The CPU's performance is starting to taper off around 80W TDP (a 23% increase in power from 65W only yields an 8% increase in performance). Unfortunately that's where the test ends and the taper could be the result of the form factor, but we saw the Strix Point do the same thing past 50W TDP. Interestingly the top desktop Zen 5 CPU, which is structurally the same as Strix Halo, can hit or exceed the Max's performance in CB R24, but it's fabbed on N4X and can draw over 180W. Strix Halo may, like Strix Point, be manufactured on N4P, as its CCD's are ever so slightly different from Zen 5 desktop. But I've not seen that confirmed. So that's why my guess even beyond 80W TDP, we aren't seeing an M4 Max competitor. To be clear this isn't a knock on the chip at all, AMD themselves are pushing the Halo as an M4 Pro competitor (their marketing compared the Halo to 12-core and 14-core Pros).
I mean really the only interesting thing about Strix Halo is the GPU config. Strix Point was a bit more interesting on the CPU because they use 5C to up the core count but not increase area usage so much.

For the Strix Halo follow on, it will be interesting to see if they go RDNA4 or skip it and go with UDNA instead.
 
  • Like
Reactions: crazy dave
Apparently, Linux can allocate 110GB to the GPU out of 128GB. More than 96GB anyway. Apparently, a stack of 4 ran DeepSeek R1 671B Q4 (404GB). Looks like they're USB4 daisy-chaining for fast communication during AI inferencing. Plus wired Ethernet also for internet access.
Now, 4 * $2K = $8K. Framework is not value-priced. If at $1.5K each from another vendor, then $6K. Cheaper than Mac Studio Ultras.

View attachment 2486080
Just the 128GB mainboard from Framework is selling for $1700.

While another brand might be able to offer lower prices, I'm unsure if the total package can get down to $1.5K. I mean maybe, but that seems like a big step down. Of course we don't know yet how Apple will price new Studios especially if the Ultra-class chip isn't two M4 Maxes and is instead the rumored Hidra.

I mean really the only interesting thing about Strix Halo is the GPU config. Strix Point was a bit more interesting on the CPU because they use 5C to up the core count but not increase area usage so much.
The CPU might be slightly interesting if it's the desktop chip on a different node.

For the Strix Halo follow on, it will be interesting to see if they go RDNA4 or skip it and go with UDNA instead.

Agreed. Future versions should provide much better utility for its intended audience.
 
Just the 128GB mainboard from Framework is selling for $1700.

While another brand might be able to offer lower prices, I'm unsure if the total package can get down to $1.5K. I mean maybe, but that seems like a big step down. Of course we don't know yet how Apple will price new Studios especially if the Ultra-class chip isn't two M4 Maxes and is instead the rumored Hidra.
We know Apple won't be price-competitive given its attitude to RAM pricing. Of course, Apple M4 Studio will be more efficient and faster. But it'd be priced out of reach.

I read the Framework one has already many pre-orders, but it's not shipping until Q3. Plenty of time for Chinese manufacturers like these to undercut them totally. GMKtec has this Strix Halo announced in Jan, shipping Q1 or Q2. No idea on price.
938f989e-b64a-4a55-8185-82f53fcfebf7.jpg copy.jpg
 
We know Apple won't be price-competitive given its attitude to RAM pricing. Of course, Apple M4 Studio will be more efficient and faster. But it'd be priced out of reach.

I read the Framework one has already many pre-orders, but it's not shipping until Q3. Plenty of time for Chinese manufacturers like these to undercut them totally. GMKtec has this Strix Halo announced in Jan, shipping Q1 or Q2. No idea on price.View attachment 2486218
Since I'm not in that field myself I can't say for sure how that would shake out, but I could see different groups coming to different conclusions about value wrt Apple Studio depending on their needs (bandwidth, efficiency & long term costs vs RAM capacity, etc ...). Should be interesting.
 
I fail to understand why people are so hyped for this with the meager 256GB/s memory bandwidth it provides. Especially talking in conjunction with LLMs.
It's very simple. People want to run/host a non-trivial LLM like DeepSeek R1 670B (full or non-ridiculously small quant) as inexpensively as possible with a decent tokens/sec rate. If this Strix Halo thing works for it at a new price point, it'll be popular. Not everyone can/wants to drop $15-20K on Apple. If that can be done in $5-6K, that's a game-changer.
 
Last edited:
It's very simple. People want to run/host a non-trivial LLM like DeepSeek R1 670B (full or non-ridiculously small quant) as inexpensively as possible with a decent tokens/sec rate. If this Strix Halo thing works for it at a new price point, it'll be popular. Not everyone can/wants to drop $15-20K on Apple. If that can be done in $5-6K, that's a game-changer.
The R1 671b in quant 8 requires 713GB of RAM. fp16 requires 1.3TB of ram. Think about this for a second.

Let's just say you chain enough of this Framework computer together to get 713GB of RAM. That's 7 of these computers. Linux has max 110GB allocation. The cost is already over $14k and uses well over 1,000 watts during load.

After all the overhead, interconnect latency (USB4 lol), etc., you're going to be looking at 1-2 tokens/s doing this. That's torturing yourself. And if you want to enable thinking mode, which can generate 100x more tokens per query, you might as well take a vacation each time you ask it to do something. So say your output is 200 tokens, and it thinks for 100x that amount, it's going to take 5.5 hours to get a measly 200 token output.

No consumer hardware can run the R1 671b at a "usable" speed.

If Apple comes out with something like an M5 Extreme running LPDDR6 and 512GB of RAM. You can then run the Q4 version and get around 130 tokens/s. If Apple uses LPDDR6X for the Extreme, it can even do 218.4 tokens/s at 4 TB/s bandwidth.

Let's say the M5 Extreme starts at $8000 which is 2x Ultra (Ultra is 2x price of Max). Add 512GB of RAM for $4,000 based on existing RAM price scaling. That's $12k. Add extra $3k for inflation and more advanced chip packaging. $15k. Now that would be a very interesting computer and worthy of its price.
 
Last edited:
I fail to understand why people are so hyped for this with the meager 256GB/s memory bandwidth it provides. Especially talking in conjunction with LLMs.
Problem with this computer/SoC is that 128GB is great, but 256GB/s is nearly useless for large models. It's torture to use.

The sweet spot is probably 64GB where you can load a 32B model and still have around 20GB of RAM left over for the system to use. The 32B model might be able to generate 10-12 tokens/s, which is "ok", not great.

However, if you're going for the 64GB version, the M4 Pro Mini 64GB is a better deal and you're getting a better computer.
 
Let's say the M5 Extreme starts at $8000 which is 2x Ultra (Ultra is 2x price of Max). Add 512GB of RAM for $4,000 based on existing RAM price scaling. That's $12k. Add extra $3k for inflation and more advanced chip packaging. $15k. Now that would be a very interesting computer and worthy of its price.
Well, $15K is way too much for most people.

Low Q models are clearly worse than higher Q models.
And 64GB doesn't get anyone an interesting model in my testing.
(I use a M1 Max 64GB RAM. If that gets replaced by a Studio or M5 Max this year, I still don't see it getting 400GB of unified RAM.)

I don't know what speed the Framework demo was running at. I think maybe it was a Q4 model. Looking on Ollama, I see DeepSeek R1 671B Q4_K_M is 404GB. So they're likely running that model.

Is the DeepSeek R1 671B Q4_K_M good enough? I haven't tested it. But the cost of running it seems doable. Individual queries are probably slow. But you can batch. (My M1 Max 64GB cost over $6000.)

There are also improvements coming for parallelism. For example, I noticed Huggingface says "Built-in Tensor Parallelism (TP) is now available with certain models using PyTorch" and "You can benefit from considerable speedups for inference, especially for inputs with large batch size or long sequences." They show some graphs of speedups for small models.

In my lab, I have a AMD Threadripper PRO workstation at more than $20K with 2 Nvidia 4090s. But I don't think I can run DeepSeek R1 671B Q4_K_M on it due to lack of RAM.
 
Last edited:
  • Like
Reactions: crazy dave
But $12k - $14k for all the Framework isn't?
404GB model. 4 machines. Framework is at the expensive end of laptops (and now mini-PCs). So one only has to pay a fraction of $12K-14K, assuming (as I'm sure) the usual other mini-PC makers will adopt the same CPU/128GB memory. I already pointed out that GMKtech announced the same thing before Framework. Unlike the mac ecosystem, the PC market is very price-competitive and the cost of these machines will rapidly come down price (once there is enough supply).
 
A couple notes on pricing for the Framework. There is also the 64GB model which starts at the same price ($1599) as the 24GB (full) Mac Mini Pro BUT doesn't come with OS, fan, hard drive (although hard drive upgrades unsurprisingly far better priced - still we'll compare base to base), front panels, etc ... Depending on your options you can quickly get to $1900-$2000 range - still a better deal than the M4 Mini Pro at $2100 (and again much better depending on your hard drive choice), but not quite as good as it might first appear based on the base price which doesn't actually include any of the above. And while there may be more "value" brands appearing soon, this does anchor the price expectations a little differently. These same add-ons are required for the 128GB and 32GB models as well:


So that needs to be taken into consideration when discussing price.
 
My Strix Halo vs M4 CPU analysis (click to expand) as always using NBC data, subtracting out idle power:

Screenshot 2025-02-26 at 10.19.45 AM.png

Things are getting a bit scrunched up, haven't had time to prune the data points (unfortunately Numbers isn't great here, change the order or remove a data point and the entire graph's style has to be redone).

With respect to the Strix Halo in Asus ROG Flow (AI Max+ 395), I expected higher power/lower efficiency for the ST but I thought there would be better performance than the Strix Point. Instead it is just same performance with much worse efficiency - not totally unexpected given the larger SOC design, but this is substantially worse. It almost uses as much power in ST as the base M4 does in multicore. Also, this is with the idle power removed and the idle power here is substantial - 11.2W and this all due to the SOC, no dGPU to blame and it is a <14" laptop/tablet hybrid. With only one data point I don't know if that is a feature of the Halo's design or Asus' implementation.

Similar story with multithreaded efficiency. The 14-core M4 Pro is over 40% more efficient (side note: in this analysis I use the 14-core M4 Pro from the 14" Pro which had a slightly lower performance and slightly better efficiency than the 16" device reported in the Halo article) for similar performance. The Halo is getting 41% more performance than the Strix Point for roughly the same power increase when the HX 370 nets a score of 1166. Halo's MT efficiency improves dramatically as power draw lowers. Unfortunately NBC did not measure wall power at each TDP. However, while TDP doesn't not correspond well to wall power in an absolute sense, the Strix Point analysis where they did indeed measure both seems to confirm that relative TDP values are proportional to the wall power. So, assuming the same is true for the Halo, I can still use the TDP values to estimate wall power assuming 70W TDP for the default state wall power was measured in. The Halo is well off the 14-core's efficiency at similar power and doesn't quite reach the 12-core M4 Pro's efficiency, but is much closer, only about 11% lower. At first blush, this is not a bad result for AMD. Strix Point is only about 10% less efficient than the base M4 while the Halo is about 11% lower than the 12-core M4 Pro. However, it should be pointed out that these chips are massive compared to the Apple ones I'm comparing them to and CB R24 is a nearly embarrassingly parallel MT test where SMT2 yields around a 25% increase in performance. Strix Point therefore essentially has 12 P-cores (4 Zen 5 AVX 256bit cores + 8 Zen 5c cores) and 15 Performance threads equivalent while the base M4 has only 4 P-cores + 6 E-cores for maybe 6 performance threads equivalent. Meanwhile, Halo has 16 full Zen 5 AVX-512 cores for 20 performance cores equivalent as compared to the 12-core M4 Pro which has 8 P-cores and 4 E-cores for roughly 9.3 performance threads equivalent. Things get even more stark when considering die size.

Based on annotated die shots, I've estimated that Strix Point CPU is roughly the same size in mm^2 as the M4 Max! (the M4 Max CPU size is estimated based on multiplying the appropriate component sizes as measured by the M4 die shot). Meanwhile the Halo is nearly double that size.

Here are my estimates:

mm^2
Apple M427.0
Apple M4 12-core Pro46.1
Apple M4 14-core Pro52.2
Apple M4 Max58.2
Strix Point58.1
Strix Halo109.5

A small note about what it is included. SLC is NOT included for any SOC above, neither are Apple's E/P-core AMX units (adds about 10%). It includes AMD's L1-L3 and Apple's L1 and L2. Of course there is a major caveat here. Apple's chips are manufactured on TSMC's N3E while AMD's are manufactured on N4P (maybe N4X for the Halo, more on that below). Quantifying how much that matters, how many transistors each chip uses, is nearly impossible. We can try to use references (1 and 2) to estimate the density difference, but that is so dependent on the proportion of SRAM to logic the even knowing transistor count of the whole die doesn't tell us much. SRAM isn't expected to change in density, but logic did substantially (although even things annotated as cache can contain logic, as Apple's 16MB cache is 30% smaller than AMD's 16MB cache on the Halo). By my estimations, Apple's chips are anywhere from 15-40% more dense than AMD's depending on what values you want to take, with I believe TSMC themselves giving a rough decrease of 30% of N3E compared to N5 for their unknown reference chip (and N4P/X being <6% smaller than N5 since SRAM didn't change at all and logic decreased by 6% - although Granite Ridge is supposedly quite a bit denser than its N5 predecessor so 🤷‍♂️).

So how does Strix Halo compare to AMD's Granite Ridge (desktop Ryzen)? My guess is they are actually manufactured on different processes. N4X and N4P are nearly identical and so if you look at die shots the two CPUs are exactly the same, but the two dies outside of the CPU looks ever so slightly different. Further what N4X adds to N4P are a couple of extra transistor types for both extra low power and high power but more leakage. Look at the desktop Ryzen 9950X, it can achieve a score of 2254 in CB R24. According to NBC, it gets 7.18 pts/W, which, subtracting out 100W idle power since it was connected to a 4090 at the time, means it used 213W above idle to achieve that score. Meanwhile Strix Halo's performance is already tapering at much lower wattages - going from 101W wall power to 117W wall power (16%) only results in 5% extra performance (117W off chart). We saw Strix Point behave the same way, eventually maxing out at just over 1200pts. Given how far 2254 and 1731 are it seems difficult to see how Halo could reach that even with double the power with such small gains (which will continue to get smaller as power increases). Now, this was measured in the tablet hybrid form factor and there is dispute online whether Granite Ridge itself is N4X or N4P, so this isn't a sure thing. But it seems to me, with the slight differences in die and the seemingly different performance characteristics, it looks reasonable to me that something is different between the desktop Ryzen and Strix Halo chips where the latter is probably geared towards performance at lower power, while the former is geared towards absolute performance.

Bottom line: Apple has a huge advantage in terms of performance/Watt AND performance/Area - perhaps not hugely surprising since the amount of power used should be proportional to the amount of silicon die area turned on, but still AMD needs far bigger CPUs to compete with Apple. This probably translates increased profit margins for Apple (with the caveat that Apple probably spends a lot more per die being first on the most advanced node), especially since they package their chips themselves without an OEM middle man. This ability to build smaller, but still performant CPUs, is also probably one factor that allows them to more cost effectively build something like the M4 Max (and Ultra or whatever Hidra ends up being) whereas the Strix Halo, for all its size is closer to a M4 Pro competitor. It isn't clear beyond consoles if there is an appetite in the PC space for larger SOCs and indeed the Strix Halo itself is as of yet unproven in the marketplace.
 
Last edited:
Blender results posted:

Screenshot 2025-02-28 at 10.46.22 AM.png


Not great, but not unexpected given RDNA 3 and 3.5's previous Blender results. RDNA 4 should improve things, though it isn't clear when RDNA 4 will come to mobile.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.