Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Yeah, more benchmarking BS. They're comparing a 12-core processor (AMD Ryzen AI 9 HX 370) to an 8-core (the M3 in the Air), using MT CPU workloads. E.g., for GB6, the AMD is 2,879/14,888, while the 8-core M3 is 3,082/12,087 (all scores taken from tomshardware.com for consistency), so of course they're going to report the MC scores, and conveniently omit the difference in SC performance.

And what kind of apps are people most commonly using on thin&lights? That's, right SC.
Even worse when ignoring the four M3's efficiency cores (comparing a 4P/4E CPU (M3) to a 12C/24T CPU).
 
You missed my point. The chart isn't BS because they compared an 8-core and 12-core processor. It's BS because they showed the kinds of tasks on which the AMD is faster (MC), but conveniently omitted those on which the M3 is faster (SC)—and it's particularly BS b/c the overwhelming majority of apps used on thin&lights are SC rather than MC.

And it's not just that they showed MC rather than SC. They additionally concealed the fact that the scores were MC by labelling them as, e.g., "Geekbench - CPU Score" rather than "Geekbench - MC CPU Score".

Now you might argue that mfrs. do that all the time—they show the scores on which their devices do best. If it were just a matter of cherry-picking certain apps, that's expected. But to omit an entire class of tasks, when that's the most common class used on that category of device, and to conceal that, is an entirely different level of marketing BS.
Another way to say this is to note the lack of appropriate web benchmarks...
Where web benchmarks (and related nonsense like Electron apps) are, as you say, what many people will be running on this sort of device.
 
  • Like
Reactions: jdb8167
Another way to say this is to note the lack of appropriate web benchmarks...
Where web benchmarks (and related nonsense like Electron apps) are, as you say, what many people will be running on this sort of device.
I was thinking of desktop apps, rather than web-based apps. But I suppose if you had a web-based app where a lot of the computation was done on the client side, then that might be interesting to benchmark as well.
 
I was thinking of desktop apps, rather than web-based apps. But I suppose if you had a web-based app where a lot of the computation was done on the client side, then that might be interesting to benchmark as well.
Well theoretically a well designed web-based app should be using worker threads. But a well designed web-based app is kind of an oxymoron.
 
What does "worker threads" mean?
Probably Web Workers.
 
  • Like
Reactions: jdb8167
Doesn't the tech press use the same video games to make historical comparisons?

AMD is using games with built-in benchmark tool for convenience but it doesn't make it accurate on Mac. They should use Lies of P, Death Stranding or Resident Evil but accuracy is often not a priority in marketing material.
 
  • Like
Reactions: Xiao_Xi
Probably Web Workers.
Exactly.
 
Reviews of AMD's latest SoC have been released. While AMD's new SOC seems to have taken a big leap in the right direction, Apple's M4 still outperforms AMD's new SoC.

y2YX5Lb3rtbBcHvBbx4QE5-1200-80.jpg

WCY3a6pK4xeZnYKivmUhfi-1200-80.jpg

 
Last edited:
Enticing as a 45 to 55W MiniPC with 128GB unified memory if they don't overprice it. It will have access to larger software ecosystem, full Linux support and won't be weighed down by compatibility layer overhead.

Screenshot (12).png
 
Enticing as a 45 to 55W MiniPC with 128GB unified memory if they don't overprice it. It will have access to larger software ecosystem, full Linux support and won't be weighed down by compatibility layer overhead.
Would like to see a head-to-head token/sec match vs the M4 Max 128GB on the largest LLM model that will fit in 96GB (max to CPU?)
 
Would like to see a head-to-head token/sec match vs the M4 Max 128GB on the largest LLM model that will fit in 96GB (max to CPU?)
96GB is Max to GPU for the Halo. The M4 Max would probably win that rather easily as the Halo's bandwidth is less than the Pro's and the Pro generally beats the Halo in AI workloads measured. You can see that in the video review @Xiao_Xi posted above. So an M4 Max with double the bandwidth of the Pro would be substantially faster.
 
Last edited:
96GB is Max to GPU for the Halo. The M4 Max would probably win that rather easily as the Halo's bandwidth is less than the Pro's and the Pro generally beats the Halo in AI workloads measured. You can see that in the video review @Xiao_Xi posted above. So an M4 Max with double the bandwidth of the Pro would be substantially faster.
I meant to type 96GB max to GPU (not CPU). On macOS, sysctl iogpu.wired_limit_mb can be adjusted a bit to give more than the default 75% for large RAM sizes. That's 96GB for 128GB M4 Max, but I don't know the maximum permissible, e.g. could we have 120GB for GPU and reserve 8GB for CPU on macOS?
 
  • Like
Reactions: crazy dave
I meant to type 96GB max to GPU (not CPU). On macOS, sysctl iogpu.wired_limit_mb can be adjusted a bit to give more than the default 75% for large RAM sizes. That's 96GB for 128GB M4 Max, but I don't know the maximum permissible, e.g. could we have 120GB for GPU and reserve 8GB for CPU on macOS?
You can go over the 75% recommended limit for macOS. However, @leman did the actual tests and I'm just going to quote him directly rather than filter it through me:

There are some caveats. I just did a quick test to make sure.

- There is a practical limits to a single buffer size, around 20GB on my system. You only get an error when you try to use it, not at allocation time
- The working size of resident GPU memory cannot exceed the total RAM amount. E.g. if you attempt to access 32GB worth of buffers in a single compute pass, you will get an "out of memory" runtime failure
- You *can* however use more than total RAM worth of buffers in separate compute passes. I had no problem writing to 80GB worth of buffers on my 36GB machine as long as I did not bind more than 27GB worth of data per pass. I suppose the system will swap between passes as needed.

Overall, it appears that the actively accessed GPU data needs to be resident in RAM at all time. The system will swap data in and out as needed. The GPU command submission serves as a boundary for residency management. But it does not seem as if the GPU can currently interrupt execution to swap data in and out. And even if it can, it won't (which is understandable, the performance would be very bad).

There are also sparse resources, which might add an additional dimension to all this, but I don't have experience working with them.

P.S. I also managed to hard freeze my laptop by trying to sequentially process multiple 20+GB buffers. I suppose it hit a slow path in the GPU firmware and the machine reset after a timeout. It was completely unresponsive for a minute or two. I previously had a similar experience experimenting with the texture count limits. It seems like Apple engineers don't expect one to do dumb stuff like that (and who would blame them).
 
Enticing as a 45 to 55W MiniPC with 128GB unified memory if they don't overprice it. It will have access to larger software ecosystem, full Linux support and won't be weighed down by compatibility layer overhead.
You can get an RTX 4060 laptop with an AMD CPU for $1,000 today. Why would anyone pay double that for similar performance?

Because it has 256GB/s of bandwidth?

Why would local LLM people want 128GB of unified memory when it's 256GB/s of bandwidth only? It'd be painfully slow for models that require that much RAM - practically useless. If you load a 70B model on it, it'd have a maximum of 3 tokens/s without accounting for other bottlenecks and overheard. That's torture to use.

I think this laptop would be great if it's $1,500, not $2,100 and certainly not $2800.

Right now, this SoC does not have product market fit. It gets beat by much cheaper RTX laptops in gaming. It gets beat by Apple in performance and efficiency. I can't think of a market for this. If there is one, it must be very small.
 
You can get an RTX 4060 laptop with an AMD CPU for $1,000 today. Why would anyone pay double that for similar performance?

Because it has 256GB/s of bandwidth?

Why would local LLM people want 128GB of unified memory when it's 256GB/s of bandwidth only? It'd be painfully slow for models that require that much RAM - practically useless. If you load a 70B model on it, it'd have a maximum of 3 tokens/s without accounting for other bottlenecks and overheard. That's torture to use.

It is unlikely that more bandwidth would result in better performance. As it is, the bandwidth/performance ratio for ML is already better than a 4090. For mixed-precision inference, these SKUs should work well enough.

I think this laptop would be great if it's $1,500, not $2,100 and certainly not $2800.

Right now, this SoC does not have product market fit. It gets beat by much cheaper RTX laptops in gaming. It gets beat by Apple in performance and efficiency. I can't think of a market for this. If there is one, it must be very small.

This I agree with. There are better products in almost every category for less money. And if you need CPU performance, a Mac is a better deal.
 
Promising, only a $500 premium to upgrade from 32GB to 128GB RAM vs Apple $1K from 48GB to 128GB.

https://rog.asus.com/us/laptops/rog-flow/rog-flow-z13-2025/spec/
But memory is slower, so for 96 usable GB on AMD you got limited by 250GB/s bus?, in newer models there a lot of optimization pushed into minimizing memory transfer as TOPS aren't limiting performance anymore (in most cases).

M4 Ultra would for sure be a lot better if we got it soon and M4 max is better right now, but NPU on AMD seems to be faster (same as Intel) compared to Apple, but how easy it is to utilize or if they stated performance is truthful i have no idea.

As for Apple you are limited by static graphs, so LLM workloads are pretty limited if you want utilize ANE (As you need autoregressive predictions in decoder part, there is new API in macOS 15+ but i didn't have any luck making it faster than good Metal kernels on pro+ devices).
 
  • Like
Reactions: crazy dave
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.