[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

quarkysg · Oct 16, 2022

name99 said:
The very claim being made reveals ignorance.

What is it that is being claimed to fit in L2/L3 caches? Instructions or data?
It is trivial to scale-up a benchmark to ensure that the data footprint is larger than L3, and SPEC definitely does this. GB probably tries to, but GB5 may be behind the curve and we may need to wait for GB6.

It's much harder to scale up a benchmark to ensure that the instruction footprint exceeds cache size. Neither SPEC nor GB do this. The professional code bases that do this are larger server databases, and the academic papers that investigate this tend to look and feel rather different from papers that investigate data issues like data prefetch because they are different communities. About the closest you can get to testing large I-footprint performance without seriously hard work is using browser benchmarks that test a wide variety of functionality.

But ultimately the question is: do you want to make tribal claims, or do you want to understand? Because if you want understanding, then simplistic measurements of the size of caches are not helpful. Substantially more important than these raw sizes, in any modern machine, are how well prefetching works (on both I and D-side) and how well the branch predictor machinery works in the face of "not recently executed" code (on the I side). It does you no good to be able to get to new I code rapidly if you're then constantly flushing in the face of each newly encountered branch... There are multiple ways to handle this (better static predictors, L2 branch predictor, analysis of code for branches before it is even executed) and all of those probably matter more than the trivial question of I-footprint size compared to cache sizes.

Modern compilers would have taken care of the nature of each architecture's nuances that it should be fairly optimised generating efficient codes. Most benchmarks should be run in fairly tight loops such that the I and D L1 caches should not be thrasing a lot when the benchmarks loops are running. And the AS SLC should be large enough to fit most dataset used in most benchmarks. Going back to RAM for dataset to process during a benchmark run should be fairly infrequent, except for initial load.

For example, the M1 and M1 Pro/Max/Ultra GB5 benchmarks appears identical for ST, despite the M1 Pro/Max/Ultra using LPDDR5 which is 33% faster than the LPDDR4X used in the M1.

I particularly like the fluid dynamic benchmark (posted in an earlier post in this thread) that shows almost linear scaling for the M1 Ultra. So it appears that there is a lot of performance left on the table for even the M1 SoC when it comes to currently available software.

name99 · Oct 17, 2022

quarkysg said:
Modern compilers would have taken care of the nature of each architecture's nuances that it should be fairly optimised generating efficient codes. Most benchmarks should be run in fairly tight loops such that the I and D L1 caches should not be thrasing a lot when the benchmarks loops are running. And the AS SLC should be large enough to fit most dataset used in most benchmarks. Going back to RAM for dataset to process during a benchmark run should be fairly infrequent, except for initial load.

For example, the M1 and M1 Pro/Max/Ultra GB5 benchmarks appears identical for ST, despite the M1 Pro/Max/Ultra using LPDDR5 which is 33% faster than the LPDDR4X used in the M1.

I particularly like the fluid dynamic benchmark (posted in an earlier post in this thread) that shows almost linear scaling for the M1 Ultra. So it appears that there is a lot of performance left on the table for even the M1 SoC when it comes to currently available software.

Oh, my dude. Thanks for the school lesson.

Perhaps, if you want to further your studies you might want to read the 1000+ pages on this subject I wrote at:

GitHub - name99-org/AArch64-Explore

Contribute to name99-org/AArch64-Explore development by creating an account on GitHub.

github.com

mi7chy · Oct 17, 2022

quarkysg said:
I particularly like the fluid dynamic benchmark

Better to use OpenFOAM since it's the leading open source CFD software rather than something like USM3D that's locked down behind government use restriction. That way others can replicate to see if nothing silly is happening like running within L3 cache.

https://openbenchmarking.org/test/pts/openfoam

https://openfoamwiki.net/index.php/Benchmarks

quarkysg · Oct 17, 2022

name99 said:
Oh, my dude. Thanks for the school lesson.

Perhaps, if you want to further your studies you might want to read the 1000+ pages on this subject I wrote at:

GitHub - name99-org/AArch64-Explore

Contribute to name99-org/AArch64-Explore development by creating an account on GitHub.

github.com

Not my intention dude. Not qualified to teach on this subject. Just trying to participate in a friendly discussion in an enthusiasts forum. Must have touched a nerve or something with you.

But it does seem to me that you're the one trying to school me tho.

quarkysg · Oct 17, 2022

mi7chy said:
Better to use OpenFOAM since it's the leading open source CFD software rather than something like USM3D that's locked down behind government use restriction. That way others can replicate to see if nothing silly is happening like running within L3 cache.

https://openbenchmarking.org/test/pts/openfoam

https://openfoamwiki.net/index.php/Benchmarks

Well, I would think the Xeon CPU has a large L3 that's comparable to the M1 Ultra? I guess I better stop spewing nonsense about CPU caches or I will get schooled again.

Use whatever benchmark that pleases you man.

roach1245 · Oct 20, 2022

Ran a simple benchmark on a new Ryzen 7950x desktop build (64GB RAM) here in the lab (the build will be returned to the supplier) vs my M1 Max laptop (64GB RAM).

Task: Take about 1000 parquet files (10.6GB total) and append them into 1 file (> 400 million observations).

Hypothesis: The Ryzen 7950x should be way faster - at first thought - because it has 16 cores (versus 8 M1 Max performance cores) that are also clocked way higher.

Result: They are equally as fast because the Ryzen CPU is bottlenecked by memory bandwidth (very fast cores but just 2 memory channels on the CPU).

The files:

The task is most efficiently done in parallel using all cores available, used both polars (Rust ) and pandas-modin (C++) to do this as fast as possible.

When using all 8 performance cores on my M1 Max, memory bandwidth to CPU is at about 120 GB/s (theoretical max is 200Gb/s).

Yet the Ryzen 7950x can do 81 GB/s memory bandwidth at most as the memory runs at 5200MT/s (* 8 bytes * 2 memory channels)/1024 = 81.25 GB/s (you can stretch this to about 100 GB/s if you heavily overclock). Thus despite the 7950x's 16 faster cores it's as fast as my M1 Max with 10 cores in this task because about 6 Ryzen cores are enough to reach that 81GB/s of bandwidth. The other 10 cores are starved from input and just idling.

This is not new; others have ran similar tasks with similar results. E.g. https://tlkh.dev/benchmarking-the-apple-m1-max who finds that

"... adding more cores on the 5600X does not help (2 cores are enough to maximize memory bandwidth), while 10 cores on the M1 Max is the optimal configuration".

Or what was referred to earlier: http://hrtapps.com/blogs/20220427/.

The M1 Ultra has 20 cores and 400GB/s of memory bandwidth and thus runs way faster than the Ryzen 7950X as none of its 20 cores are starved. This is even more so when the Ryzen 7950X is decked out with 128GB of DDR5 RAM instead (4 DIMM slots) and therefore runs at a slower 3600 MT/s instead which is a meager 56.25 GB/s memory bandwidth. 5 Ryzen cores can fully consume that; the other 11 cores will just idle.

The new Intel Raptor Lake CPUs also only have 2 memory channels and top out at about 100GB/s max memory bandwidth as well so there won't be a difference here.

So just a heads up: the new Ryzen/Intel CPUs are good for gaming and workflows which are not so much memory dependent, but if you're doing data analysis or other scientific HPC work of some sort that is CPU-and-memory bound (thus not GPU machine learning) you'll very quickly run into memory bandwidth limits. Then you better stick to Apple's M1 / M2 chips or the AMD / Intel CPUs with more than 2 memory channels and thus more memory bandwidth (which are also way more expensive, e.g. the AMD ThreadRipper Pro 5965WX with 26 cores and 8 memory channels at 200GB/s memory bandwidth max for which you have to pay $2400 just for the chip itself and $1000 for a compatible motherboard).

mi7chy · Oct 20, 2022

Parquet is used with Hadoop which is for distributed systems. Unlikely anyone will run it on a laptop with only WIFI and limited to 64GB that costs more than a Threadripper Pro with 8 memory channels and 10Gb ethernet.

$3500 $14" Macbook Pro M1 Max 10CPU 24GPU 64GB 1TB

<$3000 AMD 5955WX 16-core, Supermicro MBD-M12SWA-TF-O, Hynix 64GB DDR4-3200 ECC Registered (8GB x 8 DIMMs), Samsung 980 Pro 1TB SSD

Even ~$1K Intel 12600K budget system at half the cost is faster than MBP M1 Pro on OpenFOAM.

12th Gen Intel(R) Core(TM) i5-12600 DDR5 @ 6000

# cores Wall time (s):
------------------------
1 399.94
2 213.75
4 131.87
6 107.09

https://github.com/gerlero/openfoam-app

14in MacBook Pro base model (M1 Pro, 6P+2E, 16GB RAM), macOS 12.5.1, Apple clang 13.1.6, OpenFOAM-v2206

# cores Wall time (s):
------------------------
1 463.91
2 249.77
4 141.38
6 108.45
8 143.51

roach1245 · Oct 20, 2022

mi7chy said:
Parquet is used with Hadoop which is for distributed systems. Unlikely anyone will run it on a laptop with only WIFI and limited to 64GB that costs more than a Threadripper Pro with 8 memory channels and 10Gb ethernet.

$3500 $14" Macbook Pro M1 Max 10CPU 24GPU 64GB 1TB

<$3000 AMD 5955WX 16-core, Supermicro MBD-M12SWA-TF-O, Hynix 64GB DDR4-3200 ECC Registered (8GB x 8 DIMMs), Samsung 980 Pro 1TB SSD

Even ~$1K Intel 12600K budget system at half the cost is faster than MBP M1 Pro on OpenFOAM.

https://github.com/gerlero/openfoam-app

Sure, you can spec out a $3000 server with just a CPU + memory + SSD that needs to run in a rack in a freezing datacenter at >350W that will perform equally to a $3500 M1 Max laptop which also has a top GPU, a monitor and so on (which says a lot about the laptop's quality).

Correction: Actually you cannot. That 5955WX server also needs a motherboard which you didn't include and the WRX80Pro motherboard compatible with the Threadripper Pros (the only one that is) costs $1000. So that'd be a $4000 server.

(Note: An equally priced Mac Studio with the M1 Ultra with double the cores / memory bandwidth will run circles around that server).

.parquet files are just an efficient way to store tabular data (compared to e.g. .csv / .xls) - Hadoop / WiFi / Ethernet are totally irrelevant to loading such files into memory and analyzing those. Besides, the results are the same when using the equivalent .csv files.

The M1 machines are tremendously useful for data analysis where connecting to servers is unnecessary (as long as your data can fit into 64GB of memory - which I bet satisfies 99% of use cases, or 128GB in case of the M1 Ultra in the Mac Studio).

(P.S.: regarding these OpenFoam benchmarks - would be glad to see the original source, they seem very inaccurate - the Github repo you link to doesn't provide any such numbers.)

mi7chy · Oct 20, 2022

It's actually ~$2500 minus power supply and case which many have extras lying around.

$1296 AMD 5955WX 16-core
$720 Supermicro MBD-M12SWA-TF-O
$360 Hynix 64GB DDR4-3200 ECC Registered (8 x HMA81GR7CJR8n-XN T4)
$101 Samsung 970 EVO Plus 1TB SSD
$90 Noctua NH-U14S

Don't even need latest and greatest hardware, though, since previous gen have better bang for the buck. New Epyc 7532 are going for just north of $1K. Here's a benchmark for 2x 7532 but even 1x is plenty with near perfect scaling vs results so far from Apple silicon. Install OpenFOAM for MacOS linked above and post some numbers otherwise it means nothing. Better yet, wait for results on cfd-online forum since they're credible and knowledgeable.

2x AMD EPYC 7532 (Zen2-Rome) 32-Core CPU, 200W, 2.4GHz, 256MB L3 Cache, DDR4-3200
RAM: 256GB (16x 16GB) DDR4-3200 DIMM, REG, ECC, 2R

cores time (s) speedup
1 677,34 1,00
2 363,04 1,87
4 161,42 4,20
6 101,82 6,65
8 77,16 8,78
12 52,28 12,96
16 39,4 17,19
20 32,01 21,16
24 27,31 24,80
28 24,15 28,05
32 21,53 31,46
36 21,32 31,77
40 20,46 33,11
44 18,99 35,67
48 18,12 37,38
52 17,45 38,82
56 17,06 39,70
60 16,5 41,05
64 15,91 42,57

roach1245 · Oct 20, 2022

mi7chy said:
It's actually ~$2500 minus power supply and case which many have extras lying around.

$1296 AMD 5955WX 16-core
$720 Supermicro MBD-M12SWA-TF-O
$360 Hynix 64GB DDR4-3200 ECC Registered (8 x HMA81GR7CJR8n-XN T4)
$101 Samsung 970 EVO Plus 1TB SSD
$90 Noctua NH-U14S

Don't even need latest and greatest hardware, though, since previous gen have better bang for the buck. New Epyc 7532 are going for just north of $1K. Here's a benchmark for 2x 7532 but even 1x is plenty with near perfect scaling vs results so far from Apple silicon. Install OpenFOAM for MacOS linked above and post some numbers otherwise it means nothing. Better yet, wait for results on cfd-online forum since they're credible and knowledgeable.

You'd have to add another $500 for a case + power supply (a beefy one since that ThreadRipper needs a lot of power) but then point taken - you will indeed have a $3000 server (without a GPU) somewhere in a datacenter that performs similarly to the M1 Max laptop (or put it in your air-conditioned basement - since these servers run hot and make a lot of noise - and be prepared for a high electricity bill).

I'd prefer the laptop though.

mi7chy · Oct 20, 2022

CFD is usually run on HPC cluster. One single-CPU node on older Epyc 2nd gen is already 5x+ faster than MBP M1 Pro. Epyc 3rd gen is even faster and even more so with 3rd gen-X with fat cache and 4th gen-X is just about to be released. No one is going to waste their time running CFD on a laptop unless they want to get fired.

exoticSpice · Oct 20, 2022

mi7chy said:
It's actually ~$2500 minus power supply and case which many have extras lying around.

$1296 AMD 5955WX 16-core
$720 Supermicro MBD-M12SWA-TF-O
$360 Hynix 64GB DDR4-3200 ECC Registered (8 x HMA81GR7CJR8n-XN T4)
$101 Samsung 970 EVO Plus 1TB SSD
$90 Noctua NH-U14S

Don't even need latest and greatest hardware, though, since previous gen have better bang for the buck. New Epyc 7532 are going for just north of $1K. Here's a benchmark for 2x 7532 but even 1x is plenty with near perfect scaling vs results so far from Apple silicon. Install OpenFOAM for MacOS linked above and post some numbers otherwise it means nothing. Better yet, wait for results on cfd-online forum since they're credible and knowledgeable.

Oh Mi7chy, I don't server hardware that I cannot take me on my New Zealand holiday.

I can my 400GB/s M1 Max on a plane and use it there too. Tell me when Intel and AMD make 400GB/s laptops SoCs.

exoticSpice · Oct 20, 2022

mi7chy said:
CFD is usually run on HPC cluster. One single-CPU node on older Epyc 2nd gen is already 5x+ faster than MBP M1 Pro. Epyc 3rd gen is even faster and even more so with 3rd gen-X with fat cache and 4th gen-X is just about to be released. No one is going to waste their time running CFD on a laptop unless they want to get fired.

I am sorry, but the why the **** are you comparing server hardware to laptops???

Do you know much an Eypc costs it better to FAST!!!

exoticSpice · Oct 20, 2022

The point is I want AMD/Intel to make 400 - 800GB/s laptops or make 8 channel memory on mainstream MB's and not ONLY for server hardware.

roach1245 · Oct 20, 2022

mi7chy said:
CFD is usually run on HPC cluster. One single-CPU node on older Epyc 2nd gen is already 5x+ faster than MBP M1 Pro. Epyc 3rd gen is even faster and even more so with 3rd gen-X with fat cache and 4th gen-X is just about to be released. No one is going to waste their time running CFD on a laptop unless they want to get fired.

In the old days, yes, but the beauty of Apple Silicon is that you can run CFD on them as well - incredibly well because of the tremendously high memory bandwidth. Craig Hunter did a "real-world" empirical analysis on this, concluding that

"We know from Apple’s specs and marketing materials that the M1 Ultra has an extremely high 800 GB/sec memory bandwidth and an even faster 2.5 TB/sec interface between the two M1 Max chips that make up the M1 Ultra, and it shows in the CFD benchmark. This leads to a level of CPU performance scaling that I don’t even see on supercomputers, and is the result of a SoC (system on a chip) design that is really intended to accommodate the more demanding needs of the M1 Ultra's GPUs (a topic for a future review). The result is loads of breathing room for unfettered CPU performance."

Full post here: http://hrtapps.com/blogs/20220427/

quarkysg · Oct 20, 2022

So it seems now it is feasible to run CFD workloads on a cluster of Mac Studio Ultras, which probably cost a lot less to maintain than a traditional HPC cluster?

JouniS · Oct 20, 2022

roach1245 said:
So just a heads up: the new Ryzen/Intel CPUs are good for gaming and workflows which are not so much memory dependent, but if you're doing data analysis or other scientific HPC work of some sort that is CPU-and-memory bound (thus not GPU machine learning) you'll very quickly run into memory bandwidth limits.

There are different kinds of memory dependent workloads. I do some moderately memory-intensive work in bioinformatics, with typical jobs requiring tens of gigabytes and some jobs requiring hundreds of gigabytes of memory. Memory latency is usually more important than bandwidth. Which is kind of obvious – if you need that much memory, you are probably accessing it in more or less random fashion. Speed is largely determined by the number of CPU cores and some architectural features (such as the cleverness of out-of-order execution and the number of independent memory requests that can be served in parallel).

In the benchmarks I did for a recent paper, the M1 is 10-15% faster single-threaded than the i9-10910 used in the iMac 2020. Recent Xeons are about as fast, while older Xeons and Graviton2 are much slower. (I haven't tried Graviton3 yet.) Multi-threaded performance is harder to compare, as some systems have insufficient memory, others have insufficient cooling, and yet others have insufficient memory bandwidth. The one thing I'm fairly confident of is that Graviton2 scales better than older Xeons and the i9 in this kind of workloads.

falainber · Oct 20, 2022

exoticSpice said:
The point is I want AMD/Intel to make 400 - 800GB/s laptops or make 8 channel memory on mainstream MB's and not ONLY for server hardware.

You might be the only customer who wants such laptop. Not going to happen. Besides, what's the point of doing this type of work on a mobile device anyways?

pshufd · Oct 20, 2022

JouniS said:
There are different kinds of memory dependent workloads. I do some moderately memory-intensive work in bioinformatics, with typical jobs requiring tens of gigabytes and some jobs requiring hundreds of gigabytes of memory. Memory latency is usually more important than bandwidth. Which is kind of obvious – if you need that much memory, you are probably accessing it in more or less random fashion. Speed is largely determined by the number of CPU cores and some architectural features (such as the cleverness of out-of-order execution and the number of independent memory requests that can be served in parallel).

I have seen pipelines written using random access code where sorting and merging would be far more efficient but sort/merge would probably benefit from bandwidth over latency. The thing about bioinformatics is that you get big machines so you don't necessarily need the best algorithms to get the job done.

mi7chy · Oct 20, 2022

If you look at the OpenFOAM CFD ranking you'll notice a trend with CPU cache size being king. 768MB > 256MB > 128MB > 64MB. M1 Ultra has yet to make it on the list but it has 48MB. Maybe someone will take a break from Geekbench and run OpenFOAM for MacOS.

https://openbenchmarking.org/test/pts/openfoam&eval=959768c2bfe5424e10ba636ac9e0a8759326f7a1#metrics

Lower time is better
1x 7773X 64-core (768MB) 120 secs
1x 7763 64-core (256MB) 206 secs

1x 7532 32-core (256MB) 238 secs
1x 7542 32-core (128MB) 321 secs
1x 7601 32-core (64MB) 331 secs

Upcoming Epyc Genoa-X will have 1GB+ cache and 96-cores. That would make for a nice single-socket desktop supercomputer.

88240_01_amd-epyc-9184x-genoa-x-cpu-96c-192t-zen-4-on-5nm-1-1gb-of-cache_full.png

JouniS · Oct 21, 2022

pshufd said:
I have seen pipelines written using random access code where sorting and merging would be far more efficient but sort/merge would probably benefit from bandwidth over latency. The thing about bioinformatics is that you get big machines so you don't necessarily need the best algorithms to get the job done.

In my experience, bioinformatics tends to do more computation per unit of data than most other fields of computational science. Algorithms based on scanning/sorting/merging can often work directly from disk, because the bottleneck is computation, not bandwidth. And of course, many core problems are about assembling something from random fragments, which requires frequent lookups in large indexes.

sam_dean · Oct 21, 2022

Apple will be 1 or 2 TSMC die shrinks ahead of everyone else.

Largely thanks to iPhone chips that is the basis of Mac chips.

The chips of the iPhone 15 Pro Max with USB-C (hopefully 40Gbps) will probably be on 3nm node process.

Will the M2 Pro/Max/Ultra/Extreme be also on a 3nm node process too?

diamond.g · Oct 21, 2022

exoticSpice said:
The point is I want AMD/Intel to make 400 - 800GB/s laptops or make 8 channel memory on mainstream MB's and not ONLY for server hardware.

Sadly, Apple is the only vendor that actually does stuff like that. AMD/Intel sell parts to ODM/OEMs and none of them are asking for a system like that.

quarkysg · Oct 21, 2022

diamond.g said:
Sadly, Apple is the only vendor that actually does stuff like that. AMD/Intel sell parts to ODM/OEMs and none of them are asking for a system like that.

More likely it is AMD/Intel that is segmenting between server and consumer CPU solutions.

Xiao_Xi · Oct 28, 2022

Clock ramp comparison between some PC chips and M1.

Addendum: Clock Ramp on ADL, Zen 4, M1, and More

We recently released an article on how quickly CPUs increased their clocks from idle, and received criticism in that it didn’t include CPUs newer than Kaby Lake. That was because I just didn&…

chipsandcheese.com

[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

macrumors 65816

macrumors 68030

Suspended

macrumors 65816

macrumors 65816

macrumors member

Suspended

macrumors member

Suspended

macrumors member

Suspended

Suspended

Suspended

Suspended

macrumors member

macrumors 65816

macrumors 6502a

macrumors 68040

macrumors G4

Suspended

macrumors 6502a

Suspended

macrumors G5

macrumors 65816

macrumors 68000

Our Staff