[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

Xiao_Xi · May 27, 2023

dmccloud said:
Apple actually has its own set of instructions which run on top of the base ARM ISA

Does LLVM expose those Apple instructions?

leman · May 27, 2023

dmccloud said:
Apple actually has its own set of instructions which run on top of the base ARM ISA (some of which actually have been incorporated into the ARM ISA itself). That's one of the key advantages to the license Apple (and Samsung) have compared to what Qualcomm and others have.

Key advantage? Hardly. Useful for some niche stuff? Sure.

Sydde · May 27, 2023

leman said:
Key advantage? Hardly. Useful for some niche stuff? Sure.

AIUI, one pair of Apple-specific instructions performs data compression/decompression for a full page of memory at once. I am confident that there is at least one instruction, possibly more, specifically designed to streamline object-method calls, to make them almost as fast as a raw subroutine call, and probably some sort of notification dispatch acceleration – two tiny features that would be heavily used, hence improving performance quite a lot. If those are not in there, I would be surprised.

leman · May 28, 2023

Sydde said:
AIUI, one pair of Apple-specific instructions performs data compression/decompression for a full page of memory at once.

Sure, but that's OS-level functionality. I wouldn't count this as an advantage of custom ISA extensions because it's not something the software can take advantage to.

Custom Apple instructions I am aware of are some dedicated limited-precision integer instructions (helps with fast execution of javascript), AMX, and of course the custom x86 support features.

Sydde said:
I am confident that there is at least one instruction, possibly more, specifically designed to streamline object-method calls, to make them almost as fast as a raw subroutine call, and probably some sort of notification dispatch acceleration – two tiny features that would be heavily used, hence improving performance quite a lot. If those are not in there, I would be surprised.

They have dedicated branch prediction for method calls, but I don't remember there being a specialised instruction.

name99 · May 28, 2023

Xiao_Xi said:
Does LLVM expose those Apple instructions?

Sydde said:
AIUI, one pair of Apple-specific instructions performs data compression/decompression for a full page of memory at once. I am confident that there is at least one instruction, possibly more, specifically designed to streamline object-method calls, to make them almost as fast as a raw subroutine call, and probably some sort of notification dispatch acceleration – two tiny features that would be heavily used, hence improving performance quite a lot. If those are not in there, I would be surprised.

What's relevant to this discussion is not the compress/decompress instructions, or other Apple specific instructions include a NEON 53b multiply (essentially equivalent to an AVX512 instruction, and probably used by crypto, though not apparently usable by random code).

What IS relevant is this:
https://patents.google.com/patent/US20210240477A1
Indirect Branch Predictor Based on Register Operands

https://patents.google.com/patent/US11294684B2
Indirect branch predictor for dynamic indirect branches.

This is a way to speed up a common sort of interpreter that's basically a huge switch statement. The standard way to accelerate such indirect branches does not work well for such situations, and so the predictor for such a branch is trained differently.
This scheme is not exactly a new instruction, but it does require the interpreter to set a special register indicating the address of its large switch statement. This is apparently done by at least the Objective-C runtime and by the JS interpreter in Safari.
This work is apparently present in the A15 and later.

Other environments that use a large switch for an interpreter (Lua? Python? TeX? Mathematica?) might likewise be able to make use of this. So far Apple has not said anything, but maybe they will at some point provide some sort of API, or a compiler annotation, to allow any app to indicate such a particular "designated switch statement"? Or maybe they have a long-term plan for HW to be able to detect this situation by itself, so that it will just work for any interpreter, and the A15 solution was always meant to be just a proof of concept that the modified training of the indirect branch worked as it was supposed to (ie part one of a two part solution, the second part being the identification of "interpreter-like switch statements")?

Homy · Jun 3, 2023

Saw this today. Don't know if it's been discussed but since it's about AMD it could be of interest.

AMD's EPYC Rome Chips Crash After 1,044 Days of Uptime

A clock timer bug brings second-gen EPYCs to a halt.

www.tomshardware.com

Sydde · Jun 4, 2023

I just cannot help it, that url always expands in my head to "Tom Shardware".

thenewperson · Jun 4, 2023

Sydde said:
I just cannot help it, that url always expands in my head to "Tom Shardware".

Literally same

dmr727 · Jun 4, 2023

thenewperson said:
Literally same

Ha, glad to know I'm not the only one. It's like that Pen Island url joke that went around in the late 90s.

Xiao_Xi · Jun 9, 2023

Does Geekbench 6.1 show the performance of CPUs with a high number of cores better than Geekbench 6.0 in the multi-core benchmark?

Geekbench 6.1 - Geekbench Blog

www.geekbench.com

thenewperson · Jun 9, 2023

Xiao_Xi said:
Does Geekbench 6.1 show the performance of CPUs with a high number of cores better than Geekbench 6.0 in the multi-core benchmark?

Geekbench 6.1 - Geekbench Blog

www.geekbench.com

Well they did mention higher MT scores across the board but if you mean more like GB5 behaviour I doubt.

Xiao_Xi · Jun 9, 2023

thenewperson said:
Well they did mention higher MT scores across the board but if you mean more like GB5 behaviour I doubt.

It seems that John Poole, the developer of Geekbench, wrote in a defunct Geekbench thread.

At the start of Geekbench 6 development in 2020, we collected benchmark results for client and workstation applications across various processors. We found that only some applications scale well past four cores. We also found that some applications exhibit negative scaling (where performance decreased as the number of threads increased). We concluded that, at some point, client applications experience diminishing returns with increased core counts due to the inability to use all available cores effectively. The investigation led us to believe that Geekbench 5 overstated multi-core performance for client applications.

One design goal for Geekbench 6 was to accurately reflect multi-core performance for client applications while not arbitrarily limiting workload scaling. We wanted to ensure the multithreading approaches used were reasonable and representative of how applications use multiple cores. We also wanted to ensure that no workloads exhibited excessive negative scaling.

To achieve this goal, we switched from the "separate task" approach to the "shared task" approach for multithreading in Geekbench 6.

The "separate task" approach parallelizes workloads by treating each thread as separate. Each thread processes a separate independent task. This approach scales well as there is very little thread-to-thread communication, and the available work scales with the number of threads. For example, a four-core system will have four copies, while a 64-core system will have 64 copies.

The "shared task" approach parallelizes workloads by having each thread process a single shared task. Given the increased inter-thread communication required to coordinate the work between threads, this approach may not scale as well as the "separate task" approach.

What the shared task is varies from workload to workload. For example, the Clang workload task compiles 96 source files, while the Horizon Detection workload task adjusts one 24MP image.

For client systems, the "shared task" approach is most representative of how most client (and workstation) applications exploit multiple cores, whereas the "separate task" model is more representative of how most server applications use multiple cores.

Some Geekbench 6 workloads will scale poorly, and others will scale well on high-end workstation systems. These results follow what we observed from real-world applications.

Page 7 - Question - Geekbench 6 released and calibrated against Core i7-12700

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Xiao_Xi · Jun 14, 2023

Yesterday, AMD unveiled Zen 4C.

AMD Zen 4c Not an E-core, 35% Smaller than Zen 4, but with Identical IPC

AMD on Tuesday (June 13) launched the EPYC 9004 "Bergamo" 128-core/256-thread high density compute server processor, and with it, debuted the new "Zen 4c" CPU microarchitecture. A lot had been made out about Zen 4c in the run up to yesterday's launch, such as rumors that it is a Zen 4 "lite"...

www.techpowerup.com

Zen 4c: AMD’s Response to Hyperscale ARM & Intel Atom

Bergamo Volumes, ASP, Performance, Hyperscale Order Shift, Die Shot, Floorplan, Physical Design, and Future Use of Dense Core Variants Bergamo, AMD’s upcoming 128-core server part sets new heights …

www.semianalysis.com

Xiao_Xi · Jun 17, 2023

M2 Ultra has been delidded.

https://twitter.com/x/status/1669809875411673089

Appletoni · Jun 17, 2023

Xiao_Xi said:
M2 Ultra has been delidded.

https://twitter.com/x/status/1669809875411673089

Yes M2 Ultra inside the MacBook will be great.

Longplays · Jun 17, 2023

Xiao_Xi said:
M2 Ultra has been delidded.

https://twitter.com/x/status/1669809875411673089

dgdosen · Jun 17, 2023

ELI5 question:

The RAM chips on the Ultra (and max?) look more "square" than they do on a de-lidded M2 and M2 Pro (more rectangular).
Does Apple use different kinds of ram chips over the line? Are there significant differences?

Romain_H · Jun 17, 2023

Xiao_Xi said:
M2 Ultra has been delidded.

https://twitter.com/x/status/1669809875411673089

Why does the inscription on the cover say "Intel Xeon"? Is this some sort of joke I don't get?

mr_roboto · Jun 17, 2023

dgdosen said:
ELI5 question:

The RAM chips on the Ultra (and max?) look more "square" than they do on a de-lidded M2 and M2 Pro (more rectangular).
Does Apple use different kinds of ram chips over the line? Are there significant differences?

You're not looking at the bare RAM silicon, instead you're seeing its packaging - some kind of black polymer overmold. Inside each package there are likely multiple stacked RAM die. So, the shape of the package doesn't necessarily tell you much about the squareness of the silicon inside.

mr_roboto · Jun 17, 2023

Romain_H said:
Why does the inscription on the cover say "Intel Xeon"? Is this some sort of joke I don't get?

There's no joke, they just put a Sapphire Rapids Xeon on top for a size comparison. The delidded M2 Ultra is behind it, still attached to its PCB.

Romain_H · Jun 17, 2023

mr_roboto said:
There's no joke, they just put a Sapphire Rapids Xeon on top for a size comparison. The delidded M2 Ultra is behind it, still attached to its PCB.

Thx for the heads-up!

Sydde · Jun 17, 2023

mr_roboto said:
they just put a Sapphire Rapids Xeon on top for a size comparison

For reference, that is 56 P-cores (112 threads), 105Mb "smart cache", no GPU, 8 DDR5-compatible memory lanes, 350~420W at 1.9~4.2GHz.

name99 · Jun 18, 2023

Romain_H said:
Why does the inscription on the cover say "Intel Xeon"? Is this some sort of joke I don't get?

It's to point out that the Ultra is more or less the same size as current Xeons, which is kinda interesting.

(Both are to some extent convergent, using separate chiplets rather than ever larger chips, and with memory on the package, though the details differ a *lot*.)

name99 · Jun 18, 2023

Sydde said:
For reference, that is 56 P-cores (112 threads), 105Mb "smart cache", no GPU, 8 DDR5-compatible memory lanes, 350~420W at 1.9~4.2GHz.

Some also have HBM, but I don't know which model is which.

Sydde · Jun 18, 2023

name99 said:
It's to point out that the Ultra is more or less the same size as current Xeons, which is kinda interesting.

Not really, though. The M2 Ultra has 16P+8E cores, 96MB L3, 76 GPU cores and 8 DRAM modules in the same size package; the Xeon has no GPU cores, no neural engine (though AVX2 does have targeted ops), no H2xx type or other accelerators, and no DRAM. They are pretty dissimilar. It looks like the entire Ultra die is around a quarter the size of the Xeon die.

[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

macrumors 68000

macrumors Core

macrumors 68030

macrumors Core

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 65816

macrumors G4

macrumors 68000

macrumors 65816

macrumors 68000

macrumors 68000

macrumors 68000

Suspended

Suspended

macrumors 68030

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68030

Our Staff