Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,520
19,670
Apple actually has its own set of instructions which run on top of the base ARM ISA (some of which actually have been incorporated into the ARM ISA itself). That's one of the key advantages to the license Apple (and Samsung) have compared to what Qualcomm and others have.

Key advantage? Hardly. Useful for some niche stuff? Sure.
 

Sydde

macrumors 68030
Aug 17, 2009
2,563
7,061
IOKWARDI
Key advantage? Hardly. Useful for some niche stuff? Sure.
AIUI, one pair of Apple-specific instructions performs data compression/decompression for a full page of memory at once. I am confident that there is at least one instruction, possibly more, specifically designed to streamline object-method calls, to make them almost as fast as a raw subroutine call, and probably some sort of notification dispatch acceleration – two tiny features that would be heavily used, hence improving performance quite a lot. If those are not in there, I would be surprised.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
AIUI, one pair of Apple-specific instructions performs data compression/decompression for a full page of memory at once.

Sure, but that's OS-level functionality. I wouldn't count this as an advantage of custom ISA extensions because it's not something the software can take advantage to.

Custom Apple instructions I am aware of are some dedicated limited-precision integer instructions (helps with fast execution of javascript), AMX, and of course the custom x86 support features.

I am confident that there is at least one instruction, possibly more, specifically designed to streamline object-method calls, to make them almost as fast as a raw subroutine call, and probably some sort of notification dispatch acceleration – two tiny features that would be heavily used, hence improving performance quite a lot. If those are not in there, I would be surprised.

They have dedicated branch prediction for method calls, but I don't remember there being a specialised instruction.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
Does LLVM expose those Apple instructions?
AIUI, one pair of Apple-specific instructions performs data compression/decompression for a full page of memory at once. I am confident that there is at least one instruction, possibly more, specifically designed to streamline object-method calls, to make them almost as fast as a raw subroutine call, and probably some sort of notification dispatch acceleration – two tiny features that would be heavily used, hence improving performance quite a lot. If those are not in there, I would be surprised.
What's relevant to this discussion is not the compress/decompress instructions, or other Apple specific instructions include a NEON 53b multiply (essentially equivalent to an AVX512 instruction, and probably used by crypto, though not apparently usable by random code).


What IS relevant is this:
https://patents.google.com/patent/US20210240477A1
Indirect Branch Predictor Based on Register Operands

https://patents.google.com/patent/US11294684B2
Indirect branch predictor for dynamic indirect branches.

This is a way to speed up a common sort of interpreter that's basically a huge switch statement. The standard way to accelerate such indirect branches does not work well for such situations, and so the predictor for such a branch is trained differently.
This scheme is not exactly a new instruction, but it does require the interpreter to set a special register indicating the address of its large switch statement. This is apparently done by at least the Objective-C runtime and by the JS interpreter in Safari.
This work is apparently present in the A15 and later.

Other environments that use a large switch for an interpreter (Lua? Python? TeX? Mathematica?) might likewise be able to make use of this. So far Apple has not said anything, but maybe they will at some point provide some sort of API, or a compiler annotation, to allow any app to indicate such a particular "designated switch statement"? Or maybe they have a long-term plan for HW to be able to detect this situation by itself, so that it will just work for any interpreter, and the A15 solution was always meant to be just a proof of concept that the modified training of the indirect branch worked as it was supposed to (ie part one of a two part solution, the second part being the identification of "interpreter-like switch statements")?
 
  • Like
Reactions: Silvestru Hosszu

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Well they did mention higher MT scores across the board but if you mean more like GB5 behaviour I doubt.
It seems that John Poole, the developer of Geekbench, wrote in a defunct Geekbench thread.
At the start of Geekbench 6 development in 2020, we collected benchmark results for client and workstation applications across various processors. We found that only some applications scale well past four cores. We also found that some applications exhibit negative scaling (where performance decreased as the number of threads increased). We concluded that, at some point, client applications experience diminishing returns with increased core counts due to the inability to use all available cores effectively. The investigation led us to believe that Geekbench 5 overstated multi-core performance for client applications.

One design goal for Geekbench 6 was to accurately reflect multi-core performance for client applications while not arbitrarily limiting workload scaling. We wanted to ensure the multithreading approaches used were reasonable and representative of how applications use multiple cores. We also wanted to ensure that no workloads exhibited excessive negative scaling.

To achieve this goal, we switched from the "separate task" approach to the "shared task" approach for multithreading in Geekbench 6.

The "separate task" approach parallelizes workloads by treating each thread as separate. Each thread processes a separate independent task. This approach scales well as there is very little thread-to-thread communication, and the available work scales with the number of threads. For example, a four-core system will have four copies, while a 64-core system will have 64 copies.

The "shared task" approach parallelizes workloads by having each thread process a single shared task. Given the increased inter-thread communication required to coordinate the work between threads, this approach may not scale as well as the "separate task" approach.

What the shared task is varies from workload to workload. For example, the Clang workload task compiles 96 source files, while the Horizon Detection workload task adjusts one 24MP image.

For client systems, the "shared task" approach is most representative of how most client (and workstation) applications exploit multiple cores, whereas the "separate task" model is more representative of how most server applications use multiple cores.

Some Geekbench 6 workloads will scale poorly, and others will scale well on high-end workstation systems. These results follow what we observed from real-world applications.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Yesterday, AMD unveiled Zen 4C.
 

Longplays

Suspended
May 30, 2023
1,308
1,158
FyxbOK6aIAAjrDW.jpg
 

dgdosen

macrumors 68030
Dec 13, 2003
2,817
1,463
Seattle
ELI5 question:

The RAM chips on the Ultra (and max?) look more "square" than they do on a de-lidded M2 and M2 Pro (more rectangular).
Does Apple use different kinds of ram chips over the line? Are there significant differences?
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
ELI5 question:

The RAM chips on the Ultra (and max?) look more "square" than they do on a de-lidded M2 and M2 Pro (more rectangular).
Does Apple use different kinds of ram chips over the line? Are there significant differences?
You're not looking at the bare RAM silicon, instead you're seeing its packaging - some kind of black polymer overmold. Inside each package there are likely multiple stacked RAM die. So, the shape of the package doesn't necessarily tell you much about the squareness of the silicon inside.
 
  • Like
Reactions: dgdosen

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
Why does the inscription on the cover say "Intel Xeon"? Is this some sort of joke I don't get?
It's to point out that the Ultra is more or less the same size as current Xeons, which is kinda interesting.

(Both are to some extent convergent, using separate chiplets rather than ever larger chips, and with memory on the package, though the details differ a *lot*.)
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
For reference, that is 56 P-cores (112 threads), 105Mb "smart cache", no GPU, 8 DDR5-compatible memory lanes, 350~420W at 1.9~4.2GHz.
Some also have HBM, but I don't know which model is which.
 

Sydde

macrumors 68030
Aug 17, 2009
2,563
7,061
IOKWARDI
It's to point out that the Ultra is more or less the same size as current Xeons, which is kinda interesting.
Not really, though. The M2 Ultra has 16P+8E cores, 96MB L3, 76 GPU cores and 8 DRAM modules in the same size package; the Xeon has no GPU cores, no neural engine (though AVX2 does have targeted ops), no H2xx type or other accelerators, and no DRAM. They are pretty dissimilar. It looks like the entire Ultra die is around a quarter the size of the Xeon die.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.