Does LLVM expose those Apple instructions?Apple actually has its own set of instructions which run on top of the base ARM ISA
Does LLVM expose those Apple instructions?Apple actually has its own set of instructions which run on top of the base ARM ISA
Apple actually has its own set of instructions which run on top of the base ARM ISA (some of which actually have been incorporated into the ARM ISA itself). That's one of the key advantages to the license Apple (and Samsung) have compared to what Qualcomm and others have.
AIUI, one pair of Apple-specific instructions performs data compression/decompression for a full page of memory at once. I am confident that there is at least one instruction, possibly more, specifically designed to streamline object-method calls, to make them almost as fast as a raw subroutine call, and probably some sort of notification dispatch acceleration – two tiny features that would be heavily used, hence improving performance quite a lot. If those are not in there, I would be surprised.Key advantage? Hardly. Useful for some niche stuff? Sure.
AIUI, one pair of Apple-specific instructions performs data compression/decompression for a full page of memory at once.
I am confident that there is at least one instruction, possibly more, specifically designed to streamline object-method calls, to make them almost as fast as a raw subroutine call, and probably some sort of notification dispatch acceleration – two tiny features that would be heavily used, hence improving performance quite a lot. If those are not in there, I would be surprised.
Does LLVM expose those Apple instructions?
What's relevant to this discussion is not the compress/decompress instructions, or other Apple specific instructions include a NEON 53b multiply (essentially equivalent to an AVX512 instruction, and probably used by crypto, though not apparently usable by random code).AIUI, one pair of Apple-specific instructions performs data compression/decompression for a full page of memory at once. I am confident that there is at least one instruction, possibly more, specifically designed to streamline object-method calls, to make them almost as fast as a raw subroutine call, and probably some sort of notification dispatch acceleration – two tiny features that would be heavily used, hence improving performance quite a lot. If those are not in there, I would be surprised.
Literally sameI just cannot help it, that url always expands in my head to "Tom Shardware".
Literally same
Well they did mention higher MT scores across the board but if you mean more like GB5 behaviour I doubt.Does Geekbench 6.1 show the performance of CPUs with a high number of cores better than Geekbench 6.0 in the multi-core benchmark?
Geekbench 6.1 - Geekbench Blog
www.geekbench.com
It seems that John Poole, the developer of Geekbench, wrote in a defunct Geekbench thread.Well they did mention higher MT scores across the board but if you mean more like GB5 behaviour I doubt.
At the start of Geekbench 6 development in 2020, we collected benchmark results for client and workstation applications across various processors. We found that only some applications scale well past four cores. We also found that some applications exhibit negative scaling (where performance decreased as the number of threads increased). We concluded that, at some point, client applications experience diminishing returns with increased core counts due to the inability to use all available cores effectively. The investigation led us to believe that Geekbench 5 overstated multi-core performance for client applications.
One design goal for Geekbench 6 was to accurately reflect multi-core performance for client applications while not arbitrarily limiting workload scaling. We wanted to ensure the multithreading approaches used were reasonable and representative of how applications use multiple cores. We also wanted to ensure that no workloads exhibited excessive negative scaling.
To achieve this goal, we switched from the "separate task" approach to the "shared task" approach for multithreading in Geekbench 6.
The "separate task" approach parallelizes workloads by treating each thread as separate. Each thread processes a separate independent task. This approach scales well as there is very little thread-to-thread communication, and the available work scales with the number of threads. For example, a four-core system will have four copies, while a 64-core system will have 64 copies.
The "shared task" approach parallelizes workloads by having each thread process a single shared task. Given the increased inter-thread communication required to coordinate the work between threads, this approach may not scale as well as the "separate task" approach.
What the shared task is varies from workload to workload. For example, the Clang workload task compiles 96 source files, while the Horizon Detection workload task adjusts one 24MP image.
For client systems, the "shared task" approach is most representative of how most client (and workstation) applications exploit multiple cores, whereas the "separate task" model is more representative of how most server applications use multiple cores.
Some Geekbench 6 workloads will scale poorly, and others will scale well on high-end workstation systems. These results follow what we observed from real-world applications.
Yes M2 Ultra inside the MacBook will be great.M2 Ultra has been delidded.
M2 Ultra has been delidded.
Why does the inscription on the cover say "Intel Xeon"? Is this some sort of joke I don't get?M2 Ultra has been delidded.
You're not looking at the bare RAM silicon, instead you're seeing its packaging - some kind of black polymer overmold. Inside each package there are likely multiple stacked RAM die. So, the shape of the package doesn't necessarily tell you much about the squareness of the silicon inside.ELI5 question:
The RAM chips on the Ultra (and max?) look more "square" than they do on a de-lidded M2 and M2 Pro (more rectangular).
Does Apple use different kinds of ram chips over the line? Are there significant differences?
There's no joke, they just put a Sapphire Rapids Xeon on top for a size comparison. The delidded M2 Ultra is behind it, still attached to its PCB.Why does the inscription on the cover say "Intel Xeon"? Is this some sort of joke I don't get?
Thx for the heads-up!There's no joke, they just put a Sapphire Rapids Xeon on top for a size comparison. The delidded M2 Ultra is behind it, still attached to its PCB.
For reference, that is 56 P-cores (112 threads), 105Mb "smart cache", no GPU, 8 DDR5-compatible memory lanes, 350~420W at 1.9~4.2GHz.they just put a Sapphire Rapids Xeon on top for a size comparison
It's to point out that the Ultra is more or less the same size as current Xeons, which is kinda interesting.Why does the inscription on the cover say "Intel Xeon"? Is this some sort of joke I don't get?
Some also have HBM, but I don't know which model is which.For reference, that is 56 P-cores (112 threads), 105Mb "smart cache", no GPU, 8 DDR5-compatible memory lanes, 350~420W at 1.9~4.2GHz.
Not really, though. The M2 Ultra has 16P+8E cores, 96MB L3, 76 GPU cores and 8 DRAM modules in the same size package; the Xeon has no GPU cores, no neural engine (though AVX2 does have targeted ops), no H2xx type or other accelerators, and no DRAM. They are pretty dissimilar. It looks like the entire Ultra die is around a quarter the size of the Xeon die.It's to point out that the Ultra is more or less the same size as current Xeons, which is kinda interesting.