Benchmarking tools favoring x86 over ARM?

sunny5 · Nov 9, 2022

I think I saw a post or article that Cinebench is more optimized for x86 than ARM so testing Apple Silicon is meaningless and pointless from here but exactly why tho? Any creditable articles or links?

theorist9 · Nov 9, 2022

If you read through the first page of this thread it will give you some background:

https://forums.macrumors.com/threads/guys-please-dont-compare-cr23-or-geekbench-gpu-compute-scores-for-apple-silicon-against-an-x86-processor-or-nvidia-amd-gpu-respectively.2338453/?post=30949569#post-30949569

Xiao_Xi · Nov 9, 2022

The next CineBench benchmark should be fairer because Embree already uses Neon instructions.

Release Embree v3.13.4 Release · RenderKit/embree

Using 8-wide BVH and double pumped NEON instructions on Apple M1 gives 8% performance boost. Fixed binning related crash in SAH BVH builder. Added EMBREE_TBB_COMPONENT cmake option to define the co...

github.com

I wonder if Apple could step in and improve Embree's support for Apple Silicon.

maflynn · Nov 9, 2022

I'm a fan of cinebench and I use that for my comparisons.

leman · Nov 9, 2022

It's not that it's meaningless or pointless, it's just that using Cinebench will make x86 appear much faster than it actually is in majority of real-world applications. Reasons for that have been discussed multiple time:

- Cinebench uses Embree, a library written and maintained by Intel, which is hand-optimised for x86 SIMD. ARM SIMD support is provided though a compatibility header that implements x86 SIMD instructions as one or more ARM instructions. This approach not always results in best possible code on the ARM platform. Apple recently submitted a series of patches that are supposed to improve the performance on Apple Silicon by around 8%, but this is likely still not optimal

- Cibenench appears to rely on long dependent computation chains, that is, it has low ILP (instruction level parallelism). This is clear from inspecting CPU counters of retired instructions per cycle, which is significantly lower for Cinebench compared to other common code (I don't remember the exact numbers from the top of my head). This is not surprising since, again, it uses an x86 optimised library that aims to utilise the SIMD units present on x86 CPUs, so it's coded based on the assumption about latency and throughput of these units. Apple Silicon on the other hand relies on a very wide execution backend to execute many instructions from the same thread simultaneously. It can't really do that running Cinebench

- The previous point is also the reason why SMT (HyperThreading) is so effective in Cinebench. Since SMT tries to increase the utilisation of the execution units by feeding them operations from different threads, it excels in situations where the ILP is low. Apple Silicon does not have SMT and so goes underutilised in Cibenench.

- Cinebench is dominated by SIMD instruction throughput. Apple Silicon has little chance to win that kind of battle, even under optimal conditions. M1 family has four 128-bit SIMD units per core, effectively allowing them to process 512 bits worth of data per cycle. This is similar to modern x86 cores, but Apple Silicon runs much lower frequency. Also, x86 CPUs generally have higher cache bandwidth (this is especially true for Intel CPUs which are designed to sustain fetching large vector operands from cache — Apple is more conservative here, probably due to power consumption concerns). So on optimally hand-optimised SIMD code that's either throughput or memory bandwidth limited, higher clocked x86 will always win. On the other hand, on code where you need flexibility Apple is likely to win, because they can schedule more independent operations (this is especially true for non-vectorised scientific compute where M1 is extremely fast)

That more or less should cover it, unless I've forgotten something. TLDR: Cinebench is optimised for x86, not Apple, its basic software design favours x86, not Apple, and it represents a niche software problem where Apple is weaker overall. If you are looking for something quotable, feel free to quote this post. I am prepared to defend every point

Kazgarth · Nov 9, 2022

sunny5 said:
I think I saw a post or article that Cinebench is more optimized for x86 than ARM so testing Apple Silicon is meaningless and pointless from here but exactly why tho? Any creditable articles or links?

Even if it is true that it's optimised for x86 instructions (which is an ongoing debate), it still represent real world performance since 90% of the global apps & games are developed and programmed for x86 architecture at their core (then some ported to ARM).

mi7chy · Nov 9, 2022

Isn't it implied that anything that runs slower isn't optimized so is it necessary to state the obvious?

theorist9 · Nov 9, 2022

leman said:
I am prepared to defend every point

Well, since you offered.... 😄

OK, being more serious:

1) I think it would be interesting to put some numbers to this. In particular, if we take GB5 as a CPU benchmark that is relatively neutral with respect to platform then, based on GB5 scores for M1 and M2, and GB5 + CBR23 scores for the latest Intel and AMD CPU's, we could compute what the equivalent CBR23 score should be for M1 and M2, and thus, based on their actual scores, see if there is a consistent correction factor that could be applied (or, alternately, simply see what the actual % penalty is).

2) Any thoughts on @Xiao_Xi 's post about Embree?

leman · Nov 9, 2022

Kazgarth said:
Even if it is true that it's optimised for x86 instructions (which is an ongoing debate), it still represent real world performance since 90% of the global apps & games are developed and programmed for x86 architecture at their core (then some ported to ARM).

Not really. Just because a lot code are developed and tested on x86 doesn’t mean that it is particularly optimized for x86. Most code does not use any architecture-specific features and is written in higher-level languages that only make some basic assumptions about the target CPU. For most of general-purpose software there is not much difference between c86 and ARM64: both have the same basic data sizes and alignments and use the same basic execution principles with instructions, threads, memory organisation etc.

The following is very simplified of course, but generally there are three general cases when developers start looking at low-level CPU features. One is if you need to do something very specific and your CPU of interest might offers specialized features for doing it better. Examples are arbitrary precision integers, cryptography operations, stuff like that. If your CPU has special instructions for that, you can make use of them bit of course they won’t work on a different CPU. That different CPU might instead offer a different kind of instruction to solve the problem, so if you want to support it you’d need to write a new algorithm specially for that CPU. Second case is when you are doing a lot of specialized calculations, and I mean A LOT of them. Then you might want to make use of the high-performance computing features your CPU offers, e.g. vector instructions and other stuff. This for example applies to various mathematical libraries, ray tracing frameworks like Embree (which Cinebebch uses) etc. And finally, the third case are various low-level libraries that require intimate understanding of the CPU architecture to even work correctly (e.g. multi-threading stuff).

Programs that fall into those cases are either quite rare or use third party libraries, many of which are fully optimized for Apple Silicon by now. The performance will depend on domain. As I’ve said, Apple is generally in disadvantage when it comes to throughput-oriented SIMD code, so if that’s the application you care about, a newer x86 CPU might be a better choice for you.

theorist9 said:
1) I think it would be interesting to put some numbers to this. In particular, if we take GB5 as a CPU benchmark that is relatively neutral with respect to platform then, based on GB5 scores for M1 and M2, and GB5 + CBR23 scores for the latest Intel and AMD CPU's, we could compute what the equivalent CBR23 score should be for M1 and M2, and thus, based on their actual scores, see if there is a consistent correction factor that could be applied (or, alternately, simply see what the actual % penalty is).

Or, absolutely. Shouldn’t be too difficult. i think I’ve even pulled some numbers in some other thread showing that CB23 advantage x86 has does not show itself in GB5 or SPEC2017. But it’s late here and I’m not in the mood to go digging now. I’ll have a look sometime later if I have time or maybe one of the nice contributors here will feel inspired

theorist9 said:
2) Any thoughts on @Xiao_Xi 's post about Embree?

If Apple implemented a dedicated NEON path for Embree they might be able to get another 10-15%, who knows. But what’s the incentive? This is hard, tedious work and not very useful. Apples effort for ray tracing is the GPU, and that’s where they are investing the bulk of resources. As I’ve said, I doubt that Apple will outperform a modern x86 CPU in many SIMD based workloads. They’d have to sacrifice energy efficiency for that and will get very little in return. Makes much more sense to use dedicated coprocessors for high-performance compute such as AMX or the GPU.

Bodhitree · Nov 10, 2022

In my experience there are even a lot of games written in higher-level languages like C++ without the benefit of the Embree library while targeting x86. Some game houses are quite clued up about this stuff but don’t want to disadvantage AMD processors by using a library built by Intel.

leman · Nov 10, 2022

Bodhitree said:
In my experience there are even a lot of games written in higher-level languages like C++ without the benefit of the Embree library while targeting x86. Some game houses are quite clued up about this stuff but don’t want to disadvantage AMD processors by using a library built by Intel.

Why would a game use something like Embree? That's for building production renderers, not real-time games.

theorist9 · Nov 11, 2022

I just did what I proposed here:

theorist9 said:
I think it would be interesting to put some numbers to this. In particular, if we take GB5 as a CPU benchmark that is relatively neutral with respect to platform then, based on GB5 scores for M1 and M2, and GB5 + CBR23 scores for the latest Intel and AMD CPU's, we could compute what the equivalent CBR23 score should be for M1 and M2, and thus, based on their actual scores, see if there is a consistent correction factor that could be applied (or, alternately, simply see what the actual % penalty is).

I used ST scores only, to avoid thermal effects. I got the R23 scores from NoteBook Check, and the GB5 scores from the GB website. I'll be speaking in ballpark terms, since that's all these scores tell us.

Based on this, let's say that, for x86, the ST R23 scores are expected to be about the same as the GB5 scores.

[Note: One could consider the AMD 7000 series' GB5 performance to be a bit of an outlier-- it does particularly well in GB because of specialized instructions that accelerate cryptographic tasks; the 11th gen Intel processors had these as well, but they were removed from the 12th gen, I guess because Intel wants customers who need this functionality to spring for the more expensive Xeon processors, where it's still found. Apple Silicon doesn't have this either.]

By contrast, AS's R23 performance suffers by about 15% compared to it's GB5 scores. Thus one can conclude that the optimization issue for AS on CB R23 reduces its performance by about 15%.

Bodhitree · Nov 11, 2022

leman said:
Why would a game use something like Embree? That's for building production renderers, not real-time games.

There is a lot of crossover… many standard game engine calculations use mathematical vector and matrix types even when not in the rendering engine or physics engine, and if you code those just using standard float types it will all go through the floating point unit.

leman · Nov 11, 2022

Bodhitree said:
There is a lot of crossover… many standard game engine calculations use mathematical vector and matrix types even when not in the rendering engine or physics engine, and if you code those just using standard float types it will all go through the floating point unit.

But again, what does it have to do with embree? If you need to do linear algebra you can use an appropriate library. For example eigen, which is optimized for ARM.

Bodhitree · Nov 11, 2022

leman said:
But again, what does it have to do with embree? If you need to do linear algebra you can use an appropriate library. For example eigen, which is optimized for ARM.

A software rendering engine (and I have written several) is mostly dependent on mathematical vector and matrix types and operations. The Embree library, by all reports, contains optimised code for these, and so if you repurpose that for a more general case, you could certainly make your game engine process frames more quickly. But perhaps there are more suitable maths libraries out there.

leman · Nov 11, 2022

Bodhitree said:
A software rendering engine (and I have written several) is mostly dependent on mathematical vector and matrix types and operations. The Embree library, by all reports, contains optimised code for these, and so if you repurpose that for a more general case, you could certainly make your game engine process frames more quickly. But perhaps there are more suitable maths libraries out there.

Relying on embree just for the linalg routines is like pulling in FreeType just because you want to draw some curves. Given the number of high-quality software libraries purpose-tailored for these tasks it would be a very odd choice indeed.

Search

Search

Benchmarking tools favoring x86 over ARM?

sunny5

Suspended

theorist9

macrumors 601

Xiao_Xi

macrumors 68000

Release Embree v3.13.4 Release · RenderKit/embree

maflynn

macrumors Haswell

leman

macrumors Core

Kazgarth

macrumors 6502

mi7chy

Suspended

theorist9

macrumors 601

leman

macrumors Core

Bodhitree

macrumors 68020

leman

macrumors Core

theorist9

macrumors 601

Bodhitree

macrumors 68020

leman

macrumors Core

Bodhitree

macrumors 68020

leman

macrumors Core

Our Staff