Intel Alder Lake vs. Apple M1

crazy dave · Jan 31, 2022

Rigby said:
Source? I'm sure they optimized particularly the backend as much as possible.

Even if that is true it doesn't mean that it's possible get an additional 45% out of Apple's optimized Xcode compiler on the M1 too. I think it's only fair to use the best compiler available for each platform.

No it’s standard clang 9 something. Andrei measured results to a clang he downloaded and ran and the results were almost identical. Apple does do weird things like take the compiler from the version and library from this other version which pisses some people off because it makes the build idiosyncratic, but for the purposes of benchmarking it’s basically identical. As far as I can tell, Apple upstreams everything as clang is basically their baby. I think that Apple’s Xcode has a built in flag that standard clang doesn’t that helps a lot in one specific subtest.

In terms of could Apple and AMD get similar uplifts? Yes and no. Intel has been doing this for years and as a result people have a pretty good idea of how they achieve this. What Intel is doing here is basically using a highly specialized auto vectorization tool that normally would require you to manually rearrange or pragma your code for the compiler to recognize the opportunity to vectorize. Incredibly impressive compiler engineering (occasionally breaks stuff though and not all programs benefit). They also sometimes have used specialized libraries that Spec calls on that Intel has rewritten to be faster on Intel chips. So no flags alone won’t get you there. But that’s not to say that this is reflective of anything but compiler differences and the others could very well get the same uplift.

And therein lies the rub. We actually don’t exactly know what compilation techniques they used because it’s a closed source variant of an open source backend so we can’t say for certain what is and is not available elsewhere. Why for instance would you assume that Intel would get a free 48% increase but not anybody else if the same techniques were applied? Basically again you’re measuring the difference in compiler not hardware. If you want the most optimized version for each then at most I’d say use the same compiler but turn PGO on if you want the most optimized machine code for each processor - at least the compiler is still the same.

TLDR: Intel is using a different compiler that produces optimizations for Intel chips without flags that AMD and Apple don’t get to claim hardware wins all on a compiler that really isn’t used for consumer software and likely isn’t the most used for HPC either (gcc likely is). So it’s disingenuous unless Intel is selling a complete HPC solution to customers which they aren’t here.

Rigby · Jan 31, 2022

leman said:
Depends on what you want to compare really. It kind of looks as if Intel team fine-tuned the compiler parameters to get the best scores for the individual tests while leaving default build settings for M1.

Is that based on fact or just a gut feeling?

leman said:
And besides, one can argue that leveraging an advanced auto-vectorising compiler is nice and all, but at this point you are not measuring the performance of the hardware, but the performance of the compiler.

You're measuring what's available to any developer. ICC is a free download, just like Xcode.

leman · Jan 31, 2022

crazy dave said:
No it’s standard clang 9 something. Andrei measured results to a clang he downloaded and ran and the results were almost identical. Apple does do weird things like take the compiler from the version and library from this other version which pisses some people off because it makes the build idiosyncratic, but for the purposes of benchmarking it’s basically identical. There’s one Spec subtest I think that Apple’s Xcode has a built in flag that standard clang doesn’t that helps a lot.

In terms of could Apple and AMD get similar uplifts? Yes and no. Intel has been doing this for years and as a result people have a pretty good idea of how they achieve this. What Intel is doing here is basically using a highly specialized auto vectorization tool that normally would require you to manually rearrange or pragma your code for the compiler to recognize the opportunity to vectorize. Incredibly impressive compiler engineering (occasionally breaks stuff though and not all programs benefit). They also sometimes have used specialized libraries that Spec calls on that Intel has rewritten to be faster on Intel chips. So no flags alone won’t get you there. But that’s not to say that this is reflective of anything but compiler differences.

And therein lies the rub. We actually don’t exactly know what compilation techniques they used because it’s closed source so we can’t say for certain what is and is not available elsewhere. Why for instance would you assume that Intel would get a free 48% increase but not anybody else if the same techniques were applied? Basically again you’re measuring the difference in compiler not hardware. At most I’d say use the same compiler but turn PGO on if you want the most optimized machine code for each processor - at least the compiler is still mostly the same.

TLDR: Intel is using a different compiler that produces optimizations for Intel chips without flags that AMD and Apple don’t get to claim hardware wins all on a compiler that really isn’t used for consumer software and likely isn’t the most used for HPC either (gcc likely is). So it’s disingenuous unless Intel is selling a complete HPC solution to customers which they aren’t here.

Quick comment: did Intel publish the compiler settings they are using on the Mac? Because if it is something like -Os then talking about the superiority of the Intel compiler might be a bit premature.

crazy dave · Jan 31, 2022

leman said:
Quick comment: did Intel publish the compiler settings they are using on the Mac? Because if it is something like -Os then talking about the superiority of the Intel compiler might be a bit premature.

I think they did but I can’t remember what it was.

januarydrive7 · Jan 31, 2022

leman said:
Depends on what you want to compare really. It kind of looks as if Intel team fine-tuned the compiler parameters to get the best scores for the individual tests while leaving default build settings for M1. This approach is fine for a HPC workload, where you have one of a kind software that runs on a specific target machine, so of course you'll fine-tune the settings... but doesn't really work for the common case where software is precompiled and distributed in a binary format. And besides, one can argue that leveraging an advanced auto-vectorising compiler is nice and all, but at this point you are not measuring the performance of the hardware, but the performance of the compiler.

I'm imagining the feedback this would receive if sent off to a peer-reviewed community, with their methods clearly detailed... tuning compiler parameters for individual benches surely showcases that the researchers are competent at optimizing particular workflows, but it in no way reveals any generalizations about the underlying architecture.

It's not a huge loss, though. We have Stockfish for that.

Rigby · Jan 31, 2022

crazy dave said:
No it’s standard clang 9 something. Andrei measured results to a clang he downloaded and ran and the results were almost identical.

That sounds anecdotal.

crazy dave said:
Apple does do weird things like take the compiler from the version and library from this other version which pisses some people off because it makes the build idiosyncratic, but for the purposes of benchmarking it’s basically identical. As far as I can tell, Apple upstreams everything as clang is basically their baby. I think that Apple’s Xcode has a built in flag that standard clang doesn’t that helps a lot in one specific subtest.

In terms of could Apple and AMD get similar uplifts? Yes and no. Intel has been doing this for years and as a result people have a pretty good idea of how they achieve this. What Intel is doing here is basically using a highly specialized auto vectorization tool that normally would require you to manually rearrange or pragma your code for the compiler to recognize the opportunity to vectorize. Incredibly impressive compiler engineering (occasionally breaks stuff though and not all programs benefit).

Sounds like a perfectly legitimate approach. Apple and AMD are free to do the same in Xcode and AOCC.

leman · Jan 31, 2022

Rigby said:
Is that based on fact or just a gut feeling?

Its based on some Intel slides I've seen where they discuss fine-tuning compiler parameters for individual benchmarks. Unfortunately, I can't seem to find it anymore. I might have linked them in this very thread though.

Rigby said:
You're measuring what's available to any developer. ICC is a free download, just like Xcode.

Then why is not everybody using it if it so superior?

crazy dave · Jan 31, 2022

Rigby said:
Is that based on fact or just a gut feeling?

You're measuring what's available to any developer. ICC is a free download, just like Xcode.

leman said:
Its based on some Intel slides I've seen where they discuss fine-tuning compiler parameters for individual benchmarks. Unfortunately, I can't seem to find it anymore. I might have linked them in this very thread though.

Then why is not everybody using it if it so superior?

Intel® Core™ Ultra Processors (Series 2) - 2 | Performance Index

edc.intel.com

I was wrong. They don’t actually mention flags.

Rigby · Jan 31, 2022

leman said:
Then why is not everybody using it if it so superior?

Perhaps it doesn't make as much of a difference as people here claim?

Andropov · Jan 31, 2022

Rigby said:
Even if that is true it doesn't mean that it's possible get an additional 45% out of Apple's optimized Xcode compiler on the M1 too. I think it's only fair to use the best compiler available for each platform.

The only information Intel provides on how the SPEC int 2017 test was carried out on the M1 Max is that it was "compiled using Xcode 13.1". Xcode defaults to compiling with the -Os flag, while the optimizations ICC applies are at least comparable with clang's most aggressive level of optimization (-O3), if not more. It's not a fair comparison.

Rigby · Jan 31, 2022

Andropov said:
The only information Intel provides on how the SPEC int 2017 test was carried out on the M1 Max is that it was "compiled using Xcode 13.1".

That's still way more than Apple provided when they showed similar graphs (they didn't even say what benchmarks they used).

In any case, you could just as well say that Geekbench is unfair to Intel since the current version was compiled long before Alder Lake was available, so the compiler didn't include any optimizations for the new CPUs that Intel has since upstreamed. ...

crazy dave · Jan 31, 2022

Rigby said:
That sounds anecdotal.

Huh? He ran his benchmarks on clang and Xcode clang - they were identical. I’m a scientist that’s not what anecdotal means. That’s a controlled experiment. You have the test subject: Xcode. The control: standard clang. We even know what Apple’s clang version is! There’s a entire conversion list of what Apple’s clang corresponds to which standard clang.

Sounds like a perfectly legitimate approach. Apple and AMD are free to do the same in Xcode and AOCC.

Except it isn’t if you’re selling consumer hardware to people since their software wont be compiled with ICC. Basically in this instance it’s a benchmark only advantage used for advertising. Which makes it misleading.

Sydde · Jan 31, 2022

These tests are kind of like opinion polls, which have an error margin. Actual usage cases reflect the error bars, which are probably in the 5~10% range. In the real world, the benchmark performance differences are almost entirely swallowed up by the error margins. Arguing these minute differences is fappism.

crazy dave · Jan 31, 2022

Rigby said:
That's still way more than Apple provided when they showed similar graphs (they didn't even say what benchmarks they used).

In any case, you could just as well say that Geekbench is unfair to Intel since the current version was compiled long before Alder Lake was available, so the compiler didn't include any optimizations for the new CPUs that Intel has since upstreamed. ...

I’m not going to defend the vagueness of Apple’s graphs here but actually they did say for the CPU tests and it’s pretty obvious which ones they used for the GPU tests. Further it was backed up by third party analysis when those benchmarks were run. And hey for the GPU you can call it misleading since they chose one of the few benchmarks actually coded for AS GPUs and most software isn’t (yet). Perfectly legit to say that’s cherry picked if you’re trying to decide based on how your software will run right now.

leman · Jan 31, 2022

crazy dave said:
I think they did but I can’t remember what it was.

I checked again, they don't mention anything of relevance.

Rigby said:
Perhaps it doesn't make as much of a difference as people here claim?

If that's the case then Intel's claims are entirely bogus. Because according to their slides 11980HK compiled with ICC is as fast as M1 Max in SPECint2017. But in independent tests M1 is almost 40% faster. Also, on the Intel slide, M1 Max is only barely faster than the 5900HX, while according to the independent benchmarks it is over 30% faster then the 5980HX. I really start suspecting that Intel used an optimise for code size build here. And of course they would not publish the numbers because then it will become clear that their M1 numbers are 30-40% lower than what Anandtech is reporting.

BTW, found the presentation where they discuss compiler parameter fine-tuning: https://crc.pitt.edu/sites/default/files/Intel Compilers Overview.pdf (page 29)

Rigby said:
That's still way more than Apple provided when they showed similar graphs (they didn't even say what benchmarks they used).

Who cares? We have in-depth independent benchmarks of Apple stuff. We don't have much for ADL mobile. Just a single gaming laptop running in a 100W+ chassis...

Rigby · Jan 31, 2022

crazy dave said:
Except it isn’t if you’re selling consumer hardware to people since their software wont be compiled with ICC. Basically in this instance it’s a benchmark only advantage used for advertising. Which makes it misleading.

Consumers don't run Spec Int or Geekbench all day long either, so you could just as well say that benchmarks in general are misleading. If you want to measure the capabilities of the CPU (rather than a compiler), it seems logical to me to use the best compiler available.

crazy dave · Jan 31, 2022

Rigby said:
Perhaps it doesn't make as much of a difference as people here claim?

It does in this instance for this benchmark. We have Intel’s claims about their compiler. We have Intel’s graphs comparing their chips to other chips. We have 3rd party analysis that did it properly to compare. They all line up. Intel gave themselves a 45% speed boost.

leman · Jan 31, 2022

Rigby said:
If you want to measure the capabilities of the CPU (rather than a compiler), it seems logical to me to use the best compiler available.

This only makes sense if the best compilers available do not vastly differ in the quality of the optimisations they apply. But even then, did Intel use the best compiler or the best build setting available for the Apple platform? As I pointed out above, the graphs Intel published are in stark discrepancy with third-part tests. Ok, let's just assume that ICC is indeed magical and gives Intel a 40% boost in SPEC. Fine. Would it also give AMD 40% boost in SPEC? Hardly... We know that ICC and AMD doesn't play well, no?

crazy dave · Jan 31, 2022

Rigby said:
Consumers don't run Spec Int or Geekbench all day long either, so you could just as well say that benchmarks in general are misleading. If you want to measure the capabilities of the CPU (rather than a compiler), it seems logical to me to use the best compiler available.

That’s inverted. That’s literally measuring the difference in compiler in your test not just the hardware.

The point of Spec and GB is to provide a suite of subtests to give a general overview over how a CPU functions - we know exactly what those subtests are and they all correspond to different algorithms that people run. You can choose which subtests reflect your own workloads if you don’t want to use the overall number. Here I agree with @EntropyQ3 that the top line number is meaningless outside of comparisons to other CPUs and I would say each test has its weaknesses. But the nihilistic viewpoint that everything is meaningless so let’s just let marketing manipulate benchmarks is not healthy to accept. Why not let chip makers simply increase clocks when they detect a benchmark running? (As some have done) where’s the line?

Rigby · Jan 31, 2022

leman said:
BTW, found the presentation where they discuss compiler parameter fine-tuning: https://crc.pitt.edu/sites/default/files/Intel Compilers Overview.pdf (page 29)

And where in this presentation does it say that Intel "fine-tuned the compiler parameters to get the best scores for the individual tests while leaving default build settings for M1"? Looks to me like a technical presentation meant to teach developers how to get the best performance out of the CPUs. Probably a lot of what they are talking about will sooner or later be upstreamed into other compilers.

leman said:
Who cares? We have in-depth independent benchmarks of Apple stuff. We don't have much for ADL mobile.

Indeed.

Rigby · Jan 31, 2022

leman said:
As I pointed out above, the graphs Intel published are in stark discrepancy with third-part tests.

In your previous posting you said that we don't have much independent data for Alder Lake.

crazy dave · Jan 31, 2022

Rigby said:
In your previous posting you said that we don't have much independent data for Alder Lake.

We have it for 11th gen which they also included in the graph. We know that compared to 3rd party independent analysis that it is vastly outperforming where it should be compared to AMD and Apple. We also do have some 3rd party for 12th gen mobile now, just not this particular test. Which again paints a different picture than Intel’s slides.

Rigby · Jan 31, 2022

crazy dave said:
That’s inverted. That’s literally measuring the difference in compiler in your test not just the hardware.

You can't adequately measure the capabilities of a CPU if you don't use it to the best of its capabilities. For example, if one CPU has powerful SIMD instructions and another doesn't, then that is a real advantage and it's only fair to use vectorization, because that option is available to any developer.

crazy dave · Jan 31, 2022

Rigby said:
You can't adequately measure the capabilities of a CPU if you don't use it to the best of its capabilities. For example, if one CPU has powerful SIMD instructions and another doesn't, then that is a real advantage and it's only fair to use vectorization.

EDIT: The scenario you describe has come up and I actually agree with you and disagree with Andrei from Anandtech. I can't remember what test it was, but it could make use of AVX-512 hardware which AMD obviously doesn't have (yet). I may be butchering this from memory but Andrei's reasoning for not running the AVX-512 version of the test was that this software runs best on a GPU so if I'm going to go as far as using AVX-512 I might as well just use a GPU. But there are all sorts of reasons I can imagine where that may not be an option and including the test results with AVX-512 could be relevant to a person's say headless server setup. Again, this is from memory so I may have butchered his reasoning and there may have been more to it, but sufficed to say, I think that the relevant hardware should be tested.

However, in this case all the processors have powerful SIMD units and a compiler is being chosen to make more use of them in one CPU but not the others for the same code. This is the inversion of your scenario and this confounds the differences in compiler with whatever differences in hardware capabilities are present. Overall, the definition of a controlled experiment is to reduce the number of confounding variables not increase them. If you really want to try to optimize the machine code for each CPU without totally breaking control over confounding variables, you can turn things like PGO (profile-guided optimization) on during compilation. Anandtech don't and I understand their reasoning, but I also understand why others would disagree (I believe @leman would) and at least it's still the same compiler applying the same optimization techniques for each processor.

Further the point of benchmarks is to demonstrate how well people’s software will run. The ICC isn’t used for compiling consumer software. So its automatic replacements for hand tuned optimizations aren’t particularly relevant here. Again, if Intel was doing this to market their server chips to HPC folks saying, "here are increases you get with our compiler and libraries on our CPUs - no time consuming hand optimizations required, we've taken care of those automagically", then I'd say that's totally fair use of the ICC in benchmarks. This actually is the point of the Intel link I included earlier describing the new and improved ICC - that's totally fine. That's what it is for.

But here in this context it’s basically as bad as compiling a piece of code with Ofast turned on for one processor and O0 for the other with the cop out that one compiler has Ofast on by default while the other defaults to O0. Which might even be the case here! Heck that's why you don't use closed source compilers for benchmarking, you don't know if there is a special "If compiling Spec, do this" code block. If you think that's paranoid, years ago when Intel had even more dominance in both CPUs and even compilers Intel was being as blatant as including CPUID vendor ID checks in the ICC (https://www.agner.org/optimize/blog/read.php?i=49) i.e. checking if it was GenuineIntel and if not disabling SSE despite it being available. We also know that sometimes beyond the ICC, Intel has used completely rewritten libraries in their first party benchmarks so that the SPEC code may be the same, but the libraries it is linked against are entirely different (https://www.linleygroup.com/mpr/article.php?id=11879). I hope Intel's ICC first party benchmarks are not quite as flagrantly obvious as that today, but the effect is still similar and overall you just don't use closed source, 1st party compilers for independent benchmarks if you can avoid it.

Again if all this is okay why not simply allow Intel processors to speed boost when a benchmark is detected? Where’s the difference?

leman · Jan 31, 2022

Rigby said:
And where in this presentation does it say that Intel "fine-tuned the compiler parameters to get the best scores for the individual tests while leaving default build settings for M1"? Looks to me like a technical presentation meant to teach developers how to get the best performance out of the CPUs. Probably a lot of what they are talking about will sooner or later be upstreamed into other compilers.

It doesn't. I am just trying to find rational explanations for Intel's surprisingly high claimed scores.

Rigby said:
In your previous posting you said that we don't have much independent data for Alder Lake.

We have for other CPUs on their graph. Come on. At least pretend to make some effort here.

Rigby said:
You can't adequately measure the capabilities of a CPU if you don't use it to the best of its capabilities. For example, if one CPU has powerful SIMD instructions and another doesn't, then that is a real advantage and it's only fair to use vectorization, because that option is available to any developer.

Completely agree. So why would you use SIMD on the Intel CPU but not on M1, which has just as powerful (and often more flexible) SIMD units?

crazy dave said:
If you really want to try to optimize the machine code for each CPU without totally breaking control over confounding variables, you can turn things like PGO (profile-guided optimization) on during compilation. Anandtech don't and I understand their reasoning, but I also understand why others would disagree (I believe @leman would) and at least it's still the same compiler applying the same optimization techniques for each processor.

I think it makes a lot of sense to build benchmarks with PGO, provided we are sure that we get a comparable quality codegen for all platforms. In limited cases I think it even makes sense to compile with machine-specific optimizations, for example for HPC workloads where you know that you are going to build the software from source anyway and need to squeeze out every drop of performance you can.

Intel Alder Lake vs. Apple M1

macrumors 68000

macrumors 603

macrumors Core

macrumors 68000

macrumors 6502a

macrumors 603

macrumors Core

macrumors 68000

macrumors 603

macrumors 6502a

macrumors 603

macrumors 68000

macrumors 68030

macrumors 68000

macrumors Core

macrumors 603

macrumors 68000

macrumors Core

macrumors 68000

macrumors 603

macrumors 603

macrumors 68000

macrumors 603

macrumors 68000

macrumors Core

Our Staff