Intel Alder Lake vs. Apple M1

crazy dave · Jan 31, 2022

leman said:
I think it makes a lot of sense to build benchmarks with PGO, provided we are sure that we get a comparable quality codegen for all platforms. In limited cases I think it even makes sense to compile with machine-specific optimizations, for example for HPC workloads where you know that you are going to build the software from source anyway and need to squeeze out every drop of performance you can.

I can see the argument for both. I understand why Anandtech doesn't but I can make the case for it as you do here. If I remember right I think their concern is your caveat about comparable quality codegen. I'm fine with either as long as it is upfront and transparent about what is happening and why.

leman · Jan 31, 2022

crazy dave said:
I can see the argument for both. I understand why Anandtech doesn't but I can make the case for it as you do here. If I remember right I think their concern is your caveat about comparable quality codegen. I'm fine with either as long as it is upfront and transparent about what is happening and why.

Exactly! Transparency is the most important bit. But marketing departments (Intel, Apple, doesn’t matter) are not interested in transparency as it would limit what they can claim. We are lucky to have competent third party reviewers (who unfortunately seem to be dwindling by a minute) who can provide useful information.

crazy dave · Jan 31, 2022

leman said:
Exactly! Transparency is the most important bit. But marketing departments (Intel, Apple, doesn’t matter) are not interested in transparency as it would limit what they can claim. We are lucky to have competent third party reviewers (who unfortunately seem to be dwindling by a minute) who can provide useful information.

Yeah Ian is now doing double duty on his and Andrei's work. Ryan is starting to do reviews again so hopefully that will lessen the burden, but honestly Anandtech can't pay what a tech company can or offer that stability. Ian and Ryan are also damn good that even though they aren't a contractor like Andrei was so are less prone to leaving, one day they'll probably be offered a golden ticket that even they can't refuse. Unfortunately while the medium for tech reviews leaves something to be desired, the Techtubers tend to have more money. LTT and Gamers Nexus seem intent on hiring and creating full labs to do testing so hopefully they'll be able to offer more substantial reviews than just benchmark porn. Linus even stated that his goal in building fully fledged labs and hiring more experts to do in-depth analysis of tech was because of the dwindling number of in-depth tech reviewers many of whom have been hired away by the tech companies. He's hoping he has the money to maintain that model unlike more traditional outlets. We'll see.

Rigby · Jan 31, 2022

crazy dave said:
However, in this case all the processors have powerful SIMD units and a compiler is being chosen to make more use of them in one CPU but not the others for the same code.

Where is the evidence that Apple's compiler doesn't make as much use of them as is possible?

It's been mentioned multiple times. The M1 SIMD execution units have only half the width (128 bits) of the AVX2 execution units in Intel's P cores. That alone could make a difference.

crazy dave said:
Further the point of benchmarks is to demonstrate how well people’s software will run. The ICC isn’t used for compiling consumer software.

Says who? I'd be surprised if nobody uses it to compile performance critical libraries etc. It can be easily integrated in IDEs like Visual Studio.

crazy dave said:
So its automatic replacements for hand tuned optimizations aren’t particularly relevant here.

If optimizations are "automatic" they are per definitionem not "hand tuned".

crazy dave said:
Heck that's why you don't use closed source compilers for benchmarking, you don't know if there is a special "If compiling Spec, do this" code block.

Is the Xcode compiler open source?

Andropov · Jan 31, 2022

Rigby said:
Is the Xcode compiler open source?

Yeah. https://github.com/apple/llvm-project

Andropov · Jan 31, 2022

crazy dave said:
EDIT: The scenario you describe has come up and I actually agree with you and disagree with Andrei from Anandtech. I can't remember what test it was, but it could make use of AVX-512 hardware which AMD obviously doesn't have (yet). I may be butchering this from memory but Andrei's reasoning for not running the AVX-512 version of the test was that this software runs best on a GPU so if I'm going to go as far as using AVX-512 I might as well just use a GPU. But there are all sorts of reasons I can imagine where that may not be an option and including the test results with AVX-512 could be relevant to a person's say headless server setup. Again, this is from memory so I may have butchered his reasoning and there may have been more to it, but sufficed to say, I think that the relevant hardware should be tested.

However, in this case all the processors have powerful SIMD units and a compiler is being chosen to make more use of them in one CPU but not the others for the same code. This is the inversion of your scenario and this confounds the differences in compiler with whatever differences in hardware capabilities are present. Overall, the definition of a controlled experiment is to reduce the number of confounding variables not increase them. If you really want to try to optimize the machine code for each CPU without totally breaking control over confounding variables, you can turn things like PGO (profile-guided optimization) on during compilation. Anandtech don't and I understand their reasoning, but I also understand why others would disagree (I believe @leman would) and at least it's still the same compiler applying the same optimization techniques for each processor.

The problem is, you have to draw the line somewhere. You could keep optimizing these things ad infinitum. Using the most popular compiler to build the sources with one of the default optimization levels seems like a sensible choice. After all, it *is* how most people are going to run its software. Which is, after all, the whole point of benchmarks.

Some people will use ICC and for them it might be a more informative benchmark. But some people is not most people.

crazy dave · Jan 31, 2022

Rigby said:
Where is the evidence that Apple's compiler doesn't make as much use of them as is possible?

It's been mentioned multiple times. The M1 SIMD execution units have only half the width (128 bits) of the AVX2 execution units in Intel's P cores. That alone could make a difference.

Says who? I'd be surprised if nobody uses it to compile performance critical libraries etc. It can be easily integrated in IDEs like Visual Studio.

I’m pretty sure we’ve talked about this before but 3rd party tests have shown that Apple’s SIMD is structured differently but just as powerful. It has double the throughput of AVX/128bit and is on par or muuuuch better than AVX2 in most low level tests. Remember it has 4 per core. A good example is that when using the same compiler Apple’s SIMD solution produces better results. Further AMD doesn’t get the same uplift on ICC as Intel does so obviously this isn’t an issue if AVX vs Neon or 4x128bit vs 2x256 bit, but rather Intel prioritizing codegen for its own chips. Nothing wrong with that if what you are selling is full stack HPC solutions.

And Intel themselves say the main draw of their ICC compiler is superior codegen compared to base llvm-clang/gcc. That’s the point of that link. Again you’re measuring the difference in compiler codegen not hardware.

The reason ICC is not used more broadly that the benefits are squishy and not portable. Intel themselves say they don’t upstream stuff because it’s experimental and if you’re not compiling for an Intel machine you don’t get the benefit. Further, not every task benefits the same way from ICC. Also huge numbers of libraries are compiled with gcc or clang these days. If you need one of those and can’t recompile them ICC simply isn’t an option.

If optimizations are "automatic" they are per definitionem not "hand tuned".

Yeah? I said they were the automatic replacements for hand tuned optimizations. As in you don’t have to optimize the code by hand to get the benefits, the compiler does it automatically. Did you misread? Again using ICC is basically like turning a huge number of optimization flags on by default and manually adding pragmas to your code. A simple equivalent is compiling one with Ofast and one with O0 and then declaring the hardware running the Ofast code better. It’s nonsense.

I’m struggling to understand why you don’t get this and seem to be grasping at straws for why Intel’s results must be right wrt to Apple and AMD or indicative of hardware superiority.

1)Intel themselves claim they have superior codegen in ICC especially when it comes to compiling SpecInt.

2) The numbers from 3rd party reviewers and 3rd party software which by and large don’t use ICC give results that are very different from Intel’s first party results and have done so for years.

3) The difference between the 3rd party reviewers results on SpecInt and Intel’s is almost exactly the same as Intel’s stated claims from just changing the compiler.

mr_roboto · Jan 31, 2022

crazy dave said:
In terms of could Apple and AMD get similar uplifts? Yes and no. Intel has been doing this for years and as a result people have a pretty good idea of how they achieve this. What Intel is doing here is basically using a highly specialized auto vectorization tool that normally would require you to manually rearrange or pragma your code for the compiler to recognize the opportunity to vectorize. Incredibly impressive compiler engineering (occasionally breaks stuff though and not all programs benefit). They also sometimes have used specialized libraries that Spec calls on that Intel has rewritten to be faster on Intel chips. So no flags alone won’t get you there.

On specialized libraries - one of the major tricks is to use a replacement malloc(). System-provided malloc() is almost always a generalist design; it should perform reasonably well on virtually all types of program, and poorly on none. But because the SPEC run rules permit using a replacement, someone trying to game SPEC can produce substantial gains in a number of the SPECint benchmarks by linking in an alternate memory allocator carefully tuned for SPEC.

Also, the best compiler optimizations aren't really CPU-specific, even though they are often deployed to promote a specific CPU. Have a look at "The libquantum Dispute" in this web page on SPEC:

The SPEC Benchmarks at MROB

The SPEC Benchmarks -- Explore a wide variety of topics from large numbers to sociology at mrob.com

mrob.com

Rigby said:
Where is the evidence that Apple's compiler doesn't make as much use of them as is possible?

It's been mentioned multiple times. The M1 SIMD execution units have only half the width (128 bits) of the AVX2 execution units in Intel's P cores. That alone could make a difference.

Says who? I'd be surprised if nobody uses it to compile performance critical libraries etc. It can be easily integrated in IDEs like Visual Studio.

If optimizations are "automatic" they are per definitionem not "hand tuned".

Is the Xcode compiler open source?

You're being quite rude, you know. You're constantly assuming that your opponents are mendacious just because they tell you things which are widely known to be true.

Apple's compiler doesn't go overboard on SPEC gamesmanship for a very obvious reason: marketing on SPEC numbers simply isn't a thing Apple cares about today. The one time I remember them seriously promoting their SPEC scores was back when they were trying to position the PowerMac G5 as a UNIX workstation, not just a Mac. There are kids old enough to drive today who were born after that marketing campaign.

Today, when Apple talks about performance, they use things like video editing, not SPEC. Since compiler gamesmanship for SPEC often doesn't generalize well to other programs, and can cause regressions, why would they do it?

Contrast to Intel. Intel had a long period where SPEC scores were extremely important to them as a company: they set out to invade and conquer the UNIX workstation and server markets by force, and did so. Since everyone in that world regarded SPEC as the best available point of comparison, Intel put a lot of work into winning SPEC benchmarks. Since winning SPEC wasn't just about building a good CPU, but also a good compiler, that's why they got into the compiler business. ICC is as much a marketing tool as it is a compiler.

crazy dave · Jan 31, 2022

mr_roboto said:
On specialized libraries - one of the major tricks is to use a replacement malloc(). System-provided malloc() is almost always a generalist design; it should perform reasonably well on virtually all types of program, and poorly on none. But because the SPEC run rules permit using a replacement, someone trying to game SPEC can produce substantial gains in a number of the SPECint benchmarks by linking in an alternate memory allocator carefully tuned for SPEC.

Oh yes even in my own GPU work I’ve noticed this. I have a very naive memory allocation scheme (that could absolutely be improved rather easily) and when I profile my (otherwise very numerically intensive) code I spend a huge amount of time, like 50%, in memory allocation. It should be said memory allocation tends to be even slower on GPUs than CPUs.

WRT to CPUs though I’ve definitely watched a number of CPP talks on custom allocators, when to use them, and how for performance code. It seems it’s a perennially hot topic. Not used one myself.

mr_roboto said:
Also, the best compiler optimizations aren't really CPU-specific, even though they are often deployed to promote a specific CPU. Have a look at "The libquantum Dispute" in this web page on SPEC:

The SPEC Benchmarks at MROB

The SPEC Benchmarks -- Explore a wide variety of topics from large numbers to sociology at mrob.com

mrob.com

Aye, I’ve been trying to delineate between the two (unsuccessfully so far), compiler-specific optimizations and CPU-specific optimizations.

mr_roboto said:
Since winning SPEC wasn't just about building a good CPU, but also a good compiler, that's why they got into the compiler business. ICC is as much a marketing tool as it is a compiler.

Not to get too off topic, but I’m also wondering if the failure of Itanium had something to do with Intel’s later emphasis on compiling as the compilers of the day had difficulty producing performant code for it outside of easy to vectorize HPC code. My Dad actually quite liked his work’s itanium machines as they were quite fast on his team’s numerical code, but I’ve seen that inability to produce a wide range of code capable of actually making use of the chip cited as one (of many) of the failures of the platform.

mr_roboto · Jan 31, 2022

crazy dave said:
Not to get too off topic, but I’m also wondering if the failure of Itanium had something to do with Intel’s later emphasis on compiling as the compilers of the day had difficulty producing performant code for it outside of easy to vectorize HPC code. My Dad actually quite liked his work’s itanium machines as they were quite fast on his team’s numerical code, but I’ve seen that inability to produce a wide range of code capable of actually making use of the chip cited as one (of many) of the failures of the platform.

It's a good question, I don't know whether ICC has any history going back to their Itanium compilers.

HPC is the one thing I remember Itanium doing well at. It had a lot of FP execution resources and registers relative to its peers, and the downsides of Itanium's horrible architecture didn't dense non-branchy FP code much.

crazy dave · Jan 31, 2022

mr_roboto said:
It's a good question, I don't know whether ICC has any history going back to their Itanium compilers.

HPC is the one thing I remember Itanium doing well at. It had a lot of FP execution resources and registers relative to its peers, and the downsides of Itanium's horrible architecture didn't dense non-branchy FP code much.

Well I don’t know about the non-branching part: my Dad described it was old-style Fortran code with a rat’s nest of goto statements.

Having said that they had a guru in their department who was well over 70+ years old when my Dad started, thinking of retiring, but my Dad convinced him to stay on … he kept working for 20 more years. Didn’t fully retire until his 90s. Apparently a fantastic programmer and unbelievably good at hand tuning software for performance. So that probably helped.

MayaUser · Jan 31, 2022

crazy dave said:
And, in this case, the Mac is also actually cheaper too.

is cheaper? i think both are around the same

crazy dave · Jan 31, 2022

MayaUser said:
is cheaper? i think both are around the same

In the review it was a base pro model (a bit of a cheat)

For another, the laptop tested costs $3,999 and represents the absolute top-of-the-line Intel has to offer in a laptop. Apple’s 14-inch M1 Pro MacBook Pro costs half as much and performs nearly as well.

MayaUser · Jan 31, 2022

crazy dave said:
In the review it was a base pro model (a bit of a cheat)

oh ok, i thought its the m1 max with 2T ssd

crazy dave · Feb 1, 2022

MayaUser said:
oh ok, i thought its the m1 max with 2T ssd

They had both the max and the pro but for the purposes of just comparing the CPU those are the same. So, again, a bit of a cheat. Getting a closer match on features other than CPU and yeah they’d be closer in price.

leman · Feb 1, 2022

Rigby said:
Where is the evidence that Apple's compiler doesn't make as much use of them as is possible?

Where is the evidence that Intel used build setting that would allow Apple compilers to do their work properly? Again, the charts they posted have M1 30-40% slower compared to AMD CPUs than what has been reported by independent reviewers.

Rigby said:
It's been mentioned multiple times. The M1 SIMD execution units have only half the width (128 bits) of the AVX2 execution units in Intel's P cores.

Sure, but M1 has twice as many (generally speaking, not quite accurate). Modern x86 can do two 256-bit FMA per cycle, M1 can do four 128-bit FMAs per cycle.

Rigby said:
Says who? I'd be surprised if nobody uses it to compile performance critical libraries etc. It can be easily integrated in IDEs like Visual Studio.

It might be used in some HPC environments (but then again, any Intel-based supercomputer I have ever worked with shipped with GCC only). "Easily integrated"... hah! Wait until you learn about ABIs and all that kind of stuff...

Who knows, maybe the new ICC compiler suite will become more popular in the future. I have not yet seen any comprehensive tests of their new suite. Past results (e.g. https://iitd-plos.github.io/col729/labs/lab0/lab0_submissions/siy187504.pdf) failed to show any noteworthy lead of ICC over other compilers, which is one of the main reasons why I have hard time believing that Intel suddenly pulled 40% better performance out of there sleeve.

crazy dave · Feb 1, 2022

mr_roboto said:
It's a good question, I don't know whether ICC has any history going back to their Itanium compilers.

A quick Google search: Itanium 2 support in ICC was in version 7 so it would appear that ICC predates Itanium. So I’d say that my supposition is at first glance unlikely.

Andropov · Feb 1, 2022

mr_roboto said:
Also, the best compiler optimizations aren't really CPU-specific, even though they are often deployed to promote a specific CPU. Have a look at "The libquantum Dispute" in this web page on SPEC:

The SPEC Benchmarks at MROB

The SPEC Benchmarks -- Explore a wide variety of topics from large numbers to sociology at mrob.com

mrob.com

Huh. I had no idea that compilers could make code multithreaded by themselves. Wonder how that works. Is the number of threads decided at compile time? How many of them do they use? If so, what happens if the code was already being multithreaded in a higher level abstraction (multiple processes, for example)? Does it end up with N^2 instead of N threads then? I've read that what it does is equivalent to what some pragmas do (just implicitly).

Found this article on AnandTech where they mention why they use gcc as the compiler (and their opinion on parallelizing libquantum on ST benchmarks) btw:

So we wanted to keep the settings as "real world" as possible. We welcome constructive criticism to reach that goal. So we used:

64 bit gcc: most used compiler on Linux, good all round compiler that does not try to "break" benchmarks (libquantum...)

-Ofast: compiler optimization that many developers may use

-fno-strict-aliasing: necessary to compile some of the subtests
base run: every subtest is compiled in the same way.

pshufd · Feb 1, 2022

Andropov said:
Huh. I had no idea that compilers could make code multithreaded by themselves. Wonder how that works. Is the number of threads decided at compile time? How many of them do they use? If so, what happens if the code was already being multithreaded in a higher level abstraction (multiple processes, for example)? Does it end up with N^2 instead of N threads then? I've read that what it does is equivalent to what some pragmas do (just implicitly).

Found this article on AnandTech where they mention why they use gcc as the compiler (and their opinion on parallelizing libquantum on ST benchmarks) btw:

Autothreading has been around since around 2005 or 2007 I think.

The discussion in ICC is interesting as my former employer has been using it for quite some time and we were probably paying an Intel a small fortune for it (10K engineers is a lot of seats).

leman · Feb 1, 2022

Andropov said:
Huh. I had no idea that compilers could make code multithreaded by themselves.

This is an example of an optimization you won’t find in general purpose code. There is a reason why it is restricted to scientific computing, as you want to write code that is as simple as possible, runs as fast as possible, and usually has exclusive ownership of the system resources. For anything else such a transformation can have disasterous consequences.

theluggage · Feb 1, 2022

BigPotatoLobbyist said:
Yes, lol. Couldn't they even just literally build an ARM custom core and implement something similar to the TSO memory compatibility the M1 had(?) in order to better emulate X64 binaries?

...but Windows and Linux already have ARM-native versions with both x86-64 and x86-32 emulation(/translation?) (although I don't think the latter is such a big deal under Linux, since so much is already ARM native). Those are going to give better all-round performance since the core OS is native. From what people have said about running WoA on the M1, the emulation isn't bad. (...and I guess that, since both Apple and MS have done it, this also shows that Intel don't have any IP claim against people writing x86 translators).

BigPotatoLobbyist said:
It's not as if Apple's ability to emulate binaries in X64/X86 is any more privileged and yet they did a phenomenal job.

Apple have an easier job, since Rosetta II only has to be impressive for modern Mac apps that are already Mac OS 11+ compatible, 64-bit only and written to use the modern MacOS frameworks - which, even under Rosetta, get a further boost from GPU and other hardware in the M1 custom-designed to accelerate those frameworks.

Last I looked, although Win16 was dead (but still twitching) it's going to be a long while before Win32 compatibility can be killed. (That would be roughly the equivalent of Apple having only just succeeding to phase out Classic, but needing to keep Carbon going for at least another 5 years). In theory, since MS have been pushing the VM-based .Net framework for the last decade, binary compatibility should be a non-issue by now. In practice, PC users do expect that 20 year-old code to run...

...and that need for an insane level of backward-compatibility is both x86's curse and its only reason for continued existence. If Intel decided to start making ARM, RISC-V or some other new ISA that wasn't 100% compatible and super-fast with x86-64 and x86-32 then they'd be in direct competition with not just AMD, but Qualcomm, Samsung, Amazon and every other ARM licensee running Windows-on-ARM, Android or Linux (something that the x86 hasn't had to bother with since IBM anointed the 8086).

I don't think Intel's battle is against Apple and the M1 specifically - Apple ain't gonna be selling Apple Silicon SoCs to Dell, HP and Lenovo any time soon, and their "surprise! surprise!" attitude to roadmaps and breaking compatibility - they're kinda at the opposite extrme to Wintel - will always work against them in the enterprise sector. Intel are up against the very idea that you can do serious personal or enterprise computing without an x86 chip - and the M1 is adding a lot of credibility to that. The real threat would be Lenovo, Dell , Microsoft et. al. deciding that they could have their own chips, too. Frankly, Intel have already lost that battle - the last 10 years or so have seen them not only completely lose the mobile market, which proceeded to eat a large chunk of the lower-end PC market, but the rise of web services that use a Linux/Unix server, running scripting languages, talking to a web browser, none of which are particularly fussed about what ISA they're running on, and will ultimately switch to whatever architecture shaves 10% off their electricity bill.

We're past peak x86, past peak Windows - but those two were so huge that they will take years, if not decades to fade away. Intel still have a lot of easy money to make from x86 over that period - they need to think carefully and pragmatically about doing anything that would accelerate its demise. That's probably also why MS are being a bit lukewarm about really pushing Windows on ARM - anything without perfect legacy compatibility might make customers start taking the alternatives more seriously.

NB: At one point, Intel presumably had an ARM Architecture license - just like Apple - because they inherited the StrongARM chip from DEC. If that's not still current, I'm sure ARM would take their money (esp. now it isn't going to be NVIDIA) - so it's not as if they can't join the ARM party as soon as they've milked the fading x86 for all it is worth.

huge_apple_fangirl · Feb 1, 2022

theluggage said:
Apple have an easier job, since Rosetta II only has to be impressive for modern Mac apps that are already Mac OS 11+ compatible, 64-bit only and written to use the modern MacOS frameworks - which, even under Rosetta, get a further boost from GPU and other hardware in the M1 custom-designed to accelerate those frameworks.

Last I looked, although Win16 was dead (but still twitching) it's going to be a long while before Win32 compatibility can be killed. (That would be roughly the equivalent of Apple having only just succeeding to phase out Classic, but needing to keep Carbon going for at least another 5 years). In theory, since MS have been pushing the VM-based .Net framework for the last decade, binary compatibility should be a non-issue by now. In practice, PC users do expect that 20 year-old code to run...

Yes, this is why the idea of the PC works switching to ARM is premature. Apple has an easier time because they have so much less to be compatible with, as well as more loyal users and developers. Plus Microsoft has no financial incentive to switch- they already have dominance so they won’t gain share, and what’s one third party chipmaker or another to them?

theluggage said:
...and that need for an insane level of backward-compatibility is both x86's curse and its only reason for continued existence. If Intel decided to start making ARM, RISC-V or some other new ISA that wasn't 100% compatible and super-fast with x86-64 and x86-32 then they'd be in direct competition with not just AMD, but Qualcomm, Samsung, Amazon and every other ARM licensee running Windows-on-ARM, Android or Linux (something that the x86 hasn't had to bother with since IBM anointed the 8086).

NB: At one point, Intel presumably had an ARM Architecture license - just like Apple - because they inherited the StrongARM chip from DEC. If that's not still current, I'm sure ARM would take their money (esp. now it isn't going to be NVIDIA) - so it's not as if they can't join the ARM party as soon as they've milked the fading x86 for all it is worth.

Intel can’t “join the ARM” party. They need their x86 monopoly to be able to afford their own fabs. In a world where Intel is just another ARM-compatible chipmaker, that can’t have their own fabs.

leman · Feb 1, 2022

theluggage said:
Apple have an easier job, since Rosetta II only has to be impressive for modern Mac apps that are already Mac OS 11+ compatible, 64-bit only and written to use the modern MacOS frameworks - which, even under Rosetta, get a further boost from GPU and other hardware in the M1 custom-designed to accelerate those frameworks.

FWTW, Rosetta2 has no problems at all translating 32-bit code or code compiled for Windows. This doesn’t invalidate your points in any way of course, just wanted to clarify that Rosetta is a fully-features solution.

And of course, Rosetta’s job is made much much easier by the fact that M1 can emulate x86 memory ordering in hardware. Doing this in a purely software layer like MS has to is way trickier.

deconstruct60 · Feb 1, 2022

theluggage said:
NB: At one point, Intel presumably had an ARM Architecture license - just like Apple - because they inherited the StrongARM chip from DEC. If that's not still current, I'm sure ARM would take their money (esp. now it isn't going to be NVIDIA) - so it's not as if they can't join the ARM party as soon as they've milked the fading x86 for all it is worth.

It is highly doubtful that an Arch license for "Arm version n" gives you an Arch license for "Arm version n+2" or anything other than the version you bought. Arm is completely and utterly doomed as a viable business if one license purchases gets you all of their future intellectual property (IP) forever into the future. As a contract R&D company , that is basically giving away the product you produce forever. There is no way that works long term.

Similar thing with the notion that Apple gets a perpetual free there. They get the best discount that Arm offers anyone else ( sort of how US Govt gets lowest price offered.) perhaps. But forever free?

Arm can't be an "all you can eat buffet" when folk eat for multiple decades on one fixed lump sum. A model where the relatively smaller players who buy the design IP are doing all the heavy lifter while the rich "fat cats" with most of the profit margin from the builds all pay nothing is just an indirect Ponzi scheme. It will fail also over time. ( even faster when chip design costs are increasing as fabrication sizes get smaller. )

If a licensee wants to squat in the past forever then yeah that could be almost free . ( some really small unit fees until some patents run out perhaps. But far lower than fees for licensing actual implementation designs. ). But as the instruction set evolves there should be substantially "large enough" payments to keep the evolutionary progress going.

P.S. That said.. Intel has an active FPGA business. They have IPU / DPU ( infrastructure / data processor units ) products.

Intel-Architecture-Day-2021-IPU-Mount-Evans-ASIC-200G-IPU.jpg

Intel Mount Evans DPU IPU Arm Accelerator at Hot Chips 33

For Hot Chips 33 Intel went into more details around its new IPU member, the Intel Mount Evans DPU that has a lot of innovation in a new ASIC

www.servethehome.com

[ Intel probably took "off the shelf" Neoverse N1 cores here as a starting point but likely some custom integration here. Don't necessarily need an arch license if it s usage works well enough with what Arm came up with. ]

Pragmatically can't be a big player there without some Arm licenses. Intel is making Arm cores; just not the primary part of the desktop/laptop product.

Andropov · Feb 1, 2022

leman said:
This is an example of an optimization you won’t find in general purpose code. There is a reason why it is restricted to scientific computing, as you want to write code that is as simple as possible, runs as fast as possible, and usually has exclusive ownership of the system resources. For anything else such a transformation can have disasterous consequences.

Isn't a lot of scientific computing using clusters now, where a job unexpectedly going multithreaded would be a problem? I've seen people queuing one simulation/process per core, for example (with different parameters). Having each of those processes spawn 16 threads at some point would be hilarious (I don't think it would crash, just that there would be a lot of frivolous context switches).

Maybe I'm misunderstanding, but (reading Intel's documentation) it looks like what they're doing is basically inserting OpenMP directives in trivially* parallelizable code. By default OpenMP launches as many threads as there are cores, regardless of how many other tasks may be running.

Anyway it looks like it needs to be explicitly enabled on Linux and macOS to create extra threads, defaults to no autothreading.

*Intel's examples on that page even have a fixed number of iterations for the loops it parallelizes, which makes sense, since if the number of iterations is unknown at compile time you might end up trying to parallelize tiny loops. Don't know if this is a requirement, or if autothreading also works on dynamically sized data.

Intel Alder Lake vs. Apple M1

macrumors 68000

macrumors Core

macrumors 68000

macrumors 603

macrumors 6502a

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors 68000

macrumors 68040

macrumors 68000

macrumors 68040

macrumors 68000

macrumors Core

macrumors 68000

macrumors 6502a

macrumors G4

macrumors Core

macrumors G3

macrumors 6502a

macrumors Core

macrumors G5

macrumors 6502a

Our Staff