Agreed. We have 8 CPU cores, 8 GPU cores (in the higher version), and 16 ML cores. Intel/AMD currently do not have dedicated machine learning cores, which can definitely be used in a lot of software. It also contains the secure enclave, low powered video playback, image signal processor and more. Things that Apple and software can make use out of to improve performance. Some of these are something a generic processor cannot include.
I don’t think that anyone would argue against the fact that the specialized accelerators Apple includes in their chip give them a unique advantage in embracing new technologies and new applications. I just have an issue with a claim that Apple Silicon is only fast because it contains these specialized units and runs custom software. Apple CPUs are simply extremely fast general purpose CPUs, and that’s pretty much where their “secret” begins and ends.
By the way Intel CPUs include a series of technologies aimed at ML acceleration since Ice Lake. Look up Deep Learning Boost.
I think its a wash to be honest. I have 20 years experience in the semiconductor industry watching how hard it is to marry software with the hardware. You need twice as many software engineers as hardware engineers to even get close to making a chip successful. I can't convince you, you would have had to live my life for the past 20 years!
I certainly don’t have any intention to challenge your experience. I merely wish to understand your point. As you say, your work area was Bluetooth - which by definition is not general purpose, and would obviously require a lot of hardware-specific low-level programming expertise. I just don’t see how this line of though applies to Apple CPUs. They have to run generic CPU-agnostic code - most of it is quite terrible to be frank and it does so exceptionally well.
I think that is what is being mentioned, and it does impact complexity of decoders. Specifically, AMD has commented on how they noticed there's an upper limit on the useful number of decoders on x86 due to the need to handle these kind of instructions, despite being rarely used. The A14/M1 architecture is already beyond that particular limit, and it does benefit from it.
Yes, deciding complexity is a very interesting topic. There was a quite involved discussion with some smart people recently on RWT, I had difficulty following I don’t think there was much consensus in the end. But it should be fairly clear that fixed-instruction width is obviously simpler to build a fast decoder around.
Still, very long instruction word is something completely different. X86 is not VLIW.
This is probably the closest thing to "custom" in Apple's CPU design (EDIT: specifically CPU design, not SoC). But that's more to support modern languages like Swift (and Objective-C using ARC). But there's a lot of other multi-threaded scenarios that would benefit from this as well, so even that's not terribly custom, it's just examining existing usage of the CPU and adapting to it.
More like optimization of a frequently used feature. Because if we start acknowledging these things as “custom”, then we will a,so have to say that features such as stack manipulation, branch prefetching and speculative execution are “custom” too. It’s just designing hardware so that common software patters can be executed quicker.
By the way, fast atomics are probably only possible because ARM has a more relaxed memory ordering requirement than x86. This is an area where ISA matters.
P.S. What I would certainly describe as “custom” feature of M1 is the switchable (in hardware) memory ordering mode, which allows it to emulate x86 execution environment efficiently and correctly. Without it something like Roswtta2 would hav been much more difficult to pull off.
Last edited: