Hi, doctoral student in computer architecture here. This article did a surprisingly good job of explaining the M1’s success which can be summarized in the following equation:
Fast Decode + Big Ass ROB + Shared Memory Hierarchy = Success
You show that equation to any student of an advanced architecture course, and they’ll be able to tell you that following it will give you a fast computer. In that sense, Apple’s SoC isn’t novel. That isn’t a knock against Apple. Actually manufacturing this chip is a huge achievement and a touch engineering challenge. Instead, what I’m trying to dispel is the notion that Apple has all the talent, and Intel and AMD have unimaginative engineers. Because what the article doesn’t go into, and I haven’t heard mentioned often, is the piece that is left out of a purely technical explanation of the M1: cost.
60% of industrial architecture decision making is cost. And the fact that Apple sells to themselves and not a broad, varied market, means that they can do outrageous things that Intel, AMD, or Samsung cannot. It’s not that the engineers at these companies don’t realize that they can do these things; it’s that those things don’t make sense given their constraints. Below is a snippet (with a few edits) where I tried to explain this to someone when the M1 first came out. Sorry if it is a little out of context.
Yeah Intel/AMD do different designs for different markets, but the way they do that is a hierarchical approach. You start with a shared ISA across the full spectrum, then you have microarchitectural implementations that are separated broadly by market, and finally the actual bins that correspond to the different tiers of final products. Each point in the hierarchy will have a different cost associated with features: - The ISA (e.g. x86) will go across everything (i.e. servers, laptops, mobile). Here you might add instructions like AVX since their support is easily included/excluded per design. You probably won’t do something like give up a bit of the opcode for conditional execution if you know it will only benefit server chips. If you were to do something like that, the profit expected on server sales due to increased performance would need to justify it. - The microarchitecture (e.g. Skylake) is where most of the cost-benefit decisions will be. Like the article pointed out, here is where your ISA might limit your ability to scale up. If you have a complicated instruction for decoding (like x86 does), then the amount of circuitry added will be greater per additional issue width than a simpler ISA (like Arm). Another example might be the number of total cores that the microarchitecture supports. Does increasing the max core count for a i9 justify the additional costs for the i3? Because even if you don’t use the full core count in the i3, the additional circuitry to support it still exists in each core IP. All these decisions will be closely tied to the third bullet: binning. - Binning will greatly impact the decisions on the previous two bullets, because this is where physically manufacturing a chip comes into play. Let’s say you added a wider decoder — how does that affect your lowest bin where the TDP is distinctly lower? Or, more likely, how does it affect the max clock frequency that you can reliably run the silicon at? If only the highest bin supports your target frequency, and you cannot reliably produce that bin on the wafer, then you will end up with greater lower-tier volume who’s max frequency is limited by the highest tier (which you don’t have a lot of to sell anyways!). Another thing that Apple did was bump up the cache size. When a cache isn’t reliably fabricated, this is usually a common point where the die goes to a lower bin. If you can’t produce the full cache size on most of your yield, there are very few parts that are going to receive the benefit of your beefy cache. And let’s say in the end you are willing to have lower yield to have this amazing high end part that you will charge a lot for — will the lower end market be saturated by the increase volume, or will it accept the volume so you can recoup costs?
There is no easy way to address all these issues which is why companies like Intel/AMD are forced to be more conservative than Apple where there is a much smaller market and fewer bins. The total vertical integration from Apple also plays a role in this. Since they control the software stack and most apps running on a Mac use their APIs, they can be sure that the program is being compiled to take advantage of hardware features. Since they make an SoC with lots of accelerators and not just the CPU, they can predict the characteristics of the workload running on the CPU, and they can be sure they are getting the most utilization out of their core.