Why does Apple act like Intel is always holding them back?

leman · Jan 12, 2021

jdb8167 said:
I doubt neither Intel nor AMD is ever going to be explicit on their decision making process but I think the assumption is that Intel and AMD have very smart engineers making these decisions. If a wider decode was worth the transistor budget I expect that they would do it. The fact that they haven't is a pretty good indication that 4-wide is about as good as it is going to get.

What about ARM itself? Their Cortex X-1 has 5 instruction decoders. Apple A7 had 6 decoders (and a 192-long ROB) in 2013! What was ARM doing these past seven years?

jdb8167 said:
The problem is probably a basic computer science issue of cascading complexity. An x86_64 instruction can be anywhere from 1 byte to 15 bytes wide. When you start the decode stream, you have no idea how long the instruction will be. You have to finish the decode to find out. So each decoder has to start somewhere in the middle of another decoder's stream. You start to see the problem as you add decoders, the problem gets more and more complex. There is no easy way to know which decoder will reach the end of an instruction.

You are making this problem more complex than it has to be. Just because x86 has variable-length instructions, it doesn't mean that the instruction stream has to be scanned sequentially. It is possible to examine a block of instructions (say, 16-bytes) at once, detecting significant bit patterns. Similar techniques are used to process sequential data using SIMD-hardware, for instance for UTF8 validation or JSON parsing. Basically you test each byte for every possible significant bit pattern (which are limited) and then use masking and logical operations to mark element boundaries.

I see no reason why Intel or AMD wouldn't use a similar technique (or maybe something even smarter), as it easily maps to a hardware implementation, can be pipelined and will deliver results simultaneously. It would be obviously more complex than fixed-length decoding, and I can imagine that the silicon ends up more power-hungry, but it would be as fast — with minimal added latency.

Again, I have no proof and I will freely admit that my opinion is just an uneducated speculation. But I have also seen no clear proof that x86 is limited by the variable length of its ISA. My gut feeling is that the reason why Intel and AMD are stuck at 4-wide decode (with some optimizations here and there) is essentially the same as why ARM Cortex designs are stuck at 4-5 wide decode(*). Because neither of them can figure out how to execute the instruction stream that far ahead. It's not the ISA — it's the architectural approach taken by these companies that is unable to scale beyond a certain width. Apple however managed to circumvent this.

(*) Since x86 packs load/store+op into a single instruction and ARM needs a separate load/store instruction, 4 x86 decoders probably just about match 5-6 ARM decoders in practice.

pshufd · Jan 12, 2021

leman said:
I see no reason why Intel or AMD wouldn't use a similar technique (or maybe something even smarter), as it easily maps to a hardware implementation, can be pipelined and will deliver results simultaneously. It would be obviously more complex than fixed-length decoding, and I can imagine that the silicon ends up more power-hungry, but it would be as fast — with minimal added latency.

Again, I have no proof and I will freely admit that my opinion is just an uneducated speculation. But I have also seen no clear proof that x86 is limited by the variable length of its ISA. My gut feeling is that the reason why Intel and AMD are stuck at 4-wide decode (with some optimizations here and there) is essentially the same as why ARM Cortex designs are stuck at 4-5 wide decode(*). Because neither of them can figure out how to execute the instruction stream that far ahead. It's not the ISA — it's the architectural approach taken by these companies that is unable to scale beyond a certain width. Apple however managed to circumvent this.

(*) Since x86 packs load/store+op into a single instruction and ARM needs a separate load/store instruction, 4 x86 decoders probably just about match 5-6 ARM decoders in practice.

AMD has said that there is no benefit to having more than four decoders. Intel also has four so I assume that Intel came to the same conclusion some time ago before AMD. I have no doubt that AMD and Intel tested chips with different numbers of decoders (or simulated them). That they both have four indicates that this is what works best for x86 at this time.

Krevnik · Jan 12, 2021

leman said:
An interesting opinion I keep hearing here and there (unfortunately, with no factual proof) is that ARM64 has been designed with quite a bit of Apple's input. Allegedly folks at Apple had an idea how to make a fast and energy efficient CPU and they have approached ARM with the task of cleaning up the ISA to make this CPU possible. Basically, according to this story Apple started designing their architecture before the ARM64 ISA was finalized. I don't think this is very far fetched either, because they managed to deliver first mobile 64-bit device almost two years before anyone else. A7 has caught everybody by quite a surprise.

That’s partly what I was alluding to, yeah. Assuming it’s true, Apple created opportunities for itself through careful planning. Stuff that would pay out years down the road.

But even if that’s not true, Apple being able to drop ARM32 and Thumb still frees up footprint and provides flexibility to the silicon engineers. And that requires planning that pays off years down the road too.

In either case, Apple’s got engineers working on long time scales here, compared to other businesses. Yearly or quarterly cycles are common, and if your project requires a longer horizon, it tends to be harder to get traction in my experience. I had one opportunity to work on a team where management worked on longer cycles, and it was honestly exciting. The product is now dead, due to various other mis-steps, but long term thinking, and making sure you invest in projects that act as leverage to let you do interesting things 2-3 years down the road is something that still sticks with me as an important lesson.

And Apple has been doing this on the software side as well. iOS as it was growing added functionality in stages, to enable the next year’s products and features. Auto layout in iOS leads to Apple providing multiple sizes of iPhone, split screen on iPad, even the iPhone X, etc.

transitobserver · Jan 12, 2021

leman said:
Apple's basic recipe (very wide cores) was already apparent in 2014, as was their advantage over the usual ARM designs. But somehow, 6 years later, ARM's newest high-performance design, Cortex-X1 is narrower than now ancient Apple A7. So they either deliberately decided agains using a wider design, or simply couldn't figure out how to do it. I tend to believe the later.

Anyway, just to make it clear — I am not trying to argue for the sake of arguing. I am genuinely curious about how Apple pulled it off. I don't understand anything about practical hardware design, but I think I have a fairly good grasp of algorithmic complexity of the task, and it does seem like Apple is the only company so far that has managed to break through some sort of OOO scalability plateau. They must be doing something differently from the others and I don't believe that it's a simple trick.

Samsung had a 6-wide CPU design back in 2018 and was going to move to an 8-wide design this year if they hadn't disbanded the design team. It performed worse than the equivalent ARM Cortex core and consumed way more power though which means it's tougher than "just make it wider".

neinjohn · Jan 13, 2021

Well Qualcomm may well have just bought the 8-wide design and secret sauce acquiring Nuvia.

pshufd · Jan 13, 2021

neinjohn said:
Well Qualcomm may well have just bought the 8-wide design and secret sauce acquiring Nuvia.

Wonder what the implications for Intel are. Intel has fumbling against AMD and now ARM for quite some time. Their 2021 roadmap doesn't look encouraging.

neinjohn · Jan 13, 2021

pshufd said:
Wonder what the implications for Intel are. Intel has fumbling against AMD and now ARM for quite some time. Their 2021 roadmap doesn't look encouraging.

On laptop world AMD good work on perf/watt and decision to offer 6 and 8 full power cores on <1000$ products may well be what keeps x86 relevant for some time as Apple is not offering a direct competition. It should expand AMD market-share from differentiation and price.

Meanwhile Intel may well be shooting their own feet moving to BIG.little on Alder Lake as it invites direct comparison. The worst part is that Intel is unable to do what AMD can.

pshufd · Jan 13, 2021

neinjohn said:
On laptop world AMD good work on perf/watt and decision to offer 6 and 8 full power cores on <1000$ products may well be what keeps x86 relevant for some time as Apple is not offering a direct competition. It should expand AMD market-share from differentiation and price.

Meanwhile Intel may well be shooting their own feet moving to BIG.little on Alder Lake as it invites direct comparison. The worst part is that Intel is unable to do what AMD can.

The only way for Intel to make better chips right now would be to call TSM or Samsung to place an order.

dmccloud · Jan 13, 2021

leman said:
This doesn't make much sense though. Other ARM CPUs implement the same ISA, yet they are considerably slower. Similarly, modern Intel Atom cores can match the power efficiency of some ARM CPUs, even though they have to decode the same variable-width ISA. And finally, attributing performance increase in general-purpose software to "special-purpose" transistors doesn't make much sense.

Clearly, there must be something deeper at the architectural level that Apple is doing.

While the ISA is the same, the underlying architecture of the CPU cores is unique to Apple. The M1 is what Apple has been playing a slow game to creating since they first brought CPU design in-house. Think of it like this: the iPhone/iPad processors led to the iPad Pro chips, and lessons learned there have led to the M1 and beyond. Most of the competitors in that space (Qualcomm and Samsung's Exynos teams) are simply licensing existing core designs from ARM and plugging them into their SoCs as if they're playing with LEGOs, which means they are somewhat limited by those design constraints. OTOH, Apple simply licensed the underlying ISA, and essentially has free reign to iterate chip and core designs. This is why the functions of the T2 coprocessor have been integrated into the M1, and you have CPU, GPU, and Neural Engine cores all in the system die for the M-series of SoCs. This is also why Windows on ARM runs so much faster on a Mac via either QEMU or Parallels than it does on the Surface Pro X (either the SQ1 or SQ2 variants) - the Apple chip designs have basically lapped the field in terms of overall performance and efficiency.

leman · Jan 13, 2021

dmccloud said:
While the ISA is the same, the underlying architecture of the CPU cores is unique to Apple. The M1 is what Apple has been playing a slow game to creating since they first brought CPU design in-house.

Sure, I know this, I was preaching this very point for the last couple of years

dmccloud said:
Think of it like this: the iPhone/iPad processors led to the iPad Pro chips, and lessons learned there have led to the M1 and beyond.

It's more than that actually. Already A7 was a very wide architecture — even by today's standards. As I mentioned somewhere above, ARM's brand new high-performance Cortex X-1 is not as wide as A7 in many regards - a seven year old design.

Argon_ · Jan 13, 2021

dmccloud said:
While the ISA is the same, the underlying architecture of the CPU cores is unique to Apple. The M1 is what Apple has been playing a slow game to creating since they first brought CPU design in-house. Think of it like this: the iPhone/iPad processors led to the iPad Pro chips, and lessons learned there have led to the M1 and beyond. Most of the competitors in that space (Qualcomm and Samsung's Exynos teams) are simply licensing existing core designs from ARM and plugging them into their SoCs as if they're playing with LEGOs, which means they are somewhat limited by those design constraints. OTOH, Apple simply licensed the underlying ISA, and essentially has free reign to iterate chip and core designs. This is why the functions of the T2 coprocessor have been integrated into the M1, and you have CPU, GPU, and Neural Engine cores all in the system die for the M-series of SoCs. This is also why Windows on ARM runs so much faster on a Mac via either QEMU or Parallels than it does on the Surface Pro X (either the SQ1 or SQ2 variants) - the Apple chip designs have basically lapped the field in terms of overall performance and efficiency.

Apple trouncing Microsoft's Windows on Arm performance under emulation was an impressive moment.

dmccloud · Jan 14, 2021

leman said:
Sure, I know this, I was preaching this very point for the last couple of years

It's more than that actually. Already A7 was a very wide architecture — even by today's standards. As I mentioned somewhere above, ARM's brand new high-performance Cortex X-1 is not as wide as A7 in many regards - a seven year old design.

Which raises an interesting question: are the Cortex designs intentionally conservative on the part of ARM, or has Apple stumbled upon some design trick nobody else even thought of?

jdb8167 · Jan 14, 2021

dmccloud said:
Which raises an interesting question: are the Cortex designs intentionally conservative on the part of ARM, or has Apple stumbled upon some design trick nobody else even thought of?

It could be as simple as available transistor budget. Arm has to design for the common case where Apple is under no constraints outside of their price point and desired margin. Perhaps it is because Apple can afford a 20% larger die that Arm’s sales & marketing determined was too expensive for the expected return.

I just assume that the companies other than Apple have good engineering too so it must be some engineering constraint. For Arm, maybe they’ve decided that their customers get more value out of a more conservative design. For Intel & AMD, perhaps they’ve decided that their transistor budget is better applied elsewhere based on their x86 architecture. This is all conjecture but everyone should start with the idea that good semi design is available to all companies until shown otherwise.

pshufd · Jan 14, 2021

jdb8167 said:
It could be as simple as available transistor budget. Arm has to design for the common case where Apple is under no constraints outside of their price point and desired margin. Perhaps it is because Apple can afford a 20% larger die that Arm’s sales & marketing determined was too expensive for the expected return.

I just assume that the companies other than Apple have good engineering too so it must be some engineering constraint. For Arm, maybe they’ve decided that their customers get more value out of a more conservative design. For Intel & AMD, perhaps they’ve decided that their transistor budget is better applied elsewhere based on their x86 architecture. This is all conjecture but everyone should start with the idea that good semi design is available to all companies until shown otherwise.

There are a lot of ARM-based devices that do just fine too. I have a Garmin watch, gets great battery life, has a ton of features and just gets the job done. I'd guess that they aren't on 5 or 7 or 8 nm as there's no need for it. Would it be nice? Sure. Is it needed? Not at this time. I have a 2010 Garmin Nuvi NAV that works just fine as well. And these chips go in lots of other things.

PCs seem to be a different market where performance matters.

I'd like to know if there is some real secret sauce outside of what we have seen in articles. If there is, then I'm quite happy if it remains a secret for several more years. At least until they come up with the next secret sauce.

dmccloud · Jan 14, 2021

jdb8167 said:
It could be as simple as available transistor budget. Arm has to design for the common case where Apple is under no constraints outside of their price point and desired margin. Perhaps it is because Apple can afford a 20% larger die that Arm’s sales & marketing determined was too expensive for the expected return.

I just assume that the companies other than Apple have good engineering too so it must be some engineering constraint. For Arm, maybe they’ve decided that their customers get more value out of a more conservative design. For Intel & AMD, perhaps they’ve decided that their transistor budget is better applied elsewhere based on their x86 architecture. This is all conjecture but everyone should start with the idea that good semi design is available to all companies until shown otherwise.

Intel and AMD are still stuck on larger processes than Apple (7nm for AMD, 10 (mobile) and 14 (desktop) for Intel), so that's a big factor. Think of it like this: if you were to build a 30nm wide processor die, Intel could fit either 2 (14nm) or 3 (10nm) transistors across it, AMD could fit 4, and Apple could fit 6. Even if you widened the die to 35nm to account for gaps between traces, those numbers would remain unchanged. Not all ARM designs are created equal, Samsung has proven that with the differences in their flagship phones running Snapdragon chips vs. identical models running their own Exynos chips.

jdb8167 · Jan 14, 2021

dmccloud said:
Intel and AMD are still stuck on larger processes than Apple (7nm for AMD, 10 (mobile) and 14 (desktop) for Intel), so that's a big factor. Think of it like this: if you were to build a 30nm wide processor die, Intel could fit either 2 (14nm) or 3 (10nm) transistors across it, AMD could fit 4, and Apple could fit 6. Even if you widened the die to 35nm to account for gaps between traces, those numbers would remain unchanged. Not all ARM designs are created equal, Samsung has proven that with the differences in their flagship phones running Snapdragon chips vs. identical models running their own Exynos chips.

Except that the A-series has been wider even at TSMC's 10 nm. I think the A11 was a 7 wide decode stage. So it isn't just 5nm.

Krevnik · Jan 14, 2021

dmccloud said:
Which raises an interesting question: are the Cortex designs intentionally conservative on the part of ARM, or has Apple stumbled upon some design trick nobody else even thought of?

Honestly, sometimes it can just be a case of not needing to spend footprint on legacy. Cortex supports A32, A64 and T32 (Thumb) modes. A14/M1 only supports A64. That makes the decoders simpler, OOO schedulers less constrained, and you can ditch any logic related to managing those other modes that can’t be shared with the A64 mode. A64 also simplifies the exception model, and limits the use of conditional instructions in the ISA. Less transistor budget spent on ISA requirements, as well as making the job of the OOO scheduler easier.

There’s probably some design tricks in there as well, but generally it’s a good thing when you can free up parts of your transistor budget for more interesting things.

Search

Search

Why does Apple act like Intel is always holding them back?

leman

macrumors Core

pshufd

macrumors G4

Krevnik

macrumors 601

transitobserver

macrumors newbie

neinjohn

macrumors regular

pshufd

macrumors G4

neinjohn

macrumors regular

pshufd

macrumors G4

dmccloud

macrumors 68040

leman

macrumors Core

Argon_

macrumors 6502

dmccloud

macrumors 68040

jdb8167

macrumors 601

pshufd

macrumors G4

dmccloud

macrumors 68040

jdb8167

macrumors 601

Krevnik

macrumors 601

Our Staff