Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

jdb8167

macrumors 601
Nov 17, 2008
4,859
4,599
Note that the Intel Alder Lake performance core has a 6-wide decoder. So x86-64 is no longer limited to a 4-wide decoder.
 

JMacHack

Suspended
Mar 16, 2017
1,965
2,424
Note that the Intel Alder Lake performance core has a 6-wide decoder. So x86-64 is no longer limited to a 4-wide decoder.
Interesting, AMD was wrong apparently. Unless the decoders aren’t able to be utilized.
 

jdb8167

macrumors 601
Nov 17, 2008
4,859
4,599
Interesting, AMD was wrong apparently. Unless the decoders aren’t able to be utilized.
According to the anandtech article when they asked Intel if some of the decoders were simplified, Intel declined to answer.
 

leman

macrumors Core
Oct 14, 2008
19,522
19,679
Interesting, AMD was wrong apparently. Unless the decoders aren’t able to be utilized.

It’s not that AMD was wrong, it’s just that x86 CPUs have been limiting the decode to windows of 16 bytes. Intel is now doubling that, which makes having more decoders feasible.

I wonder how much die area the Adler Lake decode logic takes, and what is the implication for power consumption. The leaked PL limits for Adler Lake look intimidating. It’s possible that Intel is trading performance for power.
 

jdb8167

macrumors 601
Nov 17, 2008
4,859
4,599
It’s not that AMD was wrong, it’s just that x86 CPUs have been limiting the decode to windows of 16 bytes. Intel is now doubling that, which makes having more decoders feasible.

I wonder how much die area the Adler Lake decode logic takes, and what is the implication for power consumption. The leaked PL limits for Adler Lake look intimidating. It’s possible that Intel is trading performance for power.
For reference, I copied the leaked (rumored?) Alder Lake power levels here:
 

Joelist

macrumors 6502
Jan 28, 2014
463
373
Illinois
Here is the relevant part of the Anandtech article:

Starting off with the directly most obvious change: Intel is moving from being a 4-wide decode machine to being a 6-wide microarchitecture, a first amongst x86 designs, and a major design focus point. Over the last few years there had been a discussion point about decoder widths and the nature of x86’s variable length instruction set, making it difficult to design decoders that would go wider, compared to say a fixed instruction set ISA like Arm’s, where adding decoders is relatively easier to do. Notably last year AMD’s Mike Clarke had noted while it’s not a fundamental limitation, going for decoders larger than 4 instructions can create practical drawbacks, as the added complexity, and most importantly, added pipeline stages. For Golden Cove, Intel has decided to push forward with these changes, and a compromise that had to be made is that the design now adds an additional stage to the mispredict penalty of the microarchitecture, so the best-case would go up from 16 cycles to 17 cycles. We asked if there was still a kind of special-case decoder layout as in previous generations (such as the 1 complex + 3 simple decoder setup), however the company wouldn’t dwell deeper into the details at this point in time. To feed the decoder, the fetch bandwidth going into it has been doubled from 16 bytes per cycle to 32 bytes per cycle.

Intel states that the decoder is clock-gated 80% of the time, instead relying on the µOP cache. This has also seen extremely large changes this generation: first of all, the structure has now almost doubled from 2.25K entries to 4K entries, mimicking a similar large increase we had seen with the move from AMD’s Zen to Zen2, increasing the hit-rate and further avoiding going the route of the more costly decoders.

Sounds like a halfway house between typical x86 and what Apple has done.
 

jdb8167

macrumors 601
Nov 17, 2008
4,859
4,599
Here is the relevant part of the Anandtech article:



Sounds like a halfway house between typical x86 and what Apple has done.
Also, read the comments (I can't believe I just said that). There are some very knowledgeable posters on the article. Maynard Handley (goes by @name99) is particularly knowledgeable about the M1 architecture and compares what Intel is doing vs. Apple Silicon.

Edit: Apparently has an account here as well.
 

Digitalguy

macrumors 601
Apr 15, 2019
4,653
4,482
Also, read the comments (I can't believe I just said that). There are some very knowledgeable posters on the article. Maynard Handley (goes by @name99) is particularly knowledgeable about the M1 architecture and compares what Intel is doing vs. Apple Silicon.

Edit: Apparently has an account here as well.
Very interesting stuff, but also very technical and probably too technical for the vast majority of people....
I thought I knew a bit better than average this stuff, but the level of technicality in those comments is extremely high and several things are too technical for me too....
 
  • Like
Reactions: JMacHack

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
It’s not that AMD was wrong, it’s just that x86 CPUs have been limiting the decode to windows of 16 bytes. Intel is now doubling that, which makes having more decoders feasible.

I wonder how much die area the Adler Lake decode logic takes, and what is the implication for power consumption. The leaked PL limits for Adler Lake look intimidating. It’s possible that Intel is trading performance for power.

Did they have to add a decode stage to the pipeline? Do they just do pre-decode and hide some decoding in the scheduling pipeline stages?

Back in the day when I was doing it, the decoder took about 15-20% of the area of a core (ignoring cache, if you consider L1 cache part of the core). To get to 32 bytes, probably need to come close to doubling the area (unless they are cheating and it’s not fully decoding).
 

Joelist

macrumors 6502
Jan 28, 2014
463
373
Illinois
Did they have to add a decode stage to the pipeline? Do they just do pre-decode and hide some decoding in the scheduling pipeline stages?

Back in the day when I was doing it, the decoder took about 15-20% of the area of a core (ignoring cache, if you consider L1 cache part of the core). To get to 32 bytes, probably need to come close to doubling the area (unless they are cheating and it’s not fully decoding).
The article indicates that they did have to add stages to the pipeline.
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
The article indicates that they did have to add stages to the pipeline.

It seemed a little vague to me because it was implied in the article that the extra pipeline stage only happens when there is an exception (which doesn’t make a lot of sense to me). By the way, 17 or 18 pipeline stages is a big problem. x86 cruft is crufty.
 

Joelist

macrumors 6502
Jan 28, 2014
463
373
Illinois
It seemed a little vague to me because it was implied in the article that the extra pipeline stage only happens when there is an exception (which doesn’t make a lot of sense to me). By the way, 17 or 18 pipeline stages is a big problem. x86 cruft is crufty.
Me as well. The funny part is Intel seems to be going back to longer pipes which was what got them in severe trouble performance and thermal wise in the old Netburst days.
 

mj_

macrumors 68000
May 18, 2017
1,618
1,281
Austin, TX
That's one of the most clickbaity titles I have ever seen, and Max Tech is no stranger when it comes to click bait. Just for that I refuse to watch it because you know it'll be nothing but baseless speculations, assumptions, and rumors.
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.