Note that the Intel Alder Lake performance core has a 6-wide decoder. So x86-64 is no longer limited to a 4-wide decoder.
Interesting, AMD was wrong apparently. Unless the decoders aren’t able to be utilized.Note that the Intel Alder Lake performance core has a 6-wide decoder. So x86-64 is no longer limited to a 4-wide decoder.
According to the anandtech article when they asked Intel if some of the decoders were simplified, Intel declined to answer.Interesting, AMD was wrong apparently. Unless the decoders aren’t able to be utilized.
Interesting, AMD was wrong apparently. Unless the decoders aren’t able to be utilized.
For reference, I copied the leaked (rumored?) Alder Lake power levels here:It’s not that AMD was wrong, it’s just that x86 CPUs have been limiting the decode to windows of 16 bytes. Intel is now doubling that, which makes having more decoders feasible.
I wonder how much die area the Adler Lake decode logic takes, and what is the implication for power consumption. The leaked PL limits for Adler Lake look intimidating. It’s possible that Intel is trading performance for power.
Starting off with the directly most obvious change: Intel is moving from being a 4-wide decode machine to being a 6-wide microarchitecture, a first amongst x86 designs, and a major design focus point. Over the last few years there had been a discussion point about decoder widths and the nature of x86’s variable length instruction set, making it difficult to design decoders that would go wider, compared to say a fixed instruction set ISA like Arm’s, where adding decoders is relatively easier to do. Notably last year AMD’s Mike Clarke had noted while it’s not a fundamental limitation, going for decoders larger than 4 instructions can create practical drawbacks, as the added complexity, and most importantly, added pipeline stages. For Golden Cove, Intel has decided to push forward with these changes, and a compromise that had to be made is that the design now adds an additional stage to the mispredict penalty of the microarchitecture, so the best-case would go up from 16 cycles to 17 cycles. We asked if there was still a kind of special-case decoder layout as in previous generations (such as the 1 complex + 3 simple decoder setup), however the company wouldn’t dwell deeper into the details at this point in time. To feed the decoder, the fetch bandwidth going into it has been doubled from 16 bytes per cycle to 32 bytes per cycle.
Intel states that the decoder is clock-gated 80% of the time, instead relying on the µOP cache. This has also seen extremely large changes this generation: first of all, the structure has now almost doubled from 2.25K entries to 4K entries, mimicking a similar large increase we had seen with the move from AMD’s Zen to Zen2, increasing the hit-rate and further avoiding going the route of the more costly decoders.
Also, read the comments (I can't believe I just said that). There are some very knowledgeable posters on the article. Maynard Handley (goes by @name99) is particularly knowledgeable about the M1 architecture and compares what Intel is doing vs. Apple Silicon.Here is the relevant part of the Anandtech article:
Sounds like a halfway house between typical x86 and what Apple has done.
Very interesting stuff, but also very technical and probably too technical for the vast majority of people....Also, read the comments (I can't believe I just said that). There are some very knowledgeable posters on the article. Maynard Handley (goes by @name99) is particularly knowledgeable about the M1 architecture and compares what Intel is doing vs. Apple Silicon.
Edit: Apparently has an account here as well.
It’s not that AMD was wrong, it’s just that x86 CPUs have been limiting the decode to windows of 16 bytes. Intel is now doubling that, which makes having more decoders feasible.
I wonder how much die area the Adler Lake decode logic takes, and what is the implication for power consumption. The leaked PL limits for Adler Lake look intimidating. It’s possible that Intel is trading performance for power.
The article indicates that they did have to add stages to the pipeline.Did they have to add a decode stage to the pipeline? Do they just do pre-decode and hide some decoding in the scheduling pipeline stages?
Back in the day when I was doing it, the decoder took about 15-20% of the area of a core (ignoring cache, if you consider L1 cache part of the core). To get to 32 bytes, probably need to come close to doubling the area (unless they are cheating and it’s not fully decoding).
The article indicates that they did have to add stages to the pipeline.
Me as well. The funny part is Intel seems to be going back to longer pipes which was what got them in severe trouble performance and thermal wise in the old Netburst days.It seemed a little vague to me because it was implied in the article that the extra pipeline stage only happens when there is an exception (which doesn’t make a lot of sense to me). By the way, 17 or 18 pipeline stages is a big problem. x86 cruft is crufty.