Wow, multiple massive inaccuracies in only 2 points, congrats.
1) Intel invested a massive amount of resources into the Atom development starting in 2004, years before smartphones ever appeared. They developed 1-3W cores, while standard low power laptop chips were in the 35W+ range and were initially very competitive on the market. However, internal struggles resulted in an effective abandonment of this segment. Firstly, the Atom was competing in a lower margin, low unit cost market with more competition than the x86 market (at Atom's release in 2008, Intel already regained leadership ahead of AMD). At that point, Intel sold everything they could make and it made sense to use fab capacity on the expensive parts where they were free to set the price. Secondly, since Intel's approach was so profitable (and, to be honest, still is), many people inside the company were against "rocking the boat" and risking their high margin, high-performance cash cows, especially when netbooks exploded in popularity.
2) This is just so wrong, like, you really need to know absolutely nothing about this field to be able to come up with this. In real life, Intel wanted to move on from x86 and invested absolutely monstrous resources, including the largest design team in history, to create a completely new architecture called IA-64 (also known as Intel Itanium). IA-64 was not compatible with x86/32-bit instructions, breaking compatibility with all older software, and would get rid of pesky x86 competition like AMD, if successful. Instead, AMD went the opposite way and extended the x86 instructions set with 64-bit support, enabling backwards compatibility. In the end, Intel was forced to effectively write off the investment in IA-64 and license x86-64 from AMD.
The desktop graphics segment has been on a dubious trajectory for quite some time now. Just under three years back, I estimated that over the last ten years (at the time) the performance (using FLOPS as a metric) had improved just over a factor of ten whereas the performance/W and performance/$ had both improved only a factor of four over the same period. Since then, the development has stagnated completely (or at present actually regressed) in terms of performance/$ and performance/W doesn't quite keep up with that previous decade.As @Andropov correctly points out, it's a questionable path with unclear sustainability. By designing for power instead of efficiency you are limiting yourself in markets that are increasingly becoming more important. Nvidia Ampere is great on desktop (mostly because of the massively increased power budget), but it's much less impressive in the laptops. Who knows, maybe Nvidia has some sort of secret architecture they are working on that will greatly improve power efficiency, and Ampere is just a stopgap to get some sales, but they can't continue doing this going forward. What next? 500W on the high end? The power consumption of enthusiast-level hardware is already barely sustainable today...
2) This is just so wrong, like, you really need to know absolutely nothing about this field to be able to come up with this. In real life, Intel wanted to move on from x86 and invested absolutely monstrous resources, including the largest design team in history, to create a completely new architecture called IA-64 (also known as Intel Itanium). IA-64 was not compatible with x86/32-bit instructions, breaking compatibility with all older software, and would get rid of pesky x86 competition like AMD, if successful. Instead, AMD went the opposite way and extended the x86 instructions set with 64-bit support, enabling backwards compatibility. In the end, Intel was forced to effectively write off the investment in IA-64 and license x86-64 from AMD.
I assumed this was mainly because of the variable length instructions, something also mentioned by Anandtech.I was a designer on some of AMDs CPUs, and at one point i owned the integer execution units and dispatch. Not much reason x86 can’t go wider, other than the fact that code wouldn’t probably benefit too much from it, due to too much instruction interdependency, I suppose. Microcode is a disadvantage - when you send a complex instruction to the instruction decoder and it replaces it with a sequence of N microops, those microops will tend to have interdependencies which require them to be at least partially sequenced. If, instead, you have Arm, you can let the compiler do some of the work of ordering the instruction stream to take advantage of multiple pipelines, and the instruction stream that reaches the instruction decoder will tend to have fewer clumps of interdependent instructions.
What really defines Apple’s Firestorm CPU core from other designs in the industry is just the sheer width of the microarchitecture. Featuring an 8-wide decode block, Apple’s Firestorm is by far the current widest commercialized design in the industry. IBM’s upcoming P10 Core in the POWER10 is the only other official design that’s expected to come to market with such a wide decoder design, following Samsung’s cancellation of their own M6 core which also was described as being design with such a wide design.
Other contemporary designs such as AMD’s Zen(1 through 3) and Intel’s µarch’s, x86 CPUs today still only feature a 4-wide decoder designs (Intel is 1+4) that is seemingly limited from going wider at this point in time due to the ISA’s inherent variable instruction length nature, making designing decoders that are able to deal with aspect of the architecture more difficult compared to the ARM ISA’s fixed-length instructions. On the ARM side of things, Samsung’s designs had been 6-wide from the M3 onwards, whilst Arm’s own Cortex cores had been steadily going wider with each generation, currently 4-wide in currently available silicon, and expected to see an increase to a 5-wide design in upcoming Cortex-X1 cores.
I assumed this was mainly because of the variable length instructions, something also mentioned by Anandtech.
On another hand they do have quite an ability to dig out old stuff out of their massive portfolio and somehow make it competitive. Architecture at a dead end following the P4 failure? Bring back obsolete Pentium Pro-P3 arch and use it as a base for the brilliant Core arch they still use today. Primacy in HPC starting to be in danger because of NVidia? Pull an innovative but abandoned many-core x86 arch (Larrabee) out of the bin and use it to build a line of Phi accelerator cards. Get completely stuck because of lithography issues? Dust off Atom core designs and use them to create low power cores to better compete in core count and efficiency.Intel Itanium was an interesting proposal at the time, but ultimately, it ended up being a big failure. They never managed to solve some of the fundamental problems, resulting in lackluster performance and subpar scaling.
Intel’s CPU history is quite interesting anyway. Most of their innovations ended up as failures. Pentium IV was a new thing back on the day, failure. Itanium, failure. Current Intel CPUs still derive from Pentium Pro, released more than twenty years ago! AMD has a completely new CPU design every couple of years, and I’ve heard from people familiar with the industry that Apple redesigns everything every two generations or so.
In the end, Intel suffered from lack of innovation, slow management and internal political strife. They are very good at doing business, but unfortunately they’ve been suffocating when it comes to their base.
On another hand they do have quite an ability to dig out old stuff out of their massive portfolio and somehow make it competitive. Architecture at a dead end following the P4 failure? Bring back obsolete Pentium Pro-P3 arch and use it as a base for the brilliant Core arch they still use today. Primacy in HPC starting to be in danger because of NVidia? Pull an innovative but abandoned many-core x86 arch (Larrabee) out of the bin and use it to build a line of Phi accelerator cards. Get completely stuck because of lithography issues? Dust off Atom core designs and use them to create low power cores to better compete in core count and efficiency.
You make an interesting point about AMD introducing new designs way more often than Intel. Maybe @cmaier can weigh in, but I'd assume at least a part of the reason is that optimizing an architecture gets harder and harder over time, which is difficult when you have less resources at your disposal in comparison to the competition. When AMD released the Phenom line in 2007 and didn't beat Intel out of the gate, they just abandoned the arch and only released a single simple die-shrink (Phenom II, which wasn't that bad in hindsight), getting a completely new Bulldozer arch out just 4 years later. Funnily enough I remember Nehalem (1st gen Core i, released end of 2008) being architecturally extremely similar to the Phenoms, with Intel later further iterating on the design and getting massive gains for a few generations, completely dominating AMD's Bulldozer designs.
That's super interesting, were the rotations and greater responsibility part of the philosophy from the start or did they mainly appear as a way to plug holes when people left and similar? Do you know how many of these "full-stack-designers" still work for AMD, or are modern designs getting too complex for non-specialized people?The reason we always started working on the next architecture every 3 years or so was because that’s about how often we had a new fab process. When you get a process shrink, it means you have more transistors to work with. When you have more transistors, there are new things you can do with them. And we always felt we had to, anyway, to keep up with Intel. So we’d design a new chip, then do a couple revs on it (fixing bugs, increasing clock frequency as much as we could, adding minimal new features if doing so was easy), but while we were doing those things, we’d already be starting to think about the next clean-sheet design with a subset of the team. We would also learn from our prior design, and improve our methodology for the next go-round so that the same small team could work with 10x the number of transistors the next time around.
Our team was also organized very differently than Intel. Much smaller, and each person had a ton of responsibility. We’d also move around. At Intel, you might own the design of one adder for 10 years across different designs. At AMD, you owned the entire integer execution units and the scheduling unit, and on the next design you might instead own the instruction decoder, and on one chip you might be doing logic design and on the next you might be doing circuit design. At various points I worked on physical architecture (standard cell aspect ratios, power grids, clock grids, etc.), floor planning, integer logic design, floating point logic design, circuit design, etc. I even ran the electronic design automation team and wrote much of our CAD software for several years. So we were organized in a way that it was comparatively easier to start over, because we were used to “new stuff.”
That's super interesting, were the rotations and greater responsibility part of the philosophy from the start or did they mainly appear as a way to plug holes when people left and similar? Do you know how many of these "full-stack-designers" still work for AMD, or are modern designs getting too complex for non-specialized people?
“Filled with?” Three guys. One is an architect. Only two are chip designers. Not to mention lawsuits. I wouldn’t count on much from nuvia.
Nope, that passage makes no sense. First, it is confused about what the decoder does. The decoder has nothing to do with parallelism. The decoder‘s job (in an x86 machine) is to, among other things, convert the incoming variable-length instructions into fixed length micro ops. It does this by finding the beginning and end of each instruction, and using a state machine and microcode ROM to determine a sequence of simple, fixed-length, instructions. These are then fed to the scheduling unit (which is not part of the decoder), which is responsible for looking at each instruction, figuring out if its inputs are still being calculated by any of the pipelines, and, if not, dispatching it onto the next available pipeline.
In other words, to the scheduling unit, reservation stations, and pipelines, it makes no difference that the original instruction had an arbitrary length.
Again, I submit that the real issue is that there would be vastly diminished returns by going wider, because when you take a single x86 instruction and turn it into 8 microops, those microops are all going to be interdependent. So microop #4 cannot issue in parallel to #3, because the input to #4 is the output of #3.
The way around that is to look further down the instruction stream to see if something else can be issued into the pipelines, but the cost for such a large “scanning window” gets pretty large. This is why a good RISC compiler is an advantage - it can tell that instruction X and instruction Y don’t depend on each other, and it can rearrange the instruction stream accordingly to make them fit together within the scanning window so the processor sees there is no dependency and issues the instructions in parallel.
Of course there are penalties to going wider - the register file needs more ports, the CAM in the reservation stations needs more ports, etc., so any time you add a pipeline, you get less benefit than for the previously added pipeLine. It’s easy to believe that for x86, where you tend to have large clumps of inter-related operations (at the input to the scheduling unit), the benefits just aren’t worth it beyond 4 wide.
I just don’t get the fascination with Nuvia. Before they were bought out, they hadn’t made a chip. It’s 3 guys. Qualcomm already has a massive CPU team. These guys are just joining it and bringing some ideas along, assuming the lawsuit doesn’t find that Apple owns them. Nvidia is much more formidable than qualcomm, in the long run.Now that they’ve been bought out, I’d expect the team to grow. Isn’t the lawsuit more personal than intellectual? As in GW3 is personally at risk, but the company itself and whatever chips it makes (so far) are not? Then again, I can’t imagine Apple is happy that Qualcomm bought its ex-designers and would relish the opportunity for further disruption.
That said, I would still submit of the two companies, Qualcomm is the more likely to produce powerful ARM chips for the enthusiast hobbyist market but I wouldn’t care to predict what that probability is as there is indeed too much uncertainty. However, I’d say that the probability that Apple does so is hovering near 0.
Wait, I’m a bit confused… how is it any different from ARM? A single instruction can also include multiple micro-ops with complex dependencies that need to be scheduled, renamed and retired independently? Take for example something like
LDP X0, X1, [SP], #16
This will need to allocate a new register for X0, load it from stack, allocate a new X1, load it from stack (with offset) and increment SP by 16. These are three independent micro-operations that open new dependency chains and have to be scheduled independently…
Besides, isn’t that problem pretty much solved with modern dependency-tracking compilers? Most x86 instructions are not that complex and their internal dependencies are fairly clear.
I just don’t get the fascination with Nuvia. Before they were bought out, they hadn’t made a chip. It’s 3 guys. Qualcomm already has a massive CPU team. These guys are just joining it and bringing some ideas along, assuming the lawsuit doesn’t find that Apple owns them. Nvidia is much more formidable than qualcomm, in the long run.
It’s a matter of degree. The average number of microops per Arm ISA instruction is around 1.1 or something. It’s much higher in x86. And the LDP instruction can be done with a single microop, by the way. Just because it COULD be broken down into two loads doesn’t mean it has to be. And “allocate a new register” is not a thing, in the sense that it is not an op by itself. It’s done as part of any instruction that needs a register. So really you have only two things going on here - reading the offset stack and incrementing the sp by 16. This can be done in one microop by the load/store unit. And, in fact, this does not use any of the ALUs, so it’s not even really relevant to the question of how many parallel pipelines you can have (Load/store unit is not part of the ALU pipelines).
So, for example, I’d likely design this so that there is a 128 bit databus between the load/store unit and the register file, and I’d just send a control signal to the load/store telling it that it needs to fetch two, and I’d send the immediate so that the load/store can update the stack pointer after the fetch (could be before, but then you have a problem with unwinding on cache misses).
Register renaming is always a pipeline stage, so it’s not a separate microop. Typically done by a register-renaming unit that is part of the scheduling unit. And while everything has to be tracked, that doesn’t involve separate microops - it’s part of every microop.Yeah, I get that, but you still need to track data dependencies for all other instructions in flight. With "allocate" I mean taking care of register renaming and data dependencies. For example, if in my example a previous instruction is modifying SP you will need to wait until the new value of SP becomes available. Similarly, if the next instruction depends on any of the modified registers, its execution will have to wait. And this is not done on instruction granularity, but tracked separately for each involved register!
In short, I don't see a principal difference between an ARM-style load+op instruction pair and an x86 instruction where the same is achieved via a single instruction (which will generate multiple dependencies and multiple micro-ops).
Furthermore, we do know that Apple CPUs are not that straightforward. They will fuse instruction pairs into a single operation, eliminate moves via register renaming and even eliminate matching stack load/store operations! Again, I don't really know what I am talking about, but it at least seems to me that Apple's instruction decoders operate on domains that span more then one instructions (and similarly, that a single instruction is broken down into multiple operations that can be scheduled/optimized separately).
In x86 you can have to do, as part of an “add” instruction, two loads and a store, and typically none of that can be done in parallel. So you have (and forgive my mnemonics here - I’m going for legibility):
x86:
add [M1], [M2] -> [M3]
The definition of Apple enthusiast is just adding a few more types of enthusiasts is all. Things grow or die, not much in between. Whether you think so or not, I like MacOS and I want to run Mac's, but I also have other needs...Really confused by this forum but I may move on because this forum is proving more and more not to be an "Apple Enthusiast" site.