i hope intel comes back to beat the m1

Quu · Apr 11, 2021

Intel will beat the M1 eventually. But by then the M3 will be out.

cmaier · Apr 11, 2021

crevalic said:
Wow, multiple massive inaccuracies in only 2 points, congrats.

1) Intel invested a massive amount of resources into the Atom development starting in 2004, years before smartphones ever appeared. They developed 1-3W cores, while standard low power laptop chips were in the 35W+ range and were initially very competitive on the market. However, internal struggles resulted in an effective abandonment of this segment. Firstly, the Atom was competing in a lower margin, low unit cost market with more competition than the x86 market (at Atom's release in 2008, Intel already regained leadership ahead of AMD). At that point, Intel sold everything they could make and it made sense to use fab capacity on the expensive parts where they were free to set the price. Secondly, since Intel's approach was so profitable (and, to be honest, still is), many people inside the company were against "rocking the boat" and risking their high margin, high-performance cash cows, especially when netbooks exploded in popularity.

2) This is just so wrong, like, you really need to know absolutely nothing about this field to be able to come up with this. In real life, Intel wanted to move on from x86 and invested absolutely monstrous resources, including the largest design team in history, to create a completely new architecture called IA-64 (also known as Intel Itanium). IA-64 was not compatible with x86/32-bit instructions, breaking compatibility with all older software, and would get rid of pesky x86 competition like AMD, if successful. Instead, AMD went the opposite way and extended the x86 instructions set with 64-bit support, enabling backwards compatibility. In the end, Intel was forced to effectively write off the investment in IA-64 and license x86-64 from AMD.

One funny note - while Intel was off doing Merced with a thousand people, we were doing opteron/athlon 64 with a couple dozen. In fact, the reason we did amd64 had more to do with lack of (wo)manpower than anything else.

EntropyQ3 · Apr 11, 2021

leman said:
As @Andropov correctly points out, it's a questionable path with unclear sustainability. By designing for power instead of efficiency you are limiting yourself in markets that are increasingly becoming more important. Nvidia Ampere is great on desktop (mostly because of the massively increased power budget), but it's much less impressive in the laptops. Who knows, maybe Nvidia has some sort of secret architecture they are working on that will greatly improve power efficiency, and Ampere is just a stopgap to get some sales, but they can't continue doing this going forward. What next? 500W on the high end? The power consumption of enthusiast-level hardware is already barely sustainable today...

The desktop graphics segment has been on a dubious trajectory for quite some time now. Just under three years back, I estimated that over the last ten years (at the time) the performance (using FLOPS as a metric) had improved just over a factor of ten whereas the performance/W and performance/$ had both improved only a factor of four over the same period. Since then, the development has stagnated completely (or at present actually regressed) in terms of performance/$ and performance/W doesn't quite keep up with that previous decade.

So we have increasingly expensive desktop cards with moderate improvements in performance bought in large part by increased power draw. Lithography along with industry inertia is the main reason.

The consequences are rather straightforward to predict - increasing costs will cause a contraction in volume, which will result in reduced volumes. Meanwhile, mobile now takes the majority of the gaming software revenue, and is still increasing its share, and laptops are increasing their already large majority share of the PC market. None of these trends seem likely to change, and there is nothing on the horizon that implies that lithographic progress will suddenly accelerate in terms of improvement for high-power silicon chips, quite the contrary.

Desktop graphics cards are dinosaurs. Powerful, for sure. But an evolutionary dead end.

So it bothers me a bit when they are used to promote rendering techniques that have little hope to scale downwards in power to where the market is. The GPU on a MacBook Air draws little more than a hundredth of a 3090. When can we hope that lithography will provide that performance in that power envelope? Ever? And at that questionable point in the future, just how much more of gaming software revenue will have migrated to iOS/Android/other mobile?

The unit volume of desktop graphics cards will only shrink going forward. AMD/Nvidia own it and will cash in on their duopoly as much and for as long as they possibly can. But the writing on the wall has been clear for a long time, and crypto mining doesn't really change that.

leman · Apr 11, 2021

crevalic said:
2) This is just so wrong, like, you really need to know absolutely nothing about this field to be able to come up with this. In real life, Intel wanted to move on from x86 and invested absolutely monstrous resources, including the largest design team in history, to create a completely new architecture called IA-64 (also known as Intel Itanium). IA-64 was not compatible with x86/32-bit instructions, breaking compatibility with all older software, and would get rid of pesky x86 competition like AMD, if successful. Instead, AMD went the opposite way and extended the x86 instructions set with 64-bit support, enabling backwards compatibility. In the end, Intel was forced to effectively write off the investment in IA-64 and license x86-64 from AMD.

Intel Itanium was an interesting proposal at the time, but ultimately, it ended up being a big failure. They never managed to solve some of the fundamental problems, resulting in lackluster performance and subpar scaling.

Intel’s CPU history is quite interesting anyway. Most of their innovations ended up as failures. Pentium IV was a new thing back on the day, failure. Itanium, failure. Current Intel CPUs still derive from Pentium Pro, released more than twenty years ago! AMD has a completely new CPU design every couple of years, and I’ve heard from people familiar with the industry that Apple redesigns everything every two generations or so.

In the end, Intel suffered from lack of innovation, slow management and internal political strife. They are very good at doing business, but unfortunately they’ve been suffocating when it comes to their base.

Icelus · Apr 11, 2021

cmaier said:
I was a designer on some of AMDs CPUs, and at one point i owned the integer execution units and dispatch. Not much reason x86 can’t go wider, other than the fact that code wouldn’t probably benefit too much from it, due to too much instruction interdependency, I suppose. Microcode is a disadvantage - when you send a complex instruction to the instruction decoder and it replaces it with a sequence of N microops, those microops will tend to have interdependencies which require them to be at least partially sequenced. If, instead, you have Arm, you can let the compiler do some of the work of ordering the instruction stream to take advantage of multiple pipelines, and the instruction stream that reaches the instruction decoder will tend to have fewer clumps of interdependent instructions.

I assumed this was mainly because of the variable length instructions, something also mentioned by Anandtech.

What really defines Apple’s Firestorm CPU core from other designs in the industry is just the sheer width of the microarchitecture. Featuring an 8-wide decode block, Apple’s Firestorm is by far the current widest commercialized design in the industry. IBM’s upcoming P10 Core in the POWER10 is the only other official design that’s expected to come to market with such a wide decoder design, following Samsung’s cancellation of their own M6 core which also was described as being design with such a wide design.

Other contemporary designs such as AMD’s Zen(1 through 3) and Intel’s µarch’s, x86 CPUs today still only feature a 4-wide decoder designs (Intel is 1+4) that is seemingly limited from going wider at this point in time due to the ISA’s inherent variable instruction length nature, making designing decoders that are able to deal with aspect of the architecture more difficult compared to the ARM ISA’s fixed-length instructions. On the ARM side of things, Samsung’s designs had been 6-wide from the M3 onwards, whilst Arm’s own Cortex cores had been steadily going wider with each generation, currently 4-wide in currently available silicon, and expected to see an increase to a 5-wide design in upcoming Cortex-X1 cores.

Apple Announces The Apple Silicon M1: Ditching x86 - What to Expect, Based on A14

www.anandtech.com

cmaier · Apr 11, 2021

Icelus said:
I assumed this was mainly because of the variable length instructions, something also mentioned by Anandtech.

Nope, that passage makes no sense. First, it is confused about what the decoder does. The decoder has nothing to do with parallelism. The decoder‘s job (in an x86 machine) is to, among other things, convert the incoming variable-length instructions into fixed length micro ops. It does this by finding the beginning and end of each instruction, and using a state machine and microcode ROM to determine a sequence of simple, fixed-length, instructions. These are then fed to the scheduling unit (which is not part of the decoder), which is responsible for looking at each instruction, figuring out if its inputs are still being calculated by any of the pipelines, and, if not, dispatching it onto the next available pipeline.

In other words, to the scheduling unit, reservation stations, and pipelines, it makes no difference that the original instruction had an arbitrary length.

Again, I submit that the real issue is that there would be vastly diminished returns by going wider, because when you take a single x86 instruction and turn it into 8 microops, those microops are all going to be interdependent. So microop #4 cannot issue in parallel to #3, because the input to #4 is the output of #3.

The way around that is to look further down the instruction stream to see if something else can be issued into the pipelines, but the cost for such a large “scanning window” gets pretty large. This is why a good RISC compiler is an advantage - it can tell that instruction X and instruction Y don’t depend on each other, and it can rearrange the instruction stream accordingly to make them fit together within the scanning window so the processor sees there is no dependency and issues the instructions in parallel.

Of course there are penalties to going wider - the register file needs more ports, the CAM in the reservation stations needs more ports, etc., so any time you add a pipeline, you get less benefit than for the previously added pipeLine. It’s easy to believe that for x86, where you tend to have large clumps of inter-related operations (at the input to the scheduling unit), the benefits just aren’t worth it beyond 4 wide.

crevalic · Apr 11, 2021

leman said:
Intel Itanium was an interesting proposal at the time, but ultimately, it ended up being a big failure. They never managed to solve some of the fundamental problems, resulting in lackluster performance and subpar scaling.

Intel’s CPU history is quite interesting anyway. Most of their innovations ended up as failures. Pentium IV was a new thing back on the day, failure. Itanium, failure. Current Intel CPUs still derive from Pentium Pro, released more than twenty years ago! AMD has a completely new CPU design every couple of years, and I’ve heard from people familiar with the industry that Apple redesigns everything every two generations or so.

In the end, Intel suffered from lack of innovation, slow management and internal political strife. They are very good at doing business, but unfortunately they’ve been suffocating when it comes to their base.

On another hand they do have quite an ability to dig out old stuff out of their massive portfolio and somehow make it competitive. Architecture at a dead end following the P4 failure? Bring back obsolete Pentium Pro-P3 arch and use it as a base for the brilliant Core arch they still use today. Primacy in HPC starting to be in danger because of NVidia? Pull an innovative but abandoned many-core x86 arch (Larrabee) out of the bin and use it to build a line of Phi accelerator cards. Get completely stuck because of lithography issues? Dust off Atom core designs and use them to create low power cores to better compete in core count and efficiency.

You make an interesting point about AMD introducing new designs way more often than Intel. Maybe @cmaier can weigh in, but I'd assume at least a part of the reason is that optimizing an architecture gets harder and harder over time, which is difficult when you have less resources at your disposal in comparison to the competition. When AMD released the Phenom line in 2007 and didn't beat Intel out of the gate, they just abandoned the arch and only released a single simple die-shrink (Phenom II, which wasn't that bad in hindsight), getting a completely new Bulldozer arch out just 4 years later. Funnily enough I remember Nehalem (1st gen Core i, released end of 2008) being architecturally extremely similar to the Phenoms, with Intel later further iterating on the design and getting massive gains for a few generations, completely dominating AMD's Bulldozer designs.

cmaier · Apr 11, 2021

crevalic said:
On another hand they do have quite an ability to dig out old stuff out of their massive portfolio and somehow make it competitive. Architecture at a dead end following the P4 failure? Bring back obsolete Pentium Pro-P3 arch and use it as a base for the brilliant Core arch they still use today. Primacy in HPC starting to be in danger because of NVidia? Pull an innovative but abandoned many-core x86 arch (Larrabee) out of the bin and use it to build a line of Phi accelerator cards. Get completely stuck because of lithography issues? Dust off Atom core designs and use them to create low power cores to better compete in core count and efficiency.

You make an interesting point about AMD introducing new designs way more often than Intel. Maybe @cmaier can weigh in, but I'd assume at least a part of the reason is that optimizing an architecture gets harder and harder over time, which is difficult when you have less resources at your disposal in comparison to the competition. When AMD released the Phenom line in 2007 and didn't beat Intel out of the gate, they just abandoned the arch and only released a single simple die-shrink (Phenom II, which wasn't that bad in hindsight), getting a completely new Bulldozer arch out just 4 years later. Funnily enough I remember Nehalem (1st gen Core i, released end of 2008) being architecturally extremely similar to the Phenoms, with Intel later further iterating on the design and getting massive gains for a few generations, completely dominating AMD's Bulldozer designs.

The reason we always started working on the next architecture every 3 years or so was because that’s about how often we had a new fab process. When you get a process shrink, it means you have more transistors to work with. When you have more transistors, there are new things you can do with them. And we always felt we had to, anyway, to keep up with Intel. So we’d design a new chip, then do a couple revs on it (fixing bugs, increasing clock frequency as much as we could, adding minimal new features if doing so was easy), but while we were doing those things, we’d already be starting to think about the next clean-sheet design with a subset of the team. We would also learn from our prior design, and improve our methodology for the next go-round so that the same small team could work with 10x the number of transistors the next time around.

Our team was also organized very differently than Intel. Much smaller, and each person had a ton of responsibility. We’d also move around. At Intel, you might own the design of one adder for 10 years across different designs. At AMD, you owned the entire integer execution units and the scheduling unit, and on the next design you might instead own the instruction decoder, and on one chip you might be doing logic design and on the next you might be doing circuit design. At various points I worked on physical architecture (standard cell aspect ratios, power grids, clock grids, etc.), floor planning, integer logic design, floating point logic design, circuit design, etc. I even ran the electronic design automation team and wrote much of our CAD software for several years. So we were organized in a way that it was comparatively easier to start over, because we were used to “new stuff.”

mi7chy · Apr 11, 2021

Without competition the market will become stagnant and there's no price competition. Intel and Nvidia were caught being complacent. Good thing for AMD but the other companies will put AMD in check if it too becomes complacent.

crevalic · Apr 11, 2021

This makes me wonder about how deep and far reaching the internal changes at AMD must've been over the last decade or so (or was it just desperation?). They went from technologically dominating Intel in mid-2000s to being able to compete, but not immediately regaining the performance crown with Phenom, so they just threw the whole architecture in the bin and started from scratch in a sort of a "go big or go home" approach. On another hand, Ryzen 1st gen in 2017 lagged about 2 generations behind Intel in IPC, while also reaching lower maximum frequencies, only really competing on price and core count, both of which Intel could've easily countered, but decided to ignore until too late. This gave AMD some much needed cash flow and allowed them to work on the design for a couple of years, until really pushing past Intel with the 4th iteration. In any case, Intel is still rather competitive on performance (definitely more than the Phenom/Phenom II/inital Ryzen gens, and I'm saying that as an almost exclusive AMD customer for last 20 years).

Now I wonder if AMD would've been able to come this far if Intel actually brought out its 10nm process and updated core designs in any of the planned years (2016, then 2018, then 2019, then 2020). 10nm delays might have crazy ripple effects for the tech world at large, I doubt Intel refocuses and sells their 5G division (and end many other side projects like VR/AR stuff) to Apple, that Apple dares to be as competitive and antagonistic towards Intel as they've been recently or that Microsoft dares to dabble in ARM as openly as they do now, if Intel still looks like the unassailable murderous giant they were 5 years ago.

crevalic · Apr 11, 2021

cmaier said:
The reason we always started working on the next architecture every 3 years or so was because that’s about how often we had a new fab process. When you get a process shrink, it means you have more transistors to work with. When you have more transistors, there are new things you can do with them. And we always felt we had to, anyway, to keep up with Intel. So we’d design a new chip, then do a couple revs on it (fixing bugs, increasing clock frequency as much as we could, adding minimal new features if doing so was easy), but while we were doing those things, we’d already be starting to think about the next clean-sheet design with a subset of the team. We would also learn from our prior design, and improve our methodology for the next go-round so that the same small team could work with 10x the number of transistors the next time around.

Our team was also organized very differently than Intel. Much smaller, and each person had a ton of responsibility. We’d also move around. At Intel, you might own the design of one adder for 10 years across different designs. At AMD, you owned the entire integer execution units and the scheduling unit, and on the next design you might instead own the instruction decoder, and on one chip you might be doing logic design and on the next you might be doing circuit design. At various points I worked on physical architecture (standard cell aspect ratios, power grids, clock grids, etc.), floor planning, integer logic design, floating point logic design, circuit design, etc. I even ran the electronic design automation team and wrote much of our CAD software for several years. So we were organized in a way that it was comparatively easier to start over, because we were used to “new stuff.”

That's super interesting, were the rotations and greater responsibility part of the philosophy from the start or did they mainly appear as a way to plug holes when people left and similar? Do you know how many of these "full-stack-designers" still work for AMD, or are modern designs getting too complex for non-specialized people?

haralds · Apr 11, 2021

Apple has some key advantages that will be a challenge to best:

Focused ecosystem. Both macOS and processor architectures do not carry long backward compatibility. Both Windows and Intel are saddled with the accompanying complexities resulting in reduced performance.
Vertical integration. Tight coupling of SoC architecture and OS requirements. Apple defines the iOS & macOS capabilities and builds the hardware specifically for those requirements. Processors, systems, compilers, and OS can trade off optimization and functionality.
Focus on low power. This driven by requirements for watches and mobile devices but also impacts laptops and in the end even would impact data centers, which Apple doesn't not currently care about. Power and cooling always become the limiting factor in the end. Note that with expanding services revenue, Apple in the end will likely design their custom server platforms, too.
RISC vs CISC, see 2. & 3.
Fab/foundry capability. One level lower we have geometry size. 3nm is inherently lower power and faster. TSMC 3nm capacity has been locked up by Apple. Intel cannot build or buy competitive technology - at least not quickly.

My expectation is that the performance gap will increase over the next couple of years before Intel can make any headway simply due to the technological complexities resulting in inertia.

However, the Intel Windows desktop ecology is still vastly bigger than Apple's macOS business and will stay that way. Apple will just harvest the high-margin part of the business.

BTW, I would not be surprised if Apple does not introduce ANY new Intel based devices this year. Their stance of a slow transition was simply a marketing decision to encourage the purchase of the systems still in the pipe.

Others have mentioned the corporate culture differences, which are not helping Intel. Apple suffered from this before Steve Jobs returned and cleaned house. He installed a machine with Tim Cook that is pretty unparalleled. However, when scaling a behemoth of the current size you run into risks. This could not be forever!

Note also that Apple's foundry advantage is an Achilles Heel! It will be interesting to see Apple's approach in the near future. Their philosophy is that mission-critical technologies must be owned.

Apple's real advantage is its very long-term view. Short-term quarterly profit/outsource thinking has undermined great companies like HP and IBM to the point of no recovery from a slow death.

cmaier · Apr 11, 2021

crevalic said:
That's super interesting, were the rotations and greater responsibility part of the philosophy from the start or did they mainly appear as a way to plug holes when people left and similar? Do you know how many of these "full-stack-designers" still work for AMD, or are modern designs getting too complex for non-specialized people?

When I started there, it was right after AMD bought NexGen, to replace their own K6 with the NexGen chip (which became the K6 that actually got sold). I started with the NexGen team in Milpitas, away from AMD’s headquarters. So it was a startup atmosphere. There was also the “legacy” team in Austin, which was much bigger, and for a long time there was friction between the two teams. K6, K8 (opteron/athlon 64),K10, etc. were the california/NexGen team, and K5, K7(Athlon), K9 etc. were the Austin team. As time went on, there was more cross-pollination between teams, but for the entire time I was there, we shared very little design methodology, tools, etc.

Anyway, the NexGen team got augmented by a couple of folks from Exponential Technology (including me), the team that did the PowerPC x704 - that company’s Austin office became EVSX and then Intrinsity, and was eventually bought by Apple to form a good chunk of it’s CPU design team. We were also a startup, so we were used to doing everything that needed to be done. And then a bunch of folks from the DEC alpha team joined. DEC designed everything at the circuit level, rather than using standard cells, so those guys were very talented and used to thinking all the way up from the transistor level to the top level design. We ended up sort of merging the design methodology from Exponential with the methodology from DEC, building atop what NexGen already had (which wasn’t that different in philosophy). Because the design methodology eschewed “synthesis,” even logic designers had to think in terms of circuits. We would do our design by reading the logic equations from the architects, breaking them down in our heads into individual NAND/NOR/MUX/NOT-gates, figuring out, by hand, where to put the gates, sizing the gates by hand (with some help from some tools we wrote), etc., pre-wiring the most important signals, and then iterating based on various electrical analyses. Often we’d go back to the architects with suggestions on changing the logic equations to be more physically-realizable or to solve some problem. So the methodology really was built on the assumption that people had the skills to understand what the tools were telling them, and to know what to do about it. When we hired people, we’d look for people with “big picture” experience or talent. For that reason, despite interviewing many, I can’t recall ever hiring anyone from Intel - often times I couldn’t even tell what they were talking about. One guy told me he’d spent the last 7 or 8 years owning a particular block for several chips there. I didn’t recognize the name of the block, so I asked for more details. Turns out he was in charge of a chunk of the chip that just had wires in it. No transistors. Just wires. Anyway, we often hired people from startups, people with advanced degrees, people who had circuit design experience and then moved on to other things, etc.

Things got real crazy for Opteron/Athlon 64 (which we called K8). Much of our team quit, and there were only around 16 of us, plus a bunch of layout designers and verification engineers, left. As a result, the whole project was rebooted to be much simpler, and each of us had to do things that were far outside our traditional roles (e.g. me being the temporary instruction set architect responsible for designing the integer instruction set for x86-64); though, of course, AMD had little history of designing new instructions anyway (other than some SIMD stuff), so it might as well have been me.

I know that some of the most talented folks I worked with are still there. Some moved to Apple. One is now at Nuvia/Qualcomm.

Ratsaremyfreinds · Apr 11, 2021

amd rocks to been useing amd since it was on a card anyone recall the amds on card i think it was the thunderbird 750 lol

crazy dave · Apr 11, 2021

cmaier said:
“Filled with?” Three guys. One is an architect. Only two are chip designers. Not to mention lawsuits. I wouldn’t count on much from nuvia.

Now that they’ve been bought out, I’d expect the team to grow. Isn’t the lawsuit more personal than intellectual? As in GW3 is personally at risk, but the company itself and whatever chips it makes (so far) are not? Then again, I can’t imagine Apple is happy that Qualcomm bought its ex-designers and would relish the opportunity for further disruption.

That said, I would still submit of the two companies, Qualcomm is the more likely to produce powerful ARM chips for the enthusiast hobbyist market but I wouldn’t care to predict what that probability is as there is indeed too much uncertainty. However, I’d say that the probability that Apple does so is hovering near 0.

leman · Apr 11, 2021

cmaier said:
Nope, that passage makes no sense. First, it is confused about what the decoder does. The decoder has nothing to do with parallelism. The decoder‘s job (in an x86 machine) is to, among other things, convert the incoming variable-length instructions into fixed length micro ops. It does this by finding the beginning and end of each instruction, and using a state machine and microcode ROM to determine a sequence of simple, fixed-length, instructions. These are then fed to the scheduling unit (which is not part of the decoder), which is responsible for looking at each instruction, figuring out if its inputs are still being calculated by any of the pipelines, and, if not, dispatching it onto the next available pipeline.

In other words, to the scheduling unit, reservation stations, and pipelines, it makes no difference that the original instruction had an arbitrary length.

Again, I submit that the real issue is that there would be vastly diminished returns by going wider, because when you take a single x86 instruction and turn it into 8 microops, those microops are all going to be interdependent. So microop #4 cannot issue in parallel to #3, because the input to #4 is the output of #3.

The way around that is to look further down the instruction stream to see if something else can be issued into the pipelines, but the cost for such a large “scanning window” gets pretty large. This is why a good RISC compiler is an advantage - it can tell that instruction X and instruction Y don’t depend on each other, and it can rearrange the instruction stream accordingly to make them fit together within the scanning window so the processor sees there is no dependency and issues the instructions in parallel.

Of course there are penalties to going wider - the register file needs more ports, the CAM in the reservation stations needs more ports, etc., so any time you add a pipeline, you get less benefit than for the previously added pipeLine. It’s easy to believe that for x86, where you tend to have large clumps of inter-related operations (at the input to the scheduling unit), the benefits just aren’t worth it beyond 4 wide.

Wait, I’m a bit confused… how is it any different from ARM? A single instruction can also include multiple micro-ops with complex dependencies that need to be scheduled, renamed and retired independently? Take for example something like

LDP X0, X1, [SP], #16

This will need to allocate a new register for X0, load it from stack, allocate a new X1, load it from stack (with offset) and increment SP by 16. These are three independent micro-operations that open new dependency chains and have to be scheduled independently…

Besides, isn’t that problem pretty much solved with modern dependency-tracking compilers? Most x86 instructions are not that complex and their internal dependencies are fairly clear.

cmaier · Apr 11, 2021

crazy dave said:
Now that they’ve been bought out, I’d expect the team to grow. Isn’t the lawsuit more personal than intellectual? As in GW3 is personally at risk, but the company itself and whatever chips it makes (so far) are not? Then again, I can’t imagine Apple is happy that Qualcomm bought its ex-designers and would relish the opportunity for further disruption.

That said, I would still submit of the two companies, Qualcomm is the more likely to produce powerful ARM chips for the enthusiast hobbyist market but I wouldn’t care to predict what that probability is as there is indeed too much uncertainty. However, I’d say that the probability that Apple does so is hovering near 0.

I just don’t get the fascination with Nuvia. Before they were bought out, they hadn’t made a chip. It’s 3 guys. Qualcomm already has a massive CPU team. These guys are just joining it and bringing some ideas along, assuming the lawsuit doesn’t find that Apple owns them. Nvidia is much more formidable than qualcomm, in the long run.

Ratsaremyfreinds · Apr 11, 2021

power consumption is very important to me. i dont miss cpus that used 150 watts

cmaier · Apr 11, 2021

leman said:
Wait, I’m a bit confused… how is it any different from ARM? A single instruction can also include multiple micro-ops with complex dependencies that need to be scheduled, renamed and retired independently? Take for example something like

LDP X0, X1, [SP], #16

This will need to allocate a new register for X0, load it from stack, allocate a new X1, load it from stack (with offset) and increment SP by 16. These are three independent micro-operations that open new dependency chains and have to be scheduled independently…

Besides, isn’t that problem pretty much solved with modern dependency-tracking compilers? Most x86 instructions are not that complex and their internal dependencies are fairly clear.

It’s a matter of degree. The average number of microops per Arm ISA instruction is around 1.1 or something. It’s much higher in x86. And the LDP instruction can be done with a single microop, by the way. Just because it COULD be broken down into two loads doesn’t mean it has to be. And “allocate a new register” is not a thing, in the sense that it is not an op by itself. It’s done as part of any instruction that needs a register. So really you have only two things going on here - reading the offset stack and incrementing the sp by 16. This can be done in one microop by the load/store unit. And, in fact, this does not use any of the ALUs, so it’s not even really relevant to the question of how many parallel pipelines you can have (Load/store unit is not part of the ALU pipelines).

So, for example, I’d likely design this so that there is a 128 bit databus between the load/store unit and the register file, and I’d just send a control signal to the load/store telling it that it needs to fetch two, and I’d send the immediate so that the load/store can update the stack pointer after the fetch (could be before, but then you have a problem with unwinding on cache misses).

crazy dave · Apr 11, 2021

cmaier said:
I just don’t get the fascination with Nuvia. Before they were bought out, they hadn’t made a chip. It’s 3 guys. Qualcomm already has a massive CPU team. These guys are just joining it and bringing some ideas along, assuming the lawsuit doesn’t find that Apple owns them. Nvidia is much more formidable than qualcomm, in the long run.

For me I think it’s because Qualcomm is already the prime alternative manufacturer of ARM chips and (seems) to have an exclusive arrangement with MS to produce WoA devices. So it’s more likely to make an immediate impact on the PC market (if their claims hold up) next year. Also, while Qualcomm is a big cpu designer, they’re all based on standard ARM cores. Now they’ll be designing custom cores. That’s a big shift. Finally, my understanding, perhaps wrong, was that the worst that could happen in the lawsuit was financial jeopardy for GW3 rather than the company or its products getting owned by Apple.

More troubling long term for Qualcomm is that, supposedly, the designers joined Nuvia to build server chips. If Qualcomm doesn’t do that, how long will they stay?

I agree that if the Nvidia acquisition goes through they’ll be a major player as well - and maybe bigger long term. But it’s not 100% clear it will go through due to the same reason that Apple purportedly turned down SoftBank: will the regulators allow it? Apple seemed pretty sure that they wouldn’t be allowed to buy ARM by UK regulators - not clear if those regulators will let Nvidia. Beyond the potential for the good things of the acquisition, there are some concerns from other ARM licensees about what their relationship will be after Nvidia’s acquisition of ARM. Regulators might listen to that.

Maconplasma · Apr 11, 2021

Really confused by this forum but I may move on because this forum is proving more and more not to be an "Apple Enthusiast" site. People say Apple needs competition in order to innovate (strange since they are the minority), then Apple innovates on overdrive and puts out something amazing such as the M1. Apple is currently killing it but for some people it's still not enough. SMH. Now the OP wants Intel to push Apple back down to force Apple to do even better. It never seems people want Apple to actually win. Intel has kept Mac users with overheating Macs, short battery life and noisy fans trying to keep the super hot Intel processor cooled. I think Apple has proven that they don't need competition in terms of processors. Obviously Apple is trying to make the best computers for Mac customers so it makes no sense for Intel to push Apple when Intel is the one who's been failing us for years.

Also what's the point of a processor war between Apple, Intel and AMD? The only two that should compete are AMD and Intel since they do X86 processors. Apple serves the Macintosh community so it makes no sense to be part of a processor war. Either buy a Mac or buy Windows but if people are going to buy based on processors alone then it doesn't sound like they do any on their computers past web surfing. I have a lot of Mac-exclusive software and I prefer the Mac UI on software made for both Mac and Windows so I wouldn't care if Intel made a faster processor, I wouldn't give a Mac up for a faster Windows machine. At the end of the day it's about running my software.

leman · Apr 11, 2021

cmaier said:
It’s a matter of degree. The average number of microops per Arm ISA instruction is around 1.1 or something. It’s much higher in x86. And the LDP instruction can be done with a single microop, by the way. Just because it COULD be broken down into two loads doesn’t mean it has to be. And “allocate a new register” is not a thing, in the sense that it is not an op by itself. It’s done as part of any instruction that needs a register. So really you have only two things going on here - reading the offset stack and incrementing the sp by 16. This can be done in one microop by the load/store unit. And, in fact, this does not use any of the ALUs, so it’s not even really relevant to the question of how many parallel pipelines you can have (Load/store unit is not part of the ALU pipelines).

So, for example, I’d likely design this so that there is a 128 bit databus between the load/store unit and the register file, and I’d just send a control signal to the load/store telling it that it needs to fetch two, and I’d send the immediate so that the load/store can update the stack pointer after the fetch (could be before, but then you have a problem with unwinding on cache misses).

Yeah, I get that, but you still need to track data dependencies for all other instructions in flight. With "allocate" I mean taking care of register renaming and data dependencies. For example, if in my example a previous instruction is modifying SP you will need to wait until the new value of SP becomes available. Similarly, if the next instruction depends on any of the modified registers, its execution will have to wait. And this is not done on instruction granularity, but tracked separately for each involved register!

In short, I don't see a principal difference between an ARM-style load+op instruction pair and an x86 instruction where the same is achieved via a single instruction (which will both generate multiple dependencies and multiple micro-ops or micro-op-like operations).

Furthermore, we do know that Apple CPUs are not that straightforward. They will fuse instruction pairs into a single operation, eliminate moves via register renaming and even eliminate matching stack load/store operations! Again, I don't really know what I am talking about, but it at least seems to me that Apple's instruction decoders operate on domains that span more then one instructions (and similarly, that a single instruction is broken down into multiple operations that can be scheduled/optimized separately).

cmaier · Apr 11, 2021

leman said:
Yeah, I get that, but you still need to track data dependencies for all other instructions in flight. With "allocate" I mean taking care of register renaming and data dependencies. For example, if in my example a previous instruction is modifying SP you will need to wait until the new value of SP becomes available. Similarly, if the next instruction depends on any of the modified registers, its execution will have to wait. And this is not done on instruction granularity, but tracked separately for each involved register!

In short, I don't see a principal difference between an ARM-style load+op instruction pair and an x86 instruction where the same is achieved via a single instruction (which will generate multiple dependencies and multiple micro-ops).

Furthermore, we do know that Apple CPUs are not that straightforward. They will fuse instruction pairs into a single operation, eliminate moves via register renaming and even eliminate matching stack load/store operations! Again, I don't really know what I am talking about, but it at least seems to me that Apple's instruction decoders operate on domains that span more then one instructions (and similarly, that a single instruction is broken down into multiple operations that can be scheduled/optimized separately).

Register renaming is always a pipeline stage, so it’s not a separate microop. Typically done by a register-renaming unit that is part of the scheduling unit. And while everything has to be tracked, that doesn’t involve separate microops - it’s part of every microop.

In x86 you can have to do, as part of an “add” instruction, two loads and a store, and typically none of that can be done in parallel. So you have (and forgive my mnemonics here - I’m going for legibility):

x86:
add [M1], [M2] -> [M3]

which becomes

load M1 -> R0
load M2 -> R1
add R0, R1 -> R2
store R2 -> M3

The two loads could be done in parallel, but load/stores are typically not parallelizable due to memory access ports.

Anyway,

In Arm, the compiler would likely avoid this pattern in the first place, and do:

load M1 -> R0
load M2 -> R1
[some other instructions that can happen while the loads are taking place]
add R0, R1 -> R2
[do other things and try to coalesce the store with other stores, or at least only store if the value of R2 is going to be changed]

All those things in brackets can be run in parallel ALU pipelines.

A smart x86 compiler could, in theory, avoid using adds that read and write memory, but, alas, they don’t seem to actually do that. Part of the reason being that x86 has fewer architectural registers (and certainly less architectural register flexibility, with various registers serving specific purposes).

leman · Apr 11, 2021

cmaier said:
In x86 you can have to do, as part of an “add” instruction, two loads and a store, and typically none of that can be done in parallel. So you have (and forgive my mnemonics here - I’m going for legibility):

x86:
add [M1], [M2] -> [M3]

I was under impression that ADD can have at most one memory operand? Regardless, I don’t think that it changes your argument (we still have a dependent load, op and store). Still, I can’t say I get it - you have dependent operations, sure, but modern OOE CPUs will reorder them anyway. If anything, spreading the load and store operations too far apart without knowing the microarchitectural details can be detrimental to performance as the operation might be done earlier than the compiler anticipates. I think its’s almost always better to just leave the things in natural order and let the reordering machinery figure things out - unless you really know the properties of your CPU and are going for some very specific optimizations.

bobcomer · Apr 11, 2021

Maconplasma said:
Really confused by this forum but I may move on because this forum is proving more and more not to be an "Apple Enthusiast" site.

The definition of Apple enthusiast is just adding a few more types of enthusiasts is all. Things grow or die, not much in between. Whether you think so or not, I like MacOS and I want to run Mac's, but I also have other needs...

I find these recent discussions extremely informative and love them!

i hope intel comes back to beat the m1

macrumors 68040

Suspended

macrumors 6502a

macrumors Core

macrumors 6502

Suspended

Suspended

Suspended

macrumors G4

Suspended

Suspended

macrumors 68030

Suspended

macrumors regular

macrumors 65816

macrumors Core

Suspended

macrumors regular

Suspended

macrumors 65816

Cancelled

macrumors Core

Suspended

macrumors Core

macrumors 601

Our Staff