i hope intel comes back to beat the m1

Gerdi · Apr 12, 2021

cmaier said:
Ah, I’d have to think about that for a bit. I feel like it would be O(n), because you find the first instruction’s length looking at 1 prefix, then use that to locate the next prefix (etc.). But I may be forgetting some wrinkle.

Precisely, thats what i was talking about

cmaier · Apr 12, 2021

Gerdi said:
Precisely, thats what i was talking about

Got it. The trick for small N is, of course, clever speculation, but, of course, the hardware required for that grows at something like O(n^2) (Again, I’d have to sit down with a sheet of paper and plot it out while trying to remember the tricks).

Gerdi · Apr 12, 2021

cmaier said:
Got it. The trick for small N is, of course, clever speculation, but, of course, the hardware required for that grows at something like O(n^2) (Again, I’d have to sit down with a sheet of paper and plot it out while trying to remember the tricks).

I am sure, you get away with some clever tricks for small N. I was also briefly thinking about speculation - but this could let the hardware explode from size perspective - and would not get you away from the inherent O(N) nature of the problem for larger N.
As i said in my previous posts, i would not agree with Anandtech that 4 is some magic limit for x64, but we have to acknowledge that the circuit depth would raise linear with problem size (and the circuit size possibly even faster) - while for AArch64 it does not.

cmaier · Apr 12, 2021

Gerdi said:
I am sure, you get away with some clever tricks for small N. I was also briefly thinking about speculation - but this could let the hardware explode from size perspective - and would not get you away from the inherent O(N) nature of the problem for larger N.
As i said in my previous posts, i would not agree with Anandtech that 4 is some magic limit for x64, but we have to acknowledge that the circuit depth would raise linear with problem size (and the circuit size possibly even faster) - while for AArch64 it does not.

Yeah, the point of speculation is to use O(n^2) or O(n!) hardware to reduce O(n) time to O(1) time, which may or may not be necessary depending on how big N is and how many gates are in the decode path. And I agree that 4 is not a magic limit, and that it’s certainly easier to go wider when your instruction width is fixed (or at least where the allowable instruction widths share a common denominator).

Gerdi · Apr 12, 2021

cmaier said:
Yeah, the point of speculation is to use O(n^2) or O(n!) hardware to reduce O(n) time to O(1) time, which may or may not be necessary depending on how big N is and how many gates are in the decode path. And I agree that 4 is not a magic limit, and that it’s certainly easier to go wider when your instruction width is fixed (or at least where the allowable instruction widths share a common denominator).

Perfect! We concluded

ikramerica · Apr 12, 2021

DHagan4755 said:
Yes, competition is a very healthy thing. AMD's Ryzen might be the M1's stiffest competition at the moment.

In multicore yes, but though benchmarks can saturate threads and cores, in real world use cores 1-4 will have the most impact for most tasks.

The question will be what Apple can do with 25w in a power laptop and pro desktop and 45w in a pro tower.

Berc · Apr 12, 2021

jdb8167 said:
Which they got from DEC during a buyout and never really supported and the technology was subsequently dumped when they sold off their ARM XScale products to Marvell. Not really an Intel product.

It was sold because it wasn't x86 architecture. Intel added many improvements/features from the 486 libraries, most notably clock doubling. XScale were the fastest ARM based processors of their time. Compaq and RIM used them to create new categories of handheld products. Intel still holds an architectural ARM license.

crazy dave · Apr 12, 2021

@Gerdi @cmaier @leman

Thanks for the interesting discussion!

Speaking about decoders, I’m curious about the Intel’s decoder in the Tremont Atom cores. I’ve read that it’s two cluster of decoders that “decode both sides of a branch” meaning I *think* before and after a branch rather than right/wrong branch. However, I’m not sure about this or why the branch is useful to have a parallel decoder. I was wondering if any of you had read about it/had some insight.

cmaier · Apr 12, 2021

crazy dave said:
@Gerdi @cmaier @leman

Thanks for the interesting discussion!

Speaking about decoders, I’m curious about the Intel’s decoder in the Tremont Atom cores. I’ve read that it’s two cluster of decoders that “decode both sides of a branch” meaning I *think* before and after a branch rather than right/wrong branch. However, I’m not sure about this or why the branch is useful to have a parallel decoder. I was wondering if any of you had read about it/had some insight.

Usually when one talks about decoding both sides of a branch they are talking about speculatively decoding both paths of a conditional branch, so that whether you guess wrong or right about whether the branch is taken, you are ready to go. But I can’t personally speak for what any particular Intel design did. Doesn’t seem right that it could mean “before and after a branch,” though. If the branch is not conditional, all designs would do that. And if it is conditional, then the whole point of the branch prediction unit is to instruct the decoder to decode whatever the predicted next instruction would be, so that it can be speculatively executed; for that reason, any design with a branch predictor would do that, and it wouldn’t be something anyone would think is interesting.

crazy dave · Apr 12, 2021

cmaier said:
Usually when one talks about decoding both sides of a branch they are talking about speculatively decoding both paths of a conditional branch, so that whether you guess wrong or right about whether the branch is taken, you are ready to go. But I can’t personally speak for what any particular Intel design did. Doesn’t seem right that it could mean “before and after a branch,” though. If the branch is not conditional, all designs would do that. And if it is conditional, then the whole point of the branch prediction unit is to instruct the decoder to decode whatever the predicted next instruction would be, so that it can be speculatively executed; for that reason, any design with a branch predictor would do that, and it wouldn’t be something anyone would think is interesting.

Very likely I misunderstood what I was reading as this was a Twitter conversation between people familiar with cpu design and were thus writing with their audience having that built in knowledge in mind - a lot of shorthand followed by “oh of courses” and “that’s interesting”

. It’s possible I inverted things.

How might a parallel decoder cluster aid in such cases? How would you know where to start the decoding of the other side of the branch? Is this a design that would be useful for an ultra wide core like firestorm? Or unnecessary?

P.S. thanks for letting me pick your brain, I’m interested and but largely ignorant of the dirty details of how these things work. So I appreciate the time.

leman · Apr 13, 2021

Gerdi said:
Let me guess, you have never worked on CPU architectures?

Certainly didn't and I have no idea what the constraints are

But I am talking about algorithmic complexity — if you can do it in constant time in "software" you can probably do it in constant time in hardware, no?

Gerdi said:
In any case, the decoding properties depend on code properties in particular if it has certain prefix properties. In addition, an algorithm being O(n) depth, does not mean, that there are no local parallelization opportunities - we are talking about asymptotic complexities after all.*

Same can be said for UTF-8 validation or JSON parsing or sorting and yet there are constant time algorithms for those tasks which rely on local parallelization. I don't see why x86 parsing is a fundamentally different problem: there are signaling bit patterns and one can design detectors that work over multiple bytes in parallel, detect those patterns, and then reduce them to find instruction boundaries.

Gerdi said:
Yes typically i refer to the 64bit instruction set, when referring to x64. I fully believe that a single instruction can be decoded with a upper bound constant O(1) or O(constant). The question i am raising is about the complexity of decoding n instructions (while n going towards infinity).

Ah, well, maybe that's where the misunderstanding lies... I am not talking about parsing n instructions in constant time, I am talking about parsing instructions within a given fixed-size window in constant time. If I remember correctly, all modern x86 CPUs parse instructions in windows of 16 bytes.

Gerdi said:
*SIMD is a good example. Not all SIMD operation are really O(1) with respect to vector length. In particular reduction operations (which include DOT product) are inherently O(logn). So the observation that SIMD operations can be used does not imply that the underlying algorithm algorithm has O(1) complexity.

O(log n) with a constant n is constant time. The rest is just an implementation detail. Reduction operations on modern ARM CPUs (and GPUs) are O(1) — check out M1 instruction timings for example. They are obviously slower than the parallel versions (reducing ADDV needs 3 cycles for example, while parallel ADDP needs 2 cycles), but it's still constant time. And any combination of those will be constant time as well. And of course, as you increase the vector size, the latency of these operations will go up as well. But constant factors don't change the algorithmic complexity.

And sure, reduction operations in x86 perform terribly, but that's just x86... they have a history of adding seemingly useful instructions that are completely useless in practice... both DPPS and HADDPS are good examples of those.

Gerdi said:
Yup, the reason is, that 5 decoders are sufficient to feed the backends. Adding more decoders would not have helped with performance, because you would be backend limited most of the time.

I don's see that Apple has cracked something here. They just made a very wide design - there is no principal problem Apple has solved. As i pointed out earlier, it is for instance trivial for AArch64 architectures to increase the number of decoders - unlike in x64 land.

I don't think it's that straightforward. Cortex X1 for example is almost as wide, and yet it can't even approach Apple's IPC.

EdT · Apr 13, 2021

robco74 said:
Not really sure about Intel's future at the moment. They're looking at outsourcing their manufacturing. I suppose if they can't improve their process, may as well fab chips from others that may not require it. But thankfully for Intel, Apple isn't selling chips to others, nor are they licensing any of their IP for others to incorporate into their products. That means everyone else is stuck with them for the moment.

It will be interesting to see what Nvidia does with ARM, provided they're allowed to make the purchase. ARM on server will be interesting to watch as well. Still, this predicament is primarily Intel's fault. If they do go under, they should be allowed to do so, others can buy up their assets and hopefully fare better. But they've got enough momentum to carry them along for at least another decade or so.

I’m not sure Apple would sell to NVidea even if Apple was willing to supply chips to other computer/electronics companies. There’s a lot of bad blood there evidently. But NVidea has the same problem that Apple has had with Intel- unmet production and performance on promised new chipsets. AMD has their house in order (right now anyway) but Intel can’t produce enough improvements at a cost effective price.

I think a lot of companies will look to design their own chips so that they can control their own future. So chip fabricators, not designers, may become the new bottleneck.

HiVolt · Apr 13, 2021

Chipzilla will bounce back, they were down before (Pentium 4), and they bounced back. This time the competition is a little stiffer with AMD and Apple, but Apple is still only Apple market, not PC market.

So they really only have to worry about AMD for the next few years, IMO.

cmaier · Apr 13, 2021

crazy dave said:
Very likely I misunderstood what I was reading as this was a Twitter conversation between people familiar with cpu design and were thus writing with their audience having that built in knowledge in mind - a lot of shorthand followed by “oh of courses” and “that’s interesting” . It’s possible I inverted things.

How might a parallel decoder cluster aid in such cases? How would you know where to start the decoding of the other side of the branch? Is this a design that would be useful for an ultra wide core like firestorm? Or unnecessary?

P.S. thanks for letting me pick your brain, I’m interested and but largely ignorant of the dirty details of how these things work. So I appreciate the time.

once you know you have a conditional branch, you know where the next instruction in the sequence starts (the byte following the end of the conditional branch instruction). If you can decode the target of the conditional branch, you know the location of that instruction, too.

of course, the decoder may not have access to the instruction stream where the branch target is, if it is far away from the branch instruction address. It theoretically might not even be in the cache. So you can get into all sorts of schemes where you are pre-decoding things, etc. One trick is that most conditional branches repeat. You have a loop, and you take 1 branch a thousand times and only take the other branch when the loop ends. So you may decode once, and keep track of it.

There are just a million little ways that x86 makes life hard, and a million complicated tricks to try and make it bearable.

EdT · Apr 13, 2021

Smalltalk said:
No one is stuck with Intel. ARM silicon is available to everyone who wants to use it.

What everyone is stuck with is compatibility issues between ARM and X86 architecture. Apple is willing to dump systems that they feel don’t have a future, but most companies aren’t. Way too big of an established user base who will scream if their software will now require a new architecture computer. Even if a Rosetta type program is possible most companies don’t want to develop and support 3 programs- a new ARM version of their program, the Rosetta translation program, and the legacy X86 version of every program release for a couple of years minimum.

cmaier · Apr 13, 2021

EdT said:
I’m not sure Apple would sell to NVidea even if Apple was willing to supply chips to other computer/electronics companies. There’s a lot of bad blood there evidently. But NVidea has the same problem that Apple has had with Intel- unmet production and performance on promised new chipsets. AMD has their house in order (right now anyway) but Intel can’t produce enough improvements at a cost effective price.

I think a lot of companies will look to design their own chips so that they can control their own future. So chip fabricators, not designers, may become the new bottleneck.

The number of chip designers who can design chips as well as Apple does is a surprisingly limited resource.

Gerdi · Apr 13, 2021

leman said:
But I am talking about algorithmic complexity — if you can do it in constant time in "software" you can probably do it in constant time in hardware, no?

Indeed there are relation between the complexity classes. For example the classes NC and TC can both be defined with help of either, describing a relation to the depth of a uniform circuit (with bounded or unbounded fan-in) or to the runtime of a non-deterministic turing machine or PRAM. So there you have your connection between HW and SW

Ah, well, maybe that's where the misunderstanding lies... I am not talking about parsing n instructions in constant time, I am talking about parsing instructions within a given fixed-size window in constant time. If I remember correctly, all modern x86 CPUs parse instructions in windows of 16 bytes.

Glad, we got this out of the way. I might remind you, that the whole discussion started with the question, if it is harder for a variable length instruction set architecture to increase the number of decoders, which working in parallel. The question never was if i can decode a single instruction in constant time.

O(log n) with a constant n is constant time. The rest is just an implementation detail. Reduction operations on modern ARM CPUs (and GPUs) are O(1).

Look, you seem to have a fundamental misunderstanding what asymptotic complexity of an algorithm (or function) is. Of course a concrete implementation is constant time, because the vector length is fixed - no doubt about it.
The whole point of my example was to show, that there are constant time implementation of algorithms which are not in O(1) - and therefore the conclusion that because it has a SIMD implementation the algorithm has O(1) asymptotic complexity is a fallacy.
Of course in order to understand my argument, you would have to acknowledge that (vector) reduction is in O(logn) with respect to vector size.

I don't think it's that straightforward. Cortex X1 for example is almost as wide, and yet it can't even approach Apple's IPC.

Thats because it is smaller in all features, not just in the decoder. That having said, the decode-width difference pretty much matching the IPC discrepancy - not sure what else you want to conclude.

leman · Apr 13, 2021

Gerdi said:
Look, you seem to have a fundamental misunderstanding what asymptotic complexity of an algorithm (or function) is. Of course a concrete implementation is constant time, because the vector length is fixed - no doubt about it.

I sure hope not, otherwise they would probably not let me teach this stuff...

Gerdi said:
The whole point of my example was to show, that there are constant time implementation of algorithms which are not in O(1) - and therefore the conclusion that because it has a SIMD implementation the algorithm has O(1) asymptotic complexity is a fallacy.
Of course in order to understand my argument, you would have to acknowledge that (vector) reduction is in O(logn) with respect to vector size.

What I don't understand is why asymptotic complexity is even remotely relevant for this discussion. Decoding a stream of instructions or in fact scanning any kind of sequence is always going to be O(n) at best — simply because you have to examine every element of the sequence regardless of what you do.

What we are discussing instead is how to reduce the constant factor. Processing multiple bytes simultaneously is an obvious approach. It won't change the asymptotic complexity, but it will improve the performance by a factor of X.

Gerdi said:
Glad, we got this out of the way. I might remind you, that the whole discussion started with the question, if it is harder for a variable length instruction set architecture to increase the number of decoders, which working in parallel. The question never was if i can decode a single instruction in constant time.

Of course it is harder to speed up parsing a variable-length code by parallelization, but it's not impossible. There is this popular view that since x86 ISA is variable-length, the decoder has to process each instruction on bey one (mind, I am not implying that you subscribe to this view). But this is wrong. A block of k bytes can be preprocessed in constant time to detect the offsets of each individual instruction, and each of these instructions can be decoded in parallel by a separate decoder. This will obviously add some latency and considerable hardware complexity to the implementation, and it won't reduce the asymptotic complexity, but it will improve the decoding performance.

Gerdi said:
Thats because it is smaller in all features, not just in the decoder. That having said, the decode-width difference pretty much matching the IPC discrepancy - not sure what else you want to conclude.

Backends seem rather similar to me (although Apple does have more integer ports and it is also possible that their EUs are in general more capable)

Cortex X1: https://fuse.wikichip.org/news/3543/arm-cortex-x1-the-first-from-the-cortex-x-custom-program/
A14/M1: https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2

The argument would be that building wide cores is difficult because ILP in real world code is inherently limited by dependencies, control flow and things like that. When you look at modern high performance computing, be it Zen or Intel cores, or X1, they seem to converge at roughly comparable backend sizes. It's quite possible that going wider would be subject to diminishing returns. But Apple can go wider and also maintain very high retire rate. Maybe their branch predictor is just that good?

But anyway, I am just grasping at straws here. I don't really insist that this argument has much merit. I'm just curious about these things.

Ratsaremyfreinds · Apr 13, 2021

nvidia has new tricks up its sleeve. im no expert but i say intel better start thinking arm

Cr1- · Apr 13, 2021

cmaier said:
I was a designer on some of AMDs CPUs, and at one point i owned the integer execution units and dispatch. Not much reason x86 can’t go wider, other than the fact that code wouldn’t probably benefit too much from it, due to too much instruction interdependency, I suppose. Microcode is a disadvantage - when you send a complex instruction to the instruction decoder and it replaces it with a sequence of N microops, those microops will tend to have interdependencies which require them to be at least partially sequenced. If, instead, you have Arm, you can let the compiler do some of the work of ordering the instruction stream to take advantage of multiple pipelines, and the instruction stream that reaches the instruction decoder will tend to have fewer clumps of interdependent instructions.

What about the situation when Apple Silicon runs its Rosetta 2 the x86 translation layer? I think it’s a combination of both hardware and software design that Apple Silicon is designed in a way not to lose too much x86 translation performance.

if Rosetta 2 has a way to feed its wider architecture then x86 processors could use the same technique.

crevalic · Apr 13, 2021

leman said:
I don't think it's that straightforward. Cortex X1 for example is almost as wide, and yet it can't even approach Apple's IPC.

This isn't really fair, since there are only 2 SoCs containing X1 cores in existence and they both use a significantly cut down implementation, a single X1 core per SoC and are only used in a couple of smartphones. Even with that they got to within 10-15% of the A13 single core perf which was state of the art at the time of X1 announcement. 6 months newer firestorm cores in A14 had a pretty amazing single core performance uplift, though, but I wouldn't be surprised if Apple pushed extra hard because they knew they would be using the same cores in M1. On another hand Qualcomm couldn't care less (probably because there's simply no market), just look at their pathetic showing with the "highest-performance" (lol) 8cx, which they're now "revising" for the third year in a row by slightly increasing frequency of the 4 years old cores.

If memory serves me right, the motivation behind the X1 cores wasn't even really smartphones or other consumer devices but capturing the emerging ARM server market. I'm fairly sure X1 is just the first blueprint for the ARM's new Cortex-X Custom Program, where customers can request deep architectural and performance changes and ARM does the engineering legwork. But you need customers to actually request high performance and this seemingly didn't happen (yet). X2 cores should be announced any time now, let's see how they fare in comparison to A14/M1.

I'm not trying to put down the amazing work Apple engineers did on the M1, but I'm also curious how much of it is actual architectural magic and how much of it are simply higher internal performance targets vs. competition.

Similarly, I wonder how much of the incredible efficiency of M1 in comparison to the latest x86 CPUs comes down to different overall goals of the architecture. Increasing the frequency comes at an exponential increase in power consumption and I'm really curious to see how a higher clocked version of M1 would look like. I've recently read about a few experiments, where people artificially limited the power draw of their last gen Ryzen CPUs to the one comparable to M1 and the single core performance only differs 20-30% in a scenario AMD has surely not optimized for.

colinwil · Apr 13, 2021

Smalltalk said:
It seems like most people here are not aware that ARM SoCs are available from a number of suppliers. You don't need Apple to provide ARM SoCs.

The thing is, Apple Silicon isn’t an ARM design. It just uses the ARM instruction set, with extra AR bits, video bits, etc. It will be interesting to see how well other ARM SoC manufacturers can compete, both on performance and power/heat.

cmaier · Apr 13, 2021

Cr1- said:
What about the situation when Apple Silicon runs its Rosetta 2 the x86 translation layer? I think it’s a combination of both hardware and software design that Apple Silicon is designed in a way not to lose too much x86 translation performance.

if Rosetta 2 has a way to feed its wider architecture then x86 processors could use the same technique.

Rosetta 2 is mostly a static translation, which means that it’s analogous to a compiler (but a compiler with less information about intent, so there’s a penalty). It’s not particularly relevant to the issue of why Apple has such a high IPC.

crazy dave · Apr 13, 2021

Cr1- said:
What about the situation when Apple Silicon runs its Rosetta 2 the x86 translation layer? I think it’s a combination of both hardware and software design that Apple Silicon is designed in a way not to lose too much x86 translation performance.

if Rosetta 2 has a way to feed its wider architecture then x86 processors could use the same technique.

As far as I’m aware the only additional hardware M1 chips contain to aid Rosetta is an optional x86 memory model mode which were it not there could be a pretty big bottleneck for emulation since the memory models for ARM and x86 are quite different. Otherwise as @cmaier says it’s just a combination of binary translation (basically recompiled) and emulation (translating on the fly) when that doesn’t work.

Icelus · Apr 13, 2021

Cr1- said:
What about the situation when Apple Silicon runs its Rosetta 2 the x86 translation layer? I think it’s a combination of both hardware and software design that Apple Silicon is designed in a way not to lose too much x86 translation performance.

if Rosetta 2 has a way to feed its wider architecture then x86 processors could use the same technique.

Apple Silicon has a runtime toggle to enable TSO, which is enabled for Rosetta processes.
This may increase performance but has no relation to IPC.

i hope intel comes back to beat the m1

macrumors 6502

Suspended

macrumors 6502

Suspended

macrumors 6502

macrumors 68000

macrumors newbie

macrumors 65816

Suspended

macrumors 65816

macrumors Core

macrumors 68020

macrumors 68000

Suspended

macrumors 68020

Suspended

macrumors 6502

macrumors Core

macrumors regular

macrumors newbie

Suspended

macrumors 6502

Suspended

macrumors 65816

macrumors 6502

Our Staff