Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
My understanding is that cache and CPU clock has to match too, so that means CPU clock will also increase.
What do you mean by "the cache and CPU clock have to match"? Do you mean that the cache clock is a multiple of the CPU clock?

The problem is that when RAM bus clock is different from CPU cache clock, delays has to be inserted, once data starts bursting from RAM. Once data is in cache, it's basically low latency fast memory, but it needs to be fetched from RAM first.
Isn't that the reason why out-of-order cores are designed?
 
What do you mean by "the cache and CPU clock have to match"? Do you mean that the cache clock is a multiple of the CPU clock?
Usually it’s 1-1 if I understand it correctly.

Isn't that the reason why out-of-order cores are designed?
If my understanding is correct, OoO designs are usually done to solve the problem of instructions completing at different clock cycles or in laymen terms completing at different time. Instructions completing earlier can be retired and placed in the OoO cache so that the ALU can be used for other instructions.
 
Usually it’s 1-1 if I understand it correctly.
I didn't quite understand what you meant by "cache". I thought you meant the L1 cache, but it sounds like you meant the CPU registers.

OoO designs are usually done to solve the problem of instructions completing at different clock cycles or in laymen terms completing at different time
You are right. I forgot that some instructions take longer than others, such as floating point operations, which take longer than integer operations.

Don't out-of-order cores also alleviate the latency problem when data is retrieved from cache/memory instead of registers?
 
I didn't quite understand what you meant by "cache". I thought you meant the L1 cache, but it sounds like you meant the CPU registers.
All caches runs at the same clock as the CPU core. You probably can consider registers as another type of cache.

Don't out-of-order cores also alleviate the latency problem when data is retrieved from cache/memory instead of registers?
OoO improves CPU efficiency. AFAIK it does not do anything to hide memory/cache access latency.

Memory access latency is a function of, among others, signal propagation delays and how fast the memory type can make your memory access request ready for you to read off the data lines. Caches are typically SRAM and therefore makes memory data available faster as there’s no refresh cycles. System memory are DRAMs, which need periodic refreshes or data gets corrupted, so that interferes with how fast it can make memory available when requested.
 
  • Like
Reactions: Xiao_Xi
All caches runs at the same clock as the CPU core.

Are you sure about this? This might apply to L1/L2 caches but I don’t see this working for the SLC cache. After all, it’s shared by multiple different processors running at different frequencies.

OoO improves CPU efficiency. AFAIK it does not do anything to hide memory/cache access latency.

Why not? OoO is just about reordering execution stream so that you don’t waste time waiting. Whether you have a stall due to a particularly long instruction latency or due to an outstanding memory read, does it really make a difference? From my - admittedly very limited - understanding of OoO and reorder buffers they just try to execute whatever is ready to be executed.
 
Are you sure about this? This might apply to L1/L2 caches but I don’t see this working for the SLC cache. After all, it’s shared by multiple different processors running at different frequencies.
I would think the AS SLC would be running at the same clock as the CPU core clock. The GPU, NPU, Media Encoder, etc. would have to load from the SLC into their respective caches via clock sync mechanism.

Why not? OoO is just about reordering execution stream so that you don’t waste time waiting. Whether you have a stall due to a particularly long instruction latency or due to an outstanding memory read, does it really make a difference? From my - admittedly very limited - understanding of OoO and reorder buffers they just try to execute whatever is ready to be executed.
My understanding is that the re-order buffer is only used to store completed results? Any ALU getting blocked by memory request will likely stay blocked? until memory is retrieved from memory -> SLC -> L2 -> L1.

Happy to be corrected tho.
 
  • Like
Reactions: retroneo
I would think the AS SLC would be running at the same clock as the CPU core clock. The GPU, NPU, Media Encoder, etc. would have to load from the SLC into their respective caches via clock sync mechanism.

Wouldn't it be more advantageous to run it at lower clock to save power? GPU etc. are bigger users of the SLC bandwidth anyway, and it's already significantly slower than the CPU L2, so is there any advantage to run it at CPU clock? Just thinking out aloud.

FWIW, Apple patents depict an on-chip network that connects various processors and the memory controllers. Under this schema the SLC is part of the memory controller. According to those diagrams the CPU and GPU are not connected directly — all communication goes through the SLC/memory controller. Of course, it's likely an oversimplification.

My understanding is that the re-order buffer is only used to store completed results? Any ALU getting blocked by memory request will likely stay blocked? until memory is retrieved from memory -> SLC -> L2 -> L1.

But if you decouple memory operations and ALU operations, wouldn't this schema allow you to hide the RAM access latency? Instead of getting the ALU blocked you issue a load operation and only proceed with the ALU operation once that particular result is available. And while the load is still being processed the ALU is free to retire other operations. In other words, you decouple the instruction issue (recording it in the queue/buffer/whatever those things are called) from instruction execution (it only gets cleared for execution if all data is ready and there is a free ALU).
 
Wouldn't it be more advantageous to run it at lower clock to save power? GPU etc. are bigger users of the SLC bandwidth anyway, and it's already significantly slower than the CPU L2, so is there any advantage to run it at CPU clock? Just thinking out aloud.

FWIW, Apple patents depict an on-chip network that connects various processors and the memory controllers. Under this schema the SLC is part of the memory controller. According to those diagrams the CPU and GPU are not connected directly — all communication goes through the SLC/memory controller. Of course, it's likely an oversimplification.
Well, SRAM doesn't consume a lot of power if it's not used, unlike DRAM where it needs to be refreshed periodically, so clocking SRAM a little higher probably doesn't cost too much power but it provides higher burst bandwidth.

My understanding is the the SLC is the magic that makes UMA possible in AS, and so it is crucial that it is able to feed data to all components within the SoC as fast as it can, and the fastest clock is the CPU clock. Maybe someone who is familiar with AS cache design can chime in. I'm probably way off my pay scale here.

So GPU L2/L1 cache will likely load cache lines from SLC, where the SLC likely will have wide cache line. GPU cache then moves the data to it's slower L2/L1/Tile Memory and repeat for the next cache line load. The same will happen with other processing cores, e.g. NPU.

But if you decouple memory operations and ALU operations, wouldn't this schema allow you to hide the RAM access latency? Instead of getting the ALU blocked you issue a load operation and only proceed with the ALU operation once that particular result is available. And while the load is still being processed the ALU is free to retire other operations. In other words, you decouple the instruction issue (recording it in the queue/buffer/whatever those things are called) from instruction execution (it only gets cleared for execution if all data is ready and there is a free ALU).
Well, put this way, yeah, it's a way to hide memory loads and stores.
 
Isn't that the reason why out-of-order cores are designed?

When you have an ISA (x86) with various length instructions (VLI), out-of-order execution allows the CPU to execute instructions once completed versus having to wait for longer instructions to complete first. That allows for two benefits: not having to expand the on-chip cache to queue additional instructions and faster overall performance because pipeline delays are minimized. The other advantage of OOE is that in case one instruction stalls for some reason, the processor can move another instruction that is ready for execution to the front of the line while the stalled instruction waits for whatever information it needs to complete.

With ARM and Apple Silicon, instructions are fixed length, so there is less of an issue to begin with on that front.
 
  • Like
Reactions: Xiao_Xi
When you have an ISA (x86) with various length instructions (VLI), out-of-order execution allows the CPU to execute instructions once completed versus having to wait for longer instructions to complete first. That allows for two benefits: not having to expand the on-chip cache to queue additional instructions and faster overall performance because pipeline delays are minimized. The other advantage of OOE is that in case one instruction stalls for some reason, the processor can move another instruction that is ready for execution to the front of the line while the stalled instruction waits for whatever information it needs to complete.

With ARM and Apple Silicon, instructions are fixed length, so there is less of an issue to begin with on that front.

OoO is what makes processors fast. Doesn’t matter your instruction length, unless you have some other way to execute multiple operations simultaneously (like VLIW), you absolutely need OoO if you care about performance.

The funny thing is that the complicated instruction encoding if x86 makes it more costly to decode multiple instructions at once and thus more difficult to maintain utilization of the execution resources. With ARM, it’s basically a business decision how many decoders they want to throw at the problem. Which is one of the reasons why modern ARM designs have more OoO resources than x86 CPUs. I mean, the new ARM X4 dispatches 10 operations per cycle. Zen4 does six.
 
Last edited:
If my understanding is correct, OoO designs are usually done to solve the problem of instructions completing at different clock cycles or in laymen terms completing at different time. Instructions completing earlier can be retired and placed in the OoO cache so that the ALU can be used for other instructions.

That is mostly correct, but the thing about AS is they have a bunch of ALU-type execution units, each with its own dispatch queue, allowing the dispatcher to fling a bunch of ops out at once. That is the out-of-order part. The EUs are job-type-specific (e.g., IUs, FPUs, LSUs), so you have all these queues flowing side-by-side and when an op is done, it gets sent to its slot in the completion queue to await retirement: the completion queue runs in program order.

Ops have to be sequenced according to dependencies, as there is no fixed register file, just a large rename resource. The dispatcher has to identify the appropriate renames an op will need and make sure the op does not go into the EU before those renames are consistent with preceding code.

All caches runs at the same clock as the CPU core.

The L2s serve core clusters, but the L3 serves all the clusters. The E-core cluster runs significantly slower than the P-core cluster (by at least a GHz).
 
Ops have to be sequenced according to dependencies, as there is no fixed register file, just a large rename resource. The dispatcher has to identify the appropriate renames an op will need and make sure the op does not go into the EU before those renames are consistent with preceding code.
Is this how the Tomasulo's algorithm works?

new ARM X4 dispatches 10 operations per cycle. Zen4 does four.
Zen 4 decodes 4 instructions, but dispatches 6 mops.
SoC_07.png

 
The L2s serve core clusters, but the L3 serves all the clusters. The E-core cluster runs significantly slower than the P-core cluster (by at least a GHz).
Thinking back, I guess I got caught up by DRAM thinking. Since caches are SRAM, which do not need refreshes, they can basically be clocked at any rate, as long as enough time is allowed to write data into it's cells and for signal to stabilse. So it could be that all different processing cores are reading it off the SRAM based on their leisure/clock rates, with time allowed for signal stabilisation.
 

Note also that the Queue fill from the uOp side is not limited to 4 instructions/cycle either. 9 while dispatching 6 so can backfill the queue once track into a zone that is covered by the uOp cache.


Additionally on the micro op cache hit rate in the table ... For Golden Cove ( Gen 12 Adler Lake )

"... Intel states that the decoder is clock-gated 80% of the time, instead relying on the µOP cache. This has also seen extremely large changes this generation: first of all, the structure has now almost doubled from 2.25K entries to 4K entries, mimicking a similar large increase we had seen with the move from AMD’s Zen to Zen2, increasing the hit-rate and further avoiding going the route of the more costly decoders. ..."
https://www.anandtech.com/show/16881/a-deep-dive-into-intels-alder-lake-microarchitectures/3


The larger uOP cache is likely sucking up some power , but some of this doom-and-gloom on big instructional decode power overhead is being mitigated on each generation. ( uOp cache might be in the unCore accounting in the referenced table; not sure). There is basically another level of cache under the L1 layer also just of instructions (not data).


So on code like

for i = 0 to 10000 do
... 10 math ops ...
end

for loop iterations 2 to 10000 those instructions could be in the u0p cache and all the x86 is doomed stuff is moot.

while on code like this

if foo > bar then
... 5 ops ...
else
... 10 ops ..
end if

if fred < barney then
... 10 ops ...
else
... 5 ops ...
end if

if wilma == bettery then
... 8 ops ...
else
... 3 ops ...
end if


that appears inside of no loop that use you once and then never come back to a long while then does labor with the overhead.


So it is a blend. Even Arm high performance implementations are going to use a uOp cache because even for fixed and uniform sized instructions it gets to be more than kind of goofy to re-decode an instruction that you just decoded 10-15 instructions before while wandering completely inside of a relatively tight loop.
 
Even Arm high performance implementations are going to use a uOp cache because even for fixed and uniform sized instructions it gets to be more than kind of goofy to re-decode an instruction that you just decoded 10-15 instructions before while wandering completely inside of a relatively tight loop.
Interestingly, Arm’s mop cache approach has been diminishing for a few years. The cache shrunk from 3,000 to 1,500 entries in the X3. Arm removed the mop cache entirely from the A715 when introducing smaller 64-bit only decoders, moving the instruction fusion mechanism into the instruction cache to enhance throughput. It appears Arm has taken the same approach here with the wider X4 core.

One of the more noteworthy changes compared to the A710 is that the new core is 64-bit only. The absence of AArch32 instructions has allowed Arm to shrink the size of its instruction decoders by a factor of 4x compared to its predecessor, and all of these decodes now handle NEON, SVE2, and other instructions. Overall, they’re more efficient in terms of area, power, and execution.
While Arm was revamping the decoders, it switched to a 5 instruction per cycle i-cache, up from 4-lane, and has integrated instruction fusion from the mop-cache into the i-cache, both of which optimize for code with a large instruction footprint. The mop-cache is now completely gone. Arm notes that it wasn’t hitting that often in real workloads so wasn’t particularly energy-efficient, especially while moving to a 5-wide decode. Removing the mop-cache lowers overall power consumption, contributing to the core’s 20% power efficiency improvement.
 
So it is a blend. Even Arm high performance implementations are going to use a uOp cache because even for fixed and uniform sized instructions it gets to be more than kind of goofy to re-decode an instruction that you just decoded 10-15 instructions before while wandering completely inside of a relatively tight loop.

Apple doesn't use an uOP cache, and ARM appears to have dropped it in Cortex X4. It seems that decoding Aarch64 instructions is cheap enough that the uOP cache is not needed. Some discussion: https://www.realworldtech.com/forum/?threadid=212032&curpostid=212069

For x86, the uOP cache is of course indispensable because otherwise they probably won't be able to feed the SIMD backend. Not to mention that x86 decodes need multiple cycles to operate.
 
Last edited:
  • Like
Reactions: jdb8167
Nearly all AArch64 instructions translate directly into a single μop – even the pointer writeback ops are a single μop (though those do seem to require and extra retire cycle). The only instructions that might be split into multiple μops are the eleven combined-atomics (CASs, SWP and eight math-memory-writeback ops), and while these are not exaclty rare in a multi-core processor, they probably make up less than a couple percent of code.

AArch64 decode is itself trivial. Bits [28-25] of the opcode select the op type, and the remaining fields of the instruction are fairly consistent. Register selectors are always in the same places. Moreover, decode itself is not a discrete function: preliminary decode occurs before/during dispatch and final decode occurs in the target EU – this will also be somewhat the case with x86.

The ARM core has 64 registers, which do not actually reside in a fixed register file but float around in the rename array (because internal content is highly dynamic, architecting a fixed file is kind of pointless and wasteful – the situation is somewhat different for x86). Hence, perhaps the heaviest workload for the core is managing internal content. Acquiring resources between instruction read and op dispatch is the largest non-EU logic block in the core, along with maintaining value consistency across instructions. Optimally compiled code may reduce this overhead a bit.

There are few downsides to fixed-length instructions, and almost none of those downsides are addressed by the x86 model. There are really no dowsides to the large general-use register file, which highlights the weakness of x86 dedicated function registers. The design architecture of x86 is at least as problematic as its messy ISA: the overall architecture is just deeply inefficient.
 
  • Like
Reactions: Unregistered 4U
Apple doesn't use an uOP cache, and ARM appears to have dropped it in Cortex X4. It seems that decoding Aarch64 instructions is cheap enough that the uOP cache is not needed. Some discussion: https://www.realworldtech.com/forum/?threadid=212032&curpostid=212069

The Neoverse N2 has one.

So does the V2


But the core counts there are way, way , way higher. But if doing HPC and/or lock contentious workloads with relatively low count counts it may not matter as much.

But yes, if toss out everything to very closely match he uOp internals exactly then the decode overhead is low. x86 can't throw it all out, but hauling stuff folks aren't using much doesn't make sense to push down to uop level. But could use the x86-64 in more modern way more often that what the compilers were turning out 10 years ago in opcodes.
 
Last edited:
  • Like
Reactions: Unregistered 4U
The other aspect to Apple's P-core design is the instruction reorder buffer, which can have over 600 ops in flight at any given time. This would eliminate the need for a μop cache, as, theoretically, μops could simply be rerouted right back out of the buffer to the dispatcher without requiring re-fetch. Most of the tight loops that would benefit from the μop cache would easily fit in that enormous ROB.
 
  • Like
Reactions: Unregistered 4U
It makes sense that N2 and V2 have uOP cache because N2 appears to be based on A710 and V2, on X2. Although X2 cannot run 32-bit applications, unlike A710, ARM did not remove the uOP cache until X4.

Yeah it will probably shrink and/or more narrowly target on _3 and _4 iterations. The workloads are different enough may want different uOps on that Server track though and pull them out completely ( If it stays viable on second track. )

Ampere One's overview so far isn't clear if they have them or not. They seem to have put lots more time and effort into prediction and more accurate pre-fetching. The predictors can compete for space with the uOp cache.
 
  • Like
Reactions: Unregistered 4U
Ampere One's overview so far isn't clear if they have them or not. They seem to have put lots more time and effort into prediction and more accurate pre-fetching. The predictors can compete for space with the uOp cache.
Would a small L1 instruction cache show that Ampere One has no uOps cache?
The real interesting change here is the reduction from a 64KB 4-way L1 instruction cache to a 16KB 4-way L1 instruction cache.
ae601fad-2f14-44af-9cd9-8703f4efb994_1300x971.png

 
Would a small L1 instruction cache show that Ampere One has no uOps cache?

.....

It does not say anything either way. But the 'why' it got smaller is likely relevant.

from the semianalys ...

"...
But as for the former point, we don’t know why Ampere chose to cut the L1i but there are at least 2 reasons we can think of. The reduction in the L1i is for area saving because L1i uses very un-dense SRAM with high accessibility so cutting it to 16KB could save a lot of area. The other reason is that the workloads they profiled get significantly less performance as you go from 16KB to 32KB to 64KB such as SPEC INT 2017 which isn’t super heavy on the instruction side.

Now, we suspect that it’s likely a combination of these factors along with other pressures that had Ampere go this direction. ..."

I highly doubt the workloads they used to drive the L1 instruction cache sizes were primarily driven by SPEC though. At least, not driven by the single threaded performance ones. SPEC-rate with minimally > 64 cores going is far, far more likely. The core count for Ampere One starts at 136! ( Ampere is keeping the Altra around for those folks that need to 'slum' at 64 core counts. :) ). It is not a single thread drag racing SoC.

At 136 cores at 48KB 'extra' per core that amounts to 6,528KB of extra space. At 192 that is 9,216KB . Do you want 9MB of more L2 or want to top fuel drag race on single threaded code with bigger L1 instruction? For, highly concurrent, data stream intensive workloads you probably want the data (since likely swamping the memory controllers).

Also has to enumerate the L2 'problem' they have. 2MB per core ... 192 cores ... 384MB. Just L2. ( 9MB isn't going to completely 'pay' for that, but it helps a little. )

For this article the reports talked to Ampere's Chief Product Officer.

"...
In any event, Wittich did give us some insight into the key performance improving features of the A1 core.

“We have really sophisticated branch misprediction recovery,” Wittich tells The Next Platform. “If an application mispredicts a lot, you have got to recover quickly and roll back and waste as few cycles as possible. Having a really accurate L1 data prefetcher is really useful, and it first with our design philosophy of keeping lots of stuff in L1 and L2 cache and stop going out to SLC and main memory for stuff. We’ve been able to come up with a very, very accurate prefetching algorithm. ..."
https://www.nextplatform.com/2023/0...n-front-of-x86-with-192-core-siryn-ampereone/

I would expect that if their prefetcher is extremely accurate on 'data' that it also works (with some tweaks) on 'instruction data' also. (**) The L1 instruction got much smaller , but the L2 got bigger (plus all this prefetching and branch prediction that works on 'data' stored somewhere. ) . And if you haven't predicted what to put in your L1 instruction cache is another 4-way 48KB really going to help fast recovery?

The Ampere One is mostly design not to 'sag' on performance under high load (e.g., keeping all the cores occupied , most of the time time). The speculative chasing here to completely 'max out' the biggest ROB possible and maximize the number of loads/stores in flight on a single core is being balanced either several 10's of cores asking for stuff at the same time ( even if the L1/L2 hits are high if there are 150 cores even at 90% hit rate that is still 15 cores all screaming 'give me data NOW!'. Somebody is going to be missing as the pool of possible requestors gets bigger. )

Very similar reason that Apple caps the P-core and E-core clusters to a fraction of what the memory subsystem can do. ( lots of chattering GPU/NPU/non-CPU load/store requests going on with low latency response issues. )


Redundantly, storing both the opcode and the microOp version of the same instruction gets painful if running of space budget. There are definitely bigger 'other than L1 instruction caches' things that they are adding here.



(**) prefetching highly accurately has to mean looking at the instruction stream in some decently deep manner. The 'flow' of the instructions has to be weaved in somewhat otherwise would 'know' what data going to need in the future. Need to know 'what you are doing' in order to narrow down 'what you need'.
 
Last edited:
  • Like
Reactions: Xiao_Xi
What features differentiate a real-time processor such as Apple's R1 from those used by IPhone and Mac?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.