Would a small L1 instruction cache show that Ampere One has no uOps cache?
.....
Chiplets, Advanced Packaging, and Disingenuous Performance ClaimsAmpere Computing specializes in CPUs for the cloud datacenter market. The company had succeeded in attracting significant customer t…
www.semianalysis.com
It does not say anything either way. But the 'why' it got smaller is likely relevant.
from the semianalys ...
"...
But as for the former point, we don’t know why Ampere chose to cut the L1i but there are at least 2 reasons we can think of. The reduction in the L1i is for area saving because L1i uses very un-dense SRAM with high accessibility so cutting it to 16KB could save a lot of area. The other reason is that the workloads they profiled get significantly less performance as you go from 16KB to 32KB to 64KB such as SPEC INT 2017 which isn’t super heavy on the instruction side.
Now, we suspect that it’s likely a combination of these factors along with other pressures that had Ampere go this direction. ..."
I highly doubt the workloads they used to drive the L1 instruction cache sizes were primarily driven by SPEC though. At least, not driven by the single threaded performance ones. SPEC-rate with minimally > 64 cores going is far, far more likely. The core count for Ampere One starts at 136! ( Ampere is keeping the Altra around for those folks that need to 'slum' at 64 core counts.
). It is not a single thread drag racing SoC.
At 136 cores at 48KB 'extra' per core that amounts to 6,528KB of extra space. At 192 that is 9,216KB . Do you want 9MB of more L2 or want to top fuel drag race on single threaded code with bigger L1 instruction? For, highly concurrent, data stream intensive workloads you probably want the data (since likely swamping the memory controllers).
Also has to enumerate the L2 'problem' they have. 2MB per core ... 192 cores ... 384MB. Just L2. ( 9MB isn't going to completely 'pay' for that, but it helps a little. )
For this article the reports talked to Ampere's Chief Product Officer.
"...
In any event, Wittich did give us some insight into the key performance improving features of the A1 core.
“We have really sophisticated branch misprediction recovery,” Wittich tells
The Next Platform. “If an application mispredicts a lot, you have got to recover quickly and roll back and waste as few cycles as possible. Having a really accurate L1 data prefetcher is really useful, and it first with our design philosophy of keeping lots of stuff in L1 and L2 cache and stop going out to SLC and main memory for stuff. We’ve been able to come up with a very, very accurate prefetching algorithm. ..."
https://www.nextplatform.com/2023/0...n-front-of-x86-with-192-core-siryn-ampereone/
I would expect that if their prefetcher is extremely accurate on 'data' that it also works (with some tweaks) on 'instruction data' also. (**) The L1 instruction got much smaller , but the L2 got bigger (plus all this prefetching and branch prediction that works on 'data' stored somewhere. ) . And if you haven't predicted what to put in your L1 instruction cache is another 4-way 48KB really going to help fast recovery?
The Ampere One is mostly design not to 'sag' on performance under high load (e.g., keeping all the cores occupied , most of the time time). The speculative chasing here to completely 'max out' the biggest ROB possible and maximize the number of loads/stores in flight on a single core is being balanced either several 10's of cores asking for stuff at the same time ( even if the L1/L2 hits are high if there are 150 cores even at 90% hit rate that is still 15 cores all screaming 'give me data NOW!'. Somebody is going to be missing as the pool of possible requestors gets bigger. )
Very similar reason that Apple caps the P-core and E-core clusters to a fraction of what the memory subsystem can do. ( lots of chattering GPU/NPU/non-CPU load/store requests going on with low latency response issues. )
Redundantly, storing both the opcode and the microOp version of the same instruction gets painful if running of space budget. There are definitely bigger 'other than L1 instruction caches' things that they are adding here.
(**) prefetching highly accurately has to mean looking at the instruction stream in some decently deep manner. The 'flow' of the instructions has to be weaved in somewhat otherwise would 'know' what data going to need in the future. Need to know 'what you are doing' in order to narrow down 'what you need'.