I think there is an important difference in that the SLC cache is associated to the memory controller and not the CPU. It’s more of a coherency protocol/energy saving feature. And it has the same bandwidth as the DRAM.
Not sure by what mean it is associated with the 'CPU'. Not 'cores'. Or Memory controller has its own "communicatin agent" on the internal bus? Memory controllers are all in modern CPUs ( package if chiplet , and die if not chiplets ).
The bandwidth being the same for the SLC 'fronting' the memory controller is because they use the same internal bus system agent ( can only handle 4 addresses requests at a time). But bandwidth isn't the major factor when talking about caches. Latency is the bigger issue. If the SLC latency is the same as the DRAM . .. what is it doing?
When stuff is delivered matters as much , if not more , to how much. How many loads of a register can the core actually do on a clock? more bandwidth than that is going where? Not to that core.
If trying to say that one bank of the L2 has a minor diference latency than the other banks. If the cache is physically big enough and the cores spread out into different corners around the cache... that wouldn't be surprising. But it really wouldn't add another "cache level' if the gap is relatively quite small ( relative to the next level up).
Of course, in the end of the day this probably a translation problem. If you go strictly by the nominal nomenclature (by cache level) then SLC is obviously L4. But cache levels also have certain functional connotation to them and from that perspective I’d describe Apples L2 as a hybrid of traditional L2/L3.
I still don't see where you are getting L3 from. If there is some command that moves a "remote" L2 cache line to a "fast"/"faster" L2 cache then there is a path to calling that remote location different cache level. If the request goes out and the L2 cache line never moves from where it was initially placed ( even if it takes very slightly longer) then those are those the same cache level.
The documents outline that the L2 tracks multiple cores sharing the same L2 data. So the nominal mode appears to be that the data doesn't really 'move' or is duplicated. if not duplicating or moving it , then it is same level.
There is some asymmetry. Each core is assigned a “fast” L2 portion and it costs extra to access the portions associated with other cores. Maynard’s document has more info, I dint remember the details. It’s not exactly like IBMs implementation but the L2 is not uniformly shared either.
Errr this one??
drive.google.com
There are four units but that is mainly driving by both the size of the cache ( the bigger it is more more chunks pragmatically going to need to compose it) and by Apple chasing lower power ( shut down L2 cache (and cores) if happen to not need it at the moment).
page 270
"...
All this is historically interesting, but raises a meta-issue. When cores share an L2, either they all run at
the same frequency, or the cost to access the L2 is substantially higher because of buffering across
clock domains. So what’s the right tradeoff? Later below we’ll see a very complicated (but clever!)
Apple patent for optimal OS scheduling given the constraint of multiple cores having to run at the same
frequency. But more generally, as we move to from the simple world of one performance and one
efficiency cluster, is four the optimal number of CPUs sharing an L2 (and thus sharing frequency)?
There are clearly advantages to a large L2, and a shared L2; but there are also costs. Is a better tradeoff
perhaps three cores sharing an L2, and the future low-end being something like two performance
clusters (ie six performance cores)? Or two cores sharing a substantially smaller L2, and all the two-core
clusters (including the efficiency two-core clusters) sharing a large CPU L3 (distinct from the SLC),
operating on its ...
"
page 270-271 ( L2 cache is snoop for all 4 processors. )
"...
What Apple do, however, is neither exclusive nor inclusive. To implement this, while retaining the
advantages of snoop filtering by a higher level cache, the tags of lower level caches are duplicated in
the higher level caches. As far as I can tell this happens in both the L2 and SLC, and the idea seems to
be something like:
- core to core snooping will mostly occur within a core complex, and can be handled well by having the
L2 for that complex snoop for all 4 processors of the complex
- snooping between different types of IP (like NPU snooping changes in a GPU cache) will be filtered at
the SLC level
- left uncertain is whether snoops between the Performance Core Complex and the Efficiency Core
Complex are treated as a special case or handled by the SLC.
..."
page 227 ( L2 split into 4 bigger , addressible chunks )
" ...
Third within a Core Complex, the L2 is apparently split the same way, into four independently address-
able units, each of which is capable of providing close to 32B/clock cycle, so we do see total L2 aggre-
gate bandwidth increase linearly with the number of cores. My guess is the drop in bandwidth by ~25%
has to do with writing the line into the L1D, something like
- for four cycles I can read 32B/cycle from L1D (16 B from each half of the L1D)
- during those same four cycles each of four quarters of a 128B line is transferred from L2
- during the fifth cycle I can’t access the L1D because the fully-accumulated line is transferred from a
line buffer into the cache. ..."
This is a very substantially different from what IBM is doing. What IBM is doing is more so taking cache evictuion victums and trying to stuff them elsewhere in the L2 cache than they would normally belong. I highly doubt that is resulting in energy efficient overhead at all. ( the complexity is probably at least twice as high as why Apple is doing. ). In some sense, IBM is looking as hard as possible to
never turn any of the L2 cache off ever (to save power ). They are going way , way , way out of their way to find 'extra' stuff to push into L2 cache so it is always maximally as full as possible. Apple is NOT doing that.
[ P.S. IBM is saving power in a very large macro sense in that if can push the workload of 4 large refrigerator sized servers into just one refigerator sized server than the net datacenter power consumed goes down. But at the single system level.... not really doing that with this virtual L3/L4 thing. Could get some savings out of other components used in the server though. But z series ... it is server consolidation focus. ]