Modern compilers would have taken care of the nature of each architecture's nuances that it should be fairly optimised generating efficient codes. Most benchmarks should be run in fairly tight loops such that the I and D L1 caches should not be thrasing a lot when the benchmarks loops are running. And the AS SLC should be large enough to fit most dataset used in most benchmarks. Going back to RAM for dataset to process during a benchmark run should be fairly infrequent, except for initial load.The very claim being made reveals ignorance.
What is it that is being claimed to fit in L2/L3 caches? Instructions or data?
It is trivial to scale-up a benchmark to ensure that the data footprint is larger than L3, and SPEC definitely does this. GB probably tries to, but GB5 may be behind the curve and we may need to wait for GB6.
It's much harder to scale up a benchmark to ensure that the instruction footprint exceeds cache size. Neither SPEC nor GB do this. The professional code bases that do this are larger server databases, and the academic papers that investigate this tend to look and feel rather different from papers that investigate data issues like data prefetch because they are different communities. About the closest you can get to testing large I-footprint performance without seriously hard work is using browser benchmarks that test a wide variety of functionality.
But ultimately the question is: do you want to make tribal claims, or do you want to understand? Because if you want understanding, then simplistic measurements of the size of caches are not helpful. Substantially more important than these raw sizes, in any modern machine, are how well prefetching works (on both I and D-side) and how well the branch predictor machinery works in the face of "not recently executed" code (on the I side). It does you no good to be able to get to new I code rapidly if you're then constantly flushing in the face of each newly encountered branch... There are multiple ways to handle this (better static predictors, L2 branch predictor, analysis of code for branches before it is even executed) and all of those probably matter more than the trivial question of I-footprint size compared to cache sizes.
For example, the M1 and M1 Pro/Max/Ultra GB5 benchmarks appears identical for ST, despite the M1 Pro/Max/Ultra using LPDDR5 which is 33% faster than the LPDDR4X used in the M1.
I particularly like the fluid dynamic benchmark (posted in an earlier post in this thread) that shows almost linear scaling for the M1 Ultra. So it appears that there is a lot of performance left on the table for even the M1 SoC when it comes to currently available software.