Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

jeanlain

macrumors 68020
Mar 14, 2009
2,463
958
Many benchmarks go around this problem by measuring throughput (like the anaconda graph above) instead of time to finish a task. For example, Geekbench in the physical N-body simulation measures interactions computed per second, not the time to compute, say, 1M interactions. That takes granularity and task scheduling out of the equation.
You mean that geekbench counts the number of interactions computed at short intervals (like every 0.01s) and makes an average? That's indeed a better solution.
It's not entirely clear to me if the throughput shown on the anaconda graph is measured this way. It could be the amount of work divided by the time taken by the whole CPU to finish the task. This also gives you a throughput.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,628
1,101
Many benchmarks go around this problem by measuring throughput (like the anaconda graph above) instead of time to finish a task. For example, Geekbench in the physical N-body simulation measures interactions computed per second, not the time to compute, say, 1M interactions
What do numbers in the Geekbench report mean? Mean throughput? Maximum instantaneous throughput?
 

Gerdi

macrumors 6502
Apr 25, 2020
449
301
What do numbers in the Geekbench report mean? Mean throughput? Maximum instantaneous throughput?

It is the same as SpecCPU rate-N. You are running the same workload/code on each core and after some time you measure how much progress has been made overall.
This way you avoid distributing your fixed workload fairly to all cores (like in Cinebench).
 
  • Like
Reactions: Xiao_Xi

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
*sometimes the number of threads is set to twice the number of cores in SMT-capable (multithreading) machines, since they were designed to be able to run two threads per core. Didn't make much of a difference in any workload I ever tried, since the point of SMT (someone correct me if wrong) is to keep the core ALU busy and that's not a problem for numerical simulations since they're ALU-heavy by definition.

SMT isn't just aimed at ALU units. Any unused core function unit could get utilized by another thread if it is sitting there unused for a substantive amount of time. SMT isn't going to have traction on micro benchmarks where the whole problem can be loaded into cache and just run a very tight loop over a relatively limited memory area.

SMT has traction there there are dynamic execution bubbles that impede keeping the units active. For example waiting on data to load before can do a arithmetic op. If one of the operands is coming from the the persistent storage disk then nothing is going to happen for long while. If can slide in another thread that can make some "slow" progress that is better for aggregate throughput than nothing at all. (or waiting until the multithread time slice quantum is over to swap the delayed thread out and put in some work. )

There are a limited about of virtual addresses that can be translated with low overhead. Limited "local" data (e.g., if using MPI to get data off another supercomputer node ) . Caps on how many different branches can predict your way through. ( a long sequence of if-thens with tests that shift on immediate results versus a loop end condition test which almost always branches the same way ( 'while i < 100000' for. 99999 times it is 'yes' ).

Everything in the same virtual address space, all the data loaded into 'local' RAM. One big loop with no internal branches. 99% L3 cache hit rate. In that context, SMT won't get much traction if have thrown lots of resources at being hyper aggressive at pulling instructions "out of order".


I assume context switches here are less of a problem since SMT-capable cores have two sets of registers so context switching between (just) two threads should be cheaper, so performance doesn't improve nor decrease.

SMT has two dynamically sized subsets of registers. That is a different kind of "threaded" execution where there are whole sets with deliberate shared between the units. With register renaming for out of order executions the main thread is using more registers than are explicitly in the application binary instruction set. SMT overlays on top of the register rewriting that was there anyway (. regardless they are being renamed ... so not much overhead to rename into two sets. )

If the speculative execution path is stalled on "thread 1" then that much larger pool of "extra" rename registers aren't being used anyway... so just consume them with something else. The bookkeeping for which rename registers belong with which set is no where near the register and path consumption of the registers themselves. An incrementally larger pool could be useful but 100% growth not likely required to get some incremental increase in performance.


Many benchmarks go around this problem by measuring throughput (like the anaconda graph above) instead of time to finish a task. For example, Geekbench in the physical N-body simulation measures interactions computed per second, not the time to compute, say, 1M interactions. That takes granularity and task scheduling out of the equation.

It takes it out until it doesn't. If one creates. 10,000 threads versus 100 threads then the OS thread overhead is going to go up. That can have a cascade impact into scheduling. The data structures are just bigger. As threads gets "starved" for time a fair scheduler will push some others out. There are cache hits for that. Cache hits get more bubbles. More bubbles get you more delays in throughput. There is a longer line until get back to the front of the queue to get allocated. (again what likely had in cache is likely to have been pushed out unles all the threads are working on almost exact same duplicative data. )

Throughput is what matters because get to results later than needed won't help much.






Windows PCs still being sold with cut down main RAM and HDD sole system drive. SMT would have more traction to 'bite' into. Server with 10,000 remote users feeding data from < 1Gbps feeds. Same thing.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Would it go crazy like this? Anaconda run a microbenchmark (calculation of a cosine a million times) to test the differences between the P and E cores of an M1 Mac and got this.
View attachment 1945018
Source: https://www.anaconda.com/blog/apple-silicon-transition


Doubtful that is actually "steady state" on the right hand side. It is more so just non steadily increasing state. Very good chance though that continues to decline over the extended range as one hammers the cache , OS thread scheduler , etc. with larger and larger overhead of having too many "higher priority" threads around clamoring to get a time slice quantum allocation. At some point the ones that are being "starved" will be pushed to the front of the queue and once bounced out, the line to get back in is even longer.

That said calculated a cosine a million times is a bit of a joke benchmark. That is a much measuring cache sizes and cache set associativity as the whole thing is likely encapsulated by a modern era cache. The shared common instruction set binaries can land in a L1. Not really pulling any data. The data dependences between loop iterations is pragmatically non existent . This is just lobbing a super duper slow pitch softball over the plate at an olympic Gold medal winner softball player.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,628
1,101
calculated a cosine a million times is a bit of a joke benchmark. That is a much measuring cache sizes and cache set associativity as the whole thing is likely encapsulated by a modern era cache. The shared common instruction set binaries can land in a L1. Not really pulling any data.
If the benchmark only uses one number for the cosine, how could it test the performance of the cache? Can you explain what data the cache stores?
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
If the benchmark only uses one number for the cosine, how could it test the performance of the cache? Can you explain what data the cache stores?


It is not testing the cache if the cache can hold the whole problem. That the point. The cache is a principal component of a CPU "package" or "SoC". If you are not testing one of the major components then it is test that doesn't give a complete evaluation of the overall unit.

Even if it was just using a 0.5K block of numbers. It is mainly going to come down to the "all core" clock speed if have split it up into "work" chunks. The architecture differences probably are not going to matter much.

P.S. missed the "what is a data cache". The data cache is just the data used by the thread/application. If have global/local variables... that values in those are data. Each thread would have its own set of local variables. ( e.g. keeping track how many times when around the loop. ... each thread is on different iteration then needs it own count-tracking variable. The result of the cosine is probably saved. in some variable. x = cosine ( y ). Doesn't go anywhere but there is still a step to store it ( unless optimizer throws that step away because pragmatically useless in a 'nothing' computation. ) .

The instructions to execute are a special kind of "data". They are instructions not the operands to the instructions. At the L1 level those are usually kept in a separate cache because the frequently usage characteristics are grossly different than that of data.
 
Last edited:

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
Doubtful that is actually "steady state" on the right hand side. It is more so just non steadily increasing state. Very good chance though that continues to decline over the extended range as one hammers the cache , OS thread scheduler , etc. with larger and larger overhead of having too many "higher priority" threads around clamoring to get a time slice quantum allocation. At some point the ones that are being "starved" will be pushed to the front of the queue and once bounced out, the line to get back in is even longer.

That said calculated a cosine a million times is a bit of a joke benchmark. That is a much measuring cache sizes and cache set associativity as the whole thing is likely encapsulated by a modern era cache. The shared common instruction set binaries can land in a L1. Not really pulling any data. The data dependences between loop iterations is pragmatically non existent . This is just lobbing a super duper slow pitch softball over the plate at an olympic Gold medal winner softball player.
Benchmarks can be used for different things but outside marketing, their value lies in the conclusions you can draw from them.
Calculating a cosine a million times is very different from any actual workload on the system I can envision. Apart from demonstrating that the scheduling seems functional, I can’t see what conclusions I’m supposed to draw from it. It doesn’t seem to represent anything beyond itself.
But then it’s author defined it as
def waste_time(_):
so he seems quite aware. ?


{But it does remind me of a time where my code analysis showed that calculating trigonometric function values was a bottleneck. Ended up sidestepping the calculations entirely by using small look-up tables and symmetry. Nice. So the benchmark at least served as a nostalgic memory trigger.}
 
  • Like
Reactions: JMacHack and pshufd

pshufd

macrumors G4
Oct 24, 2013
10,151
14,574
New Hampshire
{But it does remind me of a time where my code analysis showed that calculating trigonometric function values was a bottleneck. Ended up sidestepping the calculations entirely by using small look-up tables and symmetry. Nice. So the benchmark at least served as a nostalgic memory trigger.}

I wanted to demonstrate compiler optimization to my kids a long time ago.

So I wrote a program like:

j=0;
for i=1 to 1000;
j=j+1;
print j;

Compiler optimization level 3 just precomputed the sum.

The Intel Software Optimization Handbook is a good place to look for optimization ideas and their compilers are top notch.

Intel might even have an advantage there.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,454
1,232
Anandtech published a review of a top end Alder Lake mobile device. Unfortunately no direct perf/W estimates, but they listed the peak as 115W and sustained as 85W and they noted that battery life has *regressed* compared to Tiger Lake. They get a (small) pass because the laptop is not actually meant to be used on battery - you know because it’s almost 9 pounds!

 
  • Like
Reactions: JMacHack

bobcomer

macrumors 601
May 18, 2015
4,949
3,699
Anandtech published a review of a top end Alder Lake mobile device. Unfortunately no direct perf/W estimates, but they listed the peak as 115W and sustained as 85W and they noted that battery life has *regressed* compared to Tiger Lake. They get a (small) pass because the laptop is not actually meant to be used on battery - you know because it’s almost 9 pounds!

Thank heavens they make more than the i9 HK!
 

pshufd

macrumors G4
Oct 24, 2013
10,151
14,574
New Hampshire
Anandtech published a review of a top end Alder Lake mobile device. Unfortunately no direct perf/W estimates, but they listed the peak as 115W and sustained as 85W and they noted that battery life has *regressed* compared to Tiger Lake. They get a (small) pass because the laptop is not actually meant to be used on battery - you know because it’s almost 9 pounds!


They sell a pair of noise canceling headphones to go with it?
 
  • Haha
Reactions: neinjohn

crazy dave

macrumors 65816
Sep 9, 2010
1,454
1,232
Ars Technica says 6.4 pounds. Anandtech says 8.8 pounds. Pretty large discrepancy.

Huh … different variants? I mean still heavy but yeah that’s sizable.

There aren’t too many laptops at 8.8 pounds or more … a couple but not many
 

jdb8167

macrumors 601
Nov 17, 2008
4,859
4,599
Huh … different variants? I mean still heavy but yeah that’s sizable.

There aren’t too many laptops at 8.8 pounds or more … a couple but not many
The Verge says 6.39 pounds. Seems more reasonable than 8.8 unless Anandtech really is including the charging brick in the weight.
 
  • Like
Reactions: crazy dave
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.