Intel Alder Lake vs. Apple M1

jeanlain · Jan 16, 2022

Andropov said:
Many benchmarks go around this problem by measuring throughput (like the anaconda graph above) instead of time to finish a task. For example, Geekbench in the physical N-body simulation measures interactions computed per second, not the time to compute, say, 1M interactions. That takes granularity and task scheduling out of the equation.

You mean that geekbench counts the number of interactions computed at short intervals (like every 0.01s) and makes an average? That's indeed a better solution.
It's not entirely clear to me if the throughput shown on the anaconda graph is measured this way. It could be the amount of work divided by the time taken by the whole CPU to finish the task. This also gives you a throughput.

cmaier · Jan 16, 2022

Delete

Xiao_Xi · Jan 16, 2022

Andropov said:
Many benchmarks go around this problem by measuring throughput (like the anaconda graph above) instead of time to finish a task. For example, Geekbench in the physical N-body simulation measures interactions computed per second, not the time to compute, say, 1M interactions

What do numbers in the Geekbench report mean? Mean throughput? Maximum instantaneous throughput?

Gerdi · Jan 17, 2022

Xiao_Xi said:
What do numbers in the Geekbench report mean? Mean throughput? Maximum instantaneous throughput?

It is the same as SpecCPU rate-N. You are running the same workload/code on each core and after some time you measure how much progress has been made overall.
This way you avoid distributing your fixed workload fairly to all cores (like in Cinebench).

deconstruct60 · Jan 19, 2022

Andropov said:
*sometimes the number of threads is set to twice the number of cores in SMT-capable (multithreading) machines, since they were designed to be able to run two threads per core. Didn't make much of a difference in any workload I ever tried, since the point of SMT (someone correct me if wrong) is to keep the core ALU busy and that's not a problem for numerical simulations since they're ALU-heavy by definition.

SMT isn't just aimed at ALU units. Any unused core function unit could get utilized by another thread if it is sitting there unused for a substantive amount of time. SMT isn't going to have traction on micro benchmarks where the whole problem can be loaded into cache and just run a very tight loop over a relatively limited memory area.

SMT has traction there there are dynamic execution bubbles that impede keeping the units active. For example waiting on data to load before can do a arithmetic op. If one of the operands is coming from the the persistent storage disk then nothing is going to happen for long while. If can slide in another thread that can make some "slow" progress that is better for aggregate throughput than nothing at all. (or waiting until the multithread time slice quantum is over to swap the delayed thread out and put in some work. )

There are a limited about of virtual addresses that can be translated with low overhead. Limited "local" data (e.g., if using MPI to get data off another supercomputer node ) . Caps on how many different branches can predict your way through. ( a long sequence of if-thens with tests that shift on immediate results versus a loop end condition test which almost always branches the same way ( 'while i < 100000' for. 99999 times it is 'yes' ).

Everything in the same virtual address space, all the data loaded into 'local' RAM. One big loop with no internal branches. 99% L3 cache hit rate. In that context, SMT won't get much traction if have thrown lots of resources at being hyper aggressive at pulling instructions "out of order".

Andropov said:
I assume context switches here are less of a problem since SMT-capable cores have two sets of registers so context switching between (just) two threads should be cheaper, so performance doesn't improve nor decrease.

SMT has two dynamically sized subsets of registers. That is a different kind of "threaded" execution where there are whole sets with deliberate shared between the units. With register renaming for out of order executions the main thread is using more registers than are explicitly in the application binary instruction set. SMT overlays on top of the register rewriting that was there anyway (. regardless they are being renamed ... so not much overhead to rename into two sets. )

If the speculative execution path is stalled on "thread 1" then that much larger pool of "extra" rename registers aren't being used anyway... so just consume them with something else. The bookkeeping for which rename registers belong with which set is no where near the register and path consumption of the registers themselves. An incrementally larger pool could be useful but 100% growth not likely required to get some incremental increase in performance.

Andropov said:
Many benchmarks go around this problem by measuring throughput (like the anaconda graph above) instead of time to finish a task. For example, Geekbench in the physical N-body simulation measures interactions computed per second, not the time to compute, say, 1M interactions. That takes granularity and task scheduling out of the equation.

It takes it out until it doesn't. If one creates. 10,000 threads versus 100 threads then the OS thread overhead is going to go up. That can have a cascade impact into scheduling. The data structures are just bigger. As threads gets "starved" for time a fair scheduler will push some others out. There are cache hits for that. Cache hits get more bubbles. More bubbles get you more delays in throughput. There is a longer line until get back to the front of the queue to get allocated. (again what likely had in cache is likely to have been pushed out unles all the threads are working on almost exact same duplicative data. )

Throughput is what matters because get to results later than needed won't help much.

Windows PCs still being sold with cut down main RAM and HDD sole system drive. SMT would have more traction to 'bite' into. Server with 10,000 remote users feeding data from < 1Gbps feeds. Same thing.

deconstruct60 · Jan 19, 2022

Xiao_Xi said:
Would it go crazy like this? Anaconda run a microbenchmark (calculation of a cosine a million times) to test the differences between the P and E cores of an M1 Mac and got this.
View attachment 1945018
Source: https://www.anaconda.com/blog/apple-silicon-transition

Doubtful that is actually "steady state" on the right hand side. It is more so just non steadily increasing state. Very good chance though that continues to decline over the extended range as one hammers the cache , OS thread scheduler , etc. with larger and larger overhead of having too many "higher priority" threads around clamoring to get a time slice quantum allocation. At some point the ones that are being "starved" will be pushed to the front of the queue and once bounced out, the line to get back in is even longer.

That said calculated a cosine a million times is a bit of a joke benchmark. That is a much measuring cache sizes and cache set associativity as the whole thing is likely encapsulated by a modern era cache. The shared common instruction set binaries can land in a L1. Not really pulling any data. The data dependences between loop iterations is pragmatically non existent . This is just lobbing a super duper slow pitch softball over the plate at an olympic Gold medal winner softball player.

Xiao_Xi · Jan 19, 2022

deconstruct60 said:
calculated a cosine a million times is a bit of a joke benchmark. That is a much measuring cache sizes and cache set associativity as the whole thing is likely encapsulated by a modern era cache. The shared common instruction set binaries can land in a L1. Not really pulling any data.

If the benchmark only uses one number for the cosine, how could it test the performance of the cache? Can you explain what data the cache stores?

deconstruct60 · Jan 19, 2022

Xiao_Xi said:
If the benchmark only uses one number for the cosine, how could it test the performance of the cache? Can you explain what data the cache stores?

It is not testing the cache if the cache can hold the whole problem. That the point. The cache is a principal component of a CPU "package" or "SoC". If you are not testing one of the major components then it is test that doesn't give a complete evaluation of the overall unit.

Even if it was just using a 0.5K block of numbers. It is mainly going to come down to the "all core" clock speed if have split it up into "work" chunks. The architecture differences probably are not going to matter much.

P.S. missed the "what is a data cache". The data cache is just the data used by the thread/application. If have global/local variables... that values in those are data. Each thread would have its own set of local variables. ( e.g. keeping track how many times when around the loop. ... each thread is on different iteration then needs it own count-tracking variable. The result of the cosine is probably saved. in some variable. x = cosine ( y ). Doesn't go anywhere but there is still a step to store it ( unless optimizer throws that step away because pragmatically useless in a 'nothing' computation. ) .

The instructions to execute are a special kind of "data". They are instructions not the operands to the instructions. At the L1 level those are usually kept in a separate cache because the frequently usage characteristics are grossly different than that of data.

EntropyQ3 · Jan 19, 2022

deconstruct60 said:
Doubtful that is actually "steady state" on the right hand side. It is more so just non steadily increasing state. Very good chance though that continues to decline over the extended range as one hammers the cache , OS thread scheduler , etc. with larger and larger overhead of having too many "higher priority" threads around clamoring to get a time slice quantum allocation. At some point the ones that are being "starved" will be pushed to the front of the queue and once bounced out, the line to get back in is even longer.

That said calculated a cosine a million times is a bit of a joke benchmark. That is a much measuring cache sizes and cache set associativity as the whole thing is likely encapsulated by a modern era cache. The shared common instruction set binaries can land in a L1. Not really pulling any data. The data dependences between loop iterations is pragmatically non existent . This is just lobbing a super duper slow pitch softball over the plate at an olympic Gold medal winner softball player.

Benchmarks can be used for different things but outside marketing, their value lies in the conclusions you can draw from them.
Calculating a cosine a million times is very different from any actual workload on the system I can envision. Apart from demonstrating that the scheduling seems functional, I can’t see what conclusions I’m supposed to draw from it. It doesn’t seem to represent anything beyond itself.
But then it’s author defined it as
def waste_time(_):
so he seems quite aware. ?

{But it does remind me of a time where my code analysis showed that calculating trigonometric function values was a bottleneck. Ended up sidestepping the calculations entirely by using small look-up tables and symmetry. Nice. So the benchmark at least served as a nostalgic memory trigger.}

pshufd · Jan 19, 2022

EntropyQ3 said:
{But it does remind me of a time where my code analysis showed that calculating trigonometric function values was a bottleneck. Ended up sidestepping the calculations entirely by using small look-up tables and symmetry. Nice. So the benchmark at least served as a nostalgic memory trigger.}

I wanted to demonstrate compiler optimization to my kids a long time ago.

So I wrote a program like:

j=0;
for i=1 to 1000;
j=j+1;
print j;

Compiler optimization level 3 just precomputed the sum.

The Intel Software Optimization Handbook is a good place to look for optimization ideas and their compilers are top notch.

Intel might even have an advantage there.

crazy dave · Jan 25, 2022

Anandtech published a review of a top end Alder Lake mobile device. Unfortunately no direct perf/W estimates, but they listed the peak as 115W and sustained as 85W and they noted that battery life has *regressed* compared to Tiger Lake. They get a (small) pass because the laptop is not actually meant to be used on battery - you know because it’s almost 9 pounds!

Intel Alder Lake-H Core i9-12900HK Review: MSI's Raider GE76 Goes Hybrid

www.anandtech.com

bobcomer · Jan 25, 2022

crazy dave said:
Anandtech published a review of a top end Alder Lake mobile device. Unfortunately no direct perf/W estimates, but they listed the peak as 115W and sustained as 85W and they noted that battery life has *regressed* compared to Tiger Lake. They get a (small) pass because the laptop is not actually meant to be used on battery - you know because it’s almost 9 pounds!

Intel Alder Lake-H Core i9-12900HK Review: MSI's Raider GE76 Goes Hybrid

www.anandtech.com

Thank heavens they make more than the i9 HK!

pshufd · Jan 25, 2022

crazy dave said:
Anandtech published a review of a top end Alder Lake mobile device. Unfortunately no direct perf/W estimates, but they listed the peak as 115W and sustained as 85W and they noted that battery life has *regressed* compared to Tiger Lake. They get a (small) pass because the laptop is not actually meant to be used on battery - you know because it’s almost 9 pounds!

Intel Alder Lake-H Core i9-12900HK Review: MSI's Raider GE76 Goes Hybrid

www.anandtech.com

They sell a pair of noise canceling headphones to go with it?

crazy dave · Jan 25, 2022

bobcomer said:
Thank heavens they make more than the i9 HK!

Indeed, but still ? as someone else said in the comments, the 5900HX + 6800 system is still a DTR!

jdb8167 · Jan 25, 2022

crazy dave said:
you know because it’s almost 9 pounds!

Ars Technica says 6.4 pounds. Anandtech says 8.8 pounds. Pretty large discrepancy.

crazy dave · Jan 25, 2022

jdb8167 said:
Ars Technica says 6.4 pounds. Anandtech says 8.8 pounds. Pretty large discrepancy.

Huh … different variants? I mean still heavy but yeah that’s sizable.

There aren’t too many laptops at 8.8 pounds or more … a couple but not many

robco74 · Jan 25, 2022

Is one site including the charging brick and cord?

crazy dave · Jan 25, 2022

robco74 said:
Is one site including the charging brick and cord?

Maybe? A 2lbs charger/cord is possible I think.

bobcomer · Jan 25, 2022

crazy dave said:
Indeed, but still ? as someone else said in the comments, the 5900HX + 6800 system is still a DTR!

That's more than I need, but it would be fun to have.

jdb8167 · Jan 25, 2022

crazy dave said:
Huh … different variants? I mean still heavy but yeah that’s sizable.

There aren’t too many laptops at 8.8 pounds or more … a couple but not many

The Verge says 6.39 pounds. Seems more reasonable than 8.8 unless Anandtech really is including the charging brick in the weight.

bobcomer · Jan 25, 2022

crazy dave said:
Huh … different variants? I mean still heavy but yeah that’s sizable.

There aren’t too many laptops at 8.8 pounds or more … a couple but not many

3lb and under is the only laptops i will buy right now. If I need a desktop, I'll buy a desktop!

jdb8167 · Jan 25, 2022

bobcomer said:
3lb and under is the only laptops i will buy right now. If I need a desktop, I'll buy a desktop!

My only complaint about the 14” MacBook Pro is that it is 3.5 pounds. Too heavy by 3/4 of a pound.

crazy dave · Jan 25, 2022

bobcomer said:
3lb and under is the only laptops i will buy right now. If I need a desktop, I'll buy a desktop!

I have and would but also probably won’t.

I still haven’t decided on my ideal configuration yet.

bobcomer · Jan 25, 2022

jdb8167 said:
My only complaint about the 14” MacBook Pro is that it is 3.5 pounds. Too heavy by 3/4 of a pound.

Definitely. I cancelled my order for the 14". I wish Apple made a carbon fiber version.

pshufd · Jan 25, 2022

jdb8167 said:
Ars Technica says 6.4 pounds. Anandtech says 8.8 pounds. Pretty large discrepancy.

Maybe one includes the power brick.

Intel Alder Lake vs. Apple M1

macrumors 68020

Suspended

macrumors 68000

macrumors 6502

macrumors G5

macrumors G5

macrumors 68000

macrumors G5

macrumors 6502a

macrumors G4

macrumors 68000

macrumors 601

macrumors G4

macrumors 68000

macrumors 601

macrumors 68000

macrumors 6502a

macrumors 68000

macrumors 601

macrumors 601

macrumors 601

macrumors 601

macrumors 68000

macrumors 601

macrumors G4

Our Staff