M1 single core not so fast at it seems in benchmarks vs X86

falainber · Nov 29, 2021

Bug-Creator said:
O.k. fair enough.

Now just find my a single core dual thread processor.....

..... and tell me how it is relevant today.

I already addressed this in my original post. As long as multithreading does not increase the size of the core, number of cores does not matter. Same silicon size implies the same number of cores and thread advantage for Intel approach.

JMacHack · Nov 29, 2021

To summarize: the M-Series isn’t impressive if you cherry pick instances where it loses to x86.

falainber · Nov 29, 2021

throAU said:
Sure, and the M1 beats intel (for the most part, pre alder lake; definitely in performance per watt) in both.

What matters is how fast the software runs; trying to do stuff like disable hyper threading to make it "fair" (because intel have an inherent core utilisation problem) or whatever is irrelevant. What mattes is how much I need to spend, and how big/hot the machine is to get the level of performance. Couldn't care if it is 1 core or 100 cores if it runs my workload faster in less power.

edit:
I just find it hilarious how intel (and their fanboys) have been pushing single thread performance as a massively important metric vs. AMD with their huge numbers of extra cores, yet now its "not fair" because hyper threading which the M1 derivatives do not do.

Is single threaded code performance important or not?

Also... if they're going to whine about not fully saturating an Intel CPU and somehow claim the M1 is fully saturated, then they need to remember it can do a bunch of ML or transcoding entirely off the CPU; whereas the intel processor is going to need to use CPU or GPU cores for that. i.e., the M1 is not "fully saturated" either. Not by a long shot.

Furthermore... it's not like a Hyperthreaded core from intel has a full 2 thread capacity per core. Resources are shared. Some things hyper thread well. Some DO NOT. To claim that the other "half" of an x86 core is "unused" is total and utter bollocks. Because the only thing hyperthreading does is try and help use the parts of a single core that are not currently able to be used because the CPU has had a pipeline stall. There are not two cores in one. This is why you do NOT get 100% linear 2x scaling with hyper threading. Often, nowhere near it.

The whole point of ST performance metrics is to give an idea how fast a machine will be using a single thread. This situation will never use hyper threading, and thus will never load two virtual HT cores on an intel machine. SO trying to fudge numbers/benchmarks to better load 2 HT virtual cores on X86 as some sort of contrived "single core" benchmark performance victory is complete bollocks.

There's already a multi-thread benchmark option: turn on multiple threads and let the scheduler make use of everything in the machine.

Smells like extreme desperation to try and spin something that is meant to measure one thing, to measure something else entirely and then claim victory. Wonder how much intel paid.

This is entirely irrelevant to the end user. What matters is what I can buy off the shelf, how much it costs and how well it runs my workload. How many cores it has, how many it uses, what processing node it is manufactured with, who makes it or whether or not it does hyper threading are all entirely irrelevant.

It's a rant not an argument. Nobody denied importance of ST benchmarks and if all apps were single threaded and CPU had just one core that's all we would ever need but it's not the case. Limiting comparisons to benchmarks that do not take advantage of hyperthreading is just plain stupid. Claiming that M1 has additional engines is just as stupid in the context of ST/MT discussion. Firstly, Intel processors are SOCs too and have features beyond conventional CPU, and secondly, one can't use neural engine for most conventional computational tasks.

throAU · Nov 29, 2021

futbalguy said:
Your M1 MacBook Air should crush the 2011 MacBook Pro

Trust me, I have a 2011 MBP and a 2020 intel MBA and as the guy said, the 2020 MBA is dog slow. Crippled by power limit and thermals.

crazy dave · Nov 29, 2021

falainber said:
That's oversimplification. Single-threaded benchmarks are important because a lot of apps are single-threaded. But per core performance is meaningful too. A simple case to demonstrate this is the case of one core processors. The total CPU performance obviously benefits ftom MT. Obviously, single core processors are rare these days. But for multi-core chips, per core performance may be viewed as a proxy for performance per [silicon] area metric.

Per core measurements are a thing but how the article tried to do it is not how or why they’re done. They’re also not a great proxy for PPA because the core sizes are so different. In reality per core measurements are done by saturating the CPU with threads, measuring performance and then literally dividing by the number of cores. It’s really just a pure MT test with # of cores thrown in as a denominator. It’s generally important for server/cloud HPC where customers will be renting time on a per-(physical/logical) core basis.

falainber said:
I already addressed this in my original post. As long as multithreading does not increase the size of the core, number of cores does not matter. Same silicon size implies the same number of cores and thread advantage for Intel approach.

I don’t know how much extra die space modern HT/SMT2 logic takes up. It’s not zero but it’s probably not huge. That said, x86 cores are in general larger than Arm cores. And even between x86 cores core sizes can vary dramatically between Alder Lake P and E cores and between both of them and Zen 3 cores. Performance per area is its own measure and is done.

falainber said:
It's a rant not an argument. Nobody denied importance of ST benchmarks and if all apps were single threaded and CPU had just one core that's all we would ever need but it's not the case. Limiting comparisons to benchmarks that do not take advantage of hyperthreading is just plain stupid. Claiming that M1 has additional engines is just as stupid in the context of ST/MT discussion. Firstly, Intel processors are SOCs too and have features beyond conventional CPU, and secondly, one can't use neural engine for most conventional computational tasks.

You’ve got this inverted @throAU didn’t claim multithreaded tests weren’t important. It’s just that the article linked to in the OP is based on fundamentally misunderstanding basic terminology that single core == single logical core == single thread. The reason I say inverted is that the article’s author did actually write an entire screed on how single threaded performance was not important and used this defend his upside down world article on per core performance.

So yes a per core measurement of performance has its uses but it isn’t done this way and really is just a multithreaded test. It’s also true that HT/SMT2 allows Intel/AMD to claw back ~20% performance per watt on average when all cores are saturated relative to the M1.

throAU · Nov 29, 2021

crazy dave said:
You’ve got this inverted @throAU didn’t claim multithreaded tests weren’t important. It’s just that the article linked to in the OP is based on fundamentally misunderstanding basic terminology that single core == single logical core == single thread.

Exactly. Single thread is important. But you aren't testing single thread by loading up a hyper threaded core with 2 threads. By doing that you're into multi-thread territory and the comparison to M1 (or anything else) should include multiple cores. Or rather, if you're in multi-thread territory just load all the resources the machine has.

Testing with TWO threads only and only doing that on a single core with hyper threading (that the M1 doesn't have) rather than just using all the cores, to try and claim intel has better IPC is just totally disingenuous. It's simply not representative of any real world scenario (which is the entire purpose of benchmarking things for performance) and therefore the only reason for it is to try and make intel look better. Either due to blind fanboy-ism, misunderstanding of reality, or paid promo.

Gnattu · Nov 29, 2021

falainber said:
Same silicon size implies the same number of cores and thread advantage for Intel approach.

No, as I mentioned before, the cores are not at the same size plus you will always get "thread advantage" by using smaller cores. Let's look at Intel themselves, an AlderLake P core is about the same size as 4 E cores, but it is definitely slower than 4 E cores combined.

By the way, SMT is no free launch and you still need extra silicon to implement that, but that is much cheaper than adding a new core(of same size) and that's all.

ADGrant · Nov 29, 2021

tonyz123456 said:
I randomly came across this thread while browsing but am curious - what mainstream apps or games today are still single threaded where this matters? I don't think I've had a single core CPU computer in a very very long-term - maybe 10+ years so isn't single threaded benchmarks pointless since most software worth paying for have supported multi-core for years?

Whether it's optimized for the M1 chip or not is a separate topic.

Plenty of software is either single threaded or has one particularly busy thread. For example the Javascript running on this webpage in your browser. The UIKit or AppKit application on your iPhone or Mac needs to do all UI updates on a single thread which also handles all UI events (button press, tap gesture etc).

name99 · Nov 29, 2021

leman said:
The question was this:

As to the rest, I completely agree with you. The main reason why I even bother replying to this kind of nonsense is to try to stop a flow of misinformation. Even if it's a futile effort.

Anyway, I can't wait to get my 16" M1 and finally do some proper GPU programming

So the issue of interest is: "Do we have a clear picture of how good utilisation of Firestorm cores tends to be with 1 thread and how much could potentially be left on the table for something like SMT"?

Bottom line (and I have said this elsewhere in many places) SMT is a TERRIBLE idea if you care about real performance. It's tolerable if you are engaged in weird legal SW licensing games (which IBM is apparently doing with POWER) but otherwise, no!
The thing is -- what makes CPUs fast is vast amounts of locality. Cache data is reused, registers are reused, branch data is reused, etc etc. All this works well because of locality. As soon as you have an independent thread running, all that locality disappears. The simplest way to think about this is: if both threads are sharing your L1, each thread essentially sees only half the L1. And so you thought you were providing extra execution while thread 1 misses L1 and goes out to RAM, but in fact you aren't ONLY doing that, you're also SUBSTANTIALLY increasing the numbers of times thread 1 has to go out to RAM.
You can claw back some of the performance by running most of your structures in a much less power efficient away, but it's just never really worth it. Look at Intel. They have been at this for 20 years. But in all that time, SMT performance hasn't moved much -- two SMT threads gives you about 1.3x the performance of a single thread -- at the cost of MASSIVELY HIGHER energy. What's the point? It's just a stupid design point that keeps going because of inertia.

Just do what Apple does! Make the P core crazy fast (which you can do, at very low energy -- IF you aggressively exploit all the locality in an instruction stream -- and of course that's what my 300+ page document is all about); and if you also want the ability to run lots of slower threads in a smaller amount of area, add some E-cores.
(Apple does not really use E-cores in this way, as "throughput/area" maximizers, but they could, if they ever wanted to, for example, create internal server cores running 128 or 256 threads per SoC.)

SMT lives alongside "RISC vs CISC" as a zombie from 20+ years ago, ideas that are only argued about because they are easy to grasp with a minimal knowledge of current design concerns. You can spend your time in silly arguments about what seemed like a good idea 20 yrs ago (in a world where the average CPU used perhaps 1000th as many transistors as it could use today), or you can spend that time reading, learning, thinking about how modern CPUs are designed...

mr_roboto · Nov 29, 2021

casperes1996 said:
Indeed. I've always found IBM's 8-way SMT especially interesting. That is, IBM's POWER 8 chips came with SMT2, 4 and 8 - Always thought it was a wicked amount of assumed under-utilisation of the core for single threaded tasks if they thought it potentially beneficial to have 8 threads on one core. But it is also a very throughput oriented system

You know how in a typical CPU, each execution unit handles a subset of the ISA? In a superscalar RISC, you'll have memory units which can only execute loads and stores, integer units which only handle integer math, and so forth. This can get very fine-grained, e.g. I remember PowerPCs which had special EUs that only handled branches.

POWER8 is... different. Each execution "unit", or "slice", is capable of executing all POWER8 instruction types with a maximum throughput of 1 per clock. The SMT4 variant of the POWER8 connects four slices to one frontend (L1 cache + decode), while the SMT8 variant connects eight slices to one frontend. The total number of slices in the SMT4 and SMT8 variants stays constant, so a SMT8 chip has half the nominal "core" count as SMT4. The biggest SMT8 POWER8 is apparently 12c96t, the biggest SMT4 is 24c96t, and they're probably the same physical design with different fuse options.

In POWER8, each thread gets guaranteed access to enough execution resources to sustain a minimum of 1 IPC if memory never stalls. When other threads stall (or cores aren't fully loaded with 4/8 threads) that means individual threads may be able to run at better than 1 IPC.

casperes1996 said:
Do we have a clear picture of how good utilisation of Firestorm cores tends to be with 1 thread and how much could potentially be left on the table for something like SMT? I don't imagine it would be much but do we have any concrete evidence for it?

Unsurprisingly, there are performance counters. In Monterey, powermetrics now reports instructions retired and instructions per clock, but only on a whole-cluster basis, so you can't monitor individual cores, just the average IPC across all cores in a cluster. There may be finer grained statistics in the hardware, but unfortunately they're not presented by powermetrics. There may be other performance monitoring tools I'm not aware of.

The current level of detail visible in powermetrics isn't great for answering your question. For that you'd need to get an idea of contention for specific execution resources under specific types of load. M1 may have the potential to dispatch eight instructions per clock, but unlike SMT8 POWER8 it's not a symmetric machine with eight equally capable EUs.

eicca said:
I failed to specify my 2020 MBA is the I5 model. Which still benchmarks double my 2011, but man that Air is just molasses. I'm trying to talk our IT guy into upgrading me to an M1.

I work for a company which lets us use Macs and the corporate antivirus and device management software slows them to a crawl. I suspect that's making your 2020 i5 MBA feel much slower than it actually is. I know it slams my own... 2015 I think? work-provided 13" MBP, and I wouldn't like to experience it on a much more thermally limited platform like the MBA. Antivirus software has a nasty habit of going nuts and chewing up a core forever, which is not good news on Intel Macs. M1 might help, since hopefully the AV software uses the right thread QoS level to get itself locked to efficiency cores... but who am I kidding, never expect antivirus software to be written right.

leman · Nov 30, 2021

name99 said:
So the issue of interest is: "Do we have a clear picture of how good utilisation of Firestorm cores tends to be with 1 thread and how much could potentially be left on the table for something like SMT"?

Bottom line (and I have said this elsewhere in many places) SMT is a TERRIBLE idea if you care about real performance. It's tolerable if you are engaged in weird legal SW licensing games (which IBM is apparently doing with POWER) but otherwise, no!
The thing is -- what makes CPUs fast is vast amounts of locality. Cache data is reused, registers are reused, branch data is reused, etc etc. All this works well because of locality. As soon as you have an independent thread running, all that locality disappears. The simplest way to think about this is: if both threads are sharing your L1, each thread essentially sees only half the L1. And so you thought you were providing extra execution while thread 1 misses L1 and goes out to RAM, but in fact you aren't ONLY doing that, you're also SUBSTANTIALLY increasing the numbers of times thread 1 has to go out to RAM.
You can claw back some of the performance by running most of your structures in a much less power efficient away, but it's just never really worth it. Look at Intel. They have been at this for 20 years. But in all that time, SMT performance hasn't moved much -- two SMT threads gives you about 1.3x the performance of a single thread -- at the cost of MASSIVELY HIGHER energy. What's the point? It's just a stupid design point that keeps going because of inertia.

Just do what Apple does! Make the P core crazy fast (which you can do, at very low energy -- IF you aggressively exploit all the locality in an instruction stream -- and of course that's what my 300+ page document is all about); and if you also want the ability to run lots of slower threads in a smaller amount of area, add some E-cores.
(Apple does not really use E-cores in this way, as "throughput/area" maximizers, but they could, if they ever wanted to, for example, create internal server cores running 128 or 256 threads per SoC.)

SMT lives alongside "RISC vs CISC" as a zombie from 20+ years ago, ideas that are only argued about because they are easy to grasp with a minimal knowledge of current design concerns. You can spend your time in silly arguments about what seemed like a good idea 20 yrs ago (in a world where the average CPU used perhaps 1000th as many transistors as it could use today), or you can spend that time reading, learning, thinking about how modern CPUs are designed...

Thanks, very informative.

I always though the logic of SMT on x86 was more along these lines: if you are in a heavy multithreading environment and your CPU has to do context switches all the time, you'll likely already suffer from cache trashing. So why not just make it official and reuse some of execution units while a thread is stalling. But I agree that it's not an optimal design point.

quarkysg · Nov 30, 2021

leman said:
if you are in a heavy multithreading environment and your CPU has to do context switches all the time, you'll likely already suffer from cache trashing.

The main difference here is that context switches by the kernel happens when a thread's time slice is used, which should be better for cache locality, while a SMT triggered context switch may force a thread that just started to yield execution context and may force a cache flush/load.

Search

Search

M1 single core not so fast at it seems in benchmarks vs X86

falainber

macrumors 68040

JMacHack

Suspended

falainber

macrumors 68040

throAU

macrumors G4

crazy dave

macrumors 68000

throAU

macrumors G4

Gnattu

macrumors 65816

ADGrant

macrumors 68000

name99

macrumors 68030

mr_roboto

macrumors 6502a

leman

macrumors Core

quarkysg

macrumors 65816

Our Staff