The secret sauce of Apple Sillicon

leman · Sep 16, 2021

Mr. Bear said:
So the performance cores can only do N things at once, but the efficiency cores can do N+X things at once because they are able to multithread? My Mac currently has 513 threads running. So you're saying that hypothetically 500 of them are low-priority and 13 are high priority (or 450/63, whatever). So the efficiency cores might be able to handle 100 low priority tasks each, while the performance cores can only handle 10? But they can run those 10 at 10x the speed that the efficiency cores can run them? And I'm talking about the M1 and a hypothetical M1x.

P-cores and E-cores are designed with different targets in mind: P-cores to run a unit of work as fast as possible and E-cores to run a unit of work with lowest possible energy expenditure. As a result of this, the cores themselves are very different. E-cores have less execution units, less cache etc. — everything is geared to use less power.

As @cmaier explains, E-cores cannot run more threads at ones or anything like that, the basic execution model is the same for P-cores and E-cores. The main difference is that while P-cores can run, say, 100 threads of work in X time using Y power, the E-cores can run that same workload in X*4 time but at Y/10 power usage (numbers only illustrative). If the work is not high priority and you are not losing anything by taking longer to run it, you can save quite a lot of energy this way.

altaic · Sep 16, 2021

leman said:
P-cores and E-cores are designed with different targets in mind: P-cores to run a unit of work as fast as possible and E-cores to run a unit of work with lowest possible energy expenditure. As a result of this, the cores themselves are very different. E-cores have less execution units, less cache etc. — everything is geared to use less power.

As @cmaier explains, E-cores cannot run more threads at ones or anything like that, the basic execution model is the same for P-cores and E-cores. The main difference is that while P-cores can run, say, 100 threads of work in X time using Y power, the E-cores can run that same workload in X*4 time but at Y/10 power usage (numbers only illustrative). If the work is not high priority and you are not losing anything by taking longer to run it, you can save quite a lot of energy this way.

So, are threads just context switches? Surely it’s a grey zone, Shirley?

leman · Sep 16, 2021

altaic said:
So, are threads just context switches? Surely it’s a grey zone, Shirley.

Yep, threads are context switches. Each Apple CPU core runs only one instruction stream at a time. There is no SMT functionality.

P.S. So yeah, my "100 threads of work" is clearly BS from technical standpoint, I was just using it to illustrate a point. Substitute it with "100 tasks" or "100 work packages".

altaic · Sep 16, 2021

leman said:
Yep, threads are context switches. Each Apple CPU core runs only one instruction stream at a time. There is no SMT functionality.

P.S. So yeah, my "100 threads of work" is creamy BS from technical standpoint, I was just using it to illustrate a point. Substitute it with "100 tasks" or "100 work packages".

It seems to me that the burden of context switching can be mitigated by other proactive measures, potentially leading to a much larger gain than would be expected. To name a one low hanging fruit, instruction reordering space.

I’m curious: are you or @cmaier familiar with the belt architecture? There are few architectures that deviate from Von Newman.

quarkysg · Sep 16, 2021

altaic said:
It seems to me that the burden of context switching can be mitigated by other proactive measures, potentially leading to a much larger gain than would be expected.

Context switches at the OS level are a lot more expensive vs at the processor level (for multi-threaded CPUs). But any context switching will add latency to any processing threads, so should be avoided if possible. I would think this area would have been studied in depth and we probably are not going to get any more breakthrough here.

altaic · Sep 16, 2021

quarkysg said:
Context switches at the OS level are a lot more expensive vs at the processor level (for multi-threaded CPUs). But any context switching will add latency to any processing threads, so should be avoided if possible. I would think this area would have been studied in depth and we probably are not going to get any more breakthrough here.

I’m constantly shocked by things that have not been studied. The American dream is still alive.

leman · Sep 16, 2021

altaic said:
It seems to me that the burden of context switching can be mitigated by other proactive measures, potentially leading to a much larger gain than would be expected. To name a one low hanging fruit, instruction reordering space.

You are absolutely right, but unfortunately, a stream of instructions (thread) is still a core architectural concept in Aarch64. So you can't just eliminate threads on a hardware level. Maybe future architectures will offer alternative approaches to computation.

In the meantime, Apple is working on reducing the need of context switching in software. Their new concurrency framework allow continuations to share the same thread/move between threads which massively improves latency and performance of small async code blocks.

altaic said:
I’m curious: are you or @cmaier familiar with the belt architecture? There are few architectures that deviate from Von Newman.

Yep, I've been following them for a while. It's really cool stuff although its not clear whether it has practical benefit. As I mentioned in the RISC-V threads, as far as register-based architectures go, Aarch64 is pretty much close to being optimal in my opinion (and x86 is utterly ugly). Any objectively "better" ISA will probably have to drop the registers.

altaic · Sep 16, 2021

leman said:
You are absolutely right, but unfortunately, a stream of instructions (thread) is still a core architectural concept in Aarch64. So you can't just eliminate threads on a hardware level. Maybe future architectures will offer alternative approaches to computation.

In the meantime, Apple is working on reducing the need of context switching in software. Their new concurrency framework allow continuations to share the same thread/move between threads which massively improves latency and performance of small async code blocks.

Yep, I've been following them for a while. It's really cool stuff although its not clear whether it has practical benefit. As I mentioned in the RISC-V threads, as far as register-based architectures go, Aarch64 is pretty much close to being optimal in my opinion (and x86 is utterly ugly). Any objectively "better" ISA will probably have to drop the registers.

Concurred on all points. Good times ahead; gotta love innovation.

Tagbert · Sep 18, 2021

pshufd said:
Something I'd like: Huffman Decoding in hardware.

I'm waiting for the Heisenberg Compensators in hardware. The software versions are too slow. 😉

Search

Search

The secret sauce of Apple Sillicon

leman

macrumors Core

altaic

macrumors 6502a

leman

macrumors Core

altaic

macrumors 6502a

quarkysg

macrumors 65816

altaic

macrumors 6502a

leman

macrumors Core

altaic

macrumors 6502a

Tagbert

macrumors 604

Our Staff