Rumored pro chip

deconstruct60 · Jul 30, 2021

JMacHack said:
Consider it this way: Maybe they’d revamp the scheduler to use more than 64 threads now in case they make a cpu with more than 64 cores in the future.

Why would Apple put that case on high priority when they are on a differentiation path with the rest of the SoC ?

First, Apple has put an extremely high priority on this "Unified and everything on the SoC package" differentiation for the M-series ( and aways had it or A-Series , S-series (watch) , etc. ). This means CPU cores are going to have to compete with other function units for transistor budget. Apple is perusing bleeding edge fab process to give themselves a bigger allotment of transistor , but that is so they can share more. 'Unified' is on their "want to do " list.

There are also lots of focus units; not just GPU cores. There are AMX (matrix) units , Neural Units(NPU) , video en/decode units . The pressing issue is that for their specific areas of expertise they beat the slop out of general CPU cores in terms of Perf/watt . And Pref/Watt is also an explicitly , open statement Apple has said they are pursuing. That is on their "want to do" list.

Apple's third openly stated objective is to have the fastest iGPU among all competitors. That's is gong to apply pressure to assign it a disproportionate share of the budget increase. If they throw displacing upper mid range discrete GPU also ( e.g., getting rid of dGPU in iMac 27" class ) then even more so.

Apple made a Afterburner card to accelerate ProRES RAW 8k decode. Bigger budget could see a one stream 8k decode unit being pushed into the SoC. That isn't an explicit want, but putting a "value add" to a custom, Apple format across the whole Mac line up..... why would they not "want" to do that? That is just deeper ecosystem glue (or tarpit depending upon viewpoint. )

Second, there are explicit "want to do" that Apple has explicitly stated directions. Moving FP16 and FP32 computational kernel to the GPU cores through Metal compute. That is a "want to do". That goes back into the desire for better Perf/Watt. The complaint from some that putting computation on GPU looses access to main RAM? Buzz, dead because of the unified memory priority.

Apple uses foundation libraries like Accelerate and AV to push computations to custom units when they are available. That is seemless access to the NPU, AMX , video en/decode units. Apple spends lots of time and effort trying to get better utilization out of their significantly better Perf/Watt specialized units. Pragmatically that amounts to generic CPU core offload.

The 'icing on the cake' is that offloading from the CPU also reduces the pressure to change the kernel schedulers... those specialied are not scheduled directly by the OS. Apple keeps a single unified microkernel from Homepod to Mac Pro. Apple "wants to" spend a higher amount of money on kernel development costs overhead? Probably not. APFS is great at single SSD drive set ups. High mulitples of storage drives (via RAID and/or multidrive, volume management. Apple deprecated that path. ). Laptops/mobile leads the charge for the file system. That is highly likely to show up in other major subsystems of the OS also.

Apple also explicitly stated that they want to natively run iOS apps on the Mac. Pragmatically that is an underlined emphasis on the unified ( i.e., GPUs are going to get a substantive transistor budget). this is somewhat similar to why Intel/AMD are hesistant to drop old x86. The 32-bit software inertia keeps that 32-bit hardware functionality in their CPU packages. The huge inertia from OS apps that Apple derives large revenues from is going to keep the GPUs in the M-series SoCs. (unlike 32-bit apps, OS apps are generally moving forward. So the chances of this inertia dying out over time is extremely low for the foreseeable future. )

Third, Apple doesn't place a high priority on being in the server business. There is no deep, driving "want" there. macOS Server is a bundle of some software and activating some cron jobs and opening some ports. There is no substantive changes at the kernel level at all. Hardware wise the approach is just to press the mini and Mac Pro into a solution. There are no 1U or 2U boxes. Apple walked away from that business over a decade ago.

Apple's foray into XCode cloud is likely going to be the same path other macOS cloud vendors too. Lots of Mini's and a relatively smaller percentage of Mac Pros. As long as the other cloud providers are constrained by the same end user hosting density then Apple will continue to just sell more Macs. ( Apple probably wants to sell more Macs rather than fewer. )

Apple has a want to be in the single user workstation business. But they don't necessarily have to take a server CPU to do that. When they make their own , they can choose their own priorities. For the last decade Apple has taken the single CPU package track. Choosing Xeon E5 versus E7. Choosing Xeon W-3200 over Xeon SP. If Apple is waiting on W-3300 for a deferred launch instead of going forward on Xeon SP Gen 3 . All basically the same want. Trading off slow time to market for higher single threaded performance. There isn't a higher priority being applied to ultimate possible core count.

Same transistor budget allocation impact. If single thread performance is a equal or higher objective than multiple threads performance, then bigger caches ( branch target, L1/L2/L3 ) are going to take transistor budget away from more cores (with smaller caches ). Similarly, the caching that AMX , NPU , and/or GPU cores need to be more effective. Probably ittle want to sacrifice that just to win some CPU core count 'war'. Similar issues as to why probably won't get SMT later either (plus Apple cares very little about SATA storage performance) .

Apple is probably wants a more balanced SoC at the top of their offerings. The combination of multiple dimensions that it does very well is the 'win' that they likely want. If they had a 56-64 ( 56P + 8E cores ) SoC that had the same (or higher) top end single threaded performance as the 4-8 core offerings then that would be something the "hand me down" server x86 packages don't really do. ( at least without crazy high TDP.... which Apple has zero desire of going anywhere near ( Perf/Watt objective). ). Apple will position that as "we are doing a better job so... replaced them" win.

Apple having a myopic 'want' to win the maximum general CPU core count war is deeply suspect.

Finally, one can point to increasingly narrow corner cases where the general CPU cores have to be used, but still is an embarrassingly parallel workload. The problem there is that Apple explicitly doesn't want to be everything for everybody. Tackling a targeted, "good enough" coverage workloads with more specialized cores is more Pref/Watt effective. The workloads inside the targeted are getting larger performance wins over design evolution.

deconstruct60 · Jul 30, 2021

AgentMcGeek said:
It is very likely that the Pro line will get more than 64 threads within a few years. After all, semi-professional CPUs like Threadrippers are already at 128 threads. If AMD can do it on x86, I expect AS to follow sooner than later.

Threadripper and Xeon W-3xxx series are basically "hand me down" server dies. Binned for higher clock rates and the multiple package functionaliy turned off ( or replaced ).

Apple probably is not going to create a Server CPU produnct line to do "hand me down" SoCs. The existance of the server Operating System market drives the enablement of those AMD/Intel processors. Plain-jane Windows H10ome doesn't work on those .

The bulk of the workstation market is in the zone that the Ryzen 9 and AlderLake top end will cover ( what used to be coverd by 16-20 Threadripper/Xeon W-2200/Xeon E5 offerings. ). As the mainstream CPUs creeps up it is eating into what the "old zone" was. Capping at 4x16 ( 64) is enough to put a differentiation gap between

Additionally, if look at Intel/AMD offerings where there is an iGPU the max CPU core counts are even lower. So 56-64 cores would be an even larger multiple. Apple is likely to price their SoC offering closer to the Threadripper and higher end Xeon W-3300 but they will be offering "More" into that bundled price. At least a iGPU and pretty good chance RAM costs are bundled in also.

For folks that have some super high CPU core count raytracer, it is very likely Apple is going to point them at the GPU and AMX to get the work done in the future. Vidoe decoder... same thing in most cases. Apple is not shy at all about telling folks they have to re-write and adjust their code.

Something like Audio workloads that preload all the data into RAM. SMT threads really do not have lots of traction. SMT generally is about filling "holes" ( or "stalls" ) that slow parts of the memory heirarchy put into processor pipelines. Not doing anything for 1,000 cycles because waiting, then switch to something else. If load as much as possible into memory those SMT gaps don't make a difference. Apple's processors don't have SMT. So there is no liklely future SMT gap being driven there.

Apple doesn't "have to" do a GPU less SoC. A "monkey see , monkey do" clone of the EPYC/Xeon SP is probably not a priority in Apple design labs. The Mac Pro ( and iMac Pro) used Intel offerings because that is what Apple was buying. Those were not necessarily what Apple wanted to buy.

cbum · Jul 30, 2021

Anyone remember the heated discussions (not very long ago) if Apple should/would transition the Mac to ARM?

;-)

macsplusmacs · Jul 30, 2021

I kept those arguments at ARM's length from me.

(mods please don't blast me into oblivion for this) LOL~

richinaus · Jul 30, 2021

macsplusmacs said:
I kept those arguments at ARM's length from me.

(mods please don't blast me into oblivion for this) LOL~

Was expecting to read some full on comments in response to deconstruct60 big posts but we get this disarming comment.l instead. Love it.

darngooddesign · Jul 30, 2021

macsplusmacs said:
I kept those arguments at ARM's length from me.

(mods please don't blast me into oblivion for this) LOL~

I don't think the mods would limbit you like that.

JMacHack · Jul 30, 2021

deconstruct60 said:
-snip-

Apple may release a cpu in the future that has more than 64 threads. The core count wars have exploded recently and show signs of continuing.

That’s reason enough to expand upon the kernel.

theorist9 · Jul 30, 2021

macsplusmacs said:
I kept those arguments at ARM's length from me.

(mods please don't blast me into oblivion for this) LOL~

Yes, the mods may ban you for insufficient bickering.

leman · Jul 30, 2021

JMacHack said:
Apple may release a cpu in the future that has more than 64 threads. The core count wars have exploded recently and show signs of continuing.

That’s reason enough to expand upon the kernel.

From a cursory glance it doesn’t seem like Apple uses similar bookkeeping on the ARM version of the kernel. At least I didn’t see anything like the a CPU bitmask…

deconstruct60 · Jul 31, 2021

quarkysg said:
Straight from the horses mouth, so to speak ... heh heh:

darwin-xnu/osfmk/i386/mp.h at 0a798f6738bc1db01281fc08ae024145e84df927 · apple/darwin-xnu

Legacy mirror of Darwin Kernel. Replaced by https://github.com/apple-oss-distributions/xnu - apple/darwin-xnu

github.com

Hmmmm, I think that is more a cap on maximum of CPU packages. Back in the day of 386 there was only one core in a CPU package. So to build "big iron" machine one was add lots of these.

However, it is quite indicative of the mindset that went into lots of aspects of the code for Mach kernel here. Where it was a "multimillon dollar" machine to get to 64 cores so why use data structures larger than that.

I think something closer is things like this one

" ...

* This bitfield currently limits MAX_CPUS to 32 on LP64.
* In the future, we can use double-wide atomics and int128 if we need 64 CPUS.
...

typedef unsigned long checkin_mask_t;

....
static uint64_t cpu_checkin_last_commit;
....

#if __LP64__
	#define CPU_CHECKIN_MASK_MAX_CPUS 32



	/* Avoid double-wide CAS on 32-bit platforms by using a 32-bit state and mask */

"

darwin-xnu/osfmk/kern/cpu_quiesce.c at a1babec6b135d1f35b2590a1990af3c5c5393479 · apple/darwin-xnu

Legacy mirror of Darwin Kernel. Replaced by https://github.com/apple-oss-distributions/xnu - apple/darwin-xnu

github.com

Unsigned long is sometimes 64 bits on 64-bit implementation but think it is still 32-bit on ARM64.

But this also highlights this isn't as easy a changing a constant. If going to spread the masks over multiple "words" and or registers going to need different locking/atomic updates to make it work. Or mechanisms to implement atomic, transactional changes.

deconstruct60 · Jul 31, 2021

Icelus said:
Seems a flexible setting.

to go past 64 you need an atomic settings for wider than int128. That isn't a flip the constant and just recompile change.

deconstruct60 · Jul 31, 2021

leman said:
From a cursory glance it doesn’t seem like Apple uses similar bookkeeping on the ARM version of the kernel. At least I didn’t see anything like the a CPU bitmask…

Everything in the osfmk/arm64 subdirectory isn't the limits of what specific about the kernel.

the CPU_quiesce.c code that has been weaved into this thread with the cpu mask highlighted is all about non x86 machines. There is typdef at the top that throws an error is try to use it during the compile of x86 version.

There are multiple masks.

"

static_assert(sizeof(cpumap_t) * CHAR_BIT >= MAX_CPUS, "cpumap_t bitvector is too small for current MAX_CPUS value");

darwin-xnu/osfmk/arm/cpu_data_internal.h at a1babec6b135d1f35b2590a1990af3c5c5393479 · apple/darwin-xnu

Legacy mirror of Darwin Kernel. Replaced by https://github.com/apple-oss-distributions/xnu - apple/darwin-xnu

github.com

there are also a few variables tagged as unsigned int 64 that seem tied implicit to the upper board on cpus.

leman · Jul 31, 2021

deconstruct60 said:
Everything in the osfmk/arm64 subdirectory isn't the limits of what specific about the kernel.

the CPU_quiesce.c code that has been weaved into this thread with the cpu mask highlighted is all about non x86 machines. There is typdef at the top that throws an error is try to use it during the compile of x86 version.

There are multiple masks.

"

static_assert(sizeof(cpumap_t) * CHAR_BIT >= MAX_CPUS, "cpumap_t bitvector is too small for current MAX_CPUS value");

darwin-xnu/osfmk/arm/cpu_data_internal.h at a1babec6b135d1f35b2590a1990af3c5c5393479 · apple/darwin-xnu

Legacy mirror of Darwin Kernel. Replaced by https://github.com/apple-oss-distributions/xnu - apple/darwin-xnu

github.com

there are also a few variables tagged as unsigned int 64 that seem tied implicit to the upper board on cpus.

Yep, you are right. These bit masks ultimately trace back to bitmap_t which is an uint64_t. I have overlooked this before.

I think this one is a core file where various configurations of Apple SoCs are described:

darwin-xnu/pexpert/pexpert/arm64/board_config.h at 2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32 · apple/darwin-xnu

Legacy mirror of Darwin Kernel. Replaced by https://github.com/apple-oss-distributions/xnu - apple/darwin-xnu

github.com

leman · Jul 31, 2021

deconstruct60 said:
Hmmmm, I think that is more a cap on maximum of CPU packages. Back in the day of 386 there was only one core in a CPU package. So to build "big iron" machine one was add lots of these.

I was doing a bit more reading and this my current interpretation: these constants represent the maximal amount of hardware threads that the OS can handle. The limit exists for efficiency reasons — it allows the thread mask to be packed into a single 64-bit unsigned integer, which allows efficient memory operations (especially if atomicity is required).

Interestingly enough, Windows has a similar limitation. In order to deal with larger systems, they have a concept they call "processor groups", which split the available CPU into groups of up to 64 hardware threads. Each thread is assigned to one of these processor groups and if an application wants to use more than 64 hardware threads it has to use special APIs that manage the processor groups. This is further complemented with a NUMA API that allows an application to query the memory hierarchy of hardware threads and schedule work in a way that utilizes NUMA nodes efficiently. As far as I understand, macOS does not have any NUMA support — even when Apple shipped multi-socket systems, those were configured as a single NUMA node. And it doesn't seem that macOS 12 changes anything in this regard.

Some folks (including myself) have been speculating that the ARM Mac Pro could introduce a notion of modular compute boards — where each board is an MPX module that contains an SoC (CPU+GPU+whatever+fixed limited amount of fast RAM). Such system could host multiple such compute boards as well as shared (slower) expandable system RAM. To manage such architecture Apple could introduce NUMA APIs not unlike the Windows Processor Groups — these APIs would allow the application to discover NUMA clusters in the system and schedule work on them. But each NUMA cluster would still be limited to at most 64 hardware threads (which sounds like a reasonable limitation even going forward). "Regular" application that are now aware of the NUMA architecture will ever only run on a single NUMA node.

hugodrax · Jul 31, 2021

MalcolmH said:
Why don’t Apple use hyper threading ?

hyperthreqdy exists in intel CPUs to paper over a poor design.

urbanman2004 · Jul 31, 2021

I call cap... If Apple indeed has some sort of "magic" 64-Core Pro CPU in the pipeline for their Pro-consumer base, I highly doubt it'll get released within the next 3-5 years.

bobcomer · Jul 31, 2021

hugodrax said:
hyperthreqdy exists in intel CPUs to paper over a poor design.

Yeah, but guess what, it helps in multicore performance and by a good margin. Apple will have to do something similar as they scale up.

dgdosen · Jul 31, 2021

leman said:
...it allows the thread mask to be packed into a single 64-bit unsigned integer...

Not trying to be obtuse - but can't you mask 64 unique things with 6 bits?

leman · Jul 31, 2021

dgdosen said:
Not trying to be obtuse - but can't you mask 64 unique things with 6 bits?

Not if you want to store the set of 64 things, which is what they are doing (don't ask me why though, I have no idea what the purpose of that bitset is)

leman · Jul 31, 2021

bobcomer said:
Apple will have to do something similar as they scale up.

Do they though? Their CPUs are so power-efficient that they scale much better to sustained multi-core workloads compared to x86 CPUs that often need to drop their clocks by 40% or more. And let's not forget that Apple has efficiency cores which offer an overall boost comparable to SMT (and in many cases, better).

bobcomer · Jul 31, 2021

leman said:
Do they though? Their CPUs are so power-efficient that they scale much better to sustained multi-core workloads compared to x86 CPUs that often need to drop their clocks by 40% or more. And let's not forget that Apple has efficiency cores which offer an overall boost comparable to SMT (and in many cases, better).

I suspect yes, there will be a wall that they will hit that they will have to do something similar or maybe even worse. It's still just a CPU. We will really need something truly innovative to get away from the problems of trying to do more and more on less and less.

I have no way of knowing for sure, I can't see the future, but it certainly follows from the past. And to tell you the truth, I really don't care one way or the other as long as what machine I'm using does what I need it to do.

JMacHack · Jul 31, 2021

bobcomer said:
I suspect yes, there will be a wall that they will hit that they will have to do something similar or maybe even worse. It's still just a CPU. We will really need something truly innovative to get away from the problems of trying to do more and more on less and less.

I have no way of knowing for sure, I can't see the future, but it certainly follows from the past. And to tell you the truth, I really don't care one way or the other as long as what machine I'm using does what I need it to do.

A solution would be to leverage the gpu that’s integrated for more parallel tasks, or asics such as the afterburner card for the Mac Pro for problems that are embarrassingly parallel.

bobcomer · Jul 31, 2021

JMacHack said:
A solution would be to leverage the gpu that’s integrated for more parallel tasks, or asics such as the afterburner card for the Mac Pro for problems that are embarrassingly parallel.

I'm just a general computing guy, but I was under the impression that GPU cores are only good for certain things.

09872738 · Jul 31, 2021

bobcomer said:
I'm just a general computing guy, but I was under the impression that GPU cores are only good for certain things.

Well, kinda. They are much slower on a per core basis (a a very rough rule of thumb 1/10 of the computing power of a CPU). But - there is a whole lot of cores, and bulk memory access is usually much faster (again; rough rule of thumb: 5x).

So, GPUs are very good if a task is parallelizable and the amount of chunks of work is huge.
Another heuristic:
CPUs - low latency, small throughput. Metaphor: Motorcycle. Propells you very fast from point A to point B. But it can can carry 2 people only, one if which is occupied by the driver, thus limiting payload
GPUs - high latency, huge throughput. Metaphor: Bus. Its pretty slow on its way from point A to point B but you can bring all your buddies. Plus your football team. Including fans

You‘d ride the motorbike if you are going alone. But if you want to bring all your friends, your football team and all your fans, then by all means the bike wouldn‘t bring you anywhere.

bobcomer · Jul 31, 2021

09872738 said:
Well, kinda. They are much slower on a per core basis (a a very rough rule of thumb 1/10 of the computing power of a CPU). But - there is a whole lot of cores, and bulk memory acees is usually much faster (again; rough rule of thumb: 5x).

So, GPUs are very good if a task is parallelizable and the amount of chunks of work is huge.
Another heuristic:
CPUs - low latency, small throughput. Metaphor: Motorcycle. Propells you very fast from point A to point B. But it can can carry 2 people only, one if which is occupied by the driver, thus limiting payload
GPUs - high lacency, huge throughput. Metaphor: Bus. Its pretty slow on its way on its way from point A to point B

You‘d ride the motorbike if you are going alone. But if you want to bring all your friends, your football team and all your fans, then by all means the bike wouldn‘t bring you anywhere.

I understand that, but like I said I'm a general purpose computing type, massively parallel jobs aren't real common. (I have never did anything like that) We like lots of cores too, but all doing different things.

In your analogy, multiple motorcycles are actually better in my world, as long as you count you have multiple passengers needing to go somewhere different.

Rumored pro chip

macrumors G5

macrumors G5

macrumors member

macrumors 68030

macrumors 68030

macrumors Core

Suspended

macrumors 601

macrumors Core

macrumors G5

macrumors G5

macrumors G5

​

macrumors Core

​

macrumors Core

macrumors 65816

macrumors regular

macrumors 601

macrumors 68030

macrumors Core

macrumors Core

macrumors 601

Suspended

macrumors 601

Cancelled

macrumors 601

Our Staff