M1 Max and Supercomputing tech

Bodhitree · Nov 15, 2021

If I read this article on Anandtech about what Intel is putting into the Aurora supercomputer, the architecture seems quite similar to an M1 Max. It’s a small step up in scale, but most of the components and thought processes seem awfully familiar.

Consider:

— unified memory architecture
— huge bandwidth
— multiple cpu compute blocks (“processors”)
— a gpu with many processors, as opposed to a number of discrete gpu’s

In Aurora Intel is putting together thousands of these nodes into a larger whole with 2 Exaflops of total compute. So I think for Apple to put together 4x M1 Max for a single Mac Pro seems not unreasonable, architecturally there seem to be solutions for a lot of those problems.

The power consumption of Aurora? 60 Megawatts.

Intel: Sapphire Rapids With 64 GB of HBM2e, Ponte Vecchio with 408 MB L2 Cache

www.anandtech.com

leman · Nov 15, 2021

Well, Apple Silicon basically scales down the architecture used in some supercomputers, if you want to see it this way…

Intels product seems rather different. I don’t think they are tiling about real unified memory here, but more like cache coherency, where different chips can talk to each other but transferring data will still take longer than in Apples architecture.

zarathu · Nov 15, 2021

My lowly obsolete iPhone 7 is faster than a 1997 super computer.

Bodhitree · Nov 16, 2021

leman said:
Well, Apple Silicon basically scales down the architecture used in some supercomputers, if you want to see it this way…

Intels product seems rather different. I don’t think they are tiling about real unified memory here, but more like cache coherency, where different chips can talk to each other but transferring data will still take longer than in Apples architecture.

Yes, it is similar to a scaled-down supercomputer node, but it’s not even that much scaled down. For example the Aurora system runs two Sapphire Rapids Xeons with 64 GB of memory and 1.4 TB/second bandwidth with 6 Ponte Vecchio graphics accelerators per node, while M1 Max runs two clusters of 4 P-cores with 64 GB of memory and 400 GB/second bandwidth with a 32 core GPU. It blows me away that some of the numbers are in the same ballpark even.

It‘s very possible that Intel’s unified memory architecture is not the same as Apple’s. But it strikes me that some of the software methodology for addressing all that compute power might also be similar, and could be brought to a desktop computer.

leman · Nov 16, 2021

Bodhitree said:
But it strikes me that some of the software methodology for addressing all that compute power might also be similar, and could be brought to a desktop computer.

Oh, there is no doubt about that. The base technology to ship hybrid solutions with unified memory have existed for a very long time. The main reason why we don’t see them in the PC space is cost and configurability. Apple can afford to use this paradigm because they only sell premium products and because they have tight control over the type of devices they deliver. But it’s possible that Intel is aiming at something similar, no wonder they’ve been pushing their GPU division hard recently. AMD also has most pieces of the puzzle sorted out. Nvidia us the only one without CPU IP.

Bodhitree · Nov 16, 2021

AMD might well be in a better position than Intel because of their experience designing hardware for consoles. I saw that the latest Steam Deck also featured their technology in a similar setup.

But it’s interesting that yesterday’s supercomputers become todays consumer products. In a way you could look at something like Aurora and say, that kind of system will be in my new Mac Pro, and all that Apple has done is jump the gun. Instead of waiting for Intel to transform the landscape of computing, they’ve done it themselves.

I think Apple’s focus on performance per watt and starting from very efficient, low-power architectures has heralded a bit of a shift in how microarchitecture design is approached. It puts them in a much better starting place for a sustained push into more advanced computing spaces.

The software is really key though. It’s all very well to put massive amounts of power into a desktop package but if you have no use cases for that power, it won’t make a significant difference to the people buying those machines.

leman · Nov 16, 2021

Bodhitree said:
I think Apple’s focus on performance per watt and starting from very efficient, low-power architectures has heralded a bit of a shift in how microarchitecture design is approached. It puts them in a much better starting place for a sustained push into more advanced computing spaces.

Let’s hope so. Apple’s approach (fewer cores, consistently high performance) makes much more sense from software perspective than the mainstream x86 paradigm. But it’s a complex matter. Until now, Intel and AMD have failed to produce an architecture that offers peak performance at reasonable power draw, and there is reason to suggest that they will be able to do this yet. Abs of course, there is marketing. An average PC user send perfectly fine with the TDP creep if it means benchmark bragging rights, and manufacturers take full advantage of it. Apple has much more control in building the product they want to build.

Bodhitree said:
The software is really key though. It’s all very well to put massive amounts of power into a desktop package but if you have no use cases for that power, it won’t make a significant difference to the people buying those machines.

Absolutely! And this is where Apple has to quadruple their effort as well as develop a proper strategy. Right now they are all over the place.

deconstruct60 · Nov 16, 2021

leman said:
Intels product seems rather different. I don’t think they are tiling about real unified memory here, but more like cache coherency, where different chips can talk to each other but transferring data will still take longer than in Apples architecture.

How do you do cache coherency if the different nodes are all have a different notion of what/where the memory is? The memory address is completely non unified and yet coherent? How?

Non uniform access times isn't "Unified". Nor does caching (making local copies ) negate uniform/unified.

Intel appears to be doing CXL 1.1 subsets of CXL.io and CXL.cache. Depending upon the major bus bandwidth and latency being used CXL.mem might be more "handy" and more "Unified complete" to put a cherry-on-top, but not necessarily the primary driver of Unified address translation space.

leman · Nov 16, 2021

deconstruct60 said:
How do you do cache coherency if the different nodes are all have a different notion of what/where the memory is? The memory address is completely non unified and yet coherent? How?

Non uniform access times isn't "Unified". Nor does caching (making local copies ) negate uniform/unified.

Intel appears to be doing CXL 1.1 subsets of CXL.io and CXL.cache. Depending upon the major bus bandwidth and latency being used CXL.mem might be more "handy" and more "Unified complete" to put a cherry-on-top, but not necessarily the primary driver of Unified address translation space.

My understanding of these things is admittedly very limited, but doesn't CLX.cache do exactly that: coherent caching across devices with non-uniform memory architecture? When I say "uniform memory" I mean it in the narrow sense — as implemented by Apple Silicon for example — that all devices access the memory though the shared memory controller and cache subsystem. This does not seem be a requirement for CLX.

There are many implementations of "uniform memory" where memory is actually non-uniform (e.g. CUDA offers one). The memory "coherency" is achieved by tracking (either in hardware or software) of cross-device memory region mappings and copying data if necessary.

Bodhitree · Nov 16, 2021

zarathu said:
My lowly obsolete iPhone 7 is faster than a 1997 super computer.

There seems to be about a 20 year gap between a supercomputer and a smartphone. Its something to consider, what requires a decent sized warehouse today may fit inside your trouser pocket within your lifetime.

Bodhitree · Nov 16, 2021

leman said:
The base technology to ship hybrid solutions with unified memory have existed for a very long time. The main reason why we don’t see them in the PC space is cost and configurability.

That puts me in mind of a number of articles that I’ve seen recently about the future of computers being a form of ‘copy-less computing’. That implies almost by definition a unified memory architecture and some form of data re-interpretability at the source code level. Certainly Intel, AMD and NVidia have some ways to go to reach that goal, whereas Apple is closer to putting all the pieces together.

Search

Search

M1 Max and Supercomputing tech

Bodhitree

macrumors 68020

Intel: Sapphire Rapids With 64 GB of HBM2e, Ponte Vecchio with 408 MB L2 Cache

leman

macrumors Core

zarathu

macrumors 6502a

Bodhitree

macrumors 68020

leman

macrumors Core

Bodhitree

macrumors 68020

leman

macrumors Core

deconstruct60

macrumors G5

leman

macrumors Core

Bodhitree

macrumors 68020

Bodhitree

macrumors 68020

Our Staff