[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

leman · Jul 14, 2023

JouniS said:
I don't see that happening. Developers like using hardware that is similar to the production environment, because it makes things easier. Why waste time supporting weird hardware with weird interfaces, if those don't exist in production?

But that's the beauty of it. For one, Macs will be more similar to Nvidia's cloud supercomputer than the PC solutions (ARM architecture, high-bandwidth cache-coherent interconnect between the processors). For two, most developers will use libraries that abstract the final GPU interface rather than CUDA/Metal directly. And for three, I'd expect the differences between Metal and CUDA to gradually erode, as Apple's hardware becomes more capable. In the end, the shared architectural traits will be much more important than API differences.

Xiao_Xi · Jul 14, 2023

leman said:
most developers will use libraries that abstract the final GPU interface rather than CUDA/Metal directly.

Who is going to fund those libraries?

leman · Jul 14, 2023

Xiao_Xi said:
Who is going to fund those libraries?

So far there don't seem to be any problem with funding. Apple definitely contributes, but there is also tremendous developer interest to get popular stuff working on Apple Silicon.

JouniS · Jul 14, 2023

leman said:
But that's the beauty of it. For one, Macs will be more similar to Nvidia's cloud supercomputer than the PC solutions (ARM architecture, high-bandwidth cache-coherent interconnect between the processors).

Supercomputers tend to be special-purpose systems that are not cost-effective for most purposes. The cloud is more about VMs that are usually smaller than the hardware they are running on, with low to moderate network bandwidth requirements. Whatever your field is, there is usually a natural size for individual tasks, and you don't get much benefit from using bigger computers than that.

leman · Jul 14, 2023

JouniS said:
Supercomputers tend to be special-purpose systems that are not cost-effective for most purposes. The cloud is more about VMs that are usually smaller than the hardware they are running on, with low to moderate network bandwidth requirements. Whatever your field is, there is usually a natural size for individual tasks, and you don't get much benefit from using bigger computers than that.

I am talking about ML specifically, not traditional cluster computing or cloud VMs.

Xiao_Xi · Jul 14, 2023

leman said:
there is also tremendous developer interest to get popular stuff working on Apple Silicon.

Can you give a couple of examples of libraries used in HPC that use Metal backend where Apple had no involvement?

leman · Jul 14, 2023

Xiao_Xi said:
Can you give a couple of examples of libraries used in HPC that use Metal backend where Apple had no involvement?

PyTorch I think?

JouniS · Jul 14, 2023

leman said:
I am talking about ML specifically, not traditional cluster computing or cloud VMs.

ML is a set of methods, not a field. In most fields, there is also a natural size for most ML tasks, beyond which bigger and faster accelerators with more memory offer no real benefit. Training giant models needs special hardware, but the vast majority of valuable ML work is more mundane.

Romain_H · Jul 14, 2023

leman said:
But that's the beauty of it. For one, Macs will be more similar to Nvidia's cloud supercomputer than the PC solutions (ARM architecture, high-bandwidth cache-coherent interconnect between the processors). For two, most developers will use libraries that abstract the final GPU interface rather than CUDA/Metal directly. And for three, I'd expect the differences between Metal and CUDA to gradually erode, as Apple's hardware becomes more capable. In the end, the shared architectural traits will be much more important than API differences.

Let's assume for a moment that you are 100 % correct. Who's gonna convince developers that Apple is willing to support their solution long term? They don't exactly enjoy a good reputation in that regard?

Who is porting over all the GPGPU code?

Xiao_Xi · Jul 14, 2023

leman said:
PyTorch I think?

It looks like Apple got involved in the Pytorch Metal backend.

In collaboration with the Metal engineering team at Apple, we are excited to announce support for GPU-accelerated PyTorch training on Mac.

Introducing Accelerated PyTorch Training on Mac – PyTorch

pytorch.org

HPC software uses CUDA and some of them like GROMACS also use SYCL.

Highlights - GROMACS 2023.3 documentation

I don't see how popular software in HPC will get a Metal backend without Apple funding.

Romain_H · Jul 14, 2023

Xiao_Xi said:
Intel is investing in SYCL to break CUDA's dominance in HPC. It is up to Apple to adopt SYCL.

I see. Unfortunately its not really cross-platform then

TechnoMonk · Jul 14, 2023

Xiao_Xi said:
Intel is investing in SYCL to break CUDA's dominance in HPC. It is up to Apple to adopt SYCL.

Do you think the same about macOS?

It shouldn’t be Intel or some one forcing others to adopt. I just hope all the companies form a group, agree and implement.

MRMSFC · Jul 14, 2023

I got a buddy that does some AI stuff, it’s all pyrorch in his case. I don’t have reason to doubt that it’s industry standard.

Considering that, the argument about CUDA vs. Metal seems pointless to me unless you’re writing the backend directly.

So, assuming Apple manages to get their AI performance on par, I think they have an opening there.

But that puts Apple in a weird position. They’ve always been a “consumer first” company(“ (though morphed into a “prosumer” in some cases) with their products typically not being tailored for enterprise. Enterprise being the major player in the AI space.

Numa_Numa_eh · Jul 14, 2023

Strong “Nothing’s gonna replace my Blackberry cuz keyboard” vibes in this thread.

Xiao_Xi · Jul 14, 2023

TechnoMonk said:
It shouldn’t be Intel or some one forcing others to adopt. I just hope all the companies form a group, agree and implement.

Such a group already exists, and it is called Khronos. SYCL is one of its standards, but the only major company that cares about it is Intel.

SYCL - C++ Single-source Heterogeneous Programming for Acceleration Offload

Enables code for heterogeneous and offload processors to be written using modern ISO C++ (at least C++ 17).

www.khronos.org

leman · Jul 14, 2023

Xiao_Xi said:
Such a group already exists, and it is called Khronos. SYCL is one of its standards, but the only major company that cares about it is Intel.

It's just a shame that this group has a history of incompetence and making bad products.

name99 · Jul 14, 2023

Joe Dohn said:
IMHO they really went greedy / overconfident in how much their solution would scale. E.g, they thought they would maintain their performance advantage, that it would be larger, and that they would be able to scale it up well.

But it's not what happened. As I also mentioned around 2 years ago, Apple has placed a gamble. Many people scoffed at the my back then that AMD and Intel wouldn't stay still. Well, that's exactly what happened, and even more so from AMD's part.

If Apple loses the performance advantage, then they have really nothing to show. Even if AMD and Intel had worse performance, the legacy compatibility and extra flexibility already heavily weights in their favor. But if they have the same performance / efficiency or BETTER, then you're paying a premium for... what exactly, if Apple is not investing in polished and cutting edge software anymore?

Oh yes, eager extrapolation from two data points (one of which is not where Apple wanted to be, but was forced into by covid). Always a good basis for prediction...

Apple is exactly where they want and need to be, optimized for everything that matters going forward. As platforms become newer (desktop to phone to watch to vision) Apple's dominance becomes more total. And even in those apparently less sexy areas of desktop (and workstation, and, invisibly, in the cloud) Apple are laying foundations you can't even imagine.
It's all visible if you bother to look at and understand the patents. But of course most people would prefer not to do that hard work, would prefer to mock patents as meaning nothing.

name99 · Jul 14, 2023

Xiao_Xi said:
Who is going to fund those libraries?

leman said:
For two, most developers will use libraries that abstract the final GPU interface rather than CUDA/Metal directly. And for three, I'd expect the differences between Metal and CUDA to gradually erode, as Apple's hardware becomes more capable. In the end, the shared architectural traits will be much more important than API differences.

In the LONG run one expects so, but the long run can be long indeed.
I'm unaware of any realistic (ie large and widely used) libraries of the form you describe on the GPU side. And it's not clear, right now, how to fix this. To take two examples:

- the issue of forward progress, and how to describe/model/guarantee/validate algorithms against it remains unsolved, even theoretically. Meaning that various algorithms maybe work (or at least APPEAR to work...) without livelocking on current nV hardware, but with no guarantees as to future nV hardware, or anyone else's.

- there exists no way (only a few vague academic ideas that haven't yet been taken up by any standards body or company) for describing data locality for a GPU in a useful fashion.

And important new ideas (like control flow or memory lane coalescing) are still being introduced, and in such a way that nothing is portable from one vendor to another (at least in part because until an idea has been used for a few years, it's not even clear the optimal way to access it through language/API). And look at the massive hostility in this thread to Metal, even though Metal is a better solution (ie better thought through, easier to use, better able to grow in future) to the problem of GPU programming than the alternatives.
You say you want this stuff to be abstracted -- but most people in threads like this DON'T WANT abstraction, they want their preferred tribal solution to be instantiated come what may, to hell with whether that locks into place unfortunate ideas that were obsolete five years ago...

We're basically at about the same position as PCs were at 1998 with respect to parallel programming – most of the community is clueless about the issues that actually matter (but utterly convinced its ideas are important...), and that cluelessness manifests in imagining that the world they know (x86 or CUDA) represents the sum total of reality, and the only possibly way to organize reality.

name99 · Jul 14, 2023

MRMSFC said:
I got a buddy that does some AI stuff, it’s all pyrorch in his case. I don’t have reason to doubt that it’s industry standard.

Considering that, the argument about CUDA vs. Metal seems pointless to me unless you’re writing the backend directly.

So, assuming Apple manages to get their AI performance on par, I think they have an opening there.

But that puts Apple in a weird position. They’ve always been a “consumer first” company(“ (though morphed into a “prosumer” in some cases) with their products typically not being tailored for enterprise. Enterprise being the major player in the AI space.

(a) correct that pytorch is the level of abstraction for that class of users.

(b) it remains unclear to me how good a level of abstraction it is. Like everything that has grown rapidly, it's evolved not designed, and the existence of companies like Mojo or Nod.ai suggest there's a truly massive amount of flab and inefficiency in pytorch. Meaning it's the sort of thing that looks like a standard for a long time, right up to the point that it isn't. (cf tensorflow, or gcc vs LLVM, or firefox)

(c) the discourse around Apple and GPUs is kinda weird. nV has already split the product line in two, with game GPUs and very different Compute GPUs being sold; Looks like AMD is on that path (have that actually sold compute-only GPUs yet with CDNA or have they just been announced?) Looks like Intel will do the same thing (once they can figure out if they are selling Nervana or Habana or Arc...)
Meanwhile Apple is selling one product that's compared, in a bob-and-weave fashion against whichever aspect of nV or AMD's split product line looks better against that Apple feature...

Now on the one hand this suggests a real issue for Apple - that some users care primarily about game performance, while others care primarily about compute performance, and Apple needs to figure out a way to square this circle.
But as an outsider trying to understand things, an honest appraisal needs to appreciate that just like Apple isn't trying to pretend that the M2 Ultra is a data center chip equivalent to a 64 core AMD or 128 core Altra, similarly it's not a data center GPU chip. Comparing it to a "workstation" GPU makes sense, comparing it to a data center GPU is a d1ck-measuring exercise, like comparing an iPhone to an i9.

(d) AI (and especially AI training) are the hot new thing right now; but the larger and more interesting point is the split between graphics (of the old-fashioned triangle-based vertex shader->fragment->pixel shader type) and throughput compute. There's a whole world of users (and potential users) of throughput compute that could be using the GPU but are not well served by anyone (including nV with CUDA, Apple with Metal, or Intel).
Historically I would expect Intel to solve this first (as in putting together an *adequate* solution for general throughput problems like, say, solving a large PDE); but Intel today, especially in the GPU space, is so incompetent and demoralized that, who knows?
Apple is well-situated to put together a good solution, and if someone like Richard Crandall were still alive, would probably be working with a company like Wolfram to put together something spectacular. They could still surprise us.
nV seems too mercenary, too interested in producing checkbox solutions rather than long-term architecture. So I expect them to always be shouting FIRTST!!! --- and always be shouting it spelled incorrectly... They will be the x86 of the GPU world -- the standard, yes, but a lousy standard that's not used by anyone with taste.

Xiao_Xi · Jul 14, 2023

name99 said:
existence of companies like Mojo or Nod.ai suggest there's a truly massive amount of flab and inefficiency in pytorch.

Is this still the case in Pytorch 2.0?

name99 said:
nV has already split the product line in two, with game GPUs and very different Compute GPUs being sold;

What do you mean by that? All Nvidia GPUs support CUDA.

NVIDIA CUDA GPU Compute Capability

Find the compute capability for your GPU.

developer.nvidia.com

name99 said:
Looks like AMD is on that path (have that actually sold compute-only GPUs yet with CDNA or have they just been announced?)

AMD split its GPU division into two: RDNA GPUs and CDNA accelerators, and AMD has realized that not providing compute support for gaming GPUs was a mistake and is making amends.

https://twitter.com/x/status/1669848494637735936

leman · Jul 14, 2023

name99 said:
I'm unaware of any realistic (ie large and widely used) libraries of the form you describe on the GPU side.

Apple has basic support for three widely used frameworks: PyTorch, Tensorflow, and JAX. I was primarily talking about these.

name99 said:
And it's not clear, right now, how to fix this. To take two examples:

- the issue of forward progress, and how to describe/model/guarantee/validate algorithms against it remains unsolved, even theoretically. Meaning that various algorithms maybe work (or at least APPEAR to work...) without livelocking on current nV hardware, but with no guarantees as to future nV hardware, or anyone else's.

- there exists no way (only a few vague academic ideas that haven't yet been taken up by any standards body or company) for describing data locality for a GPU in a useful fashion.

These are all important problems but they are hardly the topic of mainstream GPU computing these days. I’m sure that in a couple of years we will have better tools for non-trivial parallel computing from all vendors, but for now this is hardly a blocker.

name99 said:
(c) the discourse around Apple and GPUs is kinda weird. nV has already split the product line in two, with game GPUs and very different Compute GPUs being sold; Looks like AMD is on that path (have that actually sold compute-only GPUs yet with CDNA or have they just been announced?) Looks like Intel will do the same thing (once they can figure out if they are selling Nervana or Habana or Arc...)
Meanwhile Apple is selling one product that's compared, in a bob-and-weave fashion against whichever aspect of nV or AMD's split product line looks better against that Apple feature...

Because Apples solution is a hybrid that has properties of both gaming and workstation GPUs. That’s kind of what makes Apple approach special in the first place. I think these comparisons are fair, since it’s Apple themselves who position their products in this context. Apple is also in a unique position to pull this off, because their TBDR architecture allows them to focus on async compute, where forward rendering GPUs have other concerns like locality and pixel ordering. Of course, Apple is also currently held back by the mobile heritage of their GPUs (relatively small register file, tiny L2, limited coordination between individual cores). There is still a lot of work to be done.

TechnoMonk · Jul 14, 2023

Xiao_Xi said:
Intel is investing in SYCL to break CUDA's dominance in HPC. It is up to Apple to adopt SYCL.

Do you think the same about macOS?

OS is different, now imagine if python ran only on certain Hardware.

Xiao_Xi · Jul 14, 2023

TechnoMonk said:
OS is different, now imagine if python ran only on certain Hardware.

Like early versions of Swift that only worked on macOS?

Swift (programming language) - Wikipedia

en.wikipedia.org

Do you also have a problem with Metal because it only works with Apple hardware?

TechnoMonk · Jul 14, 2023

Xiao_Xi said:
Like early versions of Swift that only worked on macOS?

Swift (programming language) - Wikipedia

en.wikipedia.org

Do you also have a problem with Metal because it only works with Apple hardware?

No different than Microsoft programming languages like .net, VC++ or Linux only libraries. Last I checked, NVidia doesn’t have an OS. I am sure you are not saying windows/Mac is same as Nvidia GPU and Cuda.

name99 · Jul 14, 2023

leman said:
Apple is also in a unique position to pull this off, because their TBDR architecture allows them to focus on async compute, where forward rendering GPUs have other concerns like locality and pixel ordering. Of course, Apple is also currently held back by the mobile heritage of their GPUs (relatively small register file, tiny L2, limited coordination between individual cores). There is still a lot of work to be done.

I'm not sure these SPECIFIC complaints carry water.
As I understand the numbers, on a per core basis (the only counting that makes sense, and if you give a BS marketing "make the number as big as possible" response, I'll immediately tune you out...)
- Apple's visible register space is up to 128 4B registers (or 256 2B registers) per thread. I think nV and AMD are the same. So up to 512B of register space per lane, 16K per SIMD.
- a core is the equivalent of an (current nV, not past) SM, and split into four essentially independent quadrants.
- the storage per core is 384KB per core. Which is 96kB per quadrant.
Which is 6 "full" threadblocks (using the entire set of registers) or up to 24 threadblocks (if each uses only a quarter of the maximum register space it is entitled to).

I believe these numbers are essentially 1.5x the equivalents on the nV side.

Apple also have small L1 and local address space compared to nV. It's unclear how much of an issue this is. Only about 3% of nV apps are constrained by local address space (as opposed to number of registers).
Apple have very clever (and basically unknown) functionality to identity data streams relevant to each GPU client and optimize their use in cache from L1 to L2 to SLC. That probably gives them better effective cache performance than nV and AMD. (A dumb naive cache is not especially great for GPUs because all the data of any real interest consists of large streams...)

I'm not sure that the "limited coordination between individual cores" matters much. What I have SEEN in this space is that most of the nV work in this area is hacks around their performance issues, followed by devs who don't understand Apple Silicon assuming those same hacks are necessary on Apple. I've seen ZERO comparison of the issue in terms of real (important) algorithms written by people who know what they are doing and have genuinely optimized for each platform comparing and contrasting.

Where Apple is lacking compared to nV and AMD (ie genuine, not BS, complaints):
- no support for 64b FP (just as an MSL, even if poorly performing) [STEP 1]
- no hardware support for 64b integers and FP (the integer support could be added at fairly low cost, even the multiplies if we allow them to take a few cycles; the FP support could be added as slow multicycle in the FP32 unit, or as something like a single FP64 unit shared by sixteen lanes). Along with this, decent "complete" support for 64b atomics. [STEP 2]
- coalescing of both divergent control and divergent memory access. nV has support (kinda, in a lame fashion, and for ray tracing only?) for control, nothing as far as I know for divergent memory access. I'd hope an Apple solution is better engineered both at the HW level to support both, and at the MSL level to indicate that a particular kernel is one where coalescing by the hardware is allowed (so not just stochastic processes like ray tracing, but also things like fancy multipole PDE solvers, where there's no natural spatial structure to the problem anyway, so coalescing is no big deal).

These three would at least start Apple down the path of being a reasonable throughput engine for a variety of compute tasks, not just the most obvious GPU/ray tracing/AI that get all the attention today. My guess is they will come, everything just takes time. (And there are the constraints that they can't evolve Metal too fast until they are willing to drop support for AMD and Intel iGPU... Until then, they mostly have to keep Metal functionality, and thus visible Apple Silicon functionality, kinda in sync across all three sets of devices.)

[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

macrumors Core

macrumors 68000

macrumors Core

macrumors 6502a

macrumors Core

macrumors 68000

macrumors Core

macrumors 6502a

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors 68040

macrumors 6502

Suspended

macrumors 68000

macrumors Core

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68000

macrumors Core

macrumors 68040

macrumors 68000

macrumors 68040

macrumors 68030

Our Staff