Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

LangdonS

macrumors newbie
Jan 22, 2024
20
2
Canada
I find it amusing. How would Apple start supporting nVidia again, when they even ditched their long tradition supporting AMD?
From my read on the situation, Apple is comfortable focusing the research and development on their own chips, period. And it does not matter to Apple, that part of the customer base relied on the graphical capabilities of other producers.

Even if I prefer the older Intel era macs. I can very well imagine that several years from now, that I too will be happy to purchase a mac based on future versions of the Apple M-series.
You can always just buy a 2019 Mac Pro.
They will be supported for a LONG time.
 

mreg376

macrumors 65816
Mar 23, 2008
1,233
418
Brooklyn, NY
Seeing the latest earnings report from NVIDIA, it just shows how dominate NVIDIA currently is, nobody even comes close. Why not embrace the best technologies the industry has to offer for the Mac Pro?

Even Google who build their own custom GPU's, has embraced NVIDIA in the end. Amazon also build custom ARM chips, but they also heavily rely on NVIDIA, there is no way around it.
Well, I can think of one reason Apple won't be using Nvidia -- getting back into third-party, power hungry, heat-producing components and paying license fees is exactly the opposite of what Apple has been doing for years now.
 
  • Like
Reactions: gusmula and MRMSFC

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,663
OBX
Well, I can think of one reason Apple won't be using Nvidia -- getting back into third-party, power hungry, heat-producing components and paying license fees is exactly the opposite of what Apple has been doing for years now.
Yeah they are already paying nvidia (probably indirectly) so they probably don’t want to pay more.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
What about the power and ease of use of NVIDIA's CUDA vs. Apple's GP-GPU API? Remember that a lot of people doing ML are scientists rather than programmers.

Of course, given the size and history of the CUDA community, currently CUDA is going to be far in the lead when it comes to the collection of NVIDIA-provided pre-built application code, aftermarket pre-built application code (e.g., CUDA code downloadable from government and university websites that a scientist that can adapt for their own needs), as well as integration with ofther software systems, like CUDA Python.

But if Apple offered the hardware, and a great API, an Apple ML community, and the attendant code base, would accumulate for Apple as well.

EDIT: I just noticed your reponse to @chuckee. It sounds like you're saying CUDA's powerful and (relatively) user-friendly API, all the pre-built CUDA code, doesn't give NVIDA an advantage when it comes to ML.

I don’t think anyone uses CUDA for ML directly, folks use APIs like Tensorflow, PyTorch or Jax.
 
  • Like
Reactions: theorist9

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
I don’t think anyone uses CUDA for ML directly, folks use APIs like Tensorflow, PyTorch or Jax.
Why do you think it’s so hard for AMD to push patches for PyTorch or Tensorflow so they’re as optimized for Radeon as they are for RTX?

After all, both projects are open source.

Does AMD not want to put dedicated developer resources to do this even though it’s a multi trillion dollar opportunity?
 

sirio76

macrumors 6502a
Mar 28, 2013
578
416
Seeing the latest earnings report from NVIDIA, it just shows how dominate NVIDIA currently is, nobody even comes close. Why not embrace the best technologies the industry has to offer for the Mac Pro?

Even Google who build their own custom GPU's, has embraced NVIDIA in the end. Amazon also build custom ARM chips, but they also heavily rely on NVIDIA, there is no way around it.
Do you realize that Nvidia stellar earnings and valuation come from selling hardware for large server farm, not for individual customers and workstation? Nvidia succeeded in a totally different market that have very little in common with Apple core business. BTW, most mega caps (and even OpenAI) are trying to move away from Nvidia by building their own custom AI chip, it will take a while but they will get there.
 
  • Like
Reactions: MRMSFC

leman

macrumors Core
Oct 14, 2008
19,520
19,670
Why do you think it’s so hard for AMD to push patches for PyTorch or Tensorflow so they’re as optimized for Radeon as they are for RTX?

I think there is probably simply less interest in AMD GPUs for this kind of work, especially since gaming Radeons do not have dedicated matmul hardware. Datacenter Radeons do, but I would guess that AMD expects one to target these directly with custom-written low-level code. Besides, AMD's strategy appears to be direct CUDA support, which would make a separate TF/PyTorch backend unnecessary (at least to a certain degree).

Disclaimer: all of this is speculation as I didn't spend sufficient time looking into these things.
 

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
I think there is probably simply less interest in AMD GPUs for this kind of work, especially since gaming Radeons do not have dedicated matmul hardware. Datacenter Radeons do, but I would guess that AMD expects one to target these directly with custom-written low-level code. Besides, AMD's strategy appears to be direct CUDA support, which would make a separate TF/PyTorch backend unnecessary (at least to a certain degree).

Disclaimer: all of this is speculation as I didn't spend sufficient time looking into these things.
If I'm AMD, and I see trillions of dollars in front of me, I'd throw everything at it. Make a CUDA drop in replacement for people who want it. Make a native implementation for Pytorch/Tensorflow for others who want it. Hire dedicated low-level engineers for both projects.

Literally trillions in potential value.

1708935569695.png
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
If I'm AMD, and I see trillions of dollars in front of me, I'd throw everything at it.

I think this is the idea behind https://geohot.github.io//blog/jekyll/update/2023/05/24/the-tiny-corp-raised-5M.html

It also seems I have underestimated AMD's matmul performance (they have an instruction that does SIMD-wide packed multiply, with up to 512 FLOPS per CU). Still less than Nvidia's tensor cores for comparable price, but more than sufficient to saturate the memory bandwidth.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
If I'm AMD, and I see trillions of dollars in front of me, I'd throw everything at it.
AMD knows it too.
1708942709289.png


Make a CUDA drop in replacement for people who want it.
It is unclear whether this is legal. It appears that APIs may be copyrightable in the United States.

Make a native implementation for Pytorch/Tensorflow for others who want it. Hire dedicated low-level engineers for both projects.
AMD's main problem is its poor support for desktop GPUs.
1708943497247.png

 
  • Like
Reactions: Chuckeee

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,663
OBX
AMD knows it too.
View attachment 2353119


It is unclear whether this is legal. It appears that APIs may be copyrightable in the United States.


AMD's main problem is its poor support for desktop GPUs.
View attachment 2353120
Yeah AMD got off the compute train when they switched from GCN to RDNA for consumer GPUs. GCN lives on with CDNA. It is interesting that they seemed to somewhat come to their senses and changed the architecture enough to be able to add some support for some gfx11 parts (not really sure why they stopped at Navi31).
 

tenthousandthings

Contributor
May 14, 2012
276
322
New Haven, CT
I don’t think anyone uses CUDA for ML directly, folks use APIs like Tensorflow, PyTorch or Jax.
Apologies if I missed this earlier, or elsewhere, but what about the Apple Neural Engine (ANE)? Isn't that mostly about ML?

As I understand it, the CPU, the NPU (Neural Processing Unit, which Apple calls the Neural Engine), and the GPU are all factors. ANE has been a thing since the A11 Bionic, and it came to the M-series from that innovation. But unless something has changed (which may well be the case), its API frameworks are private, Apple-only. So they aren't available to third parties, but if we're talking about Apple Silicon's potential in the realm of ML, and Apple's prowess in that world overall, it seems like it should be mentioned. [My understanding is based on this: The Neural Engine -- what do we know about it?]

So that brings me to something else that is different about A17/M3, as opposed to A14/M1 and A15/M2. The NPU cores in A17 Pro have almost (exactly?) double the performance of those of M3/Pro/Max. Both have 16 cores, but A17 Pro can do 35 trillion ops/sec (more than double the A16 Bionic, an astounding leap), while M3 is limited to 18 trillion ops/sec. Why? This is a first. M1 and M2 had the same NPU cores/performance as A14 and A15.

Answer maybe that M3/Pro/Max CPU and/or GPU cores are more capable, so they don't need as much NPU. Or maybe just no facial recognition in macOS (yet), so enhanced NPU not needed. Basically macOS doesn't need it and/or can't use it, so that silicon is better used for other things, like enhanced GPU?
 
Last edited:
  • Wow
Reactions: gusmula

leman

macrumors Core
Oct 14, 2008
19,520
19,670
Apologies if I missed this earlier, or elsewhere, but what about the Apple Neural Engine (ANE)? Isn't that mostly about ML?

Great question! It is true that ANE is often misunderstood. Yes, ANE is about machine learning, but its primary purpose is running previously trained models with low energy footprint. Also, ANE only supports relatively small and simple models. In other words, ANE is used to do things like image classification or text input prediction on your device, but you can't use it to train new models or run overly complex stuff (not without some tweaking, at least). Folks often like to compare ANE to Nvidia Tensor Cores, but this comparison is flawed — Tensor Cores are a dedicated matrix multiplication unit embedded into a general-purpose programmable GPU, and ANE is essentially a black box we don't know much about. The closest hardware piece Apple has to Tensor Cores is the AMX (only in Apple Silicon this is a CPU coprocessor, not a GPU auxiliary unit).

As I understand it, the CPU, the NPU (Neural Processing Unit, which Apple calls the Neural Engine), and the GPU are all factors.

Yeah, the Apple world of ML is a bit confusing. There are currently four hardware paths for doing ML-related stuff:

- CPU SIMD (Neon) — 128bit vector instructions that among others operations support matrix multiplication
- CPU AMX — wide vector/matrix coprocessor that can multiply large matrices using a wide range of precisions
- ANE — power-efficient 16-bit convolution hardware that can be used to evaluate models with low energy usage
- GPU — programmable parallel processor that also supports work-optimal matrix multiplication for ML training and inference

Choosing the right hardware for your task is not trivial! What's worse, only Neon and the GPU are actually programmable, AMX and ANE are hidden behind system APIs. AMX performs best if you need to work with double precision matrices (since the GPU does not support doubles at all), and it also usually works best if the amount of matrices you have to process is relatively slow. If you need to multiply a lot of large matrices using FP32 or FP16 precision, the GPU is the best tool for the job (although Apple GPUs do not offer better FP16 performance, unlike competitor hardware). ANE is intended to accelerate ML inference within end-user applications (although Apple also uses it for other purposes, like image upscaling for MetalFX).

And if all this is not complicated enough, quite often the best performance can be achieved by utilizing multiple of these units simultaneously. E.g. Apple's tweak of stable diffusion uses the ANE, AMX, and GPU at the same time to maximize performance. It's not something that is easy to do on a regular dev's budget. Apple's new ML framework (MLX) is designed to do these things from the start (currently only on AMX and the GPU).



Both have 16 cores, but A17 Pro can do 35 trillion ops/sec, while M3 is limited to 18 trillion ops/sec. Why? This is a first. M1 and M2 had the same NPU cores/performance as A14 and A15.

Yeah, a lot of people are wondering about this. I haven't seen a satisfactory answer yet.
 

tenthousandthings

Contributor
May 14, 2012
276
322
New Haven, CT
It occurs to me after writing the above (comment #139, prior to seeing leman's reply), that there could be another layer to the possible explanation of why the M3 NPU cores are half those of the A17 Pro. I thought of this when I edited that comment to add (in parentheses) the "astounding" fact the A16 Bionic NPU cores have slightly less than half the performance of the A17 Pro.

It doesn't change any of what I said as possible explanations for the larger question of "Why?", but it does seem likely that the design/architecture of the M3 NPU cores are based on the A16, not the A17. I think we could see this in the future as well, as the Neural Engine moves forward annually in the A-series, while the M-series, on a roughly 18-month cycle, sometimes lags behind. It could even mean that the development of the M3 started before the A17, so in some ways the A17 is based on the M3 (i.e., the GPU cores), instead of the other way around...

I mean, the CPU, NPU, and GPU are discrete areas on the silicon. They all share the same unified memory, but Apple can do what they want within each of those areas.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
but it does seem likely that the design/architecture of the M3 NPU cores are based on the A16, not the A17.

This has been speculated, but inspection of die images does not really find much resemblance between the NPU in A16 and M3 (nor between A17 Pro and M3 for that matter).
 

MRMSFC

macrumors 6502
Jul 6, 2023
371
381
What makes you say this? M2 Ultra is 1.8 faster than M2 Max in Blender, that’s just 10% less than perfect scaling. The 4090 is around 2.05x faster than 4070 in Blender, that’s 30% less than perfect scaling. Seems to me Apple and a Nvidia scale about the same here.
I’m just digging for an explanation. The M3 series runs hotter than the M2 did, and has a clock speed increase.
True, but there are always ways around it. Now, I agree with you that it’s not clear that Apple wants to take the performance crown, but I don’t really see a technological reason they could not.
I feel like if they could trounce the competition currently, they would. I really don’t see why they would pull any punches.
The last generation just delivered 50% improvement in performance on complex workloads (that’s without RT). This doesn’t sound to me like an architecture that’s anywhere close diminishing returns. And there are still very obvious things to get extra performance they are not doing yet.
What are the obvious things?
Could you give an example? I mean, M3 GPU improvements are all hardware.
Well, contributing to Blender for one. On the gaming side they’re partnering with developers and made GPTK.
 

Pet3rK

macrumors member
May 7, 2023
57
34
Just use the logic. If Mac sales were 34% higher a year ago the only way their market share was not higher a year ago is if PC sales were 34+% higher a year ago too and PC sales never have such fluctuations.
Keeping a look at counter of desktop OS share and NOT shipment market share, the Mac share is steadily growing.
 

MRMSFC

macrumors 6502
Jul 6, 2023
371
381
Also, someone brought up a good point earlier in the thread. Apple’s competition with NVidia isn’t where NVidia gets most of their valuation.

NVidia’s valuation is based off their AI training hardware. Parts that are sold to enterprises with servers upon servers of these things. Apple hasn’t competed in enterprise space since I can remember.

The only space I can think of where competition is directly between the two is in video editing and 3D rendering. And Apple is competing very well there already, though probably has room to improve.

Maybe things will change and Apple will start making enterprise machines for AI, but I’m not convinced.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
I’m just digging for an explanation. The M3 series runs hotter than the M2 did, and has a clock speed increase.

The M3 Max CPU runs hotter simply because it has more cores. But at peak each M3 CPU consumes less power than M2 Max for example.

I feel like if they could trounce the competition currently, they would. I really don’t see why they would pull any punches.

It's not about pulling punches, it's about sustainable execution. You can't do everything at once. These are complex things. It is very clear that Apple GPU designers prefer to do things step by step while planning ahead.

What are the obvious things?

I wrote about this before. Just look at the progression of Apple GPU design over the years. M1/A14 (and maybe even earlier designs) have multiple independent specialized SIMD ALU pipes (FP32, FP16, INT/special) — but at most one can be activated per clock. M3 GPU can activate up to two different pipes per clock (concurrent execution of FP32+FP16 or FP+INT operations). The next logical step is making FP pipes symmetrical, for concurrent FP32+FP32 or FP16+FP16, which would give Apple a massive boost in performance comparable what Nvidia did with Ampere.

For ML-specific stuff, Apple currently supports work-optimal matrix multiplication on the FP SIMD pipes. Their hardware supports data swizzling required to run matmul on 32-wide SIMD without stalls. But it still uses at most one pipe (so 32-wide ALUs). A next logical step would be modifying the FP SIMD units so that they can also work as 16-bit matrix units. Combined with the symmetric pipe modification I mentioned earlier that would give you 128 FP16 MAC units per partition or 512 FP16 MAC units per GPU core (or 1024 "tensor" ops per core per clock — that's compatible with latest Nvidia tensor cores).

All this can be done with only minimal additional transistor budget as most of the infrastructure is already in place. Fun fact — matmul operations in Metal are defined for 8x8 matrices, which is exactly the amount of values needed to saturate 64 MAC units or a single 32-wide 32-bit SIMD configured to do 16-bit MAC.


Well, contributing to Blender for one. On the gaming side they’re partnering with developers and made GPTK.

And it's important that they continue doing this. But they did not achieve 50% performance improvement in Blender on M3 by submitting M3-specific patches to Blender (in fact, the M3-specific code removes a bunch of complex shader tweaking logic since the M3 hardware can sort these things out by itself).
 
  • Like
Reactions: Chuckeee and MRMSFC

leman

macrumors Core
Oct 14, 2008
19,520
19,670
I know that "number of cores" is a useful concept for CPUs but not for GPUs. Is it useful for NPUs?

A "core" is whatever the marketing department calls it. It is moderately useful for CPUs because our software model is still based on the concept of threads, and a single CPU core runs one thread... unless you have SMT, and then it suddenly becomes more muddy. GPU execution models are marketed in most awful of ways, it took me months to actually understand how stuff works there (using Nvidia's logic for example M1 CPU would have 96 cores). Specialized accelerators... let's not even talk about it...

Maybe things will change and Apple will start making enterprise machines for AI, but I’m not convinced.

I would be shocked if they do. It's just not their kind of business. Apple sells appliances, not barebones hardware.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
What models has AMD trained on their own hardware? Do you have any links on this?

100% Unilaterally probably not , but likely not outsourcing everything to AWS/Azure/etc

“…
Paul Alcorn: We're starting to see a lot of AI used in chip design, a lot of use of AI and machine learning for certain things like macro placement, ….

Mark Papermaster: We absolutely are. [...] How we think about it internally at AMD is, we need to practice what we preach. So we are applying AI today in chip design. We're using it in 'place and route,' both in how we position and optimize our sub-blocks of each of our chip designs to get more performance and to lower the energy [consumption]. AI does an amazing job of having an infinite appetite to iterate, iterate, and iterate until you have a truly optimal solution. But it's not just iterating; we could do that before. It's iterating and learning. It's looking at what patterns created the most optimal design, and so it's actually speeding the rate of having an optimized layout of your chip design elements, and therefore giving you higher performance and lower energy, just like we're doing with how we optimize our chiplet partitioning and chiplet placement …”

AMD probably isn’t using 100% home grown, “from scratch” chip design software , but an augmented base design model to account for AMD specific objectives/tradeoffs would likely be cost effective to do .
Nividia more obviously builds their own DGX hardware to let them cobble together internal only “supercomputers” .
AMD is still early enough on evolutionary stage the can just work with a SuperMicro/Tyan/etc ( large data center , “whiteboard” vendor ) . HPE has somewhat ‘off the shelf” frameworks to drop MIxx modeules into their supercomputer chassis also . If the use AMD CPU and MIxxx at base costs rollomih their own modeules gets more affordable than renting . ( AMD Networking is up Mellanox levels though )


I havent looked at the model footprint sizes for trace-route closely but I’m sketipcal that it needs to be as large as the “ let’s build Skynet “ attempts . It only needs to do a relatively narrow fixed task and respond to hyper domain specific prompts. AMD is shooting for large memory models and less on spill over support for hyper large models . For example



that partially offsets not having hyper high end switches to scale over a relatively large number of cards. AMD can scale over a decent amount with decent large nodes to trim the need for scale out in a number of contexts.



AMD’s. Zen 4 and 4c ( and also. 5 and 5c ) where do same baseline done twice with two different density objectives probably gets more economical if gets an assist on doing twice as much placement work.
 
  • Like
Reactions: Xiao_Xi

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
.

I mean, the CPU, NPU, and GPU are discrete areas on the silicon. They all share the same unified memory, but Apple can do what they want within each of those areas.

Actually no, there are substantial number of constraints present. Everyone sharing is dual edge sword. If one of those dramatically increase bandwidth usage them the others could get more ‘starved’ in certain contexts. Pretty
likely one reason why the P bad E core clusters are capped on the amount of memory backend bandwidth they can request. The NPUs being put on a QoS ( quality of service ) limiter would be a very similar limitation. The A series has smaller number of GPU cores/units. If those are even higher bandwidths hogs , then an even higher ratio of those cores versus the others can lead to congestion issues.

Similarly with all the core types being highly coupled thermally. Someone else using up a bigger share of unified thermal budget and someone else gets more capped. They all have to be relatively balanced.

p.s. Part of the M1 -> M2 GPU uplift was just better memory backhaul .
 
Last edited:
  • Like
Reactions: tenthousandthings
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.