Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Takuro

macrumors 6502a
Original poster
Jun 15, 2009
584
274
So YouTube channel Max Tech has been releasing a pretty comprehensive series of different benchmark tests ever since the M1 came out. In their latest test, they put an M1 Mac Mini head-to-head with a Mac Mini with in without a 5700XT. The results are surprising.


Although there is some performance loss due to the inefficiencies of an eGPU setup: The 5700XT is 12x faster in Geekbench 5 metals scores than the iGPU of the Intel Core i7 that was tested, and by comparison, the 5700XT was only 2-3x faster than the M1 in many synthetic benchmarks. This is ignoring the tests where the M1 actually came ahead, as that was mostly just a matter of it having native codec support.

Bear in mind that the 5700XT is currently the most powerful graphics card offered on any Intel Mac, and things starts to really fall into perspective. The M1 Mini is not supposed to compete with it, so its pretty wild to think things are as even close as they are.

Given the crazily, crazily higher performance-to-watt benefit that the M1's GPU cores offer today and the highly scalable nature of the architecture, it's really starting to look like a distinct possibility that Apple will catch up to or surpass AMD and Nvidia's mid-tier / upper-mid-tier offerings in the next couple years, at least for metal-optimized apps.
 

MrEcted

macrumors regular
Apr 21, 2011
222
473
While the M1's integrated GPU is stupidly good for an iGPU, I don't think it will scale linearly (though I would love to be wrong). I think integrated GPUs will always lag behind dedicated graphics cards, but eventually we'll hit a point of diminishing returns where - sure maybe in 15 years the best dedicated graphics card will play games at 16K 120 FPS and the M20 integrated GPU will "only" play games at 8K 120 FPS, but that that point the resolution is so high that it really doesn't matter, even on big ass screens.
 

russell_314

macrumors 604
Feb 10, 2019
6,664
10,264
USA
So YouTube channel Max Tech has been releasing a pretty comprehensive series of different benchmark tests ever since the M1 came out. In their latest test, they put an M1 Mac Mini head-to-head with a Mac Mini with in without a 5700XT. The results are surprising.


Although there is some performance loss due to the inefficiencies of an eGPU setup: The 5700XT is 12x faster in Geekbench 5 metals scores than the iGPU of the Intel Core i7 that was tested, and by comparison, the 5700XT was only 2-3x faster than the M1 in many synthetic benchmarks. This is ignoring the tests where the M1 actually came ahead, as that was mostly just a matter of it having native codec support.

Bear in mind that the 5700XT is currently the most powerful graphics card offered on any Intel Mac, and things starts to really fall into perspective. The M1 Mini is not supposed to compete with it, so its pretty wild to think things are as even close as they are.

Given the crazily, crazily higher performance-to-watt benefit that the M1's GPU cores offer today and the highly scalable nature of the architecture, it's really starting to look like a distinct possibility that Apple will catch up to or surpass AMD and Nvidia's mid-tier / upper-mid-tier offerings in the next couple years, at least for metal-optimized apps.
The crazy thing people keep forgetting is these are base model Macs. The M1 MacBook Pro replaced the 8th gen Intel two port model
 
  • Like
Reactions: BigMcGuire

Serban55

Suspended
Oct 18, 2020
2,153
4,344
Nothing new here...the 'brain of chip" is already working on an custom Apple gpu that alone draws around double the M1 gpu, around 20-25W, that will probably come into imac/larger macbook pro
And based on scaling...that gpu will be on par or better than this 5700XT, again at 25W
Can't even imagine what Apple will do in 2-3 years, or even for the bigger imac when they easily place an 60W custom gpu while maintaining that imac cooled and quiet
 

JMacHack

Suspended
Mar 16, 2017
1,965
2,424
I don't think it will scale linearly (though I would love to be wrong)
I doubt it'll scale linearly either, but I could see reasonably Apple matching the 5700XT on their rumored dGPU for their pro machines. By then it'll be a year and a half old?
 

iPadified

macrumors 68020
Apr 25, 2017
2,014
2,257
I doubt it'll scale linearly either, but I could see reasonably Apple matching the 5700XT on their rumored dGPU for their pro machines. By then it'll be a year and a half old?
Why would it not scale linearly? The slope of the line does not need to be exactly "1" (if you double the number of cores, you double the performance). A core draw a certain number of W and then it is only to multiply with the number of cores. Some overhead cost is however expected.

What is definitely NOT scaling linearity is clock speed and power draw. It looks more like logarithmic gain of performance to input wattage.

Making a 32 core Apple GPU seem very feasible and if that draw 40-50W, it is essentially a 5700 XT performance in a mobile GPU power bracket. Seems like clear winner to me.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
While the M1's integrated GPU is stupidly good for an iGPU, I don't think it will scale linearly (though I would love to be wrong).

GPU performance scales pretty much linearly with the amount of cores provided the rest of the system (memory etc.) can keep up. This is a trivial consequence of the fact that typical GPU workloads are embarrassingly parallel. There are diminishing returns, but they only start being visible at very high performance levels. Apple has a lot of specs here.

I think integrated GPUs will always lag behind dedicated graphics cards,

The way how Apple builds these things makes the distinction between dedicated and integrated GPUs entirely meaningless.
 
  • Like
Reactions: wyrdness

JouniS

macrumors 6502a
Nov 22, 2020
638
399
The Radeon Pro 5700XT is an upper midrange previous-generation GPU that uses 130 W. By raw TFLOPS, it should be about 3x more powerful than the M1. The current-generation GPUs from AMD and Nvidia are roughly 2x more efficient, so a current-generation GPU comparable to the M1 should theoretically be in the 20-25 W range.

Has anyone seen figures how much power does the M1 use with 100% GPU load and minimal CPU load?
 

Serban55

Suspended
Oct 18, 2020
2,153
4,344
The Radeon Pro 5700XT is an upper midrange previous-generation GPU that uses 130 W. By raw TFLOPS, it should be about 3x more powerful than the M1. The current-generation GPUs from AMD and Nvidia are roughly 2x more efficient, so a current-generation GPU comparable to the M1 should theoretically be in the 20-25 W range.

Has anyone seen figures how much power does the M1 use with 100% GPU load and minimal CPU load?
Like i said, the gpu from M1 at full load drives 10-11W
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
The Radeon Pro 5700XT is an upper midrange previous-generation GPU that uses 130 W. By raw TFLOPS, it should be about 3x more powerful than the M1. The current-generation GPUs from AMD and Nvidia are roughly 2x more efficient, so a current-generation GPU comparable to the M1 should theoretically be in the 20-25 W range.

I am not sure that 5700XT classifies as previous generation since AMD has not superseded Navi 1 chips yet. Navi 2 is very similar internally but covers the ultra high-end instead (from what I understand it's basically twice the size of Navi 1 + Infinity Cache and maybe some minor improvements). Nvidia's Ampere advantage over Turing is more difficult to quantify as they have rebalanced the shader execution, but most performance advantages seem to come from faster RAM and the slight increase in the number of compute cores, which have significantly increased the power consumption. We have not yet seen how these GPUs would perform in the lower TDP brackets.

Either way, the most energy-efficient low-power GPU on the market is the 35W Nvidia 1650 Ti Max-Q , and it seems that M1 is very close to it in real-world graphical performance (the Nvidia is still faster in compute microbenchmarks since it has more RAM bandwidth, but I am skeptical that this translates well to a real-world advantage, where data is always moved around).

Has anyone seen figures how much power does the M1 use with 100% GPU load and minimal CPU load?

10 watts.
 

aeronatis

macrumors regular
Sep 9, 2015
198
152
I am not sure that 5700XT classifies as previous generation since AMD has not superseded Navi 1 chips yet. Navi 2 is very similar internally but covers the ultra high-end instead (from what I understand it's basically twice the size of Navi 1 + Infinity Cache and maybe some minor improvements). Nvidia's Ampere advantage over Turing is more difficult to quantify as they have rebalanced the shader execution, but most performance advantages seem to come from faster RAM and the slight increase in the number of compute cores, which have significantly increased the power consumption. We have not yet seen how these GPUs would perform in the lower TDP brackets.

Either way, the most energy-efficient low-power GPU on the market is the 35W Nvidia 1650 Ti Max-Q , and it seems that M1 is very close to it in real-world graphical performance (the Nvidia is still faster in compute microbenchmarks since it has more RAM bandwidth, but I am skeptical that this translates well to a real-world advantage, where data is always moved around).

I totally agree. The released RDNA2 cards are not direct 5700XT successors. The upcoming 6700 and 6700XT will probably be so. Therefore, when a chip for 16" MacBook Pro or iMac 27" is released, comments like "but it cannot compare to RTX3080 / 6800XT" will be irrelevant, and we can safely assume that when benchmarks are the same, the real world performance will only be better on Apple Silicon chip.

If the rumoured M1X (or whatever it will be called) really has 8 + 4 CPU cores and let's say it has 16 GPU cores along with 32 GB LPDDR5 unified memory and (maybe also 4 GB HBM2E tile memory) for GPU, that should have no trouble competing the likes of Radeon Pro 5700XT, which basically means 27" iMac level of performance on a 16" MacBook Pro. It is not impossible for Apple to release such a chip with a TDP of 50 watts or so (still much lower than combined TDP of 45 watt Intel CPU + 50 watt AMD graphics), which also means cooler and quieter notebooks. This way, 16" MacBook pro will have no direct Windows rival just like M1 MacBook Air not having a rival in terms or performance per watt.
 

Chozes

macrumors member
Oct 27, 2016
75
97
I think we shall see with the next announcements. I don't expect Apple to compete directly with AMD Nvidia here but will likely be the best Mac GPU wise. Also with less noise and heat I assume/hope?

A more exotic RAM setup may be required or just a dGPU.
 

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
Maybe for the iMac Pro and Mac Pro SoC, Apple may go with two different types of memory: e.g. 8/16/32 GB of HBM memory mainly used by the GPU and 8/16/24/32/X GB of LPDDR4/5 memory mainly used by the CPU/NE etc; but at the same time both types of memory are accessible by the entire SoC due to Apple's UMA design.

Not sure if this is even feasible, but if it is, it'll probably solve their bandwidth issue for GPU while keeping thermals in check while keeping cost lower and achieve incredible performance for their GPU.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
Maybe for the iMac Pro and Mac Pro SoC, Apple may go with two different types of memory: e.g. 8/16/32 GB of HBM memory mainly used by the GPU and 8/16/24/32/X GB of LPDDR4/5 memory mainly used by the CPU/NE etc; but at the same time both types of memory are accessible by the entire SoC due to Apple's UMA design.

Not sure if this is even feasible, but if it is, it'll probably solve their bandwidth issue for GPU while keeping thermals in check while keeping cost lower and achieve incredible performance for their GPU.

That would however negate one of the main advantages of Apple Silicon architecture: low-latency communication between CPU, GPU, ML accelerators and other processing units. I think that Apple will use the more straightforward (and more expensive) route and just use faster RAM (some sort of multi-channel DDR or maybe even HBM)
 
  • Like
Reactions: BarbaricCo

UBS28

macrumors 68030
Oct 2, 2012
2,893
2,340
The Radeon Pro 5700XT is an upper midrange previous-generation GPU that uses 130 W. By raw TFLOPS, it should be about 3x more powerful than the M1. The current-generation GPUs from AMD and Nvidia are roughly 2x more efficient, so a current-generation GPU comparable to the M1 should theoretically be in the 20-25 W range.

Has anyone seen figures how much power does the M1 use with 100% GPU load and minimal CPU load?

AMD makes $499 machines for Sony and Microsoft which can do 4K gaming @ 120hz, so I highly doubt a M1 is competitive with AMD GPU's.

Probably it is bad OS X drivers.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
AMD makes $499 machines for Sony and Microsoft which can do 4K gaming @ 120hz, so I highly doubt a M1 is competitive with AMD GPU's.

Consoles are sold at a loss. And of course M1 won't be competitive with a high-end Navi 2, it's a completely different playing size. But when you look at the individual building blocks, Apple GPUs are very competitive. They are much faster than Nvidia or AMD GPUs per unit of power consumed.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
This is not true. The most powerful graphics card offered in any Intel Mac is the Pro Vega II Duo in the Mac Pro.
Yes, I think he meant the most powerful graphics card on a consumer-grade Mac, i.e., equal to the top GPU on the 27" iMac. Of course, as you pointed out, the Mac Pro offers the option of much more powerful cards.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
GPU performance scales pretty much linearly with the amount of cores provided the rest of the system (memory etc.) can keep up. This is a trivial consequence of the fact that typical GPU workloads are embarrassingly parallel. There are diminishing returns, but they only start being visible at very high performance levels. Apple has a lot of specs here.
Given the type of memory/system performance reasonably (given cost and size considerations) available to Apple for its consumer/prosumer products (i.e., everything except a hypothetical AS Mac Pro or AS iMac Pro), how many GPU cores do you think Apple could have before there's a significant departure from linear scaling (i.e., before diminishing returns are reached)?

I would take that to be the upper limit of the number of cores they could put into Lifuka, which is the GPU rumored to be going into the iMac (and if they offer it there, I assume it will also be available in the upgraded Mini since, for the Intel versions, Apple offers the same top GPU chip in both).

Does the rumor mill have any consensus on how many cores Lifuka will have (e.g., the way people think the most likely CPU core count for the upgrade to the M1 will be 8+4)?

And (given the limited number of AS-native applications currently available), do we know enough about the performance of the M1's GPU to estimate how many cores it would need (again, assuming near-linear scaling) to equal the performance of the fastest "reasonably-priced" ($500 retail, should be much less OEM) NVIDIA consumer-grade desktop GPU (the GEFORCE RTX 3070)? And what about their top consumer card, the RXT 3090? [Of course, the 3090 has 24 GB dedicated memory, so Apple's unified memory would need to offer something comparable; though I don't know what "comparable" means, since I don't know if Apple's GPU design requires an equivalent amount of memory to NVIDIA's.]

The broader question I'm really asking with the last para. is this: From what we've seen of the M1, Apple's current design should be scalable to offer CPU performance equal to or exceeding the fastest consumer-grade (say, <= 10 cores) desktop chips, using far less power. But is this design also scalable to compete with the fastest consumer-grade GPU's?
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,521
19,675
Given the type of memory/system performance reasonably (given cost and size considerations) available to Apple for its consumer/prosumer products (i.e., everything except a hypothetical AS Mac Pro or AS iMac Pro), how many GPU cores do you think Apple could have before there's a significant departure from linear scaling (i.e., before diminishing returns are reached)?

That's a good question. I have absolutely no clue about hardware design, so please bear with me :) As far as I understand it, it kind of boils down to three big factors:

  • die size (large dies are progressively more expensive due to diminishing yields)
  • memory subsystem (Apple GPUs are bandwidth-efficient, but thy still need bandwidth to feed all those GPU cores)
  • internal fabric that connects the processing units on the chip itself
To make things simpler, let's just assume that the fabric can be scaled up without issues (after all, these are pretty much "solved" engineering challenges). M1 die size is reported to be around 120mm2, which is really tiny when you compare it to other chips. Navi GPUs in the 16" MBP are much bigger at 158mm2 and 251mm2 (5600M Pro), and modern high-end GPUs like Navi 2 (536 mm2) and the absolutely beastly Ampere (826 mm2) are humongous in comparison. Looking at these number's I'd say that Apple could relatively comfortably triple or even quadruple the die area while still making it commercially viable for higher end products. As to memory is also a pretty much "solved" problem — there is HBM2 that offers high bandwidth, or Apple could go for something more esoteric like "DYI HBM" with multiple memory controllers and LPDDR. M1 currently has two memory controllers for combined bandwidth of around 68 GB/s. Apple could probably quadruple that quite easily to 270 GB/s (using 8 controllers and stacked RAM modules), which would be on par with GDDR6, but they probably won't even need that much bandwidth to get good performance since there is still quite a lot of spatial overlap in the GPU work.

Based on this, I believe that we can conservatively expect up to 32 cores using the same technology Apple uses with M1. They could definitely go higher, but then the chip size becomes unwieldy and they would probably need to find some more creative solutions (e.g. chiplets like what AMD does).

What would 32 M1 GPU cores give us? 4096 ALUs and around 10 TFLOPs of FP32 compute power at a peak TDP around 40-50 watts (for GPU alone). Thats... quite good, to say the least (comparable to RTX 2080 and Vega 56 on paper).

And (given the limited number of AS-native applications currently available), do we know enough about the performance of the M1's GPU to estimate how many cores it would need (again, assuming near-linear scaling) to equal the performance of the fastest "reasonably-priced" ($500 retail, should be much less OEM) NVIDIA consumer-grade desktop GPU (the GEFORCE RTX 3070)? And what about their top consumer card, the RXT 3090? [Of course, the 3090 has 24 GB dedicated memory, so Apple's unified memory would need to offer something comparable; though I don't know what "comparable" means, since I don't know if Apple's GPU design requires an equivalent amount of memory to NVIDIA's.]

Performance comparisons are a bit tricky, since Apple does things differently. They don't do very well in popular compute benchmarks (for example, in GB5 compute the 1650 Max-Q is twice as fast despite identical FLOPS spec), most likely because they lack the memory bandwidth of the dedicated GDDR6. At the same time, they are excellent in graphics, as TBDR allows their GPUs to do more work with fewer resources — across various graphical and gaming benchmarks, M1 is basically 50-70% of the 5500M Pro. Also, in real world compute workflows that involves small to mid-sized work packages, Apple GPUs will perform much better — unified memory drastically reduces the latency of scheduling GPU work, so you get more effective GPU time (with traditional dGPUs, a lion's share of time is spent transferring the data over the PCIe bus).

So your question can be answered in different ways depending on which metric one chooses. If we go by peak FP32 compute throughput (which is a dumb way IMO), an RTX 3090 is listed with 35.58 TFLOPS, so you would need a whopping 110 Apple GPU cores to match that of paper. Of course, Ampere peak FLOPS is mostly a marketing stunt (the real world throughput is much lower), and I assume that 80 M1 GPU cores should be able to do the job. But it's still a huge number. If we look at graphical performance instead, extrapolated benchmarks put the 3090 somewhere around 8-10 times faster than the M1, so again, an 80 core Apple GPU should do the job, provided fast enough memory interface.

Basically, in more practical terms, I see no reason why a 64-core Apple GPU should not be able to compete with the high-end dGPUs (the likes of RTX 3070 or Radeon RX 6800) at a much lower power consumption. The question is where Apple is at all interested in building GPUs of that size. It's not at all cheap and Apple's needs for content creation performance will probably be covered by smaller GPUs.
 
  • Love
Reactions: theorist9

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
That's a good question. I have absolutely no clue about hardware design, so please bear with me :) As far as I understand it, it kind of boils down to three big factors:

  • die size (large dies are progressively more expensive due to diminishing yields)
  • memory subsystem (Apple GPUs are bandwidth-efficient, but thy still need bandwidth to feed all those GPU cores)
  • internal fabric that connects the processing units on the chip itself
To make things simpler, let's just assume that the fabric can be scaled up without issues (after all, these are pretty much "solved" engineering challenges). M1 die size is reported to be around 120mm2, which is really tiny when you compare it to other chips. Navi GPUs in the 16" MBP are much bigger at 158mm2 and 251mm2 (5600M Pro), and modern high-end GPUs like Navi 2 (536 mm2) and the absolutely beastly Ampere (826 mm2) are humongous in comparison. Looking at these number's I'd say that Apple could relatively comfortably triple or even quadruple the die area while still making it commercially viable for higher end products. As to memory is also a pretty much "solved" problem — there is HBM2 that offers high bandwidth, or Apple could go for something more esoteric like "DYI HBM" with multiple memory controllers and LPDDR. M1 currently has two memory controllers for combined bandwidth of around 68 GB/s. Apple could probably quadruple that quite easily to 270 GB/s (using 8 controllers and stacked RAM modules), which would be on par with GDDR6, but they probably won't even need that much bandwidth to get good performance since there is still quite a lot of spatial overlap in the GPU work.

Based on this, I believe that we can conservatively expect up to 32 cores using the same technology Apple uses with M1. They could definitely go higher, but then the chip size becomes unwieldy and they would probably need to find some more creative solutions (e.g. chiplets like what AMD does).

What would 32 M1 GPU cores give us? 4096 ALUs and around 10 TFLOPs of FP32 compute power at a peak TDP around 40-50 watts (for GPU alone). Thats... quite good, to say the least (comparable to RTX 2080 and Vega 56 on paper).



Performance comparisons are a bit tricky, since Apple does things differently. They don't do very well in popular compute benchmarks (for example, in GB5 compute the 1650 Max-Q is twice as fast despite identical FLOPS spec), most likely because they lack the memory bandwidth of the dedicated GDDR6. At the same time, they are excellent in graphics, as TBDR allows their GPUs to do more work with fewer resources — across various graphical and gaming benchmarks, M1 is basically 50-70% of the 5500M Pro. Also, in real world compute workflows that involves small to mid-sized work packages, Apple GPUs will perform much better — unified memory drastically reduces the latency of scheduling GPU work, so you get more effective GPU time (with traditional dGPUs, a lion's share of time is spent transferring the data over the PCIe bus).

So your question can be answered in different ways depending on which metric one chooses. If we go by peak FP32 compute throughput (which is a dumb way IMO), an RTX 3090 is listed with 35.58 TFLOPS, so you would need a whopping 110 Apple GPU cores to match that of paper. Of course, Ampere peak FLOPS is mostly a marketing stunt (the real world throughput is much lower), and I assume that 80 M1 GPU cores should be able to do the job. But it's still a huge number. If we look at graphical performance instead, extrapolated benchmarks put the 3090 somewhere around 8-10 times faster than the M1, so again, an 80 core Apple GPU should do the job, provided fast enough memory interface.

Basically, in more practical terms, I see no reason why a 64-core Apple GPU should not be able to compete with the high-end dGPUs (the likes of RTX 3070 or Radeon RX 6800) at a much lower power consumption. The question is where Apple is at all interested in building GPUs of that size. It's not at all cheap and Apple's needs for content creation performance will probably be covered by smaller GPUs.
Serendipitously, this came out today from Bloomberg (I assume you've seen it):

"For later in 2021 or potentially 2022, Apple is working on pricier graphics upgrades with 64 and 128 dedicated cores aimed at its highest-end machines, the people said."

I'm guessing if they produced a 128-core GPU, it would be be pro-video-focused (video processing, FP64-capable) (since that's the current market for the Mac Pro), and thus designed to compete with the Pro Vega II Duo or Quadro GV100, rather than the consumer-focused RTX3090 (video game enhancements, FP32-focused) or the A100 (purely compute, FP64-focused, no video output).

How do you think a 128-core Apple GPU it would stack up against the Pro Vega II Duo or Quadro GV100 for video processing workloads?

And where does the the current M1 GPU fall on the spectrum between consumer-focused and pro-video-focused? I.e., would they need not only more cores, but also a significant change in design, to produced a pro-focused GPU?
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
I'm guessing if they produced a 128-core GPU, it would be be pro-video-focused (video processing, FP64-capable) (since that's the current market for the Mac Pro), and thus designed to compete with the Pro Vega II Duo or Quadro GV100, rather than the consumer-focused RTX3090 (video game enhancements, FP32-focused) or the A100 (purely compute, FP64-focused, no video output).

How do you think a 128-core Apple GPU it would stack up against the Pro Vega II Duo or Quadro GV100 for video processing workloads?

A 128-core Apple GPU based on M1 cores would have 16384 FP32 ALU cores and around 30-40 TFLOPS of peak FP32 throughput. That would be quite a beastly GPU.

And where does the the current M1 GPU fall on the spectrum between consumer-focused and pro-video-focused? I.e., would they need not only more cores, but also a significant change in design, to produced a pro-focused GPU?

I suppose it all depends on what you require of a "pro GPU". I am a bit hazy on the concept myself. Traditional workstation GPUs used to have more stable hardware (slower clock and memory) and unlocked some software and hardware features. Contemporary Pro GPUs actually see some hardware changes, they balance execution resources differently , they come with HBM2 on the high end (which is better suited for complex compute work) and they have better FP64 support. They also include GPUs that focus on virtualization support for virtual machines.

Frankly, I would argue that Apple GPUs are "pro" by design. Unified memory allows zero-latency work synchronization between the CPU and the GPU, which is a huge thing. Apple uses TBDR, which means that they don't need as many dedicated texture and ROP units (frankly, I am not even sure that Apple has hardware ROP at all — they could just use a compute shader targeting tile memory for shading) and can allocate more of their silicon budget to compute units. I mean, M1 already gives you 1024 FP32 ALUs (each capable of a FP32 multiply+add per cycle) at 10 watts of GPU power, which is kind of scandalous — even if the relatively slow memory bandwidth of M1 doesn't allow it to perform to it's peak capabilities in compute benchmarks.

At the same time M1 GPU compute units seem to be much simpler than current ones in Nvidia and AMD GPUs. Their FP16 and FP32 throughput rate appears to be identical (https://www.realworldtech.com/forum/?threadid=197759&curpostid=197855) and they don't have FP64 support (the later is suspiciously absent in Metal at all). To be honest, I don't see that much value for native FP64 processing on the GPU. It is only needed for some scientific computation (you mention video processing, where I don't see any use for FP64 at all), and it does make GPU hardware more complicated. Current GPU native FP64 support is abysmal anyway (1/32 of FP32 for Ampere and 1/16 of FP32 for Navi 2), so frankly.... why bother at all? If your GPU is 16 times slower running FP64 calculations anyway, you are better off implementing extended precision using float-float technique or something comparable. It will probably be as fast or faster in practice. Metal is based on C++, so you have templates and everything else you need to do these things in practice. Basically, I wouldn't be surprised if Apple never implements FP64 — it makes more sense to focus on the universally useful FP32 case. I'd take a 10% improvement in FP32 throughput over a 1/16 native FP64 support any day.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
To be honest, I don't see that much value for native FP64 processing on the GPU. It is only needed for some scientific computation (you mention video processing, where I don't see any use for FP64 at all), and it does make GPU hardware more complicated.
Thanks for the correction—I thought certain types of video processing could benefit from FP64, but you pointed out that's not the case.

Given that, what you wrote about FP64 not being useful for Apple's target market does make sense.

Though I would challenge one side-point:
Current GPU native FP64 support is abysmal anyway (1/32 of FP32 for Ampere and 1/16 of FP32 for Navi 2), so frankly.... why bother at all? If your GPU is 16 times slower running FP64 calculations anyway, you are better off implementing extended precision using float-float technique or something comparable. It will probably be as fast or faster in practice.

You'll recall some consumer NVIDIA cards had FP64 capability, but to keep their business customers from purchasing the consumer cards to save money (if the job is embarrassingly parallel, they could get the same performance for less money by buying a greater no. of consumer cards), NVIDIA restricted ("locked") their FP64 performance. This strategy wouldn't have worked if those customers didn't value FP64. So clearly a significant number of their industrial/scientificcustomers doing DP calculations are finding implementing it in hardware is better (and these are very sophisticated customers who presumably know what they're doing). Yes, a 16-fold slow-down is huge. But, then again, implementing calculations in software rather than in hardware is also much slower. Even if it's a wash, I suspect running in hardware is better, since it simplifies the coding.

[With Mathematica, for instance, you can implement calculations in software to arbitrarily high precision; but this is much slower than doing machine-precision calculations. Interestingly, Mathematica also allows infinite-precision calculations, in which all numbers are kept as exact, symbolic representations.]

Also, can you explain where you're getting your 32:1 FP32:FP64 ratio for Ampere? From what I'm seeing in their spec sheet, it's it's 19.5/9.7 = 2:1.

Yes, you can get a maximum of 312/19.5 = 16:1, from their "Tensor Core" specification (and that's only when you are able to get the maximum 2-fold speed-up from sparsity). But, according to https://www.theregister.com/2017/05/10/nvidia_gtc_roundup/ :

We wondered what a Tensor TFLOPS is. A spokesperson explained a Tensor floating-point operation is a "mixed-precision FP16/FP32 multiply-accumulate" calculation. "Tensor Cores add new specialized mixed-precision FMA operations that are separate from the existing FP32 and FP16 math operations," we're told.
And, more specifically, according to the embedded video here: https://www.engadget.com/nvidia-dgx-station-a100-80gb-tensor-core-gpu-announcement-140027589.html , FP32 Tensor Core gives the range of 32-bit, but only the precison of 16 bit. So you can't use the Tensor Core nos. to compare FP32 and FP64. The actual difference really is just 2:1.

1607410567572.png
 

Attachments

  • 1607410457214.png
    1607410457214.png
    60.8 KB · Views: 103
Last edited:

leman

macrumors Core
Oct 14, 2008
19,521
19,675
You'll recall some consumer NVIDA cards had FP64 capability, but to keep their business customers from purchasing the consumer cards to save money, NVIDIA restricted their FP64 performance. This strategy wouldn't have worked if those customers didn't value FP64. So clearly a significant number of their industrial/scientificcustomers doing DP calculations are finding implementing it in hardware is better (and these are very sophisticated customers who presumably know what they're doing). Yes, a 16-fold slow-down is huge. But, then again, implementing calculations in software rather than in hardware is also much slower. Even if it's a wash, I suspect running in hardware is better, since it simplifies the coding.

Also, it appears the FP64:FP32 ratio for Ampere is a maximum of 19.5/312 = 1/16 rather than 1/32:

Thanks for pointing it out! It's a it confusing since Nvidia seems to have multiple Ampere architectures, which seem to be different? The new A6000 is based on the consumer GPUs with lower FP64. I forgot about the earlier pro-level Ampere :)

The A100 GPUs are kind of interesting because they are designed for datacenter operation... basically supercomputing and virtualization. That's a completely different beast.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
Thanks for pointing it out! It's a it confusing since Nvidia seems to have multiple Ampere architectures, which seem to be different? The new A6000 is based on the consumer GPUs with lower FP64. I forgot about the earlier pro-level Ampere :)

The A100 GPUs are kind of interesting because they are designed for datacenter operation... basically supercomputing and virtualization. That's a completely different beast.
I just edited my post -- it seems like the A100's FP32:FP64 TFLOPS ratio is actually 2:1.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.