Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Zest28

macrumors 68030
Jul 11, 2022
2,581
3,933
We are talking about enterprise market and you bring up how Apple builds laptops? By this logic Intel is unsuitable for enterprise because some Intel-based laptops have soldered-on storage. Don't be ridiculous.



I hope you don't suggest running cloud-based gaming services on Apple hardware. Because I don't see any relevance of this post otherwise.

The Mac Studio is a desktop. So far, Apple their business strategy is selling non-upgradable and non-repairable systems.

Intel and AMD do have repairable and upgradable offerings, so how does your logic actually work?

This example has nothing to do with gaming but about the limitations and how expensive Apple is. I can give an other example. If you need more RAM than the original machine you bought, you can add more RAM no problem with AMD or Intel. With Apple, you got to buy a whole new super expensive machine again.

And the non-repairability of Apple is also very expensive as you again have to buy a whole new system if 1 component breaks.
 
Last edited:

Oculus Mentis

macrumors regular
Original poster
Sep 26, 2018
144
163
UK
Apple has always been a company for consumers. That's why they focus so intensively on power efficiency, something that actually matters.

THat is true but at this point in time, around 39% of a data center’s energy consumption goes into powering, cooling and ventilation systems. Chip energy efficiency is of paramount importance and a dedicated server chip built around ARM acknowledged energy efficiencies, would be cheper to run when deployed at scale.

The M1 Max (which I use) is no better than a GTX 1080 Ti from 5 years ago.

I'm not sure how you tested your use case but for certain computations, according to this scientific study at the University of Calgary on solving equations on viscoelastic for wave propagation, M1 was faster than an A6000 (initially only with OpenCL but then also under Metal).
I'm not sure how niche this particular use case is, but details seem available to those more expert than me.
 
  • Like
Reactions: Xiao_Xi

JouniS

macrumors 6502a
Nov 22, 2020
638
399
Much greater than the M1 Max, but not even an AMD 7700 XT (200 W, 20 TFLOPS) would work. This is on the same N5 node as M1. Their 20 TFLOPS GPU consumes 2-3x as much energy as Apple's (70 W sustained, 96 W fluctuations), even through RDNA 3 is the most power-efficient PC architecture.
Apple had a roughly 2x power efficiency advantage against the previous generation AMD and NVIDIA GPUs when running at the same clock rate. In this generation, the advantage should be smaller.

That 2-3x difference in power consumption you see is not because AMD GPUs are that inefficient but because AMD cares more about another efficiency metric. AMD is likely constrained by the number of wafers it can buy. In order to sell more devices, it makes smaller dies, packs them full, and clocks them higher. For example, the total die area of the 7900 XTX is 520 mm2 – about 20% bigger than the M1 Max – but AMD supposedly gets 60 TFLOPS out of that.
 
Last edited:

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
For example, the total die area of the 7900 XTX is 520 mm2 – about 20% bigger than the M1 Max – but AMD supposedly gets 60 TFLOPS out of that.
The comparison is apple to oranges tho. One die is purely GPU cores, while the other is an SoC.
 
  • Like
Reactions: maflynn

Philip Turner

macrumors regular
Dec 7, 2021
170
111
My bigger problem is, the largest AMD chips do have comparable watts/TFLOPS to M1 Max. But for smaller chips they decrease power at a slower rate than they decrease FLOPS. By comparison, for Apple, power scales linearly with size, arriving at an iPhone with 1/8 the power and 1/8 the performance. In the graph below, they control power appropriation at the sub-core granularity. The lowest data point is using a single vector ALU with 800 mW.

Power_Performance_M1_Max.png

If you are happy with your MBP for your "HPC" scientific computing, we do not share the same definition of HPC
High performance computing (HPC) is "the ability to process data and perform complex calculations at high speeds". A modern GPU is capable of as many calculations as 100 CPU cores, making it a mini-supercomputer. Especially in Folding@Home where they could utilize thousands of volunteer GPUs for simulations. A personal computer is also more accessible to an average scientist than a supercomputer.
 
  • Like
Reactions: Oculus Mentis

JouniS

macrumors 6502a
Nov 22, 2020
638
399
The comparison is apple to oranges tho. One die is purely GPU cores, while the other is an SoC.
The M1 Max is effectively a GPU with a SoC on the side, so the comparison is quite appropriate. Definitely more appropriate than comparing the power efficiency of Apple GPUs to AMD/NVIDIA desktop GPUs. As I mentioned, power efficiency comparisons should be between GPUs running at the same clock rate, which means AMD/NVIDIA laptop GPUs.
 

maflynn

macrumors Haswell
May 3, 2009
73,682
43,740
The M1 Max is effectively a GPU with a SoC on the side
That really doesn't make sense???

Soc = System on a Chip and so of course having the GPU part of the SoC is accurate, not having a GPU with a Soc on the side.
 

JouniS

macrumors 6502a
Nov 22, 2020
638
399
That really doesn't make sense???

Soc = System on a Chip and so of course having the GPU part of the SoC is accurate, not having a GPU with a Soc on the side.
Most of the die area is used by components that you would find in a dedicated GPU. CPU cores, SSD controllers, and things like that are only a small fraction of the total.
 

maflynn

macrumors Haswell
May 3, 2009
73,682
43,740
Most of the die area is used by components that you would find in a dedicated GPU. CPU cores, SSD controllers, and things like that are only a small fraction of the total.
Yet, its still a SoC not a GPU.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Yet, its still a SoC not a GPU.

You are arguing about labels. M1 Max is indeed more similar to a GPU than to other laptop SoCs (where the majority of the area is dedicated to the CPU cluster).

If you absolutely insist on using a proper label though, let's call it a console chip :) Compare the die shots of the M1 Max above with this of Xbox X

MicroSoft-Xbox-Series-X-Die.jpg
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
As I mentioned, power efficiency comparisons should be between GPUs running at the same clock rate, which means AMD/NVIDIA laptop GPUs.
At that point, all recent GPUs are pretty much the same. Per core, you have 256-384 KB of register file, 128 FP32 ALUs, and some variation in SFU capabilities. The biggest difference is that Apple underclocks by a factor of 2, optimizes the scheduler (to a fault) to maximize power efficiency, and uses less shared memory.
If you absolutely insist on using a proper label though, let's call it a console chip :) Compare the die shots of the M1 Max above with this of Xbox X
That's actually a great comparison to the M1 Max. The Xbox also needs a tight power budget for real-world performance. To be most accurate w.r.t. power, you'd use the Series S. However it's on an older process node, which cancels out the power difference. Series X would be best.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
@Xiao_Xi eyeballing your picture, the GPU itself consumes 1/4 of total area. On the Xbox Series X, it consumes slightly over 1/2.

For example, the total die area of the 7900 XTX is 520 mm2 – about 20% bigger than the M1 Max – but AMD supposedly gets 60 TFLOPS out of that.
Extrapolating this, the Apple M1 Max is 400 mm^2. 1/4 that is 100 mm^2 for 10 TFLOPS. With AMD, the GPU part is 300 mm^2. That is 100 mm^2 for 20 TFLOPS. If you divide by the difference in clock speed, both GPUs have 100 mm^2 per 10 TFLOPS per 1250 MHz.

Also, kind of strange, but the M1 Max GPU could fit on your fingernail.
 
Last edited:
  • Like
Reactions: quarkysg

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
If you divide by the difference in clock speed, both GPUs have 100 mm^2 per 10 TFLOPS per 1250 MHz.
It would appear then that the ALUs are already as efficient as the designer can make them. Wonder if the tricks done for CPUs to increase IPCs would be applicable for GPUs?
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Wonder if the tricks done for CPUs to increase IPCs would be applicable for GPUs?
CPU cores have mostly converged on 16-32 FP32 IPC. Apple P-cores use 4x 4-wide vector ALUs, most Intel chips use 2x 8-wide (and Intel quit AVX-512 which was 2x 16-wide). AMD seemingly jumped to 4x 8-wide and that's probably what Intel's P-cores use now. With SVE, I'd expect Apple to someday settle one of the following configs. I can't cite a source, but 8 independent instructions is the theoretical upper limit of OoOE.
  • 6x 4-wide, 3x 8-wide, 2x 12-wide
  • 8x 4-wide, 4x 8-wide, 2x 16-wide
For GPU cores, Nvidia was the first to upgrade from 64 to 128 FP32 IPC. Then it was Apple, and finally AMD. However regarding schedulers, it's actually 4x and/or dual 2x 32-wide vector ALUs. Just like with CPUs, 4x some kind of vector instruction per-clock.
 
Last edited:

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
I can't cite a source, but 8 independent instructions is the theoretical upper limit of OoOE.
If 8 is the magic number, likely it could be due to that the most complicated ALU operation will be completed with at most 8 clock cycles, so going any wider will not have any benefit.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
If 8 is the magic number, likely it could be due to that the most complicated ALU operation will be completed with at most 8 clock cycles, so going any wider will not have any benefit.
Clock cycles don't have much to do with it. FP32 operations usually take 4 clock cycles. That means the M1 has 64 concurrent scalar pipelines, 16 concurrent scalar ALUs, and 4 concurrent vector ALUs. It's just combinatorial complexity, like the circuitry difference between 2^4 and 2^8. That's why 2^6 might be a better option for Apple. And something that can harness ARM SVE's (128, 256, 384)-wide vector sizes.

Regarding GPU, the M1 has 512 concurrent scalar FP32 pipelines @ 4 cycles, and 128 concurrent scalar IMUL32 pipelines @ 4 cycles. That translates to 128 FP32 IPC and 32 IMUL32/C. Further, 4x FP32 vector IPC and 1x vector IMUL32/C.

So we've already far surpassed the limit regarding scalar IPC (16x). It's use cases like where the code isn't pre-optimized for vector execution. The vector pipelines are forced to run just one scalar, making it 4x. I doubt that OoOE can coalesce parallel scalar instructions into vector instructions, so it's the circuitry cost of 4x OoOE.
 
Last edited:

Philip Turner

macrumors regular
Dec 7, 2021
170
111
You should add the media and neural engines, as these elements are included in PC GPUs.
That's still only another 50 mm^2, regarding the silicon actually usable by the BIOS. So it jumps to 3/8 the chip, and my order-of-magnitude comparison still holds. We can go more specific but careful analysis shows:
  • Register file and shared memory are SRAM, the more power-efficient version consuming 6 transistors. These components are not 3D stacked, something extremely difficult with electronic (but not molecular mechanical) computers.
  • The bulk of transistors go to general-purpose FP32 ALUs, which may also perform bitwise operations or IADD32. Nvidia is different with smaller dedicated I32 pipelines but larger FP16 tensor cores.
  • Scheduling logic and instruction cache take up other circuitry. You could argue that this matters more than FP32 ALUs, because it's cheaper to dispatch instructions more efficiently and to create more underutilized hardware.
Perhaps we can quantify the size of register file, then give a ratio of ALU/scheduling logic to it. Apple has a quite massive register file (384 KB), a size that AMD recently upgraded to. Then normalize the ALU/scheduling based on clock speeds. Everyone uses the same basic building blocks on TSMC N5, but rearranges based on whether they prioritize ALU utilization (power efficiency) or absolute performance.
  • Divide the absolute number of transistors by 140 million/mm^2.
  • SRAM on M1 Max core: 384 KB (registers) + 64 KB (shared memory) + 12 KB (L1I) + 12 KB (L1D) = 472 KB. Multiply by 6 transistors (27 million), 32 cores, divide by 140 M.
  • 0.6 mm^2 for the entire GPU - what????
According to Wikipedia, a 16-bit multiplier is 9000 transistors and a 32-bit multiplier is 21000 T. Square-interpolate that to a 24-bit multiplier being 14000 T. The Apple GPU has 4 concurrent FMUL and 1 IMUL per ALU, which would be 77000 T.
  • Minimum ALU transistors per GPU core: 128 ALUs * 77000 T = 99 million.
  • Repeat the process above the sum the area.
  • Our new estimate is 3 mm^2 but the actual size is 100 mm^2. WTF.
I give up.
 
  • Like
Reactions: Xiao_Xi

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Jk.

Now I'll try to predict the die area of ANE and AMX. Treat the FP16 matrix module (ANE) as a 16-bit multiplier (it's actually smaller) and the FP32 matrix module (AMX) as a 32-bit multiplier (same issue as before). Each AMX block takes 4 cycles to perform a 32x32 FP32 outer product-accumulate, totalling 2048 FLOPs. Perhaps each ANE core has 4 concurrent 32x32 FP16 pipelines @ 335 MHz, totalling 11 TFLOPS. This is quite reasonable because GPUs and CPUs use 4x 4-cycle schemes for ALUs. The AMX also uses arrays of 4 cores, but it's more complicated. Alternatively, the ANE has one multiplier/core running @ 1340 MHz (just like the GPU). If I were Apple, I'd do the same. This further corroborates how each ANE core looks like 9000/21000 of an AMX core on the die shot. I'll use the latter hypothesis.
  • 1024 FP16 multipliers at 9,000 transistors/multiplier, 16 cores/ANE becomes 1 mm^2.
  • 8 (P cores) and 2 (E cores) AMX multipliers, like the ANE but with 21,000 transistors. 1.5 mm^2.
If you look closely and sum the rectangular modules in the lower ANE, they would fit a 1/10 by 1/10 fragment of the GPU. Likewise with all the AMX cores combined. Matrix multiplication is the simplest compute-dense task you can create. However GPUs are more dynamic and general-purpose. Further along the tangent, we have a P-CPU core with 1/8 the GPU's performance/clock while consuming the same area. The more general-purpose something is, the less compute-dense and more overhead-y it is.

It's very plausible that the biggest transistor cost isn't raw compute power or SRAM, for any GPU architecture.

Side note: Look at those 8 dark spots inside each pair of GPU cores. I imagine they're 8-bank register files, just like how SLC (SRAM) is dark. I could visually pack all of them into ~2% of the GPU area, so my first prediction (0.6 mm^2) wasn't far off.
 
Last edited:
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
If Apple were to create SoCs for data centers, their competition would be AMD MI300, not AWS Graviton/AMD Epyc, as those chips are for web server workloads and not HPC/AI workloads.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
If Apple were to create SoCs for data centers, their competition would be AMD MI300, not AWS Graviton/AMD Epyc, as those chips are for web server workloads and not HPC/AI workloads.
The AMX cores have 2.0 TFLOPS FP32 and 0.5 TFLOPS FP64 matrix power, appropriate for HPC. Scale that from 1.5 mm^2 to 400 mm^2 and you have 500 TFLOPS FP32, 130 TFLOPS FP64. Compare that to Hopper GPU's 1000 TFLOPS FP8 or MI250X with 50 TFLOPS FP64
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Wonder if the tricks done for CPUs to increase IPCs would be applicable for GPUs?

Not really. These are very different devices with different requirements. CPUs can afford to be big and complex per core because they need to run a single thread as quickly as possible — so you have multiple execution units and some fairly advanced logic that tracks instruction dependencies and executes multiple instructions in parallel and out of order. A part of CPU for example can start executing some conditional code before the condition has been resolved for example.

GPUs instead need to be as area and power efficient as possible, you want to cram as many execution units as possible into the smallest area. But they also need to run many thousands threads at the same time. So GPUs are optimised for throughput (run as many operations as possible) instead of thread latency. This leads to a simpler in-order design (no instruction reordering, limited dependency tracking) to minimise the necessary logic, but with the ability to quickly switch between execution streams.


At that point, all recent GPUs are pretty much the same. Per core, you have 256-384 KB of register file, 128 FP32 ALUs, and some variation in SFU capabilities.

I wouldn't say that. There are fairly big differences. Apple GPUs are among the simplest, they dedicate one ALU per thread and schedule one instruction per clock per thread. Recent AMD GPUs have two ALUs per thread and can encode up to two ALU operations per instruction (a form of VLIW). Nvidia GPUs similarly have two ALUs per thread but use a more sophisticated instruction scheduler that can issue instructions to either ALU per clock. Combined with some execution trickery (their ALUs are 16-wide and not 32-wide, but execute 32-wide data operations in two steps) to the user it looks like Nvidia schedules two instructions per clock essentially.

BTW, a recent Apple patent suggests that Apples next GPU (one planned for A16 really) will be able to schedule instructions from two separate programs in an SMT-like fashion.

Regarding GPU, the M1 has 512 concurrent scalar FP32 pipelines @ 4 cycles, and 128 concurrent scalar IMUL32 pipelines @ 4 cycles. That translates to 128 FP32 IPC and 32 IMUL32/C. Further, 4x FP32 vector IPC and 1x vector IMUL32/C.

Not quite sure how your numbers fit together?

From what I know, each Apple G13 GPU core contains 4x 32-wide ALUs, that's 1024 ALUs per M1 GPU in total. It can issue a FP32 instructions per clock to each pipelined ALU, so that's 1024 FP32 per clock for the M1 GPU (the latency will obviously be longer, but I have no idea how to measure that on a GPU). Integer multiplication will obviously be slower, I assume that you'v measured that, 1/4 throughput rate sounds plausible. But integer addition and subtraction issue rate is the same as FP32.
 
  • Like
Reactions: Xiao_Xi

leman

macrumors Core
Oct 14, 2008
19,521
19,674
The AMX cores have 2.0 TFLOPS FP32 and 0.5 TFLOPS FP64 matrix power, appropriate for HPC. Scale that from 1.5 mm^2 to 400 mm^2 and you have 500 TFLOPS FP32, 130 TFLOPS FP64. Compare that to Hopper GPU's 1000 TFLOPS FP8 or MI250X with 50 TFLOPS FP64

Not to mention that adding FP-optimized formats to AMX could yield a huge boost in perf/area for Apple. They are still computing these operations using FP32/FP16, where Nvidia uses lower precision (their FT32 format for example is comparable to FP16 in ALU requirements if I remember correctly).
 
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
If Apple were to create SoCs for data centers, their competition would be AMD MI300, not AWS Graviton/AMD Epyc, as those chips are for web server workloads and not HPC/AI workloads.
I didn't realize how big Ultra is. MI300 (146B transistors) uses 30% more transistors than the top level Ultra (114B transistors). A 2xUltra SoC would be huge.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.