Could Apple step in the enterprise chip market with Apple Silicon?

Zest28 · Jan 4, 2023

leman said:
We are talking about enterprise market and you bring up how Apple builds laptops? By this logic Intel is unsuitable for enterprise because some Intel-based laptops have soldered-on storage. Don't be ridiculous.

I hope you don't suggest running cloud-based gaming services on Apple hardware. Because I don't see any relevance of this post otherwise.

The Mac Studio is a desktop. So far, Apple their business strategy is selling non-upgradable and non-repairable systems.

Intel and AMD do have repairable and upgradable offerings, so how does your logic actually work?

This example has nothing to do with gaming but about the limitations and how expensive Apple is. I can give an other example. If you need more RAM than the original machine you bought, you can add more RAM no problem with AMD or Intel. With Apple, you got to buy a whole new super expensive machine again.

And the non-repairability of Apple is also very expensive as you again have to buy a whole new system if 1 component breaks.

Oculus Mentis · Jan 4, 2023

Philip Turner said:
Apple has always been a company for consumers. That's why they focus so intensively on power efficiency, something that actually matters.

THat is true but at this point in time, around 39% of a data center’s energy consumption goes into powering, cooling and ventilation systems. Chip energy efficiency is of paramount importance and a dedicated server chip built around ARM acknowledged energy efficiencies, would be cheper to run when deployed at scale.

Philip Turner said:
The M1 Max (which I use) is no better than a GTX 1080 Ti from 5 years ago.

I'm not sure how you tested your use case but for certain computations, according to this scientific study at the University of Calgary on solving equations on viscoelastic for wave propagation, M1 was faster than an A6000 (initially only with OpenCL but then also under Metal).
I'm not sure how niche this particular use case is, but details seem available to those more expert than me.

JouniS · Jan 4, 2023

Philip Turner said:
Much greater than the M1 Max, but not even an AMD 7700 XT (200 W, 20 TFLOPS) would work. This is on the same N5 node as M1. Their 20 TFLOPS GPU consumes 2-3x as much energy as Apple's (70 W sustained, 96 W fluctuations), even through RDNA 3 is the most power-efficient PC architecture.

Apple had a roughly 2x power efficiency advantage against the previous generation AMD and NVIDIA GPUs when running at the same clock rate. In this generation, the advantage should be smaller.

That 2-3x difference in power consumption you see is not because AMD GPUs are that inefficient but because AMD cares more about another efficiency metric. AMD is likely constrained by the number of wafers it can buy. In order to sell more devices, it makes smaller dies, packs them full, and clocks them higher. For example, the total die area of the 7900 XTX is 520 mm2 – about 20% bigger than the M1 Max – but AMD supposedly gets 60 TFLOPS out of that.

quarkysg · Jan 4, 2023

JouniS said:
For example, the total die area of the 7900 XTX is 520 mm2 – about 20% bigger than the M1 Max – but AMD supposedly gets 60 TFLOPS out of that.

The comparison is apple to oranges tho. One die is purely GPU cores, while the other is an SoC.

Philip Turner · Jan 4, 2023

My bigger problem is, the largest AMD chips do have comparable watts/TFLOPS to M1 Max. But for smaller chips they decrease power at a slower rate than they decrease FLOPS. By comparison, for Apple, power scales linearly with size, arriving at an iPhone with 1/8 the power and 1/8 the performance. In the graph below, they control power appropriation at the sub-core granularity. The lowest data point is using a single vector ALU with 800 mW.

Xiao_Xi said:
If you are happy with your MBP for your "HPC" scientific computing, we do not share the same definition of HPC

High performance computing (HPC) is "the ability to process data and perform complex calculations at high speeds". A modern GPU is capable of as many calculations as 100 CPU cores, making it a mini-supercomputer. Especially in Folding@Home where they could utilize thousands of volunteer GPUs for simulations. A personal computer is also more accessible to an average scientist than a supercomputer.

JouniS · Jan 4, 2023

quarkysg said:
The comparison is apple to oranges tho. One die is purely GPU cores, while the other is an SoC.

The M1 Max is effectively a GPU with a SoC on the side, so the comparison is quite appropriate. Definitely more appropriate than comparing the power efficiency of Apple GPUs to AMD/NVIDIA desktop GPUs. As I mentioned, power efficiency comparisons should be between GPUs running at the same clock rate, which means AMD/NVIDIA laptop GPUs.

maflynn · Jan 4, 2023

JouniS said:
The M1 Max is effectively a GPU with a SoC on the side

That really doesn't make sense???

Soc = System on a Chip and so of course having the GPU part of the SoC is accurate, not having a GPU with a Soc on the side.

JouniS · Jan 4, 2023

maflynn said:
That really doesn't make sense???

Soc = System on a Chip and so of course having the GPU part of the SoC is accurate, not having a GPU with a Soc on the side.

Most of the die area is used by components that you would find in a dedicated GPU. CPU cores, SSD controllers, and things like that are only a small fraction of the total.

maflynn · Jan 4, 2023

JouniS said:
Most of the die area is used by components that you would find in a dedicated GPU. CPU cores, SSD controllers, and things like that are only a small fraction of the total.

Yet, its still a SoC not a GPU.

Xiao_Xi · Jan 4, 2023

JouniS said:
The M1 Max is effectively a GPU with a SoC on the side

Apple's M1 Pro, M1 Max SoCs Investigated: New Performance and Efficiency Heights

www.anandtech.com

leman · Jan 4, 2023

maflynn said:
Yet, its still a SoC not a GPU.

You are arguing about labels. M1 Max is indeed more similar to a GPU than to other laptop SoCs (where the majority of the area is dedicated to the CPU cluster).

If you absolutely insist on using a proper label though, let's call it a console chip

Compare the die shots of the M1 Max above with this of Xbox X

Philip Turner · Jan 4, 2023

JouniS said:
As I mentioned, power efficiency comparisons should be between GPUs running at the same clock rate, which means AMD/NVIDIA laptop GPUs.

At that point, all recent GPUs are pretty much the same. Per core, you have 256-384 KB of register file, 128 FP32 ALUs, and some variation in SFU capabilities. The biggest difference is that Apple underclocks by a factor of 2, optimizes the scheduler (to a fault) to maximize power efficiency, and uses less shared memory.

leman said:
If you absolutely insist on using a proper label though, let's call it a console chip Compare the die shots of the M1 Max above with this of Xbox X

That's actually a great comparison to the M1 Max. The Xbox also needs a tight power budget for real-world performance. To be most accurate w.r.t. power, you'd use the Series S. However it's on an older process node, which cancels out the power difference. Series X would be best.

Philip Turner · Jan 4, 2023

@Xiao_Xi eyeballing your picture, the GPU itself consumes 1/4 of total area. On the Xbox Series X, it consumes slightly over 1/2.

JouniS said:
For example, the total die area of the 7900 XTX is 520 mm2 – about 20% bigger than the M1 Max – but AMD supposedly gets 60 TFLOPS out of that.

Extrapolating this, the Apple M1 Max is 400 mm^2. 1/4 that is 100 mm^2 for 10 TFLOPS. With AMD, the GPU part is 300 mm^2. That is 100 mm^2 for 20 TFLOPS. If you divide by the difference in clock speed, both GPUs have 100 mm^2 per 10 TFLOPS per 1250 MHz.

Also, kind of strange, but the M1 Max GPU could fit on your fingernail.

quarkysg · Jan 4, 2023

Philip Turner said:
If you divide by the difference in clock speed, both GPUs have 100 mm^2 per 10 TFLOPS per 1250 MHz.

It would appear then that the ALUs are already as efficient as the designer can make them. Wonder if the tricks done for CPUs to increase IPCs would be applicable for GPUs?

Philip Turner · Jan 4, 2023

quarkysg said:
Wonder if the tricks done for CPUs to increase IPCs would be applicable for GPUs?

CPU cores have mostly converged on 16-32 FP32 IPC. Apple P-cores use 4x 4-wide vector ALUs, most Intel chips use 2x 8-wide (and Intel quit AVX-512 which was 2x 16-wide). AMD seemingly jumped to 4x 8-wide and that's probably what Intel's P-cores use now. With SVE, I'd expect Apple to someday settle one of the following configs. I can't cite a source, but 8 independent instructions is the theoretical upper limit of OoOE.

6x 4-wide, 3x 8-wide, 2x 12-wide
8x 4-wide, 4x 8-wide, 2x 16-wide

For GPU cores, Nvidia was the first to upgrade from 64 to 128 FP32 IPC. Then it was Apple, and finally AMD. However regarding schedulers, it's actually 4x and/or dual 2x 32-wide vector ALUs. Just like with CPUs, 4x some kind of vector instruction per-clock.

quarkysg · Jan 4, 2023

Philip Turner said:
I can't cite a source, but 8 independent instructions is the theoretical upper limit of OoOE.

If 8 is the magic number, likely it could be due to that the most complicated ALU operation will be completed with at most 8 clock cycles, so going any wider will not have any benefit.

Philip Turner · Jan 4, 2023

quarkysg said:
If 8 is the magic number, likely it could be due to that the most complicated ALU operation will be completed with at most 8 clock cycles, so going any wider will not have any benefit.

Clock cycles don't have much to do with it. FP32 operations usually take 4 clock cycles. That means the M1 has 64 concurrent scalar pipelines, 16 concurrent scalar ALUs, and 4 concurrent vector ALUs. It's just combinatorial complexity, like the circuitry difference between 2^4 and 2^8. That's why 2^6 might be a better option for Apple. And something that can harness ARM SVE's (128, 256, 384)-wide vector sizes.

Regarding GPU, the M1 has 512 concurrent scalar FP32 pipelines @ 4 cycles, and 128 concurrent scalar IMUL32 pipelines @ 4 cycles. That translates to 128 FP32 IPC and 32 IMUL32/C. Further, 4x FP32 vector IPC and 1x vector IMUL32/C.

So we've already far surpassed the limit regarding scalar IPC (16x). It's use cases like where the code isn't pre-optimized for vector execution. The vector pipelines are forced to run just one scalar, making it 4x. I doubt that OoOE can coalesce parallel scalar instructions into vector instructions, so it's the circuitry cost of 4x OoOE.

Xiao_Xi · Jan 4, 2023

Philip Turner said:
@Xiao_Xi eyeballing your picture, the GPU itself consumes 1/4 of total area. On the Xbox Series X, it consumes slightly over 1/2.

You should add the media and neural engines, as these elements are included in PC GPUs.

Philip Turner · Jan 4, 2023

Xiao_Xi said:
You should add the media and neural engines, as these elements are included in PC GPUs.

That's still only another 50 mm^2, regarding the silicon actually usable by the BIOS. So it jumps to 3/8 the chip, and my order-of-magnitude comparison still holds. We can go more specific but careful analysis shows:

Register file and shared memory are SRAM, the more power-efficient version consuming 6 transistors. These components are not 3D stacked, something extremely difficult with electronic (but not molecular mechanical) computers.
The bulk of transistors go to general-purpose FP32 ALUs, which may also perform bitwise operations or IADD32. Nvidia is different with smaller dedicated I32 pipelines but larger FP16 tensor cores.
Scheduling logic and instruction cache take up other circuitry. You could argue that this matters more than FP32 ALUs, because it's cheaper to dispatch instructions more efficiently and to create more underutilized hardware.

Perhaps we can quantify the size of register file, then give a ratio of ALU/scheduling logic to it. Apple has a quite massive register file (384 KB), a size that AMD recently upgraded to. Then normalize the ALU/scheduling based on clock speeds. Everyone uses the same basic building blocks on TSMC N5, but rearranges based on whether they prioritize ALU utilization (power efficiency) or absolute performance.

Divide the absolute number of transistors by 140 million/mm^2.
SRAM on M1 Max core: 384 KB (registers) + 64 KB (shared memory) + 12 KB (L1I) + 12 KB (L1D) = 472 KB. Multiply by 6 transistors (27 million), 32 cores, divide by 140 M.
0.6 mm^2 for the entire GPU - what????

According to Wikipedia, a 16-bit multiplier is 9000 transistors and a 32-bit multiplier is 21000 T. Square-interpolate that to a 24-bit multiplier being 14000 T. The Apple GPU has 4 concurrent FMUL and 1 IMUL per ALU, which would be 77000 T.

Minimum ALU transistors per GPU core: 128 ALUs * 77000 T = 99 million.
Repeat the process above the sum the area.
Our new estimate is 3 mm^2 but the actual size is 100 mm^2. WTF.

I give up.

Philip Turner · Jan 4, 2023

Jk.

Now I'll try to predict the die area of ANE and AMX. Treat the FP16 matrix module (ANE) as a 16-bit multiplier (it's actually smaller) and the FP32 matrix module (AMX) as a 32-bit multiplier (same issue as before). Each AMX block takes 4 cycles to perform a 32x32 FP32 outer product-accumulate, totalling 2048 FLOPs. Perhaps each ANE core has 4 concurrent 32x32 FP16 pipelines @ 335 MHz, totalling 11 TFLOPS. This is quite reasonable because GPUs and CPUs use 4x 4-cycle schemes for ALUs. The AMX also uses arrays of 4 cores, but it's more complicated. Alternatively, the ANE has one multiplier/core running @ 1340 MHz (just like the GPU). If I were Apple, I'd do the same. This further corroborates how each ANE core looks like 9000/21000 of an AMX core on the die shot. I'll use the latter hypothesis.

1024 FP16 multipliers at 9,000 transistors/multiplier, 16 cores/ANE becomes 1 mm^2.
8 (P cores) and 2 (E cores) AMX multipliers, like the ANE but with 21,000 transistors. 1.5 mm^2.

If you look closely and sum the rectangular modules in the lower ANE, they would fit a 1/10 by 1/10 fragment of the GPU. Likewise with all the AMX cores combined. Matrix multiplication is the simplest compute-dense task you can create. However GPUs are more dynamic and general-purpose. Further along the tangent, we have a P-CPU core with 1/8 the GPU's performance/clock while consuming the same area. The more general-purpose something is, the less compute-dense and more overhead-y it is.

It's very plausible that the biggest transistor cost isn't raw compute power or SRAM, for any GPU architecture.

Side note: Look at those 8 dark spots inside each pair of GPU cores. I imagine they're 8-bank register files, just like how SLC (SRAM) is dark. I could visually pack all of them into ~2% of the GPU area, so my first prediction (0.6 mm^2) wasn't far off.

Xiao_Xi · Jan 4, 2023

If Apple were to create SoCs for data centers, their competition would be AMD MI300, not AWS Graviton/AMD Epyc, as those chips are for web server workloads and not HPC/AI workloads.

AMD Instinct MI300 Breaks Cover at CES 2023

The AMD Instinct MI300 was shown at CES 2023 this year along with new details about the multi-chip CPU, GPU, and HBM3 memory part

www.servethehome.com

Philip Turner · Jan 4, 2023

Xiao_Xi said:
If Apple were to create SoCs for data centers, their competition would be AMD MI300, not AWS Graviton/AMD Epyc, as those chips are for web server workloads and not HPC/AI workloads.

AMD Instinct MI300 Breaks Cover at CES 2023

The AMD Instinct MI300 was shown at CES 2023 this year along with new details about the multi-chip CPU, GPU, and HBM3 memory part

www.servethehome.com

The AMX cores have 2.0 TFLOPS FP32 and 0.5 TFLOPS FP64 matrix power, appropriate for HPC. Scale that from 1.5 mm^2 to 400 mm^2 and you have 500 TFLOPS FP32, 130 TFLOPS FP64. Compare that to Hopper GPU's 1000 TFLOPS FP8 or MI250X with 50 TFLOPS FP64

leman · Jan 4, 2023

quarkysg said:
Wonder if the tricks done for CPUs to increase IPCs would be applicable for GPUs?

Not really. These are very different devices with different requirements. CPUs can afford to be big and complex per core because they need to run a single thread as quickly as possible — so you have multiple execution units and some fairly advanced logic that tracks instruction dependencies and executes multiple instructions in parallel and out of order. A part of CPU for example can start executing some conditional code before the condition has been resolved for example.

GPUs instead need to be as area and power efficient as possible, you want to cram as many execution units as possible into the smallest area. But they also need to run many thousands threads at the same time. So GPUs are optimised for throughput (run as many operations as possible) instead of thread latency. This leads to a simpler in-order design (no instruction reordering, limited dependency tracking) to minimise the necessary logic, but with the ability to quickly switch between execution streams.

Philip Turner said:
At that point, all recent GPUs are pretty much the same. Per core, you have 256-384 KB of register file, 128 FP32 ALUs, and some variation in SFU capabilities.

I wouldn't say that. There are fairly big differences. Apple GPUs are among the simplest, they dedicate one ALU per thread and schedule one instruction per clock per thread. Recent AMD GPUs have two ALUs per thread and can encode up to two ALU operations per instruction (a form of VLIW). Nvidia GPUs similarly have two ALUs per thread but use a more sophisticated instruction scheduler that can issue instructions to either ALU per clock. Combined with some execution trickery (their ALUs are 16-wide and not 32-wide, but execute 32-wide data operations in two steps) to the user it looks like Nvidia schedules two instructions per clock essentially.

BTW, a recent Apple patent suggests that Apples next GPU (one planned for A16 really) will be able to schedule instructions from two separate programs in an SMT-like fashion.

Philip Turner said:
Regarding GPU, the M1 has 512 concurrent scalar FP32 pipelines @ 4 cycles, and 128 concurrent scalar IMUL32 pipelines @ 4 cycles. That translates to 128 FP32 IPC and 32 IMUL32/C. Further, 4x FP32 vector IPC and 1x vector IMUL32/C.

Not quite sure how your numbers fit together?

From what I know, each Apple G13 GPU core contains 4x 32-wide ALUs, that's 1024 ALUs per M1 GPU in total. It can issue a FP32 instructions per clock to each pipelined ALU, so that's 1024 FP32 per clock for the M1 GPU (the latency will obviously be longer, but I have no idea how to measure that on a GPU). Integer multiplication will obviously be slower, I assume that you'v measured that, 1/4 throughput rate sounds plausible. But integer addition and subtraction issue rate is the same as FP32.

leman · Jan 4, 2023

Philip Turner said:
The AMX cores have 2.0 TFLOPS FP32 and 0.5 TFLOPS FP64 matrix power, appropriate for HPC. Scale that from 1.5 mm^2 to 400 mm^2 and you have 500 TFLOPS FP32, 130 TFLOPS FP64. Compare that to Hopper GPU's 1000 TFLOPS FP8 or MI250X with 50 TFLOPS FP64

Not to mention that adding FP-optimized formats to AMX could yield a huge boost in perf/area for Apple. They are still computing these operations using FP32/FP16, where Nvidia uses lower precision (their FT32 format for example is comparable to FP16 in ALU requirements if I remember correctly).

Xiao_Xi · Jan 5, 2023

Xiao_Xi said:
If Apple were to create SoCs for data centers, their competition would be AMD MI300, not AWS Graviton/AMD Epyc, as those chips are for web server workloads and not HPC/AI workloads.

AMD Instinct MI300 Breaks Cover at CES 2023

The AMD Instinct MI300 was shown at CES 2023 this year along with new details about the multi-chip CPU, GPU, and HBM3 memory part

www.servethehome.com

I didn't realize how big Ultra is. MI300 (146B transistors) uses 30% more transistors than the top level Ultra (114B transistors). A 2xUltra SoC would be huge.

Could Apple step in the enterprise chip market with Apple Silicon?

macrumors 68030

macrumors regular

macrumors 6502a

macrumors 65816

macrumors regular

macrumors 6502a

macrumors Haswell

macrumors 6502a

macrumors Haswell

macrumors 68000

macrumors Core

macrumors regular

macrumors regular

macrumors 65816

macrumors regular

macrumors 65816

macrumors regular

macrumors 68000

macrumors regular

macrumors regular

macrumors 68000

macrumors regular

macrumors Core

macrumors Core

macrumors 68000

Our Staff