Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Philip Turner

macrumors regular
Dec 7, 2021
170
111
I wouldn't say that. There are fairly big differences. Apple GPUs are among the simplest, they dedicate one ALU per thread and schedule one instruction per clock per thread. Recent AMD GPUs have two ALUs per thread and can encode up to two ALU operations per instruction (a form of VLIW). Nvidia GPUs similarly have two ALUs per thread but use a more sophisticated instruction scheduler that can issue instructions to either ALU per clock. Combined with some execution trickery (their ALUs are 16-wide and not 32-wide, but execute 32-wide data operations in two steps) to the user it looks like Nvidia schedules two instructions per clock essentially.
Could you look at the graphs in the ALU bottlenecks and ALU layout sections of metal-benchmarks? I tried explaining this away with register cache thrashing, but it was of no use. The only coherent explanation is, their schedulers have 3 modes:
  • 3x single-dispatch (16-bit only) with ~2/3 peak IPC instead of 3/4
  • 2x dual-dispatch (16-bit, 32-bit) but not actually dual-dispatching 32-bit
  • 1x quadruple-dispatch (32-bit only)
BTW, a recent Apple patent suggests that Apples next GPU (one planned for A16 really) will be able to schedule instructions from two separate programs in an SMT-like fashion.
Source? People apply for patents years before they're granted. That means this technology is probably already integrated into Apple GPUs. With M1 Max, each GPU core can run code from 3 separate programs simultaneously. With A15, that number is 4. Not sure whether it's an A/M or M1/M2 difference. Do anyone have an M2 you could test?
Not quite sure how your numbers fit together?
Everything I mentioned was per core. In the entire GPU, the base M1 has 4096 actual scalar FP32 pipelines @ 4 cycles, 1024 effective scalar FP32 pipelines @ 1 cycle. In the entire P-CPU block, the base M1 has 256 actual scalar FP32 pipelines @ 4 cycles, 64 effective scalar FP32 pipelines @ 1 cycle.
It can issue a FP32 instructions per clock to each pipelined ALU, so that's 1024 FP32 per clock for the M1 GPU (the latency will obviously be longer, but I have no idea how to measure that on a GPU). Integer multiplication will obviously be slower, I assume that you'v measured that, 1/4 throughput rate sounds plausible. But integer addition and subtraction issue rate is the same as FP32.
I measured latency while assuming 4x 32-bit single-dispatch, a hypothesis proven wrong. It's physically impossible to do an FMA in 2 cycles. This assumption should mess with measurements for an "FP32, integer and conditional pipeline" but not for an "integer and complex pipeline".
 
Last edited:
  • Like
Reactions: Xiao_Xi

Fomalhaut

macrumors 68000
Oct 6, 2020
1,993
1,724
Then, of course, one can already do just that (use cloud computing resources from the coffee shop) using a laptop from any other vendor but Apple. Macs strong feature is integration with iPhone not cloud or enterprise applications.
Why does it need to be using a machine from "any other vendor but Apple"? AFAIK, the remote access protocols to cloud instances (of whatever type) are supported by macOS. I connect to Azure and AWS instances in various ways all the time from my Mac.
 

Fomalhaut

macrumors 68000
Oct 6, 2020
1,993
1,724
So why does the Mac Pro even exist?
Connecting to powerful remote machines doesn't handle many of the use cases that local workstations are designed for.

Cloud computing is game-changing in many fields, but falls short where you need high bandwidth for storage, memory, graphics or network traffic. Remember that you are connected to this "super machine" by an Internet connection that is probably asymmetric (much slower upload) and orders of magnitude slower than the buses on a local machine.

Consider the following scenario:

You want to move a terrabyte of video files to your editing machine.

On your local machine, plug in a USB3.2 or Thunderbolt drive - copy the files at 1000-3000MB/s - connect to your editing machine and either edit from the external drive or copy to the local SSD at the same speed. At 1000MB/s it's a lengthy copy of 1000 seconds (nearly 17 minutes for each copy). It would take a little longer over 10Gbps local network. A fast Thunderbolt drive might do this closer to 5 minutes.

Now do the same over your 5G data network in the coffee shop (or at home with a typical asymmetric connection). You might get 50 Mbps (megabits per second) upload speed. This is 160x slower than your 1000MB/s disk, so your upload now takes 160,000 seconds - over 44 hours. Your coffee gets cold....

Most home ISPs will typically max out at less than 500Mb/s (approx 60MB/s) upload speeds and 1000Mb/s downloads, and the FCC still considers anything above 3Mbps upload to be "broadband".


However fast the remote machine may be, if you have to push a lot of data to it, do some processing, and then get this data back to your local machine, then you will be limited by the network.

This is why companies like AWS have services like Snowball which couriers a plug-in storage array to the remote data centre. It's simply faster to do this than any available network. I remember an old IT saying "never under-estimate the bandwidth of a station-wagon full of magnetic tapes hurtling down the highway".


The same applies if you have to work with a lot a graphical data that needs to send from CPU->GPU->Screen, e.g. for 3-D modelling. Even if you let the remote machine do the rendering and use something like Remote Desktop or VNC to only transmit the screen updates, it's not a great user experience compared to a local machine with a fast GPU. I know there are streaming gaming services, but I haven't heard of anyone claiming that these are better than having a nice gaming rig in your house.

Cloud computing is great where you have lots of complex services living in the backend on one or more servers in data centre(s) and relatively thin client applications that don't need to send a lot of data back and forth. Most users are enterprises that access via these systems via web-apps, API calls and sometimes virtual desktops, and only need to upload or download relatively small amount of data - web pages, some modestly sized files, and highly compressed virtual desktop graphics.
 
Last edited:
  • Like
Reactions: singhs.apps

Yebubbleman

macrumors 603
May 20, 2010
6,024
2,616
Los Angeles, CA
Now that Apple has proven its capability in designing and producing Silicon for their own products and with Intel struggling on the innovation front, wouldn’t it make sense for Apple to enter the enterprise data center chip market for AI and HPC?

Just like Nvidia, Apple could design CPU and GPUs to be produced by TSMC then to be deployed in cloud HPC data centres of their own or to be sold to AMZN, meta, googl etc.

Warren Buffet announced a few weeks ago that he’s investing in TSMC, since he already is a large Apple shareholder, I do wonder if he knows something is moving in that direction…

Apple certainly has the resources and the technology to make a big splash in the AI and HPC markets.
They could license things in their silicon, but I sincerely doubt they'd sell their silicon to any other vendor. Plus, the thing to consider is that their silicon is designed around macOS and macOS (at least Apple Silicon versions) are designed around their silicon. Their silicon is not designed around Windows for ARM64 nor is it designed around Linux. Could you tool Windows for ARM64 to run on Apple Silicon? I'd assume so, but don't have enough information to say 100% for sure. Maybe things on it would need to be adapted. Could you tool Linux to run on Apple Silicon? The Asahi Linux folks are already hard at work on just that! But it's still a massive endeavor.

I think Apple wouldn't stand to lose anything by licensing out some of their tech, but you'll never see something like an M1 Max or M1 Ultra powering a blade in a datacenter somewhere...at least, not unless said blade was made by Apple (in some sort of second-coming-of-the-Xserve kind of way).
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Everything I mentioned was per core. In the entire GPU, the base M1 has 4096 actual scalar FP32 pipelines @ 4 cycles, 1024 effective scalar FP32 pipelines @ 1 cycle. In the entire P-CPU block, the base M1 has 256 actual scalar FP32 pipelines @ 4 cycles, 64 effective scalar FP32 pipelines @ 1 cycle.

Ah, so you count pipeline stages rather than ALUs themselves.

Could you look at the graphs in the ALU bottlenecks and ALU layout sections of metal-benchmarks? I tried explaining this away with register cache thrashing, but it was of no use. The only coherent explanation is, their schedulers have 3 modes:
  • 3x single-dispatch (16-bit only) with ~2/3 peak IPC instead of 3/4
  • 2x dual-dispatch (16-bit, 32-bit) but not actually dual-dispatching 32-bit
  • 1x quadruple-dispatch (32-bit only)

I measured latency while assuming 4x 32-bit single-dispatch, a hypothesis proven wrong. It's physically impossible to do an FMA in 2 cycles. This assumption should mess with measurements for an "FP32, integer and conditional pipeline" but not for an "integer and complex pipeline".

I'll need to spend some more time looking at your code and trying to understand the methodology and implications. When I did my own testing a while back, it was much more naive and much less comprehensive than your approach.

Quick comment: I am not sure I agree with the conclusion that there are separate FP16 and FP32 pipelines. My impression is that the ALU always runs computations at FP32. The discrepancy in FP16:FP32 rate between A14 and M1 could be explained by A14 throttling FP32 execution to decrease power usage of memory transfers.

Source? People apply for patents years before they're granted. That means this technology is probably already integrated into Apple GPUs. With M1 Max, each GPU core can run code from 3 separate programs simultaneously. With A15, that number is 4. Not sure whether it's an A/M or M1/M2 difference. Do anyone have an M2 you could test?

The particular patent I am referring to is this: https://patents.google.com/patent/US11422822B2/en

It seems that Apple somehow times the patent publication to hardware releases. For example, this patent describes new shift and fill instructions present in A15/M2. It was submitted couple of years ago but the publication was just few month prior to A15 announcement.
 
Last edited:

k27

macrumors 6502
Jan 23, 2018
330
419
Europe
Apple does not fit into the Enterprise => Macs. Apple is unfortunately more of a toy company.

Everything is intransparent at Apple. There are no roadmaps. Companies don't like that, because almost nothing can be planned.
At Apple, there are no clear statements about how long a software will be supported. Apple changes APIs from one day to the next, etc.
Apple only maintains the current version of macOS. But when a new macOS comes out, the new version often still has problems. Companies then want to stay with the old version. But Apple does not maintain the old macOS-Version properly and serious bugs and security gaps can remain in it.
This also annoys me as a private person.

Apple: "not all known security issues are addressed in previous versions (for example, macOS 12)."

"Mentioned in @eryeh's writeup (https://blog.google/threat-analysis-group/analyzing-watering-hole-campaign-using-macos-exploits/), this wasn’t patched for Catalina until Sept 23. NOT mentioned: This was 🚨234 days‼️ after #Apple patched the same vuln for Big Sur. 🤯 @Apple, randomly choosing which vulns you patch for 2 prior #macOS endangers customers."






Microsoft and reputable Linux distributors do much better in this regard.
 
Last edited:

Philip Turner

macrumors regular
Dec 7, 2021
170
111
I'll need to spend some more time looking at your code and trying to understand the methodology and implications.
My technique was to force certain composite numbers of threadgroups, by carefully correlating register usage and threadgroup memory usage with performance. Then you deactivate certain simds in the threadgroup and extrapolate throughput. Numbers were different when operating on 16-bit and 32-bit registers. Once, for 32-bit FMA/IMAD/CMPSEL, a flaw made 4 simds actually be 2-3 simds. I noticed this and corrected the cycles from 4/FMA to 2/FMA, although not for CMPSEL.

One thing I never tried was making more resident registers. I was cycling between 24 scalars in the register file, so perhaps 1/2/3/4 were special prime multiples and the register cache was involved. A better test might re-tune everything for 8, 15, 16 registers and see whether the patterns still hold.

I am not sure I agree with the conclusion that there are separate FP16 and FP32 pipelines
I saw an old PowerVR diagram where one ALU had 4 F16, 2 F32, and one complex pipeline. There could have been separate pipelines in the A14, but definitely not the M1. However the scheduler seems to issue 16-bit instructions more efficiently than 32-bit.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
I saw an old PowerVR diagram where one ALU had 4 F16, 2 F32, and one complex pipeline.

You mean something like described in this document? From what I understand Apple has retained much of the fixed-function stuff from PowerVR, but their shader cores are completely custom. PowerVR from that era seems to be using some sort of instruction bundles where different slot in a bundle control different stages in the shading unit, which is very different from Apple's straightforward in-order wide SIMT processor.


There could have been separate pipelines in the A14, but definitely not the M1.

True, although I think this is unlikely, since A14 and M1 GPU cores look indistinguishable on die shots and use the same ISA. Hence the hypothesis that A14 and M1 are the same, but A14 is rate-throttled.

However the scheduler seems to issue 16-bit instructions more efficiently than 32-bit.

Correct me if I'm wrong, but that's when there is high register pressure, right? Maybe what you are observing is the effect of the register file spilling into memory or at least outside the register cache... In very simple scenarios there is no difference in issue rate between FP32 and FP16 operations at least.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Correct me if I'm wrong, but that's when there is high register pressure, right? Maybe what you are observing is the effect of the register file spilling into memory or at least outside the register cache... In very simple scenarios there is no difference in issue rate between FP32 and FP16 operations at least.
No, this is absolutely minimal register pressure. Nothing is spilling to memory. Spilling only occurs when the maximum threadgroup size is 384, and here it's 1024 (technically ~1300-1500). It could be register cache but again, is there a coherent explanation for how the register cache causes the observed behavior?
True, although I think this is unlikely, since A14 and M1 GPU cores look indistinguishable on die shots and use the same ISA. Hence the hypothesis that A14 and M1 are the same, but A14 is rate-throttled.
Potentially not. Apple's M1 released months after the A14. This gave more breathing room to redesign the latter SoC, in response to Nvidia's decision to make their F32 on par with F16. If all 4 pipelines fully supported 32-bit, they'd just be wasting silicon. More likely, all 4 pipelines support IADD32, BITWISE32, and FFMA16. But only two support FFMA32 in hardware.

The A14 and M1 do have differences, such as one supporting BC compression and the other not. Also, Apple's A15 video said the A15 has "twice the FP32 math pipelines" compared to the previous generation.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Also, I'm pretty certain the GPU supports a dual-dispatch mode. At 4 pipelines/ALU and 4 32-wide ALUs/core, you'd need 16x concurrent simds to saturate them. If it was 4 schedulers w/ single-dispatch, each 4 cycles the core issues (4*1 + 4*1 + 4*1 * 4*1) = 16 instructions (>16 simds, IPC=1). That's not what happens. FADD16 is mostly saturated at 8 simds/core, as long as ILP=2.

If it was 2 schedulers w/ dual-dispatch, the core issues (2*1 + 2*1 + 2*1 + 2*1) = 8 instructions (>8 simds, ILP=1). Then it would be (2*2 + 2*2 + 2*2 + 2*2) = 16 instructions (>8 simds, ILP=2). With my theory, it still single-dispatches FP32 instructions when ILP=2. This would be a remnant of the A11-A14 architecture, where dual-dispatching would not modify max FP32 throughput.
 
Last edited:

AirpodsNow

macrumors regular
Aug 15, 2017
224
145
So why does the Mac Pro even exist?
So far what I can see in this thread is a discussion about what technologies apple has to provide for enterprise (or not). To me, it seems that apple has been laser focused on bringing products to market that needs “high” profit margins. Not sure how they plan to do that if they enter the “server” market.

the odd one of all their products is indeed the Mac Pro. It’s hard to make a case that this actually makes them (significant) money at all (low volume). It would be surprising if they breakeven on this with all the development cost etc. Brand image of the mac line is perhaps the main function.

but I think the Mac Pro is needed to provide a certain small segment of power users to be able to develop on a mac. If they don’t offer it, no one will. Also it might serve as a super deluxe version of their most powerful (and expensive) mac to increase their mac reputation. the trash can mac was a failed attempt, but they did tried to create something more different.

with the transition to apple silicon, it warped again the direction of the Mac Pro, rather getting xeon and radeon upgrades, it’s the question whether they continue this “modular” approach. It’s hard to imagine that they didn’t think this through when introducing the last Mac Pro Design. Whatever the case is, they will try to create a wow moment and make sure the Mac Pro serves as the pinnacle of performance for the mac line.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.