How can the Apple Silicon Mac Pro compete with PC Workstations?

theorist9 · Sep 21, 2022

leman said:
Again, my perspective on this might be naive, but I don’s see a principal difference between same die and same package if the dies can be connected in this way. The topology should be the same in either case. I can imagine that a on-die solution might be more energy efficient and of course have slightly lower latency but those things will matter less on a desktop.

One problem with Apples current design is that you can’t just take an individual component, there is ton of stuff that comes attached to it. People who could use a 128-core GPU likely don’t need a 32-core CPU etc. Apple will need to devise a more flexible scalable solution if they want to offer comprehensive options in that market in a way that’s commercially viable.

That certainly would be cool if they could make that work—and especially if it could be modular, where it comes with GPU slots enabling it to be upgraded as needed with additional Apple GPU chiplets.

And, of course, that would provide important product differentiation vs. the Studio.

mcnallym · Sep 21, 2022

In much the same way as added separate silicon for Neural Engine and Media Encoder can they not use separate silicon for extra features leaving the GPU less Important then on the PC.

leman · Sep 21, 2022

theorist9 said:
That certainly would be cool if they could make that work—and especially if it could be modular, where it comes with GPU slots enabling it to be upgraded as needed with additional Apple GPU chiplets.

And, of course, that would provide important product differentiation vs. the Studio.

I think modularity would be much more tricky to achieve, but there are other ways to approach it (e.g. with NUMA compute boards).

mcnallym said:
In much the same way as added separate silicon for Neural Engine and Media Encoder can they not use separate silicon for extra features leaving the GPU less Important then on the PC.

I do not understand what this means. Can you elaborate?

quarkysg · Sep 21, 2022

leman said:
But others are doing it somehow, right? There are plenty chips on the market that combine hundreds of CPU or GPU cores.

My naive imagination pictures some sort of network on a chip solution where chips can be connected together like legos to make a single entity, with additive cache capacity etc. No idea how feasible something like that is. But Intel is supposed to work on this kind of technology if I understand it correctly.

I have to be honest and admit that I have no clue what AMD or Intel is doing to link up their building blocks.

The fundamental issue as I see it is that for the M1 Ultra with two M1 Max dies (D1 & D2), when D1 fetches or writes data from it's RAM and caches (L1, L2, SLC), D1 would have to broadcast this to D2, so that if any of the D2 IPs touches the same data, it would have a consistent view. Doing this would make D1 and D2 appears as one single die to the OS. Expanding this to more dies means that this broadcast becomes exponentially more complicated.

Tying the dies together as network nodes would mean that the OS would have to use NUMA to ensure data consistency, pretty like multiple sockets servers I would think. This would negate the UMA model that Apple is advocating.

leman · Sep 21, 2022

quarkysg said:
The fundamental issue as I see it is that for the M1 Ultra with two M1 Max dies (D1 & D2), when D1 fetches or writes data from it's RAM and caches (L1, L2, SLC), D1 would have to broadcast this to D2, so that if any of the D2 IPs touches the same data, it would have a consistent view. Doing this would make D1 and D2 appears as one single die to the OS. Expanding this to more dies means that this broadcast becomes exponentially more complicated.

How is this fundamentally different from two CPU or GPU cores accessing the RAM? It is clear that the internal chip data interconnect must have capability to serialize such accesses and make them coherent to the rest of the system. How scalable is this internal interconnect? There are solutions on the market that do this for dozens or even hundreds of cores. For example, here is a description of how this works in the current crop of Intel Xeons:

Mesh Interconnect Architecture - Intel - WikiChip

Intel's mesh interconnect architecture is a multi-core system interconnect architecture that implements a synchronous, high-bandwidth, and scalable 2-dimensional array of half rings. Their mesh architecture has replaced the ring interconnect architecture in the server and HPC markets.

en.wikichip.org

This makes me think that Apple should be able to design similar solutions. Now imagine this kind of mesh extended across multiple chips by connecting their edges. I’m sure the practical details are infinitely more complex but conceptually it loos feasible?

quarkysg said:
Tying the dies together as network nodes would mean that the OS would have to use NUMA to ensure data consistency, pretty like multiple sockets servers I would think. This would negate the UMA model that Apple is advocating.

There is always NUMA, but it doesn’t mean that the NUMA must be software driven. Current systems have NUMA at the hardware level (cache hierarchies, inter-chip communication, coherency etc.) and it’s the hardware that takes care of it. Just look at the in-depth tests of modern x86 CPUs, it takes different amount of time to communicate between threads running at different CPU cores. I am sure that the same is true for Apple.

Making the system larger will make the data fabric larger, which will in turn increase the latency of data propagation and make the system more NUMA. But this would equally true for both monolithic dies and systems where multiple dies are connected to form a single topology. The OS needs to be aware of these things of course, as they have implications for thread scheduling and physical RAM allocation, but it won’t have any practical impact on the UMA system programming model. At most the average latency will go up by a few cycles, but nothing like thousands and thousands cycles as with multi-socket systems.

quarkysg · Sep 21, 2022

leman said:
How is this fundamentally different from two CPU or GPU cores accessing the RAM? It is clear that the internal chip data interconnect must have capability to serialize such accesses and make them coherent to the rest of the system. How scalable is this internal interconnect? There are solutions on the market that do this for dozens or even hundreds of cores. For example, here is a description of how this works in the current crop of Intel Xeons:

No difference, just at a different level. On die coherency are achieved by synchronising the different block's caches. It is such syncrhonisation that causes latencies I would imagine. For the M1 Ultra, my take is that the Ultra Fusion just extended the on-die synchronisation protocol to both M1 Max dies. I have no clue if this protocol can extend to more dies tho. and whether it would be still performant when expanded to more dies.

leman said:
There is always NUMA, but it doesn’t mean that the NUMA must be software driven. Current systems have NUMA at the hardware level (cache hierarchies, inter-chip communication, coherency etc.) and it’s the hardware that takes care of it. Just look at the in-depth tests of modern x86 CPUs, it takes different amount of time to communicate between threads running at different CPU cores. I am sure that the same is true for Apple.

I think this is called cache coherence NUMA? To Apple, I think as long as macOS' view does not change wrt to the SoC, it doesn't matter how the SoC is built. As long as all IP cores in the SoC sees a consistent view of the entire memory space, it would be fine. Otherwise macOS would have to be enhanced to segment the memory space according to the IP core clusters and processes. Quite unlikely for Apple to re-architect macOS to do that. The dual socket PowerMac G5 still uses HyperTransport (akin to UltraFusion) to link the two CPU to present it as one to the OS, so I don't think macOS ever support NUMA.

mr_roboto · Sep 22, 2022

quarkysg said:
I think this is called cache coherence NUMA? To Apple, I think as long as macOS' view does not change wrt to the SoC, it doesn't matter how the SoC is built. As long as all IP cores in the SoC sees a consistent view of the entire memory space, it would be fine. Otherwise macOS would have to be enhanced to segment the memory space according to the IP core clusters and processes. Quite unlikely for Apple to re-architect macOS to do that. The dual socket PowerMac G5 still uses HyperTransport (akin to UltraFusion) to link the two CPU to present it as one to the OS, so I don't think macOS ever support NUMA.

There aren't any commercially successful non-coherent NUMA systems I'm aware of. As a result, in practice a lot of people just use NUMA as a synonym for cache-coherent NUMA (aka ccNUMA).

Even ccNUMA machines tend to need some level of OS support to get the most out of the hardware. Things like trying to pin threads on the nodes where their pages reside, detecting when that's broken down and migrating threads and/or pages as necessary, and so forth.

Apple has definitely shipped NUMA hardware - the dual CPU socket configs of the 2010 Mac Pro were NUMA. However, I've never heard of MacOS X having significant NUMA optimization. Those machines were such a tiny fraction of all Macs sold, and NUMA optimization is difficult and fragile. It's also often an application level thing, not just the kernel - apps may need to explicitly tell the kernel about thread/memory affinities, need to be tuned for the size of nodes, and so on. So it's not too surprising Apple didn't do much.

Also, there's different degrees of non-uniformity. The 2010 Mac Pro was relatively mild - only two nodes, and Intel's NUMA interconnect of the time (QPI) was reasonably fast relative to memory. The penalty for failing to optimize wasn't large. There are much bigger NUMA systems out there, with hundreds to thousands of nodes distributed across many boards. In these, the performance ratio between local and remote memory accesses is so large that you must put a lot of effort into NUMA optimization to get the best out of the hardware.

Technically speaking, Apple's M1 Ultra is NUMA. However, Apple's claimed their bridge (a NUMA interconnect) is so high performance that classic NUMA software optimizations aren't necessary. It seems plausible - it's much faster relative to memory than the 2010 Mac Pro's QPI link was, and nobody had any big complaints about the performance of those machines.

deconstruct60 · Sep 22, 2022

Xenobius said:
Mac: M3 - *Hardware accelerated RT (Part 1)

All Apple needs to do is make the GPU in the Mac Pro 12x faster than the GPU in the M1 Ultra to compete with RTX 4090. Piece of cake. … and more seriously - I think we can expect the Radeon RX7000 in the Mac Pro.

blenderartists.org

The RX 7000 in the old Mac Pro ? at best a coin toss. The base chassis is already about 3 years old. If it shows up eventually decent chance it is a just a thinly rebadged AMD "Pro" card. Timing really doesn't work well here. Track record over last several years is that macOS on Intel doesn't get a AMD card for at least a quarter or two. By 2023 the MP 2019 will be winding down on macOS support. Not sure how Apple gets return on investment for a brand new card when sales will be trending down at the same time. the longer it takes the card to come out the more on the downward trend they'd be on.

A RX 7000 in a Apple Silicon Mac Pro? Pretty doubtful given there is zero driver support now. Even for existing AMD GPUs that has coverage on macOS on Intel. The 4090 is not a big enough 'boogey man' to force a mac ecosystem damage decoupling of the Mac Pro from the rest of the line up. Pretty good chance the whole Mac line up remains 100% focused on Apple GPU being sole GUI driver for macOS on Apple Silicon.

Apple letting in something from the CDNA series ( e.g., MI210 ) might have better chance of spanning both sides (macOS intel/AS ) and have a longer product lifecycle that would make effort pay off. Not sure if AMD is interested or if Apple is willing to fund it. But it is a better match the the 7000 series which likely would imply a coupling to a GUI screen.

deconstruct60 · Sep 22, 2022

leman said:
Again, my perspective on this might be naive, but I don’s see a principal difference between same die and same package if the dies can be connected in this way. The topology should be the same in either case. I can imagine that a on-die solution might be more energy efficient and of course have slightly lower latency but those things will matter less on a desktop.

The latency is actually more critical on a GPU system than in CPU system due to hard real time constraints of the display output controllers. That is why last 3 decades has seen several multiple high NUMA impact CPU package system (with OS and app ) workarounds and really no high end resolution/frame rate multiple chip solutions on the GPU systems.

AMD's MI200 series presents as two different GPUs. Doesn't present as one coherent system. The upcoming multi-chip AMD 7900/7800 leave the main complex monolithic and only prune off the memory controllers and associate "fronting" cache.

Quirky workarounds like SLI/Crossfire have largely collapsed as resolutions , color depth , and frame rate have dramatic steps forward over the last several years. It isn't missing TFLOPs that is core issue. It is missing hard real time deadlines that is the primarily source of the quirkiness.

Batch job renderings won't be sensitive to that. Interactive jobs at high framerates and bulky resolutions are bigger issue. Apple getting one tile/chiplet to talk one remote display controller is one thing. Getting three remote tiles all real time constrained for a display controller is going to be more hefty tap dancing.

leman said:
One problem with Apples current design is that you can’t just take an individual component, there is ton of stuff that comes attached to it. People who could use a 128-core GPU likely don’t need a 32-core CPU etc. Apple will need to devise a more flexible scalable solution if they want to offer comprehensive options in that market in a way that’s commercially viable.

The catch 22 is that GPUs haven't scaled well to massive chiplet fan out. If it was 'fall off a log" easy Nvidia probably wouldn't have made a 600+ mm2 solution for the 4090. For a chip that size the yields are better if it 1/4 or 1/5 the size. It isn't. And AMD isn't chopping up the cores chiplet on their solution either. There has been a need fora certain latency and bandwidth threshold to be crossed to even attempt it. Chiplets are comping for GPUs but likely won't be the same breakdown the CPU cores have been "chopped up". Substantive differences in the latency constraints.

mcnallym · Sep 22, 2022

leman said:
I do not understand what this means. Can you elaborate?

Not sure how be clearer. Apple uses seperate parts of the SoC. ie it has the Media Engine which is not the GPU which accelerates the media conversion process. On a PC workstation would be either the AmD Or Nvidia GPU performing those tasks. They have the Machine Learning through the Neural Engine, whereas PC workstation has the CPU and GPU and using those for the same functions that Apple using different engines within the SoC.
so don’t need the same raw grunt depending on the task being performed if can have a nother part of the SoC perform the task.

neinjohn · Sep 22, 2022

senttoschool said:
Creating multiple huge custom SoCs for the smallest Mac market make zero sense for Apple. Part of stacking Max dies together is to reduce R&D cost for Mac Pro-level SoCs.

Unless... of course... Apple decides to start Apple Silicon Cloud service: https://forums.macrumors.com/thread...t-a-40-core-soc-for-mac-pro-now-what.2306486/

Creating a cloud service would expand the market for big Apple Silicon SoCs beyond the Mac Pro.

A Mac Pro could probably be a integrated, greatly simplified headless number crunching / enterprise deployer / soft cloud platform through "Continuity on Steroids" and Relay.

For example an office, you could have a team of 10 people for example that I had iPhones, Macs and iPads that could access seamlessly the hardware on a Mac Pro sitted on a corner of the room.

You are drawing something on 3D on your iPad and on any app you choose "Access Mac Pro" or choose a few apps that will always run on the Mac so it can crunch away. Work on a 8K video on your MacBook Air that is only a viewpoint. As can multiple users on the run with their Apple device.

If Mac Pro was provided with a router you could connect to it when away as if was a VPN though Relay.

Since Apple has kind renewed their interest on the enterprise solutions it would be interesting if they positioned the Mac Pro not as standalone workstation that is individually connected to a pair of monitors, keyboard and mouse (leave it to the Studio) but facilitate it as a center of teamwork.

Also, given Apple silicon efficiency, they could pack a lot of power on the actual highly stylised Mac Pro tower.

mr_roboto · Sep 22, 2022

deconstruct60 said:
Quirky workarounds like SLI/Crossfire have largely collapsed as resolutions , color depth , and frame rate have dramatic steps forward over the last several years. It isn't missing TFLOPs that is core issue. It is missing hard real time deadlines that is the primarily source of the quirkiness.

Batch job renderings won't be sensitive to that. Interactive jobs at high framerates and bulky resolutions are bigger issue. Apple getting one tile/chiplet to talk one remote display controller is one thing. Getting three remote tiles all real time constrained for a display controller is going to be more hefty tap dancing.

Your recent posts are just a giant confused mess of concepts which don't really apply. The problems with NVidia SLI and AMD Crossfire had nothing to do with missing hard realtime deadlines.

GPU compute hardware isn't tightly coupled to display controllers. All it does is rasterize a scene into a memory buffer. What happens to that buffer in the future, it doesn't really know or care.

Similarly, display controllers just read pixel data from a memory buffer and send it out through an interface like DisplayPort. They don't know or care when that pixel data arrived in the buffer. In fact, they're happy to read out a buffer while it's still being modified. Doesn't matter.

You can introduce artificial realtime deadlines into this asynchronous system by trying to synchronize rasterization with display refresh, mostly to avoid the visual artifacts resulting from refresh readout as a buffer is being modified. Typically this is accomplished through double or triple buffering. In double buffering, one screen buffer (the back buffer) is being rasterized into and the other (the front buffer) is being refreshed to the display. Buffer roles flip whenever a refresh cycle completes at a time when the GPU has marked its back buffer as done. This has a downside: if GPU performance falls slightly below the display refresh rate, it falls all the way to half the refresh rate (or worse). Triple buffering adds another back buffer to the queue so that the GPU can complete one frame and move on to the next regardless of exactly when the display controller vertical blanking period starts. Note that with triple buffering, we've gotten back to GPU rendering running async from display refresh.

The biggest issue with NV SLI and Crossfire was that best performance required a technique called Alternate Frame Rendering (AFR). Thanks to the rasterization algorithms implemented by both NVidia and AMD GPUs, it's hard for two GPUs to collaborate on the same frame. With AFR, you'd avoid that by launching rasterization of frame #1 on GPU A, then launching rasterization of frame #2 on GPU B before rasterization of frame #1 was complete.

Ideally you'd kick off rasterization of frame #2 at exactly the halfway mark of rasterizing frame #1, and get perfect overlap such that whole frames are completed on a timeline at the same pace a 2x speed single GPU would. But one of the big issues with SLI/Crossfire AFR was that in practice, it was very difficult to achieve that ideal pacing. The system was prone to falling into patterns where frames from one GPU stayed visible far longer than frames from the other, which is visually unpleasant. Human vision likes relatively consistent frame times even more than it likes low frame times.

Apple's GPU technology uses a different rasterization algorithm, TBDR, which should be better suited to rendering one frame using multiple GPUs. It splits the work of rendering a frame into a two stage pipeline, geometry transform/binning and rasterization. Geometry binning can be done centrally by one GPU, and produces a bunch of little packets of work for tile rasterization engines to do. Since there's little communication between tile engines, it's possible to farm tiles out across chip boundaries and reassemble everything into a single scene at the end.

I'm glossing over a ton of complications, but the point is that TBDR GPU tech has the potential to make multi-GPU work well, mostly because the unit of collaboration can be much finer grained than whole frames. It's not without precedent. The original SLI was 3DFX Scan Line Interleaving (Nvidia's SLI means Scalable Link Interconnect, basically an awkward name invented to make it sound like 3DFX SLI). 3DFX SLI actually worked well, and it was because early 3DFX GPUs were architected around separate chips for geometry and rasterization. With 3DFX SLI, one card's geometry chip did all the geometry work, and both did rasterization, with the work unit being one scanline (hence SLI).

name99 · Sep 22, 2022

Xenobius said:
Mac: M3 - *Hardware accelerated RT (Part 1)

All Apple needs to do is make the GPU in the Mac Pro 12x faster than the GPU in the M1 Ultra to compete with RTX 4090. Piece of cake. … and more seriously - I think we can expect the Radeon RX7000 in the Mac Pro.

blenderartists.org

This assumes a specific use case for the M1 Pro, for 3D type work. It's also probably the case that Blender will keep improving in performance on AS for the next few years. That doesn't help anyone who wants high performance Blender today, but it does somewhat answer the question of where Apple goes in future – as more and more SW picks up Apple Silicon optimizations, performance differentials will flip for many apps.

The benchmark you showed (blender 3.0 maybe) has 1000 for an Ultra.
3.1 benchmarks give 1132
3.3 benchmarks give 1370
The number keeps rising as more and more calls within Blender are routed to Metal. It's far from clear if this will tail off around 1500, or if things can just keep going once some of the internal details of Blender are restructured as a better match to how Apple Silicon wants to do things. You can see some discussion about this in the Blender Developers forums; right now mainly what we see is simple non-disruptive changes to Blender, not thorough rethinking of code paths.
The next HW (whatever it's called, M2 Ultra, M3 Ultra, who knows) will not just have a baseline of about 50% faster GPU (assuming the A15 and A16 improvements) but will probably have much better scaling. M1 Ultra's GPU scaling is clearly something of a disappointment; but it was also version 1 for this sort of design, and Apple will surely have redesigned the successor to fix the most immediate problem cases to give somewhat better scaling.

Other interesting workstation use cases include
- STEM work (where one cares mainly about CPU speed). Apple can improve their presence here by working with Mathematica, Matlab, Octave and so on to ensure that
+ they route calculations to AMX as much as possible
+ they make these users (and LLVM) aware of their MUL53 extension (instruction that's not present in base ARM) that allows for substantially faster bignums.

- database use cases (eg a lot of bio looking through large datasets). This can be improved by either having more memory at an intermediate tier between DRAM and SSD (ie something like Optane) or by PiM (Processor in Memory, which is a simple processor extremely tightly coupled with DRAM).
Apple has definitely looked into the first (Optane-like) solution. It could still happen in various forms, if hey can find an appropriate HW substrate. Why would it work for Apple but not Intel? Because Apple controls the whole stack, and would make sure everything from the OS to the APIs were all co-operating to use this dual memory system appropriately.
What about the second? As I have said before, cross sections of the A15 die show what looks like PiM...
It's hard to be certain, but nothing else obvious matches what the photos shows. Unfortunately no M2 cross sections exist, and not yet an A16 cross section, so this remains tentative. But these are two directions Apple could go.

Basically Apple is in the position where they can take HW ideas that have been thrown around for years (like PiM or using two types of DRAM) and implement them correctly and fully. This certainly has the potential to create solutions very different from existing PC workstations.

leman · Sep 22, 2022

mcnallym said:
Not sure how be clearer. Apple uses seperate parts of the SoC. ie it has the Media Engine which is not the GPU which accelerates the media conversion process. On a PC workstation would be either the AmD Or Nvidia GPU performing those tasks. They have the Machine Learning through the Neural Engine, whereas PC workstation has the CPU and GPU and using those for the same functions that Apple using different engines within the SoC.
so don’t need the same raw grunt depending on the task being performed if can have a nother part of the SoC perform the task.

The key bit is “depending on task being performed“. Domain-specific accelerators are great but they don’t remove the need for capable general-purpose programmable processors.

Take ML for example. Apple currently offers three main processors in this domain. The NPU is for energy-efficient inference (but it lacks flexibility and user programmability), the AMX is for training and inference on the CPU (but it’s restricted to whatever public interface Apple provides since it’s not user-visible) and the GPU. Only the GPU is directly programmable. The three processors fulfill different roles and all three of them have to be fast in order to do their job.

JouniS · Sep 22, 2022

name99 said:
- database use cases (eg a lot of bio looking through large datasets). This can be improved by either having more memory at an intermediate tier between DRAM and SSD (ie something like Optane) or by PiM (Processor in Memory, which is a simple processor extremely tightly coupled with DRAM).

Biology is not a good use case for that, as bioinformatics typically uses custom software written by people with no significant experience in software development and hardware. You definitely don't want to use weird hardware in such situations. In the ideal case, you want simple hardware where cross-platform solutions scale well without having to deal with BS like NUMA nodes, hyperthreading, or efficiency cores. You have RAM, local SSD, and remote storage, and the OS does not try to do anything too clever between them.

Wokis · Sep 22, 2022

mcnallym said:
Not sure how be clearer. Apple uses seperate parts of the SoC. ie it has the Media Engine which is not the GPU which accelerates the media conversion process. On a PC workstation would be either the AmD Or Nvidia GPU performing those tasks. They have the Machine Learning through the Neural Engine, whereas PC workstation has the CPU and GPU and using those for the same functions that Apple using different engines within the SoC.
so don’t need the same raw grunt depending on the task being performed if can have a nother part of the SoC perform the task.

While not within a SoC, these functions, namely media encoding/decoding and machine learning, are dedicated blocks inside the silicon of PC CPUs and GPUs as well. Nvidia for example calls them NVENC and Tensor cores. The advantages and disadvantages seem to be the same on both platforms. Namely that fixed function silicon performs great per watt used but is really inflexible to changes, where a new codec or resolution for example warrants a re-design of the block and can't be offered for existing products.

Apple does take it one step further though with support for ProRes which of course as an intermediary and deliverable codec makes for great value within the video pro market. But it's already been done at this point. I don't see which fixed-function solutions they can come up with that offers true value and doesn't just lock in the customer to a narrow workflow.

mi7chy · Sep 22, 2022

Xenobius said:

How are they getting the >12000 figure since there's no 4090 entry in Blender Open Data?

Nvidia is showing a 54.5% increase from 3090ti to 4090 so it's more like ~9400.

https://blogs.nvidia.com/blog/2022/09/20/nvidia-studio-geforce-rtx-40-series/

Boil · Sep 22, 2022

A few different ways to go...?

The building block...

M2 Max SoC

12-core CPU (8P/4E)
40-core GPU
16-core Neural Engine
96GB LPDDR5 SDRAM
400GB/s UMA bandwidth

The way most have been thinking would be...

Two M2 Max SoCs = one M2 Ultra "SoC"
Two M2 Ultra "SoCs" = one M2 Extreme "SoC"

Giving us...

M2 Ultra "SoC"

24-core CPU (16P/8E)
80-core GPU
32-core Neural Engine
192GB LPDDR5 SDRAM
800GB/s UMA bandwidth

M2 Extreme "SoC"

48-core CPU (32P/16E)
160-core GPU
64-core Neural Engine
384GB LPDDR5 SDRAM
1.6TB/s UMA bandwidth

But what if we added a second building block...?

(Possible) M2 GPU-centric "SoC"

60-core GPU
96GB LPDDR5 SDRAM

And assembled the building blocks in pairs (one M2 Max SoC & one M2 GPU-centric "SoC")...?

Resulting in...

M2 Ultra "mixed SoC"

12-core CPU (8P/4E)
100-core GPU
16-core Neural Engine
192GB LPDDR5 SDRAM
800GB/s UMA bandwidth

M2 Extreme "mixed SoC"

24-core CPU (16P/8E)
200-core GPU
32-core Neural Engine
384GB LPDDR5 SDRAM
1.6TB/s UMA bandwidth

My hope is we also see Apple GPUs/GPGPUs...

Apple add-in card

Four M2 GPU-centric "SoCs"
240-cores
384GB LPDDR5 SDRAM
1.6TB/s UMA bandwidth
PCIe Gen4 x16

senttoschool · Sep 22, 2022

neinjohn said:
A Mac Pro could probably be a integrated, greatly simplified headless number crunching / enterprise deployer / soft cloud platform through "Continuity on Steroids" and Relay.

You're talking about an on-premise server using a Mac Pro.

I'm talking about a cloud-based server using Mac pro-SoCs.

The latter market is much bigger than the former.

Xiao_Xi · Sep 22, 2022

senttoschool said:
I'm talking about a cloud-based server using Mac pro-SoCs.

Wouldn't Mac Pros in the cloud hurt sales of other Apple hardware?

name99 said:
+ they make these users (and LLVM) aware of their MUL53 extension (instruction that's not present in base ARM) that allows for substantially faster bignums.

I'm sure Apple has included that in their LLVM fork.

senttoschool · Sep 22, 2022

Xiao_Xi said:
Wouldn't Mac Pros in the cloud hurt sales of other Apple hardware?

I don't think so. It would increase sales of Apple's core Macs such as the Macbook Pro/Air. The reason is that anyone with a Mac can now accelerate their applications by renting a Mac Pro in the cloud.

iPadified · Sep 23, 2022

I think that "want" is much more important than "how" at the moment.

Which compute tasks are the Mac Pro expected to solve and which market is it addressing?

Music? Add some PCI slots to the Ultra for suitable special hardware in regard to music production-should be enough
Video? The Ultra is pretty good but I expect that some might need PCI as well for special hardware
3D? Are we talking about rendering, shading, modelling, simulations or animations? Each of these tasks does not need the same kind of computer. For some of these tasks the Ultra is sufficient. Do Apple want to address the whole stack?
Science: sky is the limit here and price/performance will be key (guess why gamings cards are so popular to use). Price/performance ratio is not Apples strong point.

I am quite confident that Apple solve "how" to beat 2X 4090 but can they also do it at a competitive price?

maflynn · Sep 23, 2022

The M1/M2 claim to fame, is that they're very power efficient and offer the same performance on battery as on AC. That's great for laptops, but there's really no advantage to desktops

While the performance of the M1 is excellent and I'm taking anything away, i think PC workstations have a number of advantages over the M1 architecture

Better GPUs
Replaceable components
Better processor performance
Standardized parts

I'm not saying future M series processors cannot compete with PC workstations, but imo, you get more bang for your buck with PC workstations at the moment, not to mention many people have workflows that rely on Nvidia's cuda cores.

iPadified said:
I am quite confident that Apple solve "how" to beat 2X 4090

Apple's solution to compete against PCs and GPUs is to make bigger dies. At some point you can't just keep making the physical chip larger and larger to compete, there needs to be other architectural improvements - again I'm not saying apple cannot do that, just point out making things larger isn't a long term answer.

deconstruct60 · Sep 23, 2022

maflynn said:
The M1/M2 claim to fame, is that they're very power efficient and offer the same performance on battery as on AC. That's great for laptops, but there's really no advantage to desktops

Constrained to narrow threaded tech sensational benchmarks, maybe so.

For heavyweight grunt multithreaded workloads? yes it can be. The top end single thread performance between a M1 , Pro, Max , and Ultra is basically the same. Not loosing either base clock speeds or single threaded (ST) performance all the way up to 20 cores. If strapped four dies into an monster "Extreme" package , there is very good chance could do the exact same thing all the way up to 40 CPU cores.

Go from 8 cores to 40+ cores in the x86_64 lines ups and see if base clocks and ST performance doesn't drop. Or the power consumption veer off into substantially higher costs zone.

So largely depends upon world view. Maximum speed with a liquid nitro heat sink and massive memory overlocking on narrow ST as the baseline metric or throughput on broad heavyweight loads.

It is working for Amazon (Graviton 2 , 3 ). The fastest CPU package deployment grown in AWS is Graviton. That isn't because primarily Amazon makes it, that is because customers select it. Lower operating costs, no service level agreement misses , etc. It is working for Ampere Computing on other major cloud services deployments.

Does that mean Apple is going to drift into the > 64 core zone and enter the "Core count " wars? No. But in the 16 < c < 64 SoC space probably going to compete pretty competitively. For who has stuff that runs on 1-4 cores 80+% of the time and mostly have heavily ST bound apps where Turbo speed is 'everything'. Yes, Apple probably will loose some traction there.

But is that where modern architecture and actively maintained applications are going over the next 1-4 years ? That old school pool of apps probably isn't going to grow bigger. [*** see PS below ***]

maflynn said:
Apple's solution to compete against PCs and GPUs is to make bigger dies.

Actually that isn't really true. The dies in the Ultra are just as big as those in the Max.

Apple isn't making relatively small chiplets. Apple will likely use TSMC N3 to pull back the Max die size into more the central midrange chip size. Doubtful Apple is going to pivot to relatively small chiplets/tiles. UltraFusion is extremely wide so it takes up lots of die edge space. Make the die area too small and will run the risk of running into problems fitting the necessary I/O all around.

The can also more easily use large chiplets because their Perf/Watt is very good. If power consumption was very high then there would be a greater need for either spacing to manage the thermals or chopping clocks as core count went higher and limited to more bursty Turbo runs.

There is a ton of money being put into more advanced 2.5/3D packaging coming where future SoCs are not going to be limited to a single die for large volume product production. ( An evolutionary step past the relatively simple print a mini PCB board solution the AMD Ryzen chiplets have leveraged for a while).

maflynn said:
At some point you can't just keep making the physical chip larger and larger to compete, there needs to be other architectural improvements - again I'm not saying apple cannot do that, just point out making things larger isn't a long term answer.

There is a class of insatiable horsepower consumption users. But there are also a substantial number of workloads were today's system are fast enough to get profitable work done for several years. Mainframes gave way to Minicomputers (actually not so mini physically). desktop PC ate away at bigger brethren for many years. Laptops have eaten away at desktops. etc.

Should Apple open a window for specialized compute accelerators drivers? Yeah. But is more a problem of deprecating OpenCL but not delivering a robust replacement for it. It is software at least as much as it is hardware.

But the architectural revolution trend line you are getting at there with "Moore Law" running out of steam is likely going to follow a path were "general" workstations will get more specialized over the long term. Part of that is because has SoC foundational building blocks like Arm , GPU , RISC-V licenses where can put together specific enough SoCs without being totally dependent upon a single build everything for everybody generically general purpose chip maker.

This isn't necessarily going to end up where is cheaper commodity stuff on the retail shelves at the upper high end.

P.S. after finishing this saw an article about faster AV1 .

"..
Google could do this by adding Frame Parallel Encoding to heavily multi-threaded configurations. ...

In other words, CPU utilization in programs such as OBS has been reduced, primarily for systems packing 16 CPU threads. As a result, they are allowing users to use those CPU resources for other tasks or increase video quality even higher without any additional performance cost .. "

AV1 Update Reduces CPU Encoding Times By Up To 34 Percent

AV1 content creation will become very important, now with RTX 40-series GPUs supporting the new standard.

www.tomshardware.com

Or can just add hardware AV! encoding and complete blow past that uplift. But chaning the viewpoint from 16 cores is extremely rare to being a 'reasonable' requirement actually can change performance without adding yet even more cores to run the other algorithm limited code faster or break out the extra large Turbo, flogging whip to make the old algorithm limited code run faster.

Joe The Dragon · Sep 23, 2022

apple needs to rethink storage for an pro workstation
No need an 2th system to reprogram the build in disk
Nice to have raid 1 / 5 / etc setups with hot swapping.
Nice to have m.2 slots and not apple storage with it's high markup.
Nice to have more then one disk / volume in the case.
Nice to have more network card choice / SFP+ (not locked to apple only modules)

How can the Apple Silicon Mac Pro compete with PC Workstations?

macrumors 68040

macrumors 65816

macrumors Core

macrumors 65816

macrumors Core

macrumors 65816

macrumors 6502a

macrumors G5

macrumors G5

macrumors 65816

macrumors regular

macrumors 6502a

macrumors 68020

macrumors Core

macrumors 6502a

macrumors 6502a

macrumors G4

macrumors 68040

macrumors 68030

macrumors 68000

macrumors 68030

macrumors 68020

macrumors Haswell

macrumors G5

macrumors 65816

Our Staff