3D Rendering on Apple Silicon, CPU&GPU

bcortens · Apr 12, 2022

shuto said:
So you think the apple silicon Mac Pro won’t support PCIe graphics cards? I think that is a massive point of the Mac Pro. We’ll see I guess

I think it depends, you might be able to have software that explicitly dispatches work to other accelerators but I doubt that the main GPU (display driving) will be supported on a PCIe card.

killawat · Apr 12, 2022

shuto said:
Do people think eGPUs will ever be supported with apple silicon machines?

Nope. That ship pretty much sailed after Apple went a full year and released MacBook Pros M1 without eGPU support. I've actually gotten rid of my eGPU setup and repurposed it while GPU prices were through the roof and my system uptime has increased considerably. On Intel Macs there are incredible gains to be had. On M1 obviously the situation is a bit different. Apple is still providing AMD drivers but we will see what happens once the Mac Pro is released. I strongly suspect that it won't work with any GPU, PCIe or TB.

diamond.g · Apr 12, 2022

shuto said:
So you think the apple silicon Mac Pro won’t support PCIe graphics cards? I think that is a massive point of the Mac Pro. We’ll see I guess

Pretty sure it won't support them. That still leaves audio, storage, and video capture cards.

l0stl0rd · Apr 12, 2022

killawat said:
Nope. That ship pretty much sailed after Apple went a full year and released MacBook Pros M1 without eGPU support. I've actually gotten rid of my eGPU setup and repurposed it while GPU prices were through the roof and my system uptime has increased considerably. On Intel Macs there are incredible gains to be had. On M1 obviously the situation is a bit different. Apple is still providing AMD drivers but we will see what happens once the Mac Pro is released. I strongly suspect that it won't work with any GPU, PCIe or TB.

I am not sure what to expect from the Pro but if it has no expansion slots then we are nearly back to the trashcan days. Oh wait they kind of made that it is called the Mac Studio.

killawat · Apr 12, 2022

l0stl0rd said:
I am not sure what to expect from the Pro but if it has no expansion slots then we are nearly back to the trashcan days. Oh wait they kind of made that it is called the Mac Studio.

LOL Oh for sure. I'd imagine they have PCIe for the great number of expansion cards that work in macOS (storage, network, AV). Not including GPU seems like an oddity but its the current situation with all M1 Macs. I keep my eGPU enclosure around but not sure what to stuff it with. Could use more USB ports..maybe a nice sound card?

l0stl0rd · Apr 12, 2022

killawat said:
LOL Oh for sure. I'd imagine they have PCIe for the great number of expansion cards that work in macOS (storage, network, AV). Not including GPU seems like an oddity but its the current situation with all M1 Macs. I keep my eGPU enclosure around but not sure what to stuff it with. Could use more USB ports..maybe a nice sound card?

Yeah I still got mine too, it is collecting dust.
With the lackluster performance however that they show I would still love to the eGPUs back.
They promised in GPU perf have really been a letdown. Can’t even mach 6900 XT from AMD.
I had a Studio on order an canceled it because I am not confident.
Should we see the Pro at WWDC and it is about 10k then I might as well go back to PC.
Perhaps I can be patient and see if the M2 brings ray tracing cores will see.

diamond.g · Apr 12, 2022

l0stl0rd said:
Yeah I still got mine too, it is collecting dust.
With the lackluster performance however that they show I would still love to the eGPUs back.
They promised in GPU perf have really been a letdown. Can’t even mach 6900 XT from AMD.
I had a Studio on order an canceled it because I am not confident.
Should we see the Pro at WWDC and it is about 10k then I might as well go back to PC.
Perhaps I can be patient and see if the M2 brings ray tracing cores will see.

I wouldn't be surprised if the AS MP has a bunch of x8 slots, since they wouldn't need x16 slots.

l0stl0rd · Apr 12, 2022

Actually while M1 was exiting at first now that we see desktops this reflects quite nicely how I am feeling about Apple at the moment.
And not just about the storage modules but some other remarks he makes.

terminator-jq · Apr 12, 2022

The new Mac Pro is probably the most mysterious of the Apple Silicon devices. After we got the M1, it was pretty easy to guess what was coming for the M1 Pro and Max even before the leaks. I think predictions for the M1 Ultra started within days after the M1 Max was announced (even the Ultra name).

For the Mac Pro it’s hard to know what to expect. Apple themselves said the M1 Ultra was the last M1 family chip. A few leakers have said the Mac Pro won’t use an M1 nor M2 based chip… This leads me to believe we could see an “X1” chip or something that takes cues from the M1 and uses some of the coming M2 improvements but will be it’s own chip line.

Here’s some of my predictions based on Apples most recent Mac launches:

1. CPU wise the M1 chips have been phenomenal! However Intel has caught up in both single core and multi core performance. I’m expecting this new chip to put Apple back in the lead in both categories.

2. GPU wise, the M1 family has been good but not great. They compare well to AMD cards but Nvidia has kept its leading position. Especially for 3D. I’m expecting Apple to make significant improvements on the GPU side. Perhaps we see Apples first implementation of hardware ray tracing.

3. When it comes to modularity I think it’s fair to assume that the RAM and SSD will be user upgradable. CPU and GPU will most likely be locked but I do wonder if Apple could introduce a sort of boost kit similar to the afterburner card.

One thing is for sure, this Mac Pro chip will be the first Mac Apple Silicon chip that can be designed with Mac wattage in mind. The M1, M1 Pro and M1 Max (and M1 Ultra which is just 2X Max) have all needed to be efficient enough to run inside of a laptop. Let’s see what the Apple engineers can do when they don’t have to worry about power efficiency as much!

mi7chy · Apr 12, 2022

Xiao_Xi said:
It is most likely a power demand that couldn’t be delivered by the hardware

Wonder if the updated PSU is a possible fix. I'd bet on the Delta PSU (better reputation for quality) on the right with more caps vs the Lite-On (known for cheaper quality) on the left.

Lone Deranger · Apr 13, 2022

Well this doesn't sound good:

https://twitter.com/x/status/1514295682777059329

tldr:
The short fall in GPU performance of the M1 Ultra is due to an oversight made by chip engineers 5-7 years ago and can only be remedied at a software level through intense optimisation for Apple Silicon. Not just Metal. The kind of optimisation that might require going back to the drawing board.

Lone Deranger · Apr 13, 2022

Here's the full text I pulled out of the Twitter feed. It's a bit of a read, but interesting none the less.

Problem: Apple shows that the M1 Ultra GPU can use up to 105W of power. However, the highest we could ever get it to reach was around 86W. No, the Mac Studio cooling wasn't a problem because the GPU stayed cool, around 55-58°C compared to in the past when Apple allowed 100°C. This makes it pretty clear that the Mac Studio cooling system is OVERKILL in most apps, which means that there was a disconnect between Apple's Mac Studio cooling system engineers and the M1 Ultra chip designers/engineers. Something has gone terribly wrong in terms of chip perf.

Culprit: Each cluster of GPU cores within an M1/M1 Pro/M1 Max/M1 Ultra chip comes with a 32MB TLB or Transaction Lookaside Buffer, which is a memory cache that stores the recent translations of virtual memory to physical memory, used to reduce user memory location access time.

Hishnash: "If an application has not been optimized for the M1 GPU architecture's tile memory, (not just Metal optimized) then every read/write needs to go all the way out to system memory. If the GPU compute task is issuing MANY little reads, then this will saturate the TLB. The issue is if GPU data hits the TLB and the page table being read/written to is not loaded, then that entire thread group on the GPU needs to pause while the page table is loaded into the TLB. If your application is using MANY reads/writes per second, this results in a lot of STALLED GPU thread groups. Unlike a CPU, when a GPU is waiting for data, it can't just switch to work on something else. So the GPU sits there and waits for the TLB buffer to clear in order to get more work to process."

This is why we only saw 86W peak GPU usage in an app that was considered to be decently optimized. However, for apps that CLAIM to support Apple Silicon support but have NOT been rewritten to take advantage of Apple's TBDR tile memory system, they will be severely limited by the 32MB TLB if there are many reads/writes.
The problem is that ALMOST ALL apps out there haven't been optimized for Apple's TBDR tile memory system. Many software developers simply get it to work using the traditional TBIR model and call it good to go, being unaware of the 32MB TLB limitation that bottlenecks performance.

Hishnash: "What apps should be doing is loading as much data as possible into the tile mem and flushing it out in large chunks when needed. I bet a lot of the writes (over 95%) are for temporary values that could've been stored in tile mem and never needed to be written at all."

Hishnash: "I expect that the people building the M1 family of chips didn't expect applications to be running on it that are not TBDR optimized. So they thought 32MB would be enough."

WRONG. Most apps aren't optimized for Tile-mem, even if they claim it supports Apple Silicon.

Keep in mind that between the time when Apple started engineering the M1 family 5-7 years ago, reliance on GPU performance has skyrocketed, so the chip designers probably didn't think there would be so many reads/writes to the 32MB TLB.

What does this mean? The M1 family of chips, including the M1 Ultra, has a major limitation that can't be fixed unless apps are properly optimized. Here's the problem.

Hishnash: "The effort needed to optimize for tile memory is MASSIVE. It requires going all the way back to the drawing board, re-considering everything, like the concept that there is a local on-die memory pool you can read/write from with very very low perf impact is unthinkable in the current desktop GPU space. It’s a matter of a complete rewrite at a concept/algorithmic level."

Why is this such a big problem for M1 Ultra?
With the M1 and M1 Pro chips, there wasn't enough GPU performance to hit that 32MB TLB limit. However, the M1 Max is where you see GPU scaling fall off a cliff due to the TLB, especially the 32-core GPU model.
This problem scales linearly, so if, for example, 26 cores is the sweet spot for the M1 Max, with the rest of the 6 cores being bottlenecked by the TLB, the M1 Ultra will be bottlenecked by 12 GPU cores because it features two 32-core M1 Max dies. No wonder it scales poorly.

The solution from hishnash: "Increasing the TLB will help a lot for applications that are not optimized. This is important because many apps will NEVER be optimized, and even fewer games."
This is why gaming performance is so poor on M1 Ultra, apart from the Rosetta bottleneck.

Hishnash: "For game engines that are not TBDR aware/optimized, they might be currently bottlenecked on reads.. and depending on the post-processing effects, might have some large bottlenecks on writes if they're not using tile memory and tile compute shaders where possible."

The reason World of Warcraft runs so well and compares well to the RTX 3080/3090 is because APPLE helped them optimize the game PROPERLY to take advantage of the new TBDR tile-based architecture. (WoW Metal Update Released one week after M1 event proves they got help from Apple.)

The solution from our source: Future M-chip families (Hopefully and probably M2) will see a big increase in the TLB to solve this problem since developers are likely to be slow in optimizing apps. Apple will likely release white papers at WWDC on how to optimize apps properly.

This means that the M2 Ultra will see a HUGE boost in GPU performance over the M1 Ultra if the 32TLB bottleneck is removed. And that performance boost will be on top of higher clock speeds and potentially higher GPU core counts.

The only hope for the M1 Ultra is that developers finally decide to completely rethink and rewrite their apps to support the TBDR tile-based memory architecture. (Good luck)
Oh, and by the way, expect hardware ray-tracing support on future M-chip families. (Hopefully M2 Pro+)

killawat · Apr 13, 2022

Lone Deranger said:
The only hope for the M1 Ultra is that developers finally decide to completely rethink and rewrite their apps to support the TBDR tile-based memory architecture. (Good luck)
Oh, and by the way, expect hardware ray-tracing support on future M-chip families. (Hopefully M2 Pro+)

This is absolutely fascinating.

diamond.g · Apr 13, 2022

killawat said:
This is absolutely fascinating.

And somewhat depressing at the same time.

Boil · Apr 13, 2022

Maybe those wanting an ASi Mac for GPU intensive work should wait and see how future M2/M3 Max/Ultra SoCs handle this issue...?

Which could create even more demand for Mn Pro SoCs...?!?

Boil · Apr 13, 2022

The video version...

Lone Deranger · Apr 13, 2022

It is somewhat of a cold shower isn't it? But there was also this tidbit:

The solution from our source: Future M-chip families (Hopefully and probably M2) will see a big increase in the TLB to solve this problem since developers are likely to be slow in optimizing apps. Apple will likely release white papers at WWDC on how to optimize apps properly.

I don't know how reliable their source is, but if true, it could obviate the need for 3rd party app developers to re-write their apps from the ground up to have any chance of being competitive on macOS. Which no 3D app in their right mind would consider doing.

I'm sure the reality is more complicated than just this. The battle rages on.

quarkysg · Apr 13, 2022

IMHO, this is a sensation piece for increased views. I'll not be surprised that future Mx SoCs will still remain with 32MB TLB (which stands for Translation Lookaside Buffer, not transaction) Software need to be optimised for the hardware it is being run on.

jujoje · Apr 13, 2022

diamond.g said:
And somewhat depressing at the same time.

At this stage it kind of feels like Apple should develop their own GPU renderer; cycles and redshift are both very much geared towards a certain approach to rendering, and it sounds like it's probably need a significant rethink to get them to take full advantage of the TBDR.

Viewport wise, a little more optimistic in that most 3D Apps are rebuilding their viewports to some extent to move to away from OpenGL to Vulkan / Metal. Since this means they're discarding legacy code there's the opportunity for them to take advantage of TBDR. This would assume that they're not going to just use MoltenVK as the low effort option. Apples pro apps team have been doing a good job getting companies to adopt Metal and optimise for AS so fingers crossed they provide the support needed to make this work...

Boil said:
Maybe those wanting an ASi Mac for GPU intensive work should wait and see how future M2/M3 Max/Ultra SoCs handle this issue...?

Which could create even more demand for Mn Pro SoCs...?!?

Hopefully the Mac Pro would address this, otherwise it's really not going to be very pro

quarkysg said:
IMHO, this is a sensation piece for increased views. I'll not be surprised that future Mx SoCs will still remain with 32MB TLB (which stands for Translation Lookaside Buffer, not transaction) Software need to be optimised for the hardware it is being run on.

The idea that Apple engineers overlooked something so obvious appears disingenuous and pretty silly in itself (besides it's been clear that GPUs were the future for a fair few workloads, even 5-6 years ago).

That said GPUs are the weak point of AS at the moment. On a 24 GPU Mac studio I was getting pretty bad performance, and even if I went with a 48 core I'd only be getting meh performance. The lack of scaling above 32 cores for current apps is making me hold off (I returned the 24 core Mac studio). It may be a software issue or lack of optimisation but when you're getting 5fps in the viewport on a new machine that doesn't bode well. Hopefully this doesn't turn into too much of a chicken and egg scenario...

BootLoxes · Apr 13, 2022

So stick with the base mac studio unless you need more cpu cores. I will continue on with my M1 Air and RTX 3060 PC and wait until the M3 studio that hopefully has this issue fixed plus ray tracing cores.

quarkysg · Apr 13, 2022

jujoje said:
The idea that Apple engineers overlooked something so obvious appears disingenuous and pretty silly in itself (besides it's been clear that GPUs were the future for a fair few workloads, even 5-6 years ago).

I would agree.

32MB of TLB would allow (with 16KB page size) many GBs of mapped physical memory. The main issue as I see it is that the software is not optimally design for the hardware, as the GPU tile memory (which is the fastest for the GPU core to access) is not properly used. It doesn't matter how large the TLB is, it will still trash the cache and go to 'slow' RAM.

And we have way less cache than main memory, so large TLB (which just means the OS doing less work mapping virtual to physical) will just trash the SoC's cache if memory access patterns by the GPU is randomly done all over the virtual address space.

altaic · Apr 13, 2022

Lone Deranger said:
Each cluster of GPU cores within an M1/M1 Pro/M1 Max/M1 Ultra chip comes with a 32MB TLB

Lone Deranger said:
Why is this such a big problem for M1 Ultra?
With the M1 and M1 Pro chips, there wasn't enough GPU performance to hit that 32MB TLB limit. However, the M1 Max is where you see GPU scaling fall off a cliff due to the TLB, especially the 32-core GPU model.

I don’t buy the Max Tech guy’s scaling theory. He stated that each GPU cluster has a 32MB TLB— looking at the die shots, each cluster is 8 cores, which lines up with the different M1/Pro/Max configurations. If scaling works with 3 clusters (w/ 3 x 32MB TLBs), it should also work with 4 clusters (w/ 4 x 32MB TLBs).

Clearly something else is the bottleneck for underperforming benchmarks. Certainly could be a TBDR vs immediate rendering thing, but the TLB thrashing explanation is all hand wavy. And the headline that Apple engineers made a critical hardware design error 5-7 years ago… clickbait nonsense.

Edit: It’s a shame, because I want to believe the M2 ray-tracing “exclusive” from them, but it seems they have a deeper understanding of clickbait than the topic that they imply they’ve deep dived.

iPadified · Apr 13, 2022

So developers needs to use the hardware correctly for max performance? What a surprise! Motivating developers to do this will however be difficult. Developers usually wait for improved hardware given the history.

Xiao_Xi · Apr 14, 2022

@Lone Deranger Could you open a new thread with the video and tweets, or better yet, have some moderator move the latest posts from this thread to a new thread so more people can shed more light on this situation?

altaic · Apr 14, 2022

Xiao_Xi said:
@Lone Deranger Could you open a new thread with the video and tweets, or better yet, have some moderator move the latest posts from this thread to a new thread so more people can shed more light on this situation?

Why? The oringinal-source-tweet-guy self contradicted. It’s clickbait. I’m starting to worry about you, Xiao_Xi.

3D Rendering on Apple Silicon, CPU&GPU

macrumors 65816

macrumors 68000

macrumors G5

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors G5

macrumors 6502a

macrumors 6502a

Suspended

macrumors 68000

macrumors 68000

macrumors 68000

macrumors G5

macrumors 68040

macrumors 68040

macrumors 68000

macrumors 65816

macrumors 6502

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 68020

macrumors 68000

macrumors 6502a

Our Staff