3D Rendering on Apple Silicon, CPU&GPU

aytan · Jun 21, 2023

jmho said:
The nicest solution would be TB with a first party interface to manage load balancing for you.

The least nice solution would be to not even bother connecting them and just load balance your scene manually - for example if you have 2 studios and want to render a 1000x500 image just render the left 500x500 pixels on Studio 1, and the right 500x500 pixels on Studio 2.

endless possibilities ahead

I wish Apple enable TB daisy chaining again like 10 years ago or external GPU/extended GPU/compute module solution. Maybe in another universe they already did. Who knows

Xiao_Xi · Jun 21, 2023

jmho said:
I'd really love to see people start daisy-chaining M2 Ultra Studios as a sort of makeshift render-farm.

Apple demonstrated ML training with four M1 Ultra.

Accelerate machine learning with Metal - WWDC22 - Videos - Apple Developer

Discover how you can use Metal to accelerate your PyTorch model training on macOS. We'll take you through updates to TensorFlow training...

developer.apple.com

innerproduct · Jun 21, 2023

to me the issue is mostly price. Mac studio is the the imac replacement and is competitive with what we would have had if apple stayed in intel/amd. Fully loaded ultra is very similar in perf as a 13900 with a w7900. But now the price for the whole machine including 5k screen is twice the price as the loaded imac. (Since you can’t get around the insane prices for ram anymore etc)
An m2 ultra 60 core should really be priced at sub 3000 and the screen at 1200 or so.
But oc, The bigger problem is the lack of a real pro tier machine.
Anyway, this is what we got for now. At least the studio is finally working(it scales) as it should have already in the m1 ultra

aytan · Jun 21, 2023

Xiao_Xi said:
Apple demonstrated ML training with four M1 Ultra.

Accelerate machine learning with Metal - WWDC22 - Videos - Apple Developer

Discover how you can use Metal to accelerate your PyTorch model training on macOS. We'll take you through updates to TensorFlow training...

developer.apple.com

this is interesting

mi7chy · Jun 21, 2023

jmho said:
Let's say you want to render a 10 frame animation on 10 Mac Studios. You have your scene on a shared drive, each Mac Studio reads the scene into memory (sure, over 10Gb ethernet). Then Mac Studio 1 renders frame 1, Mac Studio 2 renders frame 2. Congratulations you've now rendered 10 frames in the time it takes 1 Mac Studio to render 1 frame. Perfect 10x scaling. 100 Studios would give you 100x scaling etc.

(Obviously this is a very simple load balancing solution and you'd probably want a better one in practice)

Compare that to the PC where even if you've managed to load your scene into the VRAM of every single card, your single CPU is going to be sending commands back and forth over PCIe continuously to each GPU constantly, meanwhile with the Mac you've got 10 CPUs - 1 CPU per 1 GPU, so you're never going to get CPU bottlenecked.

Each second of an animation is at least 24 frames * 60 seconds * number of minutes with the average 90 minute animated film at 24 fps * 60 * 90 = 129,600 frames so far from 10 frames. You need fast I/O to distribute the workload, move the scene assets to render workers and move the finished rendered scene back to a central node to combine into an animation. Isn't Thunderbolt 4 a ring topology so an ever increasing bottleneck beyond a few nodes if it has to recopy data from node to node along the ring to the destination node? Furthermore, supposedly only 22Gbit/s of Thunder 4 40Gbit/s is usable for data transfer. For comparison, PCIe 4.0 x16 is 31.5Gbyte/s (not Gbit/s) and if you need more than 24GB VRAM you can upgrade without throwing out the whole system to 300W 48GB RTX A6000 with 115.2Gbyte/s NVLink between a pair and with a 64-core Threadripper Pro 5995WX as used in the YouTube video it has 128 PCIe 4.0 lanes so plenty of I/O bandwidth. Nevertheless, it'll be interesting to see an M2 Ultra render farm on TB 4 ring for science.

jmho · Jun 21, 2023

mi7chy said:
Each second of an animation is at least 24 frames * 60 seconds * number of minutes with the average 90 minute animated film at 24 fps * 60 * 90 = 129,600 frames so far from 10 frames. You need fast I/O to distribute the workload, move the scene assets to render workers and move the finished rendered scene back to a central node to combine into an animation. Isn't Thunderbolt 4 a ring topology so an ever increasing bottleneck beyond a few nodes if it has to recopy data from node to node along the ring to the destination node? Furthermore, supposedly only 22Gbit/s of Thunder 4 40Gbit/s is usable for data transfer. For comparison, PCIe 4.0 x16 is 31.5Gbyte/s (not Gbit/s) and if you need more than 24GB VRAM you can upgrade without throwing out the whole system to 300W 48GB RTX A6000 with 115.2Gbyte/s NVLink between a pair and with a 64-core Threadripper Pro 5995WX as used in the YouTube video it has 128 PCIe 4.0 lanes so plenty of I/O bandwidth. Nevertheless, it'll be interesting to see an M2 Ultra render farm on TB 4 ring for science.

It doesn’t matter if there are 130k frames because there is no temporal dependence between each frame. Rendering frame 2 doesn’t need any information from frame 1 so those machines don’t need to talk to each other.

There is no real reason to put them in a ring either. You’d be better off connecting them all to a network.

If you have 1 CPU talking to 8 GPUs then yes the CPU needs to be able to talk to 8 GPUs at once. If you have 8 computers, each computer doesn’t need to know that the other 7 even exist.

diamond.g · Jun 22, 2023

jmho said:
It doesn’t matter if there are 130k frames because there is no temporal dependence between each frame. Rendering frame 2 doesn’t need any information from frame 1 so those machines don’t need to talk to each other.

There is no real reason to put them in a ring either. You’d be better off connecting them all to a network.

If you have 1 CPU talking to 8 GPUs then yes the CPU needs to be able to talk to 8 GPUs at once. If you have 8 computers, each computer doesn’t need to know that the other 7 even exist.

I do have a question about that, if there are frames that are easier to process/render how does the cluster know to not complete them out of order (I guess we are assuming this is a video animation we are rendering)? Or does the order not matter then either?

leman · Jun 22, 2023

diamond.g said:
I do have a question about that, if there are frames that are easier to process/render how does the cluster know to not complete them out of order (I guess we are assuming this is a video animation we are rendering)? Or does the order not matter then either?

Why would the order matter? The rendered frames can be assembled into a video afterwards.

leman · Jun 22, 2023

jmho said:
What would you need to be sending over 10Gb ethernet after the initial load?

Let's say you want to render a 10 frame animation on 10 Mac Studios. You have your scene on a shared drive, each Mac Studio reads the scene into memory (sure, over 10Gb ethernet). Then Mac Studio 1 renders frame 1, Mac Studio 2 renders frame 2. Congratulations you've now rendered 10 frames in the time it takes 1 Mac Studio to render 1 frame. Perfect 10x scaling. 100 Studios would give you 100x scaling etc.

(Obviously this is a very simple load balancing solution and you'd probably want a better one in practice)

Compare that to the PC where even if you've managed to load your scene into the VRAM of every single card, your single CPU is going to be sending commands back and forth over PCIe continuously to each GPU constantly, meanwhile with the Mac you've got 10 CPUs - 1 CPU per 1 GPU, so you're never going to get CPU bottlenecked.

What stops you from implementing the same distributed schema on the PC?

jmho · Jun 22, 2023

leman said:
What stops you from implementing the same distributed schema on the PC?

Absolutely nothing. It's just that in my mind Mac Studios are begging to be stacked neatly on top of each other

leman · Jun 22, 2023

jmho said:
Absolutely nothing. It's just that in my mind Mac Studios are begging to be stacked neatly on top of each other

And now imagine a Mac Pro with multiple slotted SoC boards (each with a healthy amount of private RAM), connected by a common 128-PCIe lane backplane that also hosts access to a large pool of shared memory. It would be ideal for the kind of application you describe.

senttoschool · Jun 22, 2023

Xiao_Xi said:
Apple demonstrated ML training with four M1 Ultra.

Accelerate machine learning with Metal - WWDC22 - Videos - Apple Developer

Discover how you can use Metal to accelerate your PyTorch model training on macOS. We'll take you through updates to TensorFlow training...

developer.apple.com

Scales pretty well. Seems to be using Horovod: https://en.wikipedia.org/wiki/Horovod_(machine_learning)

jmho · Jun 22, 2023

leman said:
And now imagine a Mac Pro with multiple slotted SoC boards (each with a healthy amount of private RAM), connected by a common 128-PCIe lane backplane that also hosts access to a large pool of shared memory. It would be ideal for the kind of application you describe.

Potentially, but at the same time there is something to be said for a render-farm of smaller independent nodes.

If you have 10 nodes and one of them breaks, you still have 9 frames completed, and you can just swap out the broken Mac Studio with a new one while the other ones continue working.

If you have a Mac Pro with 10 SoCs and something breaks your entire render farm is out of action until you fix things (even if that hopefully is just pulling out 1 dead SoC)

senttoschool · Jun 22, 2023

leman said:
And now imagine a Mac Pro with multiple slotted SoC boards (each with a healthy amount of private RAM), connected by a common 128-PCIe lane backplane that also hosts access to a large pool of shared memory. It would be ideal for the kind of application you describe.

I'm a cloud guy. You know that. If Apple is going go through the trouble of doing that, don't you think it makes more sense for Apple to create a cloud version of this sort of setup where people can rent it?

The market for local workstations like that is smaller and smaller every day. The market for cloud workstations is growing every day. Does Apple really want to be on the side of a declining trend?

leman · Jun 22, 2023

senttoschool said:
I'm a cloud guy. You know that. If Apple is going go through the trouble of doing that, don't you think it makes more sense for Apple to create a cloud version of this sort of setup where people can rent it?

No, I don't. Because this means competing on a very different market with very different margins. It's not Apple's business and I doubt they could make it profitable with their technology. They are simply not positioned for this kind of push.

senttoschool said:
The market for local workstations like that is smaller and smaller every day. The market for cloud workstations is growing every day. Does Apple really want to be on the side of a declining trend?

This is a fair point. Very possible that you are right. But it's also possible that by offering a compelling product for a reasonable price a certain niche can be carved out. Apple's technology would work well in this scenario. Whether they are interested is a whole different question.

Xiao_Xi · Jun 22, 2023

leman said:
it's also possible that by offering a compelling product for a reasonable price a certain niche can be carved out. Apple's technology would work well in this scenario.

Apple doesn't compete on price, it offers features that no one else has. What could Apple offer that no one else does? What would make an Apple render farm a success?

diamond.g · Jun 22, 2023

leman said:
Why would the order matter? The rendered frames can be assembled into a video afterwards.

Yeah I wasn't sure about that, so the rendering can just spit out a bunch of PNG and you can make a video of them afterwards?

jmho · Jun 22, 2023

diamond.g said:
Yeah I wasn't sure about that, so the rendering can just spit out a bunch of PNG and you can make a video of them afterwards?

Yeah. Thats the preferred way to do things again so that if one render gets messed up you can just fix one frame instead of needing to render everything again.

leman · Jun 22, 2023

Xiao_Xi said:
Apple doesn't compete on price, it offers features that no one else has. What could Apple offer that no one else does? What would make an Apple render farm a success?

Exactly. I don't think there is anything.

senttoschool · Jun 22, 2023

Xiao_Xi said:
Apple doesn't compete on price, it offers features that no one else has. What could Apple offer that no one else does? What would make an Apple render farm a success?

Same reason why Macs can sell even when the value was poor - macOS. In this case, both macOS and possible integration with local macOS.

leman · Jun 22, 2023

senttoschool said:
Same reason why Macs can sell even when the value was poor - macOS. In this case, both macOS and possible integration with local macOS.

MacOS is great on local desktop. When running things in the cloud you often don’t even know (or care) what OS you are running on. How would that be a selling point?

If you are talking about service integration for personal Mac computers - that’s even worse. It’s a very small market. If I am an academic AI researcher with a tight grant budget I care about minimizing costs, not maximizing convenience. There is of course space for services like Xcode cloud, which cannot be easily replicated, but Apple doesn’t even need in-house hardware for that.

Xiao_Xi · Jun 22, 2023

senttoschool said:
macOS and possible integration with local macOS.

For that to work, Apple would need to develop first-party apps. What kind of workloads could benefit of macOS integration?

sirio76 · Jun 22, 2023

aytan said:
If there is a little hope for daisy chaining I m sure I will go for it with a couple of Max and Ultra. Wish this could happen in my life time, not soon in a far far away galaxy...

Getting more computer to render the same scene at the same time have been possible for ages, for example you can run Vray DR over a number of slaves (Mac or PC doesn’t matter) and you will have all the speed up you need, scaling is very good too, especially using bucket rendering.

sirio76 · Jun 22, 2023

diamond.g said:
Yeah I wasn't sure about that, so the rendering can just spit out a bunch of PNG and you can make a video of them afterwards?

Usually you don’t want png files for compositing, much better to use an unclamped 32bit image format to have full control on the post processing.

aytan · Jun 22, 2023

sirio76 said:
Getting more computer to render the same scene at the same time have been possible for ages, for example you can run Vray DR over a number of slaves (Mac or PC doesn’t matter) and you will have all the speed up you need, scaling is very good too, especially using bucket rendering.

Sure, I have been used C4D Team Render with couple of MacPro's and iMac's several times but never used VRay for this. Also there is another problem shiny magnificent superior magical ''Subscription Model''

. For Maxon/Redshift you had to buy another license for each machine no matter what. I have no idea how it is work with VRay DR, if it is reasonable price it could work in logical budgets.

3D Rendering on Apple Silicon, CPU&GPU

macrumors regular

macrumors 68000

macrumors regular

macrumors regular

Suspended

macrumors 6502a

macrumors G5

macrumors Core

macrumors Core

macrumors 6502a

macrumors Core

macrumors 68030

macrumors 6502a

macrumors 68030

macrumors Core

macrumors 68000

macrumors G5

macrumors 6502a

macrumors Core

macrumors 68030

macrumors Core

macrumors 68000

macrumors 6502a

macrumors 6502a

macrumors regular

Our Staff