3D Rendering on Apple Silicon, CPU&GPU

exoticSpice · Oct 10, 2022

deconstruct60 said:
debut in Mac Pro? Errr, a tether-less , battery only powered, VR/AR headset likely has much higher pressing need for a hyper low power consumption RT hardware.

Several of these "work way smarter , not harder" patents smell far more driven by being limited by a small battery than on primarily focused on constructing some Nvidia x090/x080 or AMD x900/x800 'killer' large GPU.

It would trickle out the Macs ( and other SoCs) , but it won't be surprising if it doesn't start there.

Yes it will debut in a headset but all rumours point to M2 being in headset and M2 has no Hardware accelerated RT.

mi7chy · Oct 13, 2022

4090 is projected to be about 7.36x faster than 64GPU M1 Ultra based on Nvidia claiming 4090 is 1.7x the performance of 3080ti. 3080ti scores 5937.93 according to Blender Open Data so 1.7x is 10094.48 vs 1371.46 for 64GPU M1 Ultra. Blender supports multi-GPUs so you can always create a $40K render farm of eight 64GPU Mac Studios to match one 4090.

Turns out 4090 is even faster than projected at ~9x over M1 Ultra 64GPU.

https://opendata.blender.org/benchmarks/query/

exoticSpice · Oct 13, 2022

mi7chy said:
Turns out 4090 is even faster than projected at ~9x over M1 Ultra 64GPU.

https://opendata.blender.org/benchmarks/query/
View attachment 2094318
View attachment 2094319

No ****. It's got more cores and a massive 400 watt power.

jmho · Oct 13, 2022

The 4090 is an incredible card, and I hope Apple are sweating and rolling up their sleeves for the work ahead of them.

mi7chy · Oct 13, 2022

exoticSpice said:
No ****. It's got more cores and a massive 400 watt power.

Not only faster but more efficient at 2.2x greater performance per watt. 11/2/2022 will be interesting to see if AMD RDNA3 can answer. Guessing things will remain the same with Nvidia #1, AMD #2 and Apple last behind $329 Intel Arc A770.

4090
12287.86 points / 450W = 27.3 points/W

64GPU M1 Ultra
1371.46 points / ~110W = 12.5 points/W

diamond.g · Oct 13, 2022

mi7chy said:
Not only faster but more efficient at 2.2x greater performance per watt. 11/2/2022 will be interesting to see if AMD RDNA3 can answer. Guessing things will remain the same with Nvidia #1, AMD #2 and Apple last.

4090
12287.86 points / 450W = 27.3 points/W

64GPU M1 Ultra
1371.46 points / ~110W = 12.5 points/W

I didn't think there were any loads to draw the full wattage from the M1 Ultra.

deconstruct60 · Oct 13, 2022

mi7chy said:
Not only faster but more efficient at 2.2x greater performance per watt. 11/2/2022 will be interesting to see if AMD RDNA3 can answer. Guessing things will remain the same with Nvidia #1, AMD #2 and Apple last behind $329 Intel Arc A770.

4090
12287.86 points / 450W = 27.3 points/W

64GPU M1 Ultra
1371.46 points / ~110W = 12.5 points/W

View attachment 2094381

There is a context missing there though. Nvidia 3rd gen specialized hardware RT (and 2nd-3rd gen software library). AMD/Intel partial RT hardware . Apple little fixed function hardware at all.

Card

blender

Watts

pts/W

Nvidia 4090

12265

450

27

Nvidia 3080

5055

320

16

Intel 770

1630

225

7

Intel 750

1607

225

7

Intel 750M

1379

150

9

M1 Ultra 64

1371

110

12

M1 Ultra 48

1242

83

15

M1 Max 32

818

55

15

M1 Max 24

712

41

17

AMD 6800

1531

250

6

AMD 6700

1308

175

7

AMD 6650

1021

175

6

AMD 6600

1008

132

8

AMD 6700M

1082

135

8

In the pts/W, Apple is clearly out in front of the Intel/AMD mobile offerings. With no specialized hardware RT they are approximately matching 2nd Gen 3080 metrics. So spending no extra die space on this and doing better than Intel and AMD who did.

I highly doubt Apple is using the desktop 3090 or 4090 as a 'goal metric' for this next iteration. If Apple could cover the desktop AMD 6800 with a Max sized die with 32-40 core GPU with no TDP Watt increase, then I suspect they would declare that a 'win'. [NOTE: the 3080Ti laptop is 3802 ] If there is any 4000 series target then it would be the 4080M/4070M. Apple is stalking the largest mobile dGPUs run inside of larger laptops. Whatever they scale up from there is what they get to cover the upper desktop range. Where the Intel 750M is in mobile GPU deployment ( roughly equal power to the M1 Ultra ) the result isn't all that different. I don't think they are going to 100% catch the mobile 4070M, but covering a W6800X duo with an "two die SoC" would likely get trumpeted as a win.

If Apple adds some specific RT hardware in next iteration or two then the pts/W isn't going to be very far off at all.
that '27' is largely fixed-function hardware difference where have taken the bulk of the work off of general computation cores. Nvidia actually has less room to move "more" to fixed function at this point. ( Nvidia tossed NVLink from the die to toss in more specialed hardware. For the Studio/Mac Pro destine dies it is unlikely Apple is going to toss UltraFusion for more specialized cores. So likely mostly waiting on higher transistor allocation budgets (e.g., TSMC N3 and after ) for major moves.

leman · Oct 14, 2022

exoticSpice said:
No ****. It's got more cores and a massive 400 watt power.

Only twice as many cores though. The performance advantage is mostly because of hardware RT acceleration and of course the much more mature backend. Would be interesting to know the actual power usage

sirio76 · Oct 14, 2022

There are already tests about power consumption (for the FE card), when heavily utilized it will consume from 400 up to 450W, likely more for other 4090 producer. People that like to compare that to an M1 should not forget that this is just for the GPU, realistically the system consumption will be easily 700W or more. Nvidia recommends at least an 850W PSU and a bare minimum of 1000W for a TR system. Non saying that performance are not good, but it’s just in a different league when it comes to power consumption, and noise, and heat.

quarkysg · Oct 14, 2022

One advantage (probably not a lot I supposed) Apple's UMA has over the traditional design is that all IP (CPU, GPU, NPU, AMX, ISP) blocks can work on the same dataset to solve a problem simultaneously.

Xiao_Xi · Oct 14, 2022

sirio76 said:
it’s just in a different league when it comes to power consumption, and noise, and heat.

And efficiency at high performance because the more capable an Apple GPU is, the less efficient it is.

leman · Oct 14, 2022

quarkysg said:
One advantage (probably not a lot I supposed) Apple's UMA has over the traditional design is that all IP (CPU, GPU, NPU, AMX, ISP) blocks can work on the same dataset to solve a problem simultaneously.

It's a massive advantage, it's just that most software is designed with dGPU model in mind so it doesn't take advantage of it. Another big advantage is the ability to work with very large datasets.

Xiao_Xi said:
And efficiency at high performance because the more capable an Apple GPU is, the less efficient it is.

Can you elaborate this?

Xiao_Xi · Oct 14, 2022

leman said:
Can you elaborate this?

The higher the number of cores, the lower the points/W of Apple's GPU.

GPU	Points	Consumption(W)	Points / W
M1 Ultra 64	1371	110	12
M1 Ultra 48	1242	83	15
M1 Max 32	818	55	15
M1 Max 24	712	41	17

leman · Oct 14, 2022

Xiao_Xi said:
The higher the number of cores, the lower the points/W of Apple's GPU.

GPU Points Consumption(W) Points / W
M1 Ultra 64 1371 110 12
M1 Ultra 48 1242 83 15
M1 Max 32 818 55 15
M1 Max 24 712 41 17

This only illustrates that there is a software problem with how Blender currently utilizes the hardware.

diamond.g · Oct 14, 2022

Xiao_Xi said:
The higher the number of cores, the lower the points/W of Apple's GPU.

GPU Points Consumption(W) Points / W
M1 Ultra 64 1371 110 12
M1 Ultra 48 1242 83 15
M1 Max 32 818 55 15
M1 Max 24 712 41 17

What is even more interesting is the 3090 (base) gets the same 17 points per watt as the M1 Max 24. For the 4090 to double the 3090 score yet only use 29% more power feels like that should be impressive for sure.

Xiao_Xi · Oct 14, 2022

diamond.g said:
For the 4090 to double the 3090 score yet only use 29% more power feels like that should be impressive for sure.

That's the magic of TSMC.

Nvidia’s RTX 4090 Launch: A Strong Ray-Tracing Focus

Recently, Nvidia announced their RTX 4000 series at GTC 2022. At a high level, Nvidia took advantage of a process node shrink to evolve their architecture while also scaling it up and clocking it m…

chipsandcheese.com

singhs.apps · Oct 14, 2022

&

diamond.g · Oct 14, 2022

leman said:
Only twice as many cores though. The performance advantage is mostly because of hardware RT acceleration and of course the much more mature backend. Would be interesting to know the actual power usage

I noticed there are no 3090ti or 4090 CUDA scores for the Blender Bench. With what is there, the 3080 is twice as fast as the M1 Ultra 64. AFAICT CUDA doesn't use the RT backend so it is more of a "fair fight" with Metal.

Xiao_Xi · Oct 14, 2022

diamond.g said:
I noticed there are no 3090ti or 4090 CUDA scores for the Blender Bench. With what is there, the 3080 is twice as fast as the M1 Ultra 64. AFAICT CUDA doesn't use the RT backend so it is more of a "fair fight" with Metal.

CUDA cores are slower than RT cores, so people don't use them.

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

leman · Oct 14, 2022

diamond.g said:
I noticed there are no 3090ti or 4090 CUDA scores for the Blender Bench. With what is there, the 3080 is twice as fast as the M1 Ultra 64. AFAICT CUDA doesn't use the RT backend so it is more of a "fair fight" with Metal.

The 3080 has 30TFLOPS, the M1 Ultra has 20TFLOPS, so that makes a difference of 50%. Whether the other 50% are because of the software or hardware maturity or some other factor, we don't know. What's quite interesting though that if the 24-core M1 Max scaled linearly we would expect 712/24*64 = ~1900 points, and multiplying it by 1.5 (the 50% of TFLOPS difference to 3080) yields 2850, which is oddly similar to the CUDA score (2900) that the 3080 has. Sure, it's just some random napkin arithmetics, but the sheer coincidentally of all this almost suggests like the Metal Blender backend has some massive thread scaling issues.

P.S. Your wattage estimates for the M1 blender seem quite off. I am not even breaking 30W running blender benchmark on my 32-core Max...

diamond.g · Oct 14, 2022

leman said:
The 3080 has 30TFLOPS, the M1 Ultra has 20TFLOPS, so that makes a difference of 50%. Whether the other 50% are because of the software or hardware maturity or some other factor, we don't know. What's quite interesting though that if the 24-core M1 Max scaled linearly we would expect 712/24*64 = ~1900 points, and multiplying it by 1.5 (the 50% of TFLOPS difference to 3080) yields 2850, which is oddly similar to the CUDA score (2900) that the 3080 has. Sure, it's just some random napkin arithmetics, but the sheer coincidentally of all this almost suggests like the Metal Blender backend has some massive thread scaling issues.

P.S. Your wattage estimates for the M1 blender seem quite off. I am not even breaking 30W running blender benchmark on my 32-core Max...

I dunno where the wattage estimates come from, I barely see 20W on my 24 core model.

Going by TFLOP is a poor metric as the Ultra is "slower" than a 3060 Laptop GPU which is a 7 TFLOP card (10 if you use the boost number). But I see your point otherwise.

leman · Oct 14, 2022

diamond.g said:
Going by TFLOP is a poor metric as the Ultra is "slower" than a 3060 Laptop GPU which is a 7 TFLOP card (10 if you use the boost number). But I see your point otherwise.

Those results are weird anyway since the mobile 3060 is somehow faster than the desktop one?

diamond.g · Oct 14, 2022

leman said:
Those results are weird anyway since the mobile 3060 is somehow faster than the desktop one?

Weird isn't it?

sirio76 · Oct 14, 2022

Xiao_Xi said:
CUDA cores are slower than RT cores, so people don't use them.

it doesn’t work like that, the GPU will use all the available cores when rendering, both traditional and RTX, and RTX hardware is not “faster“ it’s just designed to speedup some aspect of the rendering. Depending on the scene RT cores may accelerate the rendering a lot, in other case not so much.

diamond.g · Oct 14, 2022

sirio76 said:
it doesn’t work like that, the GPU will use all the available cores when rendering, both traditional and RTX, and RTX hardware is not “faster“ it’s just designed to speedup some aspect of the rendering. Depending on the scene RT cores may accelerate the rendering a lot, in other case not so much.

What is the point of the optix renderer if the cuda renderer also uses the RT hardware?

3D Rendering on Apple Silicon, CPU&GPU

Suspended

Suspended

Suspended

macrumors 6502a

Suspended

macrumors G5

macrumors G5

macrumors Core

macrumors 6502a

macrumors 65816

macrumors 68000

macrumors Core

macrumors 68000

macrumors Core

macrumors G5

macrumors 68000

macrumors 6502a

macrumors G5

macrumors 68000

macrumors Core

macrumors G5

macrumors Core

macrumors G5

macrumors 6502a

macrumors G5

Our Staff