Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
What I think is even more impressive is how well it still does in low power mode. After 5 min, don't think it will change much running it for 10. It is about 1/3 of the power but half the speed.
Screenshot 2024-11-15 at 18.48.31.png
 
Excellent, scores broken down by scene! So correct me if I'm wrong, but of the three junkshop is biggest, most difficult render with the biggest memory footprint and Monster is the least, yes? In the video on the Redshift part I think he misspoke and says his 4070 has 8GB of VRAM, but it should have 12. Probably just a misspeak. But I do wonder if he has a factory overclocked desktop 4070 (or he has a 4070 SUPER). His Blender scores are more like the 4070 SUPER than the standard 4070 which had 16% higher scores over all Blender scenes than the M4 Max/standard 4070.

 
Yeah I typed the above too quickly and reused the Nvidia numbers as like Nvidia, Apple also has 4 subunits and 128 total FP32 pipes in a Core, but the whole point was that Nvidia's numbers include the optional FP32 pipes while Apple's do not resulting in double the number of pipes per subunit even beyond the extra FP16 pipes. That was the entire point of writing it out like that and I completely goofed it. Sorry.

Edited to this:
1 Core/SM is 4 of the following subunits:
Nvidia: 16 FP32, 16 FP32/INT32, + optional 1 FP64, ? Complex
Apple: 32 FP32, 32 INT32, 32 FP16, 32 Complex

Nvidia has the same theoretical FP32 output per clock per core only if they don't have simultaneous integer or FP16 work and Apple has double the possible integer output per clock per core.



If Robert didn't know, then at least at the time of his posts that wasn't something Nvidia had officially clarified. If they have issued a clarification since then, I have not been able to find it so far. My issue is this and I'm sure I'm wrong here somewhere, but if the FMA itself is only taking a single cycle to complete then attempting to issue subsequent FMAs should result in tandem, not parallel execution, no? Perhaps my mental model here is wrong. Here's what I think is going on:

A warp consisting of 32 threads should execute over two subunits, right? And then, provided divergences of course, both subunits are scheduled simultaneously and execute together. If there are two independent FP32 calculations per thread and if the Nvidia GPU is truly superscalar, then the Nvidia GPU's warp scheduler can issue commands to the INT32/FP32 pipe simultaneously with the FP32 pipe completing 64 FP32 operations per cycle per two subunits. However, if the Nvidia GPU is not superscalar, then by the time it schedules the work for the combined pipe, the FP32 pipe should already be done and you now have tandem execution and no increase in throughput per clock cycle, which should mean no increase in throughput.

No. The way nV did it in the past (and I assume still does it) is
- clock 0: send out the first half of an FP32 instruction (16 lanes to a 16-wide execution unit).
- clock 1: the second half of that FP32 instruction executes, but the scheduler is free, so we can also send out the first half of say an FP16 instruction
- clock 2: the second half of that FP16 instruction executes, and the scheduler can now send out say another F32 instruction, or maybe an INT32 instruction.
etc etc
This gives you effectively two ops per cycle execution (but still only 32-wide executes per cycle, maybe 16FP32 and 16INT32 in this particular cycle).

Given that a "sub-core" is 16-wide not 32-wide, it does achieve the basic goal of getting "side execution" (eg INT32 or FP16) for free by using HW that is there anyway, but never taking a cycle away from the primary FP32 line of execution. The extent to which it's valuable depends on how much side work in INT32 or FP16 you have available. On nV I think the number they said was something like ~30% on average. (Mainly int multiplies to generate addresses into 2 or 3D arrays.)

Again, I think my mental model is wrong or at least given my mental model I'm not quite following why we would need to do this at all since according to how I'm thinking about it even just a simple throughput test should tell the difference. Since that's presumably not the case, I must be mistaken somewhere. If you both and Robert are saying this schema is reasonable, it most likely is. :)


I know Nvidia has 64K 32bit registers per SM in its register file, do we know what Apple is using now? Chips and Cheese didn't go into that in their unfortunately brief look in the M2 (which is also not the M3 obviously which probably dramatically changed things).
Remember Apple has a very different scheme now; the number of registers is not fixed. There is simply a pool of per-core SRAM that is dynamically split into registers, Threadblock Scratchpad, and SRAM. It's even smarter than this in that this pool of SRAM spills to L2 like a normal cache, so registers that are not used for a while will migrate out to L2 to have their local SRAM used by something that values it more.
The details of this are explained in my
It's essentially like a second layer of virtual memory addressing, call it "private memory", sitting on the virtual address space.

I should also state I made another error in my above statements about caches.



I said Nvidia doesn't have the equivalent of the SLC, but that's not really true. Nvidia has 128KB L1 per SM (256KB for compute graphics cards like the H100) and tens of MB for its L2 which in fact functions as a last level cache similar to Apple's SLC. Apparently on some of Nvidia's compute cards there is even a "near L2" and "far L2". What Nvidia doesn't have the equivalent of Apple's L2 cache - though again it isn't clear how the M3/M4 have changed things. Again, sorry for any confusion.

Again the details make a big difference; there's a lot more going on than just capacities if you want to map on design to another.
As far as I can tell, for Apple
- within the core proper, addressing is in private address space (ie local SRAM is accessed by a private address)
- within the GPU, addressing is in virtual address space
- when an address leaves the GPU it is translated to physical address space
- the SLC operates in physical address space

This may seem esoteric, but it has implications for the costs of memory lookups. If you simply run your addressing on GPU the way a CPU does, you have a huge problem (with both power and area) in providing enough TLB lookups per cycle. I think nV have to pay this price. Meanwhile Apple's equivalent lookup (private address space->virtual address space TLB) doesn't have to happen until you hit L2, and can be optimized for the GPU (no concerns about weird aliasing with the rest of the system). The costs of a virtual->physical lookup (where you do have to worry about the rest of the system) are only paid when you transit from off the GPU to SLC.

Meanwhile SLC has it's own set of advantages. Being memory side (in other words the memory controllers are aware of everything in the SLC, unlike a standard L3) has lots of advantages for DMA and interacting with IO.
But for the GPU, what's more important is that there is a massive level of control available for the SLC. You don't just have data moved in and out based on LRU. You can define multiple different datastreams, each tagged with a DSID (data stream ID) and have these behave differently. Some data you can lock in the cache. Some you can stream through the cache (so replace MRU rather than LRU)). Some you can "decimate" in the cache (eg store every fourth line, so that you can work on that line [pulled in rapidly from the SLC] while the three non-present lines are being prefetched from DRAM. Apple's SLC is way beyond what anyone else has -- as you can see when you look at the die. The control logic for the SLC (sitting in the middle, surrounded by SRAM) is HUGE, like the size of 4 E-cores...

John McCalpin (the guy behind the HPC STREAM benchmark) says the biggest flaw in HPC is that you can't communicate to the hardware how you plan to use streams of data. Apple don't have that issue...

(BTW another feature that's probably now present in the M4, but we haven't yet seen it do anything, and may not until the next OS, has to do with co-ordination between IP blocks. Suppose you want to split neural network processing across the ANE and the GPU. This requires each side to know when the other side has completed a task.
The obvious way to do this is by flipping bits at some address which the other side is monitoring, like locks or mutexes when doing CPU parallel programming. But this sucks because it has to route through the coherency mechanisms up to the SLC then across to the other IP block.
You can do much better if you give up on copying CPU ideas from 1980 and provide a dedicated mechanism for each IP block to indicate to another IP block that an event has happened. This is probably the case with the M4 -- the timing of the patent suggests probably M4.
I don't know if the CPU/GPU interaction will also get this. It would be nice -- right now the cost of CPU/GPU synch is so high that many good ideas for moving work between the two are impractical. My guess is what we will have in M4 is exploratory, between IP blocks, like GPU, ANE, media encode/decode, and ISP, that Apple has more control over. Then, based on what is learned, the idea will be added to the CPU, which requires more flexibility and a careful API, because of 3rd party developers.)

I do wonder about improvements to the L1 cache bandwidth and/or INT scheduling - like did the M4 improve those things over the M3 or is it literally just an upclocked M3 with better memory bandwidth? I keep going back and forth but given some of these results, maybe Apple made a few tweaks to improve them?

I suspect GPU and CPU are now both on essentially 2 year development tracks. Minor tweaks are added in alternate years, but the big jumps happen on a two year cadence. M4 is a CPU model, M5 will be a GPU model.

M3 GPU gave us
- ray tracing
- dynamic cache
Dynamic cache almost certainly got scheduling tweaks of the form I described (allow allocation and deallocation to overlap with "genuine" GPU work). Ray tracing probably also got tweaks based on what was learned from the M3.

But M4 is mainly about boosted CPU smartness. Next year I expect we'll get a "disappointing" CPU boost along with an unexpectedly impressive GPU boost.
(Two places where nV is ahead of Apple are
- nV can co-ordinate work between multiple GPU cores, so at a granularity larger than a threadblock.
- nV has tweaked the precise definition of SIMT to allow thread lanes to "decouple" in their execution, which in turn allows a variety of algorithms that would deadlock on a pure SIMT design.

As I've said before, Olivier Giroux, who oversaw this sort of stuff at nV, has been at Apple for a few years now. My guess
is they are well aware of, and want to improve on, both these functionalities, and the infrastructure for dynamic cache, ie the private address spaces, is step one in hitting that goal. We'll hopefully see step two with the M5.

The truly crazy thing would be Apple shipping the equivalent of Game Porting Toolkit for CUDA. I suspect they've held off doing this for now, not least because they're simply missing the fanciest CUDA functionality like what I described above. A CUDA porting toolkit just looks lame if the end result is still that all the fanciest Graphics Gems code can't port
because you don't support large co-ordination groups, or the algorithms deadlock on your hardware! But when these limitations no longer exist...)
 
Perhaps my mental model here is wrong.

Maynard already provided a great explanation. I’ll just reiterate using slightly different words. So it all boils down to Nvidia executing a 32-wide workload over two cycles using 16-wide unit. Say it does 1 cycle for lower 16 values and the next cycle for upper 16 values. Each scheduler has two subunits, and each subunit takes two cycles to execute. This means that once issued, the subunit is busy for another cycle. So the scheduler ping-pongs between the two units. This has essentially the same effect as doing dual-issue, just without the complicated machinery of actually tracking and issuing two instructions per cycle. It’s quite smart, really.
 
No. The way nV did it in the past (and I assume still does it) is
- clock 0: send out the first half of an FP32 instruction (16 lanes to a 16-wide execution unit).
- clock 1: the second half of that FP32 instruction executes, but the scheduler is free, so we can also send out the first half of say an FP16 instruction
- clock 2: the second half of that FP16 instruction executes, and the scheduler can now send out say another F32 instruction, or maybe an INT32 instruction.
etc etc
This gives you effectively two ops per cycle execution (but still only 32-wide executes per cycle, maybe 16FP32 and 16INT32 in this particular cycle).

Given that a "sub-core" is 16-wide not 32-wide, it does achieve the basic goal of getting "side execution" (eg INT32 or FP16) for free by using HW that is there anyway, but never taking a cycle away from the primary FP32 line of execution. The extent to which it's valuable depends on how much side work in INT32 or FP16 you have available. On nV I think the number they said was something like ~30% on average. (Mainly int multiplies to generate addresses into 2 or 3D arrays.)
Maynard already provided a great explanation. I’ll just reiterate using slightly different words. So it all boils down to Nvidia executing a 32-wide workload over two cycles using 16-wide unit. Say it does 1 cycle for lower 16 values and the next cycle for upper 16 values. Each scheduler has two subunits, and each subunit takes two cycles to execute. This means that once issued, the subunit is busy for another cycle. So the scheduler ping-pongs between the two units. This has essentially the same effect as doing dual-issue, just without the complicated machinery of actually tracking and issuing two instructions per cycle. It’s quite smart, really.

Ah okay I think I get it now. I'll probably have to reread a couple of more times to make sure it sinks in. Thanks guys, appreciate it!

Remember Apple has a very different scheme now; the number of registers is not fixed. There is simply a pool of per-core SRAM that is dynamically split into registers, Threadblock Scratchpad, and SRAM. It's even smarter than this in that this pool of SRAM spills to L2 like a normal cache, so registers that are not used for a while will migrate out to L2 to have their local SRAM used by something that values it more.
The details of this are explained in my
It's essentially like a second layer of virtual memory addressing, call it "private memory", sitting on the virtual address space.
Right I meant how big is the block of data that Apple has access to now in that dynamic cache - like how big is it in total now? How much total data can it have per Core? I know it's super flexible now, but I'm just wondering how big is it? Also, prior to the M3, I assume Apple had a more traditional register file in the M1 and M2 and I'm not sure how big that was either. But maybe I'm wrong on that count as well.
Again the details make a big difference; there's a lot more going on than just capacities if you want to map on design to another.
As far as I can tell, for Apple
- within the core proper, addressing is in private address space (ie local SRAM is accessed by a private address)
- within the GPU, addressing is in virtual address space
- when an address leaves the GPU it is translated to physical address space
- the SLC operates in physical address space

This may seem esoteric, but it has implications for the costs of memory lookups. If you simply run your addressing on GPU the way a CPU does, you have a huge problem (with both power and area) in providing enough TLB lookups per cycle. I think nV have to pay this price. Meanwhile Apple's equivalent lookup (private address space->virtual address space TLB) doesn't have to happen until you hit L2, and can be optimized for the GPU (no concerns about weird aliasing with the rest of the system). The costs of a virtual->physical lookup (where you do have to worry about the rest of the system) are only paid when you transit from off the GPU to SLC.

Meanwhile SLC has it's own set of advantages. Being memory side (in other words the memory controllers are aware of everything in the SLC, unlike a standard L3) has lots of advantages for DMA and interacting with IO.
But for the GPU, what's more important is that there is a massive level of control available for the SLC. You don't just have data moved in and out based on LRU. You can define multiple different datastreams, each tagged with a DSID (data stream ID) and have these behave differently. Some data you can lock in the cache. Some you can stream through the cache (so replace MRU rather than LRU)). Some you can "decimate" in the cache (eg store every fourth line, so that you can work on that line [pulled in rapidly from the SLC] while the three non-present lines are being prefetched from DRAM. Apple's SLC is way beyond what anyone else has -- as you can see when you look at the die. The control logic for the SLC (sitting in the middle, surrounded by SRAM) is HUGE, like the size of 4 E-cores...

John McCalpin (the guy behind the HPC STREAM benchmark) says the biggest flaw in HPC is that you can't communicate to the hardware how you plan to use streams of data. Apple don't have that issue...

(BTW another feature that's probably now present in the M4, but we haven't yet seen it do anything, and may not until the next OS, has to do with co-ordination between IP blocks. Suppose you want to split neural network processing across the ANE and the GPU. This requires each side to know when the other side has completed a task.
The obvious way to do this is by flipping bits at some address which the other side is monitoring, like locks or mutexes when doing CPU parallel programming. But this sucks because it has to route through the coherency mechanisms up to the SLC then across to the other IP block.
You can do much better if you give up on copying CPU ideas from 1980 and provide a dedicated mechanism for each IP block to indicate to another IP block that an event has happened. This is probably the case with the M4 -- the timing of the patent suggests probably M4.
I don't know if the CPU/GPU interaction will also get this. It would be nice -- right now the cost of CPU/GPU synch is so high that many good ideas for moving work between the two are impractical. My guess is what we will have in M4 is exploratory, between IP blocks, like GPU, ANE, media encode/decode, and ISP, that Apple has more control over. Then, based on what is learned, the idea will be added to the CPU, which requires more flexibility and a careful API, because of 3rd party developers.)
Very interesting about the SLC, yeah that would be cool.
I suspect GPU and CPU are now both on essentially 2 year development tracks. Minor tweaks are added in alternate years, but the big jumps happen on a two year cadence. M4 is a CPU model, M5 will be a GPU model.
Aye it's definitely close to that. Just looking at the Mac, the pattern seems to be more like this:

M1 -> M2 tweaks to both CPU and GPU architectures
M2 -> M3, moderate CPU change, gargantuan GPU change
M3 -> M4, big CPU change, GPU tweak

Of course the iPhones had an extra model, the A16, in there between the M2 and M3 and I can't remember what its CPU/GPU changes were. Did it have a lot of the M3 CPU core changes already? Then your pattern of alternating years would fit even better.

M3 GPU gave us
- ray tracing
- dynamic cache
Dynamic cache almost certainly got scheduling tweaks of the form I described (allow allocation and deallocation to overlap with "genuine" GPU work). Ray tracing probably also got tweaks based on what was learned from the M3.
Yeah I'm definitely curious about the tweaks they made in the M4 GPU allowing overlapping allocation and compute is big, because it does look a little more performant over the M3 than just the clock speed/LPDDR bandwidth improvements, but obviously Apple themselves said it's the same architecture as M3 and the performance is not so much better that clock/bandwidth improvements couldn't explain most of it. I'll try to read your PDF when I get more time.
But M4 is mainly about boosted CPU smartness. Next year I expect we'll get a "disappointing" CPU boost along with an unexpectedly impressive GPU boost.
(Two places where nV is ahead of Apple are
- nV can co-ordinate work between multiple GPU cores, so at a granularity larger than a threadblock.
- nV has tweaked the precise definition of SIMT to allow thread lanes to "decouple" in their execution, which in turn allows a variety of algorithms that would deadlock on a pure SIMT design.

As I've said before, Olivier Giroux, who oversaw this sort of stuff at nV, has been at Apple for a few years now. My guess
is they are well aware of, and want to improve on, both these functionalities, and the infrastructure for dynamic cache, ie the private address spaces, is step one in hitting that goal. We'll hopefully see step two with the M5.

Oh yes. I've talked about this at length in another forum. I've watched Olivier and Bryce's talks on forward progress guarantees like 3 or 4 times (plus another Nvidia guy's) - partly out of sheer interest, partly because it took that long to wrap my head around the topic! Really cool stuff and I am looking forward to Apple implementing it as well.

The truly crazy thing would be Apple shipping the equivalent of Game Porting Toolkit for CUDA. I suspect they've held off doing this for now, not least because they're simply missing the fanciest CUDA functionality like what I described above. A CUDA porting toolkit just looks lame if the end result is still that all the fanciest Graphics Gems code can't port
because you don't support large co-ordination groups, or the algorithms deadlock on your hardware! But when these limitations no longer exist...)
As someone who does CUDA work and obviously has a Mac (but doesn't know Metal) that would be awesome. I was disappointed that one guy's project, ZLUDA was it? or is that another one?, has had so many failures/false starts. But if Apple were to do a GPTK equivalent for CUDA that would be great. Nvidia does seem to be a little litigious about the license on CUDA reverse engineering projects so I'm not sure how that would affect things.
 
Excellent, scores broken down by scene! So correct me if I'm wrong, but of the three junkshop is biggest, most difficult render with the biggest memory footprint and Monster is the least, yes? In the video on the Redshift part I think he misspoke and says his 4070 has 8GB of VRAM, but it should have 12. Probably just a misspeak. But I do wonder if he has a factory overclocked desktop 4070 (or he has a 4070 SUPER). His Blender scores are more like the 4070 SUPER than the standard 4070 which had 16% higher scores over all Blender scenes than the M4 Max/standard 4070.


You're right about 12GB in 4070. I think Classroom is the heaviest since it gets the lowest scores with M4 Max, 7900XTX and 3080.
 
  • Like
Reactions: crazy dave
There’s a quick comparison of the M4 series and 4080 and 3080 desktop GPUs performance rendering the Monster and Lone Monk demo scenes with Cycles in the first half of this video.

 
Last edited:
  • Like
Reactions: jujoje
There’s a quick comparison of the M4 series and 4080 and 3080 desktop GPUs performance rendering the Monster and Lone Monk demo scenes with Cycles in the first half of this video.

The lengths of bars in charts don't match the reported numbers.
 
  • Haha
Reactions: Xiao_Xi
Sorry I don't mean to talk over people's heads, but I recognize that I went very fast, too fast through the material. If you'd like, feel free to ask a question and I'll do my best to explain starting from basics ... and without the mistakes I made above. :)
Oh sorry, I didn't mean like that, you guys are all explaining your thoughts well!

I have enough of a background of rampant nerdery (I started with MS-DOS 3.2 on an Amstrad PC1512), an adolescence filled with every computer magazine that contained any sort of technical info, and a bunch of basic IT subjects towards an IT major for a (never completed) Bachelor of Education, to follow along. I know just enough to know that there is a lot I am not aware of.

It's still a lot of fun to read along. :)
 
  • Like
Reactions: crazy dave
The lengths of bars in charts don't match the reported numbers.
Yeah, his intern has an interesting take on chart design, ha ha.

More germane to meaningful discussion - the GPU performance data is right there in easy to read numbers.

Plus, the reviewer is a working creator. And there are those Apple and Google patents on which he is one of the named individuals.


I tend to pay attention to someone with his accomplishments.

It is amusing how some people try to direct focus away from numbers they don’t like, tho.
 
Last edited:
Yeah, his intern has an interesting take on chart design, ha ha.

More germane to meaningful discussion - the GPU performance data is right there in easy to read numbers.

Plus, the reviewer is a working creator. And there are those Apple and Google patents on which he is one of the named individuals.


I tend to pay attention to someone with his accomplishments.

It is amusing how some people try to direct focus away from numbers they don’t like, tho.
I don’t doubt any of that, but it makes it very disappointing that he used Cuda on the Nvidia gpus instead of Optix. I cannot fathom how someone this experienced would make such an error.
 
I don’t doubt any of that, but it makes it very disappointing that he used Cuda on the Nvidia gpus instead of Optix. I cannot fathom how someone this experienced would make such an error.
Oooooooh, thanks for that. I had to wade through the swamp of 500 YouTube comments to find confirmation.

We have to disregard his nVidia numbers then.

I don’t know anyone who renders using CUDA.
 
Last edited:
  • Like
Reactions: novagamer
IMG_0611.jpeg


An M4 Max result has been added to the Octane X benchmark chart.

12 seconds to render the test scene vs 15 for M3 Max.

(Just ignore the one mislabeled entry - I don’t think anyone’s expecting perfection from Otoy, ha ha)
 
Last edited:
Redshift 2025.1.1 (October 16 release) M4 Max benchmark results (RTX on):


IMG_3758.png


Redshift 2025.1.1 (October 16 release) RTX (nominal) 4090 Laptop benchmark results (RTX on):

IMG_3760.png


So this shows the Apple GPU improvements we were pretty much expecting with Redshift 2025.1.1.

The difference in M4 Max rendering time vs NVidia’s AD103-based 4090 laptop / 4080 desktop GPU now matches the difference in Blender.

And Redshift is also seeing a 30% speedup with M4 Max vs M3 Max (running 3.5.24) with RTX on.

We’ll see if the next Redshift updates benefit the M4 series even further.
 
Last edited:
Redshift 2025.1.1 (October 16 release) M4 Max benchmark results (RTX on):


View attachment 2455934

Redshift 2025.1.1 (October 16 release) RTX 4090 Laptop benchmark results (RTX on):

View attachment 2455933

So this shows M4 GPUs narrowing the gap in October’s version.

We’ll see if the next updates include any optimization for M4.
The only issue here is that the Mac could have used a much higher block size, which has shown to improve performance on Apple Silicon.
 
Hmmm, looks like Apple hasn’t completely disentangled itself from Otoy just yet…

Octane Render 2025.1 Beta Feature Highlights…

Metal-RT optimized HW acceleration on Apple devices

This version improves support for hardware acceleration on Apple M3 and M4 hardware by adding support for non-triangle primitives such as hair, particles, texture displacement and analytic lights.

This means that scenes that use non-triangle geometry will be also able to leverage the HW instead of falling back to the software implementation as before. Scenes with large amounts of triangles as well as other primitives will initially benefit most from this approach.

Below is a performance comparison using a synthetic test scene on an Apple M4Pro chip. Please note the performance may not be indicative of the final version since there's still some work to do.


Image
Neu Rungholt medieval village by "kescha"
 
Chaos is about to release Vray 7 for Blender, for now the beta is for Windows only, if you are interested in a Mac version please compile the survey (it will take no more than 10 seconds) so that they can prioritize between the Mac and the Linux version:

 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.