Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

l0stl0rd

macrumors 6502
Jul 25, 2009
483
416
What I think is even more impressive is how well it still does in low power mode. After 5 min, don't think it will change much running it for 10. It is about 1/3 of the power but half the speed.
Screenshot 2024-11-15 at 18.48.31.png
 
  • Like
Reactions: crazy dave

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,224
Excellent, scores broken down by scene! So correct me if I'm wrong, but of the three junkshop is biggest, most difficult render with the biggest memory footprint and Monster is the least, yes? In the video on the Redshift part I think he misspoke and says his 4070 has 8GB of VRAM, but it should have 12. Probably just a misspeak. But I do wonder if he has a factory overclocked desktop 4070 (or he has a 4070 SUPER). His Blender scores are more like the 4070 SUPER than the standard 4070 which had 16% higher scores over all Blender scenes than the M4 Max/standard 4070.

 

name99

macrumors 68020
Jun 21, 2004
2,410
2,314
Yeah I typed the above too quickly and reused the Nvidia numbers as like Nvidia, Apple also has 4 subunits and 128 total FP32 pipes in a Core, but the whole point was that Nvidia's numbers include the optional FP32 pipes while Apple's do not resulting in double the number of pipes per subunit even beyond the extra FP16 pipes. That was the entire point of writing it out like that and I completely goofed it. Sorry.

Edited to this:
1 Core/SM is 4 of the following subunits:
Nvidia: 16 FP32, 16 FP32/INT32, + optional 1 FP64, ? Complex
Apple: 32 FP32, 32 INT32, 32 FP16, 32 Complex

Nvidia has the same theoretical FP32 output per clock per core only if they don't have simultaneous integer or FP16 work and Apple has double the possible integer output per clock per core.



If Robert didn't know, then at least at the time of his posts that wasn't something Nvidia had officially clarified. If they have issued a clarification since then, I have not been able to find it so far. My issue is this and I'm sure I'm wrong here somewhere, but if the FMA itself is only taking a single cycle to complete then attempting to issue subsequent FMAs should result in tandem, not parallel execution, no? Perhaps my mental model here is wrong. Here's what I think is going on:

A warp consisting of 32 threads should execute over two subunits, right? And then, provided divergences of course, both subunits are scheduled simultaneously and execute together. If there are two independent FP32 calculations per thread and if the Nvidia GPU is truly superscalar, then the Nvidia GPU's warp scheduler can issue commands to the INT32/FP32 pipe simultaneously with the FP32 pipe completing 64 FP32 operations per cycle per two subunits. However, if the Nvidia GPU is not superscalar, then by the time it schedules the work for the combined pipe, the FP32 pipe should already be done and you now have tandem execution and no increase in throughput per clock cycle, which should mean no increase in throughput.

No. The way nV did it in the past (and I assume still does it) is
- clock 0: send out the first half of an FP32 instruction (16 lanes to a 16-wide execution unit).
- clock 1: the second half of that FP32 instruction executes, but the scheduler is free, so we can also send out the first half of say an FP16 instruction
- clock 2: the second half of that FP16 instruction executes, and the scheduler can now send out say another F32 instruction, or maybe an INT32 instruction.
etc etc
This gives you effectively two ops per cycle execution (but still only 32-wide executes per cycle, maybe 16FP32 and 16INT32 in this particular cycle).

Given that a "sub-core" is 16-wide not 32-wide, it does achieve the basic goal of getting "side execution" (eg INT32 or FP16) for free by using HW that is there anyway, but never taking a cycle away from the primary FP32 line of execution. The extent to which it's valuable depends on how much side work in INT32 or FP16 you have available. On nV I think the number they said was something like ~30% on average. (Mainly int multiplies to generate addresses into 2 or 3D arrays.)

Again, I think my mental model is wrong or at least given my mental model I'm not quite following why we would need to do this at all since according to how I'm thinking about it even just a simple throughput test should tell the difference. Since that's presumably not the case, I must be mistaken somewhere. If you both and Robert are saying this schema is reasonable, it most likely is. :)


I know Nvidia has 64K 32bit registers per SM in its register file, do we know what Apple is using now? Chips and Cheese didn't go into that in their unfortunately brief look in the M2 (which is also not the M3 obviously which probably dramatically changed things).
Remember Apple has a very different scheme now; the number of registers is not fixed. There is simply a pool of per-core SRAM that is dynamically split into registers, Threadblock Scratchpad, and SRAM. It's even smarter than this in that this pool of SRAM spills to L2 like a normal cache, so registers that are not used for a while will migrate out to L2 to have their local SRAM used by something that values it more.
The details of this are explained in my
It's essentially like a second layer of virtual memory addressing, call it "private memory", sitting on the virtual address space.

I should also state I made another error in my above statements about caches.



I said Nvidia doesn't have the equivalent of the SLC, but that's not really true. Nvidia has 128KB L1 per SM (256KB for compute graphics cards like the H100) and tens of MB for its L2 which in fact functions as a last level cache similar to Apple's SLC. Apparently on some of Nvidia's compute cards there is even a "near L2" and "far L2". What Nvidia doesn't have the equivalent of Apple's L2 cache - though again it isn't clear how the M3/M4 have changed things. Again, sorry for any confusion.

Again the details make a big difference; there's a lot more going on than just capacities if you want to map on design to another.
As far as I can tell, for Apple
- within the core proper, addressing is in private address space (ie local SRAM is accessed by a private address)
- within the GPU, addressing is in virtual address space
- when an address leaves the GPU it is translated to physical address space
- the SLC operates in physical address space

This may seem esoteric, but it has implications for the costs of memory lookups. If you simply run your addressing on GPU the way a CPU does, you have a huge problem (with both power and area) in providing enough TLB lookups per cycle. I think nV have to pay this price. Meanwhile Apple's equivalent lookup (private address space->virtual address space TLB) doesn't have to happen until you hit L2, and can be optimized for the GPU (no concerns about weird aliasing with the rest of the system). The costs of a virtual->physical lookup (where you do have to worry about the rest of the system) are only paid when you transit from off the GPU to SLC.

Meanwhile SLC has it's own set of advantages. Being memory side (in other words the memory controllers are aware of everything in the SLC, unlike a standard L3) has lots of advantages for DMA and interacting with IO.
But for the GPU, what's more important is that there is a massive level of control available for the SLC. You don't just have data moved in and out based on LRU. You can define multiple different datastreams, each tagged with a DSID (data stream ID) and have these behave differently. Some data you can lock in the cache. Some you can stream through the cache (so replace MRU rather than LRU)). Some you can "decimate" in the cache (eg store every fourth line, so that you can work on that line [pulled in rapidly from the SLC] while the three non-present lines are being prefetched from DRAM. Apple's SLC is way beyond what anyone else has -- as you can see when you look at the die. The control logic for the SLC (sitting in the middle, surrounded by SRAM) is HUGE, like the size of 4 E-cores...

John McCalpin (the guy behind the HPC STREAM benchmark) says the biggest flaw in HPC is that you can't communicate to the hardware how you plan to use streams of data. Apple don't have that issue...

(BTW another feature that's probably now present in the M4, but we haven't yet seen it do anything, and may not until the next OS, has to do with co-ordination between IP blocks. Suppose you want to split neural network processing across the ANE and the GPU. This requires each side to know when the other side has completed a task.
The obvious way to do this is by flipping bits at some address which the other side is monitoring, like locks or mutexes when doing CPU parallel programming. But this sucks because it has to route through the coherency mechanisms up to the SLC then across to the other IP block.
You can do much better if you give up on copying CPU ideas from 1980 and provide a dedicated mechanism for each IP block to indicate to another IP block that an event has happened. This is probably the case with the M4 -- the timing of the patent suggests probably M4.
I don't know if the CPU/GPU interaction will also get this. It would be nice -- right now the cost of CPU/GPU synch is so high that many good ideas for moving work between the two are impractical. My guess is what we will have in M4 is exploratory, between IP blocks, like GPU, ANE, media encode/decode, and ISP, that Apple has more control over. Then, based on what is learned, the idea will be added to the CPU, which requires more flexibility and a careful API, because of 3rd party developers.)

I do wonder about improvements to the L1 cache bandwidth and/or INT scheduling - like did the M4 improve those things over the M3 or is it literally just an upclocked M3 with better memory bandwidth? I keep going back and forth but given some of these results, maybe Apple made a few tweaks to improve them?

I suspect GPU and CPU are now both on essentially 2 year development tracks. Minor tweaks are added in alternate years, but the big jumps happen on a two year cadence. M4 is a CPU model, M5 will be a GPU model.

M3 GPU gave us
- ray tracing
- dynamic cache
Dynamic cache almost certainly got scheduling tweaks of the form I described (allow allocation and deallocation to overlap with "genuine" GPU work). Ray tracing probably also got tweaks based on what was learned from the M3.

But M4 is mainly about boosted CPU smartness. Next year I expect we'll get a "disappointing" CPU boost along with an unexpectedly impressive GPU boost.
(Two places where nV is ahead of Apple are
- nV can co-ordinate work between multiple GPU cores, so at a granularity larger than a threadblock.
- nV has tweaked the precise definition of SIMT to allow thread lanes to "decouple" in their execution, which in turn allows a variety of algorithms that would deadlock on a pure SIMT design.

As I've said before, Olivier Giroux, who oversaw this sort of stuff at nV, has been at Apple for a few years now. My guess
is they are well aware of, and want to improve on, both these functionalities, and the infrastructure for dynamic cache, ie the private address spaces, is step one in hitting that goal. We'll hopefully see step two with the M5.

The truly crazy thing would be Apple shipping the equivalent of Game Porting Toolkit for CUDA. I suspect they've held off doing this for now, not least because they're simply missing the fanciest CUDA functionality like what I described above. A CUDA porting toolkit just looks lame if the end result is still that all the fanciest Graphics Gems code can't port
because you don't support large co-ordination groups, or the algorithms deadlock on your hardware! But when these limitations no longer exist...)
 

leman

macrumors Core
Oct 14, 2008
19,521
19,673
Perhaps my mental model here is wrong.

Maynard already provided a great explanation. I’ll just reiterate using slightly different words. So it all boils down to Nvidia executing a 32-wide workload over two cycles using 16-wide unit. Say it does 1 cycle for lower 16 values and the next cycle for upper 16 values. Each scheduler has two subunits, and each subunit takes two cycles to execute. This means that once issued, the subunit is busy for another cycle. So the scheduler ping-pongs between the two units. This has essentially the same effect as doing dual-issue, just without the complicated machinery of actually tracking and issuing two instructions per cycle. It’s quite smart, really.
 
  • Like
Reactions: crazy dave

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,224
No. The way nV did it in the past (and I assume still does it) is
- clock 0: send out the first half of an FP32 instruction (16 lanes to a 16-wide execution unit).
- clock 1: the second half of that FP32 instruction executes, but the scheduler is free, so we can also send out the first half of say an FP16 instruction
- clock 2: the second half of that FP16 instruction executes, and the scheduler can now send out say another F32 instruction, or maybe an INT32 instruction.
etc etc
This gives you effectively two ops per cycle execution (but still only 32-wide executes per cycle, maybe 16FP32 and 16INT32 in this particular cycle).

Given that a "sub-core" is 16-wide not 32-wide, it does achieve the basic goal of getting "side execution" (eg INT32 or FP16) for free by using HW that is there anyway, but never taking a cycle away from the primary FP32 line of execution. The extent to which it's valuable depends on how much side work in INT32 or FP16 you have available. On nV I think the number they said was something like ~30% on average. (Mainly int multiplies to generate addresses into 2 or 3D arrays.)
Maynard already provided a great explanation. I’ll just reiterate using slightly different words. So it all boils down to Nvidia executing a 32-wide workload over two cycles using 16-wide unit. Say it does 1 cycle for lower 16 values and the next cycle for upper 16 values. Each scheduler has two subunits, and each subunit takes two cycles to execute. This means that once issued, the subunit is busy for another cycle. So the scheduler ping-pongs between the two units. This has essentially the same effect as doing dual-issue, just without the complicated machinery of actually tracking and issuing two instructions per cycle. It’s quite smart, really.

Ah okay I think I get it now. I'll probably have to reread a couple of more times to make sure it sinks in. Thanks guys, appreciate it!

Remember Apple has a very different scheme now; the number of registers is not fixed. There is simply a pool of per-core SRAM that is dynamically split into registers, Threadblock Scratchpad, and SRAM. It's even smarter than this in that this pool of SRAM spills to L2 like a normal cache, so registers that are not used for a while will migrate out to L2 to have their local SRAM used by something that values it more.
The details of this are explained in my
It's essentially like a second layer of virtual memory addressing, call it "private memory", sitting on the virtual address space.
Right I meant how big is the block of data that Apple has access to now in that dynamic cache - like how big is it in total now? How much total data can it have per Core? I know it's super flexible now, but I'm just wondering how big is it? Also, prior to the M3, I assume Apple had a more traditional register file in the M1 and M2 and I'm not sure how big that was either. But maybe I'm wrong on that count as well.
Again the details make a big difference; there's a lot more going on than just capacities if you want to map on design to another.
As far as I can tell, for Apple
- within the core proper, addressing is in private address space (ie local SRAM is accessed by a private address)
- within the GPU, addressing is in virtual address space
- when an address leaves the GPU it is translated to physical address space
- the SLC operates in physical address space

This may seem esoteric, but it has implications for the costs of memory lookups. If you simply run your addressing on GPU the way a CPU does, you have a huge problem (with both power and area) in providing enough TLB lookups per cycle. I think nV have to pay this price. Meanwhile Apple's equivalent lookup (private address space->virtual address space TLB) doesn't have to happen until you hit L2, and can be optimized for the GPU (no concerns about weird aliasing with the rest of the system). The costs of a virtual->physical lookup (where you do have to worry about the rest of the system) are only paid when you transit from off the GPU to SLC.

Meanwhile SLC has it's own set of advantages. Being memory side (in other words the memory controllers are aware of everything in the SLC, unlike a standard L3) has lots of advantages for DMA and interacting with IO.
But for the GPU, what's more important is that there is a massive level of control available for the SLC. You don't just have data moved in and out based on LRU. You can define multiple different datastreams, each tagged with a DSID (data stream ID) and have these behave differently. Some data you can lock in the cache. Some you can stream through the cache (so replace MRU rather than LRU)). Some you can "decimate" in the cache (eg store every fourth line, so that you can work on that line [pulled in rapidly from the SLC] while the three non-present lines are being prefetched from DRAM. Apple's SLC is way beyond what anyone else has -- as you can see when you look at the die. The control logic for the SLC (sitting in the middle, surrounded by SRAM) is HUGE, like the size of 4 E-cores...

John McCalpin (the guy behind the HPC STREAM benchmark) says the biggest flaw in HPC is that you can't communicate to the hardware how you plan to use streams of data. Apple don't have that issue...

(BTW another feature that's probably now present in the M4, but we haven't yet seen it do anything, and may not until the next OS, has to do with co-ordination between IP blocks. Suppose you want to split neural network processing across the ANE and the GPU. This requires each side to know when the other side has completed a task.
The obvious way to do this is by flipping bits at some address which the other side is monitoring, like locks or mutexes when doing CPU parallel programming. But this sucks because it has to route through the coherency mechanisms up to the SLC then across to the other IP block.
You can do much better if you give up on copying CPU ideas from 1980 and provide a dedicated mechanism for each IP block to indicate to another IP block that an event has happened. This is probably the case with the M4 -- the timing of the patent suggests probably M4.
I don't know if the CPU/GPU interaction will also get this. It would be nice -- right now the cost of CPU/GPU synch is so high that many good ideas for moving work between the two are impractical. My guess is what we will have in M4 is exploratory, between IP blocks, like GPU, ANE, media encode/decode, and ISP, that Apple has more control over. Then, based on what is learned, the idea will be added to the CPU, which requires more flexibility and a careful API, because of 3rd party developers.)
Very interesting about the SLC, yeah that would be cool.
I suspect GPU and CPU are now both on essentially 2 year development tracks. Minor tweaks are added in alternate years, but the big jumps happen on a two year cadence. M4 is a CPU model, M5 will be a GPU model.
Aye it's definitely close to that. Just looking at the Mac, the pattern seems to be more like this:

M1 -> M2 tweaks to both CPU and GPU architectures
M2 -> M3, moderate CPU change, gargantuan GPU change
M3 -> M4, big CPU change, GPU tweak

Of course the iPhones had an extra model, the A16, in there between the M2 and M3 and I can't remember what its CPU/GPU changes were. Did it have a lot of the M3 CPU core changes already? Then your pattern of alternating years would fit even better.

M3 GPU gave us
- ray tracing
- dynamic cache
Dynamic cache almost certainly got scheduling tweaks of the form I described (allow allocation and deallocation to overlap with "genuine" GPU work). Ray tracing probably also got tweaks based on what was learned from the M3.
Yeah I'm definitely curious about the tweaks they made in the M4 GPU allowing overlapping allocation and compute is big, because it does look a little more performant over the M3 than just the clock speed/LPDDR bandwidth improvements, but obviously Apple themselves said it's the same architecture as M3 and the performance is not so much better that clock/bandwidth improvements couldn't explain most of it. I'll try to read your PDF when I get more time.
But M4 is mainly about boosted CPU smartness. Next year I expect we'll get a "disappointing" CPU boost along with an unexpectedly impressive GPU boost.
(Two places where nV is ahead of Apple are
- nV can co-ordinate work between multiple GPU cores, so at a granularity larger than a threadblock.
- nV has tweaked the precise definition of SIMT to allow thread lanes to "decouple" in their execution, which in turn allows a variety of algorithms that would deadlock on a pure SIMT design.

As I've said before, Olivier Giroux, who oversaw this sort of stuff at nV, has been at Apple for a few years now. My guess
is they are well aware of, and want to improve on, both these functionalities, and the infrastructure for dynamic cache, ie the private address spaces, is step one in hitting that goal. We'll hopefully see step two with the M5.

Oh yes. I've talked about this at length in another forum. I've watched Olivier and Bryce's talks on forward progress guarantees like 3 or 4 times (plus another Nvidia guy's) - partly out of sheer interest, partly because it took that long to wrap my head around the topic! Really cool stuff and I am looking forward to Apple implementing it as well.

The truly crazy thing would be Apple shipping the equivalent of Game Porting Toolkit for CUDA. I suspect they've held off doing this for now, not least because they're simply missing the fanciest CUDA functionality like what I described above. A CUDA porting toolkit just looks lame if the end result is still that all the fanciest Graphics Gems code can't port
because you don't support large co-ordination groups, or the algorithms deadlock on your hardware! But when these limitations no longer exist...)
As someone who does CUDA work and obviously has a Mac (but doesn't know Metal) that would be awesome. I was disappointed that one guy's project, ZLUDA was it? or is that another one?, has had so many failures/false starts. But if Apple were to do a GPTK equivalent for CUDA that would be great. Nvidia does seem to be a little litigious about the license on CUDA reverse engineering projects so I'm not sure how that would affect things.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.