Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
Binning is usually done by deliberately crippling perfectly functional chips. You can't produce lower-end chips in sufficient quantities if you only wait for dies that have flaws in the right parts. Especially not with chips such as the M1 Max, where GPU cores only take a small fraction of total die area.

GPU cores take up a massive proportion of the die area on the M1 Max, like 40%, and there can be multiple technical reasons for binning, it's not always that the core has to be totally non-functional. However, yes, functional chips are sometimes binned down - usually depends on process.
 
Last edited:

jeanlain

macrumors 68020
Mar 14, 2009
2,460
954
48 core Geekbench opencl result. 77000. Pretty disappointing but not that surprising given how anandtech mentioned the test doesn’t take long enough to wake up all the gpu cores.

Look at Apple's own tests: performance scaling of the M1 Ultra GPUs is quite poor. It is only about 20-30% faster than the M1 Max.
1646807424322.png


That's disappointing.
The "ultra fusion" interconnect clearly doesn't replace a monolithic GPU.
Apple claims that the M1 Ultra is the fastest GPU out there, but as for the M1 Max, is suppose their claim is valid only in very specific scenarios. I don't even think it will be valid in favourable benchmarks like GFXBench, due to the limitation of the interconnect fabric.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
Look at Apple's own tests: performance scaling of the M1 Ultra GPUs is quite poor. It is only about 20-30% faster than the M1 Max.
View attachment 1970230

That's disappointing.
The "ultra fusion" interconnect clearly doesn't replace a monolithic GPU.
Apple claims that the M1 Ultra is the fastest GPU out there, but as for the M1 Max, is suppose their claim is valid only in very specific scenarios. I don't even think it will be valid in favourable benchmarks like GFXBench, due to the limitation of the interconnect fabric.

To be fair, redshift only had a 60% scaling from Pro to Max. So I wouldn't necessarily be quite so quick to blame the interconnect for all of this especially since something like the Davinci Resolve render tests show better scaling from Max to Ultra (75%). As always, wait for 3rd party reviews. I will be interested in seeing GFXbench as we previously know that scales fairly linearly from M1 to Pro to Max. That should tell us the interconnect effect.
 

MayaUser

macrumors 68040
Nov 22, 2021
3,178
7,199
still, not disappointing at all
 

Attachments

  • Screen Shot 2022-03-09 at 08.49.37.png
    Screen Shot 2022-03-09 at 08.49.37.png
    162.1 KB · Views: 88
  • Screen Shot 2022-03-09 at 08.49.31.png
    Screen Shot 2022-03-09 at 08.49.31.png
    158.9 KB · Views: 95

JouniS

macrumors 6502a
Nov 22, 2020
638
399
Just curious did you want more GPU with less CPU? My guess is that kind of modularity won't come until the M2 or more likely M3 based on the rumors when a greater proportion of the SOCs are actually made from chiplets that can be put together like legos. Then there might be more variations of the SOC product stack.
My priorities are RAM > CPU > GPU. An M1 Pro with 256 GB RAM would be better for my purposes than an M1 Ultra with 128 GB.

Apple is increasingly targeting its high-end devices to people who need performance for production runs. I don't need such devices, because I have the cloud for that. I use my desktop for development and prototyping. I need a system that can pretend it's a big powerful server when necessary, but I'm not willing to pay for the performance of an actual server.
 

MayaUser

macrumors 68040
Nov 22, 2021
3,178
7,199
To be fair, redshift only had a 60% scaling from Pro to Max. So I wouldn't necessarily be quite so quick to blame the interconnect for all of this especially since something like the Davinci Resolve render tests show better scaling from Max to Ultra (75%). As always, wait for 3rd party reviews. I will be interested in seeing GFXbench as we previously know that scales fairly linearly from M1 to Pro to Max. That should tell us the interconnect effect.
exactly i just post some more relevant scenarios..but i cant wait for my real scenario
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
My priorities are RAM > CPU > GPU. An M1 Pro with 256 GB RAM would be better for my purposes than an M1 Ultra with 128 GB.

Apple is increasingly targeting its high-end devices to people who need performance for production runs. I don't need such devices, because I have the cloud for that. I use my desktop for development and prototyping. I need a system that can pretend it's a big powerful server when necessary, but I'm not willing to pay for the performance of an actual server.

Ah I see. I think given the current constraints of the SOC design yes you are SOL here for the time being. We'll see what the end of the year brings with the Mac Pro whatever happens to it and of course future iterations of M processors, but not clear when Apple might produce such a device as you want. If Apple does go for a modular chiplet design in the M3, they might get closer to your desired formulation.
 

jeanlain

macrumors 68020
Mar 14, 2009
2,460
954
still, not disappointing at all
The DaVinci test is the only one that shows good scaling. For all other tests, the M1 Ultra is far from being 2X faster than the Max.
But none of these tests includes a 3D realtime rendering app (i.e., a game). It would have been nice of Apple to include one in these tests.

EDIT: compressor also shows good scaling. But as for DaVinci, I suspect compressor implements GPGPU in tasks that don't need enormous interconnection speed between GPU cores. GPU cores can work independently on different images of parts of an image. A game is a very different scenario.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
The DaVinci test is the only one that shows good scaling. For all other tests, the M1 Ultra is far from being 2X faster than the Max.
But none of these tests includes a 3D realtime rendering app (i.e., a game). It would have been nice of Apple to include one in these tests.

yeah that's why I want to see gfxbench it's the one we know scales for the others. Apple in their presentation compared the Max to a 3060TI and the Pro to a 3090, which is roughly 70% to 100% scaling. Last time they made these comparisons it was pretty clearly GFXbench.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
The "ultra fusion" interconnect clearly doesn't replace a monolithic GPU.
Apple claims that the M1 Ultra is the fastest GPU out there, but as for the M1 Max, is suppose their claim is valid only in very specific scenarios. I don't even think it will be valid in favourable benchmarks like GFXBench, due to the limitation of the interconnect fabric.

The claimed interconnect bandwidth should be close enough to the internal chip fabric. It’s definitely fast enough. And it’s not like you have to do a lot of communication between individual GPU cores - those work on independent tasks anyway. Lack of scaling in Redshift is probably the problem of Redshift itself.
 
  • Like
Reactions: crazy dave

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
Are you sure? I see an article as recent as 18th Feb by him on AnandTech - https://www.anandtech.com/show/17243/anandtech-interview-with-dr-ann-kelleher

A bit of a shame for Anandtech if he's really left though.

In any case the above video by him is a pretty good watch.

Huh - I didn't realize that. Bummer. :(
I should say that the head editor Ryan Smith is still there and he is *very* good (mostly focused on GPUs) - hasn't been writing as much recently, will probably have to now. Ian and Andre will be hard to replace though.
 
  • Like
Reactions: Andropov

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
LOL. Do you want me to count them? I don't see any "complex mappings". Most of the lines in the source are just type casts which don't produce any code.

You are the one claiming that SSE and NEON are practically equivalent. The burden of proof should be on you.

Such as? I have done quite a bit of optimization using various versions of SSE and AVX in my time, but don't have any experience in ARM coding on that level.

I am far from being a NEON expert, but there is a bunch of differing functionality (FMA, fast horizontal operations, lane operations etc., different shuffle/interleave semantics). No idea what would be relevant to Embree though. Probably FMA is a bigger one.

Anyway, Alder Lake is more than twice as fast in Cinebench multi-core than the M1 Max. It would require some pretty sensational intrinsics in Neon to achieve that much of an boost. If that were possible, I suspect Apple would have merged that into Embree long ago, given that it is used by some very popular projects such as Blender. But on the Embree Github page you can see them contributing optimizations that gain something like 8% ...

ADL has a decisive core advantage, and Cinebench also scales extremely well with SMT. I don’t see a 8+2 core M1 ever competing with a desktop 8+8 core Intel on this type of workload. Optimizations might bring it within 80-90% of the mobile ADL though, but it’s a waste of resources. Practical relevance of embree is fairly low. Rewriting it just to get better CB scores? Please, that’s childish. Apple is pushing GPU rendering, where they have much better scaling potential.

I doubt that very much. There was a significant performance improvement from AVX to AVX2, the only big difference being the longer vectors (well, and FMA for FP calculations). There are a lot of factors playing into SIMD performance besides just the execution units. For example, Alder Lake can load two 256-bit vectors per cycle from L1D (strictly speaking even two 512-bit vectors, but the consumer Alder Lakes can't make use of that since AVX512 is disabled). If you have shorter vectors, the throughput will obviously be that much lower, everything else being equal. Up to some point there is no replacement for displacement. :p On the upcoming workstation Alder Lakes the difference may be even starker, since they have AVX512 enabled.

Firestorm can load three 128bit values per cycle from L1D, with low latency. It is obviously lower throughtput, but I doubt that embree uses a 1:1 ratio of loads to ops. Ray/AABB intersections are not load-dominated.
 

MayaUser

macrumors 68040
Nov 22, 2021
3,178
7,199
With UltraFusion, Apple is able to offer an incredible 2.5TB/second of bandwidth between the two M1 Max dies. Even if we assume that this is an aggregate figure – adding up both directions at once – that would stillmean that they have 1.25TB/second of bandwidth in each direction. All of which is approaching how much internal bandwidth some chips use, and exceeds Apple’s aggregate DRAM bandwidth of 800GB/second.
 

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
EDIT: compressor also shows good scaling. But as for DaVinci, I suspect compressor implements GPGPU in tasks that don't need enormous interconnection speed between GPU cores. GPU cores can work independently on different images of parts of an image. A game is a very different scenario.
Same thing happens with games, every core works on a different part of the scene (first on different vertices, and then on different pixels). AFAIK heavy interconnection between GPU cores is not needed. It's needed for CPUs, since they have to keep some things in sync, but GPUs don't need to enforce synchronization as much, they're basically huge SIMD machines. But I'm puzzled as to why the GPU scaling is so bad in some apps.
 
  • Like
Reactions: crazy dave

Pressure

macrumors 603
May 30, 2006
5,182
1,544
Denmark
Binning is usually done by deliberately crippling perfectly functional chips. You can't produce lower-end chips in sufficient quantities if you only wait for dies that have flaws in the right parts. Especially not with chips such as the M1 Max, where GPU cores only take a small fraction of total die area.
No, binning is done because you don't have a yield rate of 100%.

This way you don't have to throw silicon out that doesn't make the cut. I can see you understand the basics but on a given wafer you will never have 100% yield rate or close to it. Especially at 57 billion transistors.

Apple have been stockpiling these since they started making M1 Max chips.
 
  • Like
Reactions: Andropov

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,665
OBX
Same thing happens with games, every core works on a different part of the scene (first on different vertices, and then on different pixels). AFAIK heavy interconnection between GPU cores is not needed. It's needed for CPUs, since they have to keep some things in sync, but GPUs don't need to enforce synchronization as much, they're basically huge SIMD machines. But I'm puzzled as to why the GPU scaling is so bad in some apps.
Maybe the scheduler cannot keep all the cores busy with work?
 

JouniS

macrumors 6502a
Nov 22, 2020
638
399
No, binning is done because you don't have a yield rate of 100%.

This way you don't have to throw silicon out that doesn't make the cut. I can see you understand the basics but on a given wafer you will never have 100% yield rate or close to it. Especially at 57 billion transistors.
That's how it may work with Intel chips, where most of the die area is covered by the CPU cores and the GPU, which can all be disabled independently.

On an M1 Max, the GPU cores only cover something like 25% of the die area. Most chips with flaws cannot be salvaged, because Apple only disables GPU cores and not, for example, CPU cores, Thunderbolt controllers, caches, memory controllers, or SSD controllers. In addition to that, the ratio of chips with 24 functional GPU cores to chips with all 32 GPU cores depends on the yield rate, which is beyond Apple's control. If the rate is too high or improves over time, Apple can't get enough 24-core chips to meet the demand, and they have to disable some 32-core chips to sell them.
 

Pressure

macrumors 603
May 30, 2006
5,182
1,544
Denmark
That's how it may work with Intel chips, where most of the die area is covered by the CPU cores and the GPU, which can all be disabled independently.

On an M1 Max, the GPU cores only cover something like 25% of the die area. Most chips with flaws cannot be salvaged, because Apple only disables GPU cores and not, for example, CPU cores, Thunderbolt controllers, caches, memory controllers, or SSD controllers. In addition to that, the ratio of chips with 24 functional GPU cores to chips with all 32 GPU cores depends on the yield rate, which is beyond Apple's control. If the rate is too high or improves over time, Apple can't get enough 24-core chips to meet the demand, and they have to disable some 32-core chips to sell them.
We don't know the redundancy built into the chip. I don't think we have gotten a true die shot yet.
 

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
Maybe the scheduler cannot keep all the cores busy with work?
It doesn't have that many cores. A 5120×2880 image, for example, has 14,745,600 pixels. You typically have one thread per pixel.

My understanding (which could be wrong) is that each 'GPU core', as Apple calls them, has its own instruction decoder, which then dispatchs work to its threadExecutionWidth number of SIMD lanes (32 on M1 Pro/Max/Ultra). So, in theory, if you write a compute kernel or a shader, your code will run in 32 * Number of cores in parallel.

On M1 Max, that's 32 (number of SIMD lanes) * 32 (number of cores) = 1,024 threads at once. On M1 Ultra, that'd be 2,048 threads at once. There are subtleties on how this is done (threads have to be organized in threadgroups), but in any case it's several orders of magnitude less threads being executed simultaneously than a typical total number of threads to run, so the scheduler should be able to keep the cores busy just fine.
 

dugbug

macrumors 68000
Aug 23, 2008
1,929
2,147
Somewhere in Florida
I think that "dubious" is the wrong word. Just browsing quickly, you almost see no instruction with 1:1 mapping. 1:2 is very common, and there are a few very expensive operations.
But just one of the many "highlights" like a simple blend operation:

#define _mm_blend_epi16(a, b, imm) \
__extension__({ \
const uint16_t _mask[8] = {((imm) & (1 << 0)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 1)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 2)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 3)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 4)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 5)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 6)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 7)) ? 0xFFFF : 0x0000}; \
uint16x8_t _mask_vec = vld1q_u16(_mask); \
uint16x8_t _a = vreinterpretq_u16_m128i(a); \
uint16x8_t _b = vreinterpretq_u16_m128i(b); \
vreinterpretq_m128i_u16(vbslq_u16(_mask_vec, _b, _a)); \
})

And thats just because the immidiate is in a different format, than you would usually employ when using NEON. In one of my previous posts I already did point out big issues because different constant formats between SSE and NEON - and this is a perfect example.

You can reformat that into one line 😂

I don't know what these non sw types are thinking.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
It doesn't have that many cores. A 5120×2880 image, for example, has 14,745,600 pixels. You typically have one thread per pixel.

My understanding (which could be wrong) is that each 'GPU core', as Apple calls them, has its own instruction decoder, which then dispatchs work to its threadExecutionWidth number of SIMD lanes (32 on M1 Pro/Max/Ultra). So, in theory, if you write a compute kernel or a shader, your code will run in 32 * Number of cores in parallel.

On M1 Max, that's 32 (number of SIMD lanes) * 32 (number of cores) = 1,024 threads at once. On M1 Ultra, that'd be 2,048 threads at once. There are subtleties on how this is done (threads have to be organized in threadgroups), but in any case it's several orders of magnitude less threads being executed simultaneously than a typical total number of threads to run, so the scheduler should be able to keep the cores busy just fine.

An Apple GPU core contains 4x 32-wide ALUs, for 128 ALUs or “shader cores/units” in total. How exactly these work and whether each ALU can execute different instruction stream is not clear as far as I know. The Ultra will therefore contain 8192 “shader cores/units” and should support close to 200000 threads in flight.

I can definitely imagine performance tanking if not enough threads are supplied. GPUs need to hide latency after all.
 

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
An Apple GPU core contains 4x 32-wide ALUs, for 128 ALUs or “shader cores/units” in total. How exactly these work and whether each ALU can execute different instruction stream is not clear as far as I know. The Ultra will therefore contain 8192 “shader cores/units” and should support close to 200000 threads in flight.

I can definitely imagine performance tanking if not enough threads are supplied. GPUs need to hide latency after all.
The original M1 page on apple.com said the (regular) M1 supported 24,576 threads in flight. That'd mean (since the max number of threads in threadgroups is fixed to 1,024 and there's only 8 GPU cores) that a single GPU core can execute different threadgroups at once (up to 3).

For the M1 Ultra the number would be 196,608 threads in-flight (just like you say). High, but still lower than even a 1920x1080 image, and that's a 'best case scenario' where you're running light shaders/kernels that don't push the register pressure too far (or you'd be able to have less threads per threadgroup to fit in memory, and then the max threads in flight would decrease).

I'm not saying that scheduling doesn't play a role in the apparently poor scaling of the M1 Ultra on some tasks, but I'm not convinced that's the entirety of the problem.

As for how many shader cores/units the Ultra has: it does have 8,192 ALUs, but is every single one of them executing instructions from a different thread? Not sure how the scheduling works at that level.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.