Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

JMacHack

Suspended
Mar 16, 2017
1,965
2,424
Why do you like Cinebench so much? The rendering library Cinebench uses is called Embree, a library developed by Intel and only have AVX native implementation. To use it on arm64, you have to use an avx-to-neon translation layer and it hurts the performance and will result in a lower score. The "arm64 native" is not the same level native as it is on x86, which has hand-crafted SIMD optimization.
Because it’s the place where the M1 series typically scores the lowest and he’d get gaped like ****** if he used any other benchmarks.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
I expect something like $6k for a 10-core Mac Pro, $10k for a 20-core model, and $15k for a 40-core model, assuming reasonable RAM/SSD options and including a decent monitor and other peripherals. The entry-level model will probably have an M1 Max or equivalent, just like the current entry-level model is comparable to a much cheaper iMac.

I kept trying to tell you that they would offer a Mac with a high core count CPU for significantly less than you were assuming. :) Now a full Apple Silicon Mac Pro, whenever it gets here, will probably be more expensive than the Studio is - naturally, but Apple did introduce a desktop with lots of CPU cores for significantly less than your prediction.

Of course for my own miscalculation I felt sure that the new mini-Pro, Studio, would still have one slot of internal expansion even if it was reduced to a single x8 PCIe slot. I also thought they might offer a 16-core binned CPU in between the Max and the Ultra for less than $4000 and I have to admit the base 20-core Ultra is slightly cheaper than even I thought - probably because it doesn't have some of what I thought internally.
 
Last edited:

jeanlain

macrumors 68020
Mar 14, 2009
2,460
954
All the news sources I follow don't include Geekbench garbage so just a habit seeing lesser garbage Cinebench R23. Actually prefer the industry standardize on Blender. What dinosaur still renders with CPU instead of GPU?
I'd ike to know why geekbench is garbage.
As opposed to Cinebench, it's designed to be cross platform (while cinebench is more optimised for X86) and it comprises a large set of algorithms, while cinebench just does one thing: ray tracing. Cinebench is not more "real life" than geekbench, and in fact is much less representative of the overall performance.

The only argument against geekbnech appears to be that... it's runtime is too short. Which is probably better for X86 CPUs as they overheat more. (For longer tests, there's geekbench pro.)
I've heard claims that geekbench favours OSX over Windows. It doesn't. I've checked myself.
 
Last edited:

Rigby

macrumors 603
Aug 5, 2008
6,257
10,215
San Jose, CA
Dubious statement at best. Can you quantify how many intrinsic have a one to one correspondence? Because just browsing through things I definitely see a lot of complex mappings.
LOL. Do you want me to count them? I don't see any "complex mappings". Most of the lines in the source are just type casts which don't produce any code.

Also, what about things SSE does not have and therefore not use but NEON does?
Such as? I have done quite a bit of optimization using various versions of SSE and AVX in my time, but don't have any experience in ARM coding on that level.

Anyway, Alder Lake is more than twice as fast in Cinebench multi-core than the M1 Max. It would require some pretty sensational intrinsics in Neon to achieve that much of an boost. If that were possible, I suspect Apple would have merged that into Embree long ago, given that it is used by some very popular projects such as Blender. But on the Embree Github page you can see them contributing optimizations that gain something like 8% ...

I would expect that on a straightforward, well optimized SIMD workload (AVX2 vs. NEON) the difference in performance will more or less mirror the difference in clock rate, as throughout per clock is very similar between Firestorm and modern x86 architectures.
I doubt that very much. There was a significant performance improvement from AVX to AVX2, the only big difference being the longer vectors (well, and FMA for FP calculations). There are a lot of factors playing into SIMD performance besides just the execution units. For example, Alder Lake can load two 256-bit vectors per cycle from L1D (strictly speaking even two 512-bit vectors, but the consumer Alder Lakes can't make use of that since AVX512 is disabled). If you have shorter vectors, the throughput will obviously be that much lower, everything else being equal. Up to some point there is no replacement for displacement. :p On the upcoming workstation Alder Lakes the difference may be even starker, since they have AVX512 enabled.
 

Gnattu

macrumors 65816
Sep 18, 2020
1,107
1,671
Alder Lake is more than twice as fast in Cinebench multi-core than the M1 Max. It would require some pretty sensational intrinsics in Neon to achieve that much of an boost.
I recently worked with my project and the crypto backend speed is more than doubled by using a native neon implementation of chacha20 and this only happens to M1(A72 got 50% faster) Embree is having a more complex situation and it is hard to guarantee how far we can reach. By the way 8+8 Alder Lake is having more E cores, so I think comparing single core performance could be a better metric.
 

januarydrive7

macrumors 6502a
Oct 23, 2020
537
578
I recently worked with my project and the crypto backend speed is more than doubled by using a native neon implementation of chacha20 and this only happens to M1(A72 got 50% faster) Embree is having a more complex situation and it is hard to guarantee how far we can reach. By the way 8+8 Alder Lake is having more E cores, so I think comparing single core performance could be a better metric.
The crypto section of sse2neon header is particularly nasty, from a brief glance, with comments even mentioning that the implementations are known to be slow.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
LOL. Do you want me to count them? I don't see any "complex mappings". Most of the lines in the source are just type casts which don't produce any code.


Such as? I have done quite a bit of optimization using various versions of SSE and AVX in my time, but don't have any experience in ARM coding on that level.

Anyway, Alder Lake is more than twice as fast in Cinebench multi-core than the M1 Max. It would require some pretty sensational intrinsics in Neon to achieve that much of an boost. If that were possible, I suspect Apple would have merged that into Embree long ago, given that it is used by some very popular projects such as Blender. But on the Embree Github page you can see them contributing optimizations that gain something like 8% ...


I doubt that very much. There was a significant performance improvement from AVX to AVX2, the only big difference being the longer vectors (well, and FMA for FP calculations). There are a lot of factors playing into SIMD performance besides just the execution units. For example, Alder Lake can load two 256-bit vectors per cycle from L1D (strictly speaking even two 512-bit vectors, but the consumer Alder Lakes can't make use of that since AVX512 is disabled). If you have shorter vectors, the throughput will obviously be that much lower, everything else being equal. Up to some point there is no replacement for displacement. :p On the upcoming workstation Alder Lakes the difference may be even starker, since they have AVX512 enabled.

Personally I actually don't think the problems with CB23 stop at Intel Embree. That's part of it, but not the whole story. Other projects use Intel Embree and don't see the poor utilization of Apple cores that CB23 does. Also other renderers with CPU ray tracing don't either, the problem is specific to CB23. Again, for what it's worth Andrei has said he's talked to the developers and they had a discussion about why CB23 so underutilizes the M1 hardware (and actually non-Intel hardware seems to be an issue). What those discussions were and why remain between them. It's a closed source program and those were private discussions so we may never know. But the problem is unique to CB23 which makes it almost guaranteed to be a coding or compiling issue on their end given that we can see the power utilization for CB23 on M1 cores relative Blender or POV-ray. We've had this discussion before, I'm not sure why you're still disputing this when the numbers are up in black and white. We may not know the full picture of what's going on beneath the hood in CB23, but something is not right and it's not right with it. We know Apple's CB23 raw scores are an outlier and we know that CB23 weirdly doesn't draw as many watts as other programs or as much as it does on Intel machines which pushes Apple's perf/W scores *higher* relative to Intel.
 
Last edited:
  • Like
Reactions: Stratus Fear

Gerdi

macrumors 6502
Apr 25, 2020
449
301
Dubious statement at best. Can you quantify how many intrinsic have a one to one correspondence? Because just browsing through things I definitely see a lot of complex mappings.

I think that "dubious" is the wrong word. Just browsing quickly, you almost see no instruction with 1:1 mapping. 1:2 is very common, and there are a few very expensive operations.
But just one of the many "highlights" like a simple blend operation:

#define _mm_blend_epi16(a, b, imm) \
__extension__({ \
const uint16_t _mask[8] = {((imm) & (1 << 0)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 1)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 2)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 3)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 4)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 5)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 6)) ? 0xFFFF : 0x0000, \
((imm) & (1 << 7)) ? 0xFFFF : 0x0000}; \
uint16x8_t _mask_vec = vld1q_u16(_mask); \
uint16x8_t _a = vreinterpretq_u16_m128i(a); \
uint16x8_t _b = vreinterpretq_u16_m128i(b); \
vreinterpretq_m128i_u16(vbslq_u16(_mask_vec, _b, _a)); \
})

And thats just because the immidiate is in a different format, than you would usually employ when using NEON. In one of my previous posts I already did point out big issues because different constant formats between SSE and NEON - and this is a perfect example.
 
Last edited:

oz_rkie

macrumors regular
Apr 16, 2021
177
165
Yeah, it’s a shame that all the folks known for in-depth reviews have quit. We are left with barely competent folks like Linux Tech Tips and barely incompetent folks like Max Tech 😑 Oh well…
There is AnandTech that still does pretty good in-depth reviews and they cover macs as well. There's an initial deep dive commentary of today's launch by Dr. Ian Cutress

 
  • Like
Reactions: JMacHack

Gerdi

macrumors 6502
Apr 25, 2020
449
301
Personally I actually don't think the problems with CB23 stop at Intel Embree. That's part of it, but not the whole story. Other projects use Intel Embree and don't see the poor utilization of Apple cores that CB23 does.

I cannot speak for M1, but i do see the same bad relative performance on my Surface Pro X, wherever Embree is used. This includes for instance Blender. Besides Povray does not use Embree. The issue with Povray is, that it is lacking a reasonable good NEON implementation at all (while it does have an AVX/SEE code path).
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
There is AnandTech that still does pretty good in-depth reviews and they cover macs as well. There's an initial deep dive commentary of today's launch by Dr. Ian Cutress

Sadly he’s not at Anandtech anymore :(

Andrei left as well
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
I cannot speak for M1, but i do see the same bad relative performance on my Surface Pro X, wherever Embree is used. This includes for instance Blender. Besides Povray does not use Embree. The issue with Povray is, that it is lacking a reasonable good NEON implementation at all (while it does have an AVX/SEE code path).

Yeah I think Embree is part of it but I saw another (toy) project using Embree that didn’t suffer as badly as CB23 on the M1 vs Intel but the post writer emphasized that they had a much simpler ray tracing framework than most professional renderers.
 

Gerdi

macrumors 6502
Apr 25, 2020
449
301
Yeah I think Embree is part of it but I saw another (toy) project using Embree that didn’t suffer as badly as CB23 on the M1 vs Intel but the post writer emphasized that they had a much simpler ray tracing framework than most professional renderers.

Surely the instruction distribution plays a role. I did update my previous post with a particular example of a simple blend operation you might want to take look at.
And there are many of such examples, which can be simply explained by different immediate formats between SSE and NEON.
Thing is the caller (from C++ code) is providing an SSE compliant immediate and the emulation converts this back into an NEON complaint immediate. If the caller would directly call NEON, it would surely pass a NEON compliant immediate.
 
Last edited:

JimmyjamesEU

Suspended
Jun 28, 2018
397
426
48 core Geekbench opencl result. 77000. Pretty disappointing but not that surprising given how anandtech mentioned the test doesn’t take long enough to wake up all the gpu cores.

 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
Surely the instruction distribution plays a role. I did update my previous post with a particular example of a simple blend operation you might want to take look at.
And there are many of such examples, which can be simply explained by different immediate formats between SSE and NEON.
Absolutely. Again, I think Embree is definitely part of it - I just suspect CB23’s problems extend to more than just the ray tracing portions of their code. Blender’s BMW test uses Intel Embree too and the M1/Pro/Max CPU does okay on that test (though Apple’s engineers note it’s performance can still improved - as of 4 days ago they submitted a pull request to Embree that they say resolves the outstanding issues to get an 8% performance increase: https://github.com/embree/embree/pull/330). It’s a suspicion, I have no proof.
 

Gerdi

macrumors 6502
Apr 25, 2020
449
301
Blender’s BMW test uses Intel Embree too and the M1/Pro/Max CPU does okay on that test (though Apple’s engineers note it’s performance can still improved - as of 4 days ago they submitted a pull request to Embree that they say resolves the outstanding issues to get an 8% performance increase: https://github.com/embree/embree/pull/330). It’s a suspicion, I have no proof.

If i read this correctly the original pull request is from mid 2021 and it is still not resolved due to bugs :(
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
If i read this correctly the original pull request is from mid 2021 and it is still not resolved due to bugs :(
In the final comment from 4 days ago Apple submitted a new patch that they say fixed the bugs so hopefully if true it’ll be accepted soon
 

JouniS

macrumors 6502a
Nov 22, 2020
638
399
I kept trying to tell you that they would offer a Mac with a high core count CPU for significantly less than you were assuming. :) Now a full Apple Silicon Mac Pro, whenever it gets here, will probably be more expensive than the Studio is - naturally, but Apple did introduce a desktop with lots of CPU cores for significantly less than your prediction.
It's certainly cheaper than I expected, partly because we got a consumer desktop instead of a workstation. And because it's also the replacement for the normal-sized iMac.

A base model (M1 Max / 32 GB / 2 TB), one Studio Display, and basic peripherals is about $4500. The same system with M1 Ultra and 64 GB RAM is ~$6500. That's roughly $1100 and $800 per M1 equivalent, which is at the lower end of what Apple is losing by not being able to sell more M1 devices. I used 2 TB as the minimum reasonable SSD size, because external connectivity is limited and you would need 2 Thunderbolt ports to achieve similar performance with an external drive.

The prices for GPU upgrades are interesting, because you pay them only to prevent Apple from crippling your GPU deliberately. The upgrade from 24 cores to 32 cores is reasonable at $200, while the comparable upgrade from 48 cores to 64 cores costs you $1000.

I'm kind of disappointed with the Mac Studio, but I've also been expecting this from the moment I first saw the specs of the M1. It's a nice machine on the paper, but it's also more expensive than I'm willing to pay and less capable than I need in the aspects I care about. The configuration I would choose costs $5800 without a monitor or peripherals, and you don't even get 256 GB RAM for that price.
 

ArkSingularity

macrumors 6502a
Mar 5, 2022
928
1,130
It's certainly cheaper than I expected, partly because we got a consumer desktop instead of a workstation. And because it's also the replacement for the normal-sized iMac.

A base model (M1 Max / 32 GB / 2 TB), one Studio Display, and basic peripherals is about $4500. The same system with M1 Ultra and 64 GB RAM is ~$6500. That's roughly $1100 and $800 per M1 equivalent, which is at the lower end of what Apple is losing by not being able to sell more M1 devices. I used 2 TB as the minimum reasonable SSD size, because external connectivity is limited and you would need 2 Thunderbolt ports to achieve similar performance with an external drive.

The prices for GPU upgrades are interesting, because you pay them only to prevent Apple from crippling your GPU deliberately. The upgrade from 24 cores to 32 cores is reasonable at $200, while the comparable upgrade from 48 cores to 64 cores costs you $1000.

I'm kind of disappointed with the Mac Studio, but I've also been expecting this from the moment I first saw the specs of the M1. It's a nice machine on the paper, but it's also more expensive than I'm willing to pay and less capable than I need in the aspects I care about. The configuration I would choose costs $5800 without a monitor or peripherals, and you don't even get 256 GB RAM for that price.

The very highest end from Apple almost always costs a premium. They'll give you something mid-high tier at a very reasonable price though. The 48 GPU configuration is already a smoke show in terms of performance.

The CPU binning is understandable, but I have a feeling 56 or 60 GPU cores would have been very achievable. The marketing department probably took over on this one and said "no, 48 is good" - and now getting the full 64 costs a premium.

Overall, the binned versions are still one heck of a bargain though. You can't find this kind of performance at this price point in the X86 world.
 

huge_apple_fangirl

macrumors 6502a
Aug 1, 2019
769
1,301
The prices for GPU upgrades are interesting, because you pay them only to prevent Apple from crippling your GPU deliberately. The upgrade from 24 cores to 32 cores is reasonable at $200, while the comparable upgrade from 48 cores to 64 cores costs you $1000.
How are they crippling the GPU deliberately? It's called binning.
 

JouniS

macrumors 6502a
Nov 22, 2020
638
399
How are they crippling the GPU deliberately? It's called binning.
Binning is usually done by deliberately crippling perfectly functional chips. You can't produce lower-end chips in sufficient quantities if you only wait for dies that have flaws in the right parts. Especially not with chips such as the M1 Max, where GPU cores only take a small fraction of total die area.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
It's certainly cheaper than I expected, partly because we got a consumer desktop instead of a workstation. And because it's also the replacement for the normal-sized iMac.

A base model (M1 Max / 32 GB / 2 TB), one Studio Display, and basic peripherals is about $4500. The same system with M1 Ultra and 64 GB RAM is ~$6500. That's roughly $1100 and $800 per M1 equivalent, which is at the lower end of what Apple is losing by not being able to sell more M1 devices. I used 2 TB as the minimum reasonable SSD size, because external connectivity is limited and you would need 2 Thunderbolt ports to achieve similar performance with an external drive.

The prices for GPU upgrades are interesting, because you pay them only to prevent Apple from crippling your GPU deliberately. The upgrade from 24 cores to 32 cores is reasonable at $200, while the comparable upgrade from 48 cores to 64 cores costs you $1000.

I'm kind of disappointed with the Mac Studio, but I've also been expecting this from the moment I first saw the specs of the M1. It's a nice machine on the paper, but it's also more expensive than I'm willing to pay and less capable than I need in the aspects I care about. The configuration I would choose costs $5800 without a monitor or peripherals, and you don't even get 256 GB RAM for that price.

True though you don't have to get a studio display if that's over kill - plenty of other displays with decent specs depending on what you want. I agree that the price increase from 48 to 64 cores is high but I also have to admit that the base 20 core price is lower than even I thought it would be. So the absolute price of the 64 core is still roughly the same as I thought and this is pretty much the machine I thought they were going to make though as aforementioned I did think there would be a small PCIe expansion slot or M2.

Just curious did you want more GPU with less CPU? My guess is that kind of modularity won't come until the M2 or more likely M3 based on the rumors when a greater proportion of the SOCs are actually made from chiplets that can be put together like legos. Then there might be more variations of the SOC product stack.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.