Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Gerdi

macrumors 6502
Apr 25, 2020
449
301
It's not about raytracing, it's about SIMD-heavy code. Both Apple and modern x86 CPUs can do around 512 bits worth of SIMD operations per clock (give or take). But x86 CPUs run at higher clocks and have more L1D cache bandwidth to support thee operations. It's all about tradeoffs really. Apple is more flexible (their 4x smaller SIDM units are better suited for more complex algorithms and scalar computation) and more efficient (not paying for higher bandwidth and higher clocks), but this also mean it cannot win when it comes to a raw SIMD slugfest. We see it across all kinds of SIMD-oriented workflows btw, not just CB.

Of course, if you take clock-frequency into consideration, then the throughput of the higher clocked design does have an advantage. But that is not an architectural weakness. However mobile x86 designs running at similar clocks compared to AS - the performance discrepancy is only related to how Embree handles NEON.
That is unless you compare to a desktop CPU, where Apple does not have a match.

Is it possible to write a CPU raytracer that would perform better on Apple Silicon than embreee? I am sure it is. Embree isn't really written with a CPU in mind that has four (albeit smaller) independent SIMD units, and I wouldn't be surprised if certain operations (like ray-box intersection) can be implemented more efficiently on ARM SIMD. But no matter how efficient your code is, this doesn't change the fact that Apple Silicon is at disadvantage in SIMD throughout compared to x86 CPUs on the hardware level.
Did you look at the sources? Float[4] is the most common data structure - which is perfect for NEON. Are you familiar with the Embree implementation?
You again fail to understand the issue, Embree is written for AVX/SSE and only statically wrapped to NEON.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Of course, if you take clock-frequency into consideration, then the throughput of the higher clocked design does have an advantage. But that is not an architectural weakness. However mobile x86 designs running at similar clocks compared to AS

Not if you look at the aggregated clock across all cores as well as SMT (which works very well for these kind of workloads). This is a common theme when you look at SIMD throughtput-oriented workloads. When operating on only few cores, x86 win because of higher clock. When operating on many cores, x86 win because of more cores and SMT.

- the performance discrepancy is only related to how Embree handles NEON.

I already wrote that I believe that Embree performance on AS could be improved. Some routines would likely benefit from careful optimisation using ARM-specific instructions, and the algorithms could be rewritten with more ILP in mind to take better advantage of Apples wide cores.

Did you look at the sources? Float[4] is the most common data structure - which is perfect for NEON. Are you familiar with the Embree implementation?

I did have a look a while ago.

You again fail to understand the issue, Embree is written for AVX/SSE and only statically wrapped to NEON.

I am fairly certain that I was one of the first to point out that Embree uses sse2neon.h. If you look around you'll find a bunch of my posts discussing this very issue and explaining (with code examples) why this kind of translation layers cannot achieve optimal results in some cases.

But here I am talking about something else. This is about the inherent disadvantage Apple Silicon has in SIMD throughput agains x86 designs. Apple designed their CPUs for flexibility (latency on complex workloads) as well as energy efficiency, while Intel's SIMD designs are all about throughput (you can clearly see it in their SIMD ISA philosophy).

Frankly, I think Apple's design makes much more sense. It doesn't make much sense to sacrifice so much just to excel on a few niche codebases, and the flexibility of Apple FP units actually allows it to perform very well in many real-world scientific workloads (just look at their SPECfp results). For throughput Apple has the AMX units which are superior to x86 SIMD for many common tasks, and once Apple implements SVE2 with streaming mode I am sure they will be able to challenge even the desktop x86. And finally, there is always the GPU, which is a dedicated parallel processor. Apple's strategy for raytracing is GPU acceleration with CPU-like programmability. They don't have any reason whatsoever to bloat their CPUs in order to make them faster in this domain.
 
  • Like
Reactions: jdb8167

Gerdi

macrumors 6502
Apr 25, 2020
449
301
Not if you look at the aggregated clock across all cores as well as SMT (which works very well for these kind of workloads). This is a common theme when you look at SIMD throughtput-oriented workloads. When operating on only few cores, x86 win because of higher clock. When operating on many cores, x86 win because of more cores and SMT.
Thing is, you fundamentally mix-up architecture with actual design. The fact, that some x86 designs have more cores or SMT is not an architectural issue, but you just compare a design with less resources with a design with more resources.
However with Embree AS is even loosing when comparing against an x86 design with comparable resources - and the reason is in the Embree usage of NEON.

I did have a look a while ago.
Then you would have concluded, that Embree is not inherently designed for larger vectors. Most of the raytracing kernels working with homogeneous coordinates (so you have 4-component float vectors). So NEON does not have an inherent disadvantage compared to SIMD implementations with larger vectors.
But here I am talking about something else. This is about the inherent disadvantage Apple Silicon has in SIMD throughput agains x86 designs.
There is not such a thing - unless you are comparing the wrong designs (see above) or vastly different SW implementations like Embree.
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Thing is, you fundamentally mix-up architecture with actual design. The fact, that some x86 designs have more cores or SMT is not an architectural issue, but you just compare a design with less resources with a design with more resources.

I am talking about concrete implementations: Apple Firestorm and its iterations vs. Alder Lake/Zen3,4 as shipping in currently available products.

However with Embree AS is even loosing when comparing against an x86 design with comparable resources - and the reason is in the Embree usage of NEON.

Nobody argues that about Embree being not optimally optimized for Apple Silicon. That’s a fact. What I am saying is that there are two dimensions to this. One is software quality. Another is lower SIMD throughput of Apple processors.


Then you would have concluded, that Embree is not inherently designed for larger vectors.

Embree works with AVX2 (256bit SIMD), not sure about AVX512. In fact, Apple has recently submitted a patch that improves performance by approx 10% on Apple Silicon by adding AVX2 intrinsic to the translation header.



So NEON does not have an inherent disadvantage compared to SIMD implementations with larger vectors.
Of course it doesn’t. Again, it’s about the implementation. Apple hardware simply doesn’t have the SIMD throughput of similar-class x86 products.


There is not such a thing - unless you are comparing the wrong designs (see above) or vastly different SW implementations like Embree.

I do not follow. What does it mean to compare the right designs to you? Write some basic (not RAM bandwidth limited) SIMD throughout code, making sure that you can confidently saturate both Apples 4x 128bit units and 2x AVX2 units. Then run it on M2 vs. a comparable class Intel or AMD CPU. The M2 will lose in single core (lower clock, lower cache bandwidth) and multicore operations (either fewer core vs Intel or again, lower clock vs. AMD).

We seem to have some sort of interfacing issue here. I really have difficulty in understanding which aspect of my argument exactly you are arguing against.
 

jdb8167

macrumors 601
Nov 17, 2008
4,859
4,599
9D655EF2-469D-4B15-8C10-56FC1317D8F4.jpeg
E3AC88C1-29D1-4E30-96C8-2B136ACF3F22.jpeg
 

senttoschool

macrumors 68030
Original poster
Nov 2, 2017
2,626
5,482
Of course it doesn’t. Again, it’s about the implementation. Apple hardware simply doesn’t have the SIMD throughput of similar-class x86 products.
I'm curious, could Apple eventually leverage UMA and develop some sort of technique where highly parallel code is automatically run on the GPU instead of the CPU without the developer knowing it? This would allow Apple to not have to spend precious transistors expanding SIMD.

An Apple Silicon is guaranteed to shared memory and guaranteed to have a GPU, which is something Windows/Linux can't guarantee.
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,521
19,674
I'm curious, could Apple eventually leverage UMA and develop some sort of technique where highly parallel code is automatically run on the GPU instead of the CPU without the developer knowing it? This would allow Apple to not have to spend precious transistors expanding SIMD.

An Apple Silicon is guaranteed to shared memory and guaranteed to have a GPU, which is something Windows/Linux can't guarantee.

Thats a very good question. I was wondering the same.

If one looks at common application of SIMD, one see that there are two common themes. One is about maximizing computation throughout (a lot of data, want to process all of it as quickly as possible) and another is about speeding up complex algorithms (little data, but want to process multiple data elements at once, latency sensitive, a lot of data swizzling, often crossing SIMD and control flow domain). These are fairly conflicting requirements. For throughout you are much better off with very large configurable vectors registers, but that sucks if you only work with few data elements and care about latency.

So really what you want is two separate SIMD systems, each optimized for different goals. Apple has been doing this for a while with AMX. SVE2 does this with regular vs. “streaming” (throughput-oriented) SIMD mode. This might seem like a more complex design, but I believe it’s superior. You don’t have to bloat your CPU cores with large vector units (problem that Intel is struggling with), and you can easily scale the large vector machinery based on the product class.

Can a GPU work as a large vector coprocessor? Frankly, I don’t know. Apples AMX sit at the same L2 as a CPU cluster which allows them to share the data very quickly. UMA or not, there is still an astronomical distance between the CPU and GPU. Also, it is possible that extending the GPU to serve as a vector coprocessor might make it an objectively worse GPU. It’s a complex topic and I am not qualified to reason about it.
 

scottrichardson

macrumors 6502a
Jul 10, 2007
716
293
Ulladulla, NSW Australia
And my iMac:

2020 iMac 27" / Core i9 10910 - 10c 5Ghz Turbo / 3.6Ghz base / 64GB RAM / Radeon Pro 5700 XT 16GB GPU / 1TB SSD
Screenshot 2023-02-27 at 3.12.12 pm.png


Screenshot 2023-02-27 at 3.14.05 pm.png


Blown away that my top-spec iMac is being totally obliterated by this MacBook Pro. Mind blowing really. Can't wait to finish setting this up with the studio display etc etc...
 

T'hain Esh Kelch

macrumors 603
Aug 5, 2001
6,474
7,408
Denmark
And my iMac:

2020 iMac 27" / Core i9 10910 - 10c 5Ghz Turbo / 3.6Ghz base / 64GB RAM / Radeon Pro 5700 XT 16GB GPU / 1TB SSD

Blown away that my top-spec iMac is being totally obliterated by this MacBook Pro. Mind blowing really. Can't wait to finish setting this up with the studio display etc etc...
I am more amazed that the 20 hour battery life MBA M2 is also doing it.. At least, single and multicore. Metal is another story of course. The M SOCs are crazy! Or Intel lazy and/or milking the market. :)
 

avkills

macrumors 65816
Jun 14, 2002
1,226
1,074
Anyone notice that the latest top end Intel and AMD chips are beating Apple Silicon in single and multi core performance. Now that the database has started to get populated. Intel and AMD are in the 3000+ range for single core performance; although at a much higher wattage for sure.

Interesting times ahead.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Anyone notice that the latest top end Intel and AMD chips are beating Apple Silicon in single and multi core performance. Now that the database has started to get populated. Intel and AMD are in the 3000+ range for single core performance; although at a much higher wattage for sure.

Interesting times ahead.

Well of course. These are desktop chips that use between 5x and 10x more power to do the same work.
 

dmccloud

macrumors 68040
Sep 7, 2009
3,142
1,899
Anchorage, AK
Anyone notice that the latest top end Intel and AMD chips are beating Apple Silicon in single and multi core performance. Now that the database has started to get populated. Intel and AMD are in the 3000+ range for single core performance; although at a much higher wattage for sure.

Interesting times ahead.

I'm not as concerned with who wins on a synthetic benchmark as I am with how much work can I get done without being tethered to a 110v outlet all the time.
 

falainber

macrumors 68040
Mar 16, 2016
3,539
4,136
Wild West
I'm not as concerned with who wins on a synthetic benchmark as I am with how much work can I get done without being tethered to a 110v outlet all the time.
Why? Do you unplug your computer while sitting at the desk or do you roam around while working? Neither are typical behaviors.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Why? Do you unplug your computer while sitting at the desk or do you roam around while working? Neither are typical behaviors.

Typical enough for me and people I work with. I mean, I do most of my work from an armchair ;)
 
  • Like
Reactions: dmccloud

falainber

macrumors 68040
Mar 16, 2016
3,539
4,136
Wild West
Typical enough for me and people I work with. I mean, I do most of my work from an armchair ;)
That's just lame. How do you hold your 34+" monitor? You are not seriously telling us that you just use a laptop screen for work. Personally, for work I did not use any monitors smaller than 21" for more than two decades (smaller monitors were used when first LCD monitors showed up but those were used in pairs). I just can't see how anything smaller than that can be used productively for anything but chatting on IM. 14" screens were fine in 1981 when IBM launched their PC but we progressed so much since then.
 

Gudi

Suspended
May 3, 2013
4,590
3,267
Berlin, Berlin
Well of course. These are desktop chips that use between 5x and 10x more power to do the same work.
Yeah, but it's the other way around. These chips draw 5× and 10× more power and are therefore only feasible for use in plugged-in computers. Whereas ARM-based SoCs are scalable up and down the power envelope. The inherent power inefficiency of x86 technology comes first and then people are looking for a use case in which this doesn't matter that much. So there is no such thing as a dedicated "desktop chip".
 

senttoschool

macrumors 68030
Original poster
Nov 2, 2017
2,626
5,482
Well of course. These are desktop chips that use between 5x and 10x more power to do the same work.
Which makes the M1 so impressive when it first came out. It led ST for all CPUs despite running at such a low wattage. I think Apple Silicon would still be leading in ST if M2 came out in Fall 2021 and M3 came out in Fall 2022.

Anyways, the 3,000+ scores are mostly overclocked results. I believe the top end AMD/Intel chips fall under 3,000 for stock clock speeds.

Regardless, I look forward to A17 3nm M3 scores. It'll most likely win back ST.
 
  • Like
Reactions: dgdosen

falainber

macrumors 68040
Mar 16, 2016
3,539
4,136
Wild West
Yeah, but it's the other way around. These chips draw 5× and 10× more power and are therefore only feasible for use in plugged-in computers. Whereas ARM-based SoCs are scalable up and down the power envelope. The inherent power inefficiency of x86 technology comes first and then people are looking for a use case in which this doesn't matter that much. So there is no such thing as a dedicated "desktop chip".
There is no such thing as "inherent power inefficiency of x86 technology". There are low power x86 chips and there are high power ARM chips. For example, TDP of the ARM processor Ampere Altra Max Q80-30 is 210W. Apple, Intel and AMD design their architecture to cover a range of chips. For Apple this range stretches from smart phones to desktops (with the bulk of their profits coming from phones). Intel and AMD design chips for computers ranging from laptops to super computers, with heavy emphasis on server chips. This explains the design trade offs. Apple architecture is more power efficient but it does not scale well (hence no AS based Mac Pro in sight).
 
  • Like
Reactions: Scarrus

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
That's just lame. How do you hold your 34+" monitor? You are not seriously telling us that you just use a laptop screen for work. Personally, for work I did not use any monitors smaller than 21" for more than two decades (smaller monitors were used when first LCD monitors showed up but those were used in pairs). I just can't see how anything smaller than that can be used productively for anything but chatting on IM. 14" screens were fine in 1981 when IBM launched their PC but we progressed so much since then.
You got a notebook computer for work just so that it sits there all the time connecting to big monitors? Wouldn't a desktop with better horsepower be better in this case?
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
That's just lame. How do you hold your 34+" monitor? You are not seriously telling us that you just use a laptop screen for work. Personally, for work I did not use any monitors smaller than 21" for more than two decades (smaller monitors were used when first LCD monitors showed up but those were used in pairs). I just can't see how anything smaller than that can be used productively for anything but chatting on IM. 14" screens were fine in 1981 when IBM launched their PC but we progressed so much since then.

I have a monitor in the office but I rarely use it. My 16" is more than sufficient for my coding and data analysis use. I only need a code editor anyway. And it's not about size but about the content on screen anyway. I run my laptop at effective 1200p.


Yeah, but it's the other way around. These chips draw 5× and 10× more power and are therefore only feasible for use in plugged-in computers. Whereas ARM-based SoCs are scalable up and down the power envelope. The inherent power inefficiency of x86 technology comes first and then people are looking for a use case in which this doesn't matter that much. So there is no such thing as a dedicated "desktop chip".

I don't really see a principal difference between ARM and x86 in this regard. Doesn't seem to me that either server or consumer ARM chips consume significantly less power than their x86 counterparts for similar performance. I mean, Zen4 reaches 5 ghz at 10 watts of power draw per core.

Apple designs are an outlier in the design space.
 

falainber

macrumors 68040
Mar 16, 2016
3,539
4,136
Wild West
You got a notebook computer for work just so that it sits there all the time connecting to big monitors? Wouldn't a desktop with better horsepower be better in this case?
It would and that would be my personal preference but most corporations nowadays give their employees the laptops, not the desktops. There are two reasons (at least this is the case in my company):
* Personal computers are used mostly for MS Office. Laptops can handle these tasks. Also, people can bring their laptops to meetings (Covid effectively killed this reason though).
* "Real" work is done on Linux servers (lots of cores and up to several TBs of RAM)

Edit: another reason for using the laptops - employees can (and do) use them both in the office and from home. The latter is important for multi-national teams because meetings with your colleagues in India, China etc. are often scheduled way outside 9-to-5 window.
 
Last edited:

falainber

macrumors 68040
Mar 16, 2016
3,539
4,136
Wild West
I have a monitor in the office but I rarely use it. My 16" is more than sufficient for my coding and data analysis use. I only need a code editor anyway. And it's not about size but about the content on screen anyway. I run my laptop at effective 1200p.




I don't really see a principal difference between ARM and x86 in this regard. Doesn't seem to me that either server or consumer ARM chips consume significantly less power than their x86 counterparts for similar performance. I mean, Zen4 reaches 5 ghz at 10 watts of power draw per core.

Apple designs are an outlier in the design space.
I am a software engineer myself and typically I have several instances of VSCODE (plus Nedit and Notepad+ etc.) running simultaneously plus several Chrome windows (with dozens of tabs), plus MS Outlook and Teams, plus several VNC sessions, File Manager(s), remote desktop(s) etc. depending on the task at hand. I understand that not everyone needs that many apps running at the same time but at least several windows with API references and other documentation are a must. While I could manage those windows on a small screen with virtual desctops and other tricks - it's just not as productive. I use 38" monitor and a laptop screen.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.