Why does Apple act like Intel is always holding them back?

leman · Jan 4, 2021

xWhiplash said:
Agreed. We have 8 CPU cores, 8 GPU cores (in the higher version), and 16 ML cores. Intel/AMD currently do not have dedicated machine learning cores, which can definitely be used in a lot of software. It also contains the secure enclave, low powered video playback, image signal processor and more. Things that Apple and software can make use out of to improve performance. Some of these are something a generic processor cannot include.

I don’t think that anyone would argue against the fact that the specialized accelerators Apple includes in their chip give them a unique advantage in embracing new technologies and new applications. I just have an issue with a claim that Apple Silicon is only fast because it contains these specialized units and runs custom software. Apple CPUs are simply extremely fast general purpose CPUs, and that’s pretty much where their “secret” begins and ends.

By the way Intel CPUs include a series of technologies aimed at ML acceleration since Ice Lake. Look up Deep Learning Boost.

steve1960 said:
I think its a wash to be honest. I have 20 years experience in the semiconductor industry watching how hard it is to marry software with the hardware. You need twice as many software engineers as hardware engineers to even get close to making a chip successful. I can't convince you, you would have had to live my life for the past 20 years!

I certainly don’t have any intention to challenge your experience. I merely wish to understand your point. As you say, your work area was Bluetooth - which by definition is not general purpose, and would obviously require a lot of hardware-specific low-level programming expertise. I just don’t see how this line of though applies to Apple CPUs. They have to run generic CPU-agnostic code - most of it is quite terrible to be frank and it does so exceptionally well.

Krevnik said:
I think that is what is being mentioned, and it does impact complexity of decoders. Specifically, AMD has commented on how they noticed there's an upper limit on the useful number of decoders on x86 due to the need to handle these kind of instructions, despite being rarely used. The A14/M1 architecture is already beyond that particular limit, and it does benefit from it.

Yes, deciding complexity is a very interesting topic. There was a quite involved discussion with some smart people recently on RWT, I had difficulty following

I don’t think there was much consensus in the end. But it should be fairly clear that fixed-instruction width is obviously simpler to build a fast decoder around.

Still, very long instruction word is something completely different. X86 is not VLIW.

Krevnik said:
This is probably the closest thing to "custom" in Apple's CPU design (EDIT: specifically CPU design, not SoC). But that's more to support modern languages like Swift (and Objective-C using ARC). But there's a lot of other multi-threaded scenarios that would benefit from this as well, so even that's not terribly custom, it's just examining existing usage of the CPU and adapting to it.

More like optimization of a frequently used feature. Because if we start acknowledging these things as “custom”, then we will a,so have to say that features such as stack manipulation, branch prefetching and speculative execution are “custom” too. It’s just designing hardware so that common software patters can be executed quicker.

By the way, fast atomics are probably only possible because ARM has a more relaxed memory ordering requirement than x86. This is an area where ISA matters.

P.S. What I would certainly describe as “custom” feature of M1 is the switchable (in hardware) memory ordering mode, which allows it to emulate x86 execution environment efficiently and correctly. Without it something like Roswtta2 would hav been much more difficult to pull off.

steve1960 · Jan 4, 2021

Sorry I don't agree. I worked for a company being employee number 25 growing to a Billion Dollar company of 5,000 employees and whilst we were obviously in the business of making money we had amazing work and company ethics alongside that.

steve1960 · Jan 4, 2021

20 years ago I witnessed the first ever Bluetooth chip in single chip CMOS silicon. What eventually killed the company after 15 years? Could not keep up with the necessary software investment to accompany brilliant silicon. Its all in the software not the silicon.

Ethosik · Jan 4, 2021

leman said:
Look up Deep Learning Boost

Intel's page says this is unique to their Xeon processors. We are discussing consumer processors like i3/i5/i7/i9 not Xeon levels. (M1 is more in the same class as Intel i-series instead of Xeons)

leman · Jan 4, 2021

xWhiplash said:
Intel's page says this is unique to their Xeon processors. We are discussing consumer processors like i3/i5/i7/i9 not Xeon levels. (M1 is more in the same class as Intel i-series instead of Xeons)

DL boost is present on client ice lake and Tiger lake. Example:

Intel® Core™ i7-1165G7 Processor (12M Cache, up to 4.70 GHz) - Product Specifications | Intel

Intel® Core™ i7-1165G7 Processor (12M Cache, up to 4.70 GHz) quick reference with specifications, features, and technologies.

ark.intel.com

Ethosik · Jan 4, 2021

leman said:
DL boost is present on client ice lake and Tiger lake. Example:

Intel® Core™ i7-1165G7 Processor (12M Cache, up to 4.70 GHz) - Product Specifications | Intel

Intel® Core™ i7-1165G7 Processor (12M Cache, up to 4.70 GHz) quick reference with specifications, features, and technologies.

ark.intel.com

Why don't they list it in the actual product pages?

Accelerate Your Growth with Intel® Partner Alliance

Learn about business-building opportunities, training, and other benefits that come with an Intel® Partner Alliance membership.

www.intel.com

AI Engines - Intel

Find out how integrated Intel® AI Engines in Intel® Xeon® Scalable processors can help you maximize AI performance while optimizing ROI.

www.intel.com

https://www.intel.com/content/dam/www/public/us/en/documents/product-overviews/dl-boost-product-overview.pdf

leman · Jan 4, 2021

@xWhiplash ask Intel, not me

Krevnik · Jan 4, 2021

leman said:
More like optimization of a frequently used feature. Because if we start acknowledging these things as “custom”, then we will a,so have to say that features such as stack manipulation, branch prefetching and speculative execution are “custom” too. It’s just designing hardware so that common software patters can be executed quicker.

By the way, fast atomics are probably only possible because ARM has a more relaxed memory ordering requirement than x86. This is an area where ISA matters.

That's why I put it in quotes.

But it still helps. The desire for things like thread-safe ARC isn't terribly common outside of Apple's ecosystem, so the perf hit is less.

leman said:
P.S. What I would certainly describe as “custom” feature of M1 is the switchable (in hardware) memory ordering mode, which allows it to emulate x86 execution environment efficiently and correctly. Without it something like Roswtta2 would hav been much more difficult to pull off.

True. I wasn't aware of that one. Interesting though.

Watch it go away on the M6 or something when Rosetta 2 is killed.

leman said:
Still, very long instruction word is something completely different. X86 is not VLIW.

So it is now that I read the Wiki page on it. Perhaps the other poster confused x86's variable instruction sizes (which can get pretty gnarly to decode) with VLIW?

leman · Jan 4, 2021

Krevnik said:
But it still helps. The desire for things like thread-safe ARC isn't terribly common outside of Apple's ecosystem, so the perf hit is less.

It might be more prevalent than you think... it’s a common technique for managing objects that are shared by multiple threads. C++ shared_prt uses ARC and it’s one of the more popular smart pointer.

Krevnik · Jan 4, 2021

leman said:
It might be more prevalent than you think... it’s a common technique for managing objects that are shared by multiple threads. C++ shared_prt uses ARC and it’s one of the more popular smart pointer.

To be fair, the C++ projects I've worked on have been legacy projects, where any sort of modern smart pointer are just starting to trickle into usage, replacing home-rolled pointer wrappers that used an ownership model. So my experience is different than modern C++ usage.

I'm honestly happy to be proven wrong here. Ownership model is not all that fun to work with.

pshufd · Jan 4, 2021

Modern CPUs don’t execute Instruction Set Architecture (ISA) instructions. They decode ISA instructions into micro-operations and then send them to execution units that actually execute micro-operations. The key to improving Instructions Per Clock (IPC) is to execute as many micro-operations in parallel and as early as possible.

The Decoders convert ISA instructions into micro-operations and then send them to a reorder buffer. Some micro-operations have dependencies on other operations so you sort the instructions by dependencies and then you find out which you can run first and which can be run in parallel.

Intel and AMD use the x86 Instruction Set Architecture which has variable length instructions. That is an instruction can be up to 15 bytes (I saw that somewhere). You want to decode as many instructions as you can in parallel to feed into the reorder buffer. So you decode an instruction, and, ideally, you would decode as many additional instructions as possible in parallel. Intel and AMD use four decoders in their CPUs.

The problem with Variable Length Instructions is: where does the second instruction start? Where does the third instruction start? Where does the fourth instruction start? AMD and Intel use a brute force (try every combination) approach and try to decode all possible starting points after the first instruction and then toss out all but the correct one. I can only imagine the number of transistors used for this approach.

RISC processors have fixed-length instructions. If you know where the first instruction is, you know where the second is, the third, the fourth, the fifth, the sixth, etc. Apple Silicon has eight decoders and they are likely far simpler than the x86 decoders and they can run them all in parallel. Apple Silicon reorder buffers are three times the size of those of Intel and AMD giving you a hint of far better IPC; suggesting two to three times the IPC. They can fill the reorder buffer faster and execute far more instructions in parallel and far earlier which is why they can give you great speed at low clocks.

Corrections appreciated. I got this from a paper I looked at in December.

This is probably the article that I read and I've condensed about seven pages into five paragraphs. If you want far more details, read the section starting at "How Out-of-Order Execution Works". The much longer description isn't necessarily easier to read than my summary. If you have a CS degree or work in chip design, then it's really easy stuff.

What Makes Apple Silicon So Fast?

Real world experience with the new Macs has sunk in. They are fast. Real fast. But why? What is the magic?

debugger.medium.com

SlCKB0Y · Jan 4, 2021

mode11 said:
Unfortunately though, Apple laptops have a long history of GPU failures. Usually because of solder joints cracking under their BGA packages, due to the PCB flexing in response to large swings in temperature.

Its not just thermal expansion and contraction, it their use of lead-free solder that contributes significantly to this problem.

SlCKB0Y · Jan 4, 2021

pshufd said:
Intel used to be on Tick-Tock and that's what Apple was counting on.

Intel is now all tock and no tick!

pshufd · Jan 4, 2021

Krevnik said:
So it is now that I read the Wiki page on it. Perhaps the other poster confused x86's variable instruction sizes (which can get pretty gnarly to decode) with VLIW?

I think that's the case. The original paper put it in there in the OOO section but it wasn't really related to the topic.

SlCKB0Y · Jan 4, 2021

JouniS · Jan 4, 2021

Krevnik said:
To be fair, the C++ projects I've worked on have been legacy projects, where any sort of modern smart pointer are just starting to trickle into usage, replacing home-rolled pointer wrappers that used an ownership model. So my experience is different than modern C++ usage.

I don't think smart pointers are that common in modern C++. They feel more like something that makes C++ easier for people who are used to doing things in the Java way.

Idiomatic C++ is more about ownership, references, and move semantics. It's a bit like Rust but far less explicit. You rarely have to create new objects explicitly with the "new" operator, so there is little need for smart pointers to manage such objects.

Ethosik · Jan 4, 2021

leman said:
@xWhiplash ask Intel, not me

Hah! Well you can't blame me for thinking it was Xeon only when three of their main pages only refer to Xeons!

Ethosik · Jan 4, 2021

JouniS said:
I don't think smart pointers are that common in modern C++. They feel more like something that makes C++ easier for people who are used to doing things in the Java way.

Idiomatic C++ is more about ownership, references, and move semantics. It's a bit like Rust but far less explicit. You rarely have to create new objects explicitly with the "new" operator, so there is little need for smart pointers to manage such objects.

Actually C++11 has built in unique_ptr, com_ptr, shared_ptr and more. It is a topic discussed in a lot of the "Modern C++" courses because, why manually delete everything in existence when you can use the smart pointers? One it leaves the scope, memory is cleared automatically.

JouniS · Jan 5, 2021

xWhiplash said:
Actually C++11 has built in unique_ptr, com_ptr, shared_ptr and more. It is a topic discussed in a lot of the "Modern C++" courses because, why manually delete everything in existence when you can use the smart pointers? One it leaves the scope, memory is cleared automatically.

I mean using pointers of any kind is a bad habit in C++. Sometimes you can't avoid it, but it's usually better to create objects in the stack or directly in a container. Functions take objects by reference and return references to existing objects or newly created objects by value. Move semantics make it all efficient, and compilers can easily skip unnecessary steps.

shared_ptr is particularly bad for performance. Because the compiler can't know when the final pointer to the object goes out of scope, it has to generate unnecessary code (almost) every time a copy of the pointer goes out of scope. There are benchmarks that suggests that Java is often faster than C++ code that relies heavily on shared_ptr for memory management.

leman · Jan 5, 2021

JouniS said:
shared_ptr is particularly bad for performance. Because the compiler can't know when the final pointer to the object goes out of scope, it has to generate unnecessary code (almost) every time a copy of the pointer goes out of scope. There are benchmarks that suggests that Java is often faster than C++ code that relies heavily on shared_ptr for memory management.

It would be interesting to run these benchmarks on an Apple CPU with its fast atomic operations

Ethosik · Jan 5, 2021

JouniS said:
I mean using pointers of any kind is a bad habit in C++. Sometimes you can't avoid it, but it's usually better to create objects in the stack or directly in a container. Functions take objects by reference and return references to existing objects or newly created objects by value. Move semantics make it all efficient, and compilers can easily skip unnecessary steps.

shared_ptr is particularly bad for performance. Because the compiler can't know when the final pointer to the object goes out of scope, it has to generate unnecessary code (almost) every time a copy of the pointer goes out of scope. There are benchmarks that suggests that Java is often faster than C++ code that relies heavily on shared_ptr for memory management.

Is there a way to create objects without having a pointer of any kind?

I know you mentioned Java, but I really think the speed between C++ and C#/.NET is must less now than before. Powerful games are created using Unity/MonoGame and C#. I think unless you are creating the next Adobe Photoshop, or next Crysis, I don't think C++ has much of an advantage anymore. I only use it for some Direct X fun.

Krevnik · Jan 5, 2021

xWhiplash said:
Is there a way to create objects without having a pointer of any kind?

I know you mentioned Java, but I really think the speed between C++ and C#/.NET is must less now than before. Powerful games are created using Unity/MonoGame and C#. I think unless you are creating the next Adobe Photoshop, or next Crysis, I don't think C++ has much of an advantage anymore. I only use it for some Direct X fun.

Its more about stack vs heap allocations. Stack allocations are very efficient, and in C++, doesn’t use pointer semantics. The machine code may use memory addresses, but the important thing is more that the developer is not the one managing ownership of them.

I wouldn’t say the speed gap is less between C++ and .NET/Java, but more that the cost of ARC or GC (.NET/Java) isn’t a deal breaker in many use cases these days. My smartphone has more RAM and a CPU an order of magnitude faster than my desktop had when .NET came out. There are still use cases where tight performance is key, just not as many. See how many Electron apps are floating around that are plenty snappy running JavaScript. Games could benefit with better optimizations, but when you need a game to be stable, letting Unity handle the low level engine management, while you write your game logic in a high level language with fewer places for your developers to throw a wrench in the works and cause crashes and/or data loss is not necessarily a bad trade off.

JouniS said:
shared_ptr is particularly bad for performance. Because the compiler can't know when the final pointer to the object goes out of scope, it has to generate unnecessary code (almost) every time a copy of the pointer goes out of scope. There are benchmarks that suggests that Java is often faster than C++ code that relies heavily on shared_ptr for memory management.

The GC can make Java look pretty good in that scenario, yes. But requiring exclusive access to the heap makes the GC a bit of a bear to work with on its own right. Either you face latency spikes during larger GC pauses, or run GC more frequently, spreading the cost over many smaller latency spikes.

At least for the sort of stuff I build these days, I’d take ARC any day of the week over GC because of the more deterministic/predictable performance profile.

That said, yes, local storage does help avoid relying on ARC or GC, and will be faster. And I’m not really advocating use of these pointers to the extent you seem to be suggesting. Passing by reference is still important, and doesn’t go away when using smart_ptr unless you start treating C++ like Objective-C, which I don’t see the point of.

Jouls · Jan 5, 2021

Way off topic by now.

thenewperson · Jan 5, 2021

Jouls said:
Way off topic by now.

Waaay 🤣

Given the M1 is out now this really should be titled “Why do some people act like Intel wasn’t holding Apple back?”

pshufd · Jan 5, 2021

thenewperson said:
Waaay 🤣

Given the M1 is out now this really should be titled “Why do some people act like Intel wasn’t holding Apple back?”

I wish that Apple broke out Mac sales by model. I'd love to know how many MacBook Air, Pro 13 and Mini M1 models they sold in Q4 compared to Q4 of 2019.

Why does Apple act like Intel is always holding them back?

macrumors Core

macrumors 6502

macrumors 6502

Contributor

macrumors Core

Contributor

macrumors Core

macrumors 601

macrumors Core

macrumors 601

macrumors G4

macrumors 68040

macrumors 68040

macrumors G4

macrumors 68040

macrumors 6502a

Contributor

Contributor

macrumors 6502a

macrumors Core

Contributor

macrumors 601

macrumors member

macrumors 65816

macrumors G4

Our Staff