Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,521
19,678
Agreed. We have 8 CPU cores, 8 GPU cores (in the higher version), and 16 ML cores. Intel/AMD currently do not have dedicated machine learning cores, which can definitely be used in a lot of software. It also contains the secure enclave, low powered video playback, image signal processor and more. Things that Apple and software can make use out of to improve performance. Some of these are something a generic processor cannot include.

I don’t think that anyone would argue against the fact that the specialized accelerators Apple includes in their chip give them a unique advantage in embracing new technologies and new applications. I just have an issue with a claim that Apple Silicon is only fast because it contains these specialized units and runs custom software. Apple CPUs are simply extremely fast general purpose CPUs, and that’s pretty much where their “secret” begins and ends.


By the way Intel CPUs include a series of technologies aimed at ML acceleration since Ice Lake. Look up Deep Learning Boost.

I think its a wash to be honest. I have 20 years experience in the semiconductor industry watching how hard it is to marry software with the hardware. You need twice as many software engineers as hardware engineers to even get close to making a chip successful. I can't convince you, you would have had to live my life for the past 20 years!

I certainly don’t have any intention to challenge your experience. I merely wish to understand your point. As you say, your work area was Bluetooth - which by definition is not general purpose, and would obviously require a lot of hardware-specific low-level programming expertise. I just don’t see how this line of though applies to Apple CPUs. They have to run generic CPU-agnostic code - most of it is quite terrible to be frank and it does so exceptionally well.

I think that is what is being mentioned, and it does impact complexity of decoders. Specifically, AMD has commented on how they noticed there's an upper limit on the useful number of decoders on x86 due to the need to handle these kind of instructions, despite being rarely used. The A14/M1 architecture is already beyond that particular limit, and it does benefit from it.

Yes, deciding complexity is a very interesting topic. There was a quite involved discussion with some smart people recently on RWT, I had difficulty following :) I don’t think there was much consensus in the end. But it should be fairly clear that fixed-instruction width is obviously simpler to build a fast decoder around.

Still, very long instruction word is something completely different. X86 is not VLIW.

This is probably the closest thing to "custom" in Apple's CPU design (EDIT: specifically CPU design, not SoC). But that's more to support modern languages like Swift (and Objective-C using ARC). But there's a lot of other multi-threaded scenarios that would benefit from this as well, so even that's not terribly custom, it's just examining existing usage of the CPU and adapting to it.

More like optimization of a frequently used feature. Because if we start acknowledging these things as “custom”, then we will a,so have to say that features such as stack manipulation, branch prefetching and speculative execution are “custom” too. It’s just designing hardware so that common software patters can be executed quicker.

By the way, fast atomics are probably only possible because ARM has a more relaxed memory ordering requirement than x86. This is an area where ISA matters.

P.S. What I would certainly describe as “custom” feature of M1 is the switchable (in hardware) memory ordering mode, which allows it to emulate x86 execution environment efficiently and correctly. Without it something like Roswtta2 would hav been much more difficult to pull off.
 
Last edited:

steve1960

macrumors 6502
Sep 23, 2014
293
300
Singapore
Sorry I don't agree. I worked for a company being employee number 25 growing to a Billion Dollar company of 5,000 employees and whilst we were obviously in the business of making money we had amazing work and company ethics alongside that.
 

steve1960

macrumors 6502
Sep 23, 2014
293
300
Singapore
20 years ago I witnessed the first ever Bluetooth chip in single chip CMOS silicon. What eventually killed the company after 15 years? Could not keep up with the necessary software investment to accompany brilliant silicon. Its all in the software not the silicon.
 

Ethosik

Contributor
Oct 21, 2009
8,142
7,120
Look up Deep Learning Boost
Intel's page says this is unique to their Xeon processors. We are discussing consumer processors like i3/i5/i7/i9 not Xeon levels. (M1 is more in the same class as Intel i-series instead of Xeons)
 

leman

macrumors Core
Oct 14, 2008
19,521
19,678
Intel's page says this is unique to their Xeon processors. We are discussing consumer processors like i3/i5/i7/i9 not Xeon levels. (M1 is more in the same class as Intel i-series instead of Xeons)

DL boost is present on client ice lake and Tiger lake. Example:

 

Ethosik

Contributor
Oct 21, 2009
8,142
7,120
DL boost is present on client ice lake and Tiger lake. Example:

Why don't they list it in the actual product pages?



 

Krevnik

macrumors 601
Sep 8, 2003
4,101
1,312
More like optimization of a frequently used feature. Because if we start acknowledging these things as “custom”, then we will a,so have to say that features such as stack manipulation, branch prefetching and speculative execution are “custom” too. It’s just designing hardware so that common software patters can be executed quicker.

By the way, fast atomics are probably only possible because ARM has a more relaxed memory ordering requirement than x86. This is an area where ISA matters.

That's why I put it in quotes. ;)

But it still helps. The desire for things like thread-safe ARC isn't terribly common outside of Apple's ecosystem, so the perf hit is less.

P.S. What I would certainly describe as “custom” feature of M1 is the switchable (in hardware) memory ordering mode, which allows it to emulate x86 execution environment efficiently and correctly. Without it something like Roswtta2 would hav been much more difficult to pull off.

True. I wasn't aware of that one. Interesting though.

Watch it go away on the M6 or something when Rosetta 2 is killed.

Still, very long instruction word is something completely different. X86 is not VLIW.

So it is now that I read the Wiki page on it. Perhaps the other poster confused x86's variable instruction sizes (which can get pretty gnarly to decode) with VLIW?
 

leman

macrumors Core
Oct 14, 2008
19,521
19,678
But it still helps. The desire for things like thread-safe ARC isn't terribly common outside of Apple's ecosystem, so the perf hit is less.

It might be more prevalent than you think... it’s a common technique for managing objects that are shared by multiple threads. C++ shared_prt uses ARC and it’s one of the more popular smart pointer.
 

Krevnik

macrumors 601
Sep 8, 2003
4,101
1,312
It might be more prevalent than you think... it’s a common technique for managing objects that are shared by multiple threads. C++ shared_prt uses ARC and it’s one of the more popular smart pointer.

To be fair, the C++ projects I've worked on have been legacy projects, where any sort of modern smart pointer are just starting to trickle into usage, replacing home-rolled pointer wrappers that used an ownership model. So my experience is different than modern C++ usage.

I'm honestly happy to be proven wrong here. Ownership model is not all that fun to work with. :)
 

pshufd

macrumors G4
Oct 24, 2013
10,149
14,574
New Hampshire
Modern CPUs don’t execute Instruction Set Architecture (ISA) instructions. They decode ISA instructions into micro-operations and then send them to execution units that actually execute micro-operations. The key to improving Instructions Per Clock (IPC) is to execute as many micro-operations in parallel and as early as possible.

The Decoders convert ISA instructions into micro-operations and then send them to a reorder buffer. Some micro-operations have dependencies on other operations so you sort the instructions by dependencies and then you find out which you can run first and which can be run in parallel.

Intel and AMD use the x86 Instruction Set Architecture which has variable length instructions. That is an instruction can be up to 15 bytes (I saw that somewhere). You want to decode as many instructions as you can in parallel to feed into the reorder buffer. So you decode an instruction, and, ideally, you would decode as many additional instructions as possible in parallel. Intel and AMD use four decoders in their CPUs.

The problem with Variable Length Instructions is: where does the second instruction start? Where does the third instruction start? Where does the fourth instruction start? AMD and Intel use a brute force (try every combination) approach and try to decode all possible starting points after the first instruction and then toss out all but the correct one. I can only imagine the number of transistors used for this approach.

RISC processors have fixed-length instructions. If you know where the first instruction is, you know where the second is, the third, the fourth, the fifth, the sixth, etc. Apple Silicon has eight decoders and they are likely far simpler than the x86 decoders and they can run them all in parallel. Apple Silicon reorder buffers are three times the size of those of Intel and AMD giving you a hint of far better IPC; suggesting two to three times the IPC. They can fill the reorder buffer faster and execute far more instructions in parallel and far earlier which is why they can give you great speed at low clocks.

Corrections appreciated. I got this from a paper I looked at in December.

This is probably the article that I read and I've condensed about seven pages into five paragraphs. If you want far more details, read the section starting at "How Out-of-Order Execution Works". The much longer description isn't necessarily easier to read than my summary. If you have a CS degree or work in chip design, then it's really easy stuff.

 
Last edited:

SlCKB0Y

macrumors 68040
Feb 25, 2012
3,431
557
Sydney, Australia
Unfortunately though, Apple laptops have a long history of GPU failures. Usually because of solder joints cracking under their BGA packages, due to the PCB flexing in response to large swings in temperature.
Its not just thermal expansion and contraction, it their use of lead-free solder that contributes significantly to this problem.
 
Last edited:
  • Like
Reactions: neinjohn

pshufd

macrumors G4
Oct 24, 2013
10,149
14,574
New Hampshire
So it is now that I read the Wiki page on it. Perhaps the other poster confused x86's variable instruction sizes (which can get pretty gnarly to decode) with VLIW?

I think that's the case. The original paper put it in there in the OOO section but it wasn't really related to the topic.
 

JouniS

macrumors 6502a
Nov 22, 2020
638
399
To be fair, the C++ projects I've worked on have been legacy projects, where any sort of modern smart pointer are just starting to trickle into usage, replacing home-rolled pointer wrappers that used an ownership model. So my experience is different than modern C++ usage.
I don't think smart pointers are that common in modern C++. They feel more like something that makes C++ easier for people who are used to doing things in the Java way.

Idiomatic C++ is more about ownership, references, and move semantics. It's a bit like Rust but far less explicit. You rarely have to create new objects explicitly with the "new" operator, so there is little need for smart pointers to manage such objects.
 

Ethosik

Contributor
Oct 21, 2009
8,142
7,120
I don't think smart pointers are that common in modern C++. They feel more like something that makes C++ easier for people who are used to doing things in the Java way.

Idiomatic C++ is more about ownership, references, and move semantics. It's a bit like Rust but far less explicit. You rarely have to create new objects explicitly with the "new" operator, so there is little need for smart pointers to manage such objects.
Actually C++11 has built in unique_ptr, com_ptr, shared_ptr and more. It is a topic discussed in a lot of the "Modern C++" courses because, why manually delete everything in existence when you can use the smart pointers? One it leaves the scope, memory is cleared automatically.
 

JouniS

macrumors 6502a
Nov 22, 2020
638
399
Actually C++11 has built in unique_ptr, com_ptr, shared_ptr and more. It is a topic discussed in a lot of the "Modern C++" courses because, why manually delete everything in existence when you can use the smart pointers? One it leaves the scope, memory is cleared automatically.
I mean using pointers of any kind is a bad habit in C++. Sometimes you can't avoid it, but it's usually better to create objects in the stack or directly in a container. Functions take objects by reference and return references to existing objects or newly created objects by value. Move semantics make it all efficient, and compilers can easily skip unnecessary steps.

shared_ptr is particularly bad for performance. Because the compiler can't know when the final pointer to the object goes out of scope, it has to generate unnecessary code (almost) every time a copy of the pointer goes out of scope. There are benchmarks that suggests that Java is often faster than C++ code that relies heavily on shared_ptr for memory management.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,678
shared_ptr is particularly bad for performance. Because the compiler can't know when the final pointer to the object goes out of scope, it has to generate unnecessary code (almost) every time a copy of the pointer goes out of scope. There are benchmarks that suggests that Java is often faster than C++ code that relies heavily on shared_ptr for memory management.

It would be interesting to run these benchmarks on an Apple CPU with its fast atomic operations ;)
 

Ethosik

Contributor
Oct 21, 2009
8,142
7,120
I mean using pointers of any kind is a bad habit in C++. Sometimes you can't avoid it, but it's usually better to create objects in the stack or directly in a container. Functions take objects by reference and return references to existing objects or newly created objects by value. Move semantics make it all efficient, and compilers can easily skip unnecessary steps.

shared_ptr is particularly bad for performance. Because the compiler can't know when the final pointer to the object goes out of scope, it has to generate unnecessary code (almost) every time a copy of the pointer goes out of scope. There are benchmarks that suggests that Java is often faster than C++ code that relies heavily on shared_ptr for memory management.
Is there a way to create objects without having a pointer of any kind?

I know you mentioned Java, but I really think the speed between C++ and C#/.NET is must less now than before. Powerful games are created using Unity/MonoGame and C#. I think unless you are creating the next Adobe Photoshop, or next Crysis, I don't think C++ has much of an advantage anymore. I only use it for some Direct X fun.
 

Krevnik

macrumors 601
Sep 8, 2003
4,101
1,312
Is there a way to create objects without having a pointer of any kind?

I know you mentioned Java, but I really think the speed between C++ and C#/.NET is must less now than before. Powerful games are created using Unity/MonoGame and C#. I think unless you are creating the next Adobe Photoshop, or next Crysis, I don't think C++ has much of an advantage anymore. I only use it for some Direct X fun.
Its more about stack vs heap allocations. Stack allocations are very efficient, and in C++, doesn’t use pointer semantics. The machine code may use memory addresses, but the important thing is more that the developer is not the one managing ownership of them.

I wouldn’t say the speed gap is less between C++ and .NET/Java, but more that the cost of ARC or GC (.NET/Java) isn’t a deal breaker in many use cases these days. My smartphone has more RAM and a CPU an order of magnitude faster than my desktop had when .NET came out. There are still use cases where tight performance is key, just not as many. See how many Electron apps are floating around that are plenty snappy running JavaScript. Games could benefit with better optimizations, but when you need a game to be stable, letting Unity handle the low level engine management, while you write your game logic in a high level language with fewer places for your developers to throw a wrench in the works and cause crashes and/or data loss is not necessarily a bad trade off.

shared_ptr is particularly bad for performance. Because the compiler can't know when the final pointer to the object goes out of scope, it has to generate unnecessary code (almost) every time a copy of the pointer goes out of scope. There are benchmarks that suggests that Java is often faster than C++ code that relies heavily on shared_ptr for memory management.

The GC can make Java look pretty good in that scenario, yes. But requiring exclusive access to the heap makes the GC a bit of a bear to work with on its own right. Either you face latency spikes during larger GC pauses, or run GC more frequently, spreading the cost over many smaller latency spikes.

At least for the sort of stuff I build these days, I’d take ARC any day of the week over GC because of the more deterministic/predictable performance profile.

That said, yes, local storage does help avoid relying on ARC or GC, and will be faster. And I’m not really advocating use of these pointers to the extent you seem to be suggesting. Passing by reference is still important, and doesn’t go away when using smart_ptr unless you start treating C++ like Objective-C, which I don’t see the point of.
 

pshufd

macrumors G4
Oct 24, 2013
10,149
14,574
New Hampshire
Waaay ?

Given the M1 is out now this really should be titled “Why do some people act like Intel wasn’t holding Apple back?”

I wish that Apple broke out Mac sales by model. I'd love to know how many MacBook Air, Pro 13 and Mini M1 models they sold in Q4 compared to Q4 of 2019.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.