Geekbench 6 results and discussions

mr_roboto · Mar 1, 2023

leman said:
That’s not how it works on Apple Silicon though. Instructions with more complex semantics (op + shift, load pair, increment addressing modes) use multiple execution modes and have longer latency. The instruction latencies are different between instructions as well. I don’t know whether Apple uses micro-ops in their usual definition but there is certainly some sort of operation decomposition going on.

Instruction timing and throughtput: https://dougallj.github.io/applecpu/firestorm-int.html

It isn't too likely that Apple is decoding to multiple micro-ops for most of these. For example, the "OP + SHIFT" patterns generally run at half throughput compared to the plain "OP" version of the same instruction. The complex and expensive way of implementing these that's compatible with the observed half throughput is cracking into two micro-ops. The cheap and easy way is just letting the instruction reside in the normally single-cycle integer ALU for two cycles, non-pipelined. First cycle does the OP computation, second cycle does the shift. The lack of pipelining halves throughput because that ALU has to stall one cycle.

falainber · Mar 1, 2023

Gudi said:
@leman Since you don’t want to answer, why there are no x86 chips in smartwatches, here’s another question: Why didn’t Apple design its own x86 SoC and build it on TSMCs 5nm node? If technically possible this would’ve saved them so much software development costs! Just admit that x86 is inherently energy inefficient. To fix this problem one must cut down the bloated instruction set and voila an ARM processor. 💁

It seems that you are not very familiar with smart phone history. Intel actually did release x86 based processors that were used in Android smartphones. It can be done but there are many challenges that have little to do with the instruction set. Besides, if it was just about instruction set and efficiency, we would all be using Risc-V based devices now.

Sydde · Mar 1, 2023

Looks like Tag Heuer has made an Atom-powered watch. They still make Android watches, but it is not easy to determine if they are still Intel.

falainber said:
if it was just about instruction set and efficiency, we would all be using Risc-V based devices now

What about RISC-V would make it preferable to ARM?

leman · Mar 1, 2023

mr_roboto said:
It isn't too likely that Apple is decoding to multiple micro-ops for most of these. For example, the "OP + SHIFT" patterns generally run at half throughput compared to the plain "OP" version of the same instruction. The complex and expensive way of implementing these that's compatible with the observed half throughput is cracking into two micro-ops. The cheap and easy way is just letting the instruction reside in the normally single-cycle integer ALU for two cycles, non-pipelined. First cycle does the OP computation, second cycle does the shift. The lack of pipelining halves throughput because that ALU has to stall one cycle.

This makes sense, and would also be a more efficient way to utilize the OOE resources. You’d still probably need multiple ops for things that need different units (like load with increment), not sure about stuff like LDP?

falainber said:
Besides, if it was just about instruction set and efficiency, we would all be using Risc-V based devices now.

Why? Nothing about RICV-V is particularly efficient (at least compared to modern ARM). They are just really obsessed with “one instruction = one operation” mentality, which creates a bunch of problems for them.

Sydde · Mar 2, 2023

leman said:
You’d still probably need multiple ops for things that need different units (like load with increment), not sure about stuff like LDP?

Absolutely not. Address calculation is built into the LSU, so you do not need to use a different unit for that. All processors work this way. The only difference is whether the base register gets modified, and whether the LSU uses the original address or the offset address, which is a simple logic switch.

As far as LDP, you have to understand that the Apple design does not use a traditional register file. The core has a large array of "renames" and a register reference table that identifies an architected register in the rename array, relative to the code stream. When code puts a value into a register, it is writing to a rename, which subsequent code will acquire using the reference table. LDP grabs two renames, marks the reference table, and eventually puts the load into the renames that it has. The memory location is contiguous, so, if it is not misaligned or is cached, a single memory transaction may be all that is needed. But it is still just one μop.

leman · Mar 2, 2023

Sydde said:
Absolutely not. Address calculation is built into the LSU, so you do not need to use a different unit for that. All processors work this way. The only difference is whether the base register gets modified, and whether the LSU uses the original address or the offset address, which is a simple logic switch.

You appear to be right. LDR with auto-increment retires as a single op, although interestingly enough, it counts both as mem and int op according to Apples internal instruction counters.

Sydde said:
As far as LDP, you have to understand that the Apple design does not use a traditional register file. The core has a large array of "renames" and a register reference table that identifies an architected register in the rename array, relative to the code stream. When code puts a value into a register, it is writing to a rename, which subsequent code will acquire using the reference table.

Well, sure, like any other out of order design for I don’t know how many years.

Sydde said:
LDP grabs two renames, marks the reference table, and eventually puts the load into the renames that it has. The memory location is contiguous, so, if it is not misaligned or is cached, a single memory transaction may be all that is needed. But it is still just one μop.

I think it’s a little bit complicated than this. Apple Silicon will coalesce load requests, so these instructions don’t do loads directly. According to the CPU counters LDP is issued as one operation but retires retires as two.

Gudi · Mar 2, 2023

falainber said:
Intel actually did release x86 based processors that were used in Android smartphones. It can be done but there are many challenges that have little to do with the instruction set.

Everything can be done. Scientists can make gold out of lesser materials, it's just way more expensive than digging. Of course you can put any kind of processor into a smartphone, but you can't achieve competitive performance and battery life. Intel wouldn't stay out of the largest processor market in the world, if there was a way to make x86 work in these environments. Now lets get real. Where is the x86 chip in a smartwatch? Still no fundamental differences between RISC and CISC processors?

Gudi · Mar 2, 2023

dmccloud said:
Additionally, since Apple was already building out its own version of the ARM instruction set for the A-series SoCs, it was relatively trivial to extend that architecture to the Mac platform.

Yeah, changing the instruction set underneath an entire OS and its 3rd party software ecosystem is probably the least challenging software development project since autonomous driving. /s

thunng8 · Mar 2, 2023

Sydde said:
Looks like Tag Heuer has made an Atom-powered watch. They still make Android watches, but it is not easy to determine if they are still Intel.

No - they switched to Qualcomm.

Sydde · Mar 2, 2023

Gudi said:
Yeah, changing the instruction set underneath an entire OS and its 3rd party software ecosystem is probably the least challenging software development project since autonomous driving. /s

Apple has had darwin and Foundation running on ARM since at least '07. The only real difference between iOS and macOS is AppKit vs UIKit. Porting the operating system had already happened, and the ARM compilers were well-tested in the meanwhile. The tools for converting to Fat or ASi were already there, and the system was all but ready out of the gate.

Xiao_Xi · Mar 2, 2023

Gudi said:
changing the instruction set underneath an entire OS and its 3rd party software ecosystem is probably the least challenging software development project since autonomous driving. /s

Once the compiler can target a new ISA, the transition is relatively easy. For example, according to Wikipedia, you can run the Linux kernel on 20 different ISAs.

Gudi · Mar 2, 2023

Xiao_Xi said:
Once the compiler can target a new ISA, the transition is relatively easy. For example, according to Wikipedia, you can run the Linux kernel on 20 different ISAs.

That's why Microsoft failed to get Windows on ARM going. It was too easy!

mr_roboto · Mar 2, 2023

leman said:
This makes sense, and would also be a more efficient way to utilize the OOE resources. You’d still probably need multiple ops for things that need different units (like load with increment), not sure about stuff like LDP?

What Sydde said. The circuitry needed to do basic address computations (adders and a few shifts to handle multiply by 2/4/8) is small (both in absolute terms, and relative to a full-function integer ALU). It's not a big deal to just put dedicated stuff into the load/store unit(s).

(In the early days of microprocessors, this was untrue. But in today's chips with billions of gates, you'll often end up needing more gates and energy just to route data over to someone else's adder and get the result back, regardless of whether that's done by cracking into two uops or some other method.)

leman said:
I think it’s a little bit complicated than this. Apple Silicon will coalesce load requests, so these instructions don’t do loads directly. According to the CPU counters LDP is issued as one operation but retires retires as two.

They probably special cased LDP to use two retire slots because it generates two writes to the register file. Most instructions generate zero or one register writes per instruction. Two is a rare exception, so it wouldn't be surprising if they designed everything around the normal 0/1 cases and handled 2 as if it's two single writes.

Sydde · Mar 2, 2023

mr_roboto said:
They probably special cased LDP to use two retire slots because it generates two writes to the register file.

Technically, since it also serves as a context-switch transaction tool, LDP/STP include an update flag, so a load can actually modify 3 registers. And there is a Neon version, that can load or store 2 vector registers. The update (pre-index/post-idex) transactions do not incur an extra cycle (or it is absorbed into the breadth of the transaction time).

senttoschool · Mar 2, 2023

Gudi said:
That's why Microsoft failed to get Windows on ARM going. It was too easy!

Microsoft has no ARM chip that is significantly better than anything AMD or Intel makes.

leman · Mar 2, 2023

Xiao_Xi said:
Once the compiler can target a new ISA, the transition is relatively easy. For example, according to Wikipedia, you can run the Linux kernel on 20 different ISAs.

Once you have the infrastructure, desire, and competence in place the transition is relatively easy. Don’t underestimate the effort of migrating and testing of low-level functionality to a new architecture. Memory model differences between x86 and ARM alone can mess you up badly.

Gudi · Mar 3, 2023

senttoschool said:
Microsoft has no ARM chip that is significantly better than anything AMD or Intel makes.

They also don’t have an OS to compete with iPhone and iPad. The Surface is merely a PC disguised as a tablet. Microsoft lost mobile computing despite Bill Gates touting tablets as the future since the 90s.

Scarrus · Mar 3, 2023

I really love these comparison threads online, people arguing over which platform is better...

I have 2 Laptops, a MBP and a Lenovo Legion.

The MBP is the most used for work, casual browsing and stuff like that and the Legion is used mostly for gaming or when I need to run a Win program or something. I like both laptops honestly and I carry both with me so I always get the best of both worlds... I don't get this constant bashing all the time. They're both great for their own purpose and at the same time they both suck sometimes, so yeah...

dmccloud · Mar 3, 2023

Gudi said:
Yeah, changing the instruction set underneath an entire OS and its 3rd party software ecosystem is probably the least challenging software development project since autonomous driving. /s

You forget one critical thing here. Since iOS was built upon the same Darwin core which is the basis for OS X/MacOS, Apple has been writing code for ARM for years at this point. So Apple has actually been doing this since the first Apple-created SoCs were produced for the iPhones...

leman · Mar 3, 2023

dmccloud said:
You forget one critical thing here. Since iOS was built upon the same Darwin core which is the basis for OS X/MacOS, Apple has been writing code for ARM for years at this point. So Apple has actually been doing this since the first Apple-created SoCs were produced for the iPhones...

Even more, according to Steve Jobs, the first iPhone was running OS X. iOS was the name given to a specific distribution of OS X (drivers and a set of libraries). The kernel and a vast amount of framework code was built and tested both for x86 and ARM simultaneously since then.

Sydde · Mar 3, 2023

One difficult concern about GBx is that it measures peak performance in a relatively brief test. This is fine, we do like to know how fast it can go. But it means that an Intel or AMD processor is running at "Turbo" for the test (somewhere between 4~6GHz).

In the real world, "Turbo" is simply not sustainable for very long. Even if you pour liquid nitrogen on the CPU to keep it from overheating, you still face the limitations of the memory system, which will eventually lead to processor starvation.

The M2 runs at around 3.5~3.7GHz (P-cores) and may drop to about 3.0, but it is getting scores in the same general neighborhood as x86 devices at some 60% of clock, and most importantly, it can maintain that pace pretty much indefinitely.

x86 machines have the high-clock capability, and good for them. ARM processors do not have a burst mode (at least, not on the same order of magnitude) but rely on getting the work done at a steady, practical pace – and, interestingly, they seem to be able to do just about as much work at a much lower clock.

MayaUser · Mar 3, 2023

Scarrus said:
I really love these comparison threads online, people arguing over which platform is better...

I have 2 Laptops, a MBP and a Lenovo Legion.

The MBP is the most used for work, casual browsing and stuff like that and the Legion is used mostly for gaming or when I need to run a Win program or something. I like both laptops honestly and I carry both with me so I always get the best of both worlds... I don't get this constant bashing all the time. They're both great for their own purpose and at the same time they both suck sometimes, so yeah...

so, Mbp is doing the important stuff in your life, thats key thing right there, the overall product not just the OS
I couldnt live with an windows laptop after experience the M series Mbp

Scarrus · Mar 3, 2023

MayaUser said:
so, Mbp is doing the important stuff in your life, thats key thing right there, the overall product not just the OS
I couldnt live with an windows laptop after experience the M series Mbp

It’s moatly about the OS and having two separate instances meaning not running data sensitive things on the same OS I fool around with.

Mac OS I use as is, it’s boring after you get used to it, it’s not as flexible as Windows, more like a protected closed environment.

And Windows when I need to mess around with stuff, i.e. install a lot of crap software and mess around with it, searching for usefull/purposefull things you can do with it.

In that respect there’re really two separate things and not one being better than the other.

As I said, I love and hate things about them both.

Sydde · Mar 3, 2023

mr_roboto said:
First cycle does the OP computation, second cycle does the shift. The lack of pipelining halves throughput because that ALU has to stall one cycle.

That is backwards to the operation, though. The semantics of it is

A = B <op> (C:<shift_ct>)

so the shift happens first. In the original AArch32, there was no discrete shift operation: it simply performed a move with a shifted/rotated source. AArch64 is structured a little differently and does have discrete shift ops. But the 64-bit version ought to be able to just feed the C operand through the barrel shifter on its way to the ALU – why it takes an extra cycle is unclear (unless it is for energy efficiency reasons, to reduce excess logic at the cost of 1 cycle for a composition that is only moderately used).

name99 · Mar 4, 2023

Sydde said:
Absolutely not. Address calculation is built into the LSU, so you do not need to use a different unit for that. All processors work this way. The only difference is whether the base register gets modified, and whether the LSU uses the original address or the offset address, which is a simple logic switch.

As far as LDP, you have to understand that the Apple design does not use a traditional register file. The core has a large array of "renames" and a register reference table that identifies an architected register in the rename array, relative to the code stream. When code puts a value into a register, it is writing to a rename, which subsequent code will acquire using the reference table. LDP grabs two renames, marks the reference table, and eventually puts the load into the renames that it has. The memory location is contiguous, so, if it is not misaligned or is cached, a single memory transaction may be all that is needed. But it is still just one μop.

Actually there's something cleverer going on.
Consider an instruction that needs multiple destination registers (like LDP). This is complicated to handle in Rename. So the instruction is cracked into two instructions for the purpose of Rename (allocating a destination register and a history file entry) but then "un-cracked" after Rename for the purpose of execution.

This is described in

US9223577B2 - Processing multi-destination instruction in pipeline by splitting for single destination operations stage and merging for opcode execution operations stage - Google Patents

Various techniques for processing instructions that specify multiple destinations. A first portion of a processor pipeline is configured to split a multi-destination instruction into a plurality of single-destination operations. A second portion of the pipeline is configured to process the...

patents.google.com

These sorts of tricks (and Apple Silicon is full of them) is why so much discussion of Apple Silicon, based on the language and ideas of 20 years ago, is so vacuous. Does Apple Silicon crack instructions? Well, yes, sorta, but ...

Dougall's work on instruction throughputs is great, but the real value is in the counters that he reports for each instruction. I did some preliminary work on what each counter means and part of it is, of course, exactly what you would expect, that different counters are counting different things (eg ops coming out of Decode, ops going into Rename, ops going into Execute, ...) and these will differ depending on this cracking and uncracking.
But I never had time to finish this work, and looks like Dougall also got distracted by other things. Hopefully at some point someone else will take up the challenge.

Geekbench 6 results and discussions

macrumors 6502a

macrumors 68040

macrumors 68030

macrumors Core

macrumors 68030

macrumors Core

Suspended

Suspended

macrumors 65816

macrumors 68030

macrumors 68000

Suspended

macrumors 6502a

macrumors 68030

macrumors 68030

macrumors Core

Suspended

macrumors 6502

macrumors 68040

macrumors Core

macrumors 68030

macrumors 68040

macrumors 6502

macrumors 68030

macrumors 68030

Our Staff