What are the benefits of Apple switching to Arm Macs?

joema2 · Aug 5, 2020

cmaier said:
Your premise is broken. RISC has frequently outperformed x86 in the past. Those RISC machines never took over the world simply because they didn’t run consumer/x86 versions of Windows. The DEC Alpha blew away contemporaneous Intel products. So did the Exponential PowerPC...

In the late 90s I personally ran lots of side-by-side performance tests between various Alpha, MIPS and Pentium Pro workstations. On commercial software dominated by integer code, I never saw a marked RISC performance advantage. It is true the RISC machines often had better floating point performance, but that was a small subset of all use cases.

On the server side (where desktop/consumer software has little impact), both current and historical performance numbers are available for transaction processing. Essentially all current TPC-E results are dominated by x86 machines: http://www.tpc.org/tpce/results/tpce_results5.asp

The older TPC-C benchmark dating back to around 1999 has some competitive RISC servers, but even back then x86 held many of the top slots and typically at a vastly cheaper capital cost per transaction:

TPC-C All Results

The Transaction Processing Performance Council (TPC) defines Transaction Processing and Database Benchmarks and delivers trusted results to the industry.

www.tpc.org

JMacHack may be correct -- The Apple SoC has a lot of heterogeneous accelerators, which if leveraged by proper software might explain the advantage, at least for certain workloads. However it doesn't explain a straight-up benchmark like GeekBench or Linpack looking roughly competitive on an A12Z vs an i7-7700K Kaby Lake CPU.

A useful number would be transistors per core between late-generation x86 and Apple Silicon cores. I don't think that has ever been revealed but it's obtainable by taking die shots of a decapped chip. It is generally assumed that Ax CPUs consume less transistor budget per core, but we don't really know. If it is significantly less, that would enable more cores per die at approx. similar fabrication costs. It's interesting the first Apple Silicon machine is rumored to be a 13" MacBook Pro with 12 cores.

Of course you also need good single-thread performance, and traditionally that has been costly from a transistor budget and power standpoint. It's conceivable something about the Ax architecture facilitates improving superscalar performance at less cost per increment than x86, but if so that hasn't been borne out by POWER9 or the 48-core ARM/Fujitsu A64FX. They are competitive with high-end Xeon & AMD but both burn huge amounts of power : https://en.wikipedia.org/wiki/Fujitsu_A64FX

cmaier · Aug 5, 2020

joema2 said:
In the late 90s I personally ran lots of side-by-side performance tests between various Alpha, MIPS and Pentium Pro workstations. On commercial software dominated by integer code, I never saw a marked RISC performance advantage. It is true the RISC machines often had better floating point performance, but that was a small subset of all use cases.

As someone who ran identical software on multiple architectures in that time period while designing RISC and x86 CPUs, I saw the opposite. Alpha blew away x86, as did PowerPC x704, etc. I recall being completely amazed at how fast Windows NT ran on the x704 compared to the fastest Intel had to offer at the time. And Alpha was faster than that.

leman · Aug 5, 2020

joema2 said:
Could you please point me to that article which states Apple Ax performance advantage over Intel is due to being a wide machine? I spent about 1 hr looking for it and could not find it.

Internally Sunny Cove does five-wide instruction issues and has a 10-wide back end and 10 execution units. I haven't seen any authoritative article stating the performance advantage of the Apple Silicon CPU over x86 is due to it being a wider superscalar design (hence better IPC) than the latest x86 microarchitecture: https://en.wikichip.org/wiki/intel/microarchitectures/sunny_cove

Please keep in mind that I am not en engineer and my information comes from articles+technical documents, and Apple does not make these things public. Based on what Anandtech writes:

1. Regarding Sunny Cove: there are four execution ports related to general-purpose instructions (the rest are load-store ports which I will ignore for the moment). These ports can schedule various integer and FP/vector operations, and various documentation suggests that there are 4 integer ALUs and 3 FP/vector ALUs to carry out these instructions, but some details are not very clear to me. For instance, Anandtech mentions that only a single AVX-512 operation can be scheduled per clock, but then they also mention that there are two AVX-512 units... so no idea. Agner Fog (the most detailed info, alas no Ice Lake yet) as well as WikiChips mentions that Skylake has more integer vector units than FP vector units, so that might be what's happening here. The diagram does not make it clear which vector units are integer only. Another important point is that the ports are not symmetrical and various types of operations contend a port. For example, there might be 4 integer and 3 vector ALUs, but you can't dispatch 4 integer and 4 vector micro-ops simultaneously, since they all share 4 ports etc.

2. Regarding Apple A13. We don't have too much to go by here. Anandtech mentions:

In terms of the core design, at least in regards to the usual execution units, we don’t see too much divergence from last year’s core. The microarchitecture at its heart is still a 7-wide decode front-end, paired with a very wide execution back-end that features 6 ALUs and three FP/vector pipelines.

In their previous A12 deep dive they say:

Monsoon (A11) and Vortex (A12) are extremely wide machines – with 6 integer execution pipelines among which two are complex units, two load/store units, two branch ports, and three FP/vector pipelines this gives an estimated 13 execution ports

There is also a neat instruction throughput and latency table.

If I understand all that info correctly, current Apple CPUs have independent ports for scheduling integer and vector instructions, and they can do 6 integer as well as 3 128-bit FMA FP operations per cycle, with very low latency. Compared to Ice Lake, Apple A13 has 2 extra integer ALUs, and if Intel code runs SSE SIMD instructions, also potentially 1 extra FP vector unit (Intel CPUs have higher SIMD throughput if one uses AVX512 or AVX256, at the cost of flexibility and also there are these weird shenanigans with clocks when wide vector ISA is used...)

Hope that the links give you enough information to go on digging! If you discover something more useful from this data, I would be very interested in hearing it.

leman · Aug 5, 2020

joema2 said:
However it doesn't explain a straight-up benchmark like GeekBench or Linpack looking roughly competitive on an A12Z vs an i7-7700K Kaby Lake CPU.

The only reasonable explanation is that they have found a way to massively increase instruction parallelism. Having a wide execution back end is nice and all, but I'd expect one to run into big issues data dependencies. But it really seems as if they are able to feed all these units effectively...

deconstruct60 · Aug 5, 2020

vigilant said:
I’d encourage you to give the hypervisor.framework a try.

Parallels has been vindictive with the way it’s constantly stealing resources even when VMs are in the background.

Parallels or Apple?

https://www.macrumors.com/2020/07/27/vmware-confirms-macos-virtualization-bug-causes-crashes/

Parallels Desktop Lite on the Mac App Store to me is far superior to the way Parallels Desktop works. My machine is SOOOO much stabler with it.

And the GPU performance throughput is better? "More stable" of VM from vendor X versus vendor Y isn't really the point of why a substantive number of folks go to Bootcamp and boot onto "raw metal" GPU . Similarly old timers goosing old systems forward after Apple just stops signing anything for their system anymore.

Works better for nested hypervisors? Probably not.

If Apple improves the hypervisor framework approach it could work better to the point where most users would be happen. Heavy users probably not ( if prmiary deploy target is another hypervisor then will run into some issues. )
If they could support IOMMU mapping to virtual GPU and take lots of the "overhead off" from the VM vendors work to emulate GPUs that would solve some bigger pain points that currently exist.

Likewise if perhaps could provision it on top of the recovery system image ( or something more specialize than than the normal OS image ) that would help too. It is a relatively "big footprint" hypervisor when it comes to production deployment in a scale out context. if trying to take ESXi and HyperV off the table then have to cover some of that ground.

deconstruct60 · Aug 5, 2020

leman said:
The only reasonable explanation is that they have found a way to massively increase instruction parallelism. Having a wide execution back end is nice and all, but I'd expect one to run into big issues data dependencies. But it really seems as if they are able to feed all these units effectively...

higher cache hits would help too. The "wide" stuff only works if data is "close". Otherwise just more funcdion units waiting on data to do something. ( which is why other systems go to symmetric multithreading SMT ( or Intel branded Hyperthreads). ) If throw a RDMS complex join at it with some substantive pointer chasing the A12Z is likely going to fade much quicker than the the i7 ( or Xeon W or SP ) will .

Small working set ( A-series low double digit RAM capacity ) helps even the odds on Geekbench.

leman · Aug 5, 2020

deconstruct60 said:
higher cache hits would help too. The "wide" stuff only works if data is "close". Otherwise just more funcdion units waiting on data to do something. ( which is why other systems go to symmetric multithreading SMT ( or Intel branded Hyperthreads). ) If throw a RDMS complex join at it with some substantive pointer chasing the A12Z is likely going to fade much quicker than the the i7 ( or Xeon W or SP ) will

The A series have much more cache than an i7. Not sure what you mean by RDMS, but if you are talking about workflows that have poor cache hit rate, I’d expect all these CPUs to do equally poorly (since they are going to stall waiting for data to arrive). A Series will probably do better than Intel consumer chips because, again, more cache .

Tigger11 · Aug 5, 2020

cmaier said:
As someone who ran identical software on multiple architectures in that time period while designing RISC and x86 CPUs, I saw the opposite. Alpha blew away x86, as did PowerPC x704, etc. I recall being completely amazed at how fast Windows NT ran on the x704 compared to the fastest Intel had to offer at the time. And Alpha was faster than that.

I am going to have to completely agree with you here. I got paid to port a bunch of plugins to Alpha and PPC when NT was running well on them, and it was years before my topend x86 got anywhere near the same numbers I was getting from the Alpha, and as you said they PPC was running in a strong second place at the time. Now that crazy 48 core motherboard that Intergraph did in the 90s was pretty damn quick, but I dont think they ever got Microsoft to license NT for it at a reasonable rate so it never hit production.
-Tig

joema2 · Aug 6, 2020

Tigger11 said:
I am going to have to completely agree with you here. I got paid to port a bunch of plugins to Alpha and PPC when NT was running well on them, and it was years before my topend x86 got anywhere near the same numbers I was getting from the Alpha, and as you said they PPC was running in a strong second place at the time..

Our various experiences may involve the differences between synthetic benchmarks or small software components vs large real-world production commercial software. On the latter, in my testing I rarely saw Alpha or MIPS workstations perform faster than a Pentium Pro workstation of roughly similar price and config.

This was shown in the attached table from the white paper "RISC vs CISC: A Tale of Two Chips" (Bhandarkar, 1997). The 300 Mhz Alpha 21164 was clearly faster on most synthetic benchmarks than the contemporary 150 Mhz Pentium Pro. However on a large real-world server application running identical software, a quad Pentium Pro system was faster than a quad Alpha system, and at much less cost. This can also be seen in the previously-posted TPC-C benchmarks. If RISC servers were really faster in that regime, nobody was preventing those vendors from dominating the lucrative transactional server market. But as described by CPU guru David Kanter, the Pentium Pro and subsequent decoupled x86 designs signaled the death of the RISC server market: https://www.realworldtech.com/sandy-bridge/

Why the Pentium Pro vs Alpha advantage on real-world commercial apps is a good question. x86 code density, hence bus bandwidth and instruction/data cache hit ratio was one advantage. This was also during the period where profile-guided post-linktime basic block optimization was developed, and I vaguely recollect this was available first (or maybe only) for x86. If this was available on Pentium Pro software but not Alpha, that could have been a significant advantage -- even if the application source code version was identical: https://docs.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=vs-2019

What counts is who's faster in the real world, not who's faster on a benchmark. The relevance of Apple Silicon CPUs is they seem faster in the real world, at least at the power/performance scale they currently occupy. They are apparently on a developmental trajectory to surpass higher-end x86 CPUs in the relatively near future. How this is being achieved is yet to be authoritatively explained.

We know the Apple CPUs use a big/little core design, so that helps on low power but doesn't explain how the few high-power cores are so competitive. Maybe they are using some as-yet-unrevealed microarchitectural trick such as data value speculation. Everyone is using speculative execution but to my knowledge data value speculation has yet to be employed outside a research environment. This is where the CPU guesses at the result of arithmetic operations, speculatively proceeds, then rolls back if later in the pipeline the guessed value is wrong.

Another possibility is since all ARM instructions are predicated, this may allow better exploiting of instruction-level parallelism with less consumed power: https://en.wikipedia.org/wiki/Predication_(computer_architecture)

leman · Aug 6, 2020

joema2 said:
Another possibility is since all ARM instructions are predicated, this may allow better exploiting of instruction-level parallelism with less consumed power: https://en.wikipedia.org/wiki/Predication_(computer_architecture)

The 64-bit ARM uses a redesigned instruction set that removed predication. From what I understand, predication everywhere is actually harmful for instruction level parallelism as it makes superscalar execution more complicated. Modern ISA designs rely on branch prediction instead. The same principle is uses by RISC-V btw.

(To clarify A64 does have a limited selection predicated instructions, just not to the degree of its 32-bit predecessor. In 32 Bit ARM predication actually involves conditional execution. In A64, they instead introduced a subset of arithmetic instructions that use condition register for data masking)

joema2 · Aug 6, 2020

leman said:
The 64-bit ARM uses a redesigned instruction set that removed predication...

Thanks for that correction. Here is another blog article pondering why the ARM ISA advantage, concluding it's something about the micro-architectural implementation not the instruction set itself.

ARM processors like the A12X Bionic are nearing performance parity with high-end desktop processors, but the old truth of x86 superiority still lives strong.

reveried.com

He references this excellent paper (Blem, 2013) that delves into x86 vs ARM, but basically concludes ISA has no major impact, it all comes down to micro-architecture: https://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa-power-struggles.pdf

That has been the traditional view since x86 dominated both server-side and client-side markets. By that reasoning an x86 CPU could be designed as power efficient for a mobile device as ARM, and an ARM CPU could scale to a high Xeon level but would burn about the same power -- assuming similar fabrication and no novel micro-architectural methods were used. Yet something profound has changed as evidenced by the latest Apple CPUs. The question is what is different now which is enabling that?

leman · Aug 6, 2020

joema2 said:
He references this excellent paper (Blem, 2013) that delves into x86 vs ARM, but basically concludes ISA has no major impact, it all comes down to micro-architecture: https://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa-power-struggles.pdf

I do believe that ISA has a certain impact, as it will affect the state the CPU has to track, but there are no major differences between x86-64 and ARM A64 in this regard. A large side-effect of ISA is ease of decoding — since ARM instructions are fixed-length and are designed to have "neat" bit layouts, it's probably much easier to design a high-performance decoder for modern ARM ISA. The x86-64 will need to do a sequential scan to detect opcode sizes and it all has to do with non-aligned memory access, multiple different instruction formats and other weirdness.

joema2 said:
Yet something profound has changed as evidenced by the latest Apple CPUs. The question is what is different now which is enabling that?

Well, one thing that has to be said about Apple CPUs is that they seem to operate at the limit of their useful frequency. The slope of the A12 power curve for example seems to go off almost vertically at it's max turbo frequency. If these measurements are correct and A13/A14 behave similarly, this would means that Apple simply "moved" the per-core performance problem instead of "solving it". It is true though that they can achieve the same performance at much lower power usage, which by itself is an achievement.

As to the "what is different", I would bet that it's the fact that Apple has poached some of the best chip designers around (including many members of Intel designed team behind the Core architecture) and that they are sitting on billions of cash.

JMacHack · Aug 6, 2020

If a CPU has ASICs for specific tasks, wouldn't that reduce or eliminate Cache misses? I've read that Cache misses are the big thing that's slowing down modern processors.

Say the CPU knows what workload it has to do, Video Decode for example, then it just refers it to the dedicated Video Codec circuitry using a few instructions. As opposed to trying to decode it with FP or Integer calculations which could have further cache misses.

On top of this, the A-series CPUs we've seen only have up to L2 cache, and boatloads of it. With a reduced instruction set, wouldn't that also result in fewer cache misses? I know that prediction has been a major speed booster to CPUs (and drawback to security given Spectre and Meltdown exploits) so theoretically wouldn't reducing the amount of prediction allow a CPU to perform better for fewer cycles?

I'm sure cmaier could answer this, or someone who is more versed in CPU architecture than me.

joema2 · Aug 6, 2020

leman said:
I do believe that ISA has a certain impact, as it will affect the state the CPU has to track, but there are no major differences between x86-64 and ARM A64 in this regard. A large side-effect of ISA is ease of decoding — since ARM instructions are fixed-length and are designed to have "neat" bit layouts, it's probably much easier to design a high-performance decoder for modern ARM ISA. The x86-64 will need to do a sequential scan to detect opcode sizes and it all has to do with non-aligned memory access, multiple different instruction formats and other weirdness...

In many cases x86 instructions are pre-decoded by the front end, converted to micro-ops and held in a micro-op cache. As new x86 instructions are fetched, the front end determines if they are already converted and cached, then the back end simply fetches the corresponding micro-ops from cache and schedules them on execution units.

Apparently, parts of the legacy decoder are powered down during micro-op hits, which diminishes decoder power consumption. Newer x86 CPUs have larger micro-op caches which improve hit ratio and decrease the % of time a legacy decode is needed.

A 2016 paper (Hirki, et al) examined this closely and by actual measurement found that x86 decode power consumption is fairly low. They concluded that due to these techniques, the x86 instruction set was not a major factor in efficiency. Use of micro-op cache only started with Sandy Bridge, so maybe before that the x86 decode issue was more relevant.

In the paper the authors state they planned on using the same evaluation techniques to measure decode power efficiency of ARM CPUs but a quick search doesn't show that ever happened:

A https://www.usenix.org/system/files/conference/cooldc16/cooldc16-paper-hirki.pdf

joema2 · Aug 6, 2020

JMacHack said:
If a CPU has ASICs for specific tasks, wouldn't that reduce or eliminate Cache misses? I've read that Cache misses are the big thing that's slowing down modern processors.

Typical hit ratios for CPU instruction cache, data cache, and branch prediction cache are high. That is one problem -- to further improve those requires a large increase in cache size for a small incremental benefit.

The biggest impediments to further speedup are fundamental and not easy to address. There is only so much instruction-level parallelism in a single code stream which a superscalar CPU can harvest. This is probably in the range of about 5 instructions.

They can use more cores but the benefit is limited by Amdahl's Law, where only a small % of serializable code limits available speedup. For many algorithms there must be some coordination between threads: https://en.wikipedia.org/wiki/Amdahl's_law

JMacHack said:
Video Decode for example, then it just refers it to the dedicated Video Codec circuitry using a few instructions. As opposed to trying to decode it with FP or Integer calculations which could have further cache misses.

This is the 3rd way to speed things up - dedicated hardware. This is viable and Apple will be heavily leveraging this on Ax CPUs, esp future ones. However the benefit is not from reduced cache misses, but simply faster performance. E.g, hardware video encode/decode is faster than software.

JMacHack said:
...The A-series CPUs we've seen only have up to L2 cache, and boatloads of it. With a reduced instruction set, wouldn't that also result in fewer cache misses?

Historically it was easier for x86 to maintain a better I-cache hit ratio because the instructions are more dense. You could liken it to fetching ZIP'd text files then unzipping (aka decoding) vs fetching the uncompressed file. However current CPUs (both RISC and CISC) have large caches so this is less a factor now.

JMacHack said:
...I know that prediction has been a major speed booster to CPUs (and drawback to security given Spectre and Meltdown exploits) so theoretically wouldn't reducing the amount of prediction allow a CPU to perform better for fewer cycles?

The problem is all major CPUs require very aggressive speculative out-of-order execution and branch prediction to provide current levels of performance. They can't just turn that off or greatly diminish it. CPU design effort to mitigate "out of band" attacks focuses on reducing ability of apps to probe another app's address space by statistical timing analysis of cache behavior. Virtually all CPUs at a contemporary performance level, both RISC and CISC, are subject to this, not just x86. This includes the IBM POWER9 RISC CPU and Z13 CPU used on IBM mainframes.

leman · Aug 6, 2020

JMacHack said:
If a CPU has ASICs for specific tasks, wouldn't that reduce or eliminate Cache misses? I've read that Cache misses are the big thing that's slowing down modern processors.

Say the CPU knows what workload it has to do, Video Decode for example, then it just refers it to the dedicated Video Codec circuitry using a few instructions. As opposed to trying to decode it with FP or Integer calculations which could have further cache misses.

Cache misses are just about data that has to be fetched from the relatively slow system memory. It doesn't matter whether you have dedicated hardware to fulfil a function or whether it runs in "software", they will need to fetch the data all the same. One thing about tasks such as video decoding is that memory accesses can be reasonably predicated — the system can start fetching the data for the next frames while the hardware is busy decoding the current one.

joema2 said:
In many cases x86 instructions are pre-decoded by the front end, converted to micro-ops and held in a micro-op cache. As new x86 instructions are fetched, the front end determines if they are already converted and cached, then the back end simply fetches the corresponding micro-ops from cache and schedules them on execution units.

What I mean is that a wide back end will need a steady stream of instructions to be fed effectively. An advantage of something like A64 ISA is that it can simply prefetch multiple 32-bit words in advance and decode them in parallel, relatively simple decoder blocks. It's more tricky with x86, where you have to process the instruction stream sequentially (at least to some degree). And it should be easier to scale this up on ARM. I am not necessarily talking about the power consumption of the decoder, but about it's throughput.

Krevnik · Aug 6, 2020

JMacHack said:
Say the CPU knows what workload it has to do, Video Decode for example, then it just refers it to the dedicated Video Codec circuitry using a few instructions. As opposed to trying to decode it with FP or Integer calculations which could have further cache misses.

This feels like focusing on the trees rather than the forest, IMO. ASICs have been better than CPUs at specialized tasks for a long time. It’s generally more about the cost/benefit of including specialized hardware for tech that might not be around in 5 years.

None of this stuff is new, really. It just seems to be in fashion again thanks to mobile phones helping place an emphasis on extracting every last milliamp of savings when it comes to power. So stuff that in the 00s would have just been left to the CPU because it’s cheaper, is being made an ASIC in the 10s/20s.

I think Apple’s emphasis on specialized hardware is more to do with their ability to control when they pick up a particular ASIC block, and how it’s tuned. Intel was slow on adding HEVC, for example. Apple’s ML acceleration is ahead of Intel and AMD. Apple’s got a working “high/low” core design that honestly leads the industry IMO.

Tigger11 · Aug 6, 2020

joema2 said:
Our various experiences may involve the differences between synthetic benchmarks or small software components vs large real-world production commercial software. On the latter, in my testing I rarely saw Alpha or MIPS workstations perform faster than a Pentium Pro workstation of roughly similar price and config.

Only if you are implying you are the one using synthetic benchmarks, not I, and I don't think you are. Renderfarms weren't being configured with MIPS and Alpha processors because they ran at the same speed as Pentium Pros, and did the same amount of work. Workstations weren't being sold with Alpha and Mips processors because they were the same speed as Pentium Pros doing the work they were doing. Noone pays me 1000s of dollars to do ports in 90s because it will run at the same speed as if they ran it on a Pentium Pro where it currently was running. Marshall didn't move a whole bunch of code to Alphas because it would run at the same speed as if it ran on a Pentium Pro. I'm sorry it was rare for you to see, but to imply as you are Alpha and MIPs workstations were just the same speed as Pentium Pros is frankly ludicrous.
-Tig

JMacHack · Aug 6, 2020

Thanks for all the responses ,I learned a lot!

vigilant · Aug 6, 2020

joema2 said:
In the late 90s I personally ran lots of side-by-side performance tests between various Alpha, MIPS and Pentium Pro workstations. On commercial software dominated by integer code, I never saw a marked RISC performance advantage. It is true the RISC machines often had better floating point performance, but that was a small subset of all use cases.

I hear you, but RISC not having a performance advantage in integer tasks is more of a result of design consideration than a dependency on instruction set.

Sun put out a chip, I believe it was the Sun Sparc T1, and if memory serves me correctly it was an integer beast. That was definitely a RISC chip. https://en.wikipedia.org/wiki/UltraSPARC_T1

I remember talking to an engineer friend who was working at Sun at the time. This was a very purpose built design to really hammer on Oracle databases, and web traffic.

I can definitely appreciate the comparison, but did think it was important to draw the line between core design vs ISA.

Whats so interesting to me about what Apple is doing is they are basically ceding the fight over ISA. Intel hasn't. The problem with Intel processors in general is they are shackled by their dependency on x86 legacy. That may sound trivial, but if you start talking to any commercial customers when Windows 8 was coming out, you'll hear the backlash. There is a huge desire to avoid having 20 year old systems, with 20 year old problems moved to be more modern. The result? It's the long road that leads to various levels of platform stagnation.

I actually applaud Microsoft for their efforts to try to move their platform forward. They have repeatedly tried to get Windows to move to 64 bit. They have tried repeatedly to get people off of win32. Every time they do, someone complains that their VB6 application for orders wouldn't run, despite the fact the code base hasn't been touched in close to 20 years. It may sound like an exaggeration, but it's really not.

vigilant · Aug 6, 2020

deconstruct60 said:
Parallels or Apple?

https://www.macrumors.com/2020/07/27/vmware-confirms-macos-virtualization-bug-causes-crashes/

And the GPU performance throughput is better? "More stable" of VM from vendor X versus vendor Y isn't really the point of why a substantive number of folks go to Bootcamp and boot onto "raw metal" GPU . Similarly old timers goosing old systems forward after Apple just stops signing anything for their system anymore.

Works better for nested hypervisors? Probably not.

If Apple improves the hypervisor framework approach it could work better to the point where most users would be happen. Heavy users probably not ( if prmiary deploy target is another hypervisor then will run into some issues. )
If they could support IOMMU mapping to virtual GPU and take lots of the "overhead off" from the VM vendors work to emulate GPUs that would solve some bigger pain points that currently exist.

Likewise if perhaps could provision it on top of the recovery system image ( or something more specialize than than the normal OS image ) that would help too. It is a relatively "big footprint" hypervisor when it comes to production deployment in a scale out context. if trying to take ESXi and HyperV off the table then have to cover some of that ground.

The bug that happened is new, and VMWare is using their own type-2 hypervisor. If they are using Apple's hypervisor, that's news to me. VMWare has a purpose built type-2 hypervisor that is shared across Windows, Linux and Mac.

The GPU performance on Parallels Lite for Mac is intentionally gimped to push you to buy their full solution. You CAN install Parallels Desktop, and enable the Apple native hypervisor, and still use everything that Parallels does for graphics.

I don't think we know if Parallels won't be able to do what they do today for graphics on Apple Silicon.

When it comes to nested hypervisors, that is a much harder question to answer. It comes down to semantics. I don't know the full on details of this, as I don't think much details were given. It's my understanding that part of the boot chain is actually invoking the Apple hypervisor to bring up the macOS system. This was done to help ensure the security that nothing could compromise startup. I don't know if Apple is treating their Hypervisor as type-1, or type-2 when VMs get started. The difference being, the hypervisor is already running all the time, is the VM existing side by side macOS and macOS is divying up resources with macOS being a resource broker? Or are guest VMs using Apple Hypervisor being nested inside macOS? If it's the second, I could see a "triple nested" experience being painful.

I know I'll be asked how do I know the hypervisor gets invoked prior to macOS starts running. I'll look for the Apple Security discussion that from I believe it was DefCon of last year.

vigilant · Aug 6, 2020

joema2 said:
This actually happened with the Pentium Pro in 1995 and was later adopted for Core, plus AMD uses the technique. This means both Intel and AMD CPUs are internally RISC-like. The more dense CISC instructions actually reduce bus bandwidth and make more efficient use of instruction and data caches.

It may be for these reasons that AMD's server/workstation EPYC CPUs tend to be faster than the RISC POWER9: https://www.phoronix.com/scan.php?page=article&item=rome-power9-arm&num=1

Since both Intel and AMD CPUs are really covert RISC machines, and even more pure RISC machines like POWER9 are highly complex CPUs using out-of-order superscalar speculative execution, the original RISC advantages might seem to no longer apply. Contemporary RISC and CISC CPUs are both incredibly complex, and there is nothing "reduced" about either of them.

We must also keep in mind as ARM-instruction-set CPUs scale upward, it is not those vs Intel, but those vs x86 (which includes AMD from whom Intel licenses the x64 instruction set which was developed solely by AMD). IOW if there is an ARM or Apple Silicon advantage it will be manifest against x86 in general which includes AMD.

Ever since the Pentium Pro and succeeding x86 CPUs seemed to nullify the RISC performance advantage, for decades the traditional view was process and fabrication technology was the differentiator, not the instruction set. The corollary was if ARM could ever scale up to Xeon or EPYC levels it would burn just as much power as x86.

However in recent years Apple's Ax CPU development has been on an improved power/performance curve which currently seems to be holding. Already the A12Z CPU in an iPad Pro is roughly as fast as the 4.2Ghz quad-core Intel Kaby Lake CPU in a 2017 iMac 27. Nobody has yet authoritatively explained how this was achieved, when all previous RISC designs failed to compete. It is unlikely not simply due to the ARM instruction set itself or traditional RISC "advantages".

I’ve responded to a lot of people in here, so I don’t recall if I responded to this.

The success of x86 isn’t on the merits of x86 being superior. It’s success comes from two different things that don’t have anything to do with instruction sets, design, or architecture, though there have been hills and valleys that I’m happy to speak to but I’m sure others could probably make the argument better.

The success of x86 comes from being cheaper than most RISC solutions, and the ecosystem of software it supports.

The big deals I’ve seen where x86 displaced any big iron solution came down to “fast enough” and “cheap” and “cheaper to maintain”. I don’t want to gloss over that as if it’s a bad thing or not important. Any classic RISC solution has traditionally been obnoxiously expensive to run and maintain. You can definitely build an x86 system that is as heavily engineered as classic RISC solutions. My personal favorite was the beastly vBlock. Customers look at the cost of the vBlock (it’s expensive) and compare it to what they are looking at to upgrade an IBM Power solution, and order 2 vBlocks.

Intel has this tendency to do very well, and have great chips, then they get drunk on their own kool-aid, and the next thing you know you have the Pentium 4, which was genuinely marketed as a way to enhance your internet experience at a time when Dial Up was still largely the norm. It was ridiculous advertising that was only outdone by how bad the actual architecture was.

Right now, from a power, and cooling front, we are back in the Pentium 4 days from a customer standpoint. I’m sure Intel probably thinks that having more Bunny Man commercials will make the pain better.

The A12Z CPU from the iPad Pro, and now in the Mac Mini DTK is an excellent architecture. I wish Apple would disclose more about the chip. The Mac Mini DTK, running native code, is getting geekbench scores on compute that is within 10% of my MacBook Pro 16.

Largely, you’re right, CISC and RISC have their pros and cons, and some how we end up with these things that have a little bit of both to the point where it seems like non-sense words.

We can argue about the trade offs all day. I’m not an electrical engineer, but I’ve been fairly deep in a lot of this stuff since I was 16. I welcome a spirited debate.

The reality is that Apple can’t build the products they want, or most of us want with Intel. I will gladly recommend Intel and AMD to anyone buying Data Center hardware. Intel still has an advantage over AMD, and it’s so simple that it seems stupid that it truly seems like the only reason why people buying servers aren’t buying AMD. Intel and AMD both have hardware accelerated but because the ways the go about doing it is different enough, you can’t vMotion from an Intel box to an AMD box live.

I’m a HUGE fan of what AMD is doing with Threadripper and EPYC. For years I wanted Apple to just buy AMD, and have their own custom x86 SOCs like Microsoft and Sony do. It’s worth noting that while Microsoft and Sony have outstanding SOCs coming out, they do appear to need fairly aggressive cooling.

joema2 · Aug 6, 2020

Tigger11 said:
...Renderfarms weren't being configured with MIPS and Alpha processors because they ran at the same speed as Pentium Pros, and did the same amount of work. Workstations weren't being sold with Alpha and Mips processors because they were the same speed as Pentium Pros doing the work they were doing....Marshall didn't move a whole bunch of code to Alphas because it would run at the same speed as if it ran on a Pentium Pro....to imply as you are Alpha and MIPs workstations were just the same speed as Pentium Pros is frankly ludicrous...

I clearly stated I was speaking about large commercial software dominated by integer code -- not floating point code and not graphics code. Yes, RISC workstations were fast at floating point code and graphics. Those workloads (while important) characterized a small % of the total IT commerce. That can be seen by Sun Microsystems 1996 revenue of $7 billion vs IBM's 1996 revenue of $75 billion. That was six times NASA's 1996 budget.

As shown in the above attachment, identical database server software running on a quad Pentium Pro was faster than a Quad Alpha 21164, at a much lower price. The above links show historical TPC-C performance *and* price/performance of systems from the early 2000s. In general x86 was faster and at vastly better price/performance. That is an indelible part of history that can be examined. That is one reason why RISC is rarely used in datacenters today. While fast in narrow scientific applications, it could not compete in the much larger marketplace of integer-oriented commercial code, especially given x86 economies of scale and lower profit margins of x86 system vendors.

Ironically in recent years things may be changing. Now Intel and their system vendors are the high-margin supplier, and in a new IT world oriented toward power efficiency, ARM or other alternatives can gain traction.

Waragainstsleep · Aug 6, 2020

vigilant said:
I’m a HUGE fan of what AMD is doing with Threadripper and EPYC. For years I wanted Apple to just buy AMD, and have their own custom x86 SOCs like Microsoft and Sony do. It’s worth noting that while Microsoft and Sony have outstanding SOCs coming out, they do appear to need fairly aggressive cooling.

Do Intel licence x86 to Microsoft and Sony? I thought only AMD had a licence.

vigilant · Aug 6, 2020

Waragainstsleep said:
Do Intel licence x86 to Microsoft and Sony? I thought only AMD had a licence.

AMD has the licenses, and they provide “semi custom” solutions to the big game system makers.

Both the PlayStation 4, and Xbox One are custom SOCs designed with AMD. The two SOCs are as similar as they are different as a result.

The PlayStation 5 and Xbox Series X are pretty much the same. They are 8 of the latest Zen cores, with a ton of the Radeon cores on an SOC.

It’s a beautiful configuration, that requires a lot of power, and a TON of cooling. Microsoft has stated that the full power of the Xbox Series X, 12TFP will be available continuously, and fixed clock rates.

What are the benefits of Apple switching to Arm Macs?

macrumors 68000

Suspended

macrumors Core

macrumors Core

macrumors G5

macrumors G5

macrumors Core

macrumors 6502a

macrumors 68000

Attachments

macrumors Core

macrumors 68000

macrumors Core

Suspended

macrumors 68000

macrumors 68000

macrumors Core

macrumors 601

macrumors 6502a

Suspended

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors 6502a

Our Staff