Rumored pro chip

theorist9 · Jul 28, 2021

If Apple wants their new pro devices to fully appeal to the professional STEM crowd, they need to create an AS replacement for the Intel MKL (Math Kernel Library) (https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html#gs.7m48kb).

It makes a significant difference. This illustrates how much of a difference it makes on x86 chips: The Intel MKL is designed to work only with Intel processors. Thus if you are running, say, Mathematica or Matlab on an AMD device, by default you wouldn't get access to the Intel MKL. However, up until about 2020, users running AMD devices were able to use a Terminal workaround (MKL_DEBUG_CPU_TYPE) to "trick" the MKL into thinking the device had an Intel chip (see https://www.pugetsystems.com/labs/h...for-Python-Numpy-And-Other-Applications-1637/).

Mathematica users with AMD devices have reported 15%–30% speedups when they used this workaround to gain access to the MKL.

MKL versions beginning in 2020 shut the door on this. And of course even if that weren't the case, this wouldn't work on AS. So, IMO, Apple needs to create an equivalent to MKL optimized for AS, and publicize it heavily, so that those building AS-native versions of computational software can incorporate calls to fast math libraries into their code.

It appears Apple hasn't done this yet, since the latest version of Mathematica runs natively on AS, yet its built-in benchmark is slower on the M1 than on the core-i9 iMac (3.1—3.2 on the M1 vs. 4.5 on a 2019 27-inch iMac with a 3.6 GHz 8-Core i9). And this isn't b/c of the difference in core count—the built-in benchmark uses only four cores.

Consistent with Mathematica's native benchmark being slower than expected on the M1, my mid-2014 MBP benchmarks at 3.0—nearly as fast as the M1 machines. And that's definitely not because of core count, b/c my MBP has only 4 cores.

Some have argued that Mathematica needs a better built-in benchmark, and that's probably true, but still...

AgentMcGeek · Jul 29, 2021

If you’re doing heavy workload like this, you probably shouldn’t go for the entry chip that is the M1. Besides, using only four cores and comparing a portable machine with a desktop introduces a lot of bias. Test again with the M1X in real conditions (aka not the benchmark?) against an Intel laptop and then we’ll see.

EDIT: I assumed you were talking about a MacBook when you mentioned M1, but I realise this could also be a new Mini or iMac. Which one are you using? This is important because they wouldn’t have the same TDP.

AgentMcGeek · Jul 29, 2021

LPDDR5X 😍 https://videocardz.com/press-release/jedec-announces-lpddr5x-memory-standard-up-to-8533-mbps

Appletoni · Jul 29, 2021

AgentMcGeek said:
LPDDR5X 😍 https://videocardz.com/press-release/jedec-announces-lpddr5x-memory-standard-up-to-8533-mbps

Nice if the M1X will have it.

AgentMcGeek · Jul 29, 2021

I doubt that, the M1X production is almost definitely already done, and JEDEC just announced now the specifications.

leman · Jul 29, 2021

AgentMcGeek said:
LPDDR5X 😍 https://videocardz.com/press-release/jedec-announces-lpddr5x-memory-standard-up-to-8533-mbps

That's literally twice the bandwidth LPDDR4X in M1 provides... but I doubt we will see this in production any time soon.

AgentMcGeek · Jul 29, 2021

Also note that the press release says JEDEC has updated the non-X standard:

LPDDR5 and LPDDR5X are pin-to-pin compatible, greatly simplifying development of system-on-chips (SoCs) and platforms supporting the new type of memory. To enable higher data transfer rates and increase reliability of the upcoming low-power memory subsystems, LPDDR5X introduces pre-emphasis function to improve the signal-to-noise ratio (to enable higher clocks and performance improvements) and reduce the bit error rate as well as adaptive refresh management, and per-pin decision feedback equalizer (DFE) to enhance robustness of the memory channel (a page taken from the DDR5 standard).

Source: https://www.tomshardware.com/news/jedec-publishes-lpddr5x-specification

JMacHack · Jul 29, 2021

deconstruct60 said:
So folks believing that Apple is going to fork the kernel scheduler just be benifit a Intel CPU SoC

Consider it this way: Maybe they’d revamp the scheduler to use more than 64 threads now in case they make a cpu with more than 64 cores in the future.

AgentMcGeek · Jul 29, 2021

It is very likely that the Pro line will get more than 64 threads within a few years. After all, semi-professional CPUs like Threadrippers are already at 128 threads. If AMD can do it on x86, I expect AS to follow sooner than later.

AgentMcGeek · Jul 29, 2021

leman said:
That's literally twice the bandwidth LPDDR4X in M1 provides... but I doubt we will see this in production any time soon.

Here's what Anandtech says:

For eventual mobile SoCs using 8533Mbps memory, the peak theoretical available bandwidth would grow from 51.2GB/s to 68.26GB/s, allowing future designs to further increase CPU and GPU performances. It’s to be noted, that we haven’t heard much about power efficiency improvements of the new LP5X standard, so I assume that we’ll be relying on DRAM vendors to improve power efficiency via more advanced manufacturing nodes to keep total power usage inside of devices in check.

I estimate we’ll be seeing LPDDR5X in late 2022 or 2023 SoCs.

Source: https://www.anandtech.com/show/16851/jedec-announces-lpddr5x-at-up-to-8533mbps

Appletoni · Jul 29, 2021

AgentMcGeek said:
It is very likely that the Pro line will get more than 64 threads within a few years. After all, semi-professional CPUs like Threadrippers are already at 128 threads. If AMD can do it on x86, I expect AS to follow sooner than later.

I want it now or at least inside the MacBook Pro 16-inch M1X and not within a few years.

CWallace · Jul 29, 2021

theorist9 said:
If Apple wants their new pro devices to fully appeal to the professional STEM crowd, they need to create an AS replacement for the Intel MKL (Math Kernel Library)

Is this something Apple can tune the Neural Engine to do?

AttoA · Jul 29, 2021

theorist9 said:
If Apple wants their new pro devices to fully appeal to the professional STEM crowd, they need to create an AS replacement for the Intel MKL (Math Kernel Library) (https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html#gs.7m48kb).

It makes a significant difference. This illustrates how much of a difference it makes on x86 chips: The Intel MKL is designed to work only with Intel processors. Thus if you are running, say, Mathematica or Matlab on an AMD device, by default you wouldn't get access to the Intel MKL. However, up until about 2020, users running AMD devices were able to use a Terminal workaround (MKL_DEBUG_CPU_TYPE) to "trick" the MKL into thinking the device had an Intel chip (see https://www.pugetsystems.com/labs/h...for-Python-Numpy-And-Other-Applications-1637/).

Mathematica users with AMD devices have reported 15%–30% speedups when they used this workaround to gain access to the MKL.

MKL versions beginning in 2020 shut the door on this. And of course even if that weren't the case, this wouldn't work on AS. So, IMO, Apple needs to create an equivalent to MKL optimized for AS, and publicize it heavily, so that those building AS-native versions of computational software can incorporate calls to fast math libraries into their code.

It appears Apple hasn't done this yet, since the latest version of Mathematica runs natively on AS, yet its built-in benchmark is slower on the M1 than on the core-i9 iMac (3.1—3.2 on the M1 vs. 4.5 on a 2019 27-inch iMac with a 3.6 GHz 8-Core i9). And this isn't b/c of the difference in core count—the built-in benchmark uses only four cores.

Consistent with Mathematica's native benchmark being slower than expected on the M1, my mid-2014 MBP benchmarks at 3.0—nearly as fast as the M1 machines. And that's definitely not because of core count, b/c my MBP has only 4 cores.

Some have argued that Mathematica needs a better built-in benchmark, and that's probably true, but still...

Not sure if this is the same thing, but isn't the Accelerate Framework what you want? (since macOS 10.3)

AgentMcGeek · Jul 29, 2021

Appletoni said:
I want it now or at least inside the MacBook Pro 16-inch M1X and not within a few years.

M1 has only 8 cores, and half of them are high efficiency/low power. You can't realistically expect M1X to get 32 cores.

Appletoni · Jul 29, 2021

CWallace said:
Is this something Apple can tune the Neural Engine to do?

I have heard that Apple also doubled the cores = 32 cores Neural Engine.
But some other people said that it will come with 64 cores.

Appletoni · Jul 29, 2021

AgentMcGeek said:
M1 has only 8 cores, and half of them are high efficiency/low power. You can't realistically expect M1X to get 32 cores.

Believe me I can 😁.
It’s the same like going from 7 or 8 core GPU to 32 cores GPU. = CPU can change in the same way.

deconstruct60 · Jul 29, 2021

AttoA said:
Not sure if this is the same thing, but isn't the Accelerate Framework what you want? (since macOS 10.3)

This situation is somewhat like Apple's stance that Metal is a complete replacement for OpenCL and CUDA. The scope of the coverage isn't the same.

"...

Enhanced math routines enable developers and data scientists to create performant science, engineering, or financial applications
Core functions include BLAS, LAPACK, sparse solvers, fast Fourier transforms (FFT), random number generator functions (RNG), summary statistics, data fitting, and vector math ..."

Accelerate Fast Math with Intel® oneAPI Math Kernel Library

Use this library of math routines for compute-intensive tasks: linear algebra, FFT, RNG. Optimized for high-performance computing and data science.

software.intel.com

Both have general BLAS , LAPACK. But other things are different. Solvers ... Accelerate covers some. Finance/statisics/data fitting computations? Nope. If follow the "DSP" link on that Apple page , Accelerate has a "virtual" DSP commands to opcode. If keep scrolling down though there are some FFT .

In the other direction, Accelerate is growing the scope of AI/ML packaged, "black box" API support. Apple's focus is drifting there in a different direction. [ Intel has yet another API (s?) that is targeting AI/ML. OneAPI is part of that. And Intel's marketing has kind of wrapped the MKL into the OneAPI umbrella. ] . Apple has more image processing helper frameworks.

So Apple is doing a substantively high percentage of the work. The issue though is when you kick folks who were doing all of the work to the curb, the onus of delivering the rest falls on Apple. Intel did a lot of "dot the i's and cross the t's" stuff that Apple just got to coast on while on the Intel platform.

But in terms of appealing to the folks doing hard core numerical simulations and modeling not having ECC is one of those "show stopper" checklist items. Getting the wrong answer faster doesn't matter more than library breadth.

There is a skew in Apple's Accelerate toward keeping Audio, Video , Gaming engine folks happy. I doubt more mainstream , hard core STEM work is a primary driver here (beyond basic, legacy checklist feature libraries like BLAS and LAPACK).

deconstruct60 · Jul 29, 2021

CWallace said:
Is this something Apple can tune the Neural Engine to do?

Generally not. The Neural Engine is generally about dumping precision for speed. So instead of accurate 64-bit resolution dropping down to 16, 8 , 4 to crank up the "operations per second" by operating on smaller and smaller chunks of data.

Many of the neural networks heuristics are trying to get to a "more right than wrong" answer. Tryong to get to an "answer" in the right zone.

If the numerical simulations/modeling requires more operations per second while still maintaining top end accuracy. If modeling real world usually already have taken short cuts on accuracy already. Throwing away more just gets more inaccurate answers.

[ Video and audio can often cut corners because of quirks and limitations of eyes/retina and ears to small data inaccuracies. Drop a single bit or two every once in a long while .... nobody is going to notice. ]

theorist9 · Jul 29, 2021

AgentMcGeek said:
If you’re doing heavy workload like this, you probably shouldn’t go for the entry chip that is the M1. Besides, using only four cores and comparing a portable machine with a desktop introduces a lot of bias. Test again with the M1X in real conditions (aka not the benchmark?) against an Intel laptop and then we’ll see.

EDIT: I assumed you were talking about a MacBook when you mentioned M1, but I realise this could also be a new Mini or iMac. Which one are you using? This is important because they wouldn’t have the same TDP.

I'm afraid you've completely misunderstood my post.

1) As I explained, the MMA (Mathematica) benchmark only uses 4 cores*. So no "bias" is introduced by using the MMA benchmark to compare the M1, which has 4 performance cores, to machines that have more than 4, since the additional cores on the latter go unused! This seems pretty obvious, so I feel the problem is that you didn't bother to give my post much thought before responding.

EDIT: The benchmark is actually single-core; so the number of cores is not an issue.

More broadly, as a practical matter, most computations on MMA can't be parallelized. Thus getting computations done faster is typically all about single-core speed.

2) No "bias" is introduced by comparing the M1 to a desktop in this context. You've completely missed the point here. The point is that, for most workloads, the per-core performance of the M1 chip is as fast or faster than that of the Intel i9 desktop, yet here it's much slower. That's a red flag.

Indeed, the M1's benchmark performance on a native build of MMA is so slow that it's about as fast as my mid-2014 MBP. So clearly something is not right. And I infer that has something to do with the availability of fast math libraries for the Intel chips but not for the M1.

3) The MMA benchmark runs fairly quickly, and the TDP of the M1 chip is very low. Thus the resulting performance difference between the M1 in the Air vs. the MBP vs. the Mini isn't significant. Both M1 MBP and M1 Air users have posted benchmark values of 3.1 - 3.2.

theorist9 · Jul 29, 2021

AttoA said:
Not sure if this is the same thing, but isn't the Accelerate Framework what you want? (since macOS 10.3)

I wasn't aware of that, and it does seem to be similar (https://poppy-optics.readthedocs.io/en/stable/performance.html), but then I'm still left wondering at the sub-par performance of the M1 with Mathematica (or at least with Mathematica's internal benchmark). Maybe there's some techical reason Wolfram couldn't link Mathematica to Accelerate when they did the native build, or maybe Acclerate simply isn't as comprehensive as MKL.

theorist9 · Jul 29, 2021

At least there's finally an open-source Fortran compiler available for AS (it seems it's still experimental, but stable)--that wasn't the case 6 months ago. Because of its speed, Fortran is still used extensively in scientific computing.

No Fortran? No data science in R and Python!

Earlier this week Apple announced their new, Arm-based ‘Apple Silicon’ machines to the world in a slick marketing event that had many of us reaching for our credit cards. Simultaneously…

walkingrandomly.com

Fortran for Apple Silicon | Apple Developer Forums

developer.apple.com

Having an open source Fortran compiler is pretty basic (think of how many R and Scipy users there are). If I were running Apple, which I'm not, a few years ago I would have assigned a team to identify all the essential, widely-used open source projects, and assigned Apple developers to assist in getting them ported to AS. That way they would all have been up and running soon after the first AS Macs hit the market.

This would have been good marketing. STEM users, including data scientists, are influencers in their own way.

CWallace · Jul 29, 2021

Maybe, while "native", this first release of Mathematica for Apple Silicon is not yet fully-optimized? Perhaps later builds will better leverage things like the Accelerate Framework or other features to improve performance.

theorist9 · Jul 29, 2021

CWallace said:
Maybe, while "native", this first release of Mathematica for Apple Silicon is not yet fully-optimized? Perhaps later builds will better leverage things like the Accelerate Framework or other features to improve performance.

That's possible. But Wolfram (the maker of Mathematica) has always had a close connection with Apple. The second release of Mathematica (1.2, in August 1989) introduced a Mac front end. And Mathematica has been part of the keynotes at previous WWDC's.

So given this, and given the complexity of the software (the app is 11 GB, installed, as compared with 2 GB for Word) I would be surprised if Apple didn't identify Wolfram as one of the developers it contacted early about creating a native build (i.e., prior to the release of the DTK), as it did for for MS (Office) and Adobe (CS). And still, Wolfram took until now to release its first native build.

So shouldn't this already be well-optimized for what AS offers?

[Of course, given Mathematica's complexity, I suppose you could argue that Wolfram needs still more time.]

leman · Jul 29, 2021

theorist9 said:
1) As I explained, the MMA (Mathematica) benchmark only uses 4 cores*. So no "bias" is introduced by using the MMA benchmark to compare the M1, which has 4 performance cores, to machines that have more than 4, since the additional cores on the latter go unused! This seems pretty obvious, so I feel the problem is that you didn't bother to give my post much thought before responding.

[*It's actually 4 cores max, since the benchmark employs a combination of single-threaded and multi-threaded workloads.]

Hm, are you sure about this? Wolfram’s documentation doesn’t seem to mention anything… I’ll have to investigate later

theorist9 said:
At least there's finally an open-source Fortran compiler available for AS (it seems it's still experimental, but stable)--that wasn't the case 6 months ago. Because of its speed, Fortran is still used extensively in scientific computing.

Unofficially patched GCC has been available since November… the problem is that GCC devs couldn’t care less about Apple platforms, so they are pretty much ignoring this. And there is Fortran project within LLVM but it will be at least a year before it’s ready to be used. I am eagerly looking to the time where I can purge my machine of GCC entirely 😁

theorist9 · Jul 29, 2021

leman said:
Hm, are you sure about this? Wolfram’s documentation doesn’t seem to mention anything… I’ll have to investigate later

When I run the benchmark, it looks like this (see screenshot below).

But I just looked at the code, and there is *no* parallelization in the benchmark (no function containing "Parallel" is used). [There must be an averaging function in the Activity Monitor such that, while it looks like multiple cores are being used simultaneously, only one is being used at a time.]

Thus while it distributes the workload across multiple cores, it's only using one core at a time. Thus the benchmark itself is assessing single-core speed only. I'll add an edit to my original post.

This of course supports my point that there is something severly limiting Mathematica's native performance on the M1. Its single-core speed shouldn't be (4.5–3.15)/((4.5+3.15)/2)*100 = 35% slower (% diff in quadrature) than a 2019 3.6 GHz i9 desktop, nor should it be only (3.15–3)/((3+3.15)/2)*100 = 5% faster than the CPU on a mid-2014 MBP.

Rumored pro chip

macrumors 601

macrumors 6502

macrumors 6502

Suspended

macrumors 6502

macrumors Core

macrumors 6502

Suspended

macrumors 6502

macrumors 6502

Suspended

macrumors G5

macrumors member

macrumors 6502

Suspended

Suspended

macrumors G5

macrumors G5

macrumors 601

macrumors 601

macrumors 601

macrumors G5

macrumors 601

macrumors Core

macrumors 601

Our Staff