Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

theorist9

macrumors 68040
May 28, 2015
3,882
3,061
If Apple wants their new pro devices to fully appeal to the professional STEM crowd, they need to create an AS replacement for the Intel MKL (Math Kernel Library) (https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html#gs.7m48kb).

It makes a significant difference. This illustrates how much of a difference it makes on x86 chips: The Intel MKL is designed to work only with Intel processors. Thus if you are running, say, Mathematica or Matlab on an AMD device, by default you wouldn't get access to the Intel MKL. However, up until about 2020, users running AMD devices were able to use a Terminal workaround (MKL_DEBUG_CPU_TYPE) to "trick" the MKL into thinking the device had an Intel chip (see https://www.pugetsystems.com/labs/h...for-Python-Numpy-And-Other-Applications-1637/).

Mathematica users with AMD devices have reported 15%–30% speedups when they used this workaround to gain access to the MKL.

MKL versions beginning in 2020 shut the door on this. And of course even if that weren't the case, this wouldn't work on AS. So, IMO, Apple needs to create an equivalent to MKL optimized for AS, and publicize it heavily, so that those building AS-native versions of computational software can incorporate calls to fast math libraries into their code.

It appears Apple hasn't done this yet, since the latest version of Mathematica runs natively on AS, yet its built-in benchmark is slower on the M1 than on the core-i9 iMac (3.1—3.2 on the M1 vs. 4.5 on a 2019 27-inch iMac with a 3.6 GHz 8-Core i9). And this isn't b/c of the difference in core count—the built-in benchmark uses only four cores.

Consistent with Mathematica's native benchmark being slower than expected on the M1, my mid-2014 MBP benchmarks at 3.0—nearly as fast as the M1 machines. And that's definitely not because of core count, b/c my MBP has only 4 cores.

Some have argued that Mathematica needs a better built-in benchmark, and that's probably true, but still...
 
Last edited:
  • Like
Reactions: richmlow

AgentMcGeek

macrumors 6502
Jan 18, 2016
374
305
London, UK
If you’re doing heavy workload like this, you probably shouldn’t go for the entry chip that is the M1. Besides, using only four cores and comparing a portable machine with a desktop introduces a lot of bias. Test again with the M1X in real conditions (aka not the benchmark?) against an Intel laptop and then we’ll see.

EDIT: I assumed you were talking about a MacBook when you mentioned M1, but I realise this could also be a new Mini or iMac. Which one are you using? This is important because they wouldn’t have the same TDP.
 
Last edited:

AgentMcGeek

macrumors 6502
Jan 18, 2016
374
305
London, UK
Also note that the press release says JEDEC has updated the non-X standard:

LPDDR5 and LPDDR5X are pin-to-pin compatible, greatly simplifying development of system-on-chips (SoCs) and platforms supporting the new type of memory. To enable higher data transfer rates and increase reliability of the upcoming low-power memory subsystems, LPDDR5X introduces pre-emphasis function to improve the signal-to-noise ratio (to enable higher clocks and performance improvements) and reduce the bit error rate as well as adaptive refresh management, and per-pin decision feedback equalizer (DFE) to enhance robustness of the memory channel (a page taken from the DDR5 standard).

Source: https://www.tomshardware.com/news/jedec-publishes-lpddr5x-specification
 

JMacHack

Suspended
Mar 16, 2017
1,965
2,424
So folks believing that Apple is going to fork the kernel scheduler just be benifit a Intel CPU SoC
Consider it this way: Maybe they’d revamp the scheduler to use more than 64 threads now in case they make a cpu with more than 64 cores in the future.
 

AgentMcGeek

macrumors 6502
Jan 18, 2016
374
305
London, UK
It is very likely that the Pro line will get more than 64 threads within a few years. After all, semi-professional CPUs like Threadrippers are already at 128 threads. If AMD can do it on x86, I expect AS to follow sooner than later.
 

AgentMcGeek

macrumors 6502
Jan 18, 2016
374
305
London, UK
That's literally twice the bandwidth LPDDR4X in M1 provides... but I doubt we will see this in production any time soon.

Here's what Anandtech says:

For eventual mobile SoCs using 8533Mbps memory, the peak theoretical available bandwidth would grow from 51.2GB/s to 68.26GB/s, allowing future designs to further increase CPU and GPU performances. It’s to be noted, that we haven’t heard much about power efficiency improvements of the new LP5X standard, so I assume that we’ll be relying on DRAM vendors to improve power efficiency via more advanced manufacturing nodes to keep total power usage inside of devices in check.

I estimate we’ll be seeing LPDDR5X in late 2022 or 2023 SoCs.
Source: https://www.anandtech.com/show/16851/jedec-announces-lpddr5x-at-up-to-8533mbps
 

Appletoni

Suspended
Mar 26, 2021
443
177
It is very likely that the Pro line will get more than 64 threads within a few years. After all, semi-professional CPUs like Threadrippers are already at 128 threads. If AMD can do it on x86, I expect AS to follow sooner than later.
I want it now or at least inside the MacBook Pro 16-inch M1X and not within a few years.
 

AttoA

macrumors member
Feb 1, 2021
34
145
If Apple wants their new pro devices to fully appeal to the professional STEM crowd, they need to create an AS replacement for the Intel MKL (Math Kernel Library) (https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html#gs.7m48kb).

It makes a significant difference. This illustrates how much of a difference it makes on x86 chips: The Intel MKL is designed to work only with Intel processors. Thus if you are running, say, Mathematica or Matlab on an AMD device, by default you wouldn't get access to the Intel MKL. However, up until about 2020, users running AMD devices were able to use a Terminal workaround (MKL_DEBUG_CPU_TYPE) to "trick" the MKL into thinking the device had an Intel chip (see https://www.pugetsystems.com/labs/h...for-Python-Numpy-And-Other-Applications-1637/).

Mathematica users with AMD devices have reported 15%–30% speedups when they used this workaround to gain access to the MKL.

MKL versions beginning in 2020 shut the door on this. And of course even if that weren't the case, this wouldn't work on AS. So, IMO, Apple needs to create an equivalent to MKL optimized for AS, and publicize it heavily, so that those building AS-native versions of computational software can incorporate calls to fast math libraries into their code.

It appears Apple hasn't done this yet, since the latest version of Mathematica runs natively on AS, yet its built-in benchmark is slower on the M1 than on the core-i9 iMac (3.1—3.2 on the M1 vs. 4.5 on a 2019 27-inch iMac with a 3.6 GHz 8-Core i9). And this isn't b/c of the difference in core count—the built-in benchmark uses only four cores.

Consistent with Mathematica's native benchmark being slower than expected on the M1, my mid-2014 MBP benchmarks at 3.0—nearly as fast as the M1 machines. And that's definitely not because of core count, b/c my MBP has only 4 cores.

Some have argued that Mathematica needs a better built-in benchmark, and that's probably true, but still...
Not sure if this is the same thing, but isn't the Accelerate Framework what you want? (since macOS 10.3)
 

Appletoni

Suspended
Mar 26, 2021
443
177
M1 has only 8 cores, and half of them are high efficiency/low power. You can't realistically expect M1X to get 32 cores.
Believe me I can ?.
It’s the same like going from 7 or 8 core GPU to 32 cores GPU. = CPU can change in the same way.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Not sure if this is the same thing, but isn't the Accelerate Framework what you want? (since macOS 10.3)

This situation is somewhat like Apple's stance that Metal is a complete replacement for OpenCL and CUDA. The scope of the coverage isn't the same.

"...
  • Enhanced math routines enable developers and data scientists to create performant science, engineering, or financial applications
  • Core functions include BLAS, LAPACK, sparse solvers, fast Fourier transforms (FFT), random number generator functions (RNG), summary statistics, data fitting, and vector math ..."

Both have general BLAS , LAPACK. But other things are different. Solvers ... Accelerate covers some. Finance/statisics/data fitting computations? Nope. If follow the "DSP" link on that Apple page , Accelerate has a "virtual" DSP commands to opcode. If keep scrolling down though there are some FFT .

In the other direction, Accelerate is growing the scope of AI/ML packaged, "black box" API support. Apple's focus is drifting there in a different direction. [ Intel has yet another API (s?) that is targeting AI/ML. OneAPI is part of that. And Intel's marketing has kind of wrapped the MKL into the OneAPI umbrella. ] . Apple has more image processing helper frameworks.

So Apple is doing a substantively high percentage of the work. The issue though is when you kick folks who were doing all of the work to the curb, the onus of delivering the rest falls on Apple. Intel did a lot of "dot the i's and cross the t's" stuff that Apple just got to coast on while on the Intel platform.


But in terms of appealing to the folks doing hard core numerical simulations and modeling not having ECC is one of those "show stopper" checklist items. Getting the wrong answer faster doesn't matter more than library breadth.

There is a skew in Apple's Accelerate toward keeping Audio, Video , Gaming engine folks happy. I doubt more mainstream , hard core STEM work is a primary driver here (beyond basic, legacy checklist feature libraries like BLAS and LAPACK).
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Is this something Apple can tune the Neural Engine to do?

Generally not. The Neural Engine is generally about dumping precision for speed. So instead of accurate 64-bit resolution dropping down to 16, 8 , 4 to crank up the "operations per second" by operating on smaller and smaller chunks of data.

Many of the neural networks heuristics are trying to get to a "more right than wrong" answer. Tryong to get to an "answer" in the right zone.

If the numerical simulations/modeling requires more operations per second while still maintaining top end accuracy. If modeling real world usually already have taken short cuts on accuracy already. Throwing away more just gets more inaccurate answers.

[ Video and audio can often cut corners because of quirks and limitations of eyes/retina and ears to small data inaccuracies. Drop a single bit or two every once in a long while .... nobody is going to notice. ]
 

theorist9

macrumors 68040
May 28, 2015
3,882
3,061
If you’re doing heavy workload like this, you probably shouldn’t go for the entry chip that is the M1. Besides, using only four cores and comparing a portable machine with a desktop introduces a lot of bias. Test again with the M1X in real conditions (aka not the benchmark?) against an Intel laptop and then we’ll see.

EDIT: I assumed you were talking about a MacBook when you mentioned M1, but I realise this could also be a new Mini or iMac. Which one are you using? This is important because they wouldn’t have the same TDP.
I'm afraid you've completely misunderstood my post.

1) As I explained, the MMA (Mathematica) benchmark only uses 4 cores*. So no "bias" is introduced by using the MMA benchmark to compare the M1, which has 4 performance cores, to machines that have more than 4, since the additional cores on the latter go unused! This seems pretty obvious, so I feel the problem is that you didn't bother to give my post much thought before responding.

EDIT: The benchmark is actually single-core; so the number of cores is not an issue.

More broadly, as a practical matter, most computations on MMA can't be parallelized. Thus getting computations done faster is typically all about single-core speed.

2) No "bias" is introduced by comparing the M1 to a desktop in this context. You've completely missed the point here. The point is that, for most workloads, the per-core performance of the M1 chip is as fast or faster than that of the Intel i9 desktop, yet here it's much slower. That's a red flag.

Indeed, the M1's benchmark performance on a native build of MMA is so slow that it's about as fast as my mid-2014 MBP. So clearly something is not right. And I infer that has something to do with the availability of fast math libraries for the Intel chips but not for the M1.

3) The MMA benchmark runs fairly quickly, and the TDP of the M1 chip is very low. Thus the resulting performance difference between the M1 in the Air vs. the MBP vs. the Mini isn't significant. Both M1 MBP and M1 Air users have posted benchmark values of 3.1 - 3.2.
 
Last edited:
  • Like
Reactions: richmlow

theorist9

macrumors 68040
May 28, 2015
3,882
3,061
Not sure if this is the same thing, but isn't the Accelerate Framework what you want? (since macOS 10.3)
I wasn't aware of that, and it does seem to be similar (https://poppy-optics.readthedocs.io/en/stable/performance.html), but then I'm still left wondering at the sub-par performance of the M1 with Mathematica (or at least with Mathematica's internal benchmark). Maybe there's some techical reason Wolfram couldn't link Mathematica to Accelerate when they did the native build, or maybe Acclerate simply isn't as comprehensive as MKL.
 

theorist9

macrumors 68040
May 28, 2015
3,882
3,061
At least there's finally an open-source Fortran compiler available for AS (it seems it's still experimental, but stable)--that wasn't the case 6 months ago. Because of its speed, Fortran is still used extensively in scientific computing.



Having an open source Fortran compiler is pretty basic (think of how many R and Scipy users there are). If I were running Apple, which I'm not, a few years ago I would have assigned a team to identify all the essential, widely-used open source projects, and assigned Apple developers to assist in getting them ported to AS. That way they would all have been up and running soon after the first AS Macs hit the market.

This would have been good marketing. STEM users, including data scientists, are influencers in their own way.
 
Last edited:
  • Like
Reactions: cbum

CWallace

macrumors G5
Aug 17, 2007
12,528
11,543
Seattle, WA
Maybe, while "native", this first release of Mathematica for Apple Silicon is not yet fully-optimized? Perhaps later builds will better leverage things like the Accelerate Framework or other features to improve performance.
 

theorist9

macrumors 68040
May 28, 2015
3,882
3,061
Maybe, while "native", this first release of Mathematica for Apple Silicon is not yet fully-optimized? Perhaps later builds will better leverage things like the Accelerate Framework or other features to improve performance.
That's possible. But Wolfram (the maker of Mathematica) has always had a close connection with Apple. The second release of Mathematica (1.2, in August 1989) introduced a Mac front end. And Mathematica has been part of the keynotes at previous WWDC's.

So given this, and given the complexity of the software (the app is 11 GB, installed, as compared with 2 GB for Word) I would be surprised if Apple didn't identify Wolfram as one of the developers it contacted early about creating a native build (i.e., prior to the release of the DTK), as it did for for MS (Office) and Adobe (CS). And still, Wolfram took until now to release its first native build.

So shouldn't this already be well-optimized for what AS offers?

[Of course, given Mathematica's complexity, I suppose you could argue that Wolfram needs still more time.]
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,522
19,679
1) As I explained, the MMA (Mathematica) benchmark only uses 4 cores*. So no "bias" is introduced by using the MMA benchmark to compare the M1, which has 4 performance cores, to machines that have more than 4, since the additional cores on the latter go unused! This seems pretty obvious, so I feel the problem is that you didn't bother to give my post much thought before responding.

[*It's actually 4 cores max, since the benchmark employs a combination of single-threaded and multi-threaded workloads.]

Hm, are you sure about this? Wolfram’s documentation doesn’t seem to mention anything… I’ll have to investigate later

At least there's finally an open-source Fortran compiler available for AS (it seems it's still experimental, but stable)--that wasn't the case 6 months ago. Because of its speed, Fortran is still used extensively in scientific computing.

Unofficially patched GCC has been available since November… the problem is that GCC devs couldn’t care less about Apple platforms, so they are pretty much ignoring this. And there is Fortran project within LLVM but it will be at least a year before it’s ready to be used. I am eagerly looking to the time where I can purge my machine of GCC entirely ?
 

theorist9

macrumors 68040
May 28, 2015
3,882
3,061
Hm, are you sure about this? Wolfram’s documentation doesn’t seem to mention anything… I’ll have to investigate later
When I run the benchmark, it looks like this (see screenshot below).

But I just looked at the code, and there is *no* parallelization in the benchmark (no function containing "Parallel" is used). [There must be an averaging function in the Activity Monitor such that, while it looks like multiple cores are being used simultaneously, only one is being used at a time.]

Thus while it distributes the workload across multiple cores, it's only using one core at a time. Thus the benchmark itself is assessing single-core speed only. I'll add an edit to my original post.

This of course supports my point that there is something severly limiting Mathematica's native performance on the M1. Its single-core speed shouldn't be (4.5–3.15)/((4.5+3.15)/2)*100 = 35% slower (% diff in quadrature) than a 2019 3.6 GHz i9 desktop, nor should it be only (3.15–3)/((3+3.15)/2)*100 = 5% faster than the CPU on a mid-2014 MBP.

1627601316026.png
 
Last edited:
  • Like
Reactions: richmlow
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.