Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Hi richmlow. I'd be happy to rerun them on my i9 iMac, but I'll need to contact those who did the comparison runs for me on their M1 Max's, and I'm not sure if they still have Mathematica running on their machines (at least one, and possibly both, downloaded the 7-day trial version to run the test). I'd offer to send my benchmark tests to you to run, but it doesn't appear you have an M1.

I would be happy to run the test on my M1 Max as I have access to Mathematica
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
I would be happy to run the test on my M1 Max as I have access to Mathematica
OK, thanks. When I have a chance I'll rerun the benchmarks on my iMac (needed to create a new baseline), and then PM you to arrange to send the files to you.
 
Last edited:

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
Hello theorist9,

Thank you for running your suite of Mathematica benchmarks on Mathematica 13.0.1.

Would it be possible to rerun your benchmarks on the most recent version of Mathematica (v.13.2) and report back your findings? According to the Wolfram Research release notes (for v. 13.2), "extensive" optimizations were made to increase the efficiency of many Mathematica capabilities.


All the best,
richmlow
I finally had a chance to rerun my Mathematica (MMA) benchmarks on v. 13.2.

To start, I did a comparison of MMA 13.0.1 (with Mac OS 12.4) to MMA 13.2 (with MacOS 12.6.2) on my 2019 i9 iMac (i9-9900; more details in signature line). The only hardware difference is that I had 32 GB RAM installed in the former case and 128 GB in the latter, but since this benchmark uses only a few GB RAM that should have no effect. The results show that MMA did indeed improve in performance somewhat, but the changes were very much task-dependent. E.g., on one integration there was an 87% improvement, while on others there was no improvement at all:

1673492554916.png

1673492584136.png


But the question to which you wanted an answer is whether the performance gap between a 9th-generation Intel i9 and an M1-generation Apple Silicon remains disappointingly small. [Based on their relative GB 5 SC scores, we'd expect to see an average ~35% improvement if Mathematica were as well-optimized for AS as it is for Intel. Instead, the improvement averaged about half that on my symbolic and graphing tasks, and the M1 was 50% slower on my image processing task.]

Here are two results on MMA 13.2, one from @casperes1996 who kindly agreed to run this a second time, and from @leman, who kindly agreed to run one on his machine for comparison

With MMA 13.2:
casperes1996's results

Machine Details 16" M1 Max MacBook Pro, 24-core GPU, 32 GB RAM, running MacOS 13.2, on battery, undisturbed
1673493011215.png

1673492977032.png

leman's results
Machine Details: 16" M1 Max MacBook Pro, 32 GB RAM, running MacOS 13.2, plugged in on high-power mode.

1674758249354.png

1674758271724.png

For comparison, here are the results with MMA 13.0.1:

1673493137659.png

1673493131940.png

From the above, you can see that, while MMA 13.2 performs better than MMA 13.0.1 on some tasks, the performance gap between a 9th-gen Intel Core i9 and an Apple M1 Max remains essentially unchanged.

Caspares1996's battery results are essentially the same as Leman's plugged-in results on high power mode, showing those don't have an effect on performance.
 
Last edited:

richmlow

macrumors 6502
Jul 17, 2002
390
285
I finally had a chance to rerun my Mathematica (MMA) benchmarks on v. 13.2.

To start, I did a comparison of MMA 13.0.1 (with Mac OS 12.4) to MMA 13.2 (with MacOS 12.6.2) on my 2019 i9 iMac (i9-9900; more details in signature line). The only hardware difference is that I had 32 GB RAM installed in the former case and 128 GB in the latter, but since this benchmark uses only a few GB RAM that should have no effect. The results show that MMA did indeed improve in performance somewhat, but the changes were very much task-dependent. E.g., on one integration there was an 87% improvement, while on others there was no improvement at all:

View attachment 2140672
View attachment 2140673

But the question to which you wanted an answer is whether the performance gap between a 9th-generation Intel i9 and an M1-generation Apple Silicon remains disappointingly small. [Based on their relative GB 5 SC scores, we'd expect to see an average ~35% improvement if Mathematica were as well-optimized for AS as it is for Intel. Instead, the improvement averaged about half that on my symbolic and graphing tasks, and the M1 was 50% slower on my image processing task.]

I thus contacted @casperes1996 , who was one of the folks who ran the benchmark last time on his M1 Mac, to ask if he would run it again. He kindly agreed. [I've also messaged @leman to see if he would like to run it as well, since I thought having two independent results would be helpful.]

casperes1996's machine details: 16" M1 Max MacBook Pro, 24-core GPU, 32 GB RAM, running MacOS 13.2
With MMA 13.2:
View attachment 2140681
View attachment 2140680

For comparison, here are the results with MMA 13.0.1:

View attachment 2140686
View attachment 2140682

From the above, you can see that, while MMA 13.2 performs better than MMA 13.0.1 on some tasks, the performance gap between a 9th-gen Intel Core i9 and an Apple M1 Max remains essentially unchanged.
Hello theorist9,


Thank you (and casperes1996 and leman) very much for running your benchmarks with Mathematica 13.2 (Intel Core i9 vs. Apple M1 Max)! This data will be very helpful as I contemplate my next computer upgrade.

Currently, my main workhorse is a 2013 Mac Pro (see my signature below). However, I will soon need to get it replaced for my computational needs. Like you, I use the Mac Pro for scientific endeavors....in particular, mathematics research (Ramsey theory), and not for video/audio work. So my main concerns are CPU, multiple cores, RAM and SSD storage.

Some quick general questions....Are you happy with your current Mac Pro setup for your computational needs? Is there anything that you would do different? Any general recommendations/advice?


All the best,
richmlow
 

richmlow

macrumors 6502
Jul 17, 2002
390
285
I agree the paper is of poor quality and results are questionable. That is beside the point. The scientific approach would be to show data that refutes their results and preferably explain their results such as a bug.

Assuming something is wrong because it goes against the knowledge or politics (of the time) is a very dangerous path to take. A recent example is global warming and a few hundred years ago you literally lost your head if you claimed the world is round. There are plenty of such examples in the literature.

Everybody knows that the Earth is flat! J/K 😂


richmlow
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
Hello theorist9,


Thank you (and casperes1996 and leman) very much for running your benchmarks with Mathematica 13.2 (Intel Core i9 vs. Apple M1 Max)! This data will be very helpful as I contemplate my next computer upgrade.

Currently, my main workhorse is a 2013 Mac Pro (see my signature below). However, I will soon need to get it replaced for my computational needs. Like you, I use the Mac Pro for scientific endeavors....in particular, mathematics research (Ramsey theory), and not for video/audio work. So my main concerns are CPU, multiple cores, RAM and SSD storage.

Some quick general questions....Are you happy with your current Mac Pro setup for your computational needs? Is there anything that you would do different? Any general recommendations/advice?


All the best,
richmlow
Hi richmlow. I don't know the extent to which these general comments will be applicable to your needs but, FWIW:

I don't use a Mac Pro, I use a 2019 i9 iMac. It works reasonably well for my needs. My only complaints are:

(1) I do experience delays and spinning beachballs, so I'd like higher SC speeds. Perhaps I'll consider upgrading when the M3 Studio comes out. It's a question of how much I'd have to pay relative to the performance improvement I'd get.

(2) While I like the quality of the iMac's 27" Retina monitor, I often create spreadsheets with my data that are too large to view comfortably on a 27" screen, so I'd love to have larger Retina-class display, say 32"/6k, 36"/7k, or 43"/8k. Ideally, I'd like to have three of those! But those will all have to wait until I upgrade to a Studio, since my iMac is limited to 2 x 4k externals.

If you do get an Apple external display, I'll caution you that the nano-texture version's anti-reflective properties are very strong (great if you've got bad reflections), but at the cost of a noticeable (for me) loss in sharpness compared to the glossy, as well as a distractingly sparkly white background. Thus make sure you compare the two at an Apple retailer before making a purchase. I can control the reflections in my office, so I would opt for the glossy myself. The upcoming 32" Dell 6k is matte only, but I suspect its matte coating is less "strong" than Apple's nano-texture finish, providing both less of the benenfit and less of the weaknesses. Also, if you do go for a three-display setup, you can sacrifice some pixel density on the side displays; i.e., the main thing is that your central monitor is "Retina".

Finally, starting 6 months ago I stopped using spinning disks for backups. I got a 4 TB Kingston XS2000 SSD for Time Machine, and a 2 TB Kingston XS2000 for Carbon Copy Cloner, and I'm very happy with them. [You just need to upgrade their cables; the ones included are cheap and cause disconnections. I recommend the Cable Matters certified cables.]
 
Last edited:

richmlow

macrumors 6502
Jul 17, 2002
390
285
Hi richmlow. I don't know the extent to which these general comments will be applicable to your needs but, FWIW:

I don't use a Mac Pro, I use a 2019 i9 iMac. It works reasonably well for my needs. My only complaints are:

(1) I do experience delays and spinning beachballs, so I'd like higher SC speeds. Perhaps I'll consider upgrading when the M3 Studio comes out. It's a question of how much I'd have to pay relative to the performance improvement I'd get.

(2) While I like the quality of the iMac's 27" Retina monitor, I often create spreadsheets with my data that are too large to view comfortably on a 27" screen, so I'd love to have larger Retina-class display, say 32"/6k, 36"/7k, or 43"/8k. Ideally, I'd like to have three of those! But those will all have to wait until I upgrade to a Studio, since my iMac is limited to 2 x 4k externals.

If you do get an Apple external display, I'll caution you that the nano-texture version's anti-reflective properties are very strong (great if you've got bad reflections), but at the cost of a noticeable (for me) loss in sharpness compared to the glossy, as well as a distractingly sparkly white background. Thus make sure you compare the two at an Apple retailer before making a purchase. I can control the reflections in my office, so I would opt for the glossy myself. The upcoming 32" Dell 6k is matte only, but I suspect its matte coating is less "strong" than Apple's nano-texture finish, providing both less of the benenfit and less of the weaknesses. Also, if you do go for a three-display setup, you can sacrifice some pixel density on the side displays; i.e., the main thing is that your central monitor is "Retina".

Finally, starting 6 months ago I stopped using spinning disks for backups. I got a 4 TB Kingston XS2000 SSD for Time Machine, and a 2 TB Kingston XS2000 for Carbon Copy Cloner, and I'm very happy with them.
Hi theorist9,


Thanks for the additional comments and pointers!

For now, I'll wait for Apple's supposed announcement of the Apple Silicon Mac Pro and then take it from there.

When I get my upgrade, I'll update this thread/post.


All the best,
richmlow
 
  • Like
Reactions: theorist9

richmlow

macrumors 6502
Jul 17, 2002
390
285
Hi all,


Now that Apple has announced updated ASi Mac minis and Powerbooks, I still eagerly await Apple's announcement of ASi Mac Pro.

While the recent M2 Pro/Max updated machines are nice, the RAM restriction is my concern.

My next Mac purchase will be mainly used for mathematics research. So, I need a reasonable number of cores and considerable RAM.

What does everybody else think about Apple's recent upgrades?


All the best,
richmlow
 
Last edited:

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
I made this thread because I thought sites regarding Apple Silicon Macs regarding science is lacking and I thought this thread would be a helpful place for any future scientists or current ones to get info regarding Apple Silicon Macs.

Please post only things regarding the usage of Apple Silicon Macs in science. Post articles, github repositories, sites, etc. regarding this topic.

I'm gonna start with this:
Apple Silicon Performance in Scientific Computing

PDF version:
Thanks for this.
If you're looking for a followup project (hah!) something that remains very unclear is the performance of Apple Silicon for
- Virtual Machines
- "non-trivial" parallel code

For Virtual Machines the issues of interest are the same sorts of issues as on the x86 side
- how much slowdowns? under what conditions?
- how much do TLB operations cost?
- are there weirdnesses in IO performance?
- how good is the VM performance isolation (ie can one VM slow down another?)

For parallel code interesting issues include
- cost of locks (of various types, including remote atomics)
- cost of synchronization (eg if a thread in P-cluster 1 has to synchronize or engage in a cache-coherency operation with either P-cluster 2 or E-cluster)

Also whether these have been materially improved in M2. (My gut feeling is that these are the sorts of things that might have worked on M1, but were sub-optimal because they were not pushed on iPhone/iPad, and M2 was a chance, even though not much seems to have happened, for these sorts of technical mac-specific issues to have been improved).
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
I created two Mathematica benchmarks, and sent them to two posters on another forum who have M1 Max's. These calculate the %difference in wall clock runtime between whatever they're run on and my 2019 i9 iMac (see config details below). They were run using Mathematica 13.0.1 (current version is 13.1).

The tables below show the results from one M1 Max user; the results for the other user didn't differ significantly.

It's been opined that, to the extent Mathematica doesn't do well on AS vs. Intel, it's because AS's math libraries aren't as well optimized as Intel's MKL. Thus my upper table consists entirely of symbolic tasks and, indeed, all of these are faster on AS than my i9 iMac. However, they are not that much faster. You can see the M1 Max averages only 2% to 18% faster for these suites of symbolic computations.

The lower table features a graphing suite, where AS was 21% faster. I also gave it an image-processing task, and it was 46% slower, possibly because it uses numeric computations.

Details:

Symbolic benchmark:
Consists of six suites of tests: Three integration suites, a simplify suite, a solve suite, and a miscellaneous suite. There are a total of 58 calculations. On my iMac, this takes 37 min, so an average of ~40s/calculation. It produces a summary table at the end, which shows the percentage difference in run time between my iMac whatever device it's run on. Most of these calculations appear to be are single-core only (Wolfram Kernel shows ~100% CPU in Activity Monitor). However, the last one (polynomial expansion) appears to be multi-core (CPU ~ 500%).

Graphing and image processing benchmark: Consists of five graphs (2D and 3D) and one set of image processing tasks (processing an image taken by JunoCam, which is the public-outreach wide-field visible-light camera on NASA’s Juno Jupiter orbiter). It takes 2 min. on my 2019 i9 iMac. As with the above, it produces a summary table at the end. The four graphing tasks appear to be single-core only (Wolfram Kernel shows ~100% CPU in Activity Monitor). However, the imaging processing task appears to be multi-core (CPU ~ 250% – 400%).

Here's how the percent differences in the summary tables are calculated (ASD = Apple Silicon Device, or whatever computer it's run on):

% difference = (ASD time/(average of ASD time and iMac time) – 1)*100.

Thus if the iMac takes 100 s, and the ASD takes 50 s, the ASD would get a value of –33, meaning the ASD is 33% faster; if the ASD takes 200 s, it would get a value of 33, meaning it is 33% slower. By dividing by the average of the iMac and ASD times, we get the same absolute percentage difference regardless of whether the two-fold difference goes in one direction or the other. For instance, If we instead divided by the iMac time, we'd get 50% faster and 100% slower, respectively, for the above two examples.

I also provide a mean and standard deviation for the percentages from each suite of tests. I decided to average the percentages rather than the times so that all processes within a test suite are weighted equally, i.e., so that processes with long run times don't dominate.

iMac details:
2019 27" iMac (19,1), i9-9900K (8 cores, Coffee Lake, 3.6 GHz/5.0 GHz), 32 GB DDR4-2666 RAM, Radeon Pro 580X (8 GB GDDR5)
Mathematica 13.0.1
MacOS Monterey 12.4

View attachment 2131848
Thanks for that!
I think it's clear that MMA still has a lot of optimization ahead of it on AS. On the other hand, pretty much everyone at Wolfram uses a MacBook as their preferred device, so I expect this will happen as fast as humanly possible!

Obvious issues I saw when I looked at this closely (which was some time ago!) were
- bignum performance was terrible. If Wolfram are using GMP (which the legalities say they are) and they are using the LATEST bignum this should improve. The most recent GNU bignum on ARM is not perfect (there is at least one unnecessary instruction in the core assembly) but it's a whole lot better than it used to be. BUT it makes no use of an Apple private instruction that provides for 53b->106b integer multiplication (similar to an AVX512 instruction) which could boost it if GNU were to write code using that.
Apple provides some bignum support via Accelerate vBigNum which may well use that instruction -- but the API may not fit well into the rest of MMA :-(

- stuff that would be routed to vector-parallel functions on Intel (think of eg Tan[ large FP array ]) are usually not routed to such functions on AS. Once again this is basically an issue of the appropriate functions being written between Apple, Wolfram and probably LLVM and GNU for various pieces.

- even then, at least at first, those functions will probably route to NEON. Better would be if they routed to AMX (AMX is growing to be essentially an AVX512 equivalent) but the AMX side is very much a work in progress for Apple...

- similarly things like BLAS/LAPACK (optimized up the wazoo in MKL) are a lot more generic on the ARM side, and Wolfram seems not to be using Apple's AMX-optimized versions. I expect that in time Apple and Wolfram (and a few obvious other clients like Matlab) will figure out a way to get access to AMX that works for all parties while honoring Apple's concerns (primarily that AMX changes a lot in each new CPU); but neither side probably has discussing this as a high priority until more basic issues are sorted out by each side.

- similarly I saw that (not sure why!) MMA was a lot more reluctant to use multiple cores for cases where you x86 would use such behind the scenes (eg for large matrix manipulation). Maybe this reflects crazy-optimized MKL LAPACK vs barely-optimized ARMv8 open source LAPACK?

- and of course there's all the more specialized stuff. Much of what MMA does could probably be routed to good effect to the GPU and even the NPU. (Including some non-obvious stuff, like requests to generate large arrays of random numbers). But that requires time...

So that's background. Most of it seems not relevant to your symbolic tests, though did some of them perhaps hit bignums in a serious way? What may be likely is that no-one at Wolfram has yet even tried serious AS optimization of the core symbolic engine. For example there are many ways to build a hash table, and Wolfram is likely using whatever parameters were optimized for Intel without even considering whether changing the hash parameters (eg to better match the cache sizes and page sizes of AS, or the fact that many more integer operations can be performed simultaneously, so more complex hashes can be considered). There may well be 15 or 20% improvement possible in things like that. Hell, the two specialized mechanisms added to the A15/M2 to speed up interpreters (most obviously JS, but also Objective-C dispatch) could be used by any interpreter, and my guess is that somewhere in the guts of MMA is an interpreter...

Generally I'd say that if Apple's first, lowest-end core can match Intel's highest end core, in spite of all the above, we're doing OK! Everything takes time (on Apple's HW side, on their SW side, on Wolfram's side, on the open source libraries side)!
 
  • Like
Reactions: richmlow

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
Both 64-bit floats and 64-bit atomics are unsupported. The M2 may (unconfirmed) have one obscure non-returning UInt64 max just to run Nanite. Not the full set required for e.g. cl_khr_int64_base_atomics. I've been heavily investing time into predicting the performance of emulated 64-bit floats. It stands at 1:70 for FP64 and 1:40 for FP59. The reduced precision version solves the same problem as FP64: FP32 is not good for highly precise scientific numbers. Many apps like OpenMM do most stuff in FP32 and sum important values in FP64. Mixed precision use cases are what the lack of hardware FP64 kills off.

Regarding 64-bit atomics emulation, I have a real-world P.O.C. where it's 5x slower than 32-bit. Not bad and aligns perfectly with theory. My intent was to accelerate INQ, a scientific software library for GPUs, on M1. It was entirely FP64 and had some atomic addition. Turns out, the CPU is much more powerful at FP64 than I realized! <400 GFLOPS vector (multi-core) and <700 GFLOPS matrix (single-core). Same as RTX 3090's FP64.


To port existing software libraries, we need more than just preprocessor macros. Many libraries, such as INQ, rely on single-source APIs. CUDA, HIP, and SYCL are examples, and improve developer productivity for compute. Dual-source APIs like OpenCL, Metal, and DirectX never caught on. I'm working on a Metal backend for hipSYCL, so the Apple GPU can actually be viable for scientific computing. Many scientific libraries use SYCL (single-source) in addition to HIP/CUDA, or are easily portable with Intel SYCLomatic. There is no such thing as Metal-omatic, which may be impossible because Metal is dual-source.

Another issue is something called "Unified Shared Memory" (USM) in CUDA/OpenCL 2.0/SYCL 2020. The idea that the CPU and GPU can read from the same memory address, you'd think Apple allows that. After all, it's perfect for the Unified Memory Architecture. No. The Apple GPU is designed for graphics and TensorFlow, not GPGPU.

USM is supported on Nvidia, AMD, and Intel's drivers for their discrete GPUs. It's accomplished by hardware-accelerated "address translation service" which maps CPU pointers to GPU pointers. Another hardware feature automatically migrates data between the CPU and GPU over PCIe. USM boosts developer productivity and makes massive C++ code bases easily portable from CPU to GPU. Luckily, that can also be emulated with good performance on M1.

So it seems that, to summarize:
- the CPU matches your requirements in some cases (in which case complaining about the GPU seems silly)

- did you try routing your code to AMX to get substantially *better* CPU performance?

- your complaints seem more about the state of GPU libraries. Which is, on the one hand, a legit complaint if you just want code to work. But is not a legit complaint as to technical direction, especially if the CPU is just as good.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
I did consider that approach, but it's inexact and not IEEE-compliant. I tried to GPU-accelerate a physics simulation once, but the double-single approach had trouble splitting a single. Perfect emulation requires manipulating the mantissa as integers. FP59 is faster because the integer mantissa can be 48-bit aligned.


I have a working P.O.C. that uses this approach. It allocates a giant chunk of VM and maps it to GPU. You translate addresses simply through integer addition. With compiler tricks, you can even pre-convert some pointers before sending them to GPU.


I've been exploring a lot of design approaches for SYCL, involving LLVM IR, SPIR-V dialects, AIR and MSL. The eventual approach was to structurize the LLVM IR, then -> Vulkan SPIR-V -> MSL using SPIRV-Cross. I'll have to modify SPIRV-Cross to add more compute instructions.


None of the graphics APIs have Unified Shared Memory.
You seem willing to jump through crazy hoops on the CUDA/SPIR-V side, but not to be willing to make any effort on the Metal or Apple AMX side.
Obviously it's your life and your time, but it seems strange to complain "the Apple code I put one hour of effort into doesn't work as well as the CUDA code I put two years of effort into! What gives???"
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
Hi all,


Now that Apple has announced updated ASi Mac minis and Powerbooks, I still eagerly await Apple's announcement of ASi Mac Pro.

While the recent M2 Pro/Max updated machines are nice, the RAM restriction is my concern.

My next Mac purchase will be mainly used for mathematics research. So, I need a reasonable number of cores and considerable RAM.

What does everybody else think about Apple's recent upgrades?


All the best,
richmlow

I'd say if you can wait do so.
My analysis (FWIW!) is that A14/M1 were designed as planned and on track.
A15 was designed as planned, as an energy optimized version of A14, giving iPhone what it needed more than performance. Perhaps M2 (and just M2) was planned as well.
Meanwhile the plan was for A16 and "desktop" M2 to be on N3, A16 in Sept 2022, M2 Pro and Max coming out now. These (A16', M2' designs) would make full use of N3's increased transistors to implement IPC improvements (and probably various ISA tweaks like the newest ARM instructions [except SVE, that's different discussion...])

BUT
covid comes along and throws the plans into disarray. Both Apple and TSMC have to run slower than planned (eg TSMC can't just fly people between Taiwan, the US, Japan, and Europe every week to solve problems! You do what you can with zoom, but it's slower).
So at some point Apple realizes the A16' and M2' designs (ie the N3 based designs) won't ship in time. Mad scramble! OK, we "recompile" the A15 to N4 and take what we can in terms of process improvements. And we accept that the M2 Pro/Max will be a disappointing A15 N5 design, not the hoped for kick-ass N3 designs. Oh well.
To add insult to injury, the M2 Pro/Max were probable targeted for about three months ago, but then we get unrest in China so we have to delay the products until that's sorted out!

I don't think this is special pleading. Look at the schedules: Compare M1 track vs M2 track.
Sept 2020 A14
Nov 2020 M1
Oct 2021 M1 Pro Max
March 2022 M1 Ultra

Sept 2021 A15
June 2022 M2
Jan 2023 M2 Pro Max

Sept 2022 A16 (on N4)

To me this looks like a schedule that's constantly trying to delay things to see which of two options (plan A or a fallback plan B) are feasible, not a long-term strategy playing out as designed.

Which is all a long way of saying that since A14/M1 Apple has basically been forced into two generations of what looks like stagnation (A15 designed for low energy, which is less immediately interesting on Mac, though ultimately valuable; A16 as a response to covid-delayed N3.) But once we get back on track with N3 and the CPU+SoC designed for it, I think we will all be pleasantly surprised...

But it remains a huge question when that will be -- how long can you wait?
One option is Apple announce "M3" (ie the N3 design) for the Mac Pro and related devices (Studio, iMac) at WWDC. This has the advantage that it looks good, and in a sense those are the devices that "want" the best silicon right now, not iPhone.
But iPhone is a nice place to test-drive silicon features that are only relevant to the very high end because if they are find to be problematic, you can set a chicken bit to not use them, and try again with the next design (M#) and hopefully get it right by Pro and Max. And iPhone is still king of the cash heap, so it gets max attention.
On the THIRD hand, the N3 delay has meant that Apple could start testing the A17 (the one that should have been the A16) almost a year earlier than usual (in tiny volumes on N3 risk production)... So this round of pretesting on iPhone chip before the M and M Pro/Max could already have happened...

I personally am waiting for WWDC before buying. Sure the M2 mini's look nice, but I suspect for what I care about (mostly MMA) the N3 designs will be substantially nicer – and it's at least plausible we will see them at WWDC.
 
  • Like
Reactions: richmlow

richmlow

macrumors 6502
Jul 17, 2002
390
285
I'd say if you can wait do so.
My analysis (FWIW!) is that A14/M1 were designed as planned and on track.
A15 was designed as planned, as an energy optimized version of A14, giving iPhone what it needed more than performance. Perhaps M2 (and just M2) was planned as well.
Meanwhile the plan was for A16 and "desktop" M2 to be on N3, A16 in Sept 2022, M2 Pro and Max coming out now. These (A16', M2' designs) would make full use of N3's increased transistors to implement IPC improvements (and probably various ISA tweaks like the newest ARM instructions [except SVE, that's different discussion...])

BUT
covid comes along and throws the plans into disarray. Both Apple and TSMC have to run slower than planned (eg TSMC can't just fly people between Taiwan, the US, Japan, and Europe every week to solve problems! You do what you can with zoom, but it's slower).
So at some point Apple realizes the A16' and M2' designs (ie the N3 based designs) won't ship in time. Mad scramble! OK, we "recompile" the A15 to N4 and take what we can in terms of process improvements. And we accept that the M2 Pro/Max will be a disappointing A15 N5 design, not the hoped for kick-ass N3 designs. Oh well.
To add insult to injury, the M2 Pro/Max were probable targeted for about three months ago, but then we get unrest in China so we have to delay the products until that's sorted out!

I don't think this is special pleading. Look at the schedules: Compare M1 track vs M2 track.
Sept 2020 A14
Nov 2020 M1
Oct 2021 M1 Pro Max
March 2022 M1 Ultra

Sept 2021 A15
June 2022 M2
Jan 2023 M2 Pro Max

Sept 2022 A16 (on N4)

To me this looks like a schedule that's constantly trying to delay things to see which of two options (plan A or a fallback plan B) are feasible, not a long-term strategy playing out as designed.

Which is all a long way of saying that since A14/M1 Apple has basically been forced into two generations of what looks like stagnation (A15 designed for low energy, which is less immediately interesting on Mac, though ultimately valuable; A16 as a response to covid-delayed N3.) But once we get back on track with N3 and the CPU+SoC designed for it, I think we will all be pleasantly surprised...

But it remains a huge question when that will be -- how long can you wait?
One option is Apple announce "M3" (ie the N3 design) for the Mac Pro and related devices (Studio, iMac) at WWDC. This has the advantage that it looks good, and in a sense those are the devices that "want" the best silicon right now, not iPhone.
But iPhone is a nice place to test-drive silicon features that are only relevant to the very high end because if they are find to be problematic, you can set a chicken bit to not use them, and try again with the next design (M#) and hopefully get it right by Pro and Max. And iPhone is still king of the cash heap, so it gets max attention.
On the THIRD hand, the N3 delay has meant that Apple could start testing the A17 (the one that should have been the A16) almost a year earlier than usual (in tiny volumes on N3 risk production)... So this round of pretesting on iPhone chip before the M and M Pro/Max could already have happened...

I personally am waiting for WWDC before buying. Sure the M2 mini's look nice, but I suspect for what I care about (mostly MMA) the N3 designs will be substantially nicer – and it's at least plausible we will see them at WWDC.
Hello name99,


As I look over the specifications of the current Mac Minis and Mac Studios, I too am leaning towards waiting for Apple's announcement of the ASi Mac Pro. As tempting as the current computers are, the RAM limitation is really holding me back.

The main purpose of this next purchase will be for math research. I envision mainly using Mathematica, Python, Java and various SAT-solvers on it. For example, I was running a custom-designed Mathematica program/calculation on my 2013 Mac Pro (3.7GHz Quad-core Intel Xeon E5, 64GB RAM, macOS Mojave 10.14.6). It took 4 hours and consumed 64GB (towards the end of the run).

I would like my next desktop Mac system to have a minimum of 192GB RAM and 16 cores.


All the best,
richmlow
 

richmlow

macrumors 6502
Jul 17, 2002
390
285
Hello all,


Slightly off-topic.....

I like this thread. It's nice to see that there are other workflows (e.g. scientific research) which exist and use Apple computers! Typically in these forums (and even in Apple marketing), everybody always seems to be blathering about video, as if there is nothing more to the matter!

It's unfortunate that Apple doesn't highlight the use of Apple hardware in the STEM disciplines. Years ago, that was not the case. At the time, Apple had xGrid, parallel computing, etc. They even had a sub-section "Apple and Science" on their website!


All the best,
richmlow
 
  • Like
Reactions: Andropov

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
It's unfortunate that Apple doesn't highlight the use of Apple hardware in the STEM disciplines. Years ago, that was not the case. At the time, Apple had xGrid, parallel computing, etc. They even had a sub-section "Apple and Science" on their website!
Most likely, IMHO, due to revenues from STEM are not consistent and/or attractive eough for Apple.
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
You seem willing to jump through crazy hoops on the CUDA/SPIR-V side, but not to be willing to make any effort on the Metal or Apple AMX side.
Obviously it's your life and your time, but it seems strange to complain "the Apple code I put one hour of effort into doesn't work as well as the CUDA code I put two years of effort into! What gives???"

I think you misunderstand what Philip is doing. He tested both AMX and the GPU. His work on the GPU side is both to evaluate its feasibility for double-precision calculation as well as explore possibilities for porting CUDA code.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
similarly things like BLAS/LAPACK (optimized up the wazoo in MKL) are a lot more generic on the ARM side, and Wolfram seems not to be using Apple's AMX-optimized versions. I expect that in time Apple and Wolfram (and a few obvious other clients like Matlab) will figure out a way to get access to AMX that works for all parties while honoring Apple's concerns (primarily that AMX changes a lot in each new CPU); but neither side probably has discussing this as a high priority until more basic issues are sorted out by each side.
Can third parties use Apple's AMX? Is it still undocumented?
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Can third parties use Apple's AMX? Is it still undocumented?
I recently benchmarked Accelerate vs. OpenBLAS. For most operations, Accelerate is faster because it uses AMX. But for complex eigendecomposition (needed for quantum chemistry), OpenBLAS is 2x faster. The number of FLOPS is also so low, that you wouldn't need AMX. At least for the non-complex eigendecomposition, Apple gets >50 FP64 GFLOPS while on single-core. That says they're using AMX but not parallelizing. OpenBLAS gets slightly better performance with multithreading.

For complex eigendecomposition, you have to un-interleave data before feeding to the AMX. This might not prevent reaching maximum FLOPS for ZGEMM, but requires more threads. Accelerate uses 4 threads for DGEMM (haven't checked ZGEMM), but doesn't multithread its AMX in ZHEEVD. Again, this may not be problematic. Apple gets 34 GFLOPS using one core, OpenBLAS gets 68 GFLOPS using 2-4 cores. DFT simulations can perform multiple eigendecompositions in parallel so you might call into Accelerate ZGEMM from 8 threads. OpenBLAS might do 2-4 eigendecompositions in parallel, therefore slower overall. I haven't benchmarked that yet.

[Data Attached]

For SGEMM and DGEMM, calling into Accelerate from multiple threads wrecked performance. If it's the same with eigendecomposition, I could get better performance using the GPU. If it doesn't (meaning CPU achieves ~300 GFLOPS), even emulated double-single won't provide much speedup (<550 GFLOPS). Mind quantum chemistry does 100% of calculation in FP64, MD is a different story.

AFAIK, only way to use it is to run on macOS via a Xcode compiled binary calling macOS APIs.
I've seen mentions on MATLAB forums that you can direct it to a different BLAS library. In that case, you can change OpenBLAS to Accelerate.
 

Attachments

  • BLAS Performance.pdf
    82 KB · Views: 136
  • Matrix Multiplication Performance.pdf
    32.7 KB · Views: 151
  • Like
Reactions: name99 and Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I've seen mentions on MATLAB forums that you can direct it to a different BLAS library. In that case, you can change OpenBLAS to Accelerate.
I'm not sure if it's the same thing, but it looks like you can change the BLAS library without recompiling.

You may be interested in this post.
 
Last edited:

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Apple gets 34 GFLOPS using one core, OpenBLAS gets 68 GFLOPS using 2-4 cores.
My FLOPS formula was off by a factor of 2. Apple got up to 156 GFLOPS on a single core for DSYEVD, within the ~200 GFLOPS limit if one core utilized 2 AMX blocks. OpenBLAS also reached 169 GFLOPS, consistent with using 4 cores worth of SIMD units. Using all 8 cores for a single problem harms performance because of cache incoherence.

I think ZHEEVD was optimized for being energy-efficient. One core could only utilize one AMX block because swizzling interleaved complex takes several instructions. Accelerate achieved 88 GFLOPS/core, while OpenBLAS 41 GFLOPS/core. It may also be more scalable when running multiple ZHEEVDs in parallel, compared with OpenBLAS optimizing single-ZHEEVD performance at all costs. I'll make a better table to visualize this:

ProcessorFP64 GFLOPS
P-Core51.6
AMX Block103.2

AccelerateOpenBLAS
GFLOPS (1 DSYEVD)156169
P-Cores Used14
Clock Frequency3.228 GHz3.036 GHz
AMX Blocks Used20
ALU Utilization76%87%
Max Concurrent DSYEVD's42
Clock w/ Concurrent DSYEVD3.132 GHz3.036 GHz
Max Theoretical GFLOPS605338

Hypothesis 1AccelerateOpenBLAS
GFLOPS (1 ZHEEVD)88162
P-Cores Used14
Clock Frequency3.228 GHz3.036 GHz
AMX Blocks Used10
ALU Utilization85%83%
Max Concurrent ZHEEVD's82
Clock w/ Concurrent ZHEEVD3.036 GHz3.036 GHz
Max Theoretical GFLOPS662324

Perhaps with an over-balanced P-Core:AMX ratio (1:1), not every instruction is spent dispatching stuff to AMX. There might be room to put additional calculations onto the NEON units, like what OpenBLAS does. I really need to make a benchmark and test this hypothesis.

Hypothesis 2AccelerateOpenBLAS
GFLOPS (1 ZHEEVD)88162
P-Cores Used14
Clock Frequency3.228 GHz3.036 GHz
AMX Blocks Used20
ALU Utilization43%83%
Max Concurrent ZHEEVD's41
Clock w/ Concurrent ZHEEVD3.132 GHz3.036 GHz
Max Theoretical GFLOPS342324

Another note, Accelerate's GFLOPS on ZHEEVD outperformed DSYEVD for small matrices (~100) by 3x. The crossover happened at n=256 where ZHEEVD stopped improving and DSYEVD skyrocketed. Also for such small matrices, Accelerate and OpenBLAS are the same. Likely because they're single-core and Accelerate can't utilize AMX yet. Between n=128 and n=192, GFLOPS exceeds 50. That means Accelerate starts using AMX, and OpenBLAS starts using multiple cores.
 
  • Like
Reactions: Andropov and name99
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.