Apple Silicon in Sciences

leman · May 16, 2023

dgdosen said:
And if you can ELI5 - what is the difference between an ANE core and GPU core? It looks like they're both optimized for parallel processing, is one better than the other at vector/matrix math? Is there overlap?

That's a good question that is very difficult to answer properly (and it doesn't help that we don't really know many details much how the ANE actually works). One major difference is that the GPU is expected to wear many hats (from UI compositing to 3d graphics to flexible compute) while the ANE is only expected to do one operation, and do it well. Just being good at "parallel processing" is not enough, since parallel processing can mean different things.

Historically, GPUs have been designed to run the same program on multiple different data items at the same, while maintaining the illusion that the items are still independent (e.g. you write the program that computes a + b but the hardware actually does a[0] + b[0], a[1] + b[1] ... a[n] + b[n] simultaneously). This is called SIMT (single instruction, multiple threads), which allows for a neat programming model and works very well for typical graphics and compute tasks. But when you want to do matrix multiplication (or more specifically, convolution), the items are not really independent as you have to shuffle and accumulate results of data-parallel multiplication. Doing this efficiently requires you to break the SIMT model somewhat (that's why Nvidia calls this "cooperative matrix multiplication" — because the GPU "threads" are cooperating to perform a single matrix operation as opposed to the usual "every thread for itself"). And while GPUs have been recently gaining cooperative capabilities (moving data between threads in the same SIMD etc.), there are always tradeoffs. A dedicated matrix multiplication/convolition hardware will probably always be more efficient, simply because it doesn't care about the flexibility and execution capabilities that a general GPU must provide. And it appears that ANE is exactly this kind of hardware (from what I understand, Nvidia uses a sort of hybrid solution with their Tensor cores, combining the general-purpose GPU processing units with some matmul-optimized hardware, but I have no idea how they do it exactly).

Philip Turner · May 16, 2023

There is a way to get around the command encoding bottleneck. Unlike CUDA, a Metal command queue is not actually a queue. It does not bunch up commands and push a large list to the device when ready. Rather, each command buffer is supposed to be a large list. You have that responsibility for better or for worse. PyTorch treats Metal's dispatching like CUDA. However, you can manually bunch the commands, or create an automatic bunching mechanism (e.g. metal-experiment-1). That should improve the sequential throughput from 10-20 us to 2 us.

The fundamental limitation is CPU-GPU synchronization. Not all operations run faster on GPU than CPU; for example eigendecomposition was slower in CUDA than MKL. With Nvidia, people just say, do the expensive parts on the GPU, switch to CPU for the trivial parts that aren't faster, then move back to the GPU. Apple has 10-20x higher synchronization latency, which makes that scheme impossible for a larger class of problems than on NV. The solution is to make everything run on GPU, even if some linear algebra parts run slightly slower.

This is also one motivation behind metal-float64. GPU eFP64 isn't much faster than CPU, maybe even slower, but it removes a fatal pipeline dependency on the CPU's FP64 ALUs. I have implemented and proven the double-single approach in practice, and theoretically it could accelerate the mixed-precision part of dynamic precision for DFT. The only issue: I have to write a reasonably performant eFP64 Cholesky, eFP64 eigendecomposition, and FP32 eigendecomposition on the GPU. There's also a software barrier to integrating Metal into and maintaining a fork of INQ, because Metal is not single-source. It's not fundamentally impossible, it's just very hard.

Philip Turner · May 23, 2023

I just achieved 80.1% ALU utilization in a custom FP16 GEMM kernel. Haven't ruled out whether that's measurement imprecision, but I think SIMD MATMUL<8x8xF16> might take less than 20 cycles. MPS(Graph) doesn't expose this, in fact FP16 is slower than FP32 and not supported by the TensorFlow plugin.

dgdosen · May 23, 2023

Has anybody used a local LLM on a beefier Mac? like Meta's LLaMA on a Mac Studio? If so, can you give a qualitative view on how well it works? (and what your config is)

Thanks!

Philip Turner · May 23, 2023

dgdosen said:
Has anybody used a local LLM on a beefier Mac? like Meta's LLaMA on a Mac Studio? If so, can you give a qualitative view on how well it works? (and what your config is)

Thanks!

Please report what speed you get. I’m trying to aggregate benchmarks about LLaMa and SD, so I can outperform all of them by a substantial margin.

dgdosen · May 25, 2023

MS and Nvidia are crushing it. AI is interesting, indeed, Tim Cook.

I wonder how long before Nvidia CPUs kick Intel to the curb... or when Nvidia starts to muscle out Apple for the latest TSMC nodes.

Philip Turner · May 29, 2023

Guess I'm a leaker now too.

Leak the SIMD futures API · philipturner/metal-benchmarks@6ff81e6

Apple GPU microarchitecture. Contribute to philipturner/metal-benchmarks development by creating an account on GitHub.

github.com

Now I understand why my code wasn't quite up to par with MPS.

Philip Turner · May 29, 2023

Looking forward to the AR headset at WWDC. Apple designed their GPU with AR as the primary use case, because that's extremely power constrained (as you'll see). Nvidia GPUs wouldn't last 15 minutes. However, I think when we achieve atomic precision, we can make mechanical computers 10000 times more power efficient. Imagine an RTX 4090 inside your contact lenses.

name99 · Jan 11, 2024

Has anyone benchmarked Mathematica 14 on an M-series? In particular are the obvious concerns from 13 improved? These included
- problematic bignum support. (Works, of course, but frequently much slower than it should be)
- many transcendental functions not yet SIMD'ized (so eg creating a long array of Sin[] is slower than Intel)
- no use of Accelerate (and specifically AMX) for the appropriate tasks
- not as much use of multiple cores for BLAS-type operations as should be the case (again, would be fixed if Acceleraet BLAS were used)

Certainly various performance issues were improved from 12.x (the first Apple Silicon version) to 13; and I'd hope this has continued with 14, but would be nice to see some numbers.

theorist9 · Jan 11, 2024

name99 said:
Has anyone benchmarked Mathematica 14 on an M-series? In particular are the obvious concerns from 13 improved? These included
- problematic bignum support. (Works, of course, but frequently much slower than it should be)
- many transcendental functions not yet SIMD'ized (so eg creating a long array of Sin[] is slower than Intel)
- no use of Accelerate (and specifically AMX) for the appropriate tasks
- not as much use of multiple cores for BLAS-type operations as should be the case (again, would be fixed if Acceleraet BLAS were used)

Certainly various performance issues were improved from 12.x (the first Apple Silicon version) to 13; and I'd hope this has continued with 14, but would be nice to see some numbers.

I have both 14.0.0 and 13.3.1 (the final iteration of v. 13) installed on both a 2019 i9 iMac and an M1 Pro MBP, and thus could run all four combinations for comparison.

I'm not sure what specific issues you're referring to, so you'll need to post the code you want me to run.

For instance, you said "creating a long array of Sin[] is slower than Intel". I wasn't sure if you meant calculating many Sin[] values, but I tried timing the calculation of 10^8 sine values using:

t1 = AbsoluteTime[];
Array[Sin[#] &, 10^8, {1.1, 10000.1}];
AbsoluteTime[] - t1

Running each test on a fresh kernel, this took 12.5 s on my i9 iMac using both MMA 13 and 14, and 5.0 s on my M1 Pro using both MMA 13 and 14. So I wasn't seeing any difference due to version, and the AS device was 2.5 x faster, indicating no performance deficit for AS. In fact, based on GB6 SC results, the M1 Pro should be only 1.44 x faster, so the AS device is doing better than expected.

I then repeated the test with a purely symbolic calculation*, and got qualitatively similar results: 5.7 s on my i9 iMac in both versions of MMA, and 3.4 s on my M1 Pro in both version of MMA. Thus here the AS device was 1.7 x faster:

t1 = AbsoluteTime[];
Array[Sin[3^#] &, 10^6, {1, 10}];
AbsoluteTime[] - t1

I have enountered functions where the AS device does suffer significantly (e.g., in MMA 13, Blur[] is 14x (!) faster on my i9 iMac than on my M1 MBP; I haven't tested v 14 yet), but (at least based on the above), Sin[] does not appear to be one of them.

I'm curious--what's your reference for the issues with Sin[]?

[*In case anyone's curious, the reason I reduced the number of calculations by 100-fold (from 10^8 to 10^6) when I switched to the symbolic test is that symbolic calcs are much more time-consuming than numerical calcs. A useful oversimplified explanation for this is that numerical calcs. can be done directly "on hardware" using math coprocessors, while symbolic calcs need to be done in software.]

name99 · Jan 12, 2024

theorist9 said:
I have both 14.0.0 and 13.3.1 (the final iteration of v. 13) installed on both a 2019 i9 iMac and an M1 Pro MBP, and thus could run all four combinations for comparison.

I'm not sure what specific issues you're referring to, so you'll need to post the code you want me to run.

For instance, you said "creating a long array of Sin[] is slower than Intel". I wasn't sure if you meant calculating many Sin[] values, but I tried timing the calculation of 10^8 sine values using:

t1 = AbsoluteTime[];
Array[Sin[#] &, 10^8, {1.1, 10000.1}];
AbsoluteTime[] - t1

Running each test on a fresh kernel, this took 12.5 s on my i9 iMac using both MMA 13 and 14, and 5.0 s on my M1 Pro using both MMA 13 and 14. So I wasn't seeing any difference due to version, and the AS device was 2.5 x faster, indicating no performance deficit for AS. In fact, based on GB6 SC results, the M1 Pro should be only 1.44 x faster, so the AS device is doing better than expected.

I then repeated the test with a purely symbolic calculation*, and got qualitatively similar results: 5.7 s on my i9 iMac in both versions of MMA, and 3.4 s on my M1 Pro in both version of MMA. Thus here the AS device was 1.7 x faster:

t1 = AbsoluteTime[];
Array[Sin[3^#] &, 10^6, {1, 10}];
AbsoluteTime[] - t1

I have enountered functions where the AS device does suffer significantly (e.g., in MMA 13, Blur[] is 14x (!) faster on my i9 iMac than on my M1 MBP; I haven't tested v 14 yet), but (at least based on the above), Sin[] does not appear to be one of them.

I'm curious--what's your reference for the issues with Sin[]?

[*In case anyone's curious, the reason I reduced the number of calculations by 100-fold (from 10^8 to 10^6) when I switched to the symbolic test is that symbolic calcs are much more time-consuming than numerical calcs. A useful oversimplified explanation for this is that numerical calcs. can be done directly "on hardware" using math coprocessors, while symbolic calcs need to be done in software.]

Thanks for the answer!

To see where I am coming from, look at this StackExchange thread (which replicates many other pages, and somewhat replicates my experiences back when I had the opportunity to experiment on this):

Benchmarking Mathematica 13 across machines

Many people consider the performance of MMA when choosing their computers. I feel we need some up-to-date benchmark results across hardware and systems for the major version 13, which will help a l...

mathematica.stackexchange.com

In particular compare say the first result (i9-13900K)
"Results" -> {
{"Data Fitting", 0.104},
{"Digits of Pi", 0.131},
{"Discrete Fourier Transform", 0.171},
{"Eigenvalues of a Matrix", 0.154},
{"Elementary Functions", 0.051},
{"Gamma Function", 0.184},
{"Large Integer Multiplication", 0.175},
{"Matrix Arithmetic", 0.015},
{"Matrix Multiplication", 0.061},
{"Matrix Transpose", 0.1},
{"Numerical Integration", 0.199},
{"Polynomial Expansion", 0.024},
{"Random Number Sort", 0.023},
{"Singular Value Decomposition", 0.131},
{"Solving a Linear System", 0.118}}}

with the M1 Max result
{"MachineName" -> "Mac Studio M1 Max 64GB RAM",
"System" -> "Mac OS X ARM (64-bit)",
"BenchmarkName" -> "WolframMark",
"FullVersionNumber" -> "13.3.0",
"Date" -> "July 31, 2023",
"BenchmarkResult" -> 4.583, "TotalTime" -> 3.02,
"Results" -> {{"Data Fitting", 0.183},
{"Digits of Pi", 0.162},
{"Discrete Fourier Transform", 0.226},
{"Eigenvalues of a Matrix", 0.252},
{"Elementary Functions", 0.298},
{"Gamma Function", 0.209},
{"Large Integer Multiplication", 0.181},
{"Matrix Arithmetic", 0.055},
{"Matrix Multiplication", 0.119},
{"Matrix Transpose", 0.096},
{"Numerical Integration", 0.359},
{"Polynomial Expansion", 0.05},
{"Random Number Sort", 0.378},
{"Singular Value Decomposition", 0.284},
{"Solving a Linear System", 0.168}}}

I'm not interested in litigating trivial differences here, but in picking up the big picture.
Unfortunately the Wolfram Benchmark does seem amenable to giving rather different results depending on exactly how you set things up, but you have to assume that things are more or less as you expect them to be!

Obviously the Intel machine maxes at about 5.8GHz (for single core), on the other hand Apple is wider; "generically" (eg GB6) we'd expect Intel to have about 30% advantage (somewhat from GHz, though not as much as pure GHz).
Obviously the Intel machine has many more cores (8P+16E) which will help with CERTAIN types of long operations (definitely BLAS stuff, probably eg transcendentals of long arrays, especially if MKL just gives you that for free).

So let's look at the big outliers.
- DFT, Eigenvalues, Matrix Multiplication, SVD, Solving a Linear System SHOULD route to AMX and show commensurate performance. I suspect they are not using AMX.
- look at Elementary Functions. I suspect that Intel is vectorizing this (ie running multiple lanes on AVX) while the Apple Silicon code is not.
- Random Number Sort is hugely faster on Intel, yet should kinda play to Apple strengths. My guess is something like it is being auto-parallelized on Intel but running one core on Apple

Looking at this list it does seem like bignums are no longer an issue. Stuff like Digits of Pi or Gamma Function were much worse, but seem reasonable now given the frequency difference. So progress in one domain!

If you feel in a mood to do all the cross tests (or at least some of them) it would be interesting to do the equivalent benchmark on (iMac vs Apple Silicon, and 14.0) but the single most interesting numbers would be the Apple Silicon 14.0 benchmarks to see if there are any clear advances (eg if BLAS stuff now routes through Accelerate and so AMX).

(BTW you can use Timing[ expression; ] which is somewhat less clunky that what you are doing.)

theorist9 · Jan 12, 2024

name99 said:
Thanks for the answer!

To see where I am coming from, look at this StackExchange thread (which replicates many other pages, and somewhat replicates my experiences back when I had the opportunity to experiment on this):

Benchmarking Mathematica 13 across machines

Many people consider the performance of MMA when choosing their computers. I feel we need some up-to-date benchmark results across hardware and systems for the major version 13, which will help a l...

mathematica.stackexchange.com

In particular compare say the first result (i9-13900K)
"Results" -> {
{"Data Fitting", 0.104},
{"Digits of Pi", 0.131},
{"Discrete Fourier Transform", 0.171},
{"Eigenvalues of a Matrix", 0.154},
{"Elementary Functions", 0.051},
{"Gamma Function", 0.184},
{"Large Integer Multiplication", 0.175},
{"Matrix Arithmetic", 0.015},
{"Matrix Multiplication", 0.061},
{"Matrix Transpose", 0.1},
{"Numerical Integration", 0.199},
{"Polynomial Expansion", 0.024},
{"Random Number Sort", 0.023},
{"Singular Value Decomposition", 0.131},
{"Solving a Linear System", 0.118}}}

with the M1 Max result
{"MachineName" -> "Mac Studio M1 Max 64GB RAM",
"System" -> "Mac OS X ARM (64-bit)",
"BenchmarkName" -> "WolframMark",
"FullVersionNumber" -> "13.3.0",
"Date" -> "July 31, 2023",
"BenchmarkResult" -> 4.583, "TotalTime" -> 3.02,
"Results" -> {{"Data Fitting", 0.183},
{"Digits of Pi", 0.162},
{"Discrete Fourier Transform", 0.226},
{"Eigenvalues of a Matrix", 0.252},
{"Elementary Functions", 0.298},
{"Gamma Function", 0.209},
{"Large Integer Multiplication", 0.181},
{"Matrix Arithmetic", 0.055},
{"Matrix Multiplication", 0.119},
{"Matrix Transpose", 0.096},
{"Numerical Integration", 0.359},
{"Polynomial Expansion", 0.05},
{"Random Number Sort", 0.378},
{"Singular Value Decomposition", 0.284},
{"Solving a Linear System", 0.168}}}

I'm not interested in litigating trivial differences here, but in picking up the big picture.
Unfortunately the Wolfram Benchmark does seem amenable to giving rather different results depending on exactly how you set things up, but you have to assume that things are more or less as you expect them to be!

Obviously the Intel machine maxes at about 5.8GHz (for single core), on the other hand Apple is wider; "generically" (eg GB6) we'd expect Intel to have about 30% advantage (somewhat from GHz, though not as much as pure GHz).
Obviously the Intel machine has many more cores (8P+16E) which will help with CERTAIN types of long operations (definitely BLAS stuff, probably eg transcendentals of long arrays, especially if MKL just gives you that for free).

So let's look at the big outliers.
- DFT, Eigenvalues, Matrix Multiplication, SVD, Solving a Linear System SHOULD route to AMX and show commensurate performance. I suspect they are not using AMX.
- look at Elementary Functions. I suspect that Intel is vectorizing this (ie running multiple lanes on AVX) while the Apple Silicon code is not.
- Random Number Sort is hugely faster on Intel, yet should kinda play to Apple strengths. My guess is something like it is being auto-parallelized on Intel but running one core on Apple

Looking at this list it does seem like bignums are no longer an issue. Stuff like Digits of Pi or Gamma Function were much worse, but seem reasonable now given the frequency difference. So progress in one domain!

If you feel in a mood to do all the cross tests (or at least some of them) it would be interesting to do the equivalent benchmark on (iMac vs Apple Silicon, and 14.0) but the single most interesting numbers would be the Apple Silicon 14.0 benchmarks to see if there are any clear advances (eg if BLAS stuff now routes through Accelerate and so AMX).

(BTW you can use Timing[ expression; ] which is somewhat less clunky that what you are doing.)

I don't like using the WolframMark benchmark because each operation is too short (some take <0.1 s), resulting in significant inter-run variation, even if one is careful to use a fresh kernel and ensure no significant background tasks are running. For instance, here are two runs of the benchmark on MMA 14.0 on my M1 Pro MBP. The second one took 3.695/2.965 =1.25 x as long as the first.

{"System" -> "Mac OS X ARM (64-bit)",
"BenchmarkName" -> "WolframMark", "FullVersionNumber" -> "14.0.0",
"Date" -> "January 12, 2024", "BenchmarkResult" -> 4.668, "TotalTime" -> 2.965,
"Results" -> {{"Data Fitting", 0.166}, {"Digits of Pi", 0.146},
{"Discrete Fourier Transform", 0.274}, {"Eigenvalues of a Matrix", 0.243},
{"Elementary Functions", 0.34}, {"Gamma Function", 0.196},
{"Large Integer Multiplication", 0.18}, {"Matrix Arithmetic", 0.055},
{"Matrix Multiplication", 0.097}, {"Matrix Transpose", 0.106},
{"Numerical Integration", 0.29}, {"Polynomial Expansion", 0.049},
{"Random Number Sort", 0.378}, {"Singular Value Decomposition", 0.278},
{"Solving a Linear System", 0.167}}}

{"System" -> "Mac OS X ARM (64-bit)",
"BenchmarkName" -> "WolframMark", "FullVersionNumber" -> "14.0.0",
"Date" -> "January 12, 2024", "BenchmarkResult" -> 3.746, "TotalTime" -> 3.695,
"Results" -> {{"Data Fitting", 0.15}, {"Digits of Pi", 0.169},
{"Discrete Fourier Transform", 0.275}, {"Eigenvalues of a Matrix", 0.243},
{"Elementary Functions", 0.612}, {"Gamma Function", 0.205},
{"Large Integer Multiplication", 0.189}, {"Matrix Arithmetic", 0.183},
{"Matrix Multiplication", 0.31}, {"Matrix Transpose", 0.11},
{"Numerical Integration", 0.31}, {"Polynomial Expansion", 0.048},
{"Random Number Sort", 0.378}, {"Singular Value Decomposition", 0.343},
{"Solving a Linear System", 0.17}}}

Having said that, after running these several times, and comparing the range of overall times on 14.0 with the range on 13.3.1, I'd say they're comparable on my M1 MBP. There may be individual results that changed significantly, but I didn't bother looking too closely, for the reason I mentioned above. For instance, I wouldn't want to conclude from a difference between 0.050 s on one version of MMA, and 0.075 s on another, that the machine is 50% slower on the latter.

The better approach would be to edit the code for the individual tests so they run for at least 5 s each, and then run them several times to check for consistency.

Other notes:
1) Sort itself can't run in parallel on MMA, irrespective of machine. However, the code for the random sort test is as follows, and thus it would be possible to parallelize it, since it's saying to repeat the sort test 15 times, making that an embarrassingly parallel task:
Module[{a}, AbsoluteTiming[SeedRandom[1]; a = RandomInteger[{1, 50000}, {520000}]; Do[Sort[a], {15}]]]

I don't have a more modern Intel processor to test, but if the brief WolframMark results are to be believed, the i9-13900K is ~20x faster on that operation than both an M1 Pro and an i9-9900K. That tells me that there is some specific optimization Intel has introduced with its later generation of processors that is offering a significant speed-up. Can Intel's new processors automatically recognize and parallelize looped tasks? I have no idea.

I ran my own test in 14.0 using this more reliable (because it runs longer) variant...

SeedRandom[1];
a = RandomInteger[{1, 50000}, {200000000}];
t1 = AbsoluteTime[];
Sort[a];
AbsoluteTime[] - t1

This at least confirms my WolframMark results, which found that Sort[]'s performance on my i9-9900K and M1 Pro are about the same:

i9-9000K iMac: 9.1 s
M1 Pro MBP: 8.6 s

2) I'm familiar with Timing[], but chose AbsoluteTime[] for two reasons:

a) I wanted real-world (wall-clock) results. That's what AbsoluteTime[] provides. Timing[] just gives raw CPU time for the evaluation, and doesn't include the time to interact with the front end, etc.

b) Timing[] can do funny things under the hood, that could compromise comparisons across different platforms. From the documentation:

"On certain computer systems with multiple CPUs, the Wolfram Language kernel may sometimes spawn additional threads on different CPUs. On some operating systems, Timing may ignore these additional threads. On other operating systems, it may give the total time spent in all threads, which may exceed the result from AbsoluteTiming."

I take it Wolfram agrees with my views, since it uses AbsoluteTiming[], rather than Timing[], in its benchmark. [AbsoluteTiming[] is a variant of AbsoluteTime[] that does multiple replicates.]

name99 · Jan 13, 2024

theorist9 said:
I don't like using the WolframMark benchmark because each operation is too short (some take <0.1 s), resulting in significant inter-run variation, even if one is careful to use a fresh kernel and ensure no significant background tasks are running. For instance, here are two runs of the benchmark on MMA 14.0 on my M1 Pro MBP.

Absolutely. But it's the only way we have to try to track these things. That's why I don't obsess over a few percent details. But factors of 6 (or missing factors of 2 or 3 where you think AMX should really help) are worth noticing.

theorist9 said:
The second one took 3.695/2.965 =1.25 x as long as the first.

{"System" -> "Mac OS X ARM (64-bit)",
"BenchmarkName" -> "WolframMark", "FullVersionNumber" -> "14.0.0",
"Date" -> "January 12, 2024", "BenchmarkResult" -> 4.668, "TotalTime" -> 2.965,
"Results" -> {{"Data Fitting", 0.166}, {"Digits of Pi", 0.146},
{"Discrete Fourier Transform", 0.274}, {"Eigenvalues of a Matrix", 0.243},
{"Elementary Functions", 0.34}, {"Gamma Function", 0.196},
{"Large Integer Multiplication", 0.18}, {"Matrix Arithmetic", 0.055},
{"Matrix Multiplication", 0.097}, {"Matrix Transpose", 0.106},
{"Numerical Integration", 0.29}, {"Polynomial Expansion", 0.049},
{"Random Number Sort", 0.378}, {"Singular Value Decomposition", 0.278},
{"Solving a Linear System", 0.167}}}

{"System" -> "Mac OS X ARM (64-bit)",
"BenchmarkName" -> "WolframMark", "FullVersionNumber" -> "14.0.0",
"Date" -> "January 12, 2024", "BenchmarkResult" -> 3.746, "TotalTime" -> 3.695,
"Results" -> {{"Data Fitting", 0.15}, {"Digits of Pi", 0.169},
{"Discrete Fourier Transform", 0.275}, {"Eigenvalues of a Matrix", 0.243},
{"Elementary Functions", 0.612}, {"Gamma Function", 0.205},
{"Large Integer Multiplication", 0.189}, {"Matrix Arithmetic", 0.183},
{"Matrix Multiplication", 0.31}, {"Matrix Transpose", 0.11},
{"Numerical Integration", 0.31}, {"Polynomial Expansion", 0.048},
{"Random Number Sort", 0.378}, {"Singular Value Decomposition", 0.343},
{"Solving a Linear System", 0.17}}}

Having said that, after running these several times, and comparing the range of overall times on 14.0 with the range on 13.3.1, I'd say they're comparable on my M1 MBP. There may be individual results that changed significantly, but I didn't bother looking too closely, for the reason I mentioned above. For instance, I wouldn't want to conclude from a difference between 0.050 s on one version of MMA, and 0.075 s on another, that the machine is 50% slower on the latter.

The better approach would be to edit the code for the individual tests so they run for at least 5 s each, and then run them several times to check for consistency.

Other notes:
1) Sort itself can't run in parallel on MMA, irrespective of machine. However, the code for the random sort test is as follows, and thus it would be possible to parallelize it, since it's saying to repeat the sort test 15 times, making that an embarrassingly parallel task:
Module[{a}, AbsoluteTiming[SeedRandom[1]; a = RandomInteger[{1, 50000}, {520000}]; Do[Sort[a], {15}]]]

It's quite possible that MOST of the time is actually being spent in generating the Random Integers...
And that may be yet another piece of functionality that has been coded to run as a vector on AVX but coded scalar (so far) on NEON?
I think that's essentially saying the same thing you are saying below? That the Sort[] times are comparable (as expected) but the RandomInteger[] times may be very different?

theorist9 said:
I don't have a more modern Intel processor to test, but if the brief WolframMark results are to be believed, the i9-13900K is ~20x faster on that operation than both an M1 Pro and an i9-9900K. That tells me that there is some specific optimization Intel has introduced with its later generation of processors that is offering a significant speed-up. Can Intel's new processors automatically recognize and parallelize looped tasks? I have no idea.

I ran my own test in 14.0 using this more reliable (because it runs longer) variant...

SeedRandom[1];
a = RandomInteger[{1, 50000}, {200000000}];
t1 = AbsoluteTime[];
Sort[a];
AbsoluteTime[] - t1

This at least confirms my WolframMark results, which found that Sort[]'s performance on my i9-9900K and M1 Pro are about the same:

i9-9000K iMac: 9.1 s
M1 Pro MBP: 8.6 s

All of which is fine! We all have our things of interest!
My particular issues of interest are the ones I raised
- vectorization of arrays of special functions
- use of AMX (most obviously for BLAS; secondarily for FFT; tertiary for arrays of special functions)

It's a shame to see that these are not yet hooked up :-(
Oh well, in time.
If anyone has the time to view all the recent LiveCEO'ing streams, maybe there are hints from within Wolfram? A discussion of "should we use Accelerate or why not"? Or perhaps that's considered too sensitive for a LiveCEO'ing session?

theorist9 · Jan 13, 2024

name99 said:
It's quite possible that MOST of the time is actually being spent in generating the Random Integers...

That's not what's happening. If I break up the WolframMark Random Number Sort test into two parts, as follows, I find that the sorting takes >100x longer than setting the seed and generating the random numbers, on both my i9-9900K iMac and my M1 MBP. That's because sorting a list of random numbers is inherently a longer process than generating it, and because Wolfram further minimized the relative effect of generating the random numbers by generating the list once, but repeating the sort 15 times.

t1 = AbsoluteTime[];
SeedRandom[1];
a = RandomInteger[{1, 50000}, {520000}];
AbsoluteTime[] - t1
t1 = AbsoluteTime[];
Do[Sort[a], {15}];
AbsoluteTime[] - t1

I confirmed this by retesting with my longer Sort[] variant.

name99 said:
I think that's essentially saying the same thing you are saying below? That the Sort[] times are comparable (as expected) but the RandomInteger[] times may be very different?

No, I was saying that, if you want to be certain the differences you are seeing with individual WolframMark tests are "real", and not due to random testing variation, you need to pull out each WolframMark test and modify its code so it runs for a much longer period of time (closer to 10 s rather than a fraction of a second), and run a few replicates to confirm consistency. That's what I did when I modified the Sort Random test to generate & sort a list that contained 200,000,000 numbers instead of 520,000. [In that case, I didn't bother with the Do loop.]

I added that, since I was seeing comparable results on my two machines with both Wolfram's short sort test and my long sort test, that tells me there's no fundamental Intel vs AS architectural difference that would explain the stunningly faster performance the i9-13900K shows (over both the i9-9900k and the M1 Pro)—and that there must be some specific optimization present in Intel's 13th-gen processors (OR that the stunningly faster time is a fluke that arose because the test is so short--that's why it's important to repeat that comparison on a new Intel machine, using a much longer sort test).

Plus if you had the new Intel machine you could specifically test whether the new Intel devices are automatically parallelizing loops, by comparing the time for Do[Sort[a], {15}] with that for Sort[a].

name99 said:
All of which is fine! We all have our things of interest!
My particular issues of interest are the ones I raised
- vectorization of arrays of special functions
- use of AMX (most obviously for BLAS; secondarily for FFT; tertiary for arrays of special functions)

Given your interests, why not download a 1-week trial version of MMA and do your own testing? I can help you modify the individual tests to run longer if you want. Just post the code for the tests that interest you, and I can suggest how to modify them.

name99 · Jan 13, 2024

theorist9 said:
That's not what's happening. If I break up the WolframMark Random Number Sort test into two parts, as follows, I find that the sorting takes >100x longer than setting the seed and generating the random numbers, on both my i9-9900K iMac and my M1 MBP. That's because sorting a list of random numbers is inherently a longer process than generating it, and because Wolfram further minimized the relative effect of generating the random numbers by generating the list once, but repeating the sort 15 times.

t1 = AbsoluteTime[];
SeedRandom[1];
a = RandomInteger[{1, 50000}, {520000}];
AbsoluteTime[] - t1
t1 = AbsoluteTime[];
Do[Sort[a], {15}];
AbsoluteTime[] - t1

I confirmed this by retesting with my longer Sort[] variant.

No, I was saying that, if you want to be certain the differences you are seeing with individual WolframMark tests are "real", and not due to random testing variation, you need to pull out each WolframMark test and modify its code so it runs for a much longer period of time (closer to 10 s rather than a fraction of a second), and run a few replicates to confirm consistency. That's what I did when I modified the Sort Random test to generate & sort a list that contained 200,000,000 numbers instead of 520,000. [In that case, I didn't bother with the Do loop.]

I added that, since I was seeing comparable results on my two machines with both Wolfram's short sort test and my long sort test, that tells me there's no fundamental Intel vs AS architectural difference that would explain the stunningly faster performance the i9-13900K shows (over both the i9-9900k and the M1 Pro)—and that there must be some specific optimization present in Intel's 13th-gen processors (OR that the stunningly faster time is a fluke that arose because the test is so short--that's why it's important to repeat that comparison on a new Intel machine, using a much longer sort test).

Plus if you had the new Intel machine you could specifically test whether the new Intel devices are automatically parallelizing loops, by comparing the time for Do[Sort[a], {15}] with that for Sort[a].

Given your interests, why not download a 1-week trial version of MMA and do your own testing? I can help you modify the individual tests to run longer if you want. Just post the code for the tests that interest you, and I can suggest how to modify them.

Sorry, I misunderstood your point. (Or I guess I understood it in that "long" sorts take as much time as expected; while "short" sorts are behaving weirdly.)

I actually have an MMA, what I don't have is an M3 machine. I've been waiting for the right time to get one that matches my needs. I was waiting for an update iMac Pro; perhaps that's never going to arrive and I should settle for an M3 Max Studio. Let's see what the lineup looks like when the new Studios are announced.

apostolosdt · Jan 14, 2024

leman said:
I am alarmed about how such low quality research is being published and disseminated. Not only are the authors using ancient benchmarks and get basic hardware specs horribly wrong, but their results are entirely nonsensical as well. There is no way in hell that an M1 with its 2.6TFLOPs will outperform an A100 on GEMM or FFT by a factor of over 10.

Please, delete this thread and don’t even discuss this stuff. This is just embarrassing.

The OP has proposed the start of a thread on the scientific use of the new Apple CPU/GPU lines. I can't see anything bizarre about that. His approach, by the inclusion of a preprint from arXiv.org, is sound and quite common for such sorts of OP threads. And I don't mind if members here ask for the deletion of the OP. What is a bit annoying is the attack on arXiv integrity.

leman · Jan 14, 2024

apostolosdt said:
The OP has proposed the start of a thread on the scientific use of the new Apple CPU/GPU lines. I can't see anything bizarre about that. His approach, by the inclusion of a preprint from arXiv.org, is sound and quite common for such sorts of OP threads. And I don't mind if members here ask for the deletion of the OP. What is a bit annoying is the attack on arXiv integrity.

The PDF in question doesn't even satisfy the most basic criteria to be considered a scientific publication. I fully understand that arXiv is a distribution service and that anyone can literally upload anything, so I don't really have any issue with arXiv itself (in fact, I am puzzled by your accusation that I am somehow questioning their integrity).

My issue is with the dangerous deficit of critical thinking when reading these "papers". While the PDF looks like a scientific work, but the claims made by the authors are ludicrously, preposterous, and certainly not serious. The mere fact that a PDF has been uploaded to arXiv does not give it merit (like, I can upload a nice looking PDF that claims archeologic evidence that dinosaurs were a space-faring civilization before they were destroyed by Cthulhu — certainly you wouldn't take that seriously, right)? One needs to be extra careful with the methodology and the claims, especially when dealing with unpublished, unreviewed results. My point is that one doesn't start a serious discussion by quoting unserious work.

apostolosdt · Jan 14, 2024

leman said:
The PDF in question doesn't even satisfy the most basic criteria to be considered a scientific publication. I fully understand that arXiv is a distribution service and that anyone can literally upload anything [...]

[...] The mere fact that a PDF has been uploaded to arXiv does not give it merit (like, I can upload a nice looking PDF that claims [...]

I'm afraid you cannot. First submitting authors have to have some sort of recommendation by established members known to the arXiv staff. (But you can submit that PDF of yours to viXra.org if you like.)

And arXiv is not a distribution service; it is an officially run site for preprints of scientific papers. Like people do with their first draft of a paper: They give a talk to their department, so their initial work can be judged by other colleagues before being submitted for publication. Standard procedure that nowadays may be done on arXiv.

But I do not criticize the rest of your comments. Computer stuff is not my scientific area.

theorist9 · Jan 14, 2024

apostolosdt said:
First submitting authors have to have some sort of recommendation by established members known to the arXiv staff.

According to https://en.wikipedia.org/wiki/ArXiv , "New authors from recognized academic institutions generally receive automatic endorsement, which in practice means that they do not need to deal with the endorsement system at all." As both authors are members of the "Physics Department and the
Center for Scientific Computing and Data Science Research at the University of Massachusetts Dartmouth", and one is addtitionaly a "senior scientist with the Max-Planck-Institut fu ̈r Gravitationsphysik", both would be endorsed automatially.

apostolosdt said:
What is a bit annoying is the attack on arXiv integrity.

Perhaps Leman's statement that "I am alarmed about how such low quality research is being published and disseminated" came off as sounding like an attack on ArXiv. I didn't take it that way. I took it instead as an attack on the authors, since they're the ones that (mis)used ArXiv to publish and disseminate their work.

In addition, I'd say Leman was raising a legitimate cautionary point about what can get onto ArXiv.

Of course, papers like this can even get into the lower-quality peer reviewed journals. Though I think it would be harder since, with peer review, there's at least there's a chance such howlers could be caught. With ArXiv there's no chance, since its system isn't designed to catch them all all. Again, from https://en.wikipedia.org/wiki/ArXiv :

"While arXiv does contain some dubious e-prints, such as those claiming to refute famous theorems or proving famous conjectures such as Fermat's Last Theorem using only high-school mathematics, a 2002 article which appeared in Notices of the American Mathematical Society described those as "surprisingly rare".[35] arXiv generally re-classifies these works, e.g. in "General mathematics", rather than deleting them;[36] however, some authors have voiced concern over the lack of transparency in the arXiv screening process.[27]"

Do I think ArXiv needs peer review? No--that would defeat its entire purpose. I just think people just need to apply an added layer of caution when citing papers from ArXiv.

iPadified · Jan 14, 2024

theorist9 said:
I took it instead as an attack on the authors, since they're the ones that (mis)used ArXiv to publish and disseminate their work.

ArXiv and many other sites are not considered "published" articles as they are not peer reviewed and therefore not quality controlled. It is a place for rapid communication of preliminary results and any academic knows this. There is a lot of trash out there, also published peer reviewed. I got a grant based on a published article in Nature (top of the top journal) and it turned out that the authors had done miscalculations. No wonder my research did not work as expected. It is the publish of perish culture at universities and faulty funding system that drives poor quality research but some times, we are just at the cutting edge (where we should be) so we get it wrong.

leman · Jan 15, 2024

iPadified said:
It is the publish of perish culture at universities and faulty funding system that drives poor quality research but some times, we are just at the cutting edge (where we should be) so we get it wrong.

Yes, the system unfortunately rewards quality over quantity. Mistakes happen. Then again, there are mistakes and there is obvious brain dead negligence (and forgive me if I am being dramatic). The thing is, merely the fact that the authors have uploaded this kind of work would be reason enough to me to immediately reject any kind of application or request from them. What kind of scientific integrity can I expect from a person who can publicly claim, without blinking an eye, that a basic entry-level GPU outperforms a supercomputer cluster? I published plenty of stuff that I myself consider cringy after all these years, but if you make a mistake like that (that can still happen due to lack of attention), you do your best to correct it.

apostolosdt said:
I'm afraid you cannot. First submitting authors have to have some sort of recommendation by established members known to the arXiv staff.

Sure, but that doesn't take much.

theorist9 · Jan 15, 2024

iPadified said:
ArXiv and many other sites are not considered "published" articles as they are not peer reviewed and therefore not quality controlled. It is a place for rapid communication of preliminary results and any academic knows this. There is a lot of trash out there, also published peer reviewed. I got a grant based on a published article in Nature (top of the top journal) and it turned out that the authors had done miscalculations. No wonder my research did not work as expected. It is the publish of perish culture at universities and faulty funding system that drives poor quality research but some times, we are just at the cutting edge (where we should be) so we get it wrong.

I respectfully disagree with your attempt to correct my semantics for using "published" instead of, say, "posted" when I wrote "they're the ones that (mis)used ArXiv to publish and disseminate their work"

First, your assertion that articles are not considered published unless they are peer reviewed is incorrect. Articles appearing in practitioners' journals, like Issues in Science and Technology, or The Economist, are certainly considered "published", yet are not peer-reviewed (they are subject to editorial review only).

Second, and more importantly, if you think the distinction between peer-reviewed and non-peer-reviewed journals is important here, you've entirely missed my point, which is that authors have a responsibility to exercise due diligence before they disseminate material, regardless of where it appears. The fact that they published/posted it on ArXiv is irrelevant--it doesn't let them off the hook for this.

Yes, cutting-edge science that pushes into new, poorly understood areas is inevitably going to get things wrong. That's understandable because it's the nature of the beast. But making obvious mistakes that violate common sense, which is what we were discussing here, is entirely different.

It's the difference between losing control of your jet when you're attempting to be the first to break the sound barrier, vs. losing control of your car when you're driving to the grocery on a dry, sunny day.

iPadified · Jan 15, 2024

theorist9 said:
I respectfully disagree with your attempt to correct my semantics for using "published" instead of, say, "posted" when I wrote "they're the ones that (mis)used ArXiv to publish and disseminate their work"

First, your assertion that articles are not considered published unless they are peer reviewed is incorrect. Articles appearing in practitioners' journals, like Issues in Science and Technology, or The Economist, are certainly considered "published", yet are not peer-reviewed (they are subject to editorial review only).

Second, and more importantly, if you think the distinction between peer-reviewed and non-peer-reviewed journals is important here, you've entirely missed my point, which is that authors have a responsibility to exercise due diligence before they disseminate material, regardless of where it appears. The fact that they published/posted it on ArXiv is irrelevant--it doesn't let them off the hook for this.

Yes, cutting-edge science that pushes into new, poorly understood areas is inevitably going to get things wrong. That's understandable because it's the nature of the beast. But making obvious mistakes that violate common sense, which is what we were discussing here, is entirely different.

It's the difference between losing control of your jet when you're attempting to be the first to break the sound barrier, vs. losing control of your car when you're driving to the grocery on a dry, sunny day.

The distinction is between a scientific journal, magazines and popular science and the ordinary newspapers.
In my field, engineering/bioengineering/biology, it is only a publication after a peer review and then only in recognised scientific journals. The Economist is not a scientific journal as I understand and would therefore not count as a scientific publication. Posting a manuscript on a preprint server has zero weight in my publication list.

There are some differences between fields but generally if a paper is not peer reviewed - be aware and any scientist should know that and be prepared for some crap.

I am peer reviewing about 10 manuscript per year (and say no to the rest of the invitations) and it is always something wrong with the approach or the interpretations so rejections and major revisions are the common outcome and these authors have all done as best as they could. Sometimes it is economic limitation and sometimes I do wonder if the PI saw the manuscript before submitting because of the poor quality. Finding errors and crappy approaches happens all the time.

iPadified · Jan 15, 2024

leman said:
What kind of scientific integrity can I expect from a person who can publicly claim, without blinking an eye, that a basic entry-level GPU outperforms a supercomputer cluster? I published plenty of stuff that I myself consider cringy after all these years, but if you make a mistake like that (that can still happen due to lack of attention), you do your best to correct it.

Haha, expect none after a blunder like that. Too bad for those authors who made public fools of themself. Peer review system has its faults but it protect scientist from themself sometimes. I also submit papers that could be better and the reason can be as simple as the funding ran out or the PhD student need a publication for the defense or, as you say, a brief lack of attention, probably because to chronic too long hours.

Perhaps I am old to not get too worked up by these things anymore. Correct the authors and hope they learn and move on.

apostolosdt · Jan 15, 2024

iPadified said:
[...] Correct the authors and hope they learn and move on.

I would be very happy to see such a correction. I'm interested---honestly.

Apple Silicon in Sciences

macrumors Core

macrumors regular

macrumors regular

macrumors 68030

macrumors regular

macrumors 68030

macrumors regular

macrumors regular

macrumors 68030

macrumors 601

macrumors 68030

macrumors 601

macrumors 68030

macrumors 601

macrumors 68030

macrumors 6502

macrumors Core

macrumors 6502

macrumors 601

macrumors 68020

macrumors Core

macrumors 601

macrumors 68020

macrumors 68020

macrumors 6502

Our Staff