Help understanding & benchmarking m4

dairymilkbatman · Apr 16, 2025

I am a Linux guy that recently got a M4 MBP for light dev work. Loving it btw.,
I am trying to understand bench marking results which, didn't make sense to me.
I compared my arch Linux desktop, in which -if I am interpreting the results correctly- the mbp whooped my Linux desktop. However, my Linux desktop compiles faster, and obviously runs games much better(no surprise there).
I did not expect the MBP to even be close to my desktop lol

How accurate is geekbench for these chips? is there another benchmark you'd recommend? preferably open source?
Also, is there any 'go-to' source code that you guys use to compile which is great for comparing to computers?

Many thanks!

ondioline · Apr 16, 2025

Compiler speed differences will probably come from the Ryzen having extensions (and thus optimizations) the M4 simply lacks: like AVX and SSE. X3Ds also obviously have much larger caches (3x bigger than the M4 Pro I'd guess.) Also depends on if you're using the same compiler/linker.

dairymilkbatman · Apr 16, 2025

ondioline said:
Compiler speed differences will probably come from the Ryzen having extensions (and thus optimizations) the M4 simply lacks: like AVX and SSE. X3Ds also obviously have much larger caches (3x bigger than the M4 Pro I'd guess.) Also depends on if you're using the same compiler/linker.

Thanks for the info, I did not know that.

I used gcc compiler.

I can see clang in geekbench results, and the MBP is 75% faster. However, I compiled tor from source and the ryzen was faster.
I'm wondering whether I should try something with much bigger compile times. Any suggestions?

crazy dave · Apr 16, 2025

Most benchmarks for compilation that I've seen, Apple Silicon tends to outcompete its x86 counterparts. Geekbench actually has a compilation subtest in it, look for "Clang". Here's an example for the two computers:

Mac mini (2024) - Geekbench

Benchmark results for a Mac mini (2024) with an Apple M4 Pro processor.

browser.geekbench.com

Micro-Star International Co., Ltd. MS-7E12 - Geekbench

Benchmark results for a Micro-Star International Co., Ltd. MS-7E12 with an AMD Ryzen 7 7800X3D processor.

browser.geekbench.com

M4 Pro Clang (Multicore): 33942
7800X3d Clang (Mutlicore): 23497

Obviously here the M4 Pro is expected to compile things faster than the 7800X3D by quite a bit. You'd have to give a bit more detail as to your compiler/linker to know why the behavior you observed was different in your actual workflow. (Edit: I see you've replied since I started typing)

In general, Geekbench 6 gives a decent overview of a variety of different workloads. There are two things to keep in mind however. First, is that the multicore tests are not all equal. Some, like Clang, are embarrassingly parallel but others are deliberately designed to mimic "work-sharing" algorithms where the benefits of multiple cores plateaus after a while. Jon Poole, the head of Geekbench's dev Primate Labs, has stated this was to mimic real life desktop processes that many consumers commonly run and to avoid chipmakers and manufacturers from using Geekbench to push multicore systems that most users don't need or even cause them to buy worse systems (because lightly threaded and single threaded applications can suffer). However, if you are doing dev work that concern might apply less you to and instead of looking at the overall score which is a arithmetic and geometric mean of various subscores, you should focus on the subscores that matter most to you. Here is a manual for Geekbench and what each of the tests does:

https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

The second thing is that Geekbench subtests are very, very short. Generally this favors x86 processors which often have very high instantaneous turbo boost modes that ARM and Apple lack, so that wouldn't exaggerate the Apple processor's but is still something to, again, keep in mind when looking at GB6 scores.

ondioline said:
Compiler speed differences will probably come from the Ryzen having extensions (and thus optimizations) the M4 simply lacks: like AVX and SSE. X3Ds also obviously have much larger caches (3x bigger than the M4 Pro I'd guess.) Also depends on if you're using the same compiler/linker.

I think really only the last one, the compiler/linker applies. M4 has NEON extensions but regardless most of those vector processors don't come into play in integer heavy tasks like compilation as far as I know - they're designed to speed up floating point operations. Further, X3D is mostly beneficial for gaming not productivity, but I'll admit that's a generalization and I don't know about compilation in particular. Prior to the current Ryzens in fact they were typically seen as slower than the non-X3D models for productivity because clock speeds had to be lowered to compensate for the extra cache dies.

dairymilkbatman said:
Thanks for the info, I did not know that.

I used gcc compiler.

I can see clang in geekbench results, and the MBP is 75% faster. However, I compiled tor from source and the ryzen was faster.
I'm wondering whether I should try something with much bigger compile times. Any suggestions?

Even if you are using gcc on the Mac, did you actually download and point to gcc? For instance, I haven't but I can still compile something typing in "gcc ...." and my Mac will simply silently switch that to use the default clang compiler.

I found this list of compiler benchmarks, some longer some shorter, most (but not all) multiplatform:

Timed Code Compilation Test Suite Collection - OpenBenchmarking.org

openbenchmarking.org

Here's a video of someone doing code compilation tests, not your particular models but still, it should give you a sense of what should be:

galad · Apr 16, 2025

GCC on AArch64 is quite underdeveloped, that's why everyone is using clang, however probably it depends on what are you compiling too, and for what architectures.

dairymilkbatman · Apr 17, 2025

Thanks for the extensive reply! great info! those results are comical.

Wow, after watching that video I am pretty happy with my macbook purchase

Ah so I was correct in the test times favoring the x86 arch!

I did download and install, I noted that gcc was listed in the config. I must go check terminal history though, as I am not sure what it used now :/

Oooo, I am going to download phoronix and that suite when I get home... Have myself a tea and compile time

galad · Apr 17, 2025

phoronix benchmarks are quite useless, they often uses really old software or obscure software that contains hand-written optimizations only for x86_64, or really old version of x265 and svt-av1 before they had any Neon optimizations.

The good (or bad?) thing about arm64 is that there is so much that can still be optimized, for example x265 got almost 100% faster in the last year.

crazy dave · Apr 17, 2025

dairymilkbatman said:
Thanks for the extensive reply! great info! those results are comical.

My pleasure! That's why a priori I felt like the 12-core M4 Pro should be doing better than the 7800X3D in your personal compilation test as well, but then who knows? Obviously though the most important benchmarks for you are going to be how it performs on your particular workflow, so hopefully we can suss out why the compilation times seem longer on your Mac at home for your project.

dairymilkbatman said:
Wow, after watching that video I am pretty happy with my macbook purchase

Ah so I was correct in the test times favoring the x86 arch!

I think it depends I don't want to say that shorter tests always favor x86 in all circumstances*, but my sense of it is that often they do. *while not applicable to your situation, obviously an M4 in a MacBook Air is going to suffer from endurance issues compared to an M4 in a mini or MBP.

dairymilkbatman said:
I did download and install, I noted that gcc was listed in the config. I must go check terminal history though, as I am not sure what it used now :/

when I type in gcc -v to get the "gcc" version, this is what I get:

Screenshot 2025-04-17 at 12.18.46 AM.png

So yeah it uses clang if you haven't actually got gcc installed and properly linked up, but you certainly might. Just thought I'd bring it up to double check. Regardless, given @galad's warning about the state of gcc on AARCH64 (I'm disappointed to hear it hasn't improved as much as one would like), maybe also try comparing clang compilation between the Mac and the PC to see how it performs?

dairymilkbatman said:
Oooo, I am going to download phoronix and that suite when I get home... Have myself a tea and compile time

I hope it helps! I do wish compile tests were done more frequently as part of test suites. I especially wish people did power efficiency tests with them. SPEC 2017 has a compilation subtest (takes about 10 minutes to run I think?), but I'm pretty sure you have to pay for SPEC even as an individual.

galad said:
phoronix benchmarks are quite useless, they often uses really old software or obscure software that contains hand-written optimizations only for x86_64, or really old version of x265 and svt-av1 before they had any Neon optimizations.

Oh shoot. As I said above, good multi-platform compilation benchmarks are hard to find. I hoped at least one of these might work. Some of these appear to be have been updated in the last 3 years and a lot of the extensions/frameworks for compilation (rather than floating point) I glanced at should be okay for ARM? But I dunno, I'm not familiar with most of them I have to admit. I agree that phoronix can be problematic in general, but I was hoping one of these might be useful.

mr_roboto · Apr 18, 2025

ondioline said:
Compiler speed differences will probably come from the Ryzen having extensions (and thus optimizations) the M4 simply lacks: like AVX and SSE. X3Ds also obviously have much larger caches (3x bigger than the M4 Pro I'd guess.) Also depends on if you're using the same compiler/linker.

1. Compilers typically do not use AVX or SSE. They're usually pure integer workloads with little or no scope for SIMD.

2. The X3D's giant L3 cache is only one part of the hierarchy, and whether it matters depends a lot on the workload.

M4 Pro L1: 128KiB data, 192KiB instruction
M4 Pro L2: 16MiB shared over each performance CPU cluster (4 or 6 cores per cluster)
M4 Pro LLC: 24MiB shared over whole SoC (memory-side cache)

7800X3D L1: 32KiB data, 32KiB instruction
7800X3D L2: 1MiB per core
7800X3D L3: 96MiB shared over all eight cores

For most applications, L1 size is more important than L2 size is more important than L3 size, so the M4's cache hierarchy is preferred in most cases. 4x the L1 data cache capacity is huge.

The X3D's giant L3 can be a big win in a few very specific workloads where the working set size is exceptionally large, but I wouldn't expect compilers to be one of those.

ondioline · Apr 18, 2025

mr_roboto said:
1. Compilers typically do not use AVX or SSE. They're usually pure integer workloads with little or no scope for SIMD.

GCC uses glibc for its runtime, which has optimizations for both AVX and SSE for stuff like memcpy/memcmp and so on.

crazy dave · Apr 18, 2025

ondioline said:
GCC uses glibc for its runtime, which has optimizations for both AVX and SSE for stuff like memcpy/memcmp and so on.

How useful that is ... is extremely dependent and in fact trying to use AVX and SSE can be slower than normal memcpy:

SSE-copy, AVX-copy and std::copy performance

I'm tried to improve performance of copy operation via SSE and AVX: #include <immintrin.h> const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float ...

stackoverflow.com

Again, for compilers, I've never seen a huge uplift from using AVX/SSE during the compilation process, but I'm not a compiler expert so maybe I'm wrong. Bottom line though, most compilation benchmarks show Apple doing really quite well against x86 machines. So while I'm not certain what explains the OP's results, most results show the opposite. This isn't a knock on the OP, merely that x86 extensions shouldn't be a problem in general when it comes to compilation.

ondioline · Apr 18, 2025

I mean it matters in the sense that gcc and glibc (not LLVM) are almost certainly not as optimized on aarch64 compared to x86_64 for a litany of reasons. I’m not shocked at all that it’s slower on an M4.

crazy dave · Apr 18, 2025

ondioline said:
I mean it matters in the sense that gcc and glibc (not LLVM) are almost certainly not as optimized on aarch64 compared to x86_64 for a litany of reasons. I’m not shocked at all that it’s slower on an M4.

I'm pretty sure SPEC 2017 uses gcc for its subtest and if I remember correctly the Apple Silicon also does extremely well there too. EDIT: Yup. Unfortunately it is hard to find up-to-date scores because so few people do SPEC (Geekerwan does but doesn't report subtest results that I've seen), but here is the M1 Max against its contemporaries:

Apple's M1 Pro, M1 Max SoCs Investigated: New Performance and Efficiency Heights

www.anandtech.com

The M1 Max just blows the x86 chips out of the water when compiling gcc - and that an older gcc version too. So that's why I'm surprised by the OP's results that even gcc on the Mac would be slower than the PC. Basically every compiler tests including gcc tests I've seen suggest it should be the opposite. Code compilation should be the major strength of Apple Silicon.

ondioline · Apr 18, 2025

crazy dave said:
I'm pretty sure SPEC 2017 uses gcc for its subtest and if I remember correctly the Apple Silicon also does extremely well there too. EDIT: Yup. Unfortunately it is hard to find up-to-date scores because so few people do SPEC (Geekerwan does but doesn't report subtest results that I've seen), but here is the M1 Max against its contemporaries:

If you look at desktop CPUs for the same tests around that era the performance is similar to the M1 Max:

AMD Zen 4 Ryzen 9 7950X and Ryzen 5 7600X Review: Retaking The High-End

www.anandtech.com

So yeah I agree, the AMD and Intel laptop chips are crap. But the OP is using a desktop processor, and we can't draw the conclusion that the Apple Silicon has some kind of insurmountable lead from these tests. Since after all, that is what is being compared: a desktop vs a laptop.

crazy dave · Apr 18, 2025

ondioline said:
If you look at desktop CPUs for the same tests around that era the performance is similar to the M1 Max:

AMD Zen 4 Ryzen 9 7950X and Ryzen 5 7600X Review: Retaking The High-End

www.anandtech.com

So yeah I agree, the AMD and Intel laptop chips are crap. But the OP is using a desktop processor, and we can't draw the conclusion that the Apple Silicon has some kind of insurmountable lead from these tests. Since after all, that is what is being compared: a desktop vs a laptop.

The OP has a 7800X3D with only 8 cores and 16 threads. The 7950X has 16 cores and 32 threads with higher clocks and scores an 89 for SPEC 2017's gcc subtest. The 7800X3D will often, especially on a benchmark like gcc, have roughly half the multicore performance of the 7950X (as it does in geekbench's compilation test).

Meanwhile, the 8+4 M4 Pro is significantly faster in multicore than the 8+2 M1 Max and the M1 Max scores a 74.2 in the gcc subtest. To give an example: in Geekbench 6 clang (yes, clang but this is for intra-Apple comparison only), the M1 Max scores 22050 while the 8+4 M4 Pro scores 33476 - 50% more. And it should be noted the AMD chips do much, much better in Geekbench's clang test vs the M1 Max than they do in SPEC 2017's gcc, in geekbench's Clang the 7950X matches an M4 Max and doubles the M1 Max's score, while it's only about 20% better in SPEC 2017 gcc (this could be an overclocked 7950x, but even so). So if anything gcc would appear to favor the Mac (I'm sure that's probably not the case, there are almost certainly other factors like test length coming into play, but it's at least good enough that the Mac's do better on SPEC's long gcc compilation test than Geekbench's short clang compilation test).

Apple's CPUs are the same between laptop and desktop, except for the Ultra obviously, and generally competitive with AMD/Intel desktop CPUs - the latest 9950X will beat an M4 Max even in GB 6's clang test, but the OP has a generation older, smaller AMD desktop CPU. Basically the M4 Pro shouldn't be outclassed by the OP's 7800X3D in almost any CPU-intesive workload, including, perhaps especially, gcc code compilation. It should be the other way around and by a fair margin. That's why the OP's results are so puzzling to me.

ondioline · Apr 18, 2025

crazy dave said:
Basically the M4 Pro shouldn't be outclassed by the OP's 7800X3D in almost any CPU-intesive workload

This doesn't seem to be true:

AMD Ryzen 7 7800X3D vs. Apple M4 Pro Benchmarks - OpenBenchmarking.org

openbenchmarking.org

You're saying it's impossible but clearly are circumstances where it can outperform the M4 Pro (in a CPU workload)

crazy dave · Apr 18, 2025

ondioline said:
This doesn't seem to be true:

AMD Ryzen 7 7800X3D vs. Apple M4 Pro Benchmarks - OpenBenchmarking.org

openbenchmarking.org

You're saying it's impossible but clearly are circumstances where it can outperform the M4 Pro (in a CPU workload)

The M4 Pro won 76% of the time, so at worst a slight exaggeration on my part, but with the caveat your initial complaint that there are PC-specific workloads where they are optimized for solely for x86 almost certainly comes into play (whereas it largely shouldn't for code compilation, but again maybe there are corner cases and the OP found one). As @galad said phoronix benchmarks are notorious for that (old benchmarks that are x86 only or have x86-specific optimizations)*, so that the M4 Pro still won 76% of the time even with that reinforces my point that the M4 Pro should generally whoop (as the OP put it) the 7800X3D in almost any CPU-intensive workloads where they are anywhere near equal footing software-wise and so the M4 Pro should definitely be beating the 7800X3D in llvm and gcc compilation.

*Edit: Yup. in fact according to the chart in your link the only tests the M4 Pro lost are the old CPU renderers which haven't been updated for 4-5 years, some since before or around the Apple Silicon launch (the latest was updated in 2021, another was way back in 2019, and another's last update was 3 days after the M1 Air launched so I doubt that made a difference) and none have ARM vector extensions. And those are tests, CPU renderers doing heavy FP vector work, where the vector extensions really do make a huge difference. However, the llvm code compilation benchmark? The M4 Pro won by about 35%. So yes, old benchmarks optimized solely for x86 can still allow a 7800X3D to beat an M4 Pro, but other than that it shouldn't be close and that includes gcc and llvm code compilation.

ondioline · Apr 18, 2025

I'm not the one saying it's close, it's not my claim. Either the OP is lying, or it is true.

crazy dave · Apr 18, 2025

ondioline said:
I'm not the one saying it's close, it's not my claim. Either the OP is lying, or it is true.

I'm assuming it's true, but the reason why it's true must be interesting since I've never seen a benchmark for code compilation that would make the OP's result seem reasonable. My hypotheses are the following:

1) OP is using gcc on the PC and clang on the Mac.

2) For some reason the Mac isn't using all of its cores/threads during compilation while the PC is (i.e. the build tool isn't setting something up right - most cross platform build tools should just work so I don't why that would be the case, but it's possible)

3) Some weird corner case where the compiler is having to do extra work for the Mac version of the library for some reason. Again, I can't think of why that would be the case, but I don't know. I don't know what the OP is compiling.

Or heck maybe something else I didn't consider, so hopefully the OP figures it out and lets us know!

mr_roboto · Apr 18, 2025

ondioline said:
GCC uses glibc for its runtime, which has optimizations for both AVX and SSE for stuff like memcpy/memcmp and so on.

That is not at all the same thing as GCC's own code having any kind of direct performance dependence on AVX.

It's also not even an indirect dependence. Every platform has its own unique best way to do memcpy() and friends. The only reason people use AVX for memcpy() on x86 is that the extra width of the 256-bit or 512-bit vector registers increases how many bytes are moved to/from memory by each load or store instruction.

Apple Silicon, on the other hand, doesn't benefit at all from using SIMD load/store. It only has 128-bit SIMD registers, but aarch64 provides load-pair and store-pair instructions, which do 128-bit load/store using only the integer register file. So you don't need to bother with SIMD at all here, you can get all of the potential speedup merely by transforming pairs of 64-bit int loads or stores into 128-bit load-pair/store-pair. I have not checked, but I would be quite surprised if gcc and glibc do not have this as it's not a complex optimization. (Far less complex than using AVX.)

ondioline · Apr 19, 2025

mr_roboto said:
That is not at all the same thing as GCC's own code having any kind of direct performance dependence on AVX.

It's also not even an indirect dependence.

Yes it is. glibc is not on MacOS, so gcc has to build with the libsystem libc. The performance would be different because the implementations are not identical. This is like saying theres no difference between musl and glibc, when programs that use musl frequently benchmark faster.

mr_roboto · Apr 19, 2025

ondioline said:
Yes it is. glibc is not on MacOS, so gcc has to build with the libsystem libc. The performance would be different because the implementations are not identical. This is like saying theres no difference between musl and glibc, when programs that use musl frequently benchmark faster.

Are you serious with this? Or are you just seeking nitpicks that let you believe you've achieved a technical debate-style 'win'?

I say this because while it's absolutely true that gcc built for macOS should link to system libc, this is obviously not confirmation for the idea you're trying to support. Unless Apple is incompetent, the memcpy() in macOS libc should be as optimized for Apple Silicon as it's possible to be, so it just doesn't matter that there's no AVX.

Also, there's something you seemingly are unaware of, the reason why I was even talking about whether gcc has AArch64-optimized memcpy(). gcc - and all other modern optimizing C compilers - replaces many calls to libc functions that don't require a syscall with inlined builtins. memcpy() is one of the main targets of this kind of optimization, and inlining it is usually triggered by the compiler noticing that the copy length is a small constant. If you compile the gcc binary you benchmark with gcc, you'll be evaluating the performance of a program with lots of gcc-provided memcpy().

But I don't consider that a problem for my argument, which is that memcpy() and friends are fairly easy to optimize for AArch64, and they probably have been in gcc. That supposition is supported by the evidence others have been pointing you at. There's plenty of benchmark data out there showing that gcc performs great on Apple Silicon.

dmccloud · Apr 19, 2025

crazy dave said:
I think really only the last one, the compiler/linker applies. M4 has NEON extensions but regardless most of those vector processors don't come into play in integer heavy tasks like compilation as far as I know - they're designed to speed up floating point operations. Further, X3D is mostly beneficial for gaming not productivity, but I'll admit that's a generalization and I don't know about compilation in particular. Prior to the current Ryzens in fact they were typically seen as slower than the non-X3D models for productivity because clock speeds had to be lowered to compensate for the extra cache dies.

The clock speed concerns were largely eliminated with the 9000 series X3D CPUs, as the 3D vcache was moved from on top of the CCD die to underneath the CCD die. The former design means heat could not be as easily extracted from the X3D parts, which is why clock speeds were slightly lower than their non-X3D counterparts. AMD also made significant improvements to how their multi-CCD SKUs hand off tasks between the CCDs (only applicable to 12 and 16-core models), which has made the 9950 X3D an excellent CPU for both gaming and productivity.

crazy dave · Apr 19, 2025

dmccloud said:
The clock speed concerns were largely eliminated with the 9000 series X3D CPUs, as the 3D vcache was moved from on top of the CCD die to underneath the CCD die. The former design means heat could not be as easily extracted from the X3D parts, which is why clock speeds were slightly lower than their non-X3D counterparts. AMD also made significant improvements to how their multi-CCD SKUs hand off tasks between the CCDs (only applicable to 12 and 16-core models), which has made the 9950 X3D an excellent CPU for both gaming and productivity.

Indeed. Hence why I wrote "Prior to the current Ryzens".

Mrbobb · Apr 19, 2025

Ask your compiler vendor whether their product is been optimized for the M chips?

Help understanding & benchmarking m4

macrumors newbie

macrumors 6502

macrumors newbie

macrumors 68000

macrumors 6502a

macrumors newbie

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors 6502

macrumors 68000

macrumors 6502

macrumors 68000

macrumors 6502

macrumors 68000

macrumors 6502

macrumors 68000

macrumors 6502

macrumors 68000

macrumors 6502a

macrumors 6502

macrumors 6502a

macrumors 68040

macrumors 68000

macrumors 603

Our Staff