Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

analog guy

macrumors 6502
Original poster
Mar 6, 2009
399
51
hey, folks:
just took delivery of a 6c nMP and wanted to post a few observations.

i ran geekbench3 with a number of memory configs and figured i would post the results.

the 16GB chips are crucial RDIMMs.

single-core/multi-core:
3183/4578 (12GB - 3x4 bank #s 1, 2 & 3)
1984/2055 (16GB - 1x16 bank #1)
3096/3916 (32GB - 2 banks filled #s 1 & 3)
3389/5377 (48GB - 3 banks filled #s 1, 2 & 3)
3319/5840 (64GB - all banks filled)

i found it surprising that 3x4 performed far better (single core) than 1x16. i was also surprised that 4x16 took a hit (albeit slight) in single performance over 3x16.

would be curious if someone with a 4x4 configuration tests 4/8/12 & 16.
 
hey, folks:
just took delivery of a 6c nMP and wanted to post a few observations.

i ran geekbench3 with a number of memory configs and figured i would post the results.

the 16GB chips are crucial RDIMMs.

single-core/multi-core:
3183/4578 (12GB - 3x4 bank #s 1, 2 & 3)
1984/2055 (16GB - 1x16 bank #1)
3096/3916 (32GB - 2 banks filled #s 1 & 3)
3389/5377 (48GB - 3 banks filled #s 1, 2 & 3)
3319/5840 (64GB - all banks filled)

i found it surprising that 3x4 performed far better (single core) than 1x16. i was also surprised that 4x16 took a hit (albeit slight) in single performance over 3x16.

would be curious if someone with a 4x4 configuration tests 4/8/12 & 16.

Congrats on receiving your nMP!

It's good to see that this is measurable. I might have guessed that Geekbench would largely fit in cache and thus not adequately stress the memory subsystem, but this clearly says otherwise.

Keep in mind, the quad channel memory controller interleaves data across channels like a RAID0 array stripes data across drives... so 3x4 should outperform 1x16 handedly since the three sticks offer triple the bandwidth of a single stick.

Another factor may also be that RDIMMs add an added clock cycle of latency vs. UDIMMs.

I'm guessing the anomaly between 3x16 and 4x16 is simply the margin of error in this test.
 
single-core/multi-core:
3183/4578 (12GB - 3x4 bank #s 1, 2 & 3)
1984/2055 (16GB - 1x16 bank #1)
3096/3916 (32GB - 2 banks filled #s 1 & 3)
3389/5377 (48GB - 3 banks filled #s 1, 2 & 3)
3319/5840 (64GB - all banks filled)

i found it surprising

I find that multi-core you have listed surprising my brothers mac mini will beat that, i5 model, you sure your not missing a 1 in front of them.
 
hey, folks:
just took delivery of a 6c nMP and wanted to post a few observations.

i ran geekbench3 with a number of memory configs and figured i would post the results.

the 16GB chips are crucial RDIMMs.

single-core/multi-core:
3183/4578 (12GB - 3x4 bank #s 1, 2 & 3)
1984/2055 (16GB - 1x16 bank #1)
3096/3916 (32GB - 2 banks filled #s 1 & 3)
3389/5377 (48GB - 3 banks filled #s 1, 2 & 3)
3319/5840 (64GB - all banks filled)

i found it surprising that 3x4 performed far better (single core) than 1x16. i was also surprised that 4x16 took a hit (albeit slight) in single performance over 3x16.

would be curious if someone with a 4x4 configuration tests 4/8/12 & 16.

Could you please post the individual test results as well?

In the other discussion on this topic most of the individual tests were pretty close in performance, but a few greatly benefitted from more channels. It would be interesting to see that data for the full complement of channel populations.
 
Thanks for your test;) I would like to see also how things performs in real applications, not only in synthetic benchmark.
I'm still waiting for my 8core, in the mean while I've tried to run a couple of test on one of the nodes I'm using for distributed rendering in Vray(it's an i7-4930K machine, the CPU is nearly identical to the Xeon you find in the 6core nMP). A 10.000.000 polygons scene needs about 10GB to be rendered, render times was exactly the same with both 16 and 32GB 1866DDR3 RAM(8x2 and 8x4 DIMM). Of course I'm expecting that if you fill up all of your RAM with more complex scenes the result will be quite different, but as far as your project fit in memory you should not see significant decrease in performance, at least in Vray. Just my experience, probably there will be many different workload where running in quad channel mode will give you some performance gains.
As soon as I'll get my 8core I'll test rendering performance with different memory configurations(16GBx1/x2/x3/x4, Crucial DIMM).
 
here is the full info from the memory system sub-tests of GB3 (64-bit) for 3x4, 1x16, 2x16, 3x16 and 4x16.

i labeled the files (you should see that if you save them), but the order appears to be what i listed above.
 

Attachments

  • 12GB = 3x4.png
    12GB = 3x4.png
    59.6 KB · Views: 361
  • 16GB = 1x16 bank 1.png
    16GB = 1x16 bank 1.png
    59 KB · Views: 298
  • 32GB = 2x16 banks 1 & 2.png
    32GB = 2x16 banks 1 & 2.png
    59.1 KB · Views: 291
  • 48GB = 3x16.png
    48GB = 3x16.png
    58.3 KB · Views: 292
  • 64GB = 4x16.png
    64GB = 4x16.png
    58.6 KB · Views: 283
Congrats on receiving your nMP!

It's good to see that this is measurable. I might have guessed that Geekbench would largely fit in cache and thus not adequately stress the memory subsystem, but this clearly says otherwise.

Keep in mind, the quad channel memory controller interleaves data across channels like a RAID0 array stripes data across drives... so 3x4 should outperform 1x16 handedly since the three sticks offer triple the bandwidth of a single stick.

Another factor may also be that RDIMMs add an added clock cycle of latency vs. UDIMMs.

I'm guessing the anomaly between 3x16 and 4x16 is simply the margin of error in this test.

thanks for that. i was shocked by the magnitude of the difference between 3x4 and 1x16. presumably, 4x4 would provide an even greater difference.

now it makes more sense why the 16GB stock config is not 1x16 or 2x8 (which would make later upgrades easier/more economical for users) -- it's not just a small hit on that subsystem.
 
I know you probably don't have the chips to run this test, but I wonder how efficient a 2x8 / 2x4 configuration would be (total = 24 gig). That's how mine will start life, and eventually go to 4x8 for 32 gig....I just don't see a reason to leave 2 slots open and have 3 chips in the drawer while I wait to purchase the remaining 2x8 from crucial....
 
I know you probably don't have the chips to run this test, but I wonder how efficient a 2x8 / 2x4 configuration would be (total = 24 gig). That's how mine will start life, and eventually go to 4x8 for 32 gig....I just don't see a reason to leave 2 slots open and have 3 chips in the drawer while I wait to purchase the remaining 2x8 from crucial....

i don't have 2x8 chips but would love to see your results when you receive your machine.

perhaps i could test 2x16+2x4 = 40GB (or 3x4+1x16=28GB) to see what difference mixing sizes might make.
 
thanks for that. i was shocked by the magnitude of the difference between 3x4 and 1x16. presumably, 4x4 would provide an even greater difference.

now it makes more sense why the 16GB stock config is not 1x16 or 2x8 (which would make later upgrades easier/more economical for users) -- it's not just a small hit on that subsystem.

Well it's just math. 1 DIMM can at most achieve bandwidth of 14.9GB/s, two can achieve twice that and so on. Bandwidth isn't always a key factor in real world performance though and it doesn't look like geekbench is really testing the capacity either so there is that to consider.

----------

i don't have 2x8 chips but would love to see your results when you receive your machine.

perhaps i could test 2x16+2x4 = 40GB (or 3x4+1x16=28GB) to see what difference mixing sizes might make.

You can't mix UDIMMs and RDIMMS I'm afraid.

Someone testing 8GB DIMMs could also test 2x8GB+2x4GB and the affects of adding an 8GB DIMM to the other three 4GB ones.
 
Well it's just math. 1 DIMM can at most achieve bandwidth of 14.9GB/s, two can achieve twice that and so on. Bandwidth isn't always a key factor in real world performance though and it doesn't look like geekbench is really testing the capacity either so there is that to consider.
yes--simple math but i hadn't done that calculation before so was surprised by the result.

umbongo said:
You can't mix UDIMMs and RDIMMS I'm afraid.

Someone testing 8GB DIMMs could also test 2x8GB+2x4GB and the affects of adding an 8GB DIMM to the other three 4GB ones.
yes, you're right about the UDIMMs and RDIMMs.

would be interesting to compare 4x8 vs 2x16 as well as the mixed 2x8+2x4 vs 1x8+1x4, 2x8, and 2x4 pairing that CH12671 proposed (hope he/she will test and report back).
 
3096/3916 (32GB - 2 banks filled #s 1 & 3)

Ok - I have 32GB as well but in banks 1 & 2.

Should I be using banks 1 & 3??
 
3096/3916 (32GB - 2 banks filled #s 1 & 3)

Ok - I have 32GB as well but in banks 1 & 2.

Should I be using banks 1 & 3??

i tried both; performance was virtually identical in geekbench for all memory tests.
 
here is the full info from the memory system sub-tests of GB3 (64-bit) for 3x4, 1x16, 2x16, 3x16 and 4x16.

i labeled the files (you should see that if you save them), but the order appears to be what i listed above.

Thanks - but can you post the integer and float numbers as well?

It should be expected that tests designed to bypass the cache and stress the memory system would show significant performance benefits from the added channels.

How does it affect other tasks like JPEG and ZIP compression?
 

Thank you. Thank you very much.

This looks like the earlier report - the number of DIMMs is almost irrelevant for many of the tests. SHA1 multicore and SHA2 multicore were faster with 1 DIMM than with 4 (but probably within the sampling error - hey 'analog guy', wanna do 20 runs and give us the mean and standard deviation for every component score? ;) ).

Looking at the group scores:
Code:
                        1 DIMM  2 DIMM  3 DIMM  4 DIMM
                       ------- ------- ------- -------
Floating Point Single    3825    3826    3828    3836
Floating Point Multi    25531   25555   25529   25522
Integer Single           3625    3641    3655    3646
Integer Multi           20959   22686   23768   24282

So,
- virtual 4-way tie on Floating Single
- virtual 4-way tie on Floating Multi-core
- virtual 4-way tie on Integer Single
- 1 DIMM is 86% of 4 DIMMs on Integer multi - but if you removed AES and Dijkstra you'd have a virtual 4-way tie, the rest of the integer multi tests were virtual ties

Those L3 caches do seem to be effective on "non-bandwidth virus" programs.
 
Last edited:
thanks for your analysis.

what do you think of the relevance of the stream copy/scale/add #s where 3x4 outperforms 1x16?

STREAM is a "bandwidth virus" benchmark designed to defeat all caches and measure the raw memory bandwidth of the system. It does nothing useful.

IMO, it is interesting for people trying to get into the Top500 Supercomputer list, but mostly irrelevant for anyone considering an Apple, Windows or Linux desktop system.

Most apps benefit from cache, and Intel is currently looking at 2 MiB to 2.5 MiB cache per physical core as the sweet spot. The GeekBench numbers show that is a good decision for almost all of the tests in GeekBench. There are probably some useful desktop apps that need extreme bandwidth, but not many.

One thing that I was happy to learn from this discussion is that AES encryption is one of the bandwidth intensive apps. I'm buying systems for an application gateway prototype which will use 20-core systems to do SSL (AES) encryption. I've learned that populating each system with 8 DIMMs is the way to go. (Some systems only need 32 GiB, so they'll get 8x4GiB.)
 

Thanks for taking the time to do this.


Thank you. Thank you very much.

This looks like the earlier report - the number of DIMMs is almost irrelevant for many of the tests. SHA1 multicore and SHA2 multicore were faster with 1 DIMM than with 4 (but probably within the sampling error - hey 'analog guy', wanna do 20 runs and give us the mean and standard deviation for every component score? ;) ).

Looking at the group scores:
Code:
                        1 DIMM  2 DIMM  3 DIMM  4 DIMM
                       ------- ------- ------- -------
Floating Point Single    3825    3826    3828    3836
Floating Point Multi    25531   25555   25529   25522
Integer Single           3625    3641    3655    3646
Integer Multi           20959   22686   23768   24282

So,
- virtual 4-way tie on Floating Single
- virtual 4-way tie on Floating Multi-core
- virtual 4-way tie on Integer Single
- 1 DIMM is 86% of 4 DIMMs on Integer multi - but if you removed AES and Dijkstra you'd have a virtual 4-way tie, the rest of the integer multi tests were virtual ties

Those L3 caches do seem to be effective on "non-bandwidth virus" programs.

Moral of this story? Cache is king! :cool:
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.