Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

peletrane

macrumors member
Original poster
Jan 19, 2007
93
10
Chicago, IL
What RAM, best price, and RAM configuration will get me from the stock 32 GB to 96 GB? Do I need to replace all the original 32 GB with 96 or 184 GB or can I leverage what already exists in the existing slots? How to configure to get up to 84 GB or 196 GB, costs, and best third party RAM? Thanks! Nuts and bolts kind of question: how do I configure the slots with what RAM and can I already use the existing RAM?
 
Apple recommended fills of ram slots are found inside the machine, under ram slot cover. You can probably fill slots in many ways, but I would follow Apples diagram if I had just bought a machine like yours. And I would use only exactly tha same and similar ram modules for every slot. Optimum ram configurations would be the ones with 6 modules or 12 modules, like 96, 192, 384, 768 GB or 1.5 TB.
1576907160355.png
 
I have both a MP 7,1 and a Cascade Lake Xeon PC workstation I built ( logic board Supermicro X11SPATF ) .

Both machines can accept the same processor and the memory in these Systems is six channel based .

So , in a System with a single processor and 12 memory slots you have two optimization choices :

Six matching memory modules will get you 97 % of maximum possible memory bandwidth or

Twelve matching memory modules will get you 100 % of maximum possible memory bandwidth .

This is based on tests performed by others verifying the performance of the Xeon's built in memory controllers with various ram configurations . The test was with PCs .

They should act identically in the MP 7,1 . The relative scores of CPU benchmarks might be a fast method of verifying memory bandwidth with Macs , so play around with the various installed configurations .

A likely list of compatible memory modules with our MP 7,1 can be found at Supermicro's website for the board I used :


Choices range from 16 GB to 64 GB modules .
 
  • Like
Reactions: konqerror
Can someone say what the performance penalty of using only 4 out of the 6 memory channels is?

Thanks.
 
Can someone say what the performance penalty of using only 4 out of the 6 memory channels is?

Thanks.
It depends.... and I say this as it depends on how a single application reads and writes to RAM(memory).

For example, if a multi-threaded application that spans across all the cores with each core wanting to access its particular chunk of the overall application's RAM(memory), then the more memory channels will provide best performance for the application to get its work done. Each core's worker-bee will have to compete for memory access, so the more paths/channels to memory the better, as that reduces memory access collisions, and collisions lead to a core's involuntary wait periods, and thus slows the application's time to complete it's task (time to final solution).

Now the above is a case that many no doubt don't have to deal with.... but is typically a scientific/engineering problem case such as Computational Fluid Dynamics for solving airflows across bodies.

For most users such as video editors, film makers, photographers, music folk, the performance difference in having 1, 2, 3, 4, 5 or 6 memory channels will largely go by unnoticed IMO.

Needless to say, the more memory channels configured, the better it will be.

I also suspect having same sized RAM modules in the slots is better than mixing sizes - but mixing sizes will not cause an issue but could mean as memory fills up less and less channel access to memory chunks arise.

For the very best memory performance I would say for 4 channels use same size DIMMs and same for the 6 memory channel case.

Using 4 DIMMs in slots 3, 5, 8 and 10 uses 2 memory channels.
Using 8 DIMMs in slots 3, 4, 5, 6, 7, 8, 9 and 10 uses 4 memory channels.
Using 12 DIMMs in slots 1 through 12 uses 6 memory channels.
 
Last edited:
  • Like
Reactions: konqerror
According to this technical paper:https://lenovopress.com/lp1089-balanced-memory-configurations-xeon-scalable-gen-2

Running with 4 DIMMS in 4 Channels gives 66% of maximum bandwidth.
Running with 8 DIMMS in 4 Channels gives 69% of maximum bandwidth
That's theoretical bandwidth - not application performance. Due to the caches, there is much less effect on typical applications.

Here's a thread where an MP6,1 was tested with 1, 2, 3, and 4 DIMMs (quad channel system). Very little effect, even with 1 DIMM.


Memory bandwidth benchmarks are written to effectively bypass cache, whereas most programs are written to optimize cache use.
 
Last edited:
Memory bandwidth benchmarks are written to effectively bypass cache, whereas most programs are written to optimize cache use.

As @bxs says, it's all dependent on your exact workload. Finite element simulation is one of the most bandwidth limited problems you can get.

But the thing to be aware of is that if you're using large amounts of memory, like 96 GB+ as the OP is saying, the effectiveness of cache plummets, since the relative size of the cache drops. Benchmarks where the working set is a few GB are not representative of those where the working set goes to tens-hundreds of GB.

Put another way, you're buying 128 GB of RAM exactly because you have 128 GB of data you need to access very rapidly.

Another problem is virtualization where you're running multiple workloads with conflicting memory access patterns that result in cache thrashing, which is why the server people care about raw bandwidth.
 
Based upon my experience with 2nd Gen Scalable Xeons ( which the MP7,1 uses ) installed in PCs , here are the theoretical max memory bandwidth I learned about in building a 12 memory slot , single processor System ( just like the MP7,1 ) :

2 DIMMS ~ 35 %
4 DIMMS ~ 67 %
6 DIMMS ~ 97 %
8 DIMMS ~ 68 %
10 DIMMS ~ 67 %
12 DIMMS ~ 100 %

We really won't know the actual results until someone benches all these configurations with a MP7,1 , but it should be a slam dunk to install 6 or 12 matching memory modules . The memory controllers in the Xeon really likes modules in groups of six , obviously .

Run a CPU benchmark ... it should take into account memory bandwidth .
 
But the thing to be aware of is that if you're using large amounts of memory, like 96 GB+ as the OP is saying, the effectiveness of cache plummets, since the relative size of the cache drops.
Memory access patterns are much more relevant than the size of the working set of the application.

The 16-core W-3245 has 22 MiB of last level cache. Even modest working sets are far larger than the last level cache (especially for multi-threaded apps where the cache is more or less divided among the active threads).

If the application doesn't randomly jump around memory, the less than 1 MiB cache per thread can be very effective. If it bounces all around memory for code and data, then raw memory bandwidth becomes more important. (Note that cache is organized in 64 byte chunks - if you need one byte that's not in cache, it discards 64 bytes of cached data, and reads 64 bytes that contain the desired byte into the cache.) (…and yes, fluid dynamics and some other HPC codes bounce around a lot, and need raw bandwidth.)

I have some ML applications are have incredibly good memory locality. They even page well. One job was crashing with out-of-memory errors on a 512 GiB system. I added a 2 TB Intel DC PCIe NVMe SSD to the system, and put a 2 TB pagefile on it. It grew into the pagefile, and ran great. I later tried the same job on a similar system with 2 TiB of actual RAM, and it was only 20% faster than the system that was paging. (I ordered more Intel DC SSDs!)
[automerge]1577055561[/automerge]
We really won't know the actual results until someone benches all these configurations with a MP7,1
It would be great if someone could repeat a test like was done in https://forums.macrumors.com/thread...bservations-with-various-mem-configs.1704700/, using real applications and not "bandwidth virus" benchmarks.
 
Memory access patterns are much more relevant than the size of the working set of the application.

The 16-core W-3245 has 22 MiB of last level cache. Even modest working sets are far larger than the last level cache (especially for multi-threaded apps where the cache is more or less divided among the active threads).

If the application doesn't randomly jump around memory, the less than 1 MiB cache per thread can be very effective. If it bounces all around memory for code and data, then raw memory bandwidth becomes more important. (Note that cache is organized in 64 byte chunks - if you need one byte that's not in cache, it discards 64 bytes of cached data, and reads 64 bytes that contain the desired byte into the cache.) (…and yes, fluid dynamics and some other HPC codes bounce around a lot, and need raw bandwidth.)

I have some ML applications are have incredibly good memory locality. They even page well. One job was crashing with out-of-memory errors on a 512 GiB system. I added a 2 TB Intel DC PCIe NVMe SSD to the system, and put a 2 TB pagefile on it. It grew into the pagefile, and ran great. I later tried the same job on a similar system with 2 TiB of actual RAM, and it was only 20% faster than the system that was paging. (I ordered more Intel DC SSDs!)
[automerge]1577055561[/automerge]

It would be great if someone could repeat a test like was done in https://forums.macrumors.com/thread...bservations-with-various-mem-configs.1704700/, using real applications and not "bandwidth virus" benchmarks.

Aiden , let me ask you a question . You have a lot of experience deploying some pretty awesome servers .

Is there some sort of circuit design a company like Apple could employ onboard that would actually improve memory performance , relative to the built in memory controllers of the Xeon ? I would assume the Xeon would already offer an optimized situation here , as a separate controller chip is no longer required . I'm thinking along the lines of , say , how a switch chip improves PCIe lane optimization from the user standpoint .
 
Nope; two channels. 3/10 & 5/8.

Hmmm, that doesn't make sense. Why would you leave four channels empty, and double up DIMMs in a channel requiring selecting between them with chip select signals? Well, OK...

(found a diagram supporting your view)

It shows that six DIMMs in Apple's recommended arrangement would have them in three channels. That's three channels unused, total 192 bit wide memory path, and half of DIMMs inaccessible because they are are hooked to the "opposite" chip select.

(Time to check in the horse's mouth: Found a video showing Apple's new "About this Mac" memory pane)

Slot Channel DIMM
1 F 1
2 F 2
3 E 1
4 E 2
5 D 1
6 D 2
-Xeon-
7 A 2
8 A 1
9 B 2
10 B 1
11 C 2
12 C 1

Thus, 4 DIMMs in slots 3, 5, 8 and 10: E1, D1, A1, and B1. That's four channels: A, B, D, and E.

Also, 8 DIMMs in slots 3, 4, 5, 6, 7, 8, 9 and 10: E1, E2, D1, D2, A2, A1, B2, and B1. That's the same four channels: A, B, D, and E.
 
Is there some sort of circuit design a company like Apple could employ onboard that would actually improve memory performance
No expert, but I would expect no. The reason we no longer have Northbridges is that it's a huge win for performance to have the memory controller on the CPU die.

I have several quad socket servers with E7-8890 v3 processors - 72 cores, 144 threads per server. Each socket has four memory channels. The servers have 96 DIMM slots, or 24 per 4 channel CPU. There's some kind of magic going on - but it's to increase the number of DIMM sockets, not to improve the performance.

cat.jpg

Note that empty DIMM sockets are hidden - only eight of the twelve sockets per memory board are occupied.
 
No expert, but I would expect no. The reason we no longer have Northbridges is that it's a huge win for performance to have the memory controller on the CPU die.

I have several quad socket servers with E7-8890 v3 processors - 72 cores, 144 threads per server. Each socket has four memory channels. The servers have 96 DIMM slots, or 24 per 4 channel CPU. There's some kind of magic going on - but it's to increase the number of DIMM sockets, not to improve the performance.


Note that empty DIMM sockets are hidden - only eight of the twelve sockets per memory board are occupied.

2TB of main system ram with your beautiful servers . I recall when you deployed them . 3 years ago ?

Strange ... the Gold 6212U that I installed in the MP7,1 has been reinstalled into a Supermicro X11SPATF logic board that can handle 3TB of 3DS LRDIMM DDR4 ECC memory at 2933 MHz . 12 memory slots. She can handle only a single 2nd Gen Scalable Xeon , but very nicely can handle four 300W class compute GPUs and has 3 x 8 pin EPS connectors to power everything at the board . I never did that before . The board cost me around $500 brand new and I almost like it as much as my new Mac .
 
In a perfect world, Apple would sell us top quality memory at 10% over street prices so we could BTO whatever amount made sense for our individual use cases.
Since the reality is a far larger price delta, many of us will opt for 3rd party sticks. With the Westmere 5,1s mixing RAM from different vendors - even if matched in capacity - was sub-optimal...
Are the newer 7,1s sensitive to mixing sticks? Would it be best practice to just pull the Apple supplied DIMMs and put in 6 or 12 matched ones?
I'm considering getting 96GB in my BTO which (according to Apple's site) would be 6 x 16GB with 6 open slots. Then, after the initial demand bubble dissipates, shop for 6 more 16GB sticks to put in the other 6 slots, getting me to 192GB. If they play well together, it might be my best route, but if they don't...
The other option would be to get the 32GB minimum from Apple, order 192GB of 3rd party sticks, pull the Apple RAM and store it in case I need warrantee repair. Downside would be paying peak prices for 3rd party RAM during the immediate post 7,1 release time frame.
Thoughts anyone...
 
2TB of main system ram with your beautiful servers . I recall when you deployed them . 3 years ago ?
They were installed early winter 2015 (Dec 2015). They had quad GTX 980Ti GPUs and 992 GiB of RAM (the Maxwell cards could not address 1 TiB of RAM, so we pulled one 32 GiB DIMM).

Upgraded to quad Pascal GTX 1080Ti, and bumped the RAM to 2 TiB. Now at quad Quadro RTX 6000 cards, still 2 TiB.

PCIe slots are wonderful.

Looking at retiring them now - the VNNI instructions look very interesting! Probably will go with dual socket quad GPU - more cores per socket now, and our work is CUDA-based.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.