...I'm not an expert on this so someone else might correct me, but, if the memory controller even supports an arrangement of 4, 4, 16, -
If Apple continues on trend of using unbuffered modules for smaller DIMM capacities , then that isn't going to work. Mixing unbuffered and registered(buffered) DIMM modules isn't supported.
If the 8GB modules are registered(buffer) than may want to later mix and match with 16GB modules. But if know long term going to 16/32GB DIMMs modules eventually buying the 4GB configs is very likely more cost effective. Just have to resign to either repurposing or selling those 4GB modules.
, then best case you're only going to get multi-channel performance across 12GB (3x4GB), although it could default to single channel mode for the whole lot (no striping).
The notion of channels here is horribly muddled. There are four pathways (controllers) to DIMMs in the Xeon E5 packages. If you plug in 4 DIMMs ( one each to each of the four pathways ) you will be using all four. If your only plug in two DIMMs you will be leaving two of those pathways(controllers) idle. [ essentially flushing potential bandwidth down the drain. ]
If 1-4 of those four pathways has multiple DIMM sockets assigned to it the timing to all four controller/pathways is set to a different timing model so can juggle talking to more than one DIMM. There are still four pathways. The notion that somehow things are going single file across all the DIMMs is flawed. It is not. Down a single pathway only one DIMM is accessible at a time. That one path is single file but there are multiple paths to memory.
On the new Mac Pro that is moot. There are four DIMMs and four pathways. One controller never has to juggle more than one DIMM. The is aligned with the future since in DDR4 that is the normal mode.
However, the performance impact may only be a few percent if your data sets normally fit into Intels enormous L3 cache. Back in the FSB days, Intel started using ridiculous L3 cache sizes on their CPUs to mask underlying issues with memory performance...
And the 10-30MB L3's of the current Xeon E5 v2 are puny compared to "back in the day". Not.
All memory accesses through a single "front side bus" is bad because it sets up a choke point. 1-2 cores can share a single access resource. Maybe 2-4. But 6, 8 , 10 , 12 only hammering on the same "door" to go to from memory. A large room with many people inside will have mulitple exits because it is hard to get lots of folks through one door. Especially if there is in/out traffic. One way of decreasing traffic jams is to have multiple doors/lanes/paths .
And even now that they've dramatically improved memory performance with on-die controller, they still use oversized L3 cache, so any issues with the way you configure your memory will have little impact on overall performance.
The gross assumption here is that the prediction will prefill the L3 cache or that the problem is only L3 sized big. 20MB of data isn't alot. Even less if it is being split 6 ways.
The on-die controller is there because the memory pressure that 4+ cores inflict on the memory pathway(s) is quite heavy. You don't want the memory access bandwidth shared. Just as the number of cores are increasing to bring new levels of parallelism, you also have to increase the parallelism in the I/O subsystem to go along with it. Otherwise merely just creating chokepoints. The L3 cache isn't going to make the chokepoint go away because if it is filled from the exact same choked pathway to memory.
That is just a small band-aid on the root cause issue.
Another point worth keeping in mind, is that in most cases, more RAM is better than faster RAM.
There is a lower limiting boundary need to cross with faster RAM. But right... which is why it is better to have parallel paths to DIMMs (relatively cheaper RAM so can buy more ) than just cranking L3 (relatively extremely expensive RAM). But as a mindless rule this would throw cheaper storage capacity in front of RAM capacity also. It almost always pays to have to have some faster RAM.
So if you need to use mismatched DIMMs to achieve your applications sweet spot, that's better than starving your apps for the sake of performance.
The real core issue folks have to let go of is that they "have to" keep some subset of the DIMMs in place. Some upgrades mean chucking all of the old stuff and starting over. Not all mismatches work well. Extremely skewed 2GB and 32GB DIMMs and likely going pragmatically run into problems. There is a range between identically all matched and matching doesn't matter at all.