Yes, the processors the 4,1 and 5,1 use are triple-channel, basically meaning they can access three modules at the same time. The bandwidth for the fourth slot (and eighth in dual-proc machines) is shared with the third slot.
So performance will take a hit if you populate the fourth/eighth slots. However, if you're actually running out of RAM and paging things back to disk frequently then you'll see better performance even if you use the fourth/eighth due to the increase in RAM amount.
Larger modules are better only in that they allow you to get higher RAM amounts using less slots. Ideally you want a matched set of three (or six for DP) modules all of the same brand and specs.
Good post, but the sentence in bold cannot be over-stressed. The actual performance hits for the extra memory slots are quite small, and require carefully controlled benchmarks to see the effect. These benchmarks are usually written to effectively disable the CPU caches - typical applications that don't disable the caches see very little degradation.
Also, take a good look at how your system reports memory usage. For example, my home workstation shows:
Right now it's not doing much, so most tools report about 44 GiB available. A closer look, however, shows that only about 13 GiB is unused, and about 30 GiB is used for filesystem caches. Sometimes I'll see just a few hundred MiB as "free".
Depending on what you're doing, having dozens of free GiB for the caches can make Safari much snappier.