It sounds like you're saying that gen 3 to gen 3 switching doesn't require 128b conversion? Or more likely the receiver needs to decode 128b to do the switching, but it doesn't need to re-encode it to pass it on to the 3.0 device? Encoding of 8b is much less taxing (simple lookup for the 10b)? In the case of a 3.0 host and 2.0 device, doesn't the PLX need to encode 128b when transmitting from the 2.0 device to the 3.0 host? How is that less of a problem then going from 2.0 host to 3.0 device?
It always needs to convert which overall is not really an issue outside of added latency (more than just switching lanes within same gen, the 2.0 chips have latency added on 1.1 source/dest but this is much lower, even with older tech/die shrink ability later on) - on the cheaper SKUs (or often not enough heatsink/cooling provided, Squid 2.0 can suffer this in some cases, but not likely in a cMP) and the extreme end (96 lane+ which SHOULD have a fan or specifically note to have chassis airflow as passive GPUs for servers do) you see TDP issues that can cause real performance bottlenecks (which obviously can also happen within same gen, but are less likely due to TDP kept low overall). This is well documented in the PLX docs that Avago/Broadcom moved around lile 5 times and then removed; mouser and some other stores do have copies though for some - google for the chip (subtract or add +-5 in case of certain 'custom' SKUs like on cheap ebay China boards) and PDF.
Customers do get access behind login wall, but i do not get deeper specsheets and sensitive deemed things anymore either unlike on nearly all 2.0 SKUs sadly, might need to actually buy a min order qty of the gen i want data for but i'll wait for the first 4.0 chips which do the same and are later on more useful for me.
Enterprise servers massively overcommit PCIe and SAS/SATA lanes because the TBs are usually more important than the GB/sec.
The standard HPE ProLiant NVMe card is a 32 lane PLX switch wired as a PCIe 3.0 x8 slot to six PCIe 3.0 x4 drives. So, yes, you're limited to *only* 8 GB/sec - but that's really a boatload of bandwidth.
In the real world, few (if any) workloads really require that every single component in the system must run at its theoretical max bandwidth. Some systems are OK with modest overcommitment, some with huge overcommitment.
The people here who diss "lane sharing" or "PCIe switches" are really victims of "can't see the forest for the trees".
PCIe - partially, only the new NVMe based servers do this - often by simple need to switch NVMe drives and PCIe slots around) and EPYC can go a very far way with 64-128 (-4) lanes available, Intel is just behind but has less "weird" issues due to the InfinityFabric being somewhat... custom.. compared to what we know from Opteron and Intels QPI/UPI. The HP card is not really industry standard here, usually you'd see x16 uplink (like on some Supermicro cards, notably excluding anything based on bifurcation obviously).
SAS - depends, these boards with an LSI RAID chip like still fairly common have them at 'dedicated' 12Gbit SAS PCIe wise with 6Gbit reality, which at 3.0 is barely x8 levels for 16-24 ports.
SATA - never seen, all that i have have their chipset SATA wired 1:1 as the chipset offers, splitters are uncommon and more on chassis level, where not supported by chipset SATA even on enterprise by lack of SAS entirely mostly.
I love PCIe switches for the flexibility they offer me to split lanes or aggregate; i do a lot of work with PLX RDKs and even Intel touched this base more now with eg. Z270 having 24 lanes off the PCH (others had 'splitting' before but not at 1:6) with only the equivalent of x4 3.0 uplink (with Intel optimizations on the customised PCIe bus, but nowhere near x8 or more).
They also have very real use cases as we see on the nMP's TB ports (naturally at 2.0) uplinked with 3.0 (good cooled, good firmware design) or in high-end (E7 SKUs, RAS, 8 socket CPU layouts) for interesting HA things you rarely see public (like a 2x Xeon Phi PCIe x16 cards in 4 way PLX controllable HA over 8 sockets (2 as cluster each and the PLX uplinked to 4 clusters at x16, a PLX 96 lane off the shelf SKU) - most you see this chips used for is REALLY worlds below what they can do, which also justified pricing with Avago (highly affected by certain PLX patents limiting other manufacturer availability until licensing deals went live later).
Apple generally likes to use them as gen converters and for 1:1 switching, yes the cMP has slot 3/4 shared but this is especially in single CPU config also a limitation by what Westmere provided in lanes, but the lane shari ng use overall is rare.