If this is indeed correct, it would be quite discouraging. How reliable are these results? I understand that they use software modeling, these are not actual measurements. Some things I find rather surprising, like the very low latency of GDDR5, I though it should be higher?
The results of the paper also seem to be incompatible with other information. For example, Rambus marketing material claims that HBM2 offers a power advantage of 3-4x over GDDR6 — which already has lower power consumption than GDDR5... Also, isn't very low I/O power supposed to be the selling point of HBM? In the simulation the I/O power usage is through the roof...
OK, strap in, this is gonna get a bit complicated. One useful tool I recommend picking up if you're not already familiar is: whenever someone quotes a "Gbps" speed for their memory, they mean the speed of a single pin. So multiply the number of pins x Gbps and divide by eight (converting Gb > GB) to get the full speed of the memory in GB/s.
For the power consumption of GDDR, it is because they are only using a single module. Normally GDDR would not be configured this way; it might use four modules (four "channels") - effectively quadrupling the number of pins - to hit high bandwidths. Look at figure five of the study for a good diagram. So a GPU might use 4 GDDR6 modules to get to the same place, capacity and bandwidth-wise, as 1 stack of HBM2E - but it's up to the designer.
HBM2, by contrast, has multiple channels within every package - two per die with four dies stacked on top of each other in its smallest configuration, which is what is tested here. So it's no surprise that it would use more power than GDDR.
Note that as you stack more dies, your bandwidth and power consumption for HBM2E could also increase. This is why
AMD's Radeon Pro 5600M advertises a bandwidth of 2048 Bit: it's using HBM2 (not E), one GB per stack, and eight stacks. Every stack has 2 channels of 128bit bandwidth; 2 x 8 x 128 = 2048. You can check the math on their pin speed, total pins, and total bandwidth using the numbers on
AMD's page. One thing I want to point out is how slowly they've clocked these pins - 1.54Gbps - presumably to keep the memory from melting.
However, I don't think this is what Samsung and SK Hynix are doing. If you look at
press releases and do the math on the numbers Samsung presents: 3.4Gbps x (unknown bits) / 8 = 410GB/s... which yields me 1024 bits on what is confirmed to be an 8-Hi stack. It looks a lot like Samsung is
disabling one of the two channels on each die and clocking them up. I am just guessing here, but I believe Samsung figured out that they could get better performance / power this way. Since HBM2E is theoretically on the low end of the power curve (wide and slow is the philosophy), this makes sense. And this is probably why Samsung and SK Hynix are regularly exceeding the speeds set by JEDEC: they've cut the lanes in half.
If that's true, then Samsung/SKY Hynix's HBM2E might be able to stay within 15W at some speed or another, and superfast HBM2E could be viable as L4 cache.
One last note. HBM2E has already delivered the speeds promised by HBM3 and the capacities. If what I'm saying is true, it's also delivered on the last design promise: low power configurations with half the bandwidth. I am more certain than ever now that HBM2E is, and has always been, HBM3. But because it fell short of the lofty visions and low costs JEDEC set out for it, they chose not to overpromote it.