Then why the whitepaper of Volta architecture, and CUDA programming guides show otherwise?
Let me quote something from the Whitepaper:
Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision product that is then accumulated using FP32 addition with the other intermediate products for a 4x4x4 matrix multiply (see Figure 9). In practice, Tensor Cores are used to perform much larger 2D or higher dimensional matrix operations, built up from these smaller elements.
So it is FP16 processed on FP32, and Tensor cores. There is no 2xFP16 in Volta. Whitepaper is in the previous thread page, at the very top f it.
This is also a clue what Nvidia will use for consumer GPU architecture. GP100 reused, not GV100, because of emerging focus for FP16x2 in gaming market.
That's talking about the Tensor Core use of FP16, not regular FP16 though. All the other documentation clearly shows that standard non-Tensor Core use of FP16 is 2x FP32, just like on GP100.