Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Status
Not open for further replies.
Then why the whitepaper of Volta architecture, and CUDA programming guides show otherwise?

Let me quote something from the Whitepaper:

Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision product that is then accumulated using FP32 addition with the other intermediate products for a 4x4x4 matrix multiply (see Figure 9). In practice, Tensor Cores are used to perform much larger 2D or higher dimensional matrix operations, built up from these smaller elements.

So it is FP16 processed on FP32, and Tensor cores. There is no 2xFP16 in Volta. Whitepaper is in the previous thread page, at the very top f it.

This is also a clue what Nvidia will use for consumer GPU architecture. GP100 reused, not GV100, because of emerging focus for FP16x2 in gaming market.

That's talking about the Tensor Core use of FP16, not regular FP16 though. All the other documentation clearly shows that standard non-Tensor Core use of FP16 is 2x FP32, just like on GP100.
 
That's talking about the Tensor Core use of FP16, not regular FP16 though. All the other documentation clearly shows that standard non-Tensor Core use of FP16 is 2x FP32, just like on GP100.
Which one documentation. Where do you see it in Whitepaper and CUDA Programming Guide for GV100?
 
http://www.anandtech.com/show/11559...ces-pcie-tesla-v100-available-later-this-year

Single precision 15 TFLOPs, half precision 30 FTLOPs. Tensor Cores are completely separate from that.
Whitepaper. Not Anandtech claims. They are journalists, and they can be clueless, or mislead, by Nvidia marketing.

https://images.nvidia.com/content/volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf Here is Volta whitepaper. I have read it from top to bottom. There is NO WORD ABOUT FP16x2 here. Only FP32 TFLOPs, FP64, and Tensor Core TFLOPs.

Volta does NOT HAVE FP16x2, regardless of what Anandtech will tell you about. CUDA Programming Guide, and Volta Whitepaper are contradicting this.

Next time, use whitepaper of the architecture and official documentation, not web articles, as your sources.
 
Not sure why you are so fixated on Volta's possibility of not having 2x FP16 when they already got dedicated Tensor cores for deep learning.
FP16 for gaming is most useful in mobile gaming, because high quality visual is not in demand there. FP16 all the sudden is not going to mean much in console/PC gaming in general.
 
Not sure why you are so fixated on Volta's possibility of not having 2x FP16 when they already got dedicated Tensor cores for deep learning.
FP16 for gaming is most useful in mobile gaming, because high quality visual is not in demand there. FP16 all the sudden is not going to mean much in console/PC gaming in general.
Im not fixated :). I just love technical stuff on hardware level of GPU architectures. The only problem for me is that Volta is much bigger change on architecture level than for example shift from Kepler to Maxwell was. If I would be consumer I would definitely want Volta instead of reused/repurposed GP100 architecture layout.

It is actually funny how immediately the usual suspects jumped out thinking that I am attacking Volta, when I posted the quote, without any comment on the topic, because it contained everything you should know, and where in fact its good idea to process FP16 using Tensor Cores with full FP32 precision.

The more I am reading Volta whitepaper the more I am actually torn, because unfortunately, I know that Nvidia will not use this architecture in consumer space, and will repurpose/reuse GP100 architecture :(. Imagine that 1024 Volta CUDA core chip, with 1.5 GHz, 3 TFLOPs FP32, 192 Bits memory bus, would bring in gaming 80-90% of performance of... GTX 980 Ti. In 75W TDP. This is the scale of increase we are talking about with Volta architecture. And this GPU would most likely be sold as GV107 GPUs(!).

Independent Scheduling, partitioned cores, for higher throughput, and scheduling(better utilization), overall higher throughput of the cores, on par with GCN currently, better cache layout, and separate INT and FP for higher throughput. I do not care about Tensor Cores, and Machine Learning stuff.

First Nvidia architecture that I am truly content with, from hardware level perspective.
 
Whitepaper. Not Anandtech claims. They are journalists, and they can be clueless, or mislead, by Nvidia marketing.

All the information you're quoting comes from a chapter titled "Tensor Cores", perhaps that should give you a clue about what they are talking about. The programming guides are specifically talking about the Tensor Cores and how to best use them, since they are new with Volta. Feel free to disregard all the information that suggests GV100 has double-rate FP16 just like GP100 did. Clearly you know more than all the tech journalists who have gotten their information directly from NVIDIA.
 
Last edited:
All the information you're quoting comes from a chapter titled "Tensor Cores", perhaps that should give you a clue about what they are talking about. The programming guides are specifically talking about the Tensor Cores and how to best use them, since they are new with Volta. Feel free to disregard all the information that suggests GV100 has double-rate FP16 just like GP100 did. Clearly you know more than all the tech journalists who have gotten their informatio directly from NVIDIA.
Yes, and I do also get this information directly from Nvidia.

Show me where it is in the documentation of the GPU arch. Simple as it can be.

Let me quote that guy who brought this to attention:

#249
I have contacted a local GPU supplier in Beijing, the listed price of a Tesla V100 is abit cheaper than I thought, and it will become availble sooner as well.

Tesla V100 PCIE will be charged about the same amount of money as Tesla P100 PCIE at launch, and about 10%-20% more expensive comparing to it is now, and it will become aviable in China next month.

As for FP16 rate, according to their tech manager, it seems that besides Tensor core's mixed precision computation, V100 dont have 2x FP16 rate like P100 does, this is also confirmed in CUDA 9.0RC's programming guide: their FP16 rate is the same as their FP32 rate, so in V100 Nvidia move all its low precision DL stuff into tensor cores (with better precision), I think thats a good idea.
 
Last edited:
Show me where it is for GP100 then?
https://images.nvidia.com/content/volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf
Whitepaper

Official Nvidia Announcement of GV100, and Volta architecture:
https://devblogs.nvidia.com/parallelforall/inside-volta/

Official Nvidia announcement of CUDA 9
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

Wake me up when you will find any word about FP16x2 in Volta. There is only information about FP32, FP64, and Tensor Cores TFLOPs.

There is no FP16x2 in Volta.

It is as described: FP16 is processed as FP32 with full precision, but thanks to Tensor Cores, you get much more operations, and then higher TFLOPs performance for Machine Learning.

As for GP100:
https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

6th page of the whitepaper:
5.3 TFLOPS of double precision floating point (FP64) performance
10.6 TFLOPS of single precision (FP32) performance
21.2 TFLOPS of half-precision (FP16) performance
Badum tss.
 
Last edited:
There is no FP16x2 in Volta.

So why does literally every post on the internet about the Volta announcement list 15 TFLOPs of FP32, 30 TFLOPs of FP16 and 120 TFLOPs via the Tensor Cores? I really don't know why I'm bothering to discuss this with you, please feel free to ignore all the information on the web and choose to focus on all the documentation about Tensor Cores instead.
 
So why does literally every post on the internet about the Volta announcement list 15 TFLOPs of FP32, 30 TFLOPs of FP16 and 120 TFLOPs via the Tensor Cores? I really don't know why I'm bothering to discuss this with you, please feel free to ignore all the information on the web and choose to focus on all the documentation about Tensor Cores instead.
1995bc82a9.jpg

This is from GTC 2017. Do you see any mention of FP16?

Have you ever thought that journalists can be clueless about what they are writing? Nvidia's own documentation does not mention anything about FP16x2. Whitepaper, CUDA 9.0 Programming guide.

Short version. Nvidia does not claim that Volta has FP16x2. Only journalists do. All of official Nvidia documentation claims that it has only FP32, FP64, and FP16 processed on Tensor Cores, for higher performance, and higher precision at the same time.
 
Last edited:
First rumors about Volta. First GPU to be released is GV104. If nothing goes wrong it will be released around February next year. Supposedly it scores between 16000 and 17000 pts. in 3dMark FireStrike Extreme Graphics score. Non OCed reference models of GTX 1080 Ti score... around 14K in the same test.

The die size is supposedly around 398mm2, but the TDP is around 180W, as GP104 was.

And lastly. There is a chance that Nvidia will adopt GDDR6 on all products, including low-end. Consider this a rumor, as always.
 
This is from GTC 2017. Do you see any mention of FP16?
Wow.

Please go re-read that infamous whitepaper you've linked to many times.

Go to page 50, where it says:

Research has shown that inferencing using half-precision FP16 data delivers the same classification accuracy as FP325. With FP16 data types6, inference throughput can be increased up to 2x in the Pascal GPU and Tegra X1 SoC architectures compared to using FP32 data types. Inferencing can also be done using 8-bit integer (INT8) precision with minimal loss in accuracy while tremendously accelerating inference throughput.

Recognizing these benefits, NVIDIA’s previous Pascal GP100 architecture included native support for the FP16 data format, and other Pascal-based GPUs such as the NVIDIA Tesla P40 and NVIDIA Tesla P4 also included support for INT8 to further accelerate inference performance.

The Pascal GP100-based Tesla P100 card delivers 21.2 TFLOPS of FP16 performance. GPUs such as the NVIDIA Tesla P40 that support INT8 computation deliver nearly 48 INT8 TOPS of performance, further boosting the inference performance of datacenter servers. And as described earlier in this whitepaper, Volta’s Tensor Cores take performance to a whole new level, with up to 120 TFLOPS of compute performance for both inferencing and training.

https://images.nvidia.com/content/volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf - page 50

Does this say the Volta has 2X FP32? No.

Does it say that Volta does not have 2X FP32? No.

Does it say that Volta has general purpose FP16? No.

Does it say that Volta does not have general purpose FP16? No.

Tensor cores can only multiply FP16 - no add/subtract/divide, and Tensor cores only do matrix operations.

Tensor cores cannot substitute for general purpose FP16. Would Nvidia go backwards from Pascal with Volta?

That white paper is vague and ambiguous, and probably a bad cut-n-paste from a Pascal whitepaper.
 
Wow.

Please go re-read that infamous whitepaper you've linked to many times.

Go to page 50, where it says:



Does this say the Volta has 2X FP32? No.

Does it say that Volta does not have 2X FP32? No.

Does it say that Volta has general purpose FP16? No.

Does it say that Volta does not have general purpose FP16? No.

Tensor cores can only multiply FP16 - no add/subtract/divide, and Tensor cores only do matrix operations.

Tensor cores cannot substitute for general purpose FP16. Would Nvidia go backwards from Pascal with Volta?

That white paper is vague and ambiguous, and probably a bad cut-n-paste from a Pascal whitepaper.
Nowhere it also says that it supports FP16x2.

In the bolded you have described exactly what is happening with performance of cores. What you forgot to check is CUDA 9 Programming Guide, and features, and its focus on Tensor Operations. Yes they can multiply only. And this is what they are doing, and the reason why you get 120 TFLOPs from single FP32, Single FP16, Single INT8 and Single FP64 operation. Cores do not have to support 2xFP16 from now on. Its done by Tensor cores. And its called Mixed Precision for a very good reason. This is also one of main reasons why I am finally content with Nvidia architecture on hardware level, as I mentioned earlier in my posts. Because it is finally not yesterday's tech, but future proof.

Its baffling that you suggest that Nvidia went back with Volta, from Pascal, when its absolutely the opposite.

And no, Whitepaper of Volta arch is not bad copy-n-paste of Pascal. You should've know that.

Ask yourself. Why Nvidia mentions in that whitepaper performance of FP16 on GP100, and does not mention performance of it in GV100?
 
Last edited:
Nowhere it also says that it supports FP16x2.
It doesn't say that it supports general FP16 at all.

Why aren't you arguing that there's no general FP16 support - since none of the documents that you link to claim FP16 support outside of very limited Tensor operations?
[doublepost=1503277132][/doublepost]
This is from GTC 2017. Do you see any mention of FP16?
The slide is full. Hit on the new features in Volta - don't have a wall of text slide that lists everything.

It doesn't say that it supports INT8 and INT32, but we know it does. The lack of a bullet point on a keynote slide does not mean that it doesn't exist - it could simply mean that the slide was full and reiterating features from the last generation isn't necessary.
 
Last edited:
It doesn't say that it supports general FP16 at all.

Why aren't you arguing that there's no general FP16 support - since none of the documents that you link to claim FP16 support outside of very limited Tensor operations?
Actually that is what I am arguing ;).

Why do you think Nvidia claims that currently the SM's in Volta are 50% more efficient? Because they have taken out a feature that was limiting efficiency. One of reasons why design of Vega is inefficient when pushed out of its frequency curve. FP16x2 processed on FP32 adds inefficiency on hardware level designs. Tensor Cores are currently doing the job, because they are simply - more efficient way to process Machine Learning operations. Tensor Cores multiply the operations, but its software - CUDA 9.0 - which is handling all the process of transforming operations from one precision to another, while maintaining full precision.

What happens when you process FP32 on GP100, and when you process FP16x2 with power consumption? ;)

It goes up. Because even if the data has less precision, the GPU has to do more work each cycle.
 
Does GV100 have fast double precision math (i.e. FP64)? Yes? Well guess what, that's the same hardware that allows them to have 2xFP16 as well.

Quoting this again for you. NVIDIA has been redesigning their SM core every generation since Kepler, with a focus on power efficiency. I guess we'll wait and see once the PCI cards become available, since you seem to be completely convinced you are correct there's not much point in discussing it with you.
 
GTX 2050 Ti will be in some games VERY close to GTX 980 Ti performance levels ;).
 
Source? Or at least some more details? What's the point of posting vague rumors like this exactly?
I cannot give you the source ;).

Well, from what has been told me, current rumors say that in some games GTX 2050 Ti will be within 5% of GTX 980Ti performance.

Doom, Warhammer 40000 Dawn of War, Warhammer Total War 2, Battlefield 1, those are specific games which has been mentioned to me. GTX 2050 Ti is supposed to be around 15% faster than GTX 1060, on average. Overwatch is one of those cases, where GTX 2050 Ti will be 15% faster than GTX 1060, and 15% slower than GTX 980 Ti.

Supposedly Early Q2 next year. Nvidia is targeting the 80-85% of market with their GPUs first, then we should see follow up in highest end. That is their plan, at least. First GV104, then GV106 and GV107. Late Q2 or early Q3 we should see GV102.
 
Last edited:
  • Like
Reactions: Synchro3
Possible Memory configurations for GV consumer chips.

GV104 - 256 Bit GDDR6 16 GB
GV106 - 256 Bit GDDR5X 8 GB
GV107 - 192 Bit GDDR5X 6 GB.
 
Status
Not open for further replies.
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.