Nvidia Volta.

koyoot · May 12, 2017

Asgorath said:
Counter point:

http://www.anandtech.com/show/11367...v100-gpu-and-tesla-v100-accelerator-announced

Emphasis mine. But sure, go ahead and pretend this is just a bigger Pascal and that Vega will win.

So in other words, I have written that Vega will win?

Good to know

.

Does scheduling, core layout, thread execution, core layout bring throughput to the cores, themselves? Simple answer - NO. Throughput of the cores is dependent of the resources which are available to them, at given time. Scheduling, layout can improve utilization, but not overall throughput.

Again. In Deep Learning GV100 chip is something new. In FP32 - it will not be. It has higher compute performance overall, but it is still achieved through increased die size. Massively.

AidenShaw said:
Worth a thousand words.

It appears that yes, there is a merit to my question...

AidenShaw · May 12, 2017

koyoot said:
In FP32 - it will not be. It has higher compute performance overall, but it is still achieved through increased die size.

Why do you insist on this spin, and not simply agree that 50% faster FP32 and 50% more performance per watt is a good thing?

koyoot · May 12, 2017

AidenShaw said:
Why do you insist on this spin, and not simply agree that 50% faster FP32 and 50% more performance per watt is a good thing?

It is. I was just expecting architectural improvement in FP32 which would increase throughput core for core, clock for clock. For example gaming performance clock for clock, core for core in Volta branded GPUs will increase only by 30-40%. If there would be improvement the gap would be much wider.

P.S. The GP100(Tesla P100) chip actually has 235W TDP and 225W power consumption. Lets wait for PCIe version of the GPU and its TDP to judge its efficiency, if it was real improvement or just marketing from Nvidia

.

Why do you spin my words in particular way?

Asgorath · May 12, 2017

koyoot said:
It is. I was just expecting architectural improvement in FP32 which would increase throughput core for core, clock for clock. For example gaming performance clock for clock, core for core in Volta branded GPUs will increase only by 30-40%. If there would be improvement the gap would be much wider.

Why do you spin my words in particular way?

What's the improvement in core for core, clock for clock with Vega? I also don't know why this is a valid or interesting metric, GPUs have been increasing core counts and clock speeds for many generations as a fundamental way of improving performance. That's one of the main benefits of going to a new process node, right? You can build more transistors and thus fit more cores?

I'll repost this since you seem to have missed it:

Rather it's a significantly different architecture in terms of thread execution, thread scheduling, core layout, memory controllers, ISA, and more.

AidenShaw · May 12, 2017

koyoot said:
It is. I was just expecting architectural improvement in FP32 which would increase throughput core for core, clock for clock. For example gaming performance clock for clock, core for core in Volta branded GPUs will increase only by 30-40%. If there would be improvement the gap would be much wider.

P.S. The GP100(Tesla P100) chip actually has 235W TDP and 225W power consumption. Lets wait for PCIe version of the GPU and its TDP to judge its efficiency, if it was real improvement or just marketing from Nvidia .

Why do you spin my words in particular way?

Because you should have stopped with the first two words of this post. Everything else is just continuing the negative spin on Volta.

koyoot · May 12, 2017

Asgorath said:
What's the improvement in core for core, clock for clock with Vega? I also don't know why this is a valid or interesting metric, GPUs have been increasing core counts and clock speeds for many generations as a fundamental way of improving performance. That's one of the main benefits of going to a new process node, right? You can build more transistors and thus fit more cores?

I'll repost this since you seem to have missed it:

Rather it's a significantly different architecture in terms of thread execution, thread scheduling, core layout, memory controllers, ISA, and more.

Why do you want to spin it to AMD vs Nvidia?

Ill repost again. None of what you are describing is increasing throughput of the cores. It helps with utilization, but nothing here is increasing the throughput. Throughput of the cores is increased by the available resources to them. It has not changed from GP100 chip.

AidenShaw said:
Because you should have stopped with the first two words of this post. Everything else is just continuing the negative spin on Volta.

Am I trahstalking Volta, or just stating observable facts about it, that you do not like?

Asgorath · May 12, 2017

koyoot said:
Why do you want to spin it to AMD vs Nvidia?

Ill repost again. None of what you are describing is increasing throughput of the cores. It helps with utilization, but nothing here is increasing the throughput. Throughput of the cores is increased by the available resources to them. It has not changed from GP100 chip.

Okay, let's assume you are correct. What's your point? You're making it sound like this is a bad thing, and that Volta is not an interesting or important update (i.e. glossing over all the other massive improvements in the architecture). Please give examples (from AMD or NVIDIA) of where FP32 throughput has been improved by something other than improving core counts or clock speeds.

Edit: I disagree that things like "improved thread execution and scheduling" won't have an impact on FP32 throughput, but let's just ignore that for now, shall we?

AidenShaw · May 12, 2017

koyoot said:
Am I trahstalking Volta, or just stating observable facts about it, that you do not like?

trahstalking

koyoot · May 12, 2017

Asgorath said:
Okay, let's assume you are correct. What's your point? You're making it sound like this is a bad thing, and that Volta is not an interesting or important update (i.e. glossing over all the other massive improvements in the architecture). Please give examples (from AMD or NVIDIA) of where FP32 throughput has been improved by something other than improving core counts or clock speeds.

I am not making anything sound as a bad thing.

It is you two that believe I am trashtalking Volta. For gods sake. I am the fanboy here?

Example?

Here you go. Transition from Kepler to Maxwell.
Kepler had 192 core in each SM to which was available 256 KB Register File Size.
Maxwel had 128 core in each SM, to which was available 256 KB Register File Size.
Effect? Those 128 core from Maxwell had 90% of performance of 192 Kepler Cores.

People claim that Nvidia got such increase in efficiency and throughput thanks to Tile Based Rasterization. Incorrect. TBR does not increase throughput, it increases Efficiency. It saves power because reduces the amount of data moved(Culling!), and power required to move the data, saves memory bandwidth, etc.
Secondly - there is no "real" tiling in Maxwell cards(and later - consumer Pascal), which Nvidia, themselves confirmed.
http://www.hardware.fr/news/15027/gdc-nvidia-parle-tile-caching-maxwell-pascal.html
http://www.realworldtech.com/forum/?threadid=159876&curpostid=168154

And this is the most important proof for this, that this is most important change:
http://www.anandtech.com/show/8526/nvidia-geforce-gtx-980-review/20
Compare the scores from GTX 980(2048 CUDA Cores) with GTX 780 Ti(2880 CUDA cores), which had higher theorethical performance, but GTX 980 has higher overal throughut of the cores, and better utilization of the architecture.

Core for core, clock for clock GV100 in FP32 will be the same as GP100. It has higher overall compute perfomance, but its not revolution in FP32. DL is the diamond here, which Nvidia rightfully emphasized. But for 90% of the market Volta is not that big of a deal. At least, not what I hoped for.

Why does it matter overall? Because you have to be incredibly delusional or stupid to believe that GV100 chip is the chip that Apple would use in their consumer/professional hardware. The architecture that will land in consumer and workstation parts(Quadro) will be heavily cut out of HPC features. The chips will be smaller, also, to fit in particular thermal envelopes. Look at broader picture.

Asgorath · May 12, 2017

koyoot said:
I am not making anything sound as a bad thing.

It is you two that believe I am trashtalking Volta. For gods sake. I am the fanboy here?

Example?

Here you go. Transition from Kepler to Maxwell.
Kepler had 192 core in each SM to which was available 256 KB Register File Size.
Maxwel had 128 core in each SM, to which was available 256 KB Register File Size.
Effect? Those 128 core from Maxwell had 90% of performance of 192 Kepler Cores.

People claim that Nvidia got such increase in efficiency and throughput thanks to Tile Based Rasterization. Incorrect. TBR does not increase throughput, it increases Efficiency. It saves power because reduces the amount of data moved(Culling!), and power required to move the data, saves memory bandwidth, etc.
Secondly - there is no "real" tiling in Maxwell cards(and later - consumer Pascal), which Nvidia, themselves confirmed.
http://www.hardware.fr/news/15027/gdc-nvidia-parle-tile-caching-maxwell-pascal.html
http://www.realworldtech.com/forum/?threadid=159876&curpostid=168154

And this is the most important proof for this, that this is most important change:
http://www.anandtech.com/show/8526/nvidia-geforce-gtx-980-review/20
Compare the scores from GTX 980(2048 CUDA Cores) with GTX 780 Ti(2880 CUDA cores), which had higher theorethical performance, but GTX 980 has higher overal throughut of the cores, and better utilization of the architecture.

Core for core, clock for clock GV100 in FP32 will be the same as GP100. It has higher overall compute perfomance, but its not revolution in FP32. DL is the diamond here, which Nvidia rightfully emphasized. But for 90% of the market Volta is not that big of a deal. At least, not what I hoped for.

Why does it matter overall? Because you have to be incredibly delusional or stupid to believe that GV100 chip is the chip that Apple would use in their consumer/professional hardware. The architecture that will land in consumer and workstation parts(Quadro) will be heavily cut out of HPC features. The chips will be smaller, also, to fit in particular thermal envelopes. Look at broader picture.

I don't know how you can look at a 50% improvement in overall raw performance and not call that revolutionary. 15 TFLOPs (your favorite metric) from a single GPU is huge. Are you saying raw TFLOPs doesn't matter, and that the only thing that's important is improving clock-for-clock and core-for-core performance? Has AMD been doing this without me noticing?

I don't think anyone said that Apple would use this (or any other NVIDIA) GPU, we're just all really confused about the point you're trying to make with your insistence on focusing on clock-for-clock and core-for-core performance.

AidenShaw · May 12, 2017

Asgorath said:
I don't know how you can look at a 50% improvement in overall raw performance and not call that revolutionary.

This - especially since the 50% improvement was within the same power envelope as the current systems. "Performance per watt" was trending when it was about Polaris, but Volta performance per watt doesn't excite the same people.

Plus the huge addition of Tensor cores. Ignoring this incredible advancement in machine learning and AI because it doesn't speed up FCPX cat video rendering is short-sighted.

koyoot · May 13, 2017

Asgorath said:
I don't know how you can look at a 50% improvement in overall raw performance and not call that revolutionary. 15 TFLOPs (your favorite metric) from a single GPU is huge. Are you saying raw TFLOPs doesn't matter, and that the only thing that's important is improving clock-for-clock and core-for-core performance? Has AMD been doing this without me noticing?

I don't think anyone said that Apple would use this (or any other NVIDIA) GPU, we're just all really confused about the point you're trying to make with your insistence on focusing on clock-for-clock and core-for-core performance.

AidenShaw said:
This - especially since the 50% improvement was within the same power envelope as the current systems. "Performance per watt" was trending when it was about Polaris, but Volta performance per watt doesn't excite the same people.

Plus the huge addition of Tensor cores. Ignoring this incredible advancement in machine learning and AI because it doesn't speed up FCPX cat video rendering is short-sighted.

This is because performance/mm2 did not increased significantly compared to GP100, it increased by single digit percent. What that means in context of what I am writing?

Quadro and Consumer GPUs will not have such huge increases unless Nvidia will decide to increase die sizes. The cores are not smaller, and they also pack Tensor Core. Even consumer GPUs will have to have Tensor Core, in some way, for comaptibility reasons(writing applications for GV100 chips).

In other words. With current die sizes, do not expect massive increases in FP32 through massively increased core counts. It will have higher throughput and from this will come the advancement.

On the other hand, Nvidia still has the benefit of designing bigger die sizes, and increasing core counts in their GPUs. However, bare in mind that HBM2 memory controllers are smaller than GDDR5/5X/6, and also to get full bandwidth it is required only to have 4 memory controllers, not 12(384 bit) as it is rumored about GV102 chip, already. So the GV102 chip can still have 600 mm2, and over 250W TDP.

And lastly. All we can discuss here is Apple context. Machine Learning is not apparent on Apple platform. Nothing software side is ready for it. Still 90-95% of market is FP32. Deep Learning is rapidly growing, but still extremely small market.

Would Apple use GV chips in upcoming Mac Pro? Yes, absolutely they can. Will they? Unlikely. Will Mac Pro be what 90% people hope for(Cheese Grater)? No. How come? This is Apple. They always want to do something different.

As a bonus: Rumor about Volta vs Vega:
http://semiaccurate.com/forums/showpost.php?p=289540&postcount=1656

Fottemberg said:
RUMOR: It seems Volta will be faster than Vega MHz vs. MHz, but Vega will be much cheaper than Volta.

It means that Volta will clock higher than Vega.
Anyways, interesting times ahead.

Asgorath · May 13, 2017

koyoot said:
This is because performance/mm2 did not increased significantly compared to GP100, it increased by single digit percent. What that means in context of what I am writing?

Quadro and Consumer GPUs will not have such huge increases unless Nvidia will decide to increase die sizes. The cores are not smaller, and they also pack Tensor Core. Even consumer GPUs will have to have Tensor Core, in some way, for comaptibility reasons(writing applications for GV100 chips).

In other words. With current die sizes, do not expect massive increases in FP32 through massively increased core counts. It will have higher throughput and from this will come the advancement.

On the other hand, Nvidia still has the benefit of designing bigger die sizes, and increasing core counts in their GPUs. However, bare in mind that HBM2 memory controllers are smaller than GDDR5/5X/6, and also to get full bandwidth it is required only to have 4 memory controllers, not 12(384 bit) as it is rumored about GV102 chip, already. So the GV102 chip can still have 600 mm2, and over 250W TDP.

And lastly. All we can discuss here is Apple context. Machine Learning is not apparent on Apple platform. Nothing software side is ready for it. Still 90-95% of market is FP32. Deep Learning is rapidly growing, but still extremely small market.

Would Apple use GV chips in upcoming Mac Pro? Yes, absolutely they can. Will they? Unlikely. Will Mac Pro be what 90% people hope for(Cheese Grater)? No. How come? This is Apple. They always want to do something different.

As a bonus: Rumor about Volta vs Vega:
http://semiaccurate.com/forums/showpost.php?p=289540&postcount=1656

It means that Volta will clock higher than Vega.
Anyways, interesting times ahead.

So your point is that Apple will choose Vega over Volta? Everyone already expects that. Does that mean Vega will beat Volta on raw performance or perf/watt, which are two actually interesting metrics? I absolutely do not think so. You can keep quoting perf per mm2 or clock-for-clock/core-for-core, but how much is Vega improving over previous generations in those metrics exactly?

Mago · May 13, 2017

to earnest i expected volta be faster, not dissapointed since hardly will lose to AMD the king of the Compute GPU crown, but I expected much more, at least with AMD Vegan grewing arround nVidia should cut Volta price .

I'm not in AI/ML business yet, but I'd lijke to see a pure Tensor Compute Accelerator from nVidia or AMD... It will come in about 2 years as much.

AidenShaw · May 14, 2017

Wrapping up an amazing week at the GPUtech conference, here's a list of some interesting Nvidia posts about the news of the week.

Groundbreaking AI, At Your Desk
Designed for your office, NVIDIA® DGX Station™ is the world’s first personal supercomputer for leading-edge AI development.

NVIDIA Tesla V100
The Most Advanced Data Center GPU Ever Built.

NVIDIA Volta
The New GPU Architecture,
Designed to Bring AI to Every Industry.

NVIDIA GPU Cloud
DEEP LEARNING EVERYWHERE, FOR EVERYONE.

NVIDIA DGX Systems
FASTER AI INNOVATION AND INSIGHT

Want Drones to Fly Right?
Make Them Hallucinate, MIT Prof Says

I'm starting to wade through the technical press stories on GPUtech, and will post soon.

Stacc · May 15, 2017

Here is a great video discussing Volta and looks forward to what 2018 might bring in regards to AMD vs Nvidia. Basically AMD is going to have a really tough time keeping up with Nvidia unless Vega really exceeds expectations. Since Nvidia has their own custom node with TSMC they have room to grow the size of their chips while also relying on cheaper memory (GDDR6). For instance even if Vega can match a 1080 Ti, Nvidia could probably match it with a mid-range enthusiast chip (i.e. the GV104/GTX 2080) in 6-9 months.

cube · May 15, 2017

Stacc said:
Here is a great video discussing Volta and looks forward to what 2018 might bring in regards to AMD vs Nvidia. Basically AMD is going to have a really tough time keeping up with Nvidia unless Vega really exceeds expectations. Since Nvidia has their own custom node with TSMC they have room to grow the size of their chips while also relying on cheaper memory (GDDR6). For instance even if Vega can match a 1080 Ti, Nvidia could probably match it with a mid-range enthusiast chip (i.e. the GV104/GTX 2080) in 6-9 months.

What is 'custom' about the 12nm process?

Remember also that nowadays the process names are used for marketing and there's no simple relation to feature sizes.

Stacc · May 15, 2017

cube said:
What is 'custom' about the 12nm process?

Remember also that nowadays the process names are used for marketing and there's no simple relation to feature sizes.

I'm sure its just tuned from the 16nm process to what Nvidia wants for its GPUs. Say optimizing for efficiency or large die sizes rather than transistor density. Maybe they had to modify the process in order to produce the massive GV100. Probably not a dramatic difference but if TSMC has a specific process for Nvidia than I'm sure Nvidia paid big bucks to get it so they must be getting something out of it.

Prince134 · May 15, 2017

cube said:
What is 'custom' about the 12nm process?

Remember also that nowadays the process names are used for marketing and there's no simple relation to feature sizes.

So are you not saying TSMC producing 10nm A11 chips for Apple is "marketing phrase"?

https://9to5mac.com/2017/05/11/tsmc...n-of-a11-chip-for-iphone-8-with-10nm-process/

koyoot · May 15, 2017

Prince134 said:
So are you not saying TSMC producing 10nm A11 chips for Apple is "marketing phrase"?

https://9to5mac.com/2017/05/11/tsmc...n-of-a11-chip-for-iphone-8-with-10nm-process/

This is low-power process. It is not feasible for high power chips.

Stacc · May 15, 2017

Prince134 said:
So are you not saying TSMC producing 10nm A11 chips for Apple is "marketing phrase"?

https://9to5mac.com/2017/05/11/tsmc...n-of-a11-chip-for-iphone-8-with-10nm-process/

No, TSMC's 10 nm node has higher transistor density and better efficiency than the 12 nm Nvidia "node" and the normal 16 nm node. Also as Koyoot mentioned, its an early production variant of 10 nm, meaning that larger chips such as GPUs are a ways off.

cube · May 15, 2017

Stacc said:
I'm sure its just tuned from the 16nm process to what Nvidia wants for its GPUs. Say optimizing for efficiency or large die sizes rather than transistor density. Maybe they had to modify the process in order to produce the massive GV100. Probably not a dramatic difference but if TSMC has a specific process for Nvidia than I'm sure Nvidia paid big bucks to get it so they must be getting something out of it.

My point is that I doubt the process is reserved for NVIDIA.

Prince134 · May 15, 2017

T

Stacc said:
I'm sure its just tuned from the 16nm process to what Nvidia wants for its GPUs. Say optimizing for efficiency or large die sizes rather than transistor density. Maybe they had to modify the process in order to produce the massive GV100. Probably not a dramatic difference but if TSMC has a specific process for Nvidia than I'm sure Nvidia paid big bucks to get it so they must be getting something out of it.

Historically each new generation process is 0.7X of previous since that shrinks die size to 50% of previous. So 28nm > 20nm > 14nm > 10nm. Every process is of course always fine tuned. There is no rules for the industry. That's why you will see 16nm and 12nm process. You can measure it with microscope if you want. TSMC skipped 20nm directly from 28nm to 16nm for A9 and A10 Chips and beat Samsung's 14nm for the business with Apple, now 10nm for A11. Imagine how much smaller the die can shrink from 16nm to 10nm? that's 40% of the former die size at same transistor counts.

You won't see a consumer volta GPU at a die of 815mm2 at $500 range. At that die size, yield rate is too low, ie cost is too high to be only fit with HPC market. There is less chance we will see a Gforce volta in 2017, until a properly shrinked die come out.
[doublepost=1494874292][/doublepost]

cube said:
My point is that I doubt the process is reserved for NVIDIA.

Could be, since the production resources are very tight.

koyoot · May 15, 2017

Prince134 said:
T

Historically each new generation process is 0.7X of previous since that shrinks die size to 50% of previous. So 28nm > 20nm > 14nm > 10nm. Every process is of course always fine tuned. There is no rules for the industry. That's why you will see 16nm and 12nm process. You can measure it with microscope if you want. TSMC skipped 20nm directly from 28nm to 16nm for A9 and A10 Chips and beat Samsung's 14nm for the business with Apple, now 10nm for A11. Imagine how much smaller the die can shrink from 16nm to 10nm? that's 40% of the former die size at same transistor counts.

You won't see a consumer volta GPU at a die of 815mm2 at $500 range. At that die size, yield rate is too low, ie cost is too high to be only fit with HPC market. There is less chance we will see a Gforce volta in 2017, until a properly shrinked die come out.
[doublepost=1494874292][/doublepost]
Could be, since the production resources are very tight.

This is just marketing name for the process. It is 16 nm, but slightly optimized for Nvidia GPU, to get best yields out of 835mm2 die size chip.

There are two GV chips coming this year, both on "12 nm" process. One's name is for marketing reasons, and one is proper 12 nm, but it is not suitable for anything powerful(GPUs, CPUs with power target over 30W).

Stacc · May 15, 2017

Prince134 said:
Historically each new generation process is 0.7X of previous since that shrinks die size to 50% of previous. So 28nm > 20nm > 14nm > 10nm. Every process is of course always fine tuned. There is no rules for the industry.

Historically this is true, but this all changed with the advent of FinFET. It no longer makes sense to simply measure the width of the transistor, instead multiple measurements in 3D are required. Thats why Intel, TSMC, Samsung and GlobalFoundries all of have different densities in their 14-16 nm processes. We are getting to a point where these nodes are less well defined and are somewhat arbitrary.

Prince134 said:
TSMC skipped 20nm directly from 28nm to 16nm for A9 and A10 Chips and beat Samsung's 14nm for the business with Apple, now 10nm for A11.

Not true, the A8 was manufactured on TSMC's 20 nm process. 20 nm never matured enough to produce large graphics cards. GPUs, especially high performance ones like those released by AMD and Nvidia always take longer to shrink, due to the large die sizes.

Prince134 said:
You won't see a consumer volta GPU at a die of 815mm2 at $500 range. At that die size, yield rate is too low, ie cost is too high to be only fit with HPC market. There is less chance we will see a Gforce volta in 2017, until a properly shrinked die come out.

Of course, but the pascal chips all have room to get larger. For example GM204 was a ~400 mm2 chip while GP104 is 314 mm2. So even if Nvidia's architectural enhancements are minimal between pascal and volta, they could simply scale up the pascal chips and still get big performance improvements especially if they have tuned their process for increased efficiency.

Nvidia Volta.

macrumors 603

macrumors P6

macrumors 603

macrumors 68000

macrumors P6

macrumors 603

macrumors 68000

macrumors P6

macrumors 603

macrumors 68000

macrumors P6

macrumors 603

macrumors 68000

macrumors 68030

macrumors P6

macrumors 6502a

Suspended

macrumors 6502a

macrumors 6502

macrumors 603

macrumors 6502a

Suspended

macrumors 6502

macrumors 603

macrumors 6502a

Our Staff