The Vega RX Thread (Rumors and Info)

koyoot · Aug 4, 2017

AidenShaw said:
That covers your first paragraph - I'd like to see links for the speculations in your second paragraph.

The third paragraph is clearly opinion, no explicit need for links - although it would be nice to explain how you were reasoning..

For the second paragraph: Nvidia will cut manufacturing costs, and we know that GDDR6 is coming next year, to GPUs. AMD is not going to use it, they are going to use HBM2 so only Nvidia is the obvious choice. And those are costing die size, and power consumption.

As for third paragraph. If there is no change from process for the design, and we are looking at bigger GPU, on the same process, as GV102, with similar tech of memory but slightly different, it will result in higher TDP, and similar theoretical performance. You have to bare in mind however, that most likely we are looking at 64 core/256 KB Register File size architecture layout so the throughput of the cores will be higher.

Compare performance between GP100 and GP102, to know how much higher performance you can expect from new layout. There are benchmarks in the wild comparing both architectures.

Stacc said:
Fine, lets try another way to estimate performance.

GP106 - 200 mm2 - 5120 GFLOP - 25.6 GFLOP/mm2
GP104 - 314 mm2 - 9216 GFLOP - 29.4 GFLOP/mm2
GP102 - 471 mm2 - 12150 GFLOP - 25.8 GFLOP/mm2

So without thinking about GP100 or GV100, lets see what we get with a potential GV102.

First, lets be conservative and say we keep the ~25 GFLOP/mm2 of GP102

GV102 - 600 mm2 - 25 GFLOP/mm2 - 15000 GFLOP

Or lets say that Nvidia does a little better and manages 27.5 GFLOP/mm2

GV102 - 600 mm2 - 27.5 GFLOP/mm2 - 16500 GFLOP

Nvidia is relatively predictable in what they do with their GPUs. Except it was very surprising to see them announce a gigantic 815 mm2 GPU. Don't forget that Nvidia could go bigger with GV102, since obviously GV100 is huge.

Math, not fanboyism, is the best way to project their next generation GPUs.

I have no idea where you see fanboyism in my post, but I guess, fanboyism is in the eye of beholder...
So how do you think Nvidia will be able to achieve 16 TFLOPs, on the same node, with smaller die size, that has more memory controllers which cut into die size even more?

815 mm2 size GPU, with just 4 memory controllers, that are smaller than GDDR6, has ONLY 5376 CUDA cores.

How many CUDA cores will you be able to fit in 12 nm(16 nm) 600 mm2 die size GPU, with 12, 32 bit memory controllers, that give 384 bit memory bus, even accounting for higher density?

Let me give you an example, if you want maths. To get to 16500 TFLOPs, you need 5120 CUDA cores, clocked at 1.6 GHz. Doable, knowing those requirements?

That is why don't expect more than 13-14 TFLOPs out of 600 mm2 GPU on 12 nm in 250W TDP, and not more than 13 TFLOPs out of GPU on 16 nm FF+ in 300W TDP.

Stacc · Aug 4, 2017

koyoot said:
How many CUDA cores will you be able to fit in 12 nm(16 nm) 600 mm2 die size GPU, with 12, 32 bit memory controllers, that give 384 bit memory bus, even accounting for higher density?

I say fanboyism because you throw out a handful of facts and then state a random number as your performance estimate.

I am not sure why you are bringing up anything about HBM. Nothing in my post included any density calculations from GPUs that use HBM. GP102 achieves its performance with a 384 bit memory bus, and there is no reason GV102 can't do the same thing with a 384 bit GDDR6 memory bus. Not to mention I am sure they have further enhanced their memory compression for Volta.

With 5120 compute units, all it needs is a clock rate of 1470 Mhz to hit 15 TFLOPs. Given GV100 will be at 1450 Mhz it should be no problem.

koyoot said:
Let me give you an example, if you want maths. To get to 16500 TFLOPs, you need 5120 CUDA cores, clocked at 1.6 GHz. Doable, knowing those requirements?

This doesn't seem out of the question, since the GTX 1080 Ti clocks over 1600 Mhz.

koyoot · Aug 4, 2017

Stacc said:
I say fanboyism because you throw out a handful of facts and then state a random number as your performance estimate.

I am not sure why you are bringing up anything about HBM. Nothing in my post included any density calculations from GPUs that use HBM. GP102 achieves its performance with a 384 bit memory bus, and there is no reason GV102 can't do the same thing with a 384 bit GDDR6 memory bus. Not to mention I am sure they have further enhanced their memory compression for Volta.

With 5120 compute units, all it needs is a clock rate of 1470 Mhz to hit 15 TFLOPs. Given GV100 will be at 1450 Mhz it should be no problem.

This doesn't seem out of the question, since the GTX 1080 Ti clocks over 1600 Mhz.

But GV100 has 4 HBM2 memory controllers, and it has 815 mm2 die size... And HBM2 memory controllers are smaller than GDDR5, and GDDR6 memory controllers. And GV102 will have 12 memory controllers which will eat both space, and power consumption from the Die.

Actually 15 TFLOPs may be realistic, with 5120 CUDA cores, and 1470 MHz. But it may not be realistic number of cores, from 600 mm2 die size.

P.S. The funniest part. I am basing my logic on hard data from Process nodes, and technical stuff, related to GPU design and I am fanboy. You have thrown out a numbers based on your logic, without any reflection in data about processes, and GPU design and it is me who is the fanboy.

AidenShaw · Aug 4, 2017

koyoot said:
But GV100 has 4 HBM2 memory controllers, and it has 815 mm2 die size... And HBM2 memory controllers are smaller than GDDR5, and GDDR6 memory controllers. And GV102 will have 12 memory controllers which will eat both space, and power consumption from the Die.

Actually 15 TFLOPs may be realistic, with 5120 CUDA cores, and 1470 MHz. But it may not be realistic number of cores, from 600 mm2 die size.

Links? I didn't know that GV102 specs have been published.

koyoot · Aug 4, 2017

AidenShaw said:
Links? I didn't know that GV102 specs have been published.

FFS. Each memory controller, for GDDR6 is 32 bit. To get 384 bit memory bus you need 12 memory controllers, and 12 memory cells. Those absolute bases, of GPU architectures. How can you not know this?

Stacc · Aug 4, 2017

koyoot said:
But GV100 has 4 HBM2 memory controllers, and it has 815 mm2 die size... And HBM2 memory controllers are smaller than GDDR5, and GDDR6 memory controllers. And GV102 will have 12 memory controllers which will eat both space, and power consumption from the Die.

Actually 15 TFLOPs may be realistic, with 5120 CUDA cores, and 1470 MHz. But it may not be realistic number of cores, from 600 mm2 die size.

So what? The smaller, GDDR5X based GP102 has more CUDA cores than the larger, HBM based GP100. Volta will likely be similar.

koyoot said:
P.S. The funniest part. I am basing my logic on hard data from Process nodes, and technical stuff, related to GPU design and I am fanboy. You have thrown out a numbers based on your logic, without any reflection in data about processes, and GPU design and it is me who is the fanboy.

Show me your data then and how you computed your performance estimates. I demonstrated mine by only extrapolating from Nvidia's previous generation without guessing at any improvements that may occur due to changes in architecture or process.

AidenShaw · Aug 4, 2017

koyoot said:
FFS. Each memory controller, for GDDR6 is 32 bit. To get 384 bit memory bus you need 12 memory controllers, and 12 memory cells. Those absolute bases, of GPU architectures. How can you not know this?

Links? Where has it been published that GV102 will have 384-bit memory bus?

And should we be expecting GV104 before GV102?

koyoot · Aug 4, 2017

Stacc said:
So what? The smaller, GDDR5X based GP102 has more CUDA cores than the larger, HBM based GP100. Volta will likely be similar.

Show me your data then and how you computed your performance estimates. I demonstrated mine by only extrapolating from Nvidia's previous generation without guessing at any improvements that may occur due to changes in architecture or process.

BOTH have 3840 CUDA cores. And the bigger one also has FP64 cores, that is why it is 600mm2 GPU.

With GP100 there were 3840 CUDA cores, but Nvidia disabled 256 of them to increase Yields. The same thing happened with GV100 - There are 5376 CUDA cores in the GPU, but Nvidia is sending out 5120 CUDA core versions, for increased Yields.

The data for sizes of memory controllers are not publicly available. If you want more details, search in the internet for die shots of GTX 1080, GP100 chip, and Vega GPU, and look into let and right sides of GPU, or top and bottom. GDDR5/X is eating quite a lot of space, compared to HBM2.

AidenShaw said:
Links? Where has it been published that GV102 will have 384-bit memory bus?

And should we be expecting GV104 before GV102?

Why don't you ask Stacc, for those links, in the first place. He started this by posting his predictions.

How hard for you is it to understand that this discussion is just theoretical?

I always love how technical discussion can be turned into defense of Nvidia, by you.

And lastly. This is AMD Vega thread.

AidenShaw · Aug 4, 2017

koyoot said:
How hard for you is it to understand that this discussion is just theoretical?

You are posting a lot of absolute specs, considering that you say that it's theoretical. The GTX1080Ti (GP102) has a 352-bit bus, why not consider (in theory) that the GV102 has fewer than 384 bits?

koyoot said:
And lastly. This is AMD Vega thread.

Then why are you talking about Volta here?

Obviously, because Vega does not exist in a vacuum, and ATI (and people here) have been comparing various Vega spinoffs to the older GP104 chip. The newer GP102 chip is more relevant, and the expected GV10x chips are even more relevant.

You put too much faith in simple theories, and miss the bigger picture.

To use one of the ever-popular automotive analogies:

The 5.0 litre engine will be more powerful than the 2.0 litre engine - due to the 5.0L having 2.5 times the displacement.

but

The 2.0 litre engine produces 240 horsepower, while the 5.0 litre has 220 horsepower - due to the 2.0L having a high pressure turbocharger with intercooler, dual overhead cams, four valves per cylinder, fuel injection, and a smart engine computer (240HP with regular gas - use higher octane and get more power). (And the 2.0L has *much* better mileage.)

(True example comparing an older Ford 302 cubic inch (5.0L) V8 with the 2.0L "EcoBoost" 4 cylinder engine in many current models. (Full disclosure - I have an Escape with the 2.0L engine. It's fun to embarrass BMWs in traffic....))

No spec in isolation determines the final performance. And a seemingly overwhelming advantage in one dimension (5.0L vs 2.0L) can be eliminated by other factors.

And GV102 (and GV104) haven't been announced - so you don't even have the specs to speculate on.

Asgorath · Aug 4, 2017

AidenShaw said:
No spec in isolation determines the final performance. And a seemingly overwhelming advantage in one dimension (5.0L vs 2.0L) can be eliminated by other factors.

It's okay though, according to koyoot TFLOPs is the only thing that matters and we should ignore real-world game/app benchmarks. Unless they've been hand-picked to show AMD performing well, that is.

AidenShaw · Aug 4, 2017

"'

Asgorath said:
It's okay though, according to koyoot TFLOPs is the only thing that matters and we should ignore real-world game/app benchmarks. Unless they've been hand-picked to show AMD performing well, that is.

This is the point I was trying to make with the earlier "Unless of course the 'theoretical' 11 TFLOPs turns out to be an actual 6 TFLOPs" post.

If the "theory" says '11 TFLOPs', but the benchmarks say '6 TFLOPs' - ignore the theory and believe the benchmarks.

But I hear that I'm clueless and that I don't understand anything about anything.

koyoot · Aug 5, 2017

AidenShaw said:
You are posting a lot of absolute specs, considering that you say that it's theoretical. The GTX1080Ti (GP102) has a 352-bit bus, why not consider (in theory) that the GV102 has fewer than 384 bits?

Aaaaaaaand, GTX Titan Xp is different chip, not GP102?

GP102 has 384 Bit memory bus. GTX 1080 Ti has to have 352 Bit memory bus, because Memory controllers are connected to SM's in Paxwell architecture. Cutting some cores out of the GPU layout, cut also the connection of memory. GTX 970 3.5 GB fiasco, is the perfect example here.

Why do you talk about hardware when you are clueless about it? Im not quantum phisicist, so I am not writing about Quantum Mechanics. You are talking about things you have no idea about.

AidenShaw said:
Then why are you talking about Volta here?

Obviously, because Vega does not exist in a vacuum, and ATI (and people here) have been comparing various Vega spinoffs to the older GP104 chip. The newer GP102 chip is more relevant, and the expected GV10x chips are even more relevant.

You put too much faith in simple theories, and miss the bigger picture.

To use one of the ever-popular automotive analogies:

The 5.0 litre engine will be more powerful than the 2.0 litre engine - due to the 5.0L having 2.5 times the displacement.

but

The 2.0 litre engine produces 240 horsepower, while the 5.0 litre has 220 horsepower - due to the 2.0L having a high pressure turbocharger with intercooler, dual overhead cams, four valves per cylinder, fuel injection, and a smart engine computer (240HP with regular gas - use higher octane and get more power). (And the 2.0L has *much* better mileage.)

(True example comparing an older Ford 302 cubic inch (5.0L) V8 with the 2.0L "EcoBoost" 4 cylinder engine in many current models. (Full disclosure - I have an Escape with the 2.0L engine. It's fun to embarrass BMWs in traffic....))

No spec in isolation determines the final performance. And a seemingly overwhelming advantage in one dimension (5.0L vs 2.0L) can be eliminated by other factors.

And GV102 (and GV104) haven't been announced - so you don't even have the specs to speculate on.

What you are describing is the exact difference between Vega throughput, and Pascal. The only difference is that Vega requires rocket fuel(optimization of software), and Nvidia architectures can run on Jim Beam. You are constantly not paying attention to my posts about architectures.

AidenShaw said:
"'
This is the point I was trying to make with the earlier "Unless of course the 'theoretical' 11 TFLOPs turns out to be an actual 6 TFLOPs" post.

If the "theory" says '11 TFLOPs', but the benchmarks say '6 TFLOPs' - ignore the theory and believe the benchmarks.

But I hear that I'm clueless and that I don't understand anything about anything.

Yes, you are clueless, because you post rubbish that is out of touch with reality. TFLOPs numbers refer to compute performance. Wake me up when Nvidia will have architecture that is not bottlenecked by the throughput of their cores, and is able to achieve its true potential, without using CUDA. Then, we can talk.

koyoot · Aug 5, 2017

Die shot of Vega:

Project 47.
https://instinct.radeon.com/en-us/p47-announcement/

Quite nice rack.

koyoot · Aug 5, 2017

AidenShaw said:
"'
This is the point I was trying to make with the earlier "Unless of course the 'theoretical' 11 TFLOPs turns out to be an actual 6 TFLOPs" post.

If the "theory" says '11 TFLOPs', but the benchmarks say '6 TFLOPs' - ignore the theory and believe the benchmarks.

But I hear that I'm clueless and that I don't understand anything about anything.

Actually to prove you how wrong you are about AMD hardware, and to prove that its Nvidia which has Theoretical TFLOPs numbers I will provide two proofs just for you:
First comparison between GTX 1060, and RX 480. GTX 1060 is using CUDA, and RX 480 is using OpenCL.
https://wiki.blender.org/index.php/Dev:Source/Render/Cycles/OpenCL

This is the difference between wide GPU architecture that can do more work with each clock with shorter pipeline, vs GPU architecture that can clock higher, but can do less work with each cycle on longer pipeline.

Theoretical TFLOPs numbers for both architectures: 4.4 TFLOPs vs 5.7 TFLOPs on RX 480.

Lets look at bigger GPU vs Smaller GPU:

6.5 TFLOPs GTX 1070 smoked by 5.7 TFLOPs RX 480 in Adobe applications.

This is exactly what I have described in that wall of text post on 2nd page of this thread. GPU that can do more work each cycle is faster than GPU that is doing less work, but can clock higher. If applications are optimized for Throughput of the cores - this is what you will see.

Im sorry Aiden, but you are clueless about hardware, and you cannot mask it. Just because you hate AMD, does not give the right to crap all over their hardware. AMD has higher IPC, and software is catching up to it, and applications are starting to overtake Nvidia hardware using CUDA.

AidenShaw · Aug 5, 2017

koyoot said:
First comparison between GTX 1060, and RX 480. GTX 1060 is using CUDA, and RX 480 is using OpenCL.
https://wiki.blender.org/index.php/Dev:Source/Render/Cycles/OpenCL

Are you sure you read that correctly? It seems to be using OpenCL on both cards.

SoyCapitanSoyCapitan · Aug 5, 2017

koyoot said:
6.5 TFLOPs GTX 1070 smoked by 5.7 TFLOPs RX 480 in Adobe applications.

In Windows of course. On macOS neither of these have well optimised drivers, or decent GPU encode/decode, or up to date APIs to use Creative Suite with.

koyoot · Aug 5, 2017

AidenShaw said:
Are you sure you read that correctly? It seems to be using OpenCL on both cards.

https://twitter.com/tonroosendaal/status/851850578112151556?ref_src=twsrc^tfw&ref_url=https://www.blendernation.com/2017/04/12/blender-cycles-opencl-now-par-cuda/
https://twitter.com/tonroosendaal/status/852103617742073857

"We tested NVidia CUDA and AMD OpenCL."

OpenCL is on par with CUDA in Blender. So the only difference you now see is the hardware capabilities.

AidenShaw · Aug 5, 2017

koyoot said:
https://twitter.com/tonroosendaal/status/851850578112151556?ref_src=twsrc^tfw&ref_url=https://www.blendernation.com/2017/04/12/blender-cycles-opencl-now-par-cuda/
https://twitter.com/tonroosendaal/status/852103617742073857

"We tested NVidia CUDA and AMD OpenCL."

OpenCL is on par with CUDA in Blender. So the only difference you now see is the hardware capabilities.

It would have been nice to have that info in the wiki, seems odd to have to check a twitter feed.

Even the spreadsheet is lacking - it doesn't include the CUDA version used for the Nvidia tests. (And it doesn't actually say CUDA.)

koyoot · Aug 5, 2017

For those who do not understand what is being talked. Blender added split kernel to their OpenCL compute pipeline which resulted in smaller pipeline, which is able to be processed faster. This is the essence of CUDA software. Smaller kernels that are able to be processed very fast. Remember that Warp in CUDA architecture is 32 KB, and in GCN wavefront is 64 KB. Hence the differences between both architectures, with properly optimized software.

OpenCL allows this, but it has to be implemented in the software for it to be used. This is also most likely the reason why Apple software is so fast. Why AMD hardware is faster than Nvidia in Apple software? Turn back to page nr. 2 of this thread and read my post that described Vega architecture, and compared both architectures, from both companies. It is because AMD GPUs are doing more work, each clock cycle. If I had to compare this in CPU Single threaded performance: AMD GCN is Intel Skylake, and Nvidia is Intel Core 2 Duo.

About gaming performance, everything stays, as I have written. In DX11 Vega will perform around GTX 1080 performance. In DX12, and Vulkan, with implementation of Vega features - it will perform much faster than GTX 1080 Ti.

SoyCapitanSoyCapitan said:
In Windows of course. On macOS neither of these have well optimised drivers, or decent GPU encode/decode, or up to date APIs to use Creative Suite with.

Apple May have split kernels in their software. But this remains to be confirmed.

cube · Aug 5, 2017

Now there's a Vega X2 rumour. How would you like a 600W card again?

koyoot · Aug 6, 2017

It appears that best performance/watt/price GPU for FP16 in upcoming moths will be RX Vega 56, with 21 FP16 TFLOPs of compute power, 210W TDP, and 400$ price tag.

koyoot · Aug 8, 2017

SoyCapitanSoyCapitan · Aug 8, 2017

koyoot said:

Chill only works with DX titles and obviously these advanced power savings features don't exist in our boring macOS drivers. We get a bare bones AMD driver just like the Nvidia web driver. No regular optimisations, no HEVC decode on GPU, etc

cube · Aug 8, 2017

SoyCapitanSoyCapitan said:
Chill only works with DX titles

Not cool.

AidenShaw · Aug 8, 2017

koyoot said:
It appears that best performance/watt/price GPU for FP16 in upcoming moths will be RX Vega 56, with 21 FP16 TFLOPs of compute power, 210W TDP, and 400$ price tag.

Volta matches or beats Vega 56 on FP16 performance/watt....

The Vega RX Thread (Rumors and Info)

macrumors 603

macrumors 6502a

macrumors 603

macrumors P6

macrumors 603

macrumors 6502a

macrumors P6

macrumors 603

macrumors P6

macrumors 68000

macrumors P6

macrumors 603

macrumors 603

macrumors 603

macrumors P6

Suspended

macrumors 603

macrumors P6

macrumors 603

Suspended

macrumors 603

macrumors 603

Suspended

Suspended

macrumors P6

Our Staff