Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
From Vega whitepaper:

HBCC technology can be leveraged for consumer applications, as well. The key limitation in that space is that most systems won’t have the benefit of large amounts of system memory (i.e., greater than 32 GB) or solid-state storage on the graphics card. In this case, HBCC effectively extends the local video memory to include a portion of system memory. Applications will see this storage capacity as one large memory space. If they try to access data not currently stored in the local high-bandwidth memory, the HBCC can cache the pages on demand, while less recently used pages are swapped back into system memory. This unified memory pool is known as the HBCC Memory Segment (HMS).

Intelligent Workload Distrubutor is "aware" of the data that is not stored in on-GPU cache, and can access it instantly.

Everything depends on implementation, and instructions in software. HBCC indexes not only the data that is stored in GPU memory. It indexes RAM, Non-volatile storage, even CPU page tables.
slides-10.jpg

slides-11.jpg


That is why HBCC can help in increasing gaming performance. Because the GPU this way can avoid stalls, and fill the cores, with work.

Funniest part. As always on this forum, you are arguing with me about the same thing, but looking from two different angles.

As for Intelligent Workload Distributor, this slide is very helpful in understanding why features have to be perfectly synced to extract everything out of Vega:
slides-44.jpg
 
If they try to access data not currently stored in the local high-bandwidth memory, the HBCC can cache the pages on demand, while less recently used pages are swapped back into system memory. This unified memory pool is known as the HBCC Memory Segment (HMS).

Intelligent Workload Distrubutor is "aware" of the data that is not stored in on-GPU cache, and can access it instantly.

Everything depends on implementation, and instructions in software. HBCC indexes not only the data that is stored in GPU memory. It indexes RAM, Non-volatile storage, even CPU page tables.

That is why HBCC can help in increasing gaming performance. Because the GPU this way can avoid stalls, and fill the cores, with work.

How does it do that? IT MOVES THE PAGES INTO LOCAL HIGH-BANDWIDTH MEMORY, or in other words, the local video memory on the GPU. That's all HBCC does, automatically moves pages from external storage (system memory, SSD, whatever) into video memory (which is now a giant L3 cache).

So, please explain how HBCC improves performance when THE DATA IS ALREDY IN LOCAL HIGH-BANDWIDTH MEMORY. It doesn't matter how the data got there, but once it's there, HBCC is done. So, please explain how HBCC improves performance for a game where the data for a given scene/level is already in the local high-bandwidth memory (i.e. local video memory). You still haven't answered that question. Probably because there is no answer, and HBCC does nothing for that situation. HBCC is designed for massive data sets that cannot fit into the local memory, and in those cases HBCC manages the residency in the local video memory on a page-by-page basis automatically for you. This is a great feature and I hope it becomes standard, but it does ABSOLUTELY NOTHING for applications that have small data sets that already fit into the local video memory. Why is this so hard for you to understand?

IWD is not a special client of HBCC, it and all other units on the GPU just go through the memory and cache hierarchy as normal. HBCC is different because it turns the local video memory into a giant L3 cache, instead of something that must be manually managed by the driver/application. Nothing more, nothing less.
 
How does it do that? IT MOVES THE PAGES INTO LOCAL HIGH-BANDWIDTH MEMORY, or in other words, the local video memory on the GPU. That's all HBCC does, automatically moves pages from external storage (system memory, SSD, whatever) into video memory (which is now a giant L3 cache).

So, please explain how HBCC improves performance when THE DATA IS ALREDY IN LOCAL HIGH-BANDWIDTH MEMORY. It doesn't matter how the data got there, but once it's there, HBCC is done. So, please explain how HBCC improves performance for a game where the data for a given scene/level is already in the local high-bandwidth memory (i.e. local video memory). You still haven't answered that question. Probably because there is no answer, and HBCC does nothing for that situation. HBCC is designed for massive data sets that cannot fit into the local memory, and in those cases HBCC manages the residency in the local video memory on a page-by-page basis automatically for you. This is a great feature and I hope it becomes standard, but it does ABSOLUTELY NOTHING for applications that have small data sets that already fit into the local video memory. Why is this so hard for you to understand?

IWD is not a special client of HBCC, it and all other units on the GPU just go through the memory and cache hierarchy as normal. HBCC is different because it turns the local video memory into a giant L3 cache, instead of something that must be manually managed by the driver/application. Nothing more, nothing less.
I will explain differently, by asking you a question.

Knowing all what you have written, about the data Intelligent Workload Distributor able to work without HBCC, with its maximum efficiency and capabilities?

Also what affect it has on the capabilities of creating worlds, per the video you have linked.

slides-09.jpg

Paging in this scenario means, you can portion the data into much smaller bits.

Intelligent Workload Distributor, can tell, the HBCC to index certain data, in advance, and this way lower the access times, and increase the performance. Its way more efficient than removing, and moving the data, after the work was done. Now it can be done BEFORE it has been done.

Why is it so hard for you to grasp this concept, to which I was pointing ALL of the time?
 
I will explain differently, by asking you a question.

Knowing all what you have written, about the data Intelligent Workload Distributor able to work without HBCC, with its maximum efficiency and capabilities?

Also what affect it has on the capabilities of creating worlds, per the video you have linked.

slides-09.jpg

Paging in this scenario means, you can portion the data into much smaller bits.

Intelligent Workload Distributor, can tell, the HBCC to index certain data, in advance, and this way lower the access times, and increase the performance. Its way more efficient than removing, and moving the data, after the work was done. Now it can be done BEFORE it has been done.

Why is it so hard for you to grasp this concept, to which I was pointing ALL of the time?

Let's say I have an 8GB Vega 64. Let's say my application has a working set of 2GB, and all that 2GB of data has been transferred to the local video memory by the driver (e.g. if it's using DX11). IWD now has full access to all the data that is sitting in video memory. Please explain how HBCC improves performance in this case? Hint: it doesn't.

Look, I get that HBCC is cool, and I hope applications that take advantage of it come along. However, HBCC is specifically designed to virtualize the local video memory and make it appear as if the GPU has effectively infinite amounts of fast memory.

Counter example: Let's say I have an 8GB Vega 64. Let's say my application has a working set of 512GB, and HBCC is enabled. HBCC will now automagically manage which pages are resident, based on the memory accesses done by all other units on the GPU (including IWD, but also everything else like ROP, texturing etc). It treats access to pages that aren't resident just like any other cache controller treats cache misses:

https://www.techopedia.com/definition/6308/cache-miss

Why is HBCC cool? Because it means the application (or driver) doesn't need to know what the access patterns are and manually micro-manage the residency of the data. It can just say "here's my 512GB of data, go!" which is super cool.

But yeah, I'm done. Keep thinking that HBCC is a magic bullet that is going to make Vega not terrible.
 
Let's say I have an 8GB Vega 64. Let's say my application has a working set of 2GB, and all that 2GB of data has been transferred to the local video memory by the driver (e.g. if it's using DX11). IWD now has full access to all the data that is sitting in video memory. Please explain how HBCC improves performance in this case? Hint: it doesn't.

Look, I get that HBCC is cool, and I hope applications that take advantage of it come along. However, HBCC is specifically designed to virtualize the local video memory and make it appear as if the GPU has effectively infinite amounts of fast memory.

Counter example: Let's say I have an 8GB Vega 64. Let's say my application has a working set of 512GB, and HBCC is enabled. HBCC will now automagically manage which pages are resident, based on the memory accesses done by all other units on the GPU (including IWD, but also everything else like ROP, texturing etc). It treats access to pages that aren't resident just like any other cache controller treats cache misses:

https://www.techopedia.com/definition/6308/cache-miss

Why is HBCC cool? Because it means the application (or driver) doesn't need to know what the access patterns are and manually micro-manage the residency of the data. It can just say "here's my 512GB of data, go!" which is super cool.

But yeah, I'm done. Keep thinking that HBCC is a magic bullet that is going to make Vega not terrible.
It doesn't Improve.

Why not? Because you believe it will store ONLY 2 GB of data in the cache. DSBR and all other parts of the pipeline are designed to get as much data, to feed the cores accordingly in combination of caching, and explicit culling techniques: Rasterization, and Geometry culling.

What DSBR, Primitive Shaders and Intelligent Workload Distributor work in combination is fine grained scheduling, culling, and feeding the cores, to increase the graphical throughput.

What does have HBCC to all of this? Well it is inherent part of the pipeline because it is part of memory controller.

You are describing everything what I have been talking about HBCC in the context of games. It can store not only data that is relevant to current scene, but as when Pipeline is being emptied it automatically paged all data needed for NEXT scene, and the next one, and the next one, and the next one... Or paged data relevant to potential scene that can occur at specific conditions, and that scene might be next one, instead of "normal" scripted scene.

This way you limit or get rid of stalls because of refreshing page in memory pool. And therefore - you increase performance.

Yes it is cool feature because saves TONS of time on memory management by developers.

What is funny is that you believe I wrote that HBCC is a feature that will "save Vega". I never wrote something like this, its only your assumption, or limited understanding of what I was talking about.

In this post, where is any word, that HBCC will save Vega? https://forums.macrumors.com/thread...rumors-and-info.2056361/page-28#post-24957226
Nothing.
Maybe in this post is a word about it? https://forums.macrumors.com/thread...rumors-and-info.2056361/page-28#post-24958233
No...
So maybe in this one?
https://forums.macrumors.com/thread...rumors-and-info.2056361/page-28#post-24958469
No.

All I was writing was about what combination of ALL new Vega features are going to mean for this architecture. Including HBCC.
 
Vulkan systems are more popular in Steam than either DX10, DX11, or DX12 systems, so almost double than "DX10". It must be the Linux contribution.
 
I am not fully confident in this test. He maxed out the antialiasing. That could be the bottleneck.

 
Without Primitive Shaders, Draw Stream Binning Rasterizer, and Intelligent Workload Distributor, you will see the same results, as this guy had.
 
Without Primitive Shaders, Draw Stream Binning Rasterizer, and Intelligent Workload Distributor, you will see the same results, as this guy had.

You can keep saying this, it won't be any more correct though. HBCC does nothing when the working set fits in video memory, as the guy in the video explained (i.e. when games are tuned to work on 8GB cards, then they don't need more than 8GB to work well, and thus HBCC does nothing). It might be a different story if he was testing with a 4GB or even a 2GB Vega card, since the games would use more than that and HBCC would have a chance to actually do something.
 
I could not really tell from the video if the games were actually using more than 8GiB of graphics memory. Of course HBCC will do nothing if everything fits in VRAM.
 
You can keep saying this, it won't be any more correct though. HBCC does nothing when the working set fits in video memory, as the guy in the video explained (i.e. when games are tuned to work on 8GB cards, then they don't need more than 8GB to work well, and thus HBCC does nothing). It might be a different story if he was testing with a 4GB or even a 2GB Vega card, since the games would use more than that and HBCC would have a chance to actually do something.
What will happen when you will have indexed and stored data for not only current scene, but also next one, and the next one, and the next one?

Why when you exceed 8 GB cache size, with your data, you still get lower performance, with HBCC on?
 
What will happen when you will have indexed and stored data for not only current scene, but also next one, and the next one, and the next one?

Why when you exceed 8 GB cache size, with your data, you still get lower performance, with HBCC on?

So HBCC can read the future and prepare for what the application is going to do before it even does it? How does that work? Can you point me at the whitepaper for that? Hint: that's not how cache controllers work.

Why did that one game get slower? There are 2 very simple reasons:

1) The rest of the GPU isn't powerful enough to run at those settings and resolution. The average FPS was marginally higher (5.9 vs 5.8) with HBCC on, but the real bottleneck was elsewhere in the GPU and thus HBCC can't do anything to save it.

2) The minimum FPS got worse because the game stalled while HBCC transferred the data from system memory. Any time the GPU reads a page that isn't in video memory, HBCC has to transfer that data over the bus, just like any standard cache controller taking a cache miss:

https://www.techopedia.com/definition/6308/cache-miss

So, it's entirely expected that cache misses will slow things down, that's like rule number 1 of modern programming. When faced with a situation where not all the resources fit, most graphics drivers (or the DX runtime) do a pretty good job of putting more important surfaces into video memory and less important surfaces (like textures on models that aren't referenced a lot) in system memory. That gives you a steady state without any thrashing, and decent performance overall.
 
So HBCC can read the future and prepare for what the application is going to do before it even does it? How does that work? Can you point me at the whitepaper for that? Hint: that's not how cache controllers work.

Why did that one game get slower? There are 2 very simple reasons:

1) The rest of the GPU isn't powerful enough to run at those settings and resolution. The average FPS was marginally higher (5.9 vs 5.8) with HBCC on, but the real bottleneck was elsewhere in the GPU and thus HBCC can't do anything to save it.

2) The minimum FPS got worse because the game stalled while HBCC transferred the data from system memory. Any time the GPU reads a page that isn't in video memory, HBCC has to transfer that data over the bus, just like any standard cache controller taking a cache miss:

https://www.techopedia.com/definition/6308/cache-miss

So, it's entirely expected that cache misses will slow things down, that's like rule number 1 of modern programming. When faced with a situation where not all the resources fit, most graphics drivers (or the DX runtime) do a pretty good job of putting more important surfaces into video memory and less important surfaces (like textures on models that aren't referenced a lot) in system memory. That gives you a steady state without any thrashing, and decent performance overall.
Its not HBCC can prepare. Primitive Shaders, Intelligent Workload Distributor, are designed to allow communication with it, and allow it to prepare work, for future scenes. The reason why it bottlenecks performance in current games is because none of new features of Vega hardware are enabled in drivers, in the first place.

This is what I was trying to tell you guys, over past few pages. That is why I was trying to tell you: you are looking at the same thing as I do, but from different angle.
 
Its not HBCC can prepare. Primitive Shaders, Intelligent Workload Distributor, are designed to allow communication with it, and allow it to prepare work, for future scenes. The reason why it bottlenecks performance in current games is because none of new features of Vega hardware are enabled in drivers, in the first place.

This is what I was trying to tell you guys, over past few pages. That is why I was trying to tell you: you are looking at the same thing as I do, but from different angle.

Nobody is disagreeing with you that Primitive Shaders, IWD etc will all help. The point you're refusing to acknowledge is that HBCC does absolutely nothing for you for applications that can already fit all their resources into video memory. Cache controllers manage residency of data in a cache, and HBCC manages video memory as a giant L3 cache. If everything is already in this giant L3 cache, then HBCC has nothing to do, and will not improve performance. Nothing more, nothing less.
 
Nobody is disagreeing with you that Primitive Shaders, IWD etc will all help. The point you're refusing to acknowledge is that HBCC does absolutely nothing for you for applications that can already fit all their resources into video memory. Cache controllers manage residency of data in a cache, and HBCC manages video memory as a giant L3 cache. If everything is already in this giant L3 cache, then HBCC has nothing to do, and will not improve performance. Nothing more, nothing less.
In Cache you store the data that is required to draw particular scene. If Primitive Shaders are working properly, or are enabled in drivers, you can cull massive amount of not required data, for this particular scene, and save memory. What this means for HBCC and Intelligent Workload Distributor, is that it can prepare data in the cache, for another scene, that can appear after this one.

HBCC alone does nothing. It will never improve Maximum Framerates, because they always are tied with Graphics throughput, and CPU draw calls. What it can improve is minimum framerates. ALWAYS. Regardless of the data set you have.

This is what I was writing from the start.

Only you, and Kpjoslee have read something different in my posts, which I have not written, in the first place.
 
Nobody is disagreeing with you that Primitive Shaders, IWD etc will all help. The point you're refusing to acknowledge is that HBCC does absolutely nothing for you for applications that can already fit all their resources into video memory. Cache controllers manage residency of data in a cache, and HBCC manages video memory as a giant L3 cache. If everything is already in this giant L3 cache, then HBCC has nothing to do, and will not improve performance. Nothing more, nothing less.

Unless you have resources that are being dynamically generated by the CPU...

It's very rare that you have an application or game where everything is precomputed and can be uploaded in one giant chunk.

I don't know much about HBCC but from what is being described here architectures like Metal are already designed around HBCC concepts, and users of Metal could see performance enhancements if HBCC is tied to Metal. Metal has a lot of use cases around streaming data to the GPU that is being dynamically created by the CPU, but you do run the risk that you could pull in data that you don't yet need and cause performance bottlenecks.
 
Unless you have resources that are being dynamically generated by the CPU...

It's very rare that you have an application or game where everything is precomputed and can be uploaded in one giant chunk.

I don't know much about HBCC but from what is being described here architectures like Metal are already designed around HBCC concepts, and users of Metal could see performance enhancements if HBCC is tied to Metal. Metal has a lot of use cases around streaming data to the GPU that is being dynamically created by the CPU, but you do run the risk that you could pull in data that you don't yet need and cause performance bottlenecks.

All games stream some amount of data down each frame. All modern APIs have good ways of dealing with this, without requiring HBCC. For example:

D3D9: https://msdn.microsoft.com/en-us/library/windows/desktop/bb205917(v=vs.85).aspx with DISCARD and NOOVERWRITE.

D3D11: https://msdn.microsoft.com/en-us/library/windows/desktop/ff476457(v=vs.85).aspx with WRITE_DISCARD and WRITE_NO_OVERWRITE.
https://msdn.microsoft.com/en-us/li...76181(v=vs.85).aspx#DISCARD_NO_OVERWRITE_USES.

Metal: https://developer.apple.com/documentation/metal/mtlbuffer/1516121-didmodifyrange on a Managed buffer.

All drivers are good at streaming this dynamic data down to the GPU as needed, since the technique has been around for well over a decade (I guess DX9 came in 2003?).
 
All games stream some amount of data down each frame. All modern APIs have good ways of dealing with this, without requiring HBCC. For example:

D3D9: https://msdn.microsoft.com/en-us/library/windows/desktop/bb205917(v=vs.85).aspx with DISCARD and NOOVERWRITE.

D3D11: https://msdn.microsoft.com/en-us/library/windows/desktop/ff476457(v=vs.85).aspx with WRITE_DISCARD and WRITE_NO_OVERWRITE.
https://msdn.microsoft.com/en-us/li...76181(v=vs.85).aspx#DISCARD_NO_OVERWRITE_USES.

Metal: https://developer.apple.com/documentation/metal/mtlbuffer/1516121-didmodifyrange on a Managed buffer.

All drivers are good at streaming this dynamic data down to the GPU as needed, since the technique has been around for well over a decade (I guess DX9 came in 2003?).

Except drivers are not good at uploading portions of buffers, only the entire buffer. If I have a 1000x1000 pixel image I've created from the CPU, and I need to send it to the GPU, but I only need a 100x100 chunk, the drivers don't magically fix that.

I can fix that in my app by cropping down to 100x100 and sending only that chunk. But I might not. Yes, if your app is perfectly engineered and you've handled every optimization issue HBCC is useless. But HBCC could be very useful in situations like that.

Metal is also not smart enough to skip uploading a buffer you don't actually use. With something like HBCC, Metal could defer uploading a buffer until the GPU actually needs it. Again, an app developer if they're really paying attention might optimize every single case, but there might be times where a buffer is uploaded that doesn't need to be uploaded because we won't know in advance if the GPU needs that buffer or not, so everything is being uploaded that the GPU could possibly need. That's even more important in Metal 2 where a shader can dynamically decide which other shaders to invoke. You might be uploading resources ahead of time for a shader that may never be invoked.

Of course HBCC would have it's own performance penalty for lazy loading, but it can outweigh the penalty of moving too much data over the PCI-E bus when you only need a subset of it. Remember even if you have enough VRAM moving data into VRAM is not free and there is a performance penalty for moving too much data over.

Again, if a developer optimized their app perfectly HBCC is useless. But HBCC could help make badly optimized apps faster.
 
Except drivers are not good at uploading portions of buffers, only the entire buffer. If I have a 1000x1000 pixel image I've created from the CPU, and I need to send it to the GPU, but I only need a 100x100 chunk, the drivers don't magically fix that.

I can fix that in my app by cropping down to 100x100 and sending only that chunk. But I might not. Yes, if your app is perfectly engineered and you've handled every optimization issue HBCC is useless. But HBCC could be very useful in situations like that.

Metal is also not smart enough to skip uploading a buffer you don't actually use. With something like HBCC, Metal could defer uploading a buffer until the GPU actually needs it. Again, an app developer if they're really paying attention might optimize every single case, but there might be times where a buffer is uploaded that doesn't need to be uploaded because we won't know in advance if the GPU needs that buffer or not, so everything is being uploaded that the GPU could possibly need. That's even more important in Metal 2 where a shader can dynamically decide which other shaders to invoke. You might be uploading resources ahead of time for a shader that may never be invoked.

Of course HBCC would have it's own performance penalty for lazy loading, but it can outweigh the penalty of moving too much data over the PCI-E bus when you only need a subset of it. Remember even if you have enough VRAM moving data into VRAM is not free and there is a performance penalty for moving too much data over.

Again, if a developer optimized their app perfectly HBCC is useless. But HBCC could help make badly optimized apps faster.
To add to all of this you have to take note, that HBCC has direct access to CPU page tables.

Have a nice day.
 
Except drivers are not good at uploading portions of buffers, only the entire buffer. If I have a 1000x1000 pixel image I've created from the CPU, and I need to send it to the GPU, but I only need a 100x100 chunk, the drivers don't magically fix that.

I can fix that in my app by cropping down to 100x100 and sending only that chunk. But I might not. Yes, if your app is perfectly engineered and you've handled every optimization issue HBCC is useless. But HBCC could be very useful in situations like that.

Metal is also not smart enough to skip uploading a buffer you don't actually use. With something like HBCC, Metal could defer uploading a buffer until the GPU actually needs it. Again, an app developer if they're really paying attention might optimize every single case, but there might be times where a buffer is uploaded that doesn't need to be uploaded because we won't know in advance if the GPU needs that buffer or not, so everything is being uploaded that the GPU could possibly need. That's even more important in Metal 2 where a shader can dynamically decide which other shaders to invoke. You might be uploading resources ahead of time for a shader that may never be invoked.

Of course HBCC would have it's own performance penalty for lazy loading, but it can outweigh the penalty of moving too much data over the PCI-E bus when you only need a subset of it. Remember even if you have enough VRAM moving data into VRAM is not free and there is a performance penalty for moving too much data over.

Again, if a developer optimized their app perfectly HBCC is useless. But HBCC could help make badly optimized apps faster.

Did you even read the links I posted? didModifyRange allows the application to explicitly say "I only modified X bytes at offset Y out of the total Z bytes". If the driver can't turn that into a transfer of only the modified bits, then I don't know what to tell you.

Similarly, the D3D9 lock method takes an offset and a size. Most apps set the DISCARD flag when they start at offset 0, and then append modified regions with the NOOVERWRITE flag to indicate that it's new data that isn't overwriting valid data. This gives the driver the exact range of modified data, just like the Metal interface. If the driver can't turn this into a transfer of only the modified bits, I don't know what to tell you.

What's your source on the fact Metal can't skip an upload for resources that are never used? In my experience it behaves very similar to OpenGL in that the resources are marked as dirty and uploaded when they are actually used by a command encoder. A well-written Metal application would manage its own data transfers using the MTLBlitCommandEncoder to stream data from a Shared (i.e. system memory) buffer to the Private (i.e. video memory) texture. Or, in other words, the application gets to control where and when this transfer happens.

Again, I'm not suggesting HBCC isn't cool or anything like that. It's just that it has a very specific usage case (huge working set that can't fit into video memory) and applications aren't written that way right now. I'd imagine high-performance computing applications (including Deep Learning) will be the first ones to take advantage of this, and I'd be surprised if games really adopted this until HBCC appears in one or more of the consoles.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.