Thunderbolt GPU's: technical barriers

Cubemmal · Jan 2, 2014

Let's discuss the technical issues with Thunderbolt GPU support.

The most promising external GPU work is at Silverstone, who recently demonstrated a work in progress

MSI also has something they're working on, but inexplicably they didn't include power.

Does this case make sense?

The Anandtech Mac Pro review states that PCIe 3.0 x16 channels is 15.7 GB/s bandwidth. Going further in the review he shows TB 2.0 as having 2.5 GB/s of bandwidth - quite a difference. However, consider that with TB 1.0 half of the bandwidth can be used for video output, namely 1.25 GB/s. That's the theoretical max, but we can assume that displaying pixels on a 27" 2560x1440 screen at 30 Hz or 60 Hz takes no more than 1.25 GB/s.

So the question is, does that really require 15.7 GB/s on the input? What are all those PCIe x16 lanes for anyhow? Loading the VRAM (RAM->VRAM DMA), serving interrupts (CPU->GPU), and inter-card communication (SLI/Crossfire) AFAIK. Now we can eliminate the final use case as this would be a single card only solution, though theoretically somebody could actually make a two slot external TB2.0->PCIe GPU box with two x16 lanes. Are 16 lanes really needed for loading VRAM and serving interrupts?

So here my knowledge and searching pans out. My question is whether this analysis is accurate, and whether this kind of device can work?

The remaining piece for SilverStone is to get Intel certification, and OS X thunderbolt aware drivers.

paulrbeers · Jan 2, 2014

1st off: Most GPUs aren't maxing out the full 16x lanes. In fact in many consumer grade computers, the GPU is most likely on an 8X lane anyway. Now, an 8X PCIE3.0 vs 4x PCIE2.0 is a difference of about 4x. So there still is a data penalty obviously. I'm going to point you to an article Barefeats did awhile back comparing a Radeon 5770 and 5870 running on Various Mac Pros. The Mac Pro 1,1 uses PCIE 1.0 vs 2.0 on all the others, while there was a penalty in overall performance, what is interesting is in how many cases the penalty was minimal. It also isn't a perfect comparison, because the Mac Pro 1,1 has significantly less CPU power so some of the penalty seen may not even have been due to the lack of PCIE bandwidth, but rather a lot of CPU grunt to pass data to the GPU.

http://www.barefeats.com/wst10g7.html

Anyway, so my point is, yes there will be a penalty using an external GPU. Most who have done it already (using external GPUs) are doing it because they have Macbook Airs or Mac Mini's with their iGPUs and just about any Discrete GPU (even bandwidth starved GPUs) is worth it.

Whether it would be worth doing for the Mac Pro especially when you can buy Dual D700's for not a whole lot more than what the External GPU boxes are going for (see $800 for just the box alone for a Sonnet Tech) and you do not have to worry about starving your GPU for data if you go D700's.

sirio76 · Jan 2, 2014

http://www.techpowerup.com/mobile/reviews/Intel/Ivy_Bridge_PCI-Express_Scaling/23.html
http://www.anandtech.com/show/5458/the-radeon-hd-7970-reprise-pcie-bandwidth-overclocking-and-msaa
Seems that there's not a huge performance hit going from 16x to 8x or sometimes even to 4/2x.
There are also people building DIY solution:
http://www.extremetech.com/extreme/...harges-macbook-air-graphics-performance-by-7x
I think the driver is the biggest problem to solve.

deconstruct60 · Jan 2, 2014

Cubemmal said:
Let's discuss the technical issues with Thunderbolt GPU support.

Like that hasn't been exhaustively done before....

The Anandtech Mac Pro review states that PCIe 3.0 x16 channels is 15.7 GB/s bandwidth. Going further in the review he shows TB 2.0 as having 2.5 GB/s of bandwidth - quite a difference.
[/quote]

There is a difference between Thunderbolt bandwidth and transported PCI-e data transported bandwidth. What travels over TB cables is TB data. What eventually gets to the cards is PCI-e data.

...Since there's 20Gbps of bandwidth per channel, you can now do 4K video over Thunderbolt. You can also expect to see higher max transfer rates for storage. Whereas most Thunderbolt storage devices top out at 800 - 900MB/s, Thunderbolt 2 should raise that to around 1500MB/s (overhead and PCIe limits will stop you from getting anywhere near the max spec). ...

Pragmatically the limit is 3x PCI-e v2.0. It is good that the Mac Pro opens slightly higher to the individual TB controllers. That will help in latency and traffic aggregation issues ( http://en.wikipedia.org/wiki/Fat_tree ), but that isn't what the external GPU card is going to see.

However, consider that with TB 1.0 half of the bandwidth can be used for video output, namely 1.25 GB/s.

This is a goofy tangent to the topic. The bandwidth of what comes out of the GPU subsystem is highly decoupled from what goes into it. In the context of an external GPU, the output is external the host and TB network. It could be TB/s. It doesn't make a difference at all to the stuff upstream.

So the question is, does that really require 15.7 GB/s on the input?

You keep pointing at hardware where the usage is being done by software. As long as groping around on hardware you aren't going to find an answer.

What are all those PCIe x16 lanes for anyhow? Loading the VRAM (RAM->VRAM DMA), serving interrupts (CPU->GPU), and inter-card communication (SLI/Crossfire) AFAIK.

Over most of recent history, tnter-card communiction has been outside of PCI-e traffic. So it has little impact. In fact, folks have used that back-channel network to skirt around PCI-e lane starved systems. ( ship smaller chucks of data down two different, smaller paths and then merge post computation results over a proprietary back-channel network. )

Whether there is loads of data to load into VRAM largely depends upon what the software is doing. There are also several work-arounds that drivers and applications have developed to load compressed data into VRAM and then decompress it on the other side to deal with limited PCI-e bandwidth.

Now we can eliminate the final use case as this would be a single card only solution, though theoretically somebody could actually make a two slot external TB2.0->PCIe GPU box with two x16 lanes.

Actually probably not. What there would more likely be is two x4 electrical slots with 16 physical connector ( of which 3/4 pins are dead. )

Typically more PCI-e switches dilute same size or greater bandwidth into smaller amounts. A switch where there is some local x16 crossbar and a relative miniscule upstream connection of just x4 would be strange. In TB PCI-e enclusures you are going to find a switch that splits x4 into more x4's (2 or 3 ) or into smaller bundles ( 4-6 1x ). If both x16 connections start pumping data in parallel through a x4 choke point it is doubtful could put in a buffer large enough to keep up that was economical even if could get around the quite substantive delays that would incur. That is exactly BACKWARDs of how to design a network that deals well with congestion and latency.

The remaining piece for SilverStone is to get Intel certification, and OS X thunderbolt aware drivers.

that enclosure isn't only going to do GPUs. So getting the TB certification has nothing to do with external GPUs working or not.

Likewise the cards are what need the drivers not the enclosure. Again you are coupling issues that aren't coupled.

----------

sirio76 said:
http://www.techpowerup.com/mobile/reviews/Intel/Ivy_Bridge_PCI-Express_Scaling/23.html

Using older software isn't going to expose the loss in bandwidth. Of note in these tables is that x16 PCI-e v2.0 turns in the same performance as x16 PCI-e v3.0. What that says is not that those are the same. They are very substantively different. What is says is the the testing mechanism they are using is probably completely oblivious to what can be done with PCI-e v3.0. That apps handicap themselves.

It is like using Pentium 3 optimized code on a bleeding edge Intel CPU. For moderate to high performance tasks isn't likely even close to what the optimal performance could be if actually tweaked the code to take advantage of the modern capability and function units inside the CPU.

Most of this PCI-e limited benchmarks are geared toward question of you run legacy apps fast if dilute the PCI-e bandwidth.

It is a substantially different question of running current bleeding edge and future apps faster.

Cubemmal · Jan 2, 2014

deconstruct60 said:
Like that hasn't been exhaustively done before....

Having a bad day?

There is a difference between Thunderbolt bandwidth and transported PCI-e data transported bandwidth. What travels over TB cables is TB data. What eventually gets to the cards is PCI-e data.

Obviously

This is a goofy tangent to the topic. The bandwidth of what comes out of the GPU subsystem is highly decoupled from what goes into it. In the context of an external GPU, the output is external the host and TB network. It could be TB/s. It doesn't make a difference at all to the stuff upstream.

my point was to look at bandwidths and compare them.

You keep pointing at hardware where the usage is being done by software. As long as groping around on hardware you aren't going to find an answer.

...

OK, having a hard time seeing what you're trying to say here. Thanks for contributing though.

ytoyoda · Jan 2, 2014

Recently I read this interesting article. Can OpenGL And OpenCL Overhaul Your Photo Editing Experience?
This is the very long article, so if you want to read the point, please look at the end of 8. Q&A Under Hood of Adobe to the middle of 9. Q&A Under Hood of Adobe

Russell Williams from Adobe stated clear that for GPU Computation, PCIe bus which connect GPU and CPU is the bottleneck, for copying data back and forth. It means, even PICe v3 x16 bus becomes bottle neck.

I think, external GPU does not suit GPU computation.

goMac · Jan 2, 2014

ytoyoda said:
Recently I read this interesting article. Can OpenGL And OpenCL Overhaul Your Photo Editing Experience?
This is the very long article, so if you want to read the point, please look at the end of 8. Q&A Under Hood of Adobe to the middle of 9. Q&A Under Hood of Adobe

Russell Williams from Adobe stated clear that for GPU Computation, PCIe bus which connect GPU and CPU is the bottleneck, for copying data back and forth. It means, even PICe v3 x16 bus becomes bottle neck.

I think, external GPU does not suit GPU computation.

Like a lot of things, it depends.

Games us very little PCI-E bandwidth. They basically stream over as much as they can when the level loads and cache it in VRAM, and then during the game very little bandwidth is used.

In Adobe's case, the bandwidth usage is very high because they have to send over 30-ish frames a second, not even counting whatever else is part of the composition. Because of a lot of content in every frame is different, the possibilities for caching are a lot less. So for apps like Adobe's apps, bandwidth is a lot more important.

So if you want Thunderbolt GPUs for games, it seems pretty workable. If you want it for video editing, it really depends on how much data you're throwing around.

Thunderbolt also supposedly has channel bonding, so there is a chance you might be able to use multiple Thunderbolt ports to add bandwidth.

jav6454 · Jan 2, 2014

Cubemmal said:
Let's discuss the technical issues with Thunderbolt GPU support.

[...]

So the question is, does that really require 15.7 GB/s on the input? What are all those PCIe x16 lanes for anyhow? Loading the VRAM (RAM->VRAM DMA), serving interrupts (CPU->GPU), and inter-card communication (SLI/Crossfire) AFAIK. Now we can eliminate the final use case as this would be a single card only solution, though theoretically somebody could actually make a two slot external TB2.0->PCIe GPU box with two x16 lanes. Are 16 lanes really needed for loading VRAM and serving interrupts?

So here my knowledge and searching pans out. My question is whether this analysis is accurate, and whether this kind of device can work?

The remaining piece for SilverStone is to get Intel certification, and OS X thunderbolt aware drivers.

Yes it requires all 15.7GB/s in the PCIe lanes. Why? Well, images are not crunched in the CPU but the GPUs, so what is being sent is a slew of raw data from the CPU concerning geometric, trigonometric, and other graphical data. This data is very dense and has to have the information about the picture to be displayed (in this case since it's per second, assume a minimum of 60 frames to be processed and displayed). All that data can clog a nice data pipeline like PCIe. However, be sure to know that during PCIe 2.0 days, several GPUs did not saturate just yet the lanes. However, as nVidia starts to move SLI and other data intensive task through the PCIe, I can see 15.7GB/s very helpful to have.

All x16 lanes carry data. Period. You have a return and sending lane (- and +) for each PCIe x1 lane.

A Thunderbolt GPU will be nice to have but won't get you the best quality.

Search

Search

Thunderbolt GPU's: technical barriers

Cubemmal

macrumors 6502a

paulrbeers

macrumors 68040

sirio76

macrumors 6502a

deconstruct60

macrumors G5

Cubemmal

macrumors 6502a

ytoyoda

macrumors member

goMac

macrumors 604

jav6454

macrumors Core

Our Staff