On CUDA benchmarks. I know Rominator was asked the question, but I also had these to hand. Mac Pro 3,1 (08) with one GTX 285 Mac edition, one injected PC 285 with 2GB Ram. OS is 10.6.3 with CUDA 3.0 release version.The PCI is 16x in both cases but the Link speed is 5 GT/s for the Mac card and 2.5 for the injected one. Most of the time you can see very little difference between these two cards. For example:
bash-3.2$ ./MonteCarloMultiGPU
bumpf deleted
GPU #0
Options : 128
Simulation paths: 262144
Time (ms.) : 1.538000
Options per sec.: 83224.968161
GPU #1
Options : 128
Simulation paths: 262144
Time (ms.) : 1.643000
Options per sec.: 77906.268704
and if just the Mac card is up it turns in >100k O/s (there is an overhead in going to multi GPU it seems). Likewise with the bandwidthtest code. But here and there you can see a more substantial difference. The most dramatic is with the BlackScholes code. Here it is run twice targeted at each GPU: I have deleted some irrelevant bits of the output.
**Here is the Mac card:
bash-3.2$ ./BlackScholes --device=0
Using CUDA device [0]: GeForce GTX 285
Executing Black-Scholes GPU kernel (512 iterations)...
Options count : 8000000
BlackScholesGPU() time : 0.698568 msec
Effective memory bandwidth: 114.519933 GB/s
Gigaoptions per second : 11.451993
BlackScholes, Throughput = 11.4520 GOptions/s, Time = 0.00070 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128
**Here is the injected card
bash-3.2$ ./BlackScholes --device=1
Using CUDA device [1]: GeForce GTX 285
Executing Black-Scholes GPU kernel (512 iterations)...
Options count : 8000000
BlackScholesGPU() time : 1.021953 msec
Effective memory bandwidth: 78.281478 GB/s
Gigaoptions per second : 7.828148
BlackScholes, Throughput = 7.8281 GOptions/s, Time = 0.00102 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128
Now I have not turned down the PCI multiplier (I do not know how) and no longer have an 8800 in the machine, but maybe Rominator could run these same codes? Almost all of the time the injected card performs very similarly though. E.g. both give about 480 Gflops on the nbody simulation.