Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

mobilehaathi

macrumors G3
Original poster
Aug 19, 2008
9,368
6,353
The Anthropocene
So I am considering a Mac Pro to help me along with some giant number crunching problems I'm working on, and I am curious about OpenCL with respect to multiple video cards. From my limited and simplified understanding at the moment, OpenCL allows you to send blocks of code out for parallel execution to open cores, including open cores on video cards. I assume this holds when having, say, four GT 120's. I'm thinking I could queue a bunch of blocks for parallel execution and have all four video cards (and maybe some i7 cores) cranking away. Does anyone know if this is an accurate (albeit very simplified) characterization?
 
So I am considering a Mac Pro to help me along with some giant number crunching problems I'm working on, and I am curious about Grand Central Dispatch with respect to multiple video cards. From my limited and simplified understanding at the moment, GCD allows you to send blocks of code out for parallel execution to open cores, including open cores on video cards. I assume this holds when having, say, four GT 120's. I'm thinking I could queue a bunch of blocks for parallel execution and have all four video cards (and maybe some i7 cores) cranking away. Does anyone know if this is an accurate (albeit very simplified) characterization?

Grand Central Dispatch does no such thing. You are thinking about OpenCL here. And OpenCL will do what you are saying. I am not sure if OpenCL will send the same work to _different_ kinds of processors; you would have to read a bit more about their documentation.
 
I guess I'm mixing up names. GCD is the scheduler, and OpenCL is the API? At any rate, I'm definitely going to be digging through the documentation soon. Thanks for the input.:)
 
GCD/OpenCL noob here, but I assume that dispatching blocks would be a good way to start OpenCL jobs in parallel to different OpenCL devices. They are different technologies, but are not mutually exclusive. Please correct me if I'm mistaken.

One interesting tidbit I picked up in Apple's OpenCL programming guide -- Apple's OpenCL CPU implementation takes advantage of GCD.
 
I guess I'm mixing up names. GCD is the scheduler, and OpenCL is the API? At any rate, I'm definitely going to be digging through the documentation soon. Thanks for the input.:)

They are quite independent technologies really. GCD is a technology that makes it quite easy to run any normal CPU code on different threads. Just the same things that you could have done in any Posix system for many years, except that it is much easier to do, has much less overhead, and is balanced over the whole system, not just your application. To a large extent GCD is useful to make the user interface more responsive by doing things in the background instead of waiting for them.

With OpenCL, you need tasks that are massively parallel (like in graphics applications, where you have 1920 x 1200 pixels that need to be calculated independently of each other), and OpenCL then compiles these tasks to optimal code for GPUs and possibly CPUs and distributes them among those resources. To the caller, it doesn't seems as if there are any threads involved. OpenCL only works well with problems that are suitable to be handled that way, any dependencies between tasks make life very hard for OpenCL.
 
There are a couple sample apps for OpenCL on Apple's developer site that show what you can do.

How you spread the love around seems to be up to the developer, so you can use the devices available as you see fit. ManiG is right that GCD is probably ideal for kicking off and then concatenating OpenCL jobs when you want to spread calculation across multiple devices.

One thing that the NBody example tells me beyond the simple fact that they should have also used GCD... is that you will need some way to optimize the block size given to each device. Each device will have vastly different performance, especially based on the user's current setup. If you just split everything up equally, your total performance will actually fall down to the slowest device.

If your CPU cores can pull 12Gflops, your 8600M pulls 30Gflops, and the 8400M pulls 12Gflops as well... then you need to include a little time tracking. Ideally, all devices should be reporting in at the same time, and you will need to splice up the data blocks accordingly. So you'd feed about 55% to the 8600M, and 22.5% each to the CPUs and 8400M. You'd want to track this every run and adjust these percentages accordingly. It isn't too hard just using localtime() or similar to measure how long the OpenCL kernel took to execute, and report that back to your code that manages which device gets what block of data.
 
Well, this is all sounding quite encouraging. Thankfully, my problems are embarrassingly parallel, and therefore seem well suited for this implementation. It seems a close reading of the documentation plus messing around on the lappy is indicated.

:apple:
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.