Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
That was what i was thinking too, not just applications utilizing the GPU...but the operating system as well.
Isn't Mavericks already doing this? A lot of people have reported increased idle usage of their GPU under Mavericks. Could be a bug, but with an OpenCL powerhouse around the corner and Haswell processors adding significant OpenCL clout to integrated GPUs in the Macbook Pros (and presumably Mac Minis and iMacs of the future) it would make sense for this to already be under way.

As for what Mavericks may be doing with the GPU is anybody's guess, but the whole point of OpenCL is that you don't have to worry about whether the CPU or GPU will run the code (okay, a good developer will still think about it a little bit), so it makes a lot of sense to convert more and more stuff into OpenCL snippets. The problem is that for an OS there just isn't a huge amount of computation that it needs to do; most of an OS' responsibilities are in facilitating applications, but that said it would make sense to make more of the various frameworks and APIs OpenCL capable, as these are what the vast majority of applications rely on.

ActionableMango said:
Anything not using GPUs won't benefit at all, like over on the musician thread, so it's unused cost and power.
This should really be "not currently using GPUs", as there's no particular reason that audio applications can't use OpenCL for computation, except that for most tasks a CPU is just fine. For really high end usage though OpenCL computation can make a lot of sense, and there is also the TrueAudio feature on newer FirePro cards that could be used for high definition audio processing, though it's not clear if the new Mac Pro (or OS X) will support that feature.

ActionableMango said:
Any GPU tasks not suited to parallel GPU computing also won't benefit.
No, but two such tasks can. I've said it in a few threads now, but it's important to remember that while you may use a Mac Pro for a single job, that doesn't necessarily mean it can't be running several distinct tasks/applications at the same time. If you have two applications with GPU acceleration, then two GPUs will be better than one in most cases.

ActionableMango said:
Any GPU tasks that can be parallelized for greater speed would have simply benefited from the greater speed of a single faster GPU anyway.
I'm not sure that that's true for OpenCL, implemented correctly at least, as two GPUs to run an OpenCL workload with is potentially twice as fast compared to a single GPU faster than one of the paired GPUs.

As I said as well, it may not be as much of an issue of forcing developers to produce OpenCL code, as if Apple can use OpenCL for more of their frameworks then we could potentially have a bunch of OpenCL capable apps overnight; okay, so they may not be making the best use of the technology, but it would open up the compute power of the new Mac Pro considerably, giving us all a taste of what is then effectively a hugely powerful triple processor machine.


I'll grant of course that this is all based on potential; while OpenCL is an existing technology right now, it's still gaining traction (and has been slowly doing so for a while). I wouldn't consider for a moment buying a new Mac Pro on the basis of what OpenCL will do for it in future; it will be nice for people that do go ahead and buy one, as they may get more of their machine, but it's by no means a certainty.
 
haravikk, that was an excellent post and I thank you. This is what I wanted discuss and you have moved us along nicely. I admit to reading between the lines, but from a developer's perspective, the nMP represent a game changer with respect to software design. I don't think this was an accident and I have said all along that I think innovation was the goal.

I don't think most have wrapped their head around the fact that OpenCL has nothing to do with graphics. The references in this thread and many other to 4k monitors, Crossfire, etc. miss the boat in my opinion. You notes regarding audio are entirely appropriate and it's telling to me that Apple's very own OpenCL "hello world" example is an FFT implementation.

Regarding the frameworks, I could not agree more. I would imagine that as OS X and the underlying frameworks further utilize GPGPU these machines will further seperate themselves from the pack and, perhaps, set a new standard.

Time will tell, and all this won't happen in the first days, months, or perhaps even years. However, as time goes on your software will most likely benefit from these underlying changes even if your software writers never "embrace" OpenCL directly. If they do, even better for you.
 
but from a developer's perspective, the nMP represent a game changer with respect to software design. I don't think this was an accident and I have said all along that I think innovation was the goal.

I'm a developer who will buy the nMP but I don't see anything game changing here. I just like nice computers, the nMP certainly doesn't offer me capabilities I didn't have before. For most work one or more threads is sufficient and all you can do anyhow. Most problems aren't parallel in that way.

However I will explore OpenCL to see what I can do with it. Otherwise you should get one because you need or want it basically.
 
It's worse than that, there are actually very few things that can be GPGPU compatible. I've been programming for 25 years and I've not once parallalized an application this way, though I've wanted to. The most is simply some threaded design. The reason for this is simple; application reflect how people work, because that's what they are written for. People spend a lot of time at the computer doing nothing, so most software is a callback system which simply reacts to user input. Not a lot of parallel in that. If you don't have an application that is specifically written for this, such as graphics rendering, you won't be using it.

It's really hard these days to write an application without CPU threading. Anything that touches graphics should have a lot of multithreading, at the very least. I write tons of multithreading all the time.

OpenCL is quite a bit less common. It's main strengths will be in graphics processing, which if you're a Mac Pro user, you're probably doing a lot of.

There are some applications to DSP, even audio DSP, but the problem is the latency of the cards. I actually think that's the biggest problem. Paralellization is not a problem, I have a few problem sets that I could divvy up in a few hundred or maybe a thousand parts. The problem is getting that data over to the card, processing it, and hopefully not having to send it back to the CPU. If you're trying to do realtime audio and every tiny fraction of a second counts, you might not have the time to wait on the data to transfer to the GPU and back.

That's the big draw of the Iris Pro, and AMD's GPUs integrated onto the CPU die. They don't have the delay of having to send the data over the PCI bus, because the GPU is parked right next to the CPU and probably shares memory with the CPU. It's also why Nvidia looks like they're in a losing position when most GPGPU geeks look at the options.

It's possible, but I think unlikely that they plumbed these and enabled CrossFire support.

OpenCL in Mac OS X already supports dual GPUs, so that at least can take strong advantage of these cards.
 
If you're trying to do realtime audio and every tiny fraction of a second counts, you might not have the time to wait on the data to transfer to the GPU and back.

So the audio apps that deal with Firewire ( x1 PCIe v2 link) and Audio cards ( maybe x4 PCI-e v2 link) have perfectly fine PCI-e latencies but a x16 PCI-e v3 link has problematic latencies? Yeah sure.

The fact is that audio systems deal with substantially slower latencies all the time than those between a decent OpenCL solution sitting on a x16 PCI-e v3 bandwidth link. This whole priceness and pea notion should scare folks off of NUMA Mac Pros because might have to hit QPI link to get the data... oh no latency scary.

Even if have to copy the data if pipeline the computation into three parts, buffer fill , compute , buffer empty actually all three can go at same time if carefully work through the synchronization. The choke point to cards is far more going to be how many different streams of data can be moved instead of sequential streams of blocks. That would be a much more serious chokepoint.

----------

Isn't Mavericks already doing this? A lot of people have reported increased idle usage of their GPU under Mavericks.

Mavericks also brings a jump to OpenGL 4. That too should increase GPU load as shockingly (*cough*) OpenGL 4 allows more graphics work to be done on the Graphics Processor Unit (GPU). Imagine that? moving computations to the unit that is suppose to be doing that work.

There is likely some incremental OpenCL workload gains too, but there is at this point a huge deployed base of OpenGL v4 capable Macs out there that have been taking off of restraints of what they actually can do.
 
I made the mistake of betting on 64-bit adoption when I bought my PowerMac G5, and was sorely disappointed, as 64-bit didn't really fully take-off until after I replaced it with a Mac Pro. I mean, I wasn't wrong to expect more 64-bit adoption, and the advantages of it, I was just wrong to be an early adopter as I never saw the benefit.

People that know they'll need GPUs for software they have now, or that is coming out soon, can't really go far wrong with the new Mac Pros. It might be a bit misguided to expect to get the machine and that it'll only get faster with time, as you might end up disappointed like I did :(

You were wrong in assuming that 64-bit means faster than 32-bit. All things being equal, 64-bit is slower than 32-bit, since it's twice as much data. The only reason it's faster in some cases is because 32-bit was done wrong and the 64-bit version fixed those weaknesses. On PowerPC, 32-bit was done right, and that's why it spanked Intel processors for a while. 32-bit was done wrong on x86.

(By "wrong" here, I mean that there weren't enough registers on x86, and there were with PowerPC. The same is true of the processors used for the iPhones.)
 
You were wrong in assuming that 64-bit means faster than 32-bit. All things being equal, 64-bit is slower than 32-bit, since it's twice as much data. The only reason it's faster in some cases is because 32-bit was done wrong and the 64-bit version fixed those weaknesses. On PowerPC, 32-bit was done right, and that's why it spanked Intel processors for a while. 32-bit was done wrong on x86.

(By "wrong" here, I mean that there weren't enough registers on x86, and there were with PowerPC. The same is true of the processors used for the iPhones.)

64 bit is faster mostly because they added more registers than for 32 bit, and likewise 16. So if you're running 64 bit compiled and at least somewhat optimized code you should see it run faster.
 
You were wrong in assuming that 64-bit means faster than 32-bit. All things being equal, 64-bit is slower than 32-bit
Well yes and no; my G5 was definitely faster, it just wasn't really because of 64-bit, so I didn't exactly waste my money. It's also not like I didn't run apps that could take advantage of 64-bit, I use Photoshop a lot for example, and do some movie transcoding and 3d rendering, all of which can definitely benefit from 64-bit instructions for more efficient processing of those larger than 32-bit numbers, and for giving apps access to more RAM (I upgraded to 8gb using third party RAM a little while after release).

Problem really was that the apps I used just didn't upgrade to 64-bit fast enough, I mean it took ages for Adobe to release a 64-bit plugin for Photoshop, and all that did was helped to speed up various filters; not that that wasn't nice, but it meant I still wasn't getting the most out of my RAM, and many operations still weren't any faster.

Same issues with other apps; even OS X at the time made poor use of 64-bit as it took us well into the Intel transition before we got the 64-bit kernel and everything else. End result was that 64-bit adoption was only really just beginning in earnest as my G5 started sucking in its last few gallons of air and finally died.

Sorry, bit off-topic, sort-of. My point is that even though OpenCL is a more established technology, it's still not exactly spreading like wildfire yet, as many apps with GPU computation are using CUDA instead, which means that we'll have to wait for them to switch, others are slow to add GPU computation at all, and still others are adding it, but only really in piecemeal, i.e - they'll have OpenCL based features, but far from fully utilising the hardware. So while I firmly believe that the OpenCL capabilities of the new Mac Pro is a very smart move by Apple, for the prosumers it only makes sense if you already use apps that are OpenCL or at least GPU accelerated either right now, or will be very soon, otherwise you're buying a machine for a feature that may not be fully used during its lifetime, just like I did.
 
So the audio apps that deal with Firewire ( x1 PCIe v2 link) and Audio cards ( maybe x4 PCI-e v2 link) have perfectly fine PCI-e latencies but a x16 PCI-e v3 link has problematic latencies? Yeah sure.

Firewire has DMA. A discrete card's compute engine does not use DMA.

The fact is that audio systems deal with substantially slower latencies all the time than those between a decent OpenCL solution sitting on a x16 PCI-e v3 bandwidth link. This whole priceness and pea notion should scare folks off of NUMA Mac Pros because might have to hit QPI link to get the data... oh no latency scary.

Sure, but those audio systems are only going one way. With a GPU you have to:
1) Send the data to the card.
2) Wait for the cards CPU to compute the data.

And that would be it. If you were outputting to a device hooked to the card, like a display. But you're not. Now with audio you also have to:
3) Send the data back to the machine back across the PCI bus (which means doing a synchronization with the CPU to send the data back.) To human beings, the PCI bus is slow. To developers, the PCI bus is insanely slow, even with QPI.
4) Have the CPU now reschedule delivery of the data into whatever path the audio was supposed to be delivered in.

GPUs are very good at long running tasks because the speed you save in processing on the GPU makes up for the long data transfer.

It's like taking a plane trip to the next city over. Sure, the plane is going to be faster than driving, in theory. But the cost of getting through the airport might outweigh the benefits of getting on a plane.

A plane is super efficient if you're going long distance because the time you'd save over driving gets large enough to outweigh getting through the airport.

Yes, QPI is fast. But from a developer perspective, it's still not that great for short running operations. I'm not sure I'd trust QPI with real time audio processing because the window is just so small.

Even if have to copy the data if pipeline the computation into three parts, buffer fill , compute , buffer empty actually all three can go at same time if carefully work through the synchronization. The choke point to cards is far more going to be how many different streams of data can be moved instead of sequential streams of blocks. That would be a much more serious chokepoint.

I think the card would have enough stream processors to deal with pushing all the data through in a few cycles. But again, my concern is that audio processing is so time sensitive that moving the data to the card would be a problem, unless the GPU was on die with the CPU.

The other problem here is that, honestly, audio processing doesn't really need the GPU as much right now. DSP can be deftly handled by SSE, which doesn't require the hit of moving to the card, but provides similar functionality.

Mavericks also brings a jump to OpenGL 4. That too should increase GPU load as shockingly (*cough*) OpenGL 4 allows more graphics work to be done on the Graphics Processor Unit (GPU). Imagine that? moving computations to the unit that is suppose to be doing that work.

OpenGL 4 does some nice cleanup on some of the coding functions, but it doesn't do anything to allow more work to be done on the GPU. There are a few minor improvements which could lead to efficiency, but there is nothing in OpenGL 4 that makes me think there is some massive change in what work OpenGL can do and what it can't.
 
Firewire has DMA. A discrete card's compute engine does not use DMA.

The modern ones do. Whether the OpenCL 1.1/1.2 API ( versus 2.0 ) gets in the way can be an issue but foundationally the modern GPUs don't have this limitation. Legacy card sure. But moving forward this is far more dogma than reality.

Besides even DMA in Firewire means making copies. DMA isn't so much the issue as whether goes through incremental buffers to get to do the transport.

Sure, but those audio systems are only going one way. With a GPU you have to:
1) Send the data to the card.
2) Wait for the cards CPU to compute the data.
....
3) Send the data back to the machine back across the PCI bus (which means doing a synchronization with the CPU to send the data back.) To human beings, the PCI bus is slow.

And yet in the AMD True Audio demos that the critical tech press attended nobody came back with complaints about grossly noticeable sound latencies defects.



To developers, the PCI bus is insanely slow, even with QPI.

And yet there are numerous folks moaning and groaning about how Apple is letting musician down because there is no dual CPU ( i.e., QPI bus using machine ) available. Never mind that in previous generations the QPI link was used between Northbridge and CPU so everything, including memory accesses, went through the bus. Music didn't crumble on those earlier machines.


'Real time' isn't about there being no or smallest latencies. It is really far more being able to being able to deterministically get things done with the latencies present. Nothing has near zero latencies.


4) Have the CPU now reschedule delivery of the data into whatever path the audio was supposed to be delivered in.

If have multiple CPU cores you also have synchroniztion/issues. You have data in L3 caches, etc.



GPUs are very good at long running tasks because the speed you save in processing on the GPU makes up for the long data transfer.

And recording a 5-10 min audio session is not a relatively long (for CPU/GPU speeds) running task?

The bigger problem lots of 'real time' folks are going to have with the GPU is that the audio has to timeslice with graphics work. It isn't so much that the distributed audio stuff is distributed around the computer it is more so whether it is going to get is required share of slices of necesary resources. So there is a disconnect between " this 2nd GPU is sitting there doing nothing" and "we'll if put the audio on the GPU how do know it is going to get done on time". If the GPU is doing nothing that isn't much getting in the way. Highly diversely loaded GPUs then sure resource scheduling may surface as a problem if the GPUs aren't managed correctly.

Throwing up PCI-e and QPI as getting in the way is bit more than rich because the other more classic CPU centric processing is using exactly those same resources.



But again, my concern is that audio processing is so time sensitive that moving the data to the card would be a problem, unless the GPU was on die with the CPU.

Latency has to do with how quickly the data transfer starts. The amount of data here that needs to be moved is small relative to the bandwidth available. The throughput time isn't going to be an issue. Especially if decouple the GPGPU from other much higher bandwidth worksloads like graphics.

As I said if you pipeline the copies and computation they are running in parallel. If the copies are a bit slower than the computation then at worse you are looking at pipeline stalls on computation.


The other problem here is that, honestly, audio processing doesn't really need the GPU as much right now. DSP can be deftly handled by SSE,

That is actually part of the problem. It is AVX that audio should be moving to, but probably won't because the legacy inertia of the folks still using SSE capable only machines. That's why. It is a new mechanism and audio industry generally moves at a snails pace. That is in part because many solutions optimize down to the point where they inertia constraints; the code only really runs well on exactly what was targeted.

The issue is that have a limited number of AVX pipelines. If have more processing data streams than pipelines then it could be used that isn't. ( and frankly extending the number of AVX/SSE pipelines means bring in more connectivty like QPI/PCI-e )


OpenGL 4 does some nice cleanup on some of the coding functions, but it doesn't do anything to allow more work to be done on the GPU.

Two factors.

1. moving to OpenGL v4 means Apple cleans up on the parts of OpenGL v3 they cherry picked around and didn't do. ( "Oh we'll fold those in when we eventually do major clean-up for v4 ". Well the v4 clean up is done no so they need to complete implementation of v3. )

2. Even in the 4 API updates the majority of them are computationally oriented.

http://en.wikipedia.org/wiki/OpenGL#OpenGL_4.0

shaders , draw (without sync) , extensions, etc.


but there is nothing in OpenGL 4 that makes me think there is some massive change in what work OpenGL can do and what it can't.

The change can be in what they are actually doing with the API. Up to now latent features, start getting used because because aiming higher than what OpenGL 3.1 era GPUs can do as the baseline functionality the OS leverages. Frankly, it can more so boil down to using what was already there and Apple was slacking on. Moving to v4 just triggers fully exploiting v3 just like fully exploiting v4 probably won't happen until gets around to implementing v5 ( or some much higher "dot increment" of v4).

Essentially following a "make it work, then make it faster" progression. Implement OpenGL X.Y then later speed up and more effectively utilize X.Y.
 
Last edited:
Firewire has DMA. A discrete card's compute engine does not use DMA.

I have been reviewing AMD's Accelerated Parallel Processing OCL Programming Guide for some code I have planned. Section 1.6.3 states the following...

Direct Memory Access (DMA) memory transfers can be executed separately from the command queue using the DMA engine on the GPU compute device. DMA calls are executed immediately; and the order of DMA calls and command queue flushes is guaranteed.

DMA transfers can occur asynchronously. This means that a DMA transfer is
executed concurrently with other system or GPU compute operations when there are no dependencies. However, data is not guaranteed to be ready until the DMA engine signals that the event or transfer is completed. The application can query the hardware for DMA event completion. If used carefully, DMA transfers are another source of parallelization.

Southern Island devices have two SDMA engines that can perform bidirectional transfers over the PCIe bus with multiple queues created in consecutive order, since each SDMA engine is assigned to an odd or an even queue correspondingly.

I especially like the part about this being another source for parallelism.
 
Was removing the PS2 ports dumb?
Was removing the floppy drive dumb?
Was removing the optical drive dumb?

Just because you are not a visionary and do not understand the reasons behind certain things does not mean that those things are dumb.

I don't mean to be rude but all of the above existed during the same time as newer technologies. Even the 3.5" floppy was around for quite awhile before finally being phased out.

It appears that Apple did its usual marketing scheme when they dropped the optical drive. It was not about not being relevant universally but was more helpful in getting Apple's on line items more 'relevant.' Lots of people have purchased after market optical drives and will continue to do so for DVD, CD and Blu Ray play and burn.

Just because newer technologies come in, doesn't mean immediately abandoning present technologies. I honestly can't think of a fast abandonment of any connectivity or storage item other than Apple's attempt to kill the optical drive.

Early hard drive connectivity existed on mother boards as newer came in.
ISA was on board when V and PCI came in.
DVD player/burners were also able to do CDs
Blu Ray players could do DVD and CD work.
UB3 is backward compatible
Firewire 800 backward compatible
TB2 - somewhat backward compatible
and the list goes on.

Sorry but Apple's dropping the optical was NOT about new technologies but about Apple forcing Mac owners to use alternatives and nothing more.
 
Sorry but Apple's dropping the optical was NOT about new technologies but about Apple forcing Mac owners to use alternatives and nothing more.

Not to dredge up an old topic, but they were about two years too late as far as I'm concerned. And the umpty-gazillion page threads here on MacRumors at the time demonstrate that opinions were well-divided on the subject, with plenty of people who were ready to ditch the legacy optical media before Apple chose to do it. You're re-writing history a bit here, the situation was hardly as you describe it. I get that you felt it was too soon, but that's just, like, your opinion man.
 
I'm going to reply to deconstruct60 in another writing session, the back and forth is turning into an essay. :)

I have been reviewing AMD's Accelerated Parallel Processing OCL Programming Guide for some code I have planned. Section 1.6.3 states the following...

That's a DMA transfer (basically saying it can move the data to the card's buffer to be processed, no duh.) It's basically what I already said. You've got to move the data from the card to VRAM, and then turn it back around.

DMA on more traditional devices like Firewire typically doesn't have a second major buffer sitting in between it and the destination. It's just pulling directly from memory.

An integrated chip (like Iris Pro or the AMD integrated stuff) does it's work from DMA, because it uses integrated memory. There is no second bank of memory like VRAM to store data in or transfer to do. Everything is kept in RAM and there is no shuffling of data around. That's typically faster (and more ideal for audio), but you lose speed in an integrated GPU and RAM is typically slower than VRAM.

For a real awakening, take a look at the Iris Pro OpenCL benchmarks. One reason that GPU is such a speed demon is it's not moving data over a PCI bus.

55300.png


Even the 4000 series IGPU is faster than the dedicated stuff due to the lack of the PCI bus hit.

I don't mean to be rude but all of the above existed during the same time as newer technologies. Even the 3.5" floppy was around for quite awhile before finally being phased out.

Apple, once they phase something out, doesn't typically offer it as an option. Once the 3.5" floppy was gone, there was no option to add it back. Same seems to be true here.

The transition period to get off of optical media has been the last 3 years as Macs slowly started dropping it.
 
The modern ones do. Whether the OpenCL 1.1/1.2 API ( versus 2.0 ) gets in the way can be an issue but foundationally the modern GPUs don't have this limitation. Legacy card sure. But moving forward this is far more dogma than reality.

Besides even DMA in Firewire means making copies. DMA isn't so much the issue as whether goes through incremental buffers to get to do the transport.

DMA in Firewire is making copies with small buffers, but it doesn't usually have to turnaround the data right back to the CPU (usually a Firewire device has either an input or an output, making the direction one way). There's no task waiting back on the CPU that's going to get slammed with the next audio frame.

And yet in the AMD True Audio demos that the critical tech press attended nobody came back with complaints about grossly noticeable sound latencies defects.

Looking over it, it looks like interesting tech, but I'd have to know more about it. Is it actually working in real time or just buffering things? The tolerance for games might be lower than the tolerance for someone working in logic.

And yet there are numerous folks moaning and groaning about how Apple is letting musician down because there is no dual CPU ( i.e., QPI bus using machine ) available. Never mind that in previous generations the QPI link was used between Northbridge and CPU so everything, including memory accesses, went through the bus. Music didn't crumble on those earlier machines.

I've seen the complaining, and I've also seen the counterpoint that performance was likely being hurt but not enough for people to realize. That said, multithreading a single audio pipeline is usually frowned upon for this very reason.

'Real time' isn't about there being no or smallest latencies. It is really far more being able to being able to deterministically get things done with the latencies present. Nothing has near zero latencies.

That's true, but GPGPU is usually considered out of the range of acceptable latencies. You're not going to have zero latency, but you need to be under a certain amount of latency (somewhere below the length of the audio packet.)

If have multiple CPU cores you also have synchroniztion/issues. You have data in L3 caches, etc.

True, which is also why threading is questionable for a real time audio pipeline.

You can certainly have multiple audio processing pipelines going at once, and thread that. But threading an individual filter is risky.

And recording a 5-10 min audio session is not a relatively long (for CPU/GPU speeds) running task?

It's the amount of data being processed by the kernel per run. With real time audio you might be dealing with chunks of data that are 500 bytes to 2k or 3k large. Unless you're doing an extreme amount of processing, you're sending very small chunks of data, probably not doing a huge amount of processing compared to what a more intensive task, and incurring the hit of moving the data over to the GPU.

When you move out of realtime, and you can move megabytes or gigabytes of data over at a time, then it gets more interesting.

The bigger problem lots of 'real time' folks are going to have with the GPU is that the audio has to timeslice with graphics work. It isn't so much that the distributed audio stuff is distributed around the computer it is more so whether it is going to get is required share of slices of necesary resources. So there is a disconnect between " this 2nd GPU is sitting there doing nothing" and "we'll if put the audio on the GPU how do know it is going to get done on time". If the GPU is doing nothing that isn't much getting in the way. Highly diversely loaded GPUs then sure resource scheduling may surface as a problem if the GPUs aren't managed correctly.

Definitely a problem too. But in theory the Mac Pro solves this by having a "dedicated" OpenCL card. Unless another app is using the GPU... :)

Throwing up PCI-e and QPI as getting in the way is bit more than rich because the other more classic CPU centric processing is using exactly those same resources.

Classic CPU processing is going to use RAM, but even if RAM and PCI-E were considered in the same performance tier, the physical distance itself would decrease performance.

Latency has to do with how quickly the data transfer starts. The amount of data here that needs to be moved is small relative to the bandwidth available. The throughput time isn't going to be an issue. Especially if decouple the GPGPU from other much higher bandwidth worksloads like graphics.

As I said if you pipeline the copies and computation they are running in parallel. If the copies are a bit slower than the computation then at worse you are looking at pipeline stalls on computation.

Pipelining can help, but pipelining doesn't usually work for audio. Real time audio typically requires you to turn around a packet before the next one comes in.

I do think GPGPU would be interesting for non real time audio where a fraction of a second latency is acceptable (rendering an entire file would be great for GPGPU.) But I'm still not convinced for real time audio. Even thread pools are iffy for real time audio, and those typically have lower latency than GPGPU.

That is actually part of the problem. It is AVX that audio should be moving to, but probably won't because the legacy inertia of the folks still using SSE capable only machines. That's why. It is a new mechanism and audio industry generally moves at a snails pace. That is in part because many solutions optimize down to the point where they inertia constraints; the code only really runs well on exactly what was targeted.

The issue is that have a limited number of AVX pipelines. If have more processing data streams than pipelines then it could be used that isn't. ( and frankly extending the number of AVX/SSE pipelines means bring in more connectivty like QPI/PCI-e )

AVX will take time, like everything else, but yeah, the number of SSE machines out there will make adoption slow. I wasn't even aware that Intel had moved on beyond SSE until WWDC this year when they announced... AVX2? But I've been busy on the ARM side of things recently.

Two factors.

1. moving to OpenGL v4 means Apple cleans up on the parts of OpenGL v3 they cherry picked around and didn't do. ( "Oh we'll fold those in when we eventually do major clean-up for v4 ". Well the v4 clean up is done no so they need to complete implementation of v3. )

2. Even in the 4 API updates the majority of them are computationally oriented.

http://en.wikipedia.org/wiki/OpenGL#OpenGL_4.0

shaders , draw (without sync) , extensions, etc.

None of those 4.0 changes are major changes that would greatly impact general performance, or open up new avenues for the OS to use the GPU. All those things are performance improvements to existing OpenGL methodologies. According to your own link:
"As in OpenGL 3.0, this version of OpenGL contains a high number of fairly inconsequential extensions, designed to thoroughly expose the capabilities of Direct3D 11-class hardware"

And tessellation was the biggest feature of DX11, which while it's a decent speed improvement, doesn't add something that OS X could make use of in general situations.

The change can be in what they are actually doing with the API. Up to now latent features, start getting used because because aiming higher than what OpenGL 3.1 era GPUs can do as the baseline functionality the OS leverages. Frankly, it can more so boil down to using what was already there and Apple was slacking on. Moving to v4 just triggers fully exploiting v3 just like fully exploiting v4 probably won't happen until gets around to implementing v5 ( or some much higher "dot increment" of v4).

Essentially following a "make it work, then make it faster" progression. Implement OpenGL X.Y then later speed up and more effectively utilize X.Y.

Even OpenGL 3.0 or 3.1 didn't expose new major functionality (besides some new higher performance paths.) The last major change to OpenGL that exposed new functionality was OpenGL 2.0, and that added shaders which made GPGPU possible.

Apple could use these functions to optimize a lot of their existing graphics related code, but I can't think of any new use cases or brand new optimizations they could do for things that aren't OpenGL optimized yet. Even stuff like QuartzGL didn't need anything higher than OpenGL 2.0.
 
Not to dredge up an old topic, but they were about two years too late as far as I'm concerned. And the umpty-gazillion page threads here on MacRumors at the time demonstrate that opinions were well-divided on the subject, with plenty of people who were ready to ditch the legacy optical media before Apple chose to do it. You're re-writing history a bit here, the situation was hardly as you describe it. I get that you felt it was too soon, but that's just, like, your opinion man.

We shall disagree entirely on this matter. While some people don't care for optical (their choice of course and nothing wrong with that)..Apple dismissed optical without much warning and forced the issue - that is the reality and the "history" as you call it.

Let's also recall that Apple's computer share of the market at that time was not very large so it is a niche product doing whatever it can to expedite more dollars coming in and thus its on line services. We can certainly see that this topic is a "bag of hurt" for many. Apple was paving a road for on line services and more and told us what we wanted. - For some it works for others it was disappointing.
 
It appears that Apple did its usual marketing scheme when they dropped the optical drive. It was not about not being relevant universally but was more helpful in getting Apple's on line items more 'relevant.'

Exactly. It was about pushing users to buy more often higher priced but less quality media from Apple. Apple was not, and is not, all that concerned with making better computers. Fortunately for us Intel and NVIDIA are.

When I want the best quality audio/video and the most user flexibility I buy Blu-ray. When I want the best audio-only quality and cannot get the media in a high-def download I buy a CD.
 
We shall disagree entirely on this matter. While some people don't care for optical (their choice of course and nothing wrong with that)..Apple dismissed optical without much warning and forced the issue - that is the reality and the "history" as you call it.

Apple continue to sell an optical drive today which works with every single one of their computers. That's a far cry from a "dismissal without much warning." You may prefer that the optical drive were still mandatory, but it does indeed exist, it's works, and it's supported. You make it sound like they abandoned the format entirely.

Apple was paving a road for on line services and more and told us what we wanted. - For some it works for others it was disappointing.

That's certainly one theory. Another theory which appears to be at least as well supported by the evidence is that a small enough percentage of Apple customers had a need for a permanent optical drive that it made sense to no longer make them a mandatory purchase. You only have to look at the then-robust marketplace of third party options for removing the optical drive from laptops to replace with a second hard drive. That was clear writing on the wall for the format.

They disappeared from the smaller portables first, then the larger portables, then the iMac, and then the Mac Pro over the course of several years. Exactly what you'd expect to see if the reasoning were along those lines.
 
That's a DMA transfer (basically saying it can move the data to the card's buffer to be processed, no duh.) It's basically what I already said.

I don't intend to argue with you, but I will respectfully suggest that you said the opposite.

Moving beyond that, if you can't make it work in your problem domain, that's totally understandable and fine by me. The much more exciting aspect to all this (and I think we agree on this) is the raw power that bubbles up to within EASY reach of us developers. Also, just for clarity, this thread is about the nMP architecture and includes all compute devices, not just the dual GPUs. The CPU will be there ready willing and able to chomp on some OpenCL too.

In the interest of moving back to center, I am going to post a link to an old article that many here may not have seen. I think it illustrates how OpenCL may change the way you work. 10 minutes to 14 seconds is impressive and enough to change some minds here and elsewhere.

Swimming in OpenCL
 
I don't intend to argue with you, but I will respectfully suggest that you said the opposite.

Moving beyond that, if you can't make it work in your problem domain, that's totally understandable and fine by me. The much more exciting aspect to all this (and I think we agree on this) is the raw power that bubbles up to within EASY reach of us developers. Also, just for clarity, this thread is about the nMP architecture and includes all compute devices, not just the dual GPUs. The CPU will be there ready willing and able to chomp on some OpenCL too.

In the interest of moving back to center, I am going to post a link to an old article that many here may not have seen. I think it illustrates how OpenCL may change the way you work. 10 minutes to 14 seconds is impressive and enough to change some minds here and elsewhere.

Swimming in OpenCL

Don't get me wrong, I like OpenCL, but there are a lot of things it's not good for. That doesn't make the nMP not a good machine, it's just a machine that Apple is clearly designed for certain workflows (coughFCPXcough), not a machine that's going to make general OS stuff faster. OpenCL has limits.

That said, a lot of those problems are fixed when the GPU is on die with the CPU (see pretty graph above.) That will likely wipe Nvidia out of the market, though.

For OpenCL, the new Macbook Pro is actually the most interesting machine to me. 750m for OpenGL tasks, but a wicked fast Iris Pro for OpenCL. Lot's of possibilities there.
 
Since the New MacPro is the only current system Apple offers with dual graphics cards, wouldn't they have to develop a custom build of OSX to leverage that in the way that the OP is describing? A dual GPU system can't be compared to single GPU systems, which is how I read the post (but I may have it wrong). Maybe the OP is addressing using OpenCL to a larger extent in any system and the single/dual GPU thing only muddies the waters a bit?

Dale
 
For OpenCL, the new Macbook Pro is actually the most interesting machine to me. 750m for OpenGL tasks, but a wicked fast Iris Pro for OpenCL. Lot's of possibilities there.

I'm not sure it's as much that as the other cards underperforming on the benchmark. 2x the performance of the HD 4000 isn't that high. I also have to ask, how do you plan to control which it uses? It's unlikely that it can run both at once. The 2011 macbook pros used to drain the charger while plugged in when run at very high loads. I would find it difficult to believe that the new one can saturate one cpu + 2 gpus without problems. Do you have something that will control switching for optimal OpenCL performance? I'm genuinely curious, because it would be cool if such a thing would work.
 
Since the New MacPro is the only current system Apple offers with dual graphics cards, wouldn't they have to develop a custom build of OSX to leverage that in the way that the OP is describing? A dual GPU system can't be compared to single GPU systems, which is how I read the post (but I may have it wrong). Maybe the OP is addressing using OpenCL to a larger extent in any system and the single/dual GPU thing only muddies the waters a bit?

Dale

I'm not sure it's as much that as the other cards underperforming on the benchmark. 2x the performance of the HD 4000 isn't that high. I also have to ask, how do you plan to control which it uses? It's unlikely that it can run both at once. The 2011 macbook pros used to drain the charger while plugged in when run at very high loads. I would find it difficult to believe that the new one can saturate one cpu + 2 gpus without problems. Do you have something that will control switching for optimal OpenCL performance? I'm genuinely curious, because it would be cool if such a thing would work.

From what I understand, this is all determined programmatically... OCL has queries for determining OCL devices in the system and even ways to measure performance of each device... and the programmer decides how to allocate tasks. There are also commercially available code libraries that make this easy for OCL programmers. It can effectively divide tasks between CPU and various GPUs effectively (and in parallel). It's very sophisticated as you might expect.
 
Yawn...

...wake me up when something actually ships, or maybe even when we know all the configurations.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.