The modern ones do. Whether the OpenCL 1.1/1.2 API ( versus 2.0 ) gets in the way can be an issue but foundationally the modern GPUs don't have this limitation. Legacy card sure. But moving forward this is far more dogma than reality.
Besides even DMA in Firewire means making copies. DMA isn't so much the issue as whether goes through incremental buffers to get to do the transport.
DMA in Firewire is making copies with small buffers, but it doesn't usually have to turnaround the data right back to the CPU (usually a Firewire device has either an input or an output, making the direction one way). There's no task waiting back on the CPU that's going to get slammed with the next audio frame.
And yet in the AMD True Audio demos that the critical tech press attended nobody came back with complaints about grossly noticeable sound latencies defects.
Looking over it, it looks like interesting tech, but I'd have to know more about it. Is it actually working in real time or just buffering things? The tolerance for games might be lower than the tolerance for someone working in logic.
And yet there are numerous folks moaning and groaning about how Apple is letting musician down because there is no dual CPU ( i.e., QPI bus using machine ) available. Never mind that in previous generations the QPI link was used between Northbridge and CPU so everything, including memory accesses, went through the bus. Music didn't crumble on those earlier machines.
I've seen the complaining, and I've also seen the counterpoint that performance was likely being hurt but not enough for people to realize. That said, multithreading a single audio pipeline is usually frowned upon for this very reason.
'Real time' isn't about there being no or smallest latencies. It is really far more being able to being able to deterministically get things done with the latencies present. Nothing has near zero latencies.
That's true, but GPGPU is usually considered out of the range of acceptable latencies. You're not going to have zero latency, but you need to be under a certain amount of latency (somewhere below the length of the audio packet.)
If have multiple CPU cores you also have synchroniztion/issues. You have data in L3 caches, etc.
True, which is also why threading is questionable for a real time audio pipeline.
You can certainly have multiple audio processing pipelines going at once, and thread that. But threading an individual filter is risky.
And recording a 5-10 min audio session is not a relatively long (for CPU/GPU speeds) running task?
It's the amount of data being processed by the kernel per run. With real time audio you might be dealing with chunks of data that are 500 bytes to 2k or 3k large. Unless you're doing an extreme amount of processing, you're sending very small chunks of data, probably not doing a huge amount of processing compared to what a more intensive task, and incurring the hit of moving the data over to the GPU.
When you move out of realtime, and you can move megabytes or gigabytes of data over at a time, then it gets more interesting.
The bigger problem lots of 'real time' folks are going to have with the GPU is that the audio has to timeslice with graphics work. It isn't so much that the distributed audio stuff is distributed around the computer it is more so whether it is going to get is required share of slices of necesary resources. So there is a disconnect between " this 2nd GPU is sitting there doing nothing" and "we'll if put the audio on the GPU how do know it is going to get done on time". If the GPU is doing nothing that isn't much getting in the way. Highly diversely loaded GPUs then sure resource scheduling may surface as a problem if the GPUs aren't managed correctly.
Definitely a problem too. But in theory the Mac Pro solves this by having a "dedicated" OpenCL card. Unless another app is using the GPU...
Throwing up PCI-e and QPI as getting in the way is bit more than rich because the other more classic CPU centric processing is using exactly those same resources.
Classic CPU processing is going to use RAM, but even if RAM and PCI-E were considered in the same performance tier, the physical distance itself would decrease performance.
Latency has to do with how quickly the data transfer starts. The amount of data here that needs to be moved is small relative to the bandwidth available. The throughput time isn't going to be an issue. Especially if decouple the GPGPU from other much higher bandwidth worksloads like graphics.
As I said if you pipeline the copies and computation they are running in parallel. If the copies are a bit slower than the computation then at worse you are looking at pipeline stalls on computation.
Pipelining can help, but pipelining doesn't usually work for audio. Real time audio typically requires you to turn around a packet before the next one comes in.
I do think GPGPU would be interesting for non real time audio where a fraction of a second latency is acceptable (rendering an entire file would be great for GPGPU.) But I'm still not convinced for real time audio. Even thread pools are iffy for real time audio, and those typically have lower latency than GPGPU.
That is actually part of the problem. It is AVX that audio should be moving to, but probably won't because the legacy inertia of the folks still using SSE capable only machines. That's why. It is a new mechanism and audio industry generally moves at a snails pace. That is in part because many solutions optimize down to the point where they inertia constraints; the code only really runs well on exactly what was targeted.
The issue is that have a limited number of AVX pipelines. If have more processing data streams than pipelines then it could be used that isn't. ( and frankly extending the number of AVX/SSE pipelines means bring in more connectivty like QPI/PCI-e )
AVX will take time, like everything else, but yeah, the number of SSE machines out there will make adoption slow. I wasn't even aware that Intel had moved on beyond SSE until WWDC this year when they announced... AVX2? But I've been busy on the ARM side of things recently.
Two factors.
1. moving to OpenGL v4 means Apple cleans up on the parts of OpenGL v3 they cherry picked around and didn't do. ( "Oh we'll fold those in when we eventually do major clean-up for v4 ". Well the v4 clean up is done no so they need to
complete implementation of v3. )
2. Even in the 4 API updates the majority of them are computationally oriented.
http://en.wikipedia.org/wiki/OpenGL#OpenGL_4.0
shaders , draw (without sync) , extensions, etc.
None of those 4.0 changes are major changes that would greatly impact general performance, or open up new avenues for the OS to use the GPU. All those things are performance improvements to existing OpenGL methodologies. According to your own link:
"As in OpenGL 3.0, this version of OpenGL contains a high number of fairly inconsequential extensions, designed to thoroughly expose the capabilities of Direct3D 11-class hardware"
And tessellation was the biggest feature of DX11, which while it's a decent speed improvement, doesn't add something that OS X could make use of in general situations.
The change can be in what they are actually doing with the API. Up to now latent features, start getting used because because aiming higher than what OpenGL 3.1 era GPUs can do as the baseline functionality the OS leverages. Frankly, it can more so boil down to using what was already there and Apple was slacking on. Moving to v4 just triggers fully exploiting v3 just like fully exploiting v4 probably won't happen until gets around to implementing v5 ( or some much higher "dot increment" of v4).
Essentially following a "make it work, then make it faster" progression. Implement OpenGL X.Y then later speed up and more effectively utilize X.Y.
Even OpenGL 3.0 or 3.1 didn't expose new major functionality (besides some new higher performance paths.) The last major change to OpenGL that exposed new functionality was OpenGL 2.0, and that added shaders which made GPGPU possible.
Apple could use these functions to optimize a lot of their existing graphics related code, but I can't think of any new use cases or brand new optimizations they could do for things that aren't OpenGL optimized yet. Even stuff like QuartzGL didn't need anything higher than OpenGL 2.0.