Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Results :
El Capitan OpenCL : 1:47
El Capitan CUDA : 1:48
Yosemite OpenCL : 2:00
Yosemite CUDA : 2:00
Windows 10 CUDA : 0.32

I had to run the Windows test three times because when I saw the progress bar moving so fast I thought there was something wrong with the settings, but they were exactly the same. You can see from the output file sizes all four renders came out to around 44.8MB.

Conclusion : if you work with video using Creative Suite you would save a lot of time using Windows. This isn't news, but it is shocking to see the result above.

So thought I'd run a little non scientific test myself to see what I could come up with similarly equipped machines. Suffice to say, I was a bit shocked on the winner. This is a premiere pro export of a 4K video that's 1:30 in duration. All opencl tests except for one pissed off cuda test at the end.

3 minutes 3 seconds : HP Z820. 32gb ram/dual quad 3.33GHz e5 Xeon (8 cores)/AMD R9 280x 3gb/Windows 10

4 minutes 1 second : 4,1 (firmware 5,1) cMP / 32gb ram/ six core 3.46 westmere/AMD R9 280x 3gb Mac flashed/10.11.1

3 minutes 39 seconds : 5,1 cMP / 32gb ram / 12 core 3.33 westmere/AMD R9 280x/10.11.1

Now the funny part, I threw 64 gb of ram in the 12 core and the same export slowed down to 3:48. What gives?

Same test with the 12 core and my gtx770 came in a few seconds faster at 3:34

Either way, premiere definitely comes across as optimized a bit better for Windows users.
 
  • Like
Reactions: artoff
Due to electrical loading on the memory channels, often the memory has to clock down as more DIMMs are put on each channel. Did you add DIMMs to get to 64GiB, or use larger DIMMs.

Yes, I loaded up with 8x8gb. I thought it still read 1333 though...
 
Media Encoder 2015 (same back end used by Premiere and After Effects) has a new renderer. I realised this when I opened the Task Manager and saw that barely 4 out of 12 CPU cores were rendering in software mode. You will also see that the multi CPU options have disappeared from AE's preferences because of the new engine. Adobe claims the new rendered is smarter than before, but some users are not content and claim that there is too much emphasis on GPU rendering at the expensive of CPU rendering which is not making enough use of all available cores as before.
 
This is an old trick but hardly anyone mentions it or they forget it. If anyone hasn't got a fast scratch disk but have plenty of memory installed, you can make a scratch for free from free memory using RAM disk.

Open Terminal and type

diskutil erasevolume HFS+ 'RAM Disk' `hdiutil attach -nomount ram://XXX

Replace XXX with a number that represents the total capacity of your RAM Disk. Calculate this by multiplying your desired disk size in megabytes by 2048. For example, If you want a 4 GB RAM Disk, then 4096x2048 = 8388608

The RAM disk will mount and you can drag it into your login items to automatically mount it when you boot up.
 
This is an old trick but hardly anyone mentions it or they forget it. If anyone hasn't got a fast scratch disk but have plenty of memory installed, you can make a scratch for free from free memory using RAM disk.
I was doing this recently, until I did some measurements and found out that the RAMdisk was slower than my SSHD.

To put it simply, the RAMdisk turned IO into a CPU load, instead of letting the DMA disk hardware run asynchronously. Using CPU cycles instead of DMA controllers was a net loss for my workload.

(Note that this was (obviously) a memory-rich system, so most disk reads were satisfied from the OS filesystem cache, and disk writes were immediate due to the write-back cache on the disk.)

A RAMdisk may be a big win - but do some testing to see if it actually helps your workload. In my case, the CPU was the bottleneck - not the disk. Adding CPU load with the RAMdisk slowed the whole process.
 
Last edited:
I was doing this recently, until I did some measurements and found out that the RAMdisk was slower than my SSHD.

To put it simply, the RAMdisk turned IO into a CPU load, instead of letting the DMA disk hardware run asynchronously. Using CPU cycles instead of DMA controllers was a net loss for my workload.

(Note that this was (obviously) a memory-rich system, so most disk reads were satisfied from the OS filesystem cache, and disk writes were immediate due to the write-back cache on the disk.)

A RAMdisk may be a big win - but do some testing to see if it actually helps your workload. In my case, the CPU was the bottleneck - not the disk. Adding CPU load with the RAMdisk slowed the whole process.
Yes I have benched the cMP's triple channel based RAM disk at 2GB/s max for large file transfers. In real world use that and an M2 SSD will transfer smaller cache or swap files at a much slower rate.
 
Yes I have benched the cMP's triple channel based RAM disk at 2GB/s max for large file transfers. In real world use that and an M2 SSD will transfer smaller cache or swap files at a much slower rate.
In my case it wasn't (and couldn't be) shown by isolated synthetic benchmarking of the disks.

It was apparent when running my workflow, and seeing that it was slower. The problem was that RAMdisk stealing CPU cycles from the work - whereas the disk hardware did asynchronous transfers using DMA that did not steal CPU cycles from the work.

Isolated synthetic disk benchmarks would have shown the RAMdisk to be faster - but it wasn't in real life.

ps: 2GB/sec is pretty sucky for disks - I get more than that with spinners and good hardware.
 
...real world speed test in rendering 4K videos with Creative Suite using OpenCL and CUDA...H.264 Codec...

Despite all the focus on GPUs in this thread, it is unlikely that *encoding* to an H.264 file uses the GPU very much if at all. Video encoding to long GOP formats is almost impossible to accelerate by GPU, so is not a very good test.

The terminology can be confusing since we often loosely refer to exporting the video file as "rendering". In fact you render *effects* in the timeline but encode to a stream or file. Effects can be GPU accelerated, but encoding to H264 or similar formats generally cannot be. Any difference in performance is probably due to other factors than the GPU.

The previously-posted Final Cut vs Premiere Youtube videos repeatedly use the term "render" when they really meant "encode" or "export". I have the same camera they used -- a Sony A7RII. With both Premiere and FCP X they obtained export times in 30 minute (!) range for a 1 minute 4k clip exported at 4k.

On my 2015 iMac 27 it takes FCP X about 40 *seconds* to export a 1 minute A7RII 4k clip to H264 4k.

If you export a timeline containing unrendered effects, those must be rendered before they are encoded. The rendering can be GPU accelerated. Those two phases -- rendering and encoding -- may appear to be a single export process but it is two distinct phases. The rendering happens automatically if needed prior to encoding.
 
  • Like
Reactions: Kaspin
Despite all the focus on GPUs in this thread, it is unlikely that *encoding* to an H.264 file uses the GPU very much if at all. Video encoding to long GOP formats is almost impossible to accelerate by GPU, so is not a very good test.

The terminology can be confusing since we often loosely refer to exporting the video file as "rendering". In fact you render *effects* in the timeline but encode to a stream or file. Effects can be GPU accelerated, but encoding to H264 or similar formats generally cannot be. Any difference in performance is probably due to other factors than the GPU.

The previously-posted Final Cut vs Premiere Youtube videos repeatedly use the term "render" when they really meant "encode" or "export". I have the same camera they used -- a Sony A7RII. With both Premiere and FCP X they obtained export times in 30 minute (!) range for a 1 minute 4k clip exported at 4k.

On my 2015 iMac 27 it takes FCP X about 40 *seconds* to export a 1 minute A7RII 4k clip to H264 4k.

If you export a timeline containing unrendered effects, those must be rendered before they are encoded. The rendering can be GPU accelerated. Those two phases -- rendering and encoding -- may appear to be a single export process but it is two distinct phases. The rendering happens automatically if needed prior to encoding.

Thanks for the post but it's highly unlikely that the GPU isn't the factor. I have checked CPU usage across both operating systems on the same machine (and on two other Macs) and there's very little usage going on, yet on Windows it was almost 4x faster than OSX. It's extremely unlikely that performance difference is down to the CPUs and not the GPUs as Media Encoder makes so little use of the CPUs in the latest CC 2015 update. Just fire up the CPU monitors and look how little activity there is.

In fact if you look at the benchmark between the cMP and nMP I tested they scale perfectly in line with what we expect from the GPU configurations, not the CPUs.
 
Despite all the focus on GPUs in this thread, it is unlikely that *encoding* to an H.264 file uses the GPU very much if at all. Video encoding to long GOP formats is almost impossible to accelerate by GPU, so is not a very good test.

The terminology can be confusing since we often loosely refer to exporting the video file as "rendering". In fact you render *effects* in the timeline but encode to a stream or file. Effects can be GPU accelerated, but encoding to H264 or similar formats generally cannot be. Any difference in performance is probably due to other factors than the GPU.

The previously-posted Final Cut vs Premiere Youtube videos repeatedly use the term "render" when they really meant "encode" or "export". I have the same camera they used -- a Sony A7RII. With both Premiere and FCP X they obtained export times in 30 minute (!) range for a 1 minute 4k clip exported at 4k.

On my 2015 iMac 27 it takes FCP X about 40 *seconds* to export a 1 minute A7RII 4k clip to H264 4k.

If you export a timeline containing unrendered effects, those must be rendered before they are encoded. The rendering can be GPU accelerated. Those two phases -- rendering and encoding -- may appear to be a single export process but it is two distinct phases. The rendering happens automatically if needed prior to encoding.
This is contradicted by my experience using Premiere Pro with CUDA acceleration provided by a GTX570. Export times for h.264 encoding are greatly reduced.
 
This is contradicted by my experience using Premiere Pro with CUDA acceleration provided by a GTX570. Export times for h.264 encoding are greatly reduced.
It is unlikely the encoding is actually GPU-accelerated. The faster export times might be due to rendering effects on the GPU -- if any effects have been applied. The important thing to keep in mind is H264 encoding cannot greatly be GPU accelerated. The greatest minds on earth have worked on this for years and there are dozens of academic white papers detailing their limited success.

The X264 lead developer (Jason Garrett-Glaser) reviewed the efforts to GPU-accelerated H264 encoding, saying: "...countless people have failed over the years. nVidia engineers, Intel engineers...have all failed" His technical seminar reviews the problems and challenges:

Recently Andrew Page (nVidia Product Manager for Professional Video Technologies), explained the issue: "there are a lot of tasks that can't be broken down to be parallel: encoding is one, decoding some of the camera compressed formats is another one...what happens in frame 2 depends one frame 1, we do something to frame 1 then feed the results into frame 2....that's pretty much an encoding problem...everything is interrelated, so we can't break it up into lots of different simultaneous things." (That Studio Show podcast, 5/20/14): https://itunes.apple.com/us/podcast/that-studio-show/id293692362?mt=2

If you are using H264 export tests as an indicator of GPU performance, this will likely lead to incorrect conclusions. If you want to test GPU performance use a direct GPU test, not indirectly infer GPU performance from video encoding, which is unlikely to significantly benefit from the GPU.
 
  • Like
Reactions: Reindeer_Games
It is unlikely the encoding is actually GPU-accelerated. The faster export times might be due to rendering effects on the GPU -- if any effects have been applied. The important thing to keep in mind is H264 encoding cannot greatly be GPU accelerated. The greatest minds on earth have worked on this for years and there are dozens of academic white papers detailing their limited success.

The X264 lead developer (Jason Garrett-Glaser) reviewed the efforts to GPU-accelerated H264 encoding, saying: "...countless people have failed over the years. nVidia engineers, Intel engineers...have all failed" His technical seminar reviews the problems and challenges:

Recently Andrew Page (nVidia Product Manager for Professional Video Technologies), explained the issue: "there are a lot of tasks that can't be broken down to be parallel: encoding is one, decoding some of the camera compressed formats is another one...what happens in frame 2 depends one frame 1, we do something to frame 1 then feed the results into frame 2....that's pretty much an encoding problem...everything is interrelated, so we can't break it up into lots of different simultaneous things." (That Studio Show podcast, 5/20/14): https://itunes.apple.com/us/podcast/that-studio-show/id293692362?mt=2

If you are using H264 export tests as an indicator of GPU performance, this will likely lead to incorrect conclusions. If you want to test GPU performance use a direct GPU test, not indirectly infer GPU performance from video encoding, which is unlikely to significantly benefit from the GPU.

No effects have been applied to be test I posted. It's a straight up codec conversion from a comp that maintains all original video settings. Enable GPU rendering instead of software rendering and the difference is supposed to be night and day. So if the GPU isn't accelerating encoding this would need an explanation, but codecs are not my field of speciality. I just use them, not develop them.
 
Last edited:
"...what happens in frame 2 depends one frame 1, we do something to frame 1 then feed the results into frame 2....that's pretty much an encoding problem...everything is interrelated, so we can't break it up into lots of different simultaneous things."
Wouldn't keyframes give you logical breaks for parallelization? Run each keyframe group (the keyframe and its dependents) as a separate thread.
 
Wouldn't keyframes give you logical breaks for parallelization? Run each keyframe group (the keyframe and its dependents) as a separate thread.
That is a good question. That is called GOP-level parallelism, and an obvious approach. If it worked, a GPU-based H264 encoder would be 50x faster than Quick Sync. The fact that those don't exist indicates there is some issue limiting the speedup. I see references in the academic literature to problems with memory utilization when trying this approach. Whatever the limitation, it is real. GPU-based H264 encoding that's dramatically faster than Quick Sync does not exist.
 
"there are a lot of tasks that can't be broken down to be parallel: encoding is one, decoding some of the camera compressed formats is another one...what happens in frame 2 depends one frame 1, we do something to frame 1 then feed the results into frame 2....that's pretty much an encoding problem...everything is interrelated, so we can't break it up into lots of different simultaneous things."

Handbrake can use all 24 threads of a CPU to encode H264, isn't that means it's already encoding in parallel?
 
Handbrake can use all 24 threads of a CPU to encode H264, isn't that means it's already encoding in parallel?
Yes, thread-level parallelism is possible. Also a well-written encoder can tap vector instructions like AVX to simultaneously process arrays of data. I think Handbrake does that for X264, which is why it's so fast.

Some versions of Handbrake on Windows can use Quick Sync, essentially an on-chip ASIC that accelerates the core encode/decode function. I don't think that Handbrake feature works on Mac.

But none of those use the GPU. The attraction of leveraging the GPU is obvious -- encode/decode is a time-consuming task, it is image data, and the GPU is very powerful. Unfortunately the inherent characteristics of H264 make it almost impossible to meaningfully exploit the GPU for encode/decode.

If you want to do more tests, I'd suggest using a commonly-available standard clip so everyone can use the same material. Here is a UHD 4k clip we've been using on another thread. It is 145MB, 53 sec, from a DJI Inspire and shot by Philip Bloom. It is MPEG4, 23.976 fps, 8-bit 4:2:0, 22 mbps.

Main page:

Direct download: Ultra HD 4K

My iMac using FCP X exports it to 1080p in 17.2 sec, and to 4k in 28.7 sec.
My PC only has CS5 but it's 4Ghz and has a GTX-660. It exports it to 1080p in 1:13.8, but unfortunately will not export at 4k since the codec can't handle that.

Edit/add:

When benchmarking that involves significant file I/O, I suggest turning off Time Machine and Spotlight indexing which can cause performance fluctuations.

To turn off Time Machine, click on that in the menu bar (clock icon with backward arrow), then Open Time Machine Preferences and turn it off.

To disable Spotlight indexing, go to System Preferences>Spotlight, click the Privacy Tab, then the + button at bottom left, and select each of your hard drives. That will disable it. Just remember to reverse these steps when you're done.
 
Last edited:
Sorry, I am a bit confused. If a CPU can do the job in parallel, why it's impossible on GPU?

I know the software / platform / coding required is different, but those are all solvable. I just couldn't understand why "H264 encoding cannot greatly be GPU accelerated".
 
Sorry, I am a bit confused. If a CPU can do the job in parallel, why it's impossible on GPU?

I know the software / platform / coding required is different, but those are all solvable. I just couldn't understand why "H264 encoding cannot greatly be GPU accelerated".
Very coarse-grained acceleration via a few cores is possible. Fine-grained acceleration via the GPU is generally not possible. This is due to data and other dependencies in the encoding algorithm. The X264 team spent a lot of effort and got some modest GPU acceleration in one component of their X264 encoder, but this comprised only 10-25% of the runtime. If you watch the previously-posted seminar in this thread, it will become obvious how complex this is and why major GPU acceleration has so far eluded all researchers: #37
 
Thanks for the conversation. I'll do your bench test now (with the same five configurations in my first post) but it would be ideal if everyone used the same application.

Edit.

That Vimeo link doesn't work. Do I have to sign up? Also, which codec do I export to? And which 4K : 4096px wide or 3840x wide?
 
Last edited:
...I'll do your bench test now (with the same five configurations in my first post) but it would be ideal if everyone used the same application.

Edit.

That Vimeo link doesn't work. Do I have to sign up? Also, which codec do I export to? And which 4K : 4096px wide or 3840x wide?
Sorry about that. You should be able to click the initial Vimeo page (not the direct download link), then select the Download button, then select the UHD 4k download. Let me know if that doesn't work.

For output size, I suggest using the same as the file which is 3840 x 2160.
 
Sorry about that. You should be able to click the initial Vimeo page (not the direct download link), then select the Download button, then select the UHD 4k download. Let me know if that doesn't work.

For output size, I suggest using the same as the file which is 3840 x 2160.

And H.264 codec?
 
And H.264 codec?

I would say yes, H264. There are obviously other codecs and we could test them if you want but H264 is very popular, esp for web upload. Each additional test adds more time.

Another decision is whether to test various quality settings on output. I don't have a good answer for that. In FCP X there is a fast vs higher quality setting, but the fast setting produces virtually the same quality, so most people just use fast. On Premiere the tradeoff might be different.
 

I would say yes, H264. There are obviously other codecs and we could test them if you want but H264 is very popular, esp for web upload. Each additional test adds more time.

Another decision is whether to test various quality settings on output. I don't have a good answer for that. In FCP X there is a fast vs higher quality setting, but the fast setting produces virtually the same quality, so most people just use fast. On Premiere the tradeoff might be different.

In Media Encoder there is a 'Maximum' quality setting that I check. I don't even have to add the video to a timeline as I can drag straight into the Encoder.
 
In Media Encoder there is a 'Maximum' quality setting that I check. I don't even have to add the video to a timeline as I can drag straight into the Encoder.
At a minimum I suggest doing a test to see is there any difference in encoding time or perceived quality with maximum on or off. I have seen references saying you only need "maximum render quality" when resizing, and otherwise it could incur a performance penalty without giving any benefit.

http://www.digitalrebellion.com/blog/posts/understanding_render_options_in_adobe_premiere_pro
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.