Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Sorry, I am a bit confused. If a CPU can do the job in parallel, why it's impossible on GPU?...

Some clarifying info (and this also addresses AidenShaw's statement): first we must review relative CPU and GPU performance characteristics:

The i7-6700K CPU has been benchmarked at about 240 Linpack GFLOPS.
By comparison the M395X is about 3.7 TFlops or 15x faster. Newer chips like the 6700K are pretty fast due to using Cray-style vector instructions, but the GPU is much faster still -- on types of problems where the hardware can be employed.

To realize the GPU's 15x performance advantage its thousands of cores must be fully utilized. Typically this requires multiple GPU threads per core -- that's how the hardware is designed. If this isn't done, each GPU core will be under-utilized and the performance won't be there. It would be like running one thread on an IBM Power8 CPU core which is 8-way hyperthreaded. On a GPU you *must* have many thousands of threads going, else those GPU cores will not be utilized and it's faster to just use the CPU.

The H264 video file is in Groups Of Pictures (GOPs), each independent. AidenShaw was correct -- each GOP is totally independent and can be encoded/decoded separately.

How big are these GOPs? In a 4k video file, each *frame* is 8 megapixels -- about 1 megabyte. A typical GOP size for web distribution is 3x the frame rate, or 90 frames for 30 fps. So each GOP is 90 megabytes in that case.

How many GOPs are in a 500MB file? About 6. We already see the problem. There are not *remotely* enough GOPs in a typical size video file to spread across the many thousands of GPU threads. Only one GPU thread can be in a GOP, so only 6 GPU cores can be leveraged on this file. Also, each GPU core is not fully utilized with only one thread.

If the file was 5GB (not 500MB) *and* if the GOP size was 15 frames (not 90), that is still only 333 GPU threads, and you can only have one GPU thread per core due to data dependency issues. Thus most of the GPU resources would be sitting idle.

The only way to leverage a GPU for H264 encode/decode is to go within the GOP and have multiple GPU threads per GOP. Yet that is virtually impossible due to data dependencies within the GOP, as discussed by X264 lead developer Jason Garrett-Glaser in the previously-listed video.

Let's take a more extreme case: if the file was 1080p (about 256 kilobytes per frame) and if the GOP size was 15, and if the file size was 10GB, that would be 3.8MB per GOP or 2,631 GOPs in the file. That could still only use 2,361 threads which is not enough to fully harness the GPU cores but it's getting a little closer. However all the data those GPU threads are manipulating must be in VRAM, which is typically about 2GB. Thus at any one moment only about 1/5th of those could be used since there is only 1/5th enough VRAM. So in that case the VRAM limit prevents significant progress.
 
You are right. GPU is doing nothing for h.264. I was tricked via a false positive because one would have expected the GPU in El Capitan and Windows to have a disparity and that the nMP D500 would fall in between. In fact the results were purely down to the CPU and also software optimisation on Windows.

The result of your test on a dual X5690 CPU is below. This was a straight up encoding to h.264 of the 145MB file you linked me to. I dragged it into Media Encoder, chose 'Match Source - High bitrate' as the preset and then ran the test.

El Capitan
OpenCL : 119
CUDA : 120
SOFTWARE : 119

Windows10
CUDA : 49
SOFTWARE : 49

Obviously this was just a straight up encode with no filters or comping for the GPU to help with. This leaves two questions:

1. Which filters and effects in Premiere or After Effects are optimal for testing the GPU? I expect only a handful are, as it is in Photoshop.
2. Is Windows faster because Media Encoder more optimised or because Windows has support for certain CPU features that are not well implemented in OSX?
 
1. Which filters and effects in Premiere or After Effects are optimal for testing the GPU? I expect only a handful are, as it is in Photoshop.

We had the same question during the huge amount of testing we did on FCP X in this thread (starting with around post #307): #307 The only solution we found was individually testing each effect while observing the GPU utilization in iStat Menus.

2. Is Windows faster because Media Encoder more optimised or because Windows has support for certain CPU features that are not well implemented in OSX?
That is the "million dollar question". I don't know the answer but there's a big difference.

In my export testing on FCP X 10.2.2 and OS X 10.11.1, exporting that test file to UHD 4k H264 using the "fast encode" option took about 28.7 seconds, and using the "high quality" (aka multi-pass) setting it took 55.5 seconds. That's on a 2015 top-spec iMac 27 with M395X. On my 5k retina iMac screen I couldn't see any apparent quality difference in the files, which is why we usually use the "fast encode" option on FCP X.
 
Some clarifying info (and this also addresses AidenShaw's statement): first we must review relative CPU and GPU performance characteristics:

The i7-6700K CPU has been benchmarked at about 240 Linpack GFLOPS.
By comparison the M395X is about 3.7 TFlops or 15x faster. Newer chips like the 6700K are pretty fast due to using Cray-style vector instructions, but the GPU is much faster still -- on types of problems where the hardware can be employed.

To realize the GPU's 15x performance advantage its thousands of cores must be fully utilized. Typically this requires multiple GPU threads per core -- that's how the hardware is designed. If this isn't done, each GPU core will be under-utilized and the performance won't be there. It would be like running one thread on an IBM Power8 CPU core which is 8-way hyperthreaded. On a GPU you *must* have many thousands of threads going, else those GPU cores will not be utilized and it's faster to just use the CPU.

The H264 video file is in Groups Of Pictures (GOPs), each independent. AidenShaw was correct -- each GOP is totally independent and can be encoded/decoded separately.

How big are these GOPs? In a 4k video file, each *frame* is 8 megapixels -- about 1 megabyte. A typical GOP size for web distribution is 3x the frame rate, or 90 frames for 30 fps. So each GOP is 90 megabytes in that case.

How many GOPs are in a 500MB file? About 6. We already see the problem. There are not *remotely* enough GOPs in a typical size video file to spread across the many thousands of GPU threads. Only one GPU thread can be in a GOP, so only 6 GPU cores can be leveraged on this file. Also, each GPU core is not fully utilized with only one thread.

If the file was 5GB (not 500MB) *and* if the GOP size was 15 frames (not 90), that is still only 333 GPU threads, and you can only have one GPU thread per core due to data dependency issues. Thus most of the GPU resources would be sitting idle.

The only way to leverage a GPU for H264 encode/decode is to go within the GOP and have multiple GPU threads per GOP. Yet that is virtually impossible due to data dependencies within the GOP, as discussed by X264 lead developer Jason Garrett-Glaser in the previously-listed video.

Let's take a more extreme case: if the file was 1080p (about 256 kilobytes per frame) and if the GOP size was 15, and if the file size was 10GB, that would be 3.8MB per GOP or 2,631 GOPs in the file. That could still only use 2,361 threads which is not enough to fully harness the GPU cores but it's getting a little closer. However all the data those GPU threads are manipulating must be in VRAM, which is typically about 2GB. Thus at any one moment only about 1/5th of those could be used since there is only 1/5th enough VRAM. So in that case the VRAM limit prevents significant progress.

Million thanks for the detail explanation. That make perfect sense for me. And now I understand why handbrake can only fully utilise my CPU when I encode long / large video files. That's the matter of GOP.
 
There have been very effective hardware H.264 encoders around for years. I used to use the Elgato H.264 Turbo USB dongle. There are standalone boxes & PCIe cards. Why do you think similar technology wouldn't be incorporated into a graphics card?
 
The i7-6700K CPU has been benchmarked at about 240 Linpack GFLOPS.

By comparison the M395X is about 3.7 TFlops or 15x faster
.
Another way to look at this is performance per core:
  • i7-6700K - 60GFLOPS per core (30GFLOPS per thread)
  • M395X - 1.8 GFLOPS per core (shader)
In other words, you need about 133 cores running on the M395X to match the CPU speed.
 
There have been very effective hardware H.264 encoders around for years. I used to use the Elgato H.264 Turbo USB dongle. There are standalone boxes & PCIe cards. Why do you think similar technology wouldn't be incorporated into a graphics card?

Those "accelerator" cards are not using a GPU approach, which the above references show is essentially impossible. Of course you can put an encoder (or elements thereof) on an ASIC or FPGA. That is essentially what Intel's Quick Sync is. None of the above discussion has questioned the ability to do hardware-assisted H264 encoding, only GPU-assisted H264 encoding.

Any hardware-assisted encoding must use the existing H264 algorithm, which is sequential and has many data dependencies within the GOP. It cannot be broken up, parallelized and distributed across thousands of threads in a GPU. If anyone figures out how to do that, they might win a Nobel Prize -- it's that big a deal. There is no lack of incentive for people to devise such a thing, only lack of results.
 
...In other words, you need about 133 cores running on the M395X to match the CPU speed.

And to fully harness 133 GPU cores would take many GPU threads per core. A single thread will not exploit the core's performance potential. Yet with long-GOP encoding, each thread must stay within a unique GOP to avoid data dependencies. So you cannot even fully leverage 133 GPU cores because they would be under-driven by only having one thread each.

Some algorithms are just very difficult to massively parallelize, and H264 is one of them.
 
Those "accelerator" cards are not using a GPU approach, which the above references show is essentially impossible. Of course you can put an encoder (or elements thereof) on an ASIC or FPGA. That is essentially what Intel's Quick Sync is. None of the above discussion has questioned the ability to do hardware-assisted H264 encoding, only GPU-assisted H264 encoding.

Any hardware-assisted encoding must use the existing H264 algorithm, which is sequential and has many data dependencies within the GOP. It cannot be broken up, parallelized and distributed across thousands of threads in a GPU. If anyone figures out how to do that, they might win a Nobel Prize -- it's that big a deal. There is no lack of incentive for people to devise such a thing, only lack of results.
Newer Nvidia graphics cards have H.264 video encoding hardware built in & Nvidia provide an API (NVENC) to access it.
NVIDIA GPUs - beginning with the Kepler generation - contain a hardware based encoder (also referred to as NVENC hence forth in the document) which provides fully accelerated hardware based video encoding.

Before Kepler GPUs, the only GPU based solution for video encoding was to use CUDA for encoding. One of the disadvantages of the CUDA-based encoder is that it uses a combination of the CPU and GPU’s graphics engines for encoding, taking away processing power from other tasks that can be performed on the CPU and GPU’s graphics engines. The approach of encoding using NVENC increases overall system power consumption. The NVENC engine’s performance is also independent of the graphics performance.
http://developer.download.nvidia.com/compute/nvenc/v5.0_beta/NVENC_DA-06209-001_v06.pdf
 
Newer Nvidia graphics cards have H.264 video encoding hardware built in & Nvidia provide an API (NVENC) to access it....
Yes that corroborates what I was saying. The GPU cannot effectively do H264 encoding for the reasons already stated. Because of this nVidia (and AMD) have been forced to add a separate dedicated fixed-function hardware encoder to the GPU. Architecturally its a bag hung on the side. They would never have had to do this if the GPU itself was efficient at video encoding.

AMD's fixed function hardware is called VCE (Video Coding Engine). Like nVidia it has little to do with the GPU itself, it's essentially an ASIC run by a proprietary microcontroller.

Of course who cares whether it's really part of the GPU if it works and software supports it.

It would be interesting if any generally-available transcoding software that supports NVENC or VCE currently exists. I'd like to see how the performance compares to just using an Intel CPU with Quick Sync.

In the previously-listed 52 second 4k test clip, FCP X on my iMac encodes that to 1080p H264 in 16.7 seconds, or 75.7 frames per second. It encodes it to 4k H264 in 27.6 seconds or 47.6 frames per second. Does anyone have any NVENC or VCE-enabled transcoding software to test this with?
 
Yes that corroborates what I was saying. The GPU cannot effectively do H264 encoding for the reasons already stated. Because of this nVidia (and AMD) have been forced to add a separate dedicated fixed-function hardware encoder to the GPU. Architecturally its a bag hung on the side. They would never have had to do this if the GPU itself was efficient at video encoding.

AMD's fixed function hardware is called VCE (Video Coding Engine). Like nVidia it has little to do with the GPU itself, it's essentially an ASIC run by a proprietary microcontroller.

Of course who cares whether it's really part of the GPU if it works and software supports it.

It would be interesting if any generally-available transcoding software that supports NVENC or VCE currently exists. I'd like to see how the performance compares to just using an Intel CPU with Quick Sync.

In the previously-listed 52 second 4k test clip, FCP X on my iMac encodes that to 1080p H264 in 16.7 seconds, or 75.7 frames per second. It encodes it to 4k H264 in 27.6 seconds or 47.6 frames per second. Does anyone have any NVENC or VCE-enabled transcoding software to test this with?

A quick search reveals there exists some third party plugins but I would need more reliable information before I try it otherwise it could be some kind of adware with malware.

https://www.google.co.uk/search?q=n...l=en-gb&client=safari#hl=en-gb&q=nvenc+plugin

And this plugin (Windows only) looks a bit unstable and complicated

https://forums.adobe.com/thread/1243687?start=160&tstart=0
 
Last edited:
...And this plugin (Windows only) looks a bit unstable and complicated

https://forums.adobe.com/thread/1243687?start=160&tstart=0
It looks like some people sort of got it working, but they didn't state any performance numbers or quality assessment.

The problem with other hardware-assisted transcoding is Quick Sync is universally available on most Intel CPUs since Sandy Bridge (except Xeon). FCP X supports it and I have no idea why Premiere does not. Instead of writing to a proprietary API specific to certain range and brand of GPU, it's a more broadly available feature. Maybe some of the NVENC and and VCE approaches are faster but I don't see much software support for these.

By the time the next Mac Pro is released, if Xeon still doesn't have Quick Sync and if the GPU is still AMD, Apple may be impelled to write VCE transcoding routines for FCP X. This is especially so with H265 coming up, which is much more CPU-intensive than H264. Without hardware assistance a Mac Pro might be slower at exporting to H265 than a MacBook Air. I don't know what Adobe will do.
 
It looks like some people sort of got it working, but they didn't state any performance numbers or quality assessment.

The problem with other hardware-assisted transcoding is Quick Sync is universally available on most Intel CPUs since Sandy Bridge (except Xeon). FCP X supports it and I have no idea why Premiere does not. Instead of writing to a proprietary API specific to certain range and brand of GPU, it's a more broadly available feature. Maybe some of the NVENC and and VCE approaches are faster but I don't see much software support for these.

By the time the next Mac Pro is released, if Xeon still doesn't have Quick Sync and if the GPU is still AMD, Apple may be impelled to write VCE transcoding routines for FCP X. This is especially so with H265 coming up, which is much more CPU-intensive than H264. Without hardware assistance a Mac Pro might be slower at exporting to H265 than a MacBook Air. I don't know what Adobe will do.

Regarding Quick Sync. I noticed the lack of it means that MS Silverlight browser plugin sometimes doesn't work or works badly on Xeon systems. I wanted to subscribe to NowTV/Sky but the experience was just terrible in a browser and in full screen mode suffered from flashes that reminded me of Macromedia Protection from the VHS days.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.