Sorry, I am a bit confused. If a CPU can do the job in parallel, why it's impossible on GPU?...
Some clarifying info (and this also addresses AidenShaw's statement): first we must review relative CPU and GPU performance characteristics:
The i7-6700K CPU has been benchmarked at about 240 Linpack GFLOPS.
By comparison the M395X is about 3.7 TFlops or 15x faster. Newer chips like the 6700K are pretty fast due to using Cray-style vector instructions, but the GPU is much faster still -- on types of problems where the hardware can be employed.
To realize the GPU's 15x performance advantage its thousands of cores must be fully utilized. Typically this requires multiple GPU threads per core -- that's how the hardware is designed. If this isn't done, each GPU core will be under-utilized and the performance won't be there. It would be like running one thread on an IBM Power8 CPU core which is 8-way hyperthreaded. On a GPU you *must* have many thousands of threads going, else those GPU cores will not be utilized and it's faster to just use the CPU.
The H264 video file is in Groups Of Pictures (GOPs), each independent. AidenShaw was correct -- each GOP is totally independent and can be encoded/decoded separately.
How big are these GOPs? In a 4k video file, each *frame* is 8 megapixels -- about 1 megabyte. A typical GOP size for web distribution is 3x the frame rate, or 90 frames for 30 fps. So each GOP is 90 megabytes in that case.
How many GOPs are in a 500MB file? About 6. We already see the problem. There are not *remotely* enough GOPs in a typical size video file to spread across the many thousands of GPU threads. Only one GPU thread can be in a GOP, so only 6 GPU cores can be leveraged on this file. Also, each GPU core is not fully utilized with only one thread.
If the file was 5GB (not 500MB) *and* if the GOP size was 15 frames (not 90), that is still only 333 GPU threads, and you can only have one GPU thread per core due to data dependency issues. Thus most of the GPU resources would be sitting idle.
The only way to leverage a GPU for H264 encode/decode is to go within the GOP and have multiple GPU threads per GOP. Yet that is virtually impossible due to data dependencies within the GOP, as discussed by X264 lead developer Jason Garrett-Glaser in the previously-listed video.
Let's take a more extreme case: if the file was 1080p (about 256 kilobytes per frame) and if the GOP size was 15, and if the file size was 10GB, that would be 3.8MB per GOP or 2,631 GOPs in the file. That could still only use 2,361 threads which is not enough to fully harness the GPU cores but it's getting a little closer. However all the data those GPU threads are manipulating must be in VRAM, which is typically about 2GB. Thus at any one moment only about 1/5th of those could be used since there is only 1/5th enough VRAM. So in that case the VRAM limit prevents significant progress.