...I am puzzled why the apps don't use the double encoder or the full GPU power. Apps should need no change to use more GPU capacity, shouldn't they?...
That is a good question. While it may intuitively seem that apps doing video decode/encode should require no tweaking to use additional hardware accelerators, there are several things to consider.
Prior to Apple Silicon, CPUs typically had a single decode/encode accelerator shared among all cores. E.g, Intel CPUs had only one Quick Sync block, not one per core. I believe it's the same with nVidia's NVENC/NVDEC and AMD's UVD/VCE -- one per card. With the M1 Max and increasing with the M1 Ultra, for the first time there are multiple accelerators available.
For Long GOP codecs such as H264 and HEVC, in theory each GOP (Group of Pictures) is independent and can be processed in parallel by a separate thread. It is impossible to use multiple threads within a GOP because the algorithm is inherently sequential -- you cannot process frame 10 until you are finished with frame 9. A typical GOP size might be 20 frames.
Thus on a 10 min video there might be 900 GOPs. If only you had a 900 core CPU
The hundreds of lightweight threads a GPU can muster cannot be used for this because they are individually too slow, and only one thread per GOP is allowed. So using software decoding, an 8 core CPU can use all eight cores, one per GOP, but it's still slow.
A hardware decode/encode accelerator is much faster since it executes the core algorithm in fixed-function hardware. However it must either be used by a single thread or a pool of threads must share it, saving state or coordinating with some synchronization method. Either way the limit will then become accelerator performance. If only somebody would make a CPU with multiple accelerators
Wait -- Apple just did!
Anytime a single hardware entity existed for a long time -- whether that was a single core machine or a single video accelerator -- software tends to develop assumptions and dependencies on that. E.g, when the first multi-core machines became available, this uncovered all kinds of timing windows and thread safety issues, even if the previous code was already multithreaded.
With M1 Max we have one H264/HEVC decoder, two encoders and two separate ProRes decode/encode units. Then with M1 Ultra we have two H264/HEVC decoders, four encoders and four separate ProRes engines. It's possible some software at the app layer or system layer has dependencies based on the assumption of a single media engine, and that's not yet fixed. It may not be a simple matter of spinning up another thread to use that in parallel on another GOP. All threads must use proper synchronization methods (mutex, semaphore, lock, etc) or the app will crash. Adding that logic and properly testing it on many workloads and platforms is difficult. Especially difficult is the "exception case" where a thread unpredictably hits an exception and must gracefully restore the state. E.g, 1 out of 10 times a user cancels an export it creates a condition where the system may crash 15 minutes later. There can be latent problems in multithreaded code which only appears when additional functional units (cores, accelerators, etc) are added. This is because having a single functional unit subtly synchronized the threads, hiding a concurrency bug.
However it seems that the ProRes engines already scale with increasing number of units. That might be because unlike Quick Sync, etc. which has existed in unitary form since 2011, ProRes accelerators are newer hence the code is newer and was designed at the outset for the possibility of multiple accelerators.
It seems very likely that newer versions of MacOS and/or FCP and other NLEs will better leverage the multiple H264/HEVC decode/encode accelerators present on the M1 Max and Ultra. This is different from a GPU which from the outset had many functional units. For the GPU case, software and development frameworks were always written with that in mind. When that software improvement is made to harness multiple video accelerators , it seems possible that an M1 Ultra could suddenly decode H264/HEVC video 2x faster and encode 4x faster.
OTOH, there's a separate possible issue with encoding formats that use "dependent GOPs", aka "open GOPs". In those cases one GOP refers to another GOP and concurrent processing of separate GOPs is more difficult. I don't know how widespread that is or how it affects performance scalability of multiple accelerators, but the academic literature says it's a more difficult case.