FCPX: AMD vs NVIDIA

Synchro3 · Sep 27, 2016

Synchro3 said:
Ok, I was curious, and this is interesting: Now on OS X 10.11.3 with newest Nvidia drivers, FCP 10.2.3, with the same Mac Pro 4,1, W3690 6-core 3.46 GHz and the same GTX 770 4 GB benchmark is new 29s (!).

I think I should change to El Capitan when using FCP.

Same Mac Pro, new graphics card GTX Titan X, OS X 10.11.6, benchmark is now 20 seconds.

dolphin842 · Apr 11, 2017

Now that nvidia has released Pascal-compatible drivers, I'm curious as to how the 10-series GPUs perform on this benchmark!

pierrox · Apr 12, 2017

It might take a little time as CUDA hasn't been released for those cards. And the first drivers seem to need more optimisation work.

Asgorath · Apr 12, 2017

pierrox said:
It might take a little time as CUDA hasn't been released for those cards. And the first drivers seem to need more optimisation work.

Based on what, exactly? People posting screenshots of Unigine Heaven at 1600x900 resolution?

dolphin842 · Apr 12, 2017

Placed an order today for a GTX 1060 6GB... will update once I have some benchmark numbers.

--

In the meantime, I'm re-running the BruceX test with my existing 5770 since it's been a while.

10.12.4 / FCP 10.3.2 / Mac Pro 5,1 6-core / 48 GB RAM
Export setting: Apple ProRes 4444 XQ
Export drive: Crucial MX100 256GB, lower drive bay

71 secs : Radeon 5770 (OEM/EFI model)

This is a significant regression from the previous 45 seconds I recorded, but this is likely due to changes in my workflow that have increased the 'ambient' demands I've placed on the VRAM. Specifically, I now run 3 monitors instead of 2, and went from 1 set of virtual desktop Spaces to 11. So the GPU is now handling 33 virtual desktops (3 desktops per Space * 11 Spaces) instead of only 2 when I last ran the benchmark.

I was looking at iStat while the benchmark was running... GPU was pegged most of the time while the VRAM utilization kept bouncing back and forth, having to swap things in and out constantly. Even before I added the virtual desktops, this poor Radeon would idle around 74C and get to the upper 80s under load. I'm definitely looking forward to replacing with the much cooler-running 1060.

Prince134 · Apr 14, 2017

New final cut pro is out, 10.3.3 just released last night. With the Nvidia Titan X Pascal under Sierra 10.12.4, I see improvement from 47 seconds to 30 seconds. Quite a bit. But still, El Capitan runs faster. I got 20 seconds with Titan X Maxwell. I can't see how Titan X pascal perform since Nvidia's Pascal driver does not support El Capitan.

dolphin842 · Apr 16, 2017

10.12.4 / FCP 10.3.3 / Mac Pro 5,1 6-core / 48 GB RAM
Export setting: Apple ProRes 4444 XQ
Export drive: Crucial MX100 256GB, lower drive bay

73 secs : Radeon 5770 (OEM/EFI model)

I got more variation in samples this time around (79s/73s/69s/69s, vs. 73s/71s/70s for 10.3.2). I'm down to 10 Spaces now. There doesn't seem to be any difference whether or not I have any additional windows open besides FCPX and QuickTime.

itdk92 · Apr 17, 2017

Prince134 said:
New final cut pro is out, 10.3.3 just released last night. With the Nvidia Titan X Pascal under Sierra 10.12.4, I see improvement from 47 seconds to 30 seconds. Quite a bit. But still, El Capitan runs faster. I got 20 seconds with Titan X Maxwell. I can't see how Titan X pascal perform since Nvidia's Pascal driver does not support El Capitan.

What export codec (for both tests)?

Prince134 · Apr 17, 2017

Prores 422HQ, as the author of BruceX suggested since the earliest time. Have you done yours?

ActionableMango · Apr 17, 2017

pierrox said:
It might take a little time as CUDA hasn't been released for those cards.

Final Cut Pro X uses CUDA? This surprises me.

itdk92 · Apr 17, 2017

Prince134 said:
Prores 422HQ, as the author of BruceX suggested since the earliest time. Have you done yours?

That'ms why you get such a low number.

People are exporting in all kind of codec, therefore the results from this particular tests are always confusing.

For example, barefeats exports in 4444 XQ in his tests, and dolphin842 in this thread did the same

Anyways, I just tested a system we made (export in 4444XQ) and I am quite amazed by AMD + FCPX

Mac Pro 5.1
12 core 2.66Ghz
48GB RAM
SATA II SSD for boot
Pcie SSD for library + export
Sierra 10.12.4

1 x GTX980TI =~1:02 min

1 x GTX980TI + 1 x Titan X (Max) =~ 1:32 min (???)

2 x GTX980TI = FCPX playback bug (known error)
2 x Titan X (Max) = FCPX playback bug (known error)

2 x 5770 =~ 35 s (LOL)

Might be the CPU, the CPU overhead or something I am not aware of, but I totally laughed my ass off.

Really looking forward to test some Pascal cards..

Prince134 · Apr 17, 2017

I had tried with 2x 7970 run in El capitan, runs Prores 422HQ (again this has long been the standard to test BruceX since it came out). It only take 16 seconds.

The only Nvidia card come so close is the Titan X Maxell that runs with El Capitan which took 20 seconds. Now because Pascal driver doesn't support El capitan, I can't see the improvement from Maxell titan to Pascal titan under El Capitan. Titan X Pascal in Sierra takes 30-32 seconds in my test with 10.3.3 FCP, few days earlier in 10.3.2 FCP, it takes 47-50 seconds.

dolphin842 · Apr 17, 2017

Indeed, I used 4444XQ because barefeats used it, and the original BruceX page didn't specify which codec to use. Although for both the 5770 and GTX 1060, I get identical times regardless of codec used (4444XQ or 422HQ).

Speaking of, I got my GTX 1060 up and running today. Specifically, it's the "Zotac 1060 AMP! ZT-P10600B-10M, 6GB". Results are... interesting:

10.12.4 / FCP 10.3.3 / Mac Pro 5,1 6-core / 48 GB RAM
Export setting: Apple ProRes 4444 XQ
Export drive: Crucial MX100 256GB, lower drive bay

73 secs : Radeon 5770 (OEM/EFI model)
102 secs : GTX 1060 6GB

Looking at iStat, it's clear the GPU isn't being taxed at all: only very brief spikes to 50% utilization, with <10% utilization most of the time. VRAM fluctuates between ~70%-100% during the export, at a rate similar to the 5770. CPU usage for FCP fluctuates between ~95%-160%. I don't recall the CPU being used that much during 5770 exports, but I'm not sure. The first few tries, the GPU's fans would get noticeably louder, but had no additional spin-up for the last two tries [Update: It wasn't the GPU's fans, but the Mac Pro's expansion card fan. Sometimes it'll kick up and stay up for a while even with no activity, then calm back down once a temporary load is placed on the GPU... might have an issue getting cues from the 1060].

My guess is that something's up with the current driver release, and FCP doesn't see the GTX as an export option, so it's using the CPU to do the export. We'll see I suppose. Thankfully I got this card mostly for gaming on Windows (and to have the Mac Pro generally run cooler and quieter). The Zotac's fans sound 'rougher' than the OEM 5770, but when it's not being taxed, it's certainly quieter overall than the 5770.

More updates as events warrant...

itdk92 · Apr 17, 2017

Prince134 said:
I had tried with 2x 7970 run in El capitan, runs Prores 422HQ (again this has long been the standard to test BruceX since it came out). It only take 16 seconds.

The only Nvidia card come so close is the Titan X Maxell that runs with El Capitan which took 20 seconds. Now because Pascal driver doesn't support El capitan, I can't see the improvement from Maxell titan to Pascal titan under El Capitan. Titan X Pascal in Sierra takes 30-32 seconds in my test with 10.3.3 FCP, few days earlier in 10.3.2 FCP, it takes 47-50 seconds.

Could you try exporting to 4444 XQ?
I get you exported at 422HQ, but many exports at 4444XQ, and for the results to be comparable, everybody has to obviously export in the same codec

I will try 422HQ too

I am curious to see your results.

Today I will put the GTX980TI in a 12 core@3.46Ghz and Test

Prince134 · Apr 18, 2017

I just tried with 4444xq. Same as 422HQ. It took the card 30 seconds. I read some where that they don't make difference between the codecs in BruceX file sharing test.

I wonder if FCP in Sierra are CPU dependent. Prove is that I just have my Dell T3500 Hacintosh tested with Titan X Pascal. It takes 50 seconds. I am puzzled again. This Hac Mac's CPU is X5687, only 4 core 8 thereads but clocked at 3.6ghz. My Mac Pro is dual 3.46ghz, 12 core 24 threads. It looks like that in Sierra, GPU is less utilized than in El Capitan, so CPU involves more?

I remember a while ago I was having 2 7970s doing the same test, the conclusion was that it's not CPU dependent, ie 6 core or 12 core doesn't matter. That was from Yosemite and El Capitan, but not anymore in Sierra. Can anyone confirm this?

dolphin842 · Apr 18, 2017

Here's a screenshot from OpenGL Driver Monitor showing a BruceX run. The test begins at the arrow:

VRAM fluctuates as expected, but notice the GPU Core Utilization stat never reaches 60% (the 86% maximum stat was from earlier in the day).

Prince134 · Apr 18, 2017

Delphin842,

Omegafilm talked about his finding in responding to my other post, that is: a clean install will bring the BruceX performance level back to as is in El Capitan. Will you try to confirm it as well. Sounds interesting.

#393
↑

Hi Prince134!
Did you solve that problem?
In my cace, my RX290X's Brucex score was 20 seconds in El Capitan.
at that time it was miracle to me.
And after moving to Sierra 50 second!
after knowing that's not the Metal support problem, i deleted all the FCP trace with AppCleaner and manually, and install FCPX agin.
and NOW my score IS 19 seconds!

dolphin842 · Apr 19, 2017

I don't have AppCleaner but I do have Hazel, which offers a similar service. It cleared out the following files and folders:

/Library/Application Support/Final Cut Pro/
~/Library/Application Support/Final Cut Pro/
~/Library/Caches/com.apple.FinalCut
~/Library/Containers/com.apple.InternalFiltersXPC
~/Library/Preferences/com.apple.FinalCut.plist
~/Library/Preferences/com.apple.FinalCut.UserDestinations.plist
~/Library/Saved Application State/ com.apple.FinalCut.savedState
/private/var/folders/m_/[random string]/C/com.apple.FinalCut

After deleting those, deleting FCP itself, and reinstalling, I re-ran BruceX and the results got... worse??

10.12.4 / FCP 10.3.3 / Mac Pro 5,1 6-core / 48 GB RAM
Export setting: Apple ProRes 422 HQ
Export drive: Crucial MX100 256GB, lower drive bay

121s (125/119/118) : GTX 1060 6GB

Here's another screenshot of OpenGL Driver Monitor showing the three runs

The first run was the worst, but I don't see much difference in GPU utilization compared to the previous screenshot. Only real difference I can see is that the VRAM cycling isn't as 'deep' in the longer run.

Maybe I missed a pref or cache file somewhere? Maybe Compressor and Motion have to be clean-installed as well? If we can get a list of things Omegafilm removed, maybe we can reproduce the improvement.

Prince134 · Apr 19, 2017

My test with a clean installation was the same. NO change. I think Omegafilm may have ignoured that the fcp update from 10.3.2 to 10.3.3 just 3 days ago, and I have stated that it get my Titan X test from 47 seconds to 30 seconds. So his finding probably due to the fcp update. Between the two OS, Sierra is still less efficient in using GPU than in El Capitan.
[doublepost=1492625674][/doublepost]I believe Barefeats' test was based on 10.3.2. So it's not comparable now.

koyoot · Apr 19, 2017

Prince134 said:
I just tried with 4444xq. Same as 422HQ. It took the card 30 seconds. I read some where that they don't make difference between the codecs in BruceX file sharing test.

I wonder if FCP in Sierra are CPU dependent. Prove is that I just have my Dell T3500 Hacintosh tested with Titan X Pascal. It takes 50 seconds. I am puzzled again. This Hac Mac's CPU is X5687, only 4 core 8 thereads but clocked at 3.6ghz. My Mac Pro is dual 3.46ghz, 12 core 24 threads. It looks like that in Sierra, GPU is less utilized than in El Capitan, so CPU involves more?

I remember a while ago I was having 2 7970s doing the same test, the conclusion was that it's not CPU dependent, ie 6 core or 12 core doesn't matter. That was from Yosemite and El Capitan, but not anymore in Sierra. Can anyone confirm this?

Its natural for Nvidia GPUs. Neither Maxwell, nor Pascal GPUs have hardware scheduling, so they rely on Drivers(and CPU) for exploiting every single bit of GPUs.

AMD GPUs have hardware scheduling, that is why they are not CPU dependent. You are describing a book example of the difference between those two architectures, and how they work.

Asgorath · Apr 19, 2017

koyoot said:
Its natural for Nvidia GPUs. Neither Maxwell, nor Pascal GPUs have hardware scheduling, so they rely on Drivers(and CPU) for exploiting every single bit of GPUs.

AMD GPUs have hardware scheduling, that is why they are not CPU dependent. You are describing a book example of the difference between those two architectures, and how they work.

Please stop making things up. There's plenty of evidence to suggest NVIDIA has had "hardware scheduling" for a very long time. A quick Google search finds things like:

https://twitter.com/oculuscat/status/529377336383537152

Pascal features a better implementation of asynchronous compute for NVIDIA, though it's still slightly different to AMD's implementation.

FWIW it's very likely that these FCPX issues have nothing to do with the GPU and the bottleneck is elsewhere (it really sounds like Apple changed something fairly fundamentally and the drivers haven't had a chance to catch up). GPU utilization appears to be very low in these tests.

koyoot · Apr 19, 2017

Asgorath said:
Please stop making things up. There's plenty of evidence to suggest NVIDIA has had "hardware scheduling" for a very long time. A quick Google search finds things like:

https://twitter.com/oculuscat/status/529377336383537152

Pascal features a better implementation of asynchronous compute for NVIDIA, though it's still slightly different to AMD's implementation.

FWIW it's very likely that these FCPX issues have nothing to do with the GPU and the bottleneck is elsewhere (it really sounds like Apple changed something fairly fundamentally and the drivers haven't had a chance to catch up). GPU utilization appears to be very low in these tests.

Nope. Consumer Pascal GPUs, and Maxwell GPUs are on low and high level the same architecture, but on different nodes. I am not making anything up. Kepler does not have Hardware Scheduler, consumer Pascal does not have it Either. I do not know about GP100, but I think it does have Hardware Scheduler, because Nvidia will reuse the low-level architecture in consumer Volta GPUs(GV104, GV102...), and they will have massiviely improved DX12/Vulkan capabilities. Again we will see the same situation as this year we have seen, with Pascal, and Consumer line will be different than HPC. But back to topic.

So partially that tweet is correct.

Asgorath · Apr 19, 2017

koyoot said:
Nope. Consumer Pascal GPUs, and Maxwell GPUs are on low and high level the same architecture, but on different nodes. I am not making anything up. Kepler does not have Hardware Scheduler, consumer Pascal does not have it Either. I do not know about GP100, but I think it does have Hardware Scheduler, because Nvidia will reuse the low-level architecture in consumer Volta GPUs(GV104, GV102...), and they will have massiviely improved DX12/Vulkan capabilities. Again we will see the same situation as this year we have seen, with Pascal, and Consumer line will be different than HPC. But back to topic.

So partially that tweet is correct.

What's your definition of "hardware scheduling" here exactly?

koyoot · Apr 19, 2017

• Static scheduling (optimized by compiler) – When there is a stall (hazard) no further issue of instructions – Of course, the stall has to be enforced by the hardware
• Dynamic scheduling (enforced by hardware) – Instructions following the one that stalls can issue if they do not produce structural hazards or dependencies

This is the difference. It is seen in how Nvidia drivers handle DX11 code, vs how AMD drivers handle DX11 code. Nvidia has very optimized commands, scheduling, paths for this API. But it is heavily reliant on the CPU.

As for compute its the same story. CUDA is software, and it relies on CPU performance, for scheduling, but the execution of the tasks is done 100% by the GPU.

In other words. Static Scheduling does not allow out of order execution of code. Hardware Scheduling allows.

Asgorath · Apr 19, 2017

koyoot said:
• Static scheduling (optimized by compiler) – When there is a stall (hazard) no further issue of instructions – Of course, the stall has to be enforced by the hardware
• Dynamic scheduling (enforced by hardware) – Instructions following the one that stalls can issue if they do not produce structural hazards or dependencies

This is the difference. It is seen in how Nvidia drivers handle DX11 code, vs how AMD drivers handle DX11 code. Nvidia has very optimized commands, scheduling, paths for this API. But it is heavily reliant on the CPU.

As for compute its the same story. CUDA is software, and it relies on CPU performance, for scheduling, but the execution of the tasks is done 100% by the GPU.

In other words. Static Scheduling does not allow out of order execution of code. Hardware Scheduling allows.

So you're talking about work done one time during shader compilation, which does not affect runtime CPU performance when those shaders are being used. I fail to see why NVIDIA is therefore more reliant on CPU performance during execution of an FCPX benchmark? The conclusion you're drawing here seems quite disingenuous, because shader compilation on the CPU generally has nothing to do with runtime performance (i.e. you can't claim that FCPX is slower on NVIDIA because their shader compiler is doing more/different work than AMD's compiler does).

FCPX: AMD vs NVIDIA

macrumors 68000

macrumors 65816

macrumors 6502

macrumors 68000

macrumors 65816

macrumors 6502

macrumors 65816

macrumors 6502a

macrumors 6502

macrumors G3

macrumors 6502a

macrumors 6502

macrumors 65816

macrumors 6502a

macrumors 6502

macrumors 65816

macrumors 6502

macrumors 65816

macrumors 6502

macrumors 603

macrumors 68000

macrumors 603

macrumors 68000

macrumors 603

macrumors 68000

Our Staff