Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

frustum

macrumors member
Original poster
Jun 16, 2021
32
12
GravityMark is a free cross-platform and cross-API GPU benchmark with native support of Metal and Apple silicone.
Check the stability of your systems and compare the performance of different platforms by drawing an enormous amount of asteroids.

GravityMark_600.jpg


https://gravitymark.com/
 
Last edited:
We released a new version with leaderboard support.


The performance of Metal API is more than twice lower than Direct3D12/Vulkan running exactly the same HW:

Metal 34 FPS:
D3D12 80 FPS:

Maybe the same issue makes Apple M1 performance is lower than AMD APU:

Apple M1 13 FPS:
AMD APU 16 FPS:

We are trying to resolve this situation with Apple. But we cannot do it alone.
 
Last edited:
Yes, 200,000 asteroids count is a kind of balance that shows a decent result on integrated GPUs.
 
We released a new version with leaderboard support.


The performance of Metal API is more than twice lower than Direct3D12/Vulkan running exactly the same HW:

Metal 34 FPS:
D3D12 80 FPS:

Maybe the same issue makes Apple M1 performance is lower than AMD APU:

Apple M1 13 FPS:
AMD APU 16 FPS:

We are trying to resolve this situation with Apple. But we cannot do it alone.
So ... maybe I'm not fully understanding, but what you're suggesting is that there's a bug for MacOS that produces scores that are not truly representative of the real relative power of the gpu? If so, then the Mac version of your program is effectively worthless, correct?
 
So ... maybe I'm not fully understanding, but what you're suggesting is that there's a bug for MacOS that produces scores that are not truly representative of the real relative power of the gpu? If so, then the Mac version of your program is effectively worthless, correct?
The problem is that there is no effective way to draw thousands of objects by using Metal API.

AMD GPU can draw thousands of objects by a single command. This command is not available with Metal API. That is why the macOS version is 3 times slower than other API running the same HW.

We are emulating this single command is emulated as a giant loop with single draw commands. Direct3D11 (Nvidia and Intel) and OpenGLES APIs are not supporting that single command as well. But even that emulation is not slowing them down 3 times as Metal.
 
I presume you are not using Mesh Shaders or Primitives in D3D or Vulkan. @leman is there not a faster way in Metal 2 to draw thousands of the same object?
 
The problem is that there is no effective way to draw thousands of objects by using Metal API.

AMD GPU can draw thousands of objects by a single command. This command is not available with Metal API. That is why the macOS version is 3 times slower than other API running the same HW.

We are emulating this single command is emulated as a giant loop with single draw commands. Direct3D11 (Nvidia and Intel) and OpenGLES APIs are not supporting that single command as well. But even that emulation is not slowing them down 3 times as Metal.

Can you elaborate more on this? Metal has full support for instanced rendering (on par with any other API) as far as I know as well as advanced indirect rendering.
 
Last edited:
  • Like
Reactions: diamond.g
I presume you are not using Mesh Shaders or Primitives in D3D or Vulkan. @leman is there not a faster way in Metal 2 to draw thousands of the same object?
Mesh shaders require the latest (RTX / RX 6XXX) GPU. Metal API and all mobile devices don't support Mesh shaders now.
 
Can you elaborate more on this? Metal has full support for instanced rendering (on par with any other API) as far as I know as well as advanced indirect rendering.

Yes, instanced rendering can draw a huge number of objects. The problem is that the CPU cannot prepare data for GPU with a reasonable FPS, especially when the number of more than 100k and the scene is fully dynamic. GPU can be used for scene processing and draw commands generation with unbeatable performance.

Metal API allows using only single draw command for indirect rendering. It means that there is no way to draw different geometry in a single command. And each geometry variation should use separate draw command:


Other API except D3D11 and OpenGLES have multi-draw-indirect-count functionality where it's possible to combine multiple draw-indirect commands in a single API call. And no CPU-GPU synchronization is required for that:


We are emulating multi-draw-indirect-count functionality as a loop of draw indirect commands for D3D11, OpenGLES, and Metal. This emulation is working even faster for the previous Nvidia GPU generation because of the driver issue:

D3D11 71 FPS (GTX 1060):

Vulkan 58 FPS (GTX 1060):

But unfortunately, not in the case of Metal API, where the same HW is three times slower.

Metal API supports indirect command buffers. Our test demonstrates that they are slower than the loop of simple indirect draw commands. That makes them practically useless. Or maybe only the next Apple HW will have benefits from them.

multi-indirect.png


indirect-command-buffer.png
 
  • Like
Reactions: T'hain Esh Kelch
Yes, instanced rendering can draw a huge number of objects. The problem is that the CPU cannot prepare data for GPU with a reasonable FPS, especially when the number of more than 100k and the scene is fully dynamic. GPU can be used for scene processing and draw commands generation with unbeatable performance.

Metal API allows using only single draw command for indirect rendering. It means that there is no way to draw different geometry in a single command. And each geometry variation should use separate draw command:


Other API except D3D11 and OpenGLES have multi-draw-indirect-count functionality where it's possible to combine multiple draw-indirect commands in a single API call. And no CPU-GPU synchronization is required for that:


We are emulating multi-draw-indirect-count functionality as a loop of draw indirect commands for D3D11, OpenGLES, and Metal. This emulation is working even faster for the previous Nvidia GPU generation because of the driver issue:

D3D11 71 FPS (GTX 1060):

Vulkan 58 FPS (GTX 1060):

But unfortunately, not in the case of Metal API, where the same HW is three times slower.

Metal API supports indirect command buffers. Our test demonstrates that they are slower than the loop of simple indirect draw commands. That makes them practically useless. Or maybe only the next Apple HW will have benefits from them.

View attachment 1810390

View attachment 1810391

Thanks for the elaboration! I think I’ve seen your posts in the Metal support forums, it indeed sounds like a complex issue. I have used indirect rendering pipelines with great success so far, so your experience leaves me a bit puzzled. There is no doubt that Apple can improve the APIs as well as their support procedures.
 
Thanks for the elaboration! I think I’ve seen your posts in the Metal support forums, it indeed sounds like a complex issue. I have used indirect rendering pipelines with great success so far, so your experience leaves me a bit puzzled. There is no doubt that Apple can improve the APIs as well as their support procedures.
Yep, that issue is on the Metal support forum without any progress.

The single indirect calls are working great. The troubles are getting appear in the case of the whole scene indirect rendering.

We also hope that API progress will go to GPU-driven rendering and better API flexibility direction. Unfortunately, all modern graphics APIs were focused on parallel CPU job submission instead of GPU-driven architectures. And now nothing is interesting except ray-tracing :) which is great, except for the totally unconfigurable API.
 
The problem is that there is no effective way to draw thousands of objects by using Metal API.

AMD GPU can draw thousands of objects by a single command. This command is not available with Metal API. That is why the macOS version is 3 times slower than other API running the same HW.

We are emulating this single command is emulated as a giant loop with single draw commands. Direct3D11 (Nvidia and Intel) and OpenGLES APIs are not supporting that single command as well. But even that emulation is not slowing them down 3 times as Metal.
Ok, the technical details are over my head, but what I am hearing is that your test is indeed worthless for MacOS.

There are many other graphics tests that don't expose the 'bug' (or whatever the problem is between your program and Metal) so the take-away here is that we might as well use any number of the graphics tests for Macs that actually do produce correct and comparable scores. :)
 
  • Like
Reactions: leman
Ok, the technical details are over my head, but what I am hearing is that your test is indeed worthless for MacOS.

There are many other graphics tests that don't expose the 'bug' (or whatever the problem is between your program and Metal) so the take-away here is that we might as well use any number of the graphics tests for Macs that actually do produce correct and comparable scores. :)
The purpose of this benchmark is to showcase the problem of macOS with heavy applications. I don't think that it's normal to have access only to 30% of possible hardware capabilities. But it's definitely good to demonstrate M1 performance. :)

Having its own proprietary graphics API is great for Apple. But it should be at least not worse than analogs.
 
The purpose of this benchmark is to showcase the problem of macOS with heavy applications. I don't think that it's normal to have access only to 30% of possible hardware capabilities. But it's definitely good to demonstrate M1 performance. :)

Having its own proprietary graphics API is great for Apple. But it should be at least not worse than analogs.

To be frank, I don’t think you have demonstrated that this is a problem with the API. All we can conclude that you were not able to get it working The fault could lie with Apple, but there also could be a problem with your implementation. If I recall correctly, the Apple engineer in charge of your ticket stated that you did not provide enough information for them to replicate and analyze the issue. I hope you can consider publishing the relevant portions of the code, so that others can have a look and try to replicate the problems you have encountered.
 
To be frank, I don’t think you have demonstrated that this is a problem with the API. All we can conclude that you were not able to get it working The fault could lie with Apple, but there also could be a problem with your implementation. If I recall correctly, the Apple engineer in charge of your ticket stated that you did not provide enough information for them to replicate and analyze the issue. I hope you can consider publishing the relevant portions of the code, so that others can have a look and try to replicate the problems you have encountered.

We started to do it 7 months ago. Nothing is happening. There is no solution how to increase DIP performance on Metal right now.

The problem is obvious. The application is spending 99% of the time in a rendering loop:

C++:
// macOS Metal rendering loop code:
for(uint32_t i = 0; i < num_draws; i++) {
    [encoder drawIndexedPrimitives:primitive_type indexType:index_type indexBuffer:index_buffer indexBufferOffset:index_offset indirectBuffer:indirect_buffer indirectBufferOffset:offset];
    offset += stride;
}

// Vulkan rendering code:
vkCmdDrawIndexedIndirectCount(command, indirect_buffer, indirect_offset, buffer, offset, num_draws, stride);

// Direct3D11 AMD rendering code:
agsDriverExtensionsDX11_MultiDrawIndexedInstancedIndirect(context->content, command, num_draws, indirect_buffer, indirect_offset, stride);

// Direct3D11 Nvidia and Intel rendering loop code:
for(uint32_t i = 0; i < num_draws; i++) {
    command->DrawIndexedInstancedIndirect(indirect_buffer, offset);
    offset += stride;
}

// Direct3D12 rendering code:
command->ExecuteIndirect(signature, num_draws, indirect_buffer, indirect_offset, buffer, offset);

// OpenGL rendering code:
glMultiDrawElementsIndirectCount(draw_mode, index_type, indirect_offset, offset, num_draws, stride);

// OpenGLES rendering loop code:
for(uint32_t i = 0; i < num_draws; i++) {
    glDrawElementsIndirect(draw_mode, index_type, offset);
    offset += stride;
}

Metal ICB is not faster than the loop of draw indirect commands.
The Direct3D11 rendering loop on Nvidia is working faster than Vulkan and Direct3D12 solutions for non-RTX GPUs.
Nobody is losing 60% of performance here except Metal.
 
Last edited:
It looks like you have updated your site with some RT testing. Are there plans to incorporate this into GravityMark? It would make for the only cross-platform RT test available.
 
It looks like you have updated your site with some RT testing. Are there plans to incorporate this into GravityMark? It would make for the only cross-platform RT test available.
Yes, the RT version of GravityMark will be released this year.
 
  • Like
Reactions: diamond.g
Meanwhile, we have released a GravityMark v1.31 update for macOS Monterey.
There is a 1.5x performance boost on M1. And almost 2x better speed on AMD.
The benchmark can crash on the previous macOS versions.

 
Does it mean your instancing issues were fixed? Was that a bug in Metal drivers or did you change the code?
 
Does it mean your instancing issues were fixed? Was that a bug in Metal drivers or did you change the code?
Yes, there is good progress with the driver. The same code is started to work with the new OS version.
 
iOS version is available in the App Store:

 
iOS version is available in the App Store:

Is there a way to re-run the benchmark without having to quit the app?
 
Is there a way to re-run the benchmark without having to quit the app?
Demo mode is infinity loop mode without calculating the score. Devices are getting hot after a single benchmark loop, and results are dropping dramatically due to thermal throttling. We can add multiple benchmark iterations, but it will have no sense in terms of benchmark results.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.