On of the big advantage of Metal is the possibility to use multiple CPU cores for graphics tasks (submit commands from multiple threads), something that is harder to achieve in openGL. I suppose the GFXBench doesn't especially take advantage of that.
In a session about Vulkan at Siggraph, they showed huge improvements over openGL, but they used demoes with thousands of objects, lots of particles, and optimisation for multiple cores. Quite different from the current GFXBench.
Parallelising the dispatch of calls to the 3D API can definitely be a win, especially if you spend a lot of time issuing calls on the CPU & under GL that was impractical to impossible. However, your engine has to be built for it as you note and Metal's most optimal way of parallelising isn't quite the same as DX12/Vulkan which makes things a bit trickier again.
They also said that Metal cannot enable the performance that DX12 and Vulkan can reach because it retains the openGL binding model (I don't know that that means specifically), but that also makes Metal simpler to use.
So in Metal, you bind resources as you need them to numbered slots (i.e. texture slots 0-n, buffer slots 0-n) but for DX12 & Vulkan the set of resources that may be referenced by a particular pipeline-state is itself a state object that can be reused. This eliminates even more validation but makes the API and usage more complex.