Mark, do you view the Metal tessellation pipeline as tailored to AX GPUs? I'm wondering, since I'm not aware of different tessellation pipelines in Vulkan that would depend on desktop vs mobile GPUs.
Apple merely justified their approach with a "clean sheet" argument, and implied they did it for the sake of simplicity (not simplicity of porting from DX, obviously).
Exposing the DX hull/geometry/domain shader stages requires lots of shader compilation work to occur at draw call time on DX11/GL *or* pipeline construction time on Vulkan/DX12 (emphasis for clarity - it is one or the other and depends on the API which it is). All the vendors have to move some code between the different stages to suit their hardware to make it *appear* that these stages are independent when really they aren’t.
AMD, Intel and all mobile GPUs I am aware of internally require that the first stage of tessellation be dealt with separately - the vertex/hull stage that Apple require be implemented as compute - and for the data to be exported to VRAM for later consumption by subsequent stages. Nvidia’s hardware is a little bit cleverer but I don’t really understand the whys and wherefores.
In other APIs AMD and Intel can break a single draw call into small chunks that are processed using a small amount of VRAM/cache to do this vertex/hull work and then switch to consuming it in the “domain” shader stage. This means the developer doesn’t have to care about providing scratch buffers for the intermediate data that are large enough for the draw call. It *appears* just like the Nvidia card, only a bit slower.
Mobile GPUs *can’t* usually do that as they process ALL geometry before processing any pixels. So they need to store all the intermediate data for all draw calls for the duration of the render pass. That consumes a lot of memory but Khronos and Microsoft are looking for consistency and continuation of the API to enable developers - so the hardware vendors either fix the hardware, swallow the memory cost or don’t implement tessellation.
Apple didn’t want to do this as they design Metal with a philosophy of exposing all the costs and problems to developers, sometimes this puts them at odds with industry standards and norms. It suits Apple’s hardware and it functions adequately on AMD. The problem is that it was decision that didn’t reflect the potential future and I expressed that at the time. It closes off an avenue of rendering development to use fully GPU driven rendering where the GPU alone decides what to render and encodes all the necessary draw commands. Apple’s approach needs the CPU to know every detail of what’s going on to allocate temp buffers and separate the issuing of dispatch and draw calls when they are one draw call on all other APIs.