GPU benchmarks should use native API for each GPU. Would you consider the results of an OpenCL-based benchmark for the Apple GPU or an Nvidia GPU?
Full warning: we're going deep into the weeds here.
An API choice and the compiler choice are not synonymous. This is apples and oranges. A program is written in the API it is written in. And that's that. If a program is written in OpenCL, then that's what its written in. If its written in CUDA, that's what it's written in. If it's written in Metal, that's what's written in. Some programs may support multiple APIs of course but even that doesn't alleviate all the issues (see list below). In contrast, I can take SPEC and compile it any C++/Fortran compiler that supports my architecture.
To answer your question: yes, I would definitely accept an OpenCL test on an Nvidia or AMD GPU. If I'm trying to make general claims about hardware performance, then I wouldn't necessarily on an Apple GPU because the API is deprecated (I'm told technically there isn't even a driver, it's a compatibility layer like MoltenVK). In contrast, Nvidia and AMD both still actively support it. However! If your program is written in OpenCL and you want to gauge how it will run on an Apple GPU, then yeah ... the OpenCL GPU benchmark is probably more relevant than a Metal one.
And to turn your question around, what exactly would be the "native compute API" for an AMD GPU? It isn't OpenCL or the Vulkan/DirectX shading language. They're cross hardware. It isn't Metal though an AMD GPU will run Metal on macOS. It obviously isn't CUDA. AMD has tried at various points to make a native language, but their current compute solution is in practice supported by almost none of their own GPUs, including the newest ones (which is extremely frustrating). Just for kicks, Intel will be adding "One API" based on Sycl I believe and it is also open/cross hardware.
So yes this absolutely means for a GPU you are never actually testing "pure hardware capability". You are testing:
1) the properties of the graphics/compute engine
a) if the application has multiple APIs, then the optimization and work that was put into those different APIs for this application
b) any application optimizations that was put in to that particular GPU
2) the optimization of the API/drivers for the GPU (if it will run at all)
3) finally the capability of the GPU hardware itself
But getting back to what we're actually discussing: CPU benchmarks. You still have some of this, especially with low level assembly code or architecture specific code like vectorization, but is overall less because you don't have different APIs and you aren't interacting with drivers. But changing compilers for each processor re-introduces additional variance. What you end up testing is the different compilers and not the hardware. For instance ICC tends to be very aggressive with vectorization relative to standard clang-LLVM and GCC. So that's what you would actually be testing ... not the hardware. And compiling with Xcode means just compiling with clang not an optimized version of llvm specifically written for Apple hardware. That's why this:
I hope that Anandtech uses the best compiler for each processor next time.
would in fact be a bad thing. The point of scientific testing is to try to control variables, not add more back in. For GPUs you don't have much choice. A program is written in the API(s) it's written in. For CPU benchmarks, you do have a choice not to reintroduce variables and so you should not do this. You want to compare the same code to the same code as much as possible. For different architectures this is impossible of course as the machine code generated for each arch will by necessity be different, but at least you aren't adding in different compilation strategies into the mix on top of everything else.
Make no mistake, what Intel did (well sort of: they used ICC for AMD) and what you are suggesting is great for marketing slides, but it would make for poor benchmarking for Anandtech meant to give readers an actual point of comparison for how fast their programs are likely to run, which is the point of review sites in the first place.
This link, while old, also explains it well:
The authoritative information platform to the semiconductor industry.
www.linleygroup.com