So basically you're saying implement the extended-precision using software rather than hardware. That's what Mathematica does with CPU calculations. You have a choice of machine-precision calculations, which are limited by the precision of the hardware, or arbitrary-precision calculations, which are done in the software, and can be extended to whatever precision you like. [There's also a third option—exact (infinite precision) calculations—which isn't germane here.] However, the arbitrary-precision ("software") calculations are always much slower.
But I see your point—to do them efficiently in hardware requires a costly transitior budget, which wouldn't make sense for Apple's GPU's given their most likely use cases.
Here's a nice (though somewhat outdated) table showing the throughput penalty for FP64 (rel. to FP32) in various NVIDIA GPUs; the're nearly all either 1/2X or 1/32X (though the 1/32X may in part be deliberate nerfing by NVIDIA to force data centers towards their more expensive commercial products):
This is a part on GPUs in a series “Hardware for Deep Learning”. It is the most content-heavy part, mostly because GPUs are the current…
blog.inten.to