Are there any technical, commercial or marketing hurdles preventing Apple from achieving the performance of nVidia GPUs?
Good question! At fundamental level, an Nvidia SM is fairly similar to an Apple GPU core. Both feature four execution partitions (each executing a 32-wide group of threads, with 32 ALUs per partition), even though the execution model itself is subtly different.
There are two main reasons why Nvidia GPUs are faster. First is that Nvidia targets higher power consumption levels and higher clocks. M2 family GPU runs at a very conservative 1.4Ghz, while Nvidia runs its GPUs at over 2Ghz. Second is that Nvidia GPUs has more SM's than Apple has GPU cores. A 4090 has whopping 128 SMs, which means 512 32-wide execution units (partitions). An M2 Ultra (if it ever arrives) instead would have 76 cores or 304 32-wide execution units. And that's just the difference in execution units, not taking into account the massive difference in clock.
If Apple builds a larger GPU and clocks it higher, there is no reason why it wouldn't be able to match Nvidia. This is of course where the question of money comes into play. Apple tech is substantially more expensive, as they build SoCs and rely on wide and energy-efficient RAM interfaces. Nvidia has also the advantage in marketing, as they can afford to make moves that Nvidia can't. For example, the 4090 with its massive amount of shader units only has RAM bandwidth of 1TB/s which is not that far off M1 Ultra's 800GB/s. The bandwidth per SM is pretty much terrible. Now, that's ok for games since they often have more predictable cache behaviour (and to be frank, that GPU is an overkill anyway, so the hardware utilisation will be poor, reducing both the power consumption and the bandwidth needs). But many professions workloads need more bandwidth. Funnily enough the professional Nvidia GPU (Hopper) ships with lower clocks and nominally lower compute performance than the 4090 (at least according tot eh specs Iv'e seen), but they give it much more RAM bandwidth via HBM2. Of course, that GPU costs between $30k and 40k. Hardly the niche Apple should go after.
What could be Apple's next step? Hardware-based ray tracing? Hardware-based path tracing?
I'm sure we will see hardware RT with hardware ray compacting in the next generation of Apple GPUs. The relevant patents have been published since late 2022. As to the rest, well, it depends on what kind of level of performance Apple is pursuing. On laptop, they are already in a fairly good spot. On desktop, they need bigger or at least faster GPUs. If I were them I would pursue a) bigger GPUs across the board with b) higher dynamic clock range so that they can be clocked faster on the desktop. But then again, I am a hobbyist posting nonsense on forums while Apple employs actual industry veterans