Thank you for this explanation! The assertion that unified memory decreases instead of increasing total memory needs finally actually makes sense to me.
Welcome!
I may be wrong, but I thought that programs still need to copy from CPU memory to GPU memory even though they are side by side because GPU can't read CPU memory. Am I wrong?
Yeah as
@leman said there are some caveats but you can share data between them without copies. However, some PC iGPUs did (do?) in fact work exactly as you state here - they technically share a memory system but that memory was hard-partitioned between the CPU and GPU such that for say a 16GB iGPU each side only has access to 8GB and data has to be copied to be shared. That said, copies between the partitions would be relatively fast, you'd still have to do one, but at least it wasn't going over PCIe. A further problem with these systems was that memory was often with a standard memory controller for DDR/LPDDR meaning it was bandwidth limited though the iGPU was often so small and ineffectual that was not always a bottleneck. Consoles (PS/Xbox with AMD-based SOC) however I believe operate like Apple's SOC in addition to other mobile phone systems. Intel's plans for Lunar Lake (their upcoming mobile SOC later this year) is to have on-package RAM like Apple but I don't think their iGPU plans are overall as ambitious as Apple's yet - i.e. not scaling it out to Max-size.
Also: TFLOPs are only comparable within a system architecture. For example 10 TFLOPs on an nVidia RTX 40x0 are not directly comparable to 10 TFLOPs on an AMD RX 7x00. It has to do with the efficiency of a given architecture.
To explain a bit: a Trillion Floating point OPerations per second depends on what those operations actually are… CAVEAT: I am making the next part up as I do not know the actual efficiency per architecture. The numbers I use aren’t even a “best guess”.
Calculating a game scene with an AMD GPU may require a million operations. That same scene might require only eight hundred thousand operations on an nVidia GPU. Thus if you see 10 TFLOPs of AMD vs 9 TFLOPs nVidia, on the surface the AMD card seems more powerful. In this case, however the nVidia wins… 10m “scenes” per second vs 11.25m (it’s a stupid example but I wanted to show the maths with potential TFLOPs values). Those who have more than my superficial knowledge of TFLOPs feel free to correct my understanding…
It's even worse than that. As I was trying to get at with my FP16 discussion and
@casperes1996 and
@leman expounded on further, things get complicated. They start off really simple though: to get the theoretical TFLOPs of a GPU, you can technically just sum up the number of FP32 units (execution units per core x number of cores), multiply by 2 because they can do a fused multiply-add operation, multiply by the clockspeed and voila. That's it. Pretty simple metric. How that actually translates to performance though outside of doing simple FP calculations that don't interact much with the memory system or fixed function hardware or other Int/FP16/tensor core/ray tracing core/whatever crazy **** they've added gets complicated. For instance, some architectures perform FP16 operations using a dedicated FP16 unit (Apple), some use their FP32 units (Nvidia), some use a single FP32 unit but can perform two FP16 calculations simultaneously (AMD/ARM) if there are two available within a thread (instruction level parallelism - ILP - within a single GPU thread, just like how Apple's M1 and M2 CPU core can potentially do 8 instructions simultaneously within a thread), some can perform two FP32 operations (AMD and Nvidia ... I think), but again if there isn't such ILP to be found in the code their practical TFLOPs gets halved. Different architectures make different tradeoffs depending on their energy targets and simply what they think the code run on them is going to look like. And all of that's before you get into the hardware to turn these calculations into pixels (Apple doesn't even have ROPs) or more esoteric things like tensor cores and ray tracing cores. Nor have I dug into how that theoretical TFLOPs number interacts with the way the cache system and memory hierarchy works which can be absolutely critical for performance. EDIT: And I didn't even mention differing APIs or drivers where a lot of "secret sauce" for performance hides out (or the fact that games, especially AAA ones with custom engines, deliberately break graphics APIs for performance and day 0 updates for drivers are meant to account for this). For some types of pure compute workloads, despite everything I just said, theoretical TFLOPs can actually be a shockingly okay metric if two architectures are sufficiently similar, but yeah ...
Interesting! Which ray tracing tech does Apple end up using then? And how good/efficient/etc is it?
When Apple and their long time GPU supplier ImaginationTech (ImgTech) split, it was a little acrimonious with ImgTech suing Apple because Apple wanted to stop paying ImgTech royalties and ImgTech alleging that Apple was still fundamentally using ImgTech's patented TBDR system and related GPU tech. Previously, under financial strain, ImgTech wanted Apple to buy them but Apple had been hiring away a lot of their people (and engineers from Nvidia/AMD/Intel) to form their own GPU division and decided against it (also because ImgTech has/had a few other customers despite Apple being far and away their biggest, regulators might've blocked it, you never know). Then suddenly a few years ago they not only buried the hatchet but did a brand new licensing deal and rumors were that ImgTech had developed a new ray tracing system that Apple wanted. Sure enough patents confirmed ImgTech had such a thing. A couple year later, Apple debuts the M3 and A17pro with ray tracing cores. While I don't know if anyone has absolutely confirmed if they are the same system as I think the terms of the deal were secret, that probably wouldn't be hard to do as you could match Apple's developer video against ImgTech's patents and just given the timings of everything, well it seems really likely.
ImgTech makes claims about how their ray tracing cores are wonderful, but I'm not an expert here and the very broad strokes seem similar to Nvidia's system. The devil is often in the details though. Performance-wise based on things like Blender, their ray tracing ability seems good, very comparable. Efficiency is where I've seen ImgTech makes its claims but I don't know of any direct Perf/W measurements comparing Apple and Nvidia ray tracing (ARM and Qualcomm have their own systems too - again I believe they're all broadly similar). No doubt, like Nvidia, there will be further iterations and improvements over time.