In terms of bandwidth per GPU ALU, the difference between the A6000 and M1 is practically negligible. And M1 has larger cache to offset the rest.
Which ALUs counting? The CUDA 'cores" or all of them ( including the Tensor and RT cores ) ? In the latter case, Apple can't compete with both of those very well with just a cache "offset". in the former, counting to different subsets of the cores directly present on the bandwidth pulls is a bit of misdirection. The P & E , NPU , AMX aren't sitting on the same memory bandwidth pipe? In typical app balanced workload the Apple GPU doesn't get the whole bandwidth and no where near the whole L3 cache.
Of course they will be aiming for those levels - and beyond. Do you really think they will want to release a Mac Pro slower than Intel machines? You are perfectly correct with regards to bandwidth, but you are neglecting the effect the cache has on the total result.
Are they? On the Intel systems Apple isn't covering the Nvidia performance zone. Apple skipped those performance options. Threadripper versus Intel options in late 2022? It is cheaper for Apple to do small updates to their 2018-2019 motherboard and they do not expose the thread limitations of macOS. It is not the best x86_64 baseline to build off of if money is no object and only the top most performance option is selected.
If selecting what is convenient is the current modus operandi .... why is that going to change on Apple silicon.
Pretty good chance Apple is going to dump ECC along the transition way also.
FP64 on the GPU? Apple hasn't even "heard" of that. ( largely because they don't need it on iOS devices ).
As for peak bandwidth.....
M1 Mini drives fewer monitors than the INtel ones do.
The top end Memory capacity is lower.
Apple has backslided on a couple of the non laptop systems they have delivered.
More than likely Apple is going to "move the goal posts" on half sized Mac Pro they do to narrow things down to just what they want. Worse case throw it out there against a iMac Pro so they can slap around some 4-5 year old hardware.
And by the way, Nvidia‘s next supercomputer platform utilizes an iGPU. They have publicly acknowledged that they have reached the wall with dGPU designs. Modern compute and ML workloads require a high level of horizontal interaction, so just throwing more and more high-latency RAM bandwidth is not working anymore.
That isn't really true. Nvidia's Grace CPU doesn't have a GPU.
"... The fourth-generation NVIDIA® NVLink® delivers 900 gigabytes per second (GB/s) of bidirectional bandwidth between the NVIDIA Grace CPU and NVIDIA GPUs. ..."
A CPU Purpose-Built for Servers Running Terabytes of Data
www.nvidia.com
Some Nvidia GPUs with NVLink can talk to IBM Power CPUs. That wasn't iGPUs 3-4 years ago when that rolled out. It isn't going to really be the case when these Grace CPUs roll out in the future. Primarily, what Nvidia is trying to get back to what they had and lost when they dumped IBM Power to go with AMD EPYC on their last generation.
There is little indication there that the GPUs don't have their own Memory. It can get addressed in some contexts in a shared memory space, but that is a stretch to lable that as iGPU. iGPU has implication that working out of the aame working store. Quite doubtful that Nvidia is moving their next gen HPC GPU's to LPDDR5X
"... Grace CPU is the first server CPU to harness LPDDR5x memory with server-class reliability through mechanisms like error-correcting code (ECC) to meet the demands of the data center while delivering 2X the memory bandwidth and up to 10X better energy efficiency compared to today’s server memory. ..."
There is difference between sharing some limited memory and sharing all memory and cache. AMD (via Infinity Fabric) , Intel (via PCI-e v5 (v6) and CXL ) , and Nvidia ( via NVLink) aren't trying to share all.
The "copy data" gap that Apple is leveraging will get narrowed, but the advantage for Apple isn't one way though. Especially when doing problems that require scalaibity across multiple nodes. Apple's appraoch doesn't scale across nodes ( because it is focused on unified/unification inside a single set of memory direct attached memory modules. "Scale" limited to the intra-package communication. ). The fabrics presented by IF/CXL/NvLink allow to grow problems where still have better contact with nearest neighbors that Apple won't be able to touch with a ten foot pole.