Apple Silicon in Sciences

name99 · May 13, 2023

Philip Turner said:
I may have been exaggerating a bit, because it impacted my use case so negatively. Let me explain where it occurs and the problem sizes it impacts. Imagine you have a molecular dynamics simulation, which is an n-body physics simulation (O(n^2)). You have 1000 particles, computing each pair of particles takes 90 instructions, the GPU runs 5.3 trillion instructions per second, and each time step is 4 femtoseconds. How many nanoseconds could be computed in one day?

But nothing in this explanation has anything to do with "Too much driver latency"...

You may well have a real problem, but you have neither described it, not explained why the supposed solutions for this (eg use multiple independent CommandEncoders, or enqueue successor kernels on the GPU side) are unsatisfactory.

Your ultimate issue seems to be something to do with shuttling computation between CPU and GPU, and again that's legitimate, no-one's saying you're lying or being dumb; but again you are simply assuming we know everything about the problem whereas in fact we know absolutely nothing.

Philip Turner · May 13, 2023

The answer to my math question, when compared to actual nanoseconds per day, shows how driver latency impacts performance. This is a real-world problem where driver latency caused an order of magnitude slowdown.

komuh · May 13, 2023

name99 said:
No clue about ANE latency.

However one thing Apple could do with MPS is the same sort of thing as they did with Core Image. Even though that presents as functions you arrange into a graph, it's actually implemented (kinda) at the compiler level so that redundant work can be removed, instructions can be re-ordered, function call overhead eliminated, etc.

They have already gone down that path with ray tracing, and in principle they could do this for all of MPS, use the ideas from CI... It's an obvious next step once they clear their plates of all the higher priority work, but it might do a lot to help NN's, and since AI is the current darling, there is hope!

I feel like they try it with MPSGraph but MPSGraph is so much slower, extremely buggy and a lot less flexible when we were writing our own models to compile on MPSGraph and compare it to MPSShaders + custom Metal kernels it was about 2/2.5x times slower but it feels like it cause they are still building it and working on compiling it to MLIR like format which is still not stable. (Also time to first compile for bigger MPSGraph models are extremely high like few seconds per different input size)

name99 · May 13, 2023

Philip Turner said:
The answer to my math question, when compared to actual nanoseconds per day, shows how driver latency impacts performance. This is a real-world problem where driver latency caused an order of magnitude slowdown.

That’s not an explanation for why THIS particular granularity of CPU to GPU shuttling is required, it’s just a list of numbers.

Philip Turner · May 13, 2023

name99 said:
That’s not an explanation for why THIS particular granularity of CPU to GPU shuttling is required, it’s just a list of numbers.

The answer is 20,000 ns/day. Yet, OpenMM only achieved 1,000 ns/day. Why? Because the CPU had to wait for the previous timestep to complete - a 300 us latency bottleneck. Switching to a more computationally intensive approach* (which required more computations, but removed the dependency) improved speed to 4,000 ns/day. It's still latency bound (now by encoding latency), but much closer to the theoretical maximum.

> *The approach is: remove the nearest neighbor list, something that reduces the number of neighboring atoms you need to check. The list transforms an O(n^2) problem into an O(n) problem.

On any other architecture, this would not be an issue. For example, NV and AMD achieved 2,000-6,000 ns/day even with the CPU-GPU synchronization. It's doesn't make the task impossible on M1 (I did achieve 20,000 in practice), but makes it very difficult. The nearest neighbor list can't always be done away with, because doing so transforms molecular dynamics from O(n) to O(n^2). Every MD code base I've seen needs to use the CPU to rebuild the nearest neighbor list; I have not seen an entirely GPU-driven list.

name99 · May 13, 2023

Philip Turner said:
The answer is 20,000 ns/day. Yet, OpenMM only achieved 1,000 ns/day. Why? Because the CPU had to wait for the previous timestep to complete - a 300 us latency bottleneck. Switching to a more computationally intensive approach* (which required more computations, but removed the dependency) improved speed to 4,000 ns/day. It's still latency bound (now by encoding latency), but much closer to the theoretical maximum.

> *The approach is: remove the nearest neighbor list, something that reduces the number of neighboring atoms you need to check. The list transforms an O(n^2) problem into an O(n) problem.

On any other architecture, this would not be an issue. For example, NV and AMD achieved 2,000-6,000 ns/day even with the CPU-GPU synchronization. It's doesn't make the task impossible on M1 (I did achieve 20,000 in practice), but makes it very difficult. The nearest neighbor list can't always be done away with, because doing so transforms molecular dynamics from O(n) to O(n^2). Every MD code base I've seen needs to use the CPU to rebuild the nearest neighbor list; I have not seen an entirely GPU-driven list.

I'm going to try one more time.
Assume I know NOTHING about your problem. Explain to me EXACTLY
- what needs to be done on the GPU
- what needs to be done on the CPU
- how they are linked in terms of latency

For example (just one example, I don't know since, as I have said repeatedly, I know NOTHING about this problem) suppose your current scheme involves something like
- do stuff on CPU
- send a kernel to the GPU

If the flow control details of the kernel do not depend on what the CPU does, then this might be modifiable to something like
- send kernel* to the GPU
- do stuff on the CPU
- CPU sets a flag in global address space via atomic store with release
meanwhile kernel* looks like
- spin on an atomic load from that same address until it sees the flag
- do the rest of the work

In THEORY will this work? Perhaps. The Metal atomics are not promised to have acquire/release semantics. On the other hand we have no clue what actual semantics they have between CPU and GPU. So...
This sort of scheme will NOT work (for forward progress reasons) if the GPU spinloop is relying on earlier workgroups of the same kernel to compete. But that's not what we have here...

Point is, there are many many ways to skin a cat, the devil is in the details. There are many other alternatives, for example you could (in principle) spawn each successor kernel from the previous kernel. Or perhaps syncing with a Metal Shared Event. Without knowing ANYTHING about what either the CPU or GPU side are doing, and WHY either side ever needs to wait for the other side (which determines the various ways in which one can "pre-load" or otherwise overlap, and how primed either side can be for immediate execution once the other side says go) it's impossible to say anything.

Philip Turner · May 13, 2023

I don't have the bandwidth to explain all of this, but I did try CPU-GPU atomics to bypass the driver latency. I achieved a 4 us record low latency, but it was very finicky, frequently jumping to 2000 us, and sometimes infinite. It's not possible to communicate between the two processors.

https://github.com/philipturner/metal-usm/blob/main/Experimentation/HostMessaging.metal

The gist of the problem, is that rewriting something to be entirely GPU driven takes time and money. The few people who do port CUDA code to OpenCL or Metal, rarely have the resources to further invest in entire new algorithms, just to make it 10x faster on M1. The latency bottleneck also only shows for medium-sized problems, but that is still a decent class of use cases where GPU acceleration is slower than CPU. While for Nvidia, GPU acceleration would be faster than CPU, even without tensor cores.

leman · May 14, 2023

komuh said:
Also maybe put ANE units into GPU die so it'll be usable for general computing not only coreml stuff.

ANE has a different purpose which is why it uses a different architecture. Some recently published parents suggest that Apple intends to specialize it even further. I doubt this will make it a good candidate for inclusion into the GPU, but who knows how these things will play out in long term.

For GPU matmul however Apple could do something similar to AMD, and integrate some of their AMX tech into the GPU SIMDs. Suppose each SIMD could be reconfigured as a 64-wide BFloat16 matrix engine, that would give an Apple GPU core 512 matrix 16-bit FLOPs per clock. Could be a fairly noticeable performance boost for a fairly minor cost.

komuh · May 14, 2023

leman said:
ANE has a different purpose which is why it uses a different architecture. Some recently published parents suggest that Apple intends to specialize it even further. I doubt this will make it a good candidate for inclusion into the GPU, but who knows how these things will play out in long term.

For GPU matmul however Apple could do something similar to AMD, and integrate some of their AMX tech into the GPU SIMDs. Suppose each SIMD could be reconfigured as a 64-wide BFloat16 matrix engine, that would give an Apple GPU core 512 matrix 16-bit FLOPs per clock. Could be a fairly noticeable performance boost for a fairly minor cost.

But ANE is just DMA engine with probably a lot bigger MAC units (smith like 32x32 or maybe even 64x64) so its not like it is so specialised it can't be puted on a GPU (nvidia is doing and call it Tensor Cores).

I feel like it was more about GPU and whole die being tiny on A-series iPhones and maybe putting ANE would waste some space but it dose not make sens for M-series desktop/laptop die as it limits your potential GPU performance and also limit usability of ANE [Its not like it have size of google TPU die and have 275 teraflops + HBM memory].

leman · May 14, 2023

komuh said:
But ANE is just DMA engine with probably a lot bigger MAC units (smith like 32x32 or maybe even 64x64) so its not like it is so specialised it can't be puted on a GPU (nvidia is doing and call it Tensor Cores).

If memory serves me right, didn’t the reverse-engineering community show that ANE is primarily a convolution accelerator? I can imagine that design criteria for this kind of hardware (especially if you want to maximize performance/watt) has to be sufficiently different from one-thread-per-lane SIMD you need for the GPU… I’m sure @name99 will know much more about this.

Regarding Nvidia Tensor cores, we still don’t really know how it works. I suspect that the GPU SIMD can dub as matmul units under certain circumstances (there might be circumstantial evidence for this based on the shape of the API). We do know that Apple and AMD use a very similar approach, the main difference being that AMD SIMD units can be „split“ to perform lower-precision calculations at a higher rate (e.g. instead of doing 32 FP32 FMAs a SIMD can do 64 FP16 FMAs).

Philip Turner · May 14, 2023

leman said:
If memory serves me right, didn’t the reverse-engineering community show that ANE is primarily a convolution accelerator?

That would be the only real-world use case, except for the recent rise of attention mechanisms in transformers.

leman said:
We do know that Apple and AMD use a very similar approach

RDNA 3's WMMA instruction is identical to Apple's simdgroup_matrix.

komuh · May 14, 2023

leman said:
If memory serves me right, didn’t the reverse-engineering community show that ANE is primarily a convolution accelerator? I can imagine that design criteria for this kind of hardware (especially if you want to maximize performance/watt) has to be sufficiently different from one-thread-per-lane SIMD you need for the GPU… I’m sure @name99 will know much more about this.

Regarding Nvidia Tensor cores, we still don’t really know how it works. I suspect that the GPU SIMD can dub as matmul units under certain circumstances (there might be circumstantial evidence for this based on the shape of the API). We do know that Apple and AMD use a very similar approach, the main difference being that AMD SIMD units can be „split“ to perform lower-precision calculations at a higher rate (e.g. instead of doing 32 FP32 FMAs a SIMD can do 64 FP16 FMAs).

https://github.com/geohot/tinygrad/tree/master/accel/ane Convolution/Cross Correlation is still mostly matmul as are google TPU's and NV Tensor Cores you can always add some extra "juice" (Dunno for example FFT accelerators or NV optical flow accelerators etc.) but you'll always focus on memory and matmul performance.

That's why Im hoping ANE will get pushed inside GPU as it is getting to be a standard in every major GPU player right now (NV and Intel* are both more robust than AMD), or we will have some dedicated lane to communicate with GPU without global memory latency and with maybe some proper metal calls available in M3-series.

Or they'll just go with a lot more SIMD accelerators, make l1 cache bigger and maybe allow for hard lowering clocks to match ANE power usage and fully remove ANE from M-series die (maybe it is possible as it should be whole new architecture on 3nm).

leman · May 14, 2023

komuh said:
.

That's why Im hoping ANE will get pushed inside GPU as it is getting to be a standard in every major GPU player right now (NV and Intel* are both more robust than AMD), or we will have some dedicated lane to communicate with GPU without global memory latency and with maybe some proper metal calls available in M3-series.

Other vendors pack the ML accelerators into GPUs because it’s the most logical place to put them. GPUs are already massively parallel and have access to high memory bandwidth. Apple doesn’t need to follow this design simply because they have UMA. Apparently they have decided it makes sense for them to keep these units separate (for energy efficiency) while partially replicating functionality. I do agree with you that a more predictable programming environment would be nice. Right now there are too many devices with different capabilities and performance characteristics.

And anyway, with how fast this field develops who knows what’s going to be the next step? There is talking about in-RAM computation to overcome the bandwidth limit, plus future approaches to computation will likely be much smarter by actively exploiting the sparse nature of these problems.

Philip Turner · May 14, 2023

leman said:
plus future approaches to computation will likely be much smarter by actively exploiting the sparse nature of these problems.

The future of computation might be, you build your own computer from raw materials, using a nano-factory. Completely unlike the current system where the entire development stack is proprietary. Right now we're like software in the 1980's, when it was barely accessible. Imagine what could happen is matter becomes programmable just like information, and anyone can write ~~software~~ hardware. You could maximize efficiency by recycling your computer on the spot into a different architecture, tailored to your compute needs (like an FPGA).

leman · May 14, 2023

Philip Turner said:
The future of computation might be, you build your own computer from raw materials, using a nano-factory. Completely unlike the current system where the entire development stack is proprietary. Right now we're like software in the 1980's, when it was barely accessible. Imagine what could happen is matter becomes programmable just like information, and anyone can write ~~software~~ hardware. You could maximize efficiency by recycling your computer on the spot into a different architecture, tailored to your compute needs (like an FPGA).

I meant a more immediate future

like in 5 years from now at most

komuh · May 14, 2023

leman said:
Other vendors pack the ML accelerators into GPUs because it’s the most logical place to put them. GPUs are already massively parallel and have access to high memory bandwidth. Apple doesn’t need to follow this design simply because they have UMA. Apparently they have decided it makes sense for them to keep these units separate (for energy efficiency) while partially replicating functionality. I do agree with you that a more predictable programming environment would be nice. Right now there are too many devices with different capabilities and performance characteristics.

And anyway, with how fast this field develops who knows what’s going to be the next step? There is talking about in-RAM computation to overcome the bandwidth limit, plus future approaches to computation will likely be much smarter by actively exploiting the sparse nature of these problems.

UMA dose not fix the problem with accessing the ANE from the GPU and context switch costs.

dgdosen · May 14, 2023

Asking questions as a non-AI expert:

While we can configure tools like TensorFlow or PyTorch to take advantage of Apple Silicon CPU/GPU - once configured, is there more to it (with respect to leveraging for AI training)?

While the GPU cores can be accessed with Swift/ObjC/C++ via Metal - does Metal also provide access to the ANE? Is there a need to do that?

And if you can ELI5 - what is the difference between an ANE core and GPU core? It looks like they're both optimized for parallel processing, is one better than the other at vector/matrix math? Is there overlap?

Thanks for any insight.

komuh · May 14, 2023

dgdosen said:
Asking questions as a non-AI expert:

While we can configure tools like TensorFlow or PyTorch to take advantage of Apple Silicon CPU/GPU - once configured, is there more to it (with respect to leveraging for AI training)?

While the GPU cores can be accessed with Swift/ObjC/C++ via Metal - does Metal also provide access to the ANE? Is there a need to do that?

And if you can ELI5 - what is the difference between an ANE core and GPU core? It looks like they're both optimized for parallel processing, is one better than the other at vector/matrix math? Is there overlap?

Thanks for any insight.

PyTorch for metal backend are using few custom metal kernels and mostly Metal Performance Shaders but it have limitation of Eager mode so it isn't fused in any way and it is pretty slow dunno about Tensorflow.

ANE isn't accessible via any public api so you can't use it with Metal like interface.

Neural engines are optimised for matmul operations mostly on some sort of 8x8/16x16/32x32 size with lower precision (BF16 for example) and higher memory bandwidth for local data transfers to fight limitation of earlier GPU's (2010+)

Newer neural accelerators are mostly on GPU die or as separate very big dies with ton of memory to scale for 100+ teraflops calculations (google TPUs, Cerebras etc).

Romain_H · May 15, 2023

komuh said:
ANE isn't accessible via any public api so you can't use it with Metal like interface.

Does this mean the ANE is useless for all apps except Apple apps?

komuh · May 15, 2023

Romain_H said:
Does this mean the ANE is useless for all apps except Apple apps?

You can still expect ANE to work with some CoreML exported models but its impossible to force system to use ANE it'll automatically use "fastest" path (so if one/two ops in your model are not supported by ANE then most likely it'll just dispatch to GPU as data transfer and context switching cost is pretty big).

Also bigger sized data are always dispatched to GPU in my experience (for example 2.5/4k images) its probably 4MB cache limitation in ANE but its indeed understandable as ANE was developed for iPhones.

leman · May 15, 2023

Romain_H said:
Does this mean the ANE is useless for all apps except Apple apps?

You can use it via CoreML framework. But it’s not directly programmable. You build a model using a high-level Apple ML framework and the system then is free to pick two this model is evaluated. Some models can run on the ANE, some can’t. Apple doesn’t really provide any guidance here.

Philip Turner · May 15, 2023

ANE is also used by MetalFX upscaling, although it seems to be only for a tiny fraction of the compute work (80 mW).

quarkysg · May 15, 2023

Philip Turner said:
ANE is also used by MetalFX upscaling, although it seems to be only for a tiny fraction of the compute work (80 mW).

Probably shouldn't quantify work done by the energy used.

Philip Turner · May 15, 2023

More correctly, quantify by the square root of the energy used.

name99 · May 15, 2023

Philip Turner said:
I don't have the bandwidth to explain all of this, but I did try CPU-GPU atomics to bypass the driver latency. I achieved a 4 us record low latency, but it was very finicky, frequently jumping to 2000 us, and sometimes infinite. It's not possible to communicate between the two processors.

https://github.com/philipturner/metal-usm/blob/main/Experimentation/HostMessaging.metal

The gist of the problem, is that rewriting something to be entirely GPU driven takes time and money. The few people who do port CUDA code to OpenCL or Metal, rarely have the resources to further invest in entire new algorithms, just to make it 10x faster on M1. The latency bottleneck also only shows for medium-sized problems, but that is still a decent class of use cases where GPU acceleration is slower than CPU. While for Nvidia, GPU acceleration would be faster than CPU, even without tensor cores.

This MIGHT be relevant to your problem: https://patents.google.com/patent/US10908962B1/
Look in particular at the section on Blacklisting…

One can imagine a few ways to solve the issue, but the most obvious “better” ways may have already been patented by nV or AMD. The best solution is to allow full GPU pre-emption, and there is a recent patent for that, but it may not yet be implemented, or only implemented in M2, or the blacklist is still there because of the cost of pre-emption.
(An interesting idea would be to have an independent E-cluster GPU capable of sustaining basic UI, somewhat like an iGPU+dGPU system. Not only could this save energy, it would allow ignoring this GUI pre-emption constraint in many cases…) This may be why the issue is less visible on Intel Mac systems running Metal, because they have an iGPU that allows the dGPU to avoid caring about pre-emption?

BTW this gives another justification for the ANE — using it rather than GPU allows for long-running kernels without having to worry about pre-emption to sustain the GUI.

Apple Silicon in Sciences

macrumors 68030

macrumors regular

Suspended

macrumors 68030

macrumors regular

macrumors 68030

macrumors regular

macrumors Core

Suspended

macrumors Core

macrumors regular

Suspended

macrumors Core

macrumors regular

macrumors Core

Suspended

macrumors 68030

Suspended

macrumors 6502a

Suspended

macrumors Core

macrumors regular

macrumors 65816

macrumors regular

macrumors 68030

Our Staff