The answer is 20,000 ns/day. Yet, OpenMM only achieved 1,000 ns/day. Why? Because the CPU had to wait for the previous timestep to complete - a 300 us latency bottleneck. Switching to a more computationally intensive approach* (which required more computations, but removed the dependency) improved speed to 4,000 ns/day. It's still latency bound (now by encoding latency), but much closer to the theoretical maximum.
> *The approach is: remove the nearest neighbor list, something that reduces the number of neighboring atoms you need to check. The list transforms an O(n^2) problem into an O(n) problem.
On any other architecture, this would not be an issue. For example, NV and AMD achieved 2,000-6,000 ns/day even with the CPU-GPU synchronization. It's doesn't make the task impossible on M1 (I did achieve 20,000 in practice), but makes it very difficult. The nearest neighbor list can't always be done away with, because doing so transforms molecular dynamics from O(n) to O(n^2). Every MD code base I've seen needs to use the CPU to rebuild the nearest neighbor list; I have not seen an entirely GPU-driven list.
I'm going to try one more time.
Assume I know NOTHING about your problem. Explain to me EXACTLY
- what needs to be done on the GPU
- what needs to be done on the CPU
- how they are linked in terms of latency
For example (just one example, I don't know since, as I have said repeatedly, I know NOTHING about this problem) suppose your current scheme involves something like
- do stuff on CPU
- send a kernel to the GPU
If the flow control details of the kernel do not depend on what the CPU does, then this might be modifiable to something like
- send kernel* to the GPU
- do stuff on the CPU
- CPU sets a flag in global address space via atomic store with release
meanwhile kernel* looks like
- spin on an atomic load from that same address until it sees the flag
- do the rest of the work
In THEORY will this work? Perhaps. The Metal atomics are not promised to have acquire/release semantics. On the other hand we have no clue what actual semantics they have between CPU and GPU. So...
This sort of scheme will NOT work (for forward progress reasons) if the GPU spinloop is relying on earlier workgroups of the same kernel to compete. But that's not what we have here...
Point is, there are many many ways to skin a cat, the devil is in the details. There are many other alternatives, for example you could (in principle) spawn each successor kernel from the previous kernel. Or perhaps syncing with a Metal Shared Event. Without knowing ANYTHING about what either the CPU or GPU side are doing, and WHY either side ever needs to wait for the other side (which determines the various ways in which one can "pre-load" or otherwise overlap, and how primed either side can be for immediate execution once the other side says go) it's impossible to say anything.