Apple Silicon in Sciences

Philip Turner · May 3, 2023

The point wasn’t that all of the information comes beforehand, but it’s a necessary piece of the puzzle. We already have a bunch of instincts, neural circuits, encoded in the DNA and everything else learned by evolution. The other information exists in the world around us, which gets transferred into the brain over time. That’s what needs to happen with an AI model - enable it to build on the foundation that is Wikipedia, ArXiv, GitHub, StackOverflow, and use that to start processing vastly larger data sources.

The reason we evolved all these complex neural circuits, is they were necessary to survive. We have emotions and desires to quickly make life-saving decisions. I’m oversimplifying a bit (please correct me), but in my mind, we’re just an astronomically complex iteration of a survival algorithm, that learned that building a society increased life expectancy. AI could become something else - an astronomically complex iteration of a different objective, like curing cancer.

leman · May 4, 2023

I really like your point about our systems following a certain evolutionary path and that AI can (and should) be a different kind of device. I am just not sure that you will achieve what you want by exposing the ML models to more information of this kind. One would probably need more modalities of information as well as some sort of self-reflection auto-regulating component.

Philip Turner · May 4, 2023

leman said:
One would probably need more modalities of information as well as some sort of self-reflection auto-regulating component.

I think vastly more compute will allow radically different training methods. For example, evolving multiple models in a hyper-detailed simulation. We didn't get where we are today because one person became super smart. We did because billions of people evolved and natural selection chose the fastest learners (in a sense). We actually have ~1 trillion years of human experience under our belt, hidden in how our brains are configured at birth. Once we have that, kiloyears of YouTube data will be a nice way to give it some context.

I don't agree with the idea of infinitely self-improving AI, but something similar could help. LLMs are already preprocessing a lot of the data for training new models. We could also use them as more efficient mutation operators for GP. Another improvement is using quantum computers to perform some of the multidimensional optimization tasks that currently require genetic algorithms.

Romain_H · May 4, 2023

Holy smokes I love the (various) topics in this thread. Your knowledge is stunning (me jealeous)

name99 · May 4, 2023

leman said:
As a researcher who works in a field tangential to cognitive psychology I take a big issue with this statement I don’t think this is accurate at all. The amount of information we receive in our lifetime is often underestimated (see the misguided Poverty of Stimulus argument), and we literally have years to build mental model and adapt our brain structure to it.

But what’s probably more important is that the training information we receive is incredibly rich. Multimodality is the key. Our learning process is modulated by our direct experience of the world, social structures, internal feedback loops, moral compasses, desires, emotions, and inspirations. An ML model has none of that. No wonder GPT-4 is somewhat of an idiot savant - perfectly capable of generating well rhymed poetry at a press of a button but completely lacking any pragmatic context.

It's unclear the extent to which grounding words in the real world (ie sense input or memories) as opposed to having them simply be self-referential via embedding, actually matters.

The same is as true of the question as to whether we are born blank slates, or fully primed for language as suggested by the earlier poster.

These are just two more of those things that we will probably learn more about over the next ten years than over the past three thousand years of pontificating. But I do look forward to a world where these issues are discussed in terms of experiments run on machines, with actual graphs of learning rates and suchlike, rather than the current model of "well obviously what I say is true, therefore..."

name99 · May 4, 2023

Romain_H said:
Holy smokes I love the (various) topics in this thread. Your knowledge is stunning (me jealeous)

We are all building on foundations laid by supermen. If you want to, there is so much you can read today to become part of the club of people thinking about these things.

An easy starting point is Noah Yuval Harari. Then VS Ramachandran. Then Daniel Dennett (not the early stuff, too technical, but maybe start with Darwin's Dangerous Idea). Then Douglas Hofstadter (personally I think Godel Escher Bach is too obsessed with things I don't care about, but I am a Strange Loop is not bad in terms of the things I do care about).
Then see the authors and books that tend to come up on Amazon or other recommendation systems associated with these books and authors and just follow what you can!

It's overwhelming, yes! But also will be a source of fascination till the day you die!

Philip Turner · May 4, 2023

@name99 do the patents give any hint as to why GEMM only reaches 80% ALU utilization on the Apple GPU? I always assumed the simdgroup_matrix_multiply instruction took 20 cycles, and there was no way to work around it. However, it seems to be a special SIMD reduction - the only one where FP16 throughput = FP32 throughput (~~or I'm wrong and MPS simply upcasts FP16 -> FP32 beforehand~~ (https://github.com/dougallj/applegp...dcd498157575c5b267015/applegpu.py#L4187-L4204)). I also want to know whether the instruction dispatching unit is stalled during the 16-20 cycles of the matmul, or whether it is free to perform other work.

leman · May 5, 2023

name99 said:
It's unclear the extent to which grounding words in the real world (ie sense input or memories) as opposed to having them simply be self-referential via embedding, actually matters.

The same is as true of the question as to whether we are born blank slates, or fully primed for language as suggested by the earlier poster.

I think it's not as much about grounding words in real world as about communication being intent-driven. After all, we use words to communicate our mental images to others. Current LLMs do have a mental image of sorts (the latent space), but these mental images are grounded in the communication medium itself and hence constrained by it. I believe that another abstraction layer is needed to modulate the communication model, and this layer will probably have to do with self-referentiality and be to a certain degree external to the language model. Sorry if this sounds a bit vague, just thinking out loud. I'd need more time and effort to try and gather these thoughts better.

name99 said:
But I do look forward to a world where these issues are discussed in terms of experiments run on machines, with actual graphs of learning rates and suchlike, rather than the current model of "well obviously what I say is true, therefore..."

Oh yes, absolutely, the argument from authority in this branch of academia still reigns supreme. Bringing actual science back to these matters would be great. At the same time I have observed a somewhat worrying tendency in recent papers where overly complicated mathematical models and vague statistical results are used to make big claims. Then again, academia is incredibly toxic, which is why I stepped back from active research into a more of a support role.

name99 · May 5, 2023

Philip Turner said:
@name99 do the patents give any hint as to why GEMM only reaches 80% ALU utilization on the Apple GPU? I always assumed the simdgroup_matrix_multiply instruction took 20 cycles, and there was no way to work around it. However, it seems to be a special SIMD reduction - the only one where FP16 throughput = FP32 throughput (~~or I'm wrong and MPS simply upcasts FP16 -> FP32 beforehand~~ (https://github.com/dougallj/applegp...dcd498157575c5b267015/applegpu.py#L4187-L4204)). I also want to know whether the instruction dispatching unit is stalled during the 16-20 cycles of the matmul, or whether it is free to perform other work.

I did not not look at that particular patent in much detail at all once I saw the full horror of the tables it uses!

I could well believe that the matrix multiplication was added on top of permute as a good quick improvement, but with an expectation/understanding (after simulation) that there is some bottleneck; and there's also expected to be an alleviation of that bottleneck in the next major rev.
For example if the next major rev allows for some degree of dual-instruction issue (like Pentium UV, doesn't have to be very fancy) you might land up being able to run pretty much all the data movement in parallel with the FMACs, and almost double throughput? Or you might tweak the data movement primitives so fewer of them are required?

name99 · May 6, 2023

leman said:
No, I mean this one: https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2023033955&_cid=P22-LH6QGU-07079-3

So the 2019 patent (the one I mentioned) gives the basic idea of being able to map some lines in a cache (realistically this will be the SLC) to fixed addresses. This is useful for at least two purposes:
- allows writes by network hardware to be captured by the SLC rather than being written to DRAM, then they're low latency accessible by the CPU.
[You could ask why this is necessary. Well, what are the alternatives?
+ You could capture ALL writes to DRAM in the SLC, but this is a bad idea because most writes to DRAM are not reused for a long time, so you are wasting a lot of your SLC space on data that won't be used.
+ Maybe you could tag the "important" writes from network hardware so they are treated specially by the SLC? But the network hardware, at least for now, is controlled by Qualcomm and Broadcom, so that's not practical.

Whereas locking some addresses in SLC is under Apple's control, and useful for other purposes like]

- you can lock some or all interrupt routines in SLC so that they always run at low latency. If you don't lock the addresses of these routines then, frequently, the routine may be flushed out of SLC the next time it is required.

Anyway, that's the 2019 patent.
What the 2023 patent (the one you referenced) adds is the ability to make this change *dynamically*. With the 2019 scheme you set things up once at boot and that was that. With the 2023 scheme, you can request that an address range be locked or unlocked from SLC. This requires you to think about a few things now.
Most importantly: what happens once you remove an address range from the SLC? The patent's answer is
- you clear the cache lines
- any reference to this address range is now an error, like addressing NULL.
Other answers could be possible, like maybe that address range could instead be mapped to DRAM, but this is what Apple chose.

The idea seems to be to provide a generalization of the network idea for arbitrary realtime data transfer (think eg video capture). The video driver can map a pre-allocated address range to the SLC, then the camera will automatically store successive frames to the SLC, from which they can be read by the GPU or media encoder or whatever. Once the video capture is over, that address range can be removed from SLC and the cache space given back to other clients.
Even the network case may benefit. The amount of SLC space allocated to hold network packets can now grow or shrink as network activity changes, rather than whatever fixed size Apple was previously using -- too large for Bluetooth, too small for 10G ethernet.

Philip Turner · May 6, 2023

I've just noticed how long some of you guys have been on these forums. @name99 you've been on literally as long as I've been alive. I assume you've been reverse-engineering or investigating Apple devices for that long?

name99 · May 6, 2023

Philip Turner said:
I've just noticed how long some of you guys have been on these forums. @name99 you've been on literally as long as I've been alive. I assume you've been reverse-engineering or investigating Apple devices for that long?

I worked at Apple from basically 1990 to 2000. So life was easier then -- they gave me access to the sooper-sekrit books rather than me having to reverse engineer this stuff myself!
This reverse-engineering shouldn't be taking up so much time! There are other things I also want to do.
But every time I walk away, then I learn about some new area (currently GPUs) and get sucked back in!

Xiao_Xi · May 10, 2023

Reading the Julia 1.9 release notes, I learned that Mx has native half-precision floating-point arithmetic. Although you have to use it carefully, it can bring incredible speed.

By the way, Julia 1.9 now has Tier 1 support for macOS ARM.

Julia 1.9 Highlights

Highlights of the Julia 1.9 release.

julialang.org

leman · May 10, 2023

Xiao_Xi said:
Reading the Julia 1.9 release notes, I learned that Mx has native half-precision floating-point arithmetic.

Yep, it’s part of ARMv8. And M2 has hardware CPU support for BF16 (and I’ve heard they also support it in AMX)

Xiao_Xi · May 10, 2023

leman said:
And M2 has hardware CPU support for BF16 (and I’ve heard they also support it in AMX)

Thanks. I didn't know it.

AArch64: add support for newer Apple CPUs · llvm/llvm-project@677da09

They're roughly ARMv8.6. This works in the .td file, but in AArch64TargetParser.def, marking them v8.6 brings in support for the SM4 cryptographic hash and we don't actually have that. So T...

github.com

M2 Compatibility · Issue #5 · corsix/amx

Hi, On my M2 (2022 MacBook Air) I'm getting the following: AMX_LDX: fail AMX_LDY: fail AMX_LDZ: pass AMX_LDZI: pass AMX_STX: pass AMX_STY: pass AMX_STZ: pass AMX_STZI: pass AMX_EXTRX: fail AMX_EXTR...

github.com

Philip Turner · May 12, 2023

I think Apple designed the AMX because their GPU is unusable for too many real-world tasks. Too much driver latency, compared to NV. I spotted the bottleneck in PyTorch, nobody fixed it, and it just reappeared in molecular dynamics. I think it's also going to appear in quantum chemistry too.

The best solution is an completely open-source alternative to Metal Performance Shaders, which lets you encode numerous kernels into the same MTLComputeCommandEncoder. Even if it's slightly slower for some cases, removing the MPS dependency can substantially improve actual performance. Especially for inferencing large language models.

ANE is probably even worse, the order of 1-10 milliseconds. @name99 do you have data on latency to access the ANE?

Xiao_Xi · May 12, 2023

Philip Turner said:
The best solution is an completely open-source alternative to Metal Performance Shaders

It is very unlikely to happen because there is little incentive to do so. It seems more likely to me that you can convince Apple to address the problem. Apple needs more time.

In the meantime, you might be interested in a talk about SYCL runtime performance comparison for GROMACS.

komuh · May 13, 2023

Philip Turner said:
I think Apple designed the AMX because their GPU is unusable for too many real-world tasks. Too much driver latency, compared to NV. I spotted the bottleneck in PyTorch, nobody fixed it, and it just reappeared in molecular dynamics. I think it's also going to appear in quantum chemistry too.

The best solution is an completely open-source alternative to Metal Performance Shaders, which lets you encode numerous kernels into the same MTLComputeCommandEncoder. Even if it's slightly slower for some cases, removing the MPS dependency can substantially improve actual performance. Especially for inferencing large language models.

ANE is probably even worse, the order of 1-10 milliseconds. @name99 do you have data on latency to access the ANE?

In which use case MTLComputeCommandEncoder is bottleneck for LLM in my DL models its always memory performance (so I have to fuse kernels) or mtlcommandbuffer/que never saw performance problem with compute encoder?

Philip Turner · May 13, 2023

I may have been exaggerating a bit, because it impacted my use case so negatively. Let me explain where it occurs and the problem sizes it impacts. Imagine you have a molecular dynamics simulation, which is an n-body physics simulation (O(n^2)). You have 1000 particles, computing each pair of particles takes 90 instructions, the GPU runs 5.3 trillion instructions per second, and each time step is 4 femtoseconds. How many nanoseconds could be computed in one day?

leman · May 13, 2023

komuh said:
In which use case MTLComputeCommandEncoder is bottleneck for LLM in my DL models its always memory performance (so I have to fuse kernels) or mtlcommandbuffer/que never saw performance problem with compute encoder?

It's not a bottleneck per se, it's just that there is a fairly large latency between encoding the command and executing the command. If you are mixing CPU and GPU computation, it can be very difficult to reach good compute utilisation.

A lot of this latency is probably due to software design — after all the driver stack is targeting throughput and not latency. Maybe Apple will improve things here in due time, who knows.

komuh · May 13, 2023

Yup true enough for molecular simulation it is a problem. But I would love more simd_matrix units or bigger cache/ faster memory for my use cases

Also maybe put ANE units into GPU die so it'll be usable for general computing not only coreml stuff.

Philip Turner · May 13, 2023

When parallelizing something, there is often a cost. For example, I learned that molecular simulations can only utilize multiple cores if they have ~300 atoms. Otherwise the latency of Grand Central Dispatch makes it slower. For ML, the types of models that are possible to train on personal computers, are also the smallest, and most vulnerable to latency.

Take two latencies: CPU-GPU synchronization (200 us) and the sequential throughput of PyTorch (25 us). Assume both training with batch-16 and inferencing a feedforward layer take the same time, and the layer is in FP32. Bandwidth is 400 GB/s and a layer's weights take n^2 memory, where n is the layer size. What is the smallest layer size you can use without being bottlenecked by latency (both for 200 us and 25 us)?

komuh · May 13, 2023

Is 200us latency of single MTLComputeCommandEncoder dispatch or its for sync of data? Cause PyTorch seems like bad options for that (at least before StableHLO/MLIR will target Metal and you can easily compile model) as PyTorch latency is quiet high especially without compilation.

If it was me I would just write small optimised kernels for that maybe something like tiny-cuda-nn from nvidia.

Philip Turner · May 13, 2023

Sequential throughput of command buffers is 10-15 us. In other words, you cannot execute more than 100,000 command buffers in one second. However, if you batch numerous commands into one command buffer (by using compute encoders), you can achieve 2 us. PyTorch doesn't do this - in fact, they add an extra 15 us/command from MPSGraph's internal latency. So it takes 25 us even for a trivial scalar addition, while NV GPUs take a single-digit percent of that time.

Also, a lot of PyTorch operations aren't implemented in Metal, which means it must fall back to the CPU. The latency to hear back from a command buffer is typically 200-300 us. The record sequential throughput I've ever seen is 150 us. Compare that to Nvidia, which seems to be ~5 us (the latency of PCIe). Apple's "unified memory" is not actually shareable between the CPU and GPU in a real-world workflow, except as a means to create a GPU with large VRAM, or to improve power efficiency with asymptotically large problems.

name99 · May 13, 2023

Philip Turner said:
I think Apple designed the AMX because their GPU is unusable for too many real-world tasks. Too much driver latency, compared to NV. I spotted the bottleneck in PyTorch, nobody fixed it, and it just reappeared in molecular dynamics. I think it's also going to appear in quantum chemistry too.

The best solution is an completely open-source alternative to Metal Performance Shaders, which lets you encode numerous kernels into the same MTLComputeCommandEncoder. Even if it's slightly slower for some cases, removing the MPS dependency can substantially improve actual performance. Especially for inferencing large language models.

ANE is probably even worse, the order of 1-10 milliseconds. @name99 do you have data on latency to access the ANE?

No clue about ANE latency.

However one thing Apple could do with MPS is the same sort of thing as they did with Core Image. Even though that presents as functions you arrange into a graph, it's actually implemented (kinda) at the compiler level so that redundant work can be removed, instructions can be re-ordered, function call overhead eliminated, etc.

They have already gone down that path with ray tracing, and in principle they could do this for all of MPS, use the ideas from CI... It's an obvious next step once they clear their plates of all the higher priority work, but it might do a lot to help NN's, and since AI is the current darling, there is hope!

Apple Silicon in Sciences

macrumors regular

macrumors Core

macrumors regular

macrumors 6502a

macrumors 68030

macrumors 68030

macrumors regular

macrumors Core

macrumors 68030

macrumors 68030

macrumors regular

macrumors 68030

macrumors 68000

macrumors Core

macrumors 68000

macrumors regular

macrumors 68000

Suspended

macrumors regular

macrumors Core

Suspended

macrumors regular

Suspended

macrumors regular

macrumors 68030

Our Staff