Apple Silicon in Sciences

Philip Turner · May 1, 2023

[duplicate]

Philip Turner · May 1, 2023

With A15/M1. FP16 hardware simply decompresses into FP32 before sending to the ALUs, although it does incur less register bandwidth, which is nice (greater utilization at lower ILP). Many SIMD-scoped reductions are slower with FP16/I16 than FP32/I32.

There's always some stone left unturned about the Apple GPU. I got information that the GPU has its own dedicated CPU (running a version of XNU), which decodes Metal commands and manages the GPU cores. Is this common in the industry?

Philip Turner · May 1, 2023

@name99 could you review the charts here about M1 Max latency, specifically the dip at 24 MB? Half of the SLC (in both capacity and bandwidth) is unusable by the CPU, just like half the memory bandwidth. I then need to find a way to measure GPU SLC latency (I can't find the benchmark software that Chips-and-Cheese uses). If my hypothesis is correct, the Pro/Max SLC has lower latency (150 - 250 ns) than the base M1 (266 ns) and could be optimized for incoherent accesses from ray tracing.

Given that CPU can't use it, the simdgroup_matrix seems to stream operands straight from L2 (based on bandwidth requirements), there seems to be no other major use for SLC, except for reducing power of off-chip accesses. That means one thing: the Pro/Max SLC is intended for GPU ray tracing.

name99 · May 1, 2023

Philip Turner said:
@name99 could you review the charts here about M1 Max latency, specifically the dip at 24 MB? Half of the SLC (in both capacity and bandwidth) is unusable by the CPU, just like half the memory bandwidth. I then need to find a way to measure GPU SLC latency (I can't find the benchmark software that Chips-and-Cheese uses). If my hypothesis is correct, the Pro/Max SLC has lower latency (150 - 250 ns) than the base M1 (266 ns) and could be optimized for incoherent accesses from ray tracing.

Given that CPU can't use it, the simdgroup_matrix seems to stream operands straight from L2 (based on bandwidth requirements), there seems to be no other major use for SLC, except for reducing power of off-chip accesses. That means one thing: the Pro/Max SLC is intended for GPU ray tracing.

You're throwing out a bunch of ideas, but I think the single most important point is that the SLC is designed to function primarily as a COMMUNICATIONS medium. This has a bunch of nice side effects like often acting as an L3 (lower latency) and saving a lot of energy, but COMMUNICATION is the main goal.

Communication is required any time any block on the SoC (CPu, GPU, NPU, IO, ...) needs to communicate with any other block. The one common addressing scheme they all use is physical addresses, meaning that they can all communicate via writing to then reading from DRAM. But doing this literally via DRAM sucks for power and latency reasons. Much better is to put a fast cache in front of DRAM that is literally a cache of the DRAM address space (thus avoiding the coherency issues of an L3 cache which is an extension of the CPU, not an extension of DRAM).

So we have SLC as this memory cache, and it does the job of allowing say CPU to communicate with GPU by writing through SLC rather than writing through DRAM. But as an extension of DRAM, it's basically specced to present much the same bandwidth to any particular block as DRAM – better latency, but more or less the same bandwidth.

The second way in which communication happens that's slightly more abstract is a block at one time communicating with itself at a future time. This is the realm of a manually managed "cache". The CPU does not care about this (the SoC does not provide functionality for this sort of control, and it seems unlikely to be useful except for a very few cases) but the GPU cares a lot, because it's common for the GPU to run through a large (but bounded, and small enough to fit in SLC) amount of data within a frame. The GPU can tag lines that are written to the SLC, and can control how groups of these lines interact. Basically all "manual" clients of the SLC can be given a quota, and sub-uses for each client also get a quota. This means that the GPU can be guaranteed that its quota of storage won't suddenly be overwritten by the CPU or the camera, and it even means that say the G-buffer won't get overwritten by random texture and materials property traffic.

Understanding this clarifies both why the SLC seems "disappointing" in terms of CPU bandwidth and in terms of the capacity that is available to the CPU. SLC is NOT a CPU L3 that happens to be used by the GPU! It's an SoC-wide facility that's optimized for a set of clients that does not include the CPU; the CPU only gets a say insofar as
- it needs to communicate data with GPU, NPU, etc, so it obviously needs access to the SLC AND
- whatever space in the SLC is not currently being used by other clients might as well be used as L3 because, why not (same as how any extra DRAM not actively in use by a program might as well be used as file cache).

Your basic point, that ray tracing leads to divergent memory access, and that yet another part of "hardware ray tracing support" is some way to deal with, or work around, that divergence, is of course correct. But I don't see any evidence that that's part of the M1 (and probably also not part of the M2). I'm sure that will come (ideally in some form that handles more general memory divergence, not just a specialization for ray tracing) but I don't think we have that hardware yet.

Philip Turner · May 1, 2023

name99 said:
better latency, but more or less the same bandwidth.

It has double the bandwidth, at least on M1. Exactly the double of the theoretical peak 68.2 GB/s. This bandwidth peak is only witnessed on the M1 iGPU graphs and not found on the M1 Max CPU graphs. I have also tested my M1 Max GPU and measured bandwidth peaks at 512 KB and 48 MB.

leman · May 2, 2023

Philip Turner said:
Would you consider simdgroup_matrix to be hardware-accelerated matrix multiplication?

I think it depends what instruction sequence it compiles to. I would be surprised if there is anything matmul-specific there, it’s probably just some combinations of SIMD FMAs like what you’d use with NEON or AMX. These matmul intrinsically are probably only there to take better advantage of the SIMD architecture that is somewhat hidden by the SIMT programming model.

Philip Turner said:
Does the chip have to dedicate space to separate circuitry that will only be utilized during a small fraction of everyday workloads?

No, of course not, but that kind of depends on what‘s important to you, no? Basic SIMD is already decent at matmul and yet the industry is still interested in more specialized hardware that can perform these operations even faster/cheaper.

In case of raytracing specifically, there seems quite a lot of benefit for investing hardware resources if one is interested in good performance. This is evident by Nvidia’s implementation (just compare Blender performance for CUDA and OPTIX backends). I’d think that Apple is very interested in delivering good performance on this front because 3D content creation is something where they have been trying to carve a niche for themselves. General-purpose instructions that can speed up intersection calculation (like CMPSEL) are great, but they don’t help with divergence (both in control flow and memory access patterns) - probably the main problem with RT. And we can see it in AMDs implementation who rushed to tackle on a ray-box intersection instruction just to have something - doesn’t really do much.

Anyway, Apple has published a substantial body of RT-related patents last year (there was some speculation that RT should have debuted in A16 but was pushed back), so unless they completely scrape their approach we have a fairly good idea what is coming. The RT unit is a hardware processor that receives work requests from the general-purpose ALUs, performs BVH traversal, compacts and reorders hits, and launches compacted SIMD groups on general-purpose ALUs to process the hits. A neat trick — the hardware ray-volume tests are done using reduced precision, which will probably allow compact and energy-efficient implementation. There can be false positives, which is why the final result is validated using full precision in the general-purpose ALU. This is all about dramatically improving occupancy and coherency of the main ALUs, reducing power consumption and improving performance.

leman · May 2, 2023

name99 said:
Do we know for a fact that Apple has removed FP16 from the GPU? When? With A15/M2?

As @Philip Turner wrote, there is no evidence for a dedicated FP16 pipe on M1 and later. The throughput is exactly the same as for FP32, so it's probably safe to assume that the ALU is 32-bit. And as Philip says, FP16 is used as a storage format, with free conversion during inputs and outputs. The performance advantage of using FP16 stems from the fact that registers are 16-bit addressable, so you can pack more state using 16-bit values (32-bit values are instead stored using pairs of 16-bit registers).

To be honest, I suspect that the same is already true for A14 and that FP32 is artificially limited in the scheduler (maybe to reduce the power consumption). BTW, dropping improved FP16 has precedence in the industry. Nvidia did it recently as well. I can imagine that having hardware that can execute different precision calculations at different rates creates all kind of issues for instruction scheduling.

name99 said:
You're throwing out a bunch of ideas, but I think the single most important point is that the SLC is designed to function primarily as a COMMUNICATIONS medium. This has a bunch of nice side effects like often acting as an L3 (lower latency) and saving a lot of energy, but COMMUNICATION is the main goal.

Just as a side note, there is this very interesting recent patent where Apple is describing allocating parts of the SLC cache as RAM. No idea whether it's already implemented (or ever will be), but I thought that's a neat idea.

More on topic of SLC as communication medium, something I found quite interesting is the fact that on newer patents where on-chip networks are discussed, the schematics never show CPU and GPU being directly connected. Of course, these are just high-level diagrams that do not show everything, but it does reinforce the idea that CPU-GPU communication primarily occurs though shared memory (which in turn means SLC as memory side cache).

Philip Turner · May 2, 2023

leman said:
Anyway, Apple has published a substantial body of RT-related patents last year

I'm thinking of upgrading my iPad Pro when the M3 chip comes out, because of hardware ray tracing. Do you have any way to estimate the performance delta from non-RT accelerated chips? I predict it would be faster than M1 Max, but by how much? For my higher-end chip, MP sounds like a good option if the chip is truly upgradeable, and Moore's Law stays alive. I don't want to get a separate display for it, so I'd need to use it as an eGPU. Run simulations on it, stream un-upscaled frames back to the 120 Hz MBP for upscaling via MetalFX.

leman · May 2, 2023

Philip Turner said:
I'm thinking of upgrading my iPad Pro when the M3 chip comes out, because of hardware ray tracing. Do you have any way to estimate the performance delta from non-RT accelerated chips?

I don't know of an objective way to predict these things. FWIW, Blender gets almost 2x performance improvements with the OptiX backend over CUDA. Will we see similar improvements for Apple? Who knows.

name99 · May 2, 2023

Philip Turner said:
It has double the bandwidth, at least on M1. Exactly the double of the theoretical peak 68.2 GB/s. This bandwidth peak is only witnessed on the M1 iGPU graphs and not found on the M1 Max CPU graphs. I have also tested my M1 Max GPU and measured bandwidth peaks at 512 KB and 48 MB.

You mean for GPU? Yes, that I could believe.
But for CPU it's the same as DRAM, totally flat. You can look at the graphs in vol2 of my PDFs.

name99 · May 2, 2023

leman said:
I think it depends what instruction sequence it compiles to. I would be surprised if there is anything matmul-specific there, it’s probably just some combinations of SIMD FMAs like what you’d use with NEON or AMX. These matmul intrinsically are probably only there to take better advantage of the SIMD architecture that is somewhat hidden by the SIMT programming model.

Here's the patent. It's like Phillip and I describe:

They are trying really hard to make the patent general, but once you get past the throat-clearing, it's basically about the permute patterns to use so as to run an 8x8 matrix multiply using all 32 lanes.

https://patents.google.com/patent/US11256518B2

name99 · May 2, 2023

leman said:
As @Philip Turner wrote, there is no evidence for a dedicated FP16 pipe on M1 and later. The throughput is exactly the same as for FP32, so it's probably safe to assume that the ALU is 32-bit. And as Philip says, FP16 is used as a storage format, with free conversion during inputs and outputs. The performance advantage of using FP16 stems from the fact that registers are 16-bit addressable, so you can pack more state using 16-bit values (32-bit values are instead stored using pairs of 16-bit registers).

Just as a side note, there is this very interesting recent patent where Apple is describing allocating parts of the SLC cache as RAM. No idea whether it's already implemented (or ever will be), but I thought that's a neat idea.

More on topic of SLC as communication medium, something I found quite interesting is the fact that on newer patents where on-chip networks are discussed, the schematics never show CPU and GPU being directly connected. Of course, these are just high-level diagrams that do not show everything, but it does reinforce the idea that CPU-GPU communication primarily occurs though shared memory (which in turn means SLC as memory side cache).

Of course things can change but the 2014 patent https://patents.google.com/patent/US20150205324A1 (look at Fig 6) describes (among other things, some of which are very weird and seem unlikely to be implemented) the idea of having parallel 16 and 32b FPUs, with the 32b unit shared between lanes, and this is done explicitly for power reasons.

(2018) https://patents.google.com/patent/US10699366B1 describes further details of how to share an FP32 unit between lanes.

There are other patents, like (2016) https://patents.google.com/patent/US10282169B2, which describe how to perform FP16 operations using the FP32 unit, in such a way that the correct FP16 rounding is calculated immediately. [If you just used the FP32 unit to generate an FP32 result, then rounding to FP16 adds an extra step and can, in some situations, give the incorrect rounding.]

All in all I'd say it's unclear.
- throughout tells us nothing. There's no way to execute both an FP16 and an FP32 in the same cycle, so both FPus could be there, each used as appropriate.
- Apple clearly have thought about what's required to implement FP16 within the FP32 unit. (The conversion of registers from FP16 to FP32 can happen as data flows into the unit, with correct rounding on the way out.)
- in the past the additional energy cost of FP16 rather than FP32 calculation was a significant issue. But maybe they have figured out a way to bring that close to zero by special handling of the relevant zero bit patterns in the FP32 unit?

By cache as RAM you mean
https://patents.google.com/patent/US10922232B1 ?

Using parts of cache as RAM (more precisely being able to mark part of a cache as corresponding to a physical address) is not about "providing more RAM" or anything like that, it's about being able to DMA IO (especially network) directly to cache. I describe the details of how this work on pages 32..34 of volume 3 of my PDFs.

Xiao_Xi · May 2, 2023

leman said:
FWIW, Blender gets almost 2x performance improvements with the OptiX backend over CUDA.

It depends on the scene.

Blender-Cycles-GPU-Render-Performance-Secret-Deer-CUDA-vs-OptiX-1.jpg

Blender-Cycles-GPU-Render-Performance-White-Lands-CUDA-vs-OptiX-1.jpg

Blender 3.4 Performance Deep-dive: Cycles, Eevee & Viewport

Blender 3.4 released just a few weeks ago, and as usual, we couldn't wait long before jumping in and exploring its performance with current GPUs from AMD, Intel, and NVIDIA. With 22 graphics cards and six Cycles and Eevee projects in-hand for rendering and viewport testing, let's find out which...

techgage.com

Philip Turner said:
Do you have any way to estimate the performance delta from non-RT accelerated chips?

You may be more interested in other applications. For example:

We use RT cores to accelerate Density-Based Clustering of Applications with Noise (DBSCAN) by translating fixed-radius nearest neighbor queries to ray tracing queries. We show that leveraging the RT hardware results in speedups between 1.3x to 4x over current state-of-the-art, GPU-based DBSCAN implementations.

RT-DBSCAN: Accelerating DBSCAN using Ray Tracing Hardware

General Purpose computing on Graphical Processing Units (GPGPU) has resulted in unprecedented levels of speedup over its CPU counterparts, allowing programmers to harness the computational power of GPU shader cores to accelerate other computing applications. But this style of acceleration is...

arxiv.org

leman · May 2, 2023

name99 said:
Of course things can change but the 2014 patent https://patents.google.com/patent/US20150205324A1 (look at Fig 6) describes (among other things, some of which are very weird and seem unlikely to be implemented) the idea of having parallel 16 and 32b FPUs, with the 32b unit shared between lanes, and this is done explicitly for power reasons.

(2018) https://patents.google.com/patent/US10699366B1 describes further details of how to share an FP32 unit between lanes.

There are other patents, like (2016) https://patents.google.com/patent/US10282169B2, which describe how to perform FP16 operations using the FP32 unit, in such a way that the correct FP16 rounding is calculated immediately. [If you just used the FP32 unit to generate an FP32 result, then rounding to FP16 adds an extra step and can, in some situations, give the incorrect rounding.]

All in all I'd say it's unclear.
- throughout tells us nothing. There's no way to execute both an FP16 and an FP32 in the same cycle, so both FPus could be there, each used as appropriate.
- Apple clearly have thought about what's required to implement FP16 within the FP32 unit. (The conversion of registers from FP16 to FP32 can happen as data flows into the unit, with correct rounding on the way out.)
- in the past the additional energy cost of FP16 rather than FP32 calculation was a significant issue. But maybe they have figured out a way to bring that close to zero by special handling of the relevant zero bit patterns in the FP32 unit?

Thanks for all the info! I didn’t think about rounding and other numerical issues. It’s entirely possible that some FP16-specific circuitry is there to make sure all of this is handled correctly. Anyway, that’s an implementation detail which probably has very little (if any) practical relevance for either the developer or the user. What matters the most to me that there are GPUs out there that have higher throughput on FP16 operations. This was also the case for several A-series chips. But M1 and later have the same FP16 and FP32 rate.

name99 said:
By cache as RAM you mean
https://patents.google.com/patent/US10922232B1 ?

No, I mean this one: https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2023033955&_cid=P22-LH6QGU-07079-3

name99 · May 2, 2023

leman said:
Thanks for all the info! I didn’t think about rounding and other numerical issues. It’s entirely possible that some FP16-specific circuitry is there to make sure all of this is handled correctly. Anyway, that’s an implementation detail which probably has very little (if any) practical relevance for either the developer or the user. What matters the most to me that there are GPUs out there that have higher throughput on FP16 operations. This was also the case for several A-series chips. But M1 and later have the same FP16 and FP32 rate.

Of course I'm interested in how the chip ACTUALLY works, not just how it presents in terms of performance

leman said:
No, I mean this one: https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2023033955&_cid=P22-LH6QGU-07079-3

Hmm, I need some time to study that. Superficially it looks like essentially the same thing as the patent I referenced. Maybe it is the same idea, just a better implementation?

leman · May 2, 2023

name99 said:
Of course I'm interested in how the chip ACTUALLY works, not just how it presents in terms of performance

Of course

It just might be difficult to get that level of details, especially given the very limited tools at our disposal.

What I find interesting (and forgive me for a trivial observation, I am a hobbyist after all) is that SIMD units implementation on CPUs usually seem employ some sort of ALU pipeline fusing to process various datatypes. E.g. a 128-bit Neon ALU can either do 2x 64-bit operations, 4x 32-bit operations, 8x 16-bit operations, or 16 8-bit operations. We also see a similar partitioning effect with Apple's AMX. Of course, hardware details are likely more complex but there is at least some pipe reuse going on.

This kind of architecture maps well to CPUs SIMD model (where one instruction processes N data elements, which can be different between various data types), but creates an issue for GPU SIMT model, where one instruction processes one element per thread with static thread geometry. If you use these kind of fused ALUs, you need to take some extra steps to ensure that the head geometry stays consistent between instructions of different types. For example, Nvidia uses 16-wide FP32 ALUs, but their thread count (wave) is 32, meaning that they have to execute a wave (32 threads) in two steps. But if they could "split" the ALUs in twice as many PF16 units, they could execute a reduced precision 32-wide operation in a single step. And if I remember correctly Nvidia did have this capability, but they dropped it when they reorganised their shader cores and radically simplified the interaction scheduler.

name99 said:
Hmm, I need some time to study that. Superficially it looks like essentially the same thing as the patent I referenced. Maybe it is the same idea, just a better implementation?

If you find the time to do it, I'd be interested to hear your analysis. Most of these patents is way beyond me.

name99 said:
Here's the patent. It's like Phillip and I describe:

They are trying really hard to make the patent general, but once you get past the throat-clearing, it's basically about the permute patterns to use so as to run an 8x8 matrix multiply using all 32 lanes.

https://patents.google.com/patent/US11256518B2

That's interesting! Sounds like a smart and fairly cheap way to better utilise hardware for this purpose. Actually, looking at this makes me wonder whether Nvidia might be doing something similar and using at least parts of their general-purpose SIMD to do matrix multiplication. Maybe Tensor Cores are just a marketing illusion and it's just an additional function of the regular SIMDs? There is a big issue with this though though, as Tensor Cores can do 128 FP16 multiplication with FP32 accumulate per clock per compute partition, while there are only 16 FP32 FMA CUDA SIMD per partition in Turing...

I wonder whether Apple plans to introduce support for lower-precision ML-optimised data types at some point. For example, if an Apple 32-wide ALU could be reconfigured as an 8x8 BF16 outer product engine (maybe leveraging the technology from AMX), that would give them effective 512 FLOPS per clock per GPU core for ML purposes. Still a far cry from Nvidia's 1024 FLOPs per SM, but would be a nice boost nevertheless.

Philip Turner · May 3, 2023

leman said:
I wonder whether Apple plans to introduce support for lower-precision ML-optimised data types at some point. For example, if an Apple 32-wide ALU could be reconfigured as an 8x8 BF16 outer product engine (maybe leveraging the technology from AMX), that would give them effective 512 FLOPS per clock per GPU core for ML purposes. Still a far cry from Nvidia's 1024 FLOPs per SM, but would be a nice boost nevertheless.

Wouldn't it be more space-efficient to put that in the neural engine? Especially if MetalFX upscaling can use the ANE in GPU applications.

leman · May 3, 2023

Philip Turner said:
Wouldn't it be more space-efficient to put that in the neural engine? Especially if MetalFX upscaling can use the ANE in GPU applications.

Probably. But I am a bit concerned with programmability. ANE is a bit of a black box and will probably continue to be so. Which is absolutely fine as long as it continues to be used in its current function as an application-specific ML inference accelerator.

The thing is, right now we have the ANE for energy-efficient inference, the AMX (for higher matmul throughput than the CPU SIMD can sustain while staying within the CPU L2), and the GPU. Only GPU is really programmable. This limits the performance and utility of various ML frameworks which either want to implement their backends themselves or which rely on Apple's MPS (which uses the GPU only, right?). It's a bit of a weird situation. And given how much specialised redundancy Apple Silicon design have, it doesn't really look like space-efficiency is a big goal with Apple anyway...

Philip Turner · May 3, 2023

leman said:
Apple's MPS (which uses the GPU only, right?)

Yes, but it's closed source, so we can't check the source code to ensure it's using the GPU optimally.

For inference, the ANE is really only useful for computationally intensive convolutions. Tensor cores, AMX only help with training feedforward layers, training/inferencing convolutional layers, or training/inferencing attention (if you can use FlashAttention). Winograd makes inferencing 3x3 convolutions much faster than physically possible, ~~but only works with general-purpose GPU ALUs.~~ (correction: https://dl.acm.org/doi/abs/10.1145/3472456.3472473)

We have to look at the dominant use cases, whether they're memory bound, and whether tensor cores would even help.

leman · May 3, 2023

Philip Turner said:
Yes, but it's closed source, so we can't check the source code to ensure it's using the GPU optimally.

For inference, the ANE is really only useful for computationally intensive convolutions. Tensor cores, AMX only help with training feedforward layers, training/inferencing convolutional layers, or training/inferencing attention (if you can use FlashAttention). Winograd makes inferencing 3x3 convolutions much faster than physically possible, ~~but only works with general-purpose GPU ALUs.~~ (correction: https://dl.acm.org/doi/abs/10.1145/3472456.3472473)

We have to look at the dominant use cases, whether they're memory bound, and whether tensor cores would even help.

I doubt real-world training use cases are anywhere close to being memory-bound on Apple Silicon, but even if they were, an optimized data representation would surely help with that as well.

But I understand what you are saying and I agree. It’s too easy to get lost in irrelevant metrics. Who cares how fast your hardware is if it’s bandwidth bound. Although, latest Nvidia gaming GPUs have an atrocious bandwidth/ALU ratio but they seem to be breaking all kinds of records in real-world workloads.

Philip Turner · May 3, 2023

leman said:
I doubt real-world training use cases are anywhere close to being memory-bound

Imagine the stereotypical feedforward layer of a neural network. n neurons connect to another n neurons using n^2 axons (hereafter called weights). It's theoretically better to make those connections sparse - not every output neuron needs to depend on every input. Performing sparse evaluation is incredibly hard IRL; the optimal connectivity ratio is ~10%, while sparse matrix libraries are designed for <1% connectivity. Nvidia has hardware to simulate 50% neuron connectivity, Tachyum 37.5% connectivity, Graphcore 20% connectivity. Block-sparse algorithms are another alternative that doesn't require special hardware, but you often need very large blocks to evaluate faster than dense. Sparsity compounds with other optimization like quantization, but both optimizations are not always used together. The 2.5-bit GPTQ algorithm seems to benefit from the sparse data pattern, despite being dense.

The ANE has native Int8 quantization, which decreases the bandwidth cost without incurring extra ALU instructions. This is one advantage compared to other parts of the chip, which only process FP16 natively. Int4 quantization can only run on the CPU or GPU. That is where GPU has a provable advantage - more integer compute power to decrease the compute cost of decoding weights (which actually might be a compute-bound operation). There might even be room for more advanced compression schemes such as LZBITMAP.

When training a feedforward neural network, you batch multiple evaluations at once (assume 32). You can fetch the weights once from RAM -> registers, then recycle them for 32 different vectors of input neurons. On a server, you can also batch numerous requests into one GPU and recycle the weights this way. But that will not decrease latency. The current generation of transformers have latency dictated mostly by the number of weights, regardless of batch size. Batch=1 runs just as fast as batch=32, but batch=1 is the dominant use case. DeepMind did some work reducing batch=1 latency by predicting the next 2-8 tokens and inferencing them in parallel. Most often, multiple predicted tokens turn out correct (according to the LLM) and you decreased the latency by a small constant factor.

name99 · May 3, 2023

Philip Turner said:
Imagine the stereotypical feedforward layer of a neural network. n neurons connect to another n neurons using n^2 axons (hereafter called weights). It's theoretically better to make those connections sparse - not every output neuron needs to depend on every input. Performing sparse evaluation is incredibly hard IRL; the optimal connectivity ratio is ~10%, while sparse matrix libraries are designed for <1% connectivity. Nvidia has hardware to simulate 50% neuron connectivity, Tachyum 37.5% connectivity, Graphcore 20% connectivity. Block-sparse algorithms are another alternative that doesn't require special hardware, but you often need very large blocks to evaluate faster than dense. Sparsity compounds with other optimization like quantization, but both optimizations are not always used together. The 2.5-bit GPTQ algorithm seems to benefit from the sparse data pattern, despite being dense.

Have you had a chance to read https://arxiv.org/abs/2210.03052
(It's in my list but I am currently swamped :-( )

This seems to be a way to get rid of sparsity in a large part of the problem (though I don't know how general it is) and may make all the HW tricks for exploiting sparsity somewhat less important...

name99 · May 3, 2023

Philip Turner said:
When training a feedforward neural network, you batch multiple evaluations at once (assume 32). You can fetch the weights once from RAM -> registers, then recycle them for 32 different vectors of input neurons. On a server, you can also batch numerous requests into one GPU and recycle the weights this way. But that will not decrease latency. The current generation of transformers have latency dictated mostly by the number of weights, regardless of batch size. Batch=1 runs just as fast as batch=32, but batch=1 is the dominant use case. DeepMind did some work reducing batch=1 latency by predicting the next 2-8 tokens and inferencing them in parallel. Most often, multiple predicted tokens turn out correct (according to the LLM) and you decreased the latency by a small constant factor.

There seem to be so many ways in which one could imagine doing interesting things, and it will be years before we know which work best!
For example

(a) what if we trained networks the way we train humans? Don't just give them the entire internet at once; start by training on the constrained vocabulary of 1st grade books. Then 2nd grade, etc etc. can we zip through these early stages very rapidly to have something that's close to optimal (high school graduate?) to throw the whole internet corpus at it?

(b) what if our batches are multiple different languages (German, French, English, Russian, ...) all being trained at once, so we get some real value out of those 32 batches?
There's also probably scope for massive sharing of embedding vectors. While there will be some difference at the margin (interesting to investigate in their own right as a linguistics project) there is probably similarity overall in these embedding vectors. "Averaging" these across languages may even make the LLM more powerful, able to "think" in multiple languages and see analogies that might occur to a Russian but not an English speaker...

Philip Turner · May 3, 2023

name99 said:
(a) what if we trained networks the way we train humans? Don't just give them the entire internet at once; start by training on the constrained vocabulary of 1st grade books. Then 2nd grade, etc etc. can we zip through these early stages very rapidly to have something that's close to optimal (high school graduate?) to throw the whole internet corpus at it?

A lot of human knowledge comes from before they are born. The human genome stores ~1 GB and their initial brain structure already has a lot more information coded in. Throwing a TB of internet text at a transformer is basically just encoding our genome. We need to harness the petabytes of YouTube video data.

Also, if molecular nanotechnology comes about, we'll have a ~1,000,000 times compute boost at the same power consumption. That comes from reaching and exceeding the Landauer limit, a physical law that (sort of) says energy consumption comes from memory transfers. To fully utilize reversible computing, it seems that becoming more compute-bound (brute force) is the way forward. Another improvement is transforming entire planets into supercomputers, although I'm going very off topic.

In short, although Moore's Law (for electronic computers) is about dead, we're only about halfway there regarding the physical limits of compute and storage density. Quintillion parameter models might exist one day (yes, parameter count is only part of the story). Trillion-processor supercomputers as well. If brute force isn't working, you're not using enough of it

leman · May 3, 2023

Philip Turner said:
A lot of human knowledge comes from before they are born. The human genome stores ~1 GB and their initial brain structure already has a lot more information coded in. Throwing a TB of internet text at a transformer is basically just encoding our genome. We need to harness the petabytes of YouTube video data.

As a researcher who works in a field tangential to cognitive psychology I take a big issue with this statement

I don’t think this is accurate at all. The amount of information we receive in our lifetime is often underestimated (see the misguided Poverty of Stimulus argument), and we literally have years to build mental model and adapt our brain structure to it.

But what’s probably more important is that the training information we receive is incredibly rich. Multimodality is the key. Our learning process is modulated by our direct experience of the world, social structures, internal feedback loops, moral compasses, desires, emotions, and inspirations. An ML model has none of that. No wonder GPT-4 is somewhat of an idiot savant - perfectly capable of generating well rhymed poetry at a press of a button but completely lacking any pragmatic context.

Apple Silicon in Sciences

macrumors regular

macrumors regular

macrumors regular

macrumors 68030

macrumors regular

macrumors Core

macrumors Core

macrumors regular

macrumors Core

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68000

macrumors Core

macrumors 68030

macrumors Core

macrumors regular

macrumors Core

macrumors regular

macrumors Core

macrumors regular

macrumors 68030

macrumors 68030

macrumors regular

macrumors Core

Our Staff