You're throwing out a bunch of ideas, but I think the single most important point is that the SLC is designed to function primarily as a COMMUNICATIONS medium. This has a bunch of nice side effects like often acting as an L3 (lower latency) and saving a lot of energy, but COMMUNICATION is the main goal.@name99 could you review the charts here about M1 Max latency, specifically the dip at 24 MB? Half of the SLC (in both capacity and bandwidth) is unusable by the CPU, just like half the memory bandwidth. I then need to find a way to measure GPU SLC latency (I can't find the benchmark software that Chips-and-Cheese uses). If my hypothesis is correct, the Pro/Max SLC has lower latency (150 - 250 ns) than the base M1 (266 ns) and could be optimized for incoherent accesses from ray tracing.
Given that CPU can't use it, the simdgroup_matrix seems to stream operands straight from L2 (based on bandwidth requirements), there seems to be no other major use for SLC, except for reducing power of off-chip accesses. That means one thing: the Pro/Max SLC is intended for GPU ray tracing.
It has double the bandwidth, at least on M1. Exactly the double of the theoretical peak 68.2 GB/s. This bandwidth peak is only witnessed on the M1 iGPU graphs and not found on the M1 Max CPU graphs. I have also tested my M1 Max GPU and measured bandwidth peaks at 512 KB and 48 MB.better latency, but more or less the same bandwidth.
Would you consider simdgroup_matrix to be hardware-accelerated matrix multiplication?
Does the chip have to dedicate space to separate circuitry that will only be utilized during a small fraction of everyday workloads?
Do we know for a fact that Apple has removed FP16 from the GPU? When? With A15/M2?
You're throwing out a bunch of ideas, but I think the single most important point is that the SLC is designed to function primarily as a COMMUNICATIONS medium. This has a bunch of nice side effects like often acting as an L3 (lower latency) and saving a lot of energy, but COMMUNICATION is the main goal.
I'm thinking of upgrading my iPad Pro when the M3 chip comes out, because of hardware ray tracing. Do you have any way to estimate the performance delta from non-RT accelerated chips? I predict it would be faster than M1 Max, but by how much? For my higher-end chip, MP sounds like a good option if the chip is truly upgradeable, and Moore's Law stays alive. I don't want to get a separate display for it, so I'd need to use it as an eGPU. Run simulations on it, stream un-upscaled frames back to the 120 Hz MBP for upscaling via MetalFX.Anyway, Apple has published a substantial body of RT-related patents last year
I'm thinking of upgrading my iPad Pro when the M3 chip comes out, because of hardware ray tracing. Do you have any way to estimate the performance delta from non-RT accelerated chips?
You mean for GPU? Yes, that I could believe.It has double the bandwidth, at least on M1. Exactly the double of the theoretical peak 68.2 GB/s. This bandwidth peak is only witnessed on the M1 iGPU graphs and not found on the M1 Max CPU graphs. I have also tested my M1 Max GPU and measured bandwidth peaks at 512 KB and 48 MB.
Here's the patent. It's like Phillip and I describe:I think it depends what instruction sequence it compiles to. I would be surprised if there is anything matmul-specific there, it’s probably just some combinations of SIMD FMAs like what you’d use with NEON or AMX. These matmul intrinsically are probably only there to take better advantage of the SIMD architecture that is somewhat hidden by the SIMT programming model.
As @Philip Turner wrote, there is no evidence for a dedicated FP16 pipe on M1 and later. The throughput is exactly the same as for FP32, so it's probably safe to assume that the ALU is 32-bit. And as Philip says, FP16 is used as a storage format, with free conversion during inputs and outputs. The performance advantage of using FP16 stems from the fact that registers are 16-bit addressable, so you can pack more state using 16-bit values (32-bit values are instead stored using pairs of 16-bit registers).
Just as a side note, there is this very interesting recent patent where Apple is describing allocating parts of the SLC cache as RAM. No idea whether it's already implemented (or ever will be), but I thought that's a neat idea.
More on topic of SLC as communication medium, something I found quite interesting is the fact that on newer patents where on-chip networks are discussed, the schematics never show CPU and GPU being directly connected. Of course, these are just high-level diagrams that do not show everything, but it does reinforce the idea that CPU-GPU communication primarily occurs though shared memory (which in turn means SLC as memory side cache).
It depends on the scene.FWIW, Blender gets almost 2x performance improvements with the OptiX backend over CUDA.
You may be more interested in other applications. For example:Do you have any way to estimate the performance delta from non-RT accelerated chips?
We use RT cores to accelerate Density-Based Clustering of Applications with Noise (DBSCAN) by translating fixed-radius nearest neighbor queries to ray tracing queries. We show that leveraging the RT hardware results in speedups between 1.3x to 4x over current state-of-the-art, GPU-based DBSCAN implementations.
Of course things can change but the 2014 patent https://patents.google.com/patent/US20150205324A1 (look at Fig 6) describes (among other things, some of which are very weird and seem unlikely to be implemented) the idea of having parallel 16 and 32b FPUs, with the 32b unit shared between lanes, and this is done explicitly for power reasons.
(2018) https://patents.google.com/patent/US10699366B1 describes further details of how to share an FP32 unit between lanes.
There are other patents, like (2016) https://patents.google.com/patent/US10282169B2, which describe how to perform FP16 operations using the FP32 unit, in such a way that the correct FP16 rounding is calculated immediately. [If you just used the FP32 unit to generate an FP32 result, then rounding to FP16 adds an extra step and can, in some situations, give the incorrect rounding.]
All in all I'd say it's unclear.
- throughout tells us nothing. There's no way to execute both an FP16 and an FP32 in the same cycle, so both FPus could be there, each used as appropriate.
- Apple clearly have thought about what's required to implement FP16 within the FP32 unit. (The conversion of registers from FP16 to FP32 can happen as data flows into the unit, with correct rounding on the way out.)
- in the past the additional energy cost of FP16 rather than FP32 calculation was a significant issue. But maybe they have figured out a way to bring that close to zero by special handling of the relevant zero bit patterns in the FP32 unit?
By cache as RAM you mean
https://patents.google.com/patent/US10922232B1 ?
Of course I'm interested in how the chip ACTUALLY works, not just how it presents in terms of performanceThanks for all the info! I didn’t think about rounding and other numerical issues. It’s entirely possible that some FP16-specific circuitry is there to make sure all of this is handled correctly. Anyway, that’s an implementation detail which probably has very little (if any) practical relevance for either the developer or the user. What matters the most to me that there are GPUs out there that have higher throughput on FP16 operations. This was also the case for several A-series chips. But M1 and later have the same FP16 and FP32 rate.
Hmm, I need some time to study that. Superficially it looks like essentially the same thing as the patent I referenced. Maybe it is the same idea, just a better implementation?
Of course I'm interested in how the chip ACTUALLY works, not just how it presents in terms of performance
Hmm, I need some time to study that. Superficially it looks like essentially the same thing as the patent I referenced. Maybe it is the same idea, just a better implementation?
Here's the patent. It's like Phillip and I describe:
They are trying really hard to make the patent general, but once you get past the throat-clearing, it's basically about the permute patterns to use so as to run an 8x8 matrix multiply using all 32 lanes.
https://patents.google.com/patent/US11256518B2
Wouldn't it be more space-efficient to put that in the neural engine? Especially if MetalFX upscaling can use the ANE in GPU applications.I wonder whether Apple plans to introduce support for lower-precision ML-optimised data types at some point. For example, if an Apple 32-wide ALU could be reconfigured as an 8x8 BF16 outer product engine (maybe leveraging the technology from AMX), that would give them effective 512 FLOPS per clock per GPU core for ML purposes. Still a far cry from Nvidia's 1024 FLOPs per SM, but would be a nice boost nevertheless.
Wouldn't it be more space-efficient to put that in the neural engine? Especially if MetalFX upscaling can use the ANE in GPU applications.
Yes, but it's closed source, so we can't check the source code to ensure it's using the GPU optimally.Apple's MPS (which uses the GPU only, right?)
Yes, but it's closed source, so we can't check the source code to ensure it's using the GPU optimally.
For inference, the ANE is really only useful for computationally intensive convolutions. Tensor cores, AMX only help with training feedforward layers, training/inferencing convolutional layers, or training/inferencing attention (if you can use FlashAttention). Winograd makes inferencing 3x3 convolutions much faster than physically possible,but only works with general-purpose GPU ALUs.(correction: https://dl.acm.org/doi/abs/10.1145/3472456.3472473)
We have to look at the dominant use cases, whether they're memory bound, and whether tensor cores would even help.
Imagine the stereotypical feedforward layer of a neural network. n neurons connect to another n neurons using n^2 axons (hereafter called weights). It's theoretically better to make those connections sparse - not every output neuron needs to depend on every input. Performing sparse evaluation is incredibly hard IRL; the optimal connectivity ratio is ~10%, while sparse matrix libraries are designed for <1% connectivity. Nvidia has hardware to simulate 50% neuron connectivity, Tachyum 37.5% connectivity, Graphcore 20% connectivity. Block-sparse algorithms are another alternative that doesn't require special hardware, but you often need very large blocks to evaluate faster than dense. Sparsity compounds with other optimization like quantization, but both optimizations are not always used together. The 2.5-bit GPTQ algorithm seems to benefit from the sparse data pattern, despite being dense.I doubt real-world training use cases are anywhere close to being memory-bound
Have you had a chance to read https://arxiv.org/abs/2210.03052Imagine the stereotypical feedforward layer of a neural network. n neurons connect to another n neurons using n^2 axons (hereafter called weights). It's theoretically better to make those connections sparse - not every output neuron needs to depend on every input. Performing sparse evaluation is incredibly hard IRL; the optimal connectivity ratio is ~10%, while sparse matrix libraries are designed for <1% connectivity. Nvidia has hardware to simulate 50% neuron connectivity, Tachyum 37.5% connectivity, Graphcore 20% connectivity. Block-sparse algorithms are another alternative that doesn't require special hardware, but you often need very large blocks to evaluate faster than dense. Sparsity compounds with other optimization like quantization, but both optimizations are not always used together. The 2.5-bit GPTQ algorithm seems to benefit from the sparse data pattern, despite being dense.
When training a feedforward neural network, you batch multiple evaluations at once (assume 32). You can fetch the weights once from RAM -> registers, then recycle them for 32 different vectors of input neurons. On a server, you can also batch numerous requests into one GPU and recycle the weights this way. But that will not decrease latency. The current generation of transformers have latency dictated mostly by the number of weights, regardless of batch size. Batch=1 runs just as fast as batch=32, but batch=1 is the dominant use case. DeepMind did some work reducing batch=1 latency by predicting the next 2-8 tokens and inferencing them in parallel. Most often, multiple predicted tokens turn out correct (according to the LLM) and you decreased the latency by a small constant factor.
A lot of human knowledge comes from before they are born. The human genome stores ~1 GB and their initial brain structure already has a lot more information coded in. Throwing a TB of internet text at a transformer is basically just encoding our genome. We need to harness the petabytes of YouTube video data.(a) what if we trained networks the way we train humans? Don't just give them the entire internet at once; start by training on the constrained vocabulary of 1st grade books. Then 2nd grade, etc etc. can we zip through these early stages very rapidly to have something that's close to optimal (high school graduate?) to throw the whole internet corpus at it?
A lot of human knowledge comes from before they are born. The human genome stores ~1 GB and their initial brain structure already has a lot more information coded in. Throwing a TB of internet text at a transformer is basically just encoding our genome. We need to harness the petabytes of YouTube video data.