Could you look at the graphs in the ALU bottlenecks and ALU layout sections of metal-benchmarks? I tried explaining this away with register cache thrashing, but it was of no use. The only coherent explanation is, their schedulers have 3 modes:I wouldn't say that. There are fairly big differences. Apple GPUs are among the simplest, they dedicate one ALU per thread and schedule one instruction per clock per thread. Recent AMD GPUs have two ALUs per thread and can encode up to two ALU operations per instruction (a form of VLIW). Nvidia GPUs similarly have two ALUs per thread but use a more sophisticated instruction scheduler that can issue instructions to either ALU per clock. Combined with some execution trickery (their ALUs are 16-wide and not 32-wide, but execute 32-wide data operations in two steps) to the user it looks like Nvidia schedules two instructions per clock essentially.
- 3x single-dispatch (16-bit only) with ~2/3 peak IPC instead of 3/4
- 2x dual-dispatch (16-bit, 32-bit) but not actually dual-dispatching 32-bit
- 1x quadruple-dispatch (32-bit only)
Source? People apply for patents years before they're granted. That means this technology is probably already integrated into Apple GPUs. With M1 Max, each GPU core can run code from 3 separate programs simultaneously. With A15, that number is 4. Not sure whether it's an A/M or M1/M2 difference. Do anyone have an M2 you could test?BTW, a recent Apple patent suggests that Apples next GPU (one planned for A16 really) will be able to schedule instructions from two separate programs in an SMT-like fashion.
Everything I mentioned was per core. In the entire GPU, the base M1 has 4096 actual scalar FP32 pipelines @ 4 cycles, 1024 effective scalar FP32 pipelines @ 1 cycle. In the entire P-CPU block, the base M1 has 256 actual scalar FP32 pipelines @ 4 cycles, 64 effective scalar FP32 pipelines @ 1 cycle.Not quite sure how your numbers fit together?
I measured latency while assuming 4x 32-bit single-dispatch, a hypothesis proven wrong. It's physically impossible to do an FMA in 2 cycles. This assumption should mess with measurements for an "FP32, integer and conditional pipeline" but not for an "integer and complex pipeline".It can issue a FP32 instructions per clock to each pipelined ALU, so that's 1024 FP32 per clock for the M1 GPU (the latency will obviously be longer, but I have no idea how to measure that on a GPU). Integer multiplication will obviously be slower, I assume that you'v measured that, 1/4 throughput rate sounds plausible. But integer addition and subtraction issue rate is the same as FP32.
Last edited: