Wonder if the tricks done for CPUs to increase IPCs would be applicable for GPUs?
Not really. These are very different devices with different requirements. CPUs can afford to be big and complex per core because they need to run a single thread as quickly as possible — so you have multiple execution units and some fairly advanced logic that tracks instruction dependencies and executes multiple instructions in parallel and out of order. A part of CPU for example can start executing some conditional code before the condition has been resolved for example.
GPUs instead need to be as area and power efficient as possible, you want to cram as many execution units as possible into the smallest area. But they also need to run many thousands threads at the same time. So GPUs are optimised for throughput (run as many operations as possible) instead of thread latency. This leads to a simpler in-order design (no instruction reordering, limited dependency tracking) to minimise the necessary logic, but with the ability to quickly switch between execution streams.
At that point, all recent GPUs are pretty much the same. Per core, you have 256-384 KB of register file, 128 FP32 ALUs, and some variation in SFU capabilities.
I wouldn't say that. There are fairly big differences. Apple GPUs are among the simplest, they dedicate one ALU per thread and schedule one instruction per clock per thread. Recent AMD GPUs have two ALUs per thread and can encode up to two ALU operations per instruction (a form of VLIW). Nvidia GPUs similarly have two ALUs per thread but use a more sophisticated instruction scheduler that can issue instructions to either ALU per clock. Combined with some execution trickery (their ALUs are 16-wide and not 32-wide, but execute 32-wide data operations in two steps) to the user it looks like Nvidia schedules two instructions per clock essentially.
BTW, a recent Apple patent suggests that Apples next GPU (one planned for A16 really) will be able to schedule instructions from two separate programs in an SMT-like fashion.
Regarding GPU, the M1 has 512 concurrent scalar FP32 pipelines @ 4 cycles, and 128 concurrent scalar IMUL32 pipelines @ 4 cycles. That translates to 128 FP32 IPC and 32 IMUL32/C. Further, 4x FP32 vector IPC and 1x vector IMUL32/C.
Not quite sure how your numbers fit together?
From what I know, each Apple G13 GPU core contains 4x 32-wide ALUs, that's 1024 ALUs per M1 GPU in total. It can issue a FP32 instructions per clock to each pipelined ALU, so that's 1024 FP32 per clock for the M1 GPU (the latency will obviously be longer, but I have no idea how to measure that on a GPU). Integer multiplication will obviously be slower, I assume that you'v measured that, 1/4 throughput rate sounds plausible. But integer addition and subtraction issue rate is the same as FP32.