The
original M1 page on apple.com said the (regular) M1 supported 24,576 threads in flight. That'd mean (
since the max number of threads in threadgroups is fixed to 1,024 and there's only 8 GPU cores) that a single GPU core can execute different threadgroups at once (up to 3).
For the M1 Ultra the number would be 196,608 threads in-flight (just like you say). High, but still lower than even a 1920x1080 image, and that's a 'best case scenario' where you're running light shaders/kernels that don't push the register pressure too far (or you'd be able to have less threads per threadgroup to fit in memory, and then the max threads in flight would decrease).
I'm not saying that scheduling doesn't play a role in the apparently poor scaling of the M1 Ultra on some tasks, but I'm not convinced that's the entirety of the problem.
As for how many shader cores/units the Ultra has: it does have 8,192 ALUs, but is every single one of them executing instructions from a different thread? Not sure how the scheduling works at that level.