YepJust since you are mentioning this, square root seems to be a single uop on M1 and FMA has 4 cycles of latency (with up to four FMA operations executed per cycle, that's up to 16 FP32 FMA per cycle)
Source: https://dougallj.github.io/applecpu/firestorm-simd.html