The problem is present in the entire field of bioinformatics, not just in my code. From my perspective, it's a hardware issue. Modern processors are not suited for running code written in popular programming languages. If you try to write clear idiomatic code, you leave most of the potential performance on the table.
Regarding OOO execution, my specialty is space-efficient data structures for large in-memory datasets. When I benchmark the data structures, I run both independent and chained queries. With independent queries, I just execute the operations with precomputed arguments in a tight loop. With chained queries, I adjust the precomputed arguments by the result of the previous query.
OOO mostly helps with primitive low-level operations. For example, with rank queries (count the number of set bits in prefix 0..i) on an uncompressed bitvector with two levels of indexes, M2 Max seems to be able to run 8-10 queries in parallel. This is a clear improvement over Intel CPUs, which can only run 4-5. If I try something more complicated, such as select queries (find the i-th set bit) on a sparse Elias-Fano encoded bitvector, the number of parallel queries drops to around 2 on both platforms. And these are still primitive operations. Once I start benchmarking something relevant to the user of the data structure, the speed difference between independent and chained queries is only something like 10-30%.
The OOO capabilities of modern processors are still limited enough that abstractions usually break them. Instead of composing high-level operations out of well-defined low-level operations, you have to engineer the high-level operation as a single unit. You may also need different versions of the operation for different platforms, if their OOO capabilities are different enough. This all takes time and effort, which are expensive.