Will x86 be dead in a decade?

JouniS · Aug 5, 2023

leman said:
Are you running each task on a separate thread or are you using a loop over the tasks? The former is suboptimal on a deep OOO machine, the latter will be able to schedule dozens or hundreds of tasks on a single core simultaneously, hiding latency.

Separate threads obviously. Software development is already slow enough in a research environment, because you rarely know in advance what the code is ultimately supposed to do. To enable faster iteration, you write simple single-threaded code that does one thing at a time. Which just parallelizes trivially over tens of threads, because you've learned to avoid situations where individual tasks compete for shared resources.

Any hardware that requires platform-specific optimization is suboptimal for real-world code. And if platform-specific optimization is unavoidable, you don't do it for a minority platform.

leman · Aug 5, 2023

JouniS said:
Any hardware that requires platform-specific optimization is suboptimal for real-world code. And if platform-specific optimization is unavoidable, you don't do it for a minority platform.

Deep out of order processors are hardly minority platforms. If your implementation inhibits instruction-level parallelism, you are leaving performance on the table on pretty much any modern fast CPU. It’s not by accident that loop unrolling is considered one of the most impactful optimizations on x86.

But of course, this will all depend on how your code is structured. Impossible to make a clear judgement without inspecting a profiling your code. But my first hunch after discussing this with you is that you might get better result both on x86 and Apple by restructuring the code to take better advantage of out of order execution.

VivienM · Aug 5, 2023

gpat said:
Double-click your 20 year old exe, it will still launch.

If you want to push this a little further, get your hands on the 'rare' 32-bit edition of MS Office 4.2 for Windows. (Like most vintage software, it was rare back in the day, but can be found with a modest amount of Googling on Windows/PC equivalents of Macintosh Garden) You can run it on Windows 11 x64... so I think that's probably a 29-year-old exe or so. Likely among the oldest software that can still run on x64 Windows.

JouniS · Aug 5, 2023

leman said:
Deep out of order processors are hardly minority platforms. If your implementation inhibits instruction-level parallelism, you are leaving performance on the table on pretty much any modern fast CPU. It’s not by accident that loop unrolling is considered one of the most impactful optimizations on x86.

But of course, this will all depend on how your code is structured. Impossible to make a clear judgement without inspecting a profiling your code. But my first hunch after discussing this with you is that you might get better result both on x86 and Apple by restructuring the code to take better advantage of out of order execution.

The problem is present in the entire field of bioinformatics, not just in my code. From my perspective, it's a hardware issue. Modern processors are not suited for running code written in popular programming languages. If you try to write clear idiomatic code, you leave most of the potential performance on the table.

Regarding OOO execution, my specialty is space-efficient data structures for large in-memory datasets. When I benchmark the data structures, I run both independent and chained queries. With independent queries, I just execute the operations with precomputed arguments in a tight loop. With chained queries, I adjust the precomputed arguments by the result of the previous query.

OOO mostly helps with primitive low-level operations. For example, with rank queries (count the number of set bits in prefix 0..i) on an uncompressed bitvector with two levels of indexes, M2 Max seems to be able to run 8-10 queries in parallel. This is a clear improvement over Intel CPUs, which can only run 4-5. If I try something more complicated, such as select queries (find the i-th set bit) on a sparse Elias-Fano encoded bitvector, the number of parallel queries drops to around 2 on both platforms. And these are still primitive operations. Once I start benchmarking something relevant to the user of the data structure, the speed difference between independent and chained queries is only something like 10-30%.

The OOO capabilities of modern processors are still limited enough that abstractions usually break them. Instead of composing high-level operations out of well-defined low-level operations, you have to engineer the high-level operation as a single unit. You may also need different versions of the operation for different platforms, if their OOO capabilities are different enough. This all takes time and effort, which are expensive.

ArkSingularity · Aug 5, 2023

JouniS said:
The problem is present in the entire field of bioinformatics, not just in my code. From my perspective, it's a hardware issue. Modern processors are not suited for running code written in popular programming languages. If you try to write clear idiomatic code, you leave most of the potential performance on the table.

Regarding OOO execution, my specialty is space-efficient data structures for large in-memory datasets. When I benchmark the data structures, I run both independent and chained queries. With independent queries, I just execute the operations with precomputed arguments in a tight loop. With chained queries, I adjust the precomputed arguments by the result of the previous query.

OOO mostly helps with primitive low-level operations. For example, with rank queries (count the number of set bits in prefix 0..i) on an uncompressed bitvector with two levels of indexes, M2 Max seems to be able to run 8-10 queries in parallel. This is a clear improvement over Intel CPUs, which can only run 4-5. If I try something more complicated, such as select queries (find the i-th set bit) on a sparse Elias-Fano encoded bitvector, the number of parallel queries drops to around 2 on both platforms. And these are still primitive operations. Once I start benchmarking something relevant to the user of the data structure, the speed difference between independent and chained queries is only something like 10-30%.

The OOO capabilities of modern processors are still limited enough that abstractions usually break them. Instead of composing high-level operations out of well-defined low-level operations, you have to engineer the high-level operation as a single unit. You may also need different versions of the operation for different platforms, if their OOO capabilities are different enough. This all takes time and effort, which are expensive.

Often people resort to hand crafting these kinds of optimizations at the assembly level when it comes to maximizing performance on these kinds of algorithms. Particularly when SIMD is required, this sort of approach is often needed in order to maximize performance.

The disadvantage here is that this would need to be done for every architecture. Not really a good tradeoff except in cases where it is absolutely necessary.

leman · Aug 6, 2023

JouniS said:
The problem is present in the entire field of bioinformatics, not just in my code.

I didn’t mean to criticize your code, I understand this is a general problem which requires time, effort, and very specific expertise to solve. My main point is that I disagree that what you observe is a memory latency effect.

JouniS said:
From my perspective, it's a hardware issue. Modern processors are not suited for running code written in popular programming languages. If you try to write clear idiomatic code, you leave most of the potential performance on the table.

Yep. OOO is a leaky abstraction. We need languages that do these things better. What’s why I like for example, Apples new async runtime that does thread compaction and attempts to maximize heterogeneous work done per thread.

leman · Aug 6, 2023

ArkSingularity said:
Often people resort to hand crafting these kinds of optimizations at the assembly level when it comes to maximizing performance on these kinds of algorithms. Particularly when SIMD is required, this sort of approach is often needed in order to maximize performance.

The disadvantage here is that this would need to be done for every architecture. Not really a good tradeoff except in cases where it is absolutely necessary.

That’s a different topic though. We are not talking about optimizations that aim to use the ISA optimally, we are talking about structuring code so that it runs well on any modern out of order implementation.

The way I like to think about it is that there are different types of optimizations. Some require you to know the exact details of the implementation, while others exploit more abstract properties. For example, you don’t need to know the size of the CPU cache or the details of the branch prediction mechanism to exploit the fact that caches and branch prediction exist. And it is possible to achieve very significant performance improvements this way, without optimizing for a particular CPU.

JouniS · Aug 6, 2023

leman said:
Yep. OOO is a leaky abstraction. We need languages that do these things better. What’s why I like for example, Apples new async runtime that does thread compaction and attempts to maximize heterogeneous work done per thread.

Platform-specific runtimes are part of the problem. As long as every OS and every programming language keeps doing things in its own way, code that needs to be widely applicable will target the lowest common denominator. In this case it's threads, which are available almost anywhere. And they are also easy to use, as long as you are not trying to be clever.

leman · Aug 7, 2023

JouniS said:
Platform-specific runtimes are part of the problem. As long as every OS and every programming language keeps doing things in its own way, code that needs to be widely applicable will target the lowest common denominator. In this case it's threads, which are available almost anywhere. And they are also easy to use, as long as you are not trying to be clever.

Threads are good abstraction for islands of sequential execution, bad abstraction for performance or concurrency. If you want the latter two, using threads becomes much more complicated. A much better abstraction would be an API like libdispatch (even though it has its own problems), but everything is ultimately blocked by committees reluctance to add closures to C ABI.

JouniS · Aug 7, 2023

leman said:
Threads are good abstraction for islands of sequential execution, bad abstraction for performance or concurrency. If you want the latter two, using threads becomes much more complicated. A much better abstraction would be an API like libdispatch (even though it has its own problems), but everything is ultimately blocked by committees reluctance to add closures to C ABI.

Threads are a good abstraction for performance and concurrency, because they work in the real world. You can often even use existing code that doesn't know anything about threads, with at most minimal changes.

You say that everything is blocked by committees, but dealing with the committees is the easy part. The hard part is rewriting all the code in the world to work with the new standard. That includes all the countless niches where there are not enough developers in the world to support and maintain the necessary publicly available libraries.

Search

Search

Will x86 be dead in a decade?

Will the x86 architecture be fully outdated in 10 years

Yes

No

Possibly

JouniS

macrumors 6502a

leman

macrumors Core

VivienM

macrumors 6502a

JouniS

macrumors 6502a

ArkSingularity

macrumors 6502a

leman

macrumors Core

leman

macrumors Core

JouniS

macrumors 6502a

leman

macrumors Core

JouniS

macrumors 6502a

Our Staff