*sometimes the number of threads is set to twice the number of cores in SMT-capable (multithreading) machines, since they were designed to be able to run two threads per core. Didn't make much of a difference in any workload I ever tried, since the point of SMT (someone correct me if wrong) is to keep the core ALU busy and that's not a problem for numerical simulations since they're ALU-heavy by definition.
SMT isn't just aimed at ALU units. Any unused core function unit could get utilized by another thread if it is sitting there unused for a substantive amount of time. SMT isn't going to have traction on micro benchmarks where the whole problem can be loaded into cache and just run a very tight loop over a relatively limited memory area.
SMT has traction there there are dynamic execution bubbles that impede keeping the units active. For example waiting on data to load before can do a arithmetic op. If one of the operands is coming from the the persistent storage disk then nothing is going to happen for long while. If can slide in another thread that can make some "slow" progress that is better for aggregate throughput than nothing at all. (or waiting until the multithread time slice quantum is over to swap the delayed thread out and put in some work. )
There are a limited about of virtual addresses that can be translated with low overhead. Limited "local" data (e.g., if using MPI to get data off another supercomputer node ) . Caps on how many different branches can predict your way through. ( a long sequence of if-thens with tests that shift on immediate results versus a loop end condition test which almost always branches the same way ( 'while i < 100000' for. 99999 times it is 'yes' ).
Everything in the same virtual address space, all the data loaded into 'local' RAM. One big loop with no internal branches. 99% L3 cache hit rate. In that context, SMT won't get much traction if have thrown lots of resources at being hyper aggressive at pulling instructions "out of order".
I assume context switches here are less of a problem since SMT-capable cores have two sets of registers so context switching between (just) two threads should be cheaper, so performance doesn't improve nor decrease.
SMT has two dynamically sized subsets of registers. That is a different kind of "threaded" execution where there are whole sets with deliberate shared between the units. With register renaming for out of order executions the main thread is using more registers than are explicitly in the application binary instruction set. SMT overlays on top of the register rewriting that was there anyway (. regardless they are being renamed ... so not much overhead to rename into two sets. )
If the speculative execution path is stalled on "thread 1" then that much larger pool of "extra" rename registers aren't being used anyway... so just consume them with something else. The bookkeeping for which rename registers belong with which set is no where near the register and path consumption of the registers themselves. An incrementally larger pool could be useful but 100% growth not likely required to get some incremental increase in performance.
Many benchmarks go around this problem by measuring throughput (like the anaconda graph above) instead of time to finish a task. For example, Geekbench in the physical N-body simulation measures interactions computed per second, not the time to compute, say, 1M interactions. That takes granularity and task scheduling out of the equation.
It takes it out until it doesn't. If one creates. 10,000 threads versus 100 threads then the OS thread overhead is going to go up. That can have a cascade impact into scheduling. The data structures are just bigger. As threads gets "starved" for time a fair scheduler will push some others out. There are cache hits for that. Cache hits get more bubbles. More bubbles get you more delays in throughput. There is a longer line until get back to the front of the queue to get allocated. (again what likely had in cache is likely to have been pushed out unles all the threads are working on almost exact same duplicative data. )
Throughput is what matters because get to results later than needed won't help much.
Windows PCs still being sold with cut down main RAM and HDD sole system drive. SMT would have more traction to 'bite' into. Server with 10,000 remote users feeding data from < 1Gbps feeds. Same thing.