Is it really 2x speed increase though? Real world use is quite different from a synthetic benchmark that is specifically made to just make the CPU do dumb work an all cylinders.
Well, it does depend, but its close to 2x, that's why I put the "~" there. For example, if I bought a workstation with the 1x1660 vs 2x2630s even synthetic benchmarks aren't going to be 2x the speed because 1x2630 < 1x1660 in performance. So, yeah it can be more like 1.5x-2+x, it just depends on your work flow and how much extra you pay to get the higher ranged 26xx processors.
You will get much more than 12x improvement, no one would make a 1000 node cluster if that wasn't the case. If the problem you are trying to solve can not scale past some low amount of cores then a cluster is probably not the right solution.
I think you're mistakenly assuming that the users are building the cluster for their own uses. For most of us, we sign up and pay for a cluster that already has thousands of users. Or you work in large corporations that might have a cluster, but its also highly used by other employees. Clusters are rarely built so that some small set of users get huge (>12x) performance boost. Rather, they are built to provide moderate performance boosts to large numbers of users. Occasionally, yes, I have to run jobs that are embarrassingly parallel, and I can submit 1000s of jobs (if the torque manager even allows it!), but that's a relative rarity to when I just need 16, 32 maybe 64 cores.
Also, queue times, and possible up/download times, are the main problem. You can't just walk up and get those 1000 cores right away like you can with your 12 core DP workstation. Usually, you run into long queue lines or torque manager limitations when submitting huge jobs with >100 node needs.
And even well designed code hits diminishing returns with very large core counts. Usually the issue is the latency in getting jobs to and from so many nodes and the organizer node(s) I/O gets saturated. At what point that happens depends on your jobs. I've seen some hit it around 20, some 50, some 500. But to see a 12x increase in speed over my 12 core computer, you need 144 distributed cores working as efficiently as my SMP workstation. That is very unlikely. So maybe to see that, I'd need 200, even 300 cores. But, in my experience, queue times even on relatively lightly used clusters for jobs using that many cores are often in the several days to weeks range. Even if the job on my computer would take 2 months, and once started on a cluster only 3 days, if it takes a week just to start, you're only getting about a 6 fold increase in speed.
Maybe your field is different, but in mine very few things scale well past around 100 cores. Many jobs could use 1000s, but they need shared memory, not distributed, and no coding tricks are going to be able to change that fact.