Because these are fairly small, low featured cores. These aren't the same cores you'd find in a real Xeon, they're much more reduced.
Not really when it comes to the vector instructions ( SIMD , SSE4 ). Gobbling double floats and doing math on them is about the same. There may be "old" Pentium branch prediction and simpler integer pipelines but the vector stuff is pretty modern.
The die for these is probably going to come out E7 size and quite large.
Besides, OpenCL is a C style language.
The syntactic sugar isn't the issue. The syntax "style" isn't particularly relevant. More relevant is that OpenCL code meant to be composed as a chunk that does something. Not a whole application or program. It is also composed in a context that assumes the data being worked on is in a separate, distinct memory block. Also that these "chunks" (or jobs) are creates, put onto a job queue , and dispatched to where they do the work.
In other words, it forces the application developers to factor the massive parallel parts from the rest of the application. To get the biggest "bang for the bucket" apps will need to keep the highly serial fragments on the host Xeon E3/E5 and dispatch smaller parallel fragments to the Xeon Phi.
There is never going to be a perfect split. Some parallel code will end up in the more highly serial portion and some serial code will end up in the more highly parallel section. For example is there is a relatively brief loop over the matrix to examine the current state that could could stay on the host CPU (no need to dispatch to the Phi just for that as the copying the data there and back will take time). Likewise if there is a longer if/then/elseif/ or switch with substantive conjunctions ( and's and or's ) buried inside of a loop then Phi will generally handle it better than the GPUs.
It doesn't matter. If the Phi's individual cores are faster, it can do the work in the same time.
They probably aren't going to be 2x faster clockwise. It is more so "faster through the serial sections of code can't get rid of " than . Don't think single floats are going to be sweet spot for Phi either. ( if they were this would have had a chance at being a discrete GPU ... it isn't. ).
Again, none of this matters if the individual Xeon cores are faster, which is what Intel is betting on.
I think Intel is betting
much bigger on easier ports from current supercomputer code.
Knights Corner is not supposed to be an x86 machine to developers.
An x86 machine to assembly code hackers? No. An x86 machine that runs Linux, OpenMP, OpenMPI that many are used to and have established code bases for ? Yes. Intel certainly is not saying "throw those all away and start over from scratch".
The only reason Intel went x86 was because they have existing x86 investments. When Knights Corner was a GPU, you couldn't even access it's instruction sets.
but it wasn't running Linux either.
Besides, like I said, if you want to use combinations of multiple Knights Corners, your best bet is OpenCL anyway.
Not really. Perhaps on Mac OS X specifically. But so far there is no announced (or even hinted) support for Mac OS X.
http://www.hpcwire.com/hpcwire/2012..._gather_developer_mindshare.html?featured=top
and
".... GCC also does not include support for any offload directives at this point. While the Intel compiler has LEO (Language Extensions for Offload), we are hopeful of a standard that brings together the GPU-only OpenACC with the more flexible Intel LEO offload model for coprocessors. We are engaged in the OpenMP committee hoping to find such a combination to be generally applicable, integrated and fully compatible with OpenMP (including the ability to offload OpenMP code) ... "
http://software.intel.com/en-us/blogs/2012/06/05/knights-corner-open-source-software-stack/
I don't think OpenCL is likely to get anywhere near peak performance out of Phi step-ups. For example MPI links on individual cards could to point-to-point memory transfers. OpenCL can't do that. That's not the architecture. It is an "easy" offload to get some results, but it is doubtful the efficiency ratio is going to be extremely high.
A very large fraction of Phi deployments are going to be into supercomputers and low efficiency ratios aren't going to cut it. In the minor subcontext of Mac OS X apps. Yeah a higher fraction are going to be OpenCL but it is an open question whether Apple puts in the work to even bring the card to the Mac Pro. [ They should to broaden the user base, but they should have had something ready for this Summer too. They didn't. ]
Intel may put the Phi on a eventual track where it gets off the PCI-e bus and onto something more like the QPI bus. It depends upon what they do with the Cray interlink they bought and Aries chipset
http://www.theregister.co.uk/2012/04/25/intel_cray_interconnect_followup/
While Aries is PCI-e v3.0 based that doesn't necessarily mean a discrete card. (just like Thunderbolt controllers having a PCI-e inputs doesn't necessarily mean a discrete card. )