Intel announce Xeon Phi - Xeon (Knights Corner) PCI-E co-processor

itsmrjon · Jun 19, 2012

In terms of what types of directives/languages will be supported on Phi, this document gives a really good idea http://www.intel.com/content/dam/ww...gh-performance-xeon-phi-coprocessor-brief.pdf

It appears it will be OpenMP, OpenCL, MPI, IBB, and Co-Array capable, using C, C++, CL, and F90+.

It's actually a godsend in my field (computational fluids/combustion). Rewriting one of our research grade codes to use OpenCL or CUDA is not an option, most of these started in F77 and are being used to this day. What this card allows us to do is slap in an extra MPI directive and use the cards as if they were an additional compute node. Sure there will probably be a few additional steps (because life is never easy), but believe me, it's much better than trying to rewrite my PhD thesis to run on a completely different architecture.

ScottishCaptain · Jun 19, 2012

goMac said:
Well, a few reasons...

As mentioned above, OpenCL is compiled on the fly. That way it doesn't matter whether you are running on ARM, on a GPU, on a Xeon Phi, or on an i7. The code is going to run. In fact, you could mix and match. You could have a GPU and a Xeon Phi both in the same machine, and run the same code over both at once. OpenCL can already do this on Mac OS X with multiple graphics cards, no SLI or Crossfire required.

I was under the impression that the people who cared about Phi were the kind of people who needed libraries like GMP and MPFR to do anything with any reasonable amount of accuracy, and OpenCL supports neither.

goMac said:
The other good reason is that C and Objective C are really not optimized at all for these sorts of processors, while OpenCL is. Imagine if you put 4 of these in a machine. Traditional languages like C and Objective C aren't really meant to run over 4 entirely different "computers" at once, while OpenCL is.

This thing is really a lot like a GPU that uses x86, not really a CPU. If you look at the project history, that's actually how it started. Intel was trying to build an x86 GPU.

If C, C++, and Objective C aren't really optimized for an x86 processor, then what is?

Scale matters, and that's what I'm confused about.

The Phi has 50 fairly beefy cores built into it. A GPU has hundreds or thousands of tiny and very simple cores. If your OpenCL code is designed to run the smallest workload possible executing a thousand times concurrently, then what is going to happen when you run the same code on the Phi?

Now you've only got 50 cores, but your OpenCL code still assumes that those cores are less capable then they physically are. Unless Intel has a way of making a single Phi core act like more then one compute unit (or a single CU with more then one processing elements), then I can't imagine that OpenCL stuff is going to run on the Phi very well without being re-written anyways to take advantage of being able to run a larger workload with fewer concurrent threads.

-SC

wallysb01 · Jun 20, 2012

itsmrjon said:
..Sure there will probably be a few additional steps (because life is never easy), but believe me, it's much better than trying to rewrite my PhD thesis to run on a completely different architecture.

What? A Ph.D. student trying to do it the easy way? You better go talk to your advisor.

throAU · Jun 20, 2012

gimme a thunderbolt version

wallysb01 · Jun 20, 2012

ScottishCaptain said:
Scale matters, and that's what I'm confused about.

The Phi has 50 fairly beefy cores built into it. A GPU has hundreds or thousands of tiny and very simple cores. If your OpenCL code is designed to run the smallest workload possible executing a thousand times concurrently, then what is going to happen when you run the same code on the Phi?

Now you've only got 50 cores, but your OpenCL code still assumes that those cores are less capable then they physically are. Unless Intel has a way of making a single Phi core act like more then one compute unit (or a single CU with more then one processing elements), then I can't imagine that OpenCL stuff is going to run on the Phi very well without being re-written anyways to take advantage of being able to run a larger workload with fewer concurrent threads.

-SC

I'm very much ignorant of this kind of computing. But I'm curious how the RAM works. I often have jobs that are ridiculously parallel, but they still require 4GB, or even much, much more RAM per process. 50 cores would then need at least 200GB of RAM. That's not terribly difficult for a workstation, but the GPU uses its own RAM right? So then if I'm not wrong, which I very certainly could be, code needs to be rewritten to understand this limitaiton, does it not? Just compiling it differently isn't going to solve this right?

goMac · Jun 20, 2012

ScottishCaptain said:
If C, C++, and Objective C aren't really optimized for an x86 processor, then what is?

Because these are fairly small, low featured cores. These aren't the same cores you'd find in a real Xeon, they're much more reduced.

Besides, OpenCL is a C style language.

ScottishCaptain said:
The Phi has 50 fairly beefy cores built into it. A GPU has hundreds or thousands of tiny and very simple cores. If your OpenCL code is designed to run the smallest workload possible executing a thousand times concurrently, then what is going to happen when you run the same code on the Phi?

It doesn't matter. If the Phi's individual cores are faster, it can do the work in the same time.

Think about it this way:
40 cores running one job a second = 40 jobs a second
20 cores running two jobs a second = 40 jobs a second

It all evens out.

ScottishCaptain said:
Now you've only got 50 cores, but your OpenCL code still assumes that those cores are less capable then they physically are. Unless Intel has a way of making a single Phi core act like more then one compute unit (or a single CU with more then one processing elements), then I can't imagine that OpenCL stuff is going to run on the Phi very well without being re-written anyways to take advantage of being able to run a larger workload with fewer concurrent threads.

Again, none of this matters if the individual Xeon cores are faster, which is what Intel is betting on.

Knights Corner is not supposed to be an x86 machine to developers. The only reason Intel went x86 was because they have existing x86 investments. When Knights Corner was a GPU, you couldn't even access it's instruction sets. And they aren't even modern x86 processors. My understanding was that they are basically Pentium level CPUs.

So while internally it's an x86 unit, that has nothing to do with what languages you'd use or how you program for it.

Besides, like I said, if you want to use combinations of multiple Knights Corners, your best bet is OpenCL anyway.

(For context: Working with CUDA, multithreading, and high performance computing was one of my focuses back in school, and we spent time talking about Knights Corner while it was in development.)

deconstruct60 · Jun 21, 2012

wallysb01 said:
I'm very much ignorant of this kind of computing.

This may help.

http://blogs.nvidia.com/2012/04/no-free-lunch-for-intel-mic-or-gpus/

It is a blog that takes a stab at it from Nvidia's view that it isn't a simple recompile. That's right. But if dive down into the comments on the article can see that for folks who already have supercomputer code that uses libraries common there ( openMPI , openMP ) and have already have factored there apps to deal with nodes of smaller memory interlinked by a high speed bus (e.g., Infiniband ) that there is often more "same" than "different".

Porting OpenCL driver won't be a "free lunch" either. It will allow folks to offload some computation onto the card but it isn't going to get anywhere near peak performance throughput. It may be a bit better than the host Xeon/Core i throughput but that's alot of money to spend for just a bit better.

But I'm curious how the RAM works. I often have jobs that are ridiculously parallel, but they still require 4GB, or even much, much more RAM per process.

It isn't going to be good for independent batch jobs that can kick off in parallel. Like "invoke 10 file video transcodings on 10 different files". It will work better on jobs that are threads (multiple nstruction streams being run on shared memory. ) You chop the problem up into 10, 20 , 30 pieces and let the MIC attack the much smaller problems in parallel.

50 cores would then need at least 200GB of RAM.

Not necessarily. It may be better to kick the jobs off in serial. So if there is 8GB of memory on the MIC then perhaps 40 processors tackle a 3GB job and the other 10 copy the next job into place and/or remove the results of the last 3GB job. You set up a pipeline where they pass through sequentially like this into the MIC's "local" RAM. The host CPU coordinates switching between the two 3GB buffers. Once both are done assign the CPUs to switch buffers and start again.

That's not terribly difficult for a workstation, but the GPU uses its own RAM right? So then if I'm not wrong, which I very certainly could be, code needs to be rewritten to understand this limitaiton, does it not?

Sometimes this isn't rewritten as much as you have to put in customer compiler directives into the code that gives enough hints to the compiler on how to chop your loops and data up into pieces. But in general yes..... just like the GPUs whenever the underlying architecture changes significantly, you'll need to tweak the code.

Just compiling it differently isn't going to solve this right?

Depends if had well structured code in the first place. If it is code with no clue as to where the massive parallism starts and stops then yeah... rewrite time.

If it is code targeted at the original Intel powered Teraflop machine then not so much.

deconstruct60 · Jun 21, 2012

goMac said:
Because these are fairly small, low featured cores. These aren't the same cores you'd find in a real Xeon, they're much more reduced.

Not really when it comes to the vector instructions ( SIMD , SSE4 ). Gobbling double floats and doing math on them is about the same. There may be "old" Pentium branch prediction and simpler integer pipelines but the vector stuff is pretty modern.

The die for these is probably going to come out E7 size and quite large.

Besides, OpenCL is a C style language.

The syntactic sugar isn't the issue. The syntax "style" isn't particularly relevant. More relevant is that OpenCL code meant to be composed as a chunk that does something. Not a whole application or program. It is also composed in a context that assumes the data being worked on is in a separate, distinct memory block. Also that these "chunks" (or jobs) are creates, put onto a job queue , and dispatched to where they do the work.

In other words, it forces the application developers to factor the massive parallel parts from the rest of the application. To get the biggest "bang for the bucket" apps will need to keep the highly serial fragments on the host Xeon E3/E5 and dispatch smaller parallel fragments to the Xeon Phi.

There is never going to be a perfect split. Some parallel code will end up in the more highly serial portion and some serial code will end up in the more highly parallel section. For example is there is a relatively brief loop over the matrix to examine the current state that could could stay on the host CPU (no need to dispatch to the Phi just for that as the copying the data there and back will take time). Likewise if there is a longer if/then/elseif/ or switch with substantive conjunctions ( and's and or's ) buried inside of a loop then Phi will generally handle it better than the GPUs.

It doesn't matter. If the Phi's individual cores are faster, it can do the work in the same time.

They probably aren't going to be 2x faster clockwise. It is more so "faster through the serial sections of code can't get rid of " than . Don't think single floats are going to be sweet spot for Phi either. ( if they were this would have had a chance at being a discrete GPU ... it isn't. ).

Again, none of this matters if the individual Xeon cores are faster, which is what Intel is betting on.

I think Intel is betting much bigger on easier ports from current supercomputer code.

Knights Corner is not supposed to be an x86 machine to developers.

An x86 machine to assembly code hackers? No. An x86 machine that runs Linux, OpenMP, OpenMPI that many are used to and have established code bases for ? Yes. Intel certainly is not saying "throw those all away and start over from scratch".

The only reason Intel went x86 was because they have existing x86 investments. When Knights Corner was a GPU, you couldn't even access it's instruction sets.

but it wasn't running Linux either.

Besides, like I said, if you want to use combinations of multiple Knights Corners, your best bet is OpenCL anyway.

Not really. Perhaps on Mac OS X specifically. But so far there is no announced (or even hinted) support for Mac OS X.

http://www.hpcwire.com/hpcwire/2012..._gather_developer_mindshare.html?featured=top

and

".... GCC also does not include support for any offload directives at this point. While the Intel compiler has LEO (Language Extensions for Offload), we are hopeful of a standard that brings together the GPU-only OpenACC with the more flexible Intel LEO offload model for coprocessors. We are engaged in the OpenMP committee hoping to find such a combination to be generally applicable, integrated and fully compatible with OpenMP (including the ability to offload OpenMP code) ... "
http://software.intel.com/en-us/blogs/2012/06/05/knights-corner-open-source-software-stack/

I don't think OpenCL is likely to get anywhere near peak performance out of Phi step-ups. For example MPI links on individual cards could to point-to-point memory transfers. OpenCL can't do that. That's not the architecture. It is an "easy" offload to get some results, but it is doubtful the efficiency ratio is going to be extremely high.

A very large fraction of Phi deployments are going to be into supercomputers and low efficiency ratios aren't going to cut it. In the minor subcontext of Mac OS X apps. Yeah a higher fraction are going to be OpenCL but it is an open question whether Apple puts in the work to even bring the card to the Mac Pro. [ They should to broaden the user base, but they should have had something ready for this Summer too. They didn't. ]

Intel may put the Phi on a eventual track where it gets off the PCI-e bus and onto something more like the QPI bus. It depends upon what they do with the Cray interlink they bought and Aries chipset

http://www.theregister.co.uk/2012/04/25/intel_cray_interconnect_followup/

While Aries is PCI-e v3.0 based that doesn't necessarily mean a discrete card. (just like Thunderbolt controllers having a PCI-e inputs doesn't necessarily mean a discrete card. )

goMac · Jun 21, 2012

deconstruct60 said:
The syntactic sugar isn't the issue. The syntax "style" isn't particularly relevant. More relevant is that OpenCL code meant to be composed as a chunk that does something. Not a whole application or program. It is also composed in a context that assumes the data being worked on is in a separate, distinct memory block. Also that these "chunks" (or jobs) are creates, put onto a job queue , and dispatched to where they do the work.

I'm not sure what your point is. OpenCL purposely uses "chunks" because that's the only real way to go multicore.

Basically what you're saying is that you can run traditional single threaded programs on a very multicore chip. Ok, maybe. Why the hell would you want to do that?

If traditional languages worked well for high core counts, OpenCL wouldn't exist in the first place.

deconstruct60 said:
In other words, it forces the application developers to factor the massive parallel parts from the rest of the application.

Yes. It makes them design in a multithreaded parallel manner for a multicore chip. And?

Even in traditional languages you still have to create threads that are factored out from the rest of the application.

deconstruct60 said:
There is never going to be a perfect split. Some parallel code will end up in the more highly serial portion and some serial code will end up in the more highly parallel section. For example is there is a relatively brief loop over the matrix to examine the current state that could could stay on the host CPU (no need to dispatch to the Phi just for that as the copying the data there and back will take time). Likewise if there is a longer if/then/elseif/ or switch with substantive conjunctions ( and's and or's ) buried inside of a loop then Phi will generally handle it better than the GPUs.

Yep. But traditional languages don't help with that at all. If anything, they make it worse.

deconstruct60 said:
They probably aren't going to be 2x faster clockwise. It is more so "faster through the serial sections of code can't get rid of " than . Don't think single floats are going to be sweet spot for Phi either. ( if they were this would have had a chance at being a discrete GPU ... it isn't. ).

If the total computing power of Knights Corner isn't more than a traditional GPU, they're wasting our time. Knights Corner may have SIMD units, but traditional GPUs absolutely tear through SIMD tasks as well.

deconstruct60 said:
I think Intel is betting much bigger on easier ports from current supercomputer code.

I'm not sure that's really buying them much. It would still take a lot to get traditional supercomputer code running on these. Probably the same amount of work it would take just to stick the same algorithms into CUDA or OpenCL.

deconstruct60 said:
An x86 machine to assembly code hackers? No. An x86 machine that runs Linux, OpenMP, OpenMPI that many are used to and have established code bases for ? Yes. Intel certainly is not saying "throw those all away and start over from scratch".

Again, yeah, you could get away with not throwing away your existing code base, but it's more complicated than just a recompile. Unless I missed something and these cards have ethernet jacks, you can't simply drop these cards in place of supercomputers. You've got to write a bunch of code on the host machine to submit jobs to the cards and do load balancing, which takes away a lot of the advantages.

deconstruct60 said:
".... GCC also does not include support for any offload directives at this point. While the Intel compiler has LEO (Language Extensions for Offload), we are hopeful of a standard that brings together the GPU-only OpenACC with the more flexible Intel LEO offload model for coprocessors. We are engaged in the OpenMP committee hoping to find such a combination to be generally applicable, integrated and fully compatible with OpenMP (including the ability to offload OpenMP code) ... "
http://software.intel.com/en-us/blogs/2012/06/05/knights-corner-open-source-software-stack/

Translation: OpenMP is mired down in politics and doesn't even have a standard yet, while OpenCL does.

deconstruct60 said:
A very large fraction of Phi deployments are going to be into supercomputers and low efficiency ratios aren't going to cut it. In the minor subcontext of Mac OS X apps. Yeah a higher fraction are going to be OpenCL but it is an open question whether Apple puts in the work to even bring the card to the Mac Pro. [ They should to broaden the user base, but they should have had something ready for this Summer too. They didn't. ]

I don't think Apple is going to bring the card to the Mac Pro themselves. So far they haven't paid much attention to the Tesla either.

However, that doesn't stop Intel from writing their own drivers if they think it's worth it.

deconstruct60 said:
Intel may put the Phi on a eventual track where it gets off the PCI-e bus and onto something more like the QPI bus. It depends upon what they do with the Cray interlink they bought and Aries chipset

It would be interesting. Moving off the PCI-e bus certainly has performance improvements for high bandwidth applications. The card has DDR5 memory, but if the machine's main memory could keep up, DMA would also be interesting and result in a speedup in some situations.

FluJunkie · Jun 21, 2012

wallysb01 said:
I'm very much ignorant of this kind of computing. But I'm curious how the RAM works. I often have jobs that are ridiculously parallel, but they still require 4GB, or even much, much more RAM per process. 50 cores would then need at least 200GB of RAM. That's not terribly difficult for a workstation, but the GPU uses its own RAM right? So then if I'm not wrong, which I very certainly could be, code needs to be rewritten to understand this limitaiton, does it not? Just compiling it differently isn't going to solve this right?

That's right. There are divisions within even very parallel jobs as to what's good for GPU computing, and what isn't. There's a thread on the Computational Science site in Stack Exchange talking about exactly that:

http://scicomp.stackexchange.com/qu...roblems-lend-themselves-well-to-gpu-computing

goMac said:
It doesn't matter. If the Phi's individual cores are faster, it can do the work in the same time.

Think about it this way:
40 cores running one job a second = 40 jobs a second
20 cores running two jobs a second = 40 jobs a second

I think it may matter for some folks because the sheer number of cores will have some impact on how you're expecting memory to get used, no?

Beyond that, I've found people pushing CUDA more than OpenCL because of nVidia's strengths with the Tesla cards. So yeah, while I think it'll be less of a boost for people who already have stuff implemented in OpenCL, I don't know that there's all that many of them.

It's certainly not "OMG THIS CHANGES EVERYTHING" in the way the original GPU computing was (or Beowulf, etc.) but its a neat bit of tech, and a promising start.

----------

goMac said:
I don't think Apple is going to bring the card to the Mac Pro themselves. So far they haven't paid much attention to the Tesla either.

However, that doesn't stop Intel from writing their own drivers if they think it's worth it.

With the state of the Science section on the Apple site, I'd be surprised if this resulted in anything other than "Huh...neat" from Apple.

Macs are great client machines for HPC, but they've abandoned the other end essentially entirely.

itsmrjon · Jun 22, 2012

goMac said:
If the total computing power of Knights Corner isn't more than a traditional GPU, they're wasting our time. Knights Corner may have SIMD units, but traditional GPUs absolutely tear through SIMD tasks as well.

I'm not sure that's really buying them much. It would still take a lot to get traditional supercomputer code running on these. Probably the same amount of work it would take just to stick the same algorithms into CUDA or OpenCL.

For the general population this will probably be a waste of time, but as was mentioned before, this is a godsend for those running true supercomputer codes (not small lab codes, i'm talking codes developed over the last 10+ years that run on hundreds of thousands of cores). Imagine porting a project like NEK5000 (http://nek5000.mcs.anl.gov/index.php/Main_Page) which was started in the 80's in F77 to OpenCL or CUDA... It's simply not an option.

The benefit of these cards is that if they have an embedded OS and they can run an MPI stack, then relative to other models this will be a walk in the park for us. Sure, there will be some tweaking necessary to address them from PCI instead of infiniband, but that's minimal compared to a full language rewrite.

Search

Search

Intel announce Xeon Phi - Xeon (Knights Corner) PCI-E co-processor

itsmrjon

macrumors regular

ScottishCaptain

macrumors 6502a

wallysb01

macrumors 68000

throAU

macrumors G4

wallysb01

macrumors 68000

goMac

macrumors 604

deconstruct60

macrumors G5