Hi Jesus
I have been working in that field, facing similar decisions. My problem basically consisted of solving partial differential equations on a grid.
basically what I work on is physics calculations, these don't require a big amount of storage and only rarely memory is a problem, the main constrain is CPU time.
Don't get bitten by that assumption. When you have that 4GB RAM workstation next to your desk, all runs that make sense to start are cpu bound. But if you move to a massively parallel supercomputer with several thousand processors, where each computing node has access to only half a gig, and you scale your computational domain accordingly, you may be surprised that all of the sudden, memory can become an issue.
I think I used the term multithreading incorrectly. These calculations are straightforward and not user interactive. Some times these simulations are easy to parallelize in one or more constrains, which is the part I'm interested in right now, eventually a more complicated paralyzation scheme will be required but not for now. I need to use a lightweight programming language because these are long runs. And given that I have the choice I like C syntax better than C++ but that is just a personal opinion.
This decision depends on a lot of different factors, like what machines you have access to (especially the number of processors, rather a couple dozens or a couple thousand). OpenMP's underlying assumption is that the code runs on a shared memory machine, which means that all processors can access one large chunk of memory equally fast. While this is generally true for all "normal" computers and fairly small supercomputers, this approach doesn't scale well up to several thousand processors. On a distributed memory machine (like the IBM BlueGene and others), each processor has its own small piece of memory, where access is superfast, and data exchange between processors goes through some connector, which is orders of magnitude slower. Here, your mileage varies also: In a "boxed" supercomputer, the connection is as fast as technology allows, in the often found "cheap cluster with 400 nodes", the connection is typically so slow that communication between processors is the major bottleneck.
I'm looking into openMP and MPI but I'm not sure which fits the bill, and what are good books on the topic.
If you are on a shared memory machine and the code has only one (fairly simple) loop that can be parallelized easily, try OpenMP, since it is very easy to use and gives you speedups very quickly (little coding time). On a distributed memory machine, use MPI. Don't go for hybrid solutions, they are generally not worth the hastle.
As far as language is concerned: Use what you feel comfortable with (as long as it is C

). Most people underestimate the time for development and testing and overestimate the time the runs take. For OpenMP, almost any language will do, it may even be Java. For MPI on a supercomputer, the installed libraries tend to support only Fortran, C and C++. The whole MPI standard is very low level, and while C++ bindings are included in the standard, some implementations do not...um, perform well (C file output several thousand times faster than C++ streams on our platform). We had a C-version of our code and a C/C++-hybrid-version, and the hybrid was faster in the end and easier to maintain. But it took some performance tuning, because the original hybrid was a factor three slower! And the different compilers sometimes produced obscure warnings for legitimate C++-code, which tended to be a nuisance. And the core routines had to be written C-style anyway

. And my initial language of choice for any project is always C++ over C...
As fas as books go, I don't have any recommendations.
Sorry for the long post, I hope, I could help.
--clemensmg