macOS working with arrays in a hurry in c

farmerdoug · Mar 1, 2010

(They have to go somewhere

)

Code:

float sum = 0, *arr;
arr = (float*) calloc( alot);

for (row = 0 ...
    for (col = 0 ...
        sum += arr[row*rowsize + col]
       
or 
for ( i = 0, i < alot 
            sum += arr[i];
            
or ??
    for ( i = 0, i < alot 
            sum += arr + i;

or ??

robbieduncan · Mar 1, 2010

The best answer to this is to think. In each case what work will have to be completed. If more work is required then it will take longer. As a starter down this path the first solution requires a multiply to calculate the array offset so is clearly going to take longer than not requiring this multiply.

But more generally you have to ask yourself if it matters which takes longer. Is this code really performance critical? How have you come to that decision? Have you run this in Shark or something similar to tell you that this is the slow section in the code?

Edit to add: general hint.

Code:

int i,max;
for (i=0;i<max;i++)
{
//Stuff
}

is slower than

Code:

int i,max;
for (i=max;i--;)
{
//Stuff
}

Note that whilst you may think that the in the second case the loop body will execute with i=max on the first time through this is not the case: the termination condition is executed before each execution of the loop body including the first.

farmerdoug · Mar 1, 2010

robbie.

I have a version of this code in another language that takes days to run.
A colleague is running similar matlab code on a mac; I think it runs 40 mins.
Forty minutes is still too long when the data is acquired in under 3 minutes and you'd like to make a decision as to what to do.

Not being a programmer, and having other pressing issues -sometimes it's easier just to ask.

doug

robbieduncan · Mar 1, 2010

It takes 40 minutes to sum an array of floats? What are you running on? An HP calculator?

Either there is way more work being done that your code indicates in the loop or the main slowness is outside the loop. This is why I query why you have decided you need to optimise this loop.

If you are really only summing an array of floats then in general your compiler is going to sort out either the second or third case to be the same. The only optimisation you can really apply is to loop down rather than up.

Do you have multiple loops to sum multiple arrays? If so can you do a single loop and sum them all together? If you have multiple CPU cores can you sum parts of the array in one thread and other parts in another then combine the results?

Basically what I'm saying is that normally you can get far more out of looking at the overall algorithm you are using and optimising that than micro-optimising loops...

farmerdoug · Mar 1, 2010

I have 19 23*250*250 arrays. Each array has to be sliced, diced, shifted, magnified, added, subtracted and multiplied until a best fit is found between each sliced and diced section. Most of the code is simple addition and subtraction but it has to be done over and over and over again.

robbieduncan · Mar 1, 2010

Then I would suggest you need to thread your code (assuming you have multiple CPUs available and that you can define units of work that do not directly effect each other). It seems that you would probably be best off with an experienced professional programmer...

farmerdoug · Mar 1, 2010

Yes, I would but budgets forbid. That's why I'm here. I've worked with threads before; we also have a supercomputer here but the code needs to run on a four or eight core MAC. Unfortunately, the cfitsio library that I use is not compatible with threads; I might be able to work around that but before I worry about threads, I need to rewrite my code into c. I need something faster than days to make sure that I even have the right algorithm. Which is why the original question; what's the fastest way to work with arrays?

robbieduncan · Mar 1, 2010

farmerdoug said:
what's the fastest way to work with arrays?

I've already told you: either the second or third approach will be the same to the compiler. Loop down not up.

If you can't work with threads due to a library issue can you work with separate processes? For example if your 19 arrays are all basically independent in processing can you run 19 processes? That way you can use all 8 cores instead of one which would basically make your code 8x faster...

robbieduncan · Mar 1, 2010

Of course there is another option. "Cheat". Modern CPUs have vectorized operations. Use the Accelerate library to take advantage of this. For example the vDSP_sve function will sum a floating point array into the float variable of your choice. It will do this using the hardware vector optimisations on the CPU and be way faster than you're lovingly hand-optimised array operations

lee1210 · Mar 1, 2010

All the methods you listed will perform about the same. Threading (or multiprocessing) is going to be your best bet for utilizing the resources you have. With a single thread/process you will be utilizing 1/4 or 1/8 of the CPU cores you have available.

If possible, I would start with multiprocessing. You have a non-threadsafe library you need to use, so you'd need to do a lot of locking to ensure that library functions aren't called simultaneously by multiple threads. To do this with multiprocessing I would have a worker program that is passed the start and end indicies it needs to process. Summing values is an easy thing to decompose because when you get your n results back you just have to sum them up.

How you shuttle values back and forth (over pipes, using shared memory, intermediate file-store) will be the most complicated piece. Also, if you have the data in memory instead of in a file or database, that presents a similar problem. I'd opt for shared memory, but it's what I'm used to.

Others will chime in with how inefficient multiprocessing is versus multithreading. I'll concede this, but mp is much easier to reason about, and it's better than single process, single thread.

-Lee

farmerdoug · Mar 1, 2010

Thanks for the advice. Taking any of it would require learning new stuff. I'm not opposed to that but I need to get something running; I think continuing on the path I'm on is my best bet even if the end result isn't the fastest. If the algorithm works, I'll worry about getting it faster. Of course if anybody wants to volunteer ...

The data is adjusted one slice at a time but three slices are involved. The numbers for optimum alignment from one slice may be used to speed up alignment of the next making threading more difficult.

Detrius · Mar 1, 2010

An easy way to do multi-process would be to look into MPI. This has the added benefit that the multiple processes do NOT have to be on the same computer. Therefore, you can take advantage of every core on every Mac you have available. Between this and other optimizations (like using the vector processing units in your processors), you should be able to do better than Matlab's 40 minutes.

It's been a while since I've done anything with MPI, so I do not which implementation to recommend. I know Boost has an MPI library, and I also know that OS X Server has its own implementation.

Furthermore, don't forget to turn on all of the compiler optimizations you can find.

If the non-thread-safe library you're using is outside the loop(s), you could simply use OpenMP to thread localized portions of your code VERY easily.

farmerdoug · Mar 1, 2010

I'll try and look into it.
thanks

gnasher729 · Mar 1, 2010

1. For multi-threading, go to developer.apple.com and enter "blocks" into the search box. That will show you how to distribute your code optimally onto any number of CPUs with approximately zero effort. Requires Snow Leopard.

2. Divide the work into chunks that fit into about 32KB and do all the work that is needed on that chunk of data before you proceed to the next. That make sure all operations are in L1 cache memory.

3. Perform operations on consecutive array elements in ascending order.

4. Use Apple's vector library (veclib). Again, type "veclib" into the search box in developer.apple.com

5. In the compiler settings, set optimisation level = fastest, loop unrolling.

Example for using multiple processors with Grand Central Dispatch; method 2 will use all cores on any Mac with about zero programming effort:

Code:

#include <dispatch/dispatch.h>

int main (void)
{
	size_t rows = 300;
	size_t columns = 520;
	double* array1 = (double *) calloc (rows*columns, sizeof (double));
	double* array2 = (double *) calloc (rows*columns, sizeof (double));
	
	/* Method 1, single threaded */
	for (size_t i = 0; i < rows; ++i) {
		double* p = array1 + i * columns;
		double* q = array2 + i * columns; 
		for (size_t j = 0; j < columns; ++j)
			p [j] = 2.7 * p [j] + 3.14 * q [j];
	}
	
	/* Method 2, GCD */
	dispatch_queue_t theQueue = dispatch_get_global_queue (DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
	dispatch_apply (rows, theQueue, ^ (size_t i) {
		double* p = array1 + i * columns; 
		double* q = array2 + i * columns;
		for (size_t j = 0; j < columns; ++j)
			p [j] = 2.7 * p [j] + 3.14 * q [j];
		});
	
	return 0;
}

But now back to your original question: What is the fastest way to work with arrays? The fastest way is to learn how modern processors work, how caches work, how to use multiple processors, how to use vector units, write code, profile it with Shark, look for the bits that take most of the time, improve them, start understanding your algorithms, start understanding what work is done that doesn't need doing, learn where multiple passes can be combined into one, find where general algorithms can be replaced with specialised ones, profile it again, and so on.

lloyddean · Mar 1, 2010

farmerdoug said:
Yes, I would but budgets forbid. That's why I'm here. I've worked with threads before; we also have a supercomputer here but the code needs to run on a four or eight core MAC. Unfortunately, the cfitsio library that I use is not compatible with threads; I might be able to work around that but before I worry about threads, I need to rewrite my code into c. I need something faster than days to make sure that I even have the right algorithm. Which is why the original question; what's the fastest way to work with arrays?

The latest version is thread safe. There are even threaded example sources available.

ChrisA · Mar 1, 2010

farmerdoug said:
Yes, I would but budgets forbid. That's why I'm here. I've worked with threads before; we also have a supercomputer here but the code needs to run on a four or eight core MAC. Unfortunately, the cfitsio library that I use is not compatible with threads; I might be able to work around that but before I worry about threads, I need to rewrite my code into c. I need something faster than days to make sure that I even have the right algorithm. Which is why the original question; what's the fastest way to work with arrays?

I know "cfitsio". I bet a lot that the loop that adds the array is not the bottle neck. I think it might be in the FITS reader library.

The two methods you lists, one with two loops and one with only one loop are so close in speed it hardly matters. What you are doing is first filling a bucket with rocks and then counting the rocks. That is to slow. Far better to count the rocks are you place each in the bucket then when the bucket is full you know the count is zero time.

There is another posable problem as well as a huge (100X) speed up. If the array does not fit in physical RAM you CFITSIO library will have read it off the disk only to have to have the operating system swap the data back out the swap space on the disk. Then if you are double un-lucky you are not reading back contigous array locations. If you loop through the array A(1,n), A(2,n),... and those elements are not close on the swap disk then each array access causes an entrir 1024 byte block to be read in just to get one floating point value. You need to read throuh the array in an order such the elements are physically adjacent in memory. SO?? is the arry read on colum or row major order? is it stored the same way?

These kinds of issues can have a 2 order of magnitude effect on speed.

gnasher729 · Mar 2, 2010

ChrisA said:
I know "cfitsio". I bet a lot that the loop that adds the array is not the bottle neck. I think it might be in the FITS reader library.

The two methods you lists, one with two loops and one with only one loop are so close in speed it hardly matters. What you are doing is first filling a bucket with rocks and then counting the rocks. That is to slow. Far better to count the rocks are you place each in the bucket then when the bucket is full you know the count is zero time.

There is another posable problem as well as a huge (100X) speed up. If the array does not fit in physical RAM you CFITSIO library will have read it off the disk only to have to have the operating system swap the data back out the swap space on the disk. Then if you are double un-lucky you are not reading back contigous array locations. If you loop through the array A(1,n), A(2,n),... and those elements are not close on the swap disk then each array access causes an entrir 1024 byte block to be read in just to get one floating point value. You need to read throuh the array in an order such the elements are physically adjacent in memory. SO?? is the arry read on colum or row major order? is it stored the same way?

These kinds of issues can have a 2 order of magnitude effect on speed.

In other words, profile with Shark. But having heard that the farmer is in the teaching profession, and with his admission that he is not willing to learn, and having heard no evidence yet that he is using a Macintosh (which may explain why he is unable to debug anything), I doubt he will do that.

But Shark would tell you within 30 seconds where your bottlenecks are. Even Activity Monitor would.

robbieduncan · Mar 2, 2010

gnasher729 said:
In other words, profile with Shark. But having heard that the farmer is in the teaching profession, and with his admission that he is not willing to learn, and having heard no evidence yet that he is using a Macintosh (which may explain why he is unable to debug anything), I doubt he will do that.

But Shark would tell you within 30 seconds where your bottlenecks are. Even Activity Monitor would.

Which all comes back to my first reply (post #2 in this thread)...

farmerdoug · Mar 2, 2010

Well, I certainly got more than I bargained for when I ask the original question.
All bares looking into. However, at the moment the I don't have working code. I was functioning at a slightly higher level yesterday and managed to finish a routing the magnifies and shifts the image. It should be pretty straightforward from here to get a working version. At that point, I can try Shark. I might even post the whole code for amusement.

BTW.
Not a teacher, a researcher, and really not opposed to learning new things. I have to learn enough to help in the design of a spectropolarimeter. Its above becoming a better programmer, although I admit my life might be easier if I was; but that's why I come to you for help. Who would you help, if it wasn't for guys like me?

farmerdoug · Mar 2, 2010

Okay here's the code except that nothing happens when I hit the manage attachment button.

farmerdoug · Mar 2, 2010

Okay so I think I attached files. The bottleneck is magnify2andshift which probably can't be modified to any great extent. However, the code from

// FOR SLICE THEN RING THEN QUAD ...
which is no longer the order of things will lend it self to some of the ideas put forth earlier.

multiprocessing multhreading , and veclib.

comments welcome

robbieduncan · Mar 2, 2010

Looking at magnify2andshift I can see a few potential speedups (and I've only looked at it for 20 seconds).

How big are rows and cols normally? If they are small numbers then my initial look at the first couple of loops won't make any difference.

Otherwise the first loop:

Code:

 for (row = 0; row < rows; row += 1)
    {
        x2[row] = (float) row;
        x1[row] = (float) row;
    }

might be able to be turned into a couple of veclib calls (need to look at it carefully)

and

Code:

        for (col = 0; col < cols; col++)
            {
            yb1[row][col] = y[row*cols +col];
            y2[row][col] = ms[row*cols +col];
            }

can probably be replaced with two calls to memcpy which will be faster.

farmerdoug · Mar 2, 2010

thanks.
Your first comment made me realize that that is repeated at every call and can be taken out of the loop altogether.

re: Your second comment. I tried. I must have been doing something wrong because the spline routine that comes next didn't like it.

robbieduncan · Mar 2, 2010

farmerdoug said:
re: Your second comment. I tried. I must have been doing something wrong because the spline routine that comes next didn't like it.

Hmm, I may need to think about that/double check how C accesses 2-d arrays. In my head yb1[row][0] is followed in memory by yb1[row][1] and so on so you can memcpy to it. Perhaps it's not.

Edit to add:

Other things to look at:

Code:

for (frow = 0; frow < rows; frow += (float)1/magy, r++)

I don't see magy changing in the loop so pre-calculate the increment. At the moment that division is executed every time. Same for the inner loop.

farmerdoug · Mar 2, 2010

Of course

done.

macOS working with arrays in a hurry in c

macrumors 6502a

Moderator emeritus

macrumors 6502a

Moderator emeritus

macrumors 6502a

Moderator emeritus

macrumors 6502a

Moderator emeritus

Moderator emeritus

macrumors 68040

macrumors 6502a

macrumors 68000

macrumors 6502a

Suspended

macrumors 65816

macrumors G5

Suspended

Moderator emeritus

macrumors 6502a

macrumors 6502a

macrumors 6502a

Attachments

Moderator emeritus

macrumors 6502a

Moderator emeritus

macrumors 6502a

Our Staff