Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Of course

done.

Did you also pre-calculate (r*rows*magy) before calling the inner loop (you have to do it inside the outer loop as you change r in it). You could pre-calculate rows*magy outside both loops, then calculate r*(precalculated rows*magy) in the first loop just leaving you to add c when you index into mtmp.

All of this is probably not going to do much if the loops are small. How big are rows/cols on average?
 
So we should be able to memcpy a linear section of one "flat" array into each row in a 2D array.

Given the way that row, rows, col, cols are used in the posted code, yes.

That's why I asked to see the code that didn't work. I'm never sure if farmerdoug is simply posting a tweet or status update ("I couldn't get it to work, I'll try something else, no assistance necessary"), or is making a request for assistance ("I couldn't get it to work, here's my code, please assist"). It starts with "I couldn't get it to work", and usually takes several additional requests before actual code gets posted. Not one of the problems posted have been solved without actual code. I was hoping the pattern for success had become obvious by now.
 
For some reason, I became fixated on the "quad" loop in SR.txt. This looks to me like a situation where you should simply unroll the loop yourself into sequential steps. case 1: is also missing - it simply recalculates the same value as the previous pass. My instinct would be to simply use the arguments directly and write the reassignment conditional 5 times, using constants. Any performance gain would be pretty negligible, though.
 
I have been accused here of getting people to do my work for me. Well make no mistake: This time I would like somebody to volunteer to work on a project to find exoplanets for the Astrophysics Department at the American Museum of Natural History. I think the best place to speed up my code is to rewrite the spline routines for threads. Anybody interested in taking this on?

doug.
 
I have been accused here of getting people to do my work for me. Well make no mistake: This time I would like somebody to volunteer to work on a project to find exoplanets for the Astrophysics Department at the American Museum of Natural History. I think the best place to speed up my code is to rewrite the spline routines for threads. Anybody interested in taking this on?

doug.

Not really: I'm kind of busy. Can you write some sort of spec as to what they do? There may be VecLib functions that already basically do the same thing.

I suspect that you need to look at threading at a higher level though...
 
I didn't see anything in veclib. These routines calculate a 2-d spline it to the data. (i.e. an interpoaltion) Much of the work involves calculating second derivatives and is done on a row by row basis which makes it perfect for threads. Which reminds me if I rewrite code so that the machine makes can make better use of its architecture.
 
I didn't see anything in veclib. These routines calculate a 2-d spline it to the data. (i.e. an interpoaltion) Much of the work involves calculating second derivatives and is done on a row by row basis which makes it perfect for threads. Which reminds me if I rewrite code so that the machine makes can make better use of its architecture.

I disagree that it makes it perfect for threading. What I see is a function spline that takes two arrays x[] and y[] which are, I assume, rows. I cannot see any safe way to break the spline call into threads as when calculating for a given value we need the value above and below that one.

The correct place to thread is whatever calls spline.

What I would suggest you do is write out the overall algorithm in english/pseudo code and work out where there are calls that a) take quite a bit of time and b) are 100% independent. These calls should be in a loop. If they are then this loop could, potentially, benefit from threading the calls.

Note that there is a definite cost with threading. Micro-threading your calls to perform a tiny bit of calculation in a thread may well actually slow your code down.
 
Sorry, at the point I wasn't actually referring to the specific function spline but to the fact that I am doing 2d spline fits. In the code, I call spline the function cols times for each row. That could be split by four or eight (may simply with proper for loop) not
for I= 0 I < cols, I++
{do for I
}
but
for I= 0 I < cols, I += 4)
{do for I
do for I +1
do for I +2
do for I +3
}

which allows the complier to send it out to four places simultanously.

The calls to spline make a 2d array.
The 2d array is called by splin2 which as you can see works a row at a time.
So the same type of program can still be used.
So yes you are right, its the programs that call spline.
 
Sorry, at the point I wasn't actually referring to the specific function spline but to the fact that I am doing 2d spline fits. In the code, I call spline the function cols times for each row. That could be split by four or eight (may simply with proper for loop) not
for I= 0 I < cols, I++
{do for I
}
but
for I= 0 I < cols, I += 4)
{do for I
do for I +1
do for I +2
do for I +3
}

which allows the complier to send it out to four places simultanously.

How long does each row take now? The overhead of sending out short tasks and gathering the results can make their parallel performance worse than their sequential performance.

This kind of thing is called micro-optimization. You haven't provided any numbers or analysis that suggest your proposed parallelization is worthwhile.

You would be far better off looking for larger independent tasks that currently take at least several seconds of pure computation time. You must factor out all the I/O time and evaluate it separately, because all I/O is a shared and contested resource, which can easily be the limiting bottleneck.
 
I don't have a row time but I know that the using the spline functions are the slowest part of the code. The images are 175 * 175. I caculate a set of 2nd derivatives once for every 23 slices. That's not bad. But then I have to use the spline functions for magnification and shifting over and over again to find the best match between slices. I may be able to improve my search for a match but I have so many images ...
 
Function splint, slightly modified (you'll see why):

Code:
void splint (float xa[], float ya[], float y2a[], size_t n, float x, float *y)
{
	size_t lo = 0, hi = n - 1;
	
	while (hi - lo > 1) {
		int k = (hi + lo) >> 1;
		if (xa [k] > x) hi = k;
		else lo = k;
	}	

	float h = xa[hi] - xa[lo];
	float a = (xa [hi] - x) / h; 
	float b = (x - xa [lo]) / h;
	float a3 = a * (a * a - 1.0f) * (h * h * (1.0f / 6.0f)); 
	float b3 = b * (b * b - 1.0f) * (h * h * (1.0f / 6.0f)); 

	*y = a * ya[lo] + b * ya[hi] + a3 * y2a [lo] + b3 * y2a[hi];
}

You're calling it in a loop:

Code:
	for ( j = 0; j < m; j++){
		splint (x2a, ya[j], y2a[j], n, x2, &yytmp[j]);
	}

Most of the time in splint () is spent (wasted) looking up the same value x2 in the same array x2a and calculating the same values a, b, a3 and b3. Only the very last line depends on y2a. So you safe about 95% of the execution time by combining everything like this:

Code:
void gnasher_splint (float xa[], float* ya[], float* y2a[], size_t n, float x, float *y, size_t m)
{
	size_t lo = 0, hi = n - 1, j, k;
	
	while (hi - lo > 1) {
		k = (hi + lo) >> 1;
		if (xa [k] > x) hi = k;
		else lo = k;
	}	

	float h = xa[hi] - xa[lo];
	float a = (xa [hi] - x) / h; 
	float b = (x - xa [lo]) / h;
	float a3 = a * (a * a - 1.0f) * (h * h * (1.0f / 6.0f)); 
	float b3 = b * (b * b - 1.0f) * (h * h * (1.0f / 6.0f)); 

        for (j = 0; j < m; ++j) {
            float* yya = ya [j];
            float* yy2a  =y2a [j];

            y [j] = a * yya[lo] + b * yya[lo + 1] + a3 * yy2a [lo] + b3 * yy2a[lo + 1];
        }
}

and calling

Code:
gnasher_splint (x2a, ya, y2a, n, x2, yytmp, m);

Number of operations is now m times (3 adds, 4 multiplies) + 1 time (5 add/subtract, 8 multiplies, 2 divides, about 8 iterations through the search loop instead of m times (8 add/subtract, 12 multiplies, 2 divides, 8 iterations through the search loop).
 
We'll take this function and make it fast:

Code:
float  sd_of_difference(float *tmpimage,float* tmptarget,int rowstart,int rowend,int colstart,int colend)
{
	float sum2 = 0, diff;
	int row, col;
	
	for (row = 0; row <= rowend; row++, tmptarget += reduced_size, tmpimage += reduced_size)
	{
		for (col = 0; col <= colend; col++)
		{
			if  (row*row + col*col > rowstart*rowstart  && row*row + col*col <= rowend*rowend)
			{
				diff  = tmptarget[col] - tmpimage[col];
				sum2 += diff*diff;
			}
		}
	}

	return (sum2);
}

The horrible bit is the if-statement in the middle of the inner loop. As col grows, row*row + col*col grows. So there will be a few items where col is too small so that row*row + col*col is not greater than rowstart*rowstart. Then some items will get added up, and then row*row + col*col will be greater than rowend*rowend, and we stop adding more items. So we can change the function to

Code:
float  sd_of_difference(float *tmpimage,float* tmptarget,int rowstart,int rowend,int colstart,int colend)
{
	float sum2 = 0, diff;
	int row, col, first, last;
	
	for (row = 0; row <= rowend; row++, tmptarget += reduced_size, tmpimage += reduced_size)
	{
		first = ???; last = ???;
		for (col = first; col <= last; col++)
		{
			diff  = tmptarget[col] - tmpimage[col];
			sum2 += diff*diff;
		}
	}

	return (sum2);
}

The question is: What are the values of first and last? When row = 0, row*row + col*col > rowstart*rowstart for the first time when col = rowstart + 1. and row*row + col*col <= rowend*rowend for the last time when col = rowend. So we start with first = row + 1, and last = rowend. Except that we also need col <= colend, so we let last = (rowend <= colend ? rowend : colend).

As row gets larger, we will have row*row + col*col > rowstart*rowstart for smaller values of col. If first - 1 >= 0 and row*row + (first-1)*(first-1) > rowend*rowend then we can make first smaller. On the other hand, row*row + col*col <= rowend*rowend will require smaller values of col. So if row*row + last*last > rowend*rowend and last >= 0, we make last smaller. We end up with this code:

Code:
float  sd_of_difference(float *tmpimage,float* tmptarget,int rowstart,int rowend,int colstart,int colend)
{
	float sum2 = 0, diff;
	int row, col;
	int first = row + 1;
	int last = colend <= rowend ? colend : rowend;
	
	for (row = 0; row <= rowend; row++, tmptarget += reduced_size, tmpimage += reduced_size)
	{
		while (first > 0 && row*row + (first-1)*(first-1) > rowstart*rowstart) --first;
		while (last >= 0 && row*row + last*last > rowend*rowend) --last;

		for (col = first; col <= last; col++)
		{
			diff  = tmptarget[col] - tmpimage[col];
			sum2 += diff*diff;
		}
	}

	return (sum2);
}

And now we have a loop that can be vectorised using SSE instructions, adding four products at a time. Then you do a bit of loop unrolling and suddenly this code gets fast. Or try how fast veclib functions are.
 
Sorry been busy with other things. Will look at this as soon as possible.
Chown. The non-working code is commented out in the original code. It's here somewhere but I couldn't find it just now.
 
Doug, who advised you to use interpolating splines to scale image data, when the purpose of the scaling is some kind of image matching? Splines are useful to approximate smooth data. Image data isn't smooth. You should probably use a simple bilinear filter. You might not use any filtering at all, just lookup the nearest pixel value, unless the things you try to recognise are really tiny (say a few pixels in size), and then you _really_ want to talk to someone who knows maths before you proceed.
 
Sorry been busy with other things. Will look at this as soon as possible.
Chown. The non-working code is commented out in the original code. It's here somewhere but I couldn't find it just now.

It's about halfway into SR.txt. Use Find and search for memcpy.

Code:
   for ( row = 0; row < rows; row++)
        for (col = 0; col < cols; col++)
            {
            yb1[row][col] = y[row*cols +col];
            y2[row][col] = ms[row*cols +col];
            }
    
    
    //for ( row = 0; row < rows; row++)
    //{
    //    memcpy(&yb1[row], &y[row*rows], sizeof(float)*rows);
    //    memcpy(&y2[row], &ms[row*rows], sizeof(float)*rows);
    //}

The calculations in the memcpy() versions are simply wrong. If it's not clear why after thinking about it, select the white text between []'s to see my guess:
[Consider whether rows should be cols, and whether &yb1[row] should be &yb1[row][0]].
 
Thanks. Will get to it. A number of other things have my attention at the moment.

I know you won't believe this but I've downloaded a copy of memwatch to help me clean up the code.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.