1. For multi-threading, go to developer.apple.com and enter "blocks" into the search box. That will show you how to distribute your code optimally onto any number of CPUs with approximately zero effort. Requires Snow Leopard.
2. Divide the work into chunks that fit into about 32KB and do all the work that is needed on that chunk of data before you proceed to the next. That make sure all operations are in L1 cache memory.
3. Perform operations on consecutive array elements in ascending order.
4. Use Apple's vector library (veclib). Again, type "veclib" into the search box in developer.apple.com
5. In the compiler settings, set optimisation level = fastest, loop unrolling.
Example for using multiple processors with Grand Central Dispatch; method 2 will use all cores on any Mac with about zero programming effort:
Code:
#include <dispatch/dispatch.h>
int main (void)
{
size_t rows = 300;
size_t columns = 520;
double* array1 = (double *) calloc (rows*columns, sizeof (double));
double* array2 = (double *) calloc (rows*columns, sizeof (double));
/* Method 1, single threaded */
for (size_t i = 0; i < rows; ++i) {
double* p = array1 + i * columns;
double* q = array2 + i * columns;
for (size_t j = 0; j < columns; ++j)
p [j] = 2.7 * p [j] + 3.14 * q [j];
}
/* Method 2, GCD */
dispatch_queue_t theQueue = dispatch_get_global_queue (DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply (rows, theQueue, ^ (size_t i) {
double* p = array1 + i * columns;
double* q = array2 + i * columns;
for (size_t j = 0; j < columns; ++j)
p [j] = 2.7 * p [j] + 3.14 * q [j];
});
return 0;
}
But now back to your original question: What is the fastest way to work with arrays? The fastest way is to learn how modern processors work, how caches work, how to use multiple processors, how to use vector units, write code, profile it with Shark, look for the bits that take most of the time, improve them, start understanding your algorithms, start understanding what work is done that doesn't need doing, learn where multiple passes can be combined into one, find where general algorithms can be replaced with specialised ones, profile it again, and so on.