Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

MrFusion

macrumors 6502a
Original poster
Jun 8, 2005
613
0
West-Europe
I am looking for advice on good software design with this post.
I have a program that reads a number of data files and plots the data, or calculates new data, depending on which entries are selected. It works great for a smaller number of data files (say up to a 500). For each file, a dataclass instance is created. This instance reads the data, stores it in a NSString and adds several more objects to this instance, based on the data it just read.

Now, I have over 30.000 datafiles to process. Needless to say, my program is choking on it. I guess I have to add some kind of data caching to it, only reading data that is only in use. The question being, should I add cache logic to every data object (bottom-up decision) or should I make a new object that makes such decisions for the data objects (top-down decision)? Or something else? What is the best method? I guess i am looking for some advice from someone with a more experience.

Would using NSData objects, and c-array's and floats help instead of NSMutableArray and NSNumber? I use these because they are easy to work with and pass around the different methods, without worrying to much about pointers and memory allocation.
 

Mac Player

macrumors regular
Jan 19, 2006
225
0
Is your algorithm asymptotically good?

Since it's impossible to use native vars with NS containers i usually use STL containers to do the job.
 

MrFusion

macrumors 6502a
Original poster
Jun 8, 2005
613
0
West-Europe
Is your algorithm asymptotically good?

Since it's impossible to use native vars with NS containers i usually use STL containers to do the job.

At the moment, my program is for taking a quick glance at the raw data.
It only calculates averages and extracts data for more specialised programs (written in octave). There is too much data to use all of it directly in these programs, and not all data is useful.
If I ever get around in re-implementing my octave routines in my cocoa program, I'll try to use gsl functions. I am not planning on programming my own matrix routines or fit routines. Such things, vastly exceed my current programming skills.

What do you mean with STL containers? Sorry, CS is not my background. I only know obj-c and some cocoa.
 

Cromulent

macrumors 604
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
STL stands for Standard Template Library and is an important part of C++. If you don't know C++ I'd stay well clear. Hell, from what I have heard even if you do know C++ they are not the easiest thing to get your head around.
 

lee1210

macrumors 68040
Jan 10, 2005
3,182
3
Dallas, TX
I want to make sure I understand the problem you are trying to solve. You mention caching, so does that mean that the same data is read over and over again from disk, and this is what's slow? Or your design requires that all data be read into memory at once whether it's needed or not, and that's what is slow, so you want to have a means of selectively reading data instead?

These are very different problems, so if you give us more information we might be able to provide more assistance.

-Lee

Re: NSNumber, if the boxing and unboxing of native types to NSNumber and back is what's slowing you down, you're doing something very, very wrong. I doubt the data structures are the real problem here, but without more information I can't say for certain.
 

MrFusion

macrumors 6502a
Original poster
Jun 8, 2005
613
0
West-Europe
I want to make sure I understand the problem you are trying to solve. You mention caching, so does that mean that the same data is read over and over again from disk, and this is what's slow? Or your design requires that all data be read into memory at once whether it's needed or not, and that's what is slow, so you want to have a means of selectively reading data instead?

These are very different problems, so if you give us more information we might be able to provide more assistance.

-Lee

Re: NSNumber, if the boxing and unboxing of native types to NSNumber and back is what's slowing you down, you're doing something very, very wrong. I doubt the data structures are the real problem here, but without more information I can't say for certain.

At the moment I don't do smart caching.
In the first version, all data was read at once and kept in memory. This was very slow at startup.
In the second version, data is read from disk when needed and kept in memory. (this was the fastest way to adjust version 1.). This is fast in starting up, but slow when part of each datafile needs to extracted. This means all datafiles are read and kept in memory. So it works fine (but slow) until all memory is used.
I think in version 3, I need to do something that reads data when needed and delete data when not needed. So something needs to decide what is kept in memory.

Each datafile can stand on it's own, but it can also be part of a group of datafiles for which "something" is calculated. At the moment "something" is the average, but this will be extended at some point in the distant future.

I am willing to restart from ground up in the design to handle an efficient cache method.

It's not a method of speed, I think, but of memory footprint. Does NSNumber not have a larger footprint than a float for example?
 

lee1210

macrumors 68040
Jan 10, 2005
3,182
3
Dallas, TX
At the moment I don't do smart caching.
In the first version, all data was read at once and kept in memory. This was very slow at startup.
In the second version, data is read from disk when needed and kept in memory. (this was the fastest way to adjust version 1.). This is fast in starting up, but slow when part of each datafile needs to extracted. This means all datafiles are read and kept in memory. So it works fine (but slow) until all memory is used.
I think in version 3, I need to do something that reads data when needed and delete data when not needed. So something needs to decide what is kept in memory.

Each datafile can stand on it's own, but it can also be part of a group of datafiles for which "something" is calculated. At the moment "something" is the average, but this will be extended at some point in the distant future.

I am willing to restart from ground up in the design to handle an efficient cache method.

It's not a method of speed, I think, but of memory footprint. Does NSNumber not have a larger footprint than a float for example?

If it's a matter of memory, whenever you are done with any piece of data it should be discarded. This could read to a lot of reads, though, if you need certain things frequently. The real issue with trying to cache here, it seems, is that you won't know how much memory is available to a process before things have to start swapping, etc. This means that you don't know when you're almost "full" so you can start discarding some older/less frequently used data.

In terms of the memory footprint, NSNumber will certainly be larger than a float. You know a float is 4 bytes, period. NSNumber is going to have some amount of overhead on the heap to store runtime information, as well as at least 8 bytes of "backing" storage, as it appears that it can store up to a long long or double.

By the same token using a C-style array will have no overhead since it only needs to store a pointer (4 or 8 bytes) to the base of the array then the actual storage needed for the elements. An NSArray or NSMutableArray will have a bit of extra overhead because it needs to store runtime information like current size, etc. It will only store pointers, so each element will be the size of a pointer. This may be a little more or a little less than the storage needed in a C-style array depending on the type and platform. You are sacrificing "easy" dynamicism using a C-style array for a pretty small amount of reduced memory usage. The real issue is that an NSArray can only store pointers to NSObjects, so you have to use a wrapper like NSNumber. You could make a small class that has a single float field, and store this in an NSArray, but you're probably not going to save that much over using NSNumber.

Unfortunately it doesn't sound like there is a magic bullet in this situation. If you have many gigabytes to process, and you might use any part of it at any time, you're in a tough spot. It sounds like your data is in a lot of separate files that are requested "on-demand". It may be too simplistic, but I'd probably try to have the data in each file abstracted by a new class. It would contain an NSArray of the data or C-style array of the data. You'd need a "master table" of these, perhaps stored in an NSDictionary. They key could be some unique identifier (set number, filename, etc.). The value would be this new object you've created. You can quickly check if the key is present in the table. If it is, the data is available, if it's not, you need to instantiate a new object of your new class, at which time it is assumed the data is needed, so the data is loaded from the file.

If you don't have an easy way to tell how many data points are in your file before they are read (i.e. the file is all floats, so you could take filesize in bytes/4 to get the number of items), you might use a dynamic type like NSMutableArray while you're reading in, then use -length and malloc a C-style array of float, then loop over the results unboxing from NSNumber to float. If you don't do it like this, you'll have to take a best guess, malloc your C-style float array to this size, then possibly have to manually realloc to a new larger size. This is doable, but is a bit messier and requires a lot more manual manipulation.

When you're done with a particular set of data (either because you just KNOW or you've determined you need to clean up based on the number of elements in your dictionary, etc.), you need to free your C-style array or release your NSArray, then release your object.

You may have all of that covered already, and you're wanting help designing the clean up algorithm. That's going to be a bit tougher. There're a ton of way to do this... the two simplest are time based (oldest data set is flushed first), in which case you just need to track the time (probably just seconds since epoch in a 4-byte value, as long as you know your code will not be used after 2038) that you read a value, then have a fast way to find that to flush. Alternatively, you could use an ordered list and delete from the beginning. To prevent log(N) time to access the elements by key, you could keep this list separate from your NSDictionary that provides you fast access. You'd just have to remember to delete from the dictionary when you remove from the beginning of the list.

The other option is least used being used to flush old data. This will be a bit tougher, because you'll have to keep a usage count. This could be done by extending NSDictionary so each data element has a usage count that's incremented whenever you use getObject, etc. If you didn't want to do that, you would have to track this manually... there's plenty of ways to do that, either keeping a count in your new data storage object, keeping another dictionary whose value is the count (though finding the item to flush becomes slower, you can mitigate this by keeping the list ordered, but this makes each reference a little slower).

This is only worth doing if you know you can do a better job of this than the system's virtual memory system.

-Lee
 

MrFusion

macrumors 6502a
Original poster
Jun 8, 2005
613
0
West-Europe
If it's a matter of memory, whenever you are done with any piece of data it should be discarded. This could read to a lot of reads, though, if you need certain things frequently. The real issue with trying to cache here, it seems, is that you won't know how much memory is available to a process before things have to start swapping, etc. This means that you don't know when you're almost "full" so you can start discarding some older/less frequently used data.

In terms of the memory footprint, NSNumber will certainly be larger than a float. You know a float is 4 bytes, period. NSNumber is going to have some amount of overhead on the heap to store runtime information, as well as at least 8 bytes of "backing" storage, as it appears that it can store up to a long long or double.

By the same token using a C-style array will have no overhead since it only needs to store a pointer (4 or 8 bytes) to the base of the array then the actual storage needed for the elements. An NSArray or NSMutableArray will have a bit of extra overhead because it needs to store runtime information like current size, etc. It will only store pointers, so each element will be the size of a pointer. This may be a little more or a little less than the storage needed in a C-style array depending on the type and platform. You are sacrificing "easy" dynamicism using a C-style array for a pretty small amount of reduced memory usage. The real issue is that an NSArray can only store pointers to NSObjects, so you have to use a wrapper like NSNumber. You could make a small class that has a single float field, and store this in an NSArray, but you're probably not going to save that much over using NSNumber.

Unfortunately it doesn't sound like there is a magic bullet in this situation. If you have many gigabytes to process, and you might use any part of it at any time, you're in a tough spot. It sounds like your data is in a lot of separate files that are requested "on-demand". It may be too simplistic, but I'd probably try to have the data in each file abstracted by a new class. It would contain an NSArray of the data or C-style array of the data. You'd need a "master table" of these, perhaps stored in an NSDictionary. They key could be some unique identifier (set number, filename, etc.). The value would be this new object you've created. You can quickly check if the key is present in the table. If it is, the data is available, if it's not, you need to instantiate a new object of your new class, at which time it is assumed the data is needed, so the data is loaded from the file.

If you don't have an easy way to tell how many data points are in your file before they are read (i.e. the file is all floats, so you could take filesize in bytes/4 to get the number of items), you might use a dynamic type like NSMutableArray while you're reading in, then use -length and malloc a C-style array of float, then loop over the results unboxing from NSNumber to float. If you don't do it like this, you'll have to take a best guess, malloc your C-style float array to this size, then possibly have to manually realloc to a new larger size. This is doable, but is a bit messier and requires a lot more manual manipulation.

When you're done with a particular set of data (either because you just KNOW or you've determined you need to clean up based on the number of elements in your dictionary, etc.), you need to free your C-style array or release your NSArray, then release your object.

You may have all of that covered already, and you're wanting help designing the clean up algorithm. That's going to be a bit tougher. There're a ton of way to do this... the two simplest are time based (oldest data set is flushed first), in which case you just need to track the time (probably just seconds since epoch in a 4-byte value, as long as you know your code will not be used after 2038) that you read a value, then have a fast way to find that to flush. Alternatively, you could use an ordered list and delete from the beginning. To prevent log(N) time to access the elements by key, you could keep this list separate from your NSDictionary that provides you fast access. You'd just have to remember to delete from the dictionary when you remove from the beginning of the list.

The other option is least used being used to flush old data. This will be a bit tougher, because you'll have to keep a usage count. This could be done by extending NSDictionary so each data element has a usage count that's incremented whenever you use getObject, etc. If you didn't want to do that, you would have to track this manually... there's plenty of ways to do that, either keeping a count in your new data storage object, keeping another dictionary whose value is the count (though finding the item to flush becomes slower, you can mitigate this by keeping the list ordered, but this makes each reference a little slower).

This is only worth doing if you know you can do a better job of this than the system's virtual memory system.

-Lee

Thanks Lee, you gave me a few things to think about.
I guess it is not as easy as I hoped it would be.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.