Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

ranguvar

macrumors 6502
Original poster
Sep 18, 2009
318
2
Hi,

I wrote a program that does the following:

1) It reads in a text file line by line using straight C:

Code:
NSString* readLineAsNSString(FILE *file) {
    char buffer[2048];
	NSString *result = [[NSString alloc] init];
	int charsRead;
	
    do {
        if(fscanf(file, "%2047[^\n]%n%*c", buffer, &charsRead) == 1) {
			char newbuffer[2048];
			int ofs = 0;
			for (int i = 0; i < (2048-ofs); i++) {
				// The "accent aigue" fix ;)
				if (buffer[i] == '\351') {
					newbuffer[i+ofs] = '\303';
					newbuffer[i+1+ofs] = '\251';
					ofs++;
				}
				else {
					newbuffer[i+ofs] = buffer[i];
				}
				
			}
			result = [NSString stringWithUTF8String:newbuffer];
		}
        else
            break;
    } while(charsRead == 2047);
	
    return result;
}


2) Using Cocoa, it creates and evaluates data from the line.
3) If the data seems fitting, it writes the line to a file, if not, the line gets discarded.

The text file is pretty big (almost 1GB). My code works, but using Activity Monitor, I found that it constantly allocates memory (about 4MB per second). I thought "Oh, there must be a leak", but Leaks from Instruments doesn't find any leak. Using Object Allocations from Instruments, I discovered that most of the allocations are done by CFString and NSData. I checked in my code, but I release all objects fine.

How does one go about solving such a problem? I have no idea where all those allocations are done, and why some objects are apparently not released. Have you ever encountered such a problem, and if yes, how did you solve it? Are there any tutorials on Instruments and how to use it to find problems?

Thanks for any help!
- ranguvar

(P.S.: I'm sorry I can't post the code, it's too much and proprietary...)
 
Are you using autorelease? If so, are you draining your pool regularly? If you're using NSString I believe that CFString is the backing store, or a member of the class cluster, something. So look at your uses of NSString. If there's a loop and you use the same NSString * in an assignment, are you always releasing it before the next iteration?

Are you using NSData directly?

-Lee

edit: now that you've posted code it's clear that your assertion that you release all objects fine is incorrect. You assign an object you own to result at the beginning of the function, then in your loop you assign new objects that you own without releasing the initial object you set result to, or the object you alloc each iteration of your loop.
 
Well for one thing on line 3 of this sample code you do
NSString *result = [[NSString alloc] init];

Then later you assign result to a new string? Seems to be a leak right there.
So you just leaked the empty string from the very beginning. But whats worse, is that every time through the while loop you "leak" an autoreleased ~2000 character NSString. Meaning that it gets created with ~2000 characters, doesn't seem to get used by you, and then gets destroyed when the autorelease pool drains.

Here it looks like putting the result = [NSString stringWithUTF8... ]; outside your while loop would at least help.
 
Oops. I said that you owned the new NSString allocated in the loop, but jaredkipe correctly points out that this is actually autoreleased. If you don't drain the pool, though, you're still going to leave this memory hanging.

-Lee
 
Thanks for the answers!

Oops. I said that you owned the new NSString allocated in the loop, but jaredkipe correctly points out that this is actually autoreleased. If you don't drain the pool, though, you're still going to leave this memory hanging.

-Lee

Unfortunately, when I try to drain the pool, the app crashes. I guess there might be some important objects in there?
 
It sounds to me like some objects got a release and autorelease sent, so they've already been destroyed, and the autorelease pool is now trying to release it again.

-Lee
 
It sounds to me like some objects got a release and autorelease sent, so they've already been destroyed, and the autorelease pool is now trying to release it again.

-Lee

I changed the code to:

Code:
NSString* readLineAsNSString(FILE *file) {
    char buffer[4096];
	NSString *result = nil;
	int charsRead;
	
    do {
        if(fscanf(file, "%4095[^\n]%n%*c", buffer, &charsRead) == 1) {
			char newbuffer[4096];
			int ofs = 0;
			for (int i = 0; i < (4096-ofs); i++) {
				// The "accent aigue" fix ;)
				if (buffer[i] == '\351') {
					newbuffer[i+ofs] = '\303';
					newbuffer[i+1+ofs] = '\251';
					ofs++;
				}
				else {
					newbuffer[i+ofs] = buffer[i];
				}
				
			}
			result = [NSString stringWithUTF8String:newbuffer];
		}
        else
            break;
    } while(charsRead == 4095);
	
    return [result autorelease];
}

It's still leaking, even though I made sure I don't send any release messages to the object anymore...
I also tried using a nested autorelease pool, but same thing: the app crashes as soon as I drain it.
 
Code:
[COLOR="Green"]			result = [NSString stringWithUTF8String:newbuffer];
[/COLOR]		}
        else
            break;
    } while(charsRead == 4095);
	
    return [[COLOR="Red"]result autorelease[/COLOR]];
}

The red-hilited code is wrong. The green-hilited code is the reason the red code is wrong.

You are autoreleasing an object that you don't own.

You don't understand the memory management rules. You should review Apple's guide, especially the parts on ownership, and what you should autorelease.
http://developer.apple.com/mac/library/documentation/cocoa/Conceptual/MemoryMgmt/MemoryMgmt.html

If you still can't get it to work, then consider making your function part of a well-isolated, compilable test program that someone else can compile and run. This will require the addition of a main() function, the file-opening code, etc.
 
Along with a self-contained, compilable example that chown33 described, data to test with would be required for us to run this.

-Lee
 
Your code has another bug. The inner 'for' loop is ignoring the value of charsRead. This will cause it to overscan in some cases.

By the way, your use of fscanf() is the most bizarre approach I've ever seen for reading a buffer of characters up to a newline. You should familiarize yourself with more of the stdio functions, in particular fgets(). Frankly, I wouldn't even use that, I'd just use getc() since the stream is probably buffered already, and your scanner has to examine every char read in anyway. That may not address your memory leakage problem, but simplifying the code almost certainly WILL address other problems, such as inability to see simple bugs.

EDIT:
Another bug: your code will malfunction if more than 4KB is read without a newline appearing. The main consequence of this is it makes failure data-dependent, e.g. if a binary file or a file with CR-only line-endings is read, it will fail. This may be related to the memory-leak problem, or it may not, since you haven't provided the input data that causes your failure.
 
Why are you creating so many autoreleased NSStrings that you don't use at all, and only sit on the heap until the autorelease pool gets drained?

hell, your last line could practically be
return [NSString stringWithUTF8: newbuffer];

The only time this WOULDNT work is if the fscanf failed.

You said that you were working with 1GB files, well it seems that this code would tend to take up 1GB+ waiting to be dealloced NSStrings. (little string bits that don't get used.)
 
Your code has another bug. The inner 'for' loop is ignoring the value of charsRead. This will cause it to overscan in some cases.

By the way, your use of fscanf() is the most bizarre approach I've ever seen for reading a buffer of characters up to a newline. You should familiarize yourself with more of the stdio functions, in particular fgets(). Frankly, I wouldn't even use that, I'd just use getc() since the stream is probably buffered already, and your scanner has to examine every char read in anyway. That may not address your memory leakage problem, but simplifying the code almost certainly WILL address other problems, such as inability to see simple bugs.

EDIT:
Another bug: your code will malfunction if more than 4KB is read without a newline appearing. The main consequence of this is it makes failure data-dependent, e.g. if a binary file or a file with CR-only line-endings is read, it will fail. This may be related to the memory-leak problem, or it may not, since you haven't provided the input data that causes your failure.

I was thinking the same thing, this seems like a very poor way of opening and looking at a file.
 
I was looking at the code, trying to figure out exactly what it tries to accomplish. The only thing I can see that differs from simply reading a line and returning it, is the "accent aigue fix". So in trying to figure out what that does, it seems to me you're simply reading bytes in a non-UTF8 encoding, translating only the e-with-acute-accent ('\351' or '\xe9') to a UTF8 represention (\xc3\xa9), then returning everything else as-is. It's being done badly, due to the bugs I already pointed out, but that seems to be its purpose.

As a text-encoding converter, this is inadequate. There may be accented characters other than e-with-acute-accent in the stream, and these will cause problems with NSString, which will try to interpret them as UTF8.

It's also very inefficient, because it fails to recognize that NSString can already accept byte-arrays in many encodings other than UTF8, and do its own translation to Unicode.

You should reread the NSString reference docs, and look at the other text-encodings that are accepted (type name NSStringEncoding, example: NSISOLatin1StringEncoding). The primary methods of interest are -initWithBytes:length:encoding: and its variants.

If this isn't what you're trying to accomplish, i.e. read a line in ISO-Latin-1 (or some other text encoding) and return it as NSString, then please explain what you're trying to accomplish in this function. It is possible there is a more effective alternative than what you have now.
 
I was looking at the code, trying to figure out exactly what it tries to accomplish. The only thing I can see that differs from simply reading a line and returning it, is the "accent aigue fix". So in trying to figure out what that does, it seems to me you're simply reading bytes in a non-UTF8 encoding, translating only the e-with-acute-accent ('\351' or '\xe9') to a UTF8 represention (\xc3\xa9), then returning everything else as-is. It's being done badly, due to the bugs I already pointed out, but that seems to be its purpose.

As a text-encoding converter, this is inadequate. There may be accented characters other than e-with-acute-accent in the stream, and these will cause problems with NSString, which will try to interpret them as UTF8.

It's also very inefficient, because it fails to recognize that NSString can already accept byte-arrays in many encodings other than UTF8, and do its own translation to Unicode.

You should reread the NSString reference docs, and look at the other text-encodings that are accepted (type name NSStringEncoding, example: NSISOLatin1StringEncoding). The primary methods of interest are -initWithBytes:length:encoding: and its variants.

If this isn't what you're trying to accomplish, i.e. read a line in ISO-Latin-1 (or some other text encoding) and return it as NSString, then please explain what you're trying to accomplish in this function. It is possible there is a more effective alternative than what you have now.

One of the files I'm reading is completely UTF-8 but contains a bug for the e-with-accute-accent. That's the purpose of the "accent aigue" fix.

EDIT: Nonsense removed.
 
Code:
NSString* readLineAsNSString(FILE *file) {
	char buffer[4096];
	char newbuffer[4096];
	fgets(buffer, 4095, file);
	
	int offset = 0;
	
	for (int i = 0; i < (4096 - offset); i++) {
		if (buffer[i] == '\351') {
			newbuffer[i+offset] = '\303';
			newbuffer[i+1+offset] = '\251';
			offset++;
		}
		else {
			newbuffer[i+offset] = buffer[i];
		}
	}
	
	return [NSString stringWithUTF8String:newbuffer];
}
This is your new code. It's still wrong.

The fgets() function returns a nul-terminated string (i.e. a C string). Your code doesn't take this into account, so regardless of how many characters are on the line, even a plain newline, your code still processes the entir 4095 characters into an NSString.

I recommend that you test your program with a small simple input file. Start with simple text that contains nothing but ASCII characters (all 0x7F or below). Make it work correctly for that case. Then give it a larger file of nothing but ASCII. Then give it a file with UTF8 but no 0xe9 bytes. Make sure everything still works, which means the output is exactly the same as the input, byte-for-byte. Finally, give it a small test file with a single 0xe9, then a file with multiple 0xe9's, then repeat with more inputs containing 0xe9.

Code:
	NSString *bundlePath = [[NSBundle mainBundle] bundlePath];
	bundlePath = [bundlePath stringByAppendingString:@"/Resources"];
	NSBundle *bundle = [NSBundle bundleWithPath:bundlePath];
	
	// Extract the paths for the resource files.
	NSString *pathForEnglishResource = [bundle pathForResource:@"dataen" ofType:@"txt"];
	NSString *pathForFrenchResource = [bundle pathForResource:@"datafr" ofType:@"txt"];
	NSString *pathForGermanResource = [bundle pathForResource:@"datade" ofType:@"txt"];
	NSString *pathForCitiesResource = [bundle pathForResource:@"cities" ofType:@"txt"];
	
	// Release no longer needed objects.
[COLOR="Red"]	[bundle release];
	[bundlePath release];
[/COLOR]

    ...code snipped here...

	// Again, release some objects.
[COLOR="Red"]	[pathForEnglishResource release];
	[pathForFrenchResource release];
	[pathForGermanResource release];
[/COLOR]

    ...code snipped here...

[COLOR="Red"]	[fileManager release];
[/COLOR]

This code is wrong. The red-hilited code releases objects you don't own.

You can't call release on every object you get. You must follow the rules.

You really need to reread the Memory Management Guide, especially the part about ownership.


I barely looked at any code past the [fileManager release], but you have a lot more errors involving release, including some involving use of an object after it's been released.
 
Okay, I lack complete understanding of memory management. I'm gonna go back to the drawing board now....
To all the mods: Please close this thread.

Thanks for all the help, though!
 
I took a look at the large input-data file (~800 MB unzipped).

I used Hex Fiend to find bytes with the hex value E9. It found a significant number of them, but I stopped after a while because I couldn't see a problem. All the E9s I found appeared to be correct Unicode bytes, as far as I could tell. They were always the start of a CJK unified ideograph, which I confirmed using TextEdit.app to read segments of the text as UTF8, and Character Palette to show the selected char in the app.

So your basic strategy of replacing a 0xE9 byte without looking at anything else nearby might be flawed, regardless of what computer language you attempt the replacement in.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.