Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

MrFusion

macrumors 6502a
Original poster
Jun 8, 2005
613
0
West-Europe
When reading a file, it is important to know the encoding, or so I have learned lately.
NSFilemanager has a method attributesOfItemAtPath:error:, which returns a dictionary.
The keys contain e.g. NSFileSize, but also NSFileExtendedAttributes. NSFileAttributes is a category for NSDictionary.

These
Code:
[myDictionary valueForKey:NSFileSize];
[myDictionary fileSize];
give the same result.

But whatever I try, I can't get to the NSFileExtendedAttributes. d2 in this snippet points to 0x0. d3 will give cause a crash.
Code:
	NSDictionary *dict = [[NSFileManager defaultManager] attributesOfItemAtPath:aPath
													 error:nil];
	int size = [dict fileSize];
	for (id key in [dict allKeys]) {
		NSLog(@"%@ %@",key, [dict valueForKey:key]);
	}
	id d2 = [dict valueForKey:@"NSFileExtendedAttributes"];
	id d3 = [dict fileExtendedAttributes];

NSFileExtendedAttributes {
"com.apple.TextEncoding" = <6d616369 6e746f73 683b30>
}

How can I get the TextEncoding, and how do I interpret its value?

Thanks!
 
I don't even see documentation for NSFileExtendedAttributes, so my guess is it's private and shouldn't be used. Second, I would guess these extended attributes don't exist for every file (assuming they're based on the getxattr(), setxattr() functions), so you shouldn't rely on them even being available. Third, text encodings are usually guessed based on the bytes of the file, or the BOM.
 
I don't even see documentation for NSFileExtendedAttributes, so my guess is it's private and shouldn't be used. Second, I would guess these extended attributes don't exist for every file (assuming they're based on the getxattr(), setxattr() functions), so you shouldn't rely on them even being available. Third, text encodings are usually guessed based on the bytes of the file, or the BOM.

1) Then I won't.
2) I wasn't. I was going to go with UTF-8 if the extended attribute didn't exist, with an option for the user to specify another encoding.
3) There are so many encodings available, and I didn't find a routine in cocoa that would make a guess for me. Maybe I overlooked it, though.

I want to manipulate the bytes directly (read from disk by NSData) before handing it over to a NSString.
 
I looked into this a little more. The method writeToFile:atomically:encoding:error: sets the attribute. It also gets set by TextEdit (probably using that method).

Here's the non-Cocoa way to get the attribute :)
Code:
char textEncoding[256];
ssize_t attrSize;
attrSize = getxattr([path fileSystemRepresentation], "com.apple.TextEncoding", textEncoding, sizeof(textEncoding), 0, 0);
if (attrSize)
    printf("textEncoding: %s\n", textEncoding);

I suppose you could check for it, just using it as a hint. The com.apple.TextEncoding value is NSData, and it appears to be just normal ASCII text in the format of "xxx;yyy" where xxx is the encoding (e.g. "utf-8", "us-ascii", etc) and yyy appears to be some number.

For guessing an encoding, there's lot of info on it. A few links:
http://stackoverflow.com/questions/1351151/guess-encoding-when-creating-an-nsstring-from-nsdata
http://developer.apple.com/mac/libr...Conceptual/Strings/Articles/readingFiles.html
 
I looked into this a little more. The method writeToFile:atomically:encoding:error: sets the attribute. It also gets set by TextEdit (probably using that method).

Here's the non-Cocoa way to get the attribute :)
Code:
char textEncoding[256];
ssize_t attrSize;
attrSize = getxattr([path fileSystemRepresentation], "com.apple.TextEncoding", textEncoding, sizeof(textEncoding), 0, 0);
if (attrSize)
    printf("textEncoding: %s\n", textEncoding);

I suppose you could check for it, just using it as a hint. The com.apple.TextEncoding value is NSData, and it appears to be just normal ASCII text in the format of "xxx;yyy" where xxx is the encoding (e.g. "utf-8", "us-ascii", etc) and yyy appears to be some number.

Thanks!


Thanks for the links. To quote the first:
"In short, to be able to handle all available encodings you need to do what TextEdit does: shunt the decision over to the user."

Do you also happen to know if there is much difference between NSData and NSString in overhead? Is it more efficient to internally store my information as NSData than as NSString? It there is no difference, I am probably better off using the method mentioned in that first link:
"initWithContentsOfFile:usedEncoding:error: and it will guess the encoding of the file."
 
Do you also happen to know if there is much difference between NSData and NSString in overhead? Is it more efficient to internally store my information as NSData than as NSString? It there is no difference, I am probably better off using the method mentioned in that first link:
"initWithContentsOfFile:usedEncoding:error: and it will guess the encoding of the file."

Probably depends on how you're going to use it. If you can provide more details on what you're doing that may help.
 
Probably depends on how you're going to use it. If you can provide more details on what you're doing that may help.

Sure. I am analyzing experimental data. It are xy data files from a window computer. One example contains over 1000 files and the directory is over 40 MB. Maybe that is not much in the world of IT, I don't know. But my programs are not that fast. Going with GCD will probably help.

The problem with these files are the carriage line ends, that I have to change over to \n. I have to scan the bytes, and replace these bytes. After that, I have to read the xy data and turn it into two NSArray's. I also like to show the raw data to the user (me that is) because I am paranoid about bugs in my own programming. The raw data also contains a header with more experimental information.

I already know from experience that disk access is the bottle neck. Loading all files and interpreting at once is way to slow (maybe GCD will help here). So, i moved to a cache based system. I load the file locations into my program, and when I want to access the data it is analyzed. However, the file locations are inflexible. Moving the data around means that my program can't find it anymore.

I also don't care that much about disk space, more about not losing or corrupting data. And speed and ease of use, of course.
However, If I keep the data in memory, I will have it double. Once in the array's and once as a NSString (the raw data). Maybe using NSData will keep the memory foot print down? Maybe I can use these new URL locations, but then how will that help if I send the file to some one else.

I have several programs, all dedicated to a different kind of data set. They all have some things in common, such as loading of data. That is why I am now rewriting this as a framework. The header is different each time, but that shouldn't be a problem anymore with blocks.

Maybe these are trivial problems for an experienced programmer, but I am self taught and within my group one of the better programmers. That doesn't say much, though, most people don't do any programming. So I really appreciate all the help this forum has given over the last few years!


Did this help, or do you need more details?
 
Sure. I am analyzing experimental data. It are xy data files from a window computer.

If the files originated on a Windows computer, then the likelihood of them having a com.apple.TextEncoding xattr is probably zero. Unless you've processed them beforehand with some program that added the xattr.

This would explain why you weren't seeing any xattrs before: the returned dictionary was nil, or the returned value was nil. When you ask a dictionary to get an object for a key, it may return nil. Your code must account for that.

I already know from experience that disk access is the bottle neck. Loading all files and interpreting at once is way to slow (maybe GCD will help here). So, i moved to a cache based system. I load the file locations into my program, and when I want to access the data it is analyzed. However, the file locations are inflexible. Moving the data around means that my program can't find it anymore.

Why do you think GCD will help? It won't make your disk faster.

Have you examined your app's disk-speed in actual use? Activity Monitor.app, an Apple-standard app located in /Applications/Utilities, has a Disk Activity tab that shows read and write data rates. You might want to look at its other capabilities, too.

Exactly what do you mean by "cache based system"? Do you mean you keep the pathnames without reading the files?

Consider posting code, or at least pseudo-code.


However, If I keep the data in memory, I will have it double. Once in the array's and once as a NSString (the raw data). Maybe using NSData will keep the memory foot print down?

After the data has been read from the file and converted to arrays, there is no longer any need for the file's bytes in memory. You have the numbers in the arrays, so keeping an NSString or NSData representation of the original is pointless.


Maybe I can use these new URL locations, but then how will that help if I send the file to some one else.

This is confusing. Do the URLs refer to a publically reachable data store? Exactly who is/are these "some one else", and where are they located on your network?

Have you considered a remote storage service like Amazon S3? Store your data there, and you can make it available to others by changing the ACLs. Read the docs for S3. Google for them.

How valuable is this data? Put a monetary figure on it. If it's worth a lot, then consider hiring a professional programmer to write code for you, using specifications you provide.
 
If the files originated on a Windows computer, then the likelihood of them having a com.apple.TextEncoding xattr is probably zero. Unless you've processed them beforehand with some program that added the xattr.

This would explain why you weren't seeing any xattrs before: the returned dictionary was nil, or the returned value was nil. When you ask a dictionary to get an object for a key, it may return nil. Your code must account for that.
No, although most files originate on windows, the framework should also be able to handle non-windows files.
I tested it on a Mac OS X file. The com.apple.TextEncoding xattr was present. Of course, I do not expect it to be present on a windows file.


Why do you think GCD will help? It won't make your disk faster.

It won't. But it will free up the GUI and avoid the spinning beach ball. I can look at the data already imported.

Have you examined your app's disk-speed in actual use? Activity Monitor.app, an Apple-standard app located in /Applications/Utilities, has a Disk Activity tab that shows read and write data rates. You might want to look at its other capabilities, too.
No.
Doesn't instruments do similar things?

Exactly what do you mean by "cache based system"? Do you mean you keep the pathnames without reading the files?
Yes. At the moment, I read the files when I need them. But this is not a good solution, as I can not move around the original files.
Consider posting code, or at least pseudo-code.

In it's most basic form:
1) read data from disk
2) replace carriage return by \n
3) convert data to array's
4) calculate new data
5) "dump" everything in the GUI for the user to evaluate: original data, converted data, calculated data

After the data has been read from the file and converted to arrays, there is no longer any need for the file's bytes in memory. You have the numbers in the arrays, so keeping an NSString or NSData representation of the original is pointless.
Not if I want to examine the original file and check if everything went okay.

This is confusing. Do the URLs refer to a publically reachable data store? Exactly who is/are these "some one else", and where are they located on your network?
Ah, sorry. No, I mean NSURL's as opposed to NSString paths. Unless mistaken; since Snow Leopard, files can be referred to by an identifier as well as a path.

Sharing data is done by storing it on a local NAS drive, or by email. If I send a document with the analysis, the data should be included in the document. Preferably, in the original form as it was obtained. When in doubt about the analysis, one can always fall back on the original data (and a calculator or pen and paper).

How valuable is this data? Put a monetary figure on it. If it's worth a lot, then consider hiring a professional programmer to write code for you, using specifications you provide.

haha, good one. :)
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.