Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

anandds

macrumors newbie
Original poster
May 8, 2009
12
0
Hi all,

We are facing some encoding problems related to Japanese special characters. The whole thing boils down to the following analysis:

Create a file with filename is "ホンダ” with contents ホンダ

$ pwd
/test
$ ls
ホンダ
$ cat ホンダ
ホンダ
$ cat ホンダ | od -x
0000000 83e3 e39b b383 83e3 0a80
0000012
$ ls | od -x
0000000 83e3 e39b b383 82e3 e3bf 9982 000a
0000015


The question is, why does the output from 'ls' produce more bytes when compared to the 'cat'? It looks like the filenames are encoded differently than the contents in each file

Any help on this would be great

Thanks,
Anand
 
Hi all,

We are facing some encoding problems related to Japanese special characters. The whole thing boils down to the following analysis:

Create a file with filename is "ホンダ” with contents ホンダ

$ pwd
/test
$ ls
ホンダ
$ cat ホンダ
ホンダ
$ cat ホンダ | od -x
0000000 83e3 e39b b383 83e3 0a80
0000012
$ ls | od -x
0000000 83e3 e39b b383 82e3 e3bf 9982 000a
0000015


The question is, why does the output from 'ls' produce more bytes when compared to the 'cat'? It looks like the filenames are encoded differently than the contents in each file

Any help on this would be great

Thanks,
Anand
Indeed the encoding IS different. The goofy thing is that file names are represented one way to the GUI bits of Mac OS X (Unicode) but another to some, but not all, of the CLI bits (the legacy Mac OS Roman encoding). This obviously can lead to issues reading and writing such files.
 
1. "od -t x1" will give the bytes in a sensible order. You got the UTF-8 characters
e3839b, e383b3, e38380, 0a vs
e3839b, e383b3, e382bf, e38299, 0a

2. File names are always converted to canonically decomposed UTF-8. Look those codes up in Keyboard Viewer and it should be quite obvious. Remember that the same text can have multiple representations in Unicode; the file system uses a canonical representation. Has nothing to do with Japanese text, try the same thing with ÄÖÜäöü and see what happens.

Indeed the encoding IS different. The goofy thing is that file names are represented one way to the GUI bits of Mac OS X (Unicode) but another to some, but not all, of the CLI bits (the legacy Mac OS Roman encoding). This obviously can lead to issues reading and writing such files.

There is no use of MacRoman in the MacOS X file system at all.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.