macOS Encoding problems in Mac

anandds · Jan 28, 2010

Hi all,

We are facing some encoding problems related to Japanese special characters. The whole thing boils down to the following analysis:

Create a file with filename is "ホンダ with contents ホンダ

$ pwd
/test
$ ls
ホンダ
$ cat ホンダ
ホンダ
$ cat ホンダ | od -x
0000000 83e3 e39b b383 83e3 0a80
0000012
$ ls | od -x
0000000 83e3 e39b b383 82e3 e3bf 9982 000a
0000015

The question is, why does the output from 'ls' produce more bytes when compared to the 'cat'? It looks like the filenames are encoded differently than the contents in each file

Any help on this would be great

Thanks,
Anand

wrldwzrd89 · Jan 28, 2010

anandds said:
Hi all,

We are facing some encoding problems related to Japanese special characters. The whole thing boils down to the following analysis:

Create a file with filename is "ホンダ with contents ホンダ

$ pwd
/test
$ ls
ホンダ
$ cat ホンダ
ホンダ
$ cat ホンダ | od -x
0000000 83e3 e39b b383 83e3 0a80
0000012
$ ls | od -x
0000000 83e3 e39b b383 82e3 e3bf 9982 000a
0000015

The question is, why does the output from 'ls' produce more bytes when compared to the 'cat'? It looks like the filenames are encoded differently than the contents in each file

Any help on this would be great

Thanks,
Anand

Indeed the encoding IS different. The goofy thing is that file names are represented one way to the GUI bits of Mac OS X (Unicode) but another to some, but not all, of the CLI bits (the legacy Mac OS Roman encoding). This obviously can lead to issues reading and writing such files.

gnasher729 · Jan 28, 2010

1. "od -t x1" will give the bytes in a sensible order. You got the UTF-8 characters
e3839b, e383b3, e38380, 0a vs
e3839b, e383b3, e382bf, e38299, 0a

2. File names are always converted to canonically decomposed UTF-8. Look those codes up in Keyboard Viewer and it should be quite obvious. Remember that the same text can have multiple representations in Unicode; the file system uses a canonical representation. Has nothing to do with Japanese text, try the same thing with ÄÖÜäöü and see what happens.

wrldwzrd89 said:
Indeed the encoding IS different. The goofy thing is that file names are represented one way to the GUI bits of Mac OS X (Unicode) but another to some, but not all, of the CLI bits (the legacy Mac OS Roman encoding). This obviously can lead to issues reading and writing such files.

There is no use of MacRoman in the MacOS X file system at all.

Search

Search

macOS Encoding problems in Mac

anandds

macrumors newbie

wrldwzrd89

macrumors G5

gnasher729

Suspended

Our Staff