Bug: TWiki on Mac OS X server with I18N generates odd looking file names

InternationalisationEnhancements tested with Mac OS X generate odd-looking file names, due to HFS+ and UFS filesystem UnicodeNormalisation issues. TWiki does work OK, but the filenames are not very easy to use for administrators using the command line.

It's also possible that attachments using some I18N characters, uploaded from Mac clients and downloaded by Windows/Unix clients, could cause problems - not tested.

(See MozillaURLEncodingWithI18N for original bug report from InternationalisationEnhancements - turned out to be mainly Mozilla UTF8 URL encoding issues.)

Test case

See comment by StefanLindmark in InternationalisationEnhancements. Browser has not been configured in any way, but there are no configuration notes for Mozilla. I'm attaching TWiki.cfg and testenv output.

Environment

TWiki version: Alpha20021202
TWiki plugins: -
Server OS: Mac OS X 10.2.1
Web server: Apache 2.0.40
Perl version: 5.6.0
Client OS: Mac OS X 10.2.1
Web Browser: Mozilla 1.2

-- StefanLindmark - 03 Dec 2002

Follow up

Fix record

(From emails) MacOS is creating quite weird looking filenames, but TWiki is working fine, so I'm setting this to BugResolved. If people using TWiki I18N find the filenames annoying on MacOS X, please open a new bug. StefanLindmark is now testing on Perl 5.6.1 on Linux, which works fine.

-- RichardDonkin - 10 Dec 2002

I've done some more testing to shed some light on how file names are treated in OS X. What I did was:
  • Created the topic AnAufHinterInNebenÜberUnterVorZwischen in TWiki running on Linux stored on reiserfs filesystem
  • Created the topic AnAufHinterInNebenÜberUnterVorZwischen in TWiki running on OS X stored on HFS+ filesystem
  • Created the folder AnAufHinterInNebenÜberUnterVorZwischen in Finder running on OS X stored on HFS+ filesystem

Then I ran ls > filename on each of those files using ls in the same environment as they were created in. The resulting output files from this have been attached to this topic. Hopefully these files can be of use for the people that put their skills into further development of i18n.

-- StefanLindmark - 11 Dec 2002

One implication that needs to be investigated is portability. If I run TWiki with i18n enhancements on a server running OS X, what happens if I want to move the site to a box with a different OS/filesystem? Would it be possible to transfer the files straight over to the new environment or would there be a need to recode the filenames?

-- StefanLindmark - 14 Dec 2002

Only one way to find out, so I tried it by doing this:
  • tar cvf an.tar AnAuf*
  • scp an.tar mysite.net:upload
  • ssh mysite.net
  • cd upload
  • tar xvf an.tar
  • ls AnAuf*
  • ls ../twiki/data/Sandbox/AnAuf*

The result can be seen below:

untarred filenames.png

So I guess this is something to worry about if you want to be able to move files around between different systems as your server platform may shift over time.

-- StefanLindmark - 17 Dec 2002

Interesting - however, I think the best longer term solution is to find out why MacOS X is UTF8-encoding filenames and see if it can be configured to avoid this, or to show the names to the user in ISO8859-1 (or perhaps to just support UTF8 filenames and topic names). Transforming UTF8 filenames into ISO8859-1 when moving server platforms would be another option.

-- RichardDonkin - 19 Dec 2002

Apple technote #1150 documents the Unicode filename encoding of the HFS+ filesystem. With _trace enabled on the RCS operations, the debug.txt file shows that RCS commands are using 8-bit single character ISO-8859 encoding of filenames (i.e. "å" encoded as E5 hex). But files are still written with Unicode filenames. One idea could be to use RcsLite and see if ci and friends in their Apple-distributed form are the cause of this.

Transforming filenames on server platform moves makes data portable, but I'm sceptic about having TWiki running on OS X generating a lot of files on the backend that are difficult to browse, backup, restore, etc. I haven't even started thinking of how useful the available backup tools will be when filenames turn up with mixed charsets and script styles (e.g. starting with western chars, and then reversing script direction to right-to-left and using non-western characters). I guess it would be more difficult to handle that situation than having stray "?" replacing 8-bit chars, still leading to recognizable filenames. So my ambition is still to try to find out how to move TWiki away from this Unicode stuff on OS X and behave like other common Unix systems like Linux and Solaris.

-- StefanLindmark - 21 Dec 2002

I've been doing a lot more research into Unicode (see InternationalisationUTF8) and it's a bit clearer what was happening here from reading the HFS+ doc's http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties Unicode section - basically, HFS+ appears to prefer to work in Unicode 2.1, storing characters internally in 16-bit values, and also normalises all filename characters into a decomposed form (i.e. "å" is encoded as "a" followed by the accent as a separate Unicode character). This can be seen in the Finder generated attachment below, which presumably is correct.

The TWiki-generated attachment looks like UTF-8 encoding of the precomposed character (i.e. "å" as a single Unicode codepoint, encoded as two bytes in UTF-8).

UPDATE: HFS+ actually uses an Apple-modified version of Unicode's Normalisation Form D (NFD, i.e. decomposed), whereas Unix/Linux and http://www.w3c.org W3C standards use Normalisation Form C (NFC, i.e. precomposed). MacOS X 10.2 seems to have recognised this issue and at least provides an API to normalise into NFC, but in any case TWiki would need to normalise filenames read out from the filesystem into NFC - without this, it appears that the conversion back to ISO-8859-1 doesn't work. This is really a MacOS X implementation issue but can be worked around. Possible solutions include:

  • TWiki code to do the normalisation to NFC - should be configurable as something like $normaliseToUnicodeNFC in TWiki.cfg - enabled on HFS+ filesystems but not on the UFS (Unix style) filesystem. There are some Apple developer docs that describe this in more detail. Main option, enables non-NFD-capable browsers (e.g. Konqueror 3.1.1) to work with MacOS X and I18N.
  • Try using a UTF-8 or other locale setting when administering TWiki files so that the conversion from Unicode NFD format to ISO-8859-1 is avoided or works properly. RCS may not work well with Unicode NFD format, though this should be largely transparent to RCS. This will also be necessary, since first option doesn't change use of NFD for filenames.
  • Research/test using Perl 5.8.x in case this has addressed this issue. Not covered by Perl 5.8, may be covered by Perl 6.

Some useful links on Apple's NFD-based normalisation in HFS+:

On testing the Finder-generated file below, using IE5.5 in UTF-8 encoding mode, it was displayed correctly - so IE at least is able to display UTF-8 NFD filenames.

The TWiki-generated file has been corrupted somehow, since the capital ü was transformed into 0xDBA2, which is an Asian character.

-- RichardDonkin - 11 Sep 2003

I now have a plan for how to solve this issue as part of ProposedUTF8SupportForI18N.

If you do need to convert a whole set of filenames from one character encoding to another, have a look at Bjoern Jacke's http://j3e.de/linux/convmv/man/ convmv (http://j3e.de/linux/convmv/ download) - suggested by the author in email.

-- RichardDonkin - 14 Oct 2003

It seems that UFS filesystems have the http://lists.apple.com/archives/unix-porting/2002/Mar/msg00147.html same NFD behaviour on Darwin (the FreeBSD based Unix underlying MacOS), so it's not just HFS+.

There's a related issue mentioned in http://lists.w3.org/Archives/Public/www-international/2003OctDec/0079.html this W3C list thread - if a MacOS X user attaches a file with a Unicode NFD filename to a TWiki page, by default TWiki would store the filename in UTF-8 without changing the normalisation. This would then mean that users on some other platforms (e.g. Konqueror on Linux) would probaby have the NFD filename rendered incorrectly even if the server is not MacOS based!

Also, when TWiki is in UTF-8 mode, MacOS X's builtin conversion of Unicode NFD to ISO-8859-1 etc does not apply - the unconverted Unicode NFD characters from the filesystem will remain in NFD mode, resulting in a similar problem.

So it seems that normalisation will be important if there are any MacOS clients or servers involved in a TWiki deployment, and hence for all public TWiki sites.

Are there any Mac users out there who could test this?

-- RichardDonkin - 14 Feb 2004

Back in 2004, Mozilla suite and Thunderbird fixed the problem of MacOS exposing Unicode NFD normalisation of filenames to the outside world (caused a problem with MacOS clients attaching files) - solution was to convert data from MacOS clients from NFD into NFC (which is what rest of world uses), see MozillaBug:227547.

-- RichardDonkin - 01 Oct 2006

http://www.nntp.perl.org/group/perl.macosx/2005/04/msg8847.html Interesting thread about I18n Filenames and CGI upload - may cause some problems on MacOS X at some point, due to use of NFD normalisation by HFS+. See also Bugs:Item3652 re other attachment issues.

-- RichardDonkin - 18 Mar 2007

 
Form definition 'WebForm' not found
I Attachment Action Size Date Who Comment
TWiki.cfgcfg TWiki.cfg manage 19 K 03 Dec 2002 - 10:13 UnknownUser TWiki.cfg (used for testing)
TestTopic1.htmlhtml TestTopic1.html manage 4 K 03 Dec 2002 - 10:24 UnknownUser HTML of rendered TestTopic1 with Dec02 code
anauf-linux-reiserfs.txttxt anauf-linux-reiserfs.txt manage 132 bytes 12 Dec 2002 - 10:02 UnknownUser ls output on linux files created by TWiki
anauf-osx-from-finder-hfs.txttxt anauf-osx-from-finder-hfs.txt manage 41 bytes 12 Dec 2002 - 10:02 UnknownUser ls output on OS X files created by Finder
anauf-osx-from-twiki-hfs.txttxt anauf-osx-from-twiki-hfs.txt manage 132 bytes 12 Dec 2002 - 10:03 UnknownUser ls output on OS X files created by TWiki
perl-V.txttxt perl-V.txt manage 2 K 03 Dec 2002 - 10:30 UnknownUser perl -V output on the server
testenv.htmlhtml testenv.html manage 9 K 03 Dec 2002 - 10:11 UnknownUser testenv output
Topic revision: r2 - 15 Dec 2008, WillNorris
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy