UnicodeNormalisation

As part of later phases of ProposedUTF8SupportForI18N, TWiki will by default expect and use Unicode NFC (Normalisation Form C), like Linux and W3C standards, with configurable support for conversion of NFD (Normalisation Form D) to NFC (for support of MacOSXFilesystemEncodingWithI18N and any plugins returning data in NFD, e.g. from a database).

Some browsers support NFD (IE5.5), some don't (Konqueror 3.1.1 and Mozilla Firefox 0.8), and most importantly use of NFD in web pages is against W3C standards - so, for some UTF-8 TWiki sites it will be important to convert NFD to NFC (mainly an issue for filenames in Legacy.MacOS filesystems, but not for file contents).

Even Mac-only sites will require conversion since NFC is much more convenient for use within TWiki, for quite a few common European languages at least. Using NFC makes it possible to process and compare Unicode strings for most European languages without considering more than one character at a time (e.g. regexes will work on a single character for ä), while still enabling users and third party data sources to encode the same character in different ways (e.g. in Vietnamese using two combining characters for accents in different orders). However, for really complete support of all Unicode writing systems in the longer term, it's important ultimately to support combining character sequences as if they were just a single character.

NFC also simplifies conversion to legacy non-Unicode character sets such as ISO-8859-*, even if some data sources (e.g. plugins) use the decomposed forms, i.e. with combining characters for accents etc. See MacOSXFilesystemEncodingWithI18N for more on this topic.

It seems that MacOS X transparently converts all NFD filenames back into ISO-8859-1 if that's the network charset, but when TWiki is in UTF-8 mode there would be no transparent conversion from NFD since Apple expects applications to deal with NFD directly: if a browser expects NFC and the server sends NFD (particularly for attachments), the user won't see properly rendered accents.

UTF-8 sites that know all users will be NFC based (e.g. no MacOS clients or servers) will be able to leave the NFD-NFC conversion parameter at its default 'off' setting, improving performance.

Normalisation is also involved in UnicodeCollation.

-- RichardDonkin - 17 Feb 2004, 21 July 2008
Topic revision: r4 - 17 May 2015, CrawfordCurrie - This page was cached on 21 Nov 2020 - 05:32.

The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy