Unicode has support for specific collation rules, used when sorting data - see the Unicode documentation and CPAN:Unicode::Collate
. This may be important as part of UnicodeSupport
, as it enables Perl locales (which provide pre-Unicode collation rules based on the locale, which are used when sorting topic names etc) to be finally dropped. However, supporting locale-like sort orders using
is more work than simply using locales.
for current discussions.
Using CPAN's Unicode collation package
seems to make this quite easy - I don't believe we need any specific options to get default collation order working, just some code like the attached script (which includes UTF-8 data so it can't be embedded within this page on TWiki.org, which uses ISO-8859-1). The default sort order, whether with Unicode Collation or without, is not very useful for many situations, and will need customization typically.
Those sites that need a language-specific order would need to do some customization of the collation order in a plugin, as detailed on CPAN:Unicode::Collate
- this could be put into a "language pack" that is re-usable, though multi-language sites might need to merge multiple languages' sort orders, which may well conflict with each other. So the "language packs" might end up being customized for some sites, but many could simply use the standard pack for their language.
Example script and output
See the attachment
for a simple test script you can run from the shell. The output is below - the non-Roman symbols are Hebrew Alef, U+05D0, and Bet, U+05D1. This is not
intended to show a particularly good sort order, just how the Unicode::Collate package works in a very simple case. Most languages/cultures will require some customization of the collation order.
>>> Sorted with Unicode Collation <<<
>>> Default sorting without Unicode Collation <<<
The sort order with Unicode Collation is not ideal for some languages (e.g. in Danish the Århus would sort just after the Aarhus as Aa is an equivalent to Å, and these are two spellings for the same town), but it works for many languages without changes and is a lot better than the default order without Unicode Collation.
for examples of language variations in collation orders.
Normalisation of data before sorting
The Unicode collation specifies that http://unicode.org/reports/tr10/#Step_1 normalisation is done as the first step in collation
, by default. The
package can do UnicodeNormalisation
if needed, and makes this quite easy.
Even if TWiki assumes all data is in Normalisation Form C (NFC) as per W3C
standards, and as planned in UnicodeNormalisation
(apart from MacOS
which uses NFD), the Unicode collation standard says that all data must be converted to NFD form before it is sorted. However, since this is done once per data item, it should not have a big performance impact.
Locale information for Unicode
is a Unicode Consortium repository of locale information for a huge range of languages/cultures. It may be a good starting point for customizing collation orders for specific languages.
Ideally we would use Unicode collation rules and configuration to http://unicode.org/reports/tr10/#Searching control searching for Unicode data
in TWiki. However, this is quite complex, with benefits only in very specific cases. Searching is also performance-critical for TWiki. Hence this is probably best left to a later phase.
Unicode does provide features outside Unicode collation for case-folding etc.
-- Contributors: RichardDonkin
- 30 Jun 2008
I would not worry too much about Aa and Å in Danish. I think we have learned to live with that.
But if the sorting order above is done with Unicode Collation then the library is worthless because it sorts all wrong. The order of the real words should be (ignoring the Aa detail)
- 19 Jul 2008
It's quite feasible to get the precise collation order you want using code that makes use of
to change the collation order - presumably you are talking about the Danish sort order here. See CPAN:Unicode::Collate
for some examples of how this is done.
My example above only uses the default Unicode collation order (aka DUCET), and perhaps using Danish words was misleading as the default order doesn't work well for this. Some languages treat accented characters as sorting near their unaccented versions, while others treat the accented characters as sorting after 'Z', such as Danish.
It's worth reading http://unicode.org/reports/tr10/ Unicode Technical Report 10
on Unicode collation, which gives more background and examples on this. The collation order should really depend on the user not the site - e.g. if a German user is looking at some data including
, they'll expect that to sort after 'a', while a Swedish user will expect it to sort after 'z'.
- 21 Jul 2008