You are here: Foswiki>Development Web>InternationalisationEnhancements>LocalizationFramework>UnicodeSupport>UseUTF8 (19 May 2015, CrawfordCurrie)Edit Attach

UseUTF8

The more work I do on getting I18N support right in WYSIWYG, the more convinced I am that Foswiki goes out of its way to make life difficult for users, admins and extensions authors by not using UTF8.

UnderstandingEncodings is a detailed primer on character sets and a discussion on the problems inherent in trying to support non-UTF8 character sets in the Foswiki core. Please read it carefully before commenting. I also highly recommend the following overview of unicode and UTF8 http://www.cl.cam.ac.uk/~mgk25/unicode.html. RD also recommends The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), which is a nice gentle introduction.

Proposal

The proposal here is to modify Foswiki to assume the use of UTF8 in all content. That means UTF8 would be assumed in:

topic content
topic and web names
template files
url parameters, including form content

Key concepts

UTF-8 is the character encoding - see UnderstandingEncodings
UTF-8 character mode (aka Perl utf8 mode) - Perl handles the 1 to N bytes of a Unicode character as a single character, not as N bytes. This is the target of this work. See perldoc perlunicode for details.
UTF-8 as bytes - Perl happens to be processing the 1 to N bytes of a Unicode character as N bytes not as a character. This is usually a mistake if you are trying to "Use UTF8", but is sort-of supported with current Foswiki versions (see InstallationWithI18N for when it's used) - you don't get WikiWord support and so on, but the characters aren't mangled.

Technical Detail - What would need to be done?

In terms of the core changes required, this would mainly be a case of deleting code. Foswiki.pm especially contains a lot of special support which is primarily aimed at different character sets. Most of this support is poorly documented and used incorrectly or not at all in the code (for example, many regexes in code use [A-Z] incorrectly to represent word characters. Therefore correction and updating of documentation is also essential.
- RD: This isn't my perception (lots of special support) - can't think of anything I wrote that was specific to a character set, apart from EBCDIC which is a special case for TWikiOnMainframe and not a priority. I think some of this code may have been added in recent years though. You are right about [A-Z] though - see InternationalisationGuidelines for a new shell one-liner that helps detect such things, and also \w and \b which are very common and almost always wrong.
all streams opened by the store need to use :encoding(UTF-8) ( not :utf8 which doesn't check for a valid UTF-8 encoding, leading to possible security holes)
stdout and stderr need to be re-opened in :encoding(UTF-8), using equivalent of binmode(STDxxx,':encoding(UTF-8)'); - this also needs to apply to ModPerl and similar CgiAccelerators which may not use stdout/stderr.
Increment the store version. Topics using old store versions will have to be read by compatibility code which uses the {CharSet} setting in configure.
Check very carefully whether input data from a form is indeed UTF-8 encoded - generally if you force the output page to UTF-8 using the HTTP header and HTML charset, the data returned in a POST or GET will also be in UTF-8. So this is mainly useful to guard against a user explicitly setting their browser to the wrong character set. Fortunately CPAN:Encode can do this very efficiently, certainly faster than the EncodeURLsWithUTF8 regex.
- CGI.pm does not do any encoding. In fact it can't, because the encoding is not given with the HTTP request. (RD: However, CPAN:CGI does turn on Perl utf8 mode in some more recent versions, which has been a problem for the current pre-Unicode versions of Foswiki. We might need to test against specific CPAN:CGI versions if we get problems.)
Evaluate the encoding of all content which is retrieved via other protocols:
- HTTP (e.g. %INCLUDE{http://somewhere}% and other Foswiki::Net interfaces
- Mail as in MailInContrib
Define the encoding of all content which Foswiki sends elsewhere
- Sent mail as in Foswiki's notifications (ouch: my mail client doesn't handle UTF-8 -- haj)
- Command parameters for Sandbox commands (needs to divine the operating system's default encoding)
An audit of the core code to find cases where failure to acknowledge the encoding correctly has implicitly broken the code.
Unit testcases would be required for:
- Existing topic in non-UTF-8 charset
- Topic with broken UTF-8 encodings
- Check encoding on all pages generated by Foswiki
Fixes for the following bugs would need to be confirmed: Foswikitask:Item3574 Foswikitask:Item4074 Foswikitask:Item2587 Foswikitask:Item3679 Foswikitask:Item4292 Foswikitask:Item4077 Foswikitask:Item4419 Foswikitask:Item5133 Foswikitask:Item5351 Foswikitask:Item5437 Foswikitask:Item4946
Corrections to the documentation
- Add guidelines for adding localized templates or skins: they need to be (or to be converted to) UTF-8, too
- Review all extensions (plugins, skins, contribs) for assumptions about character sets (e.g. /[A-Z]+/) and add guidelines for extensions authors

The default character set for Foswiki would become unicode. This means that "old" topics (those that predate the change to unicode) could break Foswiki if they include high-bit characters for a non-unicode character set. To overcome this problem, there needs to be a way to kick into a "compatibility mode" when reading such topics. One possible algorithm is:

Read content using a byte stream
If {Site}{CharSet} is set to a non-UTF8 character set ({Site}{CharSet} is basically used as a legacy setting to say "this is the charset that used to be used by this site before the change to UTF8") then
- if (1) content uses high-bit characters and (2) store version is prior to the current version * use Encode::decode({Site}{CharSet}, $text) to convert to the perl internal character representation * Note that the version in the TOPICINFO may not be useable if the
otherwise
- use Encode::decode_utf8($text) to decode utf8 to the internal representation
RD: All this assumes we don't do a bulk migration using an offline tool - this is what I'd recommend, see material below.

Some other considerations:

Security
- Foswiki must take care to check that possible UTF-8 data is in fact using only valid UTF-8 codepoints (characters in the encoding) and is not using an 'overlong' encoding - both can lead to security holes. Some specific points:
  - Don't use ":utf8", use ":encoding(UTF-8)" instead. Why? Because with ":utf8" Perl doesn't check if it's really utf8 and because of this there can arise serious security problems. See PerlMonks: UTF8 related proof of concept exploit released at T-DOSE for an example.
  - Always use the "UTF-8" value for encoding or with CPAN:Encode - this explicitly checks for invalid UTF-8 data.
  - EncodeURLsWithUTF8 does check for valid encodings in UTF-8 when detecting if a URL is in UTF-8
Performance benchmarking and tweaking
- RD: suggest that benchmarking is done very early so we get some good metrics of how the Unicode changes are affecting performance. Some optimisations may be possible though I have no idea what they are. My experience a few years back was a 3 times slowdown, hopefully Perl has improved since then.
Pre-Unicode charset support - are we going to still support pre-Unicode charsets? From a TinyMCE perspective I guess the answer is no as it's quite painful to convert to/from the site's pre-Unicode charset (e.g. ISO-8859-1). However sites that don't use TinyMCE might want to be able to do this.
MacOSX server support
- MacOS X encodes filenames on HFS+ filesystems in a unique flavour of Unicode, using an Apple-enhanced NFD normalization type (see UnicodeNormalisation), whereas the rest of the world uses NFC normalisation (i.e. W3C, Linux, Windows, etc) - it actually stores them in a 16-bit encoding of Unicode, but NFD is the problem. This means that simply getting Foswiki to create Unicode filenames on most Mac server disks may give us some issues -Foswiki will try to create NFC UTF-8 filenames, which get converted to either UCS-2 or UTF-16 (16-bit Unicode encodings), but using NFD not NFC. The risk is that NFD is then presented back to the web or email client, and in most cases the I18N characters aren't viewed properly, unless Foswiki has converted on the fly from NFD to NFC. This may "just work" without any extra code, but needs testing. MacOSXFilesystemEncodingWithI18N has the gory details. convmv has some support in this area for batch filename conversion to/from MacOS X's NFD flavour.
Windows server support
- Windows may also have some issues perhaps with Unicode filenames, but it uses NFC so should be OK. Apache on Windows works best with UTF-8 URLs, so actually our Windows I18N support could improve with UTF-8.
- Apache on Windows does have issues with non-UTF-8 PATH_INFO used by Foswiki, and some other CGI environment variables - it erroneously tries to convert these to UTF-16, which is what Windows uses in NTFS, despite these environment variables most likely having nothing to do with pathnames (and it does this even if the server is on a FAT filesystem that doesn't use UTF-16). I got a patch into Apache 2.0.54 for this (ApacheBug:32730 and ApacheBug:34985), but I think some bugs may still remain in this area.
Backward compatibility with Perl 5.6 and early 5.8.x's - this is more relevant if we support pre-Unicode site charsets, but early 5.8.x applies to Unicode as well.
- Once we start doing UnicodeSupport, Foswiki will no longer work with Perl 5.6 due to its broken Unicode support. Also, Foswiki may only work on later 5.8 versions - some systems have older 5.8.x's with too many Unicode bugs to be usable.
- So it will be important to survey our user base to see how they feel about this. If backward compatibility is seen as important, and we think it is worth the extra hassle, this would require some extra work - one idea on dynamically supporting both Unicode and non-Unicode mode is:

From Bugs:Item772: it's worth noting that the locale code needs re-working anyway to cover two cases when we do Unicode:

Unicode - do a dynamic use open to set utf8 mode on all data read and written (must also cover ModPerl which doesn't use file descriptors to pass data to Foswiki scripts, unlike CGI. This code path must never do a use locale or equivalent because mixing Unicode and locales breaks things quite comprehensively (a Perl bug-fest, I tried this...)
Non-Unicode - should function as now (assuming this is just a bug)

Migration of topics and filenames - any pre-Unicode encoded non-ASCII data in the topics or filenames (including attachment filenames but not contents) will need to be converted if we don't support pre-Unicode charsets. There are some tools on most Unix systems that will handle this, but this requires an upgrade step, unlike all other Foswiki upgrades. Doing this topic by topic as they are updated will not work, because the older topics won't be viewable properly (unless you have per-topic Unicode mode which is fairly horrible IMO).
- Automating the upgrade process could be quite hairy, particularly the filename changes in a deep directory hierarchy, and would probably not work on MacOS X at all due to MacOSXFilesystemEncodingWithI18N - probably best to ensure the upgrader does good backups and provide some scripts and docs. There might be similar problems on NTFS or FAT filesystems, possibly with variations depending on whether the OS is *nix or Windows - I believe that NTFS translates UTF-8 to UTF-16 but hopefully it doesn't do any UnicodeNormalisation.
- I think a batch migraton process is essential - I realise this goes against the Foswiki upgrade philosophy but this is quite a big and complex change to the entire pub and data trees, including existing filenames, so I don't see how you can do this 'online' (what if there are some topics that aren't updated for months, but other topics refer to them - which URL should you use? What about attachments where you don't want to re-upload them just to convert the filename on disk?). Batch migration also means that you can use tools such as convmv which converts the filenames from one format to another, and write a simple Perl script that converts the contents in place, using the original LocalSite.cfg spec for site charset to drive the conversion to Unicode.
Sorting support
- Locales are very buggy when combined with Perl utf8 mode, in RD's experience - best avoided, and many Perl Unicode apps don't use locales at all.
- UnicodeCollation support is the main alternative, not sure about performance though
Bugginess of Perl Unicode generally - should be better now, but I ran into some issues and we should expect to uncover and workaround some Perl bugs.
"Unicode mode" toggle
- Despite the niceness of Unicode, I think it's important to have a simple toggle that globally disables Unicode usage for Foswiki. While this disables any I18N, it also allows the user or developer to:
  1. Avoid any Perl utf8 bugs
  2. Ensure best possible performance even if some strings get forced into Perl utf8 mode.
  3. Easily compare the non-utf8 and utf8 modes for unit and system testing.
  4. Run Foswiki in non-I18N mode on Perl 5.6
- It may be a bit of a hassle initially to enable this (e.g. dynamic code in BEGIN blocks etc) but I think it's worth it.

Earlier work by RD on UnicodeSupport - can provide my code, which is based on an old Foswiki alpha version, and it did get to the point of running on a semi-public Unicode test site, running in real "perl utf8" mode not just "bytes" mode with UTF-8 encoding.

-- Contributors: CrawfordCurrie, HaraldJoerg, RichardDonkin

Discussion

CC asks: Isn't it easier just to say "if there is no accept-charset, UTF8 is assumed"? - sorry, I don't think it is that simple. In a HTTP request there might be an Accept-Charset header, but according to my experience it is simply an indication of browser capabilities, and not related to an accept-charset attribute of a form. If you add accept-charset="utf-8" to all templates, you can catch TWiki's own edit and search workflows, but you'll likely miss most TWiki applications with homegrown topic creators. Unfortunately the specification of the accept-charset attribute is crappy in itself: It allows to specify a list of encodings in a form tag, but does not provide a way for a browser to indicate which one it actually used to encode. Browser implementations have been reported to behave differently, but I guess this is no longer true for current browsers (for certain values of current).

I see two alternatives:

Use UTF-8 not only internally, but also externally: Always encode HTTP responses in UTF-8, and set the appropriate HTTP headers (and meta elements). There's no need then to use accept-charset attributes on forms. The configuration setting {Site}{Charset} would be just used for legacy file conversion. This solution assumes that UTF-8 is good enough for all TWiki users, and might cause problems for those who deliberately use ISO-8859 today because they use their TWiki together with external data sources (e.g. data bases or localized skins). TWiki would explicitly need to decode parameters from UTF-8, maybe with appropriate precautions to avoid a crash.
Keep sending pages in {Site}{Charset} encoding, and rely on {Site}{Charset} being the encoding you get for parameters. Again this would not require forms to use accept-charset. Decoding of parameters would be needed iff {Site}{Charset} is a multibyte encoding. I am not sure, however, how browsers behave if users enter characters in a form field which can not be encoded in {Site}{Charset}.

Hell. Getting encoding right is tough, and it has been so since tapes were written in either ASCII or EBCDIC. TWiki got rather far without touching encoding, but it is unlikely to get any further in a world which is moving towards UTF-8.

-- HaraldJoerg - 12 Apr 2008

I think option (2) is dangerous. I have had a lot of bug reports resulting from UTF8 being interpreted as high-bit iso-8859 and crashes from illegal encodings as a result of high-bit iso-8859 being interpreted as UTF8. If TWiki chooses to move to UTF-8 I think it should move everywhere. Imagine the scenario; some poor Russkaya mafiya is trying to import a topic with a Russian name, written in Russian, by a Russian-speaking colleague working undercover in the US department of electronic security, and you are trying to give them support:

What encoding is used in the US TWiki? iso-8859-1, KO18-R, UTF8, other?
What encoding is used in your local TWiki? KO18-R, UTF8, other?
What charset is used in your browser?

All questions that Josef Average will find all but impossible to answer. If, on the other hand, you always require UTF8, the worst you will be faced with (hopefully) is:

"is the charset specified as utf8 or utf-8, because utf8 will cause a bug in internet explorer"

I can't imagine many scenarios where utf8 - and therefore unicode - would not be enough for users. Unless they are writing in Klingon (the Klingon character set is inexplicably missing from unicode). Or Elvish.

One possible scenario is where a site is trying to use a plugin that has been coded to assume iso-8859, and the admin doesn't have the will, or a way, to get it fixed. For that reason I added a review of extensions to the todo list above, which should cover your point about external datasources.

-- CrawfordCurrie - 13 Apr 2008

I think it is feasible to migrate TWiki towards pure UTF-8. It should work for all known languages on the planet Earth and will limit the platform we have to test on to ONE.

With respect to plugins the important steps will be

Ensure anything in Func is UTF-8 compatible
Migrate the most popular plugins to UTF-8 and in the process document what it is it takes to upgrade a plugin to UTF-8
Provide a safe upgrade method. I have tried to upgrade ASCII type topics by converting to UTF-8 and so far it has worked fine each time. This is a situation where it is difficult - maybe impossible - to do it on the fly. But unlike upgrade scripts suggested for syntax changes that are always bound to fail because we cannot predict the more advanced ways to make applications, upgrading to UTF-8 is happening at "byte level" and is a well known process which there are plenty of tools available for. But we need to have a good way to ensure that one does not double convert topics (convert a topic which is already converted).

Going to pure utf-8 will be a task that takes a lot of testing. I am testing UTF-8 at the moment in the 4.2.1 context and as you know there are a couple of new bugs I opened where I have seen that SEARCH and verbatim is not yet working in UTF-8. There will be many more test steps needed before we can let go of other charsets. But I think the step to make TWiki utf-8 only should be considered with a positive spirit because it will make TWiki fully I18N which it is not now and with a chance of being stable also for non-English users.

-- KennethLavrsen - 13 Apr 2008

I agree that UTF-8 should be the way to go, and I fully support moving towards encoding topics in UTF-8 as soon as possible. Easy moving of topics between TWiki installation needs a unique encoding, and UTF-8 seems to be without alternative for that purpose. Topics (and templates) have long-lived encodings, they tend to lie on disks for years without surreptitiously changing their encoding, hence the migration path needs to be carefully paved (as you did in your proposal).

My options do not refer to using UTF-8 for writing topics, but to the encoding used for TWiki's other interface, HTTP/HTML written for browsers. UTF-8 would work fine for me, and maybe for all installations (including Elvish). So probably we could jettison option (2) right now.

So what it boils down to is not that TWiki is using UTF-8 (because, strictly spoken, TWiki is using Perl's internal encoding all the time), but that TWiki expects all its external data interfaces to be encoded in UTF-8. From that point of view, topics are the easiest part because writing and reading topics is under TWiki's more or less exclusive control. As you wrote, we'll need to carefully collect assumptions about encodings, but also identify unjustified ignorance. Maybe you summarized these cases with your item An audit of the core code to find cases where failure to acknowledge the encoding correctly has implicitly broken the code.. I added some to the list above, hopefully it won't grow too much.

A minor note about UTF-8 vs. utf8: Internet protocols use 'UTF-8', case-insensitive, and Perl uses 'utf8', always lowercase.

-- HaraldJoerg - 13 Apr 2008

It would be good to look at UnicodeSupport and linked pages, which contain a lot of thinking about this. I've commented at UnicodeProblemsAndSolutionCandidates in detail on some of the issues that would need to be solved, which cover some of the points made above. It would be helpful if the various Unicode pages were interlinked - perhaps UnicodeSupport could be refactored into a 'landing page' for all these topics including latest discussions, to make it easier to find them.

Shame I missed this discussion - I haven't been tracking TWiki for a while now, but would be interested in participating if someone can email me. Unfortunately TWiki.org doesn't have a good way of monitoring 'only pages with certain keywords' that I'm aware of.

WebRss supports SEARCH statements to narrow down what you get notified of (and Crawford entered an enhancement request of mine for supporting SEARCH queries (full TML actually) in WebNotify) - SD

-- RichardDonkin - 14 Jun 2008

Thanks for the tip, Sven.

On the options - I think the best one is option 1, i.e. UTF-8 at the presentation level and internally. There should be very few systems these days where UTF-8 is not supported - even on an ancient 486 you can boot a live CD that supports UTF-8 in Lynx - but I'm sure someone will come up with one.

In a possible Phase 2 of UTF-8 adoption, we could implement some charset conversion at the presentation layer, e.g. if someone has a browser or email client that only does a legacy Russian or Japanese character set, perhaps, and they are unable to upgrade their clients. This could perhaps be driven by accept-charset. However, this adds complexity so let's not do it in the first phase of UseUTF8.

See more comments in text prefixed with RD.

-- RichardDonkin - 15 Jun 2008

I've added a key concepts section above to try to differentiate between "UTF-8 character mode" in Perl vs. processing UTF-8 as bytes (which is not what we want), as a result of commenting on Bugs:Item5566.

-- RichardDonkin - 26 Jun 2008

One remark on "need to use :utf8": Don't use ":utf8" use ":encoding(UTF-8)" instead. Why? Because with ":utf8" Perl doesn't check if it's really utf8 and because of this there can arise serious security problems. See PerlMonks: UTF8 related proof of concept exploit released at T-DOSE for an example.

RD: You can also use "utf-8-strict" as a synonym for "UTF-8" in Perl pragmas, which might be less vulnerable to typos.

-- ChristianLudwig - 27 Jun 2008

Good point - have merged this above where it talks about :utf8 and also added a Security bullet under the 'other considerations' part.

I've added quite a lot of material above, comments would be useful.

One simple next step might be to agree whether we can dump the accept-charset idea which IMO is not required.

-- RichardDonkin - 28 Jun 2008

I think "keep it simple" has to be the guiding principle here. I think accept-charset falls the wrong side of that line, and should not be used.

The main support problem we have had with encoding support to date has been excessive flexibility coupled with a lack of documentation explaining in simple terms what the casual admin needs to do. I had to research quite a lot to reach my poor level of understanding, and it's unreasonable to expect yer averidge admin to do the same.

So, from a user perspective, I don't want to know it's using UTF8 (or any other encoding). configure should have no encoding options, just a single, simple options for setting the user interface language. If that means committing to a less-than-100%-flexible approach, then I'm in favour.

-- CrawfordCurrie - 28 Jun 2008

A less flexible approach should be possible since we won't be using locales, and I agree completely with going for simplicity. Some remaining issues though:

Batch migration of topics - this is essential to keep core code simple, so it only has to deal with UTF-8
Performance - early testing and tuning will be important, covering both the English-only and the I18N -heavy cases. If this can't be optimised, a Unicode-mode toggle as mentioned above will be important, but it could be based on a simple toggle such as {UseInternationalisation}.
- ModPerl, PersistentPerl - should be tested from start to work with these to get reasonable performance
Sorting - if we don't do locales, topic and table column sorting will need UnicodeCollation. This has to be based on "the language", which can most simply be derived from the user's language (for message internationalisation). Unicode obviously supports multiple languages but for collation you need to know which language the user is working in, and hence which Unicode collation order to use. The good news is that CPAN:Unicode::Collate does all this for you as long as it's used in any sort routines.

Some expert-level config options may be needed to work around brokenness, but we should try to avoid wherever possible (like the Unicode mode toggle). If we limit ourselves to Perl 5.8 only that will simplify matters - if Perl 5.6 must be supported it could turn off all I18N and use only ASCII.

-- RichardDonkin - 28 Jun 2008

I thought the user interface language code required the locale to work?
I'm torn on batch migration. Migration on the fly is seductive, and fairly easy to make work, but the performance is likely to stink. Batch migration has the potential to lose the history (unless it rebuilds it using the new encoding)
Another issue is Extensions. Authors need comprehensive support to make sure they don't fall into the /[A-Za-z]/ trap.
- Case detection and conversion (whatever 'case' means)
- Sort collation
- Language/encoding information

My personal opinion is that Perl 5.6 is past it's sell-by date and should be dropped. This might shut out some hosting providers; I'd be interested to hear if any are still using 5.6.

-- CrawfordCurrie - 29 Jun 2008

Will have to look at the UI language code but I think it only uses locales because the core does. If we go UTF-8 they simply have to convert all their translation files and make a slight adjustment (IMO, without looking at code yet.)

Non-batch migration is really hard as well as slow:

Most significantly, how do you search for an I18N word using grep across a mixed set of converted and unconverted pages? A: you have to run two grep searches.With all this complexity you are actually supporting the pre-Unicode character set forever, since you can never know when you will hit a page that nobody has yet viewed or edited.
How would you handle page A with WikiWord links to pages B and C, where A and B are ISO-8859-1, and C is already converted? A: You would convert page A into UTF-8 and ~~any generated URLs would also be UTF-8~~ any generated URLs should use UTF-8 for TWiki pages and {Site}{CharSet} for attachments (due to EncodeURLsWithUTF8 and the need for web server to directly serve attachments). Fortunately the inbound URL conversion of EncodeURLsWithUTF8 helps here with the link to page B, but you have to keep that logic around.

Batch migration would preserve history: since RCS is a text format (just checked with man rcsfile and this page documenting more details of RCS file format) and doesn't appear to have any length or checksum fields that would mess this up, it is fairly trivial - just use the iconv utility for the file, and convmv for the filename itself (and directory names). There may be some corner cases if people have embedded URL-encoded links within a TWiki page, but that's unlikely and not required with current I18N. The page linked here makes it clear that it is safe to embed UTF-8 in RCS files - the only problem might come if Asian sites have (against TWiki I18N recommendations at InstallationWithI18N) used a non-ASCII-safe double-byte character set such as Shift-JIS as the {Site}{CharSet} when we convert this to UTF-8, as RCS may have escaped a conflicting byte within a double-byte character. I suggest we don't bother with this case, as such character sets were never supported (see JapaneseAndChineseSupport as well).

EncodeURLsWithUTF8 may need to be enhanced slightly - haven't thought about the details yet, but limiting ourselves to browsers supporting UTF-8 should help and might even simplify it. Attachment support through UTF-8 URLs will be the main remaining issue - however by making the browser use UTF-8 we force all its URLs to be in UTF-8 format. We might even find TWikiOnMainframe I18N works without special code...

Extensions are a problem, which InternationalisationGuidelines tries to address, but it's really down to the extension author and promoting I18N amongst authors. Many extensions aren't I18N -aware, but I think those that are already I18N -aware will have an easier time converting, and going Unicode makes life easier generally, particularly for extensions that interface to third-party systems that already use UTF-8.

Support for MacOS X will be something of a challenge I think, requiring UnicodeNormalisation but only on MacOS servers, as mentioned above. Support for Windows servers is not a problem, I now believe - I have updated the comments above to reflect this.

On collation, I did a bit of research yesterday: UnicodeCollation doesn't give you language specific sorting, but it does provide a good default sort order across all languages. People who want correct sorting in Swedish, Danish, Japanese, etc, will need a 'language sort module' that adds some language specific collation rules for their language. This should be done as with the UI internationalisation stuff, ideally so that when you do a translation someone a bit more techie defines the collation rules - these are available from various sources with a bit of luck but there are very few language specific modules on CPAN that help. Also, CPAN:Unicode::Collate involves loading a 1.3 MB default collation order file, which could have some performance impact... UnicodeCollation might need to be enabled only for those who want language-specific collation to be absolutely correct, with the default being to sort by UTF-8 codepoint, which doesn't look nice but at least is fairly fast - some performance testing needed, with and without ModPerl.

-- RichardDonkin - 29 Jun 2008

A few more updates above to my comment of 29 Jun, and also some updates to main text - in particular I've removed the accept-charset part since we are agreed we don't want to do this.

-- RichardDonkin - 01 Jul 2008

Any more thoughts on this? I've done some updates to UnicodeCollation including a test script - this isn't hard to do.

-- RichardDonkin - 15 Jul 2008

I'm with you on batch migration. I think extension authors will have to be left to sort out their own houses; though the most common extensions will need to be tested. I don't care much about OSX, and until an OSX user with hardware steps forward I doubt anyone else will.

The main problem I foresee is testing. I don't think it makes sense to do any of this without a testing strategy. My preference is for UTF8 testcases to be added to the existing unit test suite, as lack of unit tests in this area has been crippling in the past. And as you say, performance testing is required.

I'd like to make proper UTF8 support a feature of TWiki 5.0, but I think it requires a lot more concentrated effort from interested parties than just the two of us batting ideas around, especially as neither of us is likely to be actively coding anything. Specifically I'd like to hear from community members who actually want to actively use non-western charsets in their day-to-day work, as their experiences would be key to the success of the venture.

-- CrawfordCurrie - 15 Jul 2008

Will: Thanks for pulling this across, but what's with the Legacy web? I don't have permission to view topics like UnicodeNormalisation which are highly relevant even though old. Can someone point me to policy on copying pages from TWiki.org and purpose of the Legacy web? I can guess but it's good to know! I have removed the "Legacy." prefixes on this page as it's better to be able to actually click through on these links than be refused... RD

The legacy web contains topics that have been copied from the old twiki.org where the content is principally from a foswiki contributor, but not entirely (or we can't check). The web is not publicly readable to prevent it being indexed, and to avoid putting anyone's back up.

Readers note: Richard has added a lot of good stuff on the (tm)wiki version of this page.

-- CrawfordCurrie - 11 Mar 2010

http://en.wikibooks.org/wiki/Perl_Programming/Unicode_UTF-8

-- WillNorris - 05 Dec 2010

Time to act on this one. Reading the material it seems there is consensus on providing a one-shot batch conversion script to move content to pure UTF-8, maybe as well as flag "legacy" topics not being UTF-8 yet.

-- MichaelDaum - 16 Feb 2011

Such a script can be found in the CharsetConverterContrib.

-- CrawfordCurrie - 03 Oct 2011

If UTF8 goes forward for trunk, what is that going to do to trunk.foswiki.org? Would we have to drop the symlinks into foswiki.org webs, or is there some possibility of coexistence? Or is that the "maybe as well as flag "legacy" topics not being UTF-8 yet." Any chance of handling that based upon the Version string in topic Metadata? Or is it too late by the time the topic has been read?

Moving the UTF8 discussion originally in RequirePerl588 to UseUTF8PerlRequirements.

-- GeorgeClark - 25 May 2012

Here is two main topics for Foswiki Unicode. This one and the UnicodeSupport. Because the statements are contrary, want ask you about the technical details:

Above:

all streams opened by the store need to use :encoding(UTF-8)

what mean, the topic text will be stored as UTF-8 text, and at the read FW should open the file as utf8 encoded. This is is in direct contradiction with the UnicodeSupport statement:

~~The DATA handle should be UTF-8. You will have to do this on a per-package basis, as in binmode(DATA, ":encoding(UTF-8)").~~

DONE we never use this

what (if i underdstand ok), saying than foswiki will read his files as octets.

Any perl-best-practice saying: _when reading any text files, you should do conversion at open (or later with Encode).

So, the questions:

Will FW read datafiles as "octets" or as "utf8" text?

core will use utf8 flagged texts or octets?, e.g. will "\w" correctly match multibyte unicode character or will match one byte from octets?

will FW allow utf8 urls? (contradiction with:)
DONE there is a lot of code that handles URLs and must use a-z, cos that's the definition.

Will Foswiki continue using a bad practice [A-Za-z0-9] or will allow UTF8 WikiWords? Is yes, the following can't be true - FW should use correct "\P...." regex groups.

Code that uses \p{Lu} is almost as wrong as code that uses [A-Za-z]. You need to use \p{Upper} instead, and know the reason why. Yes, \p{Lowercase} and \p{Lower} are different from \p{Ll} and \p{Lowercase_Letter}.

NOT DONE there isn't any, and hopefully none will be created

~~Code that uses [a-zA-Z] is even worse. And it can’t use \pL or \p{Letter}; it needs to use \p{Alphabetic}. Not all alphabetics are letters, you know!~~

NOT DONE see above!

will core autolink unicode WikiWords like ČaČaČa?

NFD decompositions

Consider how to match the pattern CVCV (consonsant, vowel, consonant, vowel) in the string “niño”. Its NFD form — which you had darned well better have remembered to put it in — becomes “nin\x{303}o”. Now what are you going to do? Even pretending that a vowel is [aeiou] (which is wrong, by the way), you won’t be able to do something like (?=[aeiou])\X) either, because even in NFD a code point like ‘ø’ does not decompose! However, it will test equal to an ‘o’ using the UCA comparison I just showed you. You can’t rely on NFD, you have to rely on UCA.

NOT DONE we don't need this kind of distinction anywhere in the core.

This is not true. Anywhere where FW using opendir on OS X the FW should accept NDF form - because on OS X are filenames are in NFD form.

for collation will Foswiki use the correct use Unicode::Collate::Locale; instead of "use locale;" ?

want/will Foswiki allow ISO sites? (probably yes) - so, if want - the only correct way is: using utf8 internally and doing encoding/decoding at IO level, so open($fh, "<:encoding(iso-8859-15)",...); . Is this statement true from the FW's point of view?

Here are more question, but those are really basic. Could someone elaborate a bit, how utf8 support is planned to be in FW?

-- JozefMojzis - 28 May 2012

Jozef, there have been many discussions on unicode support over the years, and not all have had the same end-goal in mind. I'll try to clarify here. There are two sensible ways to support unicode:

Treat all data as byte data, and declare the encoding used for that data to the browser. This is the status quo, and is what UnicodeSupport assumes.
- Pros:
  1. No need to worry about IO layers
  2. Because data is simply a byte stream, no need to worry about wide characters
  3. "Just" set $Foswiki::cfg{Site}{CharSet} to utf-8 and you support unicode (!)
  4. Historically has better performance than unicode
  5. Works with any perl version
- Cons:
  1. Complex to transform data for collation sequences, searching etc.
  2. Always need foreknowledge of encoding used for data
  3. When things go wrong, almost impossible to debug
  4. Requires code authors to have a deep understanding of encoding schemes to use effectively
  5. Complex to set up correctly, easy to get wrong
Treat all data as encoded using a single, unicode-compatible, encoding e.g. UTF-8. This is this UseUTF8 proposal.
- Pros:
  1. Only one encoding to worry about
  2. Encoding is compatible with XHR and supported by all browsers (pretty much a standard now)
  3. Clean integration with Locale for collation etc
  4. All data is wide-byte; easier for programmers to deal with, and easier to debug
  5. Easy to set up (no options, all data is unicode)
- Cons:
  1. Legacy code (plugins) needs to know that the encoding is used
  2. Needs modern perl version (5.14 preferred)

In my mind it is obvious that we should push for a single standard unicode encoding (UTF8). However lack of solid, reliable perl support on a widely-adopted perl version is the major problem stopping us.

-- CrawfordCurrie - 29 May 2012

ChangeProposalForm edit

TopicSummary	Use UTF8 as the encoding for all stored content
CurrentState	MergedToCore
CommittedDeveloper
ReasonForDecision	None
DateOfCommitment	11 May 2015
ConcernRaisedBy
BugTracking	Item5437
RelatedTopics	UnderstandingEncodings, UnicodeNormalisation, UnicodeCollation
PlannedFor