Unicode support

TODO

  • Crawford removed the {NameFilter} stuff, which means we need to review our XSS strategy before merging the unicode branch on github back on to svn (or at least, before branch off a Release02x00)
  • Consider Tasks/Item10489 Meta and Store API's make it extremely hard to open a UTF16 attachment and process it.
  • Follow recommendations in http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default as appropriate
    1. All source code should be in UTF-8 by default.
      • DONE by adding use utf8 to all sources.
    2. The DATA handle should be UTF-8. You will have to do this on a per-package basis, as in binmode(DATA, ":encoding(UTF-8)").
      • DONE we never use this
    3. Program arguments to scripts should be understood to be UTF-8 by default. export PERL_UNICODE=A, or perl -CA, or export PERL5OPTS =-CA.
    4. The standard input, output, and error streams should default to UTF-8.
      • DONE by ensuring the streams are binmoded in Sandbox.
    5. Any other handles opened by should be considered UTF-8 unless declared otherwise; export PERL_UNICODE=D or with i and o for particular ones of these; export PERL5OPTS =-CD would work. That makes -CSAD for all of them.
    6. Cover both bases plus all the streams you open with export PERL5OPTS -Mopen:utf8,:std. See uniquote.
    7. You don’t want to miss UTF-8 encoding errors.
      • DONE by adding =use warnings qw(FATAL utf8)* to all modules
    8. Code points between 128–255 should be understood to be the corresponding Unicode code points, not just unpropertied binary values. use feature "unicode_strings" or export PERL5OPTS =-Mfeature=unicode_strings. That will make uc("\xDF") eq "SS" and "\xE9" =~ /\w/. A simple export PERL5OPTS =-Mv5.12 or better will also get that.
    9. Named Unicode characters are not by default enabled, so add export PERL5OPTS -Mcharnames:full,:short,latin,greek or some such. See uninames and tcgrep.
      • NOT DONE we don't use named characters anywhere in the core, at should not start doing so.
    10. You almost always need access to the functions from the standard Unicode::Normalize module various types of decompositions. export PERL5OPTS =-MUnicode::Normalize=NFD,NFKD,NFC,NFKD, and then always run incoming stuff through NFD and outbound stuff from NFC. There’s no I/O layer for these yet that I’m aware of, but see nfc, nfd, nfkd, and nfkc.
    11. String comparisons using eq, ne, lc, cmp, sort, &c&cc are always wrong. So instead of @a = sort @b, you need @a = Unicode::Collate->new->sort(@b). Might as well add that to your export PERL5OPTS =-MUnicode::Collate. You can cache the key for binary comparisons.
    12. 🐪 built-ins like printf and write do the wrong thing with Unicode data. You need to use the Unicode::GCString module for the former, and both that and also the Unicode::LineBreak module as well for the latter. See uwc and unifmt.
    13. If you want them to count as integers, then you are going to have to run your \d+ captures through the CPANUnicode::UCD#num function because the built-in atoi(3) isn’t currently clever enough.
      • DONE PaulHarvey reviewed these, and reckons we are OK. Basically, we only support the arabic digits when we are expressing numbers (such as in dates)
        • NOT DONE But we should probably convert \d+ usage in core to the narrower [0-9] where possible. Added timing tests comparing \d+ with Unicode::UCD::num(\d+) in UTF8Tests
        • FAIL. CPAN:Unicode::UCD#num only exists in perl >= 5.14. Which makes it useless.
    14. You are going to have filesystem issues. Some filesystems silently enforce a conversion to NFC; others silently enforce a conversion to NFD. And others do something else still. Some even ignore the matter altogether, which leads to even greater problems. So you have to do your own NFC/NFD handling to keep sane.
    15. All your code involving a-z or A-Z and such MUST BE CHANGED, including m//, s///, and tr///. It’s should stand out as a screaming red flag that your code is broken. But it is not clear how it must change. Getting the right properties, and understanding their casefolds, is harder than you might think. I use unichars and uniprops every single day.
      • DONE there is a lot of code that handles URLs and must use a-z, cos that's the definition.
    16. Code that uses \p{Lu} is almost as wrong as code that uses [A-Za-z]. You need to use \p{Upper} instead, and know the reason why. Yes, \p{Lowercase} and \p{Lower} are different from \p{Ll} and \p{Lowercase_Letter}.
      • NOT DONE there isn't any, and hopefully none will be created
    17. Code that uses [a-zA-Z] is even worse. And it can’t use \pL or \p{Letter}; it needs to use \p{Alphabetic}. Not all alphabetics are letters, you know!
      • NOT DONE see above!
    18. If you are looking for 🐪 variables with /[\$\@\%]\w+/, then you have a problem. You need to look for /[\$\@\%]\p{IDS}\p{IDC}*/, and even that isn’t thinking about the punctuation variables or package variables.
      • NOT DONE we're not.
    19. If you are checking for whitespace, then you should choose between \h and \v, depending. And you should never use \s, since it DOES NOT MEAN [\h\v], contrary to popular belief.
      • NOT DONE that assumption has never been made. \v is not an interesting TML entity, so treating it as a non-space character is not a problem.
    20. If you are using \n for a line boundary, or even \r\n, then you are doing it wrong. You have to use \R, which is not the same! * NOT DONE and CDot considers it unnecessary. The plethora of unicode line terminators was a response to the plethora of possible encodings using those terminators. The code already deliberately ignores VT, FF and single CR. LS and PS are multi-byte, but are rarely used - they only exist to map from a few obscure encodings, and we have no evidence that they have ever been used. NEL is mapped to … in windows-1252 (the second most common encoding we encounter) so is already SNAFU.
    21. If you don’t know when and whether to call Unicode::Stringprep, then you had better learn.
      • Good luck finding a comprehensible description of this module. It may require careful study for security implications.
    22. Case-insensitive comparisons need to check for whether two things are the same letters no matter their diacritics and such. The easiest way to do that is with the standard Unicode::Collate module. Unicode::Collate->new(level => 1)->cmp($a, $b). There are also eq methods and such, and you should probably learn about the match and substr methods, too. These are have distinct advantages over the 🐪 built-ins.
    23. Sometimes that’s still not enough, and you need the Unicode::Collate::Locale module instead, as in Unicode::Collate::Locale->new(locale => "de__phonebook", level => 1)->cmp($a, $b) instead. Consider that Unicode::Collate::->new(level => 1)->eq("d", "ð") is true, but Unicode::Collate::Locale->new(locale=>"is",level => 1)->eq("d", " ð") is false. Similarly, "ae" and "æ" are eq if you don’t use locales, or if you use the English one, but they are different in the Icelandic locale. Now what? It’s tough, I tell you. You can play with ucsort to test some of these things out.
      • NOT DONE we switch on use locale which (on the assumption it is set correctly) already sets the appropriate collation sequence.There may be an issue here with separating LANGUAGE and LOCALE - if, for example, the user language is set to French while the locale is set to Klingon, will the appropriate collation sequence be used?
    24. Consider how to match the pattern CVCV (consonsant, vowel, consonant, vowel) in the string “niño”. Its NFD form — which you had darned well better have remembered to put it in — becomes “nin\x{303}o”. Now what are you going to do? Even pretending that a vowel is [aeiou] (which is wrong, by the way), you won’t be able to do something like (?=[aeiou])\X) either, because even in NFD a code point like ‘ø’ does not decompose! However, it will test equal to an ‘o’ using the UCA comparison I just showed you. You can’t rely on NFD, you have to rely on UCA.
      • NOT DONE we don't need this kind of distinction anywhere in the core. As we move towards normalisation support (Tasks.Item13405) this may become more relevant.
  • Branch investigating unicode in the core at http://github.com/cdot/foswiki - DONE
  • RequirePerl588 - DONE
  • use warnings qw(utf8); use utf8 in all source modules - DONE
  • use locale conditional on $Foswiki::cfg{UseLocale} in all source modules - DONE
  • Extend Sandbox and RCS handler tests to cover unicode filenames and data - DONE

Older discussion

So, let's investigate what's needed for full Unicode support.

From RichardDonkin on Bugs:Item772:
(Digression) It's worth noting that the locale code needs re-working anyway to cover two cases when we do Unicode, though that's not in scope for Dakar:
  1. Unicode - do a dynamic use open to set utf8 mode on all data read and written (must also cover ModPerl which doesn't use file descriptors to pass data to TWiki scripts, unlike CGI. This code path must never do a use locale or equivalent because mixing Unicode and locales breaks things quite comprehensively (a Perl bug-fest, I tried this...)
  2. Non-Unicode - should function as now (assuming this is just a bug)

The hard part is that the switch between (1) and (2) must be dynamic, based on a TWiki.cfg setting.It should NOT be based purely on locale matching /\.utf-?8$/, because some people may validly want to run with a UTF-8 locale and browser character set, but without Unicode mode.

Also, please don't do use utf8 to implement Unicode - it has an entirely different meaning between Perl 5.6 (where it means 'assume all data processed is UTF-8') and 5.8 (where it means 'variable names, literals, etc in this file can be UTF-8').

-- AntonioTerceiro - 21 Nov 2005

ProposedUTF8SupportForI18N has a lot of existing thinking and planning - should be a reasonable starting point, though it probably doesn't talk enough about performance issues.Requiring a recent Perl 5.8.x version is important too to avoid annoying bugs and perhaps help with performance.

It would also be worth considering GB18030 support, perhaps only in the browser - this is a 1-1 mapping from Unicode (i.e. really a Unicode Transformation Format analogous to UTF-8) that has been mandated by the Chinese government.More details: Wikipedia:GB18030 and http://www-128.ibm.com/developerworks/library/u-china.html IBM DeveloperWorks article.CPAN:Encode::HanExtra supports GB18030 conversion to/from Unicode.

-- RichardDonkin - 21 Nov 2005

Worth noting also that some more recent versions of CPAN:CGI set Unicode mode on characters, which can be a good thing for Unicode support, or a bad thing if you don't want Perl's Unicode mode turned on.For some pointers on this, see discussion on ProblemsWithInternationCharactersInOddPlaces.

-- RichardDonkin - 31 Jan 2006

One support request that really needs Unicode support is CentralEuropeanCharacters.Since the increasingly popular MediaWiki has excellent Unicode support, I think we need to do something here.Unfortunately I have virtually no time for coding but I have done a lot of research and am happy to advise and review.

-- RichardDonkin - 03 Sep 2006

One thing to watch out for is that Perl 5.8 now distinguishes between "utf8" and "UTF-8" - the former is Perl's looser interpretation, the latter is as specified by the Unicode standards, and is also known as "utf-8-strict".For details, see http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8 recent Encode documentation.

There are also some interesting war stories about doing Unicode with Perl in this http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html blog entry.

Also, this http://dysphoria.net/2006/02/05/utf-8-a-go-go/ excellent blog entry provides some wrappers around CPAN:CGI and CPAN:DBI to make them work better with UTF-8.

-- RichardDonkin - 04 Nov 2006

Good http://www.simplicidade.org/notes/archives/2007/02/module_of_the_d_1.html blog posting about Perl UTF-8 coding including CPAN:encoding::warnings - handy for debugging Unicode.

-- RichardDonkin - 03 Apr 2007

Russian characters or encoding.
Why force UTF when all browsers do not support it equally and html code editors do not copy / paste UTF right ?
I do Russian (none Twiki) pages with heading:
<?xml version="1.0" encoding="windows-1251"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head xml:lang="ru" lang="ru"><meta http-equiv="content-type" content="text/html; charset=windows-1251" xml:lang="ru" lang="ru" />

Any browser displays such pages correctly. It validates perfect and I can read and edit code with any editor, as it should.

Using charset=iso-8859-15 as in carrent Twiki never does the job for Russian
You’d have to switch (any) browser manually to UTF-8 each time of changing page.
The question is how to make Twiki perl generate pages with above Windows-1251 heading?

-- DimitriRytsk - 11 May 2007

See my other response. Windows-1251 is clearly a short term solution that is in no way comparable to proper Unicode support, which supports multiple languages simultaneously.Please don't ask the same question twice on pages that have nothing to do with Russian character set support.

-- RichardDonkin - 13 May 2007

WTF. CPAN:Unicode::UCD#num only exists in perl >= 5.14...

-- PaulHarvey - 15 Nov 2011

While I agree with all things marked "done", IMHO, insead adding "use utf8;" into all sources would be better adding someting like "use Foswiki::defaults;"

Foswiki::defaults should contain something like CPAN's uni::perl, (so every common unicode settings) etc.... In the future we should edit only Foswiki::defaults to adding/change some features in all sources...

See http://stackoverflow.com/questions/6412799/perl-how-to-make-use-mydefaults-with-modern-perl-utf8-defaults/6504836#6504836 too...

-- JozefMojzis - 20 Nov 2011

Crawford & I briefly discussed this, and perhaps that's what we'll eventually end up doing, however at this time we'd rather keep the number of "special" things down to a minimum - we're finding some stuff that doesn't seem to work as documented, depending on module/perl version, interpreter invocation, phase of the moon and so on...

So for now, my feeling is that this "phase 1" is to just focus on making it all work, which means minimising the number of fights we pick smile

-- PaulHarvey - 21 Nov 2011

I pushed a merge of trunk to github - it seems I can't push directly to CDot's repo smile

-- PaulHarvey - 29 Nov 2011

All my ram is gone (OOM killer starts) by the time RCSHandlerTests are run

-- PaulHarvey - 29 Nov 2011

JozefMojzis said ΑβγΔε and АбдЕжз should be valid greek & russian WikiWords; also pointed me to http://stackoverflow.com/questions/6322906/utf8-correct-regex-for-camelcase-wikiword-in-perl for tchrist's answer to a viable WikiWord regex

-- PaulHarvey - 18 Dec 2011

While i'm not sure about the validity of the above, pharwey asked me for some examples. I generated all uppercase and all lowecase characters for the tests into AllUnicodeCharacters

-- JozefMojzis - 18 Dec 2011

Perl Unicode Cookbook

-- MichaelDaum - 02 Mar 2012

I've merged unicode branch with trunk again, so it isn't so hopelessly stale. It also runs to completion without any OOM errors any more smile

-- PaulHarvey - 16 Jun 2012 - 10:30

Update: it only runs without OOM errors if you use RcsLite instead of RcsWrap.

So, notes:
  • FoswikiSuite dies with RcsWrap - massive memory leaks. So you need to use RcsLite until somebody fixes that.
  • FoswikiSuite dies on RobustnessTests because it tries to assert that a tainted variable is tainted, which seems to have problems for me in perl 5.14. I had to comment that out.

My FoswikiSuite results, mostly WYSIWYG related failures:
2984 of 3154 test cases passed(2980)+failed(4) ok from 3174 total, 20 skipped
0 + 170 = 170 incorrect results from unexpected passes + failures
1..77076

-- PaulHarvey - 20 Sep 2012

Update again: oops, dbb3596 is a FoswikiSuite run prior to today's merge.

-- PaulHarvey - 20 Sep 2012

On git 51df8bc

2882 of 3057 test cases passed(2875)+failed(7) ok from 3079 total, 22 skipped
0 + 175 = 175 incorrect results from unexpected passes + failures
1..73985

-- PaulHarvey - 20 Sep 2012

There has been a big effort to create a unicode-core Foswiki. See the utf8 branch of the distro, currently under test.

-- Main.CrawfordCurrie - 17 May 2015 - 09:56
 
Topic attachments
I Attachment Action Size Date Who Comment
20120920_UnitTest.loglog 20120920_UnitTest.log manage 133 K 20 Sep 2012 - 02:58 PaulHarvey git dbb3596 using RcsLite, hot-wired RobustnessTests
20120920b_UnitTest.loglog 20120920b_UnitTest.log manage 171 K 20 Sep 2012 - 04:41 PaulHarvey git 51df8bc using RcsLite, hot-wired RobustnessTests
Topic revision: r26 - 05 Jul 2015, GeorgeClark - This page was cached on 23 Aug 2016 - 23:02.

The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License