WikiWords such as VirksomhedskulturĂr° does not work in UTF-8

The UTF-8 improvements made recently reveals some problems and here is another.

In UTF-8 Perl is supposed to know which characters are upper case and which are lower case.

But both in the Wysiwyg editor and in the rendering of topics setting the Locate to da-DK.UTF8 and charset to utf-8 make TWiki see non A-Z as non-letters when it comes to wikiwords.

I have tested and experimented quite a lot and I am convinced that in perl 5.8 our regexes in TWiki.pm for wikiwords are correct in TWiki.

The problem must be that TWiki does not see the string as utf-8 somewhere.

In ISO8859 the non A-Z wikiwords work fine except in the Javascripts for creating topics where there are still regexes with A-Z.

Here on Bugs VirksomhedskulturĂr° points to a not yet created topic

-- TWiki:Main/KennethLavrsen - 24 Apr 2008

The problem must be that TWiki does not see the string as utf-8 somewhere yes, probably. There are several functions provided for converting from the encoded UTF8 that might come from a CGI query to perl's internal unicode representation. Unfortunately they don't always work, and are called in a rather hit-and-miss fashion. On the WYSIWYG side, as far as I know there is no code anywhere that recognises wikiwords that does not use the TWiki regexes.

A general "it doesn't work" is a truism, but isn't a useful report. Anyone who tries to fix this needs to know exactly how to reproduce a problem involving UTF8, and the above description isn't detailed enough, so I'm kicking this back for more detail (note: this does not mean I intend to fix it, I'm just triaging the issue)

-- CrawfordCurrie - 24 Apr 2008

Yes my description is detailed enough.

But let me write it again differently

Setup TWiki for UTF8 - this step should be obvious.

Write the word VirksomhedskulturĂr°

Save

Look

-- KennethLavrsen - 30 Apr 2008

OK, thanks, that was exactly the sort of simple recipe I wanted. I did that, and I see that the resulting word is not a wikiword in view. When I look at the topic saved to disc I see that the string is correctly encoded using UTF8 characters, so we can eliminate WYSIWYG as a source of error. The problem has to be with the regexes that recognise wikiwords.

Confirmed as an I18N issue.

Note: I just had a lot of grief saving this topic, which suggests there are still issues with charsets in formfieds frown

CC - 01 May 2008

I did a lot of reading on the topic and I am convinced our

        $regex{upperAlpha} = '[:upper:]';
        $regex{lowerAlpha} = '[:lower:]';
        $regex{numeric}    = '[:digit:]';
        $regex{mixedAlpha} = '[:alpha:]';

will work in UTF8 also seeing Ăě┼╔Í as uppercase and Š°ňÚ÷ as lower case

But it requires that perl at any given time sees the variable on which we use the regex as UTF8 and not as plain ASCII.

I would like to try and analyse more using some poor mans debugging. How do I easily identify if a variable holds what perl sees as UTF8 vs ASCII? I need a one liner I can print out to error_log or debug file.

-- KJL - 01 May 2008

I have tried to analyze more. But I am nowhere near being able to resolve it. The unicode/utf-8 encoding/decoding is still a bit of a mystery to me.

But I have learned something.

The rendering of wikiwords happens in lib/Render.pm in the sub getRenderedVersion.

The actual lines are

    unless( TWiki::isTrue( $prefs->getPreferencesValue('NOAUTOLINK')) ) {
        # Handle WikiWords
        $text = $this->takeOutBlocks( $text, 'noautolink', $removed );
        $text =~ s/$STARTWW(?:($TWiki::regex{webNameRegex})\.)?($TWiki::regex{wikiWordRegex}|$TWiki::regex{abbrevRegex})($TWiki::regex{anchorRegex})?/_handleWikiWord( $this,$theWeb,$1,$2,$3)/geom;
        $this->putBackBlocks( \$text, $removed, 'noautolink' );
    }

So my first trial was to see if the problem is the regexes or the _handleWikiWord. The conclusion is that it is the regexes that do not work on $text because of the encoding used.

I tried this as an experiment.

    unless( TWiki::isTrue( $prefs->getPreferencesValue('NOAUTOLINK')) ) {
        # Handle WikiWords
        $text = $this->takeOutBlocks( $text, 'noautolink', $removed );
        $text = Encode::decode($TWiki::cfg{Site}{CharSet}, $text) if $TWiki::cfg{Site}{CharSet};
        $text =~ s/$STARTWW(?:($TWiki::regex{webNameRegex})\.)?($TWiki::regex{wikiWordRegex}|$TWiki::regex{abbrevRegex})($TWiki::regex{anchorRegex})?/_handleWikiWord( $this,$theWeb,$1,$2,$3)/geom;
        $text = Encode::encode($TWiki::cfg{Site}{CharSet}, $text) if $TWiki::cfg{Site}{CharSet};
        $this->putBackBlocks( \$text, $removed, 'noautolink' );
    }

This makes the links appear correct for a not yet created topic.

But the minute I create the topic and view the original topic with the wikiword I get errors "Malformed UTF-8 character".

Some other observation. A wikiword SomeTopicĂě┼WithDanish in a topic. If I print the $text to STDERR I see the wikiword as SomeTopic\xc3\x86\xc3\x98\xc3\x85WithDanish

After Encode::decode it becomes SomeTopic\xc6\xd8\xc5WithDanish and then the regexes work again. Shouldn't the regex engine also work on utf-8 strings in Perl 5.8?

So the issue is again the coding of strings used inside TWiki. How do we fix this? I am stuck.

-- KennethLavrsen - 02 Jun 2008

I had similar problems with unicode and Perl. I described the steps that helped me in UnicodeProblemsAndSolutionCandidates.

-- ChristianLudwig - 09 Jun 2008

For 4.2.1 I am still a bit stuck and need a hand.

-- KennethLavrsen - 18 Jun 2008

This seems to be trying to do UseUTF8 aka UnicodeSupport, which is a rather large piece of work that affects many different parts of TWiki. Current versions of TWiki don't support Unicode at all - while you can set .utf8 in the locale etc, it's not recommended for European languages, only for those languages such as Chinese that don't care about I18N characters in WikiWords. In other words, this is not a bug, it's the missing UnicodeSupport feature that's needed here.

However, perhaps this is part of feature work on Unicode support. In which case it's a matter of ensuring that all strings processed by TWiki are not just UTF-8 bytes but are turned into Perl utf8 characters (i.e. Perl's utf8 mode as in perldoc perlunicode). This is presumably not happening somewhere, as you mention. Note that one side effect of Encode::decode is that UTF-8 byte strings turn into Perl utf8 character strings. Clearly, having the sequence of bytes in UTF-8 is not enough - each string of (say) 3-6 UTF-8 bytes is represented (when Perl is using utf8 mode for a string) as a single 'character' i.e. it's a unit of matching in regexes, and a unit for other string operations. There are some Perl functions that will do this for you, and it also happens automatically in some cases but not all - however, moving TWiki to Perl utf8 mode is a large piece of work...

I do think this should be treated as feature work and done on a branch - getting utf8 right is quite disruptive potentially, though it could be mitigated if we have a simple 'utf8 mode on/off' flag as I suggested on UseUTF8.

See my comment on UseUTF8 - have added a new Key Concepts section there that is relevant to this distinction.

-- RichardDonkin - 26 Jun 2008

OK so I will downgrade this to normal then and I will do a small update on the help text in configure and probably also in some of the installation docs about the current support of UTF8.

-- KennethLavrsen - 26 Jun 2008

Duplicate of Item5230 but has more analysis so closed 5230 instead.

-- CrawfordCurrie - 04 Jan 2009

Found out that there is a difference among utf8 and UTF-8 in perl. See http://jeremy.zawodny.com/blog/archives/010546.html . I think this might help for "malformed utf-8" type errors..

-- StefanosKouzof - 10 Feb 2010

ItemTemplate edit

Summary WikiWords such as VirksomhedskulturĂr° does not work in UTF-8
ReportedBy TWiki:Main.KennethLavrsen
Codebase
SVN Range TWiki-5.0.0, Tue, 15 Apr 2008, build 16676
AppliesTo Engine
Component I18N
Priority Normal
CurrentState Confirmed
WaitingFor
Checkins
TargetRelease major
ReleasedIn
Topic revision: r14 - 10 Feb 2010, StefanosKouzof
 
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. see CopyrightStatement. Creative Commons License