Item758: Raw view break Chinese characters in UTF-8

Priority: Urgent
Current State: Closed
Released In: 1.1.0
Target Release: minor
Applies To: Engine
Component: I18N
Reported By: ChYang
Waiting For:
Last Change By: KennethLavrsen
Raw View will break some of the Chinese characters FF/FD.
RawView.png
You can test the Raw View on my FW 1.0 at http://ebm.twbbs.org/bin/view/Sandbox/ChineseTest


Chinese character 盧 (codepoint 30439) is represented by the byte-sequence e7 9b a7.

Something in the "raw view" processing is converting "Windows-1252" codes to character entities. For example, 0x9B in Windows-1252 corresponds to Unicode codepoint 8250. That conversion breaks the UTF-8 encoding.

There is a similar problem with ě which is represented by the byte-sequence C4 9B, as reported in Item8950.

This might be tricky to solve comprehensively, as the problem is caused by CGI::textarea. On trunk, it is this call.

Confirmed. This particular manifestation of what I consider to be a bug in CGI.pm will not cause data-loss, but the same bug might cause data loss in forms. Setting it to urgent because I don't know what else might be affected. My CGI.pm is version 3.29.

-- MichaelTempest - 27 Jun 2010

When I:
  1. Set up {UseLocale}=1; {Site}{CharSet}='UTF-8'; {Site{Locale}='en-GB.UTF8'; (CGI version is 3.49)
  2. Edit Wiki Text
  3. Write 皍 in the text, save
  4. Looks good, I see the chinese character
  5. Copy the character shown to the cut buffer
  6. Edit again, paste the character, and save
  7. Looks good (I see two instances of the character)
  8. View Wiki Text (still looks good)
So what am I doing wrong? How did you reproduce this? Do I need to view on Windows? Or has this simply been fixed in CGI?

-- CrawfordCurrie - 13 Jul 2010

I can reproduce the problem with the process you describe, using 蓋 or 盧, but not with 皍

-- MichaelTempest - 14 Jul 2010

Dyslexics of the world unite! Confirmed.

-- CrawfordCurrie - 14 Jul 2010

OK, fixed. We have to ensure the auto-encoding in CGI uses the correct character set. CGI defaults to iso-8859-1, and has a special exception for iso-8859-1 and windows1252 in CGI::escapeHTML which breaks UTF-8 content. Get this wrong, and CGI will fail to encode certain UTF-8 characters correctly.

Note that this problem exists in earlier releases as well, but I have only committed to trunk. The same fix should be portable.

-- CrawfordCurrie - 14 Jul 2010

ItemTemplate edit

Summary Raw view break Chinese characters in UTF-8
ReportedBy ChYang
Codebase 1.0.9, 1.0.8, 1.0.7, 1.0.6, 1.0.5, 1.0.5 beta1, 1.0.4, 1.0.3, 1.0.2, 1.0.1, 1.0.0, 1.0.0 beta3, 1.0.0 beta2, 1.0.0 beta1, trunk
SVN Range SVN 1972: Foswiki-1.0.0, Fri, 09 Jan 2009, build 1899
AppliesTo Engine
Component I18N
Priority Urgent
CurrentState Closed
WaitingFor
Checkins distro:94cbfbd0a9fd distro:7a09a9169186
TargetRelease minor
ReleasedIn 1.1.0

I Attachment Action Size Date Who Comment
RawView.pngpng RawView.png manage 7.5 K 14 Jan 2009 - 04:48 ChYang Raw View breaks Chinese...
Topic revision: r9 - 04 Oct 2010, KennethLavrsen
 
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. see CopyrightStatement. Creative Commons License