Item9170: WYSIWYG-Editor (Tiny MCE) kills German Umlauts in certain circumstances

pencil
Priority: Normal
Current State: Closed
Released In: 1.1.0
Target Release: minor
Applies To: Extension
Component: WysiwygPlugin
Branches:
Reported By: ThomasDoetterl
Waiting For:
Last Change By: KennethLavrsen
WYSIWYG-Editor (Tiny MCE) kills German Umlauts in certain circumstances

PROBLEM: =======

if a foswiki-Topic that contains special numerical encoded Unicode-characters (like �) is edited with the WYSIWYG-Editor (Tiny MCE) the complete source-file seems to be saved as Unicode (instead of iso-8859-1).

Especially all german Umlauts are converted. When the topic ist viewed all german Umlauts are displayed as 2 special characters (e.g. ä, ö, ü ) .

strangely enough the Unicode-character & (= Ampersand) will not trigger this behaviour.

enclosing the respective code with <sticky> will not prevent this behaviour.

HOW TO REPRODUCE: ==============

1. create a new topic in sandbox-web

2. edit in raw-mode
and write the following text :

line with german Umlauts : ö + ä + ü

line with Unicode-character : &#65533;

3. save

4. look at the topic :
you will see :
  • a line with correct german umlauts
  • a line with a little square at the end (= the Unicode-character)

5. edit with WYSIWYG and save (perhaps forcing a new revision)

6. voila : german umlauts and little square will be corrupted ...


Hi Thomas,

Thank you for the bug report; could you please go to /bin/configure and check under Localisation, and report the value of:
  • {Site}{Locale}
  • {Site}{CharSet}
  • {UseLocale}

And under "CGI Setup", version information:
  • Operating system
  • Perl version
  • Perl modules
    • Encode
    • HTML::Entities
    • HTML::Parser

I have added MichaelTempest in the WaitingFor, who is working on locale problems with WysiwygPlugin at the moment smile

-- PaulHarvey - 17 Jun 2010

Hi Paul, thank you for the fast response and sorry - I forgot to mention explicitely that step "1. create a new topic in sandbox-web" could be done in foswiki.org-Sandbox too . I just reproduced the bug yesterday with Topic TdTestKillUmlaute20100616

And here is the config of our foswiki-installation :

  • {Site}{Locale} = de_DE@euro
  • {Site}{CharSet} = iso-8859-15
  • {UseLocale} = true
  • CGI-Setup:
    • Operating system = Linux 2.6.27 (i586-linux-thread-multi)
    • Perl version = 5.010000 (linux) Note that by convention "Perl version 5.008" is referred to as "Perl version 5.8" and "Perl 5.008004" as "Perl 5.8.4" (i.e. ignore the leading zeros after the .)
    • Perl modules
      • Encode 2.23 installed
        Desc: may be required for international characters; Required, for WysiwygPlugin
      • Encode::compat Not installed. may be required for international characters
      • HTML::Entities 1.35 installed
        Desc: for WysiwygPlugin
      • HTML::Parser 3.56 installed
        Desc: for WysiwygPlugin
      • I18N::Langinfo 0.02 installed
        Desc: may be required for international characters
      • Lingua::EN::Sentence Not installed. may be required for generating new language files
      • Locale::Maketext::Lexicon 0.49 installed
        Desc: may be required for international characters
      • Symbol 1.06 installed
        Desc: may be required for international characters
      • Unicode::MapUTF8 Not installed. may be required for international characters
      • Win32::Console Not installed. may be required for Windows

-- ThomasDoetterl - 17 Jun 2010

Thomas, please review Development.UnicodeCharactersInNonUtf8Encodings and provide your input on what you expect should be done with unicode characters that cannot be represented using the site's configured character set encoding.

We have a fundamental problem that TinyMCE likes to convert most interesting characters to &code; entities, and so the html2tml converter tries to undo this operation by attempting to restore the character back to its "native" encoding when saving back out to a topic.

Obviously when the site's charset encoding is unable to represent the entity, this is failing.

I wonder what would happen if we tell TinyMCE to use "raw" character entity encoding as per Tasks.Item8085.

TdTestKillUmlaute20100616

-- PaulHarvey - 17 Jun 2010

wow, never thought that would be such a tough problem

my answer to the question in chapter Unresolved issues of Development.UnicodeCharactersInNonUtf8Encodings :
  • Numerical entities (e.g. &#65; &#x26; &#xEB; which is A & ë) in ordinary topic text:
    • Is it acceptable for TinyMCE/WysiwygPlugin to convert them to plain characters or to named entities?
      • TD:
        • No to plain chars (NEVER !)
        • perhaps to named entities :
          the code would be easier to read with named entities (imho) but I don't mind if numerical entities persist

but apart from that TinyMCE seems to be flawed :

scenario 1
  • if I create a topic (in raw-edit-mode) with german Umlauts as plain characters but without "dangerous" numerical entities
    I can edit+save this topic in WYSIWYG without any problems - Umlauts are not touched : everything is fine
    Example: TdTestKillUmlaute20100617

scenario 2
  • if I add (in raw-edit-mode) one of the dangerous entities e.g. &#65533;
    and then edit+save in WYSIWYG, the effect is the same as in TdTestKillUmlaute20100616 :
    • the numerical entity is replaced with "something" (plain unicode char ?)
    • all Umlauts in the whole topic are mangled
      • what happens ?
      • how could this be triggered by adding a numeric entity somewhere in the topic text ?

-- ThomasDoetterl - 18 Jun 2010

Hi Thomas

Thank you for taking the trouble to document these failure scenarios - it really does help smile

A WYSIWYG edit-save cycle is a complex process, with at least four conversions between the site charset and UTF-8, two of which are done by Foswiki and two by your browser. Additionally, Foswiki, TinyMCE and your browser are all converting between characters and entities. There is a lot of space for things to go wrong.

If you click on the WikiText button in TinyMCE, Foswiki does the conversion that the browser normally does when you click "Save" in the WYSIWYG editor. Try it - you will see that Foswiki preserves the umlaut characters but converts the numeric entity to \x{fffd}. Also wrong, but different. I think I know why Foswiki converts the entity to \x{fffd} (and I am working on a fix for that). I will investigate why the wider corruption occurs when saving from TinyMCE.

With respect to Unresolved issues, I would like to clarify a few things that might make a difference:
  1. A "plain character" is one that can be represented directly in the site charset. That includes characters like "A" and "ö", for ISO-8859-1 and ISO-8859-15. I proposed that Foswiki should be allowed to convert entities to characters, where possible, to improve the readability of the TML.
  2. I am not proposing that Foswiki should be allowed to convert "ö" to "o". That would change the meaning of the text and would most definitely be wrong.
  3. I am fixing Foswiki so that it leaves entities alone in <sticky> blocks. If you need an entity somewhere (instead of the plain character), then you will be able to in future (once I have fixed that bug).

You obviously feel quite strongly that Foswiki should not convert entities to plain characters. I would really like to understand why not.

-- MichaelTempest - 18 Jun 2010

Hi Michael,

there seems to be a slight misunderstanding on my side - sorry.
When answering the question in Unresolved issues
I thought about "plain characters" meaning "plain characters in UTF-8" since I have learned that TinyMCE works with UTF-8
and the question explicitely addressed TinyMCE ("Is it acceptable for TinyMCE/WysiwygPlugin to convert").

so my first thought was:
If the site's character encoding is not UTF-8,
converting numeric entities to "plain characters in UTF-8" in TinyMCE
could lead to conversion problems when converting back to the site's character set.

my second thougt was :
numeric entities in the source-code come either from an existing source-file (opening a source-file is some kind of user-input)
or have been typed in TinyMCE as numeric entity by the user.
In both cases TinyMCE should not be allowed to radically change the user-input.
(I admit not beeing 100% consequent when saying "perhaps converting from numerical to named entity is ok" )

In my experience a lot of trouble and unnecessary complexity stem from automatic operations which are not completely clear to the user.
If there is a feature to beautify the source-code that's fine - but the operation should be deliberately triggered by the user.

-- ThomasDoetterl - 21 Jun 2010

Hi Thomas

Thanks for the feedback.

I agree that the first prize would be for TinyMCE not to make any automatic changes. Unfortunately, we are stuck with some automatic changes for now - and perhaps forever, because converting between TML and HTML is inherently lossy (in both directions), although we are making the conversions less lossy as time goes on. If you want to avoid automatic changes completely, then avoid WYSIWYG HTML editors. Some Foswiki sites disable TinyMCE completely for this very reason.

Regarding your first thought: As you said, TinyMCE uses Unicode characters internally (because Javascript uses Unicode). If you enter a Unicode character in TinyMCE, and click Save, the browser converts that character to the site charset, and produces a character entity if the site charset cannot represent that character. A similar thing happens when you click WikiText, but this time Foswiki does the conversion.

Regarding your second thought: You can disable TinyMCE for a whole site, for a whole web, or for whole topics with the NOWYSIWYG setting. You can disable WYSIWYG-editing of sections of a topic using <sticky> ... </sticky> . Please use these facilities if you would like to protect content from TinyMCE's automatic changes.

I have changed WysiwygPlugin to fix the problem you reported, on trunk. If you would like to try it out for yourself, please experiment in the Sandbox on trunk (you will have to log in to trunk.foswiki.org using the same username and password as on foswiki.org). At present, Foswiki does convert entities in non-protected-text to characters if it can.

-- MichaelTempest - 21 Jun 2010

Hi Michael, Hi Paul,

first of all, I have to thank you for your patient explanations and rapid response.
I have learned a lot the past few days.
Also I have learned that it is much more complicated than I naively thought and I still have to learn much more.

In terms of our problem I think the fix is ok.

At last I have to say that you all at foswiki do an excellent job !

-- ThomasDoetterl - 23 Jun 2010

You're welcome smile Changing status to "waiting for release"

-- MichaelTempest - 23 Jun 2010

ItemTemplate edit

Summary WYSIWYG-Editor (Tiny MCE) kills German Umlauts in certain circumstances
ReportedBy ThomasDoetterl
Codebase 1.0.9
SVN Range
AppliesTo Extension
Component WysiwygPlugin
Priority Normal
CurrentState Closed
WaitingFor
Checkins distro:a820f29b54a4
TargetRelease minor
ReleasedIn 1.1.0
Topic revision: r12 - 04 Oct 2010, KennethLavrsen
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy