This question about Issue in browser: Asked

utf8 (char) does not map to Unicode

I am upgrading from 1.1.9 to 2.0.1. I moved my data files over to the new directory tree. (Note: I did not use block_copy.pl because it's currently broken)

I opened Main/WebHome and realized that I had missed installing FlexWebListPlugin. After I did so, when I went back to Main/WebHome, I saw this error:

Foswiki detected an internal error - please check your Foswiki logs and webserver logs for more information.

utf8 "\x92" does not map to Unicode

I disabled FlexWebListPlugin and can view Main/WebHome again.

  • A web search tells me that Code point 0x92 (146 decimal) is the right single quotation mark (a so-called smart quote).
  • Given what I know about Flex Web List plugin, I am guessing that the character is in the description text of one of my webs
  • Trying to open System/SiteMap (after disabling FlexWebList Plugin) throws the same error
    • This would seem to confirm my guess about where the problem is
  • However, a brute-force attempt to uncover the problem file will be tedious (and should be something that can be automated)

Is there a script that I can run that will locate all topic files that contain unacceptable characters that do not map to Unicode?

Essentially, I want to run just the "find bad encodings" portion of bulk_copy and identify problems. I don't even need to have it automatically fix these, only identify them.

I can imagine that such a script could be useful for other people as well...

-- VickiBrown - 17 Sep 2015

The CharsetConverterContrib has an inspect mode and will report issues. It also has a repair option that will detect alternate encodings and will convert the topic. So in your case, it will see the "smart-quotes" that are part of the Windows cp-1252 codepage, and will attempt to convert the topic with that codepage.

We still have some challenges in the conversion tools, but it's getting closer. Remaining issues:
  • Topics containing more than one encoding. (Someone pastes in smart-quotes, and also some utf-8 characters).
  • Links to attachments with high characters in the attachment name. They are entity-encoded in the topic, detect as plain ASCII, and don't get converted.

-- GeorgeClark - 17 Sep 2015

Actually some sites with install base of windows users are reporting better luck converting by just setting the {Site}{CharSet} of the 1.1.9 system to 'cp-1252', so that the default source encoding includes the windows characters.

-- GeorgeClark - 17 Sep 2015

FYI, this very simple grep command should work on Unix-based servers to hunt down files with issues:

find $* -name '*.txt' | xargs grep -lnP "[\x80-\xFF]" 

-- VickiBrown - 17 Sep 2015

 

QuestionForm edit

Subject Issue in browser
Extension
Version
Status Asked
Related Topics
Topic revision: r6 - 19 Sep 2015, VickiBrown - This page was cached on 09 Jan 2018 - 12:55.

The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License