• Tip Category -
  • Tip Added By - GeorgeClark - 04 Jun 2015 - 16:59
  • Extensions Used -
  • Useful To -
  • Tip Status -
  • Related Topics -

Migrating to Unicode and UTF8

The Problem

Back when Foswiki was invented, the internet had yet to settle on a consistent way to represent the many character sets used around the world, and Perl - the language Foswiki is written in - had poor support for anything except ASCII. So Foswiki was written to mostly ignore character encodings, and only use them when absolutely necessary. Over the years this has caused many, many problems; problems that will be going away in Foswiki 2.0 when we standardise on the UTF-8 character encoding. Using UTF-8 with a modern Perl greatly simplifies use of non-Latin character sets and improves multi-language support.

For a simple but more detailed explanation of UNICODE and UTF-8, see WhatIsUtf8AllAbout

If you want to upgrade to Foswiki 2.0, you will first have to make sure all your content is consistent (using the same character set encoding throughout), and (preferably) already using utf-8. This applies to all web names, topic names, attachment names, and topic content (though not to attachment content).

Once your content is consistent and encoded using utf-8, you can either use that content directly in 2.0 with the RcsLite or RcsWrap stores, or you can use the 2.0 tools/bulk_copy.pl tool to port existing content to the new install (this will be the recommended upgrade path for all future Foswiki upgrades)

If you have any questions about character encodings, it might help to read http://search.cpan.org/~rjbs/perl-5.22.0/pod/perlunifaq.pod

What do I have to do?

First, check your LocalSite.cfg, for the setting {Site}{CharSet}. If this is already utf-8 (or utf8) then you can relax - you don't have to do anything else. Otherwise, you may need to migrate your data to utf-8.

Second, utf-8 incorporates all of the ASCII character set in it, so if you are 100% sure that only ASCII characters have been used in your wiki content (possible if your native language has no accents, cedillas, or circumflexes) then you can relax, just change {Site}{CharSet} to utf-8 and you are good to go.

'Wiki content' means:
  • Web names
  • Topic names
  • Attachment names
  • Topic content

HELP Note that the default encoding in Foswiki 1.1.9 and earlier was iso-8859-1. This encoding uses all eight bits of every byte (ASCII only uses the seven least significant bits). You can easily check if a file contains only ASCII using the following command:
perl -nlE 'say if /\P{Ascii}/' < /path/to/file/to/check
(the \P{Ascii} bit is called a "Unicode Property". There are other properties you can use to explore your content - see http://www.regular-expressions.info/unicode.html for more on this)

Third, if you are absolutely certain that all your wiki content is consistently following the same encoding (usually iso-8859-1) then the tools/bulk_copy.pl script in Foswiki 2.0 can safely transfer your content from your existing installation to your new Foswiki 2.0. If you are changing store implementation at the same time - e.g. to PlainFileStore - then this is the route you should follow (but make sure your store is consistent first, see below.)

Checking consistency and, if necessary, converting your content to utf-8, is a pretty painless process. The only challenge comes when your topics have become corrupted with a mixture of encodings (are inconsistent). This can happen sometimes as a result of things like:
  • Previous change of {Site}{CharSet} without proper migration of data
  • Topics created / maintained external to Foswiki
  • Cut/paste of encoded data into the wiki editor
ALERT! Do your users paste data from Windows apps like Microsoft Word? If they do, then it is highly likely that your topics contain data encoded in windows-1252 or cp-1252. Your conversion might work better if you set your {Site}{CharSet} to cp-1252.

Checking consistency of old encodings

If you have reason to suspect that mixed encodings may have been used in your data, then you should check first before doing anything else.

Note: This process requires direct server access. If it is not possible to access the server directly, you either need to run using your existing encoding, or copy the data off the server to a local Foswiki installation for conversion.

First install the CharsetConverterContrib. This tool can analyse and, if necessary, repair, your encoding.

You should install CPAN:Encode::Detect::Detector, and then install CharsetConverterContrib.

Now you can perform a complete scan of all your webs topics and attachments, including their complete histories, and report any issues. This should be done on a 1.1.x system. There is no need to shut down the system while the scan is running.

cd /path/to/foswiki/tools
perl -I ../lib convert_charset.pl -i -r > ../convert.log 2>&1

  • The checking process generates a very large volume of output. The best way to use this is to capture the results to a file.
  • The -i option tells the converter not to actually do anything, just to dry-run and tell you what it would do.
  • The -r option tells the converter to use Encode::Detect::Detector to guess the character encoding(s).

The converter runs in 2 phases.
  • All file names are inspected for encoding issues (and renamed as necessary, unless -i).
  • All file contents are inspected for encoding issues (and re-encoded as necessary, unless -i)

The annotated messages you might see:

Move data/Main/FioCole.txt,v
A file name is examined for non-ascii characters and renamed if necessary.
WARNING windows-1252 encoding detected in name data/Sandbox/TestTpic.txt
A filename has non-ascii characters, it will be renamed when converted to utf-8
START conversion of Sandbox/SimpleJavaScriptSnippets
Conversion of a topic has started
WARNING windows-1252 encoding detected in content of SimpleJavaScriptSnippets,v
Encode::Detect::Detector things the encoding of this file is different from the {Site}{CharSet}
Converted history of SimpleJavaScriptSnippets (9 changes)
9 instances of string encoding were changed.
WARNING windows-1252 encoding detected in content of /usr/home/foswiki.org/public_html/data/Sandbox/SimpleJavaScriptSnippets.txt
Same for the topic file itself
Converted .txt of SimpleJavaScriptSnippets
And the conversion completed of that one file
CONVERSION FINISHED Moved 97677 (33 renamed) Converted 32249
Completed the run.

Repairing inconsistent content

If convert-charset.pl doesn't report any problems, you can skip this step and go straight to "Converting to UTF-8", below.

You have a range of options for repairing problems:
  • You can simply delete any content that is wrongly encoded
  • You can manually repair small numbers of problems using a text editor, or by renaming webs/topics/attachments to names using the {Site}{CharSet}
  • You can repair the content using the CharsetConverterContrib

The best way to use the CharsetConverterContrib is to run it using the -i -r options, and fix topics / add options until it runs through cleanly. You can then proceed to the next section and use the options you have identified to run the actual conversion.

The CharsetConverterContrib -r option uses Encode::Detect::Detector, which uses significant magic to guess the encoding. Sometimes it gets it badly wrong. You are always best to grep the results for all warnings, and review the topics manually to determine if the guess was correct. For example,

  • Use the quick check described above:
perl -nlE 'say if /\P{Ascii}/' < /path/to/file/to/check
  • View the topic with your 1.1.x system. Do you see the non-ascii text? Does it display correctly?
  • If it displays correctly, then chances are the original encoding was correct, and the guess was wrong.

Dealing with bad guesses, etc.
  • If all the guesses were wrong and topics are just working on Foswiki 1.1, then carry on and complete the conversion to utf-8 but without the -r option
  • If you can't fix the broken topic manually and need to use the guesses, then you may have to tweak the converter options. The converter supports a number of different options for mapping the encodings of files, and file contents.

Maybe I don't want to deal with this...

Foswiki can function with {Store}{Encoding} set to other than utf-8 however the results are less than optimal. Here are the issues with remaining on the old Foswiki 1.x default:
Is your wiki really using iso-8859-1 or are your topics polluted with some combination of iso-8859-1, windows-1252 and whatever else end users inserted with cut/paste?
Foswiki 1.x and TWiki before it simply stores whatever bytes the user enters into the editor. If they are not valid per the {Site}{CharSet}, it's up to the browser to deal with it. *On Foswiki 2.0, running with a non-utf-8 store, these characters will be converted to &#xx; entities.
Do any attachment names contain "high characters" - umlauts, etc?
Foswiki internally uses UNICODE except when writing content and filenames to the Store. They are written per the {Store}{Encoding}. However links to attachments, (and all browser interaction) is done using utf-8. When the browser attempts to follow a link to a utf-8 encoded attachment name, the web server will fail with 404, because the filename is encoded in your {Store}{Encoding}. Foswiki 2.0.2 provides a "helper" Plugin, PubLinkFixupPlugin,to deal with this by rewriting pub/ URLs, but the results are not optimal.
Are you using older extensions that write directly to the Store?
On Foswiki 2.0, anything that writes directly to the store without using the official APIs will probably create utf-8 filenames. Foswiki 2.0 uses a hardcoded {Site}{CharSet} of utf-8. A mixture of utf-8 and iso-8859-x attachment names will be much more difficult to migrate at a future date.
  • ... (work in process)

The bottom line is migration to UNICODE is challenging, but unfortunately necessary to clean a number of issues caused by indiscriminate cut/paste of questionable data into topics.

Converting to UTF-8

Once your content is consistent, or you have established the correct set of options to CharsetConverterContrib, you can run the conversion.

ALERT! At this stage take a backup, and be certain you can easily restore it! Be absolutely certain that all files in your foswiki are writable during the migration. All it takes is one write failure and the system will be left partially converted to utf-8. You will need to restore your backup and try again.

To run the conversion, simply use the same command-line as before, but remove the -i option. Content - including all histories - will be converted in-place.

Other things to check

The CharsetConverterContrib only handles wiki content, but there are other files you might want to check:
  • Your .htpasswd file, if you are using it. If a user has used non-ascii characters in either their WikiName or Email address, it can cause the password file to be unusable with Foswiki 2.0. For example, the Foswiki.org .htpasswd file had one user with a non-ascii email address, which had to be manually converted to utf-8.
  • .changes files in webs. These may contain strange encodings if CharsetConverterContrib had to fix the encoding of any web, topic or attachment names
  • Log files, if you intend to keep them.
  • Locally developed (or modified) .tmpl files, in the templates directory
  • Attachments containing text (such as HTML files) that are included in-line into topics
If you have to fix these files, you have to know what encoding was used to generate them. Assuming you know, you can use perl to do the conversion:
# Convert file-to-convert, known to be encoded using iso-8859-1, to utf-8, and write the result to converted-file
perl -MEncode -E '$/=undef; print encode_utf8(decode("iso-8859-1", <>))' <file-to-convert >converted-file

Finding invalid utf-8 data

After you've converted, it's still possible that you may run into corrupt topics. Several things might cause this:
  • Installed extension with invalid encoding in the topic.
  • Missed files during the conversion
  • Other manual changes to files that corrupted the data.

There is a handy unix utility - isutf8, which can check if files are correctly encoded. On Debian / Ubuntu distributions, it's in the "moreutils" package.

sudo apt-get install moreutils
cd /var/www/foswiki
find -L . -name *.txt -exec isutf8 {} \;
./pub/System/FamFamFamSilkGeoSilkIcons/_readme.txt: line 2, char 1, byte offset 31: invalid UTF-8 code
perl -nlE 'say if /\P{Ascii}/' < ./pub/System/FamFamFamSilkGeoSilkIcons/_readme.txt
GeoSilk icon set by Rolando Pe�ate

Extensions

Foswiki Extensions, such as Plugins and Contribs, that include Perl and/or Javascript code need to be reviewed for compatibility.

Perl

In Foswiki 1.1.9 and earlier, all strings were converted to byte strings. Characters that didn't fit in 8 bits were encoded using the {Site}{CharSet} encoding. This meant that Extensions sometimes had to handle encoding and decoding strings themselves, especially when handling URL parameters. To do this, they will probably have used the CPAN Encode module - therefore any extension is likely to need work if it contains Encode:: anywhere in the Perl code. Be warned that this isn't the only way to encode and decode, just the most likely.

Foswiki 2.0 uses Perl characters internally, which use the Unicode system to encode characters. All request parameters ($query->param() value) are already decoded to Perl characters, so extensions do not need to decode. All HTML generated by Foswiki is generated using Perl characters, so extensions don't need to encode, either. Code that does this must be removed (or commented out, or conditionally disabled when $Foswiki::UNICODE is true.)

You are most likely to encounter problems where an extension implements a REST handler. XmlHttpRequests from Javascript are always encoded using UTF8, irrespective of the encoding requested in the request. So the REST handler would have to decode the UTF8 to Unicode, then re-encode to a byte string using the {Site}{CharSet}. In Foswiki 2.0 this is no longer required, so you may remove any code that does this (it won't break, but it's pointless.)

Another place you might encounter problems is where an extension communicates with an external service. Again, Encode is likely to have been used to convert charsets.

If you don't see Encode and there's no other evidence of charset conversion, then you are probably good to go - but check your regular expressions first......

Regular Expressions

Another thing to be aware of is that Foswiki 2.0 matches regular expressions using Unicode. The definition of a WikiWord has changed to support all the different character sets around the world, with their differening definitions of 'upper' and 'lower' case. If you have regular expressions that match WikiWords and use something like this: /[A-Z]+[a-z]+[A-Z][A-Za-z0-9]+/ then you probably want to convert them to use the Foswiki internal regular expressions - see the Foswiki::Func API for more information. If you can't find the right expression there, you can use the POSIX character classes in your regular expressions - for example, [:upper:], [:lower:], [:alnum:] etc. Google for details.

print statements

If your extension does any work directly with files on disk, or sockets, or any other type of output stream, then it is likely to use the Perl print statement. By default, Perl opens all streams in their most raw, basic format - as byte streams. In Foswiki 1.1.9, because all internal strings were simply byte strings, you could safely print to any output stream without risk. In Foswiki 2.0, strings use Unicode characters, most of which don't fit in a byte. If you try and print a string containing such "wide" characters to a byte stream, then you will see the dreaded Wide character in print error.

There are a range of solutions to this, but in general the simplest is to open all streams with a utf8 encoding layer, for example open($fh, '>:encoding(utf-8)', $filename)

You will probably want to open any streams you read from with an encoding layer as well; open($fh, '<:encoding(utf-8)', $filename)

Foswiki 2.0 treats all internal strings as Perl character strings, and you want to convert your data to this format as early (or late) as possible.

Note that in 2.0 STDOUT and STDERR are automatically assigned a utf-8 encoding layer (using binmode(STDOUT, ':encoding(utf-8)'), see perldoc unicode for details)

Responses

If your extension writes directly to the Foswiki::Response object (for example, REST handlers often do this) then you need to use the right method.
  • Use Foswiki::Response::body to output byte data.
  • Use Foswiki::Response::print to output strings that may contain Perl characters. These will automatically be encoded to utf-8.
In general you should use Foswiki::Response::print. The only case where you might want to use body is if you are outputting a complete prefabricated response e.g. binary data

Javascript

Javascript has always used unicode, so there is no problem about handling most content. One thing you have to be aware of is that extensions may have used regular expressions that use ASCII characters. For example, a classic used to match wikiwords is [A-Z]+[a-z][a-z0-9_]+[A-Z].*. This will still work for ASCII wikiwords, but in the Unicode Foswiki world it won't work for wikiwords that use non-ASCII scripts.

Unfortunately, Javascript doesn't support the POSIX unicode character classes used in Perl. However JavascriptFiles/foswikiString.js contains a structure called foswiki.RE that defines unicode character classes that match a subset of the POSIX standard.
POSIX character class foswiki.RE equivalent
[:upper:] foswiki.RE.upper
[:lower:] foswiki.RE.lower
[:digit:] foswiki.RE.digit
[:alnum:] foswiki.RE.alnum

There is also a precompiled RegExp called foswiki.wikiword that matches a wikiword.
Topic revision: r21 - 20 Mar 2016, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy