You are here: Foswiki>Tasks Web>Item14091 (25 Sep 2016, PhilippeKehl)Edit Attach

Item14091: Fowiki page cache breaks UTF-8 characters, causing non-ASCII characters to become gibberish after caching.

pencil
Priority: Normal
Current State: Waiting for Feedback
Released In: n/a
Target Release: n/a
Applies To: Engine
Component: Cache
Branches:
Reported By: ShenZhouHong
Waiting For: PhilippeKehl
Last Change By: PhilippeKehl

Item14091

Description

By enabling the Foswiki page cache ({Cache}{Enabled}) under Tuning in Configure, UTF-8 characters become gibberish - in both pages that use the %TRANSLATE% macro as well as in the default interface localisations themselves. This is most apparent with languages such as Russian, Bulgarian, or Chinese - but is also apparent in languages such as French.

I found out that this was due to the Foswiki page cache because only cached pages were affected - by loading pages using ?cache=refresh or visiting uncached pages, the user interface and wiki content loads correctly. However, when attempting to visit the page normally, broken content is presented once more.

broken-example.png
Example of the broken utf-8 that is presented when a cached page is retrieved

System Setup

I am using:
  • Ubuntu 16.04
  • Apache/2.4.18
  • MySQL 5.7.12-0ubuntu1
  • Foswiki 2.12

Debugging Process

At first, despite my strong assumptions that it's due to the Foswiki page cache, I took steps to migrate all other possible variables. I made sure to:

  • Disable CDN-level caching by setting Cloudflare to Development mode.
  • Disable apache2 mod_pagespeed in case page rewriting was the problem.
  • Tested pages in Google Chrome incognito mode with browser-level cache disabled

Even after doing all of these, the above symptoms were still present. I narrowed the problem down to either how MySQL is storing the cached pages, or how Foswiki is entering data into the cache

Debugging MySQL

I am using the Foswiki page cache with a locally run MySQL server. First, I checked the character_set collation encoding of the cache database. They were not in utf8, but in latin1. I assumed this was the cause of the problem. Therefore, I stopped apache2, deleted the foswiki_cache_deps and foswiki_cache_pages tables in the database, and changed the character encoding and collation of the MySQL database to utf8 and utf8_general_ci.

I made sure to also add the following lines in the my.cnf configuration file to MySQL, to make sure all future created tables will be in utf-8:

[client]
default-character-set=utf8


[mysql]
default-character-set=utf8

[mysqld] 
init_connect='SET collation_connection = utf8_unicode_ci' 
init_connect='SET NAMES utf8' 
character-set-server=utf8 
collation-server=utf8_unicode_ci 
skip-character-set-client-handshake

After doing all of this, I restarted the mysql client, and checked once more that the database is now in utf8. Then, I restarted apache2 and checked out the website.

The issue persists

However, despite explicitly configuring MySQL to use utf8, the issue persists! On first visit, all non-ASCII characters render correctly, but any subsequent cached pages present broken gibberish rather than cryllic or chinese characters. It is essential for my web to have non-latin multilingual support. I've tried looking at the configuration section again, but there are no obvious options that will solve this problem. I think this is caused by the way foswiki enters information into the cache databases - although my installation is utf-8 by default (as it says in configure), it appears the cache is still in some other character encoding scheme. How can I solve this issue?

Workaround

A workaround to this problem is by disabling the caching entirely. This is not a good solution, as it has negative performance implications. I hope we can work together and find a way to solve this problem.

-- ShenZhouHong - 11 Jun 2016

Oh, please tell me if you need any additional information to debug this problem. I'm extremely new to Perl and Foswiki, and this is my first time configuring something like this. I'll leave this page open and I'll be standing by if anything further is requested of me. It's my first time making a real bug report, so if I missed anything please let me know.

-- ShenZhouHong - 11 Jun 2016

For testing purposes (taken from the Gutenberg Project EBook of "Journey to the West"):

第一回 靈根育孕源流出 心性修持大道生

詩曰: 混沌未分天地亂,茫茫渺渺無人見。 自從盤古破鴻濛,開闢從茲清濁辨。 覆載群生仰至仁,發明萬物皆成善。 欲知造化會元功,須看西遊釋厄傳。

-- MarkusUeberall - 11 Jun 2016

I've added the text here. I've currently turned off caching so it displays perfectly fine - but when caching is turned on it becomes gibberish.

https://csc.uwc.wiki/Sandbox/UTF8Test

This is what the text becomes once caching is turned on:

第一回 éˆæ ¹è‚²å­•æºæµå‡ºã€€å¿ƒæ€§ä¿®æŒå¤§é“ç”Ÿ
詩曰: 混沌未分天地亂,茫茫渺渺無人見。 è‡ªå¾žç›¤å¤ç ´é´»æ¿›ï¼Œé–‹é—¢å¾žèŒ²æ¸…æ¿è¾¨ã€‚ 覆載群生仰至仁,發明萬物皆成善。 æ¬²çŸ¥é€ åŒ–æœƒå…ƒåŠŸï¼Œé ˆçœ‹è¥¿éŠé‡‹åŽ„å‚³ã€‚

I've tried caching using the SQLLite cache implementation as well, and this issue also persists. It appears to be a problem with how Foswiki inputs data to the cache itself. Internationalization support is one of Foswiki's priorities, and this bug should be fixed in order to allow proper internationalization.

-- ShenZhouHong - 11 Jun 2016

I have seen the same with default (sqlite I think) caching store and UTF-8 setting with western non-ASCII characters. (öäüèčš etc.). I hasn't been annoying enough for me to dig into it yet.

-- PhilippeKehl - 11 Jun 2016

Glad to see that this bug can be replicated. I hope a solution can be found for it soon.

-- ShenZhouHong - 11 Jun 2016

Two short notes:
  1. On f.o, foswiki_cache tables are still latin1 based, and as you can see below, caching this page still works.
  2. From the above, it's not clear whether the DB cache tables were recreated after changing the encoding; AFAIK, existing tables are not converted automatically when changing the DB system defaults.

-- MarkusUeberall - 11 Jun 2016

I've deleted the foswiki_cache_pages and foswiki_cache_deps tables after changing the encoding, and before starting apache. The tables were automatically created again, but I haven't deleted the whole database.

-- ShenZhouHong - 11 Jun 2016

Michael pointed out on IRC, that the cache databases are only the indices to the cache, and are used to find and invalidated cache entries when topics are updated. The actual cached pages are written to the directory configured in $Foswiki::cfg{Cache}{RootDir}, typically the working/cache directory. So encoding issues in the database won't have anything at all to do with the cached data.

The cache files are named using a hashed filename, for ex, d538d946e4202519b18dfeb2342b97ae. If your cache encoding is corrupted it's something related to writing/reading these files.

-- GeorgeClark - 12 Jun 2016

The cache file is written in lib/Foswiki/PageCache/DBI.pm
    #writeDebug("saving data of $webTopic into $fileName");
    open( $FILE, '>:encoding(utf-8)', $fileName )
      or die "Can't create file $fileName - $!\n";
    print $FILE $variation->{data};
    close($FILE);

-- GeorgeClark - 12 Jun 2016

Shen, I was trying to register your site in order to repro the error. However the registration code did not make it through to me. The email server says
<www-data@dilijan>: Sender address rejected: need fully-qualified address

This as just a sidenote.

I then tried to reproduce the error on my installation with above test text but was unable to get any encoding errors. Could you add some more info on which perl version you are using and what your settings in your LocalSite.cfg are. Best would be to attach it here - any privacy information removed before, of course.

Please also make sure that no cloudflare cdn or mod_pagespeed is activated. Whatever module might get in the midle: disable it please so that the raw results are delivered by Foswiki.

-- MichaelDaum - 13 Jun 2016

I have the problem on my production Foswiki but not in my dev-Foswiki. The only difference is that the first runs in mod_perl and the other as CGI. Otherwise it's the same host, httpd, Perl etc. In both instances the string "äöüč" appears as "äöü" in the cache file. Why's that? "äöü" is also what I see in the page mod_perl-served from the cache (Content-type headers etc. look okay). The topic.txt file has the correct "äöüč" in both installations.

I have {HttpCompress} disabled because I cannot get that to work in the mod_perl instance (I get weird Firefox "content" or "decoding" error pages, or something like that). It does work on the CGI installation. Maybe that's related? Some encoding weirdness messing up the gzipped data?

The problem does not occur on pages that have a <dirtyarea>.

Any ideas where to look?

-- PhilippeKehl - 17 Jun 2016

I'm wondering if this is somehow related to API differences in mod_perl vs. plain old CGI / FastCGI. We have many sites running with fcgi / FastCGI without issues including foswiki.org.

-- GeorgeClark - 18 Jun 2016

Michael, site registration is limited right now only to holders of an @uwcchina.org address. You are right - the email doesn't seem to work as well. I'm planning to rebuild the entire foswiki site in light of the trouble I am facing, in hopes I can reac a solution.

Philippe, I have the exact same problem with content encoding as well, when I turn on {HttpCompress}. The content encoding problem disappears when pages are loaded with the ?cache=refresh header as well.

George - I'll try to perform a clean reinstall of the site with mod_perl rather than the CGI engine - since I am using FastCGI right now. Perhaps that is the root of the issues?

-- ShenZhouHong - 22 Jun 2016

Actually we are aware of several sites including foswiki.org successfully using FastCGI without any character set issues, so I doubt that switching to mod_perl will help. I'm not sure where to go now with this.

-- GeorgeClark - 23 Jun 2016

For those who encounter this problem--have you verified that all locale specific settings are correct? (see How to set up a clean UTF-8 environment in Linux, Unicode-processing issues in Perl and how to cope with it)

-- MarkusUeberall - 27 Jun 2016

I have the problem with mod_perl, but not with CGI.

My system locales should be alright. The system default is en_GB.UTF-8. "locale -a" says that that is available. Also Perl should be alright (LC_ALL=en_GB.UTF-8 perl -e 'print "hello\n";' doesn't complain about a missing locale or so, which it would if the locale wasn't available).

I've now tried various combinations of LANG and LC_ALL settings ins /etc/apache/envvars and for the {Site}{Locale}, {UseLocale} and {Store}{Encoding} settings without any success.

I'll try getting mod_perl with cache compression working. Somehow I'm unable to prevent Apache's mod_deflate to compress the output again (which is why I have {HttpCompress} disabled.

Or I'll try fcgi.

-- PhilippeKehl - 01 Jul 2016

mod_perl is really not recommended when you also want performance. mod_deflate is a problem as well: foswiki already caches compressed pages so no need to compress the page again on every new request. I'd highly recommend fcgi. Sure, even when that fixes your encoding issues, would we still don't really know what caused your encoding problems...meh

-- MichaelDaum - 01 Jul 2016

I was able to fix it for me (mod_perl, {HttpCompress} off) by removing the ":encoding(utf-8)" from the open() call in Foswiki::PageCache::DBI::setPageVariation(). I.e. I changed

open( $FILE, '>:encoding(utf-8)', $fileName )

to

open( $FILE, '>', $fileName )

And now it works.

Now also the working/cache/..... file shows the original äöü content instead of the garbled version. That somehow makes more sense to me. However, my CGI installation of foswiki has the garbled version in the cache file but all is fine and öäü displays correctly. I'm confused.

I haven't yet traced where $variation->{data} is filled in or what it looks like.

-- PhilippeKehl - 01 Jul 2016

I'll try the fcgi sometimes.

-- PhilippeKehl - 01 Jul 2016

It works fine with fcgi (and {HttpCompress} off, as I still cannot get that to work with Apache -- it insists on mod_deflate-ing the content).

I'm still puzzled by the garbled content in the cache file (e.g. "äöü" instead of the original "äöüč"), but the cached pages now display correctly. Why would the cache not store the "raw" contents?

I'm confused by all the ":encoding(utf-8)" vs ":raw" stuff. I never had to use any of those in the perl/CGI apps I wrote (which would store strings in files and databases).

-- PhilippeKehl - 25 Sep 2016
 

ItemTemplate edit

Summary Fowiki page cache breaks UTF-8 characters, causing non-ASCII characters to become gibberish after caching.
ReportedBy ShenZhouHong
Codebase
SVN Range
AppliesTo Engine
Component Cache
Priority Normal
CurrentState Waiting for Feedback
WaitingFor PhilippeKehl
Checkins
TargetRelease n/a
ReleasedIn n/a
CheckinsOnBranches
trunkCheckins
masterCheckins
ItemBranchCheckins
Release02x01Checkins
Release02x00Checkins
Release01x01Checkins
Topic attachments
I Attachment Action Size Date Who Comment
broken-example.pngpng broken-example.png manage 252 K 11 Jun 2016 - 06:16 ShenZhouHong Example of the broken utf-8 characters that become gibberish when a cached page is served
Topic revision: r19 - 25 Sep 2016, PhilippeKehl - This page was cached on 20 Sep 2017 - 07:14.

The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License