Item11755: compare breaks charset encodings

pencil
Priority: Urgent
Current State: Closed
Released In: 1.1.6
Target Release: patch
Applies To: Extension
Component: CompareRevisionsAddOn
Branches: Release01x01 trunk
Reported By: MichaelDaum
Waiting For:
Last Change By: GeorgeClark
The plugin was using {UseLocale} to decide whether or not to decode the text before feeding it to the tree parser.

The problem only shows up using the compare action, not using rdiff

-- MichaelDaum - 13 Apr 2012

The fix (distro:70593672ab20) corrects some encoded characters but breaks others. In fact, I seem to have solved this problem more thoroughly by reverting that fix and, instead, doing something else.

A bit more background: HTML-Tree seems to like to assume that everything under the sun is Latin1. It will, essentially, look at bytes and treat them as characters.

The interesting part is that some of the elements that are treated are passed straight through that (i.e. after Michael's decode call they are now broken ISO8859-1 characters in our UTF-8 stream) and some others are magically encoded into entities. Here's the fun part: with Michael's fix that part works (the character value now happens to be equivalent to the corresponding ISO8859-1 byte value, so the right entity comes out). But now it outputs the parts that are not entity-encoded as broken raw ISO8859-1 bytes.

If I remove Michael's fix it's exactly the other way around: the raw values are now properly UTF-8-encoded, but the entities are made from the individual bytes of the UTF-8-encoded character, so, for example, ü turns into ü.

My solution is to get rid of Michael's decode statement and, instead, tell HTML::Entity not to convert high-bit bytes into entities, like so:

diff --git a/lib/Foswiki/Contrib/CompareRevisionsAddOn/Compare.pm b/lib/Foswiki/Contrib/CompareRevisionsAddOn/Compare.pm
index ffae467..4c2b412 100644
--- a/lib/Foswiki/Contrib/CompareRevisionsAddOn/Compare.pm
+++ b/lib/Foswiki/Contrib/CompareRevisionsAddOn/Compare.pm
@@ -447,7 +442,7 @@ sub _findSubChanges {
 sub _elementHash {
 
     # Purpose: Stringify HTML ELement for comparison in Algorithm::Diff
-    my $text = ref( $_[0] ) eq $HTMLElement ? $_[0]->as_HTML : "$_[0]";
+    my $text = ref( $_[0] ) eq $HTMLElement ? $_[0]->as_HTML('<>&') : "$_[0]";
 
     # Strip leading & trailing blanks in text and paragraphs
     $text =~ s/^\s*//;
@@ -538,7 +533,7 @@ sub _getTextWithClass {
 
     if ( ref($element) eq $HTMLElement ) {
         _addClass( $element, $class ) if $class;
-        return $element->as_HTML( undef, undef, {} );
+        return $element->as_HTML( '<>&', undef, {} );
     }
     elsif ($class) {
         return '<span class="' . $class . '">' . $element . '</span>';

-- JanKrueger - 19 Jul 2012

Michael, can you look at JanKrueger's fix and decide if his approach should go ahead? Also need to decide if we want to merge this into 1.1.6. With his feedback, I don't think it should go in as-is to 1.1.6 anyway.

-- GeorgeClark - 17 Oct 2012

Jan, George, please take over the initiative and the fix. I can't check this at the moment. Thanks.

-- MichaelDaum - 17 Oct 2012

Jan, please check in your changes for 1.2. We'll decide on 1.1.6 once they have a bit of soak time on 1.2.

-- GeorgeClark - 24 Oct 2012

This still looks strange on an utf-8 foswiki: http://trunk.foswiki.org/bin/compare/Sandbox/CompareEncodingTest?type=last

Note that while the unicode chars diff all right, the html entities don't.

CompareSnap1.png

-- MichaelDaum - 30 Oct 2012

Please look again. When I go to that same URL it looks just like it should.

PS. Okay, no. I misunderstood. Copying that page to a UTF-8 wiki does indeed break it. Apparently something transparently decodes the entities and uses Perl's default I/O character encoding, which is ISO-8859-1. So this will need some more work...

-- JanKrueger - 30 Oct 2012

I've disabled all entity processing (on trunk); seems to fix the issue Michael found, too. Moving this forward to NM state.

-- JanKrueger - 30 Oct 2012

Looks like distro:2eb64fb4013a introduced a dependency to HTML:Treebuilder >= 4.0. the attribute "no_expand_entities" was first available in that version. Currently the dependencies state that version >= 3.23 are required. Unfortunately e.g CentOS 6.3 comes with version 3.23.

-- FrankHoellering - 08 Jan 2013

I opened Item12337 for this issue. Are you able to install a newer version of Treebuilder? I see that Centos 6.3 is not all that old, but HTML::TreeBuilder 3.23 dates from 2010.

I made the task a release blocker for 1.1.7, but I'd really prefer not to delay 1.1.7 any further.

-- GeorgeClark - 09 Jan 2013
 

Topic revision: r17 - 09 Jan 2013, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy