Foswiki on GitHub is open for business! Next release meeting: Monday October 13, 1300Z

Item11814: HTML2TML breaks hrefs with URI encoding

Priority: CurrentState: AppliesTo: Component: WaitingFor:
Normal Closed Extension WysiwygPlugin  
Suppose that a user creates a weird link in a WYSIWYG editor, such as:

<a href="mailto:a@example.org?subject=Hi&body=Hi%21%0A%0ABye%21">hi</a>

When HTML2TML parses the link, it removes the URI encoding, even if it's not actually necessary (it probably is necessary for converting Web.Topic links, but not for arbitrary URLs). In the case above, the link totally breaks and will no longer be properly recognized by Foswiki (and the WYSIWYG editor is probably going to be confused, too):

[[mailto:a@example.org/subject=Hi&body=Hi!

Bye!][hi]]

Potential fix (against lib/Foswiki/Plugins/WysiwygEditor/HTML2TML/Node.pm):

--- Node.pm     2012-04-27 05:24:24.000000000 +0200
+++ Node_new.pm 2012-05-03 13:06:50.735491554 +0200
@@ -1457,7 +1457,7 @@
             return ( 0, $WC::CHECK1 . $nop . $text . $WC::CHECK2 );
         }
         if ( $text eq $href ) {
-            return ( 0, $WC::CHECKw . '[' . $nop . '[' . $href . ']]' );
+            return ( 0, $WC::CHECKw . '[' . $nop . '[' . $this->{attrs}{href} . ']]' );
         }
 
         # we must quote square brackets in [[...][...]] notation
@@ -1467,7 +1467,7 @@
         $href =~ s/[]]/%5D/g;
 
         return ( 0,
-            $WC::CHECKw . '[' . $nop . '[' . $href . '][' . $text . ']]' );
+            $WC::CHECKw . '[' . $nop . '[' . $this->{attrs}{href} . '][' . $text . ']]' );
     }
     elsif ( $this->{attrs}->{name} ) {
 

-- JanKrueger - 03 May 2012

Thanks for doing the debugging and suggesting a fix. It looks good. I've checked in the fix and a unit test to the trunk version of WysiwygPlugin.

-- GeorgeClark - 05 May 2012

Oops... Not quite. If the URI contains [ or ], they need to be encoded, or the TML markup is broken after save.

-- GeorgeClark - 05 May 2012

The above fix breaks the round-trip of the %ATTACHURL% macros in links. See Item11862. The $href variable goes through the conversion back to the %ATTACHURL macro, where as $this->{attrs}{href} does not.

I don't have any examples of where the URI encoding needs to be removed. The unit tests all pass with the code to remove URI encoding commented out.

Crawford, can you please look at this, particularly the most recent checkin to WysiwygPlugin/Node.pm - where I've commented out the URI Decode.

# SMELL:  Item11814 - this corrupts URL's that must be encoded,  ex. embedded Newline
# decode URL params in the href
#$href =~ s/%([0-9A-F]{2})/chr(hex($1))/gei;

All of the unit tests pass. I'm wondering what I've broken now.

-- GeorgeClark - 15 May 2012

Crawford, did some digging. The decode of URL params was in the initial import. Digging back into TWiki, it arrives in twiki changset 14723 and was a fix for TWiki bug Item4535. I still don't understand why the URL needs decoding, unless it was due to some TMCE strangness in older versions.

I've tried simple Topic name, Web.TopicName and Web/Subweb.TopicName ... and they all seem to work correctly in trunk Wysiwyg/TMCE.

I can make the decode conditional, and skip it for href's that start with a known protocol. That preserves the fix, and it doesn't change the unit test results. I'd like to come up with a reason for it and a unit test if this is needed.

        if ( !$href =~ /${WC::PROTOCOL}[^?]*$/ ) {
            $href =~ s/%([0-9A-F]{2})/chr(hex($1))/gei;
        }

-- GeorgeClark - 15 May 2012
 
Topic revision: r23 - 02 Dec 2012, GeorgeClark
 
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. see CopyrightStatement. Creative Commons License