Item817: Malformed header anchors if header contains non A-Za-z0-9_ characters - advanced solution

Priority: CurrentState: AppliesTo: Component: WaitingFor:
Urgent Closed Engine I18N  

Overview

See also Tasks.Item1096, Tasks.Item5689, Tasks.Item1416

Description of the problem

If you have useLocale turned on and you have an header with an Umlaut, e.g.
---++ ‹bung
Then the rendered html looks like this:
<h2><a name="%dcbung"></a><a name="_dcbung"></a> ‹bung </h2>
That's not an valid name for an anchor.

Diagnosis

Looking in Render.pm in the function makeAnchorName then we see
    if ( !defined( $Foswiki::cfg{Site}{CharSet} )
        || $Foswiki::cfg{Site}{CharSet} =~ /^iso-?8859-?/i )
    {
        $anchorName =~ s/[^$Foswiki::regex{mixedAlphaNum}]+/_/g;
    }   
    ...
    return Foswiki::urlEncode( $anchorName );
If useLocale is in effect, then mixedAlphaNum sees "‹" as an letter, see source code of Foswiki.pm
    # Build up character class components for use in regexes.
    # Depends on locale mode and Perl version, and finally on
    # whether locale-based regexes are turned off.
    if (   not $Foswiki::cfg{UseLocale}
        or $] < 5.006
        or not $Foswiki::cfg{Site}{LocaleRegexes} )
    ...
Now the "‹" won't be substituted and then the urlEncode has a different opinion and substitutes "‹" with "%dc".

The "double anchoring" feature (cf. above) is also a result of this, see method _makeAnchorHeading how this happens.

-- ChristianLudwig - 20 Jan 2009

Possible Solutions

  • Rewrite anchors to be strictly ASCII and sequential

Consider the following header
---+ Header 1
---+ Header 2 with ‹mlaut
---+ &#31532;&#19977; Header
---++ Header 3.1

The anchors could be something like,
<a name="fw-header1">
<a name="fw-header2">
<a name="fw-header3">
<a name="fw-header3.1">

This way, we are safe on the URI presentation. We could randomise the name with a set of characters to avoid any naming conventions and/or potential clashes.

Comments

See also Item1096 in utf-8 context.

-- ChristianLudwig - 19 Feb 2009

Some remarks on the question "What characters and/or (escape) squeneces are allowed in html anchors?"
  1. HTML 4 A-element: case-sensitive CDATA; allowed characters see HTML 4 Type name: must begin with a character in [A-Za-z], may be followed by characters in [A-Za-z0-9_.:-].
  2. HTML 4 Syntax of anchor names: "Anchor names should be restricted to ASCII characters. Please consult the appendix for more information about non-ASCII characters in URI attribute values." In this appendix there is a recommendation for user agents: UTF-8 and the %HH hexdecimal notation is allowed.

The first option avoids many problems, because some browsers have difficulties with the above recommendation.

-- ChristianLudwig - 19 Feb 2009

See Item5689 for a related issue with anchor in [[ ... ][ ... ]] links.

-- ChristianLudwig - 25 Feb 2009

See also Item1416.

-- ChristianLudwig - 07 Apr 2009

I believe this is fixed on trunk, via Item1448. Please give it a try and let me know.

-- CrawfordCurrie - 15 Apr 2009

I tested trunk revision 3452: Now there are no more "special" characters in the anchors. But the anchors are very long/lengthy even for short headlines, e.g. for the headline "Аналитика и комментарии" I got the anchor
_38_351040 _59_38_351085 _59_38_351072 _59_38_351083 _59_38_351080 _59_38_351090 _59_38_351080 _59_38_351082 _59_38_351072 _59_32_38_351080 _59_32_38_351082 _59_38_351086 _59_38_351084 _59_38_351084 _59_38_351077 _59_38_351085 _59_38_351090 _59_38_351072 _59_38_351088 _59_38_351080 _59_38_351080 _59
(I've manually inserted spaces for layout)

And the anchor starts with an underscore "_"!

-- ChristianLudwig - 16 Apr 2009

That's correct. Unfortunately:
  • Anchors have to be unique
  • Anchors may only contain characters in the range [A-Za-z0-9:_.]
Because of these constraints we need an encoding that is unique over the range of all unicode characters in ASCII. Any such encoding is going to be lengthy; if you look at this topic 'raw' you will see that your "short" heading is actually quite long, even in n-bit unicode (&#1040;&#1085;&#1072;&#1083;&#1080;&#1090;&#1080;&#1082;&#1072; &#1080; &#1082;&#1086;&#1084;&#1084;&#1077;&#1085;&#1090;&#1072;&#1088;&#1080;&#1080; to be precise). Fortunately anchors (fragment identifiers) are not limited in length.

However you are correct that anchor names must start with an alphabetic character, which has been fixed:
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").
If you can suggest an alternative encoding that results in shorter, but still unique, fragment identifiers, then I'm all ears.

-- CrawfordCurrie - 17 Apr 2009

Hm. What about base64 encoding?
echo "Аналитика и комментарии" | base64
0JDQvdCw0LvQuNGC0LjQutCwINC4INC60L7QvNC80LXQvdGC0LDRgNC40LgK

Is a little shorter. But perhaps I've missed some other constraint for the encoding so base64 isn't suitable.

Or is it really important to encode the string "&#1040;" instead of the character (number 1040): "А"?

-- ChristianLudwig - 17 Apr 2009

My goals for the encoding were:
  1. to make it unique, so two different headings always produce different anchors
  2. to make it legal, as per http://www.w3.org/TR/html401/struct/links.html (note that your base64 encoding is not)
  3. to make it reversible - i.e. if you have a an encoded anchor, be able to reverse the encoding to recover the original heading
  4. To make it at least vaguely readable when used with iso-8859-1*.
(1) and (2) are obvious, and are the key requirements the old encoding fails. (3) is not so obvious. The reason for this requirement is so that a client working only with the anchor name is able to generate an AJAX request that can target a specific anchor in the original topic. You might think that if anchors are unique it might be possible to run the encoding algorithm again and compare the encoded unique anchors. Unfortunately the rendering process is quite complex, and this is easier said than done. So I decided to make the encoding reversible.

(4) is also not so obvious. It's required so that in extremis a human can write an anchor name in an href. echo fred | base64 gives ZnJlZAo fails the vague readability test (though echo fred | base64 | base64 -d may be workable for extremely well educated users)

-- CrawfordCurrie - 18 Apr 2009

ItemTemplate edit

Summary Malformed header anchors if header contains non A-Za-z0-9_ characters - advanced solution
ReportedBy ChristianLudwig
Codebase 1.0.0
SVN Range Foswiki-1.0.0, Thu, 08 Jan 2009, build 1878
AppliesTo Engine
Component I18N
Priority Urgent
CurrentState Closed
WaitingFor
Checkins Foswikirev:3468 Foswikirev:3472 Foswikirev:3473
TargetRelease minor
ReleasedIn 1.1.0
Topic revision: r26 - 04 Oct 2010, KennethLavrsen
 
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. see CopyrightStatement. Creative Commons License