AddEncodingToTXTFiles

Add encoding to .TXT files

Currently .txt files are written by the back-end stores (RCSStoreContrib anb PlainFileStoreContrib) without any encoding information. There's an assumption that the files will be read using whatever encoding is current in the configuration (the {Site}{CharSet}. This has caused us a world of problems over the years, as no-one can be sure what encoding has been used for data at what point in the process.

The proposal here is to add encoding information to the topics written by the file stores. This allows anyone reading these topics to immediately see what encoding is used. External scripts will be able to modify these topics - and even change the encoding if they want to - without bringing Foswiki down in flames.

See also UseUTF8

Here's my proposed spec:

Foswiki .txt files already use a META:TOPICINFO meta-datum to store information about the topic. This meta-datum is at a fixed offset in the file (the first line), is easy and fast to parse, and is trivial to extend. It makes sense to use it for storing encoding information.

  1. A new optional encoding parameter is added to META:TOPICINFO in .txt files that contains the IANA charset name of the encoding used to encode the data in the file
  2. If the encoding parameter is present in the META:TOPICINFO of a topic, the Store MUST respect that encoding when reading the data, irrespective of the current {Site}{CharSet}
  3. If the encoding is not set, the current {Site}{CharSet} SHOULD be assumed.
    1. A store MAY optionally add a configuration parameter specific to the store to indicate an alternative encoding to be assumed when reading. For example, a configuration may define {Site}{CharSet} to be utf-8, but also define {Contrib}{PlainFileStoreContrib}{DefaultEncoding} to be iso-8859-1. This will result in topics without an encoding parameter being read using iso-8859-1 (but always written using utf-8)
  4. The Store implementation MUST add the encoding parameter when it writes a topic. This will usually be the {Site}{CharSet} or the iso-8859-1 default.
  5. The Store implementation must delete the encoding parameter from the META:TOPICINFO when reading a topic. It must not 'bleed' into the Foswiki::Meta object, where it is irrelevant would simply cause confusion. Foswiki::Meta objects are always encoded using {Site}{CharSet}.
  6. The new parameter MUST be supported by RCSStoreContrib, and PlainFileStoreContrib.
  7. If a store reads a topic containing characters that cannot be encoded in the {Site}{CharSet}, they MUST report this as an error using =throw Error. This is regarded as a configuration error.
A possible enhancement to this spec would be a script that adds the encoding to existing topics. However this is not strictly necessary, as the fallback is sufficient.

The relevant Task is Tasks.Item1344

-- CrawfordCurrie - 30 May 2014

Discussion

Please consider publishing your excellent CharsetConvertorContrib finally. It allows to make sure all of the content is in a homogeneous charset encoding. Its advantage over the proposed solition here is:

  1. no changes required to the various store implementations out there (rcs, plain-file, ...)
  2. no extension of the TOPIC format specification required
  3. no runtime overhead: it totally performs offline

I am particularly concerned about the runtime implications of mixing together content from sources differently encoded. Just imagine what this would mean for all sorts of transclusions that we have: INCLUDE, SEARCH, dirty areas in page caching, ... whatever macro that reads from other topics and brings in stuff to the current topic. These all have to transcode strings on the fly.

Note further that the core would have to transcode strings from various sources again and again and again, doing the very same transcoding process all of the time during runtime while the user waits for his page.

In any attempt to squeeze the most performance out of Foswiki, one would definitely try to prevent this overhead by using CharsetConvertorContrib to homogenize all content and switch off all knobs that may lure the store into wasting time in considering transcoding and rewriting content.

Or put the other way: what motivation would there be to pay the extra overhead online instead of converting all content in one and the same charset offline?

-- MichaelDaum - 30 May 2014

I think you are missing the point. As far as the core code - everything above the store - is concerned, topic data is already homogenous, because {Site}{CharSet} is the encoding used for all strings read from topics. The encoding doesn't touch the core at all - it is totally irrelevant to transclusion, caching or anything else, so long as they use the methods of the store implementation to read the topics. In the case of the PlainFileStoreContrib the support amounts to a few lines of code:
    if ($text && $text =~ /^%META:TOPICINFO{([^\r\n]+)}%/s) {
        my $a = Foswiki::Attrs->new($1);
        if (defined $a->{encoding}
            && $a->{encoding} ne $Foswiki::cfg{Site}{CharSet}) {
            # Decode to perl internal string
            my $t = Encode::decode($a->{encoding}, $text, Encode::FB_CROAK);
            # Re-encode to {Site}{CharSet}
            $text = Encode::encode($Foswiki::cfg{Site}{CharSet}, $t, Encode::FB_CROAK);
        }
    }

I would imagine the RCSStoreContrib will be similar.

The main goal is to bridge the gap between the current mess (where, for example, a change to {Site}{CharSet} can quickly result in the corruption of all topic data) and internal unicode.

The CharsetConvertorContrib is only of use for RCS based stores. It has no value for any other kind of store. It's a hack, at best, and I don't see it as a long term solution.

-- CrawfordCurrie - 30 May 2014

CharsetConvertorContrib is currently only operating on RCS based stores. I don't see a reason why not to read from other backends.

It does not matter on which level the runtime-transcoding happens, be it down inside the a store layer well hidden away or not. You will have to pay for it anyway. And it surely doesn't come for free. What matters is that you are trying to do it online over and over again. Why is that preferable compared to doing it only once offline?

-- MichaelDaum - 30 May 2014

+1 for making the encoding of on-disk data explicit

But it seems wrong to re-encode topic data to {Site}{CharSet} - from what I have read about handling encodings, data should be decoded when read in and encoded on output, but processed as characters (perl internal format) instead of "byte stream" in between

-- FlorianSchlichting - 30 May 2014

If the cost of conversion is negligible (and it would be for my wikis with small user bases), then I would prefer online conversion that "just works" to offline conversion that requires a manual process.

I like the online charset conversion because of the increased robustness. Those administrators who don't want the performance hit can do an offline conversion.

There is still the cost of checking if conversion is required. I suspect that the performance hit from checking is negligible (but it is only a suspicion).

-- MichaelTempest - 30 May 2014

There's a misunderstanding here about how the stores work. Once a topic has been loaded into Meta, the encoding is irrelevant. The CharSetConvertorContrib works below the level of Meta, to convert the encoding of RCS topics, including their histories. If we had encoding then this would trivially add encoding={Site}{CharSet} to META:TOPICINFO in each revision. As it is, it works using Encode::encode which works fine if you are very careful, but is fraught with risk.

Florian, you have to canonicalise the encoding of strings to something, and so much of the code assumes that canonical form is {Site}{Charset} encoded byte stream that to convert to perl strings would be huigely risky, and require a modern perl not widely supported in Linux distributions. So we bounced that to FW 2.,0

-- CrawfordCurrie - 01 Jun 2014
 
Topic revision: r8 - 01 Jun 2014, CrawfordCurrie - This page was cached on 13 May 2020 - 15:05.

The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy