You are here: Foswiki>Development Web>AddEncodingToTXTFiles (01 Jun 2014, CrawfordCurrie)Edit Attach

AddEncodingToTXTFiles

Add encoding to .TXT files
Discussion

Add encoding to .TXT files

Currently .txt files are written by the back-end stores (RCSStoreContrib anb PlainFileStoreContrib) without any encoding information. There's an assumption that the files will be read using whatever encoding is current in the configuration (the {Site}{CharSet}. This has caused us a world of problems over the years, as no-one can be sure what encoding has been used for data at what point in the process.

The proposal here is to add encoding information to the topics written by the file stores. This allows anyone reading these topics to immediately see what encoding is used. External scripts will be able to modify these topics - and even change the encoding if they want to - without bringing Foswiki down in flames.

Discussion

Please consider publishing your excellent CharsetConvertorContrib finally. It allows to make sure all of the content is in a homogeneous charset encoding. Its advantage over the proposed solition here is:

no changes required to the various store implementations out there (rcs, plain-file, ...)
no extension of the TOPIC format specification required
no runtime overhead: it totally performs offline

I am particularly concerned about the runtime implications of mixing together content from sources differently encoded. Just imagine what this would mean for all sorts of transclusions that we have: INCLUDE, SEARCH, dirty areas in page caching, ... whatever macro that reads from other topics and brings in stuff to the current topic. These all have to transcode strings on the fly.

Note further that the core would have to transcode strings from various sources again and again and again, doing the very same transcoding process all of the time during runtime while the user waits for his page.

In any attempt to squeeze the most performance out of Foswiki, one would definitely try to prevent this overhead by using CharsetConvertorContrib to homogenize all content and switch off all knobs that may lure the store into wasting time in considering transcoding and rewriting content.

Or put the other way: what motivation would there be to pay the extra overhead online instead of converting all content in one and the same charset offline?

-- MichaelDaum - 30 May 2014

I think you are missing the point. As far as the core code - everything above the store - is concerned, topic data is already homogenous, because {Site}{CharSet} is the encoding used for all strings read from topics. The encoding doesn't touch the core at all - it is totally irrelevant to transclusion, caching or anything else, so long as they use the methods of the store implementation to read the topics. In the case of the PlainFileStoreContrib the support amounts to a few lines of code:

    if ($text && $text =~ /^%META:TOPICINFO{([^\r\n]+)}%/s) {
        my $a = Foswiki::Attrs->new($1);
        if (defined $a->{encoding}
            && $a->{encoding} ne $Foswiki::cfg{Site}{CharSet}) {
            # Decode to perl internal string
            my $t = Encode::decode($a->{encoding}, $text, Encode::FB_CROAK);
            # Re-encode to {Site}{CharSet}
            $text = Encode::encode($Foswiki::cfg{Site}{CharSet}, $t, Encode::FB_CROAK);
        }
    }

I would imagine the RCSStoreContrib will be similar.

The main goal is to bridge the gap between the current mess (where, for example, a change to {Site}{CharSet} can quickly result in the corruption of all topic data) and internal unicode.

The CharsetConvertorContrib is only of use for RCS based stores. It has no value for any other kind of store. It's a hack, at best, and I don't see it as a long term solution.

-- CrawfordCurrie - 30 May 2014

CharsetConvertorContrib is currently only operating on RCS based stores. I don't see a reason why not to read from other backends.

It does not matter on which level the runtime-transcoding happens, be it down inside the a store layer well hidden away or not. You will have to pay for it anyway. And it surely doesn't come for free. What matters is that you are trying to do it online over and over again. Why is that preferable compared to doing it only once offline?

-- MichaelDaum - 30 May 2014

+1 for making the encoding of on-disk data explicit

But it seems wrong to re-encode topic data to {Site}{CharSet} - from what I have read about handling encodings, data should be decoded when read in and encoded on output, but processed as characters (perl internal format) instead of "byte stream" in between

-- FlorianSchlichting - 30 May 2014

If the cost of conversion is negligible (and it would be for my wikis with small user bases), then I would prefer online conversion that "just works" to offline conversion that requires a manual process.

I like the online charset conversion because of the increased robustness. Those administrators who don't want the performance hit can do an offline conversion.

There is still the cost of checking if conversion is required. I suspect that the performance hit from checking is negligible (but it is only a suspicion).

-- MichaelTempest - 30 May 2014

There's a misunderstanding here about how the stores work. Once a topic has been loaded into Meta, the encoding is irrelevant. The CharSetConvertorContrib works below the level of Meta, to convert the encoding of RCS topics, including their histories. If we had encoding then this would trivially add encoding={Site}{CharSet} to META:TOPICINFO in each revision. As it is, it works using Encode::encode which works fine if you are very careful, but is fraught with risk.

Florian, you have to canonicalise the encoding of strings to something, and so much of the code assumes that canonical form is {Site}{Charset} encoded byte stream that to convert to perl strings would be huigely risky, and require a modern perl not widely supported in Linux distributions. So we bounced that to FW 2.,0

-- CrawfordCurrie - 01 Jun 2014