You are here: Foswiki>Tasks Web>Item13405 (03 Feb 2016, GeorgeClark)Edit Attach

Item13405: Unicode Normalisation

pencil
Priority: Normal
Current State: Closed
Released In: 2.1.0
Target Release: minor
Applies To: Engine
Component: I18N, Unicode
Branches: master Item13525 Item13405 Item13897
Reported By: JozefMojzis
Waiting For:
Last Change By: GeorgeClark

OS X Specific error repo on OS X

00000000: 55 cc 81 6e cc 8c 69 cc 81 63 cc 8c 6f cc 82 64  U..n..i..c..o..d
00000010: cc 8c 65 cc 8c 0a                                ..e...    

Background

Need much more precise analytical work on the impacts. Many parts of this topic will be moved into the UnicodeNormalisation.

Unicode characters can have several normalization forms. Most commons are NFC and NFD.

E.g. the 'Å' could be represented as U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE) or as U+0041 U+030A (LATIN CAPITAL LETTER A)+(COMBINING RING ABOVE). They're are the same characters but in two different forms.

When the client uploading a file where the file name containing unicode characters, the browser supplies the utf8 encoded filename as got it form the underlying OS. Usually (e.g. when filenames are not crafted specially) the following is applied:
  • Linux/freebsd - filenames are sent as NFC. (possible craft NFD too) e.g. could exists two files, one as Å and second as Å smile
  • Mac OS X - enforced NFD. Impossible to craft NFC filename, e.g. the open Å and the open Å opens the same file always, the uploaded filenames are NFD. (really don't understand why, because in every application OS X uses NFC - e.g. for the texts.)
  • Windows - uses its own wchar (UTF16) for the filenames. Because the browser MUST (RFC) sends UTF8 encoded filenames, windows converts its internal wchar to NFC -> utf8.

Whats mean this for the Foswiki?

  • When someone sitting behind MacOSX, and uploads a file with name "ŠáŘý" into the Linux-Foswiki installation, the Foswiki saves it, as it gets = e.g. as NFD encoded filename.
  • If someone other (on non Mac client) want upload the same file to the same topic from Linux, the file get uploaded with NFC filename
  • e.g. the topic will have TWO files attached, both with the name "ŠáŘý" name, (one in NFC form, second in NFD). More precisely, one in utf8 encoded from the NFC and second as utf8 encoded from NFD.
  • See the following screenshot, taken from FreeBSD - two files with the same name:
    screenshot 71.png

The first is NFD, the second NFC.

The problem

  • For the topic editing, the client's browser uses NFC always.
  • So, when the file comes from the OS X, and someone want refer this file as %ATTACHURLPATH%/ŠáŘý
  • Foswiki will search the NFC filename.
  • If it is saved as NFD - the link will not works.
  • Also the same is applied, when someone want address the attachment from remote site, again will use NFC because this the default for the editors.

Solution

Foswiki should enforce the NFC (or NFKC) internally for every unicode string decoded from the network, e.g. should use the =Unicode::Normalize::NFC( Encode::decode_utf8(...) )

COST

From the IRC
CDot   ok, so I understand the arguments for normalisation. What's the cost?
...
CDot   no, once for each incoming string   [08:45]
jast   well, that would be the "optimal" treatment   [08:45]
CDot   because you never know what params refer to filenames

Development effort

Can't estimate. For the Engine::CGI attached a patch. Maybe it isn't complete. Sure the same should be done for the other Engines, Much testing needs (and adding some UnitTests).

Performance

I tested the following, e.g. compared the usual Foswiki's actions:
  • CGI-query init
  • decode
  • and testes with and without NFC norm.
The results:
Benchmark: timing 100000 iterations of norm, pure...
      norm: 53.4442 wallclock secs (49.43 usr +  0.32 sys = 49.75 CPU) @ 2010.05/s (n=100000)
      pure: 51.4577 wallclock secs (47.79 usr +  0.28 sys = 48.07 CPU) @ 2080.30/s (n=100000)
The teststring is: 96 characters and 144 bytes long
In the usual operation and NFC -ing every request, we lost on 100_000 requests 2 seconds, e.g. approximately 20 microseconds per request. Comparing this with the foswiki usual processing time - the performance lost is nearly zero ...

Concerns

Some comments to the IRC discussion in a form of a table:

Impact when Description Foswiki NOT enforces NFC Foswiki WILL enforce NFC
Mac user uploads a file into Foswiki. The filename will come as NFD. The file get saved as NFD, so here is a problem addressing it. The filename is converted to NFC. No problem addressing it.
Linux user uploads file into Foswiki The filename comes (if it is not crafted) as NFC. The filename saved as comes, e.g. usually as NFC. The filename is saved as NFC. No difference.
Windows users upload a file Windows its wchar filenames converts into NFC -> utf8. The file saved as NFC. The file is saved under NFC. No difference.
Serving files from /pub with Apache. Apache trying to serve what it got. Browsers sends requests usually as NFC. If the filename is in NFD, the apache will not found it when the request comes as NFC. E.g. again, any attachment what comes from mac will be problematic. If every filename will be converted to NFC, no problem - the apache will serve files the as NFC. Minor possible problem: when the apache got NDF encoded request for the file. Such request must be crafted specially, and not happens in normal circumstances - so can safely ignore it.
Foswiki on MAC installation Mac enforces every filename to NFD. The files get saved as (enforced) NFD. When the Foswiki (or any other application) tries access NFC filename name, get opened the NFD - e.g. no problem. The files get saved as (enforced) NFD. When the Foswiki (or any other application) tries access NFC filename name, get opened the NFD - e.g. no problem.
WebDAV AFAIK - Webdav serves files as are saved, and saving them as got. No enforcing policy - but the browsers uses default NFC so probably all filenames are in NFC. (I'm not sure whether it is even possible to get an NFD encoded file name with WebDAV) Foswiki uses the default NFC internally for the text content, e.g. no problem. Foswiki uses the enforces NFC internally for the text content, e.g. again, no problem.
NFS export for the pub on the Linux - linux NFS client. Linux serves file as they are in the filesystem. Could serve NFD filename, what could cause problems on the client side, when the client uses NFC on his filesystem. (what is default for all Linux/Freebsd - not enforced) The filenames will be NFC - no problems on the client side. All filenames are consistent with the default.
NFS export for the pub on the Linux - MacOSX NFS client. Linux serves file as they are in the filesystem. Any copied file from the Linux is converted to NFD filename in the Mac. No problem, such every other file. Any copied file from the Linux is converted to NFD filename in the Mac. No problem, such every other file.
Samba export from Linux Don't know - haven't info. Don't know - haven't info. Don't know - haven't info.
Samba export from Mac to windows. Works without any problems, the windows could use and save international filenames into the Mac and the Mac see them with the right name. Works without any problems, the windows could use and save international filenames into the Mac and the Mac see them with the right name. Works without any problems, the windows could use and save international filenames into the Mac and the Mac see them with the right name.
Any other concerns? Add them here. smile
?      
?      

For me the above means - Foswiki should enforce NFC (or NFKC) for the problem-less operation.

All the above is an issue only when
  • OS X user using Foswiki
  • and want upload a file
  • and when the filename contains unicode characters.

e.g. NOT URGENT

Ps: someone who really knows English, should clean a bit the above to make it understand-able My "perl" isn't the best, but still better than my English.. wink

Impacts?

 


See also

Supplemental

Script for generating NFD NFC filenames...

#!/usr/bin/env perl
use 5.014;
use strict;
use warnings;
use charnames qw(:full);
use Path::Tiny;
use Data::Dumper;

use utf8;
use Unicode::Normalize qw(NFD NFC);
use Encode;
use open qw(:std :utf8);

my $ch = {
    c => "\N{U+0010C}", # LATIN CAPITAL LETTER C WITH CARON
    a => "\N{U+000E1}", # LATIN SMALL LETTER A WITH ACUTE
    r => "\N{U+00158}", # LATIN CAPITAL LETTER R WITH CARON
    y => "\N{U+000FD}", # LATIN SMALL LETTER Y WITH ACUTE
};

my $str = join '', @$ch{qw(c a r y)};
my $name = {
    nfc => NFC($str),
    nfd => NFD($str),
};
say "$_: ", map { s/(..)/$1 /gr } unpack "H*", Encode::encode('utf8', $name->{$_}) for sort keys %$name;
say Dumper($name);

my $tmp = Path::Tiny->tempdir(DIR => ".", CLEANUP => 0);
say "Check the $tmp directory: find $tmp -type f -print | xxd";

$tmp->child($name->{nfc})->spew_utf8("this file has NFC filename [$name->{nfc}]\n");
die "$name->{nfd} exists" if $tmp->child($name->{nfd})->exists; #on the OS X dies..
$tmp->child($name->{nfd})->spew_utf8("this file has NFD filename [$name->{nfd}]\n");

Request benchmark script

#!/usr/bin/env perl
use 5.014;
use warnings;

use CGI;
use Encode;
use Unicode::Normalize;
use Benchmark qw(:hireswallclock cmpthese timethese);

my $data = join '', map s/\s+//gr, do { local $/; scalar <DATA> };

timethese(100000, {
    'pure' => sub {
        my $q = CGI->new("data=$data");
        my $arg = [$q->multi_param('data')]->[0];
        my $e = Encode::decode('utf8', $arg);
    },
    'norm' => sub {
        my $q = CGI->new("data=$data");
        my $arg = [$q->multi_param('data')]->[0];
        my $e = NFC(Encode::decode('utf8', $arg));
    },
});

my $q = CGI->new("data=$data");
my $d = [$q->multi_param('data')]->[0];
my $c = Encode::decode('utf8', $d);

binmode(STDOUT, ":utf8");
say "The teststring is: @{[length($c) ]} characters and @{[ length($d) ]} bytes long";
say "Contains: [$c]";

__DATA__
a%CC%81a%CC%88c%CC%8Cd%CC%8Ce%CC%81e%CC%8Ci%CC%81l%CC%81l%CC%8Cn%CC%8Co%CC%81o%C
C%82o%CC%88o%CC%8Br%CC%81r%CC%8Cs%CC%8Ct%CC%8Cu%CC%81u%CC%8Au%CC%88u%CC%8By%CC%8
1z%CC%8CA%CC%81A%CC%88C%CC%8CD%CC%8CE%CC%81E%CC%8CI%CC%81L%CC%81L%CC%8CN%CC%8CO%
CC%81O%CC%82O%CC%88O%CC%8BR%CC%81R%CC%8CS%CC%8CT%CC%8CU%CC%81U%CC%8AU%CC%88U%CC%
8BY%CC%81Z%CC%8C

Patch for normalization and sorting

_patch removed - now checked in.

-- Main.GeorgeClark - 01 Dec 2015 - 02:13

Reopening this task. Performance seems to be related to the perl version. 5.20.2 performs just fine. 5.22 and 5.23, performance takes a drastic drop. Perl 5.22 changed from Unicode 6.3 to Unicode 7.0. NYTProf shows all of the time spent in Unicode::Normalize.

Version Not normalized Normalized
5.8.8 0m0.920s 0m0.999s
5.20.2 0m0.753s 0m0.784s
5.22.0 0m0.658s 0m3.368s
5.23.3 0m0.626s 0m3.117s

Need to make normalization configurable.

Bottom line - my original fix was bogus. It was NFC normalizing ALL strings, not just strings received from the network. Configurable is not the right answer. The normalization should not be done in Foswiki::decode_utf8. This was all described in the original problem report and I missed it.

-- GeorgeClark - 13 Dec 2015

If you want consistently normalised strings inside of the Foswiki, e.g. don't want end with strings with mixed normalization - need to do the normalisation at the borders. Otherwise, you can get from the network NFC but from the filesystem NFD. And many other scenarios, like executing external shell commands (which using NFD filenames) could result NFD responses.

For example in the OS X all filenames are enforced NFD - therefore NFC-ing the network requests doesn't helps at all.

Also, I benchmarked different perl versions and different CPAN:Unicode::Normalize using the following script:
use 5.014;
use warnings;

use utf8;
binmode STDOUT, ':utf8';

use Unicode::Normalize;
use Benchmark qw(:all);

say "perl version: $]";
say "Unicode::Normalize ver: ", $Unicode::Normalize::VERSION;

my $cnt = 1_000;
say "Long string $cnt times";
my $str = NFC(join '', grep { /\w/ } map { chr } 0xFF .. 0x10_FFFF);
doit($cnt, $str);

$cnt = 50_000;
say "Short string $cnt times";
$str = NFC(join '', grep { /[\p{Latin}\p{Greek}\p{Cyrillic}\p{Number}]/ } map { chr } 0xFF .. 0xFFFF);
doit($cnt, $str);

sub doit {
   my($n,$nfc) = @_;
   #say $nfc;
   say "NFC length: ", length($nfc);

   my $nfd = NFD($nfc);
   #say $nfd;
   say "NFD length: ", length($nfd);

   timethis($n, sub { NFC($nfd) } );
}

The above benchmark doesn't show any meaningful difference between the different versions: for the 5.16.3
perl version: 5.016003
Unicode::Normalize ver: 1.14
Long string 1000 times
NFC length: 103350
NFD length: 126430
timethis 1000:  9 wallclock secs ( 9.22 usr +  0.01 sys =  9.23 CPU) @ 108.34/s (n=1000)
Short string 50000 times
NFC length: 2590
NFD length: 3624
timethis 50000: 13 wallclock secs (12.88 usr +  0.04 sys = 12.92 CPU) @ 3869.97/s (n=50000)

for the 5.20.3
perl version: 5.020003
Unicode::Normalize ver: 1.17
Long string 1000 times
NFC length: 103352
NFD length: 126432
timethis 1000:  8 wallclock secs ( 8.82 usr +  0.01 sys =  8.83 CPU) @ 113.25/s (n=1000)
Short string 50000 times
NFC length: 2590
NFD length: 3624
timethis 50000: 13 wallclock secs (12.35 usr +  0.01 sys = 12.36 CPU) @ 4045.31/s (n=50000)

for the 5.23.5
perl version: 5.023005
Unicode::Normalize ver: 1.23
Long string 1000 times
NFC length: 112427
NFD length: 135515
timethis 1000:  8 wallclock secs ( 8.61 usr +  0.01 sys =  8.62 CPU) @ 116.01/s (n=1000)
Short string 50000 times
NFC length: 2706
NFD length: 3740
timethis 50000: 11 wallclock secs (10.44 usr +  0.01 sys = 10.45 CPU) @ 4784.69/s (n=50000)

So, the table:
perl Unicode::Normalize long/s short/s
5.16.3 1.14 108.34 3869.97
5.20.3 1.17 113.25 4045.31
5.23.5 1.23 116.01 4784.69

So, on my notebook the fastest combination perl v5.23.5 and Unicode::Normalize 1.23 - but the differences isn't horrible at all.

-- JozefMojzis - 13 Dec 2015

Also, as i told on the IRC - the network-only-NFC version again introduces all old bugs, like Tasks.Item13660.

-- JozefMojzis - 13 Dec 2015

After talking in the IRC need testing the result of the attachment named ČáŘý.png to the: http://trunk.foswiki.org/Sandbox/ŽuŽu .

Its puburl is
  • (copied): ČáŘý.png
  • written by hand: ČáŘý.png
  • copied the filename from the filesystem: ČáŘý.png
(all should be the same)

results of some quick tests:
  • view wikitext OK
  • click manage attachments - scroll to history - the comment has wrong encoding
  • edit wikitext - OK
  • wysiwyg edit -> switch to wikitext - the link contains encoded Topicname - -the filename is OK.

the filename in the OS X filesystem is:

$ ls ????.png | od -bc
0000000   103 314 214 141 314 201 122 314 214 171 314 201 056 160 156 147
           C    ̌  **   a    ́  **   R    ̌  **   y    ́  **   .   p   n   g
0000020   012                                                            
          \n         

-- JozefMojzis - 13 Dec 2015

attached two files - with nyprof results from the command:
NYTPROF="file=/tmp/nytprof_n.out:addpid=1:endatexit=1" perl -d:NYTProf bin/view Sandbox.BigTopic
  • the file: nytprof_a.out.15001 - from the master version
  • the file: nytprof_n.out.15199 - from the HEAD detached at 4554e77
after the nytprofhtml -f.... the both versions show only few milliseconds in the NFC - wondering why on the Linux it took much longer.

-- JozefMojzis - 13 Dec 2015

The above filename on trunk.foswiki.org, after upload from an osx client:
0000000   305 275 165 305 275 165 057 057 304 214 303 241 305 230 303 275
         305 275   u 305 275   u   /   / 304 214 303 241 305 230 303 275
0000020   056 160 156 147 012                                            
           .   p   n   g  \n                                            
0000025

-- GeorgeClark - 13 Dec 2015

I've made some more fixes ... I missed NFC encoding in Engine::CGI. So that's checked into master.

The badly encoded comment history is an unrelated bug Fails on 2.0.3 as well. RcsHandler is double-encoding the comment. Fixed in Item13894.

-- GeorgeClark - 14 Dec 2015

Confirmed for one perl version.

Installed Freebsd 10.2 into Vmware and installed a plenty of perl's using plenv.

Tested the BigTopic rendering using this script
#!/usr/local/bin/bash

err() { echo "$@" >&2; exit 1; }

runtests() {
   while read -r ver
   do
      echo "======== $ver ==========="
      plenv local $ver
      perl -MUnicode::Normalize -E 'say "perl: $] Norm: $Unicode::Normalize::VERSION"'
      echo "From the System.PerlDependencyReport:" $(bin/view System.PerlDependencyReport | grep 'Perl version:' |sed 's/.* 5\./ 5./')
      for i in {1..10}
      do
         time bin/view Sandbox.BigTopic >/dev/null
      done 2> >(perl -lnE 'if(/real.*(\d+)m(\d+)\.(\d+)s/){$sec=(($1*60)+$2).".$3";$n++;$sum+=$sec;}}{say "BigTopic $n times, average: ",$sum/$n;') >/dev/null
      sleep 1
   done < <(plenv versions | perl -lnE 'say $1 if( !/system/&&/.*(5\.\d+\.\d+?).*/)' )
}

git pull

git checkout 4554e77fccb151ec98c6e0d80ba276fe46872c44
echo "RESULTS for the $(git status | head -1)"
runtests

git checkout master
echo "RESULTS for the $(git status | head -1)"
runtests

The results:
RESULTS for the HEAD detached at 4554e77
======== 5.16.3 ===========
perl: 5.016003 Norm: 1.14
From the System.PerlDependencyReport: 5.016003
BigTopic 10 times, average: 1.2498
======== 5.18.4 ===========
perl: 5.018004 Norm: 1.16
From the System.PerlDependencyReport: 5.018004
BigTopic 10 times, average: 1.0938
======== 5.20.3 ===========
perl: 5.020003 Norm: 1.17
From the System.PerlDependencyReport: 5.020003
BigTopic 10 times, average: 0.8872
======== 5.22.0 ===========
perl: 5.022000 Norm: 1.18
From the System.PerlDependencyReport: 5.022000
BigTopic 10 times, average: 3.7704
======== 5.23.5 ===========
perl: 5.023005 Norm: 1.23
From the System.PerlDependencyReport: 5.023005
BigTopic 10 times, average: 0.8783


RESULTS for the On branch master
======== 5.16.3 ===========
perl: 5.016003 Norm: 1.14
From the System.PerlDependencyReport: 5.016003
BigTopic 10 times, average: 0.8192
======== 5.18.4 ===========
perl: 5.018004 Norm: 1.16
From the System.PerlDependencyReport: 5.018004
BigTopic 10 times, average: 0.7956
======== 5.20.3 ===========
perl: 5.020003 Norm: 1.17
From the System.PerlDependencyReport: 5.020003
BigTopic 10 times, average: 0.6753
======== 5.22.0 ===========
perl: 5.022000 Norm: 1.18
From the System.PerlDependencyReport: 5.022000
BigTopic 10 times, average: 0.7378
======== 5.23.5 ===========
perl: 5.023005 Norm: 1.23
From the System.PerlDependencyReport: 5.023005
BigTopic 10 times, average: 0.6781
In table form:
perl Uni-Norm time full NFC time partial NFC difference
5.016003 1.14 1.2498 0.8192 0.4306
5.018004 1.16 1.0938 0.7956 0.2982
5.020003 1.17 0.8872 0.6753 0.2119
5.022000 1.18 3.7704 0.7378 3.0326
5.023005 1.23 0.8783 0.6781 0.2002

For a big 1200 lines topic where every character is unicode (120k characters), the slowdown in 0.3secs expect one perl version. (This is a bug - more info here: https://rt.cpan.org/Public/Bug/Display.html?id=102766 ). In short: up to v1.17 the module is XS-based. from 1.18 is pure perl. From 1.23 is again XS based.

Result, the speed difference is 0.2-0.3 sec on BUGFREE (XS) version.

-- JozefMojzis - 14 Dec 2015

Here is one more patch to test before I merge it. It enables NFC normalization for filenames. I recreated the Item13405 branch, and pushed it there. It uses a setting on the Store tab: NFC Normalize Filenames:

-- GeorgeClark - 23 Dec 2015

Oh... I also don't do any normalization in Configure, so you might find some issues if Foswiki is installed in a NFD unicode directory name. Please check if anything is needed there too.

-- GeorgeClark - 23 Dec 2015

I checked out the Foswiki (On branch Item13405) into a dir, where the path contains the "ňfď" directory. /me/fw/sites/git/ňfď/foswiki/core

Pseudoinstall OK. After few routine checks, like topic, attachments, renames and like - seems everything is working.

Configure: works OK from the web, even can change filename to unicode, like:

{LocalSitePreferences} to $Foswiki::cfg{UsersWebName}.SitePreferencesŽuŽu

and it works ok. Configure from the command line too, just the user must not forgot specify the perl -CA when uses unicode arguments. It would be nice if the configure could check few $ENV{LC_something} to determine the shell's locale and if it found utf-8 in some LC_var could do the decode_utf8($ARGV). Or something similar.

Strange, the configure saves e.g.:
$Foswiki::cfg{PubDir} = "/me/fw/sites/git/n\x{30c}fd\x{30c}/foswiki/core/pub";
$Foswiki::cfg{LocalSitePreferences} = "\$Foswiki::cfg{UsersWebName}.SitePreferences\x{17d}u\x{17d}u";
e.g. NFD direcory name and NFC filename, but seems it doesn't hurts (at least yet - only basic checks done).

-- JozefMojzis - 23 Dec 2015

Unit-tests doesn't works - many errors, but I never runs the test successfully, maybe me doing something wrong. here is the log: http://foswiki.org/pub/Sandbox/UnitTestLog/TESTLOG if interested.

-- JozefMojzis - 23 Dec 2015

It looks like the unit tests are not set up to run in anything but English. They check string responses from various tests and would require a lot of work to make them multi-lingual.

As far as the NFD directory, NFC filename. I'll bet bootstrap needs to NFC the detected paths. I'll look at that.

-- GeorgeClark - 23 Dec 2015

Pushed. NFC normalize the paths detected in bootstrap.

-- GeorgeClark - 23 Dec 2015

For the unit tests, try running them with a "default" configuration. (Remove LocalSite.cfg and run pseudo_install.pl -A ) I've made a couple of fixes for the wide char in print issues.

-- GeorgeClark - 24 Dec 2015

after the pull ac918e07a59edb588d71d4687b8b0290c3c25938 my configure shows this.

screenshot 10.png

note the missing {Store}{Implementation} too,

-- Main.JozefMojzis - 31 Dec 2015

This needs some more debugging if possible jomo, it's all working fine here. Just bootstrapped again, and everything is clean. No issues with Store Implementation either. Maybe something specific to the NFD detection? I didn't have a file system to test with. The Store Implementation issue is quite baffling, as that didn't change.

If you could either run a "git blame", or just go backwards a couple of commits to see what introduced the problem, that would be helpful.

-- GeorgeClark - 31 Dec 2015

ok,will try figure out what is wrong with my checkout. next year. smile

-- JozefMojzis - 31 Dec 2015

Two problems:
  • After changing the readdir= to ==_readdir the findPackages was unable to find packages when the Foswiki is installed in a directory containing unicode characters.
  • Also the checking routine missing one NFD call so it always returns fail.

I modified the patch as following, it tested and works on OS X and detects the NFD. Of course, you probably want rewrite it to the Foswiki's "perl style". smile

However, it still doesn't do what I want, e.g. when detects the NFD it should set the {NFCNormalizeFilenames} to 1 automatically and currently it didn't. E.g. still is needed to check it manually. frown, sad smile Unfortunately, the configure logic is too un-understandable to me - but at least it works "somewhat".. cry

diff --git a/core/lib/Foswiki.spec b/core/lib/Foswiki.spec
index 3bfb40e..c21dedd 100644
--- a/core/lib/Foswiki.spec
+++ b/core/lib/Foswiki.spec
@@ -1359,7 +1359,7 @@ $Foswiki::cfg{PluralToSingular} = $TRUE;
 
 # **BOOLEAN LABEL="NFC Normalize Filenames" EXPERT **
 # Foswiki uses NFC normalization for all network operations, but assumes
-# that the file system is also NFC normalized.  Some systems such as OSx
+# that the file system is also NFC normalized.  Some systems such as OS X
 # enforce NFD normalization for filenames.  If Foswiki is installed on one
 # of these sysetms, or accesses such a system via a remote file system
 # like NFS, then all directory / filename read operations must be NFC
diff --git a/core/lib/Foswiki/Configure/Checkers/NFCNormalizeFilenames.pm b/core/lib/Foswiki/Configure/Checkers/NFCNormalizeFilenames.pm
index 4db5fbf..3ace4fd 100755
--- a/core/lib/Foswiki/Configure/Checkers/NFCNormalizeFilenames.pm
+++ b/core/lib/Foswiki/Configure/Checkers/NFCNormalizeFilenames.pm
@@ -6,48 +6,26 @@ use warnings;
 
 use Encode;
 use Unicode::Normalize;
+use Foswiki::Configure::FileUtil ();
 
 use Foswiki::Configure::Checker ();
 our @ISA = ('Foswiki::Configure::Checker');
 
 sub check_current_value {
     my ( $this, $reporter ) = @_;
-    my $e;
 
-# Determine if the file system is NFC or NFD.
-# Write a UTF8 filename to the data directory, and then read the directory.
-# If the filename is returned in NFD format, then the NFCNormalizeFilename flag is enabled.
+    my $nfcok = Foswiki::Configure::FileUtil::canNfcFilenames($Foswiki::cfg{DataDir});
 
-    my $testfile = 'ČáŘý.testCfgNFC';
-    if (
-        open(
-            my $F, '>', Encode::encode_utf8("$Foswiki::cfg{DataDir}/$testfile")
-        )
-      )
-    {
-        close($F);
-        opendir( my $dh, Encode::encode_utf8( $Foswiki::cfg{DataDir} ) )
-          or die $!;
-        my @list = grep { /testCfgNFC/ }
-          map { Encode::decode_utf8($_) } readdir($dh);
-        if ( scalar @list && $list[0] eq $testfile ) {
-            $e .= $reporter->NOTE("NFC Data Storage Detected");
-            $Foswiki::cfg{NFCNormalizeFilenames} = 0;
-        }
-        else {
-            if ( scalar @list && $testfile eq $list[0] ) {
-                $e .= $reporter->NOTE("NFD Data Storage Detected");
-                $e .= $reporter->ERROR(
-"Filename Normalization should be enabled on NFD File Systems."
-                ) unless ( $Foswiki::cfg{NFCNormalizeFilenames} );
-            }
-            else {
-                $e .= $reporter->WARN(
-"Unable to detect Normalization. Read/write of test file failed."
-                );
-            }
-        }
-        unlink "$Foswiki::cfg{DataDir}/$testfile";
+    if ( defined $nfcok && $nfcok == 1 ) {
+        $reporter->NOTE("Data Storage allows NFC filenames");
+    }
+    elsif ( defined($nfcok) && $nfcok == 0 ) {
+        $reporter->NOTE("Data Storage enforces NFD filenames");
+        $reporter->WARN( "Filename Normalization should be enabled on NFD File Systems.")
+            unless ( $Foswiki::cfg{NFCNormalizeFilenames} );
+    }
+    else {
+        $reporter->ERROR( "Unable to detect Normalization." );
     }
 }
 
diff --git a/core/lib/Foswiki/Configure/FileUtil.pm b/core/lib/Foswiki/Configure/FileUtil.pm
index 78b4f48..963be72 100644
--- a/core/lib/Foswiki/Configure/FileUtil.pm
+++ b/core/lib/Foswiki/Configure/FileUtil.pm
@@ -12,6 +12,7 @@ Basic file utilities used by Configure and admin scripts
 
 use strict;
 use warnings;
+use utf8;
 
 use Assert;
 
@@ -157,7 +158,8 @@ sub findPackages {
     $pattern =~ s/\*/.*/g;
     my @path = split( /::/, $pattern );
 
-    my $places = \@INC;
+    my @NFCINC = map { NFC( decode_utf8($_) ) } @INC;
+    my $places = \@NFCINC;
     my $dir;
 
     while ( scalar(@path) > 1 && @$places ) {
@@ -975,6 +977,46 @@ sub rewriteShebang {
     return '';
 }
 
+=begin TML
+
+---++ StaticMethod canNfcFilenames($testdir)
+Determine if the file system is NFC or NFD.
+Write a UTF8 filename to the data directory, and then read the directory.
+If the filename is returned in NFD format, then the NFCNormalizeFilename flag is enabled.
+
+returns:
+   * 1 if NFC filenames are accepted by the filesystem
+   * 0 if the NFC is converted to NFD
+   * undef in any other case (errors)
+
+=cut
+
+sub canNfcFilenames {
+    my $testdir = shift;
+
+    die "missing argument in canNfcFilenames" unless $testdir;
+    #die as BUG if the testdir contains non-ascii characters and it isn't unicode string
+    die "CORE bug, got a [$testdir] as bytes" if( $testdir =~ /\P{Ascii}/ && !utf8::is_utf8($testdir) );
+
+    my $ext = '.CfgNfcTmpFile';
+    my $testname = 'ÁčňÖüß'.$ext;
+    my $fullpath = NFC(File::Spec->catfile($testdir,$testname));   #ensure full NFC path
+    my $fsnorm;
+
+    if( open my $fd, '>', $fullpath ) {
+        close $fd;
+        opendir my $dh, $testdir or return; #or die?
+        my @list = grep { /$ext/ } map { decode_utf8($_) } readdir $dh;
+        closedir $dh;
+        return unless @list;    #or die?
+        #what if @list > 1 ??
+        $fsnorm = ( $list[0] eq $testname ) ? 1 : ($list[0] eq NFD($testname)) ? 0 : undef;
+        unlink $fullpath;
+    }
+    return $fsnorm;
+}
+
+
 1;
 __END__
 Foswiki - The Free and Open Source Wiki, http://foswiki.org/
diff --git a/core/lib/Foswiki/Configure/Load.pm b/core/lib/Foswiki/Configure/Load.pm
index aede2e9..1c3659a 100644
--- a/core/lib/Foswiki/Configure/Load.pm
+++ b/core/lib/Foswiki/Configure/Load.pm
@@ -16,7 +16,6 @@ package Foswiki::Configure::Load;
 
 use strict;
 use warnings;
-use utf8;    # Needed to probe NFC/NFD filesystem
 
 use Cwd qw( abs_path );
 use Assert;
@@ -635,38 +634,18 @@ sub _bootstrapStoreSettings {
         }
     }
 
-# Determine if the file system is NFC or NFD.
-# Write a UTF8 filename to the data directory, and then read the directory.
-# If the filename is returned in NFD format, then the NFCNormalizeFilename flag is enabled.
-
-    my $testfile = 'ČáŘý.testCfgNFC';
-    if (
-        open(
-            my $F, '>', Encode::encode_utf8("$Foswiki::cfg{DataDir}/$testfile")
-        )
-      )
-    {
-        close($F);
-        opendir( my $dh, Encode::encode_utf8( $Foswiki::cfg{DataDir} ) )
-          or die $!;
-        my @list = grep { /testCfgNFC/ }
-          map { Encode::decode_utf8($_) } readdir($dh);
-        if ( scalar @list && $list[0] eq $testfile ) {
-            print STDERR "AUTOCONFIG: NFC Data Storage Detected\n" if (TRAUTO);
-            $Foswiki::cfg{NFCNormalizeFilenames} = 0;
-        }
-        else {
-            if ( scalar @list && NFD($testfile) eq $list[0] ) {
-                print STDERR "AUTOCONFIG: NFD Data Storage Detected\n"
-                  if (TRAUTO);
-                $Foswiki::cfg{NFCNormalizeFilenames} = 1;
-            }
-            else {
-                print STDERR
-                  "AUTOCONFIG: WARNING: Unable to detect Normalization.\n";
-            }
-        }
-        unlink "$Foswiki::cfg{DataDir}/$testfile";
+    my $nfcok = Foswiki::Configure::FileUtil::canNfcFilenames($Foswiki::cfg{DataDir});
+    if ( defined $nfcok && $nfcok == 1 ) {
+        print STDERR "AUTOCONFIG: Data Storage allows NFC filenames\n" if (TRAUTO);
+        $Foswiki::cfg{NFCNormalizeFilenames} = 0;
+    }
+    elsif ( defined($nfcok) && $nfcok == 0 ) {
+        print STDERR "AUTOCONFIG: Data Storage enforces NFD filenames\n" if (TRAUTO);
+        $Foswiki::cfg{NFCNormalizeFilenames} = 1; #the configure's interface still shows unchecked - so, don't understand.. ;(
+    }
+    else {
+        print STDERR "AUTOCONFIG: WARNING: Unable to detect Normalization.\n";
+        $Foswiki::cfg{NFCNormalizeFilenames} = 1;   #enable too - safer as none
     }
 }
 

-- JozefMojzis - 01 Jan 2016

Thanks JozefMojzis for the rewrite. As far as setting NFCNormalize flag, that only can be done during bootstrap, a checker can never change the configuration. Ah... the value in Foswiki.spec needs to be commented out - bootstrapped settings never have a default.

I'll work it in and merge it all into master. Thanks!

-- GeorgeClark - 01 Jan 2016

Back to you. Incorporated your rewrite with some tweaks. Hopefully the flag in LSC will be bootstrapped correctly now.

Added NFC checks to the File System advanced checks. This should pick up on any issues where a directory is symlinked to a NFD based file system.

-- Main.GeorgeClark - 02 Jan 2016

Fresh configuration for 2fdc6c1002085372f84c1a9054e423eba1d5f707 in directory with unicode in path. TREEVIEW and other basic things works as excepted.

-- JozefMojzis - 02 Jan 2016

Okay Thanks ... merged into master.

-- GeorgeClark - 03 Jan 2016
 

ItemTemplate edit

Summary Unicode Normalisation
ReportedBy JozefMojzis
Codebase 2.0.3, 2.0.2, 2.0.1, 2.0.0, trunk
SVN Range
AppliesTo Engine
Component I18N, Unicode
Priority Normal
CurrentState Closed
WaitingFor
Checkins ModPerlEngineContrib:6f0fe8ab6e70 FastCGIEngineContrib:4879ca5e0fb8 FastCGIEngineContrib:ef6f7417c19d distro:ee6fd5c3595d distro:57a2ddc335db distro:46c72a645194 distro:421aa603b459 distro:88e4f4b679d2 distro:0930aabb36b0 distro:d2e86890d298 distro:b1e66afabc15 distro:7475af5112cd distro:778ea0fb6f7a distro:3afce45cbc02 distro:a50caced1bd7 distro:678f28d9aeb8 distro:93a3f9d67ffa distro:3177833a7680 distro:c9e65de9af66 distro:c9bd4c203fbb distro:0ef26750b402 distro:571a35935ed8 distro:1db9068f2b41 distro:be33f8f40df9 distro:6f5332c3b245 distro:3f1cfd9be2d3 distro:86d5f50f15c2 distro:3a4a220afa3c distro:f43bcf550676 distro:ac918e07a59e distro:2fdc6c100208 distro:6510e317cf50
TargetRelease minor
ReleasedIn 2.1.0
CheckinsOnBranches master Item13525 Item13405 Item13897
trunkCheckins
masterCheckins ModPerlEngineContrib:6f0fe8ab6e70 FastCGIEngineContrib:4879ca5e0fb8 FastCGIEngineContrib:ef6f7417c19d distro:ee6fd5c3595d distro:57a2ddc335db distro:46c72a645194 distro:421aa603b459 distro:3afce45cbc02 distro:a50caced1bd7 distro:678f28d9aeb8 distro:93a3f9d67ffa distro:3177833a7680 distro:c9e65de9af66 distro:c9bd4c203fbb distro:0ef26750b402 distro:571a35935ed8 distro:1db9068f2b41 distro:be33f8f40df9 distro:6f5332c3b245 distro:3f1cfd9be2d3 distro:86d5f50f15c2 distro:3a4a220afa3c distro:f43bcf550676 distro:ac918e07a59e distro:2fdc6c100208 distro:6510e317cf50
ItemBranchCheckins distro:3a6e9fd1a139 distro:00bfab13e9c6 distro:d8e7b93b1044 distro:88e4f4b679d2 distro:0930aabb36b0 distro:d2e86890d298 distro:b1e66afabc15 distro:7475af5112cd distro:778ea0fb6f7a distro:571a35935ed8 distro:1db9068f2b41 distro:be33f8f40df9 distro:6f5332c3b245 distro:3f1cfd9be2d3 distro:86d5f50f15c2 distro:3a4a220afa3c distro:f43bcf550676 distro:ac918e07a59e distro:2fdc6c100208 distro:6510e317cf50
Release02x00Checkins
Release01x01Checkins
Topic attachments
I Attachment Action Size Date Who Comment
nytprof_a.out.1500115001 nytprof_a.out.15001 manage 3 MB 13 Dec 2015 - 19:27 JozefMojzis nytprof result on OS X for the "master" for the 1200 line BigFile
nytprof_n.out.1519915199 nytprof_n.out.15199 manage 3 MB 13 Dec 2015 - 19:30 JozefMojzis the result of nyprof on OS X for the HEAD detached at 4554e77
screenshot_10.pngpng screenshot_10.png manage 213 K 31 Dec 2015 - 16:28 JozefMojzis configure error after after the pull ... c3c25938
screenshot_71.pngpng screenshot_71.png manage 29 K 15 May 2015 - 16:31 JozefMojzis Two files with the same name
Topic revision: r42 - 03 Feb 2016, GeorgeClark - This page was cached on 25 Jul 2016 - 10:18.

The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License