You are here: Foswiki>Tasks Web>Item401 (08 Jan 2009, KwangErnLiew)Edit Attach

Item401: OSX fails unit tests - notably UTF8 seg faults.

pencil
Priority: Normal
Current State: Closed
Released In: 1.0.0
Target Release: patch
Applies To: Engine
Component:
Branches:
Reported By: Foswiki:Main.SvenDowideit
Waiting For:
Last Change By: KwangErnLiew
weird thing. seems to me that /($regex{validUtf8StringRegex})*/ causes it to barf, so perhaps its a resource limit that's making it go?

the patch below stops it, and I presume that while horrid, still works?
Index: /Users/svend/Sites/foswiki/core/lib/Foswiki.pm
===================================================================
--- Foswiki.pm   (revision 1184)
+++ Foswiki.pm   (working copy)
@@ -521,7 +521,10 @@
     return undef if ( $text =~ $regex{validAsciiStringRegex} );
 
     # If not UTF-8 - assume in site character set, no conversion required
-    return undef unless ( $text =~ $regex{validUtf8StringRegex} );
+    #return undef unless ( $text =~ $regex{validUtf8StringRegex} );  #<-- this seg faults on OSX leopard.
+    my $trial = $text;
+    $trial =~ s/$regex{validUtf8CharRegex}//g;
+    return unless (length($trial) == 0);
 
     # If site charset is already UTF-8, there is no need to convert anything:
     if ( $Foswiki::cfg{Site}{CharSet} =~ /^utf-?8$/i ) {

Sorry, I can't really comment usefully here. I have been avoiding this piece of code like the plague. Richard is probably the only person who fully understands it.

I'm just wondering (and apparently I'm not the only one), why we're using a regexp where we could directly use Encode.

I know Encode is another module to require, thus another piece of code that gets loaded, but anyway some modules already require it (such as Wysiwyg).

To my humble opinion, if we want to go UTF-8, we will have to use some proper tool to do it, and thus Encode seems the appropriate choice.

Re-inventing the wheel using regexp can work, but...

Also, Encode uses XS, thus is much quicker than a regexp to achieve the same.

Funny: http://develop.twiki.org/trac/changeset/17776 Item6146: Adding Encode as a required CPAN module

Encode was first released with perl 5.007003 (patchlevel perl/15039, released on 2002-03-05)

But according to people using it, it makes no sense doing UTF-8 with anything older than perl 5.8.3.

The patch above does fix the segfault on my mac.

I've commited it, and am adding a new task for Olivier and his Encode replacement work. Item438

I'm going to close this, even though there are still unit test failures - the remaining are the 'rename topic issues' that are not OSX specific, they will occur on any non-case sensitive File system - notably on windows too - Item439

-- SvenDowideit - 12 Dec 2008

If we are not supporting Perl 5.6 at all, we can just use Encode for this, or use Perl's feature to do same check (see TWiki:Codev.UTF8 re security part - can we get this page pulled into Foswiki btw?). I suspect the performance benefit of using Encode is tiny if any, as this code is only processing a small part of the URL (topic and web names), and it is only paid by sites with UTF-8 URLs as there's an earlier check for pure ASCII I believe.

That regex was mostly used to work across 5.6 and 5.8 but it's clearly not that easy to read and if we are only supporting 5.8 (for all sites, not just those with UTF-8 turned on) then Encode is the way to go.

-- RichardDonkin - 13 Dec 2008

ItemTemplate edit

Summary OSX fails unit tests - notably UTF8 seg faults.
ReportedBy Foswiki:Main.SvenDowideit
Codebase trunk
SVN Range TWiki-4.2.3, Wed, 06 Aug 2008, build 17396
AppliesTo Engine
Component
Priority Normal
CurrentState Closed
WaitingFor
Checkins distro:cc0f840cb215 distro:d1bd0da5dfb0 distro:0662369edc69 distro:a5df9523e7d5 distro:ceaa56226c9f
TargetRelease patch
ReleasedIn 1.0.0
Topic revision: r16 - 08 Jan 2009, KwangErnLiew
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy