Foswiki on GitHub is open for business! Next release meeting: Monday October 13, 1300Z

Strategy for Regular Expression Support

Foswiki strives to support the rich Perl regular expression syntax for end users, for example in searching. However, because Foswiki has to interface with third party tools and libraries, it is not always possible to support all the features of Perl regular expressions in all places.

Any developer who implements an interface to such a third-party tool must make every effort to map all the functionality of Perl regular expressions to the tool. The following table lists the features of Perl regular expressions that are understood to be supported by a number of common third-party tools. The features are chosen from those described in http://www.regular-expressions.info/refflavors.html.

The Required by Foswiki column documents regex features used by the core code when using the search engine. For example, when searching for topic references, the core code assembles a regex and then uses the search engine to look for it. Loss of one or more of the features in this column will affect Foswiki (or one or more important extensions) functionality in some way. This won't necessarily make Foswiki unusable, but should be borne in mind. Note also that Foswiki internal regexes may use meta-syntax that might need to be escaped/modified for different regex flavours e.g. ( and \(.

Developers working with regular expressions must take great care, when exposing features of non-Perl regular expressions to end users, that they don't use features which are sparsely supported.

Perl Regex Feature Required by Foswiki Often used in searches PCRE Sort Java XPath GNU ERE XML POSIX ERE Sort GNU BRE POSIX BRE .NET
Single-character-generating escapes
Backslash escapes one metacharacter choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-no choice-yes choice-yes
\d shorthand for digits choice-yes choice-yes ascii ascii choice-yes choice-no choice-yes choice-no choice-no choice-no choice-yes
\w shorthand for word characters choice-yes choice-yes ascii ascii choice-yes choice-yes choice-yes choice-no choice-yes choice-no choice-yes
\s shorthand for whitespace choice-yes choice-yes ascii ascii ascii choice-yes ascii choice-no choice-yes choice-no choice-yes
\D, \W and \S shorthand negated character classes choice-yes   choice-yes choice-yes choice-yes choice-yes choice-yes choice-no choice-yes choice-no choice-yes
\x00 through \xFF (ASCII character) choice-yes   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
\n (LF), \r (CR) and \t (tab) choice-yes choice-yes choice-yes choice-yes choice-yes choice-no choice-yes choice-no choice-no choice-no choice-yes
. (dot; any character except line break)   choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes
\Q...\E escapes a string of metacharacters     choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\f (form feed) and \v (vtab)     choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
\a (bell) and \e (escape)     choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
\cA through \cZ (control character)     choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
\ca through \cz (control character)     choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
Character classes features
[abc] character class choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes
[^abc] negated character class choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes
[a-z] character class range choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes
Hyphen in [\d-z] is a literal   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no ?
Backslash escapes one character class metacharacter   choice-yes choice-yes choice-yes choice-yes choice-no choice-yes choice-no choice-no choice-no choice-yes
\Q...\E escapes a string of character class metacharacters   choice-yes Java 6 choice-no choice-no choice-no choice-no choice-no choice-no choice-no
[\b] backspace   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no
[:alpha:] POSIX character class   ascii choice-no choice-no choice-yes choice-no choice-yes choice-yes choice-yes choice-no
\p{IsAlpha} POSIX character class   choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
Anchors
\b (at the beginning or end of a word) choice-yes ascii choice-yes choice-no ascii choice-no choice-no ascii choice-no choice-yes
\B (NOT at the beginning or end of a word)   ascii choice-yes choice-no ascii choice-no choice-no ascii choice-no choice-yes
^ (start of string/line)   choice-yes choice-yes choice-yes choice-yes choice-yes choice-no choice-yes choice-yes choice-yes choice-yes
$ (end of string/line)   choice-yes choice-yes choice-yes choice-yes choice-yes choice-no choice-yes choice-yes choice-yes choice-yes
\A (start of string)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
\Z (end of string, before final line break)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
\z (end of string)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
Grouping, references and quantifiers
(regex) (numbered capturing group) choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes \( \) \( \) choice-yes
\1 through \9 (backreferences) choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-no choice-no choice-yes choice-yes choice-yes
| (alternation) choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes \| choice-no choice-yes
? (0 or 1) choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes \? choice-no choice-yes
* (0 or more) choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes
+ (1 or more) choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes \+ choice-no choice-yes
{n} (exactly n) choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes \{n\} \{n\} choice-yes
{n,m} (between n and m) choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes \{n,m\} \{n,m\} choice-yes
{n,} (n or more) choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes choice-yes \{n,\} \{n,\} choice-yes
? after any of the above quantifiers to make it "lazy"   choice-yes choice-yes choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-yes
(?:regex) (non-capturing group)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
\10 through \99 (backreferences)   choice-yes choice-yes choice-yes choice-no n/a n/a choice-no choice-no choice-yes
Forward references \1 through \9   choice-yes choice-yes choice-no choice-no n/a n/a choice-no choice-no choice-no
Nested references \1 through \9   choice-yes choice-yes choice-no choice-no n/a n/a choice-no choice-no ?
Backreferences non-existent groups are an error   choice-yes choice-yes choice-yes choice-yes n/a n/a choice-yes choice-yes ?
Backreferences to failed groups also fail   choice-yes choice-yes choice-yes choice-yes n/a n/a choice-yes choice-yes ?
(?>regex) (atomic group)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
(?=regex) (positive lookahead)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
(?!regex) (negative lookahead)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
(?<=text) (fixed length positive lookbehind)   choice-yes finite length choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
(?<!text) (fixed length negative lookbehind)   choice-yes finite length choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
\G (start of match attempt)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
(?(?=regex)then|else) (using any lookaround)   choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
(?(1)then|else)   choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
Flags, Spacing and Comments
(?i) (case insensitive)   choice-yes choice-yes choice-yes flag choice-no choice-no choice-no choice-no choice-no choice-yes
(?s) (dot matches newlines)   choice-yes choice-yes flag choice-no choice-no choice-no choice-no choice-no choice-yes
(?m) (^ and $ match at line breaks)   choice-yes choice-yes flag choice-no choice-no choice-no choice-no choice-no choice-yes
(?x) (free-spacing mode)   choice-yes choice-yes flag choice-no choice-no choice-no choice-no choice-no choice-yes
(?-ismxn) (turn off mode modifiers)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
(?ismxn:group) (mode modifiers local to group)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
(?#comment)   choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
Free-spacing syntax supported   choice-yes choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-yes
Character class is a single token   choice-yes choice-no choice-yes n/a n/a n/a n/a n/a ?
# starts a comment   choice-yes choice-yes choice-no n/a n/a n/a n/a n/a choice-yes
Unicode support
\X (Unicode grapheme)   choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\x{0} through \x{FFFF} (Unicode character)   choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\pL through \pC (Unicode properties)   choice-yes choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{L} through \p{C} (Unicode properties)   choice-yes choice-yes choice-yes choice-no choice-yes choice-no choice-no choice-no choice-yes
\p{Lu} through \p{Cn} (Unicode property)   choice-yes choice-yes choice-yes choice-no choice-yes choice-no choice-no choice-no choice-yes
\p{L&} and \p{Letter&} (equivalent of [\p{Lu}\p{Ll}\p{Lt}] Unicode properties)   choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{IsL} through \p{IsC} (Unicode properties)   choice-no choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{IsLu} through \p{IsCn} (Unicode property)   choice-no choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{Letter} through \p{Other} (Unicode properties)   choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{Lowercase_Letter} through \p{Not_Assigned} (Unicode property)   choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{IsLetter} through \p{IsOther} (Unicode properties)   choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{IsLowercase_Letter} through \p{IsNot_Assigned} (Unicode property)   choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{Arabic} through \p{Yi} (Unicode script)   choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{IsArabic} through \p{IsYi} (Unicode script)   choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-yes
\p{BasicLatin} through \p{Specials} (Unicode block)   choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{InBasicLatin} through \p{InSpecials} (Unicode block)   choice-no choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\p{IsBasicLatin} through \p{IsSpecials} (Unicode block)   choice-no choice-no choice-yes choice-no choice-yes choice-no choice-no choice-no choice-yes
Part between {} in all of the above is case insensitive   choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no
Spaces, hyphens and underscores allowed in all long names listed above
(e.g. BasicLatin can be written as Basic-Latin or Basic_Latin or Basic Latin)
  choice-no Java 5 choice-no choice-no choice-no choice-no choice-no choice-no choice-no
\P (negated variants of all \p as listed above)   choice-yes choice-yes choice-yes choice-no choice-yes choice-no choice-no choice-no choice-yes
\p{^...} (negated variants of all \p{...} as listed above)   choice-yes choice-no choice-no choice-no choice-no choice-no choice-no choice-no choice-no

In the event that an external tool supports regular expression syntax that is not compatible with Perl, the calling code must defuse the regex feature that is not perl compatible. This may result in some loss of functionality, but is necessary to avoid confusing users.

The PCRE library can compiled with Unicode support, but is not always. Check.

MySQL/MariaDB do have options to incorporate PCRE rather than POSIX ERE. Indeed for MariaDB 10 (when finally GA) will include PCRE with Unicode as standard.

As can be seen there is great variability in regular expressions support. This is especially true of SQL interfaces to databases, where the ANSI standard for pattern matching is so pathetic that most databases support some extension. Even where standards (such as POSIX) have been implemented, they are at times arbitrarily constrained or extended. The following table provides a guideline as to what is supported in SQL by a number of common database implementations.

Database Native With extensions
MySQL ANSI PCRE
MariaDB ANSI As this is a MySQL fork the above library should also work here. See also MariaDB 10 PCRE
Oracle Posix ERE with variations PCRE may be possible see this reference. However I have not found any docs as to how this can be installed or how to use from within Oracle
Postgresql ANSI, plus Tcl ARE, POSIX ERE, POSIX BRE depending on the SQL function used.  
SQLite ANSI PCRE is usually added as an extension
Microsoft SQL Server ANSI .NET, see http://www.codeproject.com/Articles/19502/A-T-SQL-Regular-Expression-Library-for-SQL-Server and http://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
DB2 LUW Ansi PCRE Via User Defined Functions UDFs: http://www.ibm.com/developerworks/data/library/techarticle/0301stolze/0301stolze.html
TODO: add Tcl ARE to the top table


Foswiki internally uses \Q...\E to disable metacharacters within regular expressions. A quick search of a foswiki install finds a number of places. Is the above list only meant to list features that are pushed down into regexes used by the Store and Search engines?

Foswiki/Configure/Checker.pm
Foswiki/Plugins/SmiliesPlugin.pm
Foswiki/Plugins/EditTablePlugin/Core.pm
Foswiki/Plugins/WysiwygPlugin/TML2HTML.pm
Foswiki/Plugins/WysiwygPlugin/HTML2TML/Node.pm
Foswiki/Plugins/TwistyPlugin.pm
Foswiki/Prefs.pm
Foswiki/Contrib/BuildContrib/Targets/manifest.pm
Foswiki/Macros/WEBLIST.pm
Foswiki/Macros/TOPICLIST.pm
Foswiki/Macros/LANGUAGES.pm
Unit/TestRunner.pm
CPAN/lib/Text/Patch.pm
CPAN/lib/Crypt/PasswdMD5.pm
CPAN/lib/Locale/Maketext/Extract/Plugin/Base.pm
Foswiki.pm

-- GeorgeClark - 24 Dec 2013

I can only imagine that this is about Store and Search engines, but CrawfordCurrie was the original author so he'll need to confirm.

To summarise the usual SQL suspects we have:

  • MySQL: Posix ERE, PCRE via library
  • MariaDB: Posix ERE, PCRE via library (Standard from version 10)
  • Oracle: Posix ERE
  • PostgreSQL: Tcl ARE

The following databases support regex only after installing an extra library.

I would suggest the possibility of pruning some of the columns above. It appears to me we only want columns for actual known targets.

I suppose we may end up with similar issues with any NoSQL type stores, but the active development is currently around SQL stores.

I've amended the table to highlight a few Foswiki required rows. I find the choice-yes for Foswiki required gets lost when scrolling up and down. Do you agree this approach highlights these rows better? In which case I'll complete the job and remove the Foswiki required column. Alternatively, I could add an explicit choice-no when Foswiki support is not required and another flag when we're not quite sure ;).

It seems to me that there are other rows that should be marked Foswiki required, e.g '.', but maybe CrawfordCurrie had good reason not to include it.

-- JulianLevens - 25 Dec 2013

Correct; as described in the intro, the intention is to document constraints imposed on regexes by external third party tools. The "Foswiki required" column documents regex features used by the core code when using the search engine. For example, when searching for topic references, the core code assembles a regex and then uses the search engine to look for it. Since '.' is not used in composing this regex, then Foswiki can continue to operate without it.

I'm not a fan of the highlighted rows; it puts undue focus on the Foswiki requires column. What we should do, though, is identify a subset of PCRE that is (1) well supported across the range of search engines and (2) adequate for end-user searching. At the moment the doc is weasely about what regex features are available (RegularExpression) to an end user; it really ought to say something definitive like "POSIX ERE".

JulianLevens can you please update the "Databases" table with what you know about the different SQL implementations? Thanks.

Really we need a Task to reduce the number of regex features that core depends on. For example, [0-9] is more widely supported than \d.

I just noticed that the column for GNU BRE is very wrong. However I have to go and cook now.

-- CrawfordCurrie - 25 Dec 2013

That could be me. I fixed up the table formatting by replacing | with %VBAR%. I have since tried to make the POSIX ERE, POSIX BRE, GNU ERE and GNU BRE columns consistent with http://www.regular-expressions.info but that site appears to be inconsistent with http://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html which does not specify the use of \} with GNU BRE. I also saw that POSIX ERE and BRE both support [[:<:]] instead of \b.

-- MichaelTempest - 25 Dec 2013

I've updated the database table.

Crawford, I was indeed confused about the meaning of the 'Foswiki required' column in that I thought it related to a minimum requirement for user search. As you say we need to actually define what a FW user can expect to work.

BTW: I've found that the standard backlinks regex used in a regexp SQL clause e.g.

select * from metaText where value regexp 'DefaultPreferences([^A-Za-z0-9]|$)|Default([^A-Za-z0-9]*)Preferences([^A-Za-z0-9]|$)|System.DefaultPreferences([^A-Za-z0-9]|$)' \G

Took 43s to complete, and because of MySQL caching data, subsequent calls are a rapid 0s. However, the 3 separate parts of the alternation are much quicker (< 1s each). So, why the significant time required? They are all table scans so it's shouldn't be an io bottleneck, which suggests that this regex via a POSIX ERE has some surprisingly poor performers. This will need more investigation, note that a simple search for '.' with original perl regex as a post filter is much faster.

MariaDB [foswiki]> select count(fobid) from metaText where value regexp 'Default|Preferences'\G -- AutoDocket([^A-Za-z0-9]|$)|Auto([^A-Za-z0-9]*)Docket([^A-Za-z0-9]|$)|System.AutoDocket([^A-Za-z0-9]|$)' \G
*************************** 1. row ***************************
count(fobid): 2551
1 row in set (9.13 sec)

MariaDB [foswiki]> select count(fobid) from metaText where value regexp '.'\G
*************************** 1. row ***************************
count(fobid): 418800
1 row in set (0.53 sec)

MariaDB [foswiki]> select count(fobid) from metaText where value regexp 'Default' or value regexp 'Preferences'\G
*************************** 1. row ***************************
count(fobid): 2551
1 row in set (0.76 sec)

I suspect it's related to this: http://www.regular-expressions.info/posix.html see the POSIX ERE Alternation Returns The Longest Match section.

Note that these timings are not the complete process with perl and DBI where all the data is transferred to perl for processing. I.e. reading 418800 string into perl for further filtering may still be the overall worst performer, but that's not how it feels.

Why is it rarely simple frown ?

-- JulianLevens - 25 Dec 2013
Topic revision: r15 - 21 Feb 2014, CrawfordCurrie
 
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. see CopyrightStatement. Creative Commons License