Strategy for Regular Expression Support

Foswiki strives to support the rich Perl regular expression syntax for end users, for example in searching. However, because Foswiki has to interface with third party tools and libraries, it is not always possible to support all the features of Perl regular expressions in all places.

Any developer who implements an interface to such a third-party tool must make every effort to map all the functionality of Perl regular expressions to the tool. The following table lists the features of Perl regular expressions that are understood to be supported by a number of common third-party tools. The features are chosen from those described in http://www.regular-expressions.info/refflavors.html.

The Required by Foswiki column documents regex features used by the core code when using the search engine. For example, when searching for topic references, the core code assembles a regex and then uses the search engine to look for it. Loss of one or more of the features in this column will affect Foswiki (or one or more important extensions) functionality in some way. This won't necessarily make Foswiki unusable, but should be borne in mind. Note also that Foswiki internal regexes may use meta-syntax that might need to be escaped/modified for different regex flavours e.g. ( and \(.

Developers working with regular expressions must take great care, when exposing features of non-Perl regular expressions to end users, that they don't use features which are sparsely supported.

Sorted ascending Perl Regex Feature Required by Foswiki Often used in searches PCRE Sort Java XPath GNU ERE XML POSIX ERE Sort GNU BRE POSIX BRE .NET
edit
Single-character-generating escapes
edit
Backslash escapes one metacharacter
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-no
choice-yes
choice-yes
edit
\d shorthand for digits
choice-yes
choice-yes
ascii
ascii
choice-yes
choice-no
choice-yes
choice-no
choice-no
choice-no
choice-yes
edit
\w shorthand for word characters
choice-yes
choice-yes
ascii
ascii
choice-yes
choice-yes
choice-yes
choice-no
choice-yes
choice-no
choice-yes
edit
\s shorthand for whitespace
choice-yes
choice-yes
ascii
ascii
ascii
choice-yes
ascii
choice-no
choice-yes
choice-no
choice-yes
edit
\D, \W and \S shorthand negated character classes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-no
choice-yes
choice-no
choice-yes
edit
\x00 through \xFF (ASCII character)
choice-yes
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
\n (LF), \r (CR) and \t (tab)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-no
choice-yes
choice-no
choice-no
choice-no
choice-yes
edit
. (dot; any character except line break)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
edit
\Q...\E escapes a string of metacharacters
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\f (form feed) and \v (vtab)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
\a (bell) and \e (escape)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
\cA through \cZ (control character)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
\ca through \cz (control character)
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
Character classes features
edit
[abc] character class
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
edit
[^abc] negated character class
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
edit
[a-z] character class range
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
edit
Hyphen in [\d-z] is a literal
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
?
edit
Backslash escapes one character class metacharacter
choice-yes
choice-yes
choice-yes
choice-yes
choice-no
choice-yes
choice-no
choice-no
choice-no
choice-yes
edit
\Q...\E escapes a string of character class metacharacters
choice-yes
Java 6
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
[\b] backspace
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
[:alpha:] POSIX character class
ascii
choice-no
choice-no
choice-yes
choice-no
choice-yes
choice-yes
choice-yes
choice-no
edit
\p{IsAlpha} POSIX character class
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
Anchors
edit
\b (at the beginning or end of a word)
choice-yes
ascii
choice-yes
choice-no
ascii
choice-no
choice-no
ascii
choice-no
choice-yes
edit
\B (NOT at the beginning or end of a word)
ascii
choice-yes
choice-no
ascii
choice-no
choice-no
ascii
choice-no
choice-yes
edit
^ (start of string/line)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-no
choice-yes
choice-yes
choice-yes
choice-yes
edit
$ (end of string/line)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-no
choice-yes
choice-yes
choice-yes
choice-yes
edit
\A (start of string)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
\Z (end of string, before final line break)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
\z (end of string)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
Grouping, references and quantifiers
edit
(regex) (numbered capturing group)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
\( \)
\( \)
choice-yes
edit
\1 through \9 (backreferences)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-no
choice-no
choice-yes
choice-yes
choice-yes
edit
| (alternation)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
\|
choice-no
choice-yes
edit
? (0 or 1)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
\?
choice-no
choice-yes
edit
* (0 or more)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
edit
+ (1 or more)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
\+
choice-no
choice-yes
edit
{n} (exactly n)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
\{n\}
\{n\}
choice-yes
edit
{n,m} (between n and m)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
\{n,m\}
\{n,m\}
choice-yes
edit
{n,} (n or more)
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
choice-yes
\{n,\}
\{n,\}
choice-yes
edit
? after any of the above quantifiers to make it "lazy"
choice-yes
choice-yes
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?:regex) (non-capturing group)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
\10 through \99 (backreferences)
choice-yes
choice-yes
choice-yes
choice-no
n/a
n/a
choice-no
choice-no
choice-yes
edit
Forward references \1 through \9
choice-yes
choice-yes
choice-no
choice-no
n/a
n/a
choice-no
choice-no
choice-no
edit
Nested references \1 through \9
choice-yes
choice-yes
choice-no
choice-no
n/a
n/a
choice-no
choice-no
?
edit
Backreferences non-existent groups are an error
choice-yes
choice-yes
choice-yes
choice-yes
n/a
n/a
choice-yes
choice-yes
?
edit
Backreferences to failed groups also fail
choice-yes
choice-yes
choice-yes
choice-yes
n/a
n/a
choice-yes
choice-yes
?
edit
(?>regex) (atomic group)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?=regex) (positive lookahead)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?!regex) (negative lookahead)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?<=text) (fixed length positive lookbehind)
choice-yes
finite length
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?<!text) (fixed length negative lookbehind)
choice-yes
finite length
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
\G (start of match attempt)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?(?=regex)then|else) (using any lookaround)
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
(?(1)then|else)
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
Flags, Spacing and Comments
edit
(?i) (case insensitive)
choice-yes
choice-yes
choice-yes
flag
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?s) (dot matches newlines)
choice-yes
choice-yes
flag
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?m) (^ and $ match at line breaks)
choice-yes
choice-yes
flag
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?x) (free-spacing mode)
choice-yes
choice-yes
flag
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?-ismxn) (turn off mode modifiers)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?ismxn:group) (mode modifiers local to group)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
(?#comment)
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
Free-spacing syntax supported
choice-yes
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
Character class is a single token
choice-yes
choice-no
choice-yes
n/a
n/a
n/a
n/a
n/a
?
edit
# starts a comment
choice-yes
choice-yes
choice-no
n/a
n/a
n/a
n/a
n/a
choice-yes
edit
Unicode support
edit
\X (Unicode grapheme)
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\x{0} through \x{FFFF} (Unicode character)
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\pL through \pC (Unicode properties)
choice-yes
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{L} through \p{C} (Unicode properties)
choice-yes
choice-yes
choice-yes
choice-no
choice-yes
choice-no
choice-no
choice-no
choice-yes
edit
\p{Lu} through \p{Cn} (Unicode property)
choice-yes
choice-yes
choice-yes
choice-no
choice-yes
choice-no
choice-no
choice-no
choice-yes
edit
\p{L&} and \p{Letter&} (equivalent of [\p{Lu}\p{Ll}\p{Lt}] Unicode properties)
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{IsL} through \p{IsC} (Unicode properties)
choice-no
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{IsLu} through \p{IsCn} (Unicode property)
choice-no
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{Letter} through \p{Other} (Unicode properties)
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{Lowercase_Letter} through \p{Not_Assigned} (Unicode property)
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{IsLetter} through \p{IsOther} (Unicode properties)
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{IsLowercase_Letter} through \p{IsNot_Assigned} (Unicode property)
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{Arabic} through \p{Yi} (Unicode script)
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{IsArabic} through \p{IsYi} (Unicode script)
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-yes
edit
\p{BasicLatin} through \p{Specials} (Unicode block)
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{InBasicLatin} through \p{InSpecials} (Unicode block)
choice-no
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\p{IsBasicLatin} through \p{IsSpecials} (Unicode block)
choice-no
choice-no
choice-yes
choice-no
choice-yes
choice-no
choice-no
choice-no
choice-yes
edit
Part between {} in all of the above is case insensitive
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
Spaces, hyphens and underscores allowed in all long names listed above
(e.g. BasicLatin can be written as Basic-Latin or Basic_Latin or Basic Latin)
choice-no
Java 5
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
edit
\P (negated variants of all \p as listed above)
choice-yes
choice-yes
choice-yes
choice-no
choice-yes
choice-no
choice-no
choice-no
choice-yes
edit
\p{^...} (negated variants of all \p{...} as listed above)
choice-yes
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no
choice-no

In the event that an external tool supports regular expression syntax that is not compatible with Perl, the calling code must defuse the regex feature that is not perl compatible. This may result in some loss of functionality, but is necessary to avoid confusing users.

† The PCRE library can compiled with Unicode support, but is not always. Check.

‡ MySQL/MariaDB do have options to incorporate PCRE rather than POSIX ERE. Indeed for MariaDB 10 (when finally GA) will include PCRE with Unicode as standard.

As can be seen there is great variability in regular expressions support. This is especially true of SQL interfaces to databases, where the ANSI standard for pattern matching is so pathetic that most databases support some extension. Even where standards (such as POSIX) have been implemented, they are at times arbitrarily constrained or extended. The following table provides a guideline as to what is supported in SQL by a number of common database implementations.

Database Native With extensions
ANSI
ANSI
As this is a MySQL fork the above library should also work here. See also MariaDB 10 PCRE
Oracle
Posix ERE with variations
PCRE may be possible see this reference. However I have not found any docs as to how this can be installed or how to use from within Oracle
Postgresql
ANSI, plus Tcl ARE, POSIX ERE, POSIX BRE depending on the SQL function used.
SQLite
ANSI
PCRE is usually added as an extension
Microsoft SQL Server
ANSI
DB2 LUW
Ansi
TODO: add Tcl ARE to the top table


Foswiki internally uses \Q...\E to disable metacharacters within regular expressions. A quick search of a foswiki install finds a number of places. Is the above list only meant to list features that are pushed down into regexes used by the Store and Search engines?

Foswiki/Configure/Checker.pm
Foswiki/Plugins/SmiliesPlugin.pm
Foswiki/Plugins/EditTablePlugin/Core.pm
Foswiki/Plugins/WysiwygPlugin/TML2HTML.pm
Foswiki/Plugins/WysiwygPlugin/HTML2TML/Node.pm
Foswiki/Plugins/TwistyPlugin.pm
Foswiki/Prefs.pm
Foswiki/Contrib/BuildContrib/Targets/manifest.pm
Foswiki/Macros/WEBLIST.pm
Foswiki/Macros/TOPICLIST.pm
Foswiki/Macros/LANGUAGES.pm
Unit/TestRunner.pm
CPAN/lib/Text/Patch.pm
CPAN/lib/Crypt/PasswdMD5.pm
CPAN/lib/Locale/Maketext/Extract/Plugin/Base.pm
Foswiki.pm

-- GeorgeClark - 24 Dec 2013

I can only imagine that this is about Store and Search engines, but CrawfordCurrie was the original author so he'll need to confirm.

To summarise the usual SQL suspects we have:

  • MySQL: Posix ERE, PCRE via library
  • MariaDB: Posix ERE, PCRE via library (Standard from version 10)
  • Oracle: Posix ERE
  • PostgreSQL: Tcl ARE

The following databases support regex only after installing an extra library.

I would suggest the possibility of pruning some of the columns above. It appears to me we only want columns for actual known targets.

I suppose we may end up with similar issues with any NoSQL type stores, but the active development is currently around SQL stores.

I've amended the table to highlight a few Foswiki required rows. I find the choice-yes for Foswiki required gets lost when scrolling up and down. Do you agree this approach highlights these rows better? In which case I'll complete the job and remove the Foswiki required column. Alternatively, I could add an explicit choice-no when Foswiki support is not required and another flag when we're not quite sure ;).

It seems to me that there are other rows that should be marked Foswiki required, e.g '.', but maybe CrawfordCurrie had good reason not to include it.

-- JulianLevens - 25 Dec 2013

Correct; as described in the intro, the intention is to document constraints imposed on regexes by external third party tools. The "Foswiki required" column documents regex features used by the core code when using the search engine. For example, when searching for topic references, the core code assembles a regex and then uses the search engine to look for it. Since '.' is not used in composing this regex, then Foswiki can continue to operate without it.

I'm not a fan of the highlighted rows; it puts undue focus on the Foswiki requires column. What we should do, though, is identify a subset of PCRE that is (1) well supported across the range of search engines and (2) adequate for end-user searching. At the moment the doc is weasely about what regex features are available (RegularExpression) to an end user; it really ought to say something definitive like "POSIX ERE".

JulianLevens can you please update the "Databases" table with what you know about the different SQL implementations? Thanks.

Really we need a Task to reduce the number of regex features that core depends on. For example, [0-9] is more widely supported than \d.

I just noticed that the column for GNU BRE is very wrong. However I have to go and cook now.

-- CrawfordCurrie - 25 Dec 2013

That could be me. I fixed up the table formatting by replacing | with %VBAR%. I have since tried to make the POSIX ERE, POSIX BRE, GNU ERE and GNU BRE columns consistent with http://www.regular-expressions.info but that site appears to be inconsistent with http://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html which does not specify the use of \} with GNU BRE. I also saw that POSIX ERE and BRE both support [[:<:]] instead of \b.

-- MichaelTempest - 25 Dec 2013

I've updated the database table.

Crawford, I was indeed confused about the meaning of the 'Foswiki required' column in that I thought it related to a minimum requirement for user search. As you say we need to actually define what a FW user can expect to work.

BTW: I've found that the standard backlinks regex used in a regexp SQL clause e.g.

select * from metaText where value regexp 'DefaultPreferences([^A-Za-z0-9]|$)|Default([^A-Za-z0-9]*)Preferences([^A-Za-z0-9]|$)|System.DefaultPreferences([^A-Za-z0-9]|$)' \G

Took 43s to complete, and because of MySQL caching data, subsequent calls are a rapid 0s. However, the 3 separate parts of the alternation are much quicker (< 1s each). So, why the significant time required? They are all table scans so it's shouldn't be an io bottleneck, which suggests that this regex via a POSIX ERE has some surprisingly poor performers. This will need more investigation, note that a simple search for '.' with original perl regex as a post filter is much faster.

MariaDB [foswiki]> select count(fobid) from metaText where value regexp 'Default|Preferences'\G -- AutoDocket([^A-Za-z0-9]|$)|Auto([^A-Za-z0-9]*)Docket([^A-Za-z0-9]|$)|System.AutoDocket([^A-Za-z0-9]|$)' \G
*************************** 1. row ***************************
count(fobid): 2551
1 row in set (9.13 sec)

MariaDB [foswiki]> select count(fobid) from metaText where value regexp '.'\G
*************************** 1. row ***************************
count(fobid): 418800
1 row in set (0.53 sec)

MariaDB [foswiki]> select count(fobid) from metaText where value regexp 'Default' or value regexp 'Preferences'\G
*************************** 1. row ***************************
count(fobid): 2551
1 row in set (0.76 sec)

I suspect it's related to this: http://www.regular-expressions.info/posix.html see the POSIX ERE Alternation Returns The Longest Match section.

Note that these timings are not the complete process with perl and DBI where all the data is transferred to perl for processing. I.e. reading 418800 string into perl for further filtering may still be the overall worst performer, but that's not how it feels.

Why is it rarely simple frown, sad smile ?

-- JulianLevens - 25 Dec 2013
Topic revision: r15 - 21 Feb 2014, CrawfordCurrie - This page was cached on 04 Sep 2015 - 16:36. Get a fresh version here.
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License