Item10352: Query optimizer regexes beyond simple plain text fail

Priority: Urgent
Current State: Closed
Released In: 1.1.3
Target Release: patch
Applies To: Engine
Reported By: MichaelDaum
Waiting For:
Last Change By: KennethLavrsen
Given there's a web with topics and DataForms that have a TopicType field in it and there's at least one topic that has got a formfield

TopicType="BlogEntry, CategorizedTopic"

Let's search these.

You type:
%SEARCH{"TopicType=~'BlogEntry'" type="query" format="   1 [[$web.$topic][$topic]"}%

You get: zero results ERROR

Now let's try something different:

You type:
%SEARCH{"TopicType=~'.*BlogEntry.*'" type="query" format="   1 [[$web.$topic][$topic]"}%

You get the expected search hits.

Now, that's wrong, according to the specs. Tracing the evaluation order it turns out that in the first try OP_match is never executed. Digging deeper there is a Query::Node optimizer that tries to find out which parts of the query are "constant" and considers the 'BlogEntry' being a static string not worth checking using OP_match anymore. Serious bug.

Not sure this analysis is correct. But there's something seriously going wrong in that area but I lost enthusiasm to dig any deeper.

-- MichaelDaum - 14 Feb 2011

Thats a pretty big bug - and clearly needs a failing unit test.

-- SvenDowideit - 15 Feb 2011

Shouldn't the query optimizer be moved from Query::Node to the search algorithm? An sql or xquery backend already does a pretty nice job in optimizing the query before executing it. Trying to do something similar in perl is more specific to the current perl/grep search algorithm and not needed on real search engines.

Another observation: $query->simplify is called multiple times on the same query. That might be inevitable looking at the current code paths. So a isOptimized flag would speed this up. Not sure how intensive this is judging on the current code. But $query->simplify at least sounds frightening.

-- MichaelDaum - 15 Feb 2011

I recently tried using the new =~ feature with a simple regex containing parathesises. That did not work either. In my experince that new regex query search feature does not work at all in practical use. I was in the middle of a work peak load and forgot to raise a bug report.

-- KennethLavrsen - 22 Feb 2011

FWIW =~ works fine for me in practical use, but I use trunk, and not 1.1.2. I think we had an OP_match fix that Sven did which I can't remember if it made it into 1.1.2 or not.

Anyway, on trunk you can query for literal parens if you escape them with \(. Added Item10399

-- PaulHarvey - 23 Feb 2011

There is something very broken, even on trunk:

This here doens't work either:

%SEARCH{"TopicType=~'.*\bBlogEntry\b.*'" type="query" format="   1 [[$web.$topic][$topic]"}%

So basic regexing using the OP_match operator is not there anymore, even though the code in is just fine. That's why the best guess is something is going on before the operators are called.

-- MichaelDaum - 23 Feb 2011

The problem I had with (...) was not searching for literal ( or ). I used them the regex way. Specifically I tried to OR two words in a regex like

field =~ "Something(this|that)".

I could not get any regex to work with =~ operator unless it was dud simple.

Example: %SEARCH{"TopicTitle =~ '.*Topic (1|2).*'" type="query"}% will not find the two topics that have the values 'Topic 1' and 'Topic 2'.

I have tried to analyse this. What is it it does not understand

It does not understand the (). TopicTitle =~ '.*Topic (1).*' fails to find anything

It does not understand the |. TopicTitle =~ '.*Topic 1|2.*' fails to find anything

It does not understand a character class. TopicTitle =~ '.*Topic [12].*'

In my view the query regex feature is totally broken. I cannot make any use of it. And several of my users have tried for hours to get it to work. A lot of users are wasting a lot of time with this.

I will not even build a 1.1.3 release candidate until we have found what breaks this new important 1.1 feature.

-- KennethLavrsen - 01 Mar 2011

It's worth noting that fields[value=~'<regex>'] works

-- PaulHarvey - 06 Mar 2011

As Sven pointed out - this needs a unit test. It's a lot easier for others to fix when a unit test exists. Otherwise we need to create both a failing test and figure out the fix.

-- GeorgeClark - 06 Mar 2011


-- PaulHarvey - 06 Mar 2011

I don't see the same diagnosis as Michael. OP_match is called, and returns true. Something else is going on.

-- PaulHarvey - 06 Mar 2011

Right, it gets called for QUERY... something bad happens with SEARCH.

-- PaulHarvey - 06 Mar 2011

My adventure led me to bruteforce search algo before I had to stop. I'm hoping Sven or Crawford might have time to offer suggestions. The interesting thing is that fields[name='Blah' value=~'<regex>'] works as both QUERY and SEARCH, whereas Blah=~'<regex>' works in a QUERY but not in a SEARCH.

-- PaulHarvey - 07 Mar 2011

sven is planning on working on this tomorrow (9/3/2011 my time)

-- SvenDowideit - 08 Mar 2011

fixed, i think - but given the lack of docco and unit test - there essentially is no spec (see QuerySearch and RegexExpression) its pretty hard to be sure

-- SvenDowideit - 09 Mar 2011

Topic revision: r22 - 16 Apr 2011, KennethLavrsen
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy