Item10352: Query optimizer regexes beyond simple plain text fail

Priority: Urgent
Current State: Closed
Released In: 1.1.3
Target Release: patch
Applies To: Engine
Reported By: MichaelDaum
Waiting For:
Last Change By: KennethLavrsen
Given there's a web with topics and DataForms that have a TopicType field in it and there's at least one topic that has got a formfield

TopicType="BlogEntry, CategorizedTopic"

Let's search these.

You type:
%SEARCH{"TopicType=~'BlogEntry'" type="query" format="   1 [[$web.$topic][$topic]"}%

You get: zero results ERROR

Now let's try something different:

You type:
%SEARCH{"TopicType=~'.*BlogEntry.*'" type="query" format="   1 [[$web.$topic][$topic]"}%

You get the expected search hits.

Now, that's wrong, according to the specs. Tracing the evaluation order it turns out that in the first try OP_match is never executed. Digging deeper there is a Query::Node optimizer that tries to find out which parts of the query are "constant" and considers the 'BlogEntry' being a static string not worth checking using OP_match anymore. Serious bug.

Not sure this analysis is correct. But there's something seriously going wrong in that area but I lost enthusiasm to dig any deeper.

-- MichaelDaum - 14 Feb 2011

Thats a pretty big bug - and clearly needs a failing unit test.

-- SvenDowideit - 15 Feb 2011

Shouldn't the query optimizer be moved from Query::Node to the search algorithm? An sql or xquery backend already does a pretty nice job in optimizing the query before executing it. Trying to do something similar in perl is more specific to the current perl/grep search algorithm and not needed on real search engines.

Another observation: $query->simplify is called multiple times on the same query. That might be inevitable looking at the current code paths. So a isOptimized flag would speed this up. Not sure how intensive this is judging on the current code. But $query->simplify at least sounds frightening.

-- MichaelDaum - 15 Feb 2011

I recently tried using the new =~ feature with a simple regex containing parathesises. That did not work either. In my experince that new regex query search feature does not work at all in practical use. I was in the middle of a work peak load and forgot to raise a bug report.

-- KennethLavrsen - 22 Feb 2011

FWIW =~ works fine for me in practical use, but I use trunk, and not 1.1.2. I think we had an OP_match fix that Sven did which I can't remember if it made it into 1.1.2 or not.

Anyway, on trunk you can query for literal parens if you escape them with \(. Added Item10399

-- PaulHarvey - 23 Feb 2011

There is something very broken, even on trunk:

This here doens't work either:

%SEARCH{"TopicType=~'.*\bBlogEntry\b.*'" type="query" format="   1 [[$web.$topic][$topic]"}%

So basic regexing using the OP_match operator is not there anymore, even though the code in is just fine. That's why the best guess is something is going on before the operators are called.

-- MichaelDaum - 23 Feb 2011

The problem I had with (...) was not searching for literal ( or ). I used them the regex way. Specifically I tried to OR two words in a regex like

field =~ "Something(this|that)".

I could not get any regex to work with =~ operator unless it was dud simple.

Example: %SEARCH{"TopicTitle =~ '.*Topic (1|2).*'" type="query"}% will not find the two topics that have the values 'Topic 1' and 'Topic 2'.

I have tried to analyse this. What is it it does not understand

It does not understand the (). TopicTitle =~ '.*Topic (1).*' fails to find anything

It does not understand the |. TopicTitle =~ '.*Topic 1|2.*' fails to find anything

It does not understand a character class. TopicTitle =~ '.*Topic [12].*'

In my view the query regex feature is totally broken. I cannot make any use of it. And several of my users have tried for hours to get it to work. A lot of users are wasting a lot of time with this.

I will not even build a 1.1.3 release candidate until we have found what breaks this new important 1.1 feature.

-- KennethLavrsen - 01 Mar 2011

It's worth noting that fields[value=~'<regex>'] works

-- PaulHarvey - 06 Mar 2011

As Sven pointed out - this needs a unit test. It's a lot easier for others to fix when a unit test exists. Otherwise we need to create both a failing test and figure out the fix.

-- GeorgeClark - 06 Mar 2011


-- PaulHarvey - 06 Mar 2011

I don't see the same diagnosis as Michael. OP_match is called, and returns true. Something else is going on.

-- PaulHarvey - 06 Mar 2011

Right, it gets called for QUERY... something bad happens with SEARCH.

-- PaulHarvey - 06 Mar 2011

My adventure led me to bruteforce search algo before I had to stop. I'm hoping Sven or Crawford might have time to offer suggestions. The interesting thing is that fields[name='Blah' value=~'<regex>'] works as both QUERY and SEARCH, whereas Blah=~'<regex>' works in a QUERY but not in a SEARCH.

-- PaulHarvey - 07 Mar 2011

sven is planning on working on this tomorrow (9/3/2011 my time)

-- SvenDowideit - 08 Mar 2011

fixed, i think - but given the lack of docco and unit test - there essentially is no spec (see QuerySearch and RegexExpression) its pretty hard to be sure

-- SvenDowideit - 09 Mar 2011

Topic revision: r22 - 16 Apr 2011, KennethLavrsen - This page was cached on 18 Oct 2018 - 05:05.

The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy