Code Refactoring For Fast Queries

Searching is absolutely key to everything Foswiki does. This topic is for developers working on the core to discuss the detailed code refactorings required to make searching work better.

Existing topics/background reading

The following feature proposals cover parts of the discussion below. Please don't duplicate their content here; this is intended as a high level, overviewy sort of brainstormy topic that ties all these threads together.

How search works in Foswiki 1.0.0

First, a quick summary of how search currently works.

There is a monster "Search" module that decomposes the %SEARCH. It then fires off a "search engine" specific to the type of the search. The search monster takes responsibility for the lists of topics to be searched, and handles all the parameter processing. The search monster also takes responsibility for rendering search results. It also has a number of optimisations that assume the underlying database implementation.

The search engines are implemented in the Store engine, hidden behind the searchInWeb* interfaces.

This has a number of problems.
  1. The search monster is a ... monster. Most of us are afraid to even read the code, it is so scary.
  2. No clear delegation of responsibility.
  3. The engines are drip-fed information about the search - for example, they are not told about limits on the search.
  4. Queries are pre-parsed by the search monster, which may be a waste of time if the search is cached.
  5. No result sets.
  6. Searches are not re-entrant - you can't start a search, process initial results, then complete the search later.
  7. Searches over multiple webs require the same search to be applied to each web separately.
  8. The search monster owns rendering the results, and doesn't do a very good job.
  9. Only very simple sorting is possible, and it is done too late in the process
  10. No explicit documentation of search orders (must maintain search order when lists of sources to search are defined)
  11. Data found by the search and included in the search result is ignored by the rendering process ($formfield has to go back and read the topic again)
Search backends are of two types; full text and query. The full text engine relies on the availability of topic text with embedded meta-data. The query engine currently takes a parsed query and applies it to the stored data; the current (brute force) implementation rewrites the query as a regex, and uses the full text engine for the actual search.

Ideas

Search needs to be broken down into layers. Let's assume that there will be various UI components that use search, and they will all use a common search module. The search module needs to take an abstract description of the search to be performed, and return a set of results. The existing search formatting needs to be moved away into a separate result set rendering module. We want searches to be re-entrant, so it makes sense that searches be performed by a "search object" that is constructed when the search is analysed.
  • Move command-specific processing into invoking code
  • Move rendering into a separate result set (ResultSets) rendering package (ExtractAndCentralizeFormattingRefactor)
  • Change Search.pm into a search object factory
  • The search needs to be able to search multiple webs in one pass, and needs to be cognizant of sorting
This should reduce the Search.pm to a relatively simple dispatcher for the actual search algorithms implemented by the backend store modules (DelegateMoreProcessingToSearchAlgorithm). It's a given that different backends will take different approaches to searching, so it may be appropriate for Search.pm to decompose to a factory for search objects that are provided in turn by factories in the backend implementations.
  • Search.pm should avoid assuming that only one backend implementation is used for all webs (MultiStoreRefactor), though it's acknowledged that this may conflict with the requirement for efficient multi-web searches.
  • Backends should always provide some standard factories, as defined to date (type="regex", type="query", type="topic") but may also provide additional factories for backend-specific searches (type="sql", type="xquery")
Backends should be able to use toolkits of search parts to build up search solutions. For example, the current brute force query search uses regex searches. This approach may be of value in other store implementations that use a flat text representation (e.g. a BerkeleyDB - based topic store). Having said that, we cannot allow backends to impose divergent syntax requirements on search expressions, so some modules, such as a query parser and a regex checker/transformer, have to be mandated for all backends.

Here's a list of ideas for the toolkit modules that are required:
  • Query parser (Foswiki::Query::Parser)
  • Regex checker (and possibly transformer?)
  • Regex hoister (convert queries into regexes for application to flat DB; Foswiki::Query::HoistREs)
  • Foswiki query to SQL transformation engine
  • Pluggable sort algorithms
(This is brainstorming; feel free to add more)

Discussion

Initial version

-- CrawfordCurrie - 22 Jan 2009

I hope that the needs listed above, and a strategy for implementing them will become much simpler when I start to commit some of the refactorings for the ExtractAndCentralizeFormattingRefactor (which i started over Christmas, and hopefully lost so will do again with lessons learnt).

Its certainly a pretty good summary of the destination I have in mind for all the disparate FeatureRequests - which are driven partially by user facing functionality, partially by backend (kino and Xapien indexing, DatabaseStore, and in part on disgust at our fear.

-- SvenDowideit - 21 Jan 2009

Search engines have quite different query languages like XQuery or SQL just to name two most different. Other search engines have fuzzy search features etc. The other way around: not every search engine implements regex search (because it is so expensive and hard to reuse indexes).

I think it is not feasible to force everything through a query transformation engine. These kind of language mappers yare bound to fail and will cover just a boring subset of the features of your backend. Besides the principle problems this kind of forcing a camel through the eye of a needle is most probably a performance problem as well. Let's call it driving a Porsche with square tires wink

From what I see, type is there to solve the problem by having type=sql for explicit SQL and the like.

You also don't mention how to cache search objects between requests.

-- MichaelDaum - 21 Jan 2009

I certainly don't agree that query mappers are "certainly bound to fail". However it's true that mapping from the SEARCH query language covers a boring subset; I kinda think that's the whole point. If you stick to TML queries, it will work with all backends. If you want a richer search experience, select type="sql" and caveat emptor.

Of course this prompts another aspect I hadn't covered - pluggable search language.

I hadn't considered cacheing search objects. Interesting problem. We need to bear it in mind when refactoring.

-- CrawfordCurrie - 21 Jan 2009

Also sorting is a crucial step in the search process. Somewhat related: GeneralSortingMechanism

-- ArthurClemens - 21 Jan 2009

Refactored comments to date back into the text.

-- CrawfordCurrie - 22 Jan 2009

we have considered cacheing - see ResultSets.

-- SvenDowideit - 22 Jan 2009

I'd suggest that in addition to search redesign and optimization, careful consideration be given to instrumentation of the search engine and topic rendering engine as well. While the Apache %T log variable can provide a look at total request processing time, there is little insight into exactly which components of the engine are consuming the time. Information gathered could range from simple begin/end/request stamping to more detailed logging of topics considered and other internal operation depending upon the type and search implementation.

It's probably deserving a separate topic, but a well designed internal trace log with configurable log levels, and a consistent format to be used by plugins, engine, and other core and optional components would greatly assist in finding and resolving issues when a site experiences periods of sluggishness.

-- GeorgeClark - 23 Jan 2009

agreed. I had something in mind, but stopped when I discovered DTrace and NYTProf - it'd be great to see someone adding more user-level requirements, cos from a developer pov, Foswiki::Monitor and NYTProf are pretty good.

-- SvenDowideit - 23 Jan 2009

Profiling all of the engine is a completely different matter. Some of the findings are well known for quite some years, i.e. the Foswiki.pm being too fat, the rendering process being totally anal, plugins adding overhead even though they are not used, adding a central table parser to the core instead of multiple table parsing per click in several core plugins, etc.

One other culprit definitely is %SEARCH. And that's what this proposal is about.

-- MichaelDaum - 23 Jan 2009

With MongoDBPlugin, we're getting awesome response times involving SEARCHes in large webs. Querying through ~60,000 topics is no problem at all, as long as you're only going to display content from ~20-100 topics at a time. Some more discussion at LargeSiteAclPerformance.

-- PaulHarvey - 28 Nov 2011

BasicForm edit

TopicClassification BrainStorming
TopicSummary Searching is absolutely key to everything Foswiki does. This topic is for developers working on the core to discuss the detailed code refactorings required to make searching work better.
InterestedParties
Topic revision: r15 - 29 Nov 2011, PaulHarvey
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy