Feature Proposal: Query across all topic revisions

Motivation

We really want to be able to report some stuff based on the evolution of a topic. Who are all the authors of a topic? Did a formfield ever contain a particular value, and if so when/what rev, whose change was it? What were the values of these fields at a particular time?

Description and Documentation

New query syntax to support such queries. A topic and all of its revisions become an array. Here's the text from QuerySearch explaining it:

Working with arrays

A number of fields in a topic are arrays, meaning that they refer to a list of values rather than a single value. Arrays are accessed using the square brackets ([]) and dot (.) operators.

The square brackets operator is actually a bit more clever than a simple query or array index. You can use the comma , operator inside the brackets to select a set of matching elements. For example,
versions[1,5]
will select version 1 and version 5.

You can even mix integer indices and conditions. For example, let's say we want to select all versions that are authored by 'RoaldDahl' and also select version 9.
versions[info.author='RoaldDahl', 9]
Note that if 'RoaldDahl' authored version 9, then you will get that version twice in the result.

The index of the first item in an array is [1], the second is at [2] etc. An index of [0] is used to refer to the last item in the array, [-1] refers to the second last, etc.

When the dot operator is applied to an array, it is applied to each element of the array, and the result is a new array containing the results of each application.

Working with versions

The versions field is mainly used with %QUERY to provide a powerful view into the history of a topic. It is an array of objects, one for each revision of the topic (including the most recent), where the oldest revision is in position [1] in the array, the second oldest in [2] etc. You can use the versions field to select old versions based on a query. For example,
versions[info.author='RoaldDahl']
will return a list of all versions authored by RoaldDahl. You can access field values in the topics as they were at the time - for example,
versions[info.author='RoaldDahl'].Published
will return an array of the values of the Published field in all versions authored by RoaldDahl.

The versions array is indexed by the revision number. Index [0] which always refers to the most recent version, [-1] to the next most recent etc.

Gotchas

  • Remember that in the query language, topic names are constants. You cannot write Main.UserTopic/UserForm.firstName because Main.UserTopic will be interpreted as a form field name. If you want to refer to topics by name you must enclose the topic name in quotes i.e. 'Main.UserTopic'/UserForm.firstName
  • You cannot use the result of a sub-query as an index into the versions array. For example, versions[UseThisVersion] where the field UseThisVersion is set to '2' will not work. Any index that requires reference back to the database (such as a field value) will be evaluated as a condition and not as an index. To select a version based on a field value, use versions[info.version=UseThisVersion]

Examples

revs NOT authored by PaulHarvey

%QUERY{"versions[info.author!='PaulHarvey']/info.version"}%

Result:

authors of revs matching 'CurrentState'='CommunityVote'
%QUERY{"versions[CurrentState='CommunityVote']/info.author"}%

Result:

revs created between the 5th and the 7th December 2010
%QUERY{"versions[info.date < d2n('2010-12-08') AND info.date > d2n('2010-12-05')]/info.version"}%

Result:

What about..

  • Just wanting the first/last rev where the condition matches?
    %QUERY{"first(versions[author='Waiting'])"}%
    maybe? LATER

Challenges

Impact

Performance. RCS store may be practical when using VarQUERY on a single topic, but will probably drown a server scanning any non-trivial sized web with VarSEARCH.

WhatDoesItAffect: %WHATDOESITAFFECT%

Implementation

CrawfordCurrie has implemented on trunk in Tasks.Item10129

IN ADDITION it may be desirable to enhance Foswiki::Meta API with a more convenient way to iterate over the revisions of a given $topicObject (let's assume that every revision will be its own stand-alone, self-contained $topicObj).

SvenDowideit said on IRC:
Meta->getRevIterator() => RevIterator where revIterator->next returns a new cacheable Meta, and if rev!=head, perhaps getWriteable fails

forcing that even more unusnal situation to be written by copying the contenst of the old rev into a head writable

Discussion

Exploring what you suggest above, I agree you need some way to statically express a specific revision of a topic. Right now we do that in a URL with Web/Topic?rev=2. Let's assume what you suggest, that this syntax isn't part of the query language, it's instead part of a web.topic reference string, so Web.Topic refers always to the latest rev of a topic, but Web.Topic?rev=2 refers to that specific rev of the topic. Then, you need some way to get all the revs of a topic. So, let's say the revs array postulated above contains a list of fully-qualified revision references, e.g. ['Web/Topic?rev=2', 'Web/Topic?rev=1']

You could then use the REF operator:
  • 'Web.Topic'/revs[author!='GungaDin'].version to get a list of all versions that were not authored by GungaDin.
  • revs/info.author to get an array of all the authors of all revs
An extension to this proposal would be to treat the ?rev=2 as part of the topic reference syntax, so writing QueryAcrossTopicRevisions?rev=1 anywhere, even outside squabs, would give a link to that specific rev of the topic.

-- CrawfordCurrie - 15 Sep 2010

Cool. I have mixed feelings about formalising Web.Topic?rev=n. I think it would be nice to come up with something that doesn't break simple string concatenation when you try to build a URL out of these topic references (ie. avoid too many '?'). Or is that being too geeky? I don't know.

Let's review rev access syntax:

... syntax comparison table moved to LoadDifferentTopicVersions

-- PaulHarvey - 16 Sep 2010

and please don't forget that addressing a specific rev is the least useful interface into the object history - more useful are before 'date', first after 'date' and between 'date' and 'date', followed by first, last and previous.

we should possibly consider rev as a second rate alias, existent only due to some implementation detail.

-- SvenDowideit - 17 Sep 2010

... more discussion moved to LoadDifferentTopicVersions

-- PaulHarvey - 09 Dec 2010

A few thoughts.

First, in the context of something like ApprovalPlugin surely we need the concept of [REV=APPROVED] and for this to be the default display (possibly PlugIn managed). After all this is the approved version, of course in a full implementation you need to be offered the choice of seeing the latest version, possibly with the option to make further changes. The crucial point is that a store(s) will need to support this.

Second, we should not get hung up on a maintaining something like RCS, if a store provides (or even must provide) revisions then it can simply store a complete copy for each revision (actually with META data broken out into separate fields that will only be a single field copied per revision, it's only text that could be large).

The text field could even be compressed (does any RDBMS support this?) which should increase performance if done right.

With multiple StoreEngines each of which can be configured may allow an Admin to make choices web by web: e.g. with/without fast revision searches.

-- JulianLevens - 27 Oct 2010

Hi Julian,

I'm hoping that the query language would support the ability to "get the latest rev which satisfies some criteria", something like
%QUERY{
  "latest(
    revisions[name='Web.SomeTopic' AND META:APPROVAL.state='APPROVED']
  )"
}%

This example above depends on us coming up with a topic path syntax which not only holds the web and topic, but a fully qualified identifier that has a revision component in it. So the above query might return Web.Topic--23 - revision 23 of Web.Topic. You should then be able to use this fully qualified topic identifier in subsequent queries, Eg. get the approver:
%QUERY{
  "'Web.Topic--23'/author"
}%

I also don't think we're getting hung up on RCS smile Just that it doesn't make sense to create a new query capability that can't be exploited with an out-of-the-box Foswiki (or a differently configured one anyway). OTOH, if this stuff becomes a capability as a result of QueryCustomCollections - then we could leave it up to a store plugin (that is NOT shipped with default Foswiki) to provide a 'revisions' collection that can be queried in this way.

Topics (and their revisions) constitute pretty simple data in terms of storage problems today. I'm not sure the overhead of compressing/decompressing is really worth it, but I guess that's up to each store implementation.

I have added a note regarding the Foswiki::Meta API: there might be some work required there, so that plugins may more conveniently iterate over the versions of a given $topicObject.

-- PaulHarvey - 14 Nov 2010

Thinking about Paul's -- syntax practically, I don't favour mixing up dates and revision numbers (see LoadDifferentTopicVersions for the original discussion) -- Main.Paul Harvey - 09 Dec 2010

The REF (/) operator is current a means of referring to another topic. Let's say we extend the concept to refer to another rev of the current topic. To illustrate:
  • 'Web.Topic'/field - refers to the value of 'field' in the latest revision of Web.Topic
  • 'Web.Topic'/(info.createdate < d2n('1 aug 2010')) - refers to the latest revision of Web.Topic iff it was created prior to 1/8/10
  • 'Web.Topic--1'/field - refers to the value of field in revision 1 of Web.Topic
  • 'Web.Topic--*'/field - refers to the value of 'field' in all revisions of Web.Topic
  • 'Web.Topic--1:9'/field - refers to the value of 'field' in revisions 1 through 9 of Web.Topic
  • 'Web.Topic--*'/(info.createdate < d2n('1 aug 2010')) - refers to an array of all revisions of Web.Topic prior to 1/8/10
and
  • field - refers to the value of 'field' in the latest revision of the queried topic
  • info.createdate < d2n('1 aug 2010') - refers to the latest revision of the queried topic iff it was created prior to 1/8/10
  • '--1'/field - refers to the value of field in revision 1 of the queried topic
  • '--*'/field - refers to the value of 'field' in all revisions of Web.Topic
  • '--1:9'/field - refers to the value of 'field' in revisions 1 through 9 of Web.Topic
  • '--*'/(info.createdate < d2n('1 aug 2010')) - refers to an array of all revisions of the queried topic prior to 1/8/10
-- CrawfordCurrie - 21 Nov 2010

Implementation thoughts. If a "topic" (the LHS of a REF) is an uncommitted (i.e. unloaded) meta object, until a getField is actually performed on it, then we can have the concept of -- as a "version resolution operator". Thus you could have a topic name in a field of a topic, then use the -- operator to focus on a specific (set of) revision(s) of that topic. -- would be a binary/ternary operator to allow for a range of revisions.

Actually, the more I think about it, the more I think it needs to be an operator (and not delegated to the meta-loader inside ''). Otherwise you will not be able to use computed ranges e.g. those where version numbers exist in dataform fields.

-- CrawfordCurrie - 01 Dec 2010

The more i think about this, the more I think it needs to be an operator. More, I have already implemented the comma operator for creating lists (required for IN) and having worked on arithmetic -, can't make -- work as an operator (easily confused with binary minus unary minus). So, propose the following: three new operators, unary @, binary @, and nonary *
  • 'Web.Topic'@1/field - refers to the value of field in revision 1 of Web.Topic
  • 'Web.Topic'@*/field - refers to the value of 'field' in all revisions of Web.Topic
  • 'Web.Topic'@(1,2,3,4,5,6,7,8,9)/field - refers to the value of 'field' in revisions 1 through 9 of Web.Topic
  • 'Web.Topic'@*/(info.createdate < d2n('1 aug 2010')) - refers to an array of all revisions of Web.Topic prior to 1/8/10
Unary @ is defined as an operation on the current topic, so
  • @1/field - refers to the value of field in revision 1 of the queried topic
  • @*/field - refers to the value of 'field' in all revisions of Web.Topic
  • @(1,2,3,4,5,6,7,8,9)'/field - refers to the value of 'field' in revisions 1 through 9 of Web.Topic
  • @*/(info.createdate < d2n('1 aug 2010')) - refers to an array of all revisions of the queried topic prior to 1/8/10
Note that the explicit expansion of what was a range operator (:) is because version names are not necessarily sequential.

This lets me select a revision based on the value of a field (i.e. 'Web.Topic'@VersionField) but still isn't quite right. The * operator is really hard to implement. And it would still be nice to be able to express ranges (and also parse lists of ranges out of field values). Possibly treat @ as a kind of "grep" operator i.e. "select all revisions that match the condition on the RHS" so 'Web.Topic'@(n IN (1,2,3)) or 'Web.Topic'@(n=1)

-- CrawfordCurrie - 04 Dec 2010

Hi Crawford, I really appreciate your input on this. As mentioned earlier, trac uses @ for version-specific page names also. I initially discounted it because I thought it might be more likely for users to have in their topic names already than '--'.

I take it this means you want QUERY to emit hits as 'Web.Topic@N' where the query matches specific versions of topics? What are your thoughts with '@' as a part of a fully qualified topic name+rev?

Also, I take it that this means QueryAcrossTopicRevisions won't need QueryCustomCollections. Earlier you said "... not delegated to the meta-loader inside '' ... Otherwise you will not be able to use computed ranges e.g. those where version numbers exist in dataform fields." - somehow I thought we could still do
%QUERY{"'Web.Topic'@%QUERY{"'Web.Topic'/VersionField"}%"}%

although this approach requires several requests to the db

-- PaulHarvey - 04 Dec 2010

You could indeed use nested queries, but as you observe it gets very clumsy very quickly.

There are (I think) 3 cases where you want to specify topic revisions. The first is in a URL. Well, the URL syntax is dictated by standards, and we really don't want to mess with encodings, so ?rev= is likely to remain for the foreseeable in this role. The second is in queries, and for this application I'm advocating the use of @ as an operator, as described above. The third is in in-topic used, such as [[Web.Topic@42]]. I think it's OK to allow this difference between the syntax employed in queries and that employed in squabs.

I'm going to have a stab at an implementation today, just to see how I get on.

-- CrawfordCurrie - 05 Dec 2010

With a range operator I would expect to be able to enter ranges, like

  • @(1..9)'/field - refers to the value of 'field' in revisions 1 through 9
  • @(..9)'/field - refers to the value of 'field' in revisions 1 through 9
  • @(1..)'/field - refers to the value of 'field' in revisions 1 through the latest revision
-- ArthurClemens - 05 Dec 2010

Agreed; but what does "range" mean in the context of revisions? Revision numbering is sequential in the RCS store, but that's not a requirement, and it won't be in (for example) git or svn stores. So a range operator has to enumerate all the available revisions of a topic, and match those that fall into the range.

A similar problem applies to the other idea I had, that of using negative indices to get versions relative to latest; e.g. @(-1) to get the revision before the current one. Add that to ranges and you have @(-6..0) to get the last 6 revisions - or does it? Doesn't it get the revisions that fall in the range from the current revision number minus 6, to the current revision?

Another problem I haven't solved is how to get "all revisions". '*' as a nonary operator doesn't cut it. Neither does a special '@*' operator (the parser has no concept of reverse-polish operators). I considered another reserved word, 'all', so @all would give you all revisions. But then that's yet another reserved word frown

-- CrawfordCurrie - 05 Dec 2010

For me, @() would make sense for "all". Because the ellipsis specify a range: @(specify_range) and if not passed, no range is specified, returning all.

But true, I haven't considered Git revisions. If we use Git revisions we need to use Git revision syntax.

-- ArthurClemens - 05 Dec 2010

With RCS, topics have a linear history. With another type of store, non-linear history might be possible, but I suspect that linear history will remain the most common because it is easy for users to get their heads around. In the context of a linear history, a "range" does make sense.

I would like to suggest that the query syntax for dealing with revisions (and ranges of revisions) should be store-agnostic. It should not matter if I use RCS, GIT, MongoDB, DBIStoreContrib or LittleGnomeWithPaperFoldersStoreContrib - the revisions syntax should remain the same. This will enable us to develop wiki applications and be able to use them on other foswikis with different stores.

The reality is that different stores use different revisions syntax. In URLs, we could consider using something like rev=2 to refer to the second revision in a linear history, and revid=76c5d170a9ae168f86c8cd98f74ee92773dbc914 for referencing an arbitrary revision using the store's own syntax. When querying across topic revisions, I suggest using RCS-style revision numbers (i.e. like rev, unlike revid ), regardless of the underlying store.

-- MichaelTempest - 06 Dec 2010

I have a (non-range) implementation of the @ operator ready to go. BTW @() loads all revs of the topic. Path attached to Tasks.Item10129

-- CrawfordCurrie - 06 Dec 2010

Requesting vote, so I can get this checked in. if you are against, please say why.

For Against Why
CrawfordCurrie    
KennethLavrsen    
GeorgeClark    
LynnwoodBrown    
  MichaelTempest Revision syntax is not specified
-- CrawfordCurrie - 07 Dec 2010

Formally the proposal has to be committed for 14 days to be accepted.

And this we need to stick to. But if I were you I would checkin the code now. All the usual "suspects" that raises concern have been part of the debate and there seem to be no major concern. So the risk that you later have to revert because the proposal is voted down is very slim.

The vote for a quick informal "temperature check" is OK but the declaration of acceptance will be in 14 days

-- KennethLavrsen - 07 Dec 2010

I have mixed feelings about this. I like the idea, but.... there are two problems.

(1) I am concerned that the syntax for revision is not defined for @revision.

This is the first time (as far as I am aware) that topic revisions become part of query syntax. As defined here (i.e. not defined), revision could be interpreted to mean "whatever the store wants it to mean". Any wiki apps developed for RCS store (e.g. mini-apps like utilities that ship with Foswiki itself) could break horribly on another store. How would you query the first revision of a topic on a store that uses GIT or SVN?

The problem is that this is unspecified, and I suspect we are all making our own assumptions about it. I also suspect it will be a can of worms. I'd like to string those worms up where we can see them smile

(2) The syntax for all revisions looks like a possible syntax for no revisions. Suppose that the specific-revision-syntax problem is resolved. Suppose that I use a nested query (bearing in mind that the inner query need not use QUERY, it might use something that MySecondPlugin provides) to first get a list of revisions, and then do a QUERY based on those revisions. For example: %QUERY{"SomeTopic@(%FOO{...}%)" ... }%. If FOO returns an empty list, then QUERY will see @(), which means all revisions, where FOO intended no revisions.

-- MichaelTempest - 08 Dec 2010

(1) Sloppy of me not to state here that a revision is identified by a positive non-zero integer. I thought this was stated elsewhere, but it's only in code so I'm adding it to the doc.

(2) Indeed. I chose the semantics of @() based on the requirement to return all revisions. The original concept, the @* operator, was not workable due to the limitations of the parser (no postfix operators). If you can propose an alternative syntax, we can discuss.

-- CrawfordCurrie - 08 Dec 2010

This was discussed at length on IRC. To summarise:
  • CDot pointed out that in order to implement @, he had to formalise the concept of "lists" in the query language, which has had knock-on effects on other operators. That needs further discussion.
  • The query language does not support lists of lists: ((1,2),3) == (1,2,3) == (1,(2,3))
  • There was lively discussion about the nature of revision numbers, which did not reach a clear consensus. (This probably needs further discussion.)
  • If we plan to use operators to refer to revisions, then we can defer specification of the syntax for specific revisions. In the meantime, we could specify that these operators return "nonsense that is understood by @()" My discussion about Web.Topic--123 forms is my attempt to resolve this. -- PaulHarvey - 08 Dec 2010
  • It should be possible to specify revision number(s) by referring to a field in (another) topic.
  • Several operators were proposed - which should be discussed in (separate?) feature proposals:
    • There is definitely a need for a revat() operator which gives the revision at a specific date (and time). Lavr described how this operator would be useful in other contexts than @().
    • There is a need to be able to refer to specific revisions, such as the first revision and the current revision.
    • MichaelDaum proposed JQuery-style :first, :n-th, :last etc similar to jquery's selectors. This notation was well-received.
  • There is a need to be able to specify an empty list of revisions.
  • One problem with using @() to mean "all revisions" is that it becomes hard to return an empty list of revisions from a macro or field. A blank field or macro result would somewhat-counter-intuitively mean "all revisions", and there isn't an easy way to say "%FOO%, or ':none' if %FOO% is blank" in the query language.
  • CDot noted that X/Y is probably synonymous with X@(:latest)[Y] , but said "not sure about the precedence rules there. That needs to be checked."
Crawford - you mentioned that () is a list-with-no-members. Would @( () ) be viable syntax for an empty list of revisions?

Otherwise, I suggest that @() should be the empty list of revisions, and that we should use something like @(:all) to refer to all revisions. I think that will be less surprising than using @() to mean all revisions.

-- MichaelTempest - 08 Dec 2010

What about this...

All topics not created by PaulHarvey:

         Could not perform search. Error was: Syntax error in '@versions(:first)[info.author!='PaulHarvey']' at '@versions(:first)[info.author!='PaulHarvey']'

All topics which ever had CurrentState='CommunityVote' :

         Could not perform search. Error was: Syntax error in '@versions(:all)[CurrentState='CommunityVote']' at '@versions(:all)[CurrentState='CommunityVote']'

All topics which had CurrentState='CommunityVote' in December: (ISO 8601 range)

         Could not perform search. Error was: Syntax error in '@date(2010-12-01/2010-12-31)[CurrentState='CommunityVote']' at '@date(2010-12-01/2010-12-31)[CurrentState='CommunityVote']'

See also: more about topic+rev identifiers/syntax/linking/etc at LoadDifferentTopicVersions

-- PaulHarvey - 08 Dec 2010

We've got a bit of a functionality split here; the @ operator in the Query, and the Content Access Syntax for getting at versions through URLs and in-topic specifiers. I'm committed to the first but can't commit to the second, so I cloned this last comment to LoadDifferentTopicVersions.

Let's keep this topic focused on %QUERY.

-- CrawfordCurrie - 09 Dec 2010

Crawford pointed out on IRC that [if FreezeRevisionNumberingScheme is accepted] then :first is not needed because all revisions start at 1.

-- MichaelTempest - 09 Dec 2010

Having fiddled with it some, writing queries, I'm coming round to the view that the versions syntax is actually more natural than the @ operator.

Consider some examples: authors of revs matching 'CurrentState'='CommunityVote'
%QUERY{"@()[CurrentState='CommunityVote'].info.author"}% OR
%QUERY{"versions[CurrentState='CommunityVote'].info.author"}%

%QUERY{"fieldname@[info.date < d2n('2010-12-08') AND info.date > d2n('2010-12-05')].info.version"}% OR
%QUERY{"fieldname/versions[info.date < d2n('2010-12-08') AND info.date > d2n('2010-12-05')].info.version"}%

%QUERY{"versions[1].info.author"}% - who created the topic
%QUERY{"versions[1,3,5].info.author"}% - ditto for the first three revs
%QUERY{"versions.info.author"}% - array of authors of all revisions

It's a (little bit) more work for the code, but it would mean we don't need to add the @ operator, and it just feels right.

Note that this doesn't affect the need to have version "identifiers" such as :last

-- CrawfordCurrie - 09 Dec 2010

I think you're right. Does this mean we have to reset the clock again? smile

-- PaulHarvey - 09 Dec 2010

I was not planning to interfere in your discussions, though I have been following them closely, because they are a bit too technical, in particular regarding the parser's constraints, which I am not familiar with at all.

As a user, @() does not mean nothing to me. As a programmer it could mean too many things depending on your programming background, and I don't like implicit parameters either, that makes the syntax more error prone. Using versions is better for a human.

About "version". In my current installation of Foswiki System.Macros#QUERY_45_45_get_the_value_of_meta_45data is already documenting that %QUERY% can take a parameter rev="version" that makes the query "operate on the given version of the current topic.". Is there a reason why we do not extend that? Will this be deprecated?

As you know rev is also already used in URL construction.

Also we already use in URLs ?rev1=3;rev2=2 when we display differences between two revisions. For consistency sake, wouldn't that be reasonable to consider rev1 and rev2 instead of creating yet another way to designate a range of revisions?

The value of rev could be an actual revision number (also "first" and "last"), or a date and time when you use revat (plus revat1 and revat2 for ranges)

For me, it does take a tremendous effort to find out what

%QUERY{"fieldname/versions[info.date < d2n('2010-12-08') AND info.date > d2n('2010-12-05')].info.version"}%

actually looks for and returns. There are slashes, parenthesis, square brackets, dots, ' and ". That is a lot. I'm not quite sure I understand it.

I feel that is a bit far from what System.Macros says "Macros are text strings - %MACRONAME% or %MACRONAME{ parameter="value" }% ".

Wouldn't that be more in the spirit of that statement:

%QUERY{"info.rev" revat1="2010-12-08" revat2="2010-12-05"}% 

Also, my feeling is that with such complex syntax, people will only be able to cut and paste snippets and examples blindly, and that before they can actually construct their own queries they will have to dedicate a lot of time and efforts (which ultimately they won't do).

I am not raising a concern for that (so you do whatever you think appropriate) because you are all here more knowledgeable about the parser and syntax constraints, but since we are rethinking the syntax I felt I had to say this.

-- RaulFRodriguez - 09 Dec 2010

I fear that you are probably right. You probably do have to have programming experience to 'get' QuerySearch syntax: the main thing for me was realising that some parts of the statement are an array of topics, some filter that array, and there's the TopicObjectModel 'map' for accessing bits of topic (or set of topics). But I personally think we need to pursue the syntax that we've discussed here, partially because this is also the syntax that %SEARCH{type="query" will use and I'm not sure it's a good idea to add even more parameters to that thing (but OTOH I could be convinced that it is good idea).

QuerySearch is more of a reference topic than anything else, which is good, but I'd like to spice up the cookbook a bit, work on a new "System.QuerySearch for humans" topic as well, which uses animations like we did in System.Macros. I hope this can mitigate some of the barriers to using QuerySearch syntax.

Maybe we can make a wrapper macro (you could probably even build a series of them with EasyMacroPlugin) for those that "just want to do X" but can't stomach QuerySearch syntax.

-- PaulHarvey - 09 Dec 2010

I understand where Raul is comming from. And yes, the Query syntax in its full flour is a 5 to 7 range on the NerdoMeter.

But it is also a very powerful feature and very advanced feature. The syntax would probably be different if we had started from scratch now. But it is not a bad syntax and the important thing is that it has a simple syntax for the most used cases. In 99% of my query searches I just look for form field values and then I only have to write the name of the field.

When you take into consideration that the syntax is used also in SEARCH and SEARCH already has quite many parameters, then I would not want to add more parameters to it.

But in future we will probably see some new macros that will target specific queries with a simplified syntax.

But in the context of this proposal, I do not see any way that will get the NerdoMeter score significantly down.

But as also Paul says - we need to add a lot more to the cookbook. We could use 3 times as many examples in it. There is nothing like good examples that show the most typical use cases.

-- KennethLavrsen - 09 Dec 2010

I was going to say pretty much the same. A cookbook is absolutely essential especially now the syntax has grown.

The trade-offs you make when designing such a language are between compactness, expressiveness, readibility and familiarity. For anyone already au fait with CSS selectors, then they will find familiar constructs in %QUERY. Similarly for anyone familiar with XPath, XQuery, and even SQL. Unfortunately, for anyone unfamiliar with any of those, there is a steep learning curve.

I have started a new feature request, WriteAQueryCookbook, to try and address this.

-- CrawfordCurrie - 10 Dec 2010

Excellent progress. I guess we have separated the first of MichaelTempest's concerns into LoadDifferentTopicVersions, so we just have to fix the fact that

%QUERY{"versions[].info.version"}%

Result: QUERY{ "versions[].info.version" }: Missing operator in 'versions[].info.version' at '.info.version'

... currently gives a list of all versions, instead of none. I suspect this is actually standard behaviour on all array "things" not just versions, so I'm not quite sure what should happen next.

-- PaulHarvey - 11 Dec 2010

That's quite logical. versions is an array of all versions. Remember, '[]' means 'where' in SQL-speak, so SELECT info.version FROM topics WHERE is what you wrote. It's consistent with all other usages of []. Aside: Note that the default evaluation of versions is inefficient compared to the @ operator, because it always loads all versions, whereas @ only loaded the revisions returned by the RHS. However it is more general, and I think it should be left up to stores to optimise it)

Of more concern (to me) are the semantics of an integer index. versions is an array of the N versions of the topic, with the most recent version occupying position 0 in the array, and the oldest occupying position N-1. So if I have a topic with 3 versions, the array will get [3,2,1] and versions[1].info.version will not return 1 as you might expect - it will return 2, because that's what's at index 1 in the array. To get version 1, you have to write versions[info.version=1] (or versions[-1]). This is totally consistent with all other uses of the [] operator, but is potentially confusing for users.

The reason I like the array populated youngest-first is that versions[0] is always the latest revision of the topic. There's also the fact that arrays are 0-based, while revision numbers are 1-based (we could reverse the array, but then revision 1 would be at index 0, revision N at index N-1).

MichaelTempest, are you still concerned?

Later: after writing a bunch of doc, I think the integer index is explainable. see http://trunk.foswiki.org/System/QuerySearch and WriteAQueryCookbook for more.

-- CrawfordCurrie - 12 Dec 2010

While farting about with this, I realised there's a strong case for an operator to uniqify a list. For example, let's say I want to create a list of all versions that were authored by Fred but always want the latest rev, even if it wasn't authored by Fred. If I write:
versions[0,info.author='Fred']

I will get that, but if Fred authored the latest rev, it will appear in the list twice.

A uniq operator would allow us to remove duplicates from the list:
uniq versions[0,info.author='Fred']

-- CrawfordCurrie - 13 Dec 2010

Both SQL and MongoDB queries use 'distinct' - I guess uniq makes sense for users coming from SSP?

Also: I'd like to use 'uniq' on any old list like
%QUERY{"unique('1,2,3,4')"}%

so would it be better as a function? Or does that make it more difficult to optimize queries?

BTW, we are already using length() and + operators in one of our wiki apps, very cool not having to use SSP smile

-- PaulHarvey - 13 Dec 2010

unary function or operator, it doesn't make any difference (a function is a unary operator that takes a list as it's only argument).

What's more important is how uniq maps to an underlying DB. I had in mind that a DISTINCT would be the optimisation target, but that doesn't quite work. In fact, I'm having serious misgivings about mixing integer indices with query expressions in lists, for the simple reason that mixing them is an implicit JOIN, which makes mapping hard.

I really wish someone would with a bit more SQL-fu would help with this. I'm a bit lost.

-- CrawfordCurrie - 14 Dec 2010

I am not convinced that it is a good idea to use/permit both query expressions and integer indices to "index" versions[].

Suppose I write a wiki app where the query is built up using info collected from a user via a form, which has "by me" and "recent" checkboxes. That could give me a query expression like versions[info.date > d2n('2010-12-08') AND info.author='MichaelTempest']. I should be able to replace either or both of those conditionals with 1 if the checkbox is not checked, giving something like this: versions[1 AND info.author='MichaelTempest']. Or even like this: versions[1 AND 1].

However, I cannot do that with the simpler expression versions[info.author='MichaelTempest'] because versions[1] means something else. Sure, I could write versions[1 AND info.author='MichaelTempest'], which is reduceable to versions[1 AND 1], but I do think this could bite.

We had @(1, 7) meaning revisions 1 and 7. One way to express that with versions would be versions[info.version=1 OR info.version=7]. That is a mouthful. Suppose we had a revs operator that takes a list of revision numbers and selects for those revisions. My example would then look like this: versions[revs(1, 7)]. We could then define revs() to match no revisions, which would then address my concern.

We could also eliminate integer indices by adding an index operator, that takes a list of indices as an argument. Then you could say versions[index(0)] to refer to the latest revision and it would also open up the door for things like versions[index(0, 1, 2, -1)]. The way we define index would be important. I have in mind that the arguments to index "let through" the specified elements of the version array. Thus, index(n, m) is identical to index(n) OR index(m). This also eliminates the need for uniq.

I make no claim to having any SQL-fu, and I suspect there are good reasons not to do as I have suggested (e.g. this might be really clumsy in practice), but here is the suggestion anyway, in the hope that it gives someone else a better idea smile

-- MichaelTempest - 14 Dec 2010

We have always supported integer indexes in [] - there is nothing new in that. The only thing I have done is to extend it that so that the comma operator can be used to access a range of items from the array on the LHS. Also this relates to the comma operator, not the and operator, so I'm a bit confused by your examples.

In the final analysis the main reason for wanting to support integer indices is to be able to use [0] and [-1] to access the first and last items in the list, though other use cases - such as "the 3 most recent revisions" are also present. However I am also uncomfortable with mixing the two index types. But it's kinda unavoidable; maybe we just don't shout about it.

BTW all this is implemented, so if you are in any doubt about how it works, please do build examples on t.f.o to illustrate your points. They will never be wasted, as the working ones can be drawn into WriteAQueryCookbook

-- CrawfordCurrie - 15 Dec 2010

From reading the docs in QuerySearch and from looking at the code, it appears that there is no way to index an array with a computed expression that involves lookups in topics.

There is definitely value in doing that. We are often asked if it is possible to show only the "approved" version of a topic to WikiGuest, or to people not in some group. I have not ever seen an answer explaining how it could be done, but I think it could be done with view templates and/or tweaking skin templates, something like this (I say "something like" that because %QUERY{"LastApprovedVersion"}% would return blank if there is no LastApprovedVersion field or if that field is blank, which would result in all versions of the topic being show, concatenated together. Instead, it would be better to show nothing. That problem could be solved but it would obfuscate my example):
%IF{ 
   "$ WIKINAME='WikiGuest' " 
   then="%QUERY{"versions[ %QUERY{"LastApprovedVersion"}% ].text"}%"
   else="%QUERY{"versions[0].text"}%"
}%

However, if we had an index operator (which would only make sense in the context of the [] operator), then we could write the then clause as:
   then="%QUERY{"versions[ index( versions[0].LastApprovedVersion ) ].text"}%"
   <!-- Note: We want the text of the version identified by the value of the LastApprovedVersion field in the latest version of the topic -->

which would eliminate the need for nested queries in this particular case as well as giving a more intuitive result if the version number came from a preference setting e.g. "versions[ index( %APPROVEDVERSION%) ].text".

If this idea is feasible e.g. it could be translated to SQL, then I can remove my concern. I am not asking for this index operator to be implemented immediately, I would just like to know if it is feasible.

-- MichaelTempest - 16 Dec 2010

This seems focused on querying a particular topic. That needs to be expanded into querying multiple topics simultaneously. Consider the following scenario:

A foswiki is used to maintain the documentation, design documents, and source code for a product that goes through MANY releases. A customer finds a problem in an old release of the product. The customer would like to check the documentation of the product AS OF the time of the particular release they are using. More than that, the internal customer support and development teams need to consult the design documents and source code for the particular release of the product in order to assure that they are addressing the problem in the particular release and are not fixing problems in later releases.

In foswiki, this means:
  1. We need to be able to easily refer to particular versions of a topic.
  2. We need to be able to refer to a particular version of a topic by a label (or date).
  3. We need to be able to pass in a label to use in fetching all topics in the foswiki.
  4. Different users should be able to pass in different labels so they have different views of the foswiki.
-- DavidMasterson - 02 May 2011

I think we need to change the versions array in a QUERY is an array of versions indexed most-recent first - as that does not provide a lasting way for an application to query for a specific revision (and I think its hella confusing to pretty much anyone that uses both the revision numbers they see in the UI (that are static) and then tries to query for a specific revision.

for example, if a user stores an versioning ID (one would expect rev, but atm, thats useless) and then in a viewtemplate tries to retreive a foeld value from that specific revision, they have to do something counter intuitive - like versions[info.version=StoredRev].Fieldvalue - whereas most will expect to be able to do versions[StoredRev].Fieldvalue.

additionally, we need user friendly ways to say first&current and 'the one before' etc - that don't also make it impossible to insta-convert these queries to non-perl sources - and index of -1 is very perlish - but imo would be useful for retreving in the style that is currently positive..

-- SvenDowideit - 03 May 2011

  • Testing:
  • Testing:
  • Testing:
  • Testing:

-- PaulHarvey - 03 May 2011

While I don't have a problem with the index operator and I agree that labelling topics with a release date would be good practice, as would short-cutting info.versions, these are all extensions to the general "select a version based on a query" principle I outlined in the proposal. I have come round to Sven's view that the versions array needs to be a 1-based index of revision numbers, with -1 indexing the most recent revision.

Moving this to community vote didn't work, and there is an open concern, so I can't just move this to accepted :-(, but I'd like to get it in 1.2 if possible - even given the performance issues. MichaelTempest can you please clarify your concern with implementing the proposal as written at the top of this topic?

-- CrawfordCurrie - 21 Feb 2012

It has been a while, so I've had to re-familiarise myself with all this. I can see that I got sidetracked in the discussion - sorry about that. As I understand it: versions[ ] is an array, indexable by a list of integers, and the proposal is extending the syntax to permit query expressions instead of a list of array indexes. There is a problem of ambiguity. Is 0 an (invalid) integer index or a query that means "no versions"? Is 1 an integer index that refers to the first revision, or is it a query that means "all versions" (i.e. the same as versions[ ] )? My concern now is that the proposal does not address the ambiguity. This might be covered in QuerySearch on t.f.o, but for reasons unknown I cannot log in on t.f.o. I'll have to check that later.

-- MichaelTempest - 21 Feb 2012

I handled it using a special interpretation of a constant index in the [] operator. If the content of the [] evaluates to a constant (the evaluator does not have to go back to the DB to get a value), it is treated as an index rather than a query. Of course this means you can't use a query - such as a query for a field containing an index - as an index. You can of course use versions[info.version='4'].

Any operator that has exclusively constant operands (or more precisely does not have to refer back to the DB for any operand) is also a constant index. This [1+2] is a constant index but [1+info.version] is not.

I did not cover this in QuerySearch (I thought it was too nerdy), but can do so if you think it's needed.

-- CrawfordCurrie - 21 Feb 2012

I think it is necessary to cover this in the documentation because it has the potential to surprise. I know there is a tradeoff between wtf's when reading over-nerdy documentation and wtf's when figuring out under-documented stuff. In this case, I think it needs to be documented.

-- MichaelTempest - 21 Feb 2012

Thanks, Crawford - I'm removing my concern.

-- MichaelTempest - 22 Feb 2012

I thought I could deliver this for 1.2, but the discussion today on IRC has disabused me. A lot more thought - and hopefully a few more inputs - are required. Taking off the list for 1.2.

http://irclogs.foswiki.org/bin/irclogger_log/foswiki?date=2012-02-24,Fri&sel=271#l267

Resetting to UnderInvestigation. If you have anything to say, for the love of god say it - please don't leave me to implement something you are unhappy with.

-- CrawfordCurrie - 24 Feb 2012

Some thoughts to provoke you, they are incomplete but ...

I am not suggesting any syntax in the following. I have only thrown together some code to explore the problem. The solution still eludes me let alone the syntax.

I started out by considering something like:
  • All('Main.WebHome'/author = 'Fred') would be true if every single version has 'Fred' as the same author
  • Any('Main.WebHome'/author = 'Fred') would be true if only one version has 'Fred' as the same author

Now the true/false is necessary to decide whether 'Main.WebHome' should be included in the result set. However, there is nothing in the above syntax to state that I only want the current 'Main.WebHome' or every version (or matching version) to be included in the result set. After all, if creating a table out of the result set you may wish every $topic to include the $rev and buld one row per $topic$rev. Or you may wish to create one row per topic with optionally some of the cells populated with some of the revision details.

In attempting to come up with something generic I ended up with:
   versions[when -1 AND -2 OR any(-3..-5) choose -1, -2;
                 when -6 choose -6]

Why you would need something as complex I am not sure, but I cannot escape the idea that all(), any() etc are effectively point solutions, although they may still be useful as syntactic sugar (once you work out how extend the syntax to define the content returned by these queries).

Alternative options are groupby(version), where multiple hits on the same topic can be flattened into one, However, how does that work neatly in conjunction with versions[] in the query? In addition, how do we express the format syntax to repeat the output (by $topic$rev) should we wish to do so.

And to complicate things even more, if the query returns multiple versions in the result set, do you need the content access during formatting to relate to the actual set of returned versions rather than the set in the main FW database. For example the DB contains r1..r10 and the query returns r3, r5, r8 in the result set. During formatting you do not know the version numbers that will be in the resultset. Therefore, you need to refer to them using something like [-1, -2, -3] except that refers to r8, r5 and r3 from the original database respectively.

What about being able to say 'Main.WebHome'[info.author[-1] = info.author[1]], i.e Current author is also the original author. Or even more powerfully, 'Main.WebHome'[info.author[*] = info.author[*-1]], would be true for any consecutive topics with the same author.

I saw this line in the description above:
  • Note that if 'RoaldDahl' authored version 9, then you will get that version twice in the result.

Will that happen naturally from an SQL store? I would have thought you would get that database row only once.

Another possibility is to consider a query to run against all topic revisions not against all topics. i.e. the following 'topics' are examined one by one against selection criteria and either match or not. If they match then they are included in the result set or are discarded if not.
  • Main.WebHome@1
  • Main.WebHome@2
  • Main.WebHome@3
  • Main.WebHome@4
  • Main.WebHome@5 (also -1 and current)

Of course there would need to be a default version[-1]. This would then necessarily need groupby and having clauses (or something better) but may well result in a much cleaner interface, as the responsibilities are better defined. The downside might be slightly less power, but only for extreme cases.

I know I need to think about this a lot more.

HTH

-- JulianLevens - 12 Mar 2012

Multiple hits on the same version ought to be flattened into one hit. My prior comment was reflection the implementation I had at that time.

The problem here really is just one of syntax. The strategy I had take was to say "topic.versions" i.e. each topic has a list of versions associated with it, each of which is a pointer to a very similar object, but populated with the data relative to that version. The cons were proposing instead to express the versions of an object using an operator. This lead on to questions such as "how do I express a search over all versions of all topics ".

It's a fine point, but an important one. I decided to back down from the implementation I had done because my head had started to explode.

-- CrawfordCurrie - 12 Mar 2012

A fine point I have not yet grokked. Why is "topic.versions" not a case of topic with the ".versions" operator?

I may well be going beyond not merely syntax but semantically what we want to be possible. When this topic came up earlier I did not feel qualified (or had enough time) to analyse/comment. I'm not sure I'm qualified even now. Of course my somewhat dumb outsider view may help to re-conceive this (or not).

You say: Multiple hits on the same version ought to be flattened into one hit

Why ought? Now while I actually like the fact we gain some simplicity by flattening, why should this be the only option imposed on the user?

Primarily, while on one hand I cannot imagine many use-cases for complex version searches. I also see that the criteria used to match a topic (with optional version(s) criteria) is orthogonal to the selection of the specific revision(s) to see in the result set. (Let's not concern ourselves yet with any correlations between the two).

  1. Flattening matching is a way to define these rules, while pushing selection of version(s) in the result set to the formatting phase.
  2. Allowing Topic@1 Topic@2 etc during the matching process complicates the matching process (albeit it's still well defined), while pushing a flattening option (groupby or somesuch) to the formatting phase.

I have no strong sense that either of the above is the better choice (and there may be other choices).

Why was flattening during matching felt to be superior?

I just have that gut feeling that we may end up painting ourselves into a corner. That feeling may in part be driven by the fact that I don't quite get it yet, but there again ...

-- JulianLevens - 12 Mar 2012

The trick is to try to think of the kind of queries people are likely to make, and organise to make their queries as easy as possible. For example,
  • "Show me all versions of all topics which I authored"
  • "Search all topics as they were at time T"
Then create a syntax which supports ll those types of query while still being consistent with the query syntax we already use.

-- CrawfordCurrie - 13 Mar 2012

I have a use case for you but first ...

If we try to think of the kind of queries people are likely to make and cater for those. We are in danger of creating some point solutions that do not cover other possibilities.

And now that use case:

Imagine an app using data-forms that tracks a process. As part of the tracking one of the fields contains a state, something like: New, Being-Worked-On, For-Approval, Rework, ...

Now Rework should never happen and you want a topic to show any process that went awry — to spot any patterns for process improvement. That is show any process-topic that ever had (in its history) some Rework. Furthermore, the required table to display should also show the current version to check it's really OK now. There should be one table row per revision sorted with the most recent first. i.e.

ProcessNo State Rework Notes
Process123@15 Approved  
Process123@12 Rework Squiggle not foodled
Process123@9 Rework Foodle not squiggled
Process456@10 Approved  
Process456@8 Rework Squiggle not foodled
Process789@20 Approved  
Process789@15 Rework Squiggle not foodled
Process789@10 Rework Foodle not barred

While I am not sure that this will be a common requirement, it's seems quite a reasonable one to me. I also see no reason why other more involved version queries and selections should not be possible.

How does the proposed method allow us to create this sort of result?

The more I think about this, it seems clearer to me that we need something to provide distinct revision matching and revision selection (into the result set). Getting this right will cost more time now, but save much grief later with a solution that proves too narrow.

This topic is QueryAcrossTopicRevisions. I believe there needs to be (at least logically) a SelectAcrossTopicRevisions and they need to work neatly together.

I appreciate that this is an involved and as you say head exploding topic, and I hope this is not a wafer thin mint too far big grin

In my experience when my head is about to explode with ideas, it's often an indication that I haven't thought everything through yet.

-- JulianLevens - 14 Mar 2012

I too would really love to be able to get a resultset iterator which returned Web.Topic@123 hits rather than just Web.Topic; and whilst in itself this doesn't seem (to me) to be hugely difficult, the impact on the rest of core is quite high - all the way up to thinking very hard about how to make SEARCH do something useful with such a thing (Eg. multiple="revisions" ? revisionseparator/header/footer ? Etc.).

So, rather than attempt to solve all that up-front, let's try to stick to a full spec on the query side of things, while at the same time trying to anticipate all the needs of the selection concerns. I think a separate SelectAcrossTopicRevisions proposal would be a great start; example use-cases are excellent for seeding discussion.

All that said, we can (albeit inefficiently) get all the revisions using existing TML & proposed QueryAcrossTopicRevisions syntax, Eg.

%SEARCH{
  "versions[State='Rework']"
  type="query"
  format="   * [[$web.$topic]] had these revisions in =State='Rework'=:
$percntINCLUDE{\"%TOPIC%\" section=\"display\" topic=\"$web.$topic\"}$percnt"
}%
<verbatim class="foswikiHidden">%STARTSECTION{"display"}%%FORMAT{
  "%QUERY{'%topic%'/versions[State='Rework'].address}%"
  format="      * [[$web.$topic@$rev]] - $formfield(ReworkNotes)"
}%%ENDSECTION{"display"}%</verbatim>
Assumptions:
  • There's a topic attribute called 'address' (in addition to web and name) which would be the full Web.Topic@123 tuple (I'm not proposing this as a great idea, although we are doing that in MongoDBPlugin)
  • FORMAT understands Web.Topic@123 address notation

-- PaulHarvey - 14 Mar 2012
Topic revision: r75 - 14 Mar 2012, PaulHarvey
 
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. see CopyrightStatement. Creative Commons License