Feature Proposal: Give DataForms, field types, registerMETA ability to specify type of data they hold

Motivation

1. Field entry validation

If a data form app is designed to use a number in a field, it should be able to rely on it - so that users don't need to put a check in every use of a field.

This means we need to work out a way to tell someone that changes a form definition that what they're 'requesting' is impossible, or requires 'some list' of topics to be modified... - see Tasks.Item11026 for a little idea that might lead in a useful direction.

Allow a form and field specific default for AddSortingBehaviourToSearch and SupportMultiKeySorting.

3. Index generation for fast query caches

DBIStoreContrib, SolrPlugin and MongoDBPlugin have some great benefits over the BruteForce algorithm:
  • Unlike grep, which must parse entire topics, they can efficiently scan an atomised form of the topics, which reduces the volume of data that must be processed.
  • If the data type is known, indexes can be used to simply lookup values - scanning often isn't required (or possible).
  • If the data type is known, external databases can pre-sort results so that Foswiki doesn't have to. When there are many thousands of results in a web of tens of thousands of topics, this means Foswiki only has to fetch & process only the topics that are going to be displayed.
But there is a problem. Foswiki currently doesn't care about the type of data it holds anywhere; it's "magical" about sorting results. It "knows" to sort numerically when the resultset's sort key only contains numbers. But if we want to avoid slow double-handling of data, preventing Foswiki from touching any topics it doesn't end up displaying, then an external database must do the sorting for us. To do that it needs to know the type of the data so that it can set indexes efficiently.

Giving hints about the data type also means that QuerySearch might be able match a value of '007' as being the same as '7'; and an external DB could properly ascending sort -5.06 after -6.

This proposal aims to build a specification for specifying types on data held in the TopicObjectModel.

Description and Documentation

Foswiki has always been agnostic about data types. Our DataForms, QuerySearch and VarSEARCH users haven't had to think about data types before. Here are some "high-level" types, that might be further specialised by using + separated modifiers (much like select+multi+values is still a 'select' type):

Data types

  • number
    • number+integer
    • number+real (base2 float)
    • number+decimal (base10 float)
  • datetime
    • datetime+date
  • boolean
  • string
    • string+varchar (for a long string)
    • string+varchar+html
    • string+char[255] (implied when the size ==255)
    • case-less (there are times when we should really store 'thursday' irrespective of what someone's js sends us
  • fwaddress, Eg.
    • Some/Web/
    • Some/Web.Topic
    • Some/Web.Topic@3
  • list (to benefit QuerySearch 's IN operator)
    • list+<type>, Eg.
      • list+string
      • list+number+integer
      • list+fwaddress

But in reality, PaulHarvey suspects that only number and date (and possibly list, for some usages of QuerySearch 's new IN operator) will be immediately required.

  • Sorting on number fields requires a number datatype
  • Sorting on date fields requires a datetime datatype, but this can be automatically defaulted in the date type's Foswiki::Form implementation
  • Using QuerySearch 's new IN operator where the RHS is a formfield requires the list datatype

DataForms

Sadly, we need a way to specify number data type on a numeric fields, if users want to be able to sort on those fields correctly.

But we can provide a way for the Foswiki::Form field type to automatically set a default datatype, so for example the date field type would automatically default to a datetime+date datatype.

So, in practice, it should be possible for users to only care about datatypes when they're trying to sort on numbers.

Field types and their default data types

!DataForm field type Default data type
textarea
string+varchar
text
string
radio
string
checkbox
list+string
select
string
select+multi
list+string
checkbox
list+string
textboxlist
list+string
date
datetime+date
color
string
label
string
rating
string
bibtex+code
string+varchar
bibtext+fragment
string
It's inappropriate for MongoDBPlugin to index textarea because there's a small limit to the amount of data that can be indexed in a field. So in the case of string+varchar datatype, MongoDBPlugin simply wouldn't set an index at all. But SolrPlugin would. And DBIStoreContrib, might try to use a varchar index type appropriate for the SQL db it's connected to.

Foswiki::Meta::registerMETA

SemanticLinksPlugin makes use of this API, for example it registers a META:LINK datum ( many => 1 ) and we would love to do a query like this, which would take advantage of indexes:

%SEARCH{
  "links[value='Some/Web.Topic']"
  type="query"
}%

or perhaps the equivalent:
%SEARCH{
  "'Some/Web.Topic' IN links.value"
  type="query"
}%

Examples

%WHATDOESITAFFECT%
edit

Implementation

-- Contributors: PaulHarvey - 10 Aug 2011

Discussion

For compatibility, it still has to behave correctly when data is not typed. There are three scenarios I can think of:
  1. data is typed and the searching algorithm can sort on that data (will return a sorted result set)
  2. data is typed but the search algortihm is unable to sort on that data for some reason (e.g. the data type is not supported by the search engine)
  3. data is not typed
So somehow the Foswiki default post-search sort has to be told that it doesn't need to sort (case 1), may need to sort (case 2) or must always sort on a field (case 3). For each field.

Secondly, a search such as Field='value' presents a problem because Field might exist in several different forms, attached to different topics. If the data type is specified the same in all those forms, then no worries, but if not, how do you handle it?

Note that the data typing applies equally to search terms. For example, Field='01-04-2011' could be a string match, or a date match if the Field has a date data type.

-- CrawfordCurrie - 10 Aug 2011

MongoDBPlugin is missing a post-sort-search feature; that would be cool. I'm hoping if you're only fetching the datum on which you need to sort on, from say a resultset of 35,000 topics, that's going to be much quicker than the minutes of saturated network i/o it was taking for Foswiki to load/parse/process ACLs on each member of said res suultset (the problem we had prior to ACL filtering).

Regarding the second problem, I guess search engines could group/collect/table/namespace the fields by the form they're coming from (in reality, I'm surprised we've got as far as we have without making MongoDB collection-per-dataform, rather than collection/database-per-web). I haven't thoroughly thought that through: I've only been thinking of the minimum changes required from a plugin author/dataform user perspective.

As for the problem of query values not resembling stored/normalised/indexed values, such as with date... I guess we need new foo2foo() QuerySearch functions, like... Field=date('2001-04-01') or Field=date(OtherField) ? Ew... frown, sad smile we could make the second form illegal (only constant values allowed in the type-casting functions).

-- PaulHarvey - 10 Aug 2011

lets see what i can put together quickly.

  • mentioning adding type ~~ string+html - perhaps this is a way to implement AddTextareaPlusRichToTWikiForms - '=Add a textarea+rich type to the TWikiForms definition that would cover the textarea with a wysiwyg html editor and store the contents of the field as html.=
  • caseless - so taht we store no-case, and search indexes can optimise for that too. (related to SearchOrderOnFormfieldNoCase - do we have case-less sorting in the *SortProposal*
  • adding some more important Motivations

-- SvenDowideit - 14 Aug 2011

Added the SemanticLinksPlugin use-case for the registerMETA API

-- PaulHarvey - 15 Aug 2011

See also: ConvertDatesToISO

-- PaulHarvey - 30 Nov 2011

Idea: add an optional type attribute to META:FORMFIELD. If the META:FORM cannot be loaded, then use the type; otherwise use the type from the META:FORM. Update this field whenever the topic is saved and a form definition is loaded. At the moment if the form cannot be found all fields are given "text" type; this is a simple way to improve on that.

-- CrawfordCurrie - 30 Nov 2011

Great minds think alike - in fact I've been experimenting with this on our wiki. The idea being to nominate some fields as type 'fwaddress' which can be safely followed with OP_ref, but the idea is the same.

-- PaulHarvey - 01 Dec 2011

please, continue these thoughts - but don't forget to consider the flow-on of reducing the significance of the DataForm definition.

for example, if the typing info is in each topic, but the type changes in the definition topic - and thus is expected to change when rendering (without having updated the data topic)

given that iirc, editing a topic will update (as in remove) fields that are no longer in the dataform, and update them to the latest schema, knowing what the type was sounds handy - but even though you know the type, you've not addressed the change in how its rendered, or how to manage the required change to queries.

basically - its complicated, and the 2 opposing solutions would be:
  1. manage changes to schema/view/query all at once - so if the dataform definition changes, all topics that use it are changed to suit
  2. consider the dataform definition as only an initial seed definition, but this requires a linkage to versioned query and view definitions, so that if a schema changes incompatibly, old topics can be viewed and queried the old way.

even so - storing the current data's type (and values limitations) will be useful.

-- SvenDowideit - 01 Dec 2011

OK, but I think there's a simpler (3) treat the type in the topic as only affecting the view, all other accesses to the topic go back to the form definition (such as determining value constraints).

(3) is quite like the problem I fixed recently of the cache topic being out of step with the RCS history. I found it useful there to consider the possible scenarios.
  1. Type in field is inconsistent with form definition.
  2. No type in field in topic
  3. Typed field exists in topic, but there is no corresponding field in form definition
  4. Field does not exist in topic but exists in form definition
Then we have some access scenarios:
  1. Topic view
  2. Topic edit
  3. Topic load (as in: I want to manipulate this topic in code)
  4. Query (search) over form data

-- CrawfordCurrie - 01 Dec 2011

Any redundancy is bad, besides for caching purposes. Dealing with cached type info and trying to keep it up to date seems to be a major hassle. We suffered from similar in the past as Crawford reminded us (rcs vs topicinfo). Sven's having an important point and I've got the feeling that dealing with this type information by conditionally ignoring it, is going to be very tricky up to the point it becomes unpredictable for the user when view and edit react differently on DataForm changes.

Is it worth duplicating type info on each record? I am not convinced yet.

-- MichaelDaum - 01 Dec 2011

Well, the case I've got is that I want to draw a directed graph of topic links. This means only following formfields whose values really are valid topic names, and not, eg. some GATC sequence or a date.

Presently I've got a solution that works quite well, and without any knowledge of the dataforms it expects to encounter

But, perhaps the store can transparently add fields.type from the DataForm definition, to avoid polluting the actual topic.txt with unnecessary META:FILED.type keys

Just as I desperately wish we could have form.name.topic='MyForm', instead of form.name='MyForm' OR form.name='Web.MyForm' OR form.name='Web/MyForm'...

  • acacia_pachyacra_LLB_sequence_view.png:
    acacia pachyacra LLB sequence view.png

-- PaulHarvey - 15 Dec 2011

Regarding date handling, I have created the proposal CreateISODateFormat. Does the "strictly typed" date field isodate fit in?

-- ArthurClemens - 22 Feb 2012

Change to "accepted" 24/2/12 - you might want to consider what release.

-- CrawfordCurrie - 24 Feb 2012

Note that multiple indexes are possible, for example: sort=SomeTextField will potentially be able to use an index (and sort quickly) on a back-end store that supports it. Whereas something simple like sort=uc(SomeTextField) would prevent the use of the index. To solve this either the index is defined as uc(SomeTextField) or if a raw and uc() searches are required then define both as indexes.

As noted sorting and indexing issues are close bedfellows. i.e. a query for 'LEVENS' = uc(SomeTextField? ) is likely to lose the benefit of the index.

On my work with VDBI, it has become clear that I may need to store fields two or more times: exactly as-is and indexable version(s).

Some extra indexes could be specified by the core or extensions

When VDBI is installed, it will be necessary to load the database from the text equivalent and I would like to store any field into a searchable form, but:
  1. What do I do if the DataForm says "date" but the embedded META:FIELD says its text?
  2. What do I do if the DataForm says "date" but the embedded META:FIELD isn't given?
  3. What do I do if the DataForm says "date" but the embedded META:FIELD is given but has a blank value?
  4. What do I do if the DataForm says "date" but the embedded META:FIELD is given but has an invalid value?

Some of this may be a case of the admin running an initial load with error reports, but ideally some sensible defaults would be (respectively to above)
  1. Take the type as date. With an SQL back-end, every field must be of the same type to be of any use.
  2. Store as a NULL? (undefined in QUERY speak)
  3. Store and hence query/sort with a date of '0000-00-00' or '9999-99-99'? Or also treat as NULL?
  4. Store and hence query/sort with a date of '0000-00-00' or '9999-99-99'?

Note that the original field can also be stored as-is, such that '12-March-2008' could still be returned from the store, this would minimise the impact to how an application would appear, but search and sort would be much faster.

Some of these need to be options on the DataForm.

Alternate indexes can be constructed, there was a request for a field like A-001-X567, but the numbers may not have leading zeroes, i.e A-1-X8 might also occur. All that requires is a mask to create the index with appropriate number of leading zeroes, but retrieve the original from the store. However, from where we are, I think that this is a step too far and this can be added without needing to prepare the way now.

It seems to me that we have 4 stages of data structure:
  1. Unstructured: i.e Natural language; paragraphs, sentences & words
  2. Loosely structured: Headings, Tables etc
  3. Formal structure: DataForms
    • Now under control of Foswiki via the UI
    • But can be damaged by direct manipulation of the text files
    • But over time datatypes can changes
  4. Strictly structured:
    • Now under DB control, types controlled across the board not just one field at a time
    • Direct DB manipulation can possibly damage structure but less likely, it's not a valid API in any sense
    • Versioning side must handle change types over time, but the change is necessarily across all fields not just the next update of one topic

Do some degree or other we are moving towards a more strictly structured foswiki. To gain performace thru indexes it's inevitable. My VDBI goal is to minimise this, but I do not believe it can be eliminated.

Note that the store will work without indexes, and as a pluggable replacement that's important. It's just that some performance benefits will not materialize without some indexing nous — i.e it's becoming rather geeky. Mitigating poor search performance can achieved by capturing query statistics for the FW admin/dba to analyse. Making it less geeky is somewhat harder.

The strictly structured concept suggests the possibility of a +strict modifier (or equivalent). OTOH, it will necessarily be a given in many cases.

Please also note that I have a SuperDataForm idea to deal with other issues so there is likely to be overlap. I'll write up a proposal real soon now — honest!

Note that when the DataForm is modified this could trigger some mass updates, but that will just need to be handled via the UI (are you sure? And then the updates occur etc)

A related issue is that the changed DataForm could be utter rubbish initially. Within VDBI I am planning to note which version of the DataForm was the last valid one. Only when a valid version is specified would any changes be accepted. Actually, because DataForms are currently updated in a suck-it and see style as encouraged by the wiki (loosely structured) and we now need a stricter nature for the DataForm, this is a conflict. We arguably need a UI to 'Approve' the new DataForm in addition to merely saving it. There, are alternatives: allow a temporary new DataForm and create a few test entries and approve the change when A-OK. There maybe better ideas, but I suspect that they are all non-trivial.

-- JulianLevens - 27 Feb 2012

All valid points, I think both Sven & Crawford have encountered the same design issues.

For what it's worth, a MongoDBPlugin -powered query in Foswiki is still much faster than a vanilla one even with all indexes turned off. Especially when it is pre-filtering the results according to ACLs for the user under which the query takes place; by delegating as much work as we can to the DB, we prevent too much double-handling of data by Foswiki in perl, and that already speeds things up enormously.

In my own experience, getting indexes right is great for overall query performance, but matters mostly (or at least: impact is more noticeable) for nested SEARCHes where we are firing dozens or hundreds of queries in a single page view.

In my mind, I think it's best to treat missing/invalid values as nulls in the index, rather than using zeroes/1970.

a query for 'LEVENS' = uc(SomeTextField? ) is likely to lose the benefit of the index. - but could be hoisted to a query that can make use of a case-insensitive index. With MongoDB I've got many queries that look like lc('%URLPARAM{"q"}%')=lc(SomeTextField) on webs with on the order of 50-60,000 topics, and lack of indexes isn't really an issue here (unless used in a nested SEARCH).

This multiple-index strategy is what Solr does best; in fact, I wish I had the time to write a query algo using Solr - although it's not expressive enough to run much of the QuerySearch language, many simpler expressions should be possible.

-- PaulHarvey - 27 Feb 2012

I have found that hybrid search algos - running simple hoisted searches, such as full-text searches using Sphinx - to narrow down the set of potential topics, then using brute-force to perform a final search, successfully accelerates 90% of searches (which are usually looking for a narrow set of topics). Don't get too hung up on complete coverage in the index - it's a good goal to have, but expect a leprechaun to be waiting for you with a pot of gold when you achieve it.

-- CrawfordCurrie - 28 Feb 2012

Crawford: I've been in SQL land quite a bit lately especially wrt indexes, as a result my thoughts above do give the impression that I'm trying to create perfect indexing. I also had the CreateISODateFormat proposal on my mind and it's specific needs.

Well I'm not trying to create perfect indexing.

VDBI will by its nature index every field, but only as a string. Actually that's simplified. Things like date and number could be indexed by storing a canonical text version that sorts (and hence indexes) appropriately. However, I'll probably add a table for all date fields for efficiency, specifically allowing the SQL engine to do a lot of the work. I will need something similar for numbers. Everything else can be a string.

Starting from where we are, this will be a significant speed-up (as you confirm) which for the general Wiki and App usage will be (I hope) more than enough. If someone needs more performance then there's nothing stopping some further tweaking, but let's wait for these needs to appear and then worry about them.

I'm very much in favour of the work here, I dare so we all have some misgivings, but we'll only sort those out as we work on things. My VDBI work is certainly generating a lot of questions and ideas in my mind.

Finally, I do not believe it is possible to have a completely store-agnostic solution to this. Actually it is, but only if you sacrifice making the best of the stores abilities — which is a mostly about performance.

-- JulianLevens - 28 Feb 2012

I really need to share what we've learnt with our Mongo adventure: if you want efficient queries over 10,000s of topics, it is crucial to avoid chattiness, to delegate as much of the query/filtering to the DB as possible. I can't stress that enough: this is drastically more important than fiddling with indexes - in fact I can have all indexes disabled and Foswiki still struggles to formulate a query convoluted enough that Mongo can't answer it (under production load!) inside of a second or three (we have ~230,000 topics, largest web ~65,000). (Actually, that's a lie - some d2n() queries (I think there's a bug wrt nulls) and especially OP_ref queries - these aren't native query ops, we execute a javascript program on the server - can take 10s of seconds on the larger webs).

Whenever Foswiki must load thousands (or even hundreds) of topics into memory, that totally dwarfs any DB query times, indexed or not. Example:
  • We had a web of ~35,000 topics
  • query looked something like "form.name='Web/Topic.DataForm' AND Status='Foo'"
  • Mongo's explain() indicated that this was a reasonably fast query (using indexes), even when loaded with real traffic on prod: circa 5-10ms
  • Nonetheless, with most of the topics ACLs set so that only SpecialGroup could view them, non-admin users experienced page view times of up to two minutes to get the first page of 25 hits!
  • Profiling indicated Foswiki was sucking in almost all 35,000 topics (~200MiB!) on each view - to check ACLs, and hide hidden stuff from the result
  • Sven then added the ACL filtering code, so that ACL filtering was delegated to mongo. WikiGuest, SpecialGroup, non-special users alike were all able experience ~1.6s page views

This is also why in MongoDBPlugin, we can get away with terribly anti-scalable implementation decisions ("thunking to JS" is terribly frowned-upon: according to the "webscale" crowd, you should always use native query ops, as executing javascript server-side is very expensive), and yet still enjoy such enormous performance boosts. Because, let's face it, even with our wiki at 230,000 topics.. we're nowhere near having a "big data" problem

That's not to say that good indexes aren't important - if it weren't for indexes, I wouldn't be able to have massively inefficient nested SEARCHes performing hundreds of queries in a given page view - I just want to stress that getting Foswiki to scale absolutely demands minimization of database chattiness.

-- PaulHarvey - 29 Feb 2012

I have been thinking about DataForms quite a bit and it suggests to me deeper problems that need to be addressed. I have sets of DataForms as part of work flows. In one work flow state I capture a date. In another state that date is now a label (so it cannot be changed during form edit). As a label it will not be seen as a date, so how can any database know to treat this field as a date. This could be fixed with date+label I guess.

As I've already stated, I believe that AllowTypedData needs to introduce strict types (as distinct from existing loose types) which should also properly consider rendering if not the whole MVC business.

There's another complication relating to my date+label thoughts, what about number+select? That is to say the the type is at heart a number and should be rendered as such, but it's also a select type for editing where a discrete range of numbers are offered.

-- JulianLevens - 10 May 2012

I suggested something along that idea with the column at the top, mapping traditional foswiki types to the 'new' data-type spec. In your example, the new-style way would be list+number, but I see I didn't tackle the problem of giving an appropriate view = select.

Hrm. I really will try to get my datum-handler thingy into a workable state, which tries to be an MV(P)C architecture to replace Foswiki::Form - initial inspiration was the disconnect between Foswiki's (X)HTML-centric renderForDisplay/Edit vs other views that we wanted (RDF, CSV, jqGrid, etc).

-- PaulHarvey - 10 May 2012 - 23:58

This feature proposal has lost its developer. Discussion is stale for 3 years. It still is unclear how types are assigned in DataForm definition. There doesn't seem to be consensus about it and no such definition in the proposal description as far as I can read. For this reason I have to raise concerns.

-- MichaelDaum - 21 Sep 2015
 

ChangeProposalForm edit

TopicSummary Allow data forms, Foswiki::Form field types and registerMETA to specify data/index types for query/search algo's w/indexes
CurrentState ParkedProposal
CommittedDeveloper
ReasonForDecision NoCommittedDeveloper
DateOfCommitment 14 Aug 2011
ConcernRaisedBy
BugTracking Tasks.Item10437
RelatedTopics Tasks.Item10721, ConvertDatesToISO
PlannedFor
I Attachment Action Size Date Who Comment
acacia_pachyacra_LLB_sequence_view.pngpng acacia_pachyacra_LLB_sequence_view.png manage 131 K 15 Dec 2011 - 23:39 PaulHarvey  
Topic revision: r25 - 21 Sep 2015, MichaelDaum
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy