You are here: Foswiki>Development Web>RestPlugin>HowToAddressNonMETAmetadata (10 Aug 2011, PaulHarvey)Edit Attach

HowToAddressNonMETAmetadata

There are a couple of "things" in Foswiki topics that are really meta-data - well, they are processed that way, though they appear in text as equal citizens. We really need to make them addressable.

Tables

Tables are used a lot for capturing structured data. There are many reasons for this, which I won't go into here; suffice it to say that tables store structured data, and we need to be able to get at it easily

At the moment a "table" means different things to different people.

The Foswiki renderer can generate an HTML table from many different types of data. Different representations of tables exist at different phases in the rendering pipeline e.g. before and after macro expansion, before and after TML rendering
Plugins have their own idea what constitutes "a table" e.g. TablePlugin, EditTablePlugin, SpreadSheetPlugin, and how to address them.

As a result there has are abundant application-specific table parsers (noted elsewhere by MichaelDaum)

There are a number of different ways in which we want to "deal" with tables. These include:

Making them addressable in the query language, so you can search on them
Capturing them in database caches for rapid manipulation (as in FormQueryPlugin)
Making them CRUD accessible via a REST interface (see RestPlugin for more on this)
Custom formatting in the rendering pipeline (such as that performed by SpreadShitPlugin and TablePlugin)

For 1 through 3 we need a canonical addressing scheme that lets us get at the source of tables i.e. the TML table in the source topic. For 4, we need something different, because the formatting pipeline builds an HTML table by collating many sources. So, let's say for the purposes of this discussion we dismiss (4) and focus on 1 through 3.

So, how do we make tables addressable? One way would be to treat TML tables as meta-data (bear with me). Imagine that instead of writing tables as:

| tables | like | this |

we wrote them thus:

%META:TABLEDATA{col="2" data="this" index="0" row="0"}%

(I'm not suggesting we do this, just trying to illustrate the point that TML tables can be "read" at the same time as meta-data).

Obviously the META:TABLEDATA view maps immediately and cleanly to the extensible meta-data schema. You would then address the table in a query something like this:

%QUERY{"tables[index=0 && row=0 AND col=0].data}%

I'm not seriously suggesting this schema, BTW; just that reading tables as meta-data needn't be frightening.

I have already done quite a bit of work in EditRowPlugin on making tables cell-addressable, and it's quite straightforward until you start to deal with rowspans and colspans (and even the it's not that hard). It's not hard to build a clean mapping from a "meta" model of the table to the physical |representation|in|the|topic| if you focus on the source of the topic and ignore the rendering pipeline. Cell addresses can even be passed on into the HTML representation using HTML tag attributes or similar.

So, anyone see a good way to make this cleaner/more usable? (a bunch of early IRC discussion can be found at http://irclogs.foswiki.org/bin/irclogger_log/foswiki?date=2010-11-06,Sat&sel=498#l494)

Preferences (and access controls)

* Set BLAH = is parsed out of the topic early in the processing cycle, and a massive great data structure built for every request, most of which is redundant. Also, there's no clean way to access a preference setting (not the value, the setting) which is needed for CRUD.

Taking the same approach to analysis as above,

   * Set BLAH = blah

is analogous to

%META:PREFERENCE{name="BLAH" value="blah"}%

So there's no particularly good reason why * Set statements in the topic body shouldn't be addressed in the same way as meta preferences.

Headings

headings, as in the things used by TOC, have been discussed as a structural element several times in the past.

-- CrawfordCurrie - 06 Nov 2010

You know, it seems natural now, but the first time I saw the * Set syntax in TWiki I thought it seemed a shortcoming - existing only because of a lack of a preferences editing UI. On reflection, I guess it's only half-true... it is a UI shortcut, however not an entirely bad one.

But users are often asking how to hide them. So what am I rambling about here... I guess I'm surprised to see that trying to make * Set lines addressable TOM elements is easier than just doing a preference editing UI that could make %META:PREFERENCES% just as accessible as in-line * Set PREFERENCES (to the extent that we could do away with them

-- PaulHarvey - 07 Nov 2010

ok, I think the table example is a particularly good one.

I have always assumed that that we would build a way to transform from a table to a subweb and query, and back - both as a 'once off' when the quick table becomes too big, as a way to collapse an application that has perhaps been retired, a way for some users to build applications and then 'deploy them', and importantly, as a touchstone for building a query - the presumption being that this mythical query should survive the transform.

so I think of tables as an automatically addressable sectional resultset - ala SEARCH{"log[user='sven']", instead something like SEARCH{"'Web.Topic'/table2[col2='banana']"}, thou obviously, this works better if the table has a header naming the columns, and the table were named somehow...

THEN, we move to other non-META - things that are outside the topic.txt - like revision info, log files, performance data, you name it

-- SvenDowideit - 07 Nov 2010

Paul, good point. Quite a few "custom UIs" can be moved into client-side things, I suspect.

Sven, not sure how that works; any table, rable row or table col needs an addressing mechanism. We discussed using table headers to do that, but it is fraught with problems. You can add %TABLE tags to provide a handle to the table, but it isn't a general mechanism. I guess I can't see how that could work with tables as we have them.

Good point on the "meta-ness" of other things, such as logs. Not wuite sure how you address those things in the query language, which is inherently context-oriented - queries are asked in the context of a topic, so should the logs returned be those for the topic? If so, how do I get logs for the entire system? I think that needs some more designing.

-- CrawfordCurrie - 07 Nov 2010

Two thoughts.

First, most processing of topic text in foswiki is stemming from its historical roots of treating it as plain text. While there is DataForms to add a bit of structure attached to the rest of the txt file, this doesn't qualify it as a real structured wiki. A really structured wiki would treat its topics as structured objects like a DOM document where all sorts of objects could be embedded at any point the user wishes to, not only attaching some structure to the bottom of the page.

So a real structured wiki would deal with really structure documents ... which they aren't. They are still plain text files.

Second thought: another way to associate a structured object to a topic is to attach an xml document to it and use jqgrid to edit it. Extracting data from it can be done using regular xml query tools.

-- MichaelDaum - 07 Nov 2010

I'm happy to keep using topics-within-topics as my "DOM" and live without a proper structured DOM within each topic itself. TML's informal, freestyle flavour is quite accessible to non-programmers (actually wiki powerusers). I got SemanticLinksPlugin running and less than 3 hrs after building a test report page, a colleague enhanced it to emit a dot graph. I'd hate to lose that.

I don't want to sound too "anti-DOM" (actually, I think it'd be great to have proper list + table parser like we have a macro parser), but XML gives more bureaucracy than you want or need - and while that CAN bring its own benefits, requires much more effort to keep from tying yourself into knots. Anyway, if we really wanted to tangle ourselves up in XML, I'd much rather be doing it in RDF+SPARQL

-- PaulHarvey - 07 Nov 2010

We're at risk of confusing concerns here. As discussed in RestPlugin, content can be viewed in different formats, so an XML view of a topic/dataform/table is just another view of the topic, akin to JSON (or HTML, or any other DOM view, for that matter). As long as the model and the view can be reconciled in both directions, it'll work, and it's interesting but irrelevant to this particular discussion. Once we have a data model that lets us extract the structure from TML, then saving it as XML is perfectly reasonable. However what we are talking about here is a way of extracting that structure from existing TML syntax elements that are, at present, unstructured i.e. tables.

What's bugging me most at the moment is the question of "what's a source". The power of many plugins derives from the fact that they operate not on the source form of the topic, but on a partially processed form. For example, the SpreadShitPlugin is a lot less powerful if you don't expand %MACROs before it's called. MichaelTempest proposed an "early eval" macro style that would work at template expansion time, and would at least allow you to import topic content from external sources before the table parser was invoked. However my concern with this is that without a coordinate object model, there is no way to distinguish such "meta tables" from the "hard tables" in the topic source, so tools don't have a route to drive changes back to the source. For example, I could write:

| Blah |
%GETSOMEROWSFROMGOOGLE%

and GETSOMEROWSFROMGOOGLE could expand to table rows, and you could parse those rows and pass them to another plugin, but you have no way to know those rows came from a different data source. An integrated table model would dictate an API for the plugin that implemented GETSOMEROWSFROMGOOGLE so that changes could be driven back.

-- CrawfordCurrie - 08 Nov 2010

I think Michael's point (largely) is that if we were to magically drop everything, abandon TML and switch to XML we could lean on mature 3rd-party stuff to do querying and transformations of documents. And then my comment just muddied the waters.

Now, what I'm about to say hardly gets the discussion back on track, but IMHO "external" (as in, outside the wiki - on the web) data needs a much richer container than a table to start with

(in our work, maintaining links with original sources is critically important. So I've been using topic-per-datum, working under the assumption that MongoDBPlugin will make this feasible).

-- PaulHarvey - 08 Nov 2010

My point is: TML is too fragile to store structured information in an extensible and reliable way. TML tables are an okay way to - well - draw table data server side, a kind of shortcut for its html output. It is nothing more. The concept better isn't used for storing structured information. That's bound to fail either for usability or scalability reasons. XML does a better job in that respect. On the flip side, XML isn't "forgiving" on a degree we'd like to allow users to write wiki content. However, even TML tables break all too easily and reach their manageable limit rather early, say at > 5 columns, > 5 rows no matter if you are doing pure TML, edit table or wysiwyg.

That's why you'd either change something conceptually about topics, like making them a real DOM where TML table syntax is just a form of editing the real object underneath (which doesn't cut it anyway imho), or look somewhere else how to store structured metadata in a topic.

-- MichaelDaum - 08 Nov 2010

While I sympathise with your viewpoint, it's a bit like giving up before you started. What I'm trying to do here is explore ways to extract structural information from existing content. TML + embedded meta is close to end-of-life, I agree, but we're not quite there yet.

-- CrawfordCurrie - 08 Nov 2010

I'm beginning to think that the QuerySearch/TOM stuff - our structured data - might be legitimately different to WikiSyntax document markup.

Here are some WikiSyntax DOM APIs worth reading about:

X-Wiki's Rendering Module, which can input and output various syntax
MediaWiki parser (blog post intro)
perl wikipedia toolkit
WikiModel - "This project contains a set of wiki-related libraries, such as a parsers for various wiki syntaxes, and common wiki model (event- and object-based)." cool, it handles CommonSyntax, Creole, MediaWiki, Confluence, JSPWiki, XWiki and others.
doxia, which powers Maven's markup

And there are many others.

I've been experimenting with generate a parse tree for TML. If we want to take this seriously, there's a serious question of performance. I want to see how much worse (or better?) perf might be if we parsed to a tree first, and then generated HTML separately.

My goal is to be able to tune the rendered output for things other than HTML.

It would also be nice if WysiwygPlugin's TML2HTML was using some of the same code as Foswiki's core rendering system. They could share the same parser, arriving at the same parse tree, but apply a different HTML generator appropriate for those two consumers.

I also want a better way to integrate new syntax like SemanticLinksPlugin. The current spaghetti mess of regexes is extremely brittle and I'm terrified of extending it. Maybe I just need to add more tests... I dunno.

At this point I think if I ever find time to continue this exploration, it will be working towards generating a tree that is compatible with CPAN:XML::XPathEngine - then we can use their selector/traversal code, "for free" (I hope). Perhaps we could also cache bits of these parse trees to avoid recomputing it on each request.

Then again, I'm not sure that CPAN:XML::XPathEngine has been written with CGI apps in mind... its performance might make it entirely unsuitable.

Or perhaps this entire approach is flawed... we'll see, I guess.

-- PaulHarvey - 09 Aug 2011