Feature Proposal: Separate out storage of files that are not managed by Store APIs

Motivation

We've had this discussion multiple times, and never really have settled on and documented a resolution. There are multiple ways to solve the issues, and we need to move toward a single solution. The root cause of these difficulties are that Foswiki currently intermixes "Attachments" and other static cached files all within the same directory structure. This is blocking migration to alternative store implementations. It's even resulted in some developers reverting other developers work in this area.

There is hopefully some agreement on requirements:
  • We need a way to allow the web server to serve files via a "fast path".
    • Critical for "non-attachments" like the .js and .css files.
    • Also critical for "semi-attachments" like the very large collections of icons.
    • It would be nice if "fast path" files could be handed off to a CDN for delivery.
  • There are files that never need visibility from Topics, and should never be revision controlled.
  • The "Store" API needs to be "correct", in that the data stored as Attachments and the files in the attachment "locations" should be consistent. If not, certain tools like bulk_copy give inconsistent results.
  • Plugins need a low-overhead API for creating and maintaining cached files available via the "fast path"

Current Practices address the above in various haphazard ways:
  1. Simply placing files in pub/Web/Topic directories without using the APIs. These may be discovered by Auto Attachments, or the UpdateAttachmentsPlugin.
  2. Naming files with leading underscore. Used to hide Webs (eg. _default), Topics, Attachments, and other pub files. Store can find these with attachmentExists, but will not include them in the iterators.
  3. Naming files with leading dot, Use by store (.changes) and web server (.htaccess)
  4. Naming "web" or "topic" directories using a lower-case first character. Recent ImagePlugin uses that to hide the image cache in pub.
  5. Placing files in sub-directories, ie. pub/Web/Topic/subdir/... Used by several extensions, especially TinyMCE editor.

Description and Documentation

The following changes are needed.
  • pub/ and data/ structures remain as fully managed by whatever Store is in place. The "file based" stores will expect to fully manage the contents of these file paths.
    • Anything existing in either of these locations that is not managed by Store would be considered an error. (Exceptions TBD - data/mime.types, data/.htpasswd, data/.htaccess, pub/.htaccess and pub/<deeper>/.htaccess)
  • Define a new root level directory static/
    • foswiki/static/<Category>/&lt/Optional>/somefile.css
    • Categories Generally follow System topics, but could be anything.
    • Ex. foswiki/static/TinyMCEPlugin/tinymce/jscripts/tiny_mce/jquery.tinymce.js
  • Enhance or add macros to access static/ path ... with CDN capabilities.
    • STATICURL{"path/to/file"}%
  • Consider also a cache/ path used for files managed by extensions? or mirrored by Store?
    • Existing PUBURL ATTACHURL macros could return a cached location if mirrored or supported by CDN

Examples

Impact

%WHATDOESITAFFECT%
edit

Implementation

-- Contributors: GeorgeClark - 04 Dec 2015

Discussion

.htpasswd in pub? Ouch. That's got to be an error. Other exceptions are OK, where critically required by a web server, but clearly documented.

/static is what I originally intended /working to be, but that ended up serving a different purpose. But yes, needed.

Not too keen on the /cache.... too many places, brain too small to cope.

-- Main.CrawfordCurrie - 04 Dec 2015 - 17:37

No.. .htpasswd is in data. That's me providing incomplete info. Now corrected. The first two exceptions listed, mime.types, and .htpasswd, are data/ files that appear in a "Store owned" place. But .htaccess would be in root of data & pub, and possibly deeper in pub as well.

Re /cache, I was iffy about that as well. But added it when considering that static/* should never change except when extensions are updated, or files are manually placed there. Truly static information. But stuff that isn't static, but needs a high-performance delivery, imagePlugin, DirectedGraphPlugin, any plugins that generate data based upon topic content and want to put it in other than store. That stuff needs more control than /static, possibly including:
  • conflict / locking?
  • More rigorous naming conventions
  • Some possible cleanup mechanism (Topic source of data was deleted, renamed, etc)

So the model I was proposing:
  • Installed/managed manually, or by configure extension_installer, etc.. /static
  • Installed/managed by extension operation, possibly derived from topic content /cache but outside of the Store API.
  • Attachments, Topics, managed by Store APIs, /pub and /data

Another way to think about static. If I make a new install, static is never copied or migrated. But cache could optionally be migrated, or would be created on demand on the new system. I'm trying not to mix data sources.

-- GeorgeClark - 04 Dec 2015

OK, so we have different classes of resource:
  1. Public static resources that are installed by core or plugins, and never change in normal day-to-day use
  2. Access-controlled attachments that are explicitly attached to topics by users
  3. Public dynamic resources that are generated by extensions, may be thrown away at any time, or generated on the fly
  4. Access controlled dynamic resources generated by extensions e.g thumbnails of an access-controlled image, or other similar cache

Idea 1

Let's look at this a slightly different way. Instead of thinking about it as directories on disk, think about it as URL paths. The goal is to make the URL path to a resource as intention-revealing as possible.

Let's say I have a base URL to my site - call it http://foswiki. Under this I have the following URL paths:
  • /static - read-only resources, such as images, javascript etc.
  • /attachments - the root of attachments to Foswiki topics. Paths below this URL mirror the web/topic hierarchy
  • /scratch the root of public dynamic resources (basically a scratch area)
  • /twitch - access controlled dynamic resources
So far so very similar to what you already proposed - but with one important difference. These are URL paths, and not directories on disk.

Let's say we impose the same constraints of web/topic hierarchy on all these URL paths. We could, by default, map these 1:1 to directories. Or, a sysadmin might choose, for compatibility reasons:

RewriteRule   ^/attachments/(*)$  /pub/$1/
RewriteRule   ^/static/(*)$  /pub/$1/
RewriteRule   ^/scratch/(*)$  /pub/$1/
RewriteRule   ^/twitch/(*)$  /pub/$1/

Of course there is some loss of functionality, as the DB's are too currently stupid to distinguish between attachments that were attached, and attachments that just sprung into being, but hey, TANSTAAFL. That's why we are looking at splitting the URL paths. As it is, should access control be required:

RewriteRule   ^/attachments/(*)$  /bin/viewfile/$1/
RewriteRule   ^/twitch/(*)$  /bin/viewfile/$1/
RewriteRule   ^/static/(*)$  /pub/$1/
RewriteRule   ^/scratch/(*)$  /pub/$1/

As far as I am aware, all web servers support some form of rewriting rules.

Idea 2

OK, so that's a possible approach. But what happens with macros? We could have %STATIC, %SCRATCH etc. but that implies a lot of code/template changes. Or, we could be smarter, keep the idea of /pub but leverage what URL rewriting gives us:

RewriteRule   ^/pub/(System/ImagePlugin/*\.[A-Za-z0-9]+$)$  /scratch/$1/
RewriteRule   ^/pub/(System/*)$  /static/$1/
RewriteRule   ^/pub/(*\.[A-Za-z0-9]+$)$  /attachments/$1/

-- Main.CrawfordCurrie - 05 Dec 2015 - 08:16

I'm really uneasy about moving complexity into the web server configuration. As it is, a lot of our installation issues seem to be web server related. ApacheConfigGenerator helps, but even with it, sites have issues. Then multiply that by lighttpd, nginx, and it goes downhill from there.

-- GeorgeClark - 06 Dec 2015

OOB there is no greater complexity in the web server config - you are just setting up five paths rather than the 2 (data and pub) at present.

Of course you could collapse static and scratch into one, at the cost of mixing shipped and site data, and attachments and twitch into one (a.k. pub). But every time you simplify the paths, you add a layer of complexity into the code as it tries to sort out access controls.

-- Main.CrawfordCurrie - 07 Dec 2015 - 17:20

By me, this (and such) decisions should be postponed until we decide when, how, (and if at all) - will the Foswiki move to PSGI.

Reason: I'm fully with Crawford's point of view, but also understand George's worries about the problematic FW-deployment (installation) process.(enough to check the IRC).

But in the PSGI-FW such server configurations will be moved (at least patrially) into the app.psgi (as Plack::Builder mounts) - e.g. will be an intergral part of the PSGI-Foswiki - so the configuration worries can be minimized. (of course, i'm talking about the full PSGI rework).

Many similar discussions are less meaningful if we "deep dive" into the PSGI - therefore (IMHO) we need make many partial decisions about the PSGI rework and based on, many solutions will be (or could be) lined up by self.

-- JozefMojzis - 07 Dec 2015

Seems like we are missing some detail on your PSGI proposals, Jozef. Can you write a proposal, please?

-- Main.CrawfordCurrie - 08 Dec 2015 - 08:08

Crawford, as you already wrote - the PSGI proposal exists. So, if once the PSGI/FW
  • will use the middlewares (by Plack::Builder)
  • and we will ship the app.psgi
then we could to use the CPAN:Plack::Middleware::Rewrite . So, the rewrite rules will be directly in the shipped app.psgi and not in some external http-server, thus elimimating George's worries about the deployment process.

Such decision - is the one of many. If the core-dev team decide - will not ship the app.psgi and the development direction will be like as an monolithic app - the approach is different - and probably Georges approach is the right. Thats all what i tried tell.

I do not fully understand what you you mean with the "write a proposal". Mean, one more? Or somewhat extend the existing one? Or adding some ideas to brainstorming? Or?

-- JozefMojzis - 09 Dec 2015

This proposal was initiated because bulk_copy could not cope with copying any non-attachment files and sub-dirs; thus highlighting a deployment issue and that these are non-Store resources.

However, it seems to me that there are a number of problems here that we are trying to solve of which deployment and non-Store resources is only a part.

I am concerned that some of the requirements above already suggest solutions before the whole problem has been specified, let alone analysed. Not that the whole problem has to be solved but at least conscious choices made about what/how to solve and what not to solve.

I am particularly concerned that we are painting ourselves into a corner with some of these suggestions.

My problem dump tries to include all related things, as such some of it may initially appear trivial. Subsequent analysis will weigh the different parts carefully.

As I see it we have the following Main Problem Areas (analysis of this follows):

  1. PUBURL and friends need to be converted in all topics (and templates?) to use the %PUBURL{ params }% form
    • It's an existing requirement
      • Tasks.Item13099 created the new PUBURL etc. forms. Another task (somewhere) states the need to convert all of them ASAP: +1000 to that
  2. Existing topic Resources (except for attachments) have never been categorized before
    • Therefore, information has never been stored about them and where applicable software has to guess what type of Resource it is
    • Support current Resource categories (or attributes?)
    • Support future Resource categories or we will have to revisit this again and again
  3. Some existing attachments are not recognized by stores (initial [*_.] ) but they are considered attachments elsewhere
    • Flexible by on a topic by topic basis
  4. working should not be part of Store API; (I'm not sure that it's a different category of Resource though)
  5. bulk_copy was designed as a Store copy tool, we need a deployment tool

Solutions?

  1. PUBURL enhanced to include type= parameter with the default being 'attachment'
    1. New Resource types can easily be added
    2. Abstract responsibility for URL creation to different ResourceManagers based on type
    3. Could create URLs just like now: /pub/System/PlugIn/scripts/whatever.js OR /scripts/System/Plugin/whatever.js
    4. During deployment resources can be offered to ResourceManagers to copy/move elsewhere (or load to CDN etc) and when PUBURL called create the correct URL
      • Of course these choices impact the web-server config but we are not wedded to a particular choice
    5. Basic ResourceManager can be very simple and similar to current Store getAttachmentURL abstraction
  2. Deployment: How should plugins deliver and categorize their resources?
    1. Add something to MANIFEST to categorize resources?
    2. Add a .fw-resources file with categorization rules in each pub/Web/Topic directory?
  3. PlugIns: How do plugins inform a ResourceManager that a new file has been created (e.g. an svg from DirectedGraphPlugin) and to map it/move it/load it.

I need to flesh this out somewhat more, but I've run out of time for now.

A key point for me is using %<nop?PUBURL{}% and different ResourceManagers (of which one responsibility is URL generation) to abstract out some of choices. Ultimately we will need a default implementation, but we do not need to limit the possibilities needlessly. The core abstraction is not that great. However, I would like to expand on this further.

-- JulianLevens - 15 Dec 2015

 
Topic revision: r13 - 15 Dec 2015, JulianLevens
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy