Feature Proposal: Allow disabling of RCS control for binary attachments

Motivation

RCS is primarily a text based tool, while it supports binary files it makes no semantic sense to use this on binary files. Doing so just wastes disk space...unless you are providing a hex editor that shows the differences. (Who can read gif data by hex?) -- How many people can read hex if only you and dead people can read hex? -- Babar

Description and Documentation

Provide a preference setting, with a default in configure, that disables using RCS for binary attachments. Functionally you should be able to disable RCS control on all, some, or no attachments. -- Edited by Babar

Examples

Some user uploads releases to .tgz, or .rpm to Foswiki. It uses twice the disk space for nothing, and it will never use history.

Impact

%WHATDOESITAFFECT%
edit

Implementation

-- Contributors: DaveHayes, OlivierRaginel - 30 Nov 2010

Discussion

Unless the RCS is replaced by another mechanism then removing the RCS means removing the feature of having all previous releases of the binary available. If someone uploads another file with the same name, it means LOSS OF DATA. Something Foswiki is otherwise immune to.

I think such a feature is a security trap just to save what in practical life is very little disk space.

I fear people enable such a feature without realizing the consequences

I would be more open to alternative storage of old attachment versions. The RCS is used for storing the versions. We have no Diff feature for attachments anyway so that argument is irrelevant But having a feature that removes this important wiki security element will end up being a misfeature.

-- KennethLavrsen - 30 Nov 2010

I also use the revision system to retrieve older versions of attachments - which is (i thought) the point of it - diffing is added sugar - not the key usage.

I'm not sure how you explain to a user that the previous versions is gone because the admin needed to save a few Meg on the multi-terrabyte disk, but I do realise that different use cases have different needs.

-- SvenDowideit - 01 Dec 2010

I agree with Sven & Kenneth. We have to continue to support multiple versions of attachments in a way that is compatible with the existing RCS files, so that history of attachments is not lost. However an alternative versioning system of some sort would be good. The overhead of the RCS "checkin" of attachments is a huge delay. Bypassing the current attach API in the DirectedGraphPlugin can shave many seconds off a graph update. Rather than bypassing it we need it to be much faster.

Another thought is that it would be nice if attachments rev's could be synchronized with the topic revision. So viewing an older rev. of a topic shows graphics and other attachments as they were at that revision level.

-- GeorgeClark - 01 Dec 2010

There definitely are use cases for NOT wanting to keep previous revisions of attachments. I have come across one or two myself, so I do see the point of a feature like that. Some ideas to (hopefully) make it more palatable:
  • Make it an expert option in configure, and put dire warnings in its documentation of the potential for data loss.
  • Make it only applicable to a specific set of webs or topics e.g. by listing them in configure or by enabling the feature when a specific preference is defined.
  • Make a store contrib that wraps any other store but bypasses it for some attachments (e.g. those in specific webs or topics) and provides a "cover" that shows a dire warning when manipulating affected attachments.
  • Provide a recipe in Support for how to do it with a cron job
  • Instead of keeping no previous revisions, provide a rotating-backup-style system that stores only the last n revisions, or that deletes revisions older than n days.
-- MichaelTempest - 01 Dec 2010

Hmm. Good proposals. Let me brain it further. If we have an alternative attachment storage method which has these two properties

  • Having a keep N versions where N can be 1 to infinite (infinite is default)
  • Having the N versions with a naming scheme that suffix the filename by the revision ALWAYS starting with 1.
  • If you have N=3 and have uploaded 5 times you would have .3, .4 and .5
It would give TWO fixes to TWO problems.

  • Limit the storage space if you set N to none infinite
  • Solve the problem that old revisions of the topic does not contain old revisions of the graphics.
We all know the problem that old revisions of a topic contains the latest uploaded .jpg, .gif, or .png images.

It requires the user changes the suffix in the topic text. Unless we make the attachment UI smart so it can detect the filename within SAME TOPIC and update the suffix.

With an N= some reasonable number you will have some protection against one or two mistake uploads. I would really warn (with a scull and death symbol) against an N=1 scheme because people make mistakes and "manage the wrong attachment" and upload a file to the wrong existing filename and then the original is lost forever. I have seen this mistake done. I have made the mistake myself. Today you just download the previous version of the attached file and reupload it and then you are OK.

Another feature we may consider is the repRev of attachments. Ie. that the file is not revised if you reupload within the 1 hour we also know from topic content. THAT would not be a security problem because during the hour the person uploading has the original on his disk. And I bet 90% of all attachment revisions happen within the editing window of the same revision of a topic. Just think of the the JHotDrawPlugin that saves a new revision of the attached 2-3 files each time you save.

An N revisions and a repRev for attachments is a much better approach than the original proposal.

Even when something can be configured we should still not give admins options that are bad practice. And the people that ask for it may not know the consequences of what they ask for no matter how much you warn them.

-- KennethLavrsen - 01 Dec 2010

I hate that attachments appear to be revisioned independently from the topic they are attached to; it causes all sorts of misunderstandings. Let's be absolutely clear about this; an attachment is part of a topic, and as I can recover an old version of a topic, I have to be able to recover the attachments to that old version of the topic as well. If you compromise that ability, then you compromise revisioning of topics; fatally so, IMHO. The question here should not be "how do we limit revisioning of normal attachments". I do not believe that should be considered, at any level.

Having said that, I think there is one (and only one) exceptional case where it makes sense, and that's exactly the case that Kenneth rejects; i.e. N=1. I think there is room for two types of attachment; "variable" and "constant". A variable attachment is revision controlled with full history, locked to the topic history, no limits, and is the default condition for an attachment. A constant attachment has one, and only one, version, and if it is overwritten, then start praying you have good backups. As such, "constant" status is something that should only be conferred with the greatest of care - possibly only by admin users. Examples of constant attachments would be things like logos, or all the little image files on DocumentGraphics.

(This is in fact Dave and Babar's original proposal of this feature. In implementation terms, a "constant attachment" can be trivially implemented. Only create a new revision of an attachment if there is an existing ,v file, otherwise it's a constant. Use a trivial flag in other store impls. An API change is required for the store to tell the world that am attachment is a constant, and some UI changes are required to display the fact, with appropriate warnings if a new version is uploaded)

-- CrawfordCurrie - 01 Dec 2010

I agree with Crawford, in theory.

In practice foswiki builds up an independent repository for each single file, i.e. separate repositories for (a) the topics and (b) one for each attachment. On top any revision change of an attachment triggers a new topic revision because there's these meta fields part of the topic text itself, which is rather an unfortunate design decision in itself.

Point is that in practice attachments are versioned independent of the topic mostly because foswiki versions per file.

Second point is, and I sort of share the pain this proposal stems from, that RCS is not quite appropriate for binary attachments.

I can't see the point of saving disk space. I'd be happy to find some way to justify terra byte disk drives in some way other than gathering copyrighted material. What is more of a problem related to large amounts of data is the disk IO it needs to process this data. More specifically, how much does it cost to get an old revision out of the attachment repository? Is this an issue we have to face? Not that I know.

Basically this makes two independent issues:

  1. revision silos: can't role back a topic+attachments to a previous state
  2. better version control for binary attachments
It should be clear that RCS is not the right technology to tackle these two. And we know that.

Sure adding YAK (yet another knob) to tinker with the way attachments are stored might make sense. But also remember that the additional value is rather low and does not address the real problems that are surfacing here yet again.

Now, all of this discussion would be rather pointless if we used GIT instead of RCS. Part of the reasons GIT is quite well suited to address both issues is that

  1. it doesn't store diffs, it stores the new version as is ... which is part of the core reasons GIT is superior not only to RCS.
  2. it allows to treat topics or webs as one repository ... rolling back or forward to a specific point in time also takes care of deletions or insertions of attachments ... impossible to do right now ... which leads to the question: is this an adequate document management approach nowadays?
While YAK is a lot easier to satisfy this feature request, switching to GIT as a version control system for the store is the real answer.

-- MichaelDaum - 01 Dec 2010

Using git instead of rcs may reduce the version-control overhead, but it will not eliminate it. Will git's overhead be low enough that plugins like DirectedGraphPlugin can afford to work through it (given that, today, you have to bypass the store API to get decent performance out of DirectedGraphPlugin)? I don't know.

Using git instead of rcs will not reduce the disk usage significantly. IIRC, this proposal originally came about because someone asked on irc about not putting attachments under version control. That person was attaching files that are several hundred megabytes. Should people be doing that? Perhaps not, but they are doing it. People do it in my organisation, too.

I am not saying that switching to GIT is a bad idea. Indeed - I think it is a good idea. But I do not think it is the answer to all of the problems to be addressed.

-- MichaelTempest - 01 Dec 2010

I think CDot nailed the point on IRC. The all idea here is that sometimes, people want read-only attachments, and maybe even read-only topics.

To address Kenneth's concern, maybe we can rename the proposal to:
  • Allow attachments to be read-only
This would mean that we would not store any history, and would prevent any overwriting. If a user wants to change an attachment, he would have to trash it first.

This would address Ian's (the user who requested that feature on IRC last night) first concern, and I think most of the other concerns raised.

And Micha, a 100 Mb attachment doesn't need to be a copyrighted material. One could upload some release of some software, which includes loads of dependencies. Simply imagine storing all Foswiki releases as attachments. What would be the need of history for those?

-- OlivierRaginel - 01 Dec 2010

I am all for supporting large blobs as attachments. Revisioning them is not really the problem. Generally, in a document management system everything shall be revisioned, even large blobs. That's part of the security net revision control gives you. Foswiki isn't a dms by far. Still other dmses do version large blobs. There isn't a principle fault in that per se.

The argument is that history is always of value, even for read-only content. It wasn't always read-only, it was created as part of a longer workflow where these documents have been created step by step. Freazing content into a read-only state is normally achieved using quite different means within foswiki. I can't see an argument against revision control here.

CDot had a far better idea on IRC: don't create a ,v file for the first revision uploaded

That's far better:
  • it reduces the amount of disk space (if you still insist in this being an argument),
  • it reduces disk io because the blob doesn't need to be checked in to bloddy RCS,
  • but best of all: versioning is defered until after the same blob is checked in again.
-- MichaelDaum - 01 Dec 2010

I found that some of my users were not updating versions of documents. They were just attaching another copy with a change in the name (Doc v1, Doc v2). I thought that this was a training issue. So I demonstated how to do it properly and let Foswiki handle the revisions. This was a 2MB Excel document. After waiting for way too long, I agreed that the users alternative was the only practical option.

I've had the idea that we really need to implement some DifferenceEngines (homage to Charles Babbage here) for different document types (MIME I suspect). Of course this may not tie up very well with RCS but that's the idea. Of course this is further complicated by new backend stores, which will not use RCS at all I suspect. However, it will be a requirement of a backend store to provide revisions, so a generic set of DifferenceEngines for any backend store would be worthwhile (apart from text/binary engines (often the fall back cases), XML may be common I'm not sure of the value of other specialist cases for PDF, Word, Excel etc)

It is also worth considering two other strategies, (the second is a refinement of the first)
  1. Keep the whole document for each revision
    • Not as daft as it initially sounds, certainly for us there is no space issue and the above experience means that's practically what we are doing for some documents
    • With the desire in the back end store discussions (or elsewhere?) to be able to do queries based on revision, fast retrieval would be easier with the whole document available (think also about Stringified documents as required for better search engines won't they need revisioning as well for revisionable searches)
  2. As above, but do the difference processing via a Cron job
    • This would allow time for more sophisticated processing (faster re-creation of a specific revision may then be possible)
    • Use config options:
      • Don't 'difference' small files (unless they have a high churn history)
      • Allow multiple uploads within a short time be coalesced into one
      • Keep a full copy every Nth revision
        • Faster recreation of older revisions
        • Better safety as you have a number of full copies, corruption of the main revision or a difference could render a lot of it worthless
-- JulianLevens - 01 Dec 2010

Both the "read-only" and the "don't create ,v on first checkin" are great ideas, and I think they both deserve a look at implementation.

Keep in mind that the original use case that started this feature request was someone uploading an RPM file as an attachment. This was 700Mb in size. I know disk space is cheap these days but that shouldn't be an argument for using space unnecessarily; it's quite likely that this RPM has it's own version control somewhere else and isn't at risk of being lost. I think "read-only" covers this idea perfectly.

Is there a compelling reason to create the ,v on first checkin for any files, including topics?

-- DaveHayes - 01 Dec 2010

The readonly proposal does not seem to solve any of the issues anyone have raised.

The graphics we distribute in System web do not have ,v files today in the distro (I removed that a long time ago because it is waste). The whole distribution incl all graphics unpacked is around 27 MBytes. Nothing to worry about.

It is the large uploads that is the concern. I can feel the worry about 700 MB uploads and I must admit that all my Foswiki webs have the limit for uploads set to something much smaller than that and we tell people to upload to a file storage system and link to this.

But what if someone want to use Foswiki for file storage? And what if they need to upload 700 MB blobs?

I'd rather find a feature that fits this purpose than destroying the great security we have in a wiki. Remember that a main difference between a wiki and a normal file server is that everybody can do anything as the general principle. And this is why the overwriting is so dangerous. Anyone making a mistake and a vital file can be wiped forever. Especially if the backup schedule is weekly or monthly in a smaller company.

We need to think out of the box to find a good way to handle this special situation so an entire wiki can run with normal revisions saved for attachments and then for particular topics you can actively disable it. Could be a plugin that adds this feature. A "LargeAttachmentPlugin" that does not store revisions combined with a 20 MB limit set for the normal attachments may be a combination that could work for Dave's use case.

And again. The missing repRev for attachments is a huge space eater because people will upload a figure or similar many times during document creation and for the binary attachments the RCS file grows with the size of the image file each save. The repRev is what people are used to with the topics themselves. It would be natural that attachments are handled same way. If someone uploads an RPM and then finds a small oops and upload it again 30 minutes later it is suddenly 2 x 700 MB of space.

Crawford, yes I wish the rev of attachments and rev of topic would follow each other. Especially for images.

-- KennethLavrsen - 01 Dec 2010

The main reason for creating the ,v on first upload is to create the document history, which is part of the meta-data that the ,v stores. Of course that can be "faked" in the %META:FILEATTACHMENT, but there are risks.

Note that RCS does not handle diffs of binary files, so it stores the entire content for every version.

-- CrawfordCurrie - 02 Dec 2010

Part of this also could be handled by more intelligent revision / diff handling. A diff of a compressed file is totally useless, but looking at articles about other SCM systems - I don't recall which one did what, there are better ways to handle this.
  • Handle zip / tar and other compressed / archive formats by recording the diff of the expanded contents
  • images handled with incremental history, no diff capabilities, as done by rcs.

If we could be more intelligent about what we diff and when/why, some of this could be more useful. The example of the small oops with an RPM could potentially identify the actual change within the RPM if we could plugin in a diff engine that understood RPM format. Maybe too hard to do, but I could see it being really useful to know that the change between two versions of a plugin or rpm were an updated copyright notice, vs. a code change internally in the package, or a change to an packaged gif. file, etc.

So maybe we need a pluggable diff engine:
  • Separate storage of the old versions from the diff of the versions
    • If possible to recreate original from the diff, then store differences, otherwise store full copies. (Considering storage costs, may be easier / safer to just store the entire attachment, but include reprev capability to "rewrite history" when appropriate).
  • PDF - use a pdf diff engine
  • Archives - operate on expanded components of the archive
  • Images - simple incrementing history of files.
  • OpenOffice - diff of the expanded contents
  • Other known - use pluggable difference engine.
  • Other unknown formats - incremented history.

-- GeorgeClark - 02 Dec 2010

just to inspire you all a little - the rdiff functionality has been pluggable (to some degree) since the tmwiki Cario release - not that anyone, not even its implementer ever did any more work on it...

-- SvenDowideit - 03 Dec 2010

Changing to a Parked proposal. No action in >4 years. And the new PlainFileStoreContrib has eliminated RCS altogether when that store is used. That reduces some of the need for this proposal. Though a pluggable rdiff would be interesting.

-- GeorgeClark - 09 Feb 2015
 
Topic revision: r20 - 09 Feb 2015, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy