Item1246: Add support for bypassing session creation for crawlers

Priority: Enhancement
Current State: Closed
Released In: 1.1.0
Target Release: minor
Applies To: Engine
Reported By: AndrewJones
Waiting For:
Last Change By: KennethLavrsen
We have found that for each time our internal Google box hits the server a new session file is created. In a short time there were so many session files that the tick script could not handle deleting them and in the end we had >400,000 session files in a company with 1,700 people!

As a quick fix we added an exception in the loadSession subroutine in that skipped creating sessions if the user was gcrawler, which just looked like this:

return $authUser if $authUser eq 'gcrawler';

Obviously this is not a very elegant fix and we would need to remember to add this each time we upgrade.

So is there a better way to do this, or can on be provided? Presumably we could have something in LocalSite.cfg that can contain a list of user names that we will not create session files for, but is this the best way to do it?

If we can decide on a way to fix this then I don't mind doing the coding for it (assuming its within my ability).

Note: we are actually using TWiki 4.2.4, but looking at the Foswiki code in it is very similar so I assume the problem is the same. We are upgrading to Fosiwki when 1.0.x is released so would like a more elegant solution if possible.

-- AndrewJones - 11 Mar 2009

Interesting report. We have disabled GSA for our intranet because of performance problems. I did not have a clue at the time what to do about it, and this might point to the cause.

-- ArthurClemens - 11 Mar 2009

Given the ubiquity of intranet crawlers like gcrawler, I'm regrading this from Normal to Urgent and confirming it. It's biting more people than realise it.

-- CrawfordCurrie - 05 Jun 2010

It occurs to me that Apache has the capability to support quite sophisticated rules for matching user agents, usernames, ip addresses and the like. The blockAccess rules in the default configuration demonstrate this. They use SetEnvIf to set environment variables based on matching criteria.

So, a possible approach would be to set a NO_FOSWIKI_SESSION environment variable according to the result of those apache matching rules. For example,
BrowserMatch ^Google NO_FOSWIKI_SESSION
Then the Foswiki fix would be as simple as replacing your line with:
return $authUser if $ENV{NO_FOSWIKI_SESSION};
I'm not certain that these environment variables are passed on to Foswiki, but even if they aren't this may be a fruitful line of investigation.

-- CrawfordCurrie - 10 Jun 2010

Ok thanks, that sounds like a good approach. I will try and test it in the next week or two.

-- AndrewJones - 10 Jun 2010

Yeah this fix seems to work great, thanks smile Tested under plain CGI and Fast CGI.

Is it ok to check this in? Also, where should the documentation go? I can't see anywhere suitable in the System web, so maybe just as an FAQ in the Support web?

Also, to match the Google Search Appliance, you need the following:

BrowserMatch "^gsa-crawler" NO_FOSWIKI_SESSION

-- AndrewJones - 13 Jul 2010

I can't see why not. The NO_FOSWIKI_SESSION check has to be clearly documented in-code, at least where your fix is but also in the header of the LoginManager module (the overview). Then, it also has to be documented in-code in the template httpd.conf and in the apache config generator.

I can't see any obvious security issues (after all, the effect of this is to remove auth, not grant it).

-- CrawfordCurrie - 14 Jul 2010

Done for 1.1.

-- AndrewJones - 14 Jul 2010

I wonder why and when a user does need the session file at all. Is it required per default? Does guest need one? Can we delay creating it it til the first try to acces it? How much does it buy? At least all the crawlers don't trigger.

-- MichaelDaum - 27 Aug 2010

Changed headline (was "Google crawler creates session files for each hit")

-- CrawfordCurrie - 06 Sep 2010

ItemTemplate edit

Summary Add support for bypassing session creation for crawlers
ReportedBy AndrewJones
Codebase 1.0.9, 1.0.8, 1.0.7, 1.0.6, 1.0.5, 1.0.5 beta1, 1.0.4, 1.0.3, 1.0.2, 1.0.1, 1.0.0, trunk
SVN Range Foswiki-1.0.0, Thu, 08 Jan 2009, build 1878
AppliesTo Engine
Priority Enhancement
CurrentState Closed
Checkins distro:40ddbc6e79b1
TargetRelease minor
ReleasedIn 1.1.0
Topic revision: r13 - 04 Oct 2010, KennethLavrsen
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy