NeedBetterWayToControlBots

How can we better control bot scans

Looking at the Foswiki.org logs, it appears that the various indexing bots, especially Bing, are generating a huge load following links that are clearly marked as rel=nofollow. Here are logs for just WebStatistics topics for an hour period.
207.46.204.188 - - [03/Nov/2011:00:22:35 +0000] "GET /Main/WebStatistics?cover=print;sortcol=2;table=1;up=0 HTTP/1.1" 200 77692 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
207.46.204.241 - - [03/Nov/2011:00:23:45 +0000] "GET /Sandbox/WebStatistics?rev=9176;sortcol=1;table=1;up=0 HTTP/1.1" 200 64567 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
207.46.204.188 - - [03/Nov/2011:00:27:22 +0000] "GET /Home/WebStatistics?cover=print;rev=8184&rev=8184 HTTP/1.1" 200 39544 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
207.46.199.49 - - [03/Nov/2011:00:29:37 +0000] "GET /About/WebStatistics?rev=9321 HTTP/1.1" 200 49307 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
65.52.108.67 - - [03/Nov/2011:00:31:45 +0000] "GET /Extensions/WebStatistics?raw=on&rev=14051 HTTP/1.1" 200 43720 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
207.46.13.96 - - [03/Nov/2011:00:34:59 +0000] "GET /Sandbox/WebStatistics?rev=7634 HTTP/1.1" 200 57483 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
65.52.110.153 - - [03/Nov/2011:00:43:10 +0000] "GET /Download/WebStatistics?rev=8150;sortcol=4;table=1;up=0 HTTP/1.1" 200 58226 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
66.249.72.37 - - [03/Nov/2011:00:47:32 +0000] "GET /Support/WebStatistics?t=2011-10-29T07:03:18Z HTTP/1.1" 200 87494 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.37 - - [03/Nov/2011:00:48:40 +0000] "GET /Support/WebStatistics?cover=print; HTTP/1.1" 200 76989 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
207.46.204.241 - - [03/Nov/2011:00:50:52 +0000] "GET /Sandbox/WebStatistics?cover=print;rev=8898&rev=8898 HTTP/1.1" 200 50549 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
66.249.72.37 - - [03/Nov/2011:00:56:20 +0000] "GET /Sandbox/WebStatistics?cover=print;raw=on;rev=14169&rev=14169 HTTP/1.1" 200 34424 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
65.52.108.67 - - [03/Nov/2011:00:57:30 +0000] "GET /Download/WebStatistics?rev=14959 HTTP/1.1" 200 71546 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
66.249.66.106 - - [03/Nov/2011:00:59:09 +0000] "GET /Main/WebStatistics?cover=print;raw=on;rev=14413&rev=14413 HTTP/1.1" 200 38323 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
65.52.110.153 - - [03/Nov/2011:01:01:30 +0000] "GET /Main/WebStatistics?cover=print;rev=7416;rev=7416;sortcol=1;table=1;up=0 HTTP/1.1" 200 77800 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
65.52.110.153 - - [03/Nov/2011:01:02:45 +0000] "GET /Extensions/WebStatistics?cover=print;rev=9938;sortcol=;table=1;up=&rev=9938 HTTP/1.1" 200 57287 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

All of these are bots - bingbot or googlebot. And clearly following rel=nofollow links. The table sort links, rev= links, and these are from links marked nofollow.

Possible solutions
  • Foswiki detects bot, returns a static copy of the page with nofollow links removed.
  • point nofollow links to an alternative view script viewnobot, that is banned in robots.txt
  • ... (Any more?)
Is there a quick solution that doesn't involve building a new skin? A longer term one that helps optimize bot scanning?

Some discussion from IRC:
gac410:  mod_fcgid: stderr: [Thu Nov  3 07:59:48 2011] foswiki.fcgi: /usr/home/foswiki.org/public_html/data/Download/WebStatistics.txt,v is corrupt; parsed up to deltatext.log at /usr/home/foswiki.org/public_html/lib/Foswiki/Store/VC/RcsLiteHandler.pm line 350.
CDot: yikes, that is one humungous history
gac410: yeah -  as I said - we really need to clean them out.  Maybe archive and start a new history on 1/1 of each year
CDot: aye. RcsLite is not good with very long histories
CDot: it regerenrates all the text of all revisions in memory
gac410: yikes
CDot: so will run like a drain on 15K+ revisions
gac410: And bingbot is hitting every rev of every Statistics topic from what I can see
CDot: ouch!
gac410: This rel=nofollow is totally useless.    I think we should add a "viewnobot" script,  and make the nofollow links  use viewnobot    so we can put them in the robots.txt
gac410: maybe *that* will fix it.   And if any robot loads a viewnobot link -  ban them.
CDot: bingbot doesn't respect rel-nofollow? tsk, bad bot
gac410: None of the bots seem to -   concensus is that nofollow means  "follow but do not index"
gac410: And robots.txt can't wildcard,   so no way to do a ban of  *... WebStatistics
CDot: [[http://www.inceptor.com/blog/2011/09/bingbot-unaffected-by-noindex-meta-tag/][http://www.inceptor.com/blog/2011/09/bingbot-unaffected-by-noindex-meta-tag/]]
gac410: From wikipedia:  Google states that their engine takes "nofollow" literally and does not "follow" the link at all. However, experiments conducted by SEOs show conflicting results. These studies reveal that Google does follow the link, but does not index the linked-to page, 
gac410: And I can see the proof in our logs -  bing is definitely following the rel=nofollow links
CDot: yeah, as per that link I pasted
CDot: nofollow is a weighting instruction, and does not preclude following :-(
CDot: there does not seem to be any way to block a follow.
gac410: All I could think of is replace the "view" script with a script that is blocked in robots.txt
CDot: y :-(
gac410: maybe I should create a proposal for that.   Probably a bit too much for a "fix".  But the bots are really killing our server following all of the revs of all of the WebStatstics.   And sorting every table by every column in every direction.
gac410: Most of the errors I've been chasing in the logs are because bots are hitting links not normally followed.
CDot: a URL param would be an alternative
CDot: ?bots=denied
CDot: ?bot=goaway
gac410: How would that work?   Can robots.txt ban based on URL param? 
CDot: no; but any page hit that way could have all onward links suppressed
CDot: or even simply 404
gac410: I don't understand, but I'll open a proposal to improve bot restrictions beyond the nofollow=    We have to be careful with 404 - in that we don't want to hurt pageranks of the page containing the "bad link"  
SvenDowideit: it should be pretty easy to make an couple of apache re-wrte condition
gac410: I don't understand page ranks.  But if every history link on the bottom of every page generated 404's for bots,  could it impact the rank of the base page that a user did want linked.
SvenDowideit: where agent~~bot or ip~~known bot + url contain nofollow
SvenDowideit: and make that goto static html
SvenDowideit: then again, i still wish i could be botered just making all botsrewrite to publishplugin html with a simplified skin
SvenDowideit: that way we contol what the bots see to our advantage
***SvenDowideit hasn't done it in 8+years of thinking it, which would be a reason i'm pissy with myself :/
CDot: "url contain nofollow" - how can you tell?
SvenDowideit: when its requested
CDot: unless you mean ?bot=goaway
SvenDowideit: though i have used rewrite_html :)
gac410: Okay - so a brainstorm topic -  how do we limit bot's from killing a site.   The load on foswiki.org is huge 
SvenDowideit: ok, if you want a real proper solution
CDot: nofollow isn;t part of the URL, it's on the <a
SvenDowideit: that is to publish the site to html using a non-rev, non oops etc skin
SvenDowideit: and then use re-write to send that content to any known bot
SvenDowideit: then, whenever you id a new bot, add it to the list
SvenDowideit: that way their load is minimal
SvenDowideit: and fully to our advantage
SvenDowideit: hell, this was even somethign i suggeted on t.o when our server was too small - serving a static html version of the site to all guest users
CDot: that would work; it's a lot of reskinning, tho
SvenDowideit: lot?
CDot: lot
SvenDowideit: very little - use a simplified plain skin
SvenDowideit: add a logo and a little navigation
SvenDowideit: and done
SvenDowideit: course, i'm biased there - i've written how many skins?
CDot: no idea
SvenDowideit: :p
SvenDowideit: mmm, i even have code that can identify and dashboard probable bots
SvenDowideit: i wonder where that is, and how non -foswiki it is

Discussion

Topic revision: r2 - 03 Nov 2011, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy