Item11632: Frequently corrupted cache

pencil
Priority: Normal
Current State: Closed
Released In: n/a
Target Release:
Applies To: Extension
Component: LdapContrib
Branches: trunk
Reported By: JayenAshar
Waiting For:
Last Change By: MichaelDaum
About once a week, access controls on our Foswiki 1.1.3 instance seem to stop working. Deleting the LDAP cache usually makes it start working again, so I suspect the cache is getting corrupting by concurrent accesses. The attached patch to LDAPContrib 4.30 switches from DB_File to BerkeleyDB in CDS mode, which hopefully has fixed the problem.

-- JayenAshar - 11 Mar 2012

Thanks a lot for this extremely valuable contribution.

Now, how often do you refresh the ldap cache? Do you refresh it using a cron job? This refresh procedure is the only write access to the cache file. All other access is for reading it. I did not go the BerlekeyDB way up to now because you normally don't refresh the cache often enough to run into a collision. Normal means: refresh the ldap cache once a day at night.

-- MichaelDaum - 11 Mar 2012

MaxCacheAge is set to 86400 (one day). Precache is set to off, so I assume it is writing more than just once a day. We had problems, so we additionally clear the cache with a cron job at 4:15am daily. We still had problems, so I have been trying BerkeleyDB.

-- JayenAshar - 11 Mar 2012

Hrm, my patch seems to be dropping __db.### files in the bin directory. Needed to set -Home when creating the Env.

-- JayenAshar - 12 Mar 2012

With the attached patch, I'm seeing empty cache.db_tmp files on occasion. Probably when the cache expires and two users are using the wiki. Though this doesn't appear to cause any problems to the user, it does prevent the cache from being refreshed. The error log has this to say:
Cannot open file .../working/work_areas/LdapContrib/cache.db_tmp: Permission denied at .../lib/Foswiki/Contrib/LdapContrib.pm line 704.
Cannot open file .../working/work_areas/LdapContrib/cache.db_tmp: Permission denied at .../lib/Foswiki/Contrib/LdapContrib.pm line 704.
Cannot open file .../working/work_areas/LdapContrib/cache.db_tmp: Permission denied at .../lib/Foswiki/Contrib/LdapContrib.pm line 704.
Cannot open file .../working/work_areas/LdapContrib/cache.db_tmp: Permission denied at .../lib/Foswiki/Contrib/LdapContrib.pm line 704.
- LdapContrib - WARNING: already refreshing cache

Any ideas on how to make sure only one thread tries to refresh the cache? Thanks.

-- JayenAshar - 04 Apr 2012

the original patch was teething with bugs. i've uploaded a new patch, but lets see if it fixes the empty tmp files on occasion.

-- JayenAshar - 07 Apr 2012

the cache was corrupted today (even with the attached patch). first time in months, so better than 4.30 at least.

-- JayenAshar - 24 Jul 2012

A "permission denied" means that some other user created the cache.db. This happens for instance when you run a cache maintenance cronjob as a different user than your web server's user. E.g. cronjob as root + web server as www-data = permission denied.

In any other case cache corruption can be eliminated by setting the {MaxCacheAge} configure setting to zero. This will switch off any automatic cache maintenance. Instead use a cronjob of the form cd bin; ./view refreshldap=on to update the cache regularly.

See LdapContrib#Updating_the_LDAP_cache_using_a_cronjob for more info.

-- MichaelDaum - 17 Apr 2013

All cron jobs regarding foswiki run as the web server user wwwrun. The permissions on the files are fine.

I'm no longer maintaining this instance of foswiki, but i'll let them know to change MaxCacheAge and flush the cache regularly.

-- JayenAshar - 17 Apr 2013

Hi,

We realized that there is a timeframe where a race condition can result in a corrupted cache, even when the cache.db is already created. This is by two concurrent processes trying to add a user or group to the tied hash.

The "dangerous" timeslot is depending on the amount of LDAP entries and groups. So in case of many requests in combination with "slow" LDAP requests we were faced with a corrupted cache.db frequently.

In order to avoid ANY update triggered by normal usage, I would propose the following codechange: 1. Add another cfg parameter like $Foswiki::cfg{Ldap}{DisableUpdateCacheDB} = 1; 2. Include in LdapContrib.pm initializing of that parameter (sub new)
disableUpdateCacheDB => $Foswiki::cfg{Ldap}{DisableUpdateCacheDB} || 0
2. Allow to tie in readonly mode in LdapContrib.pm. Change in sub initCache from
  $this->{cacheDB} = tie %{$this->{data}}, 'DB_File', $this->{cacheFile}, O_CREAT | O_RDWR, 0664, $DB_HASH
    or die "Cannot open file $this->{cacheFile}: $!";
to
   # open cache
  unless ($this->{disableUpdateCacheDB}) {
	  #writeDebug("opening ldap cache from $this->{cacheFile}");
	  $this->{cacheDB} = tie %{$this->{data}}, 'DB_File', $this->{cacheFile}, O_CREAT | O_RDWR, 0664, $DB_HASH
		or die "Cannot open file $this->{cacheFile}: $!";
		# Tied cache file in READWRITE mode....
  }
  else {
	if (-e $this->{cacheFile}) {
		$this->{cacheDB} = tie %{$this->{data}}, 'DB_File', $this->{cacheFile}, O_RDONLY, 0664, $DB_HASH
			or die "Cannot open file $this->{cacheFile}: $!";
		# Tied cache file in READONLY mode....
	}
	else {
		# Creating new cache, as no file is existing....
		$this->refreshCache(1);
		return;
	}
  }

-- ChristianKoehn - 25 Sep 2013

A DisableUpdateCacheDB would be the third parameter to configure when and how to update the cache.db

  1. Precache ... if set to 0 will disable fetching all user info in one refresh operation; instead new user info is fetched as users drive by
  2. MaxCacheAge ... if set to 0 will disable auto-refreshing the cache.db ... only relevant in precache mode
  3. EnableUpdateCacheDB ... if set to 0 will disable any write access other than an exlicit call to refreshCache()

Is it possible to merge the logic these three flags? Or put the other way: which use cases do we have for the combination of these settings:

# Precache EnableUpdateCacheDB MaxCacheAge Description
1 0 0 0 ???
2 0 0 > 0 ???
3 0 1 0 cache.db is emptied via cron during an explicit call to refreshCache(); new records are added incrementally
4 0 1 > 0 cache.db is emptied from time to time; new records are added incrementally
5 1 0 0 cache.db is fully updated via cron; opened for writing only during an explicit refreshCache() operation, otherwise in read only mode
6 1 0 > 0 cache.db is updated automatically; opened for writing only during an explicit refreshCache() operation, otherwise in read only mode
7 1 1 0 cache.db is fully updated via cron; occasional cache misses are filled in incrementally
8 1 1 > 0 default setting; cache.db is updated automatically from time to time

I think that the current backend isn't really able to deal with Precache disabled. Updates to the cache.db might happen rappidly and concurrent once cache.db is emptied and users hit the site. DB_File isn't fit for this purpose as it doesn't have any locking what so ever.

There's also a security problem in that as name clashes aren't resolved in a stable way.

  1. Frank Smith hits the site.
  2. His user data is fetched from the ldap directory.
  3. He is known as FrankSmith from now on.
  4. Frank Smith Junior hits the site.
  5. His user data is fetched from the ldap directory.
  6. A name clash is detected as there's another Frank Smith already there using another DN in the ldap directory.
  7. He is known as FrankSmith1 from now on. This decision is rememberd until a refreshCache() is performed in force mode to override all name clash decisions.
  8. When Precache is siwtched off and MaxCacheAge > 0 or refreshCache() is called explicitly from a cron job, the cache.db is emptied !!!
  9. Now, Frank Smith Junior hits the site before his father.
  10. His user data is fetched from the ldap directory.
  11. He is known as FrankSmith from now on. He does not get the FrankSmith1 ID again ... and now owns all data of his father.

-- MichaelDaum - 25 Sep 2013

Post edited

Has anyone looked at Tie::DB_Lock, Tie::DB_LockFile or DB_File::Lock?

Another possibility could be to do a transition to http://search.cpan.org/dist/BerkeleyDB/BerkeleyDB.pod which also has support for locking.

It does however seem like that a switch to BerkeleyDB will require the database to be updated

-- TerjeNessAndersen - 02 Oct 2013

Concerning DB_File::Lock to battle race conditions:

I created a little script that ran `./view -user $user -topic Main.WebHome`; on 10 users (present in LDAP) on a empty cache on a empty Foswiki (latest version).

These calls was made in child processes, making them hit the wiki in a parallel way. I did three tests, removing the cache files before each test, and with preCache set to off.
  • First test: 2 users present in cache, 7.06 seconds.
  • Second test: 2 users present in cache, 6.96 seconds.
  • Third test: 1 user present in cache, 6.47 seconds.

Obviously some race conditions going on here.

Then I altered the LdapContrib to use DB_File::Lock, and ran the script again three times:
  • First test: 10 users present in cache, 6.48 seconds
  • Second test: 10 users present in cache, 6.20 seconds
  • Third test: 10 users present in cache, 6.35 seconds

Seems to be working good. I wonder why the tests seem to go quicker with write locking on the cache file, should have the opposite effect? More tests needs to be done, but I think this seems promising. A diff of my changes to LdapContrib is attached.

-- TerjeNessAndersen - 03 Oct 2013

Terje, thanks for the patch! Does it fix the problem initial reported, Jayen?

A possible explanation for the thing being faster with locking than without is io being linearized while talking to the ldap server and while talking to the hard disk. Otherwise you get the horde-of-burning-cows-effect of a lot of similar things happening at the very same time on the server.

-- MichaelDaum - 09 Oct 2013

I've passed on the patch to the person now maintaining the wiki. I haven't looked at this for 18 months.

-- JayenAshar - 09 Oct 2013

Hi, thanks to Terje. This is for sure the preferred fix of the underlying problem.

-- ChristianKoehn - 10 Oct 2013

Terje, I've reviewed and merged your patch and it seems to work all right. There've been 2 problems though where the code reconnected to the cache file with write perms but did not drop back to read-only under certain circumstances due to an early return happening before.

Another one was that it dropped write perms too early while still writing to the tied hash.

See the modified patch attached. What do you think?

-- MichaelDaum - 10 Oct 2013

Seems good Michael, thank you for noticing the bugs!

There is one scenario that still could be an issue.

in refreshCache, what happens if there are one or more other processes trying to aquire a write lock on the main cache prior to the following code sequence?
  # try to be transactional
  $this->untieCache();

  #writeDebug("replacing working copy");
  rename $this->{tempCacheFile}, $this->{cacheFile};

  # reconnect hash
  $this->tieCache('read')

This line of events illustrates it:
  1. PreCache is true
  2. Process 1: A cronjob or user starts refreshing the cache. The main cache now has a read lock, the temporary cache is being built.
  3. Process 2: An new, unknown users enters the wiki (unknown meaning not present in the cache).
  4. Process 2: checkCacheForLoginName is run, the user is determined unknown, it's LDAP entry is fetched.
  5. Process 2: A write lock on the main cache leads Process 2 to wait, since Process 1 (building new cache) is still active.
  6. Process 3: Same as step 2-4. Waiting for write lock.
  7. Process 1: The temporary cache is now finished. Untie main cache. Rename temporary cache to main cache. Retie main cache.

  • What version of the cache will process 2 and 3 access when they finally acquire a write lock?
  • If they will access the newly refreshed cache (by process 1), will it matter? (They will probably write the same user entry again since it was already found in the cache refresh.)
  • Could process 2 or 3 theoretically acquire a write lock on the main cache in the timeframe between the main cache being untied and renamed by process 1?
    • If so, will the file rename be unsuccessful?
    • If the rename -is- successful after process 2 or 3 has acquired a write lock, will they be writing to the new cache then?

The main problem by far would be if the write lock acquired by process 2 or 3 would lead the renaming in process 1 to fail.

I plan to investigate this further. Just thought it would be wise to share my thoughts on this here, in case you have some input as well.

-- TerjeNessAndersen - 22 Oct 2013
 

ItemTemplate edit

Summary Frequently corrupted cache
ReportedBy JayenAshar
Codebase 1.1.3
SVN Range
AppliesTo Extension
Component LdapContrib
Priority Normal
CurrentState Closed
WaitingFor
Checkins LdapContrib:6427a08e7c6e
ReleasedIn n/a
CheckinsOnBranches trunk
trunkCheckins LdapContrib:6427a08e7c6e
Release01x01Checkins
I Attachment Action Size Date Who Comment
LdapContrib.pm_DBFileLock.patchpatch LdapContrib.pm_DBFileLock.patch manage 5 K 10 Oct 2013 - 13:02 MichaelDaum  
db.patchpatch db.patch manage 2 K 07 Apr 2012 - 08:48 JayenAshar  
Topic revision: r22 - 18 Mar 2014, MichaelDaum
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy