Item9059: Further counter measures needed to avoid truncated .htpasswd

pencil
Priority: Urgent
Current State: Closed
Released In: 1.0.10, 1.1.0
Target Release: patch
Applies To: Engine
Component: HtPasswdUser.pm
Branches:
Reported By: MatthiasThullner MichaelDeckert
Waiting For: Main.KennethLavrsen
Last Change By: KennethLavrsen
We got a truncated .htpasswd file in our productive system.

Behaviour seems similar (or same) as described in Item2537, we did not install the patch attached to Item2537

-- MatthiasThullner - 25 May 2010

While analysis we found this entry in our log201005.txt:

...
2010-05-25T05:30:51Z info <user name> changepasswd <user name>   10.8.141.73
2010-05-25T05:30:51Z info <user name> changepasswd <user name>   10.8.141.73
...

It is the only entry in the time frame the error occured, where two htpasswd changing actions have been started within some seconds. Thus it is most probable the reason for the fault.

-- MichaelDeckert - 26 May 2010

For analysis: we are using this system configuration:

  • OS-Name: CentOS release 5.4 (Final)
  • OS-Release: 2.6.18-164.11.1.el5
  • Platform: x86_64
  • Lang: en_US.UTF-8
  • HW-Type: VMware ESX Cluster Abt2

-- MichaelDeckert - 31 May 2010

Hi KennethLavrsen, the problem above could be related to Tasks.Item2537, which was fixed by you. Could you please have a look at our problem report? Could it be related to Tasks.Item2537? Can there be a problem with our server OS? Best Regards

-- MichaelDeckert - 01 Jun 2010

It sounds exactly like the same problem as in 2537.

The patch on that report was for the - at that time - current SVN version and cannot be applied to 1.0.8.

I implemented a fix in 1.0.9 using a simple file lock.

Are you guys on 1.0.9 or on an earlier version of Foswiki?

There is no issue with your OS I am sure. Same as I am developing on except for the VM.

I am concerned if you are on 1.0.9 and my lock file scheme is not working. If you are on 1.0.8 or earlier then please upgrade to 1.0.9.

-- KennethLavrsen - 02 Jun 2010

We are really using 1.0.9 archive. We have not installed 1.0.8 before, it's just 1.0.9 without any updates or fixes (as mentioned in fields below). So we also do not use your patch.

Another thing that we are facing is a strange "double loading" of topics. Especially the Main is affected. Worst case it is loaded 50 times in a row (according to the log file). Something similar is happening on other topics, but there only some components are loaded double. This might have caused the double trigger of changepasswd above. So maybe the two triggers came within some milliseconds.

Regarding your patch I was not able to follow the hole discussion. Why are you using "a simple file lock" instead of the proposed solution? What would be the benefit of the "proper" solution?

-- MichaelDeckert - 02 Jun 2010

Ahh, and one more finding: the file "./working/htpasswd.lock" is always there. Is it intended that way? I'm sorry, I can read your code, but I'm not used to these locking mechanisms on a linux server.

-- MichaelDeckert - 02 Jun 2010

OK you are on 1.0.9. I just had to get that confirmed. It is often that people do not pay attention to the Codebase field in bug reports.

The use of a lock file is the safe way to avoid race conditions.

And it is correct that the file working/htpasswd.lock is always there. The code uses the filesystem locks on this lock file.

http://interglacial.com/tpj/23/ was my primary source of information when I implemented the solution. It is pretty good at explaining what is happening. There is no doubt that protection with a lockfile is safer than using a lock directly on the file you want to protect. The reason is that the code that is used to open files and lock them is not "atomic". This means that even a simple perl code line is translated into quite many machine code instructions, and two processes can run almost at the same time and the code can start a read of a file in the nanoseconds between the other process is taking to open a file and locking it. The use of a lock file eliminates this problem.

The thing I did not implement in Item2537 was a check of the lockfile for normal read. This means that while a password is being changed, other people just arriving to the site and getting authenticated could run into being denied access because it happened just in the moment the password file was updated. This would be a very rare event and people could simply just try again. This is not your problem. You get a truncated password file and this should not be possible with my fix.

But. There are several things that can have gone wrong.

  • Maybe the file locking mechanism does not work in a VM machine? Or works less well.
  • Maybe you use some additional extensions that take care of users and passwords.
  • Maybe you have accidently changed the access rights to the file lock file so Apache cannot lock it anymore. This can happen if you run something from command line.
  • You obviously have another additional problem with all those reads you see. This needs to be investigated.
  • Finally - I can have implemented the file lock code wrong or missed one place in the code where I should have used the lock also.

Maybe we can code up some small test programs to check the lock file mechanism in VM. Maybe also add some debug lines in your code so we can see that things work as they should.

This is one I want to get to the bottom of.

Can you give additional info about your environment. Especially which extensions you run and which password managers you use.

-- KennethLavrsen - 03 Jun 2010

OK, so let's debug. Here are my answers to your ideas:

  • Maybe the file locking mechanism does not work in a VM machine? Or works less well.
    • I agree that we should do a test to ensure that this is not the problem.

  • Maybe you use some additional extensions that take care of users and passwords.
    • We are using
      {PasswordManager} = Foswiki::Users::HtPasswdUser 
      {MinPasswordLength} = 7
      {Htpasswd}{FileName} = /web/www/html/twiki/data/.htpasswd
      {Htpasswd}{Encoding} = crypt 
    • I don't think that we use other extensions. To ensure that could you give me a hint how I can do further checks to ensure that.

  • Maybe you have accidently changed the access rights to the file lock file so Apache cannot lock it anymore. This can happen if you run something from command line.
    • we have a cron job ensuring that this fault should not happen
      chown -R apache:apache /web/www/html/twiki/*  
    • I think your implementation should anyway ensure that the file doesn't get corrupt, the only problem would be that the passwords could not be changed

  • You obviously have another additional problem with all those reads you see. This needs to be investigated.
    • Your right, but all attempts to find out whats happening failed till now. The problem occurs quiet often so that it causes problems, nevertheless, when we try to analyze it's hard to reproduce frown, sad smile . Any ideas?

  • Finally - I can have implemented the file lock code wrong or missed one place in the code where I should have used the lock also.
    • I think I found an issue, but I would like you to analyze my finding. In the code below there is no try-catch, so if the locking fails, I currently think that the program continues.
      sub setEmails {
          my $this  = shift;
          my $login = shift;
          ASSERT($login) if DEBUG;
      
          my $lockHandle = _lockPasswdFile();
          my $db = $this->_readPasswd();
          unless ( $db->{$login} ) {
              $db->{$login}->{pass} = '';
          }
      
      #SMELL: this makes no sense. - the if above suggests that we can get to this point without $db->{$login}
      #  what use is going on if the user is not in the auth system?
          if ( defined( $_[0] ) ) {
              $db->{$login}->{emails} = join( ';', @_ );
          }
          else {
              $db->{$login}->{emails} = '';
          }
          _savePasswd($db);
          _unlockPasswdFile( $lockHandle );
          return 1;
      } 
    • I analyzed our internal call (RB Bugzilla 32785) again and found these activities (other logs skipped): ...
      2010-05-25T03:09:33Z info guest resetpasswd FangWang Mozilla 10.8.141.73
      2010-05-25T05:06:14Z info guest view System.UserRegistration Mozilla 10.47.32.55
      2010-05-25T05:10:45Z info guest resetpasswd KrishnamoorthiS Mozilla 10.47.32.55
      2010-05-25T05:12:15Z info KrishnamoorthiS changepasswd KrishnamoorthiS   10.47.32.55
      2010-05-25T05:20:58Z info admin save Main.WikiUsers   10.8.141.73
      2010-05-25T05:20:58Z info RegistrationAgent save Main.SongWeijin   10.8.141.73
      2010-05-25T05:20:59Z info SongWeijin register Main.SongWeijin <email removed> 10.8.141.73
      2010-05-25T05:25:56Z info guest resetpasswd RainHE Mozilla 10.8.141.73
      2010-05-25T05:30:51Z info RainHE changepasswd RainHE   10.8.141.73
      2010-05-25T05:30:51Z info RainHE changepasswd RainHE   10.8.141.73
      2010-05-25T05:35:44Z info admin save Main.WikiUsers repRev 100 by BaseUserMapping_333 2010/05/25 05:35:44 10.8.141.73
      2010-05-25T05:35:44Z info RegistrationAgent save Main.KevinHan   10.8.141.73
      2010-05-25T05:35:45Z info KevinHan register Main.KevinHan <email removed> 10.8.141.73
      ... These are all activities in the time frame the problem occurred. Functions "changepasswd", "resetpasswd" and "register" where used, so I think the problem must occur in one of them. I gess all of the functions in Foswiki::Users::HtPasswdUser are used at least once frown, sad smile .
    • The only thing I also saw in the logs above is that several users use the same IP address. In fact there are only two IP addresses used (one in India, one in China). This could be the case because most of our Chinese are working on thin clients, where the server is located in Germany and serves several users. Could that also lead to problems?

-- MichaelDeckert - 03 Jun 2010

ok, the first topic seems to be clarified. The locking works VM. I attached a small test file using the same mechanism. I started the file in two independent putty sessions, both with same user (root). The second one waited for the first to unlock. After that the lock was set and the second one continued and completed too. Thus I think the mechanism works, at least if the file is accessible to the user.

-- MichaelDeckert - 04 Jun 2010

I looked at the code I did. And I am pretty sure that the lock file approach does work. And I am also pretty sure that while the lock file is locked, other Foswiki processes will wait till the file is unlocked again.

You wrote above that there is no catch in case the my $lockHandle = _lockPasswdFile() fails.

Yes and No. Inside _lockPasswdFile. If the lockfile is not accessible the code will throw an error. And I have confirmed that if I change the lock file so Apache cannot access it, the passwords will not be set or reset and emails will not be changed.

There is however a user interface problem to be solved in this exception case because the user is told that all is fine even though no change is made. But that is another problem.

Looking at your log it seems that the same user is sending the same request in the same second. I wonder if the second request will abort the first. I am thinking that perhaps this is not at all related to lockfiles working or not, but more a problem with the code being aborted in the middle of writing the password file. I am not sure how the mechanism works in this case. I have a feeling this is where the problem is.

Ie. it is not simultaneous requests from multiple users we should care about but multiple requests from the same user with the same session ID.

-- KennethLavrsen - 04 Jun 2010

Well, I agree that an exception is an exception, no matter if there is a try-catch. So what's left?
  • the double-loading problem: it seems to be related to this bug.
  • the problem of interrupting the writing of the .htpasswd file

So to get some data - at least if the problem occurs again - I entered some debug entries in the functions "_lockPasswdFile", "_unlockPasswdFile", "setPassword", "removeUser" and "setEmails". So if the fault happens again, we can see the start and end of all these functions and the user involved. So we can ask him what he did and find out something about his PC.

Regarding the second problem I'm not sure if I can help. I'm not used to those mechanisms.

-- MichaelDeckert - 05 Jun 2010

It's notable that the htpasswd file in this case is truncated - not empty, not corrupted, truncated. So how can this happen? Inspecting the code, the write of the file is done thus:
    umask(077);
    open( FILE, '>', "$Foswiki::cfg{Htpasswd}{FileName}" )
      || throw Error::Simple(
        $Foswiki::cfg{Htpasswd}{FileName} . ' open failed: ' . $! );

    print FILE _dumpPasswd($db);
    close(FILE);
For the file to be truncated, rather than just empty, the write has to fail in the middle of that print statement, after dumpPasswd has completed successfully. Since that's a simple print of a string, that seems unlikely - the only way I can see that happening is if the process receives a signal, for example if apache times it out, or some similar external event.

Putting that aside, what if "truncated" actually means "emptied"? Let's say we have some random exception - say an undefined variable - that triggers during dumpPasswd. The open has already happened, so the file has been zeroed for write. The exception will be caught by the try..catch block in setPassword, which will result in the process terminating normally, closing the empty file behind it. It will also report an error to the logfile, which should be checked for. (lesson: the open-write-close shouldn't happen until after dumpPasswd has been completely executed and a static string obtained).

Of course if this happened during setPassword, you would be seeing errors in your apache error log. The error checking in the other calls - removeUser and setEmails is pretty abysmal, but there's no suggestion from your description that either of these is being called.

So, I'd recommend you check your apache error log (and any other logs) for errors printed in the same timeframe.

Later: after a bit of a code inspection, I can see an error scenario. It revolves around the fact that _readPasswd doesn't always read the file. The first time _readPasswd is called, the passwords are cached in $this->{passworddata}. This is returned by subsequent calls to _readPasswd. Now, consider this scenario. Two processes, A and B:
  1. A wants to change a password, so it locks the password file and opens it for write (this clears the password file)
  2. B also wants to change a password. First it has to check a password, so it calls _readPasswd.This caches the empty file just created by A. The password check passes (for whatever reason)
  3. B tries to lock the file, which makes it wait on A.
  4. A finishes writing, and closes and unlocks the new password file.
  5. B locks and writes the new password file, but it doesn't read the new file - instead, _readPasswd uses the empty cache.

Later still: I just checked in a change to trunk that addresses the scenario just described. Can't be certain it fixes your specific problem, but it does fix a problem.

Marked as waiting for correspondents attention, to get their assessment. If the scenario I just described is the problem, then I'd expect to see a password file with a single entry (the password changed by B). If some other type of "truncation" is happening, then this is probably not the problem.

-- CrawfordCurrie - 11 Jun 2010

Sorry, but
For the file to be truncated, rather than just empty, the write has to fail in the middle of that print statement, after dumpPasswd has completed successfully.
is wrong.

For the file to be truncated, the DB has to be tempered with, so between the _readPasswd and the _savePasswd. One very easy way, which seems to be happening here is the following scenario.
  1. A wants to change his password, so he clicks on the 'Change my password' button
  2. A is a user, so it double-clicks instead of clicking just once that button
And what happens is this (me thinks):
  1. A makes a request, which starts a perl process (CGI or whatever persistent engine shouldn't matter, but CGI might be easier to follow)
  2. Apache forks process perlA, which reads the DB, locks the files, and starts writing it back.
  3. A clicked a second time, so its browser will drop the first request, and create a new one.
  4. Apache sees that this request has been interrupted, therefore it kills process perlA because it's not needed anymore.
  5. process perlA is being asked by the system to terminate, therefore it closes its filehandles half-way through the write.

In theory, there might be a warning message in the Apache logs, but I guess one way to try and reproduce is to play with a huge password file (otherwise the write is too quick to be interrupted), on a slow filesystem (for the same reasons), and click 10 times in a row the 'Change my password' button.

I already mailed what I thought would be a proper fix for this.

-- OlivierRaginel - 11 Jun 2010

Sorry, my mail was wrong, so my proposed fix is:
  1. Obtain an exclusive lock for the .lock file
  2. Read (again) the normal password file to ensure no race-condition can have occured since the time we last read-it and the time we obtained an exclusive lock
  3. Write the new DB file in the .lock file
  4. Rename the .lock file to the normal one.

As CDot points out, renaming should only be atomic on the same filesystem, so we want to make sure we're writing most likely in the same directory as where the password file resides.

Otherwise, what we could do is to do it in place, hoping that the writing is fast enough for other requests not to be affected (they might only fail to find some users while the file is being written, but a reload should fix it).

-- OlivierRaginel - 11 Jun 2010

Another catch-net would be to keep a backup of the password file.

-- CrawfordCurrie - 11 Jun 2010

Also, this broke one valid unit test. If the registration process fails half-way through, like for RegisterTests::test_4061 which sets the WikiUsers topic read-only, the user is still added to the password file, That's because your fix doesn't take into account the removal of the user from the cached DB, because you're forcing a refresh of the DB.

So what we need to do, at least for registration, is:
  1. Obtain an exclusive lock
  2. Read (again) the normal password file
  3. Try registration
  4. Write new DB in the .lock file
  5. Rename .lock to normal

-- OlivierRaginel - 12 Jun 2010

Ok, implemented proper locking of the htpasswd file, so this should never happen again. Do we need to backport this to 1.0.9? I don't think so, as it works most of the times. But this won't help our users, so I'm backporting it...

Kenneth, as I have no idea how you can deal with this from a release manager level, I've put it to waiting for release, for both, but it's set to minor.

-- OlivierRaginel - 13 Jun 2010

so here are the other logs:
  • debug.txt = no relevant entry for that time frame
  • warn201005.txt = no relevant entry for that time frame

in the apache access_log log there are these entries (only change password is listed):
10.47.32.55 - - [25/May/2010:07:11:58 +0200] "POST /twiki/bin/login/System/ChangePassword HTTP/1.1" 302 - "http://abt-ismtwiki.abt.de.bosch.com/twiki/bin/login/System/ChangePassword?origurl=/twiki/bin/view/System/ChangePassword" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
10.47.32.55 - - [25/May/2010:07:12:15 +0200] "POST /twiki/bin/manage/System/WebHome HTTP/1.1" 200 6597 "http://abt-ismtwiki.abt.de.bosch.com/twiki/bin/view/System/ChangePassword?foswiki_redirect_cache=a09ca05c73f849f89202171250f79c36" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
10.8.141.73 - - [25/May/2010:07:30:29 +0200] "POST /twiki/bin/login/System/ChangePassword HTTP/1.1" 302 - "http://abt-ismtwiki.abt.de.bosch.com/twiki/bin/login/System/ChangePassword?origurl=/twiki/bin/view/System/ChangePassword" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
10.8.141.73 - - [25/May/2010:07:30:50 +0200] "POST /twiki/bin/manage/System/WebHome HTTP/1.1" 200 6570 "http://abt-ismtwiki.abt.de.bosch.com/twiki/bin/view/System/ChangePassword?foswiki_redirect_cache=ee58d45fbc3330b8b916ea8b79e04342" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
10.8.141.73 - - [25/May/2010:07:30:51 +0200] "POST /twiki/bin/manage/System/WebHome HTTP/1.1" 200 6570 "http://abt-ismtwiki.abt.de.bosch.com/twiki/bin/view/System/ChangePassword?foswiki_redirect_cache=ee58d45fbc3330b8b916ea8b79e04342" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
10.47.32.55 - - [25/May/2010:11:41:20 +0200] "POST /twiki/bin/login/System/ChangePassword HTTP/1.1" 403 9129 "http://abt-ismtwiki.abt.de.bosch.com/twiki/bin/login/System/ChangePassword?origurl=/twiki/bin/view/System/ChangePassword" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 by Peacy (.NET CLR 3.5.30729)"
10.47.32.55 - - [25/May/2010:11:47:37 +0200] "POST /twiki/bin/login/System/ChangePassword HTTP/1.1" 302 - "http://abt-ismtwiki.abt.de.bosch.com/twiki/bin/login/System/ChangePassword?origurl=/twiki/bin/view/System/ChangePassword" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 by Peacy (.NET CLR 3.5.30729)"
10.47.32.55 - - [25/May/2010:11:58:16 +0200] "POST /twiki/bin/manage/System/WebHome HTTP/1.1" 200 92085 "http://abt-ismtwiki.abt.de.bosch.com/twiki/bin/view/System/ChangePassword" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

For me that shows that the user most probably has pressed the button twice at 07:30:5x . So the scenario above seems plausible. Also I think renaming is more save. So the file is available or not. If not the reading process could wait for the file to be back.

The error_log does not contain relevant data.

Sorry for the long response times, we are also in a release process currently :-|

-- MichaelDeckert - 14 Jun 2010

http://trac.foswiki.org/changeset/7757 is what broken strawberry perl's unit tests very badly.

so I've reverted it frown, sad smile on trunk - I'm truely hating windows's inability to do patch reversion well.

my recollection is that doing the lockfiles with an explicit .lock file was needed to get around some braindead-ness when the data was stored on NFS, but it looks like its also busted on strawberry windows

just putting back
+    sysopen( my $fh, $Foswiki::cfg{Htpasswd}{FileName},
+        Fcntl::O_RDWR | Fcntl::O_CREAT, 0666 )
+      || throw Error::Simple( $Foswiki::cfg{Htpasswd}{FileName}
+          . ' open or create password file failed -'
+          . ' check access rights: '

causes it all to fail

-- SvenDowideit - 16 Jun 2010

You can't just put back that piece of code, of course, but I'm sure you know that.

Ok, then plan B... do it all in the .lock, and pray that the renaming works? I hate it. I'll investigate why it's failing on windows. Also, NFS offers no lock at all, so I don't see the point in the current scheme. Other than creating an exclusive lock with a non-existing file... Yeah, I could do that too... In fact, that's the "recommended" locking practice in the Camel (at least one they detail).

-- OlivierRaginel - 16 Jun 2010

What is happening to this important bug report?

And what is the situation for 1.0.10?

Some code was reverted but only on trunk.

And where is the lock file I introduced? I had a dammed good reason to use a lock file which I documented very well in the bug item in which I introduced it.

You cannot lock the password file itself without having a race condition situation. A lot of literature documents this.

-- KennethLavrsen - 19 Jul 2010

I am pretty sure I know what is happening. I will work on this. And I will go back to using my semaphone file. And I do not plan renaming files. That lowers the risk but it does not eliminate it.

We need to prevent reading the passwd file at any time while writing. See Item2537 and the references I point to why the lock file is the safest way.

-- KennethLavrsen - 20 Jul 2010

I have done a major rewrite of my original lockfile fix.

The problem we have is that we correctly protected the .htpasswd file from being written by two processes.

But we did not prevent that one process could read while the other is writing.

I have reverted the fix that tried to overwrite the .htpasswd file line by line That improved the robustness but not 100%. We can still loose a line or two this way I have also reverted the fic that tried to reread the password file before writing to it Not because it was wrong, but because we still had the issue that a near empty passwd file could be read by one process while the other was writing.

So instead we keep on setting an exclusive lock on a lockfile when we write.

But in addition we set a shared lock when we read. We only do this when needed. Ie when we read the file the first time. We do not lock anything when we can get the data from the cached password file data.

This way the only time we slow down response to the user is when someone changes password or email address or registers. And it will be a fraction of a second or few seconds if the password file is gigantic.

All unit tests pass.

-- KennethLavrsen - 20 Jul 2010
 
Topic revision: r38 - 08 Sep 2010, KennethLavrsen
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy