Web Infrastructure Systems, Deployments, and Technologies.: May 2008

I am able to reproduce the issue at will now using both a Linux NFS
server or the NAS system. We can pretty safely rule out the NAS as the
culprit (though let's keep the ticket open, in case they've seen this
before).

As I've already explained, Linux caches file names as to avoid hitting
the disk on every look up (dentry cache). When there is a cache hit
(either negative or positive) the kernel will call a 'validate'
function, to ensure that the cached data is still good.

On local file systems, that function is simply a no-operation as all
updates need to go through the cache. Coherency is guaranteed.

On remote file systems (NFS, GFS, etc...) that call does "something" to
validate the contents of the cache. If the backing data has changed,
that entry is invalidated and the entire look up process starts over.

If there is a cache entry, the NFS validate function checks the cached
mtime on that entry's parent and compares it with current mtime as
reported by the NFS server. If there is a difference, the entry is
invalidated.

That process appears to not work correctly when dealing with negative
entries. It seems to work fine for positive entries (file *changes* show
up, just not new files in some cases).

So, after spending some time tracing how this whole process actually
works, I was able to reproduce the problem at will:

Stat a file that we know is non-existant. This populates the dentry
cache with a negative entry.


[root@cluster-util ~]# ssh root@cluster-www1a
"stat /home/cluster1/data/f/f/nfsuser01/test_nofile"
stat: cannot stat `/home/cluster1/data/f/f/nfsuser01/test_nofile': No
such file or directory

Create that file on a different server, this will also update the
mtime on that parent directory, so the NFS validation code on the dentry
hit ought to catch that.
```
[root@cluster-util ~]# ssh root@cluster-sshftp1a
"touch /home/cluster1/data/f/f/nfsuser01/test_nofile"
```

Try and stat the file again. Still broken.


[root@cluster-util ~]# ssh root@cluster-www1a
"stat /home/cluster1/data/f/f/nfsuser01/test_nofile"
stat: cannot stat `/home/cluster1/data/f/f/nfsuser01/test_nofile': No
such file or directory

Read the parent directory, this doesn't actually *repopulate* the
cache as I first thought. It actually invalidates the parent directory
itself (which is what should have happened correctly the first time
through). It doesn't invalidate it "the same way", though.
```
[root@cluster-util ~]# ssh root@cluster-www1a
"ls /home/cluster1/data/f/f/nfsuser01/ | wc -l"
16
```

And now it's present, as evidenced by the stat command below.


[root@cluster-util ~]# ssh root@cluster-www1a
"stat /home/cluster1/data/f/f/nfsuser01/test_nofile"
  File: `/home/cluster1/data/f/f/nfsuser01/test_nofile'
  Size: 0               Blocks: 0          IO Block: 4096   regular
empty file
Device: 15h/21d Inode: 4046108346  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2008-05-28 10:07:28.963000000 -0400
Modify: 2008-05-28 10:07:28.963000000 -0400
Change: 2008-05-28 10:07:28.963000000 -0400
[root@cluster-util ~]#

Next step is going to be to load up a debug kernel or stick a system tap script in there to attempt to determine why that negative entry isn't invalidated as it should be .

Recently, we started seeing issues where a file would exist on a couple of the NFS clients, but not the others. The front-end Apache instance would return different results depending on which cluster nodes our load balancers would direct us to. In some cases, we'd recieve 404 errors. In other cases, we'd get the actual content. In a third scnenario, we would get the default Red Hat Index page (as the NFS client couldn't access the *real* DirectoryIndex page).

At first, I thought it was an Apache problem as we run some custom modules which handle URL translation.

However, another issue popped up with a PHP script. That PHP script was attempting to include another as a function library. In this case, that include was failing with an ENOENT on some systems, but not on others. Clearly this wasn't an Apache problem.

I've yet to be able to reproduce the problem, but I've a few existing instances to test with.

Given a situation where we're getting a 404 half of the time, I ran the following test against our NFS clients:


# for i in  www1a www1b www1c www1d ; do echo $i; ssh cluster-$i-mgmt "stat /home/cluster1/data/s/c/user/html/index.htm"; done
www1a
stat: cannot stat `/home/cluster1/data/s/c/user/html/index.htm': No such file or directory
www1b
  File: `/home/cluster1/data/s/c/user/html/index.htm'
  Size: 18838           Blocks: 40         IO Block: 4096   regular file
Device: 15h/21d Inode: 2733856169  Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (15953/ user)   Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400
www1c
  File: `/home/cluster1/data/s/c/user/html/index.htm'
  Size: 18838           Blocks: 40         IO Block: 4096   regular file
Device: 15h/21d Inode: 2733856169  Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (15953/ user)   Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400 
www1d
  File: `/home/cluster1/data/s/c/user/html/index.htm'
  Size: 18838           Blocks: 40         IO Block: 4096   regular file
Device: 15h/21d Inode: 2733856169  Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (15953/ user)   Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400

When logging in to 'www1a', it's just not possible to read the file directly. I can't cat it, stat it, or ls it. However, once I step into the containing directory (html, in this case) and run an 'ls', the file is now availble. Looks as though the readdir() triggered by my ls command updates the cache.

So, my assumption at this point is that it's a directory cache problem. For some reason, we're getting negative entries (NULL inode structure pointers). I've no idea why.

The problem came up again this morning. As it turns out, Linux 2.6.16 allows users to dump various caches in order to free the memory being used. I tried it out on one of the systems experiencing the problem:


[root@cluster-www1c vm]# ls
block_dump                 drop_caches           max_map_count
overcommit_ratio          swappiness
dirty_background_ratio     hugetlb_shm_group     min_free_kbytes
pagecache                 swap_token_timeout
dirty_expire_centisecs     laptop_mode           nr_hugepages
page-cluster              vdso_enabled
dirty_ratio                legacy_va_layout      nr_pdflush_threads
panic_on_oom              vfs_cache_pressure
dirty_writeback_centisecs  lowmem_reserve_ratio  overcommit_memory
percpu_pagelist_fraction
[root@cluster-www1c vm]# stat /home/cluster1/data/s/c/user/html/index.htm
stat: cannot stat `/home/cluster1/data/s/c/user/html/index.htm': No such file or directory
[root@cluster-www1c vm]# sync
[root@cluster-www1c vm]# sync
[root@cluster-www1c vm]# echo '2' > drop_caches
[root@cluster-www1c vm]# !st
stat /home/cluster1/data/s/c/user/html/index.htm
  File: `/home/cluster1/data/s/c/user/html/index.htm'
  Size: 18838           Blocks: 40         IO Block: 4096   regular file
Device: 15h/21d Inode: 2733856169  Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (15953/ user)   Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400
[root@cluster-www1c vm]#

So, we've got a bit of a workaround, at least for the long weekend. I'll probably wind up setting up a cron job to dump the cache every half hour or so in order to avoid phone calls.

The problem is clearly related to the dcache. I've no idea what is causing it, however. It could be our bind mount system. The fancy NAS unit may also be returning invalid responses causing Linux to do the Right Thing at the Wrong Time.

Web Infrastructure Systems, Deployments, and Technologies.

I've moved my blog to jmcneil.net. This is no longer being updated!

Wednesday, May 28, 2008

More NFS...

Friday, May 23, 2008

Linux Caching Issues

About Me

Blog Archive

Books Reviewed Here