I've moved my blog to jmcneil.net. This is no longer being updated!

Wednesday, May 28, 2008

More NFS...

I am able to reproduce the issue at will now using both a Linux NFS
server or the NAS system. We can pretty safely rule out the NAS as the
culprit (though let's keep the ticket open, in case they've seen this
before).

As I've already explained, Linux caches file names as to avoid hitting
the disk on every look up (dentry cache). When there is a cache hit
(either negative or positive) the kernel will call a 'validate'
function, to ensure that the cached data is still good.

On local file systems, that function is simply a no-operation as all
updates need to go through the cache. Coherency is guaranteed.

On remote file systems (NFS, GFS, etc...) that call does "something" to
validate the contents of the cache. If the backing data has changed,
that entry is invalidated and the entire look up process starts over.

If there is a cache entry, the NFS validate function checks the cached
mtime on that entry's parent and compares it with current mtime as
reported by the NFS server. If there is a difference, the entry is
invalidated.

That process appears to not work correctly when dealing with negative
entries. It seems to work fine for positive entries (file *changes* show
up, just not new files in some cases).

So, after spending some time tracing how this whole process actually
works, I was able to reproduce the problem at will:

  1. Stat a file that we know is non-existant. This populates the dentry
    cache with a negative entry.

    [root@cluster-util ~]# ssh root@cluster-www1a
    "stat /home/cluster1/data/f/f/nfsuser01/test_nofile"
    stat: cannot stat `/home/cluster1/data/f/f/nfsuser01/test_nofile': No
    such file or directory


  2. Create that file on a different server, this will also update the
    mtime on that parent directory, so the NFS validation code on the dentry
    hit ought to catch that.

    [root@cluster-util ~]# ssh root@cluster-sshftp1a
    "touch /home/cluster1/data/f/f/nfsuser01/test_nofile"


  3. Try and stat the file again. Still broken.

    [root@cluster-util ~]# ssh root@cluster-www1a
    "stat /home/cluster1/data/f/f/nfsuser01/test_nofile"
    stat: cannot stat `/home/cluster1/data/f/f/nfsuser01/test_nofile': No
    such file or directory


  4. Read the parent directory, this doesn't actually *repopulate* the
    cache as I first thought. It actually invalidates the parent directory
    itself (which is what should have happened correctly the first time
    through). It doesn't invalidate it "the same way", though.

    [root@cluster-util ~]# ssh root@cluster-www1a
    "ls /home/cluster1/data/f/f/nfsuser01/ | wc -l"
    16


  5. And now it's present, as evidenced by the stat command below.

    [root@cluster-util ~]# ssh root@cluster-www1a
    "stat /home/cluster1/data/f/f/nfsuser01/test_nofile"
    File: `/home/cluster1/data/f/f/nfsuser01/test_nofile'
    Size: 0 Blocks: 0 IO Block: 4096 regular
    empty file
    Device: 15h/21d Inode: 4046108346 Links: 1
    Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2008-05-28 10:07:28.963000000 -0400
    Modify: 2008-05-28 10:07:28.963000000 -0400
    Change: 2008-05-28 10:07:28.963000000 -0400
    [root@cluster-util ~]#


Next step is going to be to load up a debug kernel or stick a system tap script in there to attempt to determine why that negative entry isn't invalidated as it should be .

Friday, May 23, 2008

Linux Caching Issues

Recently, we started seeing issues where a file would exist on a couple of the NFS clients, but not the others. The front-end Apache instance would return different results depending on which cluster nodes our load balancers would direct us to. In some cases, we'd recieve 404 errors. In other cases, we'd get the actual content. In a third scnenario, we would get the default Red Hat Index page (as the NFS client couldn't access the *real* DirectoryIndex page).

At first, I thought it was an Apache problem as we run some custom modules which handle URL translation.

However, another issue popped up with a PHP script. That PHP script was attempting to include another as a function library. In this case, that include was failing with an ENOENT on some systems, but not on others. Clearly this wasn't an Apache problem.

I've yet to be able to reproduce the problem, but I've a few existing instances to test with.

Given a situation where we're getting a 404 half of the time, I ran the following test against our NFS clients:


# for i in www1a www1b www1c www1d ; do echo $i; ssh cluster-$i-mgmt "stat /home/cluster1/data/s/c/user/html/index.htm"; done
www1a
stat: cannot stat `/home/cluster1/data/s/c/user/html/index.htm': No such file or directory
www1b
File: `/home/cluster1/data/s/c/user/html/index.htm'
Size: 18838 Blocks: 40 IO Block: 4096 regular file
Device: 15h/21d Inode: 2733856169 Links: 1
Access: (0755/-rwxr-xr-x) Uid: (15953/ user) Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400
www1c
File: `/home/cluster1/data/s/c/user/html/index.htm'
Size: 18838 Blocks: 40 IO Block: 4096 regular file
Device: 15h/21d Inode: 2733856169 Links: 1
Access: (0755/-rwxr-xr-x) Uid: (15953/ user) Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400
www1d
File: `/home/cluster1/data/s/c/user/html/index.htm'
Size: 18838 Blocks: 40 IO Block: 4096 regular file
Device: 15h/21d Inode: 2733856169 Links: 1
Access: (0755/-rwxr-xr-x) Uid: (15953/ user) Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400

When logging in to 'www1a', it's just not possible to read the file directly. I can't cat it, stat it, or ls it. However, once I step into the containing directory (html, in this case) and run an 'ls', the file is now availble. Looks as though the readdir() triggered by my ls command updates the cache.

So, my assumption at this point is that it's a directory cache problem. For some reason, we're getting negative entries (NULL inode structure pointers). I've no idea why.

The problem came up again this morning. As it turns out, Linux 2.6.16 allows users to dump various caches in order to free the memory being used. I tried it out on one of the systems experiencing the problem:


[root@cluster-www1c vm]# ls
block_dump drop_caches max_map_count
overcommit_ratio swappiness
dirty_background_ratio hugetlb_shm_group min_free_kbytes
pagecache swap_token_timeout
dirty_expire_centisecs laptop_mode nr_hugepages
page-cluster vdso_enabled
dirty_ratio legacy_va_layout nr_pdflush_threads
panic_on_oom vfs_cache_pressure
dirty_writeback_centisecs lowmem_reserve_ratio overcommit_memory
percpu_pagelist_fraction
[root@cluster-www1c vm]# stat /home/cluster1/data/s/c/user/html/index.htm
stat: cannot stat `/home/cluster1/data/s/c/user/html/index.htm': No such file or directory
[root@cluster-www1c vm]# sync
[root@cluster-www1c vm]# sync
[root@cluster-www1c vm]# echo '2' > drop_caches
[root@cluster-www1c vm]# !st
stat /home/cluster1/data/s/c/user/html/index.htm
File: `/home/cluster1/data/s/c/user/html/index.htm'
Size: 18838 Blocks: 40 IO Block: 4096 regular file
Device: 15h/21d Inode: 2733856169 Links: 1
Access: (0755/-rwxr-xr-x) Uid: (15953/ user) Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400
[root@cluster-www1c vm]#


So, we've got a bit of a workaround, at least for the long weekend. I'll probably wind up setting up a cron job to dump the cache every half hour or so in order to avoid phone calls.

The problem is clearly related to the dcache. I've no idea what is causing it, however. It could be our bind mount system. The fancy NAS unit may also be returning invalid responses causing Linux to do the Right Thing at the Wrong Time.