I've moved my blog to jmcneil.net. This is no longer being updated!

Showing posts with label linux. Show all posts
Showing posts with label linux. Show all posts

Monday, February 23, 2009

I keep killing my X server...

I use GNU Emacs to edit most everything. Python, text files, C code, shell scripts... I even run all of my shell windows as 'ansi-term' buffers. Who needs a mouse?

That said, I have a terrible habit of smacking CTRL+ALT+BACKSPACE while attempting to hit CTRL+BACKSPACE. Yeah, I know. One kills the previous word and one kills the X server...

I've never really bothered to plug that up. This morning, I did it again. I did some digging:

Section "ServerFlags"
Option "DontZap" "yes"
EndSection

There. Now I can hit the wrong combination all day long and I won't bounce my X session. What a helpful little config option.

Thursday, January 15, 2009

Learning Xen

I've been diving into Xen over the past week or so. In an effort to learn how it really works, I decided to setup a new VM without using the Red Hat documented virt-install utility. A lot of this I learned from http://wiki.xensource.com. Hopefully someone else finds this useful as well.

I'm doing this on Red Hat Enterprise v5.2. My machine currently has 2GB of RAM and about 20GB free space under /var, which is where I'll stick my disk image. I've a dual-core CPU, but it doesn't appear to have virtualization extensions available.

1. Setting up the host system.

This is a simple process. All of the Xen packages are available via Yum and are part of the base entitlement.

xenhost# yum install xen kernel-xen
xenhost# yum install virt-manager libvirt libvirt-python \
libvirt-python python-virtinst

The kernel-xen package updates the /etc/grub.conf file, but doesn't set the Xen kernel to boot by default. On my system, that meant setting the default kernel to '0' as opposed to '1', but that will probably differ. Simply reboot.

2. Creating Disk Images

Xen supports a few different block devices types. It's possible to directly attach physical devices, use direct files, or NBD devices. It's even possible to setup a copy-on-write configuration which is probably very useful when testing installations which require rolling back. In this example, I'm going to use the "blktap" driver and a disk image.

The disk image itself is nothing but a dump of /dev/zero.

[root@xenhost images]# dd if=/dev/zero bs=1024 \
count=1500000 of=example.dsk
1500000+0 records in
1500000+0 records out
1536000000 bytes (1.5 GB) copied, 32.68 seconds, 47.0 MB/s

There. That gives us 1GB of space, minus FS overhead. That ought to be more than enough to hold a minimal Red Hat Linux installation. The next step is to create a filesystem.

[root@xenhost images]# mkfs -t ext3 -j example.dsk
mke2fs 1.39 (29-May-2006)
example.dsk is not a block special device.
Proceed anyway? (y,n) y
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
187776 inodes, 375000 blocks
18750 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=385875968
12 block groups
32768 blocks per group, 32768 fragments per group
15648 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912

Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 25 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
[root@xenhost images]# tune2fs -c0 -i0 ./example.dsk
tune2fs 1.39 (29-May-2006)
Setting maximal mount count to -1
Setting interval between checks to 0 seconds
[root@xenhost images]#

We'll also need swap space.

[root@xenhost images]# dd if=/dev/zero bs=1024 \
count=256000 of=example-swap.dsk
256000+0 records in
256000+0 records out
262144000 bytes (262 MB) copied, 2.82531 seconds, 92.8 MB/s
[root@xenhost images]# mkswap ./example-swap.dsk
Setting up swapspace version 1, size = 262139 kB
[root@xenhost images]#

Now we need to mount the image up and install a base version of Linux. The common method of doing this is to use a loopback device (losetup and friends) and mount the image as a file system. I'm not going to do it that way. Remember, as we rebooted under the Xen kernel, we're running in the context of Domain0. It's possible to use the Xen tools to make this file system available as we would to any guest domain. The only trick? We specify '0' as our domain ID. This also helped to get me familiar with the Xen utilities.

[root@xenhost images]# modprobe xenblk
[root@xenhost images]# xm block-attach 0 \
tap:aio:/var/lib/xen/images/example.dsk xvda1 w
[root@xenhost images]# ls -al /dev/xvda1
brw-r----- 1 root disk 202, 1 Jan 15 16:46 /dev/xvda1

Lots going on there! So, in english: attach the block device located at /var/lib/xen/images/example.dsk to domain0 as /dev/xvda1. It should be writeable. Now, we can mount that file system up just like we would any other. No need for a loopback device.

[root@xenhost images]# mount /dev/xvda1 /mnt
[root@xenhost images]# ls /mnt
lost+found
[root@xenhost images]# mount
...
/dev/xvda1 on /mnt type ext3 (rw)

It would have been possible to specify "file://var/lib.." as opposed to "tap:aio", but from what I understand, blktap is the preferred mechanism as the consistancy of the guest OS isn't at the mercy of the host buffer cache contents (power outage, anyone?).

3. Install the Guest OS.

There are a few ways to do this, but the net result has to be the same: OS files need to make it to this FS. You can do this via Yum & the --installroot option, cp -r, RPM & chroot. In my opinion, the easiest method is to use Yum.

[root@xenhost images]# mkdir -p /mnt/var/lib/yum
[root@xenhost images]# yum --installroot=/mnt groupinstall Base
...read repos...
Transaction Summary
=============================================================================
Install 333 Package(s)
Update 0 Package(s)
Remove 0 Package(s)

Total download size: 188 M
Is this ok [y/N]: y
...download and run transaction...
Complete!
[root@xenhost images]# yum install --installroot=/mnt -y kernel-xen
...
Complete!

Almost there. Now we need to chroot into the guest OS and configure a few things. All of this could easily be automated with a script and probably should be if more than a couple domU virtuals are setup.

[root@xenhost var]# chroot /mnt
bash-3.2# authconfig --useshadow --update
bash-3.2# passwd root
Changing password for user root.
New UNIX password:
BAD PASSWORD: it is based on a dictionary word
Retype new UNIX password:
passwd: all authentication tokens updated successfully.
bash-3.2# echo "127.0.0.1 localhost" > /etc/hosts
bash-3.2# cd /root && cp /etc/skel/.* .

Also, it's necessary to update the /etc/modprobe.conf on the guest to include the Xen drives.

alias eth0 xennet
alias scsi_hostadapter xenblk

Lastly, we need an /etc/fstab that matches our Xen configuration. I've used the following in this example.

/dev/xvda1 / ext3 defaults 1 1
/dev/xvda2 none swap sw 0 0
none /dev/pts devpts gid=5,mode=620 0 0
none /dev/shm tmpfs defaults 0 0
none /proc proc defaults 0 0
none /sys sysfs defaults 0 0

Now we have a working, though limited, install. I didn't bother to setup networking or anything just yet, that's fairly textbook once the instance is running. We need to unmount the new OS directory and block-detach the xvda1 device.

[root@xenhost var]# umount /mnt
[root@xenhost var]# xm block-detach 0 xvda1

4. Building up a Xen Configuration File.

The next step is to create a working domain configuration file under /etc/xen. There are a couple of pieces of data we'll need to generate first. Both the MAC as well as the UUID need to be unique across systems. To do this, I put together a small Python script.

#!/usr/bin/python

import virtinst.util

print "New UUID: %s" %\
virtinst.util.uuidToString(virtinst.util.randomUUID())
print "New MAC: %s" % virtinst.util.randomMAC()

Running the above script outputs the following:

New UUID: 65ffda11-5fef-0876-23a4-76839888b36b
New MAC: 00:16:3e:22:15:51

So, now it's possible to put a Xen configuration together. Note the MAC and the UUID from the above script are used.

name = "example"
uuid = "65ffda11-5fef-0876-23a4-76839888b36b"
memory = 128
vcpus = 16 # Why not? ;-)
kernel = "/boot/vmlinuz-2.6.18-92.1.22.el5xen"
ramdisk = "/boot/initrd-2.6.18-92.1.22.el5xen-no-scsi.img"
disk = [ "tap:aio://var/lib/xen/images/example.dsk,xvda1,w",
"tap:aio://var/lib/xen/images/example-swap.dsk,xvda2,w"]
root= "/dev/xvda1 ro"
vif = ["mac=00:16:3e:22:15:51,bridge=xenbr0,script=vif-bridge" ]

The configuation is pretty straight forward. The vif line creates an eth0 device within the guest that's part of the xenbr0 bridge. This makes the new VM accessible via the same network that the host resides on. Configure that interface as you would a normal, physical, device.

5. Start up the Virtual

This is the easy part. Simply run 'xm create example.' If everything was done correctly, the new virtual ought to start up. To watch the machine boot and login, simply type 'xm console example.'

[root@xenhost xen]# xm console example

Red Hat Enterprise Linux Server release 5.2 (Tikanga)
Kernel 2.6.18-92.1.22.el5xen on an i686

example login: root
Password:
Last login: Thu Jan 15 20:08:25 on xvc0
[root@example ~]# uname -a
Linux example 2.6.18-92.1.22.el5xen #1 SMP Fri Dec 5 10:29:16 EST 2008 i686 i686 i386 GNU/Linux
[root@example ~]# cat /proc/cpuinfo | grep processor | wc -l
16
[root@example ~]#

6. Configurations are Python!

The Xen configuration files not only look like Python, they *are* Python. This makes the entire configure process extremely flexible. For (an extremely useless) example:

[root@xenhost xen]# cat example
name = "example"
uuid = "65ffda11-5fef-0876-23a4-76839888b36b"
memory = 128
vcpus = 16
kernel = "/boot/vmlinuz-2.6.18-92.1.22.el5xen"
ramdisk = "/boot/initrd-2.6.18-92.1.22.el5xen-no-scsi.img"
disk = [ "tap:aio://var/lib/xen/images/example.dsk,xvda1,w",
"tap:aio://var/lib/xen/images/example-swap.dsk,xvda2,w"]
root= "/dev/xvda1 ro"
vif = ["mac=00:16:3e:22:15:51,bridge=xenbr0,script=vif-bridge" ]
#vfb = [ "type=vnc,vncdisplay=2" ]

for i in disk:
print "device %s" % i
[root@xenhost xen]# xm create example
Using config file "./example".
device tap:aio://var/lib/xen/images/example.dsk,xvda1,w
device tap:aio://var/lib/xen/images/example-swap.dsk,xvda2,w
Started domain example

I know I left off a lot of important configuration, but I know more about Xen now than I did yesterday. I think I'll take a dive into libvirt over the next day or so. About the only thing cooler than Xen virtualization is Xen virtualization, in Python. I've done a lot of automated installation work in the past, this really exposes a lot of functionality I wish I had back then.

Thursday, October 16, 2008

Thanks, GoogleBot!

There are very few things I'll write in C these days. I just don't have a reason to use it. If I can write it in Python, I usually will unless I have a good reason not to. One of the exceptions is an Apache module that we use to dynamically manage virtual host data (as opposed to flat files). I pull all configurations out of LDAP. I'm able to get some sick scale using a dynamic approach and I never have to restart my server.

The code has been in existence for about 7 years now in various forms. I'm not the original author, though I've probably replaced about half of it as our needs change.

At the core, there is a shared memory cache that Apache children attach to. As data is pulled out of LDAP, it's jammed into that cache. I'm also storing negative entries in order to prevent against DOS situations. Data is expired after a configurable time frame. The expiration is handled by a separate daemon.

So, about a week ago, we started having issues with nodes locking up. The expiry daemon was sitting at around 100% CPU and Apache would not answer requests. An strace on the expiry process showed no system calls.

These are the fun ones. Probably stuck in a while loop due to a buffer overrun or some such problem.

Well, I stuck some debug code into the expiration process and I see the following:


expire: domain.com
expire: domain2.com
expire: big long string with lots of spaces in it.. more than 128 bytes long ending in arealdomain.com???????


The question marks being the terminal chars for "I don't know how to render that!" Turns out that uncovered a bug in that domains over 127 chars were not NULL terminated when added to the negative cache.

In digging further, I checked my access logs. It turns out it was GoogleBot sending that big long string of junk as a 'Host:' header. Each time GoogleBot would hit a specific site on my platform, it would pass that in. It's amazing to me that we've not had this problem before and that GoogleBot was the first agent to trigger it...

Of course, it could always be a fraudulent user agent as I forgot to check the IP ownership before I ditched the logs...

Friday, June 13, 2008

NFS Problems

All of my NFS problems have just gone away. We've recently updated our Kernels to the latest RHES available. Looks like there was something hidden in there that fixed it for us. Wonderful.

I do realize that it's normal to see delay. Both due to attribute cache semantics and mtime granularity of one second. The problem I had here was quite different in that the dentry caches *never* expired negative entries.

Friday, May 23, 2008

Linux Caching Issues

Recently, we started seeing issues where a file would exist on a couple of the NFS clients, but not the others. The front-end Apache instance would return different results depending on which cluster nodes our load balancers would direct us to. In some cases, we'd recieve 404 errors. In other cases, we'd get the actual content. In a third scnenario, we would get the default Red Hat Index page (as the NFS client couldn't access the *real* DirectoryIndex page).

At first, I thought it was an Apache problem as we run some custom modules which handle URL translation.

However, another issue popped up with a PHP script. That PHP script was attempting to include another as a function library. In this case, that include was failing with an ENOENT on some systems, but not on others. Clearly this wasn't an Apache problem.

I've yet to be able to reproduce the problem, but I've a few existing instances to test with.

Given a situation where we're getting a 404 half of the time, I ran the following test against our NFS clients:


# for i in www1a www1b www1c www1d ; do echo $i; ssh cluster-$i-mgmt "stat /home/cluster1/data/s/c/user/html/index.htm"; done
www1a
stat: cannot stat `/home/cluster1/data/s/c/user/html/index.htm': No such file or directory
www1b
File: `/home/cluster1/data/s/c/user/html/index.htm'
Size: 18838 Blocks: 40 IO Block: 4096 regular file
Device: 15h/21d Inode: 2733856169 Links: 1
Access: (0755/-rwxr-xr-x) Uid: (15953/ user) Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400
www1c
File: `/home/cluster1/data/s/c/user/html/index.htm'
Size: 18838 Blocks: 40 IO Block: 4096 regular file
Device: 15h/21d Inode: 2733856169 Links: 1
Access: (0755/-rwxr-xr-x) Uid: (15953/ user) Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400
www1d
File: `/home/cluster1/data/s/c/user/html/index.htm'
Size: 18838 Blocks: 40 IO Block: 4096 regular file
Device: 15h/21d Inode: 2733856169 Links: 1
Access: (0755/-rwxr-xr-x) Uid: (15953/ user) Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400

When logging in to 'www1a', it's just not possible to read the file directly. I can't cat it, stat it, or ls it. However, once I step into the containing directory (html, in this case) and run an 'ls', the file is now availble. Looks as though the readdir() triggered by my ls command updates the cache.

So, my assumption at this point is that it's a directory cache problem. For some reason, we're getting negative entries (NULL inode structure pointers). I've no idea why.

The problem came up again this morning. As it turns out, Linux 2.6.16 allows users to dump various caches in order to free the memory being used. I tried it out on one of the systems experiencing the problem:


[root@cluster-www1c vm]# ls
block_dump drop_caches max_map_count
overcommit_ratio swappiness
dirty_background_ratio hugetlb_shm_group min_free_kbytes
pagecache swap_token_timeout
dirty_expire_centisecs laptop_mode nr_hugepages
page-cluster vdso_enabled
dirty_ratio legacy_va_layout nr_pdflush_threads
panic_on_oom vfs_cache_pressure
dirty_writeback_centisecs lowmem_reserve_ratio overcommit_memory
percpu_pagelist_fraction
[root@cluster-www1c vm]# stat /home/cluster1/data/s/c/user/html/index.htm
stat: cannot stat `/home/cluster1/data/s/c/user/html/index.htm': No such file or directory
[root@cluster-www1c vm]# sync
[root@cluster-www1c vm]# sync
[root@cluster-www1c vm]# echo '2' > drop_caches
[root@cluster-www1c vm]# !st
stat /home/cluster1/data/s/c/user/html/index.htm
File: `/home/cluster1/data/s/c/user/html/index.htm'
Size: 18838 Blocks: 40 IO Block: 4096 regular file
Device: 15h/21d Inode: 2733856169 Links: 1
Access: (0755/-rwxr-xr-x) Uid: (15953/ user) Gid: (15953/ user)
Access: 2008-05-23 09:55:03.029000000 -0400
Modify: 2008-05-23 09:55:03.029000000 -0400
Change: 2008-05-23 09:55:03.029000000 -0400
[root@cluster-www1c vm]#


So, we've got a bit of a workaround, at least for the long weekend. I'll probably wind up setting up a cron job to dump the cache every half hour or so in order to avoid phone calls.

The problem is clearly related to the dcache. I've no idea what is causing it, however. It could be our bind mount system. The fancy NAS unit may also be returning invalid responses causing Linux to do the Right Thing at the Wrong Time.