How to setup a redundant NFS server with DRBD and Heartbeat in CentOS 5

How to setup a redundant NFS server with DRBD and Heartbeat in CentOS 5

This tutorial will guide you through the entire process of setting up a highly available NFS server.  To proceed, you must have the following:

1. 2 servers with similar hard disk setup (These will be used to create a redundant nfs server)
2. atleast 1 server where the nfs share will be mounted.
3. Static IPs
4. Basic knowledge of vi (:q! = quit, :wq = write and then quit, i = insert mode, esc = leave insert mode, dd = delete line when not in insert mode)

First off, install CentOS on both machines.  During the install process, create a separate blank partition on both machines to be used as your nfs mount.  Set the mount point to /data during installation.

From this point on i’m going to be referring to both nfs servers by their IPs and hostnames.  Server1 will be nfs1 with ip 10.132.196.221 and server2 will be nfs2 with ip 10.132.196.222.  Your private IPs might be different so make sure to put in the correct IPs where necessary within this tutorial.

Do the following on nfs1(10.132.196.221) and nfs2(10.132.196.222):

vi /etc/fstab

This will give you the mount points and devices on your system.  Look for the /data mount point and comment it out to prevent it form automatically being mounted on boot.  Take note of the device for the /data mount point.  Here is what my fstab looks like.

/dev/VolGroup00/LogVol00 /                       ext3    defaults        1 1
#/dev/VolGroup00/LogVol02 /data                   ext3    defaults        1 2
LABEL=/boot             /boot                   ext3    defaults        1 2
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
/dev/VolGroup00/LogVol01 swap                    swap    defaults        0 0

I’m using LVM so my device names are weird.  Yours might just be /dev/sda1, /dev/sda2, etc…

Now lets go ahead an unmount that data partition because heartbeat will take care of mounting the partition on the live nfs server.

umount /data

Make sure that ntp and ntpdate are installed on both of the nfs servers.

yum install ntp ntpdate

The time on both servers must be identical.

Now lets check and make sure that the nfs service is not running on startup and that selinux is also turned off.

setup

Go down to “Firewall Configuration” and disable selinux and the firewall.  Next, lets go to “System Services” and make sure that the nfs service is not enabled.  You will need to reboot your machines for these settings to take affect.

Lets go ahead and create our exports for the nfs so that the nfs share can be mounted on other machines in your network.

vi /etc/exports

Add the following line to your exports file but make sure to replace the IP.  My private IPs are in the form of 10.132.196.1. Yours may be 192.168.0.1.

/data/export/ 10.132.196.0/255.255.255.0(rw,no_root_squash,no_all_squash,sync)

The above line in my exports file will allow me to mount the nfs share anywhere within my local network.  If you only want to allow a specific machine to be able to mount the nfs share then use a specific IP instead of the 0 at the end.  For example, here is an exports file that allows only 10.132.196.24 to mount the nfs share.

/data/export/ 10.132.196.24/255.255.255.0(rw,no_root_squash,no_all_squash,sync)

Now we need to install DRBD and the DRBD kernel module.

yum install drbd kmod-drbd

After installing drbd we need to setup the config file for it.

vi /etc/drbd.conf

Let me show you my config file and then we’ll go over it.

common {
protocol C;

syncer {
rate         15M;
al-extents    257;
}
}

resource r0 {

handlers {
pri-on-incon-degr    “halt -f”;
}

disk {
on-io-error    detach;
}

startup {
degr-wfc-timeout 120;
}

on nfs1 {
device        /dev/drbd0;
disk        /dev/VolGroup00/LogVol02;
address        10.132.196.221:7789;
meta-disk    internal;
}

on nfs2 {
device        /dev/drbd0;
disk        /dev/VolGroup00/LogVol02;
address        10.132.196.222:7789;
meta-disk    internal;
}
}

So lets start from the top.

  • Protocol – This is the method that drbd will use to sync both of the nfs servers.  There are 3 available options here, Protocol A, Protocol B and Protocol C.Protocol A is an asynchronous replication protocol.  The drbd.org manual states, “local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has been placed in the local TCP send buffer. In the event of forced fail-over, data loss may occur. The data on the standby node is consistent after fail-over, however, the most recent updates performed prior to the crash could be lost.”Protocol B is a memory synchronous (semi-synchronous) replication protocol.  The drbd.org manual states, “local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has reached the peer node. Normally, no writes are lost in case of forced fail-over. However, in the event of simultaneous power failure on both nodes and concurrent, irreversible destruction of the primary’s data store, the most recent writes completed on the primary may be lost.”
    Protocol C is a synchronous replication protocol.  The drbd.org manual states, “local write operations on the primary node are considered completed only after both the local and the remote disk write have been confirmed. As a result, loss of a single node is guaranteed not to lead to any data loss. Data loss is, of course, inevitable even with this replication protocol if both nodes (or their storage subsystems) are irreversibly destroyed at the same time.

    You may choose your desired protocol but Protocol C is the most commonly used one and it is the safest method.

  • rate – The rate is the maximum speed at which data will be sent from one nfs server to the other while syncing.  This should be about a third of your maximum write speed.  In my case, I have only a single disk that can write about 45mb/sec so a third of that would be 15mb.  This number will usually be much higher for people with raid setups.  In some large raid setups, the bottleneck would be the network and not the disks so set the rate accordingly.
  • al-extent – This data on the disk are cut up into slices for synchronization purposes.  For each slice there is an al-extent that is used to indicate any changes to that slice.  Larger al-extent values make synchronization slower but benefit from less writes to the metadata partition.  In my case, I’m using an internal metadata which means the drbd metadata is written to the same parition that my nfs data is on.  It would benefit me to have less metadata writes to prevent the disk arm from constantly moving back and forth and degrading performance.  If you are using a raid setup and a separate partition for the metadata then set this number lower to benefit from faster synchronization.  This number MUST be a prime to gain the most possible performance because it is used in specific hashes that benefit from prime number sized structures.
  • pri-on-incon-degr – The “halt -f” command is executed if the node is primary, degraded and if the data is inconsistent.  I use this to make sure drbd is halted when there is some sort of data inconsistency to prevent a major mess from occuring.
  • on-io-error – This allows you to handle low level I/O errors.  The method I use is the “detach” method.  This is the recommended option by drbd.org. On the occurrence of a lower-level I/O error, the node drops its backing device, and continues in diskless mode.
  • degr-wfc-timeout – This is the amount of time in seconds that is allowed before a connection is timed out.  In case a degraded cluster (cluster with only one node left) is rebooted, this timeout value is used instead of wfc-timeout, because the peer is less likely to show up in time, if it had been dead before.

The rest of the config is pretty self explanatory.  Replace nfs1 and nfs2 with the hostnames of your nfs servers.  To get the hostnames use the following command on both servers:

uname -n

Then replace the disk value with the device name from your fstab file that you commented out.  Enter the IP address of each server and use port 7789.  The last part is the meta-disk.  I used an internal meta-disk because I only have one hard disk in the server and it would not give me any benefit to create a separate partition for the metadata.  If you have a raid setup or a separate disk from your data partition that you can use for the meta data than go ahead and create a 150mb partition.  Replace the word “internal” in the config file with your device name that you used for the meta data partition.

Now that we finally have our drbd.conf file ready we can move on.  Lets go ahead and enable the drbd kernel module.

modprobe drbd

Now that the kernel module is enabled lets start up drbd.

drbdadm up all

This will start drbd, now lets check its status.

cat /proc/drbd

You can always use the above command to check the status of drbd.  The above command should show you something like this.

0: cs:Connected st:Secondary/Secondary ld:Inconsistent
ns:0 nr:0 dw:0 dr:0 al:0 bm:1548 lo:0 pe:0 ua:0 ap:0
1: cs:Unconfigured

You should get some more data before it but the above part is what we are interested in.  If you notice it shows that drbd is connected and both nodes are in secondary mode.  This is because we have not assigned which node is going to be the primary yet.  It also says the data is inconsistent because we have not done the initial sync yet.

I am going to set nfs1 to be my primary node and nfs2 to be my secondary node.  If nfs1 fails, nfs2 will takeover but if nfs1 comes back online then all the data from nfs2 will be synced back to nfs1 and nfs1 will take over again.

First of all lets go ahead and delete any data that was created on the /data partition that we setup during our intial OS installation.  Be very careful with the command below.  Make sure to use the appropriate device because all data on that device will be lost.

dd if=/dev/zero bs=1M count=1 of=/dev/VolGroup00/LogVol02; sync

Instead of “/dev/VolGroup00/LogVol02″, replace it with your device for the /data parition.  Now that the partition is completely erased on both servers, lets create the meta data.

drbdadm create-md r0

Do the following ONLY on nfs1(10.132.196.221)

Now that the metadata is created, we can move onto assigning a primary node and conducting the initial sync.  It is absolutely important that you only execute the following command on the primary node.  It doesn’t matter which node you choose to be the primary since they should be identical.  In my case, I decided to use nfs1 as the primary.

drbdadm — –overwrite-data-of-peer primary r0

Ok, now we just have to sit back and wait for the initial sync to finish.  This is going to take some time to finish even though there is no data on each device, drbd has to sync every single block on /data partition from nfs1 to nfs2.  You can check the status by using the following command.

cat /proc/drbd

Do the following on nfs1(10.132.196.221) and nfs2(10.132.196.222):

After the initial sync is finished, “cat /proc/drbd” should show something like this.

0: cs:Connected st:Primary/Secondary ld:Consistent
ns:37139 nr:0 dw:0 dr:49035 al:0 bm:6 lo:0 pe:0 ua:0 ap:0
1: cs:Unconfigured

If you notice, we are still connected and have a primary and secondary node with consistent data.

Do the following ONLY on nfs1(10.132.196.221):

Now lets make an ext3 file system on our drbd device and mount it.  Since drbd is running, the ext3 file system will also be created on the secondary node.

mkfs.ext3 /dev/drbd0

The above command will create an ext3 file system on the drbd device.  Now lets go ahead and mount it.

mount -t ext3 /dev/drbd0 /data

NFS has important information that it stores in /var/lib/nfs that is required to function properly.  In order to preserve file locks and other such information, we need to have that data stored on the drbd device so that if the primary node failes, NFS on the secondary node will continue from right where the primary node left off.

mv /var/lib/nfs/ /data/
ln -s /data/nfs/ /var/lib/nfs
mkdir /data/export
umount /data

So lets go over what we just did.  We moved the nfs folder from /var/lib to /data.  Then we created a symbolic link from /var/lib/nfs to /data/nfs since the operating system is still going to look for /var/lib/nfs when nfs is running.  Then we created an export directory in /data to store all the actual data that we are going to use for our nfs share.  Finally, we un-mounted the /data partition since we finished what we were doing.

Do the following ONLY on nfs2(10.132.196.222):

Since we moved the nfs folder to /data, that was synced over to the secondary node as well.  We just need to create the symbolic link so that when the /data partition is mounted on nfs2 we have a link to the nfs data.

rm -rf /var/lib/nfs/
ln -s /data/nfs/ /var/lib/nfs

So we removed the nfs folder and created a symbolic link from /var/lib/nfs to /data/nfs.  The symbolic link will be broken since the /data parition is not mounted.  Don’t worry about that because in the event of a failover that partiton will be mounted and everything will work just fine =).

Do the following on nfs1(10.132.196.221) and nfs2(10.132.196.222):

Now onto heartbeat.  Heartbeat is going to make sure partitions are umount/mount and services are started/stopped in the even of a fail over.  So lets get to it.

yum install heartbeat

Ok, now that we have heartbeat installed, lets go ahead and create our 3 necessary config files.

vi /etc/ha.d/ha.cf

Paste the following data into ha.cf and save it (:wq).

logfacility     local0
keepalive 2
deadtime 10
bcast   eth0
node nfs1 nfs2

Replace nfs1 and nfs2 with your server hostnames.  You can retrieve the hostname for each server by executing the following command.

uname -n

Now lets create our resource config file.

vi /etc/ha.d/haresources

Put the following data in there and save it.

nfs1  IPaddr::10.132.196.220/24/eth0 drbddisk::r0 Filesystem::/dev/drbd0::/data::ext3 nfslock nfs

The first word is the hostname for the primary server and should be identical on both servers.  I have chosen nfs1 to be my primary server.  The net part is the virtual IP.  This is the virtual IP we are going to use for the live nfs server, whether it be nfs1 or nfs2.  This ip can be any IP that is not being used within your network.  For example, if your nfs servers have IPs 192.168.0.11 and 192.168.0.12 then maybe you can use 192.168.0.10 as your virtual IP.  Its up to you.

Finally lets create our authentication file.

vi /etc/ha.d/authkeys

Put the following data in that file and save it.

auth 3
3 md5 somepassword12345

Replace “somepassword12345″ with your own password.  This will be used by both of the heartbeat daemons on nfs1 and nfs2 to authenticate each other.  The filw should be read-only by root so lets go ahead and do that.

chmod 600 /etc/ha.d/authkeys

Thats it!  Lets just start drbd and heartbeat on both servers now.

/etc/init.d/drbd start
/etc/init.d/heartbeat start

Now we have a redundant NFS server running!  Lets do a couple tests on the primary nfs server.

ifconfig

We should see our virtual IP address show up.  Mine looks like this.

eth0      Link encap:Ethernet  HWaddr 00:14:22:7C:65:6B
inet addr:10.132.196.221  Bcast:10.132.196.255  Mask:255.255.255.0
inet6 addr: fe80::214:22ff:fe7c:656b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:897038670 errors:0 dropped:367 overruns:0 frame:0
TX packets:1204564630 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:245125213396 (228.2 GiB)  TX bytes:1194659917566 (1.0 TiB)
Interrupt:169

eth0:0    Link encap:Ethernet  HWaddr 00:14:22:7C:65:6B
inet addr:10.132.196.220  Bcast:10.132.196.255  Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
Interrupt:169

lo        Link encap:Local Loopback
inet addr:127.0.0.1  Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING  MTU:16436  Metric:1
RX packets:372 errors:0 dropped:0 overruns:0 frame:0
TX packets:372 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:50327 (49.1 KiB)  TX bytes:50327 (49.1 KiB)

If you notice, eth0:0 has the virtual IP that I used for the nfs servers.  You should only see this on the nfs server that is live.

Now lets check our partitions.

df -h

The primary nfs server should show the /data mounted while the secondary nfs should not show the drbd device mounted.  My primary nfs server looks like this.

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
15G  2.5G   11G  19% /
/dev/sda1              99M   19M   76M  20% /boot
tmpfs                 4.0G     0  4.0G   0% /dev/shm
/dev/drbd0            259G  172G   75G  70% /data

Now lets go ahead and mount the nfs share on another server.  First lets create the /data folder on the new server.  This doesn’t have to be named “data” and can be named whatever you like.

mkdir /data

Now lets setup our nfs mount.

vi /etc/fstab

At the end of the file, add the following line.

10.132.196.220:/data/export  /data    nfs          rw            0    0

Replace the IP address with the virtual IP address that you chose for your nfs servers.  Now lets mount the partition for the first time.

mount /data

Thats it!  Now you have a redundant nfs server and client actually using a redundant nfs server.  Just to simulate a failover lets test some stuff out.  Go ahead and create some files in the “/data” folder from your nfs client machine.

cd /data
touch testfile1.txt
mkdir testdirectory

Now that we have some data in the “/data” folder we can simulate a failed nfs server.  If we did everything right, the data that we just created was created on the primary nfs server and synced to the secondary nfs server via drbd.  Lets stop heartbeat on nfs1 so that nfs2 thinks that nfs1 has failed.

/etc/init.d/heatbeat stop

Now that heartbeat is stopped on nfs1 run the following commands to make sure that the /data partition was unmounted and the virtual IP is gone.

df -h
ifconfig

When you check the same thing on nfs2, you should see that the /data partition has been mounted and the virtual IP is now live.

Now if you go back to your nfs client machine and do an “ls” in the /data directory, you should see that your data is still there.  Lets change our test data around.

cd /data
mv testfile1.txt testfile2.txt
rm -rf testdirectory

Now lets go back to nfs1 and start up heartbeat again.

/etc/init.d/heatbeat start

Give it a couple seconds and check the partition and virtual IP.

df -h
ifconfig

You should see that the partition is mounted again and the virtual IP is also live on nfs1.  If you check the same thing on nfs2, the partition will be un-mounted and the virtual IP should be gone.  If you check the /data directory on your nfs client machine, you should see that the “testfile2.txt” file is still there.

Congratulations, you have a fully functional and highly available nfs server!  Check out http://www.drbd.org for more information on DRBD.

3 Responses to “ How to setup a redundant NFS server with DRBD and Heartbeat in CentOS 5 ”

  1. Brandon says:

    Confused on this line:

    drbdadm — –overwrite-data-of-peer primary r0

    What’s with the dashes? Every combination of dashes and spaces fails, and the help file mentions nothing about this command.

  2. admin says:

    its actually double dashes. Wordpress autoformats the dashes to one big dash for some reason.

    the command is

    drbdadm [DASH][DASH] [DASH][DASH]overwrite-data-of-peer primary r0

  3. lxdorney says:

    Great Tutorial, soon I will setup mine to our video server

    thanks

Leave a Reply