
Now that the Web server is running and the database is being backed up it is time to create a failover webserver. I had grandiose dreams back in 2006 when my main web server took a dive and I needed to buy a new server. Why not get 2 identical servers and run them in parallel, I don’t have that kind of money that’s why! So along came this tiger catalog (I wont say their full name…) Look I can get 2 computers for the price of one Dell!!! BIG MISTAKE!!! One system went up w/o a problem, but the other has a new power supply, memory, MB and CPU and still it sits in my closet, I believe the MB is giving another error code, I will never buy an ABIT MB again.
After investigating the how to create failover systems, I came upon 2 articles below that are good foundations for the principle. Linux-HA is a great website that reads like a dictionary, not a how to site. So I am looking for a few articles that will guide me through the process. Here are a few:
Apache failover with heartbeat and mon
Welcome to this mini howto on how to install a two node Apache failover using heartbeat and mon.
This is probably the base setup you want to start with.
I’ll cover a base Linux installation, but it’s mostly the same on other UNIX distributions.
I’ve left out areas such as NFS mounted web roots and IP based heartbeat, for simplicity.
Contents:
Introduction
Failover means you have a at least two nodes, a server (master) and a hot standby server (slave) which takes over the resources when the master fails.
Only one node is active at a time. The active machine will setup a virtual ip address, which the clients connect to. They won’t know which server they actually talk to.
In this example, a Apache server is being served by the active node..
Heartbeat provides the backbone for making sure that the nodes know who’s online. Only one can be online at the same time. If you want load balancing, read the resource section.
Heartbeat will monitor that the two machines can talk to each other via serial cable and/or network cables.
To achieve resource monitoring, ie Apache runs and answers on requests, you need another package: Mon.
A common misunderstanding is that heartbeat will do the resource monitoring for you.
My experience is that the software or network connections are more likely to crash than the machine itself.
Goal
During normal operation, the master (active) owns the resources and the IP address
In a failover scenario, the slave will take over master’s resources:
starting the same processes (apache in our case) and configure the network card to use the same virtual IP address (172.17.10.30)
Heartbeat runs on both machines, but Apache and Mon is only running on the active one. If mon decides a service is down, it will shut down heartbeat gracefully on that node.
In essence, if any of the following happens, you want the slave to take over the resources:
- The master goes down (either physically crashes or no heartbeats are sent via serial or the NIC)
- Mon on the running machine decides a service is down
But what if mon can’t contact anything, ip network is completely down, or some server room operator yanked a cable? That’s when heartbeat itself will sense that a member is down and will initiate the take over.
Setup
Hardware
You need two machines with at least one free serial port.
Get a null modem cable and hook it up between the machines.
One the master:
# cat < /dev/ttyS0 |
On the slave:
# echo "hello heartbeat" > /dev/ttyS0 |
You should see messages getting echoed on the master.
Configure the first machine (master) with 172.17.10.10, the second one (slave) with 172.17.10.20. Also make sure both machines know where the other part is, by editing /etc/hosts.
Excerpt from /etc/hosts |
172.17.10.10 master 172.17.10.20 slave |
If you have two network cards on the machines, hook up a cross over network cable to provide non-serial heartbeat too. If you don’t have a cross over, you can use regular cables, but go via a hub/switch. I’ll leave that out for simplicity.
Heartbeat
An easy getting started guide for Heartbeat can be read here. I explain the most basic setup below:
Install heartbeat on both the master and slave.
# rpm -i heartbeat-0.4.9.2-1.i386.rpm |
[Don’t try 1.0.1 yet, because it requires stontih, requires pils, requires udp-snmp, requires libcrypto…. ]
Create a file /etc/ha/ha.cf (/var/log/ha-log) both on the master and slave
/etc/ha/ha.cf |
logfile /var/log/ha-log keepalive 2 deadtime 10 serial /dev/ttyS0 node master node slave |
Create a file /etc/ha/haresources that looks exactly the same on both master and slave.
This is where you define the virtual ip. When the slave sees this it knows its not supposed to take over the resources until the master is down..
/etc/ha/haresources |
master 172.17.10.30 apache mon |
Create a /etc/ha/authkeys (same on both nodes):
/etc/ha/authkeys |
auth 1 1 crc |
Mon
Obtain the mon packages here. Do all these steps on both master and slave.
Install them under /etc/ha.d/mon
# cd /etc/ha.d # tar xzvf mon-0.99.2.tar.gz # mv mon-0.99.2.tar.gz mon |
Mon requires some external Perl modules. Use your nearest CPAN mirror, but I’ve provided with links so that you know where to start looking.
Unpack them and follow install instructions, which is usually
# perl Makefile.pl # make # make test # make install
Create a /etc/ha.d/mon/mon.cf.
This configuration will try to get the page /test.html from the webserver every 30s. It also tries to ping the router (to see if the node has “connectivity”). Replace the 172.17.10.254 with your router address, or comment out the whole watch routers if your don’t want that.
Replace operator@yourdomain.com with your email address.
/etc/ha.d/mon/mon.cf |
cfbasedir = /etc/ha.d/mon/etc alertdir = /etc/ha.d/mon/alert.d mondir = /etc/ha.d/mon/mon.d statedir = /etc/ha.d/mon/state.d logdir = /var/log/ maxprocs = 20 histlength = 100 randstart = 10s authtype = getpwnam hostgroup web-fe 172.17.10.30 hostgroup routers 172.17.10.254 watch web-fe service http interval 30s monitor http.monitor -p 80 -u /test.html allow_empty_group period wd {Mon-Sun} alert bring-ha-down.alert -S "web server node member down" \ operator@yourdomain.com upalert mail.alert -S "web server is back up" \ operator@yourdomain.com alertevery 600s alertafter 2 watch routers service ping interval 10s monitor ping.monitor allow_empty_group period wd {Mon-Sun} alert bring-ha-down.alert -S "node member NIC down" \ operator@yourdomain.com upalert mail.alert -S "web server is back up" \ operator@yourdomain.com alertevery 10s |
As you see in the alert section above, I’ve defined an alert script, bring-ha-down.alert.
Create such a script here: /etc/ha.d/mon/alert.d/bring-ha-down.alert.
It will call the mail alert and then bring down heartbeat on that node. Make it executable.
/etc/ha.d/mon/alert.d/bring-ha-down.alert |
/etc/ha.d/mon/alert.d/mail.alert $* /etc/rc.d/heartbeat stop |
At this point, I usually copy the whole Copy the whole /etc/ha.d/mon directory to the slave.
Apache
Edit your httpd.conf (normally /usr/local/apache/conf/httpd.conf or /etc/httpd/httpd.conf).
Create a virtual host like this: (replace with your settings)
excerpt from httpd.conf |
NameVirtualHost 172.17.10.30 <VirtualHost 172.17.10.30> ServerAdmin operator@yourdomain.com DocumentRoot /usr/local/apache2/htdocs/ ServerName yourserver.yourdomain.com ErrorLog logs/test-errors_log CustomLog logs/test-access_log common </VirtualHost> |
When this node is active, Apache will listen on 172.17.10.30
test.html
Create a test.html in the httproot, with text MASTER on node1 and SLAVE on node2.
Put it in your webroot (/usr/local/apache/htdocs or similar).
This way you can easily test which active node is up.
Start scripts
Create /etc/ha.d/mon/mon-start.
/etc/ha.d/mon/mon-start |
#!/bin/bash MON_HOME=/etc/ha.d/mon case "$1" in start) if [ -f $MON_HOME/mon.pid ]; then echo "mon already started" exit fi echo "Starting Mon" $MON_HOME/mon -c $MON_HOME/mon.cf -L $MON_HOME -P $MON_HOME/mon.pid & ;; stop) if [ -f $MON_HOME/mon.pid ]; then echo "Stopping Mon" kill -9 `cat $MON_HOME/mon.pid` rm -f $MON_HOME/mon.pid else echo "no server pid, server doesn't seem to run" fi ;; status) echo "doing good" ;; *) echo "Usage: $0 {start|stop|status|reload|restart}" exit 1 esac exit 0 |
Upon startup, Heartbeat will look in /etc/rc.d/init.d or /etc/ha.d/resource.d for it’s resource start scripts.
Create links in /etc/ha.d/resource.d
# cd /etc/ha.d/resource.d # ln -s ../mon/mon-start mon # ln -s /etc/rc.d/apache apache |
OK, now heartbeat knows how to start/stop mon and apache. Check that you have startscripts like this:start scripts
Start the cluster
As we’ve configured Heartbeat, it will start apache and mon on the master
For the slave, starting heartbeat with a healthy master, will put it in standby and patiently wait for the master to go down.
start heatbeat on the master
# /etc/rc.d/heartbeat start |
Do the same on the slave
Check out the masters log, /var/log/ha-log. This is a good log.
/var/log/ha-log |
heartbeat: ... info: Configuration validated. Starting heartbeat 0.4.9.2 heartbeat: ... info: heartbeat: version 0.4.9.2 heartbeat: ... info: Heartbeat generation: 7 heartbeat: ... notice: Starting serial heartbeat on tty /dev/ttyS0 heartbeat: ... info: Local status now set to: 'up' heartbeat: ... info: Local status now set to: 'active' heartbeat: ... info: Heartbeat restart on node master heartbeat: ... info: Node master: status up heartbeat: ... info: Running /etc/ha.d/rc.d/status status heartbeat: ... info: Running /etc/ha.d/resource.d/IPaddr 172.17.10.30 status heartbeat: ... info: Node master: status active heartbeat: ... info: Resource acquisition completed. heartbeat: ... info: Running /etc/ha.d/rc.d/status status heartbeat: ... info: Running /etc/ha.d/rc.d/ip-request ip-request heartbeat: ... info: Running /etc/ha.d/resource.d/IPaddr 172.17.10.30 status heartbeat: ... info: Acquiring resource group: master 172.17.10.30 apache mon heartbeat: ... info: Running /etc/ha.d/resource.d/IPaddr 172.17.10.30 start heartbeat: ... info: ifconfig eth0:0 172.17.10.30 netmask 255.255.0.0 \ broadcast 172.17.255.255 heartbeat: ... info: Sending Gratuitous Arp for 172.17.10.30 on eth0:0 [eth0] heartbeat: ... info: Running /etc/ha.d/resource.d/apache start heartbeat: ... info: Running /etc/ha.d/resource.d/mon start |
Here are some common troubleshooting scenarios:
heartbeat: ... ERROR: Bad permissions on keyfile [/etc/ha.d/authkeys], 600 recommended. |
fix that with chmod 600 /etc/ha.d/authkeys
heartbeat: ... ERROR: Current node [mastah] not in configuration |
check your hostname, which has to be the same as you defined it in /etc/ha.d/ha.cf and /etc/ha.d/haresources
heartbeat: ... Cannot locate resource script apache heartbeat: ... Cannot locate resource script mon |
Make sure you have start scripts in /etc/rc.d/init.d or /etc/ha.d/resource.d that takes the start/stop argument and are exectuable. Try them manually first.
If mon shuts down heartbeat right away, it probably means you haven’t configured the http ping or router ping properly (thus making Mon think it’s not answering).
Check that heartbeat brings up the virtual ip, this is what the master should look like:
# ifconfig -a
eth0:0 Link encap:Ethernet HWaddr ....
inet addr:172.17.10.30 Bcast:172.17.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:11 Base address:0x1000
|
Test failover scenarios
If you’ve come this far, good! Now it’s time to try to shake down the master.
Apache is now listening on that virtual ip, so point a browser to http://172.17.10.30/test.html.
You should see the test page displaying “MASTER”.
1. Turn off the master, or do a init 1, to get out of the multiuser runlevel.
It should take ~10 secs for the slave to determine that the master is down.
Soon enough, you’ll see this in the slaves log
heartbeat: ... WARN: node master: is dead heartbeat: ... info: Link master:/dev/ttyS0 dead. heartbeat: ... info: Link master:eth1 dead. heartbeat: ... info: Running /etc/ha.d/rc.d/status status heartbeat: ... info: Running /etc/ha.d/rc.d/ifstat ifstat heartbeat: ... info: Running /etc/ha.d/rc.d/ifstat ifstat heartbeat: ... info: Taking over resource group 172.17.10.30 heartbeat: ... info: Acquiring resource group: master 172.17.10.30 apache mon heartbeat: ... info: Running /etc/ha.d/resource.d/IPaddr 172.17.10.30 start heartbeat: ... info: ifconfig eth0:0 172.17.10.30 netmask 255.255.0.0 \ broadcast 172.17.255.255 heartbeat: ... info: Sending Gratuitous Arp for 172.17.10.30 on eth0:0 [eth0] heartbeat: ... info: Running /etc/ha.d/resource.d/apache start heartbeat: ... info: Running /etc/ha.d/resource.d/mon start |
Do a ifconfig -a to see that the slave took over the IP address
Point your browser to http://172.17.10.30/test.html. You should see “SLAVE”.
Start the master again to get back to normal operation. Start heartbeat if necessary.
2. do a /etc/rc.d/apache stop on the master.
After a while (defined in mon.cf) you should see this in /var/log/messages log at the master
Feb 21 18:10:13 master mon[28334]: failure for web-fe http 1045879813 172.17.10.30 |
Mon will sense this and call the alert script, which does a shutdown heartbeat.
This is equivalent to that apache has died, it takes 10 secs for Mon to realize that it can’t ping the server anymore, and thus will bring down heartbeat.
Check the log on the slave to see that it brings up the interface,
In production, you’d have to fix the master, and then bring heartbeat back up on it, in which slave releases the resources. Start heartbeat on master again. This will make the slave to release all the resources.
Now you’re ready to read about all the options of heartbeat and mon. Refer to the Resources section below.
If you have trouble getting heartbeat start at boot, send me an email.
Resources
- Heartbeat homepage: http://www.linux-ha.org/
- Mon homepage: http://www.kernel.org/software/mon/
- Apache: httpd.apache.org
- Load balancing in mind?: Linux virtual server: http://www.linuxvirtualserver.org
thomas.olausson@home.se, 2003-02-05. Comments and questions are welcome.
Another Way…
Using a Linux failover router
Now turn on IP packet forwarding on the Linux box by changing the value of net.ipv4.ip_forward
to 1
in the /etc/sysctl.conf file and executing the command:
# sysctl -p
Next, you need to configure iptables by adding certain rules, so that your internal LAN can route packets to the Internet. For this, issue the following commands as root:
# iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE# iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE # iptables -A FORWARD -s 10.0.0.0/24 -j ACCEPT # iptables -A FORWARD -d 10.0.0.0/24 -j ACCEPT # iptables -A FORWARD -s ! 10.0.0.0/24 -j DROP
The above commands turn on masquerading in the NAT table by appending a POSTROUTING rule (-A POSTROUTING
) for all outgoing packets on the two Ethernet interfaces, eth0 and eth1. The next two lines accept forwarding of all packets to and from the 10.0.0.0/24 network. The last line drops the packets that do not come from the 10.0.0.0/24 network.
To make the iptables rules permanent, save them as follows:
# iptables-save > /etc/sysconfig/iptables
Now you must restart your network, as well as iptables:
# /etc/init.d/network restart# /etc/init.d/iptables restart
To see if your new iptables rules have gone into effect, type iptables --L
.
Enabling failover routing
After you have configured your network, the next step is to enable failover routing on your Linux box, so that if the first route dies the router will automatically switch over to the next route. To do so, you’ll need to add the default gateway routes provided to you by your ISPs for both your network cards:
# route add default gw 61.16.130.97 dev eth0# route add default gw 200.15.110.90 dev eth1
Here, 61.16.130.97 is the gateway address given by ISP1 and 200.15.110.90 is the gateway address given by ISP2. Replace them with the addresses available to you. These routes will disappear every time you reboot the system. In order to make these routes permanent add the above two commands in the /etc/rc.d/rc.local file, which is run at boot time.
Also make sure that all the computers on your internal LAN (10.0.0.0/24) have their default gateway address set as the IP address of the eth3 Ethernet interface (i.e. 10.0.0.1) of your failover router.
Finally, modify the /proc/sys/net/ipv4/route/gc_timeout file. This file contains a numerical value that denotes the time in seconds after which the kernel declares a route to be inactive and automatically switches to the other route if available. Open the file in any text editor and change its default value of 300 to some smaller value, say 10 or 15. Save the changes and exit.
Now your Linux machine is ready to serve as a failover router, automatically and quickly switching to the secondary route every time the primary route fails.