Clustering,High Availability,Linux,Xen July 8, 2012 1:56 pm

Fencing Linux Clusters Nodes on XenSever/XCP Using XenAPI

As many of you already know, fencing is an important component of maintaining the health of your cluster. When cluster nodes experience issues, behave improperly or overall, just aren’t playing nice with the remainder of the nodes, it’s important to bring down that node as fast as possible, otherwise you risk service interruption or even worse, data corruption!

Before the prevalence of Virtualization in the Data Center, the most common way to fence a node was log into it’s IPMI or DRAC card and issue a shutdown/reboot command. If the node didn’t have a DRAC or IPMI, you could also log into the PDU it was connected to and power-off the outlet. Either of the two methods ensured that if required, cluster nodes could be quickly taken offline when necessarily.

Well on virtualized cluster nodes, there isn’t a dedicated IPMI or DRAC card. And you certainly wouldn’t want to log into the PDU and shutdown the entire physical host. So the only methods left are those that require the host to self-fence or those that require in-band access to the host to issue a reboot/shutdown command. Sadly, these methods unreliable at best. For example, if a node is unstable and has partially crashed, ssh access may be unavailable, or components of the OS may be unstable so it cannot properly self-fence. So then, what is the best way to fence an unstable, vm based cluster-node?

Well, if you are using XenServer or XCP, then I’d say the best way to do that would be through an agent or script that leverages XAPI. For example, if you could execute the xe vm-reboot or xe vm-reset-powerstate when a vm cluster node was unstable or hung, then that would be awesome. It would be even better if this script or agent was integrated with the cluster stack so that it worked inside rather than outside (e.g. so that the cluster itself was responsible for detecting when it was necessary to fence a node as opposed to an external script or agent), but does such a thing even exist??

Well kiddies, yes it does. And we have Matthew J Clark to thank for it. He wrote a pretty nice fencing agent for XenServer/XCP that does everything I said above. You can download it over at his Google Code Page or from the FedoraHosted.com git repo.
I recommend you pull it down from the git repository at fedorahosted.com. It works with CMAN (I think lol) and Pacemaker clusters, though I’ve only tested it with the later.

Pull it down, build it (it requires pexpect and python-suds) and install it. You should have /usr/sbin/fence_xenapi after you install it. Run stonith_admin -M -a fence_xenapi to check out it’s metadata:



fence_cxs is an I/O Fencing agent used on Citrix XenServer hosts. It uses the XenAPI, supplied by Citrix, to establish an XML-RPC sesssion to a XenServer host. Once the session is established, further XML-RPC commands are issued in order to switch on, switch off, restart and query the status of virtual machines running on the host.

        
                
                
                Fencing Action
        
        
                
                
                Login Name
        
        
                
                
                Login password or passphrase
        
        
                
                
                Script to retrieve password
        
        
                
                
                Physical plug number or name of virtual machine
        
        
                
                
                The URL of the XenServer host.
        
        
                
                
                The UUID of the virtual machine to fence.
        
        
                
                
                Verbose mode
        
        
                
                
                Write debug information to given file
        
        
                
                
                Display version information and exit
        
        
                
                
                Display help and exit
        
        
                
                
                Separator for CSV created by operation list
        
        
                
                
                Test X seconds for status change after ON/OFF
        
        
                
                
                Wait X seconds for cmd prompt after issuing command
        
        
                
                
                Wait X seconds for cmd prompt after login
        
        
                
                
                Wait X seconds after issuing ON/OFF
        
        
                
                
                Wait X seconds before fencing is started
        
        
                
                
                Count of attempts to retry power on
        


        
        
        
        
        
        
        


Notice the use the pcmk_host and pcmk_host_check in the primitive. The agent isn’t able to map cluster nodes back to vm-labels on it’s own, so the pcmk_host stanza provides that mapping. Also, the location blocks ensure that a node cannot fence itself.
To test, you’d simply need to execute the stonith_admin -B <node-name>. That will reboot the specified node.
After you’ve tested and you are ready to go production, don’t forget to set stonith-enabled=true in your properties and no-quorum-policy="ignore" and you’ll be all set.

Tags:

1 Comment

  • matthias hoffmann

    Hello,
    I am trying to get that stonith primitive run. could you please explain this pcmk_host and pcmk_host_check params? I have no idea which values are needed for them. Have two physical server with xenserver, wanna reboot a vm via stonith primitive.
    could you please provide an example ?

Leave a reply to matthias hoffmann

required

required

optional


Time limit is exhausted. Please reload the CAPTCHA.

css.php