Cluster Help Needed

Anthony Porcano Anthony_Porcano at kaplan.com
Wed Apr 28 11:22:01 CDT 2004


We have a two node RHEL2.1 cluster here connected to two 220s arrays used as shared storage. We are running an Oracle instance on each node, and we are trying to achieve host level fail over of an instance if one of the servers goes offline. During testing everything worked great after some tuning of the cluster configuration and the Oracle Listeners; however, no were are seeing some very disturbing problems with this setup. 
 
Essentially, what we are seeing is both nodes start up, start the cluster service, fire up the Oracle instance preferred on each server, and then start failing over to each other after 3-4 hours. During those first couple of hours everything works very well. However after a certain amount of uptime we see messages like this in the cluster log on both hosts:
 
Apr 28 08:57:28 lunasosdb2 cluquorumd[18308]: <info> partnerHeartbeatActive: HB
says partner is UP.
Apr 28 08:57:38 lunasosdb2 last message repeated 5 times
Apr 28 08:57:38 lunasosdb2 cluquorumd[18308]: <warning> shoot_partner: attempting to shoot partner.

These messages appear within seconds of eachother on both hosts, and are immediately followed by one of the hosts rebooting itself. I can gather that the rebooting is due to it trying to powercycle the other node and realizing that there is still I/O coming from that node. (We are running without a power switch) When it sees the other node is still there it shoots itself instead. 
 
This makes some sense to me although I would think that when you opt to not use a power switch it should not expect to be able to power cycle a partner. What I really need some help understanding is what is triggering this in the first place. Why do they function for several hours and then both start freaking out? What is that first message we see (partnerHeartbeatActive) indicative of? If anyone has worked with the RHEL cluster suite before and has some insight into this problem I would be very eager to hear your thoughts. 
 
Thanks,
 Anthony
 
_______________________
Anthony Porcano
Systems Engineer, Kaplan Inc.
 




More information about the Linux-PowerEdge mailing list