Cluster Help Needed
Anthony_Porcano at kaplan.com
Wed Apr 28 11:31:00 CDT 2004
Getting some more warnings in the logs in case anyone has insight on these as well.
Apr 28 11:00:19 lunasosdb2 clumibd: <warning> Lock: Held for too long (13 seconds)
Apr 28 11:00:19 lunasosdb2 clumibd: <warning> Lock: Increase fail-over time parameters.
Apr 28 11:56:39 lunasosdb2 clumibd: <warning> Lock: Held for too long (20 seconds)
Apr 28 11:56:39 lunasosdb2 clumibd: <warning> Lock: Increase fail-over time parameters.
Apr 28 12:07:27 lunasosdb2 cluquorumd: <warning> Skipped processing of pending requests 3 times!
Where exactly are fail-over time paramters set? I don't remember configuring that via cluconfig.
From: linux-poweredge-admin at dell.com on behalf of Anthony Porcano
Sent: Wed 4/28/2004 12:21 PM
To: linux-poweredge at dell.com
Subject: Cluster Help Needed
We have a two node RHEL2.1 cluster here connected to two 220s arrays used as shared storage. We are running an Oracle instance on each node, and we are trying to achieve host level fail over of an instance if one of the servers goes offline. During testing everything worked great after some tuning of the cluster configuration and the Oracle Listeners; however, no were are seeing some very disturbing problems with this setup.
Essentially, what we are seeing is both nodes start up, start the cluster service, fire up the Oracle instance preferred on each server, and then start failing over to each other after 3-4 hours. During those first couple of hours everything works very well. However after a certain amount of uptime we see messages like this in the cluster log on both hosts:
Apr 28 08:57:28 lunasosdb2 cluquorumd: <info> partnerHeartbeatActive: HB
says partner is UP.
Apr 28 08:57:38 lunasosdb2 last message repeated 5 times
Apr 28 08:57:38 lunasosdb2 cluquorumd: <warning> shoot_partner: attempting to shoot partner.
These messages appear within seconds of eachother on both hosts, and are immediately followed by one of the hosts rebooting itself. I can gather that the rebooting is due to it trying to powercycle the other node and realizing that there is still I/O coming from that node. (We are running without a power switch) When it sees the other node is still there it shoots itself instead.
This makes some sense to me although I would think that when you opt to not use a power switch it should not expect to be able to power cycle a partner. What I really need some help understanding is what is triggering this in the first place. Why do they function for several hours and then both start freaking out? What is that first message we see (partnerHeartbeatActive) indicative of? If anyone has worked with the RHEL cluster suite before and has some insight into this problem I would be very eager to hear your thoughts.
Systems Engineer, Kaplan Inc.
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
Please read the FAQ at http://lists.us.dell.com/faq or search the list archives at http://lists.us.dell.com/htdig/
More information about the Linux-PowerEdge