Cluster Help Needed
Anthony_Porcano at kaplan.com
Wed Apr 28 19:32:01 CDT 2004
Thanks so much for your response. Your information is very helpful, and gives me hope of getting this working. One question, in your working setup with the 2650's, were you still using 2.1?
From: David Truchan-contr [mailto:David.Truchan-contr at trw.com]
Sent: Wed 4/28/2004 1:53 PM
To: linux-poweredge at dell.com; Anthony Porcano
Subject: Re: Cluster Help Needed
I have 2 Dell 2650 servers clustered with a Dell Powervault 220s.
The only way I could get a stable configuration was by dedicating 2
disks on the powervault for quorum only.
I also found it useful to increase the polling interval of cluquorumd.
You can do this by cludb --put cluquorumd&&pingInterval 3.
This will increase the pingInterval of cluquorumd to 3 seconds instead
of 2 seconds. You may have to play with this setting to figure out
what works best for your configuration.
Also, try setting your perc bios to no-readahead. When I ran bonnie
benchmarks, I actually saw better performance with readahead disabled.
One last thing. I had absolutely no luck trying to cluster 2 Dell 2600 servers with a powervault using redhat AS 2.1. No matter which setting or configuration I tweaked the whole setup was completely unstable.
>>> "Anthony Porcano" <Anthony_Porcano at kaplan.com> 04/28/04 12:21PM >>>
We have a two node RHEL2.1 cluster here connected to two 220s arrays used as shared storage. We are running an Oracle instance on each node, and we are trying to achieve host level fail over of an instance if one of the servers goes offline. During testing everything worked great after some tuning of the cluster configuration and the Oracle Listeners; however, no were are seeing some very disturbing problems with this setup.
Essentially, what we are seeing is both nodes start up, start the cluster service, fire up the Oracle instance preferred on each server, and then start failing over to each other after 3-4 hours. During those first couple of hours everything works very well. However after a certain amount of uptime we see messages like this in the cluster log on both hosts:
Apr 28 08:57:28 lunasosdb2 cluquorumd: <info> partnerHeartbeatActive: HB
says partner is UP.
Apr 28 08:57:38 lunasosdb2 last message repeated 5 times
Apr 28 08:57:38 lunasosdb2 cluquorumd: <warning> shoot_partner: attempting to shoot partner.
These messages appear within seconds of eachother on both hosts, and are immediately followed by one of the hosts rebooting itself. I can gather that the rebooting is due to it trying to powercycle the other node and realizing that there is still I/O coming from that node. (We are running without a power switch) When it sees the other node is still there it shoots itself instead.
This makes some sense to me although I would think that when you opt to not use a power switch it should not expect to be able to power cycle a partner. What I really need some help understanding is what is triggering this in the first place. Why do they function for several hours and then both start freaking out? What is that first message we see (partnerHeartbeatActive) indicative of? If anyone has worked with the RHEL cluster suite before and has some insight into this problem I would be very eager to hear your thoughts.
Systems Engineer, Kaplan Inc.
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
Please read the FAQ at http://lists.us.dell.com/faq or search the list archives at http://lists.us.dell.com/htdig/
More information about the Linux-PowerEdge