Vmware ESXi (3.5) and RHEL (5.1) : Timekeeping Woes

Brian O'Mahony brian.omahony at curamsoftware.com
Thu Sep 17 04:57:11 CDT 2009


Jack

This is a great in detail explanation, thanks. Ill explain a little more about our scenario. I think you may have it the nail on the head with this.

We have a machine with 4Gb non-reserved memory. It can use up to 4G but we have never seen any where near that level of usage. The machine uses between 200ish and 2gig. The issues we have been seeing on this machine:


1.       Time skew. Randomly. Fine for weeks then happened twice in three days. We have turned off the VMWare tools host-guest synch, and I have edited the ntp.conf (but not turned the service back on till I get clearance for it)

2.       Random slow downs. As you described. Memory, CPU, usage in Virtual Infastructure client graphs are all nominal - bout 25% usage. However everything on the server was running sallow. (Server hosts bugtracking, wikis newsgroups). Everything else on the Host was running fine (all windows systems)

3.       Taking a clone of the machine, while running, on two separate occasions has crashed it. The VM completely hung.

4.       "ethtool operation 3 not supported" constantly in dmesg. Unrelated I know. However I have a test version of this VM on our test network (pretty much a replica of the live network but without contant load/use) and this doesn't appear. At all.

Im beginning to hate Linux on VMWare. :P

B

From: Roehrig, Jack (John) [mailto:Jack.Roehrig at ask.com]
Sent: 16 September 2009 16:05
To: Brian O'Mahony; linux-poweredge at lists.us.dell.com
Subject: RE: Vmware ESXi (3.5) and RHEL (5.1) : Timekeeping Woes

Without knowing how the swapping algorithm works, it's difficult to be certain under what conditions the swap will be used. The conditions look correct though. I have seen many 2, 4, 6, and 8GB RHEL5.1 guests whose swap file is utilized by their hosts. These machines usually use between 100MB and 2GB of actual RAM. Certain conditions seem to exacerbate the problem as well. For example, a cluster of 45 machines all with the same load, purpose, and memory configuration will not experience synchronized global swapping issues. Instead, combinations of guests seem to create more of an issue. However, the problem still occurs even when the sum of total allocated guest memory is less than the available-to-guests physical ESX memory.  Perhaps this is caused by paging out unused portions of RAM to disk while a VM resides on an over-committed ESX host, followed by a migration to an ESX host which is not overcommitted, and later page fault resulting in access of the swap.

In any case, convincing many developers that despite all of their mlocking, ESX will still swap out their LRU junk to the SAN, is much more difficult than setting resource allocations for each guest. Hard-allocating the total requested guest memory to a guest will cause the swap file to be created with a zero length, and thus I assume disabling its use.

In any event, it's worth a shot. Try setting a resource allocation for a troublesome VM and see if your time stops skewing. If it doesn't work, experiment with combinations of other methods discussed in this thread. An associated, but much more devastating manifestation of the swapping problem is horrible VM performance. During periods of bad performance, most system reporting data will report normal conditions (memory, CPU, disk I/O, etc), but load average will skyrocket. This problem may be more noticeable on VMs whose swap resides on an oversaturated, slow SAN. If swapping was the issue, resource allocations may increase overall application response time and a better service to your customers.

HTH
-Jack Roehrig

From: Brian O'Mahony [mailto:brian.omahony at curamsoftware.com]
Sent: Wednesday, September 16, 2009 1:22 AM
To: Roehrig, Jack (John); linux-poweredge at lists.us.dell.com
Subject: RE: Vmware ESXi (3.5) and RHEL (5.1) : Timekeeping Woes

Jack

Thanks for the in depth explanation. This is pretty much what I was looking for.

Im going to disable the local lines, 1271.127.1.0 and fudge stratum 10. Add the tinker panic 0.

Just a s a matter of interest, the Host has 48Gig of Ram, where there ius 4Gb assigned to this guest, even though on average it uses between 400M and 1.4Gig. Would this cause issues that we are seeing?

Thank
B

From: Roehrig, Jack (John) [mailto:Jack.Roehrig at ask.com]
Sent: 15 September 2009 17:53
To: Brian O'Mahony; linux-poweredge at lists.us.dell.com
Subject: RE: Vmware ESXi (3.5) and RHEL (5.1) : Timekeeping Woes

While there are many timekeeping issues with VMs, most of them are specifically related to suspending machines and vmotioning machines. What you seem to be experiencing is an unstable clock frequency. When NTP detects clock frequency instability, it panics and quits. Several thing can cause this clock tick problem, but the most common we've noticed is utilization of the VM's swap file. If a guest exists on a host that has far more memory allocated to it than is committed in RAM (check Committed_AS), the ESX host may detect the LRU memory and page it out to disk. When the VM accesses this memory, ESX will swap it in and out of physical RAM. This causes horrible slowness on the VM and terrible time skew. You can check to see if your ESX host has paged memory to disk with the following command:

/usr/bin/esxtop -b -d 2 -n 1 | cut -d',' -f 40 | grep -v esx | tr -d '"' | awk '{printf "%.0f",$1}'

There are many conditions that can cause the VM's swap file to be accessed, but VMware engineers will not disclose the algorithms used to control this. If the sum of allocated memory to all guests than is more than available on the host, ESX may utilize the VMs swap. We monitor our ESX hosts with a cron job and SNMP traps to detect when ESX machines are utilizing swap.

HTH
-Jack Roehrig


From: linux-poweredge-bounces at lists.us.dell.com [mailto:linux-poweredge-bounces at lists.us.dell.com] On Behalf Of Brian O'Mahony
Sent: Tuesday, September 15, 2009 4:45 AM
To: linux-poweredge at lists.us.dell.com
Subject: Vmware ESXi (3.5) and RHEL (5.1) : Timekeeping Woes

I may have posted this here previously, and if not I know I spoke to Dell support, whom advised me to make sure that the "Synchronize guest time with host" was turned off.

Basically I have a RHEL5.1 server. It has the ntpd service running. This seems to stop every now and again. It took me quite some time to organize downtime on this VM, so I only got to turn off the option in setting last week.

Here is the log from messages:

messages:Sep 13 10:28:07 curzilla ntpd[2593]: synchronized to LOCAL(0), stratum 10
messages:Sep 13 11:35:50 curzilla ntpd[2593]: synchronized to 172.16.164.100, stratum 4
messages:Sep 13 15:18:10 curzilla ntpd[2593]: synchronized to LOCAL(0), stratum 10
messages:Sep 13 19:33:29 curzilla ntpd[2593]: synchronized to 172.16.164.100, stratum 4
messages:Sep 14 08:04:52 curzilla ntpd[2593]: synchronized to LOCAL(0), stratum 10
messages:Sep 14 09:30:14 curzilla ntpd[2593]: synchronized to 172.16.164.100, stratum 4
messages:Sep 14 22:01:36 curzilla ntpd[2593]: synchronized to LOCAL(0), stratum 10
messages:Sep 15 01:10:27 curzilla ntpd[2593]: synchronized to 172.16.164.100, stratum 4
messages:Sep 15 01:26:28 curzilla ntpd[2593]: time correction of 2232 seconds exceeds sanity limit (1000); set clock manually to the
 correct UTC time.



Anyone have any ideas/suggestions?

B





The information in this email is confidential and may be legally privileged.

It is intended solely for the addressee. Access to this email by anyone else

is unauthorized. If you are not the intended recipient, any disclosure,

copying, distribution or any action taken or omitted to be taken in reliance

on it, is prohibited and may be unlawful. If you are not the intended

addressee please contact the sender and dispose of this e-mail. Thank you.





The information in this email is confidential and may be legally privileged.

It is intended solely for the addressee. Access to this email by anyone else

is unauthorized. If you are not the intended recipient, any disclosure,

copying, distribution or any action taken or omitted to be taken in reliance

on it, is prohibited and may be unlawful. If you are not the intended

addressee please contact the sender and dispose of this e-mail. Thank you.


The information in this email is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this email by anyone else
is unauthorized. If you are not the intended recipient, any disclosure,
copying, distribution or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful. If you are not the intended
addressee please contact the sender and dispose of this e-mail. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20090917/8e10267f/attachment-0001.htm 


More information about the Linux-PowerEdge mailing list