Reset Search
 

 

Article

KB43866 - vTM Virtual Appliance whole-VM hangs and process monitor logs

« Go Back

Information

 
Last Modified Date10/15/2018 5:08 PM
Synopsis
This article describes causes of the 'procmon hung' error message and solutions.
Problem or Goal

The vTM VA runs as a virtual machine inside a virtualization environment (e.g. ESXi, KVM, Xen, etc). As such it is dependent on that virtualized environment to deliver physical resources (e.g. CPU cycles, input-output operations, network, etc.) to the virtual machine. Similarly, the software variant of vTM can be installed on an OS (e.g. Centos, Ubuntu) running in a virtualized environment.

Sometimes the virtualized environment fails to deliver resources, and stops running the virtual machine for a period of time. This results in the whole VM hanging and is referred to as "whole-VM hang". While the VM is hanging, network traffic can't flow. If the hang period exceeds a few seconds, and if a cluster of vTMs is used then fail-over to another vTM in the cluster is initiated.

It is important to understand that while this problem can be detected by vTM, it can only be resolved by finding and fixing the problem on the virtualized environment (hypervisor).

Cause
Solution

The vTM VA and software contain a process monitor (procmon) that checks for any stalled processes, including itself, so it can detect whole-VM hangs. The procmon script is a standalone process that only relies on the resources from the underlying OS. If we see that this is process is hanging, this is generally because of one of 2 reasons - resource contention within the VM, or resource contention outside of the VM (e.g. too many other VMs running on the host hardware).

We can check if the system resources (RAM/CPU) internal to the VM are being heavily used, e.g. with top. If the internal resources are heavily utilised then it is usually recommended to increase which ever resource is being used too much, i.e. add more RAM or CPUs to the virtual machine. If they have very little use around the same time as the procmon hangs then the resource issue is likely to be caused by an external factor from the VM.

Logs with detected incidents can be found in the $ZEUSHOME/zxtm/log/procmon file. The same file is also included within the technical support report (TSR), and typically looks like following:

[19/Jul/2018:12:34:54 +0000]     procmon hung for 3.814 seconds
[19/Jul/2018:12:35:46 +0000]     procmon hung for 3.419 seconds
[19/Jul/2018:12:35:47 +0000]     Process 23719 appears to be hanging
[19/Jul/2018:12:35:52 +0000]     procmon hung for 4.896 seconds
[19/Jul/2018:12:35:53 +0000]     Process 23719 has recovered
[19/Jul/2018:12:38:38 +0000]     Process 23720 appears to be hanging
[19/Jul/2018:12:38:42 +0000]     procmon hung for 4.202 seconds
[19/Jul/2018:12:38:43 +0000]     Process 23720 has recovered


When this is seen, and the internal resources are not over-utilised, it is recommended to investigate the virtualized environment (hypervisor) usage, configuration and performance metrics. 

Possible causes and solutions:

1) Too many other virtual machines running on the host. In this case, for example, on ESXi, it is recommended to:

Reserve 100% of RAM, and at least some amount of CPU. I.e. in vSphere client, under properties of the VM where the vTM is running:

- Right Click > Edit Settings > Resources > Memory > Reservation > (reserve whole amount of RAM, move slider all the way to the right, as far as it goes)

- Right Click > Edit Settings > Resources > CPU > Reservation > (reserve at least something, 1000Mhz or more)

2) Another possible cause for processes to hang is storage with insufficient IO bandwidth.  Slow disk access can cause a process to hang while it is waiting for its read/write to the disk storage to return. A strong clue to this is the disk wait times (anything over 25 is considered high) in the SAR output that is collected in the TSR.  If we see high wait times around the times of the last procmon hangs then we can determine that the problem is caused by a slow storage system and should be investigated locally.

Related Links
Attachment 1 
Created ByAndy Chernyak

Feedback

 

Was this article helpful?


   

Feedback

Please tell us how we can make this article more useful.

Characters Remaining: 255