The vTM VA and software contain a process monitor (procmon) that checks for any stalled processes, including itself, so it can detect whole-VM hangs. The procmon script is a standalone process that only relies on the resources from the underlying OS. If we see that this is process is hanging, this is generally because of one of 2 reasons - resource contention within the VM, or resource contention outside of the VM (e.g. too many other VMs running on the host hardware).
We can check if the system resources (RAM/CPU) internal to the VM are being heavily used, e.g. with top. If the internal resources are heavily utilised then it is usually recommended to increase which ever resource is being used too much, i.e. add more RAM or CPUs to the virtual machine. If they have very little use around the same time as the procmon hangs then the resource issue is likely to be caused by an external factor from the VM.
Logs with detected incidents can be found in the $ZEUSHOME/zxtm/log/procmon file. The same file is also included within the technical support report (TSR), and typically looks like following:
[19/Jul/2018:12:34:54 +0000] procmon hung for 3.814 seconds
[19/Jul/2018:12:35:46 +0000] procmon hung for 3.419 seconds
[19/Jul/2018:12:35:47 +0000] Process 23719 appears to be hanging
[19/Jul/2018:12:35:52 +0000] procmon hung for 4.896 seconds
[19/Jul/2018:12:35:53 +0000] Process 23719 has recovered
[19/Jul/2018:12:38:38 +0000] Process 23720 appears to be hanging
[19/Jul/2018:12:38:42 +0000] procmon hung for 4.202 seconds
[19/Jul/2018:12:38:43 +0000] Process 23720 has recovered
When this is seen, and the internal resources are not over-utilised, it is recommended to investigate the virtualized environment (hypervisor) usage, configuration and performance metrics.
Possible causes and solutions:
1) Too many other virtual machines running on the host. In this case, for example, on ESXi, it is recommended to:
Reserve 100% of RAM, and at least some amount of CPU. I.e. in vSphere client, under properties of the VM where the vTM is running:
- Right Click > Edit Settings > Resources > Memory > Reservation > (reserve whole amount of RAM, move slider all the way to the right, as far as it goes)
- Right Click > Edit Settings > Resources > CPU > Reservation > (reserve at least something, 1000Mhz or more)
2) Another possible cause for processes to hang is storage with insufficient IO bandwidth. Slow disk access can cause a process to hang while it is waiting for its read/write to the disk storage to return. A strong clue to this is the disk wait times (anything over 25 is considered high) in the SAR output that is collected in the TSR. If we see high wait times around the times of the last procmon hangs then we can determine that the problem is caused by a slow storage system and should be investigated locally.