Reset Search
 

 

Article

KB9780 - Is my PCS cluster failing over because of Packet loss or System failure?

« Go Back

Information

 
Last Modified Date3/9/2017 1:04 AM
Synopsis
This article outlines the possible causes of a cluster failover.
Problem or Goal
My cluster is failing over, what could be the causes of the failover?
Cause
Solution

There are two basic reasons that cause a cluster to failover, packet loss and system failure. A system failure could be caused by any number of hardware failures (NIC, power supply, etc.), and a packet loss could be due to a failing NIC, blocked ports, or any number of failing networking devices between the nodes in the cluster.

Some of the symptoms of a system failure are: the node won't power on, port/port's are unreachable, and/or the serial console is not responding. 

For information on how a cluster fails over, consult: KB9779 - How does an Active/Active cluster and an Active/Passive cluster failover?

 

The troubleshooting steps to determine if it is a system failure or a packet loss scenario are the following:

To view the flowchart for the steps listed below, select:  KB9780 Flowchart

Step1 Is the node powered on? Verify that the problem node is powered on by checking the front of the unit (either the right side or the left side of the unit) for the power LED. Also check that the link lights are on for the ports being used.

  • Yes - Continue with Step 2
  • No  - If the unit will not power on, an RMA may need to be created for the node. Go to RMA the Device

Step2 Is the node reachable? Verify that the problem node is reachable from the internal network. This can be done by browsing directly to the node via the hostname or via the IP address, or by using the “ping” command from a command prompt.

  • Yes - Continue with Step 3
  • No  - If the node is powered on but cannot be reached with either the ping command or browsing directly to it, try to connect to the node via the serial console. A null modem cable is used to connect to the serial console with the following connection settings (this can be done with HyperTerminal, Putty, etc.):

    settings

    With the serial console connection, try to reboot the node. To do this, select the System Operations or the Reboot/Shutdown/Restart option, then the Reboot this IVE option.

    Here is an example of the serial console output of rebooting an PCS  running 5.4R1:

    reboot

    If the node is still not reachable after the reboot, an RMA may need to be created for the node. 

Step3 Are the cluster nodes able to communicate with each other? To verify that cluster nodes can communicate with each other the cluster troubleshooter tool can be used. For information on using the cluster troubleshooting too, see: KB9746 - How to use the Cluster Troubleshooter tool

  • Yes - Continue with Step 4
  • No  - If the communication between the nodes fails, it is possible that there are ports that need to be opened on any firewalls between the nodes that is blocking the communication. See: KB9682 - Cluster members cannot communicate what ports need to be opened?

    If the ports are opened and the nodes are still unable to communicate, it is possible that there is another networking device between the nodes that is failing. Get with the networking administrator that oversees the devices (switches, routers, firewalls, etc.) between the two nodes to check for errors/issues (this would include issues with a WAN connection).

Step4 Are there any system generated snapshots on the problem node? If the physical node is OK (powered on, reachable, etc.), and the communication between the nodes is OK, it is possible that there is an issue with either the load (the amount of users/throughput) on the cluster or with the software version being used by the cluster. To check for system generated snapshots, login as an administrator and navigate to Troubleshooting > System Snapshot

snapshot

Here is an example of a snapshot that was generated by an Administrator from an IVE running 5.5R1:

admin_snapshot

Look for and collect any snapshots in the list that were not generated by an administrator (System generated, Process generated, etc.).

If there are no system generated snapshots, there may be errors in the events log. To view and collect the events log, navigate to Log/Monitoring > Events Log > Log

events log

Check for any messages in the log that are Minor, Major, or Critical

example

To collect the log, click on Save Log As...

save_log

With the events log and/or snapshots, open a case with Juniper Support.
 

Step5 To open a case either calling in to  +1 844-751-7629 (US) OR login to the Case Management tool via the Pulse Secure support site at:
https://my.pulsesecure.net/ and click "Create a Case"

RMA the device
To be able to create an RMA, you will need the serial number off of the problem node. The serial number is 16 digits, and can be found on the back of the unit on a sticker or on the bottom/underside of the unit. Once the serial number has been collected, open a case by either calling into customer care at +1 844-751-7629 (US)  or login to the case management tool via the Pulse Secure Global Support Center (PSGSC) at https://my.pulsesecure.net/

Related Links
Attachment 1 
Created ByData Deployment

Feedback

 

Was this article helpful?


   

Feedback

Please tell us how we can make this article more useful.

Characters Remaining: 255