Reset Search
 

 

Article

KB16782 - Common reasons for Active/Passive cluster VIP failover

« Go Back

Information

 
Last Modified Date3/4/2017 8:58 PM
Synopsis
This article provides possible causes why an Active/Passive cluster VIP failover occurs between nodes. Also, instructions are provided to capture logs for further analysis by Pulse Secure Support.

The Physical, Data Link, Network and Transport layers (OSI Model) affect every network device including PPS/PCS clustering services. High CPU, memory utilization, and memory leaks (especially "out of swap memory" conditions) affect system services.
 
Problem or Goal
Delayed responses or NO response from gateways affect cluster services and result in the cluster VIP fluctuating between nodes, and the following messages are logged to events:
Internal/external gateway 'x.x.x.x' unreachable
VIP deactivated on node X on all ports, reason internal/external port failed
VIP activated on node X on all ports, reason: other node yielded
VIP deactivated on Node X on all ports, reason other node is better

Note: The is applicable for both internal/external (if enabled) network interfaces.
Cause
Solution
Verify if there are any general network outages/problem (Physical, Data-link). If there are none, then perform the following checks:
  1. High CPU/Memory utilization or OUT of memory symptoms and critical event reports (if any) on PCS/PPS node dashboard graphs/event logs.  
    • Recommended data to review:  This information can be verified from the admin UI under System > Status > Overview.
  2. Mismatch in speed/duplex/MTU settings between the PCS/PPS network interfaces and the respective connected switch ports. 
    • Recommended data to review: This information can be verified from the admin UI under System > Network > Internal Port and/or External Port.
  3. High CPU on gateways/firewall devices resulting in latency, delayed responses and packet drops.
  4. ARP/Proxy ARP security settings/filters on the firewall gateway devices that either drop or do not respond to ARP between the two nodes.  
    • Recommended data to review:  This information can only be verified by a system snapshot while the problem is occurring.
  5. Incorrect ARP broadcast (sent from another device) received by one or both nodes causing an incorrect mac address entry on the PCS device.
    •  Recommended data to review:  This information can only be verified by a system snapshot while the problem is occurring.
If the observed symptom is due to high CPU/memory experienced during peak usage hours, it could be triggered due to sudden ramp-up/burst in user activity.
  • ​If the ramp-up time is consistent every day, please engage Pulse Secure Support to analyze the load on the device.  
    • Recommended data to gather:  Gather a system snapshot (while the problem is occurring) and screenshot of all system status graphs for 1 day and 2 days.  Note: Graphs should cover the time when the problem is occurring.
Note:  To further optimize performance, disable synchronization (if not required) for "log messages/ user sessions/ last access time" at System > Cluster > Properties  > Synchronization.



Recommended logs to gather:


If the problem persists, perform the following steps on all nodes:
  1. Enable node monitoring at Maintenance > Troubleshooting > Monitoring > Node Monitor.
    1. Enter 30 as maximum log size / 30 seconds as monitoring interval.
    2. Enable the checkboxes for  Node monitoring enabled and ifconfig enabled/top enabledt/cachesize enabled/dsstatdump enabled, and click Save Changes.
  2. Enable debug logging at Maintenance > Troubleshooting > Monitoring, and specify the following:
    1. Detail Level: 10 / Size: 250 MB
    2. Enter the following event codes: dsnetd::garpsweep,dsnetd::health,dsnetd::ipat,DSCluster:dsipatd, -DSUtil,-DSLog,-DSConfig, and click Save Changes.

After 5-10 minutes (during the failover symptoms), collect the following on all nodes:
 
  • Enable the checkboxes for Include Debug log/system config  under Maintenance > Troubleshooting > System Snapshot and Take Snapshot.  Save the file locally.
  • Navigate to System > Log/Monitoring > Event Logs and click Save All Logs.  Save the file locally
  • Save screenshots of all "System dashboard / system capacity graphs" at System > Status page.
  • After logs are captured, disable node monitoring and debug logging commented at steps 1 and 2.

Please open a Tech SR at https://my.pulsesecure.net and attach all logs.
Related Links
Attachment 1 
Created ByData Deployment

Feedback

 

Was this article helpful?


   

Feedback

Please tell us how we can make this article more useful.

Characters Remaining: 255