联系:手机/微信(+86 17813235971) QQ(107644445)
作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]
以前写过一篇在linux平台rac环境,心跳网络通过网线直连,当其中一台机器关机之后,另外一个节点无法检测到心跳网络是active,导致无法启动的情况:私网直连后遗症:一节点无法启动导致另外节点haip无法启动
昨天晚上在aix环境中遇到类似情况,由于某种原因,需要关闭rac的一个节点,另外一个节点启动crs的过程中,haip始终无法启动,虽然haip起不来,但是过了一会儿,asm服务启动成功,磁盘组mount,数据库正常open(这个和linux环境有一定的区别,linux 下面11.2.0.4的rac,如果haip无法启动,默认情况启动asm服务),业务临时恢复
bash-4.2$ crsctl status res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE ONLINE db2 Started
ora.cluster_interconnect.haip
1 ONLINE OFFLINE
ora.crf
1 ONLINE ONLINE db2
ora.crsd
1 ONLINE ONLINE db2
ora.cssd
1 ONLINE ONLINE db2
ora.cssdmonitor
1 ONLINE ONLINE db2
ora.ctssd
1 ONLINE ONLINE db2 OBSERVER
ora.diskmon
1 OFFLINE OFFLINE
ora.drivers.acfs
1 ONLINE ONLINE db2
ora.evmd
1 ONLINE ONLINE db2
ora.gipcd
1 ONLINE ONLINE db2
ora.gpnpd
1 ONLINE ONLINE db2
ora.mdnsd
1 ONLINE ONLINE db2
分析haip对应的日志如下
[ USRTHRD][7257]{0:0:221} Starting Probe for ip 169.254.57.103
[ USRTHRD][7257]{0:0:221} Transitioning to Probe State
[ USRTHRD][7257]{0:0:221} Arp::sProbe {
[ USRTHRD][7257]{0:0:221} Arp::sSend: sending type 1
[ USRTHRD][7257]{0:0:221} [NetHAWork] thread hit OSD exception failed to send arp
[ USRTHRD][7257]{0:0:221} (null) category: -2, operation: write, loc: arpsend:1,os, OS error: 69, other:
[ USRTHRD][7257]{0:0:221} [NetHAWork] thread stopping
[ USRTHRD][7257]{0:0:221} Thread:[NetHAWork]isRunning is reset to false here
[ USRTHRD][5201]{0:0:221} use all detected INF
[ USRTHRD][5201]{0:0:221} Thread:[NetHAWork]thread constructor
[ USRTHRD][5201]{0:0:221} HAIP: Moving ip '' from inf 'en6' to inf 'en6'
[ USRTHRD][5201]{0:0:221} pausing thread
[ USRTHRD][5201]{0:0:221} posting thread
[ USRTHRD][5201]{0:0:221} Waiting for HAIP work thread to cleanup ARP
[ USRTHRD][5201]{0:0:221} timeout to wait thread to cleanup ARP
[ USRTHRD][5201]{0:0:221} Thread:[NetHAWork]start {
[ USRTHRD][5201]{0:0:221} Thread:[NetHAWork]start }
[ USRTHRD][7514]{0:0:221} [NetHAWork] thread started
[ USRTHRD][7514]{0:0:221} Arp::sCreateSocket {
[ USRTHRD][7514]{0:0:221} Arp::sCreateSocket }
[ USRTHRD][5201]{0:0:221} use all detected INF
[ USRTHRD][7514]{0:0:221} Failed to check 169.254.57.103 on en6
[ USRTHRD][7514]{0:0:221} (null) category: 0, operation: , loc: , OS error: 0, other:
这里初步看是把169.254.57.103这个ip增加到en6的网卡上,但是由于OS error: 69失败了.通过aix工程师分析,这个错误可能是物理网络不通导致,对网卡状态进行分析
bash-4.2# entstat -d ent6
-------------------------------------------------------------
ETHERNET STATISTICS (ent6) :
Device Type: 2-Port Gigabit Ethernet-SX PCI-Express Adapter (14103f03)
Hardware Address: 40:f2:e9:91:eb:7a
Elapsed Time: 0 days 1 hours 38 minutes 14 seconds
Transmit Statistics: Receive Statistics:
-------------------- -------------------
Packets: 4128 Packets: 5077
Bytes: 35215659 Bytes: 370511
Interrupts: 0 Interrupts: 4815
Transmit Errors: 0 Receive Errors: 0
Packets Dropped: 0 Packets Dropped: 0
Bad Packets: 0
Max Packets on S/W Transmit Queue: 1
S/W Transmit Queue Overflow: 0
Current S/W+H/W Transmit Queue Length: 0
Broadcast Packets: 12 Broadcast Packets: 0
Multicast Packets: 62 Multicast Packets: 66
No Carrier Sense: 0 CRC Errors: 0
DMA Underrun: 0 DMA Overrun: 0
Lost CTS Errors: 0 Alignment Errors: 0
Max Collision Errors: 0 No Resource Errors: 0
Late Collision Errors: 0 Receive Collision Errors: 0
Deferred: 0 Packet Too Short Errors: 0
SQE Test: 0 Packet Too Long Errors: 0
Timeout Errors: 0 Packets Discarded by Adapter: 0
Single Collision Count: 0 Receiver Start Count: 0
Multiple Collision Count: 0
Current HW Transmit Queue Length: 0
General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 2000
Driver Flags: Up Broadcast Simplex
Limbo 64BitSupport ChecksumOffload
LargeSend DataRateSet
2-Port Gigabit Ethernet-SX PCI-Express Adapter (14103f03) Specific Statistics:
------------------------------------------------------------------------------
Link Status : Down <======表示网络链路状态异常(一般就是直连导致,如果通过交换机不会这样)
Media Speed Selected: Auto negotiation
Media Speed Running: Unknown
PCI Mode: PCI-Express X4
Relaxed Ordering: Enabled
TLP Size: 256
MRR Size: 4096
Jumbo Frames: Disabled
TCP Segmentation Offload: Enabled
TCP Segmentation Offload Packets Transmitted: 3625
TCP Segmentation Offload Packet Errors: 0
Transmit and Receive Flow Control Status: Enabled
XON Flow Control Packets Transmitted: 0
XON Flow Control Packets Received: 0
XOFF Flow Control Packets Transmitted: 0
XOFF Flow Control Packets Received: 0
Transmit and Receive Flow Control Threshold (High): 40960
Transmit and Receive Flow Control Threshold (Low): 20480
Transmit and Receive Storage Allocation (TX/RX): 4/44
通过解决掉异常问题,把故障主机启动之后,启动该机器之后,网络链路状态恢复正常,启动haip成功,但是由于该集群在haip异常的时候启动成功,心跳网络使用是直接的私网ip(没有使用haip),因此还是要对集群进行一次重启恢复到正常状态.







































