init.cssd startcheck—HP Service Guard未启动导致CRS无法正常启动

联系:手机/微信(+86 17813235971) QQ(107644445)

标题:init.cssd startcheck—HP Service Guard未启动导致CRS无法正常启动

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

早上到客户现场,客户告知有一套环境替换OCR和VOTEDISK之后,crs无法启动,让我看看。环境:HP RAC(只用一个节点)+10.2.0.5 Oracle 数据库
start crs显示正常,但是无法启动

# /app/oracle/product/10.2.0/crs/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
# ps -ef|grep crs
    root  6461     1  0  May 19  ?         0:00 /bin/sh /sbin/init.d/init.crsd run
    root 29719 23678  0 10:04:51 pts/tc    0:00 grep crs

也无任何日志

[xifenfei01][orawj][/root/xifenfei]#ls -ltr
total 148
drwxr-x---   2 oracle     dba             96 May 15  2014 admin
drwxr-x---   2 root       dba             96 May 15  2014 crsd
drwxr-x---   2 oracle     dba             96 May 15  2014 evmd
drwxrwxr-t   5 oracle     dba           1024 Jun  4  2014 racg
drwxr-x---   5 oracle     dba           1024 May 17 22:50 cssd
-rw-rw-r--   1 root       dba          61568 May 24 15:26 alertxifenfei01.log
drwxr-x---   2 oracle     dba           3072 May 24 15:43 client
[xifenfei01][orawj][/root/xifenfei]#date
Mon, May 25, 2015 11:30:09 AM

表决磁盘和OCR信息

[xifenfei01][orawj][/root/xifenfei]#ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :    1441492
         Used space (kbytes)      :       5972
         Available space (kbytes) :    1435520
         ID                       : 1714667730
         Device/File Name         : /dev/vgc01/rCMPR_VGC01_OCR1
                                    Device/File integrity check succeeded
         Device/File Name         : /dev/vgc02/rCMPR_VGC02_OCR2
                                    Device/File integrity check succeeded
         Cluster registry integrity check succeeded
[xifenfei01][orawj][/root/xifenfei]#crsctl query css votedisk
 0.     0    /dev/vgc01/rCMPR_VGC01_VOTE1
 1.     0    /dev/vgc02/rCMPR_VGC02_VOTE2
 2.     0    /dev/vgc03/rCMPR_VGC03_VOTE3
located 3 votedisk(s).

ocr.loc文件路径

# more /var/opt/oracle/ocr.loc
#Device/file /dev/vgc02/rCMPR_VGC02_OCR2 getting replaced by device /dev/vgc02/rCMPR_VGC02_OCR2
ocrconfig_loc=/dev/vgc01/rCMPR_VGC01_OCR1
ocrmirrorconfig_loc=/dev/vgc02/rCMPR_VGC02_OCR2
local_only=false

这里可以看出来表决磁盘和ocr等相关信息正常

显示init.cssd startcheck进程

[xifenfei01][orawj][/root/xifenfei]#ps -ef|grep init
    root     1     0  0  May 19  ?         0:03 init
    root   119     0  0  May 19  ?         0:00 pagetable_init_daemon
    root   115     0  0  May 19  ?         0:00 mdep_initiator_thread
    root 26820 26792  0 10:49:53 ?         0:00 /bin/sh /sbin/init.d/init.cssd startcheck
    root 26791     1  0 10:49:53 ?         0:00 /bin/sh /sbin/init.d/init.crsd run
    root 27183 23698  0 10:50:23 ?         0:00 /bin/sh /sbin/init.d/init.cssd startcheck
    root 26792     1  0 10:49:53 ?         0:00 /bin/sh /sbin/init.d/init.cssd fatal
    root 23698     1  0 10:45:23 ?         0:00 /bin/sh /sbin/init.d/init.evmd run
    root 26816 26791  0 10:49:53 ?         0:00 /bin/sh /sbin/init.d/init.cssd startcheck
  oracle 20534 11033  0 11:30:35 pts/ta    0:00 grep init

这里的init.cssd startcheck大部分情况下,是由于不能访问存储或者第三方集群件无法访问导致

查看vg状态

VG Name                     /dev/vgc01
VG Write Access             read/write
VG Status                   available
Max LV                      255
Cur LV                      9
Open LV                     9
Max PV                      255
Cur PV                      1
Act PV                      1
Max PE per PV               3200
VGDA                        2
PE Size (Mbytes)            32
Total PE                    3199
Alloc PE                    736
Free PE                     2463
Total PVG                   0
Total Spare PVs             0
Total Spare PVs in use      0
VG Version                  1.0
VG Max Size                 25500g
VG Max Extents              816000
VG Name                     /dev/vgc02
VG Write Access             read/write
VG Status                   available
Max LV                      255
Cur LV                      9
Open LV                     9
Max PV                      255
Cur PV                      1
Act PV                      1
Max PE per PV               3200
VGDA                        2
PE Size (Mbytes)            32
Total PE                    3199
Alloc PE                    736
Free PE                     2463
Total PVG                   0
Total Spare PVs             0
Total Spare PVs in use      0
VG Version                  1.0
VG Max Size                 25500g
VG Max Extents              816000
VG Name                     /dev/vgc03
VG Write Access             read/write
VG Status                   available
Max LV                      255
Cur LV                      6
Open LV                     6
Max PV                      255
Cur PV                      1
Act PV                      1
Max PE per PV               3200
VGDA                        2
PE Size (Mbytes)            32
Total PE                    3199
Alloc PE                    448
Free PE                     2751
Total PVG                   0
Total Spare PVs             0
Total Spare PVs in use      0
VG Version                  1.0
VG Max Size                 25500g
VG Max Extents              816000

这里可以看到,三个存放表决磁盘和ocr的vg都是available的

看votedisk和ocr权限

# ls -l /dev/vgc0*/rCMPR*|grep -v .dbf|grep -v .log|grep -v .ctl
crw-r-----   1 oracle     dba         64 0x020008 May 24 14:40 /dev/vgc01/rCMPR_VGC01_OCR1
crw-r-----   1 oracle     dba         64 0x020009 May 24 14:41 /dev/vgc01/rCMPR_VGC01_VOTE1
crw-r-----   1 oracle     dba         64 0x030008 May 24 14:41 /dev/vgc02/rCMPR_VGC02_OCR2
crw-r-----   1 oracle     dba         64 0x030009 May 24 14:41 /dev/vgc02/rCMPR_VGC02_VOTE2
crw-r-----   1 oracle     dba         64 0x040006 May 24 14:41 /dev/vgc03/rCMPR_VGC03_VOTE3

直接修改权限为777,然后尝试

# chmod 777 /dev/vgc0*/rCMPR*|grep -v .dbf|grep -v .log|grep -v .ctl
#  ls -l /dev/vgc0*/rCMPR*|grep -v .dbf|grep -v .log|grep -v .ctl
crwxrwxrwx   1 oracle     dba         64 0x020008 May 24 14:40 /dev/vgc01/rCMPR_VGC01_OCR1
crwxrwxrwx   1 oracle     dba         64 0x020009 May 24 14:41 /dev/vgc01/rCMPR_VGC01_VOTE1
crwxrwxrwx   1 oracle     dba         64 0x030008 May 24 14:41 /dev/vgc02/rCMPR_VGC02_OCR2
crwxrwxrwx   1 oracle     dba         64 0x030009 May 24 14:41 /dev/vgc02/rCMPR_VGC02_VOTE2
crwxrwxrwx   1 oracle     dba         64 0x040006 May 24 14:41 /dev/vgc03/rCMPR_VGC03_VOTE3

kill相关进程重试

# ps -ef|grep init
    root     1     0  0  May 19  ?         0:03 init
    root   119     0  0  May 19  ?         0:00 pagetable_init_daemon
    root   115     0  0  May 19  ?         0:00 mdep_initiator_thread
    root  6458     1  0  May 19  ?         0:00 /bin/sh /sbin/init.d/init.evmd run
    root 20975     1  0 10:40:11 ?         0:00 /bin/sh /sbin/init.d/init.crsd run
    root 20976     1  0 10:40:11 ?         0:00 /bin/sh /sbin/init.d/init.cssd fatal
    root 21006 20976  0 10:40:11 ?         0:00 /bin/sh /sbin/init.d/init.cssd startcheck
    root 20997 20975  0 10:40:11 ?         0:00 /bin/sh /sbin/init.d/init.cssd startcheck
    root 21152 23678  0 10:40:18 pts/tc    0:00 grep init
vi /etc/inittab
#h1:3:respawn:/sbin/init.d/init.evmd run >/dev/null 2>&1 </dev/null
#h2:3:respawn:/sbin/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null
#h3:3:respawn:/sbin/init.d/init.crsd run >/dev/null 2>&1 </dev/null
# /sbin/init q
# ps -ef|grep init.c | grep -v grep | awk '{print $2}' |xargs kill -9
# ps -ef|grep init
    root     1     0  0  May 19  ?         0:03 init
    root   119     0  0  May 19  ?         0:00 pagetable_init_daemon
    root   115     0  0  May 19  ?         0:00 mdep_initiator_thread
    root 21744 23678  1 10:42:31 pts/tc    0:00 grep init

重新启动init进程

vi /etc/inittab
h1:3:respawn:/sbin/init.d/init.evmd run >/dev/null 2>&1 </dev/null
h2:3:respawn:/sbin/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null
h3:3:respawn:/sbin/init.d/init.crsd run >/dev/null 2>&1 </dev/null
~
# /sbin/init q
# ps -ef|grep init
    root     1     0  0  May 19  ?         0:03 init
    root   119     0  0  May 19  ?         0:00 pagetable_init_daemon
    root   115     0  0  May 19  ?         0:00 mdep_initiator_thread
    root 23737 23706  0 10:45:23 ?         0:00 /bin/sh /sbin/init.d/init.cssd startcheck
    root 23731 23698  0 10:45:23 ?         0:00 /bin/sh /sbin/init.d/init.cssd startcheck
    root 23706     1  0 10:45:23 ?         0:00 /bin/sh /sbin/init.d/init.crsd run
    root 23698     1  0 10:45:23 ?         0:00 /bin/sh /sbin/init.d/init.evmd run
    root 23887 23678  1 10:45:28 pts/tc    0:00 grep init
    root 23746 23700  0 10:45:23 ?         0:00 /bin/sh /sbin/init.d/init.cssd startcheck
    root 23700     1  0 10:45:23 ?         0:00 /bin/sh /sbin/init.d/init.cssd fatal

证明修改lv权限,问题依旧,不是votedisk和ocr的权限和所有者导致,通过dd和strings读相关文件,发现都OK.

调试/sbin/init.d/init.cssd startcheck进程

[xifenfei01][orawj][/root/xifenfei]#sh -x  /sbin/init.d/init.cssd startcheck
+ ORA_CRS_HOME=/app/oracle/product/10.2.0/crs
+ ORACLE_USER=oracle
+ ORACLE_HOME=/app/oracle/product/10.2.0/crs
+ export ORACLE_HOME
+ export ORA_CRS_HOME
+ export ORACLE_USER
+ DISABLE_OPROCD=false
+ OPROCD_DEFAULT_TIMEOUT=1000
+ OPROCD_DEFAULT_MARGIN=500
+ OPROCD_CHECK_TIMEOUT=2000
+ OPROCD_STOP_TIMEOUT=2000
+ OPROCD_DEFAULT_HISTORGRAM=
+ HOSTN=/bin/hostname
+ EXPRN=/usr/bin/expr
+ CUT=/usr/bin/cut
+ AWK=/bin/awk
+ ECHO=echo
+ TR=/bin/tr
+ /bin/uname
+ [ SunOS = HP-UX ]
+ /bin/uname
+ [ Linux = HP-UX ]
+ + /bin/hostname
HOST=xifenfei01
+ + /usr/bin/expr xifenfei01 : .*
len1=8
+ + /usr/bin/expr match xifenfei01 [0-9]*\.[0-9]*\.[0-9]*\.[0-9]*
len2=0
+ [ 8 != 0 ]
+ + echo xifenfei01
+ /usr/bin/cut -d. -f1
HOST=xifenfei01
+ + echo xifenfei01
+ /bin/tr [:upper:] [:lower:]
HOST=xifenfei01
+ PS=/bin/ps
+ PSE=/bin/ps -e
+ PSEF=/bin/ps -ef
+ HEAD=/bin/head
+ GREP=/bin/grep
+ KILL=/bin/kill
+ KILLTERM=/bin/kill -TERM
+ KILLDIE=/bin/kill -9
+ KILLCHECK=/bin/kill -0 5852
+ SLEEP=/bin/sleep
+ NULL=/dev/null
+ UNAME=/bin/uname
+ CAT=/bin/cat
………………
+ eval /bin/true
+ /bin/true
+ [ 0 != 0 ]
+ eval /bin/ps -ef | /bin/grep '/usr/lbin/cm[g]msd' 1>/dev/null 2>/dev/null
+ /bin/grep /usr/lbin/cm[g]msd
+ /bin/ps -ef
+ 1> /dev/null 2> /dev/null
+ RC=1
+ [ 1 -ne 0 ]
+ /bin/logger -puser.err Oracle Cluster Ready Services waiting for HP-UX Service Guard to start.
+ /bin/sleep 60

这里可以通过-x调试shell脚本,发现crs在等待HP-UX Service Guard启动,从而可以确定是由于HP-UX Service Guard未启动

检查HP-UX Service Guard是否启动

[xifenfei01][orawj][/root/xifenfei]#cmviewcl
CLUSTER           STATUS
crmdb_b_cluster   down
  NODE           STATUS       STATE
  xifenfei01       down         unknown
  crmdbb02       down         unknown
UNOWNED_PACKAGES
    PACKAGE        STATUS           STATE            AUTO_RUN    NODE
    pkg1           down             halted           enabled     unowned
    pkg2           down             halted           enabled     unowned

通过这里,结合客户描述(只启动了一个节点,另外一个节点的vg未激活),可以判断出来由于只使用一个节点,在未启动Service Guard的情况下,直接激活vg,由于Service Guard未启动导致crs无法启动