ERROR: diskgroup XXXX was not mounted

aix平台10.2.0.5 2节点RAC,由于节点2系统盘故障,通过节点1镜像系统,复制到节点2,结果由于节点2磁盘顺序和节点1不匹配,aix工程师进行了相关操作之后,节点1重启之后datadg磁盘组无法mount

SQL> alter diskgroup datadg mount
Mon Jun 10 23:23:46 CST 2019
NOTE: cache registered group DATADG number=1 incarn=0x8cf61164
Mon Jun 10 23:23:46 CST 2019
NOTE: Hbeat: instance first (grp 1)
Mon Jun 10 23:23:50 CST 2019
NOTE: start heartbeating (grp 1)
Mon Jun 10 23:23:50 CST 2019
NOTE: cache dismounting group 1/0x8CF61164 (DATADG)
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DATADG was not mounted

检查datadg磁盘组相关信息

Tue Jan 29 19:21:45 CST 2019
NOTE: start heartbeating (grp 2)
NOTE: cache opening disk 0 of grp 2: DATADG_0000 path:/dev/rhdisk6
Tue Jan 29 19:21:45 CST 2019
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 2: DATADG_0001 path:/dev/rhdisk7
NOTE: cache opening disk 2 of grp 2: DATADG_0002 path:/dev/rhdisk8
NOTE: cache opening disk 3 of grp 2: DATADG_0003 path:/dev/rhdisk9
NOTE: cache mounting (first) group 2/0x60E59155 (DATADG)
* allocate domain 2, invalid = TRUE
Tue Jan 29 19:21:45 CST 2019
NOTE: attached to recovery domain 2
Tue Jan 29 19:21:45 CST 2019
NOTE: cache recovered group 2 to fcn 0.849668
Tue Jan 29 19:21:45 CST 2019
NOTE: LGWR attempting to mount thread 1 for disk group 2
NOTE: LGWR mounted thread 1 for disk group 2
NOTE: opening chunk 1 at fcn 0.849668 ABA
NOTE: seq=21 blk=5394
Tue Jan 29 19:21:46 CST 2019
NOTE: cache mounting group 2/0x60E59155 (DATADG) succeeded
SUCCESS: diskgroup DATADG was mounted

通过这里可以看出来datadg磁盘组是由rhdisk6-9 四块磁盘组成,查询相关磁盘信息发现
asm-foreign


这里确定rhdisk7磁盘异常,通过kfed分析磁盘情况

D:\BaiduNetdiskDownload\xifenfei>kfed read rhdisk7.dd
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                           34 ; 0x001: 0x22
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                   49407 ; 0x004: blk=49407
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                    58396 ; 0x010: 0x0000e41c
kfbh.fcn.wrap:                   131072 ; 0x014: 0x00020000
kfbh.spare1:                 4294967064 ; 0x018: 0xffffff18
kfbh.spare2:                 2105310074 ; 0x01c: 0x7d7c7b7a
005918A00 00002200 0000C0FF 00000000 00000000  [."..............]
005918A10 0000E41C 00020000 FFFFFF18 7D7C7B7A  [............z{|}]
005918A20 00000000 00000000 00000000 00000000  [................]
  Repeat 253 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]
D:\BaiduNetdiskDownload\xifenfei>kfed read rhdisk7.dd blkn=1
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
006EF8A00 00000000 00000000 00000000 00000000  [................]
  Repeat 255 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]
D:\BaiduNetdiskDownload\xifenfei>kfed read rhdisk7.dd blkn=2|more
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                33554432 ; 0x004: blk=33554432
kfbh.block.obj:                16777344 ; 0x008: file=128
kfbh.check:                  3844041089 ; 0x00c: 0xe51f6981
kfbh.fcn.base:               1297484544 ; 0x010: 0x4d560b00
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdatb10.aunum:                       0 ; 0x000: 0x00000000
kfdatb10.shrink:                  49153 ; 0x004: 0xc001
kfdatb10.ub2pad:                  20555 ; 0x006: 0x504b
kfdatb10.auinfo[0].link.next:      2048 ; 0x008: 0x0800
kfdatb10.auinfo[0].link.prev:      2048 ; 0x00a: 0x0800
kfdatb10.auinfo[0].free:              0 ; 0x00c: 0x0000
kfdatb10.auinfo[0].total:         49153 ; 0x00e: 0xc001
kfdatb10.auinfo[1].link.next:      4096 ; 0x010: 0x1000
kfdatb10.auinfo[1].link.prev:      4096 ; 0x012: 0x1000
kfdatb10.auinfo[1].free:              0 ; 0x014: 0x0000
kfdatb10.auinfo[1].total:             0 ; 0x016: 0x0000
kfdatb10.auinfo[2].link.next:      6144 ; 0x018: 0x1800
kfdatb10.auinfo[2].link.prev:      6144 ; 0x01a: 0x1800
kfdatb10.auinfo[2].free:              0 ; 0x01c: 0x0000
kfdatb10.auinfo[2].total:             0 ; 0x01e: 0x0000
kfdatb10.auinfo[3].link.next:      8192 ; 0x020: 0x2000
kfdatb10.auinfo[3].link.prev:      8192 ; 0x022: 0x2000
kfdatb10.auinfo[3].free:              0 ; 0x024: 0x0000

对比磁盘可能的损坏情况,由于在aix 平台asm disk的block有一个特征一般0082开头,通过工具打开磁盘,检索该标记对比
正常磁盘
0082-good


异常磁盘
0082-had


通过上述分析,大概评估rhdisk7 元数据部分损坏的不光是block 0和1,人工修复继续使用的可能性不太大,而且基于客户的数据库不大,采取方案是直接拷贝数据文件、redo、控制文件到文件系统,然后在本地文件系统open库
open_db


运气不错,实现完美恢复数据0丢失

WARNING: Read Failed.导致asm磁盘组异常

有客户对asm dg进行扩容,一段时间之后,asm data 磁盘组直接dismount

Wed May 29 18:37:25 2019
SUCCESS: ALTER DISKGROUP DATA ADD  DISK '/dev/oracleasm/disks/DATA_0028' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0027' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0026' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0025' SIZE 511993M /* ASMCA */
NOTE: starting rebalance of group 1/0x9e18e2f1 (DATA) at power 1
Wed May 29 18:37:26 2019
Starting background process ARB0
Wed May 29 18:37:26 2019
ARB0 started with pid=34, OS id=96638
NOTE: assigning ARB0 to group 1/0x9e18e2f1 (DATA) with 1 parallel I/O
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: Refresh completed on diskgroup DATA. No voting file found.
cellip.ora not found.
Wed May 29 19:21:43 2019
WARNING: Read Failed. group:1 disk:27 AU:0 offset:360448 size:4096
WARNING: cache failed reading from group=1(DATA) dsk=27 blk=88 count=1 from disk= 27
(DATA_0027) kfkist=0x20 status=0x02 osderr=0x0 file=kfc.c line=11596
ERROR: cache failed to read group=1(DATA) dsk=27 blk=88 from disk(s): 27(DATA_0027)
ORA-15080: synchronous I/O operation to a disk failed
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 704
Additional information: -1
NOTE: cache initiating offline of disk 27 group DATA
NOTE: process _user31879_+asm1 (31879) initiating offline of disk 27.3915911747 (DATA_0027) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 27/0xe9681243, mask = 0x6a, op = clear
Wed May 29 19:21:43 2019
GMON updating disk modes for group 1 at 10 for pid 35, osid 31879
ERROR: Disk 27 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Wed May 29 19:21:43 2019
NOTE: cache dismounting (not clean) group 1/0x9E18E2F1 (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 90256, image: oracle@ftz-db-o1 (B000)
Wed May 29 19:21:43 2019
NOTE: halting all I/Os to diskgroup 1 (DATA)
WARNING: Offline for disk DATA_0027 in mode 0x7f failed.
Wed May 29 19:21:43 2019
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=27.3207 last written ABA 27.3207
Wed May 29 19:21:43 2019
ERROR: ORA-15130 thrown in ARB0 for group number 1
Errors in file /oracle/grid_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_96638.trc:
ORA-15130: diskgroup "" is being dismounted
ORA-15130: diskgroup "DATA" is being dismounted
Wed May 29 19:21:43 2019
NOTE: stopping process ARB0

后续继续mount data 磁盘组成功,但是立马又dismount

Wed May 29 18:37:25 2019
SUCCESS: ALTER DISKGROUP DATA ADD  DISK '/dev/oracleasm/disks/DATA_0028' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0027' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0026' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0025' SIZE 511993M /* ASMCA */
NOTE: starting rebalance of group 1/0x9e18e2f1 (DATA) at power 1
Wed May 29 18:37:26 2019
Starting background process ARB0
Wed May 29 18:37:26 2019
ARB0 started with pid=34, OS id=96638
NOTE: assigning ARB0 to group 1/0x9e18e2f1 (DATA) with 1 parallel I/O
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: Refresh completed on diskgroup DATA. No voting file found.
cellip.ora not found.
Wed May 29 19:21:43 2019
WARNING: Read Failed. group:1 disk:27 AU:0 offset:360448 size:4096
WARNING: cache failed reading from group=1(DATA) dsk=27 blk=88 count=1 from disk= 27
(DATA_0027) kfkist=0x20 status=0x02 osderr=0x0 file=kfc.c line=11596
ERROR: cache failed to read group=1(DATA) dsk=27 blk=88 from disk(s): 27(DATA_0027)
ORA-15080: synchronous I/O operation to a disk failed
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 704
Additional information: -1
NOTE: cache initiating offline of disk 27 group DATA
NOTE: process _user31879_+asm1 (31879) initiating offline of disk 27.3915911747 (DATA_0027) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 27/0xe9681243, mask = 0x6a, op = clear
Wed May 29 19:21:43 2019
GMON updating disk modes for group 1 at 10 for pid 35, osid 31879
ERROR: Disk 27 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Wed May 29 19:21:43 2019
NOTE: cache dismounting (not clean) group 1/0x9E18E2F1 (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 90256, image: oracle@ftz-db-o1 (B000)
Wed May 29 19:21:43 2019
NOTE: halting all I/Os to diskgroup 1 (DATA)
WARNING: Offline for disk DATA_0027 in mode 0x7f failed.
Wed May 29 19:21:43 2019
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=27.3207 last written ABA 27.3207
Wed May 29 19:21:43 2019
ERROR: ORA-15130 thrown in ARB0 for group number 1
Errors in file /oracle/grid_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_96638.trc:
ORA-15130: diskgroup "" is being dismounted
ORA-15130: diskgroup "DATA" is being dismounted
Wed May 29 19:21:43 2019
NOTE: stopping process ARB0

对于上述的故障现象,本质原因是由于asm 磁盘组增加新磁盘之后,开始做rebalance,但是由于遭遇到 27号盘上有IO读错误,使得asm磁盘组无法正常完成rebalance,因而data磁盘组无法稳定的mount。解决该问题思路,通过patch asm磁盘组,禁止rebalance,从而使得data磁盘组不再dismount,再进行后续恢复

回收站中有大量wri$_rcs表

在对一套Oracle 12.1.0.2的数据库巡检之时发现大量WRI$_RCS_数字_1的表在回收站中,从命名中看该表应该是Oracle某个自动任务处理后,表未被正常处理干净,遗留在回收站中.
wri$_rcs


查询mos确认是Bug 20114306 – Objects left in recyclebin after upgrade to 12.1.0.2 or with fix for bug 16851194 present – superseded (文档 ID 20114306.8)
20114306


可以尝试打上补丁21498770或者23100700然后设置_fix_control

alter system set "_fix_control"='16851194:off' ;

确认该_fix_control是否可以设置,可以查询 v$system_fix_control视图

ORA-600 kokasgi1故障恢复

数据库启动报ORA-600 kokasgi1错误

SMON: enabling tx recovery
Database Characterset is WE8ISO8859P1
Errors in file /u01/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_10056.trc  (incident=269259):
ORA-00600: internal error code, arguments: [kokasgi1], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/xifenfei/xifenfei1/incident/incdir_269259/xifenfei1_ora_10056_i269259.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_10056.trc:
ORA-00600: internal error code, arguments: [kokasgi1], [], [], [], [], [], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_10056.trc:
ORA-00600: internal error code, arguments: [kokasgi1], [], [], [], [], [], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 10056): terminating the instance due to error 600
Instance terminated by USER, pid = 10056
ORA-1092 signalled during: alter database open resetlogs...
opiodr aborting process unknown ospid (10056) as a result of ORA-1092
Sat May 25 09:40:21 2019
ORA-1092 : opitsk aborting process

该错误在mos上没有查询出来明确的解决方案,但是在google中有人删除user$模拟出该故障
ora-600-kokasgi1


数据库启动10046跟踪

PARSING IN CURSOR #140185422046848 len=189 dep=1 uid=0 oct=3 lid=0 tim=1558756188092143 hv=186852205
ad='390983730' sqlid='2tkw12w5k68vd'
select user#,password,datats#,tempts#,type#,defrole,resource$, ptime,decode(defschclass,NULL,
'DEFAULT_CONSUMER_GROUP',defschclass),spare1,spare4,ext_username,spare2 from user$ where name=:1
END OF STMT
PARSE #140185422046848:c=0,e=784,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1558756188092141
BINDS #140185422046848:
 Bind#0
  oacdty=01 mxl=32(03) mxlc=00 mal=00 scl=00 pre=00
  oacflg=18 fl2=0001 frm=01 csi=31 siz=32 off=0
  kxsbbbfp=7f7f7648a230  bln=32  avl=03  flg=05
  value="SYS"
EXEC #140185422046848:c=1000,e=1432,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=1457651150,tim=1558756188093835
WAIT #140185422046848: nam='db file sequential read' ela= 5226 file#=1 block#=417 blocks=1 obj#=46 tim=1558756188099198
FETCH #140185422046848:c=1000,e=5465,p=1,cr=1,cu=0,mis=0,r=0,dep=1,og=4,plh=1457651150,tim=1558756188099349
STAT #140185422046848 id=1 cnt=0 pid=0 pos=1 obj=22 op='TABLE ACCESS BY INDEX ROWID USER$ (cr=1 pr=1 pw=0 time=5463 us)'
STAT #140185422046848 id=2 cnt=0 pid=1 pos=1 obj=46 op='INDEX UNIQUE SCAN I_USER1 (cr=1 pr=1 pw=0 time=5461 us)'
CLOSE #140185422046848:c=0,e=10,dep=1,type=0,tim=1558756188099578
ORA-00600: internal error code, arguments: [kokasgi1], [], [], [], [], [], [], [], [], [], [], []

这里比较明显数据库在查询user$中的SYS用户的时候,无法查询数据从而出现ORA-00600: internal error code, arguments: [kokasgi1]错误.通过进一步对USER$表进行分析发现,sys和system被人重命名

SQL> select name from user$ WHERE NAME LIKE 'SYS%';
NAME
------------------------------
SYSDW
SYSMAN
SYSTEMDW

定位到具体问题,解决比较简单,在oracle的open过程中,通过对user$表进行修复,实现数据库完美恢复.

*** ALL YOUR WORK AND PERSONAL FILES HAVE BEEN ENCRYPTED ***

最近休假的一段时间,接到不少win文件系统被加密的数据库恢复,主要集中在类似如下的病毒恢复,通过分析,可以确定对于该类加密病毒的Oracle和Sql Server数据库可以实现较为完美的恢复
1. 每个文件的目录下面有一个!!! DECRYPT MY FILES !!!.txt文件,内容为:
1


2. 加密的文件名为:在原文件名后面加上类似.id-3109967046_[Icanhelp@cock.li].firex3m
oracle1
sql1


我们通过分析对于这类加密的oracle和sql数据库基本上可以实现完美恢复(恢复结果应用可以直接运行)
sql-recover
oracle-recovery


ORA-00470: LGWR process terminated with error

有客户win 10.2.0.1数据库由于断电之后无法正常启动,报ORA-00470错误

SQL> startup mount;
ORACLE 例程已经启动。
Total System Global Area  293601280 bytes
Fixed Size                  1248600 bytes
Variable Size              92275368 bytes
Database Buffers          192937984 bytes
Redo Buffers                7139328 bytes
数据库装载完毕。
SQL> recoer database;
SQL> recover database;
完成介质恢复。
SQL> alter database open;
alter database open
*
第 1 行出现错误:
ORA-00470: LGWR 进程因错误而终止

查看alert日志发现报ORA-600 kcrfwfl_nab错误,导致后台进程异常

Mon May 13 10:38:52 2019
alter database open
Mon May 13 10:38:52 2019
Thread 1 opened at log sequence 1699
  Current log# 3 seq# 1699 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\ORACLE\REDO03.LOG
Successful open of redo thread 1
Mon May 13 10:38:52 2019
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Mon May 13 10:38:52 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\bdump\oracle_lgwr_2308.trc:
ORA-00600: internal error code, arguments: [kcrfwfl_nab], [4294967295], [102401], [], [], [], [], []
Mon May 13 10:38:53 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\bdump\oracle_lgwr_2308.trc:
ORA-00600: internal error code, arguments: [kcrfwfl_nab], [4294967295], [102401], [], [], [], [], []
Mon May 13 10:38:53 2019
LGWR: terminating instance due to error 470
Mon May 13 10:38:53 2019
ORA-470 signalled during: alter database open...
Mon May 13 10:38:53 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\bdump\oracle_pmon_2928.trc:
ORA-00470: LGWR process terminated with error
Mon May 13 10:38:54 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\bdump\oracle_reco_3224.trc:
ORA-00470: LGWR process terminated with error
Mon May 13 10:38:54 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\bdump\oracle_smon_3064.trc:
ORA-00470: LGWR process terminated with error
Mon May 13 10:38:54 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\bdump\oracle_ckpt_3640.trc:
ORA-00470: LGWR process terminated with error
Mon May 13 10:38:54 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\bdump\oracle_dbw0_1976.trc:
ORA-00470: LGWR process terminated with error
Mon May 13 10:38:54 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\bdump\oracle_mman_2356.trc:
ORA-00470: LGWR process terminated with error
Mon May 13 10:38:54 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\bdump\oracle_psp0_3876.trc:
ORA-00470: LGWR process terminated with error
Mon May 13 10:38:54 2019
Instance terminated by LGWR, pid = 2308

尝试启动数据库到upgrade模式,报ORA-600 2758错误

SQL> alter database open upgrade;
alter database open upgrade
*
第 1 行出现错误:
ORA-00600: 内部错误代码, 参数: [2758], [3], [4294967295], [102400], [10], [],[], []

对应alert日志报错

Mon May 13 10:45:50 2019
Completed redo application
Mon May 13 10:45:50 2019
Completed crash recovery at
 Thread 1: logseq 1699, block 4294967295, scn 56807170
 0 data blocks read, 0 data blocks written, 0 redo blocks read
Mon May 13 10:45:50 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\udump\oracle_ora_3220.trc:
ORA-00600: 内部错误代码, 参数: [2758], [3], [4294967295], [102400], [10], [], [], []
Mon May 13 10:45:51 2019
Aborting crash recovery due to error 600
Mon May 13 10:45:51 2019
Errors in file d:\oracle\product\10.2.0\db_1\admin\oracle\udump\oracle_ora_3220.trc:
ORA-00600: 内部错误代码, 参数: [2758], [3], [4294967295], [102400], [10], [], [], []
ORA-600 signalled during: alter database open upgrade...

通过上述相关报错分析,以及ORA-600 kcrfwfl_nab和ORA-600 2758报错的相关资料查询,确定是由于redo和ctl损坏导致,通过强制拉库,恢复成功

SQL> recover database;
完成介质恢复。
SQL> alter database open resetlogs;
数据库已更改。

存储双活系统逻辑损坏数据库抢救恢复

计划休假的前一夜晚上节点朋友求救电话,说xx医院核心his系统的Oracle数据库很多表报ORA-8103错误,业务无法正常办理.
ora-8103


通过dbv检查文件发现连续坏块
2


根据以往经验数据库出现类似这样的错误,很可能是底层问题,查看系统日志发现大量磁盘错误
3


该报错时间和应用反馈系统异常时间基本上匹配,初步怀疑是硬件或者os异常导致.因为客户数据库大量表表ORA-8103,而且有文件出现连片被置空,无法准确定位数据库损坏情况(置空值数据库级别的物理损坏,ora-8103是逻辑错误在表不被访问的情况下无法检查出来),考虑分析客户的硬件环境,备份容灾情况,分析选择最佳方案.
通过和客户沟通以及检查数据库的相关情况发现信息如下:
1)存储使用的是xx厂商的双活方案,这种存储级解决方案对于该故障来说没用,因为是lun的逻辑级别损坏,损坏数据同时同步到两套存储上.
2)数据库库容灾使用的是某厂家的cdp同步容灾,客户对cdp库进行分析,发现数据同步异常,基本上该方案也无法使用
3)数据库的备份情况:由于存放数据库备份的存储电池异常和有坏盘导致存储写io效率非常低,客户在3天之前停止掉了文件系统中的rman备份;有tsm的带库备份,结果检查发现竟无一次备份成功.
故障进一步扩大
针对客户情况,确定是节点2有明显异常,准备停掉节点2的数据库和集群,然后看下在节点1上是否有改善,结果发现把节点2的crs停掉之后,节点1的库直接crash,通过分析发现asm disk有一块盘磁盘前几M表直接置空(应该是在关闭crs之前就已经异常,只是因为磁盘头部分数据没有相关操作,因此没有触发相关问题),当一个节点关闭会去写磁盘头信息,asm发现异常直接dismount 节点1的磁盘组了,从而使得节点1的库异常.
4


5


现在的情况:
1)现在的asm 磁盘组异常(其中一个磁盘头前几M损坏),也就是说在原库基础上直接修复的概率基本上没有可能
2)cdp数据异常,不可用
3)在数据库相关服务器中找到一份4天之前的一次全备
恢复思路:
1.客户准备新空间,直接把4天之前的备份还原到本地文件系统中
2.通过底层工具对于有磁盘损坏的asm磁盘组进行分析,尝试恢复归档日志和redo(尽可能做到最大限度恢复数据)
3.通过备份还原4天之前的备份结合我们恢复的归档日志和redo尝试完全恢复数据
4.问题风险,就算归档日志和redo从损坏的asm 磁盘组中恢复出来,但是也有可能损坏,导致后面无法恢复到最新数据(造成数据丢失)
实际操作:
1. 由于客户在昨天晚上故障之后增加了一些undo数据文件,使得无法正常全库restore database(因为ctl中数据文件信息比备份集中多)
2. 后续由于10204 rac还原到单机出现ORA-600 kgeade_is_1错误
6


3. 数据库恢复完成之后,出现sqlplus 操作数据库正常,plsql dev和应用访问数据库报ora-27092的问题
7


最后运气不错,经过一系列努力,数据库open成功,应用也正常访问,最初生产环境中损坏的表现在查询也不再报ORA-8103,dbv检查异常文件也ok
8


9


再次提醒各位朋友:
1)你的数据库备份是否正常,建议定期做故障演练
2)选择合适数据库的容灾方案,建议定期检查或者演练
3)存储双活可以解决硬件故障问题,但是还要有适当的解决方案来规避存储逻辑错误风险.

数据库open过程遭遇ORA-1555对应sql语句补充

在2015年的在数据库open过程中常遇到ORA-01555汇总文章中写过oracle open过程中可能会遇到ORA-01555错误,对应的sql语句.最近的恢复中又遇到两个新的,对其进行补充
select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1

Thu May 09 02:10:27 2019
SMON: enabling cache recovery
ORA-01555 caused by SQL statement below (SQL ID: bqbdby3c400p7, SCN: 0x0000.3e785fc7):
select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_15929.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 91 with name "_SYSSMU91_1360910548$" too small
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_15929.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 91 with name "_SYSSMU91_1360910548$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 15929): terminating the instance due to error 704
Instance terminated by USER, pid = 15929
ORA-1092 signalled during: alter database open resetlogs...
opiodr aborting process unknown ospid (15929) as a result of ORA-1092
Thu May 09 02:10:28 2019
ORA-1092 : opitsk aborting process

select obj#,type#,ctime,mtime,stime, status, dataobj#, flags, oid$, spare1, spare2 from obj$ where owner#=:1 and name=:2 and namespace=:3 and remoteowner is null and linkname is null and subname is null

NSA2 started with pid=41, OS id=32571518
ORA-01555 caused by SQL statement below (SQL ID: 3nkd3g3ju5ph1, Query Duration=0 sec, SCN: 0x0005.e4bea784):
select obj#,type#,ctime,mtime,stime, status, dataobj#, flags, oid$, spare1, spare2 from
    obj$ where owner#=:1 and name=:2 and namespace=:3 and remoteowner is null and linkname is null and subname is null
Errors in file /u01/app/oracle/diag/rdbms/xifenfei_std/xifenfei/trace/xifenfei_ora_18939904.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 2
ORA-01555: snapshot too old: rollback segment number 7 with name "_SYSSMU7_542380376$" too small
Errors in file /u01/app/oracle/diag/rdbms/xifenfei_std/xifenfei/trace/xifenfei_ora_18939904.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 2
ORA-01555: snapshot too old: rollback segment number 7 with name "_SYSSMU7_542380376$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 18939904): terminating the instance due to error 704
Instance terminated by USER, pid = 18939904
ORA-1092 signalled during: alter database open RESETLOGS...
opiodr aborting process unknown ospid (18939904) as a result of ORA-1092

ORA-00600 kcratr_scan_rc

11.2.0.4数据库启动报ORA-600 kcratr_scan_rc错误
kcratr_scan_rc


ORA-600 kcratr_scan_rc错误相关alert日志

Thu May 09 01:56:01 2019
alter database   open
Beginning crash recovery of 1 threads
 parallel recovery started with 23 processes
Started redo scan
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_9975.trc  (incident=171):
ORA-00600: internal error code, arguments: [kcratr_scan_rc], [4], [1], [39821], [190063], [], [], [], [], [], [], []
Incident details in: /home/u01/diag/rdbms/orcl/orcl/incident/incdir_171/orcl_ora_9975_i171.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Aborting crash recovery due to error 600
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_9975.trc:
ORA-00600: internal error code, arguments: [kcratr_scan_rc], [4], [1], [39821], [190063], [], [], [], [], [], [], []
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_9975.trc:
ORA-00600: internal error code, arguments: [kcratr_scan_rc], [4], [1], [39821], [190063], [], [], [], [], [], [], []
ORA-600 signalled during: alter database   open...

从这里看是由于数据库在做实例恢复的时候无法正确的应用日志导致.通过屏蔽数据库前滚恢复,强制open库

SQL> startup mount pfile='/tmp/pfile'
ORACLE instance started.
Total System Global Area 6998261760 bytes
Fixed Size                  2266624 bytes
Variable Size            2684357120 bytes
Database Buffers         4294967296 bytes
Redo Buffers               16670720 bytes
Database mounted.
SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 91 with name
"_SYSSMU91_1360910548$" too small
Process ID: 15929
Session ID: 5272 Serial number: 3

数据库报出来比较熟悉的ORA-01092 ORA-00704 ORA-00604 ORA-01555错误,分析trace文件

Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu May 09 02:10:27 2019
SMON: enabling cache recovery
ORA-01555 caused by SQL statement below (SQL ID: bqbdby3c400p7, SCN: 0x0000.3e785fc7):
select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_15929.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 91 with name "_SYSSMU91_1360910548$" too small
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_15929.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 91 with name "_SYSSMU91_1360910548$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 15929): terminating the instance due to error 704
Instance terminated by USER, pid = 15929
ORA-1092 signalled during: alter database open resetlogs...
opiodr aborting process unknown ospid (15929) as a result of ORA-1092
Thu May 09 02:10:28 2019
ORA-1092 : opitsk aborting process

这次的是比较少见的ORA-1555的错误语句select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1,通过bbed进行处理之后

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 17776
Session ID: 5272 Serial number: 3

数据库出现ORA-03113错误,通过分析alert日志

ORA-00600: internal error code, arguments: [4198], [], [], [], [], [], [], [], [], [], [], []
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_smon_17739.trc  (incident=128132):
ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], [], [], [], [], []
Incident details in: /home/u01/diag/rdbms/orcl/orcl/incident/incdir_128132/orcl_smon_17739_i128132.trc
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_mmon_17743.trc  (incident=128151):
ORA-00600: internal error code, arguments: [4412], [0x1E6BC4DF8], [0x000000000], [1], [6283], [], [], [], [], [], [], []
Incident details in: /home/u01/diag/rdbms/orcl/orcl/incident/incdir_128151/orcl_mmon_17743_i128151.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
SMON: Parallel transaction recovery slave got internal error
SMON: Downgrading transaction recovery to serial
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_smon_17739.trc  (incident=128133):
ORA-00600: internal error code, arguments: [4137], [10.4.1100583], [0], [0], [], [], [], [], [], [], [], []
Incident details in: /home/u01/diag/rdbms/orcl/orcl/incident/incdir_128133/orcl_smon_17739_i128133.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_mmon_17743.trc  (incident=128152):
ORA-00600: internal error code, arguments: [4406], [0x1E6BC4DF8], [0x000000000], [2], [6289], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [4412], [0x1E6BC4DF8], [0x000000000], [1], [6283], [], [], [], [], [], [], []
Incident details in: /home/u01/diag/rdbms/orcl/orcl/incident/incdir_128152/orcl_mmon_17743_i128152.trc
ORACLE Instance orcl (pid = 15) - Error 600 encountered while recovering transaction (10, 4).
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_smon_17739.trc  (incident=128134):
ORA-00600: internal error code, arguments: [4137], [98.33.13158], [0], [0], [], [], [], [], [], [], [], []
Incident details in: /home/u01/diag/rdbms/orcl/orcl/incident/incdir_128134/orcl_smon_17739_i128134.trc
Starting background process SMCO
Thu May 09 02:16:25 2019
SMCO started with pid=21, OS id=18119
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_smon_17739.trc  (incident=128135):
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /home/u01/diag/rdbms/orcl/orcl/incident/incdir_128135/orcl_smon_17739_i128135.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Non-fatal internal error happenned while SMON was doing non-existent object cleanup.
SMON encountered 1 out of maximum 100 non-fatal internal errors.
Thu May 09 02:16:27 2019
Thu May 09 02:16:32 2019
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_18168.trc  (incident=128620):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
ORA-28000: the account is locked
Incident details in: /home/u01/diag/rdbms/orcl/orcl/incident/incdir_128620/orcl_ora_18168_i128620.trc
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
ORA-28000: the account is locked
Block recovery from logseq 3, block 239 to scn 2147483855
Recovery of Online Redo Log: Thread 1 Group 3 Seq 3 Reading mem 0
  Mem# 0: /home/u01/oradata/orcl/redo03.log
Block recovery stopped at EOT rba 3.241.16
Block recovery completed at rba 3.241.16, scn 0.2147483854
Block recovery from logseq 3, block 239 to scn 2147483853
Recovery of Online Redo Log: Thread 1 Group 3 Seq 3 Reading mem 0
  Mem# 0: /home/u01/oradata/orcl/redo03.log
Block recovery completed at rba 3.241.16, scn 0.2147483854
Thu May 09 02:16:33 2019
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_18170.trc  (incident=128621):
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
ORA-28000: the account is locked
Incident details in: /home/u01/diag/rdbms/orcl/orcl/incident/incdir_128621/orcl_ora_18170_i128621.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu May 09 02:16:34 2019
PMON (ospid: 17710): terminating the instance due to error 474
Instance terminated by PMON, pid = 17710

大量的ORA-600 4198,ORA-600 6006,ORA-600 4412,ORA-600 4137,ORA-600 4406,ORA-600 kdsgrp1等错误,根据以往经验,主要是undo异常导致,通过相关处理之后,数据库open正常,数据逻辑导出.

ORA-600 kfrValAcd30 恢复

有客户由于存储控制器损坏,修复控制器之后,asm无法正常mount,报ORA-600 kfrValAcd30错误,让我们提供技术支持
kfrValAcd30


asm alert日志报错

Wed Apr 03 16:50:57 2019
SQL> alter diskgroup data mount
NOTE: cache registered group DATA number=1 incarn=0x14248741
NOTE: cache began mount (first) of group DATA number=1 incarn=0x14248741
NOTE: Assigning number (1,0) to disk (ORCL:DATAVOL1)
Wed Apr 03 16:51:03 2019
NOTE: start heartbeating (grp 1)
kfdp_query(DATA): 15
kfdp_queryBg(): 15
NOTE: cache opening disk 0 of grp 1: DATAVOL1 label:DATAVOL1
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache mounting (first) external redundancy group 1/0x14248741 (DATA)
Wed Apr 03 16:51:04 2019
* allocate domain 1, invalid = TRUE
Wed Apr 03 16:51:04 2019
NOTE: attached to recovery domain 1
NOTE: starting recovery of thread=1 ckpt=27.2697 group=1 (DATA)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_15951.trc  (incident=23394):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [27], [2697], [28], [2697], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_23394/+ASM2_ora_15951_i23394.trc
Abort recovery for domain 1
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DATA
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [27], [2697], [28], [2697], [], [], [], [], []
ERROR: alter diskgroup data mount
NOTE: cache dismounting (clean) group 1/0x14248741 (DATA)
NOTE: lgwr not being msg'd to dismount
freeing rdom 1
Wed Apr 03 16:51:05 2019
Sweep [inc][23394]: completed
Sweep [inc2][23394]: completed
Wed Apr 03 16:51:05 2019
Trace dumping is performing id=[cdmp_20190403165105]
NOTE: detached from domain 1
NOTE: cache dismounted group 1/0x14248741 (DATA)
NOTE: cache ending mount (fail) of group DATA number=1 incarn=0x14248741
kfdp_dismount(): 16
kfdp_dismountBg(): 16
NOTE: De-assigning number (1,0) from disk (ORCL:DATAVOL1)
ERROR: diskgroup DATA was not mounted
NOTE: cache deleting context for group DATA 1/337938241

mos相关记录
参考:ORA-600 [KFRVALACD30] in ASM (Doc ID 2123013.1)
kfrValAcd30-mos


ORA-00600: internal error code, arguments: [kfrValAcd30]可能是bug或者硬件故障导致.基于客户的情况,最大可能就是由于硬件故障导致asm 磁盘组的acd无法正常进行,从而无法mount成功.如果运气好,通过kfed相关修复可以正常mount成功,运气不好可以通过底层进行恢复数据文件,从而最大限度恢复数据.