悲剧的客户因为IBM p系列小机更换电源导致主机直接掉电,起来后发现数据库出现不少坏块,而且还有部分坏块中含有回滚事务,导致alert日志一直报smon回滚遇到坏块错误,该数据库版本是9.2.0.8 RAC,根据客户的备份情况,为了减少对业务的影响,决定使用blockrecover对其处理.这里通过10g数据库大概模拟出现含事务坏块的情况以及处理过程,重现了我们在处理的时候不确定的一些知识.
创建测试表
SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Prod
PL/SQL Release 10.2.0.4.0 - Production
CORE 10.2.0.4.0 Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production
SQL> conn /as sysdba
Connected.
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence 231
Next log sequence to archive 233
Current log sequence 233
SQL> conn chf/xifenfei
Connected.
SQL> create table t_xifenfei
2 as
3 select * from dba_objects where rownum<10;
Table created.
SQL> select rowid,
2 dbms_rowid.rowid_relative_fno(rowid) rel_fno,
3 dbms_rowid.rowid_block_number(rowid) block
4 from chf.t_xifenfei;
ROWID REL_FNO BLOCK
------------------ ---------- ----------
AAANIqAAEAAAAAcAAA 4 28
AAANIqAAEAAAAAcAAB 4 28
AAANIqAAEAAAAAcAAC 4 28
AAANIqAAEAAAAAcAAD 4 28
AAANIqAAEAAAAAcAAE 4 28
AAANIqAAEAAAAAcAAF 4 28
AAANIqAAEAAAAAcAAG 4 28
AAANIqAAEAAAAAcAAH 4 28
AAANIqAAEAAAAAcAAI 4 28
9 rows selected.
当前的seq是233(也就是说我在233归档上创建了t_xienfei表)
dbv检查block
[oracle@xifenfei ~]$ dbv file='/u01/oracle/oradata/XFF/users01.dbf'
DBVERIFY: Release 10.2.0.4.0 - Production on Sun Jan 20 22:16:16 2013
Copyright (c) 1982, 2007, Oracle. All rights reserved.
DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/XFF/users01.dbf
DBVERIFY - Verification complete
Total Pages Examined : 1280
Total Pages Processed (Data) : 904
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 38
Total Pages Failing (Index): 0
Total Pages Processed (Other): 44
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 294
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 3224018224 (2860.3224018224)
证明无任何坏块
切换归档
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence 233
Next log sequence to archive 235
Current log sequence 235
现在已经切换seq到235
rman备份我们需要测试block(file 4 block 28)对应的数据文件
RMAN> backup datafile 4 format '/u01/oracle/oradata/xff_4.rman';
具体见:rman制造坏块,bbed修复坏块
模拟数据库进行其他操作
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence 235
Next log sequence to archive 237
Current log sequence 237
SQL> conn chf/xifenfei
Connected.
SQL> create table t_xff
2 as
3 select * from dba_objects;
Table created.
SQL> alter system switch logfile;
System altered.
主要是为了模拟对其他block操作,对于block 28的恢复影响
对block 28进行操作
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence 237
Next log sequence to archive 239
Current log sequence 239
SQL> update chf.t_xifenfei set object_name='www.xifenfei.com';
9 rows updated.
SQL> commit;
Commit complete.
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
SQL> /
System altered.
在seq为239的时候对block 28进行了一次update操作
模拟其他业务操作
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence 240
Next log sequence to archive 242
Current log sequence 242
SQL> delete from chf.t_xff ;
50491 rows deleted.
SQL> commit;
Commit complete.
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence 242
Next log sequence to archive 244
Current log sequence 244
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
这里可以知道在seq为246的时候做了备份归档操作
备份归档操作
RMAN> backup archivelog all format '/u01/oracle/oradata/xff_arch_%U' delete input;
模拟继续操作
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence 248
Next log sequence to archive 250
Current log sequence 250
SQL> update chf.t_xifenfei set object_name='www.orasos.com' where rownum<5;
4 rows updated.
这里可以发现,在seq为250的时候我们再次对block 28进行了操作
使用rman制造坏块
RMAN> BLOCKRECOVER DATAFILE 4 block 28 clear;
dbv检查坏块
[oracle@xifenfei ~]$ dbv file='/u01/oracle/oradata/XFF/users01.dbf'
DBVERIFY: Release 10.2.0.4.0 - Production on Sun Jan 20 23:01:24 2013
Copyright (c) 1982, 2007, Oracle. All rights reserved.
DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/XFF/users01.dbf
Page 28 is influx - most likely media corrupt
Corrupt block relative dba: 0x0100001c (file 4, block 28)
Fractured block found during dbv:
Data in bad block:
type: 6 format: 2 rdba: 0x0100001c
last change scn: 0x0b2c.c02ab081 seq: 0x1 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xb0c4a6ea
check value in block header: 0x393f
computed block checksum: 0xc917
DBVERIFY - Verification complete
Total Pages Examined : 1280
Total Pages Processed (Data) : 903
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 38
Total Pages Failing (Index): 0
Total Pages Processed (Other): 47
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 291
Total Pages Marked Corrupt : 1
Total Pages Influx : 1
Highest block SCN : 3224022228 (2860.3224022228)
强制kill数据库
[oracle@xifenfei ~]$ ps -ef|grep pmon
oracle 9744 9638 0 23:03 pts/1 00:00:00 grep pmon
oracle 32156 1 0 14:17 ? 00:00:10 ora_pmon_XFF
[oracle@xifenfei ~]$ kill -9 32156
[oracle@xifenfei ~]$ ps -ef|grep pmon
oracle 9751 9638 0 23:03 pts/1 00:00:00 grep pmon
为了模拟含事务的block出现坏块
启动数据库
SQL> startup mount
ORACLE instance started.
Total System Global Area 306184192 bytes
Fixed Size 1267164 bytes
Variable Size 109054500 bytes
Database Buffers 188743680 bytes
Redo Buffers 7118848 bytes
Database mounted.
SQL> ALTER DATABASE OPEN;
Database altered.
数据库启动正常
查询坏块
SQL> select count(*) from chf.t_xifenfei;
select count(*) from chf.t_xifenfei
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 28)
ORA-01110: data file 4: '/u01/oracle/oradata/XFF/users01.dbf'
alert日志
Sun Jan 20 23:04:37 2013
SMON: enabling tx recovery
Sun Jan 20 23:04:37 2013
Database Characterset is ZHS16GBK
Sun Jan 20 23:04:37 2013
Hex dump of (file 4, block 28) in trace file /u01/oracle/admin/XFF/bdump/xff_smon_9775.trc
Corrupt block relative dba: 0x0100001c (file 4, block 28)
Fractured block found during buffer read
Data in bad block:
type: 6 format: 2 rdba: 0x0100001c
last change scn: 0x0b2c.c02ab081 seq: 0x1 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xb0c4a6ea
check value in block header: 0x393f
computed block checksum: 0xc917
Reread of rdba: 0x0100001c (file 4, block 28) found same corrupted data
ORACLE Instance XFF (pid = 8) - Error 1578 encountered while recovering transaction (9, 37) on object 53802.
Sun Jan 20 23:04:38 2013
Errors in file /u01/oracle/admin/XFF/bdump/xff_smon_9775.trc:
ORA-01578: ORACLE data block corrupted (file # 4, block # 28)
ORA-01110: data file 4: '/u01/oracle/oradata/XFF/users01.dbf'
通过试验步骤和alert日志可以发现因为block有事务,但是被标记为了坏块,所以smon无法回滚该事务,从而出现alert中类似提示
继续切换归档
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
SQL> /
System altered.
SQL> /
System altered.
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence 252
Next log sequence to archive 254
Current log sequence 254
移走归档
[oracle@xifenfei archivelog]$ ls -l
total 2224
-rw-r----- 1 oracle oinstall 360960 Jan 20 22:59 1_247_792679299.dbf
-rw-r----- 1 oracle oinstall 1024 Jan 20 22:59 1_248_792679299.dbf
-rw-r----- 1 oracle oinstall 1630208 Jan 20 23:04 1_249_792679299.dbf
-rw-r----- 1 oracle oinstall 249344 Jan 20 23:09 1_250_792679299.dbf
-rw-r----- 1 oracle oinstall 1024 Jan 20 23:09 1_251_792679299.dbf
-rw-r----- 1 oracle oinstall 4608 Jan 20 23:09 1_252_792679299.dbf
-rw-r----- 1 oracle oinstall 1024 Jan 20 23:09 1_253_792679299.dbf
[oracle@xifenfei archivelog]$ mkdir bak
[oracle@xifenfei archivelog]$ mv *.dbf bak
[oracle@xifenfei archivelog]$ ll
total 4
drwxr-xr-x 2 oracle oinstall 4096 Jan 20 23:11 bak
为了重现,当我们使用blockrecover恢复的时候,如果缺少归档会怎么样
尝试blockrecover
RMAN> BLOCKRECOVER DATAFILE 4 block 28;
Starting blockrecover at 20-JAN-13
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=135 devtype=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: sid=126 devtype=DISK
allocated channel: ORA_DISK_3
channel ORA_DISK_3: sid=125 devtype=DISK
allocated channel: ORA_DISK_4
channel ORA_DISK_4: sid=124 devtype=DISK
channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00004
channel ORA_DISK_1: reading from backup piece /u01/oracle/oradata/xff_4.rman
channel ORA_DISK_1: restored block(s) from backup piece 1
piece handle=/u01/oracle/oradata/xff_4.rman tag=TAG20130120T222333
channel ORA_DISK_1: block restore complete, elapsed time: 00:00:01
starting media recovery
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of blockrecover command at 01/20/2013 23:11:41
RMAN-06053: unable to perform media recovery because of missing log
RMAN-06025: no backup of log thread 1 seq 253 lowscn 12286830510311 found to restore
RMAN-06025: no backup of log thread 1 seq 252 lowscn 12286830510307 found to restore
RMAN-06025: no backup of log thread 1 seq 251 lowscn 12286830510305 found to restore
RMAN-06025: no backup of log thread 1 seq 250 lowscn 12286830509979 found to restore
RMAN-06025: no backup of log thread 1 seq 249 lowscn 12286830489543 found to restore
RMAN-06025: no backup of log thread 1 seq 248 lowscn 12286830489541 found to restore
RMAN-06025: no backup of log thread 1 seq 247 lowscn 12286830489279 found to restore
alert日志记录
Sun Jan 20 23:11:38 2013
alter database recover datafile list clear
Sun Jan 20 23:11:38 2013
Completed: alter database recover datafile list clear
Sun Jan 20 23:11:38 2013
Starting block media recovery
Sun Jan 20 23:11:39 2013
Media Recovery Log /u01/oracle/oradata/XFF/archivelog/1_235_792679299.dbf
Sun Jan 20 23:11:41 2013
alter database recover cancel
Sun Jan 20 23:11:41 2013
Media Recovery Canceled
Completed: alter database recover cancel
blockrecover恢复途中或者异常终止,dbv检测
[oracle@xifenfei ~]$ dbv file=/u01/oracle/oradata/XFF/users01.dbf
DBVERIFY: Release 10.2.0.4.0 - Production on Sun Jan 20 23:18:29 2013
Copyright (c) 1982, 2007, Oracle. All rights reserved.
DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/XFF/users01.dbf
Page 28 is marked corrupt
Corrupt block relative dba: 0x0100001c (file 4, block 28)
Bad check value found during dbv:
Data in bad block:
type: 6 format: 2 rdba: 0x0100001c
last change scn: 0x0b2c.c02b0248 seq: 0x1 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x02480601
check value in block header: 0x13fc
computed block checksum: 0x663b
DBVERIFY - Verification complete
Total Pages Examined : 1280
Total Pages Processed (Data) : 903
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 38
Total Pages Failing (Index): 0
Total Pages Processed (Other): 47
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 291
Total Pages Marked Corrupt : 1
Total Pages Influx : 0
Highest block SCN : 3224022228 (2860.3224022228)
在blockrecover处理过程,或者处理失败的后,block依然是坏块,如果遇到这类情况,没有事务可以直接指定seq/scn/time的方法来恢复,如果有事务,需要指定恢复时间点过该事务的时间点,让smon能够正常回滚,从而使得smon进程正常工作
还原归档后继续测试
[oracle@xifenfei archivelog]$ mv bak/* ./
RMAN> BLOCKRECOVER DATAFILE 4 block 28;
starting media recovery
archive log thread 1 sequence 247 is already on disk as file /u01/oracle/oradata/XFF/archivelog/1_247_792679299.dbf
…………
media recovery complete, elapsed time: 00:00:01
Finished blockrecover at 20-JAN-13
[oracle@xifenfei ~]$ dbv file=/u01/oracle/oradata/XFF/users01.dbf
DBVERIFY: Release 10.2.0.4.0 - Production on Sun Jan 20 23:15:43 2013
Copyright (c) 1982, 2007, Oracle. All rights reserved.
DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/XFF/users01.dbf
DBVERIFY - Verification complete
Total Pages Examined : 1280
Total Pages Processed (Data) : 904
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 38
Total Pages Failing (Index): 0
Total Pages Processed (Other): 47
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 291
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 3224023169 (2860.3224023169)
>
blockrecover处理alert日志
Sun Jan 20 23:15:01 2013
Media Recovery Log /u01/oracle/oradata/XFF/archivelog/1_250_792679299.dbf
Sun Jan 20 23:15:01 2013
Media Recovery Log /u01/oracle/oradata/XFF/archivelog/1_251_792679299.dbf
Sun Jan 20 23:15:01 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 252 Reading mem 0
Mem# 0: /u01/oracle/oradata/XFF/redo03.log
Sun Jan 20 23:15:01 2013
Recovery of Online Redo Log: Thread 1 Group 1 Seq 253 Reading mem 0
Mem# 0: /u01/oracle/oradata/XFF/redo01.log
Sun Jan 20 23:15:01 2013
Recovery of Online Redo Log: Thread 1 Group 2 Seq 254 Reading mem 0
Mem# 0: /u01/oracle/oradata/XFF/redo02.log
Sun Jan 20 23:15:02 2013
Completed block media recovery
补充说明
1.在9i中使用blockrecover会在Archive destination中生产block的备份文件,类似469_519791_3063_2442393528.bkd(file 469 block 519791),需要注意Archive destination目录结尾需要”/”,不然可能出现直接写入和该目录并列的Archive destination+blockrecover产生文件;10g中不生成该文件;如果该block在9i中未备份,也不会在生产相关文件,而是利用归档恢复.
2.blockrecover需要还原从该block从备份之后的所有归档(如果被备份起来需要还原出来),如果缺少归档可能导致恢复失败,包括基于scn/seq/time的等
3.blockrecover在执行过程中或者执行异常终止,该block依然是坏块,不会对其他block产生影响.在使用blockrecover出现异常终止后的block修复,在后续blog中提供解决方法
4.对于不含事务的坏块,如果数据允许丢失可以通过设置event跳过坏块(特殊block除外),然后重建对象;对于含event的block 坏块,建议使用blockrecover处理或者直接恢复数据文件,如果没有备份,考试使用event跳过事务回滚,然后屏蔽坏块处理