记录因磁盘头被重写,抢救redo恢复经历

联系:手机/微信(+86 17813235971) QQ(107644445)

标题:记录因磁盘头被重写,抢救redo恢复经历

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

客户使用赛门铁克做同城异地容灾部署extent rac,因某种情况导致主备容灾不同步,然后在主库中进行了若干操作,导致主库所有裸设备丢失,然后进行了一些列的操作,主库识别了裸设备,但是oracle出现异常
数据库裸设备异常

Fri May 31 22:07:39 2013
ORA-00202: control file: '/dev/rcontrol2'
ORA-27047: unable to read the header block of file
Additional information: 2
ORA-205 signalled during: ALTER DATABASE   MOUNT...

使用备份还原控制文件后,查询数据文件头v$datafile_header.error全部为”WRONG FILE TYPE”,使用bbed去查看,10个block以内全部是0,证明数据库文件头也损坏。因为客户的数据库虽然有rman备份,但涉及到memory,对redo的信息也很敏感,所以希望能够在他们当前的情况下评估redo是否可以应用,确保他们的数据不丢失.验证文件头

DATA FILE #1:
  (name #4) /dev/rsystem
creation size=128000 block size=8192 status=0xe head=4 tail=4 dup=1
 tablespace 0, index=1 krfil=1 prev_file=0
 unrecoverable scn: 0x0000.00000000 01/01/1988 00:00:00
 Checkpoint cnt:22469 scn: 0x0000.7b4f9d86 05/29/2013 22:09:50
 Stop scn: 0xffff.ffffffff 05/15/2013 00:08:31
 Creation Checkpointed at scn:  0x0000.00000009 05/20/2007 21:52:41
 thread:1 rba:(0x1.3.10)
 enabled  threads:  01000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000
 Offline scn: 0x0000.00000000 prev_range: 0
 Online Checkpointed at scn:  0x0000.00000000
 thread:0 rba:(0x0.0.0)
 enabled  threads:  00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000
 Hot Backup end marker scn: 0x0000.00000000
 aux_file is NOT DEFINED
File header version cannot be determined due to corruption
Dump may be suspect
 V10 STYLE FILE HEADER:
	Software vsn=0=0x0, Compatibility Vsn=0=0x0
	Db ID=0=0x0, Db Name=''
	Activation ID=0=0x0
	Control Seq=0=0x0, File size=0=0x0
	File Number=0, Blksiz=0, File Type=0 UNKNOWN
Tablespace #0 -   rel_fn:0
Creation   at   scn: 0x0000.00000000 01/01/1988 00:00:00
Backup taken at scn: 0x0000.00000000 01/01/1988 00:00:00 thread:0
 reset logs count:0x0 scn: 0x0000.00000000 reset logs terminal rcv data:0x0 scn: 0x0000.00000000
 prev reset logs count:0x0 scn: 0x0000.00000000 prev reset logs terminal rcv data:0x0 scn: 0x0000.00000000
 recovered at 01/01/1988 00:00:00
 status:0x0 root dba:0x00000000 chkpt cnt: 0 ctl cnt:0
begin-hot-backup file size: 0
Checkpointed at scn:  0x0000.00000000
 thread:0 rba:(0x0.0.0)
 enabled  threads:  00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000
Backup Checkpointed at scn:  0x0000.00000000
 thread:0 rba:(0x0.0.0)
 enabled  threads:  00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  00000000 00000000 00000000 00000000 00000000 00000000
External cache id: 0x0 0x0 0x0 0x0
Absolute fuzzy scn: 0x0000.00000000
Recovery fuzzy scn: 0x0000.00000000 01/01/1988 00:00:00
Terminal Recovery Stamp  01/01/1988 00:00:00

这里可以看出来文件头完全损坏,dump redo header

LOG FILE #1:
  (name #1) /dev/rredo11
 Thread 1 redo log links: forward: 2 backward: 0
 siz: 0x3c000 seq: 0x00003f9c hws: 0x2 bsz: 512 nab: 0x1763c flg: 0x1 dup: 1
 Archive links: fwrd: 0 back: 0 Prev scn: 0x0000.7b458379
 Low scn: 0x0000.7b4583ad 05/29/2013 01:32:39
 Next scn: 0x0000.7b4fa0e5 05/29/2013 22:17:59
 FILE HEADER:
	Software vsn=0=0x0, Compatibility Vsn=0=0x0
	Db ID=0=0x0, Db Name=''
	Activation ID=0=0x0
	Control Seq=0=0x0, File size=0=0x0
	File Number=0, Blksiz=0, File Type=0 UNKNOWN
 descrip:""
 thread: 0 nab: 0x0 seq: 0x00000000 hws: 0x0 eot: 0 dis: 0
 reset logs count: 0x0 scn: 0x0000.00000000
 Low scn: 0x0000.00000000 01/01/1988 00:00:00
 Next scn: 0x0000.00000000 01/01/1988 00:00:00
 Enabled scn: 0x0000.00000000 01/01/1988 00:00:00
 Thread closed scn: 0x0000.00000000 01/01/1988 00:00:00
 Log format vsn: 0x0 Disk cksum: 0x0 Calc cksum: 0x0
 Terminal Recovery Stop scn: 0x0000.00000000
 Terminal Recovery Stamp  01/01/1988 00:00:00
 Most recent redo scn: 0x0000.00000000
 Largest LWN: 0 blocks
 Miscellaneous flags: 0x0
 Thread internal enable indicator: thr: 0, seq: 0 scn: 0x0000.00000000

验证redo header已经异常,dump redo logfile全部提示文件头错误


现在情况已经很明显,客户的库因为使用了裸设备,online 磁盘的过程中,所有的裸设备卷已经重写了文件头,oracle的datafile header信息不在了,无法正常操作完成.我们决定使用rman备份来恢复该数据库,然后想办法处理redo

注册带库备份集
在rman恢复过程中,我们遇到一个问题,客户的库rman备份策略有问题,一周一个全备,每天一个增量备份,一次归档备份,最后一次增量备份后备份了控制文件,但是最后一次归档备份之后无控制文件,而且是归档的备份发生在增量备份之后,因为是使用了带库无catalog库,我们增量恢复之后,数据不一致需要归档,但是归档,而归档的备份未记录在还原出来的控制文件中,需要人工注册带库的备份集到控制文件中

--涉及到3个节点都有配置不同的NB_ORA_CLIENT=pysa,如果在同一个节点中还原归档日志,需要配置如下
CONFIGURE CHANNEL DEVICE TYPE 'SBT_TAPE' PARMS  'ENV=(NB_ORA_CLIENT=pysa)';
--如果在默认节点直接分配sbt通道即可
configure default device type to sbt_tape;
--注册带库备份集
catalog device type 'sbt_tape' backuppiece 'al_21395_1_816744765';

分析redo
通过ue打开dd出来的redo文件,我们分析得到20000h(10*8192)全部为0,应该是和赛门铁克存储管理系统有关系,后面开始是aix的设备头信息

正常redo文件信息

该库redo信息对比(获得aix偏移量)

对比正常redo起点信息和经验我们定位到aix的设备头偏移量为1000h(4096),整体偏移量为21000h(10*8192+4096)
该库的redo起点为21000h,也就是说,我们需要执行的dd语句为类似语句(redo 大小为120M)

dd if=/dev/vx/rdsk/dg/redo31 bs=512 skip=264 count=245761 of=/arch/xifenfei/redo31

dd出来所有数据后,对先要已经应用了归档的库,继续尝试recover redo

SQL> recover database using backup controlfile;
ORA-00279: change 2069348436 generated at 05/30/2013 16:18:08 needed for thread
1
ORA-00289: suggestion : /arch/1_16289_623109141.dbf
ORA-00280: change 2069348436 for thread 1 is in sequence #16289
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/arch/xifenfei/redo12
ORA-00279: change 2069348436 generated at 05/30/2013 16:18:08 needed for thread
2
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/arch/xifenfei/redo23
ORA-00279: change 2069348436 generated at 05/30/2013 16:18:07 needed for thread
3
ORA-00289: suggestion : /arch/3_3898_623109141.dbf
ORA-00280: change 2069348436 for thread 3 is in sequence #3898
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/arch/xifenfei/redo31
ORA-00279: change 2069455373 generated at 05/30/2013 20:03:26 needed for thread
1
ORA-00289: suggestion : /arch/1_16290_623109141.dbf
ORA-00280: change 2069455373 for thread 1 is in sequence #16290
ORA-00278: log file '/arch/xifenfei/redo12' no longer needed for this recovery
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/arch/xifenfei/redo11
ORA-00279: change 2069475956 generated at 05/30/2013 20:04:09 needed for thread
3
ORA-00289: suggestion : /arch/3_3899_623109141.dbf
ORA-00280: change 2069475956 for thread 3 is in sequence #3899
ORA-00278: log file '/arch/xifenfei/redo31' no longer needed for this recovery
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/arch/xifenfei/redo33
Log applied.
Media recovery complete.
SQL> alter database open resetlogs;
Database altered.

到这一步我们完整的通过dd跳过了由于赛门铁克管理磁盘导致的磁盘头损坏的块,从裸设备中复制出redo到文件系统,然后进行恢复,完整的抢救了客户的数据,减少了客户的损坏.这里温馨提示对于非常重要的系统(涉及钱),强烈建议redo多路冗余,光依靠存储容灾,备份,dg依然不够