接到客户恢复请求:把oracle asm datagroup中的一个磁盘增加到vg中,现在磁盘组无法mount,数据库无法正常启动.远程登录现场进行分析发现情况如下:
操作系统层面分析
history操作记录
这里比较明显把一个磁盘做成pv,并且加入到vg中,然后再分配199G给lv_home,系统层面分析lvm信息
--查看pv信息
[root@xff1 ~]# pvdisplay
--- Physical volume ---
PV Name /dev/sda2
VG Name VolGroup
PV Size 277.98 GiB / not usable 3.00 MiB
Allocatable yes (but full)
PE Size 4.00 MiB
Total PE 71161
Free PE 0
Allocated PE 71161
PV UUID F6QO3f-065n-mwTW-Xbq2-Xx2y-c8HD-Tkr7V7
--- Physical volume ---
PV Name /dev/sdg <----新加入的磁盘
VG Name VolGroup
PV Size 200.00 GiB / not usable 4.00 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 51199
Free PE 255
Allocated PE 50944
PV UUID i69vUG-nCIK-dtxL-FvpD-2WZd-bvLv-n7lwrb
[root@xff1 ~]# lvdisplay
--- Logical volume ---
LV Path /dev/VolGroup/lv_root
LV Name lv_root
VG Name VolGroup
LV UUID JUNnkN-m4zq-D0gh-h42b-cUM1-Wh1q-ZMtQE4
LV Write Access read/write
LV Creation host, time localhost.localdomain, 2017-07-19 20:08:47 +0800
LV Status available
# open 1
LV Size 50.00 GiB
Current LE 12800
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0
--- Logical volume ---
LV Path /dev/VolGroup/lv_home
LV Name lv_home
VG Name VolGroup
LV UUID eZTkLt-cNGX-371i-m8Bd-VdD9-q6Hz-wYDRIJ
LV Write Access read/write
LV Creation host, time localhost.localdomain, 2017-07-19 20:08:54 +0800
LV Status available
# open 1
LV Size 422.97 GiB <-----lv大小编程422G,应该是被扩了199G后结果
Current LE 108281
Segments 2
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:2
--- Logical volume ---
LV Path /dev/VolGroup/lv_swap
LV Name lv_swap
VG Name VolGroup
LV UUID 54P9ok-VpwO-zM68-hvwY-9GBf-89yb-8xQAMn
LV Write Access read/write
LV Creation host, time localhost.localdomain, 2017-07-19 20:09:23 +0800
LV Status available
# open 1
LV Size 4.00 GiB
Current LE 1024
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:1
[root@xff1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
50G 3.9G 43G 9% /
tmpfs 63G 509M 63G 1% /dev/shm
/dev/sda1 477M 44M 408M 10% /boot
/dev/mapper/VolGroup-lv_home
417G 226G 170G 58% /home <----增加了199g空间,剩余只剩170G,证明增加空间之后最少使用了30G以上
基于这样的情况,基本上可以确定sdg盘加入VolGroup中并且被分配给 lv_home中,而且还写入了数据(/home空闲空间只剩余170G,lv_home当时扩了199G).
asm层面分析
asm磁盘组无法mount,提示缺少一块磁盘
SQL> ALTER DISKGROUP DATA MOUNT /* asm agent *//* {1:12056:279} */
NOTE: cache registered group DATA number=1 incarn=0xa1dbff16
NOTE: cache began mount (first) of group DATA number=1 incarn=0xa1dbff16
NOTE: Assigning number (1,2) to disk (/dev/asmdisk3)
NOTE: Assigning number (1,1) to disk (/dev/asmdisk2)
Sat Apr 25 13:04:58 2020
ERROR: no read quorum in group: required 1, found 0 disks
NOTE: cache dismounting (clean) group 1/0xA1DBFF16 (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 81552, image: oracle@rac2db1 (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 1/0xA1DBFF16 (DATA)
NOTE: cache ending mount (fail) of group DATA number=1 incarn=0xa1dbff16
NOTE: cache deleting context for group DATA 1/0xa1dbff16
GMON dismounting group 1 at 19 for pid 30, osid 81552
NOTE: Disk DATA_0001 in mode 0x9 marked for de-assignment
NOTE: Disk DATA_0002 in mode 0x9 marked for de-assignment
ERROR: diskgroup DATA was not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15040: diskgroup is incomplete
ERROR: ALTER DISKGROUP DATA MOUNT /* asm agent *//* {1:12056:279} */
kfed分析磁盘信息
报错比较明显asm disk磁盘头被lvm的信息取代(因为asm disk 被加入到vg中),根据前面的分析,该磁盘被写入数据很可能超过30G,使用kfed分析一个随意au,确认被破坏,证明开始判断基本正确
root@xff1:/home/oracle11g$kfed read /dev/asmdisk1 aun=10000
kfbh.endian: 51 ; 0x000: 0x33
kfbh.hard: 55 ; 0x001: 0x37
kfbh.type: 32 ; 0x002: *** Unknown Enum ***
kfbh.datfmt: 42 ; 0x003: 0x2a
kfbh.block.blk: 1329801248 ; 0x004: blk=1329801248
kfbh.block.obj: 1128615502 ; 0x008: file=347726
kfbh.check: 1094999892 ; 0x00c: 0x41445f54
kfbh.fcn.base: 675103060 ; 0x010: 0x283d4154
kfbh.fcn.wrap: 1448232275 ; 0x014: 0x56524553
kfbh.spare1: 1598374729 ; 0x018: 0x5f454349
kfbh.spare2: 1162690894 ; 0x01c: 0x454d414e
7F7843EAD400 2A203733 4F432820 43454E4E 41445F54 [37 * (CONNECT_DA]
7F7843EAD410 283D4154 56524553 5F454349 454D414E [TA=(SERVICE_NAME]
7F7843EAD420 6361723D 29626432 44494328 5250283D [=rac2db)(CID=(PR]
7F7843EAD430 4152474F 3A443D4D 4341505C DFCF3153 [OGRAM=D:\PACS1..]
7F7843EAD440 B3BEB7BB 6369445C 65536D6F 72657672 [....\DicomServer]
7F7843EAD450 445C524D 6D6F6369 76726553 524D7265 [MR\DicomServerMR]
7F7843EAD460 6578652E 4F482829 573D5453 362D4E49 [.exe)(HOST=WIN-6]
7F7843EAD470 51414C38 54553645 28294A30 52455355 [8LAQE6UT0J)(USER]
7F7843EAD480 6D64413D 73696E69 74617274 2929726F [=Administrator))]
7F7843EAD490 202A2029 44444128 53534552 5250283D [) * (ADDRESS=(PR]
7F7843EAD4A0 434F544F 743D4C4F 28297063 54534F48 [OTOCOL=tcp)(HOST]
7F7843EAD4B0 2E30313D 2E303831 30332E31 4F502829 [=10.180.1.30)(PO]
7F7843EAD4C0 343D5452 37333539 2A202929 74736520 [RT=49537)) * est]
7F7843EAD4D0 696C6261 2A206873 63617220 20626432 [ablish * rac2db ]
7F7843EAD4E0 3231202A 0A343135 2D534E54 31353231 [* 12514.TNS-1251]
7F7843EAD4F0 54203A34 6C3A534E 65747369 2072656E [4: TNS:listener ]
7F7843EAD500 73656F64 746F6E20 72756320 746E6572 [does not current]
7F7843EAD510 6B20796C 20776F6E 7320666F 69767265 [ly know of servi]
7F7843EAD520 72206563 65757165 64657473 206E6920 [ce requested in ]
7F7843EAD530 6E6E6F63 20746365 63736564 74706972 [connect descript]
………………
7F7843EAE300 6F636944 7265536D 4D726576 69445C52 [DicomServerMR\Di]
7F7843EAE310 536D6F63 65767265 2E524D72 29657865 [comServerMR.exe)]
7F7843EAE320 534F4828 49573D54 4F302D4E 314B304A [(HOST=WIN-0OJ0K1]
7F7843EAE330 4955304E 55282954 3D524553 696D6441 [N0UIT)(USER=Admi]
7F7843EAE340 7473696E 6F746172 29292972 28202A20 [nistrator))) * (]
7F7843EAE350 52444441 3D535345 4F525028 4F434F54 [ADDRESS=(PROTOCO]
7F7843EAE360 63743D4C 48282970 3D54534F 312E3031 [L=tcp)(HOST=10.1]
7F7843EAE370 312E3038 2930332E 524F5028 35353D54 [80.1.30)(PORT=55]
7F7843EAE380 29383632 202A2029 61747365 73696C62 [268)) * establis]
7F7843EAE390 202A2068 32636172 2A206264 35323120 [h * rac2db * 125]
7F7843EAE3A0 540A3431 312D534E 34313532 4E54203A [14.TNS-12514: TN]
7F7843EAE3B0 696C3A53 6E657473 64207265 2073656F [S:listener does ]
7F7843EAE3C0 20746F6E 72727563 6C746E65 6E6B2079 [not currently kn]
7F7843EAE3D0 6F20776F 65732066 63697672 65722065 [ow of service re]
7F7843EAE3E0 73657571 20646574 63206E69 656E6E6F [quested in conne]
7F7843EAE3F0 64207463 72637365 6F747069 34320A72 [ct descriptor.24]
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][32]
通过上述kfed可以看到第10000 au的位置被写入的是数据库异常之后listener.log的信息(该数据库安装在/home目录中),进一步证明覆盖,通过以下信息证明sdg就是asmdisk1
[root@xff1 dev]# ls -l sdg
brw-rw---- 1 root disk 8, 96 Apr 25 00:05 sdg
[root@xff1 dev]# ls -l asmdisk1
brw-rw---- 1 grid asmadmin 8, 96 Apr 25 00:05 asmdisk1
基于现在的情况,data磁盘组是由三块 200G的磁盘组成,第一块磁盘被意外加入vg,并且写入数据大于30G,无法从asm层面直接通过kfed修复磁盘组,然后直接mount,只能通过oracle asm磁盘数据块重组技术(asm disk header 彻底损坏恢复)实现没有覆盖数据的恢复.
该客户运气还不错,通过仅剩的2019年12月份几天的不成功备份找出来所有的数据文件(无归档),然后强制拉库成功.通过碎片恢复的最新的数据文件数据结合2019年12月份备份,实现绝大部分业务数据恢复,最大限度减少客户损失.对于oracle rac数据库服务器磁盘操作需要谨慎.
如果不幸有类似oracle asm disk被破坏(格式化,dd部分,做成lv等),需要进行恢复支持,可以联系我们,做专业的恢复评估,最大限度,最快速度抢救数据,减少损失
Phone:17813235971 Q Q:107644445 E-Mail:dba@xifenfei.com
恢复过部分asm被格式化案例:
又一例asm格式化文件系统恢复
一次完美的asm disk被格式化ntfs恢复
oracle asm disk格式化恢复—格式化为ext4文件系统
oracle asm disk格式化恢复—格式化为ntfs文件系统