比特币加密勒索间隔加密

最近我们在一个客户的oracle恢复case中发现比特币文件系统勒索加密比较特殊,和大家做一个分享
文件加密后缀名为:.$ILICONE
$ILICONE


文件加密特点分析

DUL> dump datafile 5 block 1
Block Header:
block type=0x0b (file header)
block format=0xa2 (oracle 10)
block rdba=0x01400001 (file#=5, block#=1)
scn=0x0000.00000000, seq=1, tail=0x00000b01
block checksum value=0x6e7d=28285, flag=4
File Header:
Db Id=0xe1891cca=3783859402, Db Name=XIFENFEI, Root Dba=0x0
Software vsn=0x0, Compatibility Vsn=0xa200300, File Size=0x3ffffe=4194302 Blocks
File Type=0x3 (data file), File Number=5, Block Size=8192
Tablespace #7 - OA rel_fn:5
DUL> dump datafile 5 block 2
Block Header:
block type=0x63 (unknown)
block format=0x57 (unknown)
block rdba=0xc6538298 (file#=793, block#=1278616)
scn=0xe0ab.fdc4d8d0, seq=225, tail=0xa7b5cab5
block checksum value=0xfaa1=64161, flag=165
corrupted block.
DUL> dump datafile 5 block 3
Block Header:
block type=0x1e (LMT space map block)
block format=0xa2 (oracle 10)
block rdba=0x01400003 (file#=5, block#=3)
scn=0x0000.00246fbe, seq=1, tail=0x6fbe1e01
block checksum value=0xe495=58517, flag=4
DUL> dump datafile 5 block 4
Block Header:
block type=0x83 (unknown)
block format=0xa3 (unknown)
block rdba=0x17e4c9e4 (file#=95, block#=2410980)
scn=0xe3b2.fc505eea, seq=101, tail=0x6e2f1004
block checksum value=0x7f2e=32558, flag=196
corrupted block.
DUL> dump datafile 5 block 5
Block Header:
block type=0x1e (LMT space map block)
block format=0xa2 (oracle 10)
block rdba=0x01400005 (file#=5, block#=5)
scn=0x0000.00264875, seq=1, tail=0x48751e01
block checksum value=0xb25e=45662, flag=4
DUL> dump datafile 5 block 6
Block Header:
block type=0x68 (unknown)
block format=0x35 (unknown)
block rdba=0x7011e0e3 (file#=448, block#=1171683)
scn=0x47bf.9f2df54a, seq=207, tail=0x69ae0a91
block checksum value=0x49f8=18936, flag=174
corrupted block.

通过这里初步分析,确认加密是间隔方式加密,在数据库中表现明显的是每相隔8k进行加密,而且这里是偶数block被加密
确认加密文件结束位置

DUL> dump datafile 5 block 962818 header
Block Header:
block type=0x4d (unknown)
block format=0xde (unknown)
block rdba=0x0bab780d (file#=46, block#=2848781)
scn=0x056b.2c695f6b, seq=223, tail=0x2399e0cb
block checksum value=0x9706=38662, flag=212
corrupted block.
DUL> dump datafile 5 block 962820 header
Block Header:
block type=0x00 (blank block)
block format=0xa2 (oracle 10)
block rdba=0x014eb104 (file#=5, block#=962820)
scn=0x0000.00000000, seq=1, tail=0x00000001
block checksum value=0x174a=5962, flag=5

通过这里可以发现,对于一个32G的文件,一直被加密到block 962818,也就是7.34G(962818*8k),这里间隔加密,而且加密深度特别深,在以往的比特币文件系统加密中比较少见.
再次提醒
1. 不要把数据库暴露在外网
2. 相对linux而言,win更容易受到黑客的攻击
3. 数据库一定要做好备份,条件允许的情况下,配置数据实时同步到其他机器还是有必要的

Using mlock ulimits for SHM_HUGETLB is deprecated

Oracle 数据库运行在linux 6中,启用大页之后,我们经常在/var/log/messages里面会看到类似这样的记录:Mar 11 12:12:33 i-q2ghx82t kernel: oracle (3677): Using mlock ulimits for SHM_HUGETLB is deprecated,我这里的环境也有重现该问题
环境说明

--系统配置
[root@i-q2ghx82t ~]# more /etc/issue
CentOS release 6.8 (Final)
Kernel \r on an \m
[root@i-q2ghx82t ~]# ulimit  -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 128331
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 128331
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[oracle@i-q2ghx82t ~]$ cat /proc/meminfo|grep Hu
AnonHugePages:         0 kB
HugePages_Total:   10752
HugePages_Free:    10752
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
[root@i-q2ghx82t ~]# more  /proc/sys/vm/hugetlb_shm_group
[root@i-q2ghx82t ~]# id oracle
uid=1000(oracle) gid=1000(oinstall) groups=1000(oinstall),1001(dba),1002(oper),1005(asmdba)
--数据库参数
use_large_pages=only

在本环境中数据库启动正常,大页也可以正常使用,但是在系统日志中有类似Mar 11 12:12:33 i-q2ghx82t kernel: oracle (3677): Using mlock ulimits for SHM_HUGETLB is deprecated 这样的告警.通过分析,是由于少配置了hugetlb_shm_group参数导致(vm.hugetlb_shm_group 参数设置为有权使用 HugePages 的操作系统组。默认情况下,此参数设置为 0,从而允许所有组使用 HugePages。可以将此参数设置为 Oracle 数据库进程所属的操作系统组,如 oinstall),在本系统中在sysctl.conf中增加vm.hugetlb_shm_group=1000,然后重启系统(测试中,如果只是重启数据库,非系统重启后第一次重启数据库,不会出现该告警),系统日志没有出现相关告警.
在Linux 6中配置大页建议加上对应的hugetlb_shm_group参数

分布式存储crash导致ORACLE坏块原因分析

oracle运行在分布式存储中,我们在实验室模拟所有存储节点故障(整个存储断电),然后加电,数据库启动没有坏块,但是有客户在自己的环境做测试,发现数据库启动有大量坏块(几次测试问题依旧)。
客户异常环境报错汇总
模拟存储故障后,数据库报错信息,主要体现犹豫存储模拟故障,数据库读写io失败ORA-15081,ORA-27072,ORA-27061

Thread 2 cannot allocate new log, sequence 488
Checkpoint not complete
  Current log# 3 seq# 487 mem# 0: +DATADG/testdb/onlinelog/group_3.318.970157559
  Current log# 3 seq# 487 mem# 1: +ARCHDG/testdb/onlinelog/group_3.616.970157559
Thu Mar 08 16:17:08 2018
WARNING: Read Failed. group:2 disk:0 AU:6925 offset:49152 size:16384
WARNING: failed to read mirror side 1 of virtual extent 4 logical extent 0 of
file 313 in group [2.1286527068] from disk DATADG_0000  allocation unit 6925 reason error;
if possible, will try another mirror side
Thu Mar 08 16:17:08 2018
WARNING: Read Failed. group:2 disk:0 AU:6921 offset:16384 size:16384
WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0
of file 313 in group [2.1286527068] from disk DATADG_0000  allocation unit 6921 reason error;
 if possible, will try another mirror side
Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_lmon_21091.trc:
ORA-00202: control file: '+DATADG/testdb/controlfile/current.313.970157461'
ORA-15081: failed to submit an I/O operation to a disk
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 56729696
Additional information: -1
Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_lgwr_21111.trc:
ORA-00202: control file: '+DATADG/testdb/controlfile/current.313.970157461'
ORA-15081: failed to submit an I/O operation to a disk
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 56696864
Additional information: -1
Thu Mar 08 16:17:08 2018
WARNING: Write Failed. group:1 disk:0 AU:56970 offset:65536 size:16384
WARNING: Write Failed. group:2 disk:0 AU:6921 offset:65536 size:16384
…………
WARNING: failed to write mirror side 1 of virtual extent 147 logical extent
0 of file 335 in group 2 on disk 0 allocation unit 11789
Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_dbw0_21109.trc:
ORA-15080: synchronous I/O operation to a disk failed
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 8192
WARNING: failed to write mirror side 1 of virtual extent 147 logical extent
0 of file 335 in group 2 on disk 0 allocation unit 11789
KCF: read, write or open error, block=0xddec online=1
        file=18 '+DATADG/testdb/datafile/soe.333.970157833'
        error=15081 txt: ''
Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_dbw0_21109.trc:
Thu Mar 08 16:17:09 2018
System state dump requested by (instance=2, osid=21113 (CKPT)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_diag_21081_20180308161709.trc
Thu Mar 08 16:17:09 2018
ORA-1092 : opitsk aborting process
Thu Mar 08 16:17:12 2018
ORA-1092 : opitsk aborting process
Instance terminated by CKPT, pid = 21113

asm日志报错信息,主要也是体现在io的读写错误上Linux-x86_64 Error: 5: Input/output error

Thu Mar 08 16:17:20 2018
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_9636.trc:
ORA-17503: ksfdopn:2 Failed to open file +DATADG/testdb/spfiletestdb.ora
ORA-15001: diskgroup "DATADG" does not exist or is not mounted
Thu Mar 08 16:17:37 2018
WARNING: Write Failed. group:3 disk:0 AU:1 offset:4190208 size:4096
Thu Mar 08 16:17:37 2018
WARNING: Write Failed. group:3 disk:0 AU:22 offset:143360 size:4096
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_20679.trc:
ORA-15080: synchronous I/O operation to a disk failed
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 180504
Additional information: -1
WARNING: Hbeat write to PST disk 0.3916310704 in group 3 failed. [4]
WARNING: failed to write mirror side 1 of virtual extent 0 logical extent 0
of file 255 in group 3 on disk 0 allocation unit 22
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_20679.trc:
ORA-15081: failed to submit an I/O operation to a disk
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 180504
Additional information: -1
Thu Mar 08 16:17:37 2018
NOTE: process _b001_+asm2 (9546) initiating offline of
disk 0.3916310704 (OCRDG_0000) with mask 0x7e in group 3
NOTE: initiating PST update: grp = 3, dsk = 0/0xe96e28b0, mask = 0x6a, op = clear
GMON updating disk modes for group 3 at 15 for pid 27, osid 9546
ERROR: no read quorum in group: required 1, found 0 disks
Thu Mar 08 16:17:37 2018
NOTE: cache dismounting (not clean) group 3/0x4CAED85D (OCRDG)
WARNING: Offline for disk OCRDG_0000 in mode 0x7f failed.
NOTE: messaging CKPT to quiesce pins Unix process pid: 9505, image: oracle@dbtest02 (B000)
Thu Mar 08 16:17:37 2018
NOTE: halting all I/Os to diskgroup 3 (OCRDG)
Thu Mar 08 16:17:37 2018
NOTE: LGWR doing non-clean dismount of group 3 (OCRDG)
NOTE: LGWR sync ABA=11.6 last written ABA 11.6
WARNING: Write Failed. group:3 disk:0 AU:22 offset:102400 size:4096
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_20679.trc:
ORA-15080: synchronous I/O operation to a disk failed
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 180424
Additional information: -1
WARNING: failed to write mirror side 1 of virtual extent 0 logical extent
0 of file 255 in group 3 on disk 0 allocation unit 22
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_20679.trc:
ORA-15081: failed to submit an I/O operation to a disk
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 180424
Additional information: -1
WARNING: Write Failed. group:3 disk:0 AU:22 offset:147456 size:4096
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_20679.trc:
ORA-15080: synchronous I/O operation to a disk failed
ORA-27072: File I/O error
…………
Thu Mar 08 16:17:37 2018
ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 512)
)
ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 512)
)
ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 512)
)

存储恢复之后,asm mount磁盘组没有报任何错误

SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */
NOTE: Diskgroups listed in ASM_DISKGROUPS are
         ARCHDG
         DATADG
NOTE: Diskgroup used for Voting files is:
         OCRDG
Diskgroup with spfile:OCRDG
Diskgroup used for OCR is:OCRDG
NOTE: cache registered group ARCHDG number=1 incarn=0xb81889ff
NOTE: cache began mount (first) of group ARCHDG number=1 incarn=0xb81889ff
NOTE: cache registered group DATADG number=2 incarn=0xb8288a00
NOTE: cache began mount (first) of group DATADG number=2 incarn=0xb8288a00
NOTE: cache registered group OCRDG number=3 incarn=0xb8288a01
NOTE: cache began mount (first) of group OCRDG number=3 incarn=0xb8288a01
NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Assigning number (1,0) to disk (/dev/oracleasm/disks/ARCHDISK01)
NOTE: Assigning number (2,0) to disk (/dev/oracleasm/disks/DATADISK01)
NOTE: Assigning number (3,0) to disk (/dev/oracleasm/disks/OCRDISK01)
NOTE: GMON heartbeating for grp 1
GMON querying group 1 at 5 for pid 23, osid 3303
NOTE: cache opening disk 0 of grp 1: ARCHDG_0000 path:/dev/oracleasm/disks/ARCHDISK01
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache mounting (first) external redundancy group 1/0xB81889FF (ARCHDG)
* allocate domain 1, invalid = TRUE
NOTE: attached to recovery domain 1
NOTE: starting recovery of thread=1 ckpt=13.6287 group=1 (ARCHDG)
NOTE: starting recovery of thread=2 ckpt=12.7985 group=1 (ARCHDG)
NOTE: advancing ckpt for group 1 (ARCHDG) thread=1 ckpt=13.6287
NOTE: advancing ckpt for group 1 (ARCHDG) thread=2 ckpt=12.7985
NOTE: cache recovered group 1 to fcn 0.122934
NOTE: redo buffer size is 256 blocks (1053184 bytes)
NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (ARCHDG)
Process LGWR (pid 3287) is running at high priority QoS for Exadata I/O
NOTE: LGWR found thread 1 closed at ABA 13.6286
NOTE: LGWR mounted thread 1 for diskgroup 1 (ARCHDG)
NOTE: LGWR opening thread 1 at fcn 0.122934 ABA 14.6287
NOTE: cache mounting group 1/0xB81889FF (ARCHDG) succeeded
NOTE: cache ending mount (success) of group ARCHDG number=1 incarn=0xb81889ff
NOTE: GMON heartbeating for grp 2
GMON querying group 2 at 7 for pid 23, osid 3303
NOTE: cache opening disk 0 of grp 2: DATADG_0000 path:/dev/oracleasm/disks/DATADISK01
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache mounting (first) external redundancy group 2/0xB8288A00 (DATADG)
* allocate domain 2, invalid = TRUE
NOTE: attached to recovery domain 2
NOTE: starting recovery of thread=1 ckpt=12.2323 group=2 (DATADG)
NOTE: starting recovery of thread=2 ckpt=11.2681 group=2 (DATADG)
NOTE: advancing ckpt for group 2 (DATADG) thread=1 ckpt=12.2326
NOTE: advancing ckpt for group 2 (DATADG) thread=2 ckpt=11.2687
NOTE: cache recovered group 2 to fcn 0.21395
NOTE: redo buffer size is 512 blocks (2101760 bytes)
NOTE: LGWR attempting to mount thread 1 for diskgroup 2 (DATADG)
NOTE: LGWR found thread 1 closed at ABA 12.2325
NOTE: LGWR mounted thread 1 for diskgroup 2 (DATADG)
NOTE: LGWR opening thread 1 at fcn 0.21395 ABA 13.2326
NOTE: cache mounting group 2/0xB8288A00 (DATADG) succeeded
NOTE: cache ending mount (success) of group DATADG number=2 incarn=0xb8288a00
NOTE: GMON heartbeating for grp 3
GMON querying group 3 at 9 for pid 23, osid 3303
NOTE: cache opening disk 0 of grp 3: OCRDG_0000 path:/dev/oracleasm/disks/OCRDISK01
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache mounting (first) external redundancy group 3/0xB8288A01 (OCRDG)
* allocate domain 3, invalid = TRUE
Thu Mar 08 17:00:24 2018
NOTE: attached to recovery domain 3
NOTE: starting recovery of thread=1 ckpt=13.55 group=3 (OCRDG)
NOTE: starting recovery of thread=2 ckpt=11.7 group=3 (OCRDG)
NOTE: advancing ckpt for group 3 (OCRDG) thread=1 ckpt=13.55
NOTE: advancing ckpt for group 3 (OCRDG) thread=2 ckpt=11.7
NOTE: cache recovered group 3 to fcn 0.851
NOTE: redo buffer size is 512 blocks (2101760 bytes)
Thu Mar 08 17:00:24 2018
NOTE: LGWR attempting to mount thread 1 for diskgroup 3 (OCRDG)
NOTE: LGWR found thread 1 closed at ABA 13.54
NOTE: LGWR mounted thread 1 for diskgroup 3 (OCRDG)
NOTE: LGWR opening thread 1 at fcn 0.851 ABA 14.55
NOTE: cache mounting group 3/0xB8288A01 (OCRDG) succeeded
NOTE: cache ending mount (success) of group OCRDG number=3 incarn=0xb8288a01
Thu Mar 08 17:00:24 2018
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1
SUCCESS: diskgroup ARCHDG was mounted
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 2
SUCCESS: diskgroup DATADG was mounted
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 3
SUCCESS: diskgroup OCRDG was mounted
SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */

数据库虽然启动成功了,但是大量的坏块生产ORA-01578,ORA-01110

ALTER DATABASE OPEN /* db agent *//* {2:40834:2} */
This instance was first to open
Beginning crash recovery of 2 threads
 parallel recovery started with 7 processes
Started redo scan
Completed redo scan
 read 107566 KB redo, 10569 data blocks need recovery
Started redo application at
 Thread 1: logseq 767, block 101851
 Thread 2: logseq 486, block 91861
Recovery of Online Redo Log: Thread 1 Group 1 Seq 767 Reading mem 0
  Mem# 0: +DATADG/testdb/onlinelog/group_1.314.970157463
  Mem# 1: +ARCHDG/testdb/onlinelog/group_1.614.970157463
Recovery of Online Redo Log: Thread 2 Group 4 Seq 486 Reading mem 0
  Mem# 0: +DATADG/testdb/onlinelog/group_4.319.970157561
  Mem# 1: +ARCHDG/testdb/onlinelog/group_4.617.970157561
Recovery of Online Redo Log: Thread 1 Group 2 Seq 768 Reading mem 0
  Mem# 0: +DATADG/testdb/onlinelog/group_2.315.970157463
  Mem# 1: +ARCHDG/testdb/onlinelog/group_2.615.970157463
Recovery of Online Redo Log: Thread 2 Group 3 Seq 487 Reading mem 0
  Mem# 0: +DATADG/testdb/onlinelog/group_3.318.970157559
  Mem# 1: +ARCHDG/testdb/onlinelog/group_3.616.970157559
Thu Mar 08 17:01:11 2018
Hex dump of (file 12, block 126469) in trace file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p001_4090.trc
Corrupt block relative dba: 0x0301ee05 (file 12, block 126469)
Fractured block found during crash/instance recovery
Data in bad block:
 type: 32 format: 2 rdba: 0x0301ee05
 last change scn: 0x0000.00446cbb seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0xeac22001
 check value in block header: 0xbd0c
 computed block checksum: 0x8679
Reading datafile '+DATADG/testdb/datafile/soe.327.970157825' for corruption at rdba: 0x0301ee05 (file 12, block 126469)
Reread (file 12, block 126469) found same corrupt data (no logical check)
Completed redo application of 84.56MB
Completed crash recovery at
 Thread 1: logseq 768, block 102395, scn 4586960
 Thread 2: logseq 487, block 101664, scn 4587050
 10569 data blocks read, 10569 data blocks written, 107566 redo k-bytes read
ARCH: STARTING ARCH PROCESSES
Thu Mar 08 17:01:12 2018
ARC0 started with pid=43, OS id=4151
Thu Mar 08 17:01:13 2018
ARC0: Archival started
ARCH: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
Thu Mar 08 17:01:13 2018
ARC1 started with pid=44, OS id=4153
Thu Mar 08 17:01:13 2018
ARC2 started with pid=45, OS id=4155
ARC1: Archival started
ARC2: Archival started
Thu Mar 08 17:01:13 2018
ARC3 started with pid=46, OS id=4157
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
ARC2: Becoming the heartbeat ARCH
Thread 1 advanced to log sequence 769 (thread recovery)
Picked broadcast on commit scheme to generate SCNs
Thu Mar 08 17:01:13 2018
Thread 2 advanced to log sequence 488 (thread open)
Thread 2 opened at log sequence 488
  Current log# 4 seq# 488 mem# 0: +DATADG/testdb/onlinelog/group_4.319.970157561
  Current log# 4 seq# 488 mem# 1: +ARCHDG/testdb/onlinelog/group_4.617.970157561
Successful open of redo thread 2
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Mar 08 17:01:13 2018
SMON: enabling cache recovery
Archived Log entry 1249 added for thread 2 sequence 487 ID 0xa41f5c11 dest 1:
[4046] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:4294332240 end:4294332460 diff:220 (2 seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is AL32UTF8
No Resource Manager plan active
Starting background process GTX0
Thu Mar 08 17:01:14 2018
GTX0 started with pid=55, OS id=4181
Starting background process RCBG
Thu Mar 08 17:01:14 2018
RCBG started with pid=57, OS id=4185
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Thu Mar 08 17:01:14 2018
QMNC started with pid=61, OS id=4195
Thu Mar 08 17:01:15 2018
Hex dump of (file 9, block 43523) in trace file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p009_4175.trc
Corrupt block relative dba: 0x0240aa03 (file 9, block 43523)
Fractured block found during buffer read
Data in bad block:
 type: 32 format: 2 rdba: 0x0240aa03
 last change scn: 0x0000.0044ccc4 seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0xd94f2001
 check value in block header: 0xb09c
 computed block checksum: 0x158b
Reading datafile '+DATADG/testdb/datafile/soe.324.970157821' for corruption at rdba: 0x0240aa03 (file 9, block 43523)
Reread (file 9, block 43523) found same corrupt data (no logical check)
Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p009_4175.trc  (incident=216418):
ORA-01578: ORACLE data block corrupted (file # 9, block # 43523)
ORA-01110: data file 9: '+DATADG/testdb/datafile/soe.324.970157821'
Incident details in: /oracle/app/db/diag/rdbms/testdb/testdb2/incident/incdir_216418/testdb2_p009_4175_i216418.trc
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
Thu Mar 08 17:01:15 2018
Hex dump of (file 14, block 71173) in trace file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p003_4094.trc
Corrupt block relative dba: 0x03811605 (file 14, block 71173)
Fractured block found during buffer read
Data in bad block:
 type: 32 format: 2 rdba: 0x03811605
 last change scn: 0x0000.0043916d seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x93662001
 check value in block header: 0xad98
 computed block checksum: 0x20b
Reading datafile '+DATADG/testdb/datafile/soe.329.970157831' for corruption at rdba: 0x03811605 (file 14, block 71173)
Reread (file 14, block 71173) found same corrupt data (no logical check)
Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p003_4094.trc  (incident=216306):
ORA-01578: ORACLE data block corrupted (file # 14, block # 71173)
ORA-01110: data file 14: '+DATADG/testdb/datafile/soe.329.970157831'
Incident details in: /oracle/app/db/diag/rdbms/testdb/testdb2/incident/incdir_216306/testdb2_p003_4094_i216306.trc
Hex dump of (file 12, block 112385) in trace file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_smon_4005.trc
Corrupt block relative dba: 0x0301b701 (file 12, block 112385)
Fractured block found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x0301b701
 last change scn: 0x0000.003cb146 seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x1d030601
 check value in block header: 0x1e62
 computed block checksum: 0x7a47
Reading datafile '+DATADG/testdb/datafile/soe.327.970157825' for corruption at rdba: 0x0301b701 (file 12, block 112385)
Reread (file 12, block 112385) found same corrupt data (no logical check)
Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_smon_4005.trc  (incident=216170):
ORA-01578: ORACLE data block corrupted (file # 12, block # 112385)
ORA-01110: data file 12: '+DATADG/testdb/datafile/soe.327.970157825'
Incident details in: /oracle/app/db/diag/rdbms/testdb/testdb2/incident/incdir_216170/testdb2_smon_4005_i216170.trc
Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p003_4094.trc  (incident=216307):
ORA-01578: ORACLE data block corrupted (file # , block # )
Incident details in: /oracle/app/db/diag/rdbms/testdb/testdb2/incident/incdir_216307/testdb2_p003_4094_i216307.trc
Completed: ALTER DATABASE OPEN /* db agent *//* {2:40834:2} */

rman 检查坏块信息,发现有大量坏块,而且主要坏块类型是FRACTURED(折断)

SQL> select * from v$database_block_corruption;
     FILE#     BLOCK#	  BLOCKS CORRUPTION_CHANGE# CORRUPTIO
---------- ---------- ---------- ------------------ ---------
	12     126469	       1		  0 FRACTURED
	12     112385	       1		  0 FRACTURED
	14	71173	       1		  0 FRACTURED
	 9	43523	       1		  0 FRACTURED
	15	44801	       1		  0 FRACTURED
	15	44855	       1		  0 FRACTURED
	15	44862	       1		  0 FRACTURED
	11	56252	       1		  0 FRACTURED
	12     116481	       1		  0 CHECKSUM
	12     116535	       1		  0 FRACTURED
	15	46903	       1		  0 FRACTURED
	12     108289	       1		  0 FRACTURED
	12     108343	       1		  0 FRACTURED
	 8	95872	       1		  0 FRACTURED
	 8	99712	       1		  0 FRACTURED
	 8     102783	       1		  0 FRACTURED
	 8     104974	       1		  0 FRACTURED
	 8     105856	       1		  0 FRACTURED
	 8     105998	       1		  0 FRACTURED
	 8     109120	       1		  0 FRACTURED
	 8     110016	       1		  0 FRACTURED
	 8     110092	       1		  0 FRACTURED
	 8     116226	       1		  0 FRACTURED
	 8     116231	       1		  0 FRACTURED
	 8     119295	       1		  0 FRACTURED
	 8     130559	       1		  0 FRACTURED
	12     113163	       1		  0 FRACTURED
	12     118283	       1		  0 FRACTURED
	12     119553	       1		  0 FRACTURED
	12     119608	       1		  0 FRACTURED
	12     120576	       1		  0 FRACTURED
	12     120632	       1		  0 FRACTURED
	12     120639	       1		  0 FRACTURED
	12     123407	       1		  0 FRACTURED
	 9	38411	       1		  0 FRACTURED
	 9	67647	       1		  0 FRACTURED
	 9     109068	       1		  0 FRACTURED
	 9     109070	       1		  0 FRACTURED
	11	99839	       1		  0 FRACTURED
	11     104960	       1		  0 FRACTURED
	13	63214	       1		  0 FRACTURED
	13	65097	       1		  0 FRACTURED
	13	71183	       1		  0 FRACTURED
	13	77319	       1		  0 FRACTURED
	13     103104	       1		  0 FRACTURED
	13     107072	       1		  0 FRACTURED
	13     110082	       1		  0 FRACTURED
	13     111106	       1		  0 FRACTURED
	13     114752	       1		  0 FRACTURED
	14	72196	       1		  0 FRACTURED
	14	75275	       1		  0 FRACTURED
	 7	76293	       1		  0 FRACTURED
	 7	76296	       1		  0 FRACTURED
	 7	76415	       1		  0 FRACTURED
	 7	76864	       1		  0 FRACTURED
	15	49976	       1		  0 FRACTURED
	15	81413	       1		  0 FRACTURED
	 2	61512	       1		  0 FRACTURED
	19	35338	       1		  0 FRACTURED
	19	42687	       1		  0 FRACTURED
60 rows selected.

对比实验室和客户环境配置
实验室环境

[oracle@i-q2ghx82t ~]$ more /etc/issue
CentOS release 6.8 (Final)
Kernel \r on an \m
SQL> select path from v$asm_disk;
PATH
----------------------------------------
/dev/asm-disk011
/dev/asm-disk001
/dev/asm-disk014
/dev/asm-disk002
/dev/asm-disk003
/dev/asm-disk015
/dev/asm-disk012
/dev/asm-disk013
8 rows selected.
[root@i-q2ghx82t ~]# fdisk -l|grep vd
Disk /dev/vda: 107.4 GB, 107374182400 bytes
/dev/vda1   *           1       13055   104856576   83  Linux
Disk /dev/vdb: 34.4 GB, 34359738368 bytes
Disk /dev/vdc: 107.4 GB, 107374182400 bytes
Disk /dev/vdd: 268.4 GB, 268435456000 bytes
/dev/vdd1               1       32635   262140606   83  Linux
Disk /dev/vde: 1073.7 GB, 1073741824000 bytes
Disk /dev/vdf: 53.7 GB, 53687091200 bytes
/dev/vdf1               1      104025    52428568+  83  Linux
Disk /dev/vdg: 107.4 GB, 107374182400 bytes
Disk /dev/vdh: 1073.7 GB, 1073741824000 bytes
Disk /dev/vdi: 107.4 GB, 107374182400 bytes
Disk /dev/vdj: 1073.7 GB, 1073741824000 bytes
Disk /dev/vdk: 1073.7 GB, 1073741824000 bytes
Disk /dev/vdl: 1073.7 GB, 1073741824000 bytes
[oracle@i-q2ghx82t ~]$ ls -l /dev/asm-*
lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk001 -> vdg
lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk002 -> vdc
lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk003 -> vdi
lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk011 -> vde
lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk012 -> vdh
lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk013 -> vdl
lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk014 -> vdj
lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk015 -> vdk
[root@i-q2ghx82t ~]# fdisk -l /dev/vdj
Disk /dev/vdj: 1073.7 GB, 1073741824000 bytes
16 heads, 63 sectors/track, 2080507 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

客户环境

[oracle@dbtest02 ~]$ more /etc/issue
Red Hat Enterprise Linux Server release 6.6 (Santiago)
Kernel \r on an \m
SQL> select path from v$asm_disk;
PATH
----------------------------------------
/dev/oracleasm/disks/ARCHDISK01
/dev/oracleasm/disks/OCRDISK01
/dev/oracleasm/disks/DATADISK01
[root@dbtest02 ~]# oracleasm listdisks
ARCHDISK01
DATADISK01
OCRDISK01
[root@dbtest02 ~]# oracleasm querydisk -p DATADISK01
Disk "DATADISK01" is a valid ASM disk
/dev/vdc1: LABEL="DATADISK01" TYPE="oracleasm"
[root@dbtest02 ~]# oracleasm querydisk -p ARCHDISK01
Disk "ARCHDISK01" is a valid ASM disk
/dev/vde1: LABEL="ARCHDISK01" TYPE="oracleasm"
[root@dbtest02 ~]# oracleasm querydisk -p OCRDISK01
Disk "OCRDISK01" is a valid ASM disk
/dev/vdd1: LABEL="OCRDISK01" TYPE="oracleasm"
[root@dbtest02 ~]# fdisk -l|grep vd
Disk /dev/vda: 53.7 GB, 53687091200 bytes
/dev/vda1               1        6528    52427776   83  Linux
Disk /dev/vdb: 17.2 GB, 17179869184 bytes
Disk /dev/vdc: 1073.7 GB, 1073741824000 bytes
/dev/vdc1               1     2080507  1048575496+  83  Linux
Disk /dev/vdd: 107.4 GB, 107374182400 bytes
/dev/vdd1               1      208050   104857168+  83  Linux
Disk /dev/vde: 536.9 GB, 536870912000 bytes
/dev/vde1               1     1040253   524287480+  83  Linux
Disk /dev/vdf: 859.0 GB, 858993459200 bytes
/dev/vdf1               1     1664406   838860592+  83  Linux
[root@dbtest02 ~]# fdisk -l /dev/vdf
Disk /dev/vdf: 859.0 GB, 858993459200 bytes
16 heads, 63 sectors/track, 1664406 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x1bdbf439
   Device Boot      Start         End      Blocks   Id  System
/dev/vdf1               1     1664406   838860592+  83  Linux
Partition 1 does not start on physical sector boundary.

这里比较明显:1)实验室环境使用的是udev,而客户环境使用的是asmlib.2)客户环境使用了分区,而实验室环境直接使用裸盘.

尝试在实验室环境使用asmlib

[root@i-q2ghx82t tmp]# ls -l *oracleasm*
-rw-r--r-- 1 root root 36288 Mar 10 13:09 kmod-oracleasm-2.0.8-13.el6_8.x86_64.rpm
-rw-r--r-- 1 root root 17168 Mar 10 13:09 oracleasmlib-2.0.12-1.el6.x86_64.rpm
-rw-r--r-- 1 root root 74984 Mar 10 13:09 oracleasm-support-2.1.8-1.el6.x86_64.rpm
[root@i-q2ghx82t tmp]# rpm -ivh kmod-oracleasm-2.0.8-13.el6_8.x86_64.rpm
warning: kmod-oracleasm-2.0.8-13.el6_8.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID ec551f03: NOKEY
Preparing...                ########################################### [100%]
   1:kmod-oracleasm         ########################################### [100%]
[root@i-q2ghx82t tmp]# rpm -ivh oracleasm*.rpm
warning: oracleasmlib-2.0.12-1.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID ec551f03: NOKEY
Preparing...                ########################################### [100%]
   1:oracleasm-support      ########################################### [ 50%]
   2:oracleasmlib           ########################################### [100%]
[root@i-q2ghx82t tmp]#
[root@i-q2ghx82t tmp]# rpm -aq|grep oracleasm
kmod-oracleasm-2.0.8-13.el6_8.x86_64
oracleasmlib-2.0.12-1.el6.x86_64
oracleasm-support-2.1.8-1.el6.x86_64
[root@i-q2ghx82t dev]# oracleasm createdisk ASMVDL /dev/vdl
Device "/dev/vdl" is not a partition

测试比较明显,对于没有分区的分布式存储中的盘,无法直接被asmlib创建asm disk.如果要使用asmlib,需要对磁盘进行分区.

存储特征
通过咨询分布式存储开发,确认几点:
0. 整个存储是基于网络技术实现分布式存储内部数据同步
1. 整个分布式存储使用ssd卡,默认三副本
2. 三副本中有两个副本写入成功才会反馈给应用(数据库),io成功
3. 整个分布式存储,没有引入任何的cache.
4. 存储的最小io单元是4k,由于ssd卡决定的
5. 分布式存储划分出来的lun是自动实现快对齐的

故障原因
由于存储最小单元是4k,但是在os层面物理扇区为4k,逻辑扇区为512,客户为了使用asmlib,对磁盘进行了默认分区,而没有考虑块对齐.会使得在存储整体crash的时候,很多block由于没有块对齐,很可能被进行了拆分(本来一个4k的io,存储上对对应的一个原子io完成,但是现在这个os层面的4k被拆分成了多个io的可能性,使得os层面的4k的io无法有了原则性),进而使得在数据库主机,存储突发异常的时候,有可能发生坏块.

处理建议
1. 在linux 6开始,对于asmlib,建议使用udev
2. 如果坚持asmlib,分区的时候考虑块对齐,不然会出现较大的性能下降,而且还会引起坏块的风险

ORA-12754: Feature 'startup' is disabled due to missing capability 'Runtime Environment'

ORACLE 18C启动报ORA-12754
从http://edelivery.oracle.com网上可以下载oracle 18c for exadata,在普通的linux平台可以正常安装,但是无法正常启动实例(dbca无法创建库),报错信息为:ORA-12754: Feature ‘startup’ is disabled due to missing capability ‘Runtime Environment’.
dbca-12754


由于普通的linux而非真实oracle一体机,而该版本中做了限制,因此无法启动数据库.根据oracle的作风,对于引进的新特性,oracle一般都提供隐含参数或者event控制,通过分析确定有_exadata_feature_on参数进行控制.
修改脚本创建库参数
由于我们无法在dbca图形化的界面中增加新参数,因此可以通过生成创建库脚本,然后通过修改参数文件,增加_exadata_feature_on=true

[oracle@xifenfei scripts]$ ls -ltr *.ora
-rw-r----- 1 oracle oinstall 2252 Feb 28 09:22 init.ora
-rw-r----- 1 oracle oinstall 2264 Feb 28 09:28 initxffdbTempOMF.ora
-rw-r----- 1 oracle oinstall 2386 Feb 28 09:29 initxffdbTemp.ora

然后重新执行xffdb.sh脚本创建库

[oracle@xifenfei scripts]$ sh xffdb.sh
You should Add this entry in the /etc/oratab: xffdb:/u02/soft:Y
SQL*Plus: Release 18.0.0.0.0 Production on Wed Feb 28 09:29:48 2018
Version 18.1.0.0.0
Copyright (c) 1982, 2017, Oracle.  All rights reserved.
Enter new password for SYS:
Enter new password for SYSTEM:
Enter password for SYS:
Connected to an idle instance.
SQL> spool /u02/app/oracle/admin/xffdb/scripts/CloneRmanRestore.log append
SQL> startup mount pfile="/u02/app/oracle/admin/xffdb/scripts/initxffdbTempOMF.ora";
ORACLE instance started.
Total System Global Area 2432694552 bytes
Fixed Size                  8898840 bytes
Variable Size             654311424 bytes
Database Buffers         1761607680 bytes
Redo Buffers                7876608 bytes
Database mounted.
SQL> execute dbms_backup_restore.resetCfileSection(dbms_backup_restore.RTYP_DFILE_COPY);
PL/SQL procedure successfully completed.
SQL> execute dbms_backup_restore.resetCfileSection(13);
PL/SQL procedure successfully completed.
SQL> host /u02/soft/bin/rman @/u02/app/oracle/admin/xffdb/scripts/rmanRestoreDatafiles.sql &&sysPassword;
Recovery Manager: Release 18.0.0.0.0 - Production on Wed Feb 28 09:30:08 2018
Version 18.1.0.0.0
Copyright (c) 1982, 2018, Oracle and/or its affiliates.  All rights reserved.
RMAN> connect target *
2>
3> CATALOG START WITH   '/u02/soft/assistants/dbca/templates//Seed_Database.dfb'  NOPROMPT  ;
4>
5> RUN {
6>
7> set newname for datafile 1 to  '/u02/app/oracle/oradata/XFFDB/system01.dbf' ;
8>
9> set newname for datafile 3 to  '/u02/app/oracle/oradata/XFFDB/sysaux01.dbf' ;
10>
11> set newname for datafile 4 to  '/u02/app/oracle/oradata/XFFDB/undotbs01.dbf' ;
12>
13> set newname for datafile 7 to  '/u02/app/oracle/oradata/XFFDB/users01.dbf' ;
14>
15> restore datafile 1;
16>
17> restore datafile 3;
18>
19> restore datafile 4;
20>
21> restore datafile 7; }
22>
connected to target database: SEEDDATA (DBID=1214140, not open)
……………………
SQL> set echo on
SQL> spool /u02/app/oracle/admin/xffdb/scripts/postPDBCreation.log append
SQL> select TABLESPACE_NAME from cdb_tablespaces a,dba_pdbs b where a.con_id=b.con_id and UPPER(b.pdb_name)=UPPER('pdb');
TABLESPACE_NAME
------------------------------
SYSTEM
SYSAUX
UNDOTBS1
TEMP
USERS
SQL> connect "SYS"/"&&sysPassword" as SYSDBA
Connected.
SQL> alter session set container=pdb;
Session altered.
SQL> set echo on
SQL> spool /u02/app/oracle/admin/xffdb/scripts/postPDBCreation.log append
SQL> Select count(*) from dba_registry where comp_id = 'DV' and status='VALID';
  COUNT(*)
----------
         1
SQL> alter session set container=CDB$ROOT;
Session altered.
SQL> exit;
Disconnected from Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
Version 18.1.0.0.0
[oracle@xifenfei scripts]$

登录oracle 18c

[oracle@xifenfei scripts]$ sqlplus / as sysdba
SQL*Plus: Release 18.0.0.0.0 Production on Wed Feb 28 09:40:08 2018
Version 18.1.0.0.0
Copyright (c) 1982, 2017, Oracle.  All rights reserved.
Connected to:
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
Version 18.1.0.0.0
SQL> set lines 150
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
BANNER_FULL
------------------------------------------------------------------------------------------------------
BANNER_LEGACY                                                                        CON_ID
-------------------------------------------------------------------------------- ----------
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
Version 18.1.0.0.0
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production                    0
SQL> col name for a52
SQL> col value for a24
SQL> col description for a50
SQL> set linesize 150
SQL> select a.ksppinm name,b.ksppstvl value,a.ksppdesc description
  2    from x$ksppi a,x$ksppcv b
  3   where a.inst_id = USERENV ('Instance')
  4     and b.inst_id = USERENV ('Instance')
  5     and a.indx = b.indx
  6     and upper(a.ksppinm) LIKE upper('%&param%')
  7  order by name
  8  /
Enter value for param: _exadata_feature_on
old   6:    and upper(a.ksppinm) LIKE upper('%&param%')
new   6:    and upper(a.ksppinm) LIKE upper('%_exadata_feature_on%')
NAME                                                 VALUE                    DESCRIPTION
---------------------------------------------------- ------------------------ -------------------------
_exadata_feature_on                                  TRUE                     Exadata Feature On
SQL> show parameter name;
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
cdb_cluster_name                     string
cell_offloadgroup_name               string
db_file_name_convert                 string
db_name                              string      xffdb
db_unique_name                       string      xffdb
global_names                         boolean     FALSE
instance_name                        string      xffdb
lock_name_space                      string
log_file_name_convert                string
pdb_file_name_convert                string
processor_group_name                 string
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
service_names                        string      xffdb
SQL>

OSD-04016: 异步 I/O 请求排队时出错

有某客户由于硬件故障,导致数据库无法启动,让我们介入处理
数据库启动报错

Mon Feb 26 17:28:24 2018
ALTER DATABASE OPEN
Beginning crash recovery of 1 threads
 parallel recovery started with 3 processes
Started redo scan
Completed redo scan
 read 2054 KB redo, 509 data blocks need recovery
Started redo application at
 Thread 1: logseq 41341, block 54
Recovery of Online Redo Log: Thread 1 Group 1 Seq 41341 Reading mem 0
  Mem# 0: E:\ORADATA\ORCL\REDO01.LOG
Completed redo application of 1.77MB
KCF: read, write or open error, block=0x16439 online=1
        file=1 'E:\ORADATA\ORCL\SYSTEM01.DBF'
        error=27070 txt: 'OSD-04016: 异步 I/O 请求排队时出错。
O/S-Error: (OS 1) 函数不正确。'
Errors in file d:\oracle\diag\rdbms\orcl\orcl\trace\orcl_dbw0_4928.trc:
ORA-01243: system tablespace file suffered media failure
ORA-01114: IO error writing block to file 1 (block # 91193)
ORA-01110: data file 1: 'E:\ORADATA\ORCL\SYSTEM01.DBF'
ORA-27070: async read/write failed
OSD-04016: 异步 I/O 请求排队时出错。
O/S-Error: (OS 1) 函数不正确。
DBW0 (ospid: 4928): terminating the instance due to error 1243
Mon Feb 26 17:28:29 2018
Instance terminated by DBW0, pid = 4928

20180227223540


这里错误比较明显,由于io错误,在数据库实例恢复之时,写回block正好在该损坏位置从而使得数据库无法正常实例恢复,进而无法open.

dbv验证文件
通过专业工具对system文件进行了重构system文件,然后dbv检查结果如下

DBVERIFY: Release 11.2.0.1.0 - Production on 星期二 2月 27 17:20:14 2018
Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
DBVERIFY - 开始验证: FILE = D:\OK\SYSTEM01.DBF
页 91156 标记为损坏
Corrupt block relative dba: 0x00416414 (file 1, block 91156)
Bad header found during dbv:
Data in bad block:
 type: 229 format: 6 rdba: 0xe1d9e3e7
 last change scn: 0xd682.ffc8c7cb seq: 0x8c flg: 0x8c
 spare1: 0xc0 spare2: 0xf6 spare3: 0x70b3
 consistency value in tail: 0x71f50602
 check value in block header: 0x8195
 computed block checksum: 0x8689
DBVERIFY - 验证完成
检查的页总数: 215040
处理的页总数 (数据): 178112
失败的页总数 (数据): 0
处理的页总数 (索引): 19070
失败的页总数 (索引): 0
处理的页总数 (其他): 3118
处理的总页数 (段)  : 1
失败的总页数 (段)  : 0
空的页总数: 14739
标记为损坏的总页数: 1
流入的页总数: 0
加密的总页数        : 0
最高块 SCN            : 1638554501 (0.1638554501)

这里比较明显重构出来的system文件只有block 91156坏块,这里注意和没有处理之前的坏块不一样

通过dump分析坏块所属对象

Start dump data block from file D:\OK\SYSTEM01.DBF minblk 91156 maxblk 91156
 V10 STYLE FILE HEADER:
	Compatibility Vsn = 186646528=0xb200000
	Db ID=1383974140=0x527dc4fc, Db Name='ORCL'
	Activation ID=0=0x0
	Control Seq=806694=0xc4f26, File size=215040=0x34800
	File Number=1, Blksiz=8192, File Type=3 DATA
Dump all the blocks in range:
buffer tsn: 0 rdba: 0xe1d9e3e7 (903/1696743)
scn: 0xd682.ffc8c7cb seq: 0x8c flg: 0x8c tail: 0x71f50602
frmt: 0x06 chkval: 0x8195 type: 0xe5=unknown
Hex dump of corrupt header 4 = CORRUPT
Start dump data block from file D:\OK\SYSTEM01.DBF minblk 91155 maxblk 91155
 V10 STYLE FILE HEADER:
	Compatibility Vsn = 186646528=0xb200000
	Db ID=1383974140=0x527dc4fc, Db Name='ORCL'
	Activation ID=0=0x0
	Control Seq=806694=0xc4f26, File size=215040=0x34800
	File Number=1, Blksiz=8192, File Type=3 DATA
Dump all the blocks in range:
buffer tsn: 0 rdba: 0x00416413 (1/91155)
scn: 0x0000.613ad8d6 seq: 0x01 flg: 0x06 tail: 0xd8d60601
frmt: 0x02 chkval: 0xcc5f type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
Block header dump:  0x00416413
 Object id on Block? Y
 seg/obj: 0x25  csc: 0x00.613ad8ce  itc: 2  flg: -  typ: 2 - INDEX
     fsl: 0  fnx: 0x0 ver: 0x01
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0009.000.0016fda0  0x00c0130e.6f38.01  CB--    0  scn 0x0000.3e6ed294
0x02   0x0009.008.0023e862  0x00c002cf.a217.13  --U-    1  fsc 0x0000.613ad8d6
Start dump data block from file D:\OK\SYSTEM01.DBF minblk 91157 maxblk 91157
 V10 STYLE FILE HEADER:
	Compatibility Vsn = 186646528=0xb200000
	Db ID=1383974140=0x527dc4fc, Db Name='ORCL'
	Activation ID=0=0x0
	Control Seq=806694=0xc4f26, File size=215040=0x34800
	File Number=1, Blksiz=8192, File Type=3 DATA
Dump all the blocks in range:
buffer tsn: 0 rdba: 0x00416415 (1/91157)
scn: 0x0000.6193dc0c seq: 0x01 flg: 0x06 tail: 0xdc0c0601
frmt: 0x02 chkval: 0x8c21 type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
Block header dump:  0x00416415
 Object id on Block? Y
 seg/obj: 0x25  csc: 0x00.6193dc04  itc: 2  flg: -  typ: 2 - INDEX
     fsl: 0  fnx: 0x0 ver: 0x01
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0006.00c.001ee103  0x00c008a5.8dbd.02  C---    0  scn 0x0000.57303e03
0x02   0x000b.00f.00053100  0x00c008af.1563.09  --U-    1  fsc 0x0000.6193dc0c

这里比较明显,可以确定坏块为index,object_id=0x25=37,通过查询其他库,确定为i_obj2(obj$的index)

使用该文件启动数据库

SQL> alter database rename file 'E:\ORADATA\ORCL\SYSTEM01.DBF' to 'd:\orcl\SYSTEM01.DBF';
数据库已更改。
SQL> recover database;
完成介质恢复。
SQL> alter database open;
数据库已更改。

检查alert日志
发现smon进程由于坏块的存储,出现大量报错,需要处理,不然数据库一段时间后就会crash.

Tue Feb 27 20:31:23 2018
QMNC started with pid=27, OS id=4652
Completed: alter database open
Tue Feb 27 20:31:25 2018
Starting background process CJQ0
Tue Feb 27 20:31:25 2018
CJQ0 started with pid=30, OS id=2172
Tue Feb 27 20:31:25 2018
db_recovery_file_dest_size of 3912 MB is 0.00% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Hex dump of (file 1, block 91156) in trace file d:\oracle\diag\rdbms\orcl\orcl\trace\orcl_cjq0_2172.trc
Corrupt block relative dba: 0x00416414 (file 1, block 91156)
Bad header found during multiblock buffer read
Data in bad block:
 type: 229 format: 6 rdba: 0xe1d9e3e7
 last change scn: 0xd682.ffc8c7cb seq: 0x8c flg: 0x8c
 spare1: 0xc0 spare2: 0xf6 spare3: 0x70b3
 consistency value in tail: 0x71f50602
 check value in block header: 0x8195
 computed block checksum: 0x8689
Reading datafile 'D:\ORCL\SYSTEM01.DBF' for corruption at rdba: 0x00416414 (file 1, block 91156)
Reread (file 1, block 91156) found same corrupt data
Errors in file d:\oracle\diag\rdbms\orcl\orcl\trace\orcl_smon_3992.trc  (incident=77085):
ORA-01578: ORACLE data block corrupted (file # 1, block # 91156)
ORA-01110: data file 1: 'D:\ORCL\SYSTEM01.DBF'
Incident details in: d:\oracle\diag\rdbms\orcl\orcl\incident\incdir_77085\orcl_smon_3992_i77085.trc
Errors in file d:\oracle\diag\rdbms\orcl\orcl\trace\orcl_cjq0_2172.trc  (incident=77221):
ORA-01578: ORACLE data block corrupted (file # 1, block # 91156)
ORA-01110: data file 1: 'D:\ORCL\SYSTEM01.DBF'
Incident details in: d:\oracle\diag\rdbms\orcl\orcl\incident\incdir_77221\orcl_cjq0_2172_i77221.trc

重建i_obj2 index,参考:http://www.xifenfei.com/?p=5566

18c新特性:alter system cancel sql

根据18c官方描述cancel sql功能是在18c中引起,但是实测发现在oracle 12.2中已经有了cancel sql功能,可以实现终止掉某个sql的当前sql正在执行的sql语句,而不是传统的直接kill某个会话.ALTER SYSTEM CANCEL SQL语句有四个参数分别为:
cancel_sql

--会话1
SQL> set lines 150
SQL> select * from v$version;
BANNER                                                                               CON_ID
-------------------------------------------------------------------------------- ----------
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production              0
PL/SQL Release 12.2.0.1.0 - Production                                                    0
CORE    12.2.0.1.0      Production                                                        0
TNS for Linux: Version 12.2.0.1.0 - Production                                            0
NLSRTL Version 12.2.0.1.0 - Production                                                    0
SQL> select sid, serial# from v$session where sid in
  2  (select  sid from v$mystat where rownum=1);
       SID    SERIAL#
---------- ----------
       278       4019
SQL> create table t_xifenfei tablespace users as select * from dba_source;
Table created.
SQL> insert into t_xifenfei select * from t_xifenfei;
274132 rows created.                    <<===没有提交
SQL> select count(*)from t_xifenfei;
  COUNT(*)
----------
    548264
SQL> insert into t_xifenfei select * from t_xifenfei;
548264 rows created.     <<===没有提交
SQL> select count(*)from t_xifenfei;
  COUNT(*)
----------
   1096528
SQL> insert into t_xifenfei select * from t_xifenfei;
--会话2
SQL> select count(*)from t_xifenfei;
  COUNT(*)
----------
    274132
SQL> alter system cancel sql '278,4019';
System altered.
SQL> select count(*)from t_xifenfei;
  COUNT(*)
----------
    274132
--会话1
SQL> insert into t_xifenfei select * from t_xifenfei;
insert into t_xifenfei select * from t_xifenfei
*
ERROR at line 1:
ORA-01013: user requested cancel of current operation
SQL> select count(*)from t_xifenfei;
  COUNT(*)
----------
   1096528

这里可以看到会话1的最后一个insert被cancel,但是前面两个没有提交的insert没有被回滚/提交,看到了cancel sql的功能的实现.

Oracle中的主要字母缩写含义

ACD = Active Change Directory
ACFS = ASM Cluster File System
ADDM = Automatic Database Diagnostic Monitor
ADR = Automatic Diagnostic Repository
ADVM = ASM Dynamic Volume Manager
AIO = Asynchronous I/O
AMDU = ASM Metadata Dump Utility
AMM = Automatic Memory Management
ARC = Archive process
ASH = Active Session History
ASM = Automatic Storage Management
ASMB = ASM Background process
ASMCA = ASM Configuration Assistant
ASMCMD = ASM CoMmanD line utility
ASMLIB = ASM LIBrary tool
ASMM = Automatic Shared Memory Management
ASMSNMP = ASM Simple Network Management Protocol
AT = Allocation Table
ATA = Advanced Technology Attachment
AU = Allocation Unit
AWR = Automatic Workload Repository
BH = Block Header
BMC = Baseboard Management Controller
BS = Block Size
CBO = Cost-Based Optimizer
CF = Control File
CFS = Cluster FileSystem
CHM = Cluster Health Monitor
CIO = Concurrent I/O
CKPT = ChecKPoinT process
CLUVFY = CLUster VeriFy utility
COD = Continuing Operation Directory
CPU = Central Processing Unit or Critical Patch Update
CRM = Customer Relationship Management
CRS = Cluster Ready Services
CSS = Cluster Synchronization Services
CSV = Comma-Separated Values
CVM = Cluster Volume Manager
DB = DataBase
DBCA = DataBase Configuration Assistant
DBFS = DataBase File System
DBM = DataBase Machine
DBMS = DataBase Management Systems
DBPERF = DataBase PERFormance
DBV = DataBase Verification tool
DBW = DataBase Writer process
DD = Disk Directory or Data Description tool
DES = Database Excelleration Systems Inc.
DG = DiskGroup
DH = Disk Header
DIO = Direct I/O
DISM = Dynamic Intimate Shared Memory
DM = Device Mapper
DNFS = Direct Network File System
DRAM = Dynamic Random Access Memory
DSS = Decision Support System
DUL = Data UnLoader
DW = Data Warehouse
EIDE = Enhanced Integrated Drive Electronics
ERP = Enterprise Resource Planning
ETA = Estimated Time of Arrival
ETL = Extract Transform Load
FD = File Directory
FG = FailGroup
FRA = Flash Recovery Area or Fast Recovery Area
FS = FileSystem
FST = Free Segments Table
FTS = Full Table Scan
GC = Grid Control
GI = Grid Infrastructure
GUI = Graphical User Interface
HA = High Availability
HARD = Hardware Assisted Resilient Data
HB = Heart Beat
HBA = Host Bus Adapter
IDE = Integrated Drive Electronics
IIS = Internet Information Services
INST = INSTance
IO = Input/Output
IOPS = IO Per Second
IOT = Index Organized Table
IP = Internet Protocol
IPMI = Intelligent Platform Management Interface
JET = Joint Escalation Team
JFS = Journaled FileSystem
KB = KiloByte
KFED = Kernel Files metadata EDitor
KFOD = Kernel Files Osm disk[group] Discovery
LGWR = redo LoG WRiter process
LLT = Veritas Low Latency Transport protocol
LSNR = LiSteNeR
LUN = Logical Unit Number
LVM = Logical Volume Manager
MAA = Maximum Availability Architecture
MB = MegaByte
NAS = Network Attached Storage
NetApp = Network Appliance
NETCA = NETwork Configuration Assistant
NFS = Network FileSystem
NIC = Network Interface Controller
OCFS = Oracle Cluster FileSystem
OCR = Oracle Cluster Registry
ODM = Oracle Disk Manager
ODS = Operational Data Store
OEL = Oracle Enterprise Linux
OEM = Oracle Enterprise Manager
OID = Oracle Internet Directory
OLAP = OnLine Analytical Processing
OLTP = OnLine Transaction Processing
OMS = Oracle Management Service
OPATCH = Oracle PATCHing utility
OS = Operating System
OSCP = Oracle Storage Compatibility Program
OSW = OS Watcher
OSWFW = OS Watcher For Windows
OTN = Oracle Technology Network
OUI = Oracle Universal Installer
PB = PetaByte
PFILE = Parameter FILE
PGA = Program Global Area
PID = Proces ID
PL/SQL = Procedural Language/Structured Query Language
POC = Proof Of Concept
PROCWATCHER = PROCess WATCHER
PST = Partnership Status Table
PSU = Patch Set Update
PX = Parrallel eXecution
RAC = Real Application Cluster
RAID = Redundant Array of Independent Disks
RAM = Random Access Memory
RBAL = ReBALance process
RCA = Root Cause Analysis
RDA = Remote Diagnostic Agent
RDBMS = Relational DataBase Management System
RHEL = RedHat Enterprise Linux
RM = Resource Manager
RMAN = Recovery MANager
RPM = Resource Package Manager
SAN = Storage Area Network
SAS = Serial Attached SCSI
SATA = Serial Advanced Technology Attachment
SCAN = Single Client Access Name
SCSI = Small Computer System Interface
SGA = System Global Area
SLA = Service Level Agreement
SMF = Service Management Facility
SPFILE = Server Parameter FILE
SQL = Structured Query Language
SRDF = EMC Symmetrix Remote Data Facility
SSD = Solid State Disk
SVCTM = average SerViCe TiMe
SVM = Solaris Volume Manager
TB = TeraByte
TCP = Transmission Control Protocol
TDE = Transparent Data Encryption
TKPROF = Transient Kernel PROFile
TNS = Transparent Network Substrate
UDEV = Unix DEVice manager
UDP = User Datagram Protocol
UFG = Umbilicus ForeGround process
UFS = User FileSystem
VBG = Volume BackGround process
VCS = Veritas Cluster Server
VD = Voting Disk
VDBG = Volume Driver BackGround process
VIP = Virtual Internet Portocol
VLDB = Very Large DataBase
VM = Virtual Machine
VMB = Volume Membership BackGround process
VxFS = Veritas File System
VxVM = Veritas Volume Manager
XDB = XML DataBase

DBMS_STATS收集子分区表导致library cache lock等待

有客户反馈一个问题,业务中有一张表无论是查询还是dml操作都非常慢,让我们介入分析
数据库版本
打上了170814比较新的psu

[oracle@xffdb1 ~]$ $ORACLE_HOME/OPatch/opatch lsinventory
Oracle 中间补丁程序安装程序版本 11.2.0.3.15
版权所有 (c) 2018, Oracle Corporation。保留所有权利。
Oracle Home       : /u01/app/oracle/product/11.2.0/db_1
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/oracle/product/11.2.0/db_1/oraInst.loc
OPatch version    : 11.2.0.3.15
OUI version       : 11.2.0.4.0
Log file location : /u01/app/oracle/product/11.2.0/db_1/cfgtoollogs/opatch/opatch2018-01-06_16-20-22下午_1.log
--------------------------------------------------------------------------------
Local Machine Information::
Hostname: xffdb1
ARU platform id: 226
ARU platform description:: Linux x86-64
已安装的顶级产品 (1):
Oracle Database 11g                                                  11.2.0.4.0
此 Oracle 主目录中已安装 1 个产品。
中间补丁程序 (2) :
Patch  26609929     : applied on Fri Sep 15 14:28:51 CST 2017
Unique Patch ID:  21482966
Patch description:  "OCW Patch Set Update : 11.2.0.4.170814 (26609929)"
Patch  26609445     : applied on Fri Sep 15 14:28:04 CST 2017
Unique Patch ID:  21482382
Patch description:  "Database Patch Set Update : 11.2.0.4.170814 (26609445)"

awr数据结果
20180203235626
20180203235701


这里显示大量的library cache lock等待
做systemstate分析

                    Resource Holder State
   LOCK: Handle=0x325d938e08    74: 74: is waiting for PIN: Handle=0x325d938e08
    Enq TX-00E0001B-000004D3   488: 488: is waiting for 74:
              Mutex 1e7fbf6c   322: 322: is waiting for 74:
    Enq TX-02850009-00000086   630: 630: is waiting for 74:
    Enq TX-02BC000A-0000009E    ??? Blocker
    Enq TX-0075000E-000010BE    ??? Blocker
    PIN: Handle=0x325d938e08    ??? Blocker
    Enq TX-049A000C-00000002    ??? Blocker
    Enq TX-034C001D-00000038    ??? Blocker
    Enq TX-052C0012-00000002   669: 669: is waiting for 74:
    Enq TX-04620004-00000003    ??? Blocker
    Enq TX-009B0020-00000DA6    ??? Blocker
              Mutex af4db5a8   242: 242: is waiting for 74:
    Enq TX-041E0002-00000005    ??? Blocker
              Mutex 663b253d   727: 727: is waiting for 74:
    Enq TX-04FA0008-00000003   124: 124: is waiting for 69:
    Enq TX-031D0010-0000005B    ??? Blocker
    Enq TX-02BB0004-000000A2    ??? Blocker
    Enq TX-0248001A-000000CD    69: 69: is waiting for Enq TX-02BC000A-0000009E
    Enq TX-03D70008-00000002    ??? Blocker
    Enq TX-02B1001D-00000081    ??? Blocker
    Enq TX-0423001A-00000003    ??? Blocker
    Enq TX-051A0007-00000003   506: 506: is waiting for 74:
              Mutex 2aceb8e9   602: 602: is waiting for 74:
              Mutex c6b2e0f4   196: 196: is waiting for 74:
    Enq TX-00A30020-0000055A    ??? Blocker
    Enq TX-00D70015-00000315    ??? Blocker
    Enq TX-03B30006-00000012    ??? Blocker
    Enq TX-008C0003-00001005    ??? Blocker
    Enq TX-05470014-00000006   219: 219: is waiting for 74:
    Enq TX-054E0018-00000006   673: 673: is waiting for 74:
              Mutex f28c06f8   279: 279: is waiting for 74:
    Enq TX-03D30012-00000002    ??? Blocker
    Enq TX-055C001B-00000005   333: 333: is waiting for Enq TX-01CF000B-00000B2F
              Mutex 2ff03da9   276: 276: is waiting for 74:
    Enq TX-030A000A-000000B6    ??? Blocker
    Enq TX-00530004-000023DF   212: 212: is waiting for 74:
    Enq TX-05550017-00000002   469: 469: is waiting for 74:
              Mutex 724ba5e3   177: 177: is waiting for 74:
    Enq TX-03C10007-00000015    ??? Blocker
    Enq TX-02E30006-00000079    ??? Blocker
                       IPC 6      6 Blocker
    Enq TX-0289000B-0000005F   449: 449: is waiting for 74:
    Enq TX-021E0002-000001DA    ??? Blocker
              Mutex a9fcc7b8   131: 131: is waiting for 74:
    Enq TX-0483001F-00000009    ??? Blocker
    Enq TX-00800010-00000BA0    ??? Blocker
              Mutex c0af249e   666: 666: is waiting for 74:
              Mutex 94389ed4    78: 78: is waiting for 74:
              Mutex 4ac40611   143: 143: is waiting for 74:
              Mutex 24c1c387   168: 168: is waiting for 74:
              Mutex 930636c9   752: 752: is waiting for 74:
    Enq TX-04660001-00000003    ??? Blocker
    Enq TX-01B6000A-00000022    ??? Blocker
    Enq TX-0166000F-000002C7    ??? Blocker
    Enq TX-02FF0008-000000A1    ??? Blocker
    Enq TX-01E10001-00000023    ??? Blocker
    Enq TX-0327000E-0000009E    ??? Blocker
    Enq TX-05120016-00000002   615: 615: is waiting for 74:
    Enq TX-039F001D-00000011    ??? Blocker
    Enq TX-01CF000B-00000B2F    ??? Blocker
    Enq TX-01130020-000001B6    ??? Blocker
    Enq TX-052A0021-00000002   635: 635: is waiting for 74:
              Mutex ebf68fee   579: 579: is waiting for 74:
    Enq TX-02350017-000000D6    ??? Blocker
    Enq TX-00BB0000-00000598    ??? Blocker
    Enq TX-0443000E-00000003    ??? Blocker
              Mutex e7bae014   344: 344: is waiting for 74:
              Mutex fe251793   616: 616: is waiting for 74:
              Mutex 750494ae   180: 180: is waiting for 74:
    Enq TX-01230010-000001C0    ??? Blocker
    Enq TX-0543001C-00000003   119: 119: is waiting for 124:
    Enq TX-04E60014-00000005    ??? Blocker
    Enq TX-00F1000A-000002AB    ??? Blocker
              Mutex aa83fbd7   651: 651: is waiting for 74:
    Enq TX-058A0002-00000003    ??? Blocker
    Enq TX-03B00016-0000000F    ??? Blocker
              Mutex 26e065a4   150: 150: is waiting for 74:
    Enq TX-0219001B-000001BE    ??? Blocker
    Enq TX-00B9001D-00000069    ??? Blocker
    Enq TX-01110020-00000198    ??? Blocker
    Enq TX-04F3001B-00000002    ??? Blocker
    Enq TX-04A60015-00000002    ??? Blocker
    Enq TX-02D30015-0000008A    ??? Blocker
    Enq TX-01540018-000001B7    ??? Blocker
    Enq TX-02240019-0000001C    ??? Blocker
              Mutex e595002d   502: 502: is waiting for 74:
              Mutex 1661a4cb   732: 732: is waiting for 74:
    Enq TX-00320010-00003A12    ??? Blocker
    Enq TX-020F001F-00000027    ??? Blocker
    Enq TX-0387000A-0000004D   515: 515: is waiting for 74:
    Enq TX-01260014-0000012D    ??? Blocker
    Enq TX-02EE001D-0000005F    ??? Blocker
    Enq TX-004A0001-00000882   603: 603: is waiting for Enq TX-02240019-0000001C
              Mutex 4a18a781   258: 258: is waiting for 74:
   LOCK: Handle=0x325aa4c428   643: 643: is waiting for 74:
    Enq TX-037A0002-0000006E   118: 118: is waiting for 74:
    Enq TX-02460020-0000006D   493: 493: is waiting for Enq TX-00F1000A-000002AB
PROCESS 74: J000
  ----------------------------------------
  SO: 0x31a2b2e440, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x31a2b2e440, name=process, file=ksu.h LINE:12721, pg=0
  (process) Oracle pid:74, ser:153, calls cur/top: 0x305d24c378/0x305f3f5eb0
            flags : (0x0) -
            flags2: (0x10),  flags3: (0x10)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 80
              last post received-location: kji.h LINE:3691 ID:kjata: wake up enqueue owner
              last process to post me: 0x3182a6fca0 1 6
              last post sent: 0 0 26
              last post sent-location: ksa2.h LINE:285 ID:ksasnd
              last process posted by me: 0x3182a75038 2 6
    (latch info) wait_event=0 bits=0x0
    Process Group: DEFAULT, pseudo proc: 0x3162ddbe90
    O/S info: user: oracle, term: UNKNOWN, ospid: 103036
    OSD pid info: Unix process pid: 103036, image: oracle@xffdb1 (J000)
    Short stack dump:
      ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10
      <-skgpwwait()+178<-ksliwat()+2046<-kslwaitctx()+163<-kjusuc()+3400
      <-ksipgetctxi()+1759<-kqlmPin()+2943<-kqlmClusterLock()+237<-kglpnal()+4072
      <-kglpin()+1381<-qostobkglcrt1()+640<-qostobkglcrt()+255<-qospsts()+1639
      <-spefcmpa()+196<-spefmccallstd()+235<-pextproc()+41<-peftrusted()+150
      <-psdexsp()+255<-rpiswu2()+1776<-psdextp()+700<-pefccal()+726<-pefcal()+224
      <-pevm_FCAL()+169<-pfrinstr_FCAL()+75<-pfrrun_no_tool()+63<-pfrrun()+627
      <-plsql_run()+649<-peidxr_run()+263<-peidxexe()+79<-kkxdexe()+338
      <-kkxmpexe()+241<-kgmexwi()+605<-kgmexec()+2193<-evapls()+813
      <-evaopn2()+808<-kkxmexcs()+133<-opiexe()+20880<-kpoal8()+2380
      <-opiodr()+917<-kpoodr()+1401<-upirtrc()+2436<-kpurcsc()+98
      <-kpuexec()+10790<-OCIStmtExecute()+39<-jslvec_execcb1()+672
      <-jslvswu()+56<-jslve_execute0()+2257<-jslve_execute()+332
      <-rpiswu2()+1776<-kkjex1e()+379<-kkjsexe()+711<-kkjrdp()+694
      <-opirip()+958<-opidrv()+603<-sou2o()+103<-opimai_real()+250
      <-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
    SO: 0x31c382f010, type: 4, owner: 0x31a2b2e440, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x31a2b2e440, name=session, file=ksu.h LINE:12729, pg=0
    (session) sid: 6001 ser: 679 trans: 0x311e36fa28, creator: 0x31a2b2e440
              flags: (0x8010041) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x40009) -/-/INC
              DID: , short-term DID:
              txn branch: (nil)
              edition#: 100              oct: 170, prv: 0, sql: 0x323de178e8, psql: 0x32b5d0fc98, user: 0/SYS
    ksuxds FALSE at location: 0
    service name: SYS$USERS
    client details:
      O/S info: user: oracle, term: UNKNOWN, ospid: 103036
      machine: xffdb1 program: oracle@xffdb1 (J000)
      application name: DBMS_SCHEDULER, hash value=2478762354
      action name: ORA$AT_OS_OPT_SY_1594, hash value=1524069073
    Current Wait Stack:
     0: waiting for 'library cache pin'
        handle address=0x325d938e08, pin address=0x325baec3e0, 100*mode+namespace=0x850fe00010003
        wait_id=46488 seq_num=46502 snap_id=1
        wait times: snap=6 min 2 sec, exc=6 min 2 sec, total=6 min 2 sec
        wait times: max=15 min 0 sec, heur=6 min 2 sec
        wait counts: calls=727 os=727
        in_wait=1 iflags=0x15a2
    There are 2136 sessions blocked by this session.

发现主要是74号进程阻塞了其他的
做hanganalyze分析

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (xffdb.xffdbsv1)
                   os id: 82310
              process id: 224, oracle@xffdb1
              session id: 4
        session serial #: 12237
    }
    is waiting for 'enq: TX - row lock contention' with wait info:
    {
                      p1: 'name|mode'=0x54580006
                      p2: 'usn<<16 | slot'=0x21e0002
                      p3: 'sequence'=0x1da
            time in wait: 3 min 48 sec
           timeout after: never
                 wait id: 1260
                blocking: 0 sessions
             current sql: UPDATE t_xifenfei subpartition(p201801_s32)
                            SET INDICATORCODE   = :1,
                                LOWERLIMIT      = :2,
                                UPPERLIMIT      = :3,
                                AVGUNITPRICE    = :4,
                                TRADEITEMNAME   = :5,
             short stack: ………………
            wait history:
              * time between current wait and wait #1: 0.000220 sec
              1.       event: 'gc cr block 2-way'
                 time waited: 0.000341 sec
                     wait id: 1259            p1: ''=0x6
                                              p2: ''=0x3590
                                              p3: ''=0x44b
              * time between wait #1 and #2: 0.000284 sec
              2.       event: 'gc current block 3-way'
                 time waited: 0.000663 sec
                     wait id: 1258            p1: ''=0xa
                                              p2: ''=0x3ad02
                                              p3: ''=0x2000001
              * time between wait #2 and #3: 0.000383 sec
              3.       event: 'gc cr block 2-way'
                 time waited: 0.000288 sec
                     wait id: 1257            p1: ''=0x5
                                              p2: ''=0x2e20
                                              p3: ''=0x957
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 3 (xffdb.xffdbsv3)
                   os id: 39472
              process id: 587, oracle@xffdb3
              session id: 2215
        session serial #: 8213
    }
    which is waiting for 'library cache lock' with wait info:
    {
                      p1: 'handle address'=0x327f5ecea0
                      p2: 'lock address'=0x327a5ab660
                      p3: '100*mode+namespace'=0x850fe00010002
            time in wait: 3 min 43 sec
           timeout after: 11 min 16 sec
                 wait id: 2343
                blocking: 3 sessions
             current sql: SELECT COUNT(*) COUNT
                          FROM t_xifenfei subpartition(p201801_s32)
                         WHERE TRADEITEMID = '679EA6DE428F414D014B3D5EC8DE5E32'
             short stack:  ………………
                 wait history:
              * time between current wait and wait #1: 0.000908 sec
              1.       event: 'SQL*Net message from client'
                 time waited: 0.001074 sec
                     wait id: 2342            p1: 'driver id'=0x54435000
                                              p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.000033 sec
              2.       event: 'SQL*Net message to client'
                 time waited: 0.000003 sec
                     wait id: 2341            p1: 'driver id'=0x54435000
                                              p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000069 sec
              3.       event: 'SQL*Net message from client'
                 time waited: 0.001232 sec
                     wait id: 2340            p1: 'driver id'=0x54435000
                                              p2: '#bytes'=0x1
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (xffdb.xffdbsv1)
                   os id: 103036
              process id: 74, oracle@xffdb1 (J000)
              session id: 6001
        session serial #: 679
    }
    which is waiting for 'library cache pin' with wait info:
    {
                      p1: 'handle address'=0x325d938e08
                      p2: 'pin address'=0x325baec3e0
                      p3: '100*mode+namespace'=0x850fe00010003
            time in wait: 3 min 47 sec
           timeout after: 11 min 12 sec
                 wait id: 46488
                blocking: 2099 sessions
             current sql: call dbms_stats.gather_database_stats_job_proc (  )
             short stack: …………
            wait history:
              * time between current wait and wait #1: 0.000038 sec
              1.       event: 'library cache lock'
                 time waited: 0.012353 sec
                     wait id: 46487           p1: 'handle address'=0x325d938e08
                                              p2: 'lock address'=0x3259aee680
                                              p3: '100*mode+namespace'=0x850fe00010003
              * time between wait #1 and #2: 0.002509 sec
              2.       event: 'enq: IV -  contention'
                 time waited: 0.001079 sec
                     wait id: 46486           p1: 'type|mode'=0x49560005
                                              p2: 'id1'=0x53594e43
                                              p3: 'id2'=0x15
              * time between wait #2 and #3: 0.000085 sec
              3.       event: 'enq: IV -  contention'
                 time waited: 0.001817 sec
                     wait id: 46485           p1: 'type|mode'=0x49560005
                                              p2: 'id1'=0x4c4f434b
                                              p3: 'id2'=0x15
    }
Chain 1 Signature: 'library cache pin'<='library cache lock'<='enq: TX - row lock contention'
Chain 1 Signature Hash: 0xa08ff7bf

到这里基本上可以确定是由于自动任务收集统计信息导致系统出现大量的library cache lock和library cache pin,另外可以确定大部分被阻塞在library cache 相关的select和dml语句都集中在t_xifenfei这个子分区表中,通过查询mos,发现相关bug:Bug 19790972 – “library cache lock” waits due to DBMS_STATS gather of stats for a subpartition
20180204002507


osw配置私网监控(oswprvtnet)

OS Watcher Black Box (OSWbb) 是一种 UNIX shell 脚本的集合,主要用于收集和归档操作系统和网络的度量,以便为诊断性能问题提供支持。OSWbb 作为服务器上的一组后台进程来运行,定期收集OS 数据,其调用 Unix 实用程序,如 vmstat、netstat 和 iostat 等。在最近的gi psu中已经把osw包含在tfa中了,但是默认不启用私网的监控,如果需要可以进行人工配置
OSWbb 使用 traceroute 命令获取这些私有网络的状态。每个操作系统对 traceroute 命令使用的参数都略有不同。基本 oswbb 目录中的示例Exampleprivate.net 文件中包含每个操作系统使用的语法。因此,这将导致在 UNIX 平台中的输出显示有所不同。OSWbb 按指定间隔运行 private.net 文件,并在归档目录下的 oswprvtnet 子目录中存储数据。数据存储在每小时归档的文件中。文件中的每个条目都包含一个时间戳,前缀为 ***,嵌入在顶部输出中。
tfa中含osw

[root@xifenfei01 ~]# ps -ef|grep osw
grid      17673      1  0 Jan30 ?        00:00:07 /bin/sh ./OSWatcher.sh 30 48 NONE /u01/app/grid/tfa/repository/suptools/xifenfei01/oswbb/grid/archive
grid      17841  17673  0 Jan30 ?        00:00:01 /bin/sh ./OSWatcherFM.sh 48 /u01/app/grid/tfa/repository/suptools/xifenfei01/oswbb/grid/archive
root      64159  64120  0 00:30 pts/0    00:00:00 grep osw

测试prvtnet.sh

[root@xifenfei01 oswbb]# cd /u01/app/grid/tfa/repository/suptools/xifenfei01/oswbb/grid/oswbb
[root@xifenfei01 oswbb]# more prvtnet.sh
echo "zzz ***"`date`
traceroute -r -F xifenfei01-priv
traceroute -r -F xifenfei02-priv
######################################################################
# DO NOT DELETE THE FOLLOWING LINE!!!!!!!!!!!!!!!!!!!!!
######################################################################
rm locks/lock.file
[root@xifenfei01 oswbb]# ./prvtnet.sh
zzz ***Tue Jan 30 18:02:49 CST 2018
traceroute to xifenfei01-priv (192.168.102.11), 30 hops max, 60 byte packets
 1  xifenfei01-priv (192.168.102.11)  0.010 ms  0.003 ms  0.002 ms
traceroute to xifenfei02-priv (192.168.102.14), 30 hops max, 60 byte packets
 1  xifenfei02-priv (192.168.102.14)  0.081 ms  0.076 ms  0.087 ms
rm: cannot remove `locks/lock.file': No such file or directory

配置private.net正式收集私网信息

[root@xifenfei01 oswbb]# find . -name OSWatcher.sh | xargs grep -i "private"
# Run traceroute for private networks if file private.net exists
if [ -x private.net ]; then
   ./private.net >> $OSWBB_ARCHIVE_DEST/oswprvtnet/${hostn}_prvtnet_$hour 2>&1 &

这里重要的是:
./OSWatcher.sh:# Run traceroute for private networks if file private.net exists
表示,在OSWatcher.sh中调用的文件名称是private.net
将prvtnet.sh复制为供OSWatcher.sh脚本调用的文件名称

[root@xifenfei01 oswbb]# cp prvtnet.sh private.net
[root@xifenfei01 oswbb]# chown grid:oinstall prvtnet.net
[root@xifenfei01 oswbb]# chmod +x prvtnet.net

关注内容

示例 1: 接口处于启动和响应状态:
traceroute to X.X.X.X, (X.X.X.X) 30 hops max, 1492 byte packets
1 X.X.X.X 1.015 ms 0.766 ms 0.755 ms
示例 2: 目标接口不是在直接连接的网络上,所以需要验证地址
         是否正确或其插入的交换机是否位于相同的 VLAN 上(或其他问题):
traceroute to X.X.X.X, (X.X.X.X) 30 hops max, 40 byte packets
traceroute: host X.X.X.X is not on a directly-attached network
示例 3: 网络无法访问:
traceroute to X.X.X.X, (X.X.X.X) 30 hops max, 40 byte packets
Network is unreachable
示例 4: 网络无法访问:
xifenfei01_prvtnet_18.01.30.2000.dat-zzz ***Tue Jan 30 20:27:43 CST 2018
xifenfei01_prvtnet_18.01.30.2000.dat-traceroute to xifenfei01-priv (192.168.102.11), 30 hops max, 60 byte packets
xifenfei01_prvtnet_18.01.30.2000.dat- 1  xifenfei01-priv (192.168.102.11)  0.014 ms  0.004 ms  0.005 ms
xifenfei01_prvtnet_18.01.30.2000.dat-traceroute to xifenfei02-priv (192.168.102.14), 30 hops max, 60 byte packets
xifenfei01_prvtnet_18.01.30.2000.dat- 1  * * *
xifenfei01_prvtnet_18.01.30.2000.dat- 2  * * *
xifenfei01_prvtnet_18.01.30.2000.dat- 3  * * *
xifenfei01_prvtnet_18.01.30.2000.dat- 4  * * *
xifenfei01_prvtnet_18.01.30.2000.dat- 5  * * *
xifenfei01_prvtnet_18.01.30.2000.dat- 6  * * *
xifenfei01_prvtnet_18.01.30.2000.dat- 7  * * *
xifenfei01_prvtnet_18.01.30.2000.dat- 8  * * *
xifenfei01_prvtnet_18.01.30.2000.dat- 9  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-10  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-11  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-12  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-13  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-14  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-15  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-16  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-17  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-18  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-19  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-20  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-21  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-22  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-23  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-24  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-25  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-26  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-27  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-28  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-29  * * *
xifenfei01_prvtnet_18.01.30.2000.dat-30  * * *
xifenfei01_prvtnet_18.01.30.2000.dat:zzz ***Warning. Traceroute response is spanning snapshot intervals.

reliable message等待

在一次给客户的rac poc测试中,发现大量reliable message等待,严重影响性能
数据库版本
11.2.0.4+最新psu(180116)

SYS@xffdb1>select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
PL/SQL Release 11.2.0.4.0 - Production
CORE    11.2.0.4.0      Production
TNS for Linux: Version 11.2.0.4.0 - Production
NLSRTL Version 11.2.0.4.0 - Production
Elapsed: 00:00:00.00
SYS@xffdb1>exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
[oracle@xifenfei01 ~]$ opatch lspatches
26609929;OCW Patch Set Update : 11.2.0.4.170814 (26609929)
26925576;Database Patch Set Update : 11.2.0.4.180116 (26925576)
OPatch succeeded.

监控em发现大量other等待
3


查询发现大量reliable message等

SYS@xffdb1>@event
         INST_ID USERNAME                       EVENT                                                      COUNT(*)
---------------- ------------------------------ -------------------------------------------------- ----------------
               1 SOE                            reliable message                                                 16
               1 SOE                            gc current request                                                4
               1 SOE                            gc cr request                                                     5
               1 SOE                            db file sequential read                                          17
               1 SYS                            db file scattered read                                            1
               2 SOE                            library cache: mutex X                                            2
               2 SOE                            latch: ges resource hash list                                     2
               2 SOE                            gc current request                                                1
               2 SOE                            gc cr request                                                     4
               2 SOE                            db file sequential read                                          12
               2                                db file parallel write                                            1
               2 SOE                            SQL*Net message to client                                         1
12 rows selected.

进一步查询发现大量Result Cache: Channel

SYS@xffdb1>SELECT CHANNEL,
  2    SUM(wait_count) sum_wait_count
  3  FROM GV$CHANNEL_WAITS
  4  GROUP BY CHANNEL
  5  ORDER BY SUM(wait_count) DESC;
CHANNEL                                                            SUM_WAIT_COUNT
---------------------------------------------------------------- ----------------
Result Cache: Channel                                                   379367698
RBR channel                                                                128968
kxfp control signal channel                                                115091
MMON remote action broadcast channel                                         4646
obj broadcast channel                                                         395
LCK0 ksbxic channel                                                            17
parameters to cluster db instances - broadcast channel                          4
service operations - broadcast channel                                          3
kill job broadcast - broadcast channel                                          2

查看result_cache_max_size参数

SYS@xffdb1>show parameter result_cache_max_size;
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
result_cache_max_size                big integer 335552K

调整result_cache_max_size参数

SYS@xffdb1>alter system set result_cache_max_size=0;
System altered.
SYS@xffdb1>show parameter result_cache_max_size;
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
result_cache_max_size                big integer 0

重启数据库,再次测试,发现相关等待消失,系统也更加稳定

SYS@xffdb1>@event
         INST_ID USERNAME                       EVENT                                                      COUNT(*)
---------------- ------------------------------ -------------------------------------------------- ----------------
               1 SOE                            library cache: mutex X                                            1
               1 SOE                            gc current request                                                3
               1 DBSNMP                         gc cr request                                                     1
               1 SOE                            gc cr request                                                     7
               1 SOE                            db file sequential read                                          18
               1                                db file parallel write                                            1
               2 SOE                            library cache: mutex X                                            2
               2 SOE                            latch free                                                        1
               2 SOE                            gc current request                                                4
               2 SOE                            gc cr request                                                     4
               2 SOE                            db file sequential read                                          12
               2                                db file parallel write                                            3
12 rows selected.

4


参考:Very High Waits for ‘reliable message’ After Upgrade to 11.2.0.4 When Using Result Cache (Doc ID 1951729.1)