2022年5月 – Database SOS

Oracle断电故障处理

异常断电导致数据库异常恢复文件报ORA-00283 ORA-00742 ORA-00312

 
D:\check_db>sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on 星期二 5月 31 00:38:42 2022

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


连接到:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> recover datafile 1;
ORA-00283: 恢复会话因错误而取消
ORA-00742: 日志读取在线程 %d 序列 %d 块 %d 中检测到写入丢失情况
ORA-00312: 联机日志 3 线程 1:
'D:\APP\ADMINISTRATOR\FAST_RECOVERY_AREA\ORCL\ONLINELOG\O1_MF_3_HJ32KJD5_.LOG'

这个错误比较明显是由于异常断电引起的写丢失导致.而且这种故障在没有备份的情况下，没有什么好处理方法,只能屏蔽一致性强制拉库,尝试强制拉库报错如下

SQL> startup mount pfile='d:/pfile.txt'
ORACLE 例程已经启动。

Total System Global Area 2.0310E+10 bytes
Fixed Size                  2290000 bytes
Variable Size            3690991280 bytes
Database Buffers         1.6576E+10 bytes
Redo Buffers               40837120 bytes
数据库装载完毕。
SQL> recover database until cancel;
ORA-00279: 更改 18755939194213 (在  生成) 对于线程 1 是必需的


指定日志: {<RET>=suggested | filename | AUTO | CANCEL}
D:\APP\ADMINISTRATOR\FAST_RECOVERY_AREA\ORCL\ONLINELOG\O1_MF_3_HJ32KJD5_.LOG
ORA-00600: internal error code, arguments: [3020], [2], [78824], [8467432], [],
[], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 78824, file
offset is 645726208 bytes)
ORA-10564: tablespace SYSAUX
ORA-01110: data file 2: 'D:\ORADATA\ORCL\SYSAUX01.DBF'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 80834


ORA-01112: 未启动介质恢复


SQL> alter database open resetlogs;
alter database open resetlogs
*
第 1 行出现错误:
ORA-00600: 内部错误代码, 参数: [krsi_al_hdr_update.15], [4294967295], [], [],[], [], [], [], [], [], [], []

ORA-600 krsi_al_hdr_update.15错误,主要是由于redo异常导致无法resetlogs成功,具体参考：Alter Database Open Resetlogs returns error ORA-00600: [krsi_al_hdr_update.15], (Doc ID 2026541.1)描述,处理这个问题之后,再次resetlogs库，报ORA-600 2662错误

SQL> alter database open resetlogs;
alter database open resetlogs
*
第 1 行出现错误:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00600: internal error code, arguments: [2662], [4366], [4112122046],
[4366], [4112228996], [12583040], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [2662], [4366], [4112122045],
[4366], [4112228996], [12583040], [], [], [], [], [], []
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00600: internal error code, arguments: [2662], [4366], [4112122040],
[4366], [4112228996], [12583040], [], [], [], [], [], []
进程 ID: 4644
会话 ID: 1701 序列号: 3

这个问题比较简单,通过修改scn即可绕过去,之后数据库open报ORA-600 4194等错误

SQL> alter database open ;
alter database open 
*
第 1 行出现错误:
ORA-00600: 内部错误代码, 参数: [4194], [

SMON: enabling tx recovery
Database Characterset is ZHS16GBK
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_smon_5112.trc  (incident=322982):
ORA-00600: internal error code, arguments: [4137], [10.33.3070116], [0], [0], [], [], [], [], [], [], [], []
Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\incident\incdir_322982\orcl_smon_5112_i322982.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
replication_dependency_tracking turned off (no async multimaster replication found)
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_ora_3340.trc  (incident=323030):
ORA-00600: 内部错误代码, 参数: [4194], [
Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\incident\incdir_323030\orcl_ora_3340_i323030.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue May 31 09:05:04 2022
Sweep [inc][322982]: completed
ORACLE Instance orcl (pid = 13) - Error 600 encountered while recovering transaction (10, 33).
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_smon_5112.trc:
ORA-00600: internal error code, arguments: [4137], [10.33.3070116], [0], [0], [], [], [], [], [], [], [], []
Checker run found 1 new persistent data failures
Tue May 31 09:05:05 2022
Sweep [inc][323030]: completed
Sweep [inc2][322982]: completed
Tue May 31 09:05:14 2022
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_smon_5112.trc  (incident=322983):
ORA-00600: internal error code, arguments: [4193], [10.33.3070116], [0], [], [], [], [], [], [], [], [], []
Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\incident\incdir_322983\orcl_smon_5112_i322983.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue May 31 09:05:14 2022
ORA-600 signalled during: alter database open...
Block recovery stopped at EOT rba 2.61.16
Block recovery completed at rba 2.61.16, scn 4366.4112429058
Block recovery from logseq 2, block 60 to scn 18755939643393
Recovery of Online Redo Log: Thread 1 Group 2 Seq 2 Reading mem 0
  Mem# 0: D:\APP\ADMINISTRATOR\FAST_RECOVERY_AREA\ORCL\ONLINELOG\O1_MF_2_K9BSVC11_.LOG
Block recovery completed at rba 2.61.16, scn 4366.4112429058
Dumping diagnostic data in directory=[cdmp_2022053],requested by(instance=1,osid=5112(SMON)),summary=[incident=322983].
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_smon_5112.trc:
ORA-01595: error freeing extent (3) of rollback segment (1))
ORA-00600: internal error code, arguments: [4193], [10.33.3070116], [3], [], [], [], [], [], [], [], [], []

对异常undo进行处理,数据库正常open成功

SQL> shutdown immediate;
ORA-00600: 内部错误代码, 参数: [4193], [


SQL> shutdown abort;
ORACLE 例程已经关闭。
SQL> startup mount
ORACLE 例程已经启动。

Total System Global Area 2.0310E+10 bytes
Fixed Size                  2290000 bytes
Variable Size            3690991280 bytes
Database Buffers         1.6576E+10 bytes
Redo Buffers               40837120 bytes
数据库装载完毕。
SQL> alter database open;

数据库已更改。

hcheck检测有一些字典不一致,建议客户逻辑导出,然后导入到新库中

HCheck Version 07MAY18 on 31-5月 -2022 09:12:22
----------------------------------------------
Catalog Version 11.2.0.4.0 (1102000400)
db_name: ORCL

                                   Catalog       Fixed
Procedure Name                     Version    Vs Release    Timestamp      Resul
t
------------------------------ ... ---------- -- ---------- -------------- -----
-
.- LobNotInObj                 ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- MissingOIDOnObjCol          ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- SourceNotInObj              ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- OversizedFiles              ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- PoorDefaultStorage          ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- PoorStorage                 ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- TabPartCountMismatch        ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- OrphanedTabComPart          ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- MissingSum$                 ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- MissingDir$                 ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- DuplicateDataobj            ... 1102000400 <=  *All Rel* 05/31 09:12:22 PASS
.- ObjSynMissing               ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- ObjSeqMissing               ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedUndo                ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedIndex               ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedIndexPartition      ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedIndexSubPartition   ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedTable               ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedTablePartition      ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedTableSubPartition   ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- MissingPartCol              ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedSeg$                ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- OrphanedIndPartObj#         ... 1102000400 <=  *All Rel* 05/31 09:12:23 FAIL

HCKE-0024: Orphaned Index Partition Obj# (no OBJ$) (Doc ID 1360935.1)
ORPHAN INDPART$: OBJ#=149167 BO#=6378 - no OBJ$ row
ORPHAN INDPART$: OBJ#=149168 BO#=6378 - no OBJ$ row

.- DuplicateBlockUse           ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- FetUet                      ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- Uet0Check                   ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- SeglessUET                  ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadInd$                     ... 1102000400 <=  *All Rel* 05/31 09:12:23 FAIL

HCKE-0030: OBJ$ INDEX entry has no IND$ or INDPART$/INDSUBPART$ entry (Doc ID 13
60528.1)
OBJ$ INDEX PARTITION has no INDPART$ entry: Obj#=148278 SYS Name=WRH$_FILESTATXS
_PK PARTITION=WRH$_FILEST_1572571104_16462
OBJ$ INDEX PARTITION has no INDPART$ entry: Obj#=148920 SYS Name=WRH$_FILESTATXS
_PK PARTITION=WRH$_FILEST_1572571104_16678

.- BadTab$                     ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadIcolDepCnt               ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- ObjIndDobj                  ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- TrgAfterUpgrade             ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- ObjType0                    ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadOwner                    ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- StmtAuditOnCommit           ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadPublicObjects            ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadSegFreelist              ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadDepends                  ... 1102000400 <=  *All Rel* 05/31 09:12:23 WARN

HCKW-0016: Dependency$ p_timestamp mismatch for VALID objects (Doc ID 1361045.1)

[E] - P_OBJ#=6376 D_OBJ#=6765

.- CheckDual                   ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- ObjectNames                 ... 1102000400 <=  *All Rel* 05/31 09:12:23 PASS
.- BadCboHiLo                  ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- ChkIotTs                    ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- NoSegmentIndex              ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- BadNextObject               ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- DroppedROTS                 ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- FilBlkZero                  ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- DbmsSchemaCopy              ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- OrphanedObjError            ... 1102000400 >  1102000000 05/31 09:12:24 PASS
.- ObjNotLob                   ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- MaxControlfSeq              ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- SegNotInDeferredStg         ... 1102000400 >  1102000000 05/31 09:12:24 PASS
.- SystemNotRfile1             ... 1102000400 >   902000000 05/31 09:12:24 PASS
.- DictOwnNonDefaultSYSTEM     ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- OrphanTrigger               ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
.- ObjNotTrigger               ... 1102000400 <=  *All Rel* 05/31 09:12:24 PASS
---------------------------------------
31-5月 -2022 09:12:24  Elapsed: 2 secs
---------------------------------------
Found 4 potential problem(s) and 1 warning(s)
Contact Oracle Support with the output and trace file
to check if the above needs attention or not

PL/SQL 过程已成功完成。

PostgreSQL恢复系列:pg_control异常恢复

联系：手机/微信(+86 17813235971) QQ(107644445)

标题：PostgreSQL恢复系列:pg_control异常恢复

在PG中pg_control文件类似oracle数据库的control文件(控制文件),在Oracle中如果该文件丢失/损坏,可以通过alter database create controlfile命令进行创建,对于PG数据库来说也可以通过pg_resetwal命令来实现创建,由于pg_control文件损坏,需要人工指定一些参数完成pg_resetwal相关操作
pg_resetwal 使用说明

-bash-4.2$ pg_resetwal --help
pg_resetwal resets the PostgreSQL write-ahead log.

Usage:
  pg_resetwal [OPTION]... DATADIR

Options:
  -c, --commit-timestamp-ids=XID,XID
                                   set oldest and newest transactions bearing
                                   commit timestamp (zero means no change)
 [-D, --pgdata=]DATADIR            data directory
  -e, --epoch=XIDEPOCH             set next transaction ID epoch
  -f, --force                      force update to be done
  -l, --next-wal-file=WALFILE      set minimum starting location for new WAL
  -m, --multixact-ids=MXID,MXID    set next and oldest multitransaction ID
  -n, --dry-run                    no update, just show what would be done
  -o, --next-oid=OID               set next OID
  -O, --multixact-offset=OFFSET    set next multitransaction offset
  -u, --oldest-transaction-id=XID  set oldest transaction ID
  -V, --version                    output version information, then exit
  -x, --next-transaction-id=XID    set next transaction ID
      --wal-segsize=SIZE           size of WAL segments, in megabytes
  -?, --help                       show this help, then exit

Report bugs to <pgsql-bugs@lists.postgresql.org>.
PostgreSQL home page: <https://www.postgresql.org/>

确认现在业务表记录情况

-bash-4.2$ psql
psql (14.3)
Type "help" for help.

postgres=# select count(1) from ac_event;
 count  
--------
 246266
(1 row)

模拟pg_control文件异常

-bash-4.2$ ps -ef|grep postgres
postgres  37178      1  0 09:58 ?        00:00:00 /usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data
postgres  37179  37178  0 09:58 ?        00:00:00 postgres: logger 
postgres  37181  37178  0 09:58 ?        00:00:00 postgres: checkpointer 
postgres  37182  37178  0 09:58 ?        00:00:00 postgres: background writer 
postgres  37183  37178  0 09:58 ?        00:00:00 postgres: walwriter 
postgres  37184  37178  0 09:58 ?        00:00:00 postgres: autovacuum launcher 
postgres  37185  37178  0 09:58 ?        00:00:00 postgres: stats collector 
postgres  37186  37178  0 09:58 ?        00:00:00 postgres: logical replication launcher 
root      41368  41314  0 11:06 pts/1    00:00:00 su - postgres
postgres  41369  41368  0 11:06 pts/1    00:00:00 -bash
postgres  45071  41369  0 12:07 pts/1    00:00:00 ps -ef
postgres  45072  41369  0 12:07 pts/1    00:00:00 grep --color=auto postgres
-bash-4.2$ kill -9 37178
-bash-4.2$ ps -ef|grep postgres
root      41368  41314  0 11:06 pts/1    00:00:00 su - postgres
postgres  41369  41368  0 11:06 pts/1    00:00:00 -bash
postgres  45095  41369  0 12:08 pts/1    00:00:00 ps -ef
postgres  45096  41369  0 12:08 pts/1    00:00:00 grep --color=auto postgres
-bash-4.2$ pwd
/var/lib/pgsql/14/data/global
-bash-4.2$ ls -l pg_control 
-rw-------. 1 postgres postgres 8192 May 30 12:04 pg_control
-bash-4.2$ rm -rf pg_control 
-bash-4.2$ ls -l pg_control 
ls: cannot access pg_control: No such file or directory

PG启动失败

-bash-4.2$ pg_ctl start 
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....postgres: could not find the database system
Expected to find it in the directory "/var/lib/pgsql/14/data",
but could not open file "/var/lib/pgsql/14/data/global/pg_control": No such file or directory
 stopped waiting
pg_ctl: could not start server
Examine the log output.

创建空pg_control文件启动依旧失败

-bash-4.2$ touch /var/lib/pgsql/14/data/global/pg_control
-bash-4.2$ pg_ctl start 
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2022-05-30 12:09:43.953 CST [45215] PANIC:  
   could not read file "global/pg_control": read 0 of 296
 stopped waiting
pg_ctl: could not start server
Examine the log output.

设置next-wal-file
-l, –next-wal-file=WALFILE,这个参数设置下一个新的WAL文件的最小值，这个值可以从$PGDATA/pg_wal目录下去看最后一个WAL 文件，这个文件的id+1即可

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_wal
-bash-4.2$ ls -l
total 16384
-rw-------. 1 postgres postgres 16777216 May 30 12:04 000000010000000000000014
drwx------. 2 postgres postgres        6 May 24 02:20 archive_status
-bash-4.2$

这个文件+1，-l 000000010000000000000015
设置next-transaction
-x, –next-transaction-id=XID,这个参数设置pg_control中的下一个XID的值，这个值可以从pg_xact目录下的文件中查询

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_xact
-bash-4.2$ ls -ltr
total 8
-rw-------. 1 postgres postgres 8192 May 30 12:03 0000

最后一个是0000，那么下一个XID就是0001，然后乘以 1048576 (0×100000)，实际上后面直接加5个0就行了。注意，这个值是16进制的。-x 0×000100000
multixact-ids设置
-m, –multixact-ids=MXID1,MXID2，这个参数包含两个部分，MXID1和MXID2，都可以从$PGDATA/pg_multixact/offsets目录下获得。MXID1的值，首先找到最大值，+1，再乘以 65536 (0×10000，相当于后面加4个0)作为这个参数的前半部分。找到最小的值，后面加4个0，作为MXID2的值

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_multixact/offsets
-bash-4.2$ ls -ltr
total 8
-rw-------. 1 postgres postgres 8192 May 29 22:06 0000
-bash-4.2$

-m 0×00010000, 0×00000000（由于oldest multitransaction ID不能为0,因此后续这个值需要适当调整）
multixact-offset设置
-O, –multixact-offset=OFFSET,这个参数可以从$PGDATA/pg_multixact/members目录下获得。找到最大值，+1，乘以 52352 (0xCC80)

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_multixact/members
-bash-4.2$ ls -ltr
total 8
-rw-------. 1 postgres postgres 8192 May 24 02:20 0000

-O 0xCC80
尝试执行pg_resetwal

-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00010000,0x00000000 -O 0xCC80 $PGDATA
pg_resetwal: error: oldest multitransaction ID (-m) must not be 0

multixact-ids值不对,进行调整后处理
postmaster.pid文件需要清理
由于PG库异常关闭,需要人工清理掉该文件

-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00020000,0x00010000 -O 0xCC80 $PGDATA
pg_resetwal: error: lock file "postmaster.pid" exists
-bash-4.2$ rm -rf postmaster.pid

pg_resetwal结果预览

-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00020000,0x00010000 -O 0xCC80 $PGDATA
pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it
Guessed pg_control values:

pg_control version number:            1300
Catalog version number:               202107181
Database system identifier:           7103392535324046312
Latest checkpoint's TimeLineID:       1
Latest checkpoint's full_page_writes: off
Latest checkpoint's NextXID:          0:3
Latest checkpoint's NextOID:          12000
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        3
Latest checkpoint's oldestXID's DB:   0
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 0
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float8 argument passing:              by value
Data page checksum version:           0


Values to be changed:

First log segment after reset:        000000010000000000000015
NextMultiXactId:                      131072
OldestMultiXid:                       65536
OldestMulti's DB:                     0
NextMultiOffset:                      52352
NextXID:                              1048576
OldestXID:                            3
OldestXID's DB:                       0

If these values seem acceptable, use -f to force reset.

pg_resetwal进行创建pg_control并启动PG

-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00020000,0x00010000 -O 0xCC80 -f $PGDATA
pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it
Write-ahead log reset
-bash-4.2$ pg_ctl start
waiting for server to start....2022-05-30 13:33:28.266 CST [51437] LOG:  
redirecting log output to logging collector process
2022-05-30 13:33:28.266 CST [51437] HINT:  Future log output will appear in directory "log".
 done
server started

验证数据

-bash-4.2$ psql
psql (14.3)
Type "help" for help.

postgres=#  select count(1) from ac_event;
 count  
--------
 245275
(1 row)

这种方法恢复之后,建议理解dump数据,然后导入到新库中

PostgreSQL恢复系列:wal日志丢失恢复

联系：手机/微信(+86 17813235971) QQ(107644445)

标题：PostgreSQL恢复系列:wal日志丢失恢复

WAL是Write Ahead Log的简写，和oracle的redo日志类似，存放在$PGDATA/pg_xlog中，10版本以后在$PGDATA/pg_wal目录.在oracle数据库中,如果redo丢失,分为active/current和inactive的redo,分别有不同的处理方式,对于oracle需要实例恢复的redo丢失，需要屏蔽数据库一致性，强制打开数据库,对于PG数据库这部分日志丢失该如何恢复,主要是通过pg_resetwal/pg_resetxlog（10以前版本）命令来实现，这里通过一个测试来验证
创建测试表并强制kill数据库

-bash-4.2$ psql
psql (14.3)
Type "help" for help.

postgres=# create table t_xifenfei as select * from pg_database;
SELECT 4
postgres=# select count(1) from t_xifenfei;
 count 
-------
     4
(1 row)

postgres=# \q
-bash-4.2$ ps -ef|grep post
root       1819      1  0 May28 ?        00:00:00 /usr/libexec/postfix/master -w
postfix    1838   1819  0 May28 ?        00:00:00 qmgr -l -t unix -u
postgres  11102      1  0 05:49 ?        00:00:00 /usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data
postgres  11103  11102  0 05:49 ?        00:00:00 postgres: logger 
postgres  11105  11102  0 05:49 ?        00:00:00 postgres: checkpointer 
postgres  11106  11102  0 05:49 ?        00:00:00 postgres: background writer 
postgres  11107  11102  0 05:49 ?        00:00:00 postgres: walwriter 
postgres  11108  11102  0 05:49 ?        00:00:00 postgres: autovacuum launcher 
postgres  11109  11102  0 05:49 ?        00:00:01 postgres: stats collector 
postgres  11110  11102  0 05:49 ?        00:00:00 postgres: logical replication launcher 
root      22743  22300  0 18:26 pts/3    00:00:00 su - postgres
postgres  22744  22743  0 18:26 pts/3    00:00:00 -bash
postgres  22937  22744  0 18:28 pts/3    00:00:00 psql
postgres  22938  11102  0 18:28 ?        00:00:00 postgres: postgres postgres [local] idle
postfix   32623   1819  0 21:10 ?        00:00:00 pickup -l -t unix -u
root      33032  32912  0 21:15 pts/2    00:00:00 su - postgres
postgres  33033  33032  0 21:15 pts/2    00:00:00 -bash
postgres  35210  33033  0 21:51 pts/2    00:00:00 ps -ef
postgres  35211  33033  0 21:51 pts/2    00:00:00 grep --color=auto post
-bash-4.2$ kill -9 11102

删除wal日志

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_wal
-bash-4.2$ ls -ltr
total 311296
drwx------. 2 postgres postgres        6 May 24 02:20 archive_status
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000014
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000015
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000016
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000017
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000018
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000019
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001A
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001B
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001C
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001D
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001E
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001F
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000020
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000021
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000022
-rw-------. 1 postgres postgres 16777216 May 28 21:30 000000010000000000000023
-rw-------. 1 postgres postgres 16777216 May 28 21:30 000000010000000000000024
-rw-------. 1 postgres postgres 16777216 May 28 21:30 000000010000000000000025
-rw-------. 1 postgres postgres 16777216 May 29 21:51 000000010000000000000013
-bash-4.2$ rm -rf 0000000100000000000000*
-bash-4.2$ ls
archive_status

查询当时数据库需要的最小wal记录

-bash-4.2$ pg_controldata 
pg_control version number:            1300
Catalog version number:               202107181
Database system identifier:           7100998319216817119
Database cluster state:               in production
pg_control last modified:             Sat 28 May 2022 09:36:11 PM CST
Latest checkpoint location:           0/13692F80
Latest checkpoint's REDO location:    0/13692F48
Latest checkpoint's REDO WAL file:    000000010000000000000013   <===需要的记录
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:17824
Latest checkpoint's NextOID:          32769
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        727
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  17824
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Sat 28 May 2022 09:31:41 PM CST

尝试启动PG

-bash-4.2$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2022-05-29 21:52:22.926 CST [35270] LOG:  
redirecting log output to logging collector process
2022-05-29 21:52:22.926 CST [35270] HINT:  Future log output will appear in directory "log".
. stopped waiting
pg_ctl: could not start server
Examine the log output.

启动pg失败,查看日志记录

2022-05-29 21:52:22.926 CST [35270] LOG:  starting PostgreSQL 14.3 on x86_64-pc-linux-gnu, 
            compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2022-05-29 21:52:22.927 CST [35270] LOG:  listening on IPv6 address "::1", port 5432
2022-05-29 21:52:22.927 CST [35270] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2022-05-29 21:52:22.929 CST [35270] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-29 21:52:22.931 CST [35270] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-05-29 21:52:22.936 CST [35272] LOG:  database system was interrupted; last known up at 2022-05-28 21:36:11 CST
2022-05-29 21:52:23.049 CST [35272] LOG:  invalid primary checkpoint record
2022-05-29 21:52:23.049 CST [35272] PANIC:  could not locate a valid checkpoint record
2022-05-29 21:52:24.211 CST [35270] LOG:  startup process (PID 35272) was terminated by signal 6: Aborted
2022-05-29 21:52:24.211 CST [35270] LOG:  aborting startup due to startup process failure
2022-05-29 21:52:24.218 CST [35270] LOG:  database system is shut down

错误比较明显,无法定位到有效的checkpoint记录,在oracle里面的意思可以理解为无法进行实例恢复,pg启动失败

重设wal
由于数据库为不一致状态,需要使用-f进行强制重设

-bash-4.2$ pg_resetwal $PGDATA
The database server was not shut down cleanly.
Resetting the write-ahead log might cause data to be lost.
If you want to proceed anyway, use -f to force reset.
-bash-4.2$ pg_resetwal -f $PGDATA
Write-ahead log reset

启动PG成功

-bash-4.2$ pg_ctl start -D $PGDATA
waiting for server to start....2022-05-29 22:01:02.647 CST [37178] LOG:  
redirecting log output to logging collector process
2022-05-29 22:01:02.647 CST [37178] HINT:  Future log output will appear in directory "log".
 done
server started

日志记录

2022-05-29 22:01:02.647 CST [37178] LOG:  starting PostgreSQL 14.3 on x86_64-pc-linux-gnu, 
                    compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2022-05-29 22:01:02.648 CST [37178] LOG:  listening on IPv6 address "::1", port 5432
2022-05-29 22:01:02.648 CST [37178] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2022-05-29 22:01:02.649 CST [37178] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-29 22:01:02.651 CST [37178] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-05-29 22:01:02.653 CST [37180] LOG:  database system was shut down at 2022-05-29 22:00:47 CST
2022-05-29 22:01:02.661 CST [37178] LOG:  database system is ready to accept connections

查看wal日志,产生新记录

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_wal
-bash-4.2$ ls -ltr
total 16384
drwx------. 2 postgres postgres        6 May 24 02:20 archive_status
-rw-------. 1 postgres postgres 16777216 May 29 22:01 000000010000000000000014

验证刚刚创建测试表

-bash-4.2$ psql
psql (14.3)
Type "help" for help.

postgres=# select count(1) from t_xifenfei;
ERROR:  relation "t_xifenfei" does not exist
LINE 1: select count(1) from t_xifenfei;
                             ^

由于需要进行实例恢复的wal日志丢失导致这表记录也丢失.由此可见这类操作可能导致数据丢失风险,对于生产环境,需要慎重,

pvcreate asm disk导致asm磁盘组异常恢复

联系：手机/微信(+86 17813235971) QQ(107644445)

标题：pvcreate asm disk导致asm磁盘组异常恢复

一客户asm磁盘组异常，无法正常mount

SQL> alter diskgroup datadg mount 
2022-05-28T19:08:55.114960+08:00
NOTE: cache registered group DATADG 1/0x2B504997
NOTE: cache began mount (first) of group DATADG 1/0x2B504997
NOTE: Assigning number (1,3) to disk (/dev/oracleasm/disks/DATA05)
NOTE: Assigning number (1,2) to disk (/dev/oracleasm/disks/DATA03)
NOTE: Assigning number (1,1) to disk (/dev/oracleasm/disks/DATA02)
2022-05-28T19:08:55.150062+08:00
ERROR: no read quorum in group: required 1, found 0 disks
2022-05-28T19:08:55.150684+08:00
NOTE: cache dismounting (clean) group 1/0x2B504997 (DATADG)
NOTE: messaging CKPT to quiesce pins Unix process pid: 15103, image: oracle@XFF01 (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: LGWR not being messaged to dismount
NOTE: cache dismounted group 1/0x2B504997 (DATADG)
NOTE: cache ending mount (fail) of group DATADG number=1 incarn=0x2b504997
NOTE: cache deleting context for group DATADG 1/0x2b504997
2022-05-28T19:08:55.191073+08:00
GMON dismounting group 1 at 36 for pid 37, osid 15103
2022-05-28T19:08:55.191258+08:00
NOTE: Disk DATA02 in mode 0x8 marked for de-assignment
NOTE: Disk DATA03 in mode 0x8 marked for de-assignment
NOTE: Disk DATA05 in mode 0x8 marked for de-assignment
ERROR: diskgroup DATADG was not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATADG" cannot be mounted
ORA-15040: diskgroup is incomplete

通过报错信息,初步判断是由于少了asm disk导致（依据:1. ORA-15040,2.asmlib中的DATA01丢失),初步判断由于某种原因导致asmlib的磁盘异常,从而使得asm磁盘组无法正常mount,通过对dd 到本地的asm磁盘进行分析

C:\Users\XFF>kfed read H:\TEMP\asmdd\sdb6-o.dd
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
0066E8200 00000000 00000000 00000000 00000000  [................]
        Repeat 31 times
0066E8400 4542414C 454E4F4C 00000001 00000000  [LABELONE........]
0066E8410 4E06D490 00000020 324D564C 31303020  [...N ...LVM2 001]
0066E8420 34535542 476A7667 42546C48 6D384675  [BUS4gvjGHlTBuF8m]
0066E8430 7A385273 4B495777 73336242 33637449  [sR8zwWIKBb3sItc3]
0066E8440 48001000 000001E8 00100000 00000000  [...H............]
0066E8450 00000000 00000000 00000000 00000000  [................]
0066E8460 00000000 00000000 00001000 00000000  [................]
0066E8470 000FF000 00000000 00000000 00000000  [................]
0066E8480 00000000 00000000 00000002 00000000  [................]
0066E8490 00000000 00000000 00000000 00000000  [................]
  Repeat 214 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

通过这部分信息可以确认，一个asm disk被创建了pv,进一步分析pv信息

对于这样的情况，表示asm disk被创建了pv但是pv没有加入到任何vg中,也就意味着该disk没有太大破坏,通过信息确认

主要是这两个部分信息被损坏,可以通过一些方法对这两个block信息进行重构

C:\Users\XFF>kfed read H:\TEMP\asmdd\sdb6.dd|more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:              2147483648 ; 0x008: disk=0
kfbh.check:                  3196491921 ; 0x00c: 0xbe869891
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr:   ORCLDISKDATA01 ; 0x000: length=14
kfdhdb.driver.reserved[0]:   1096040772 ; 0x008: 0x41544144
kfdhdb.driver.reserved[1]:        12592 ; 0x00c: 0x00003130
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
kfdhdb.compat:                203424000 ; 0x020: 0x0c200100
kfdhdb.dsknum:                        0 ; 0x024: 0x0000
kfdhdb.grptyp:                        1 ; 0x026: KFDGTP_EXTERNAL
kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname:                  DATA01 ; 0x028: length=6
kfdhdb.grpname:                  DATADG ; 0x048: length=6
kfdhdb.fgname:                   DATA01 ; 0x068: length=6
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.crestmp.hi:             33083792 ; 0x0a8: HOUR=0x10 DAYS=0xc MNTH=0x4 YEAR=0x7e3
kfdhdb.crestmp.lo:           2268043264 ; 0x0ac: USEC=0x0 MSEC=0x3e6 SECS=0x32 MINS=0x21
kfdhdb.mntstmp.hi:             33134479 ; 0x0b0: HOUR=0xf DAYS=0x1c MNTH=0x5 YEAR=0x7e6
-- More  --

C:\Users\XFF>kfed read H:\TEMP\asmdd\sdb6.dd blkn=1|more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC
kfbh.datfmt:                          2 ; 0x003: 0x02
kfbh.block.blk:                       1 ; 0x004: blk=1
kfbh.block.obj:              2147483648 ; 0x008: disk=0
kfbh.check:                  2177715180 ; 0x00c: 0x81cd4bec
kfbh.fcn.base:                  3721754 ; 0x010: 0x0038ca1a
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdfsb.aunum:                         0 ; 0x000: 0x00000000
kfdfsb.max:                        1014 ; 0x004: 0x03f6
kfdfsb.cnt:                        1014 ; 0x006: 0x03f6
kfdfsb.bound:                         0 ; 0x008: 0x0000
kfdfsb.flag:                          1 ; 0x00a: B=1
kfdfsb.ub1spare:                      0 ; 0x00b: 0x00
kfdfsb.spare[0]:                      0 ; 0x00c: 0x00000000
kfdfsb.spare[1]:                      0 ; 0x010: 0x00000000
kfdfsb.spare[2]:                      0 ; 0x014: 0x00000000
kfdfse[0].fse:                        0 ; 0x018: FREE=0x0 FRAG=0x0
kfdfse[1].fse:                        0 ; 0x019: FREE=0x0 FRAG=0x0
kfdfse[2].fse:                        0 ; 0x01a: FREE=0x0 FRAG=0x0
kfdfse[3].fse:                        0 ; 0x01b: FREE=0x0 FRAG=0x0
kfdfse[4].fse:                        0 ; 0x01c: FREE=0x0 FRAG=0x0
kfdfse[5].fse:                        0 ; 0x01d: FREE=0x0 FRAG=0x0
kfdfse[6].fse:                        0 ; 0x01e: FREE=0x0 FRAG=0x0
kfdfse[7].fse:                        0 ; 0x01f: FREE=0x0 FRAG=0x0
kfdfse[8].fse:                        0 ; 0x020: FREE=0x0 FRAG=0x0

通过dd写入到原磁盘,通过oracleasm scandisks扫描磁盘

磁盘组mount成功

数据库顺利open

这个案例能够完美恢复,主要是客户没有做进一步破坏，没有把这个pv加入到vg中并且写入数据,以前有过类似案例因为写入了数据,恢复比这个难多了,效果也没有这个好asm disk被加入vg恢复
如果不幸有类似oracle asm disk被破坏(格式化,dd部分,做成lv等),需要进行恢复支持,可以联系我们,做专业的恢复评估,最大限度,最快速度抢救数据,减少损失
Phone:17813235971 Q Q:107644445 E-Mail:dba@xifenfei.com
恢复过部分asm异常案例:
删除分区 oracle asm disk 恢复
asm disk 磁盘部分被清空恢复
又一例asm格式化文件系统恢复
一次完美的asm disk被格式化ntfs恢复
oracle asm disk格式化恢复—格式化为ext4文件系统
oracle asm disk格式化恢复—格式化为ntfs文件系统
分享oracleasm createdisk重新创建asm disk后数据0丢失恢复案例

O/S-Error: (OS 23) 数据错误(循环冗余检查) 数据库恢复

联系：手机/微信(+86 17813235971) QQ(107644445)

标题：O/S-Error: (OS 23) 数据错误(循环冗余检查) 数据库恢复

有客户数据库运行过程中突然crash,检测发现ORA-27070 OSD-04016 O/S-Error: (OS 23) 等报错

Thu May 12 11:25:53 2022
KCF: write/open error block=0x19e95f online=1
     file=57 H:\ORADATA\xifenfei\XFF51.DBF
     error=27070 txt: 'OSD-04016: 异步 I/O 请求排队时出错。
O/S-Error: (OS 23) 数据错误(循环冗余检查)。'
Thu May 12 11:25:53 2022
Errors in file e:\oracle\product\10.2.0\admin\xifenfei\bdump\xifenfei_dbw0_3532.trc:
ORA-01242: 数据文件出现介质故障: 数据库处于 NOARCHIVELOG 模式
ORA-01114: 将块写入文件 57 时出现 IO 错误 (块 # 1698143)
ORA-01110: 数据文件 57: 'H:\ORADATA\xifenfei\XFF51.DBF'
ORA-27070: 异步读取/写入失败
OSD-04016: 异步 I/O 请求排队时出错。
O/S-Error: (OS 23) 数据错误(循环冗余检查)。

DBW0: terminating instance due to error 1242
Thu May 12 11:25:54 2022
Errors in file e:\oracle\product\10.2.0\admin\xifenfei\bdump\xifenfei_mman_3528.trc:
ORA-01242: 数据文件出现介质故障: 数据库处于 NOARCHIVELOG 模式

Thu May 12 11:25:54 2022
Errors in file e:\oracle\product\10.2.0\admin\xifenfei\bdump\xifenfei_lgwr_3544.trc:
ORA-01242: 数据文件出现介质故障: 数据库处于 NOARCHIVELOG 模式

Thu May 12 11:25:55 2022
Errors in file e:\oracle\product\10.2.0\admin\xifenfei\bdump\xifenfei_dbw1_3536.trc:
ORA-01242: 数据文件出现介质故障: 数据库处于 NOARCHIVELOG 模式

Thu May 12 11:25:55 2022
Errors in file e:\oracle\product\10.2.0\admin\xifenfei\bdump\xifenfei_psp0_3524.trc:
ORA-01242: 数据文件出现介质故障: 数据库处于 NOARCHIVELOG 模式

Thu May 12 11:25:55 2022
Errors in file e:\oracle\product\10.2.0\admin\xifenfei\bdump\xifenfei_ckpt_3548.trc:
ORA-01242: 数据文件出现介质故障: 数据库处于 NOARCHIVELOG 模式

Thu May 12 11:25:55 2022
Errors in file e:\oracle\product\10.2.0\admin\xifenfei\bdump\xifenfei_pmon_3520.trc:
ORA-01242: 数据文件出现介质故障: 数据库处于 NOARCHIVELOG 模式

Thu May 12 11:26:06 2022
Errors in file e:\oracle\product\10.2.0\admin\xifenfei\bdump\xifenfei_q002_37468.trc:
ORA-01242: 数据文件出现介质故障: 数据库处于 NOARCHIVELOG 模式

Thu May 12 11:26:08 2022
Errors in file e:\oracle\product\10.2.0\admin\xifenfei\bdump\xifenfei_reco_3556.trc:
ORA-01242: 数据文件出现介质故障: 数据库处于 NOARCHIVELOG 模式

Thu May 12 11:26:08 2022
Errors in file e:\oracle\product\10.2.0\admin\xifenfei\bdump\xifenfei_smon_3552.trc:
ORA-01242: 数据文件出现介质故障: 数据库处于 NOARCHIVELOG 模式

Thu May 12 11:26:10 2022
Instance terminated by DBW0, pid = 3532

再次重启数据库报错 ORA-27070: 异步读取/写入失败 OSD-04016: 异步 I/O 请求排队时出错。类似错误

dbv检查数据文件报异常

通过以上信息基本上可以确认是由于底层故障(文件系统或者硬件故障),导致数据库文件访问异常,检查系统日志发现异常

通过专业恢复软件对异常文件进行恢复,实现数据库正常open(跳过坏块)

数据库open报ORA-00959: tablespace ‘UNDOTBS1′ does not exist分析

联系：手机/微信(+86 17813235971) QQ(107644445)

标题：数据库open报ORA-00959: tablespace ‘UNDOTBS1′ does not exist分析

有一个朋友找到我,说数据库重启之后无法正常启动,提示ORA-00959 UNDOTBS1表空间不存在

在数据库的启动过程中,这个是一个很常见的错误,一般出现这类错误的原因是由于undo_tablespace指定的undo表空间不存在导致.但是这个库比较明显,设置了undo_management=manual, undo_tablespace=system,依旧数据库需要找undotbs1表空间,进一步分析数据库当前表空间情况

该数据库确实没有undotbs1表空间,基于以上信息,初步怀疑很可能是undo回滚段异常,通过对于oracle基表进行分析,发现信息

基于上面的信息,可以确认回滚段中确实有四条记录指向被删除的undotbs1,而且还有一条undo回滚段信息为need recovery状态.进一步分析数据库alert日志

--创建undotbs2表空间,并重启数据库
Sun Sep 26 14:42:16 2021
create undo tablespace UNDOTBS2 datafile '/data/oradata/xifenfei/undotbs001.dbf' size 120G
reuse autoextend on next 500m maxsize unlimited
Sun Sep 26 14:46:46 2021
Completed: create undo tablespace UNDOTBS2 datafile '/data/oradata/xifenfei/undotbs001.dbf' size 120G
reuse autoextend on next 500m maxsize unlimited
Sun Sep 26 14:47:13 2021
[16927] Successfully onlined Undo Tablespace 10.
[16927] **** active transactions found in undo Tablespace 2 - moved to Pending Switch-Out state.
[16927] active transactions found/affinity dissolution incompletein undo tablespace 2 during switch-out.
ALTER SYSTEM SET undo_tablespace='UNDOTBS2' SCOPE=BOTH;
Sun Sep 26 14:49:25 2021
Shutting down instance (immediate)
Shutting down instance: further logons disabled
Stopping background process QMNC

---第一次尝试删除undotbs1失败
Sun Sep 26 15:06:30 2021
drop tablespace UNDOTBS1 INCLUDING CONTENTS AND DATAFILES CASCADE CONSTRAINTS
Sun Sep 26 15:47:26 2021
ORA-1013 signalled during: drop tablespace UNDOTBS1 INCLUDING CONTENTS AND DATAFILES CASCADE CONSTRAINTS
...
Sun Sep 26 15:48:40 2021
Shutting down instance (immediate)
Shutting down instance: further logons disabled
Stopping background process QMNC

--加上该隐含参数,继续重启库删除undotbs1,依旧删除失败
  _corrupted_rollback_segments= "_SYSSMU28_1306132068$"

Sun Sep 26 15:53:34 2021
QMNC started with pid=31, OS id=20454 
Completed: ALTER DATABASE OPEN
Sun Sep 26 15:55:32 2021
 drop tablespace UNDOTBS1 INCLUDING CONTENTS AND DATAFILES CASCADE CONSTRAINTS
…………
Sun Sep 26 16:45:47 2021
ORA-1013 signalled during:  drop tablespace UNDOTBS1 INCLUDING CONTENTS AND DATAFILES CASCADE CONSTRAINTS...
Sun Sep 26 16:46:00 2021
Shutting down instance (immediate)
Shutting down instance: further logons disabled
Stopping background process QMNC

---加上以下参数，并尝试重启数据库删除undotbs1成功
  _corrupted_rollback_segments= "_SYSSMU31_201790566$"
  _corrupted_rollback_segments= "_SYSSMU30_2395098326$"
  _corrupted_rollback_segments= "_SYSSMU29_306369076$"
  _corrupted_rollback_segments= "_SYSSMU28_1306132068$"

Sun Sep 26 16:47:24 2021
QMNC started with pid=31, OS id=23421 
Completed: ALTER DATABASE OPEN
Sun Sep 26 16:47:40 2021
drop tablespace UNDOTBS1 INCLUDING CONTENTS AND DATAFILES CASCADE CONSTRAINTS
…………
Mon Sep 27 01:44:56 2021
Deleted file /data/oradata/xifenfei/undotbs01.dbf
Deleted file /data/oradata/xifenfei/undotbs02.dbf
Deleted file /data/oradata/xifenfei/undotbs03.dbf
Deleted file /data/oradata/xifenfei/undotbs04.dbf
Deleted file /data/oradata/xifenfei/undotbs05.dbf
Deleted file /data/oradata/xifenfei/undotbs06.dbf
Completed: drop tablespace UNDOTBS1 INCLUDING CONTENTS AND DATAFILES CASCADE CONSTRAINTS

从这里基本上可以看出来,操作之人非常草率,在oracle回滚段还在被占用状态,直接尝试删除老undo表空间,在无法删除之后,直接暴力的使用undo回滚段异常参数,然后进行undo表空间删除.这样操作的后果屏蔽了事务的一致性,导致后续可能导致数据库一系列异常.对当前库启动过程进行跟踪发现

数据库启动的过程中查询undo$中的undotbs1信息,而该表空间不存在,所以出现此类报错,从而导致数据库无法正常启动.知道问题原因所在,那通过undo$记录,数据库即可正常启动.

ASM删除表空间恢复

联系：手机/微信(+86 17813235971) QQ(107644445)

标题：ASM删除表空间恢复

前几天刚刚恢复了一个文件系统层面drop 表空间的case(分享运气超级好的一次drop tablespace 数据恢复),又一客户删除表空间(认为是不要的表空间),结果发现业务上丢失了很多表数据,通过分析和回顾以往事件,确认由于在以前数据迁移过来的过程中,数据写入了和原库一致的表空间,而没有恢复到本该恢复的新表空间中,这次删除该空间导致很多表数据丢失.而且该客户是asm环境,drop tablespace带上了including contents and datafiles语句,导致该表空间对应的数据文件也丢失.对于这类数据的恢复,一般情况下先通过asm层面恢复出来被删除的数据文件,然后再对被删除的数据文件按照丢失system的方式恢复里面的表数据(这个客户有历史备份便于整合)
在恢复被删除的文件之前,需要先确认对应的被删除的表空间信息和对应的文件信息,通过对底层字典分析file$,ts$,结合alert日志,可以确认被删除文件的文件号,文件名称等信息

由于文件已经从asm磁盘组中删除,无法直接恢复,通过对asm磁盘组进行扫描找出对应的block信息,参考:asm磁盘组操作不当导致数据文件丢失恢复类似处理方法,分析文件是否异常

初步判断文件恢复效果应该不错,恢复出来数据文件，然后进行dbv检查

后续的操作比较简单,使用oracle dul恢复出来按照类似方法:dul恢复drop表测试数据即可,业务进行核对即可.如果你遭遇到此类情况,而且无有效备份,尽可能保护现场(不要对asm/文件系统系统进行写入操作)，然后联系我们进行处理,最大限度恢复数据

ORA-12518 ORA-27302: 错误发生在: ssthrddcr

联系：手机/微信(+86 17813235971) QQ(107644445)

标题：ORA-12518 ORA-27302: 错误发生在: ssthrddcr

11.2.0.1数据库运行在win 2008机器上,进程数配置为1000,但是数据库连接数到了170多之后,就开始报ORA-12518 TNS:监听程序无法分发客户机连接

检查alert日志发现后台日志报ORA-27300 ORA-27301 ORA-27302错误

Fri May 06 11:57:14 2022
Process startup failed, error stack:
Errors in file c:\app\administrator\diag\rdbms\orcl\orcl\trace\orcl_psp0_3052.trc:
ORA-27300: 操作系统系统相关操作: CreateThread 失败, 状态为: 8
ORA-27301: 操作系统故障消息: 存储空间不足，无法处理此命令。
ORA-27302: 错误发生在: ssthrddcr
Process J002 died, see its trace file
kkjcre1p: unable to spawn jobq slave process 
Errors in file c:\app\administrator\diag\rdbms\orcl\orcl\trace\orcl_cjq0_3808.trc:

通过mos分析确认问题为IRPStackSize参数值问题

参考:Ora-27300: OS System Dependent Operation:Createthread Failed With Status: 8 (Doc ID1310453.1)

plsql 插入blob

联系：手机/微信(+86 17813235971) QQ(107644445)

标题：plsql 插入blob

plsql 插入blob

SQL> create bigfile tablespace t_xifenfei datafile 'e:/oradata/orcl/t_blob.dbf' size 128M autoextend on;

表空间已创建。


SQL> create table u0425.t_blob(a int, b varchar(10), c blob) tablespace t_xifenfei ;

表已创建。


SQL> create or replace directory expdp_dir as 'e:/';

目录已创建。


SQL> declare
  2  b_file bfile;
  3  b_lob blob;
  4  begin
  5  insert into u0425.t_blob(a,b,c) values(1,'xifenfei', empty_blob()) return c into b_lob;
  6  b_file:=bfilename('EXPDP_DIR','0430.DMP');
  7  dbms_lob.open(b_file,dbms_lob.file_readonly);
  8  dbms_lob.loadfromfile(b_lob,b_file,dbms_lob.getlength(b_file));
  9  dbms_lob.close(b_file);
 10  commit;
 11  end;
 12  /

PL/SQL 过程已成功完成。

SQL> select * from u0425.t_blob where rownum<2;

         A B
---------- ----------
C
--------------------------------------------------------------------------------
         1 xifenfei
0301914E3B98006780BC00019FC70146A1A3633DBC4B089DCF3DFCD951D559000000010000100003
6907E6041E0D3037000000020200000077D10000000000FA1A880022535953222E225359535F4558


SQL> select a,b,dbms_lob.getlength(c) from u0425.t_blob where rownum<2;

         A B          DBMS_LOB.GETLENGTH(C)
---------- ---------- ---------------------
         1 xifenfei               142024704

插入的blob字段和实际文件大小，内容一致,插入是成功的

月度归档：5月 2022