PostgreSQL恢复系列:wal日志丢失恢复

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:PostgreSQL恢复系列:wal日志丢失恢复

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

WAL是Write Ahead Log的简写,和oracle的redo日志类似,存放在$PGDATA/pg_xlog中,10版本以后在$PGDATA/pg_wal目录.在oracle数据库中,如果redo丢失,分为active/current和inactive的redo,分别有不同的处理方式,对于oracle需要实例恢复的redo丢失,需要屏蔽数据库一致性,强制打开数据库,对于PG数据库这部分日志丢失该如何恢复,主要是通过pg_resetwal/pg_resetxlog(10以前版本)命令来实现,这里通过一个测试来验证
创建测试表并强制kill数据库

-bash-4.2$ psql
psql (14.3)
Type "help" for help.

postgres=# create table t_xifenfei as select * from pg_database;
SELECT 4
postgres=# select count(1) from t_xifenfei;
 count 
-------
     4
(1 row)

postgres=# \q
-bash-4.2$ ps -ef|grep post
root       1819      1  0 May28 ?        00:00:00 /usr/libexec/postfix/master -w
postfix    1838   1819  0 May28 ?        00:00:00 qmgr -l -t unix -u
postgres  11102      1  0 05:49 ?        00:00:00 /usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data
postgres  11103  11102  0 05:49 ?        00:00:00 postgres: logger 
postgres  11105  11102  0 05:49 ?        00:00:00 postgres: checkpointer 
postgres  11106  11102  0 05:49 ?        00:00:00 postgres: background writer 
postgres  11107  11102  0 05:49 ?        00:00:00 postgres: walwriter 
postgres  11108  11102  0 05:49 ?        00:00:00 postgres: autovacuum launcher 
postgres  11109  11102  0 05:49 ?        00:00:01 postgres: stats collector 
postgres  11110  11102  0 05:49 ?        00:00:00 postgres: logical replication launcher 
root      22743  22300  0 18:26 pts/3    00:00:00 su - postgres
postgres  22744  22743  0 18:26 pts/3    00:00:00 -bash
postgres  22937  22744  0 18:28 pts/3    00:00:00 psql
postgres  22938  11102  0 18:28 ?        00:00:00 postgres: postgres postgres [local] idle
postfix   32623   1819  0 21:10 ?        00:00:00 pickup -l -t unix -u
root      33032  32912  0 21:15 pts/2    00:00:00 su - postgres
postgres  33033  33032  0 21:15 pts/2    00:00:00 -bash
postgres  35210  33033  0 21:51 pts/2    00:00:00 ps -ef
postgres  35211  33033  0 21:51 pts/2    00:00:00 grep --color=auto post
-bash-4.2$ kill -9 11102

删除wal日志

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_wal
-bash-4.2$ ls -ltr
total 311296
drwx------. 2 postgres postgres        6 May 24 02:20 archive_status
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000014
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000015
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000016
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000017
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000018
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000019
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001A
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001B
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001C
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001D
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001E
-rw-------. 1 postgres postgres 16777216 May 28 21:29 00000001000000000000001F
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000020
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000021
-rw-------. 1 postgres postgres 16777216 May 28 21:29 000000010000000000000022
-rw-------. 1 postgres postgres 16777216 May 28 21:30 000000010000000000000023
-rw-------. 1 postgres postgres 16777216 May 28 21:30 000000010000000000000024
-rw-------. 1 postgres postgres 16777216 May 28 21:30 000000010000000000000025
-rw-------. 1 postgres postgres 16777216 May 29 21:51 000000010000000000000013
-bash-4.2$ rm -rf 0000000100000000000000*
-bash-4.2$ ls
archive_status

查询当时数据库需要的最小wal记录

-bash-4.2$ pg_controldata 
pg_control version number:            1300
Catalog version number:               202107181
Database system identifier:           7100998319216817119
Database cluster state:               in production
pg_control last modified:             Sat 28 May 2022 09:36:11 PM CST
Latest checkpoint location:           0/13692F80
Latest checkpoint's REDO location:    0/13692F48
Latest checkpoint's REDO WAL file:    000000010000000000000013   <===需要的记录
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:17824
Latest checkpoint's NextOID:          32769
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        727
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  17824
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Sat 28 May 2022 09:31:41 PM CST

尝试启动PG

-bash-4.2$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2022-05-29 21:52:22.926 CST [35270] LOG:  
redirecting log output to logging collector process
2022-05-29 21:52:22.926 CST [35270] HINT:  Future log output will appear in directory "log".
. stopped waiting
pg_ctl: could not start server
Examine the log output.

启动pg失败,查看日志记录

2022-05-29 21:52:22.926 CST [35270] LOG:  starting PostgreSQL 14.3 on x86_64-pc-linux-gnu, 
            compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2022-05-29 21:52:22.927 CST [35270] LOG:  listening on IPv6 address "::1", port 5432
2022-05-29 21:52:22.927 CST [35270] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2022-05-29 21:52:22.929 CST [35270] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-29 21:52:22.931 CST [35270] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-05-29 21:52:22.936 CST [35272] LOG:  database system was interrupted; last known up at 2022-05-28 21:36:11 CST
2022-05-29 21:52:23.049 CST [35272] LOG:  invalid primary checkpoint record
2022-05-29 21:52:23.049 CST [35272] PANIC:  could not locate a valid checkpoint record
2022-05-29 21:52:24.211 CST [35270] LOG:  startup process (PID 35272) was terminated by signal 6: Aborted
2022-05-29 21:52:24.211 CST [35270] LOG:  aborting startup due to startup process failure
2022-05-29 21:52:24.218 CST [35270] LOG:  database system is shut down

错误比较明显,无法定位到有效的checkpoint记录,在oracle里面的意思可以理解为无法进行实例恢复,pg启动失败

重设wal
由于数据库为不一致状态,需要使用-f进行强制重设

-bash-4.2$ pg_resetwal $PGDATA
The database server was not shut down cleanly.
Resetting the write-ahead log might cause data to be lost.
If you want to proceed anyway, use -f to force reset.
-bash-4.2$ pg_resetwal -f $PGDATA
Write-ahead log reset

启动PG成功

-bash-4.2$ pg_ctl start -D $PGDATA
waiting for server to start....2022-05-29 22:01:02.647 CST [37178] LOG:  
redirecting log output to logging collector process
2022-05-29 22:01:02.647 CST [37178] HINT:  Future log output will appear in directory "log".
 done
server started

日志记录

2022-05-29 22:01:02.647 CST [37178] LOG:  starting PostgreSQL 14.3 on x86_64-pc-linux-gnu, 
                    compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2022-05-29 22:01:02.648 CST [37178] LOG:  listening on IPv6 address "::1", port 5432
2022-05-29 22:01:02.648 CST [37178] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2022-05-29 22:01:02.649 CST [37178] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-29 22:01:02.651 CST [37178] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-05-29 22:01:02.653 CST [37180] LOG:  database system was shut down at 2022-05-29 22:00:47 CST
2022-05-29 22:01:02.661 CST [37178] LOG:  database system is ready to accept connections

查看wal日志,产生新记录

-bash-4.2$ pwd
/var/lib/pgsql/14/data/pg_wal
-bash-4.2$ ls -ltr
total 16384
drwx------. 2 postgres postgres        6 May 24 02:20 archive_status
-rw-------. 1 postgres postgres 16777216 May 29 22:01 000000010000000000000014

验证刚刚创建测试表

-bash-4.2$ psql
psql (14.3)
Type "help" for help.

postgres=# select count(1) from t_xifenfei;
ERROR:  relation "t_xifenfei" does not exist
LINE 1: select count(1) from t_xifenfei;
                             ^

由于需要进行实例恢复的wal日志丢失导致这表记录也丢失.由此可见这类操作可能导致数据丢失风险,对于生产环境,需要慎重,