今天一位朋友在问pg_control文件丢失或损坏是否可以启动PostgreSQL的问题,于是在测试系统尝试了一下。
首先记录下正常情况下的pg_controldata信息如下:这个信息非常有用,需要用于恢复.
pg_control version number: 903
Catalog version number: 201008051
Database system identifier: 5556352067747738614
Database cluster state: shut down
pg_control last modified: Fri 12 Aug 2011 03:12:16 PM CST
Latest checkpoint location: 23/50000020
Prior checkpoint location: 23/4C000020
Latest checkpoint's REDO location: 23/50000020
Latest checkpoint's TimeLineID: 6
Latest checkpoint's NextXID: 0/4710
Latest checkpoint's NextOID: 2082697
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 655
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Fri 12 Aug 2011 03:12:16 PM CST
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: hot_standby
Current max_connections setting: 2000
Current max_prepared_xacts setting: 50
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 1048576
WAL block size: 8192
Bytes per WAL segment: 67108864
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
然后删除$PGDATA/global/pg_control文件,启动数据库 .
pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
postgres: could not find the database system
Expected to find it in the directory "/database/pgdata/tbs1/pg_root",
but could not open file "/database/pgdata/tbs1/pg_root/global/pg_control": No such file or directory
pg_ctl: could not start server
Examine the log output.
是的,报错。
恢复pg_control如下:
首先 touch $PGDATA/global/pg_control
然后 pg_resetxlog $PGDATA
pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it
Guessed pg_control values:
First log file ID after reset: 35
First log file segment after reset: 38
pg_control version number: 903
Catalog version number: 201008051
Database system identifier: 5639864756195480479
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/3
Latest checkpoint's NextOID: 10000
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 3
Latest checkpoint's oldestXID's DB: 0
Latest checkpoint's oldestActiveXID: 0
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 1048576
WAL block size: 8192
Bytes per WAL segment: 67108864
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
If these values seem acceptable, use -f to force reset.
从上面的值可以看出以下数值已经和正常情况下不一样了:
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/3
Latest checkpoint's NextOID: 10000
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 3
Latest checkpoint's oldestXID's DB: 0
Latest checkpoint's oldestActiveXID: 0
接下来重建pg_control
pg_resetxlog -f $PGDATA
pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it
Transaction log reset
重建完后,先不启动数据库,看看重建的pg_control当前的信息 :
pg_controldata
pg_control version number: 903
Catalog version number: 201008051
Database system identifier: 5639864833504827199
Database cluster state: shut down
pg_control last modified: Fri 12 Aug 2011 03:15:59 PM CST
Latest checkpoint location: 23/98000020
Prior checkpoint location: 0/0
Latest checkpoint's REDO location: 23/98000020
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/3
Latest checkpoint's NextOID: 10000
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 3
Latest checkpoint's oldestXID's DB: 0
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Fri 12 Aug 2011 03:15:59 PM CST
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: minimal
Current max_connections setting: 100
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 1048576
WAL block size: 8192
Bytes per WAL segment: 67108864
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
可以看出以下信息和正常的pg_control信息不一致:
Database system identifier: 5639864833504827199
pg_control last modified: Fri 12 Aug 2011 03:15:59 PM CST
Latest checkpoint location: 23/98000020
Prior checkpoint location: 0/0
Latest checkpoint's REDO location: 23/98000020
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/3
Latest checkpoint's NextOID: 10000
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 3
Latest checkpoint's oldestXID's DB: 0
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Fri 12 Aug 2011 03:15:59 PM CST
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: minimal
Current max_connections setting: 100
Current max_prepared_xacts setting: 0
接下来启动数据库
pg_ctl start $PGDATA
再次查看控制文件的信息
pg_control version number: 903
Catalog version number: 201008051
Database system identifier: 5639864833504827199
Database cluster state: in production
pg_control last modified: Fri 12 Aug 2011 03:17:22 PM CST
Latest checkpoint location: 23/98000020
Prior checkpoint location: 0/0
Latest checkpoint's REDO location: 23/98000020
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/3
Latest checkpoint's NextOID: 10000
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 3
Latest checkpoint's oldestXID's DB: 0
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Fri 12 Aug 2011 03:15:59 PM CST
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: hot_standby
Current max_connections setting: 2000
Current max_prepared_xacts setting: 50
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 1048576
WAL block size: 8192
Bytes per WAL segment: 67108864
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
某些信息已经和正常情况下的pg_control信息一致了,有一些是来自参数文件的内容。
现在不一致的信息如下:
Database system identifier: 5639864833504827199
pg_control last modified: Fri 12 Aug 2011 03:17:22 PM CST
Latest checkpoint location: 23/98000020
Prior checkpoint location: 0/0
Latest checkpoint's REDO location: 23/98000020
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/3
Latest checkpoint's NextOID: 10000
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 3
Latest checkpoint's oldestXID's DB: 0
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Fri 12 Aug 2011 03:15:59 PM CST
这些没有办法从参数文件获取。
OK,pg_control文件修复完成,但是问题来了。所有的用户,库都看不见了,但是数据还在,并且能进库,查看数据,但是就是在pg_class这种数据表里面看不到数据,原因是MVCC带来的假象,数据看似消失了。(执行完VACUUM就真的消失了)
postgres@db-172-16-3-33-> psql -h 127.0.0.1 -l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access privileges
------+-------+----------+-----------+-------+-------------------
(0 rows)
修正消失的数据 , 通过修改控制文件的
Latest checkpoint's NextXID: 0/4710 :
TIMELINE我这里就没修正,不过最好能修正还是修正。
postgres@db-172-16-3-33-> pg_ctl stop -m fast
postgres@db-172-16-3-33-> pg_resetxlog -x 4710 $PGDATA
Transaction log reset
postgres@db-172-16-3-33-> pg_ctl start
server starting
postgres@db-172-16-3-33-> psql -h 127.0.0.1 -l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access privileges
-------------+-------------+----------+-----------+-------+-----------------------
digoal | digoal | UTF8 | C | C | digoal=CTc/digoal +
| | | | | role_b=c/digoal +
| | | | | =CTc/digoal
fingerprint | fingerprint | UTF8 | C | C |
pgwatch | pgwatch | UTF8 | C | C |
postgres | postgres | UTF8 | C | C |
template0 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
(6 rows)
现在就正常了。
以上恢复过程风险较大,请勿模仿。
可靠的恢复操作请见另一篇BLOG :
http://blog.163.com/digoal@126/blog/static/16387704020130109400557/