recovery_target_timeline and multiple slave behavior when master fails

Поиск
Список
Период
Сортировка
От Rick Pufky
Тема recovery_target_timeline and multiple slave behavior when master fails
Дата
Msg-id CAAaAz-JCe86N+5hGtpDvsSrNM4Od6C1t4EzQvAi64oOT3UD9gw@mail.gmail.com
обсуждение исходный текст
Ответы Re: recovery_target_timeline and multiple slave behavior when master fails  (Fujii Masao <masao.fujii@gmail.com>)
Список pgsql-general
I'm attempting to make use of the ability defined by the statement at the end of the first paragraph in this section of the documentation (http://www.postgresql.org/docs/9.1/static/warm-standby.html#STANDBY-SERVER-SETUP). Specifically, "If you plan to have multiple standby servers for high availability purposes, set recovery_target_timeline to latest, to make the standby server follow the timeline change that occurs at failover to another standby."

To attempt this, I've setup 3 pg machines, a master (192.168.56.5) and 2 pitr slaves (192.168.56.6, 192.168.56.7), both connected to the master. Below, the hostnames will remain the same, even if a slave is promoted to master, so slave 1 will always refer to 192.168.56.6. I've compiled/installed pg 9.1.2 and pgbench for testing purposes. I've made changes to the postgresql.conf file as follows (for ease of testing, using same postgresql.conf on all servers):
wal_level = hot_standby
archive_mode = off
max_wal_senders = 3
wal_keep_segments = 100
hot_standby = on
hot_standby_feedback = on

And on the slaves, I've got this for recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=192.168.56.5 port=5432 user=postgres'
trigger_file = '/pgsql/omnipitr/finish.recovery'
recovery_target_timeline = latest


I've initialized a fresh DB on master, then executed the following to create fresh slave copies (with .7 instead of .6 to create slave 2):
      pg_start_backup('start');
      rsync -avh --exclude=pg_log --exclude=postgresql.conf --exclude=postmaster.pid . 192.168.56.6:/pgsql/91/data
      pg_stop_backup();
      rsync -avh --exclude=pg_log --exclude=postgresql.conf --exclude=postmaster.pid . 192.168.56.6:/pgsql/91/data

Both slaves were brought online and to prove that streaming replication was working, I created a table and threw some simple data into it. These showed up on the slaves, so setup appears good so far. For further data on the master, I used pgbench (time /opt/pgsql/bin/pgbench -i -s 35 test). In the middle of the pgbench data load (I let it load more than 1 million tuples first), I killed the postgresql/pgbench processes on the master.
root@pg912-master:~# ps -ef |grep post
postgres  1955     1  0 11:39 pts/0    00:00:00 /opt/pgsql/bin/postmaster -D /pgsql/91/data
postgres  1956  1955  0 11:39 ?        00:00:00 postgres: logger process                   
postgres  1958  1955  0 11:39 ?        00:00:00 postgres: writer process                   
postgres  1959  1955  0 11:39 ?        00:00:00 postgres: wal writer process               
postgres  1960  1955  0 11:39 ?        00:00:00 postgres: autovacuum launcher process      
postgres  1961  1955  0 11:39 ?        00:00:00 postgres: stats collector process          
postgres  2013  1955  1 11:50 ?        00:00:11 postgres: wal sender process postgres 192.168.56.6(35184) streaming 0/2B000000
postgres  2015  1955  0 11:50 ?        00:00:02 postgres: wal sender process postgres 192.168.56.7(44052) streaming 0/2B000000
postgres  2049  1709  1 12:05 pts/1    00:00:00 /opt/pgsql/bin/pgbench -i -s 35 test
postgres  2050  1955 12 12:05 ?        00:00:00 postgres: postgres test [local] COPY       
root      2052  1602  0 12:05 pts/0    00:00:00 grep post
root@pg912-master:~# kill -9 1955 1956 1958 1959 1960 1961 2013 2015 2049 2050

After the master terminated, I promoted slave 1 to become the new master by touching the trigger file (touch /pgsql/omnipitr/finish.recovery). Some relevant snippets from slave 1's log file:
2011-12-15 12:05:56.191 EST    2580:LOG:  recovery restart point at 0/2D0BE0B0
2011-12-15 12:05:56.191 EST    2580:DETAIL:  last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:05:57.031 EST    2580:LOG:  restartpoint starting: xlog
2011-12-15 12:06:03.067 EST    2580:LOG:  restartpoint complete: wrote 98 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=4.907 s, sync=1.101 s, total=6.035 s; sync files=2, longest=1.070 s, average=0.550 s
2011-12-15 12:06:03.067 EST    2580:LOG:  recovery restart point at 0/300BE0B0
2011-12-15 12:06:03.067 EST    2580:DETAIL:  last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:06:05.148 EST    2581:FATAL:  could not receive data from WAL stream:
2011-12-15 12:06:09.769 EST    2579:LOG:  invalid record length at 0/33C92290
2011-12-15 12:06:09.938 EST    2608:FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.56.5" and accepting
                TCP/IP connections on port 5432?
... more of the same primary server connection refused error ...
2011-12-15 12:08:09.892 EST    2579:LOG:  trigger file found: /pgsql/omnipitr/finish.recovery
2011-12-15 12:08:09.892 EST    2579:LOG:  redo done at 0/33C921F0
2011-12-15 12:08:09.892 EST    2579:LOG:  last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:08:09.892 EST    2579:LOG:  selected new timeline ID: 2
2011-12-15 12:08:10.078 EST    2579:LOG:  archive recovery complete
2011-12-15 12:08:10.491 EST    2580:LOG:  restartpoint complete: wrote 2692 buffers (87.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=112.051 s, sync=0.274 s, total=112.361 s; sync files=2, longest=0.273 s, average=0.137 s
2011-12-15 12:08:10.492 EST    2580:LOG:  recovery restart point at 0/330BE0B0
2011-12-15 12:08:10.492 EST    2580:DETAIL:  last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:08:10.493 EST    2580:LOG:  checkpoint starting: end-of-recovery immediate wait
2011-12-15 12:08:10.497 EST    2580:LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.005 s; sync files=0, longest=0.000 s, average=0.000 s
2011-12-15 12:08:10.951 EST    2651:LOG:  autovacuum launcher started
2011-12-15 12:08:10.960 EST    2577:LOG:  database system is ready to accept connections


Slave 2 was still attempting to connect to the master, so I modified its recovery.conf file to look like this (connect to slave 1 instead of master) and restarted pg on slave 2:
standby_mode = 'on'
primary_conninfo = 'host=192.168.56.6 port=5432 user=postgres'
trigger_file = '/pgsql/omnipitr/finish.recovery'
recovery_target_timeline = latest

Relevant log snippets from slave 2:
2011-12-15 12:05:54.992 EST    2626:LOG:  recovery restart point at 0/2D0BE0B0
2011-12-15 12:05:54.992 EST    2626:DETAIL:  last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:05:56.295 EST    2626:LOG:  restartpoint starting: xlog
2011-12-15 12:06:05.979 EST    2627:FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

2011-12-15 12:06:06.191 EST    2625:LOG:  unexpected pageaddr 0/28E40000 in log file 0, segment 50, offset 14942208
2011-12-15 12:06:06.218 EST    2626:LOG:  restartpoint complete: wrote 185 buffers (6.0%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=9.692 s, sync=0.222 s, total=9.923 s; sync files=2, longest=0.204 s, average=0.111 s
2011-12-15 12:06:06.218 EST    2626:LOG:  recovery restart point at 0/300BE0B0
2011-12-15 12:06:06.218 EST    2626:DETAIL:  last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:06:21.419 EST    2652:FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.56.5" and accepting
                TCP/IP connections on port 5432?
... more of the same primary server connection refused error ...
2011-12-15 12:08:44.924 EST    2623:LOG:  received fast shutdown request
2011-12-15 12:08:44.935 EST    2623:LOG:  aborting any active transactions
2011-12-15 12:08:44.935 EST    2626:LOG:  shutting down
2011-12-15 12:08:44.945 EST    2626:LOG:  database system is shut down
2011-12-15 12:08:46.802 EST    2737:LOG:  database system was shut down in recovery at 2011-12-15 12:08:44 EST
2011-12-15 12:08:46.804 EST    2737:LOG:  entering standby mode
2011-12-15 12:08:46.818 EST    2737:LOG:  redo starts at 0/300BE0B0
2011-12-15 12:08:47.313 EST    2737:LOG:  consistent recovery state reached at 0/32E3FFF0
2011-12-15 12:08:47.313 EST    2735:LOG:  database system is ready to accept read only connections
2011-12-15 12:08:47.313 EST    2737:LOG:  unexpected pageaddr 0/28E40000 in log file 0, segment 50, offset 14942208
2011-12-15 12:08:47.437 EST    2740:FATAL:  timeline 2 of the primary does not match recovery target timeline 1
2011-12-15 12:08:52.323 EST    2742:FATAL:  timeline 2 of the primary does not match recovery target timeline 1
... repeated continuously ...


Any thoughts on the above snippets? Am I interpreting the documentation correctly? Is there any further information needed to debug this?

Thanks,
Rick

В списке pgsql-general по дате отправления:

Предыдущее
От: Yan Chunlu
Дата:
Сообщение: pg_dump and pg_restore make data dir doubled size?
Следующее
От: amit sehas
Дата:
Сообщение: indexes and tables