Обсуждение: New Slave - timeline ERROR

Поиск
Список
Период
Сортировка

New Slave - timeline ERROR

От
"drum.lucas@gmail.com"
Дата:

I've started a new SLAVE PostgreSQL server set up.

* NOTE: I run the pg_basebackup from another STANDBY SERVER. Not from the MASTER

1 - screen -t basebackup

2 - su - postgres

3 - cd ~/9.2/data/

4 - ssh postgres@slave01 'pg_basebackup --pgdata=- --format=tar --label=bb_master --progress --host=localhost --port=5432 --username=replicator --xlog | pv --quiet --rate-limit 100M' | tar -x --no-same-owner

5 - I've commented the "primary_conninfo =" and "standby_mode=" so the slave can get the files from WAL_ARCHIVE

6 - Afte I got the logs:

postgres(iostreams)[10037]:       2016-01-09 00:07:26.604 UTC|10085|LOG:  database system is ready to accept read only connections

7 - After the server finished the WAL_ARCHIVE, I turned on replication from MASTER on recovery.conf:

recovery.conf on the New Slave:

restore_command = 'exec nice -n 19 ionice -c 2 -n 7 ../../bin/restore_wal_segment.bash "../wal_archive/%f" "%p"'
archive_cleanup_command = 'exec nice -n 19 ionice -c 2 -n 7 ../../bin/pg_archivecleaup_mv.bash -d "../wal_archive" "%r"'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=192.168.100.XX port=5432 user=replicator application_name=replication_slave02'

But, once I've restarted the POSTGRESQL I got this error:

WAL segment `../wal_archive/00000005.history` not found
2016-01-09 01:13:39.183 UTC|774|FATAL:  timeline 2 of the primary does not match recovery target timeline 4

What can I do to solve the problem?

It's really important as it's a production New Slave. Thank you!

Re: New Slave - timeline ERROR

От
bricklen
Дата:
On Fri, Jan 8, 2016 at 8:44 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:

Hi, I'm a bit too lazy to try suss out the exact reasons for your failure, but here is a reasonably thorough guide to set up replication:
http://dba.stackexchange.com/a/53546/24393

A few tips:
- Having the master ship WALs to the slaves is handy if you can pull it off. If you are doing it over the wire and using rsync, "-z" for compression is recommended. If you are doing the tar format of the pg_basebackup, you *must* have the master ship the WALs to the slave otherwise it won't be able to synchronize (the "stream" method ships WALs over the wire so the end result is a synchronized system.

- I always run pg_basebackup from the slave I am building, for simplicity.
- I create new slaves almost every day (we have thousands of databases) using a bash script and it almost much never fails. In essence it is a big wrapper around the pg_basebackup command (though we are using pg93 mostly).

The base backup command that I run from the slave I am building:
pg_basebackup --pgdata=$PGDATA --host=$MASTER_IP --port=$PGPORT --username=replication --no-password --xlog-method=stream --format=plain --progress --verbose

The recovery.conf:
standby_mode = 'on'
primary_conninfo = 'user=replication host=$IP_OF_UPSTREAM_SLAVE_OR_MASTER port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
recovery_target_timeline = 'latest'
archive_cleanup_command = '/usr/pgsql-9.3/bin/pg_archivecleanup /path/to/WALs %r'
restore_command = 'cp /path/to/WALs/%f "%p" 2>> /your/PGDATA/path/pg_log/standby.log'

Re: New Slave - timeline ERROR

От
"drum.lucas@gmail.com"
Дата:
Hi @bricklen. Thanks for your reply. I've been working on it for 20h =(
So....

The master is currently shipping the WALs to the slave.
When pg_basebackup has done, I got a successful log:
postgres(iostreams)[10037]:       2016-01-09 00:07:26.604 UTC|10085|LOG:  database system is ready to accept read only connections
The problem happens when, after pg_basebackup, I change the recovery.conf to replicate from live,  changing these 2 lines:
standby_mode = on
primary_conninfo = 'host=IP_TO_THE_OTHER_SLAVE port=5432 user=replicator application_name=replication_slave02'

I read the link you have sent: http://dba.stackexchange.com/a/53546/24393
I'm doing all exactly the same, but only one parameter is different:

time pg_basebackup --pgdata=$PGDATA --host=IP_OF_MASTER --port=5432 --username=replication --password --xlog-method=stream --format=plain --progress --verbose
I'm not using --xlog-method=strem
I'm using:
ssh postgres@slave1 'pg_basebackup --pgdata=- --format=tar --label=bb_master --progress --host=localhost --port=5432 --username=replicator --xlog | pv --quiet --rate-limit 100M' | tar -x --no-same-owner

on http://www.postgresql.org/docs/9.2/static/app-pgbasebackup.html is:
-xlog

Using this option is equivalent of using -X with method fetch.

So... I have 2 TB of data. I wouldn't like to run the pg_basebackup again with the option: --xlog-method=stream without have sure that is going to work....


Do you have any other ideia? Do you know if --xlog it's the problem and I should re-run the pg_basebackup again with the --xlog-method=stream option?

Thank you



Lucas Possamai


On 10 January 2016 at 06:06, bricklen <bricklen@gmail.com> wrote:
On Fri, Jan 8, 2016 at 8:44 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:

Hi, I'm a bit too lazy to try suss out the exact reasons for your failure, but here is a reasonably thorough guide to set up replication:
http://dba.stackexchange.com/a/53546/24393

A few tips:
- Having the master ship WALs to the slaves is handy if you can pull it off. If you are doing it over the wire and using rsync, "-z" for compression is recommended. If you are doing the tar format of the pg_basebackup, you *must* have the master ship the WALs to the slave otherwise it won't be able to synchronize (the "stream" method ships WALs over the wire so the end result is a synchronized system.

- I always run pg_basebackup from the slave I am building, for simplicity.
- I create new slaves almost every day (we have thousands of databases) using a bash script and it almost much never fails. In essence it is a big wrapper around the pg_basebackup command (though we are using pg93 mostly).

The base backup command that I run from the slave I am building:
pg_basebackup --pgdata=$PGDATA --host=$MASTER_IP --port=$PGPORT --username=replication --no-password --xlog-method=stream --format=plain --progress --verbose

The recovery.conf:
standby_mode = 'on'
primary_conninfo = 'user=replication host=$IP_OF_UPSTREAM_SLAVE_OR_MASTER port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
recovery_target_timeline = 'latest'
archive_cleanup_command = '/usr/pgsql-9.3/bin/pg_archivecleanup /path/to/WALs %r'
restore_command = 'cp /path/to/WALs/%f "%p" 2>> /your/PGDATA/path/pg_log/standby.log'

Re: New Slave - timeline ERROR

От
bricklen
Дата:

On Sat, Jan 9, 2016 at 12:36 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
Do you have any other ideia? Do you know if --xlog it's the problem and I should re-run the pg_basebackup again with the --xlog-method=stream option?


If the master is successfully ships WALs to the slave you are setting up you do not need the "stream" option.

Re: New Slave - timeline ERROR

От
"drum.lucas@gmail.com"
Дата:
Hi,

If the master is successfully ships WALs to the slave you are setting up you do not need the "stream" option.

yes.. the master is successfully shipping the WALs....

Is there anything else? Help, please hehehehe





Lucas Possamai


On 10 January 2016 at 10:34, bricklen <bricklen@gmail.com> wrote:

On Sat, Jan 9, 2016 at 12:36 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
Do you have any other ideia? Do you know if --xlog it's the problem and I should re-run the pg_basebackup again with the --xlog-method=stream option?


If the master is successfully ships WALs to the slave you are setting up you do not need the "stream" option.

Re: New Slave - timeline ERROR

От
bricklen
Дата:
On Sat, Jan 9, 2016 at 1:49 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
Hi,

If the master is successfully ships WALs to the slave you are setting up you do not need the "stream" option.

yes.. the master is successfully shipping the WALs....

Is there anything else? Help, please hehehehe

If you are able to stop Postgres on the slave you are taking the base backup from, you could do this:

1). Stop postgres on slave1
2). Rsync slave1 to slave2 to copy only the deltas.
3). When you start up slave2 the WALs that the master has shipped to slave2 should apply and bring your system up to consistency.

Re: New Slave - timeline ERROR

От
"drum.lucas@gmail.com"
Дата:
Hi,

If you are able to stop Postgres on the slave you are taking the base backup from, you could do this:

I'm not... the data base is 2 TB.
So, a RSYNC would take DAYS.....  And I'm not able to stop the SLAVE for that long time

Lucas 



Lucas Possamai


On 10 January 2016 at 10:53, bricklen <bricklen@gmail.com> wrote:
On Sat, Jan 9, 2016 at 1:49 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
Hi,

If the master is successfully ships WALs to the slave you are setting up you do not need the "stream" option.

yes.. the master is successfully shipping the WALs....

Is there anything else? Help, please hehehehe

If you are able to stop Postgres on the slave you are taking the base backup from, you could do this:

1). Stop postgres on slave1
2). Rsync slave1 to slave2 to copy only the deltas.
3). When you start up slave2 the WALs that the master has shipped to slave2 should apply and bring your system up to consistency.

Re: New Slave - timeline ERROR

От
bricklen
Дата:
On Sat, Jan 9, 2016 at 1:54 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
Hi,

If you are able to stop Postgres on the slave you are taking the base backup from, you could do this:

I'm not... the data base is 2 TB.
So, a RSYNC would take DAYS.....  And I'm not able to stop the SLAVE for that long time

Depending on when you took the base backup and how many changes have occurred at your source (slave1) database cluster, the rsync execution time may or may not take as long as a new base backup if is only only shipping deltas (changed files).

Re: New Slave - timeline ERROR

От
"drum.lucas@gmail.com"
Дата:
Hmm... I see...

Depending on when you took the base backup and how many changes have occurred at your source (slave1) database cluster, the rsync execution time may or may not take as long as a new base backup if is only only shipping deltas (changed files).

I could stop the slave then.... But I'm afraid getting it back online and get some other errors hehehehe =\







Lucas Possamai


On 10 January 2016 at 10:59, bricklen <bricklen@gmail.com> wrote:
On Sat, Jan 9, 2016 at 1:54 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
Hi,

If you are able to stop Postgres on the slave you are taking the base backup from, you could do this:

I'm not... the data base is 2 TB.
So, a RSYNC would take DAYS.....  And I'm not able to stop the SLAVE for that long time

Depending on when you took the base backup and how many changes have occurred at your source (slave1) database cluster, the rsync execution time may or may not take as long as a new base backup if is only only shipping deltas (changed files).

Re: New Slave - timeline ERROR

От
bricklen
Дата:
On Sat, Jan 9, 2016 at 2:10 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
I could stop the slave then.... But I'm afraid getting it back online and get some other errors

At this point I think your options are slim. If you are feeling adventurous, you can try doing the rsync with the slave running, then do a second rsync with the slave stopped or do it from the master after putting the master in backup mode (eg. executing "pg_start_backup('slave_backup')")

Re: New Slave - timeline ERROR

От
"drum.lucas@gmail.com"
Дата:
rsync would be something like:

from slave1:
rsync -av data/pg_xlog slave2:/var/lib/postgresql/data/

Is that correct?

At this point I think your options are slim. If you are feeling adventurous, you can try doing the rsync with the slave running, then do a second rsync with the slave stopped or do it from the master after putting the master in backup mode (eg. executing "pg_start_backup('slave_backup')")

I didn't unterstand why doing RSYNC twice... sorry

Lucas



Lucas Possamai


On 10 January 2016 at 11:19, bricklen <bricklen@gmail.com> wrote:
On Sat, Jan 9, 2016 at 2:10 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
I could stop the slave then.... But I'm afraid getting it back online and get some other errors

At this point I think your options are slim. If you are feeling adventurous, you can try doing the rsync with the slave running, then do a second rsync with the slave stopped or do it from the master after putting the master in backup mode (eg. executing "pg_start_backup('slave_backup')")


Re: New Slave - timeline ERROR

От
bricklen
Дата:


On Sat, Jan 9, 2016 at 2:22 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
rsync would be something like:

from slave1:
rsync -av data/pg_xlog slave2:/var/lib/postgresql/data/

Normally I run something like the following from the slave I am setting up.

rsync -azr --progress --partial postgres@$MASTER_IP:/var/lib/postgresql/data /var/lib/postgresql/data/ --exclude postmaster.pid
 
I didn't unterstand why doing RSYNC twice... sorry

Unless the source db cluster you are rsync'ing from is stopped, there will be changes to data files replicated from the master. The second rsync might not be necessary given the WALs are shipping from the master to slave2.
 

Re: New Slave - timeline ERROR

От
"drum.lucas@gmail.com"
Дата:
rsync -azr --progress --partial postgres@$MASTER_IP:/var/lib/postgresql/data/var/lib/postgresql/data/ --exclude postmaster.pid 

Ah ok! So this will do an incrementa, right? not supposed to copy ALL the base/ again?





Lucas Possamai


On 10 January 2016 at 11:31, bricklen <bricklen@gmail.com> wrote:


On Sat, Jan 9, 2016 at 2:22 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
rsync would be something like:

from slave1:
rsync -av data/pg_xlog slave2:/var/lib/postgresql/data/

Normally I run something like the following from the slave I am setting up.

rsync -azr --progress --partial postgres@$MASTER_IP:/var/lib/postgresql/data /var/lib/postgresql/data/ --exclude postmaster.pid
 
I didn't unterstand why doing RSYNC twice... sorry

Unless the source db cluster you are rsync'ing from is stopped, there will be changes to data files replicated from the master. The second rsync might not be necessary given the WALs are shipping from the master to slave2.
 

Re: New Slave - timeline ERROR

От
bricklen
Дата:
On Sat, Jan 9, 2016 at 2:35 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
rsync -azr --progress --partial postgres@$MASTER_IP:/var/lib/postgresql/data/var/lib/postgresql/data/ --exclude postmaster.pid 

Ah ok! So this will do an incrementa, right? not supposed to copy ALL the base/ again?

Yes, this is for incremental copying from the upstream source.
Actually, you don't need the -r with -a (it is implied), and you can run it first with --dry-run to see what it _would_ do.
If you are not shipping over the WAN, then omit the -z flag as you do not need compression.

Re: New Slave - timeline ERROR

От
"drum.lucas@gmail.com"
Дата:
Should I point of replication new slave to same DB?

Lucas

On Sunday, 10 January 2016, bricklen <bricklen@gmail.com> wrote:
On Sat, Jan 9, 2016 at 2:35 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
rsync -azr --progress --partial postgres@$MASTER_IP:/var/lib/postgresql/data/var/lib/postgresql/data/ --exclude postmaster.pid 

Ah ok! So this will do an incrementa, right? not supposed to copy ALL the base/ again?

Yes, this is for incremental copying from the upstream source.
Actually, you don't need the -r with -a (it is implied), and you can run it first with --dry-run to see what it _would_ do.
If you are not shipping over the WAN, then omit the -z flag as you do not need compression.


--


Lucas Possamai


Re: New Slave - timeline ERROR

От
John R Pierce
Дата:
On 1/9/2016 4:33 PM, drum.lucas@gmail.com wrote:
Should I point of replication new slave to same DB?


I can't even guess what you're asking here.



-- 
john r pierce, recycling bits in santa cruz

Re: New Slave - timeline ERROR

От
"drum.lucas@gmail.com"
Дата:
What is the --pgdata=- in your original command? Are you perhaps in the wrong directory and not getting all the required files?

I run the pg_basebackup from the Slave on /var/lib/pgsql/9.2/data.
So I'm not in the wrong directory...

I'm out of fresh ideas. The rsync command is what I would go with, given that I can't think of any other commands to try.

I chose the pg_basebackup command just to not stop any database. It's out of circumstances to stop even the slave one... sorry...

I really don't know what else to do. Have tried everything! 

Lucas

On 10 January 2016 at 13:31, bricklen <bricklen@gmail.com> wrote:
Bottom-posting is the convention in the postgresql lists, and makes it easier to follow a long thread.

On Sat, Jan 9, 2016 at 3:16 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
My servers are not in the same network. A new pg_backup would take 30 hours to complete as I use --rate-limit 100MB.

If you had enough bandwidth, you could do some shell magic to parallelize the rsync commands, or use something like http://moo.nac.uci.edu/~hjm/parsync/ to do that. If you are limited by bandwidth, then a single rsync run is probably what you're stuck with.
 
I really need to put his server up! =\

If you were running zfs you could also take a snapshot of the fs and use that for your base backup, but I assume you would have mentioned that if it was an option.

 
I don't think that running a pg_basebackup one more time will solve the problem, because I've already done that!
I could run actually, but the problem is that it takes 30h! hahahahah

What is the --pgdata=- in your original command? Are you perhaps in the wrong directory and not getting all the required files?
 

I'm out of fresh ideas. The rsync command is what I would go with, given that I can't think of any other commands to try.

 

Have a look:

Note that there are some limitations in an online backup from the standby:
 
The backup history file is not created in the database cluster backed up.
There is no guarantee that all WAL files required for the backup are archived at the end of backup. If you are planning to use the backup for an archive recovery and want to ensure that all required files are available at that moment, you need to include them into the backup by using -x option.

You had that in your original command I believe.

Re: New Slave - timeline ERROR

От
"drum.lucas@gmail.com"
Дата:
Hi guys..
I started a new PG_BASEBACKUP and it's working now..

The problem was the line: standby_mode = on on the recovery.conf  on the STANDBY server.

After the basebackup, I comented this line and started the postgreSQL. BUT, you shouldn't  do that.

On the last time I didn't  comment and it worked. 

Thank you!



Lucas Possamai


On 10 January 2016 at 19:22, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
What is the --pgdata=- in your original command? Are you perhaps in the wrong directory and not getting all the required files?

I run the pg_basebackup from the Slave on /var/lib/pgsql/9.2/data.
So I'm not in the wrong directory...

I'm out of fresh ideas. The rsync command is what I would go with, given that I can't think of any other commands to try.

I chose the pg_basebackup command just to not stop any database. It's out of circumstances to stop even the slave one... sorry...

I really don't know what else to do. Have tried everything! 

Lucas

On 10 January 2016 at 13:31, bricklen <bricklen@gmail.com> wrote:
Bottom-posting is the convention in the postgresql lists, and makes it easier to follow a long thread.

On Sat, Jan 9, 2016 at 3:16 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:
My servers are not in the same network. A new pg_backup would take 30 hours to complete as I use --rate-limit 100MB.

If you had enough bandwidth, you could do some shell magic to parallelize the rsync commands, or use something like http://moo.nac.uci.edu/~hjm/parsync/ to do that. If you are limited by bandwidth, then a single rsync run is probably what you're stuck with.
 
I really need to put his server up! =\

If you were running zfs you could also take a snapshot of the fs and use that for your base backup, but I assume you would have mentioned that if it was an option.

 
I don't think that running a pg_basebackup one more time will solve the problem, because I've already done that!
I could run actually, but the problem is that it takes 30h! hahahahah

What is the --pgdata=- in your original command? Are you perhaps in the wrong directory and not getting all the required files?
 

I'm out of fresh ideas. The rsync command is what I would go with, given that I can't think of any other commands to try.

 

Have a look:

Note that there are some limitations in an online backup from the standby:
 
The backup history file is not created in the database cluster backed up.
There is no guarantee that all WAL files required for the backup are archived at the end of backup. If you are planning to use the backup for an archive recovery and want to ensure that all required files are available at that moment, you need to include them into the backup by using -x option.

You had that in your original command I believe.