Обсуждение: Replication: slave is in permanent startup 'recovery'

Поиск
Список
Период
Сортировка

Replication: slave is in permanent startup 'recovery'

От
"Henry C."
Дата:
Greets,

Pg 9.0.3

This must be due to my own misconfiguration, so apologies if I'm not seeing
the obvious - I've noticed that my slave seems to be stuck in a permanent
startup/recovery state.  ps on the slave shows:

...
postgres: wal receiver process   streaming 190/A6C384A0
postgres: startup process   recovering 0000000100000190000000A6 (1)
...

(1) keeps incrementing to reflect each new log file in pg_xlog on the master
(even after the slave caught up with all the other log files after the initial
rsync).

If I try and execute a long-lived SQL query on the slave, it eventually fails
with "canceling statement due to conflict with recovery".  Replication is
definitely working (DML actions are propagated to the slave), but something is
amiss.

I'm trying Streaming replication.  Once I get this working reliably, the idea
is to use it on a rather busy server where the log files are used in case the
slave(s) fall behind the stream during peak periods.  (sorry if I'm using the
wrong idioms, or not quite understanding Pg's built in replication methods -
I'm used to Skype's Londiste)

MASTER CONFIG
wal_level = hot_standby
archive_mode = on
archive_command = 'cp -a "%p" /home/psql-wal-archive/"%f"'
max_wal_senders = 5
wal_keep_segments = 8

SLAVE CONFIG
wal_level = hot_standby
wal_keep_segments = 64
hot_standby = on


Any ideas what I'm doing wrong here?

Thanks


Re: Replication: slave is in permanent startup 'recovery'

От
"Henry C."
Дата:
Forgot to mention recovery.conf on slave:

standby_mode = 'on'
primary_conninfo = 'host..."
restore_command = 'cp /home/psql-wal-archive/%f "%p"'
archive_cleanup_command = 'pg_archivecleanup /home/psql-wal-archive %r'


The wiki states "If wal_keep_segments is a high enough number to retain the
WAL segments required for the standby server, this [restore_command] may not
be necessary."

Presumably this is referring to wal_keep_segments config on the master, right?


Re: Replication: slave is in permanent startup 'recovery'

От
Tomas Vondra
Дата:
Dne 13.4.2011 20:42, Henry C. napsal(a):
>
> Forgot to mention recovery.conf on slave:
>
> standby_mode = 'on'
> primary_conninfo = 'host..."
> restore_command = 'cp /home/psql-wal-archive/%f "%p"'
> archive_cleanup_command = 'pg_archivecleanup /home/psql-wal-archive %r'
>
>
> The wiki states "If wal_keep_segments is a high enough number to retain the
> WAL segments required for the standby server, this [restore_command] may not
> be necessary."
>
> Presumably this is referring to wal_keep_segments config on the master, right?

Right. If there are enough WAL segments on the master, the standby may
ask for them and the data will be streamed to the standby. So the
archive mode is not a requirement, although if you already use WAL
archiving, it's a good idea to use it (no additional overhead on the
primary etc.).

regards
Tomas

Re: Replication: slave is in permanent startup 'recovery'

От
"Henry C."
Дата:
On Wed, April 13, 2011 20:15, Henry C. wrote:
> If I try and execute a long-lived SQL query on the slave, it eventually fails
>  with "canceling statement due to conflict with recovery".  Replication is
> definitely working (DML actions are propagated to the slave), but something
> is amiss.

Let me ask the question another way.

Following the hints at http://wiki.postgresql.org/wiki/Streaming_Replication:

Between the primary and standby hosts, pg_current_xlog_location() and
pg_last_xlog_receive_location()/select pg_last_xlog_replay_location() show
what appears to be normal activity (ie, replication is happening successfully,
and the numbers indicate health).

primary% ps -ef | grep sender
standby% ps -ef | grep receiver

...show things talking nicely and the numbers match.

ps also shows a startup process recovering logs on the standby (which I assume
is normal behaviour).

However, a SELECT eventually fails with "canceling statement due to conflict
with recovery".

Where else can I check, or what else can I do to determine what the problem is?

Thanks


Re: Replication: slave is in permanent startup 'recovery'

От
Craig Ringer
Дата:
On 14/04/2011 2:15 AM, Henry C. wrote:
> Greets,
>
> Pg 9.0.3
>
> This must be due to my own misconfiguration, so apologies if I'm not seeing
> the obvious - I've noticed that my slave seems to be stuck in a permanent
> startup/recovery state.

That's what warm- and hot-standby slaves are. They're continuously
replaying WAL files from the master, essentially the same thing as
during recovery from a bad shutdown. The advantage is that it's
*extremely* well tested code.

> If I try and execute a long-lived SQL query on the slave, it eventually fails
> with "canceling statement due to conflict with recovery".

That's a limitation of streaming replication. It's a lot like the issue
Oracle has with running out of undo or redo log space. Essentially, my
understanding is that the hot standby server cannot replay WAL archives
to keep up with the master's changes at the same time as running
queries. To avoid getting too far behind the master because of a huge or
stuck query, it'll cancel very long-running queries.

Again from my limited understanding, the reason it can't replay WAL is
because the WAL records include overwrites of pages VACUUMed and re-used
on the master. HS is block-level replication; it cannot keep a page
in-place on the slave when the master has erased or overwritten it.

It's theoretically possible for the slave to copy blocks that're about
to be written out-of-line into a slave-side-only store of blocks that've
been erased on the master but are still needed by transactions on the
slave. The discussion I've read suggests that that'd be ... complicated
... to make work well especially with log replay happening concurrently.

> Replication is
> definitely working (DML actions are propagated to the slave), but something is
> amiss.

Nope, it's working as designed I'm afraid.

There are params you can tune to control how far slaves are allowed to
get behind the master before cancelling queries. I don't remember what
they are, but the manual will cover them. Do consider though that the
more behind the slave is, the more log files the master has to have
space to keep... and if the master runs out of space, things get ugly.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

Re: Replication: slave is in permanent startup 'recovery'

От
"Henry C."
Дата:
> However, a SELECT eventually fails with "canceling statement due to conflict
> with recovery".
>
> Where else can I check, or what else can I do to determine what the problem
> is?

...or maybe there _is_ no problem.

select count(*) from big_table; -- will fail because it's long-lived and rows
are changing (deletes, inserts, updates) underneath it.

select * from big_table where id=nnn; -- succeeds because the window for
something in that particular row to change during the select is too small.

All's good!

Sorry about the noise.

h


Re: Replication: slave is in permanent startup 'recovery'

От
"Henry C."
Дата:
> On 14/04/2011 2:15 AM, Henry C. wrote:
> Nope, it's working as designed I'm afraid.
>
> There are params you can tune to control how far slaves are allowed to
> get behind the master before cancelling queries...

Thanks Craig - this dawned on me eventually.

Re: Replication: slave is in permanent startup 'recovery'

От
Tomas Vondra
Дата:
Dne 14.4.2011 10:01, Craig Ringer napsal(a):
> That's a limitation of streaming replication. It's a lot like the issue
> Oracle has with running out of undo or redo log space. Essentially, my
> understanding is that the hot standby server cannot replay WAL archives
> to keep up with the master's changes at the same time as running
> queries. To avoid getting too far behind the master because of a huge or
> stuck query, it'll cancel very long-running queries.
>
> Again from my limited understanding, the reason it can't replay WAL is
> because the WAL records include overwrites of pages VACUUMed and re-used
> on the master. HS is block-level replication; it cannot keep a page
> in-place on the slave when the master has erased or overwritten it.

I guess it's worth mentioning there's another solution to this problem,
implemented in 9.1 - it's called feedback. That means the standby asks
the primary not to overwrite the blocks - see this

http://developer.postgresql.org/pgdocs/postgres/hot-standby.html#HOT-STANDBY-CONFLICT

But obviously this is not for free - you have to pay a price, in this
case a bloat on the primary (as it can't do VACUUM properly).

regards
Tomas

Re: Replication: slave is in permanent startup 'recovery'

От
"Henry C."
Дата:
> On 14/04/2011 2:15 AM, Henry C. wrote:
> Nope, it's working as designed I'm afraid.
>
> There are params you can tune to control how far slaves are allowed to
> get behind the master before cancelling queries...

Thanks Craig - this dawned on me eventually.