Обсуждение: Temporary disabling a replica in a Patroni cluster

Поиск
Список
Период
Сортировка

Temporary disabling a replica in a Patroni cluster

От
Victor Sudakov
Дата:
Dear Colleagues,

Do you perchance know what is the correct procedure of temporarily
taking down a replica in a Patroni cluster, e.g. for 5-10 minutes of
hardware maintenance?

The problem is that after stopping the patroni process (service) on a
replica, patroni removes the corresponding physical replication slot
from the leader, and unless the wal_keep_size value is unsanely high,
the replica, when up again, cannot restart streaming because the WAL
segments are already gone from the leader.

Well, you all know:
<%%%>LOG:  started streaming WAL from primary at B4A0/E2000000 on timeline 8
<%%%>FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000080000B4A0000000E2 has already
beenremoved
 
<%%%>LOG:  waiting for WAL to become available at B4A0/E2002000

Do you think there is a way to tell Patroni that a replica is down
temporarily and its replication slot should not be removed?

Or, what am I missing? 

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet



Re: Temporary disabling a replica in a Patroni cluster

От
"Georg H."
Дата:
Hello Victor,


Am 25.08.2023 um 13:18 schrieb Victor Sudakov:
> Dear Colleagues,
>
> Do you perchance know what is the correct procedure of temporarily
> taking down a replica in a Patroni cluster, e.g. for 5-10 minutes of
> hardware maintenance?
>
> The problem is that after stopping the patroni process (service) on a
> replica, patroni removes the corresponding physical replication slot
> from the leader, and unless the wal_keep_size value is unsanely high,
> the replica, when up again, cannot restart streaming because the WAL
> segments are already gone from the leader.
>
> Well, you all know:
> <%%%>LOG:  started streaming WAL from primary at B4A0/E2000000 on timeline 8
> <%%%>FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000080000B4A0000000E2 has
alreadybeen removed
 
> <%%%>LOG:  waiting for WAL to become available at B4A0/E2002000
>
> Do you think there is a way to tell Patroni that a replica is down
> temporarily and its replication slot should not be removed?
>
> Or, what am I missing?


you may use patronictl pause + resume

keep in mind to set wal_keep_size (or wal_keep_segments depending on 
your PG version high enough)

regards

Georg




Re: Temporary disabling a replica in a Patroni cluster

От
Victor Sudakov
Дата:
Georg H. wrote:
> Hello Victor,
> 
> 
> Am 25.08.2023 um 13:18 schrieb Victor Sudakov:
> > Dear Colleagues,
> >
> > Do you perchance know what is the correct procedure of temporarily
> > taking down a replica in a Patroni cluster, e.g. for 5-10 minutes of
> > hardware maintenance?
> >
> > The problem is that after stopping the patroni process (service) on a
> > replica, patroni removes the corresponding physical replication slot
> > from the leader, and unless the wal_keep_size value is unsanely high,
> > the replica, when up again, cannot restart streaming because the WAL
> > segments are already gone from the leader.
> >
> > Well, you all know:
> > <%%%>LOG:  started streaming WAL from primary at B4A0/E2000000 on timeline 8
> > <%%%>FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000080000B4A0000000E2 has
alreadybeen removed
 
> > <%%%>LOG:  waiting for WAL to become available at B4A0/E2002000
> >
> > Do you think there is a way to tell Patroni that a replica is down
> > temporarily and its replication slot should not be removed?
> >
> > Or, what am I missing?
> 
> 
> you may use patronictl pause + resume

I would like to do the maintenance on one node only and keep the rest
of the cluster functioning normally.

> 
> keep in mind to set wal_keep_size (or wal_keep_segments depending on 
> your PG version high enough)

I have written above about "unless the wal_keep_size value is unsanely high" :-)

Keeping wal_keep_size very high is a waste of disk space and still
provides no real guarantee, unfortunately. Why does Patroni use slots
at all then?

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet



Re: Temporary disabling a replica in a Patroni cluster

От
Victor Sudakov
Дата:
Victor Sudakov wrote:
> 
> Do you perchance know what is the correct procedure of temporarily
> taking down a replica in a Patroni cluster, e.g. for 5-10 minutes of
> hardware maintenance?
> 
> The problem is that after stopping the patroni process (service) on a
> replica, patroni removes the corresponding physical replication slot
> from the leader, and unless the wal_keep_size value is unsanely high,
> the replica, when up again, cannot restart streaming because the WAL
> segments are already gone from the leader.
> 
> Well, you all know:
> <%%%>LOG:  started streaming WAL from primary at B4A0/E2000000 on timeline 8
> <%%%>FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000080000B4A0000000E2 has
alreadybeen removed
 
> <%%%>LOG:  waiting for WAL to become available at B4A0/E2002000
> 
> Do you think there is a way to tell Patroni that a replica is down
> temporarily and its replication slot should not be removed?
> 
> Or, what am I missing? 

As WAL archiving (wal-g) is enabled in this cluster anyway, do you
think adding "postgresql.parameters.restore_command" to the Patroni
config will help in this situation? 

restore_command works very well in regular Postgres clusters catching
up from a big replication delay and permits to have wal_keep_size=0,
however does anyone know if there are any Patroni-specific reasons not
to use restore_command under Patroni?

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet