Обсуждение: [GENERAL] BDR node removal and rejoin

Поиск

Список

Период

Сортировка

[GENERAL] BDR node removal and rejoin

От

"Zhu, Joshua"

Дата:

08 июля 2017 г., 03:58:31

Hi, I am having difficulty removing a node from a BDR group (with nodes node1 through node5) then rejoin the group.

Prior to removing a node, the BDR is running fine, query on bdr.bdr_nodes table shows all nodes having the status ‘r’.

Here is what I have done for removing node5 and rejoining:

On node1, do bdr.bdr_part_by_node_names

At this point the status of node5 in bdr.bdr_nodes becomes ‘k’

On node5, do bdr.remove_bdr_from_local_node
On node5, drop and recreate the database, then rejoin using bdr.bdr_group_join (using the same node name and external dsn)

At this point the status of node5 on node1 though node4 still remains ‘k’, and the status of node5 on node5 (there is only one record) is ‘i’, and they stuck at these status codes.

[note: I tried using a different node name on rejoining, same result]

What have I done wrong, what is the correct way of doing removal and rejoining?

Thanks

Re: [GENERAL] BDR node removal and rejoin

От

"Zhu, Joshua"

Дата:

11 июля 2017 г., 03:49:02

An update… after manually removing the record for ‘node4’ from bdr.bdr_nodes, corresponding record in bdr.bdr_connections, and associated replication slot (with pg_drop_replication_slot), rejoining was successful.

I was under the impression that there is no need to perform manual cleanup before a removed node (with database dropped and recreated) rejoining a BDR group.

From: Zhu, Joshua
Sent: Friday, July 07, 2017 2:59 PM
To: 'pgsql-general@postgresql.org' <pgsql-general@postgresql.org>
Subject: BDR node removal and rejoin

Hi, I am having difficulty removing a node from a BDR group (with nodes node1 through node5) then rejoin the group.

Prior to removing a node, the BDR is running fine, query on bdr.bdr_nodes table shows all nodes having the status ‘r’.

Here is what I have done for removing node5 and rejoining:

On node1, do bdr.bdr_part_by_node_names

At this point the status of node5 in bdr.bdr_nodes becomes ‘k’

On node5, do bdr.remove_bdr_from_local_node
On node5, drop and recreate the database, then rejoin using bdr.bdr_group_join (using the same node name and external dsn)

At this point the status of node5 on node1 though node4 still remains ‘k’, and the status of node5 on node5 (there is only one record) is ‘i’, and they stuck at these status codes.

[note: I tried using a different node name on rejoining, same result]

What have I done wrong, what is the correct way of doing removal and rejoining?

Thanks

Re: [GENERAL] BDR node removal and rejoin

От

Craig Ringer

Дата:

12 июля 2017 г., 14:58:30

On 11 July 2017 at 05:49, Zhu, Joshua <jzhu@vormetric.com> wrote:

An update… after manually removing the record for ‘node4’ from bdr.bdr_nodes, corresponding record in bdr.bdr_connections, and associated replication slot (with pg_drop_replication_slot), rejoining was successful.

I was under the impression that there is no need to perform manual cleanup before a removed node (with database dropped and recreated) rejoining a BDR group.

BDR1 requires that you manually remove the bdr.bdr_nodes entry if you intend to re-use the same node name.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [GENERAL] BDR node removal and rejoin

От

"Zhu, Joshua"

Дата:

12 июля 2017 г., 23:56:41

Thanks for the clarification.

Looks like I am running into a different issue: while trying to pin down precisely the steps (and the order in which to perform them) needed to remove/rejoin a node, the removal/rejoining exercise was repeated a number of times, and stuck again:

The status of the re-joining node (node4) on other nodes is “I”
The status of the re-joining node on the node4 itself started at “I”, changed to “o”, then stuck there
From the log file for node4, the following entries are constantly being generated:

2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]DEBUG: 00000: received replication command: IDENTIFY_SYSTEM

2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]LOCATION: exec_replication_command, walsender.c:1309

2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]DEBUG: 08003: unexpected EOF on client connection

2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]LOCATION: SocketBackend, postgres.c:355

2017-07-12 10:37:46 PDT [24944:bdr (6408408103171110238,1,24713,):receive:::1(33884)]DEBUG: 00000: received replication command: IDENTIFY_SYSTEM

2017-07-12 10:37:46 PDT [24944:bdr (6408408103171110238,1,24713,):receive:::1(33884)]LOCATION: exec_replication_command, walsender.c:1309

2017-07-12 10:37:46 PDT [24944:bdr (6408408103171110238,1,24713,):receive:::1(33884)]DEBUG: 08003: unexpected EOF on client connection

2017-07-12 10:37:46 PDT [24944:bdr (6408408103171110238,1,24713,):receive:::1(33884)]LOCATION: SocketBackend, postgres.c:355

2017-07-12 10:37:46 PDT [24946:bdr (6334686760735153516,1,43845,):receive:::1(33885)]DEBUG: 00000: received replication command: IDENTIFY_SYSTEM

2017-07-12 10:37:46 PDT [24946:bdr (6334686760735153516,1,43845,):receive:::1(33885)]LOCATION: exec_replication_command, walsender.c:1309

2017-07-12 10:37:46 PDT [24946:bdr (6334686760735153516,1,43845,):receive:::1(33885)]DEBUG: 08003: unexpected EOF on client connection

2017-07-12 10:37:46 PDT [24946:bdr (6334686760735153516,1,43845,):receive:::1(33885)]LOCATION: SocketBackend, postgres.c:355

2017-07-12 10:37:49 PDT [24949:bdr (6394432535408825526,1,37325,):receive:::1(33892)]DEBUG: 00000: received replication command: IDENTIFY_SYSTEM

2017-07-12 10:37:49 PDT [24949:bdr (6394432535408825526,1,37325,):receive:::1(33892)]LOCATION: exec_replication_command, walsender.c:1309

2017-07-12 10:37:49 PDT [24949:bdr (6394432535408825526,1,37325,):receive:::1(33892)]DEBUG: 08003: unexpected EOF on client connection

2017-07-12 10:37:49 PDT [24949:bdr (6394432535408825526,1,37325,):receive:::1(33892)]LOCATION: SocketBackend, postgres.c:355

What do these entries say? and what can be done to correct the situation (there have been no change with respect to either postgres or network configuration in the remove/rejoin exercise)?

Thanks

From: Craig Ringer [mailto:craig@2ndquadrant.com]
Sent: Wednesday, July 12, 2017 1:59 AM
To: Zhu, Joshua <jzhu@thalesesec.net>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] BDR node removal and rejoin

On 11 July 2017 at 05:49, Zhu, Joshua <jzhu@vormetric.com> wrote:

An update… after manually removing the record for ‘node4’ from bdr.bdr_nodes, corresponding record in bdr.bdr_connections, and associated replication slot (with pg_drop_replication_slot), rejoining was successful.

I was under the impression that there is no need to perform manual cleanup before a removed node (with database dropped and recreated) rejoining a BDR group.

BDR1 requires that you manually remove the bdr.bdr_nodes entry if you intend to re-use the same node name.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [GENERAL] BDR node removal and rejoin

От

Craig Ringer

Дата:

13 июля 2017 г., 12:58:34

On 13 July 2017 at 01:56, Zhu, Joshua <jzhu@vormetric.com> wrote:

Thanks for the clarification.

Looks like I am running into a different issue: while trying to pin down precisely the steps (and the order in which to perform them) needed to remove/rejoin a node, the removal/rejoining exercise was repeated a number of times, and stuck again:

The status of the re-joining node (node4) on other nodes is “I”
The status of the re-joining node on the node4 itself started at “I”, changed to “o”, then stuck there
From the log file for node4, the following entries are constantly being generated:

2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]DEBUG: 00000: received replication command: IDENTIFY_SYSTEM
2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]LOCATION: exec_replication_command, walsender.c:1309
2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]DEBUG: 08003: unexpected EOF on client connection
2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]LOCATION: SocketBackend, postgres.c:355
2017-07-12 10:37:46 PDT [24944:bdr (6408408103171110238,1,24713,):receive:::1(33884)]DEBUG: 00000: received replication command: IDENTIFY_SYSTEM
2017-07-12 10:37:46 PDT [24944:bdr (6408408103171110238,1,24713,):receive:::1(33884)]LOCATION: exec_replication_command, walsender.c:1309
2017-07-12 10:37:46 PDT [24944:bdr (6408408103171110238,1,24713,):receive:::1(33884)]DEBUG: 08003: unexpected EOF on client connection

Check the logs on the other end.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [GENERAL] BDR node removal and rejoin

От

"Zhu, Joshua"

Дата:

13 июля 2017 г., 22:09:22

Found these log entries from one of the other node:

t=2017-07-13 08:35:34 PDT p=27292 a=DEBUG: 00000: found valid replication identifier 15

t=2017-07-13 08:35:34 PDT p=27292 a=LOCATION: bdr_establish_connection_and_slot, bdr.c:604

t=2017-07-13 08:35:34 PDT p=27292 a=ERROR: 53400: no free replication state could be found for 15, increase max_replication_slots

Increased max_replication_slots, things are looking good now, thanks.

This does bring up a couple of questions:

Given the fact there is no real increase in the number of nodes in this repeated removal/rejoining exercise, yet it caused replication slots being used up, wouldn’t removal of a node also automatically free up the replication slot allocated for the node? Or is there a way to manually free up no longer needed slots? (the don’t seem to show up in pg_replication_slots view, I made sure to use pg_drop_replication_slot when they do show up there)
If there is such a thing, what is the rule of thumb for best value of max_replication_slots (are they somehow related to the value max_wal_senders as well), with respect to, say, the max number of nodes intended to support?

Thanks

From: Craig Ringer [mailto:craig@2ndquadrant.com]
Sent: Wednesday, July 12, 2017 11:59 PM
To: Zhu, Joshua <jzhu@thalesesec.net>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] BDR node removal and rejoin

On 13 July 2017 at 01:56, Zhu, Joshua <jzhu@vormetric.com> wrote:

Thanks for the clarification.

Looks like I am running into a different issue: while trying to pin down precisely the steps (and the order in which to perform them) needed to remove/rejoin a node, the removal/rejoining exercise was repeated a number of times, and stuck again:

The status of the re-joining node (node4) on other nodes is “I”
The status of the re-joining node on the node4 itself started at “I”, changed to “o”, then stuck there
From the log file for node4, the following entries are constantly being generated:

2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]DEBUG: 00000: received replication command: IDENTIFY_SYSTEM
2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]LOCATION: exec_replication_command, walsender.c:1309
2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]DEBUG: 08003: unexpected EOF on client connection
2017-07-12 10:37:46 PDT [24943:bdr (6334686800251932108,1,43865,):receive:::1(33883)]LOCATION: SocketBackend, postgres.c:355
2017-07-12 10:37:46 PDT [24944:bdr (6408408103171110238,1,24713,):receive:::1(33884)]DEBUG: 00000: received replication command: IDENTIFY_SYSTEM
2017-07-12 10:37:46 PDT [24944:bdr (6408408103171110238,1,24713,):receive:::1(33884)]LOCATION: exec_replication_command, walsender.c:1309
2017-07-12 10:37:46 PDT [24944:bdr (6408408103171110238,1,24713,):receive:::1(33884)]DEBUG: 08003: unexpected EOF on client connection

Check the logs on the other end.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [GENERAL] BDR node removal and rejoin

От

Craig Ringer

Дата:

14 июля 2017 г., 10:46:55

On 14 July 2017 at 00:09, Zhu, Joshua <jzhu@vormetric.com> wrote:

Found these log entries from one of the other node:

t=2017-07-13 08:35:34 PDT p=27292 a=DEBUG: 00000: found valid replication identifier 15
t=2017-07-13 08:35:34 PDT p=27292 a=LOCATION: bdr_establish_connection_and_slot, bdr.c:604
t=2017-07-13 08:35:34 PDT p=27292 a=ERROR: 53400: no free replication state could be found for 15, increase max_replication_slots

Increased max_replication_slots, things are looking good now, thanks.

This does bring up a couple of questions:

Given the fact there is no real increase in the number of nodes in this repeated removal/rejoining exercise, yet it caused replication slots being used up, wouldn’t removal of a node also automatically free up the replication slot allocated for the node?

Yes, it should. Open issue. A patch would be welcomed.

Or is there a way to manually free up no longer needed slots? (the don’t seem to show up in pg_replication_slots view, I made sure to use pg_drop_replication_slot when they do show up there)

It'll be complaining about replication identifiers ("origins" in 9.6); see pg_replication_identifier

If there is such a thing, what is the rule of thumb for best value of max_replication_slots (are they somehow related to the value max_wal_senders as well), with respect to, say, the max number of nodes intended to support?

I think that's covered in the docs, but it's safe to err fairly high. The cost of extra slots is minimal.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [GENERAL] BDR node removal and rejoin

[GENERAL] BDR node removal and rejoin

Re: [GENERAL] BDR node removal and rejoin

Re: [GENERAL] BDR node removal and rejoin

Re: [GENERAL] BDR node removal and rejoin

Re: [GENERAL] BDR node removal and rejoin

Re: [GENERAL] BDR node removal and rejoin

Re: [GENERAL] BDR node removal and rejoin