Обсуждение: monitoring bdr nodes

Поиск
Список
Период
Сортировка

monitoring bdr nodes

От
Dennis
Дата:
I need some clarification on how to monitor BDR nodes.  In particular determining replication lag.  As an example,  I
havea two node cluster with nodes ‘A’ and ‘B’.    I need to be able to look at node ‘B’ and determine if it is lagging
behindnode ‘A’,  by interrogating node ‘B’ only.   

From the BDR documentation on monitoring:

SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes,
       pid,
       application_name
  FROM pg_stat_replication;

Because it is querying the pg_stat_replication table,  I will need to run this query on node ‘A’ to check the lag on
node‘B’, is that true?  I need to be able run a query on node ‘B’ to determine if it node ‘B’ is behind.  I am not sure
theabove query will work for that use case. 




Re: monitoring bdr nodes

От
Craig Ringer
Дата:
On 16 April 2015 at 23:58, Dennis <dennisr@visi.com> wrote:
> I need some clarification on how to monitor BDR nodes.  In particular determining replication lag.  As an example,  I
havea two node cluster with nodes ‘A’ and ‘B’.    I need to be able to look at node ‘B’ and determine if it is lagging
behindnode ‘A’,  by interrogating node ‘B’ only. 

You can't, that doesn't really make sense - in BDR, or in regular
PostgreSQL streaming replication.

For that to be possible, node 'B' would need some side-channel by
which it found out the current WAL insert position of node 'A'. Which
effectively means communicating in real time with node 'A'... so the
client might as well do it instead. We can't do this effectively on
the walsender stream without some kind of interrupt message that can
be priority-injected into the stream, and even then it wouldn't help
if the issue was packet loss causing connection issues, etc.

If you're in a position where node 'B' can make direct libpq
non-replication connections to 'A' but the client can't, you could use
postgres_fdw to expose a view of node A's
pg_current_xlog_insert_location(), plus the pg_replication_slots and
pg_stat_replication views. That seems a bit of an odd situation to me,
though.

> Because it is querying the pg_stat_replication table,  I will need to run this query on node ‘A’ to check the lag on
node‘B’, is that true? 

Correct. I'll make the docs more explicit about that.

> I need to be able run a query on node ‘B’ to determine if it node ‘B’ is behind.  I am not sure the above query will
workfor that use case. 

It won't, and you really can't.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: monitoring bdr nodes

От
Dennis
Дата:
OK thanks.  One of the motivations for asking these questions is that we are investigating ways to implement automated
noderemoval from a VIP pool. We would like to be able to have the VIP management software (a dumb load balancer
currently)be able to query the health of a particular node directly and if that node reports back to the VIP manager
thatit is lagging to much, have the VIP manager take the node out of it’s pool of backend servers.   

Currently it appears I will have to query the other nodes in the cluster to determine the replication healthiness
statusof a particular node, and figure out a way to send that status back to the VIP manager in a way it can act on it.
 

Any suggestions on how to accomplish that would be appreciated.

Dennis

> On Apr 20, 2015, at 7:25 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> On 16 April 2015 at 23:58, Dennis <dennisr@visi.com> wrote:
>> I need some clarification on how to monitor BDR nodes.  In particular determining replication lag.  As an example,
Ihave a two node cluster with nodes ‘A’ and ‘B’.    I need to be able to look at node ‘B’ and determine if it is
laggingbehind node ‘A’,  by interrogating node ‘B’ only. 
>
> You can't, that doesn't really make sense - in BDR, or in regular
> PostgreSQL streaming replication.
>
> For that to be possible, node 'B' would need some side-channel by
> which it found out the current WAL insert position of node 'A'. Which
> effectively means communicating in real time with node 'A'... so the
> client might as well do it instead. We can't do this effectively on
> the walsender stream without some kind of interrupt message that can
> be priority-injected into the stream, and even then it wouldn't help
> if the issue was packet loss causing connection issues, etc.
>
> If you're in a position where node 'B' can make direct libpq
> non-replication connections to 'A' but the client can't, you could use
> postgres_fdw to expose a view of node A's
> pg_current_xlog_insert_location(), plus the pg_replication_slots and
> pg_stat_replication views. That seems a bit of an odd situation to me,
> though.
>
>> Because it is querying the pg_stat_replication table,  I will need to run this query on node ‘A’ to check the lag on
node‘B’, is that true? 
>
> Correct. I'll make the docs more explicit about that.
>
>> I need to be able run a query on node ‘B’ to determine if it node ‘B’ is behind.  I am not sure the above query will
workfor that use case. 
>
> It won't, and you really can't.
>
> --
> Craig Ringer                   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>



Re: monitoring bdr nodes

От
Craig Ringer
Дата:


On 21 April 2015 at 04:21, Dennis <dennisr@visi.com> wrote:
OK thanks.  One of the motivations for asking these questions is that we are investigating ways to implement automated node removal from a VIP pool. We would like to be able to have the VIP management software (a dumb load balancer currently) be able to query the health of a particular node directly and if that node reports back to the VIP manager that it is lagging to much, have the VIP manager take the node out of it’s pool of backend servers.

The challenge there is that the lagging node may have changes it has not replicated to its peers yet.

If you remove it from the BDR group without letting its downstreams replay up to its current write position, those changes will be "cut away" from the BDR group. They'll still be on the node you removed, but will never be replayed to the rest of the systems.

Your strategy is very reasonable for nodes where you only do reads, it's only an issue when every node is an active master accepting writes that all nodes must see.

One possible way to mitigate this would be adding support for synchronous_standby_names = 'all' in PostgreSQL and allowing a node to be switched into sync-write mode, where nothing commits locally until synced to all peers. It would thus be safe to remove the node at any time, even if it's badly lagging behind on its replay from upstream peers. (This would be significant feature development that is not currently targeted for BDR's roadmap).

Another, which is a current development target, is to force a node that's being removed into read-only mode and flush its replication queues before removing it. The read-only mode would preferably only restrict replicated changes, so you could still use TEMPORARY and UNLOGGED tables, etc, thus making it useful for enforcing read-only nodes in horizontal read-scaling use cases. There is no ETA on this planned feature yet.
 
Currently it appears I will have to query the other nodes in the cluster to determine the replication healthiness status of a particular node, and figure out a way to send that status back to the VIP manager in a way it can act on it.

If you're in a design where all nodes are write masters, yes, that is correct.
 
Any suggestions on how to accomplish that would be appreciated.

Just make direct libpq connections to each node from the monitoring host. You should generally be doing that anyway for your node health monitoring.

If you can't make inbound connections, do it on a push model, e.g. nsca-ng and Icinga's passive mode. This is something that's routinely done for clients and works well.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services