Обсуждение: [HACKERS] Some thoughts about multi-server sync rep configurations

Поиск
Список
Период
Сортировка

[HACKERS] Some thoughts about multi-server sync rep configurations

От
Thomas Munro
Дата:
Hi,

Sync rep with multiple standbys allows queries run on standbys to see
transactions that haven't been flushed on the configured number of
standbys.  That means that it's susceptible to lost updates or a kind
of "dirty read" in certain cluster reconfiguration scenarios.  To
close that gap, we would need to introduce extra communication so that
standbys wait for flushes on other servers before a snapshot can be
used, in certain circumstances.  That doesn't sound like a popular
performance feature, and I don't have a concrete proposal for that,
but wanted to raise the topic for discussion to see if others have
thought about it.

I speculate that there are three things that need to be aligned for
multi-server sync rep to be able to make complete guarantees about
durability, and pass the kind of testing that someone like Aphyr of
Jepsen[1] would throw at it.  Of course you might argue that the
guarantees about reconfiguration that I'm assuming in the following
are not explicitly made anywhere, and I'd be very interested to hear
what others have to say about them.  Suppose you have K servers and
you decide that you want to be able to lose N servers without data
loss.  Then as far as I can see the three things are:

1.  While automatic replication cluster management is not part of
Postgres, you must have a manual or automatic procedure with two
important properties in order for synchronous replication to be able
to reconfigure without data loss.  At cluster reconfiguration time on
the loss of the primary, you must be able to contact at least K - N
servers and of those you must promote the server that has the highest
LSN, otherwise there is no way to know that the latest successfully
committed transaction is present in the new timeline.  Furthermore,
you must be able to contact more than K / 2 servers (a majority) to
avoid split-brain syndrome.

2.  The primary must wait for N standby servers to acknowledge
flushing before returning from commit, as we do.

3.  No server must allow a transaction to be visible that hasn't been
flushed on N standby servers.  We already prevent that on the primary,
but not on standbys.  You might see a transaction on a given standby,
then lose that standby and the primary, and then a new primary might
be elected that doesn't have that transaction.  We don't have this
problem if you only run queries on the primary, and we don't have it
on single standby configurations ie K = 2 and N = 1.  But as soon as K
> 2 and N > 1, we can have the problem on standbys.

Example:

You have 2 servers in London, 2 in New York and 2 in Tokyo.  You
enable synchronous replication with N = 3.  A transaction updates a
record and commits locally on host "london1", and begins waiting for 3
servers to respond.  A network fault prevents messages from London
reaching the other two data centres, because the rack is on fire.  But
"london2" receives and applies the WAL.  Now another session sees this
transaction on "london2" and reports a fact that it represents to a
customer.  Finally failover software or humans determine that it's
time to promote a new primary server in Tokyo.  The fact reported to
the customer has evaporated; that was a kind of "dirty read" that
might have consequences for your business.

[1] https://aphyr.com/tags/jepsen

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Some thoughts about multi-server sync rep configurations

От
Craig Ringer
Дата:
On 28 December 2016 at 08:14, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

> 3.  No server must allow a transaction to be visible that hasn't been
> flushed on N standby servers.  We already prevent that on the primary

Only if the primary doesn't restart. We don't persist the xact masking
used by sync rep at the moment.

I suspect that solving that is probably tied to solving it on standbys.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Some thoughts about multi-server sync rep configurations

От
Thomas Munro
Дата:
On Wed, Dec 28, 2016 at 4:21 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 28 December 2016 at 08:14, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>
>> 3.  No server must allow a transaction to be visible that hasn't been
>> flushed on N standby servers.  We already prevent that on the primary
>
> Only if the primary doesn't restart. We don't persist the xact masking
> used by sync rep at the moment.

Right.  Maybe you could fix that gap by making the primary wait until
the rule in synchronous_standby_names would be satisfied by the most
conservative possible synchronous_commit level (remote_apply) after
recovery and before allowing any queries to run?

> I suspect that solving that is probably tied to solving it on standbys.

Hmm.  I was imagining that for standbys it might involve extra
messages flowing from the primary carrying the consensus write and
flush LSN locations (since there isn't any other kind of inter-node
communication possible, today), and then somehow teaching the standby
to see only the transactions whose commit record is <= the lastest
consensus commit LSN (precisely, and no more!) when acquiring a
snapshot if K is > 2 and N > 1 and you have synchronous_commit set to
a level >= remote_write on the standby.  That could be done by simply
waiting for the consensus write or flush LSN (as appropriate) to be
applied before taking a snapshot, but aside from complicated
interlocking requirements, that would slow down snapshot acquisition
unacceptably on write-heavy systems.  Another way to do it could be to
maintain separate versions of the snapshot data somehow for each
synchronous_commit level on standbys, so that you can get quickly your
hands on a snapshot that can only see xids whose commit record was <=
consensus write or flush location as appropriate.  That interacts
weirdly with synchronous_commit = remote_apply on the primary though
because (assuming we want it to be useful) it needs to wait until the
LSN is applied on the standby(s) *and* they can see it in this weird
new time-delayed snapshot thing; perhaps it would require a new level
remote_apply_consensus_flush which waits for the standby(s) to apply
and and also know that the transaction has been flushed on enough
nodes to allow it to be seen...

-- 
Thomas Munro
http://www.enterprisedb.com