Обсуждение: How to Qualifying or quantify risk of loss in asynchronous replication

Поиск
Список
Период
Сортировка

How to Qualifying or quantify risk of loss in asynchronous replication

От
otheus uibk
Дата:
I've been working with PG 9.1.8 for two years now, mainly asynchronous replication. Recently, an IT admin of another group contested that the PG's asynchronous replication can result in loss of data in a 1-node failure. After re-readinG the documentation, I cannot determine to what extent this is true.

Back in 2008, Robert Haas made this post http://postgresql.nabble.com/Sync-Rep-First-Thoughts-on-Code-tp1998339p1998433.html in which he delineates between different levels of replication. 1-safe is guaranteed with PG WALs. Other possibilities include group-safe, both group safe and 1-safe, 2-safe. 

How do we qualify PG when WALs are written (and archived) on the master, and streaming replication to a hot standby, albeit asynchronous, is used? Is it Group-safe?

My understanding is "Strictly speaking, No". 

But what precisely is the algorithm and timing involved with streaming WALs? 

Is it:
  * client issues COMMIT
  * master receives commit
  * master processes transaction internally
  * master creates WAL record
  | master appends WAL to local WAL file, flushes disk
  | master sends WAL record to all streaming clients
  * master sends OK to client
  * master applies WAL

So is this correct? Is it correct to say: PG async guarantees that the WAL is *sent* to the receivers, but not that they are received, before the client receives acknowledgement?

Can we make a case stronger than that? Assuming the T0 is round trip time between master and client, and T1 is round trip time between master and slave, as long as T1 <= T0, and provided both Slave and Master do not fail, the system is Group-safe?




--

Re: How to Qualifying or quantify risk of loss in asynchronous replication

От
Thomas Munro
Дата:
On Wed, Mar 16, 2016 at 6:26 AM, otheus uibk <otheus.uibk@gmail.com> wrote:
> I've been working with PG 9.1.8 for two years now, mainly asynchronous
> replication. Recently, an IT admin of another group contested that the PG's
> asynchronous replication can result in loss of data in a 1-node failure.
> After re-readinG the documentation, I cannot determine to what extent this
> is true.

It is true.  If the primary server is destroyed by a meteor, it is
entirely possible for recently written WAL records to be lost, because
they haven't even been sent to an asynchronous standby node yet, let
alone written.

> Back in 2008, Robert Haas made this post
> http://postgresql.nabble.com/Sync-Rep-First-Thoughts-on-Code-tp1998339p1998433.html
> in which he delineates between different levels of replication. 1-safe is
> guaranteed with PG WALs. Other possibilities include group-safe, both group
> safe and 1-safe, 2-safe.
>
> How do we qualify PG when WALs are written (and archived) on the master, and
> streaming replication to a hot standby, albeit asynchronous, is used? Is it
> Group-safe?
>
> My understanding is "Strictly speaking, No".

No.  There is no guarantee that any other node knows about your transaction.

> But what precisely is the algorithm and timing involved with streaming WALs?
>
> Is it:
>   * client issues COMMIT
>   * master receives commit
>   * master processes transaction internally
>   * master creates WAL record
>   | master appends WAL to local WAL file, flushes disk
>   | master sends WAL record to all streaming clients
>   * master sends OK to client
>   * master applies WAL
>
> So is this correct? Is it correct to say: PG async guarantees that the WAL
> is *sent* to the receivers, but not that they are received, before the
> client receives acknowledgement?

Async replication doesn't guarantee anything at all about receivers,
or even that there is one right at this moment.  Did you mean to write
"synchronous" instead of "asynchronous"?  In asynchronous replication,
the primary writes to the WAL and flushes the disk.  Then, for any
standbys that happen to be connected, a WAL sender process trundles
along behind feeding new WAL doesn the socket as soon as it can, but
it can be running arbitrarily far behind or not running at all (the
network could be down or saturated, the standby could be temporarily
down or up but not reading the stream fast enough, etc etc).

> Can we make a case stronger than that? Assuming the T0 is round trip time
> between master and client, and T1 is round trip time between master and
> slave, as long as T1 <= T0, and provided both Slave and Master do not fail,
> the system is Group-safe?

You might use that kind of thinking to reason about the probability
that a transaction has reached the standby in async mode by the time
your client gets a commit response back, but it's not any kind of
useful guarantee, and in any case only applies if your standby is
currently connected and keeping up.  That's why we have synchronous
replication.

If you turn on synchronous replication (using the
synchronous_standby_names GUC which takes a list of standby
'application' names, or * for any) then you can stop COMMIT from
returning until one standby from that list has written out the WAL
record.  (In future we will probably support more than one).  That
works using the same approach as asynchronous replication, except that
there is a wait inserted into the committing transaction: it waits for
the current synchronous standby to report back that it has processed
the commit record.  There are two levels: synchronous_commit =
remote_write, meaning that the chosen standby has written the WAL but
not necessarily flushed it to disk yet (I think this may be called
"1-safe and group safe" in the terminology you referenced: it's
flushed locally AND betting that the other machine(s) won't (all)
crash), and synchronous_commit = on, meaning that the chosen standby
has written it actually flushed it to disk (with fsync, fdatasync etc,
"2-safe").  The former might be faster, but could lose writes that are
in the OS page cache on the standby if power is lost before those
pages eventually hit the disk, so "on" is probably what most people
mean when they talk about synchronous replication.  Asynchronous
replication doesn't wait for anything except the local disk (so it is
"1-safe").

Waiting for the transaction to be durably stored (flushed to disk) on
two servers before COMMIT returns means that you can avoid this
situation:

1.  You commit a transaction, and COMMIT returns as soon as the WAL is
flushed to disk on the primary.
2.  You communicate a fact based on that transaction to a third party
("Thank you Dr Bowman, you are booked in seat A4, your reservation
number is JUPITER123").
3.  Your primary computer is destroyed by a meteor, and its WAL sender
hadn't yet got around to sending that transaction to the standby.
4.  You recover using the standby.
5.  The transaction has been forgotten ("I'm sorry Dave, I'm afraid we
have no record of booking JUPITER123, and the rocket is full.  The
next rocket leaves in 7 years, would you like to book a seat?").

If you enable synchronous replication, and you are careful to recover
in step 4 using the correct standby, then you can't lose a transaction
that you reported to external systems *after* (because) COMMIT
returned.  If your primary is destroyed after you executed COMMIT, but
*before* it returned, it is possible that the current synchronous
standby's WAL contains the transaction or doesn't contain the
transaction, but not for you to have taken any external action based
on the commit having returned, because it didn't.  (If your primary
crashes and restarts before COMMIT returns, and it had got as far as
flushing locally but not yet heard from the standby, then things may
be slightly more complicated).

--
Thomas Munro
http://www.enterprisedb.com


Re: How to Qualifying or quantify risk of loss in asynchronous replication

От
otheus uibk
Дата:
Thomas, thanks for your input... But I'm not quite getting the answer I need....

> But what precisely is the algorithm and timing involved with streaming WALs?
>
> Is it:
>   * client issues COMMIT
>   * master receives commit
>   * master processes transaction internally
>   * master creates WAL record
>   | master appends WAL to local WAL file, flushes disk
>   | master sends WAL record to all streaming clients
>   * master sends OK to client
>   * master applies WAL
>
> So is this correct? Is it correct to say: PG async guarantees that the WAL
> is *sent* to the receivers, but not that they are received, before the
> client receives acknowledgement?

Async replication doesn't guarantee anything at all about receivers,
or even that there is one right at this moment.  Did you mean to write
"synchronous" instead of "asynchronous"?  

I'm only concerned with async (for this thread). 


In asynchronous replication,
the primary writes to the WAL and flushes the disk.  Then, for any
standbys that happen to be connected, a WAL sender process trundles
along behind feeding new WAL doesn the socket as soon as it can, but
it can be running arbitrarily far behind or not running at all (the
network could be down or saturated, the standby could be temporarily
down or up but not reading the stream fast enough, etc etc).


This is the *process* I want more detail about. The question is the same as above:
> (is it true that) PG async guarantees that the WAL
> is *sent* to the receivers, but not that they are received, before the
> client receives acknowledgement?


But I will refine what I mean by "sent"... does PostgreSQL write the WAL to the socket and  flush the socket before acknowledging the transaction to the client? Does it *always* do this? Or does it make a best effort? Or does the write to the socket and return to client happen asynchronously? 

I realize that the data might not be *seen* at the client, i realize network buffers may take time to reach the network, I realize various levels of synchronous replication provide higher guarantees.  But For the purposes of this topic, I'm interest to know what PG actually does. I can't tell that from the documentation (because it is not clearly stated and because it is self contradictory).



 


--

Re: How to Qualifying or quantify risk of loss in asynchronous replication

От
otheus uibk
Дата:
Apologies for the double-reply... This is to point out the ambiguity between the example you gave and stated documentation.

On Wednesday, March 16, 2016, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Waiting for the transaction to be durably stored (flushed to disk) on
two servers before COMMIT returns means that you can avoid this
situation:

1.  You commit a transaction, and COMMIT returns as soon as the WAL is
flushed to disk on the primary.
2.  You communicate a fact based on that transaction to a third party
("Thank you Dr Bowman, you are booked in seat A4, your reservation
number is JUPITER123").
3.  Your primary computer is destroyed by a meteor, and its WAL sender
hadn't yet got around to sending that transaction to the standby

Section 25.2.5. "The standby connects to the primary, which streams WAL records to the standby as they're generated, without waiting for the WAL file to be filled."

This suggests that the record is on the network stack possibly before a flush to disk.

 Section 25.2.6 "If the primary server crashes then some transactions that were committed may not have been replicated to the standby server, causing data loss. The amount of data loss is proportional to the replication delay at the time of failover." 

Whence this replication delay? If the standby server is caught up and streaming asynchronously, what delays *in receiving* might there be other than network delays? 

Note: I am totally unconcerned with the possibility that both primary and standby go down at the same time. 


--

Re: How to Qualifying or quantify risk of loss in asynchronous replication

От
Thomas Munro
Дата:
On Wed, Mar 16, 2016 at 9:59 PM, otheus uibk <otheus.uibk@gmail.com> wrote:
>> In asynchronous replication,
>> the primary writes to the WAL and flushes the disk.  Then, for any
>> standbys that happen to be connected, a WAL sender process trundles
>> along behind feeding new WAL doesn the socket as soon as it can, but
>> it can be running arbitrarily far behind or not running at all (the
>> network could be down or saturated, the standby could be temporarily
>> down or up but not reading the stream fast enough, etc etc).
>
>
>
> This is the *process* I want more detail about. The question is the same as
> above:
>> (is it true that) PG async guarantees that the WAL
>> is *sent* to the receivers, but not that they are received, before the
>> client receives acknowledgement?

The primary writes WAL to disk, and then wakes up walsender processes,
and they read the WAL from disk (presumably straight out of the OS
page cache) in the background and send it down the network some time
later.  Async replication doesn't guarantee anything about the WAL
being sent.

Look for WalSndWakeupRequest() in xlog.c, which expands to a call to
WalSndWakeup in walsender.c which sets latches (= a mechanism for
waking processes) on all walsenders, and see the WaitLatchOrSocket
calls in walsender.c which wait for that to happen.

--
Thomas Munro
http://www.enterprisedb.com


Re: How to Qualifying or quantify risk of loss in asynchronous replication

От
otheus uibk
Дата:
On Wednesday, March 16, 2016, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> In asynchronous replication, the primary writes to the WAL and flushes the disk.  Then, for any standbys that happen to be connected, a WAL sender  process trundles along behind feeding new WAL doesn the socket as soon as it can, but it can be running arbitrarily far behind or not running at all  (the network could be down or saturated, the standby could be temporarily down or up but not reading the stream fast enough, etc etc).

Thanks for your help on finding the code. To be more precise, in the 9.1.8 code, I see this:

 1. [backend] WAL is flushed to disk
 2. [backend] WAL-senders are sent  SIGUSR1 to wake up
 3. [backend] wait for responses from other SyncRep-Receiver, effectively skipped if none
     [wal-sender] wakes up
 4. [backend] end-of-xact cycle
     [wal-sender] reads WAL (XLogRead) up to MAX_SEND_SIZE (or less) bytes
 5. [backend] ? is there an ACK send to client?
     [wal-sender] sends chunk to WAL-receiver using the pq_putmessage_noblock call
 6. [wal-sender] repeats reading-sending loop

So if the WAL record is bigger than whatever MAX_SEND_SIZE is (in my source, I seek 8k * 16 = 128 kB, so 1 Mb (roughly)), the WAL may end up sleeping (between iterations of 5 and 6). 

On Wed, Mar 16, 2016 at 10:21 AM, otheus uibk <otheus.uibk@gmail.com> wrote:
Section 25.2.5. "The standby connects to the primary, which streams WAL records to the standby as they're generated, without waiting for the WAL file to be filled."
 Section 25.2.6 "If the primary server crashes then some transactions that were committed may not have been replicated to the standby server, causing data loss. The amount of data loss is proportional to the replication delay at the time of failover."  

Both these statements, then, from the documentation perspective, are incorrect, at least to a pedant. For 25.2.5, The primary streams WAL records to the standby after they've been flushed to disk but without waiting for the file to be filled. For 25.2.6 it's not clear: some transactions that were *written* to the local WAL and reported as committed but not yet *sent* to the standby server is possible.

Somehow, the documentation misleads (me) to believe the async replication algorithm at least guarantees WAL records are *sent* before responding "committed" to the client. I now know this is not the case. *grumble*. 

How can I help make the documentation clearer on this point?

--

Re: How to Qualifying or quantify risk of loss in asynchronous replication

От
Andrew Sullivan
Дата:
On Wed, Mar 16, 2016 at 10:40:03PM +0100, otheus uibk wrote:
> Somehow, the documentation misleads (me) to believe the async replication
> algorithm at least guarantees WAL records are *sent* before responding
> "committed" to the client. I now know this is not the case. *grumble*.
>
> How can I help make the documentation clearer on this point?

Well, I never had the understanding you apparently do, but you're
right that it's important to be clear.  If there were an additional
sentence, "Note that, in any available async option, the client can
receive a message that data is committed before any replication of the
data has commenced," would that help?

Best regards,

A

--
Andrew Sullivan
ajs@crankycanuck.ca


Re: How to Qualifying or quantify risk of loss in asynchronous replication

От
Adrian Klaver
Дата:
On 03/16/2016 02:40 PM, otheus uibk wrote:
> On Wednesday, March 16, 2016, Thomas Munro

>
> Somehow, the documentation misleads (me) to believe the async
> replication algorithm at least guarantees WAL records are *sent* before
> responding "committed" to the client. I now know this is not the case.
> *grumble*.
>
> How can I help make the documentation clearer on this point?

I thought it was already clear:

http://www.postgresql.org/docs/9.4/interactive/warm-standby.html
"It should be noted that log shipping is asynchronous, i.e., the WAL
records are shipped after transaction commit. As a result, there is a
window for data loss should the primary server suffer a catastrophic
failure; transactions not yet shipped will be lost. ..."

http://www.postgresql.org/docs/9.4/interactive/warm-standby.html#STREAMING-REPLICATION

"Streaming replication is asynchronous by default (see Section 25.2.8),
in which case there is a small delay between committing a transaction in
the primary and the changes becoming visible in the standby. This delay
is however much smaller than with file-based log shipping, typically
under one second assuming the standby is powerful enough to keep up with
the load. ..."

http://www.postgresql.org/docs/9.4/interactive/warm-standby.html#SYNCHRONOUS-REPLICATION

"PostgreSQL streaming replication is asynchronous by default. If the
primary server crashes then some transactions that were committed may
not have been replicated to the standby server, causing data loss. The
amount of data loss is proportional to the replication delay at the time
of failover.

Synchronous replication offers the ability to confirm that all changes
made by a transaction have been transferred to one synchronous standby
server. This extends the standard level of durability offered by a
transaction commit. This level of protection is referred to as 2-safe
replication in computer science theory. "

>
> --
> Otheus
> otheus.uibk@gmail.com <mailto:otheus.uibk@gmail.com>
> otheus.shelling@uibk.ac.at <mailto:otheus.shelling@uibk.ac.at>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: How to Qualifying or quantify risk of loss in asynchronous replication

От
otheus uibk
Дата:

On Wed, Mar 16, 2016 at 11:51 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

I thought it was already clear:
 
Perhaps "Clarity is in the eye of the beholder". If you are very familiar with the internals and operation of the software, the documentation is clear. It's like hindsight; it's always "20/20".

 
http://www.postgresql.org/docs/9.4/interactive/warm-standby.html
"It should be noted that log shipping is asynchronous, i.e., the WAL records are shipped after transaction commit. As a result, there is a window for data loss should the primary server suffer a catastrophic failure; transactions not yet shipped will be lost. ..."

That refers to *log shipping* not streaming.
 
http://www.postgresql.org/docs/9.4/interactive/warm-standby.html#STREAMING-REPLICATION

"Streaming replication is asynchronous by default (see Section 25.2.8), in which case there is a small delay between committing a transaction in the primary and the changes becoming visible in the standby. This delay is however much smaller than with file-based log shipping, typically under one second assuming the standby is powerful enough to keep up with the load. ..."

Asynchronous to what? The next sentence indicates the delay is relevant to *becoming visible in the standby*. Thus, a message could be received by the standby, but before it is logged to disk, both it and the primary fail. Meanwhile, the client thinks its transaction was committed, but in fact, it was committed to only one side. Thus, it does NOT necessarily imply that "asynchronous" means with respect to the client receiving "transaction complete" acknowledgement. 
 
http://www.postgresql.org/docs/9.4/interactive/warm-standby.html#SYNCHRONOUS-REPLICATION

"PostgreSQL streaming replication is asynchronous by default. If the primary server crashes then some transactions that were committed may not have been replicated to the standby server, causing data loss. The amount of data loss is proportional to the replication delay at the time of failover.

Synchronous replication offers the ability to confirm that all changes made by a transaction have been transferred to one synchronous standby server. This extends the standard level of durability offered by a transaction commit. This level of protection is referred to as 2-safe replication in computer science theory. "

Again, an asynchronous mode *could mean* that the WALs are sent before the commit was acknowledge, and that would be consistent with the above statements.