Re: Syncrep and improving latency due to WAL throttling

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Syncrep and improving latency due to WAL throttling
Дата
Msg-id 750b20d7-6a8c-4c3f-a4b3-23ad409b0046@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Syncrep and improving latency due to WAL throttling  (Andres Freund <andres@anarazel.de>)
Ответы Re: Syncrep and improving latency due to WAL throttling  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On 11/8/23 07:40, Andres Freund wrote:
> Hi,
> 
> On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote:
>> scope
>> -----
>> Now, let's talk about scope - what the patch does not aim to do. The
>> patch is explicitly intended for syncrep clusters, not async. There have
>> been proposals to also support throttling for async replicas, logical
>> replication etc. I suppose all of that could be implemented, and I do
>> see the benefit of defining some sort of maximum lag even for async
>> replicas. But the agreement was to focus on the syncrep case, where it's
>> particularly painful, and perhaps extend it in the future.
> 
> Perhaps we should take care to make the configuration extensible in that
> direction in the future?
> 

Yes, if we can come up with a suitable configuration, that would work
for the other use cases. I don't have a very good idea what to do about
replicas that may not be connected, of have connected but need to catch
up. IMHO it would be silly to turn this into "almost a sync rep".

> 
> Hm - is this feature really tied to replication, at all? Pretty much the same
> situation exists without. On an ok-ish local nvme I ran pgbench with 1 client
> and -P1. Guess where I started a VACUUM (on a fully cached table, so no
> continuous WAL flushes):
> 
> progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed
> progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed
> progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed
> progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed
> progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed
> progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed
> progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed
> progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed
> progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed
> progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed
> 
> At 32 clients we go from ~10k to 2.5k, with a full 2s of 0.
> 
> Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the
> "good times" show a delay of ~8kB (note that this includes WAL records that
> are still being inserted). Once the VACUUM runs, it's ~2-3MB.
> 
> The picture with more clients is similar.
> 
> If I instead severely limit the amount of outstanding (but not the amount of
> unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less
> (down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32).  Of course
> that's ridiculous and will completely trash performance in many other cases,
> but it shows that limiting the amount of outstanding WAL could help without
> replication as well.  With remote storage, that'd likely be a bigger
> difference.
> 

Yeah, that's an interesting idea. I think the idea of enforcing "maximum
lag" is somewhat general, the difference is against what LSN the lag is
measured. For the syncrep case it was about LSN confirmed by the
replica, what you described would measure it for either flush LSN or
write LSN (which would be the "outstanding" case I think).

I guess the remote storage is somewhat similar to the syncrep case, in
that the lag includes some network communication.

> 
>> problems
>> --------
>> Now let's talk about some problems - both conceptual and technical
>> (essentially review comments for the patch).
>>
>> 1) The goal of the patch is to limit the impact on latency, but the
>> relationship between WAL amounts and latency may not be linear. But we
>> don't have a good way to predict latency, and WAL lag is the only thing
>> we have, so there's that. Ultimately, it's a best effort.
> 
> It's indeed probably not linear. Realistically, to do better, we probably need
> statistics for the specific system in question - the latency impact will
> differ hugely between different storage/network.
> 

True. I can imagine two ways to measure that.

We could have a standalone tool similar to pg_test_fsync that would
mimic how we write/flush WAL, and measure the latency for different
amounts flushed data. The DBA would then be responsible for somehow
using this to configure the database (perhaps the tool could calculate
some "optimal" value to flush).

Alternatively, we could collect timing in XLogFlush, so that we'd track
amount of data to flush + timing, and then use that to calculate
expected latency (e.g. by binning by data size and using average latency
for each bin). And then use that, somehow.

So you could say - maximum commit latency is 10ms, and the system would
be able to estimate that the maximum amount of unflushed WAL is 256kB,
and it'd enforce this distance.

Still only best offort, no guarantees, of course.

> 
>> 2) The throttling is per backend. That makes it simple, but it means
>> that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
>> and with a single backend that works fine - the lag should not exceed
>> the 8MB value. But if there are N backends, the lag could be up to
>> N-times 8MB, I believe. That's a bit annoying, but I guess the only
>> solution would be to have some autovacuum-like cost balancing, with all
>> backends (or at least those running large stuff) doing the checks more
>> often. I'm not sure we want to do that.
> 
> Hm. The average case is likely fine - the throttling of the different backends
> will intersperse and flush more frequently - but the worst case is presumably
> part of the issue here. I wonder if we could deal with this by somehow
> offsetting the points at which backends flush at somehow.
> 

If I understand correctly, you want to ensure the backends start
measuring the WAL from different LSNs, in order to "distribute" them
uniformly within the WAL (and not "coordinate" them, which leads to
higher lag).

I guess we could do that, say by flushing only up to

    hash(pid) % maximum_allowed_lag

I'm not sure that'll really work, especially if the backends are
somewhat naturally "correlated".

Maybe it could work if the backends explicitly coordinated the flushes
to distribute them, but then how is that different from just doing what
autovacuum-like costing does in principle.

However, perhaps there's an "adaptive" way to do this - each backend
know how much WAL it produced since the last flush LSN, and it can
easily measure the actual lag (unflushed WAL). It could compare those,
and estimate what fraction of the lag it's likely responsible for. And
then adjust the "flush distance" based on that.

Imagine we aim for 1MB unflushed WAL, and you have two backends that
happen to execute at the same time. They both generate 1MB of WAL and
hit the throttling code at about the same size. They discover the actual
lag is not the desired 1MB but 2MB, so (requested_lag/actual_lag) = 0.5,
and they'd adjust the flush distance to 1/2MB. And from that point we
know the lag is 1MB even with two backends.

Then one of the backends terminates, and the other backend eventually
hits the 1/2MB limit again, but the desired lag is 1MB, and it doubles
the distance again.

Of course, in practice the behavior would be more complicated, thanks to
backends that generate WAL but don't really hit the threshold.

There'd probably also be some sort of ramp-up, i.e. the backed would not
start with the "full" 1MB limit, but perhaps something lower. Would need
to be careful to be high enough to ignore the OLTP transactions, though.


> I doubt we want to go for something autovacuum balancing like - that doesn't
> seem to work well - but I think we could take the amount of actually unflushed
> WAL into account when deciding whether to throttle. We have the necessary
> state in local memory IIRC. We'd have to be careful to not throttle every
> backend at the same time, or we'll introduce latency penalties that way. But
> what if we scaled synchronous_commit_wal_throttle_threshold depending on the
> amount of unflushed WAL? By still taking backendWalInserted into account, we'd
> avoid throttling everyone at the same time, but still would make throttling
> more aggressive depending on the amount of unflushed/unreplicated WAL.
> 

Oh! Perhaps similar to the adaptive behavior I explained above?

> 
>> 3) The actual throttling (flush and wait for syncrep) happens in
>> ProcessInterrupts(), which mostly works but it has two drawbacks:
>>
>>  * It may not happen "early enough" if the backends inserts a lot of
>> XLOG records without processing interrupts in between.
> 
> Does such code exist? And if so, is there a reason not to fix said code?
> 

Not sure. I thought maybe index builds might do something like that, but
it doesn't seem to be the case (at least for the built-in indexes). But
if adding CHECK_FOR_INTERRUPTS to more places is an acceptable fix, I'm
OK with that.

> 
>>  * It may happen "too early" if the backend inserts enough WAL to need
>> throttling (i.e. sets XLogDelayPending), but then after processing
>> interrupts it would be busy with other stuff, not inserting more WAL.
> 
>> I think ideally we'd do the throttling right before inserting the next
>> XLOG record, but there's no convenient place, I think. We'd need to
>> annotate a lot of places, etc. So maybe ProcessInterrupts() is a
>> reasonable approximation.
> 
> Yea, I think there's no way to do that with reasonable effort. Starting to
> wait with a bunch of lwlocks held would obviously be bad.
> 

OK.

> 
>> We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but
>> that seems reasonable.
> 
> And independently beneficial.
> 

OK.

> 
>> missing pieces
>> --------------
>> The thing that's missing is that some processes (like aggressive
>> anti-wraparound autovacuum) should not be throttled. If people set the
>> GUC in the postgresql.conf, I guess that'll affect those processes too,
>> so I guess we should explicitly reset the GUC for those processes. I
>> wonder if there are other cases that should not be throttled.
> 
> Hm, that's a bit hairy. If we just exempt it we'll actually slow down everyone
> else even further, even though the goal of the feature might be the opposite.
> I don't think that's warranted for anti-wraparound vacuums - they're normal. I
> think failsafe vacuums are a different story - there we really just don't care
> about impacting other backends, the goal is to prevent moving the cluster to
> read only pretty soon.
> 

Right, I confused those two autovacuum modes.

> 
>> tangents
>> --------
>> While discussing this with Andres a while ago, he mentioned a somewhat
>> orthogonal idea - sending unflushed data to the replica.
>>
>> We currently never send unflushed data to the replica, which makes sense
>> because this data is not durable and if the primary crashes/restarts,
>> this data will disappear. But it also means there may be a fairly large
>> chunk of WAL data that we may need to send at COMMIT and wait for the
>> confirmation.
>>
>> He suggested we might actually send the data to the replica, but the
>> replica would know this data is not flushed yet and so would not do the
>> recovery etc. And at commit we could just send a request to flush,
>> without having to transfer the data at that moment.
>>
>> I don't have a very good intuition about how large the effect would be,
>> i.e. how much unflushed WAL data could accumulate on the primary
>> (kilobytes/megabytes?),
> 
> Obviously heavily depends on the workloads. If you have anything with bulk
> writes it can be many megabytes.
> 
> 
>> and how big is the difference between sending a couple kilobytes or just a
>> request to flush.
> 
> Obviously heavily depends on the network...
> 

I know it depends on workload/network. I'm merely saying I don't have a
very good intuition what value would be suitable for a particular
workload / network.

> 
> I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit
> network (albeit with a crappy external card for my laptop), to put some
> numbers to this. I used -r $s,100 to test sending a variable sized data to the
> other size, with the other side always responding with 100 bytes (assuming
> that'd more than fit a feedback response).
> 
> Command:
> fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields;
fors in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done
 
> 
> 10gbe:
> 
> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> 10              100             43              64.30           390             96              15526.084
> 100             100             57              75.12           428             122             13286.602
> 1000            100             47              74.41           270             108             13412.125
> 10000           100             89              114.63          712             152             8700.643
> 100000          100             167             255.90          584             312             3903.516
> 1000000         100             891             1015.99         2470            1143            983.708
> 
> 
> Same hosts, but with my workstation forced to use a 1gbit connection:
> 
> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> 10              100             78              131.18          2425            257             7613.416
> 100             100             81              129.25          425             255             7727.473
> 1000            100             100             162.12          1444            266             6161.388
> 10000           100             310             686.19          1797            927             1456.204
> 100000          100             1006            1114.20         1472            1199            896.770
> 1000000         100             8338            8420.96         8827            8498            118.410
> 
> I haven't checked, but I'd assume that 100bytes back and forth should easily
> fit a new message to update LSNs and the existing feedback response. Even just
> the difference between sending 100 bytes and sending 10k (a bit more than a
> single WAL page) is pretty significant on a 1gbit network.
> 

I'm on decaf so I may be a bit slow, but it's not very clear to me what
conclusion to draw from these numbers. What is the takeaway?

My understanding is that in both cases the latency is initially fairly
stable, independent of the request size. This applies to request up to
~1000B. And then the latency starts increasing fairly quickly, even
though it shouldn't hit the bandwidth (except maybe the 1MB requests).

I don't think it says we should be replicating WAL in tiny chunks,
because if you need to send a chunk of data it's always more efficient
to send it at once (compared to sending multiple smaller pieces). But if
we manage to send most of this "in the background", only leaving the
last small bit to be sent at the very end, that'd help.

> Of course, the relatively low latency between these systems makes this more
> pronounced than if this were a cross regional or even cross continental link,
> were the roundtrip latency is more likely to be dominated by distance rather
> than throughput.
> 
> Testing between europe and western US:
> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> 10              100             157934          167627.12       317705          160000          5.652
> 100             100             161294          171323.59       324017          170000          5.530
> 1000            100             161392          171521.82       324629          170000          5.524
> 10000           100             163651          173651.06       328488          170000          5.456
> 100000          100             166344          198070.20       638205          170000          4.781
> 1000000         100             225555          361166.12       1302368         240000          2.568
> 
> 
> No meaningful difference before getting to 100k. But it's pretty easy to lag
> by 100k on a longer distance link...
> 

Right.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Christoph Berg
Дата:
Сообщение: Re: meson documentation build open issues
Следующее
От: Daniel Gustafsson
Дата:
Сообщение: Re: pgsql: Clean up role created in new subscription test.