Обсуждение: Reduce useless changes before reassembly during logical replication

Поиск
Список
Период
Сортировка

Reduce useless changes before reassembly during logical replication

От
li jie
Дата:
Hi hackers,

During logical replication, if there is a large write transaction, some
spill files will be written to disk, depending on the setting of
logical_decoding_work_mem.

This behavior can effectively avoid OOM, but if the transaction
generates a lot of change before commit, a large number of files may
fill the disk. For example, you can update a TB-level table.

However, I found an inelegant phenomenon. If the modified large table is not
published, its changes will also be written with a large number of spill files.
Look at an example below:

publisher:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE PUBLICATION mypub FOR TABLE public.tbl_pub;
```

subscriber:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432
user=postgres dbname=postgres' PUBLICATION mypub;
```

publisher:
```
begin;
insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba',
i),repeat('dfds', i) from generate_series(0,999999) i;
```

Later you will see a large number of spill files in the
"/$PGDATA/pg_replslot/mysub/" directory.
```
$ll -sh
total 4.5G
4.0K -rw------- 1 postgres postgres 200 Nov 30 09:24 state
17M -rw------- 1 postgres postgres 17M Nov 30 08:22 xid-750-lsn-0-10000000.spill
12M -rw------- 1 postgres postgres 12M Nov 30 08:20 xid-750-lsn-0-1000000.spill
17M -rw------- 1 postgres postgres 17M Nov 30 08:23 xid-750-lsn-0-11000000.spill
......
```

We can see that table tbl_t1 is not published in mypub. It also won't be sent
downstream because it's not subscribed.
After the transaction is reorganized, the pgoutput decoding plugin filters out
changes to these unpublished relationships when sending logical changes.
See function pgoutput_change.

Most importantly, if we filter out unpublished relationship-related
changes after constructing the changes but before queuing the changes
into a transaction, will it reduce the workload of logical decoding
and avoid disk
or memory growth as much as possible?

The patch in the attachment is a prototype, which can effectively reduce the
memory and disk space usage during logical replication.

Design:
1. Added a callback LogicalDecodeFilterByRelCB for the output plugin.

2. Added this callback function pgoutput_table_filter for the pgoutput plugin.
Its main implementation is based on the table filter in the
pgoutput_change function.
Its main function is to determine whether the change needs to be published based
on the parameters of the publication, and if not, filter it.

3. After constructing a change and before Queue a change into a transaction,
use RelidByRelfilenumber to obtain the relation associated with the change,
just like obtaining the relation in the ReorderBufferProcessTXN function.

4. Relation may be a toast, and there is no good way to get its real
table relation based on toast relation. Here, I get the real table oid
through toast relname, and then get the real table relation.

5. This filtering takes into account INSERT/UPDATE/INSERT. Other
changes have not been considered yet and can be expanded in the future.

Test:
1. Added a test case 034_table_filter.pl
2. Like the case above, create two tables, the published table tbl_pub and
the non-published table tbl_t1
3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use
pg_ls_replslotdir to record the total size of the slot directory
every second.
4. Compare the size of the slot directory at the beginning of the
transaction(size1),
the size at the end of the transaction (size2), and the average
size of the entire process(size3).
5. Assert(size1==size2==size3)

Sincerely look forward to your feedback.
Regards, lijie

Вложения

Re: Reduce useless changes before reassembly during logical replication

От
Bharath Rupireddy
Дата:
On Wed, Jan 17, 2024 at 11:45 AM li jie <ggysxcq@gmail.com> wrote:
>
> Hi hackers,
>
> During logical replication, if there is a large write transaction, some
> spill files will be written to disk, depending on the setting of
> logical_decoding_work_mem.
>
> This behavior can effectively avoid OOM, but if the transaction
> generates a lot of change before commit, a large number of files may
> fill the disk. For example, you can update a TB-level table.
>
> However, I found an inelegant phenomenon. If the modified large table is not
> published, its changes will also be written with a large number of spill files.
> Look at an example below:

Thanks. I agree that decoding and queuing the changes of unpublished
tables' data into reorder buffer is an unnecessary task for walsender.
It takes processing efforts (CPU overhead), consumes disk space and
uses memory configured via logical_decoding_work_mem for a replication
connection inefficiently.

> Later you will see a large number of spill files in the
>
> We can see that table tbl_t1 is not published in mypub. It also won't be sent
> downstream because it's not subscribed.
> After the transaction is reorganized, the pgoutput decoding plugin filters out
> changes to these unpublished relationships when sending logical changes.
> See function pgoutput_change.

Right. Here's my testing [1].

> Most importantly, if we filter out unpublished relationship-related
> changes after constructing the changes but before queuing the changes
> into a transaction, will it reduce the workload of logical decoding
> and avoid disk
> or memory growth as much as possible?

Right. It can.

> The patch in the attachment is a prototype, which can effectively reduce the
> memory and disk space usage during logical replication.
>
> Design:
> 1. Added a callback LogicalDecodeFilterByRelCB for the output plugin.
>
> 2. Added this callback function pgoutput_table_filter for the pgoutput plugin.
> Its main implementation is based on the table filter in the
> pgoutput_change function.
> Its main function is to determine whether the change needs to be published based
> on the parameters of the publication, and if not, filter it.
>
> 3. After constructing a change and before Queue a change into a transaction,
> use RelidByRelfilenumber to obtain the relation associated with the change,
> just like obtaining the relation in the ReorderBufferProcessTXN function.
>
> 4. Relation may be a toast, and there is no good way to get its real
> table relation based on toast relation. Here, I get the real table oid
> through toast relname, and then get the real table relation.
>
> 5. This filtering takes into account INSERT/UPDATE/INSERT. Other
> changes have not been considered yet and can be expanded in the future.

Design of this patch is based on the principle of logical decoding
filtering things out early on and looks very similar to
filter_prepare_cb_wrapper/pg_decode_filter_prepare and
filter_by_origin_cb/pgoutput_origin_filter. Per my understanding this
design looks okay  unless I'm missing anything.

> Test:
> 1. Added a test case 034_table_filter.pl
> 2. Like the case above, create two tables, the published table tbl_pub and
> the non-published table tbl_t1
> 3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use
> pg_ls_replslotdir to record the total size of the slot directory
> every second.
> 4. Compare the size of the slot directory at the beginning of the
> transaction(size1),
> the size at the end of the transaction (size2), and the average
> size of the entire process(size3).
> 5. Assert(size1==size2==size3)

I bet that the above test with 10K rows is going to take a noticeable
time on some buildfarm members (it took 6 seconds on my dev system
which is an AWS EC2 instance). And, the above test can get flaky.
Therefore, IMO, the concrete way of testing this feature is by looking
at the server logs for the following message using
PostgreSQL::Test::Cluster log_contains().

+filter_done:
+
+    if (result && RelationIsValid(relation))
+        elog(DEBUG1, "logical filter change by table %s",
RelationGetRelationName(relation));
+

Here are some comments on the v1 patch:
1.
@@ -1415,9 +1419,6 @@ pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
     TupleTableSlot *old_slot = NULL;
     TupleTableSlot *new_slot = NULL;

-    if (!is_publishable_relation(relation))
-        return;
-

Instead of removing is_publishable_relation from pgoutput_change, I
think it can just be turned into an assertion
Assert(is_publishable_relation(relation));, no?

2.
+    switch (change->action)
+    {
+            /* intentionally fall through */

Perhaps, it must use /* FALLTHROUGH */ just like elsewhere in the
code, otherwise a warning is thrown.

3. From commit message:
Most of the code in the FilterByTable function is transplanted from
the ReorderBufferProcessTXN
function, which can be called before the ReorderBufferQueueChange function.It is

I think the above note can just be above the FilterByTable function
for better understanding.

+static bool
+FilterByTable(LogicalDecodingContext *ctx, ReorderBufferChange *change)
+{

4. Why is FilterByTable(ctx, change) call placed after DecodeXLogTuple
in DecodeInsert, DecodeUpdate and DecodeDelete? Is there a use for
decoded tuples done by DecodeXLogTuple in the new callback
filter_by_table_cb? If not, can we move FilterByTable call before
DecodeXLogTuple to avoid some more extra processing?

5. Why is ReorderBufferChange needed as a parameter to FilterByTable
and filter_by_table_cb? Can't just the LogicalDecodingContext and
relation name, the change action be enough to decide if the table is
publishable or not? If done this way, it can avoid some more
processing, no?

6. Please run pgindent and pgperltidy on the new source code and new
TAP test file respectively.

[1]
HEAD:
postgres=# BEGIN;
BEGIN
Time: 0.110 ms
postgres=*# insert into tbl_t1 select i,repeat('xyzzy',
i),repeat('abcba', i),repeat('dfds', i) from generate_series(0,99999)
i;
INSERT 0 100000
Time: 379488.265 ms (06:19.488)
postgres=*#

ubuntu:~/postgres/pg17/bin$ du -sh
/home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub
837M    /home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub
ubuntu:~/postgres/pg17/bin$ du -sh /home/ubuntu/postgres/pg17/bin/db17
2.6G    /home/ubuntu/postgres/pg17/bin/db17

PATCHED:
postgres=# BEGIN;
BEGIN
Time: 0.105 ms
postgres=*# insert into tbl_t1 select i,repeat('xyzzy',
i),repeat('abcba', i),repeat('dfds', i) from generate_series(0,99999)
i;
INSERT 0 100000
Time: 380044.554 ms (06:20.045)

ubuntu:~/postgres$ du -sh /home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub
8.0K    /home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub
ubuntu:~/postgres$ du -sh /home/ubuntu/postgres/pg17/bin/db17
1.8G    /home/ubuntu/postgres/pg17/bin/db17

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Reduce useless changes before reassembly during logical replication

От
Amit Kapila
Дата:
On Thu, Jan 18, 2024 at 12:12 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Jan 17, 2024 at 11:45 AM li jie <ggysxcq@gmail.com> wrote:
> >
> > Hi hackers,
> >
> > During logical replication, if there is a large write transaction, some
> > spill files will be written to disk, depending on the setting of
> > logical_decoding_work_mem.
> >
> > This behavior can effectively avoid OOM, but if the transaction
> > generates a lot of change before commit, a large number of files may
> > fill the disk. For example, you can update a TB-level table.
> >
> > However, I found an inelegant phenomenon. If the modified large table is not
> > published, its changes will also be written with a large number of spill files.
> > Look at an example below:
>
> Thanks. I agree that decoding and queuing the changes of unpublished
> tables' data into reorder buffer is an unnecessary task for walsender.
> It takes processing efforts (CPU overhead), consumes disk space and
> uses memory configured via logical_decoding_work_mem for a replication
> connection inefficiently.
>

This is all true but note that in successful cases (where the table is
published) all the work done by FilterByTable(accessing caches,
transaction-related stuff) can add noticeable overhead as anyway we do
that later in pgoutput_change(). I think I gave the same comment
earlier as well but didn't see any satisfactory answer or performance
data for successful cases to back this proposal. Note, users can
configure to stream_in_progress transactions in which case they
shouldn't see such a big problem. However, I agree that if we can find
some solution where there is no noticeable overhead then that would be
worth considering.

--
With Regards,
Amit Kapila.



Re: Reduce useless changes before reassembly during logical replication

От
Bharath Rupireddy
Дата:
On Thu, Jan 18, 2024 at 2:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 12:12 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Wed, Jan 17, 2024 at 11:45 AM li jie <ggysxcq@gmail.com> wrote:
> > >
> > > Hi hackers,
> > >
> > > During logical replication, if there is a large write transaction, some
> > > spill files will be written to disk, depending on the setting of
> > > logical_decoding_work_mem.
> > >
> > > This behavior can effectively avoid OOM, but if the transaction
> > > generates a lot of change before commit, a large number of files may
> > > fill the disk. For example, you can update a TB-level table.
> > >
> > > However, I found an inelegant phenomenon. If the modified large table is not
> > > published, its changes will also be written with a large number of spill files.
> > > Look at an example below:
> >
> > Thanks. I agree that decoding and queuing the changes of unpublished
> > tables' data into reorder buffer is an unnecessary task for walsender.
> > It takes processing efforts (CPU overhead), consumes disk space and
> > uses memory configured via logical_decoding_work_mem for a replication
> > connection inefficiently.
> >
>
> This is all true but note that in successful cases (where the table is
> published) all the work done by FilterByTable(accessing caches,
> transaction-related stuff) can add noticeable overhead as anyway we do
> that later in pgoutput_change().

Right. Overhead for published tables need to be studied. A possible
way is to mark the checks performed in
FilterByTable/filter_by_table_cb and skip the same checks in
pgoutput_change. I'm not sure if this works without any issues though.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Reduce useless changes before reassembly during logical replication

От
Andy Fan
Дата:
Hi, 
>
> This is all true but note that in successful cases (where the table is
> published) all the work done by FilterByTable(accessing caches,
> transaction-related stuff) can add noticeable overhead as anyway we do
> that later in pgoutput_change(). I think I gave the same comment
> earlier as well but didn't see any satisfactory answer or performance
> data for successful cases to back this proposal.

I did some benchmark yesterday at [1] and found it adds 20% cpu time.
then come out a basic idea, I think it deserves a share. "transaction
related stuff" comes from the syscache/systable access except the
HistorySansphot. and the syscache is required in the following
sistuations: 

1. relfilenode (from wal) -> relid.
2. relid -> namespaceid (to check if the relid is a toast relation).
3. if toast, get its origianl relid.
4. access the data from pg_publication_tables.
5. see if the relid is a partition, if yes, we may get its root
relation.

Acutally we already has a RelationSyncCache for #4, and it *only* need
to access syscache when replicate_valid is false, I think this case
should be rare, but the caller doesn't know it, so the caller must
prepare the transaction stuff in advance even in the most case they are
not used. So I think we can get a optimization here.

then the attached patch is made.

Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date:   Wed Feb 21 18:40:03 2024 +0800

    Make get_rel_sync_entry less depending on transaction state.
    
    get_rel_sync_entry needs transaction only a replicate_valid = false
    entry is found, this should be some rare case. However the caller can't
    know if a entry is valid, so they have to prepare the transaction state
    before calling this function. Such preparation is expensive.
    
    This patch makes the get_rel_sync_entry can manage a transaction stage
    only if necessary. so the callers don't need to prepare it blindly.

Then comes to #1, acutally we have RelfilenumberMapHash as a cache, when
the cache is hit (suppose this is a usual case), no transaction stuff
related.  I have two ideas then:

1. Optimize the cache hit sistuation like what we just did for
get_rel_sync_entry for the all the 5 kinds of data and only pay the
effort for cache miss case. for the data for #2, #3, #5, all the keys
are relid, so I think a same HTAB should be OK.

2. add the content for #1, #2, #3, #5 to wal when wal_level is set to
logical. 

In either case, the changes for get_rel_sync_entry should be needed. 

> Note, users can
> configure to stream_in_progress transactions in which case they
> shouldn't see such a big problem.

People would see the changes is spilled to disk, but the CPU cost for
Reorder should be still paid I think. 

[1] https://www.postgresql.org/message-id/87o7cadqj3.fsf%40163.com

-- 
Best Regards
Andy Fan


Вложения

Re: Reduce useless changes before reassembly during logical replication

От
li jie
Дата:
Hi,

Sorry I replied too late.

> This is all true but note that in successful cases (where the table is
> published) all the work done by FilterByTable(accessing caches,
> transaction-related stuff) can add noticeable overhead as anyway we do
> that later in pgoutput_change().

You are correct. Frequent opening of transactions and access to cache will
cause a lot of overhead, which Andy has tested and proved.

The root cause is because every dml wal record needs to do this, which is really
wasteful. I use a hash table LocatorFilterCache to solve this problem.
After getting
a RelFileLocator,  I go to the hash table to check its
PublicationActions and filter it
based on the PublicationActions to determine whether it has been published.

The effect of my test is very obvious: (perf record)
v1:
  Children      Self  Command   Shared O  Symbol
+ 22.04%     1.53%  postgres  postgres           [.] FilterByTable

v2:
  Children      Self  Command   Shared O  Symbol
+  0.58%     0.00%  postgres  postgres  [.] ReorderBufferFilterByLocator

v1 patch introduces 20% overhead, while v2 only has 0.58%.


> Note, users can
>configure to stream_in_progress transactions in which case they
> shouldn't see such a big problem.

Yes, stream mode can prevent these irrelevant changes from being written to
disk or sent to downstream.
However, CPU and memory consumption will also be incurred when processing
these useless changes. Here is my simple test[1]:

base on master :

CPU stat: perf stat -p pid -e cycles -I 1000
# time counts unit events
76.007070936 9,691,035 cycles
77.007163484 5,977,694 cycles
78.007252533 5,924,703 cycles
79.007346862 5,861,934 cycles
80.007438070 5,858,264 cycles
81.007527122 6,408,759 cycles
82.007615711 6,397,988 cycles
83.007705685 5,520,407 cycles
84.007794387 5,359,162 cycles
85.007884879 5,194,079 cycles
86.007979797 5,391,270 cycles
87.008069606 5,474,536 cycles
88.008162827 5,594,190 cycles
89.008256327 5,610,023 cycles
90.008349583 5,627,350 cycles
91.008437785 6,273,510 cycles
92.008527938 580,934,205 cycles
93.008620136 4,404,672 cycles
94.008711818 4,599,074 cycles
95.008805591 4,374,958 cycles
96.008894543 4,300,180 cycles
97.008987582 4,157,892 cycles
98.009077445 4,072,178 cycles
99.009163475 4,043,875 cycles
100.009254888 5,382,667 cycles

memory stat:  pistat -p pid -r 1 10
07:57:18 AM UID PID minflt/s majflt/s VSZ RSS %MEM Command
07:57:19 AM 1000 11848 233.00 0.00 386872 81276 0.01 postgres
07:57:20 AM 1000 11848 235.00 0.00 387008 82068 0.01 postgres
07:57:21 AM 1000 11848 236.00 0.00 387144 83124 0.01 postgres
07:57:22 AM 1000 11848 236.00 0.00 387144 83916 0.01 postgres
07:57:23 AM 1000 11848 236.00 0.00 387280 84972 0.01 postgres
07:57:24 AM 1000 11848 334.00 0.00 337000 36928 0.00 postgres
07:57:25 AM 1000 11848 3.00 0.00 337000 36928 0.00 postgres
07:57:26 AM 1000 11848 0.00 0.00 337000 36928 0.00 postgres
07:57:27 AM 1000 11848 0.00 0.00 337000 36928 0.00 postgres
07:57:28 AM 1000 11848 0.00 0.00 337000 36928 0.00 postgres
Average: 1000 11848 151.30 0.00 362045 60000 0.01 postgres

After patched:
# time counts unit events
76.007623310 4,237,505 cycles
77.007717436 3,989,618 cycles
78.007813848 3,965,857 cycles
79.007906412 3,601,715 cycles
80.007998111 3,670,835 cycles
81.008092670 3,495,844 cycles
82.008187456 3,822,695 cycles
83.008281335 5,034,146 cycles
84.008374998 3,867,683 cycles
85.008470245 3,996,927 cycles
86.008563783 3,823,893 cycles
87.008658628 3,825,472 cycles
88.008755246 3,823,079 cycles
89.008849719 3,966,083 cycles
90.008945774 4,012,704 cycles
91.009044492 4,026,860 cycles
92.009139621 3,860,912 cycles
93.009242485 3,961,533 cycles
94.009346304 3,799,897 cycles
95.009440164 3,959,602 cycles
96.009534251 3,960,405 cycles
97.009625904 3,762,581 cycles
98.009716518 4,859,490 cycles
99.009807720 3,940,845 cycles
100.009901399 3,888,095 cycles

08:01:47 AM UID PID minflt/s majflt/s VSZ RSS %MEM Command
08:01:48 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres
08:01:49 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres
08:01:50 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres
08:01:51 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres
08:01:52 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres
08:01:53 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres
08:01:54 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres
08:01:55 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres
08:01:56 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres
08:01:57 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres
Average: 1000 19466 0.00 0.00 324424 15140 0.00 postgres

Through comparison, it is found that patch is also profitable for stream mode.
Of course, LocatorFilterCache also need to deal with invalidation, such as the
corresponding relation invalidate, or pg_publication changes, just like
RelationSyncCache and RelfilenumberMapHash.
But ddl is a small amount after all, which is insignificant compared to a
large amount of dml.

Another problem is that the LocatorFilterCache looks redundant compared
 to RelationSyncCache and RelfilenumberMapHash. like this:
1. RelfilenumberMapHash:  relfilenode -> relation oid
2. RelationSyncCache:  relation oid-> PublicationActions
3. LocatorFilterCache:  RelFileLocator-> PublicationActions

The reason is that you cannot simply access two caches from the
relfilenode --> PublicationActions, and you must use historical
snapshots to access
transactions and relcache in the middle, so there is no good solution
for this for the
time being, ugly but effective.


>Therefore, IMO, the concrete way of testing this feature is by looking
>at the server logs for the following message using
>PostgreSQL::Test::Cluster log_contains().
thinks, done.

>Instead of removing is_publishable_relation from pgoutput_change, I
>think it can just be turned into an assertion
>Assert(is_publishable_relation(relation));, no?
yes, done.

>Perhaps, it must use /* FALLTHROUGH */ just like elsewhere in the
>code, otherwise a warning is thrown.
/* intentionally fall through */  can also avoid warnings.

>Can't just the LogicalDecodingContext and
>relation name, the change action be enough to decide if the table is
>publishable or not? If done this way, it can avoid some more
>processing, no?
yes,  RelFileLocator filtering is used directly in v2, and change is
no longer required.

>Please run pgindent and pgperltidy on the new source code and new
>TAP test file respectively.
ok.

[1]: https://www.postgresql.org/message-id/CAGfChW62f5NTNbLsqO-6_CrmKPqBEQtWPcPDafu8pCwZznk%3Dxw%40mail.gmail.com

Вложения