RE: Forget close an open relation in ReorderBufferProcessTXN()

Поиск
Список
Период
Сортировка
От osumi.takamichi@fujitsu.com
Тема RE: Forget close an open relation in ReorderBufferProcessTXN()
Дата
Msg-id OSBPR01MB4888DCAC385E707D74B2957EED299@OSBPR01MB4888.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Forget close an open relation in ReorderBufferProcessTXN()  (Amit Langote <amitlangote09@gmail.com>)
Ответы RE: Forget close an open relation in ReorderBufferProcessTXN()  ("osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com>)
Re: Forget close an open relation in ReorderBufferProcessTXN()  (Amit Langote <amitlangote09@gmail.com>)
Список pgsql-hackers
On Thursday, May 20, 2021 9:59 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Here are updated/divided patches.
Thanks for your updates.

But, I've detected segmentation faults caused by the patch,
which can happen during 100_bugs.pl in src/test/subscription.
This happens more than one in ten times.

This problem would be a timing issue and has been introduced by v3 already.
I used v5 for HEAD also and reproduced this failure, while
OSS HEAD doesn't reproduce this, even when I executed 100_bugs.pl 200 times in a tight loop.
I aligned the commit id 4f586fe2 for all check. Below logs are ones I got from v3.

* The message of the failure during TAP test.

# Postmaster PID for node "twoways" is 5015
Waiting for replication conn testsub's replay_lsn to pass pg_current_wal_lsn() on twoways
# poll_query_until timed out executing this query:
# SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE
application_name= 'testsub';
 
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# psql: error: connection to server on socket "/tmp/cs8dhFOtZZ/.s.PGSQL.59345" failed: No such file or directory
#       Is the server running locally and accepting connections on that socket?
timed out waiting for catchup at t/100_bugs.pl line 148.


The failure produces core file and its back trace is below.
My first guess of the cause is that between the timing to get an entry from hash_search() in get_rel_sync_entry()
and to set the map by convert_tuples_by_name() in maybe_send_schema(), we had invalidation message,
which tries to free unset descs in the entry ?

* core file backtrace

Core was generated by `postgres: twoways: walsender k5user [local] START_REPLICATION                 '.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f93b38b8c2b in rel_sync_cache_relation_cb (arg=0, relid=16388) at pgoutput.c:1225
1225                            FreeTupleDesc(entry->map->indesc);
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64 libicu-50.2-4.el7_7.x86_64
libstdc++-4.8.5-44.el7.x86_64
(gdb) bt
#0  0x00007f93b38b8c2b in rel_sync_cache_relation_cb (arg=0, relid=16388) at pgoutput.c:1225
#1  0x0000000000ae21f0 in LocalExecuteInvalidationMessage (msg=0x21d1de8) at inval.c:601
#2  0x00000000008dbd6e in ReorderBufferExecuteInvalidations (nmsgs=4, msgs=0x21d1db8) at reorderbuffer.c:3232
#3  0x00000000008da70a in ReorderBufferProcessTXN (rb=0x21d1a40, txn=0x2210b58, commit_lsn=25569096,
snapshot_now=0x21d1e10,command_id=1, streaming=false)
 
    at reorderbuffer.c:2294
#4  0x00000000008dae56 in ReorderBufferReplay (txn=0x2210b58, rb=0x21d1a40, xid=748, commit_lsn=25569096,
end_lsn=25569216,commit_time=674891531661619,
 
    origin_id=0, origin_lsn=0) at reorderbuffer.c:2591
#5  0x00000000008daede in ReorderBufferCommit (rb=0x21d1a40, xid=748, commit_lsn=25569096, end_lsn=25569216,
commit_time=674891531661619,origin_id=0,
 
    origin_lsn=0) at reorderbuffer.c:2615
#6  0x00000000008cae06 in DecodeCommit (ctx=0x21e6880, buf=0x7fffb9383db0, parsed=0x7fffb9383c10, xid=748,
two_phase=false)at decode.c:744
 
#7  0x00000000008ca1ed in DecodeXactOp (ctx=0x21e6880, buf=0x7fffb9383db0) at decode.c:278
#8  0x00000000008c9e76 in LogicalDecodingProcessRecord (ctx=0x21e6880, record=0x21e6c80) at decode.c:142
#9  0x0000000000901fcc in XLogSendLogical () at walsender.c:2876
#10 0x0000000000901327 in WalSndLoop (send_data=0x901f30 <XLogSendLogical>) at walsender.c:2306
#11 0x00000000008ffd5f in StartLogicalReplication (cmd=0x219aff8) at walsender.c:1206
#12 0x00000000009006ae in exec_replication_command (
    cmd_string=0x2123c20 "START_REPLICATION SLOT \"pg_16400_sync_16392_6964617299612181363\" LOGICAL 0/182D058
(proto_version'2', publication_names '\"testpub\"')") at walsender.c:1646
 
#13 0x000000000096ffae in PostgresMain (argc=1, argv=0x7fffb93840d0, dbname=0x214ef18 "d1", username=0x214eef8
"k5user")at postgres.c:4482
 

I'll update when I get more information.


Best Regards,
    Takamichi Osumi


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "tsunakawa.takay@fujitsu.com"
Дата:
Сообщение: RE: Bug in query rewriter - hasModifyingCTE not getting set
Следующее
От: Dilip Kumar
Дата:
Сообщение: Re: Move pg_attribute.attcompression to earlier in struct for reduced size?