Обсуждение: [Incident report]Backend process crashed when executing 2pc transaction

Поиск
Список
Период
Сортировка

[Incident report]Backend process crashed when executing 2pc transaction

От
"LIANGBO"
Дата:
Dear all,

I've met the following problem in our product environment. We tried to reproduce the problem, but because of the low
probabilityof occurrence, we could not reproduce it.
 
1. phenomenon
Backend process crashed when executing 2pc transaction in citus.

- coordinator

        2019-11-24 11:08:09.914 CST 35791 10.246.66.182(6881) lobausr lobadbw2 PostgreSQL JDBC Driver WARNING:  server
conncrashed?
 
        2019-11-24 11:08:09.914 CST 35791 10.246.66.182(6881) lobausr lobadbw2 PostgreSQL JDBC Driver CONTEXT:  while
executingcommand on 10.230.27.117:6432
 
        2019-11-24 11:08:09.914 CST 35791 10.246.66.182(6881) lobausr lobadbw2 PostgreSQL JDBC Driver WARNING:  failed
tocommit transaction on 10.230.27.117:6432
 
        2019-11-24 11:08:09.914 CST 35791 10.246.66.182(6881) lobausr lobadbw2 PostgreSQL JDBC Driver WARNING:  failed
toroll back prepared transaction 'citus_0_35791_4207001212_1287199'
 
        2019-11-24 11:08:09.914 CST 35791 10.246.66.182(6881) lobausr lobadbw2 PostgreSQL JDBC Driver HINT:  Run
"COMMITPREPARED 'citus_0_35791_4207001212_1287199'" on 10.230.27.117:6432
 
        2019-11-24 11:08:09.914 CST 35791 10.246.66.182(6881) lobausr lobadbw2 PostgreSQL JDBC Driver WARNING:  server
closedthe connection unexpectedly
 
                This probably means the server terminated abnormally
                before or while processing the request.
        2019-11-24 11:08:09.914 CST 35791 10.246.66.182(6881) lobausr lobadbw2 PostgreSQL JDBC Driver CONTEXT:  while
executingcommand on 10.230.27.117:6432
 
        2019-11-24 11:08:09.914 CST 35791 10.246.66.182(6881) lobausr lobadbw2 PostgreSQL JDBC Driver LOG:  duration:
17123.210ms  execute S_1: COMMIT
 

- worker

        2019-11-24 11:08:09.854 CST 14668     LOG:  server process (PID 14714) was terminated by signal 6: Aborted
        2019-11-24 11:08:09.854 CST 14668     DETAIL:  Failed process was running: COMMIT PREPARED
'citus_0_35791_4207001212_1287199'
        2019-11-24 11:08:09.854 CST 14668     LOG:  terminating any other active server processes
        
2. Occurrence condition
Distributed transaction in business SQL

PostgreSQL:10.7
citus:7.4.1
OS:RHEL6.3

3. Investigation
3.1 PG log
- worker

        *** glibc detected *** postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED: double free or corruption
(!prev):0x0000000001a977a0 ***
 
        ======= Backtrace: =========
        /lib64/libc.so.6[0x369e275f4e]
        /lib64/libc.so.6[0x369e278cf0]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED(XLogReaderFree+0x57)[0x4ff947]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED[0x4e4387]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED(FinishPreparedTransaction+0x139)[0x4e5849]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED(standard_ProcessUtility+0x711)[0x72d601]
        /usr/pgsql-10/lib/citus.so(multi_ProcessUtility+0x741)[0x7f63a1ae97e1]
        /usr/pgsql-10/lib/pg_stat_statements.so(+0x4178)[0x7f63a11e1178]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED[0x729388]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED[0x72a2fd]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED(PortalRun+0x238)[0x72aa98]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED[0x727051]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED(PostgresMain+0x549)[0x728039]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED(PostmasterMain+0x194a)[0x6bb43a]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED(main+0x7c0)[0x63b4d0]
        /lib64/libc.so.6(__libc_start_main+0xfd)[0x369e21ed5d]
        postgres: lobausr lobadbw2 127.0.0.1(39349) COMMIT PREPARED[0x477149]
        ======= Memory map: ========
        00400000-00a99000 r-xp 00000000 fd:08 266747                             /usr/pgsql-10/bin/postgres
        00c99000-00ca7000 rw-p 00699000 fd:08 266747                             /usr/pgsql-10/bin/postgres
        00ca7000-00d10000 rw-p 00000000 00:00 0 
        0190c000-019b3000 rw-p 00000000 00:00 0 
        019b3000-01c5d000 rw-p 00000000 00:00 0 
        369de00000-369de20000 r-xp 00000000 fd:01 3765                           /lib64/ld-2.12.so
        369e01f000-369e020000 r--p 0001f000 fd:01 3765                           /lib64/ld-2.12.so
        369e020000-369e021000 rw-p 00020000 fd:01 3765                           /lib64/ld-2.12.so
        369e021000-369e022000 rw-p 00000000 00:00 0 
        369e200000-369e38a000 r-xp 00000000 fd:01 3766                           /lib64/libc-2.12.so
        369e38a000-369e58a000 ---p 0018a000 fd:01 3766                           /lib64/libc-2.12.so
        369e58a000-369e58e000 r--p 0018a000 fd:01 3766                           /lib64/libc-2.12.so
        369e58e000-369e58f000 rw-p 0018e000 fd:01 3766                           /lib64/libc-2.12.so
        369e58f000-369e594000 rw-p 00000000 00:00 0 
        369e600000-369e602000 r-xp 00000000 fd:01 5167                           /lib64/libdl-2.12.so
        369e602000-369e802000 ---p 00002000 fd:01 5167                           /lib64/libdl-2.12.so
        369e802000-369e803000 r--p 00002000 fd:01 5167                           /lib64/libdl-2.12.so
        369e803000-369e804000 rw-p 00003000 fd:01 5167                           /lib64/libdl-2.12.so
        369ea00000-369ea17000 r-xp 00000000 fd:01 5174                           /lib64/libpthread-2.12.so
        369ea17000-369ec17000 ---p 00017000 fd:01 5174                           /lib64/libpthread-2.12.so
        369ec17000-369ec18000 r--p 00017000 fd:01 5174                           /lib64/libpthread-2.12.so
        369ec18000-369ec19000 rw-p 00018000 fd:01 5174                           /lib64/libpthread-2.12.so
        369ec19000-369ec1d000 rw-p 00000000 00:00 0 
        369ee00000-369ee07000 r-xp 00000000 fd:01 5175                           /lib64/librt-2.12.so
        369ee07000-369f006000 ---p 00007000 fd:01 5175                           /lib64/librt-2.12.so
        369f006000-369f007000 r--p 00006000 fd:01 5175                           /lib64/librt-2.12.so
        369f007000-369f008000 rw-p 00007000 fd:01 5175                           /lib64/librt-2.12.so
        369f200000-369f283000 r-xp 00000000 fd:01 5182                           /lib64/libm-2.12.so
        369f283000-369f482000 ---p 00083000 fd:01 5182                           /lib64/libm-2.12.so
        369f482000-369f483000 r--p 00082000 fd:01 5182                           /lib64/libm-2.12.so
        369f483000-369f484000 rw-p 00083000 fd:01 5182                           /lib64/libm-2.12.so
        369fa00000-369fa0c000 r-xp 00000000 fd:01 5170                           /lib64/libpam.so.0.82.2
        369fa0c000-369fc0c000 ---p 0000c000 fd:01 5170                           /lib64/libpam.so.0.82.2
        369fc0c000-369fc0d000 r--p 0000c000 fd:01 5170                           /lib64/libpam.so.0.82.2
        369fc0d000-369fc0e000 rw-p 0000d000 fd:01 5170                           /lib64/libpam.so.0.82.2
        369fe00000-369fe1d000 r-xp 00000000 fd:01 5179                           /lib64/libselinux.so.1
        369fe1d000-36a001c000 ---p 0001d000 fd:01 5179                           /lib64/libselinux.so.1
        36a001c000-36a001d000 r--p 0001c000 fd:01 5179                           /lib64/libselinux.so.1
        36a001d000-36a001e000 rw-p 0001d000 fd:01 5179                           /lib64/libselinux.so.1
        36a001e000-36a001f000 rw-p 00000000 00:00 0 
        36a0200000-36a0219000 r-xp 00000000 fd:08 17924                          /usr/lib64/libsasl2.so.2.0.23
        36a0219000-36a0418000 ---p 00019000 fd:08 17924                          /usr/lib64/libsasl2.so.2.0.23
        36a0418000-36a0419000 r--p 00018000 fd:08 17924                          /usr/lib64/libsasl2.so.2.0.23
        36a0419000-36a041a000 rw-p 00019000 fd:08 17924                          /usr/lib64/libsasl2.so.2.0.23
        36a0600000-36a0616000 r-xp 00000000 fd:01 3772                           /lib64/libgcc_s-4.4.6-20120305.so.1
        36a0616000-36a0815000 ---p 00016000 fd:01 3772                           /lib64/libgcc_s-4.4.6-20120305.so.1
        36a0815000-36a0816000 rw-p 00015000 fd:01 3772                           /lib64/libgcc_s-4.4.6-20120305.so.1
        36a0a00000-36a0a16000 r-xp 00000000 fd:01 5178                           /lib64/libresolv-2.12.so
        36a0a16000-36a0c16000 ---p 00016000 fd:01 5178                           /lib64/libresolv-2.12.so
        36a0c16000-36a0c17000 r--p 00016000 fd:01 5178                           /lib64/libresolv-2.12.so
        36a0c17000-36a0c18000 rw-p 00017000 fd:01 5178                           /lib64/libresolv-2.12.so
        36a0c18000-36a0c1a000 rw-p 00000000 00:00 0 
        36a0e00000-36a0e49000 r-xp 00000000 fd:01 1926                           /lib64/libldap-2.4.so.2.5.6
        36a0e49000-36a1048000 ---p 00049000 fd:01 1926                           /lib64/libldap-2.4.so.2.5.6
        36a1048000-36a1049000 r--p 00048000 fd:01 1926                           /lib64/libldap-2.4.so.2.5.6
        36a1049000-36a104b000 rw-p 00049000 fd:01 1926                           /lib64/libldap-2.4.so.2.5.6
        36a1600000-36a160e000 r-xp 00000000 fd:01 3777                           /lib64/liblber-2.4.so.2.5.6
        36a160e000-36a180d000 ---p 0000e000 fd:01 3777                           /lib64/liblber-2.4.so.2.5.6
        36a180d000-36a180e000 r--p 0000d000 fd:01 3777                           /lib64/liblber-2.4.so.2.5.6
        36a180e000-36a180f000 rw-p 0000e000 fd:01 3777                           /lib64/liblber-2.4.so.2.5.6
        36a1a00000-36a1a5d000 r-xp 00000000 fd:01 5168                           /lib64/libfreebl3.so
        36a1a5d000-36a1c5c000 ---p 0005d000 fd:01 5168                           /lib64/libfreebl3.so
        36a1c5c000-36a1c5d000 r--p 0005c000 fd:01 5168                           /lib64/libfreebl3.so
        36a1c5d000-36a1c5e000 rw-p 0005d000 fd:01 5168                           /lib64/libfreebl3.so
        36a1c5e000-36a1c62000 rw-p 00000000 00:00 0 
        36a1e00000-36a1e07000 r-xp 00000000 fd:01 5169                           /lib64/libcrypt-2.12.so
        36a1e07000-36a2007000 ---p 00007000 fd:01 5169                           /lib64/libcrypt-2.12.so
        36a2007000-36a2008000 r--p 00007000 fd:01 5169                           /lib64/libcrypt-2.12.so
        36a2008000-36a2009000 rw-p 00008000 fd:01 5169                           /lib64/libcrypt-2.12.so
        36a2009000-36a2037000 rw-p 00000000 00:00 0 
        36a2200000-36a2203000 r-xp 00000000 fd:01 5192                           /lib64/libcom_err.so.2.1
        36a2203000-36a2402000 ---p 00003000 fd:01 5192                           /lib64/libcom_err.so.2.1
        36a2402000-36a2403000 r--p 00002000 fd:01 5192                           /lib64/libcom_err.so.2.1
        36a2403000-36a2404000 rw-p 00003000 fd:01 5192                           /lib64/libcom_err.so.2.1
        36a2600000-36a2620000 r-xp 00000000 fd:08 5902                           /usr/lib64/libnssutil3.so
        36a2620000-36a281f000 ---p 00020000 fd:08 5902                           /usr/lib64/libnssutil3.so
        36a281f000-36a2825000 r--p 0001f000 fd:08 5902                           /usr/lib64/libnssutil3.so
        36a2825000-36a2826000 rw-p 00025000 fd:08 5902                           /usr/lib64/libnssutil3.so
        36a2a00000-36a2b47000 r-xp 00000000 fd:08 18039                          /usr/lib64/libxml2.so.2.7.6
        36a2b47000-36a2d46000 ---p 00147000 fd:08 18039                          /usr/lib64/libxml2.so.2.7.6
        36a2d46000-36a2d50000 rw-p 00146000 fd:08 18039                          /usr/lib64/libxml2.so.2.7.6
        36a2d50000-36a2d51000 rw-p 00000000 00:00 0 
        36a2e00000-36a2e03000 r-xp 00000000 fd:01 5188                           /lib64/libplds4.so
        36a2e03000-36a3002000 ---p 00003000 fd:01 5188                           /lib64/libplds4.so
        36a3002000-36a3003000 r--p 00002000 fd:01 5188                           /lib64/libplds4.so
        36a3003000-36a3004000 rw-p 00003000 fd:01 5188                           /lib64/libplds4.so
        36a3200000-36a3239000 r-xp 00000000 fd:01 663                            /lib64/libnspr4.so
        36a3239000-36a3438000 ---p 00039000 fd:01 663                            /lib64/libnspr4.so
        36a3438000-36a3439000 r--p 00038000 fd:01 663                            /lib64/libnspr4.so
        36a3439000-36a343b000 rw-p 00039000 fd:01 663                            /lib64/libnspr4.so
        36a343b000-36a343d000 rw-p 00000000 00:00 0 
        36a3600000-36a3638000 r-xp 00000000 fd:08 21660                          /usr/lib64/libssl3.so
        36a3638000-36a3838000 ---p 00038000 fd:08 21660                          /usr/lib64/libssl3.so
        36a3838000-36a383a000 r--p 00038000 fd:08 21660                          /usr/lib64/libssl3.so
        36a383a000-36a383b000 rw-p 0003a000 fd:08 21660                          /usr/lib64/libssl3.so
        36a383b000-36a383c000 rw-p 00000000 00:00 0 
        36a3a00000-36a3a04000 r-xp 00000000 fd:01 5187                           /lib64/libplc4.so
        36a3a04000-36a3c03000 ---p 00004000 fd:01 5187                           /lib64/libplc4.so
        36a3c03000-36a3c04000 r--p 00003000 fd:01 5187                           /lib64/libplc4.so
        36a3c04000-36a3c05000 rw-p 00004000 fd:01 5187                           /lib64/libplc4.so
        36a3e00000-36a3f33000 r-xp 00000000 fd:08 21659                          /usr/lib64/libnss3.so
        36a3f33000-36a4132000 ---p 00133000 fd:08 21659                          /usr/lib64/libnss3.so
        36a4132000-36a4137000 r--p 00132000 fd:08 21659                          /usr/lib64/libnss3.so
        36a4137000-36a4139000 rw-p 00137000 fd:08 21659                          /usr/lib64/libnss3.so
        36a4139000-36a413b000 rw-p 00000000 00:00 0 
        36a4200000-36a43ba000 r-xp 00000000 fd:08 20219                          /usr/lib64/libcrypto.so.1.0.1e
        36a43ba000-36a45b9000 ---p 001ba000 fd:08 20219                          /usr/lib64/libcrypto.so.1.0.1e
        36a45b9000-36a45d4000 r--p 001b9000 fd:08 20219                          /usr/lib64/libcrypto.so.1.0.1e
        36a45d4000-36a45e0000 rw-p 001d4000 fd:08 20219                          /usr/lib64/libcrypto.so.1.0.1e
        36a45e0000-36a45e4000 rw-p 00000000 00:00 0 
        36a4600000-36a4628000 r-xp 00000000 fd:08 20218                          /usr/lib64/libsmime3.so
        36a4628000-36a4828000 ---p 00028000 fd:08 20218                          /usr/lib64/libsmime3.so
        36a4828000-36a482b000 r--p 00028000 fd:08 20218                          /usr/lib64/libsmime3.so
        36a482b000-36a482c000 rw-p 0002b000 fd:08 20218                          /usr/lib64/libsmime3.so
        36a4a00000-36a4a02000 r-xp 00000000 fd:01 237                            /lib64/libkeyutils.so.1.3
        36a4a02000-36a4c01000 ---p 00002000 fd:01 237                            /lib64/libkeyutils.so.1.3
        36a4c01000-36a4c02000 r--p 00001000 fd:01 237                            /lib64/libkeyutils.so.1.3
        36a4c02000-36a4c03000 rw-p 00002000 fd:01 237                            /lib64/libkeyutils.so.1.3
        36a4e00000-36a4e0a000 r-xp 00000000 fd:01 5190                           /lib64/libkrb5support.so.0.1
        36a4e0a000-36a5009000 ---p 0000a000 fd:01 5190                           /lib64/libkrb5support.so.0.1
        36a5009000-36a500a000 r--p 00009000 fd:01 5190                           /lib64/libkrb5support.so.0.1
        36a500a000-36a500b000 rw-p 0000a000 fd:01 5190                           /lib64/libkrb5support.so.0.1
        36a5200000-36a52d4000 r-xp 00000000 fd:01 5193                           /lib64/libkrb5.so.3.3
        36a52d4000-36a54d4000 ---p 000d4000 fd:01 5193                           /lib64/libkrb5.so.3.3
        36a54d4000-36a54dd000 r--p 000d4000 fd:01 5193                           /lib64/libkrb5.so.3.3
        36a54dd000-36a54df000 rw-p 000dd000 fd:01 5193                           /lib64/libkrb5.so.3.3
        36a5600000-36a563f000 r-xp 00000000 fd:01 5194                           /lib64/libgssapi_krb5.so.2.2
        36a563f000-36a583f000 ---p 0003f000 fd:01 5194                           /lib64/libgssapi_krb5.so.2.2
        36a583f000-36a5840000 r--p 0003f000 fd:01 5194                           /lib64/libgssapi_krb5.so.2.2
        36a5840000-36a5842000 rw-p 00040000 fd:01 5194                           /lib64/libgssapi_krb5.so.2.2
        36a5a00000-36a5a51000 r-xp 00000000 fd:08 6885                           /usr/lib64/libcurl.so.4.1.1
        36a5a51000-36a5c50000 ---p 00051000 fd:08 6885                           /usr/lib64/libcurl.so.4.1.1
        36a5c50000-36a5c53000 rw-p 00050000 fd:08 6885                           /usr/lib64/libcurl.so.4.1.1
        36a5e00000-36a5e2a000 r-xp 00000000 fd:01 5191                           /lib64/libk5crypto.so.3.1
        36a5e2a000-36a6029000 ---p 0002a000 fd:01 5191                           /lib64/libk5crypto.so.3.1
        36a6029000-36a602b000 r--p 00029000 fd:01 5191                           /lib64/libk5crypto.so.3.1
        36a602b000-36a602c000 rw-p 0002b000 fd:01 5191                           /lib64/libk5crypto.so.3.1
        36a6200000-36a6226000 r-xp 00000000 fd:08 16520                          /usr/lib64/libssh2.so.1.0.1
        36a6226000-36a6426000 ---p 00026000 fd:08 16520                          /usr/lib64/libssh2.so.1.0.1
        36a6426000-36a6427000 rw-p 00026000 fd:08 16520                          /usr/lib64/libssh2.so.1.0.1
        36a7200000-36a7262000 r-xp 00000000 fd:08 17699                          /usr/lib64/libssl.so.1.0.1e
        36a7262000-36a7461000 ---p 00062000 fd:08 17699                          /usr/lib64/libssl.so.1.0.1e
        36a7461000-36a7465000 r--p 00061000 fd:08 17699                          /usr/lib64/libssl.so.1.0.1e
        36a7465000-36a746c000 rw-p 00065000 fd:08 17699                          /usr/lib64/libssl.so.1.0.1e
        7f61c8000000-7f61c8021000 rw-p 00000000 00:00 0 
        7f61c8021000-7f61cc000000 ---p 00000000 00:00 0 
        7f61ccdba000-7f61ccdc0000 r-xp 00000000 fd:08 266931                     /usr/pgsql-10/lib/btree_gin.so
        7f61ccdc0000-7f61ccfbf000 ---p 00006000 fd:08 266931                     /usr/pgsql-10/lib/btree_gin.so
        7f61ccfbf000-7f61ccfc0000 rw-p 00005000 fd:08 266931                     /usr/pgsql-10/lib/btree_gin.so
        7f61ccfc0000-7f61cd1df000 rw-p 00000000 00:00 0 
        7f61cd1df000-7f61cd1eb000 r-xp 00000000 fd:01 3754                       /lib64/libnss_files-2.12.so
        7f61cd1eb000-7f61cd3eb000 ---p 0000c000 fd:01 3754                       /lib64/libnss_files-2.12.so
        7f61cd3eb000-7f61cd3ec000 r--p 0000c000 fd:01 3754                       /lib64/libnss_files-2.12.so
        7f61cd3ec000-7f61cd3ed000 rw-p 0000d000 fd:01 3754                       /lib64/libnss_files-2.12.so
        7f61cd3f7000-7f61cd406000 rw-s 00000000 00:10 1491041                    /dev/shm/PostgreSQL.1869824892
        7f61cd406000-7f63a0bd6000 rw-s 00000000 00:04 1491038                    /dev/zero (deleted)
        7f63a0bd6000-7f63a0bd9000 r-xp 00000000 fd:08 268834                     /usr/pgsql-10/lib/timescaledb.so
        7f63a0bd9000-7f63a0dd8000 ---p 00003000 fd:08 268834                     /usr/pgsql-10/lib/timescaledb.so
        7f63a0dd8000-7f63a0dd9000 rw-p 00002000 fd:08 268834                     /usr/pgsql-10/lib/timescaledb.so
        7f63a0dd9000-7f63a0dda000 r-xp 00000000 fd:08 266927                     /usr/pgsql-10/lib/auth_delay.so
        7f63a0dda000-7f63a0fd9000 ---p 00001000 fd:08 266927                     /usr/pgsql-10/lib/auth_delay.so
        7f63a0fd9000-7f63a0fda000 rw-p 00000000 fd:08 266927                     /usr/pgsql-10/lib/auth_delay.so
        7f63a0fda000-7f63a0fdc000 r-xp 00000000 fd:08 266928                     /usr/pgsql-10/lib/auto_explain.so
        7f63a0fdc000-7f63a11dc000 ---p 00002000 fd:08 266928                     /usr/pgsql-10/lib/auto_explain.so
        7f63a11dc000-7f63a11dd000 rw-p 00002000 fd:08 266928                     /usr/pgsql-10/lib/auto_explain.so
        7f63a11dd000-7f63a11e5000 r-xp 00000000 fd:08 266954
/usr/pgsql-10/lib/pg_stat_statements.so
        7f63a11e5000-7f63a13e4000 ---p 00008000 fd:08 266954
/usr/pgsql-10/lib/pg_stat_statements.so
        7f63a13e4000-7f63a13e5000 rw-p 00007000 fd:08 266954
/usr/pgsql-10/lib/pg_stat_statements.so
        7f63a13e5000-7f63a1433000 r-xp 00000000 fd:01 644                        /lib64/libldap_r-2.4.so.2.5.6
        7f63a1433000-7f63a1633000 ---p 0004e000 fd:01 644                        /lib64/libldap_r-2.4.so.2.5.6
        7f63a1633000-7f63a1634000 r--p 0004e000 fd:01 644                        /lib64/libldap_r-2.4.so.2.5.6
        7f63a1634000-7f63a1636000 rw-p 0004f000 fd:01 644                        /lib64/libldap_r-2.4.so.2.5.6
        7f63a1636000-7f63a1638000 rw-p 00000000 00:00 0 
        7f63a1638000-7f63a166a000 r-xp 00000000 fd:01 5189                       /lib64/libidn.so.11.6.1
        7f63a166a000-7f63a1869000 ---p 00032000 fd:01 5189                       /lib64/libidn.so.11.6.1
        7f63a1869000-7f63a186a000 rw-p 00031000 fd:01 5189                       /lib64/libidn.so.11.6.1
        7f63a186a000-7f63a18af000 r-xp 00000000 fd:08 266281                     /usr/pgsql-10/lib/libpq.so.5.10
        7f63a18af000-7f63a1aaf000 ---p 00045000 fd:08 266281                     /usr/pgsql-10/lib/libpq.so.5.10
        7f63a1aaf000-7f63a1ab2000 rw-p 00045000 fd:08 266281                     /usr/pgsql-10/lib/libpq.so.5.10
        7f63a1ab2000-7f63a1b5a000 r-xp 00000000 fd:08 268586                     /usr/pgsql-10/lib/citus.so
        7f63a1b5a000-7f63a1d5a000 ---p 000a8000 fd:08 268586                     /usr/pgsql-10/lib/citus.so
        7f63a1d5a000-7f63a1d5e000 rw-p 000a8000 fd:08 268586                     /usr/pgsql-10/lib/citus.so
        7f63a1d5e000-7f63a1d64000 rw-p 00000000 00:00 0 
        7f63a1d64000-7f63a7bf5000 r--p 00000000 fd:08 21655                      /usr/lib/locale/locale-archive
        7f63a7bf5000-7f63a7bfd000 rw-p 00000000 00:00 0 
        7f63a7bfd000-7f63a7ce5000 r-xp 00000000 fd:08 1530                       /usr/lib64/libstdc++.so.6.0.13
        7f63a7ce5000-7f63a7ee5000 ---p 000e8000 fd:08 1530                       /usr/lib64/libstdc++.so.6.0.13
        7f63a7ee5000-7f63a7eec000 r--p 000e8000 fd:08 1530                       /usr/lib64/libstdc++.so.6.0.13
        7f63a7eec000-7f63a7eee000 rw-p 000ef000 fd:08 1530                       /usr/lib64/libstdc++.so.6.0.13
        7f63a7eee000-7f63a7f03000 rw-p 00000000 00:00 0 
        7f63a7f03000-7f63a8e48000 r-xp 00000000 fd:08 22589                      /usr/lib64/libicudata.so.42.1
        7f63a8e48000-7f63a9047000 ---p 00f45000 fd:08 22589                      /usr/lib64/libicudata.so.42.1
        7f63a9047000-7f63a9048000 rw-p 00f44000 fd:08 22589                      /usr/lib64/libicudata.so.42.1
        7f63a9048000-7f63a904e000 rw-p 00000000 00:00 0 
        7f63a904e000-7f63a9065000 r-xp 00000000 fd:01 5166                       /lib64/libaudit.so.1.0.0
        7f63a9065000-7f63a9264000 ---p 00017000 fd:01 5166                       /lib64/libaudit.so.1.0.0
        7f63a9264000-7f63a9265000 r--p 00016000 fd:01 5166                       /lib64/libaudit.so.1.0.0
        7f63a9265000-7f63a926a000 rw-p 00017000 fd:01 5166                       /lib64/libaudit.so.1.0.0
        7f63a926a000-7f63a927f000 r-xp 00000000 fd:01 5181                       /lib64/libz.so.1.2.3
        7f63a927f000-7f63a947e000 ---p 00015000 fd:01 5181                       /lib64/libz.so.1.2.3
        7f63a947e000-7f63a947f000 r--p 00014000 fd:01 5181                       /lib64/libz.so.1.2.3
        7f63a947f000-7f63a9480000 rw-p 00015000 fd:01 5181                       /lib64/libz.so.1.2.3
        7f63a9480000-7f63a9481000 rw-p 00000000 00:00 0 
        7f63a9481000-7f63a95c0000 r-xp 00000000 fd:08 22601                      /usr/lib64/libicuuc.so.42.1
        7f63a95c0000-7f63a97c0000 ---p 0013f000 fd:08 22601                      /usr/lib64/libicuuc.so.42.1
        7f63a97c0000-7f63a97d1000 rw-p 0013f000 fd:08 22601                      /usr/lib64/libicuuc.so.42.1
        7f63a97d1000-7f63a97d3000 rw-p 00000000 00:00 0 
        7f63a97d3000-7f63a995b000 r-xp 00000000 fd:08 22591                      /usr/lib64/libicui18n.so.42.1
        7f63a995b000-7f63a9b5b000 ---p 00188000 fd:08 22591                      /usr/lib64/libicui18n.so.42.1
        7f63a9b5b000-7f63a9b68000 rw-p 00188000 fd:08 22591                      /usr/lib64/libicui18n.so.42.1
        7f63a9b68000-7f63a9b6d000 rw-p 00000000 00:00 0 
        7f63a9b75000-7f63a9b76000 rw-p 00000000 00:00 0 
        7f63a9b76000-7f63a9b77000 rw-s 00000000 00:04 262145                     /SYSV0052e2c1 (deleted)
        7f63a9b77000-7f63a9b78000 rw-p 00000000 00:00 0 
        7fff4a3fc000-7fff4a411000 rw-p 00000000 00:00 0                          [stack]
        7fff4a4c3000-7fff4a4c4000 r-xp 00000000 00:00 0                          [vdso]
        ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
        2019-11-24 11:07:53.735 CST 14713 127.0.0.1(39348) lobausr lobadbw2 citus - 10.230.239.36:49809 LOG:  process
14713still waiting for ShareLock on transaction 971951123 after 1000.135 ms
 
        2019-11-24 11:07:53.735 CST 14713 127.0.0.1(39348) lobausr lobadbw2 citus - 10.230.239.36:49809 DETAIL:
Processholding the lock: 0. Wait queue: 14713.
 
        2019-11-24 11:07:53.735 CST 14713 127.0.0.1(39348) lobausr lobadbw2 citus - 10.230.239.36:49809 CONTEXT:  while
deletingtuple (88025,10) in relation "loba_tt_ldcs_zcd_undo_103061" 
        2019-11-24 11:07:53.735 CST 14713 127.0.0.1(39348) lobausr lobadbw2 citus - 10.230.239.36:49809 STATEMENT:
DELETEFROM lobauser.loba_tt_ldcs_zcd_undo_103061 loba_tt_ldcs_zcd_undo WHERE (((zcbsf)::text OPERATOR(pg_catalog.=) $1)
AND((exidv)::text OPERATOR(pg_catalog.=) $2))
 
        2019-11-24 11:07:53.751 CST 5930 127.0.0.1(38471) lobausr lobadbw2 citus - 10.230.239.36:47339 LOG:  process
5930still waiting for AccessExclusiveLock on tuple (88025,10) of relation 129161 of database 16397 after 1000.107 ms
 
        2019-11-24 11:07:53.751 CST 5930 127.0.0.1(38471) lobausr lobadbw2 citus - 10.230.239.36:47339 DETAIL:  Process
holdingthe lock: 14713. Wait queue: 5930, 14710, 13352.
 
        2019-11-24 11:07:53.751 CST 5930 127.0.0.1(38471) lobausr lobadbw2 citus - 10.230.239.36:47339 STATEMENT:
DELETEFROM lobauser.loba_tt_ldcs_zcd_undo_103061 loba_tt_ldcs_zcd_undo WHERE (((zcbsf)::text OPERATOR(pg_catalog.=) $1)
AND((exidv)::text OPERATOR(pg_catalog.=) $2))
 
        2019-11-24 11:07:53.767 CST 14710 127.0.0.1(39345) lobausr lobadbw2 citus - 10.230.239.36:49771 LOG:  process
14710still waiting for AccessExclusiveLock on tuple (88025,10) of relation 129161 of database 16397 after 1000.065 ms
 
        2019-11-24 11:07:53.767 CST 14710 127.0.0.1(39345) lobausr lobadbw2 citus - 10.230.239.36:49771 DETAIL:
Processholding the lock: 14713. Wait queue: 5930, 14710, 13352.
 
        2019-11-24 11:07:53.767 CST 14710 127.0.0.1(39345) lobausr lobadbw2 citus - 10.230.239.36:49771 STATEMENT:
DELETEFROM lobauser.loba_tt_ldcs_zcd_undo_103061 loba_tt_ldcs_zcd_undo WHERE (((zcbsf)::text OPERATOR(pg_catalog.=) $1)
AND((exidv)::text OPERATOR(pg_catalog.=) $2))
 
        2019-11-24 11:07:53.776 CST 13352 127.0.0.1(39181) lobausr lobadbw2 citus - 10.230.239.36:19373 LOG:  process
13352still waiting for AccessExclusiveLock on tuple (88025,10) of relation 129161 of database 16397 after 1000.071 ms
 
        2019-11-24 11:07:53.776 CST 13352 127.0.0.1(39181) lobausr lobadbw2 citus - 10.230.239.36:19373 DETAIL:
Processholding the lock: 14713. Wait queue: 5930, 14710, 13352.
 
        2019-11-24 11:07:53.776 CST 13352 127.0.0.1(39181) lobausr lobadbw2 citus - 10.230.239.36:19373 STATEMENT:
DELETEFROM lobauser.loba_tt_ldcs_zcd_undo_103061 loba_tt_ldcs_zcd_undo WHERE (((zcbsf)::text OPERATOR(pg_catalog.=) $1)
AND((exidv)::text OPERATOR(pg_catalog.=) $2))
 
        2019-11-24 11:08:09.854 CST 14668     LOG:  server process (PID 14714) was terminated by signal 6: Aborted
        2019-11-24 11:08:09.854 CST 14668     DETAIL:  Failed process was running: COMMIT PREPARED
'citus_0_35791_4207001212_1287199'
        2019-11-24 11:08:09.854 CST 14668     LOG:  terminating any other active server processes

3.2 stacktrace

    (gdb) bt
    #0  0x000000369e232625 in raise () from /lib64/libc.so.6
    #1  0x000000369e233e05 in abort () from /lib64/libc.so.6
    #2  0x000000369e270537 in __libc_message () from /lib64/libc.so.6
    #3  0x000000369e275f4e in malloc_printerr () from /lib64/libc.so.6
    #4  0x000000369e278cf0 in _int_free () from /lib64/libc.so.6
    #5  0x00000000004ff947 in XLogReaderFree (state=0x1a403a8) at xlogreader.c:141
    #6  0x00000000004e4387 in XlogReadTwoPhaseData (lsn=32886947137584, buf=0x7fff4a40ec38, len=0x0) at
twophase.c:1341
    #7  0x00000000004e5849 in FinishPreparedTransaction (gid=0x19d7830 "citus_0_35791_4207001212_1287199", isCommit=1
'\001')at twophase.c:1411
 
    #8  0x000000000072d601 in standard_ProcessUtility (pstmt=0x19d7ba0, queryString=0x19d6e48 "COMMIT PREPARED
'citus_0_35791_4207001212_1287199'",context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
 
        queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at utility.c:460
    #9  0x00007f63a1ae97e1 in multi_ProcessUtility (pstmt=0x19d7ba0, queryString=0x19d6e48 "COMMIT PREPARED
'citus_0_35791_4207001212_1287199'",context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
 
        queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at executor/multi_utility.c:254
    #10 0x00007f63a11e1178 in pgss_ProcessUtility (pstmt=0x19d7ba0, queryString=0x19d6e48 "COMMIT PREPARED
'citus_0_35791_4207001212_1287199'",context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
 
        queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at pg_stat_statements.c:998
    #11 0x0000000000729388 in PortalRunUtility (portal=0x1a27128, pstmt=0x19d7ba0, isTopLevel=<value optimized out>,
setHoldSnapshot=<valueoptimized out>, dest=0x19d7c80, 
 
        completionTag=<value optimized out>) at pquery.c:1178
    #12 0x000000000072a2fd in PortalRunMulti (portal=0x1a27128, isTopLevel=1 '\001', setHoldSnapshot=0 '\000',
dest=0x19d7c80,altdest=0x19d7c80, completionTag=0x7fff4a40f260 "")
 
        at pquery.c:1331
    #13 0x000000000072aa98 in PortalRun (portal=0x1a27128, count=9223372036854775807, isTopLevel=1 '\001', run_once=1
'\001',dest=0x19d7c80, altdest=0x19d7c80, completionTag=0x7fff4a40f260 "")
 
        at pquery.c:799
    #14 0x0000000000727051 in exec_simple_query (query_string=0x19d6e48 "COMMIT PREPARED
'citus_0_35791_4207001212_1287199'")at postgres.c:1122
 
    #15 0x0000000000728039 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x1952b48
"lobadbw2",username=<value optimized out>) at postgres.c:4117
 
    #16 0x00000000006bb43a in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4405
    #17 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4077
    #18 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1755
    #19 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1363
    #20 0x000000000063b4d0 in main (argc=3, argv=0x190c520) at main.c:228
    (gdb) f 5
    #5  0x00000000004ff947 in XLogReaderFree (state=0x1a403a8) at xlogreader.c:141
    141    xlogreader.c: No such file or directory.
        in xlogreader.c
    (gdb) p state->readRecordBuf
    $3 = 0x1a977d8 "M\001"
    (gdb) p state->readRecordBufSize
    $7 = 40960
    (gdb) 
    


--
LIANGBO@suning.com




Re: [Incident report]Backend process crashed when executing 2pc transaction

От
Amit Langote
Дата:
Hello,

On Wed, Nov 27, 2019 at 8:59 PM LIANGBO <liangboa@suning.com> wrote:
> I've met the following problem in our product environment. We tried to reproduce the problem, but because of the low
probabilityof occurrence, we could not reproduce it. 
> 1. phenomenon
> Backend process crashed when executing 2pc transaction in citus.
>
> 2. Occurrence condition
> Distributed transaction in business SQL
>
> PostgreSQL:10.7
> citus:7.4.1
> OS:RHEL6.3
>
> 3.2 stacktrace
>
>         (gdb) bt
>         #0  0x000000369e232625 in raise () from /lib64/libc.so.6
>         #1  0x000000369e233e05 in abort () from /lib64/libc.so.6
>         #2  0x000000369e270537 in __libc_message () from /lib64/libc.so.6
>         #3  0x000000369e275f4e in malloc_printerr () from /lib64/libc.so.6
>         #4  0x000000369e278cf0 in _int_free () from /lib64/libc.so.6
>         #5  0x00000000004ff947 in XLogReaderFree (state=0x1a403a8) at xlogreader.c:141
>         #6  0x00000000004e4387 in XlogReadTwoPhaseData (lsn=32886947137584, buf=0x7fff4a40ec38, len=0x0) at
twophase.c:1341
>         #7  0x00000000004e5849 in FinishPreparedTransaction (gid=0x19d7830 "citus_0_35791_4207001212_1287199",
isCommit=1'\001') at twophase.c:1411 
>         #8  0x000000000072d601 in standard_ProcessUtility (pstmt=0x19d7ba0, queryString=0x19d6e48 "COMMIT PREPARED
'citus_0_35791_4207001212_1287199'",context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
>             queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at utility.c:460
>         #9  0x00007f63a1ae97e1 in multi_ProcessUtility (pstmt=0x19d7ba0, queryString=0x19d6e48 "COMMIT PREPARED
'citus_0_35791_4207001212_1287199'",context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
>             queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at executor/multi_utility.c:254
>         #10 0x00007f63a11e1178 in pgss_ProcessUtility (pstmt=0x19d7ba0, queryString=0x19d6e48 "COMMIT PREPARED
'citus_0_35791_4207001212_1287199'",context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 

Have you considered *also* reporting this to Citus developers, because
while the crash seems to have occurred in the core PostgreSQL code
they may have a better chance reproducing this if at all.

Thanks,
Amit



Re: [Incident report]Backend process crashed when executing 2pctransaction

От
Michael Paquier
Дата:
On Thu, Nov 28, 2019 at 01:24:00PM +0900, Amit Langote wrote:
> Have you considered *also* reporting this to Citus developers, because
> while the crash seems to have occurred in the core PostgreSQL code
> they may have a better chance reproducing this if at all.

Hard to fully conclude with the information at hand.  Still, if you
look at the backtrace, it complains about readRecordBuf being already
free'd, which is something that happens only if it is not NULL and
only when freeing the reader.  The thing is that this area is used
only as a temporary buffer for a record being read, which may
optionally get extended.  Please note as well that the stack trace
mentions multi_ProcessUtility(), which is not Postgres code.  So my
gut actually tells me that this is a Citus-only bug, and that there is
an issue with some memory context cleanup in a xact callback or such.
Just a guess, but this could explain why the memory area of
readRecordBuf just went magically away.

If you can produce a test case with just Postgres, that's another
story of course, and if it were a bug in Postgres, I would imagine
that a simple pgbench test running a lot of 2PC transactions in
parallel may be able to reproduce it after some time.
--
Michael

Вложения

答复: [Incident report]Backend process crashed when executing 2pc transaction

От
"LIANGBO"
Дата:
Hello:

> Have you considered *also* reporting this to Citus developers, because while the crash seems to have occurred in the
corePostgreSQL code they may have a better chance reproducing this if at all.
 

I've sent this issue to the citus community, and then received the reply with "Just a note that this appears to be a
bugin Postgres 2PC code.".
 
https://github.com/citusdata/citus/issues/3228



-----邮件原件-----
发件人: Amit Langote [mailto:amitlangote09@gmail.com] 
发送时间: 2019年11月28日 12:24
收件人: LIANGBO
抄送: PostgreSQL Hackers
主题: Re: [Incident report]Backend process crashed when executing 2pc transaction

Hello,

On Wed, Nov 27, 2019 at 8:59 PM LIANGBO <liangboa@suning.com> wrote:
> I've met the following problem in our product environment. We tried to reproduce the problem, but because of the low
probabilityof occurrence, we could not reproduce it.
 
> 1. phenomenon
> Backend process crashed when executing 2pc transaction in citus.
>
> 2. Occurrence condition
> Distributed transaction in business SQL
>
> PostgreSQL:10.7
> citus:7.4.1
> OS:RHEL6.3
>
> 3.2 stacktrace
>
>         (gdb) bt
>         #0  0x000000369e232625 in raise () from /lib64/libc.so.6
>         #1  0x000000369e233e05 in abort () from /lib64/libc.so.6
>         #2  0x000000369e270537 in __libc_message () from /lib64/libc.so.6
>         #3  0x000000369e275f4e in malloc_printerr () from /lib64/libc.so.6
>         #4  0x000000369e278cf0 in _int_free () from /lib64/libc.so.6
>         #5  0x00000000004ff947 in XLogReaderFree (state=0x1a403a8) at xlogreader.c:141
>         #6  0x00000000004e4387 in XlogReadTwoPhaseData (lsn=32886947137584, buf=0x7fff4a40ec38, len=0x0) at
twophase.c:1341
>         #7  0x00000000004e5849 in FinishPreparedTransaction (gid=0x19d7830 "citus_0_35791_4207001212_1287199",
isCommit=1'\001') at twophase.c:1411
 
>         #8  0x000000000072d601 in standard_ProcessUtility (pstmt=0x19d7ba0, queryString=0x19d6e48 "COMMIT PREPARED
'citus_0_35791_4207001212_1287199'",context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
 
>             queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at utility.c:460
>         #9  0x00007f63a1ae97e1 in multi_ProcessUtility (pstmt=0x19d7ba0, queryString=0x19d6e48 "COMMIT PREPARED
'citus_0_35791_4207001212_1287199'",context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
 
>             queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at executor/multi_utility.c:254
>         #10 0x00007f63a11e1178 in pgss_ProcessUtility 
> (pstmt=0x19d7ba0, queryString=0x19d6e48 "COMMIT PREPARED 
> 'citus_0_35791_4207001212_1287199'", context=PROCESS_UTILITY_TOPLEVEL, 
> params=0x0,

Have you considered *also* reporting this to Citus developers, because while the crash seems to have occurred in the
corePostgreSQL code they may have a better chance reproducing this if at all.
 

Thanks,
Amit






Re: [Incident report]Backend process crashed when executing 2pc transaction

От
Amit Langote
Дата:
On Thu, Nov 28, 2019 at 2:00 PM LIANGBO <liangboa@suning.com> wrote:
>
> Hello:
>
> > Have you considered *also* reporting this to Citus developers, because while the crash seems to have occurred in
thecore PostgreSQL code they may have a better chance reproducing this if at all.
 
>
> I've sent this issue to the citus community, and then received the reply with "Just a note that this appears to be a
bugin Postgres 2PC code.".
 
> https://github.com/citusdata/citus/issues/3228

Interesting.  Still, I think you'd be in better position than anyone
else to come up with reproduction steps for vanilla PostgreSQL by
analyzing the stack trace if and when the crash next occurs (or using
the existing core dump).  It's hard to tell by only guessing what may
have gone wrong when there is external code involved, especially
something like Citus that hooks into many points within vanilla
PostgreSQL.

Thanks,
Amit



Re: [Incident report]Backend process crashed when executing 2pc transaction

От
Marco Slot
Дата:
On Thu, Nov 28, 2019 at 6:18 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Interesting.  Still, I think you'd be in better position than anyone
> else to come up with reproduction steps for vanilla PostgreSQL by
> analyzing the stack trace if and when the crash next occurs (or using
> the existing core dump).  It's hard to tell by only guessing what may
> have gone wrong when there is external code involved, especially
> something like Citus that hooks into many points within vanilla
> PostgreSQL.

To clarify: In a Citus cluster you typically have a coordinator which
contains the "distributed tables" and one or more workers which
contain the data. All are PostgreSQL servers with the citus extension.
The coordinator uses every available hook in PostgreSQL to make the
distributed tables behave like regular tables. Any crash on the
coordinator is likely to be attributable to Citus, because most of the
code that is exercised is Citus code. The workers are used as regular
PostgreSQL servers with the coordinator acting as a regular client. On
the worker, the ProcessUtility hook will just pass on the arguments to
standard_ProcessUtility without any processing. The crash happened on
a worker.

One interesting thing is the prepared transaction name generated by
the coordinator, which follows the form: citus_<coordinator node
id>_<pid>_<server-wide transaction number >_<prepared transaction
number in session>. The server-wide transaction number is a 64-bit
counter that is kept in shared memory and starts at 1. That means that
over 4 billion (4207001212) transactions happened on the coordinator
since the server started, which quite possibly resulted in 4 billion
prepared transactions on this particular server. I'm wondering if some
counter is overflowing.

cheers,
Marco



Re: [Incident report]Backend process crashed when executing 2pc transaction

От
Amit Langote
Дата:
Hi Marco,

On Thu, Nov 28, 2019 at 5:02 PM Marco Slot <marco@citusdata.com> wrote:
>
> On Thu, Nov 28, 2019 at 6:18 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Interesting.  Still, I think you'd be in better position than anyone
> > else to come up with reproduction steps for vanilla PostgreSQL by
> > analyzing the stack trace if and when the crash next occurs (or using
> > the existing core dump).  It's hard to tell by only guessing what may
> > have gone wrong when there is external code involved, especially
> > something like Citus that hooks into many points within vanilla
> > PostgreSQL.
>
> To clarify: In a Citus cluster you typically have a coordinator which
> contains the "distributed tables" and one or more workers which
> contain the data. All are PostgreSQL servers with the citus extension.
> The coordinator uses every available hook in PostgreSQL to make the
> distributed tables behave like regular tables. Any crash on the
> coordinator is likely to be attributable to Citus, because most of the
> code that is exercised is Citus code. The workers are used as regular
> PostgreSQL servers with the coordinator acting as a regular client. On
> the worker, the ProcessUtility hook will just pass on the arguments to
> standard_ProcessUtility without any processing. The crash happened on
> a worker.

Thanks for clarifying.

> One interesting thing is the prepared transaction name generated by
> the coordinator, which follows the form: citus_<coordinator node
> id>_<pid>_<server-wide transaction number >_<prepared transaction
> number in session>. The server-wide transaction number is a 64-bit
> counter that is kept in shared memory and starts at 1. That means that
> over 4 billion (4207001212) transactions happened on the coordinator
> since the server started, which quite possibly resulted in 4 billion
> prepared transactions on this particular server. I'm wondering if some
> counter is overflowing.

Interesting.  This does kind of gets us closer to figuring out what
might have gone wrong, but hard to tell without the core dump at hand.

Thanks,
Amit



RE: [Incident report]Backend process crashed when executing 2pctransaction

От
Ranier Vilela
Дата:
Marco wrote:
> One interesting thing is the prepared transaction name generated by
> the coordinator, which follows the form: citus_<coordinator node
> id>_<pid>_<server-wide transaction number >_<prepared transaction
> number in session>. The server-wide transaction number is a 64-bit
> counter that is kept in shared memory and starts at 1. That means that
> over 4 billion (4207001212) transactions happened on the coordinator
> since the server started, which quite possibly resulted in 4 billion
> prepared transactions on this particular server. I'm wondering if some
> counter is overflowing.

Amit wrote:
>Interesting.  This does kind of gets us closer to figuring out what
>might have gone wrong, but hard to tell without the core dump at hand.

If something is corrupting memory rarely. It would be interesting to consider all the possibilities.
The MemSet function has an error alert on line 785 (twophase.c).
The size the var "_vstart" buffer, is not multiple size of the type long.
Maybe it's filling more than it should.

Ranier Vilela