Обсуждение: logical decoding - reading a user catalog table
My logical decoding plugin is occasionally getting this error "could not resolve cmin/cmax of catalog tuple" I get this when my output plugin is trying to read one of the user defined catalog tables (user_catalog_table=true) I am not sure if this is a bug in the time-travel support in the logical decoding support of if I'm just using it wrong (ie not getting a sufficient lock on the relation or something). This is the interesting part of the stack trace #4 0x000000000091bbc8 in HeapTupleSatisfiesHistoricMVCC (htup=0x7fffcf42a900, snapshot=0x7f786ffe92d8, buffer=10568) at tqual.c:1631 #5 0x00000000004aedf3 in heapgetpage (scan=0x28d7080, page=0) at heapam.c:399 #6 0x00000000004b0182 in heapgettup_pagemode (scan=0x28d7080, dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:747 #7 0x00000000004b1ba6 in heap_getnext (scan=0x28d7080, direction=ForwardScanDirection) at heapam.c:1475 #8 0x00007f787002dbfb in lookupSlonyInfo (tableOid=91754, ctx=0x2826118, origin_id=0x7fffcf42ab8c, table_id=0x7fffcf42ab88, set_id=0x7fffcf42ab84) at slony_logical.c:663 #9 0x00007f787002b7a3 in pg_decode_change (ctx=0x2826118, txn=0x28cbec0, relation=0x7f787a3446a8, change=0x7f786ffe3268)at slony_logical.c:237 #10 0x00000000007497d4 in change_cb_wrapper (cache=0x28cbda8, txn=0x28cbec0, relation=0x7f787a3446a8, change=0x7f786ffe3268) at logical.c:704 Here is what the code in lookupSlonyInfo is doing ------------------ sltable_oid = get_relname_relid("sl_table",slony_namespace); sltable_rel = relation_open(sltable_oid,AccessShareLock); tupdesc=RelationGetDescr(sltable_rel); scandesc=heap_beginscan(sltable_rel, GetCatalogSnapshot(sltable_oid),0,NULL); reloid_attnum = get_attnum(sltable_oid,"tab_reloid"); if(reloid_attnum == InvalidAttrNumber) elog(ERROR,"sl_table does not have a tab_reloid column"); set_attnum = get_attnum(sltable_oid,"tab_set"); if(set_attnum == InvalidAttrNumber) elog(ERROR,"sl_table does not have a tab_set column"); tableid_attnum = get_attnum(sltable_oid,"tab_id"); if(tableid_attnum == InvalidAttrNumber) elog(ERROR,"sl_table does not have a tab_id column"); while( (tuple = heap_getnext(scandesc,ForwardScanDirection) ))
On 2014-10-25 18:18:07 -0400, Steve Singer wrote: > My logical decoding plugin is occasionally getting this error > > "could not resolve cmin/cmax of catalog tuple" > > I get this when my output plugin is trying to read one of the user defined > catalog tables (user_catalog_table=true) Hm. That should obviously not happen. Could you describe how that table is modified? Does that bug happen initially, or only after a while? Do you have a testcase that would allow me to easily reproduce the problem? > I am not sure if this is a bug in the time-travel support in the logical > decoding support of if I'm just using it wrong (ie not getting a sufficient > lock on the relation or something). I don't know yet... > This is the interesting part of the stack trace > > #4 0x000000000091bbc8 in HeapTupleSatisfiesHistoricMVCC > (htup=0x7fffcf42a900, > snapshot=0x7f786ffe92d8, buffer=10568) at tqual.c:1631 > #5 0x00000000004aedf3 in heapgetpage (scan=0x28d7080, page=0) at > heapam.c:399 > #6 0x00000000004b0182 in heapgettup_pagemode (scan=0x28d7080, > dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:747 > #7 0x00000000004b1ba6 in heap_getnext (scan=0x28d7080, > direction=ForwardScanDirection) at heapam.c:1475 > #8 0x00007f787002dbfb in lookupSlonyInfo (tableOid=91754, ctx=0x2826118, > origin_id=0x7fffcf42ab8c, table_id=0x7fffcf42ab88, > set_id=0x7fffcf42ab84) > at slony_logical.c:663 > #9 0x00007f787002b7a3 in pg_decode_change (ctx=0x2826118, txn=0x28cbec0, > relation=0x7f787a3446a8, change=0x7f786ffe3268) at slony_logical.c:237 > #10 0x00000000007497d4 in change_cb_wrapper (cache=0x28cbda8, txn=0x28cbec0, > relation=0x7f787a3446a8, change=0x7f786ffe3268) at logical.c:704 > > > > Here is what the code in lookupSlonyInfo is doing > ------------------ > > sltable_oid = get_relname_relid("sl_table",slony_namespace); > > sltable_rel = relation_open(sltable_oid,AccessShareLock); > tupdesc=RelationGetDescr(sltable_rel); > scandesc=heap_beginscan(sltable_rel, > GetCatalogSnapshot(sltable_oid),0,NULL); > reloid_attnum = get_attnum(sltable_oid,"tab_reloid"); > > if(reloid_attnum == InvalidAttrNumber) > elog(ERROR,"sl_table does not have a tab_reloid column"); > set_attnum = get_attnum(sltable_oid,"tab_set"); > > if(set_attnum == InvalidAttrNumber) > elog(ERROR,"sl_table does not have a tab_set column"); > tableid_attnum = get_attnum(sltable_oid, "tab_id"); > > if(tableid_attnum == InvalidAttrNumber) > elog(ERROR,"sl_table does not have a tab_id column"); > > while( (tuple = heap_getnext(scandesc,ForwardScanDirection) )) (Except missing spaces ;)) I don't see anything obviously wrong with this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10/28/2014 01:31 PM, Andres Freund wrote: > On 2014-10-25 18:18:07 -0400, Steve Singer wrote: >> My logical decoding plugin is occasionally getting this error >> >> "could not resolve cmin/cmax of catalog tuple" >> >> I get this when my output plugin is trying to read one of the user defined >> catalog tables (user_catalog_table=true) > Hm. That should obviously not happen. > > Could you describe how that table is modified? Does that bug happen > initially, or only after a while? It doesn't happen right away, in this case it was maybe 4 minutes after creating the slot. The error also doesn't always happen when I run the this test workload but it is reproducible with some trying. I' don't do anything special to that table, it gets created then I do inserts on it. I don't do an alter table or anything fancy like that. I was running the slony failover test (all nodes under the same postmaster) which involves the occasional dropping and recreating of databases along with normal query load + replication. I'll send you tar of the data directory off list with things in this state. > Do you have a testcase that would allow me to easily reproduce the > problem? I don't have a isolated test case that does this. The test that I'm hitting this with does lots of stuff and doesn't even always hit this. >> I am not sure if this is a bug in the time-travel support in the logical >> decoding support of if I'm just using it wrong (ie not getting a sufficient >> lock on the relation or something). > I don't know yet... > >> This is the interesting part of the stack trace >> >> #4 0x000000000091bbc8 in HeapTupleSatisfiesHistoricMVCC >> (htup=0x7fffcf42a900, >> snapshot=0x7f786ffe92d8, buffer=10568) at tqual.c:1631 >> #5 0x00000000004aedf3 in heapgetpage (scan=0x28d7080, page=0) at >> heapam.c:399 >> #6 0x00000000004b0182 in heapgettup_pagemode (scan=0x28d7080, >> dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:747 >> #7 0x00000000004b1ba6 in heap_getnext (scan=0x28d7080, >> direction=ForwardScanDirection) at heapam.c:1475 >> #8 0x00007f787002dbfb in lookupSlonyInfo (tableOid=91754, ctx=0x2826118, >> origin_id=0x7fffcf42ab8c, table_id=0x7fffcf42ab88, >> set_id=0x7fffcf42ab84) >> at slony_logical.c:663 >> #9 0x00007f787002b7a3 in pg_decode_change (ctx=0x2826118, txn=0x28cbec0, >> relation=0x7f787a3446a8, change=0x7f786ffe3268) at slony_logical.c:237 >> #10 0x00000000007497d4 in change_cb_wrapper (cache=0x28cbda8, txn=0x28cbec0, >> relation=0x7f787a3446a8, change=0x7f786ffe3268) at logical.c:704 >> >> >> >> Here is what the code in lookupSlonyInfo is doing >> ------------------ >> >> sltable_oid = get_relname_relid("sl_table",slony_namespace); >> >> sltable_rel = relation_open(sltable_oid,AccessShareLock); >> tupdesc=RelationGetDescr(sltable_rel); >> scandesc=heap_beginscan(sltable_rel, >> GetCatalogSnapshot(sltable_oid),0,NULL); >> reloid_attnum = get_attnum(sltable_oid,"tab_reloid"); >> >> if(reloid_attnum == InvalidAttrNumber) >> elog(ERROR,"sl_table does not have a tab_reloid column"); >> set_attnum = get_attnum(sltable_oid,"tab_set"); >> >> if(set_attnum == InvalidAttrNumber) >> elog(ERROR,"sl_table does not have a tab_set column"); >> tableid_attnum = get_attnum(sltable_oid, "tab_id"); >> >> if(tableid_attnum == InvalidAttrNumber) >> elog(ERROR,"sl_table does not have a tab_id column"); >> >> while( (tuple = heap_getnext(scandesc,ForwardScanDirection) )) > (Except missing spaces ;)) I don't see anything obviously wrong with > this. > > Greetings, > > Andres Freund >
Hi Steve, On 2014-10-28 22:38:29 -0400, Steve Singer wrote: > On 10/28/2014 01:31 PM, Andres Freund wrote: > >On 2014-10-25 18:18:07 -0400, Steve Singer wrote: > >>My logical decoding plugin is occasionally getting this error > >> > >>"could not resolve cmin/cmax of catalog tuple" > >> > >>I get this when my output plugin is trying to read one of the user defined > >>catalog tables (user_catalog_table=true) > >Hm. That should obviously not happen. I've pushed a fix for a bug that could possibly also cause this. Although it'd be odd that it always hits the user catalog table. Except if your tests mostly modify the slony tables, but do not do much DDL otherwise? > I'll send you tar of the data directory off list with things in this state. > > >Do you have a testcase that would allow me to easily reproduce the > >problem? > > I don't have a isolated test case that does this. The test that I'm hitting > this with does lots of stuff and doesn't even always hit this. If it still happens, could you send me instructions of how to reproduce the problem after cloning the necessary source repositories? It's quite hard to validate a possible fix otherwise. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/13/2014 02:44 PM, Andres Freund wrote: > H > I've pushed a fix for a bug that could possibly also cause > this. Although it'd be odd that it always hits the user catalog > table. Except if your tests mostly modify the slony tables, but do not > do much DDL otherwise? The test I was running doesn't do DDL, so only the user catalog tables would have changes being tracked. I still sometimes get the error. When I get sometime on the weekend I'll work on some detailed instructions on how to grab and setup the various components to replicate this test. Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in a normal client connection, not the walsender #3 0x00000000006b4978 in GetSerializableTransactionSnapshotInt ( snapshot=snapshot@entry=0xbfa8a0 <CurrentSnapshotData>, sourcexid=sourcexid@entry=0) at predicate.c:1738 #4 0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>) at predicate.c:1517 #5 GetSerializableTransactionSnapshot ( snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598 #6 0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200 #7 0x00000000006c0e35 in exec_simple_query ( query_string=0x1fd01b8 "select ev_origin, ev_seqno, ev_timestamp, ev_snapshot, \"pg_catalog\".txid_snapshot_xmin(ev_snapshot), \"pg_catalog\".txid_snapshot_xmax(ev_snapshot), coalesce(ev_provider_xid,\""...) at postgres.c:959 #8 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50, I have no idea if this has anything to do with your recent changes or some other change. I haven't so far been able to replicate that since the first time I saw it. >> I'll send you tar of the data directory off list with things in this state. >> >>> Do you have a testcase that would allow me to easily reproduce the >>> problem? >> I don't have a isolated test case that does this. The test that I'm hitting >> this with does lots of stuff and doesn't even always hit this. > If it still happens, could you send me instructions of how to reproduce > the problem after cloning the necessary source repositories? It's quite > hard to validate a possible fix otherwise. > > Greetings, > > Andres Freund >
On 11/13/2014 02:44 PM, Andres Freund wrote: > Hi Steve, > If it still happens, could you send me instructions of how to reproduce > the problem after cloning the necessary source repositories? It's quite > hard to validate a possible fix otherwise. 1. Install PG 9.4 2. Perform an initdb max_connections = 200 wal_level=logical max_walsenders=50 wal_keep_segments = 100 wal_sender_timeout= 600s max_replication_slots = 120 Create a user slon with superuser create a user test (set passwords for them you'll use them later) Configure pg_hba.conf to allow slon to connect as a replication user 3. Download slony from github (https://github.com/ssinger/slony1-engine.git) checkout the branch logical_remote_helper ./configure --with-pgconfigdir=/path_to_your_pgcbindir make make install 4. Download clustertest & compile clustertest from (https://github.com/clustertest/clustertest-framework). 5. In the slony source tree cd to the clustertest directory 6. cp conf/disorder.properties.sample to conf/disorder.properties Edit the file to have the correct paths, ports, users, passwords. 7 cp conf/java.properties.sample to conf/java.properties edit the file to point at the JAR you downloaded earlier. I run with all 5 databases on the same PG cluster. Performance will be much better with 5 different clusters, but I recommend trying to emulate my configuration as much as possible to replicate this To run the tests then do ./run_all_disorder_tests.sh At the moment (commit df5acfd1a3) is configured to just run the Failover node test cases where I am seeing this not the entire suite. It typically takes between 30 minutes to an hour to run though. I installed things following the above steps on a different system than my usual development laptop and I have been unable to reproduce the error so for (on that system). But I am still able to reproduce it on occasion on my normal development laptop. > Greetings, > > Andres Freund >
On 11/16/2014 04:49 PM, Steve Singer wrote: > > > I installed things following the above steps on a different system > than my usual development laptop and I have been unable to reproduce > the error so for (on that system). But I am still able to reproduce > it on occasion on my normal development laptop. > > > After continuously running the test I was eventually able to reproduce the error on the other system as well. >> Greetings, >> >> Andres Freund >> > > >
On 2014-11-17 10:33:52 -0500, Steve Singer wrote: > On 11/16/2014 04:49 PM, Steve Singer wrote: > > > > > >I installed things following the above steps on a different system than my > >usual development laptop and I have been unable to reproduce the error so > >for (on that system). But I am still able to reproduce it on occasion on > >my normal development laptop. Thanks for the testcase! I'll try to reproduce/fix. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-11-13 22:23:02 -0500, Steve Singer wrote: > On 11/13/2014 02:44 PM, Andres Freund wrote: > >H > > >I've pushed a fix for a bug that could possibly also cause > >this. Although it'd be odd that it always hits the user catalog > >table. Except if your tests mostly modify the slony tables, but do not > >do much DDL otherwise? > > The test I was running doesn't do DDL, so only the user catalog tables would > have changes being tracked. > > I still sometimes get the error. When I get sometime on the weekend I'll > work on some detailed instructions on how to grab and setup the various > components to replicate this test. > > Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in > a normal client connection, not the walsender > > #3 0x00000000006b4978 in GetSerializableTransactionSnapshotInt ( > snapshot=snapshot@entry=0xbfa8a0 <CurrentSnapshotData>, > sourcexid=sourcexid@entry=0) at predicate.c:1738 > #4 0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>) > at predicate.c:1517 > #5 GetSerializableTransactionSnapshot ( > snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598 > #6 0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200 > #7 0x00000000006c0e35 in exec_simple_query ( > query_string=0x1fd01b8 "select ev_origin, ev_seqno, ev_timestamp, > ev_snapshot, \"pg_catalog\".txid_snapshot_xmin(ev_snapshot), > \"pg_catalog\".txid_snapshot_xmax(ev_snapshot), > coalesce(ev_provider_xid,\""...) > at postgres.c:959 > #8 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50, > > > I have no idea if this has anything to do with your recent changes or some > other change. I haven't so far been able to replicate that since the first > time I saw it. That crash is decidedly odd. Any chance you still have the full backtrace around? This is in the SSI code... I'm not immediately seeing how this could be related to logical decoding. It can't even be a imported snapshot, because the exported snapshot is marked repeatable read. Are you actually using serializable transactions? If so, how and why? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/17/2014 10:37 AM, Andres Freund wrote: > On 2014-11-13 22:23:02 -0500, Steve Singer wrote: > > > Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in > a normal client connection, not the walsender > > #3 0x00000000006b4978 in GetSerializableTransactionSnapshotInt ( > snapshot=snapshot@entry=0xbfa8a0 <CurrentSnapshotData>, > sourcexid=sourcexid@entry=0) at predicate.c:1738 > #4 0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>) > at predicate.c:1517 > #5 GetSerializableTransactionSnapshot ( > snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598 > #6 0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200 > #7 0x00000000006c0e35 in exec_simple_query ( > query_string=0x1fd01b8 "select ev_origin, ev_seqno, ev_timestamp, > ev_snapshot, \"pg_catalog\".txid_snapshot_xmin(ev_snapshot), > \"pg_catalog\".txid_snapshot_xmax(ev_snapshot), > coalesce(ev_provider_xid,\""...) > at postgres.c:959 > #8 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50, > > > I have no idea if this has anything to do with your recent changes or some > other change. I haven't so far been able to replicate that since the first > time I saw it. > That crash is decidedly odd. Any chance you still have the full > backtrace around? Yes I still have the core file > > This is in the SSI code... I'm not immediately seeing how this could be > related to logical decoding. It can't even be a imported snapshot, > because the exported snapshot is marked repeatable read. > > Are you actually using serializable transactions? If so, how and why? Yes, the test client that performs the 'simulated workload' does some serializable transactions. It connects as a normal client via JDBC and does a prepareStatement("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE") and then does some DML. Why? because it seemed like a good thing to include in the test suite. Your right this might have nothing to do with logical decoding. I haven't been able to reproduce again either, I can't even say if this problem was introduced to 9.4 in the past month orif it has been around much longer and I just haven't hit it before. > > Greetings, > > Andres Freund >
Hi, Kevin: CCed you, because it doesn't really look like a logical decoding related issue. On 2014-11-17 11:25:40 -0500, Steve Singer wrote: > On 11/17/2014 10:37 AM, Andres Freund wrote: > >On 2014-11-13 22:23:02 -0500, Steve Singer wrote: > > > > > >Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in > >a normal client connection, not the walsender > > > >#3 0x00000000006b4978 in GetSerializableTransactionSnapshotInt ( > > snapshot=snapshot@entry=0xbfa8a0 <CurrentSnapshotData>, > > sourcexid=sourcexid@entry=0) at predicate.c:1738 > >#4 0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>) > > at predicate.c:1517 > >#5 GetSerializableTransactionSnapshot ( > > snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598 > >#6 0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200 > >#7 0x00000000006c0e35 in exec_simple_query ( > > query_string=0x1fd01b8 "select ev_origin, ev_seqno, ev_timestamp, > >ev_snapshot, \"pg_catalog\".txid_snapshot_xmin(ev_snapshot), > >\"pg_catalog\".txid_snapshot_xmax(ev_snapshot), > >coalesce(ev_provider_xid,\""...) > > at postgres.c:959 > >#8 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50, > > > > > >I have no idea if this has anything to do with your recent changes or some > >other change. I haven't so far been able to replicate that since the first > >time I saw it. > >That crash is decidedly odd. Any chance you still have the full > >backtrace around? > > Yes I still have the core file Cool, could you show the full thing? Or did you just snip it because it's just the Assert/ExceptionalCondition thing? Could you print *snapshot in frame #3? > > > >This is in the SSI code... I'm not immediately seeing how this could be > >related to logical decoding. It can't even be a imported snapshot, > >because the exported snapshot is marked repeatable read. > > > >Are you actually using serializable transactions? If so, how and why? > > Yes, the test client that performs the 'simulated workload' does some > serializable transactions. It connects as a normal client via JDBC and does > a prepareStatement("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE") and then > does some DML. Why? because it seemed like a good thing to include in the > test suite. Yes, it's a good reason as the above backtrace proves ;). I'm just trying to understand uner which circumstances it happens. The above backtrace looks like it can only be triggered if you do a BEGIN TRANSACTION SERIALIZABLE DEFERRABLE READ ONLY; Is that something you do? > Your right this might have nothing to do with logical decoding. I haven't > been able to reproduce again either, I can't even say > if this problem was introduced to 9.4 in the past month or if it has been > around much longer and I just haven't hit it before. It's not hard to imagine that the safe/deferred snapshot code isn't all that well exercised. I don't think many people use it yet. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/17/2014 11:34 AM, Andres Freund wrote: > Hi, > > Kevin: CCed you, because it doesn't really look like a logical decoding > related issue. > > On 2014-11-17 11:25:40 -0500, Steve Singer wrote: >> On 11/17/2014 10:37 AM, Andres Freund wrote: >>> On 2014-11-13 22:23:02 -0500, Steve Singer wrote: >>> >>> >>> Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in >>> a normal client connection, not the walsender >>> >>> #3 0x00000000006b4978 in GetSerializableTransactionSnapshotInt ( >>> snapshot=snapshot@entry=0xbfa8a0 <CurrentSnapshotData>, >>> sourcexid=sourcexid@entry=0) at predicate.c:1738 >>> #4 0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>) >>> at predicate.c:1517 >>> #5 GetSerializableTransactionSnapshot ( >>> snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598 >>> #6 0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200 >>> #7 0x00000000006c0e35 in exec_simple_query ( >>> query_string=0x1fd01b8 "select ev_origin, ev_seqno, ev_timestamp, >>> ev_snapshot, \"pg_catalog\".txid_snapshot_xmin(ev_snapshot), >>> \"pg_catalog\".txid_snapshot_xmax(ev_snapshot), >>> coalesce(ev_provider_xid,\""...) >>> at postgres.c:959 >>> #8 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50, >>> >>> >>> I have no idea if this has anything to do with your recent changes or some >>> other change. I haven't so far been able to replicate that since the first >>> time I saw it. >>> That crash is decidedly odd. Any chance you still have the full >>> backtrace around? >> Yes I still have the core file > Cool, could you show the full thing? Or did you just snip it because > it's just the Assert/ExceptionalCondition thing? I just snipped the exception stuff. Here is the full thing (gdb) backtrace #0 0x00007f6fb22e8295 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f6fb22eb438 in __GI_abort () at abort.c:90 #2 0x00000000007a08e7 in ExceptionalCondition ( conditionName=conditionName@entry=0x918e88 "!(TransactionIdFollows(snapshot->xmin, PredXact->SxactGlobalXmin))", errorType=errorType@entry=0x7da390 "FailedAssertion", fileName=fileName@entry=0x9182c3 "predicate.c", lineNumber=lineNumber@entry=1738) at assert.c:54 #3 0x00000000006b4978 in GetSerializableTransactionSnapshotInt ( snapshot=snapshot@entry=0xbfa8a0 <CurrentSnapshotData>, sourcexid=sourcexid@entry=0) at predicate.c:1738 #4 0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>) at predicate.c:1517 #5 GetSerializableTransactionSnapshot ( snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598 #6 0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200 #7 0x00000000006c0e35 in exec_simple_query ( query_string=0x1fd01b8 "select ev_origin, ev_seqno, ev_timestamp, ev_snapshot, \"pg_catalog\".txid_snapshot_xmin(ev_snapshot), \"pg_catalog\".txid_snapshot_xmax(ev_snapshot), coalesce(ev_provider_xid,\""...) at postgres.c:959 #8 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50, ---Type <return> to continue, or q <return> to quit--- dbname=0x1f5ab30 "test2", username=<optimized out>) at postgres.c:4016 #9 0x000000000046a08e in BackendRun (port=0x1f83010) at postmaster.c:4123 #10 BackendStartup (port=0x1f83010) at postmaster.c:3797 #11 ServerLoop () at postmaster.c:1576 #12 0x0000000000665eae in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1f59d00) at postmaster.c:1223 #13 0x000000000046ab58 in main (argc=3, argv=0x1f59d00) at main.c:227 > > Could you print *snapshot in frame #3? (gdb) p *snapshot $1 = {satisfies = 0x7d0330 <HeapTupleSatisfiesMVCC>, xmin = 412635, xmax = 412637, xip = 0x1f86e40, xcnt = 1, subxcnt =0, subxip = 0x1fd2190, suboverflowed = 0 '\000', takenDuringRecovery = 0 '\000', copied = 0 '\000', curcid = 0, active_count = 0, regd_count = 0} (gdb) >>> This is in the SSI code... I'm not immediately seeing how this could be >>> related to logical decoding. It can't even be a imported snapshot, >>> because the exported snapshot is marked repeatable read. >>> >>> Are you actually using serializable transactions? If so, how and why? >> Yes, the test client that performs the 'simulated workload' does some >> serializable transactions. It connects as a normal client via JDBC and does >> a prepareStatement("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE") and then >> does some DML. Why? because it seemed like a good thing to include in the >> test suite. > Yes, it's a good reason as the above backtrace proves ;). I'm just > trying to understand uner which circumstances it happens. > > The above backtrace looks like it can only be triggered if you do a > BEGIN TRANSACTION SERIALIZABLE DEFERRABLE READ ONLY; Is that something > you do? The slony remote listener connection does this when it selects from sl_event. We switched to this shortly after 9.1 came out.