Обсуждение: PANIC :Call AbortTransaction when transaction id is no normal
Hello,
The process crashed when running in bootstrap mode and received signal to shutdown.
From the call stack we can see that the transaction id is 1, which is BootstrapTransactionId.
During TransactionLogFetch function, which fetch commit status of specified transaction id, it will return COMITTED when transcation id is BootstrapTransactionId.
Then it will crash because we cannot abort transaction while it was already committed.
(gdb) bt
#0 0x00007f4598f02617 in raise () from /lib64/libc.so.6
#1 0x00007f4598f03d08 in abort () from /lib64/libc.so.6
#2 0x000000000106001d in errfinish (dummy=0) at elog.c:564
#3 0x0000000001064788 in elog_finish (elevel=22, fmt=0x1190408 "cannot abort transaction %u, it was already committed") at elog.c:1385
#4 0x0000000000633e5a in RecordTransactionAbort (isSubXact=false) at xact.c:1584
#5 0x00000000006361fa in AbortTransaction () at xact.c:2614
#6 0x000000000063a5b9 in AbortOutOfAnyTransaction () at xact.c:4423
#7 0x000000000108f417 in ShutdownPostgres (code=1, arg=0) at postinit.c:1221
#8 0x0000000000d3554a in shmem_exit (code=1) at ipc.c:239
#9 0x0000000000d35395 in proc_exit_prepare (code=1) at ipc.c:194
#10 0x0000000000d352bb in proc_exit (code=1) at ipc.c:107
#11 0x000000000105ffbb in errfinish (dummy=0) at elog.c:550
#12 0x0000000000d9a020 in ProcessInterrupts () at postgres.c:3019
#13 0x00000000005489aa in heapgetpage (scan=0x34895f0, page=1) at heapam.c:384
#14 0x000000000054c48d in heapgettup_pagemode (scan=0x34895f0, dir=ForwardScanDirection, nkeys=1, key=0x33cf150) at heapam.c:1052
#15 0x000000000054e3e0 in heap_getnext (scan=0x34895f0, direction=ForwardScanDirection) at heapam.c:1850
#16 0x00000000006fe364 in index_update_stats (rel=0x7f459b770030, hasindex=true, reltuples=0) at index.c:2167
#17 0x00000000006ff40b in index_build (heapRelation=0x7f459b770030, indexRelation=0x7f459b704ca8, indexInfo=0x345c008, isprimary=false, isreindex=false, parallel=false) at index.c:2398
#18 0x00000000006e3ae3 in build_indices () at bootstrap.c:1112
#19 0x00000000006dbc8b in boot_yyparse () at bootparse.y:399
#20 0x00000000006e17f4 in BootstrapModeMain () at bootstrap.c:516
#21 0x00000000006e1578 in AuxiliaryProcessMain (argc=6, argv=0x33299b8) at bootstrap.c:436
#22 0x0000000000aab0be in main (argc=7, argv=0x33299b0) at main.c:220
(gdb) f 4
#4 0x0000000000633e5a in RecordTransactionAbort (isSubXact=false) at xact.c:1584
1584 elog(PANIC, "cannot abort transaction %u, it was already committed",
(gdb) p xid
$4 = 1
I try to fix this issue and check whether it's normal transaction id before we do abort.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 20feeec327..dbf2bf567a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -4504,8 +4504,13 @@ RollbackAndReleaseCurrentSubTransaction(void)
void
AbortOutOfAnyTransaction(void)
{
+ TransactionId xid = GetCurrentTransactionIdIfAny();
TransactionState s = CurrentTransactionState;
+ /* Check to see if the transaction ID is a permanent one because we cannot abort it */
+ if (!TransactionIdIsNormal(xid))
+ return;
+
/* Ensure we're not running in a doomed memory context */
AtAbort_Memory();
Can we fix in this way?
Thanks
Ray
I try to fix this issue and check whether it's normal transaction id before we do abort.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 20feeec327..dbf2bf567a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -4504,8 +4504,13 @@ RollbackAndReleaseCurrentSubTransaction(void)
void
AbortOutOfAnyTransaction(void)
{
+ TransactionId xid = GetCurrentTransactionIdIfAny();
TransactionState s = CurrentTransactionState;
+ /* Check to see if the transaction ID is a permanent one because we cannot abort it */
+ if (!TransactionIdIsNormal(xid))
+ return;
+
/* Ensure we're not running in a doomed memory context */
AtAbort_Memory();Can we fix in this way?
On Mon, May 13, 2019 at 01:25:19PM +0530, Kuntal Ghosh wrote: > If we fix the issue in this way, we're certainly not going to do all > those portal,locks,memory,resource owner cleanups that are done > inside AbortTransaction() for a normal transaction ID. But, I'm not > sure how relevant those steps are since the database is anyway > shutting down. And it is happening in bootstrap, meaning that the data folder is most likely toast, and needs to be reinitialized. TransactionLogFetch() treats bootstrap and frozen XIDs as always committed, so from this perspective it is not wrong either to complain that this transaction has already been committed when attempting to abort it. Not sure what's a more user-friendly behavior in this case though. -- Michael
Вложения
Michael Paquier <michael@paquier.xyz> writes: > On Mon, May 13, 2019 at 01:25:19PM +0530, Kuntal Ghosh wrote: >> If we fix the issue in this way, we're certainly not going to do all >> those portal,locks,memory,resource owner cleanups that are done >> inside AbortTransaction() for a normal transaction ID. But, I'm not >> sure how relevant those steps are since the database is anyway >> shutting down. > And it is happening in bootstrap, meaning that the data folder is most > likely toast, and needs to be reinitialized. Indeed, initdb is going to remove the data directory if the bootstrap run crashes. If we do anything at all about this, my thought would just be to change bootstrap_signals() so that it points all the signal handlers at quickdie(), or maybe something equivalent to quickdie() but printing a more apropos message, or even just set them all to SIGDFL since that means process termination for all of these. die() isn't really the right thing, precisely because it thinks it can trigger transaction abort, which makes no sense in bootstrap mode. But ... that code's been like that for decades and nobody's complained before. Why are we worried about bootstrap's response to signals at all? regards, tom lane
Michael Paquier <michael@paquier.xyz> writes:
> On Mon, May 13, 2019 at 01:25:19PM +0530, Kuntal Ghosh wrote:
>> If we fix the issue in this way, we're certainly not going to do all
>> those portal,locks,memory,resource owner cleanups that are done
>> inside AbortTransaction() for a normal transaction ID. But, I'm not
>> sure how relevant those steps are since the database is anyway
>> shutting down.
> And it is happening in bootstrap, meaning that the data folder is most
> likely toast, and needs to be reinitialized.
Indeed, initdb is going to remove the data directory if the bootstrap run
crashes.
But ... that code's been like that for decades and nobody's complained
before. Why are we worried about bootstrap's response to signals at all?
I wrote: > If we do anything at all about this, my thought would just be to change > bootstrap_signals() so that it points all the signal handlers at > quickdie(), or maybe something equivalent to quickdie() but printing > a more apropos message, or even just set them all to SIGDFL since that > means process termination for all of these. die() isn't really the right > thing, precisely because it thinks it can trigger transaction abort, > which makes no sense in bootstrap mode. After further thought I like the SIG_DFL answer, as per attached proposed patch. With this, you get SIGINT behavior like this if you manage to catch it in bootstrap mode (which is not that easy these days): selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting default timezone ... America/New_York creating configuration files ... ok running bootstrap script ... ^Cchild process was terminated by signal 2: Interrupt initdb: removing data directory "/home/postgres/testversion/data" That seems perfectly fine from here. > But ... that code's been like that for decades and nobody's complained > before. Why are we worried about bootstrap's response to signals at all? I'm still wondering why the OP cares. Still, the PANIC message that you get right now is potentially confusing. regards, tom lane diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c index d8776e1..825433d 100644 --- a/src/backend/bootstrap/bootstrap.c +++ b/src/backend/bootstrap/bootstrap.c @@ -558,11 +558,16 @@ bootstrap_signals(void) { Assert(!IsUnderPostmaster); - /* Set up appropriately for interactive use */ - pqsignal(SIGHUP, die); - pqsignal(SIGINT, die); - pqsignal(SIGTERM, die); - pqsignal(SIGQUIT, die); + /* + * We don't actually need any non-default signal handling in bootstrap + * mode; "curl up and die" is a sufficient response for all these cases. + * But let's just make sure the signals are set that way, since our parent + * process initdb has them set differently. + */ + pqsignal(SIGHUP, SIG_DFL); + pqsignal(SIGINT, SIG_DFL); + pqsignal(SIGTERM, SIG_DFL); + pqsignal(SIGQUIT, SIG_DFL); } /*
On Mon, May 13, 2019 at 09:37:32AM -0400, Tom Lane wrote: > But ... that code's been like that for decades and nobody's complained > before. Why are we worried about bootstrap's response to signals at all? Yeah, I don't think that it is something worth bothering either. As you mentioned the data folder would be removed by default. Or perhaps the reporter has another case in mind which could justify a change in the signal handlers? I am ready to hear that case, but there is nothing about the reason why it could be a benefit. The patch proposed upthread is not something I find correct anyway, I'd rather have the abort path complain loudly about a bootstrap transaction that fails instead of just ignoring it, because it is the kind of transaction which must never fail. And it seems to me that it can be handy for development purposes. -- Michael
Вложения
At 2019-05-14 07:53:36, "Michael Paquier" <michael@paquier.xyz> wrote: >On Mon, May 13, 2019 at 09:37:32AM -0400, Tom Lane wrote: >> But ... that code's been like that for decades and nobody's complained >> before. Why are we worried about bootstrap's response to signals at all? > >Yeah, I don't think that it is something worth bothering either. As >you mentioned the data folder would be removed by default. Or perhaps >the reporter has another case in mind which could justify a change in >the signal handlers? I am ready to hear that case, but there is >nothing about the reason why it could be a benefit. > >The patch proposed upthread is not something I find correct anyway, >I'd rather have the abort path complain loudly about a bootstrap >transaction that fails instead of just ignoring it, because it is the >kind of transaction which must never fail. And it seems to me that it >can be handy for development purposes. >-- >Michael
[ please don't top-post on the PG lists ] Thunder <thunder1@126.com> writes: > At 2019-05-14 07:53:36, "Michael Paquier" <michael@paquier.xyz> wrote: >> On Mon, May 13, 2019 at 09:37:32AM -0400, Tom Lane wrote: >>> But ... that code's been like that for decades and nobody's complained >>> before. Why are we worried about bootstrap's response to signals at all? > On our server when process crash and core dump file generated we will receive complaining phone call. > That's why i try to fix it. OK, that's fair. The SIG_DFL change I suggested will fix that problem for SIGINT etc (except SIGQUIT, for which you should be *expecting* a core file). I agree with Michael that we do not wish to change what happens for an internal error; but external signals do not represent a bug in PG, so forcing a PANIC for those seems unwarranted. regards, tom lane
On Mon, May 13, 2019 at 11:28:51PM -0400, Tom Lane wrote: > OK, that's fair. The SIG_DFL change I suggested will fix that problem > for SIGINT etc (except SIGQUIT, for which you should be *expecting* > a core file). I agree with Michael that we do not wish to change what > happens for an internal error; but external signals do not represent > a bug in PG, so forcing a PANIC for those seems unwarranted. No objections from here to change the signal handlers. Still, I would like to understand why the bootstrap process has been signaled to begin with, particularly for an initdb, which is not really something that should happen on a server where an instance runs. If you have a too aggressive monitoring job, you may want to revisit that as well, because it is able to complain just with an initdb. -- Michael
Вложения
Hi, On 2019-05-14 12:37:39 +0900, Michael Paquier wrote: > Still, I would like to understand why the bootstrap process has been > signaled to begin with, particularly for an initdb, which is not > really something that should happen on a server where an instance > runs. If you have a too aggressive monitoring job, you may want to > revisit that as well, because it is able to complain just with an > initdb. Shutdown, timeout, resource exhaustion all seem like possible causes. Don't think any of them warrant a core file - as the OP explains, that'll often trigger pages etc. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-05-14 12:37:39 +0900, Michael Paquier wrote: >> Still, I would like to understand why the bootstrap process has been >> signaled to begin with, particularly for an initdb, which is not >> really something that should happen on a server where an instance >> runs. If you have a too aggressive monitoring job, you may want to >> revisit that as well, because it is able to complain just with an >> initdb. > Shutdown, timeout, resource exhaustion all seem like possible > causes. Don't think any of them warrant a core file - as the OP > explains, that'll often trigger pages etc. Yeah. The case I was thinking about was mostly "start initdb, decide I didn't want to do that, hit control-C". That cleans up without much fuss *except* if you manage to hit the window where it's running bootstrap, and then it spews this scary-looking error. It's less scary-looking with the SIG_DFL patch, which I've now pushed. regards, tom lane