Обсуждение: self-deadlock at FATAL exit of boostrap process on read error
I encounter a situation that the server can't shutdown when a boostrap process does ReadBuffer() but gets an read error. I guess the problem may be like this - the boostrap process can't read at line: smgrread(reln->rd_smgr, blockNum, (char *) bufBlock); So it does a FATAL exit and shmem_exit() is called: while (--on_shmem_exit_index >= 0) (*on_shmem_exit_list[on_shmem_exit_index].function) (code, on_shmem_exit_list[on_shmem_exit_index].arg); Where on_shmem_exit_list[0] = DummyProcKill on_shmem_exit_list[1] = AtProcExit_Buffers The above callback is called in a stack order, so AtProcExit_Buffers() will call AbortBufferIO() which is blocked by itself on "io_in_progress_lock" (which is not the case as the comment says "since LWLockReleaseAll has already been called, we're not holding the buffer's io_in_progress_lock"). There may other similar problems for bootstrap process like this, so I am not sure the best fix for this ... Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > I encounter a situation that the server can't shutdown when a boostrap > process does ReadBuffer() but gets an read error. Hm, AtProcExit_Buffers is assuming that we've done AbortTransaction, but the WAL-replay process doesn't do that because it's not running a transaction. Seems like we need to stack another on-proc-exit function to do the appropriate subset of AbortTransaction ... LWLockReleaseAll at least, not sure what else. Do you have a test case to reproduce this problem? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > Do you have a test case to reproduce this problem? > According to the error message, the problem happens during reading pg_database. I just tried to plug in this line in mdread(): + /* pretend there is an error reading pg_database */ + if (reln->smgr_rnode.relNode == 1262) + { + fprintf(stderr, "Ooops \n"); + return false; + } v = _mdfd_getseg(reln, blocknum, false); And it works. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> wrote >> Do you have a test case to reproduce this problem? > According to the error message, the problem happens during reading > pg_database. I just tried to plug in this line in mdread(): OK, patch applied for this. regards, tom lane