Обсуждение: ddd
Hi Robert, Todd, All, I think both I and commit e9baa5e9fa147e are confused. Mantid started to fail with the parallel hash join commit, with the following assert failure [1]: #2 0x00000000008698e7 in ExceptionalCondition (conditionName=conditionName@entry=0x8efee8 "!(CurrentTransactionState->parallelModeLevel== 0)", errorType=errorType@entry=0x8b35c9 "FailedAssertion", fileName=fileName@entry=0x8ec2bf"xact.c", lineNumber=lineNumber@entry=691) at assert.c:54 #3 0x000000000050c7fb in GetCurrentCommandId (used=used@entry=1 '\001') at xact.c:691 #4 0x00000000004d3f0c in toast_save_datum (value=value@entry=43175280, oldexternal=0x0, options=options@entry=2, rel=0x7f98a770d588,rel=0x7f98a770d588) at tuptoaster.c:1477 #5 0x00000000004d54f9 in toast_insert_or_update (rel=rel@entry=0x7f98a770d588, newtup=0x28deaf0, oldtup=oldtup@entry=0x0,options=2) at tuptoaster.c:814 #6 0x00000000004c0bf5 in heap_prepare_insert (relation=relation@entry=0x7f98a770d588, tup=tup@entry=0x28deaf0, xid=<optimizedout>, cid=cid@entry=34, options=options@entry=2) at heapam.c:2660 #7 0x00000000004c6eeb in heap_insert (relation=0x7f98a770d588, tup=0x28deaf0, cid=34, options=2, bistate=0x251d220) at heapam.c:2429 #8 0x00000000005b39a3 in intorel_receive (slot=<optimized out>, self=0x2359420) at createas.c:599 #9 0x000000000061cd58 in ExecutePlan (execute_once=<optimized out>, dest=0x2359420, direction=<optimized out>, numberTuples=0,sendTuples=1 '\001', operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x23effb0, estate=0x23efd98)at execMain.c:1753 #10 standard_ExecutorRun (queryDesc=0x24f3910, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:361 #11 0x00000000005b3c91 in ExecCreateTableAs (stmt=stmt@entry=0x2337e58, queryString=queryString@entry=0x2336bc8 "create tablewide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;", params=params@entry=0x0, queryEnv=queryEnv@entry=0x0,completionTag=completionTag@entry=0x7fff66986fc0 "") at createas.c:351 #12 0x000000000076a6a9 in ProcessUtilitySlow (pstate=pstate@entry=0x2359308, pstmt=pstmt@entry=0x23387d0, queryString=queryString@entry=0x2336bc8"create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') ast;", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, completionTag=completionTag@entry=0x7fff66986fc0"", dest=0x2338868) at utility.c:1454 #13 0x000000000076951a in standard_ProcessUtility (pstmt=0x23387d0, queryString=0x2336bc8 "create table wide as select generate_series(1,2) as id, rpad('', 320000, 'x') as t;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2338868,completionTag=0x7fff66986fc0 "") at utility.c:932 #14 0x0000000000766bb8 in PortalRunUtility (portal=0x239aaf8, pstmt=0x23387d0, isTopLevel=<optimized out>, setHoldSnapshot=<optimizedout>, dest=0x2338868, completionTag=0x7fff66986fc0 "") at pquery.c:1178 #15 0x0000000000767679 in PortalRunMulti (portal=portal@entry=0x239aaf8, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0'\000', dest=dest@entry=0x2338868, altdest=altdest@entry=0x2338868, completionTag=completionTag@entry=0x7fff66986fc0"") at pquery.c:1331 #16 0x0000000000768332 in PortalRun (portal=portal@entry=0x239aaf8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1'\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x2338868, altdest=altdest@entry=0x2338868,completionTag=completionTag@entry=0x7fff66986fc0 "") at pquery.c:799 #17 0x00000000007640df in exec_simple_query (query_string=0x2336bc8 "create table wide as select generate_series(1, 2) asid, rpad('', 320000, 'x') as t;") at postgres.c:1120 the reason this confuses me is that the PHJ commit hasn't changed anything relevant here. Testing force_parallel_mode=regress locally also shows that the failure is present before the commit. Why mantid's earlier builds don't show the problem, even though I can locally reproduce the issue escapes me. Todd, did you recently change anything on Mantid? By my reading this is the fault of e9baa5e9fa147e [3]. Robert, Haribabu any idea? Greetings, Andres Freund [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2017-12-21%2011%3A07%3A06 [2] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=mantid&br=HEAD [3] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e9baa5e9fa147e00a2466ab2c40eb99c8a700824
Hi, Err, the subject ought to have been "force parallel mode vs CTAS" or such. On 2017-12-21 06:31:06 -0800, Andres Freund wrote: > Hi Robert, Todd, All, > > I think both I and commit e9baa5e9fa147e are confused. > > Mantid started to fail with the parallel hash join commit, with the following > assert failure [1]: > #2 0x00000000008698e7 in ExceptionalCondition (conditionName=conditionName@entry=0x8efee8 "!(CurrentTransactionState->parallelModeLevel== 0)", errorType=errorType@entry=0x8b35c9 "FailedAssertion", fileName=fileName@entry=0x8ec2bf"xact.c", lineNumber=lineNumber@entry=691) at assert.c:54 > #3 0x000000000050c7fb in GetCurrentCommandId (used=used@entry=1 '\001') at xact.c:691 > #4 0x00000000004d3f0c in toast_save_datum (value=value@entry=43175280, oldexternal=0x0, options=options@entry=2, rel=0x7f98a770d588,rel=0x7f98a770d588) at tuptoaster.c:1477 > #5 0x00000000004d54f9 in toast_insert_or_update (rel=rel@entry=0x7f98a770d588, newtup=0x28deaf0, oldtup=oldtup@entry=0x0,options=2) at tuptoaster.c:814 > #6 0x00000000004c0bf5 in heap_prepare_insert (relation=relation@entry=0x7f98a770d588, tup=tup@entry=0x28deaf0, xid=<optimizedout>, cid=cid@entry=34, options=options@entry=2) at heapam.c:2660 > #7 0x00000000004c6eeb in heap_insert (relation=0x7f98a770d588, tup=0x28deaf0, cid=34, options=2, bistate=0x251d220) atheapam.c:2429 > #8 0x00000000005b39a3 in intorel_receive (slot=<optimized out>, self=0x2359420) at createas.c:599 > #9 0x000000000061cd58 in ExecutePlan (execute_once=<optimized out>, dest=0x2359420, direction=<optimized out>, numberTuples=0,sendTuples=1 '\001', operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x23effb0, estate=0x23efd98)at execMain.c:1753 > #10 standard_ExecutorRun (queryDesc=0x24f3910, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:361 > #11 0x00000000005b3c91 in ExecCreateTableAs (stmt=stmt@entry=0x2337e58, queryString=queryString@entry=0x2336bc8 "createtable wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;", params=params@entry=0x0, queryEnv=queryEnv@entry=0x0,completionTag=completionTag@entry=0x7fff66986fc0 "") at createas.c:351 > #12 0x000000000076a6a9 in ProcessUtilitySlow (pstate=pstate@entry=0x2359308, pstmt=pstmt@entry=0x23387d0, queryString=queryString@entry=0x2336bc8"create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') ast;", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, completionTag=completionTag@entry=0x7fff66986fc0"", dest=0x2338868) at utility.c:1454 > #13 0x000000000076951a in standard_ProcessUtility (pstmt=0x23387d0, queryString=0x2336bc8 "create table wide as selectgenerate_series(1, 2) as id, rpad('', 320000, 'x') as t;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,dest=0x2338868, completionTag=0x7fff66986fc0 "") at utility.c:932 > #14 0x0000000000766bb8 in PortalRunUtility (portal=0x239aaf8, pstmt=0x23387d0, isTopLevel=<optimized out>, setHoldSnapshot=<optimizedout>, dest=0x2338868, completionTag=0x7fff66986fc0 "") at pquery.c:1178 > #15 0x0000000000767679 in PortalRunMulti (portal=portal@entry=0x239aaf8, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0'\000', dest=dest@entry=0x2338868, altdest=altdest@entry=0x2338868, completionTag=completionTag@entry=0x7fff66986fc0"") at pquery.c:1331 > #16 0x0000000000768332 in PortalRun (portal=portal@entry=0x239aaf8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1'\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x2338868, altdest=altdest@entry=0x2338868,completionTag=completionTag@entry=0x7fff66986fc0 "") at pquery.c:799 > #17 0x00000000007640df in exec_simple_query (query_string=0x2336bc8 "create table wide as select generate_series(1, 2)as id, rpad('', 320000, 'x') as t;") at postgres.c:1120 > > the reason this confuses me is that the PHJ commit hasn't changed > anything relevant here. Testing force_parallel_mode=regress locally also > shows that the failure is present before the commit. > > Why mantid's earlier builds don't show the problem, even though I can > locally reproduce the issue escapes me. Todd, did you recently change > anything on Mantid? > > By my reading this is the fault of e9baa5e9fa147e [3]. Robert, Haribabu > any idea? > > Greetings, > > Andres Freund > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2017-12-21%2011%3A07%3A06 > [2] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=mantid&br=HEAD > [3] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e9baa5e9fa147e00a2466ab2c40eb99c8a700824
On 12/21/17 09:32, Andres Freund wrote: > >> Why mantid's earlier builds don't show the problem, even though I can >> locally reproduce the issue escapes me. Todd, did you recently change >> anything on Mantid? Not within the buildfarm set-up. Otherwise, the most recent change was on Dec. 14 when I applied some system package updates which affected the kernel, gcc, and glibc. If you need me to run additional tests on mantid, please don't hesitate to ask. -- todd
Great subject line! On Thu, Dec 21, 2017 at 9:31 AM, Andres Freund <andres@anarazel.de> wrote: > I think both I and commit e9baa5e9fa147e are confused. > > By my reading this is the fault of e9baa5e9fa147e [3]. Robert, Haribabu > any idea? If I run the regression tests with force_parallel_mode=on prior to the parallel hash join patch, they pass. If I run them now, they fail inside the parallel hash join tests here: create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t; I'm guessing that test case would have failed before, too, but we didn't have it. I'll analyze this further in a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Great subject line! fixed ... > If I run the regression tests with force_parallel_mode=on prior to the > parallel hash join patch, they pass. If I run them now, they fail > inside the parallel hash join tests here: > create table wide as select generate_series(1, 2) as id, rpad('', > 320000, 'x') as t; > I'm guessing that test case would have failed before, too, but we > didn't have it. I'll analyze this further in a bit. I've not updated my repo to today's sources yet. But on yesterday's HEAD: regression=# create table wide as select generate_series(1, 2) as id, rpad('', regression(# 320000, 'x') as t; SELECT 2 regression=# drop table wide; DROP TABLE regression=# set force_parallel_mode=on; SET regression=# create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. So yes, this is a pre-existing bug accidentally exposed by the PHJ tests. Log says TRAP: FailedAssertion("!(CurrentTransactionState->parallelModeLevel == 0)", File: "/home/postgres/pgsql/src/backend/access/transam/xact.c",Line: 691) 2017-12-21 10:30:23.131 EST [4958] LOG: server process (PID 14939) was terminated by signal 6: Aborted 2017-12-21 10:30:23.131 EST [4958] DETAIL: Failed process was running: create table wide as select generate_series(1, 2)as id, rpad('', 320000, 'x') as t; and here's the stack trace: #0 0x0000003b78a32495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003b78a33c75 in abort () at abort.c:92 #2 0x000000000087183d in ExceptionalCondition ( conditionName=<value optimized out>, errorType=<value optimized out>, fileName=<value optimized out>, lineNumber=<value optimized out>) at /home/postgres/pgsql/src/backend/utils/error/assert.c:54 #3 0x00000000005070d4 in GetCurrentCommandId (used=<value optimized out>) at /home/postgres/pgsql/src/backend/access/transam/xact.c:691 #4 0x00000000004ccf18 in toast_save_datum (rel=0x7ff973c47938, value=19547872, oldexternal=0x0, options=2) at /home/postgres/pgsql/src/backend/access/heap/tuptoaster.c:1477 #5 0x00000000004cf7cd in toast_insert_or_update (rel=<value optimized out>, newtup=0x1256460, oldtup=0x7d8, options=2) at /home/postgres/pgsql/src/backend/access/heap/tuptoaster.c:814 #6 0x00000000004c320e in heap_insert (relation=0x7ff973c47938, tup=0x1256460, cid=5, options=2, bistate=0x1125f70) at /home/postgres/pgsql/src/backend/access/heap/heapam.c:2429 #7 0x00000000005b3243 in intorel_receive (slot=<value optimized out>, self=0x1065470) at /home/postgres/pgsql/src/backend/commands/createas.c:599 #8 0x00000000006249cc in ExecutePlan (queryDesc=0x10f7040, direction=<value optimized out>, count=0, execute_once=48 '0') at /home/postgres/pgsql/src/backend/executor/execMain.c:1753 #9 standard_ExecutorRun (queryDesc=0x10f7040, direction=<value optimized out>, count=0, execute_once=48 '0') at /home/postgres/pgsql/src/backend/executor/execMain.c:361 #10 0x00000000005b3511 in ExecCreateTableAs (stmt=<value optimized out>, queryString=0x103ff18 "create table wide as select generate_series(1, 2) as id, rpad('',\n320000, 'x') as t;", params=0x0,queryEnv=<value optimized out>, completionTag=0x7ffddedc4d80 "") at /home/postgres/pgsql/src/backend/commands/createas.c:351 #11 0x0000000000771f69 in ProcessUtilitySlow (pstate=0x1065358, pstmt=0x1041d40, queryString=0x103ff18 "create table wide as select generate_series(1, 2) as id, rpad('',\n320000, 'x') as t;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x10f2498, completionTag=0x7ffddedc4d80 "") at /home/postgres/pgsql/src/backend/tcop/utility.c:1454 #12 0x0000000000772c17 in standard_ProcessUtility (pstmt=0x1041d40, queryString=0x103ff18 "create table wide as select generate_series(1, 2) as id, rpad('',\n320000, 'x') as t;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x10f2498, completionTag=0x7ffddedc4d80 "") at /home/postgres/pgsql/src/backend/tcop/utility.c:932 #13 0x000000000076ecab in PortalRunUtility (portal=0x10a2f08, pstmt=0x1041d40, isTopLevel=<value optimized out>, setHoldSnapshot=<value optimized out>, dest=0x10f2498, completionTag=<value optimized out>) at /home/postgres/pgsql/src/backend/tcop/pquery.c:1178 regards, tom lane
On Thu, Dec 21, 2017 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > If I run the regression tests with force_parallel_mode=on prior to the > parallel hash join patch, they pass. If I run them now, they fail > inside the parallel hash join tests here: > > create table wide as select generate_series(1, 2) as id, rpad('', > 320000, 'x') as t; > > I'm guessing that test case would have failed before, too, but we > didn't have it. I'll analyze this further in a bit. I think this is just a poorly-written assertion. currentCommandIdUsed is only used to skip redundant increments of the command counter, and CommandCounterIncrement() is categorically denied under parallelism anyway. Therefore, it's OK for this to happen in parallel mode; we just need to be in the leader, not the worker. Therefore, I proposed the attached patch, which fixes the regression test crash for me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Thu, Dec 21, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Great subject line! > > fixed ... Ack, sorry, I replied on the original thread, not realizing there was a second one. Basically, I think the assertion is wrong-headed and can be adjusted so it doesn't fail, and I attached a patch over there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Ack, sorry, I replied on the original thread, not realizing there was > a second one. Basically, I think the assertion is wrong-headed and > can be adjusted so it doesn't fail, and I attached a patch over there. Yeah, I saw. Looks plausible to me, but I haven't studied the fine points of the parallelism code enough to be confident in saying it's good. regards, tom lane
Hi, On 2017-12-21 10:35:06 -0500, Robert Haas wrote: > Great subject line! Had started written the email before finishing testing, forgot to fill in the blank. mutt annoyingly aborts with an empty subject, hence smashing one key a couple times. > On Thu, Dec 21, 2017 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > If I run the regression tests with force_parallel_mode=on prior to the > > parallel hash join patch, they pass. If I run them now, they fail > > inside the parallel hash join tests here: > > > > create table wide as select generate_series(1, 2) as id, rpad('', > > 320000, 'x') as t; > > > > I'm guessing that test case would have failed before, too, but we > > didn't have it. I'll analyze this further in a bit. > > I think this is just a poorly-written assertion. currentCommandIdUsed > is only used to skip redundant increments of the command counter, and > CommandCounterIncrement() is categorically denied under parallelism > anyway. Therefore, it's OK for this to happen in parallel mode; we > just need to be in the leader, not the worker. > > Therefore, I proposed the attached patch, which fixes the regression > test crash for me. > diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c > index d430e662e6..b37510c24f 100644 > --- a/src/backend/access/transam/xact.c > +++ b/src/backend/access/transam/xact.c > @@ -683,12 +683,12 @@ GetCurrentCommandId(bool used) > if (used) > { > /* > - * Forbid setting currentCommandIdUsed in parallel mode, because we > - * have no provision for communicating this back to the master. We > + * Forbid setting currentCommandIdUsed in a parallel worker, because > + * we have no provision for communicating this back to the master. We > * could relax this restriction when currentCommandIdUsed was already > * true at the start of the parallel operation. > */ > - Assert(CurrentTransactionState->parallelModeLevel == 0); > + Assert(!IsParallelWorker()); > currentCommandIdUsed = true; > } > return currentCommandId; Yea, that looks reasonable to me. Greetings, Andres Freund
On Thu, Dec 21, 2017 at 12:55 PM, Andres Freund <andres@anarazel.de> wrote: > Had started written the email before finishing testing, forgot to fill > in the blank. mutt annoyingly aborts with an empty subject, hence > smashing one key a couple times. Yeah, I figured that was what happened. It probably came across as sarcastic, sorry for that, but actually I was cracking up. > Yea, that looks reasonable to me. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 22, 2017 at 2:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
> Great subject line!
fixed ...
Thanks for the fix Robert.
This could have been caught, if I have test that creates toast table
using CTAS during development of the patch.
Regards,
Hari Babu
Fujitsu Australia