Обсуждение: BUG #14344: string_agg(DISTINCT ..) crash

Поиск
Список
Период
Сортировка

BUG #14344: string_agg(DISTINCT ..) crash

От
lr@pcorp.us
Дата:
VGhlIGZvbGxvd2luZyBidWcgaGFzIGJlZW4gbG9nZ2VkIG9uIHRoZSB3ZWJz
aXRlOgoKQnVnIHJlZmVyZW5jZTogICAgICAxNDM0NApMb2dnZWQgYnk6ICAg
ICAgICAgIFJlZ2luYSBPYmUKRW1haWwgYWRkcmVzczogICAgICBsckBwY29y
cC51cwpQb3N0Z3JlU1FMIHZlcnNpb246IDkuNnJjMQpPcGVyYXRpbmcgc3lz
dGVtOiAgIFdpbmRvd3MgMjAxMiBhbmQgV2luZG93cyA3IDY0LWJpdApEZXNj
cmlwdGlvbjogICAgICAgIAoKTGV0IG1lIGtub3cgaWYgeW91IG5lZWQgbW9y
ZSBpbmZvIG9yIGlmIHRoaXMgaGFzIGFscmVhZHkgYmVlbiBmaXhlZC4NCg0K
SSB3YXMgdGVzdGluZyB1bmRlciANCg0KUG9zdGdyZVNRTCA5LjZyYzEsIGNv
bXBpbGVkIGJ5IFZpc3VhbCBDKysgYnVpbGQgMTgwMCwgNjQtYml0DQoNCkFu
ZCB3YXMgYWJsZSB0byB0cmlnZ2VyIGEgY3Jhc2ggd2l0aCB0aGlzOg0KDQot
LSBjcmVhdGUgdGVzdCB0YWJsZQ0KQ1JFQVRFIFRBQkxFIHN0cmluZ19hZ2df
Y3Jhc2hfdGVzdCBBUyANClNFTEVDVCBpIEFTIGlkLCBzdGF0ZXNbbW9kKGks
NikgKyAxXSBBUyBzdGF0ZSwgJ2p1bmsnIHx8IGNocihtb2QoaSwgMjYpICsg
MSkKQVMganVuaw0KRlJPTSBnZW5lcmF0ZV9zZXJpZXMoMSwxMDAwMDApIEFT
IGksIChTRUxFQ1QKJ3tOVUxMLE1BLEhJLElMLEtTLE5ZLE5KfSc6OnZhcmNo
YXJbXSBBUyBzdGF0ZXMpIEFTIHMgOw0KDQotLSB0cnkgdG8gc3VtbWFyaXpl
IGNyYXNoZXMgd2l0aCBwYXJhbGxlbCBvciBub3QgDQotLSBzbyBkb2Vzbid0
IHNlZW0gdG8gYmUgcGFyYWxsZWwgd29ya2VyIHJlbGF0ZWQuDQoNCnNldCBt
YXhfcGFyYWxsZWxfd29ya2Vyc19wZXJfZ2F0aGVyID0gMDsNClNFTEVDVCBz
dHJpbmdfYWdnKERJU1RJTkNUIHN0YXRlLCAnLCcgT1JERVIgQlkgc3RhdGUp
DQpGUk9NIHN0cmluZ19hZ2dfY3Jhc2hfdGVzdDsNCg0KLS0gaXQgbXVzdCBo
YXZlIHNvbWV0aGluZyB0byBkbyB3aXRoIHRoZSBESVNUSU5DVCwgYmVjYXVz
ZQ0KLS0gdGhpcyBkb2Vzbid0IGNyYXNoDQpzZXQgbWF4X3BhcmFsbGVsX3dv
cmtlcnNfcGVyX2dhdGhlciA9IDA7DQpTRUxFQ1Qgc3RyaW5nX2FnZyhzdGF0
ZSwgJywnIE9SREVSIEJZIHN0YXRlKQ0KRlJPTSBzdHJpbmdfYWdnX2NyYXNo
X3Rlc3Q7DQoNCg0KVGhlIGxvZ3MgZG9uJ3QgZ2l2ZSBtZSBhbnkgdXNlZnVs
IGluZm8ganVzdA0KDQpXQVJOSU5HOiAgdGVybWluYXRpbmcgY29ubmVjdGlv
biBiZWNhdXNlIG9mIGNyYXNoIG9mIGFub3RoZXIgc2VydmVyCnByb2Nlc3MN
CgoK

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Tom Lane
Дата:
lr@pcorp.us writes:
> Let me know if you need more info or if this has already been fixed.

Hm, works OK for me:

regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state)
FROM string_agg_crash_test;
    string_agg
-------------------
 HI,IL,KS,MA,MA,NY
(1 row)

But I see no post-rc1 fixes that look like they might explain this.

            regards, tom lane

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Michael Paquier
Дата:
On Wed, Sep 28, 2016 at 9:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> lr@pcorp.us writes:
>> Let me know if you need more info or if this has already been fixed.
> regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state)
> FROM string_agg_crash_test;
>     string_agg
> -------------------
>  HI,IL,KS,MA,MA,NY
> (1 row)
>
> But I see no post-rc1 fixes that look like they might explain this.

No problems on OSX or Linux, but I am getting a crash on Windows
(Win7, MSVC 2010) even with latest REL9_6_STABLE. Oops.
--
Michael

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
"Regina Obe"
Дата:
> lr@pcorp.us writes:
> Let me know if you need more info or if this has already been fixed.

> Hm, works OK for me:

> regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
>    string_agg
> -------------------
> HI,IL,KS,MA,MA,NY
> (1 row)

> But I see no post-rc1 fixes that look like they might explain this.

>            regards, tom lane


I was able to get it to crash under my mingw64 9.6beta4ish build too. Gdb
looks like this:

Reading symbols from
C:\ming64gcc48\projects\postgresql\rel\pg9.6w64gcc48\bin\postgres.exe...done
.
0x000000007732ae11 in ntdll!DbgBreakPoint ()
   from C:\Windows\SYSTEM32\ntdll.dll
(gdb) cont
Continuing.
[Thread 6944.0x61c exited with code 0]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 6944.0x2384]
toast_raw_datum_size (value=242751291) at tuptoaster.c:358
358     tuptoaster.c: No such file or directory.
(gdb) bt
#0  toast_raw_datum_size (value=242751291) at tuptoaster.c:358
#1  0x000000000078e616 in texteq (fcinfo=0x299e760) at varlena.c:1615
#2  0x00000000007bb045 in FunctionCall2Coll (flinfo=flinfo@entry=0x50ea400,
    collation=collation@entry=0, arg1=arg1@entry=85010827,
    arg2=<optimized out>) at fmgr.c:1319
#3  0x0000000000585548 in execTuplesMatch (slot1=slot1@entry=0x50e9e28,
    slot2=slot2@entry=0x50ea248, numCols=numCols@entry=2,
    matchColIdx=0x50ea350, eqfunctions=0x50ea3d0,
    evalContext=evalContext@entry=0x38d550) at execGrouping.c:100
#4  0x00000000005984e5 in process_ordered_aggregate_multi (
    pergroupstate=0x50e9980, pertrans=0x50e8968, aggstate=0x50aaaf0)
    at nodeAgg.c:1244
#5  finalize_aggregates (aggstate=aggstate@entry=0x50aaaf0,
    peraggs=peraggs@entry=0x50abfb0, pergroup=pergroup@entry=0x50e9980,
    currentSet=currentSet@entry=0) at nodeAgg.c:1552
#6  0x0000000000598c7a in agg_retrieve_direct (aggstate=0x50aaaf0)
    at nodeAgg.c:2157
#7  ExecAgg (node=node@entry=0x50aaaf0) at nodeAgg.c:1837
#8  0x000000000058b228 in ExecProcNode (node=node@entry=0x50aaaf0)
    at execProcnode.c:503
#9  0x00000000005871c5 in ExecutePlan (dest=0x50e5838,
    direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
    operation=CMD_SELECT, use_parallel_mode=<optimized out>,
    planstate=0x50aaaf0, estate=0x50aa9d8) at execMain.c:1567
#10 standard_ExecutorRun (queryDesc=0x5057288, direction=<optimized out>,
    count=0) at execMain.c:338
#11 0x00000000006b1fe6 in PortalRunSelect (portal=portal@entry=0x509e978,
    forward=forward@entry=1 '\001', count=0, count@entry=43643472,
    dest=dest@entry=0x0) at pquery.c:948
#12 0x00000000006b3686 in PortalRun (portal=0x299ee90,
    portal@entry=0x509e978, count=43643472, count@entry=2147483647,
    isTopLevel=isTopLevel@entry=0 '\000', dest=0x0, dest@entry=0x50e5838,
    altdest=altdest@entry=0x50e5838, completionTag=0x299f210 "",
    completionTag@entry=0x4000000007b <error: Cannot access memory at
address 0x4000000007b>) at pquery.c:789
#13 0x00000000006b0e54 in exec_simple_query (
    query_string=0x100000000000000 <error: Cannot access memory at address
0x100000000000000>) at postgres.c:1094
#14 PostgresMain (argc=<optimized out>, argv=argv@entry=0x2893e0,
    dbname=0x18001700160015 <error: Cannot access memory at address
0x18001700160015>, username=<optimized out>) at postgres.c:4074
#15 0x0000000000647aad in BackendRun (port=0x299f400) at postmaster.c:4262
#16 SubPostmasterMain (argc=argc@entry=3, argv=argv@entry=0x2f7ea0)
    at postmaster.c:4752
#17 0x0000000000803d68 in main (argc=3, argv=0x2f7ea0) at main.c:205
(gdb)


I'll compile against the latest 9.6 branch then and see if it crashes there
too.  Might be windows specific perhaps.

Thanks,
Regina

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Michael Paquier
Дата:
On Wed, Sep 28, 2016 at 9:23 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:
> On Wed, Sep 28, 2016 at 9:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> lr@pcorp.us writes:
>>> Let me know if you need more info or if this has already been fixed.
>> regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state)
>> FROM string_agg_crash_test;
>>     string_agg
>> -------------------
>>  HI,IL,KS,MA,MA,NY
>> (1 row)
>>
>> But I see no post-rc1 fixes that look like they might explain this.
>
> No problems on OSX or Linux, but I am getting a crash on Windows
> (Win7, MSVC 2010) even with latest REL9_6_STABLE. Oops.

After rolling in crashdumps/ and remembering some stuff about windbg:
- .sympath to point to symbols
- !analyze -v to load the backtrace:
00000000`0042eff0 00000001`40380293 : 000007fe`fda07a01 000007fe`fda07a11
00000000`ffff0000 00000000`000004e4 : postgres!toast_raw_datum_size+0x4
[c:\users\ioltas\git\postgres\src\backend\access\heap\tuptoaster.c @ 358]
00000000`0042f030 00000001`403ab6dc : 00000000`009f0000 00000000`00000002
00000000`008491f8 00000000`008f3caa : postgres!texteq+0x33
[c:\users\ioltas\git\postgres\src\backend\utils\adt\varlena.c @ 1616]
00000000`0042f060 00000001`4012a29e : 00000000`00000002 00000000`008491f8
00000000`008f3caa 00000000`00000001 : postgres!FunctionCall2Coll+0x5c
[c:\users\ioltas\git\postgres\src\backend\utils\fmgr\fmgr.c @ 1322]
00000000`0042f450 00000001`4013d05c : 00000000`00000002 00000000`01b33b30
00000000`008f3b00 00000000`008f3830 : postgres!execTuplesMatch+0xae
[c:\users\ioltas\git\postgres\src\backend\executor\execgrouping.c @ 101]
00000000`0042f4a0 00000001`4013d61c : 00000000`00895f00 00000000`00000000
00000000`01b359d8 00000000`00849308 :
postgres!process_ordered_aggregate_multi+0xfc
[c:\users\ioltas\git\postgres\src\backend\executor\nodeagg.c @ 1248]
00000000`0042f520 00000001`4013f081 : 00000000`00000001 00000000`00000000
00000000`01b33f98 00000000`00000000 : postgres!finalize_aggregates+0x9c
[c:\users\ioltas\git\postgres\src\backend\executor\nodeagg.c @ 1557]
00000000`0042f580 00000001`4013fe9a : 00000000`008ee800 00000000`00000003
00000000`01b33ee8 00000000`01b33f98 : postgres!agg_retrieve_direct+0x461
[c:\users\ioltas\git\postgres\src\backend\executor\nodeagg.c @ 2163]
00000000`0042f620 00000001`40130352 : 00000000`00000054 00000001`4029eaea
00000000`00000000 00000000`00000001 : postgres!ExecAgg+0x4a
[c:\users\ioltas\git\postgres\src\backend\executor\nodeagg.c @ 1838]
00000000`0042f650 00000001`4012c98d : 00000000`01b33a20 00000001`4000a42c
00000000`ffffffff 00000000`00000001 : postgres!ExecProcNode+0x202
[c:\users\ioltas\git\postgres\src\backend\executor\execprocnode.c @ 503]
00000000`0042f680 00000001`4012dc58 : 00000000`008d9b60 00000000`00000001
00000000`01b33a20 00000000`00000000 : postgres!ExecutePlan+0x7d
[c:\users\ioltas\git\postgres\src\backend\executor\execmain.c @ 1573]
00000000`0042f6d0 00000001`4029f659 : 00000000`008df510 00000000`00000000
00000000`008df510 00000001`403cb144 : postgres!standard_ExecutorRun+0xb8
[c:\users\ioltas\git\postgres\src\backend\executor\execmain.c @ 350]
00000000`0042f740 00000001`4029fe6a : 00000000`008df510 00000000`008df510
00000000`00843d50 00000000`0042f8b0 : postgres!PortalRunSelect+0x99
[c:\users\ioltas\git\postgres\src\backend\tcop\pquery.c @ 949]
00000000`0042f7a0 00000001`4029d723 : 00000000`00000000 00000000`00843d50
00000000`00000000 00000000`00000000 : postgres!PortalRun+0x1ca
[c:\users\ioltas\git\postgres\src\backend\tcop\pquery.c @ 796]
00000000`0042f980 00000001`4029e1fc : 00000000`00000051 00000000`00894e60
00000000`0042fb70 00000000`009f8220 : postgres!exec_simple_query+0x573
[c:\users\ioltas\git\postgres\src\backend\tcop\postgres.c @ 1101]
00000000`0042fa70 00000001`4022eaea : 00000001`405f4300 00000001`404a37b8
00000000`000010a6 00000001`404a3938 : postgres!PostgresMain+0x63c
[c:\users\ioltas\git\postgres\src\backend\tcop\postgres.c @ 4072]
00000000`0042fc60 00000001`40232645 : 00000001`00000001 00000000`000861df
00000000`1f7f5412 00000000`00000000 : postgres!BackendRun+0x1ca
[c:\users\ioltas\git\postgres\src\backend\postmaster\postmaster.c @ 4271]
00000000`0042fcc0 00000001`4016ca71 : 00000000`00000003 00000000`009f2329
00000000`00000003 00000000`00000000 : postgres!SubPostmasterMain+0x255
[c:\users\ioltas\git\postgres\src\backend\postmaster\postmaster.c @ 4762]
00000000`0042ff00 00000001`403f1bfa : 00000000`00000000 00000000`00000001
00000000`00000000 00000000`00000000 : postgres!main+0x3c1
[c:\users\ioltas\git\postgres\src\backend\main\main.c @ 216]
00000000`0042ff30 00000000`76cc59cd : 00000000`00000000 00000000`00000000
00000000`00000000 00000000`00000000 : postgres!__tmainCRTStartup+0x11a
[f:\dd\vctools\crt_bld\self_64_amd64\crt\src\crtexe.c @ 555]
00000000`0042ff60 00000000`76efa2e1 : 00000000`00000000 00000000`00000000
00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`0042ff90 00000000`00000000 : 00000000`00000000 00000000`00000000
00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21


STACK_COMMAND:  ~0s; .ecxr ; kb

FAULTING_SOURCE_CODE:
   354: {
   355:     struct varlena *attr = (struct varlena *)
DatumGetPointer(value);
   356:     Size        result;
   357:
>  358:     if (VARATT_IS_EXTERNAL_ONDISK(attr))
   359:     {
   360:         /* va_rawsize is the size of the original datum --
including header */
   361:         struct varatt_external toast_pointer;
   362:
   363:         VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);

--
Michael

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Michael Paquier
Дата:
On Wed, Sep 28, 2016 at 9:28 PM, Regina Obe <lr@pcorp.us> wrote:
> I'll compile against the latest 9.6 branch then and see if it crashes there
> too.  Might be windows specific perhaps.

Confirmed. I got the same trace.That's an aggregate bug in the executor.
--
Michael

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
David Rowley
Дата:
On 29 September 2016 at 17:36, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Sep 28, 2016 at 9:28 PM, Regina Obe <lr@pcorp.us> wrote:
>> I'll compile against the latest 9.6 branch then and see if it crashes there
>> too.  Might be windows specific perhaps.
>
> Confirmed. I got the same trace.That's an aggregate bug in the executor.

hmm, I would be surprised about that. The planner would never generate
a parallel aggregate plan for that given that string_agg has no
combine function, and there's also a DISTINCT in there, which is never
made parallel. The code for the non parallel case is not really
changed very much in 9.6.
My guess would be a sorting bug...

Let me check....

Looks like:

commit 0011c0091e886b874e485a46ff2c94222ffbf550
Author: Robert Haas <rhaas@postgresql.org>
Date:   Thu Mar 17 16:05:05 2016 -0400

    Improve memory management for external sorts.

    Introduce a new memory context which stores tuple data, and reset it
    at the end of each merge pass; this helps avoid memory fragmentation
    and, consequently, overallocation.  Also, for the final merge patch,
    eliminate memory context chunk header overhead entirely by allocating
    all of the memory used for buffering tuples during the merge in a
    single chunk.  Since this modestly increases the number of tuples we
    can store, grow the memtuples array a bit so that we're less likely to
    run short of slots there.

    Peter Geoghegan.  Review and testing of patches in this series by
    Jeff Janes, Greg Stark, Mithun Cy, and me.




--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
David Rowley
Дата:
On 29 September 2016 at 18:41, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> My guess would be a sorting bug...
>
> Let me check....
>
> Looks like:
>
> commit 0011c0091e886b874e485a46ff2c94222ffbf550

Just for the record I can only get this to crash on Windows too, but I
don't have much confidence that it's a windows only issue.

I compiled with TRACE_SORT on Windows and Linux to find out what the
sort does differently on each platform.

On windows:

LOG:  begin tuple sort: nkeys = 2, workMem = 4096, randomAccess = f
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) from
string_Agg_crash_test;
LOG:  switching to external sort with 15 tapes: CPU 0.00s/0.03u sec
elapsed 0.02 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) from
string_Agg_crash_test;
LOG:  performsort starting: CPU 0.00s/0.37u sec elapsed 0.37 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) from
string_Agg_crash_test;
LOG:  finished writing run 1 to tape 0: CPU 0.00s/0.43u sec elapsed 0.43 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) from
string_Agg_crash_test;
LOG:  finished writing final run 2 to tape 1: CPU 0.00s/0.48u sec
elapsed 0.48 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) from
string_Agg_crash_test;
LOG:  grew memtuples 1.29x from 58253 (1366 KB) to 74896 (1756 KB) for
final merge
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) from
string_Agg_crash_test;
LOG:  tape 0 initially used 878 KB of 1147 KB batch (0.765) and 37446
out of 37447 slots (1.000)
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) from
string_Agg_crash_test;
LOG:  tape 1 initially used 309 KB of 1147 KB batch (0.269) and 13182
out of 37447 slots (0.352)
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) from
string_Agg_crash_test;
LOG:  performsort done (except 2-way final merge): CPU 0.01s/0.48u sec
elapsed 0.49 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) from
string_Agg_crash_test;
LOG:  server process (PID 10848) was terminated by exception 0xC0000005

On Linux:

LOG:  begin tuple sort: nkeys = 2, workMem = 4096, randomAccess = f
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  switching to external sort with 15 tapes: CPU 0.00s/0.03u sec
elapsed 0.03 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  replacement selection will sort 58253 first run tuples
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  performsort starting: CPU 0.00s/0.16u sec elapsed 0.16 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  finished incrementally writing first run 1 to tape 0: CPU
0.00s/0.20u sec elapsed 0.20 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  starting quicksort of run 2: CPU 0.00s/0.20u sec elapsed 0.20 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  finished quicksort of run 2: CPU 0.00s/0.20u sec elapsed 0.20 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  finished writing run 2 to tape 1: CPU 0.00s/0.21u sec elapsed 0.21 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  grew memtuples 1.29x from 58253 (1366 KB) to 74896 (1756 KB) for
final merge
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  tape 0 initially used 878 KB of 1147 KB batch (0.765) and 37446
out of 37447 slots (1.000)
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  tape 1 initially used 309 KB of 1147 KB batch (0.269) and 13182
out of 37447 slots (0.352)
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  performsort done (except 2-way final merge): CPU 0.00s/0.21u sec
elapsed 0.21 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;
LOG:  external sort ended, 177 disk blocks used: CPU 0.00s/0.24u sec
elapsed 0.24 sec
STATEMENT:  SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM
string_agg_crash_test;

Both versions are 6ad8ac6

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Thu, Sep 29, 2016 at 6:41 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Let me check....
>
> Looks like:
>
> commit 0011c0091e886b874e485a46ff2c94222ffbf550

Why do you say that? My first suspicion is
f1f5ec1efafe74ca45e24e0bf3371b1d6985c8ee, fwiw.


--
Peter Geoghegan

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
David Rowley
Дата:
On 30 September 2016 at 00:29, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 29, 2016 at 6:41 AM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> Let me check....
>>
>> Looks like:
>>
>> commit 0011c0091e886b874e485a46ff2c94222ffbf550
>
> Why do you say that? My first suspicion is
> f1f5ec1efafe74ca45e24e0bf3371b1d6985c8ee, fwiw.

git bisect


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Thu, Sep 29, 2016 at 12:30 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> git bisect

That commit might just change things in a way that makes the test case
accidentally not crash. The characteristics of the merge will be
significantly different. Could you tell me if reverting
f1f5ec1efafe74ca45e24e0bf3371b1d6985c8ee on Windows helps?

--
Peter Geoghegan

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Thu, Sep 29, 2016 at 12:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
> That commit might just change things in a way that makes the test case
> accidentally not crash.

Of course, I meant that reverting that commit might merely mask the
problem, rather than removing it.

--
Peter Geoghegan

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
David Rowley
Дата:
On 30 September 2016 at 00:34, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 29, 2016 at 12:30 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> git bisect
>
> That commit might just change things in a way that makes the test case
> accidentally not crash. The characteristics of the merge will be
> significantly different. Could you tell me if reverting
> f1f5ec1efafe74ca45e24e0bf3371b1d6985c8ee on Windows helps?

I didn't get a clean revert due to some conflicts. I manually ended up
with the attached, which, when applied still crashes on windows.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Thu, Sep 29, 2016 at 1:04 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I didn't get a clean revert due to some conflicts. I manually ended up
> with the attached, which, when applied still crashes on windows.

Thanks.

I have been able to reproduce the problem, in a fashion, so can now
probably debug it easily enough. Specifically, I can get Valgrind to
complain about Regina's test case. Valgrind memcheck with
track_origins enabled, running Postgres git tip, prints the following:

2016-09-29 05:06:56.382 PDT 15292 LOG:  tape 1 initially used 409 KB
of 1147 KB batch (0.357) and 17448 out of 37447 slots (0.466)
2016-09-29 05:06:56.383 PDT 15292 LOG:  performsort done (except 2-way
final merge): CPU 0.02s/13.32u sec elapsed 13.35 sec
==15292== Invalid read of size 1
==15292==    at 0x4A2F13: toast_raw_datum_size (tuptoaster.c:358)
==15292==    by 0x7A41EF: texteq (varlena.c:1615)
==15292==    by 0x7CA538: FunctionCall2Coll (fmgr.c:1319)
==15292==    by 0x5BBC48: execTuplesMatch (execGrouping.c:100)
==15292==    by 0x5CD7D5: process_ordered_aggregate_multi (nodeAgg.c:1244)
==15292==    by 0x5CD7D5: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==    by 0x6CCC43: exec_simple_query (postgres.c:1094)
==15292==    by 0x6CCC43: PostgresMain (postgres.c:4070)
==15292==  Address 0xc3e573b is 399,099 bytes inside a block of size
1,173,864 free'd
==15292==    at 0x4C2EDEB: free (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7E51C6: AllocSetFree (aset.c:989)
==15292==    by 0x7E6C5D: pfree (mcxt.c:1012)
==15292==    by 0x7ED2CC: mergebatchfreetape (tuplesort.c:3100)
==15292==    by 0x7ED2CC: tuplesort_gettuple_common (tuplesort.c:2017)
==15292==    by 0x7F11A0: tuplesort_gettupleslot (tuplesort.c:2066)
==15292==    by 0x5CD876: process_ordered_aggregate_multi (nodeAgg.c:1231)
==15292==    by 0x5CD876: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==  Block was alloc'd at
==15292==    at 0x4C2DB8F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7E4897: AllocSetAlloc (aset.c:691)
==15292==    by 0x7E6F6D: MemoryContextAllocHuge (mcxt.c:1083)
==15292==    by 0x7ECBF0: mergebatch (tuplesort.c:2983)
==15292==    by 0x7ECBF0: beginmerge (tuplesort.c:2793)
==15292==    by 0x7F10F0: mergeruns (tuplesort.c:2547)
==15292==    by 0x7F10F0: tuplesort_performsort (tuplesort.c:1797)
==15292==    by 0x5CD757: process_ordered_aggregate_multi (nodeAgg.c:1225)
==15292==    by 0x5CD757: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==
{
   <insert_a_suppression_name_here>
   Memcheck:Addr1
   fun:toast_raw_datum_size
   fun:texteq
   fun:FunctionCall2Coll
   fun:execTuplesMatch
   fun:process_ordered_aggregate_multi
   fun:finalize_aggregates
   fun:agg_retrieve_direct
   fun:ExecAgg
   fun:ExecProcNode
   fun:ExecutePlan
   fun:standard_ExecutorRun
   fun:ExecutorRun
   fun:PortalRunSelect
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
}
==15292== Invalid read of size 1
==15292==    at 0x7CC7AF: pg_detoast_datum_packed (fmgr.c:2214)
==15292==    by 0x7A420C: texteq (varlena.c:1621)
==15292==    by 0x7CA538: FunctionCall2Coll (fmgr.c:1319)
==15292==    by 0x5BBC48: execTuplesMatch (execGrouping.c:100)
==15292==    by 0x5CD7D5: process_ordered_aggregate_multi (nodeAgg.c:1244)
==15292==    by 0x5CD7D5: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==    by 0x6CCC43: exec_simple_query (postgres.c:1094)
==15292==    by 0x6CCC43: PostgresMain (postgres.c:4070)
==15292==  Address 0xc3e573b is 399,099 bytes inside a block of size
1,173,864 free'd
==15292==    at 0x4C2EDEB: free (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7E51C6: AllocSetFree (aset.c:989)
==15292==    by 0x7E6C5D: pfree (mcxt.c:1012)
==15292==    by 0x7ED2CC: mergebatchfreetape (tuplesort.c:3100)
==15292==    by 0x7ED2CC: tuplesort_gettuple_common (tuplesort.c:2017)
==15292==    by 0x7F11A0: tuplesort_gettupleslot (tuplesort.c:2066)
==15292==    by 0x5CD876: process_ordered_aggregate_multi (nodeAgg.c:1231)
==15292==    by 0x5CD876: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==  Block was alloc'd at
==15292==    at 0x4C2DB8F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7E4897: AllocSetAlloc (aset.c:691)
==15292==    by 0x7E6F6D: MemoryContextAllocHuge (mcxt.c:1083)
==15292==    by 0x7ECBF0: mergebatch (tuplesort.c:2983)
==15292==    by 0x7ECBF0: beginmerge (tuplesort.c:2793)
==15292==    by 0x7F10F0: mergeruns (tuplesort.c:2547)
==15292==    by 0x7F10F0: tuplesort_performsort (tuplesort.c:1797)
==15292==    by 0x5CD757: process_ordered_aggregate_multi (nodeAgg.c:1225)
==15292==    by 0x5CD757: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==
{
   <insert_a_suppression_name_here>
   Memcheck:Addr1
   fun:pg_detoast_datum_packed
   fun:texteq
   fun:FunctionCall2Coll
   fun:execTuplesMatch
   fun:process_ordered_aggregate_multi
   fun:finalize_aggregates
   fun:agg_retrieve_direct
   fun:ExecAgg
   fun:ExecProcNode
   fun:ExecutePlan
   fun:standard_ExecutorRun
   fun:ExecutorRun
   fun:PortalRunSelect
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
}
==15292== Invalid read of size 1
==15292==    at 0x7A4214: texteq (varlena.c:1623)
==15292==    by 0x7CA538: FunctionCall2Coll (fmgr.c:1319)
==15292==    by 0x5BBC48: execTuplesMatch (execGrouping.c:100)
==15292==    by 0x5CD7D5: process_ordered_aggregate_multi (nodeAgg.c:1244)
==15292==    by 0x5CD7D5: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==    by 0x6CCC43: exec_simple_query (postgres.c:1094)
==15292==    by 0x6CCC43: PostgresMain (postgres.c:4070)
==15292==    by 0x670D36: BackendRun (postmaster.c:4260)
==15292==    by 0x670D36: BackendStartup (postmaster.c:3934)
==15292==    by 0x670D36: ServerLoop (postmaster.c:1691)
==15292==    by 0x670D36: PostmasterMain (postmaster.c:1299)
==15292==  Address 0xc3e573b is 399,099 bytes inside a block of size
1,173,864 free'd
==15292==    at 0x4C2EDEB: free (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7E51C6: AllocSetFree (aset.c:989)
==15292==    by 0x7E6C5D: pfree (mcxt.c:1012)
==15292==    by 0x7ED2CC: mergebatchfreetape (tuplesort.c:3100)
==15292==    by 0x7ED2CC: tuplesort_gettuple_common (tuplesort.c:2017)
==15292==    by 0x7F11A0: tuplesort_gettupleslot (tuplesort.c:2066)
==15292==    by 0x5CD876: process_ordered_aggregate_multi (nodeAgg.c:1231)
==15292==    by 0x5CD876: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==  Block was alloc'd at
==15292==    at 0x4C2DB8F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7E4897: AllocSetAlloc (aset.c:691)
==15292==    by 0x7E6F6D: MemoryContextAllocHuge (mcxt.c:1083)
==15292==    by 0x7ECBF0: mergebatch (tuplesort.c:2983)
==15292==    by 0x7ECBF0: beginmerge (tuplesort.c:2793)
==15292==    by 0x7F10F0: mergeruns (tuplesort.c:2547)
==15292==    by 0x7F10F0: tuplesort_performsort (tuplesort.c:1797)
==15292==    by 0x5CD757: process_ordered_aggregate_multi (nodeAgg.c:1225)
==15292==    by 0x5CD757: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==
{
   <insert_a_suppression_name_here>
   Memcheck:Addr1
   fun:texteq
   fun:FunctionCall2Coll
   fun:execTuplesMatch
   fun:process_ordered_aggregate_multi
   fun:finalize_aggregates
   fun:agg_retrieve_direct
   fun:ExecAgg
   fun:ExecProcNode
   fun:ExecutePlan
   fun:standard_ExecutorRun
   fun:ExecutorRun
   fun:PortalRunSelect
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
   fun:BackendRun
   fun:BackendStartup
   fun:ServerLoop
   fun:PostmasterMain
}
==15292== Invalid read of size 1
==15292==    at 0x4C33D28: __memcmp_sse4_1 (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7A4241: texteq (varlena.c:1623)
==15292==    by 0x7CA538: FunctionCall2Coll (fmgr.c:1319)
==15292==    by 0x5BBC48: execTuplesMatch (execGrouping.c:100)
==15292==    by 0x5CD7D5: process_ordered_aggregate_multi (nodeAgg.c:1244)
==15292==    by 0x5CD7D5: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==    by 0x6CCC43: exec_simple_query (postgres.c:1094)
==15292==    by 0x6CCC43: PostgresMain (postgres.c:4070)
==15292==  Address 0xc3e573c is 399,100 bytes inside a block of size
1,173,864 free'd
==15292==    at 0x4C2EDEB: free (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7E51C6: AllocSetFree (aset.c:989)
==15292==    by 0x7E6C5D: pfree (mcxt.c:1012)
==15292==    by 0x7ED2CC: mergebatchfreetape (tuplesort.c:3100)
==15292==    by 0x7ED2CC: tuplesort_gettuple_common (tuplesort.c:2017)
==15292==    by 0x7F11A0: tuplesort_gettupleslot (tuplesort.c:2066)
==15292==    by 0x5CD876: process_ordered_aggregate_multi (nodeAgg.c:1231)
==15292==    by 0x5CD876: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==  Block was alloc'd at
==15292==    at 0x4C2DB8F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7E4897: AllocSetAlloc (aset.c:691)
==15292==    by 0x7E6F6D: MemoryContextAllocHuge (mcxt.c:1083)
==15292==    by 0x7ECBF0: mergebatch (tuplesort.c:2983)
==15292==    by 0x7ECBF0: beginmerge (tuplesort.c:2793)
==15292==    by 0x7F10F0: mergeruns (tuplesort.c:2547)
==15292==    by 0x7F10F0: tuplesort_performsort (tuplesort.c:1797)
==15292==    by 0x5CD757: process_ordered_aggregate_multi (nodeAgg.c:1225)
==15292==    by 0x5CD757: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==
{
   <insert_a_suppression_name_here>
   Memcheck:Addr1
   fun:__memcmp_sse4_1
   fun:texteq
   fun:FunctionCall2Coll
   fun:execTuplesMatch
   fun:process_ordered_aggregate_multi
   fun:finalize_aggregates
   fun:agg_retrieve_direct
   fun:ExecAgg
   fun:ExecProcNode
   fun:ExecutePlan
   fun:standard_ExecutorRun
   fun:ExecutorRun
   fun:PortalRunSelect
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
}
==15292== Invalid read of size 1
==15292==    at 0x4C33D45: __memcmp_sse4_1 (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7A4241: texteq (varlena.c:1623)
==15292==    by 0x7CA538: FunctionCall2Coll (fmgr.c:1319)
==15292==    by 0x5BBC48: execTuplesMatch (execGrouping.c:100)
==15292==    by 0x5CD7D5: process_ordered_aggregate_multi (nodeAgg.c:1244)
==15292==    by 0x5CD7D5: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==    by 0x6CCC43: exec_simple_query (postgres.c:1094)
==15292==    by 0x6CCC43: PostgresMain (postgres.c:4070)
==15292==  Address 0xc3e573a is 399,098 bytes inside a block of size
1,173,864 free'd
==15292==    at 0x4C2EDEB: free (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7E51C6: AllocSetFree (aset.c:989)
==15292==    by 0x7E6C5D: pfree (mcxt.c:1012)
==15292==    by 0x7ED2CC: mergebatchfreetape (tuplesort.c:3100)
==15292==    by 0x7ED2CC: tuplesort_gettuple_common (tuplesort.c:2017)
==15292==    by 0x7F11A0: tuplesort_gettupleslot (tuplesort.c:2066)
==15292==    by 0x5CD876: process_ordered_aggregate_multi (nodeAgg.c:1231)
==15292==    by 0x5CD876: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==  Block was alloc'd at
==15292==    at 0x4C2DB8F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15292==    by 0x7E4897: AllocSetAlloc (aset.c:691)
==15292==    by 0x7E6F6D: MemoryContextAllocHuge (mcxt.c:1083)
==15292==    by 0x7ECBF0: mergebatch (tuplesort.c:2983)
==15292==    by 0x7ECBF0: beginmerge (tuplesort.c:2793)
==15292==    by 0x7F10F0: mergeruns (tuplesort.c:2547)
==15292==    by 0x7F10F0: tuplesort_performsort (tuplesort.c:1797)
==15292==    by 0x5CD757: process_ordered_aggregate_multi (nodeAgg.c:1225)
==15292==    by 0x5CD757: finalize_aggregates (nodeAgg.c:1552)
==15292==    by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157)
==15292==    by 0x5CE328: ExecAgg (nodeAgg.c:1837)
==15292==    by 0x5C135B: ExecProcNode (execProcnode.c:503)
==15292==    by 0x5BD86C: ExecutePlan (execMain.c:1567)
==15292==    by 0x5BD86C: standard_ExecutorRun (execMain.c:338)
==15292==    by 0x5BD969: ExecutorRun (execMain.c:286)
==15292==    by 0x6CE99E: PortalRunSelect (pquery.c:948)
==15292==    by 0x6CFDC7: PortalRun (pquery.c:789)
==15292==
{
   <insert_a_suppression_name_here>
   Memcheck:Addr1
   fun:__memcmp_sse4_1
   fun:texteq
   fun:FunctionCall2Coll
   fun:execTuplesMatch
   fun:process_ordered_aggregate_multi
   fun:finalize_aggregates
   fun:agg_retrieve_direct
   fun:ExecAgg
   fun:ExecProcNode
   fun:ExecutePlan
   fun:standard_ExecutorRun
   fun:ExecutorRun
   fun:PortalRunSelect
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
}
--15292-- warning: evaluate_Dwarf3_Expr: unhandled DW_OP_ 0xf2
--15292-- warning: evaluate_Dwarf3_Expr: unhandled DW_OP_ 0xf3
2016-09-29 05:06:58.308 PDT 15292 LOG:  external sort ended, 177 disk
blocks used: CPU 0.02s/15.24u sec elapsed 15.27 sec

--
Peter Geoghegan

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Thu, Sep 29, 2016 at 5:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm, works OK for me:
>
> regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state)
> FROM string_agg_crash_test;
>     string_agg
> -------------------
>  HI,IL,KS,MA,MA,NY
> (1 row)

Actually, it doesn't work for you, since "MA" appears twice. :-(

A fix is in the works.

--
Peter Geoghegan

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Tom Lane
Дата:
Peter Geoghegan <pg@heroku.com> writes:
> On Thu, Sep 29, 2016 at 5:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hm, works OK for me:
>>
>> regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state)
>> FROM string_agg_crash_test;
>> string_agg
>> -------------------
>> HI,IL,KS,MA,MA,NY
>> (1 row)

> Actually, it doesn't work for you, since "MA" appears twice. :-(

Hah ... it was too late at night for me to notice that ;-)

> A fix is in the works.

So you've identified the problem and the rest of us don't need to
chase it anymore?

            regards, tom lane

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Thu, Sep 29, 2016 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So you've identified the problem and the rest of us don't need to
> chase it anymore?

Yes. Just verifying my understanding in detail. Shouldn't be long.

--
Peter Geoghegan

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Thu, Sep 29, 2016 at 1:10 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I have been able to reproduce the problem, in a fashion, so can now
> probably debug it easily enough. Specifically, I can get Valgrind to
> complain about Regina's test case.

I see what the problem is. This use-after-free bug occurs because, for
some reason, tuplesort_gettupleslot()'s contract was not updated by me
alongside analogous routines like tuplesort_getindextuple() as part of
the batch memory commit. This meant that at least one particular case
(process_ordered_aggregate_multi() calls to tuplesort_gettupleslot(),
with distinct columns) felt entitled to reuse some slot's tuple, a
tuple located in memory managed by tuplesort.c as batch memory.  The
reuse was across calls to tuplesort_gettupleslot().

Attached patch fixes the bug, and updates the contract of
tuplesort_gettupleslot() so it's no longer the odd one out. However,
I'm not recommending this be committed without further discussion,
because the performance overhead could be a concern. Also, it looks
like hypothetical_dense_rank_final() requires similar treatment, which
isn't addressed by this patch. I did audit other callers of
tuplesort_gettupleslot() and analogous routines for other classes of
caller-tuple (e.g., tuplesort_getindextuple() callers were examined
too). Look like it's just these two tuplesort callers that are
affected.

How much could it hurt performance to fix the bug in this way? Does
anyone see an alternative? Pushing knowledge of this special case into
tuplesort.c seems like an unappealing alternative.

--
Peter Geoghegan

Вложения

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Tom Lane
Дата:
Peter Geoghegan <pg@heroku.com> writes:
> I see what the problem is. This use-after-free bug occurs because, for
> some reason, tuplesort_gettupleslot()'s contract was not updated by me
> alongside analogous routines like tuplesort_getindextuple() as part of
> the batch memory commit. This meant that at least one particular case
> (process_ordered_aggregate_multi() calls to tuplesort_gettupleslot(),
> with distinct columns) felt entitled to reuse some slot's tuple, a
> tuple located in memory managed by tuplesort.c as batch memory.

My immediate reaction to this is WTF.  It seems like you have completely
broken the expected contract of tuplesort_gettupleslot, which is that
it copies the data into caller-owned memory.  That cannot stand.  Even
if you are willing to hack all the core callers so that they no longer
expect that, you've probably broken --- subtly --- callers in extensions.

            regards, tom lane

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Thu, Sep 29, 2016 at 4:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> My immediate reaction to this is WTF.  It seems like you have completely
> broken the expected contract of tuplesort_gettupleslot, which is that
> it copies the data into caller-owned memory.  That cannot stand.

Other comparable routines, like tuplesort_getindextuple(), could be
argued to have always had a contract that allows for this kind of
thing (recycling batch memory) because they get to examine a
*should_free pointer -- do you accept that much?
tuplesort_gettupleslot() does that for callers that happen to want to
use a tuple slot, rather than a direct caller tuple. So, it doesn't
seem like there is a very hard distinction there; that could also have
broken something in an extension too, since the lifetime of memory
controlled by tuplesort (should_free = false cases, managed by
tuplesort memory context) was always a bit unclear.

--
Peter Geoghegan

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Heikki Linnakangas
Дата:
On 09/29/2016 07:04 PM, Peter Geoghegan wrote:
> On Thu, Sep 29, 2016 at 4:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> My immediate reaction to this is WTF.  It seems like you have completely
>> broken the expected contract of tuplesort_gettupleslot, which is that
>> it copies the data into caller-owned memory.  That cannot stand.
>
> Other comparable routines, like tuplesort_getindextuple(), could be
> argued to have always had a contract that allows for this kind of
> thing (recycling batch memory) because they get to examine a
> *should_free pointer -- do you accept that much?

The comments in those other routines explicitly say that "If it is not
set [*should_free], caller should not use tuple following next call
here". I also didn't notice that tuplesort_gettupleslot doesn't contain
say that, until now.

That's unfortunate. AFAICS, we have no choice but palloc(), when
tuplesort_gettupleslot() is used. For version 10, maybe we should
reconsider that API.

- Heikki

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Thu, Sep 29, 2016 at 10:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> The comments in those other routines explicitly say that "If it is not set
> [*should_free], caller should not use tuple following next call here". I
> also didn't notice that tuplesort_gettupleslot doesn't contain say that,
> until now.

They say that because the patch with the bug in question made them say
that. The fact that tuplesort_gettupleslot() did not have such a
comment added at the same time (and, more importantly, verification
that that would be okay for all existing callers) was arguably the
oversight in that patch/commit. That said, it probably wouldn't have
been a good idea to make it work. There is an implicit requirement for
tuplesort_gettupleslot() alone around the ownership of memory, per Tom
-- caller wants ownership of the memory returned. This is confused by
the fact that the memory is and was always allocated in sortcontext
memory context. So, caller wants ownership of the memory, but gets it
in the sortcontext. It looks like the callers sometimes ultimately
call ExecCopySlotTuple() again, just to be safe (e.g.,
agg_retrieve_direct()), possibly because the contained-in-slot tuple
may need to outlive the tuplesort itself (which represent the entire
sort operation's lifetime).

> That's unfortunate. AFAICS, we have no choice but palloc(), when
> tuplesort_gettupleslot() is used. For version 10, maybe we should reconsider
> that API.

We have to do something to fix this bug, and performing an extra
retail palloc() looks like the easiest solution. This does not
actually defeat the purpose of batch memory, which was mostly about
cache efficiency. I tend to think that the possible uses of the slot
by tuplesort_gettupleslot() callers are too complex to tie down.

--
Peter Geoghegan

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Wed, Oct 12, 2016 at 2:20 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> That's unfortunate. AFAICS, we have no choice but palloc(), when
>> tuplesort_gettupleslot() is used. For version 10, maybe we should reconsider
>> that API.
>
> We have to do something to fix this bug, and performing an extra
> retail palloc() looks like the easiest solution. This does not
> actually defeat the purpose of batch memory, which was mostly about
> cache efficiency. I tend to think that the possible uses of the slot
> by tuplesort_gettupleslot() callers are too complex to tie down.

Attached is a fix inspired by the similar routine in tuplestore.c:
tuplestore_gettupleslot(). This bug seems totally analogous to the one
fixed by 25bf7f8b -- history repeats itself.

This fix has us copy the MinimalTuple into sortcontext palloc() memory
within tuplesort_gettupleslot() (based on commit 25bf7f8b). This still
differs a little from tuplestore_gettupleslot(), which explicitly uses
current context of caller, but we've always done things that way for
tuplesort.c.

It wouldn't be very hard to also add a tuplestore_gettupleslot()-style
"copy" boolean argument to tuplesort_gettupleslot(), but I haven't
done that here. That way, callers that are happy to rely on the
current behavior (the sometimes dangerous behavior) could do so, and
so could still benefit from avoiding the new heap_copy_minimal_tuple()
call (doing so isn't broken for some callers -- those that are happy
to not reuse contents of tuple slot across calls). For now, I haven't
added that. I guess that's Postgres 10 work, as Heikki mentioned.

Apologies for the delay on this.

--
Peter Geoghegan

Вложения

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Wed, Oct 12, 2016 at 4:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
> This fix has us copy the MinimalTuple into sortcontext palloc() memory
> within tuplesort_gettupleslot() (based on commit 25bf7f8b). This still
> differs a little from tuplestore_gettupleslot(), which explicitly uses
> current context of caller, but we've always done things that way for
> tuplesort.c.

Actually, it's only true that tuplesort sortcontext context is used
when copy isn't needed, which is not predictable to caller, so the new
comment is a bit inaccurate. The inconsistency seems inconsequential,
since we've always assume that caller tuples allocated within
sortcontext may be "owned" by caller (when should_free = true),
despite not being in caller's own memory context.

Attached is revision with tiny tweak to relevant comment.

--
Peter Geoghegan

Вложения

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Heikki Linnakangas
Дата:
On 10/13/2016 03:02 AM, Peter Geoghegan wrote:
> On Wed, Oct 12, 2016 at 4:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> This fix has us copy the MinimalTuple into sortcontext palloc() memory
>> within tuplesort_gettupleslot() (based on commit 25bf7f8b). This still
>> differs a little from tuplestore_gettupleslot(), which explicitly uses
>> current context of caller, but we've always done things that way for
>> tuplesort.c.
>
> Actually, it's only true that tuplesort sortcontext context is used
> when copy isn't needed, which is not predictable to caller, so the new
> comment is a bit inaccurate. The inconsistency seems inconsequential,
> since we've always assume that caller tuples allocated within
> sortcontext may be "owned" by caller (when should_free = true),
> despite not being in caller's own memory context.
>
> Attached is revision with tiny tweak to relevant comment.

Hmm. That also adds a copy to the sorted-in-mem case. That's safe, but
should we be worried about performance. Or is the extra copy so cheap
that it doesn't matter?

We could easily also check for TSS_SORTEDINMEM, if that matters.

- Heikki

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Thu, Oct 13, 2016 at 12:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm. That also adds a copy to the sorted-in-mem case. That's safe, but
> should we be worried about performance. Or is the extra copy so cheap that
> it doesn't matter?

I think that simply reading random locations in memory is the dominant
cost, but the exact overhead should be investigated before proceeding.

> We could easily also check for TSS_SORTEDINMEM, if that matters.

You're asking about this because TSS_SORTEDINMEM is a case that
changes (gets caller tuple copied) from 9.5 to a patched 9.6 for what
might seem like no particular reason. That's fair, but if you consider
the code and not the history, then TSS_SORTEDINMEM isn't really a
special case (plus, you can say the same thing about the equivalent
code within tuplestore.c).  That is, unnecessary copying will occur
for perhaps over 99% of calls here -- this waste occurs with all
TSS_SORTEDINMEM-state calls, but also with most TSS_FINALMERGE-state
calls.

The point I'm making is that we might be better off worrying about the
general problem, by adding a tuplestore_gettupleslot()-style "copy"
boolean argument at the same time, and having some callers pass
"false" to avoid copying (when they determine no risk of
use-after-free, by not keeping the contents of a slot active across
calls to tuplesort_gettupleslot()). You indicated that you don't
really want to go there for 9.6, but maybe it's worth reconsidering
that. For example, maybe ABI breakage is avoided by making
tuplesort_gettupleslot() into a shim. Or, maybe it's okay to put it in
the release notes of 9.6.1 -- I'm not sure how manageable that is.

--
Peter Geoghegan

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Heikki Linnakangas
Дата:
On 10/13/2016 08:27 PM, Peter Geoghegan wrote:
> On Thu, Oct 13, 2016 at 12:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Hmm. That also adds a copy to the sorted-in-mem case. That's safe, but
>> should we be worried about performance. Or is the extra copy so cheap that
>> it doesn't matter?
>
> I think that simply reading random locations in memory is the dominant
> cost, but the exact overhead should be investigated before proceeding.

Ok. In quick testing, the extra palloc() indeed didn't seem to cost much.

> The point I'm making is that we might be better off worrying about the
> general problem, by adding a tuplestore_gettupleslot()-style "copy"
> boolean argument at the same time, and having some callers pass
> "false" to avoid copying (when they determine no risk of
> use-after-free, by not keeping the contents of a slot active across
> calls to tuplesort_gettupleslot()). You indicated that you don't
> really want to go there for 9.6, but maybe it's worth reconsidering
> that. For example, maybe ABI breakage is avoided by making
> tuplesort_gettupleslot() into a shim. Or, maybe it's okay to put it in
> the release notes of 9.6.1 -- I'm not sure how manageable that is.

Pushed this isolated fix for now.

- Heikki

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Tom Lane
Дата:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 10/13/2016 08:27 PM, Peter Geoghegan wrote:
>> The point I'm making is that we might be better off worrying about the
>> general problem, by adding a tuplestore_gettupleslot()-style "copy"
>> boolean argument at the same time, and having some callers pass
>> "false" to avoid copying (when they determine no risk of
>> use-after-free, by not keeping the contents of a slot active across
>> calls to tuplesort_gettupleslot()). You indicated that you don't
>> really want to go there for 9.6, but maybe it's worth reconsidering
>> that. For example, maybe ABI breakage is avoided by making
>> tuplesort_gettupleslot() into a shim. Or, maybe it's okay to put it in
>> the release notes of 9.6.1 -- I'm not sure how manageable that is.

> Pushed this isolated fix for now.

FWIW, I think this is an appropriate solution for 9.6, but I'd favor
changing the function's API to be like tuplestore_gettupleslot's
in HEAD.

            regards, tom lane

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Mon, Oct 17, 2016 at 6:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> FWIW, I think this is an appropriate solution for 9.6, but I'd favor
> changing the function's API to be like tuplestore_gettupleslot's
> in HEAD.

In light of the fact that the minimal fix has low overhead, I agree.
I've drafted a patch to do this. I expect to be able to post it in the
next few days.


--
Peter Geoghegan

Re: BUG #14344: string_agg(DISTINCT ..) crash

От
Peter Geoghegan
Дата:
On Mon, Oct 17, 2016 at 6:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
> In light of the fact that the minimal fix has low overhead, I agree.
> I've drafted a patch to do this. I expect to be able to post it in the
> next few days.

I've put this off until I can get some buy-in from Heikki about a more
general issue with "should_free" arguments in tuplesort.c [1].

[1] https://www.postgresql.org/message-id/CAM3SWZQWZZ_N=DmmL7tKy_OUjGH_5mN=N=A6h7kHyyDvEhg2DA@mail.gmail.com
--
Peter Geoghegan