Обсуждение: tuplestore API problem
By chance I discovered that this query in the regression tests SELECT ntile(NULL) OVER (ORDER BY ten, four), ten, four FROM tenk1 LIMIT 2; stops working if work_mem is small enough: it either dumps core or delivers wrong answers depending on platform. After some tracing I found out the reason. ExecWindowAgg() does this: if (!tuplestore_gettupleslot(winstate->buffer, true, winstate->ss.ss_ScanTupleSlot)) elog(ERROR, "unexpected end of tuplestore"); and then goes off and calls the window functions (ntile() here), and expects the ScanTupleSlot to still be valid afterwards. However, ntile() forces us to read to the end of the input to find out the number of rows. If work_mem is small enough, that means the tuplestore is forced into dump-to-disk mode, which means it releases all its in-memory tuples. And guess what: the ScanTupleSlot is pointing at one of those, it doesn't have its own copy of the tuple. So we wind up trying to read from a trashed bit of memory. A brute-force solution is to change tuplestore_gettupleslot() so that it always copies the tuple, but this would be wasted cycles for most uses of tuplestores. I'm thinking of changing tuplestore_gettupleslot's API to add a bool parameter specifying whether the caller wants to force a copy. Comments, better ideas? BTW: this tells me that no one has tried to apply window functions to nontrivial problems yet ... we'll need to encourage beta testers to stress that code. regards, tom lane
On Thu, 2009-03-26 at 12:57 -0400, Tom Lane wrote: > If work_mem is small enough, that means the tuplestore is > forced into dump-to-disk mode, which means it releases all its > in-memory tuples. And guess what: the ScanTupleSlot is pointing at > one of those, it doesn't have its own copy of the tuple. So we wind > up trying to read from a trashed bit of memory. > > A brute-force solution is to change tuplestore_gettupleslot() so that > it always copies the tuple, but this would be wasted cycles for most > uses of tuplestores. I'm thinking of changing > tuplestore_gettupleslot's API > to add a bool parameter specifying whether the caller wants to force > a copy. > > Comments, better ideas? Sounds very similar to the solution that you just removed from the hash join code for performance reasons. Flushing memory when we overflow sounds like an artifact from the time when tuplestore split from tuplesort. Can't we keep the appropriate rows in memory and scroll through them? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > Sounds very similar to the solution that you just removed from the hash > join code for performance reasons. Flushing memory when we overflow > sounds like an artifact from the time when tuplestore split from > tuplesort. Can't we keep the appropriate rows in memory and scroll > through them? We're not doing a fundamental redesign of the window function support right now. regards, tom lane
2009/3/27 Tom Lane <tgl@sss.pgh.pa.us>: > By chance I discovered that this query in the regression tests > > SELECT ntile(NULL) OVER (ORDER BY ten, four), ten, four FROM tenk1 LIMIT 2; > > stops working if work_mem is small enough: it either dumps core or > delivers wrong answers depending on platform. > > After some tracing I found out the reason. ExecWindowAgg() does this: > > if (!tuplestore_gettupleslot(winstate->buffer, true, > winstate->ss.ss_ScanTupleSlot)) > elog(ERROR, "unexpected end of tuplestore"); > > and then goes off and calls the window functions (ntile() here), and > expects the ScanTupleSlot to still be valid afterwards. However, > ntile() forces us to read to the end of the input to find out the number > of rows. If work_mem is small enough, that means the tuplestore is > forced into dump-to-disk mode, which means it releases all its in-memory > tuples. And guess what: the ScanTupleSlot is pointing at one of those, > it doesn't have its own copy of the tuple. So we wind up trying to read > from a trashed bit of memory. > > A brute-force solution is to change tuplestore_gettupleslot() so that it > always copies the tuple, but this would be wasted cycles for most uses > of tuplestores. I'm thinking of changing tuplestore_gettupleslot's API > to add a bool parameter specifying whether the caller wants to force > a copy. > > Comments, better ideas? Is this tuplestore API problem? ISTM this is window function's problem. I think my early code was holding heaptuple instead of tupleslot for the current row. At a glance, the issue appears in only current row in window function, which fetches row and uses it later after storing following rows in some cases. So a brute-force solution might be that ExecWindowAgg() copies the current row from tuplestore instead of pointing directly to inside tuplestore memory, not changing tuplestore API. Regards, -- Hitoshi Harada
2009/3/27 Hitoshi Harada <umi.tanuki@gmail.com>: > 2009/3/27 Tom Lane <tgl@sss.pgh.pa.us>: >> By chance I discovered that this query in the regression tests >> >> SELECT ntile(NULL) OVER (ORDER BY ten, four), ten, four FROM tenk1 LIMIT 2; >> >> stops working if work_mem is small enough: it either dumps core or >> delivers wrong answers depending on platform. >> >> After some tracing I found out the reason. ExecWindowAgg() does this: >> >> if (!tuplestore_gettupleslot(winstate->buffer, true, >> winstate->ss.ss_ScanTupleSlot)) >> elog(ERROR, "unexpected end of tuplestore"); >> >> and then goes off and calls the window functions (ntile() here), and >> expects the ScanTupleSlot to still be valid afterwards. However, >> ntile() forces us to read to the end of the input to find out the number >> of rows. If work_mem is small enough, that means the tuplestore is >> forced into dump-to-disk mode, which means it releases all its in-memory >> tuples. And guess what: the ScanTupleSlot is pointing at one of those, >> it doesn't have its own copy of the tuple. So we wind up trying to read >> from a trashed bit of memory. >> >> A brute-force solution is to change tuplestore_gettupleslot() so that it >> always copies the tuple, but this would be wasted cycles for most uses >> of tuplestores. I'm thinking of changing tuplestore_gettupleslot's API >> to add a bool parameter specifying whether the caller wants to force >> a copy. >> >> Comments, better ideas? > > Is this tuplestore API problem? ISTM this is window function's > problem. I think my early code was holding heaptuple instead of > tupleslot for the current row. At a glance, the issue appears in only > current row in window function, which fetches row and uses it later > after storing following rows in some cases. So a brute-force solution > might be that ExecWindowAgg() copies the current row from tuplestore > instead of pointing directly to inside tuplestore memory, not changing > tuplestore API. Here's the patch. Hope there are no more on the same reason. It seems that we'd need to implement something like garbage collector in tuplestore, marking and tracing each row references, if the complete solution is required. Regards, -- Hitoshi Harada
Вложения
Hitoshi Harada <umi.tanuki@gmail.com> writes: > 2009/3/27 Hitoshi Harada <umi.tanuki@gmail.com>: >> 2009/3/27 Tom Lane <tgl@sss.pgh.pa.us>: >>> A brute-force solution is to change tuplestore_gettupleslot() so that it >>> always copies the tuple, but this would be wasted cycles for most uses >>> of tuplestores. �I'm thinking of changing tuplestore_gettupleslot's API >>> to add a bool parameter specifying whether the caller wants to force >>> a copy. > Here's the patch. Hope there are no more on the same reason. It seems > that we'd need to implement something like garbage collector in > tuplestore, marking and tracing each row references, if the complete > solution is required. I don't like this; I'm planning to go with the aforementioned API change instead. The way you have it guarantees an extra copy cycle even when tuplestore is already making a copy internally; and it doesn't help if we find similar problems elsewhere. (While I'm making the API change I'll take a close look at each call site to see if it has any similar risk.) regards, tom lane
2009/3/28 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> 2009/3/27 Hitoshi Harada <umi.tanuki@gmail.com>: >>> 2009/3/27 Tom Lane <tgl@sss.pgh.pa.us>: >>>> A brute-force solution is to change tuplestore_gettupleslot() so that it >>>> always copies the tuple, but this would be wasted cycles for most uses >>>> of tuplestores. I'm thinking of changing tuplestore_gettupleslot's API >>>> to add a bool parameter specifying whether the caller wants to force >>>> a copy. > >> Here's the patch. Hope there are no more on the same reason. It seems >> that we'd need to implement something like garbage collector in >> tuplestore, marking and tracing each row references, if the complete >> solution is required. > > I don't like this; I'm planning to go with the aforementioned API > change instead. The way you have it guarantees an extra copy cycle > even when tuplestore is already making a copy internally; and it doesn't > help if we find similar problems elsewhere. (While I'm making the > API change I'll take a close look at each call site to see if it has > any similar risk.) You're right. It kills performance even after dumptuples(). Thinking more, I found the cause is only around dumptuples(). If you can trace TupleTableSlots that points to memtuples inside tuplestore, you can materialize them just before WRITETUP() in dumptuples(). So I tried pass EState.es_tupleTables to tuplestore_begin_heap() to trace those TupleTableSlots. Note that if you pass NULL the behavior is as before so nothing's broken. Regression passes but I'm not quite sure EState.es_tupleTable is only place that holds TupleTableSlots passed to tuplestore... I know you propose should_copy boolean parameter would be added to tuplestore_gettupleslot(). That always adds overhead even if tuplestore *doesn't* dump tuples. That case doesn't need copy tuples, I guess. Regards, -- Hitoshi Harada
Вложения
Hitoshi Harada <umi.tanuki@gmail.com> writes: > So I tried pass EState.es_tupleTables to tuplestore_begin_heap() to > trace those TupleTableSlots. Note that if you pass NULL the behavior > is as before so nothing's broken. Regression passes but I'm not quite > sure EState.es_tupleTable is only place that holds TupleTableSlots > passed to tuplestore... I've got zero confidence in that, and it seems pretty horrid from a system structural point of view even if it worked: neither tuplestore.c nor its direct callers have any business knowing where all the slots are. What might be a bit saner is to remember the slot last passed to tuplestore_gettupleslot for each read pointer. The implication would be that we'd be assuming only one slot is used to fetch from any one read pointer, but that is probably a reasonable restriction. > I know you propose should_copy boolean parameter would be added to > tuplestore_gettupleslot(). I already did it, and concluded that not only were all the tuplestore_gettupleslot calls in nodeWindowAgg broken, but so was nodeCtescan. I'm open to reverting that patch if you have a better solution though. regards, tom lane
2009/3/29 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> So I tried pass EState.es_tupleTables to tuplestore_begin_heap() to >> trace those TupleTableSlots. Note that if you pass NULL the behavior >> is as before so nothing's broken. Regression passes but I'm not quite >> sure EState.es_tupleTable is only place that holds TupleTableSlots >> passed to tuplestore... > > I've got zero confidence in that, and it seems pretty horrid from a > system structural point of view even if it worked: neither tuplestore.c > nor its direct callers have any business knowing where all the slots > are. What might be a bit saner is to remember the slot last passed to > tuplestore_gettupleslot for each read pointer. The implication would be > that we'd be assuming only one slot is used to fetch from any one read > pointer, but that is probably a reasonable restriction. Hm, this choice is better than mine. But if we take this, I suppose we need to consider the way to break the restriction, for the case we will be forced to use many TupleTableSlots on one read pointer. Without that, I don't think it's a good idea to take this way. >> I know you propose should_copy boolean parameter would be added to >> tuplestore_gettupleslot(). > > I already did it, and concluded that not only were all the > tuplestore_gettupleslot calls in nodeWindowAgg broken, but so was > nodeCtescan. I'm open to reverting that patch if you have a better > solution though. For now, yours is the better than everything. As we're not staying here forever, let's choose copy way. Then I'll keep it in my mind and retry when I will improve the tuplestore performance, with the later window function's patch. Regards, -- Hitoshi Harada
Hitoshi Harada <umi.tanuki@gmail.com> writes: > 2009/3/29 Tom Lane <tgl@sss.pgh.pa.us>: >> ... What might be a bit saner is to remember the slot last passed to >> tuplestore_gettupleslot for each read pointer. �The implication would be >> that we'd be assuming only one slot is used to fetch from any one read >> pointer, but that is probably a reasonable restriction. > Hm, this choice is better than mine. But if we take this, I suppose we > need to consider the way to break the restriction, for the case we > will be forced to use many TupleTableSlots on one read pointer. I don't think we'd ever be "forced" to do that; and it would be easy to add an Assert to tuplestore_gettupleslot to check that it gets the same slot on each call. Someone who needed to save previous tuples would be advised to copy older tuples to some other slot after fetching them. regards, tom lane
2009/3/30 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> 2009/3/29 Tom Lane <tgl@sss.pgh.pa.us>: >>> ... What might be a bit saner is to remember the slot last passed to >>> tuplestore_gettupleslot for each read pointer. The implication would be >>> that we'd be assuming only one slot is used to fetch from any one read >>> pointer, but that is probably a reasonable restriction. > >> Hm, this choice is better than mine. But if we take this, I suppose we >> need to consider the way to break the restriction, for the case we >> will be forced to use many TupleTableSlots on one read pointer. > > I don't think we'd ever be "forced" to do that; and it would be easy to > add an Assert to tuplestore_gettupleslot to check that it gets the same > slot on each call. Someone who needed to save previous tuples would be > advised to copy older tuples to some other slot after fetching them. I don't think advising or documenting such restriction to the future programmer is a good idea from the point of encapsulation. I've come up with an idea, that the read pointers remember their last slot as you suggest and materialize it when another slot comes in tuplestore_gettupleslot() and forget the former one. By this, you don't need the restriction above, adding minimum penalty that is paid if and only if you pass more than one tupleslot to tuplestore, which doesn't seem to be occurred currently. Regards, -- Hitoshi Harada
Hitoshi Harada <umi.tanuki@gmail.com> writes: > I don't think advising or documenting such restriction to the future > programmer is a good idea from the point of encapsulation. I've come > up with an idea, that the read pointers remember their last slot as > you suggest and materialize it when another slot comes in > tuplestore_gettupleslot() and forget the former one. By this, you > don't need the restriction above, adding minimum penalty that is paid > if and only if you pass more than one tupleslot to tuplestore, which > doesn't seem to be occurred currently. I think that the problem I found a couple days ago http://archives.postgresql.org/pgsql-hackers/2009-03/msg01247.php probably blows a hole in all these schemes. After-the-fact materialization of a TupleTableSlot won't protect any pass-by-reference Datums that have already been fetched from that slot. Perhaps we could invent a coding rule that would prevent the situation, but I think it would be awfully easy to mess up in any case where you actually had a need to keep track of more than one current tuple. I now think that the CVS-HEAD arrangement is about as good as we should expect to get for 8.4. The experiments I've been doing suggest that the "extra" tuple copying isn't a major bottleneck anyway... regards, tom lane