Обсуждение: OOM in spgist insert
While testing something on spgist I found that at certain point while inserting in spgist it is going for doPickSplit, but even after split is is not able to find a place to insert a tuple and it keeping going in that loop infinitely it seems and finally error out with OOM because in this loop we are continuously allocating memory for the tuple in temp context but since we are never coming out of the loop it is erroring out with OOM. To reproduce load the data from the attached script 'data_load.sql' and run below commands ------Load data before running this using 'data_load.sql' -------Test case start--- create extension spgist_name_ops; select opcname, amvalidate(opc.oid) from pg_opclass opc join pg_am am on am.oid = opcmethod where amname = 'spgist' and opcname = 'name_ops'; -- warning expected here select opcname, amvalidate(opc.oid) from pg_opclass opc join pg_am am on am.oid = opcmethod where amname = 'spgist' and opcname = 'name_ops_old'; create table t(f1 name, f2 integer, f3 text); create index on t using spgist(f1) include(f2, f3); \d+ t_f1_f2_f3_idx insert into t select proname, case when length(proname) % 2 = 0 then pronargs else null end, prosrc from pg_proc_test; ---- Test case end --Memory allocation stack---- #1 0x0000000000bf96c5 in palloc0 (size=9696) at mcxt.c:1133 #2 0x000000000056b24b in spgFormLeafTuple (state=0x7ffedea15b80, heapPtr=0x27df306, datums=0x7ffedea15660, isnulls=0x7ffedea15640) at spgutils.c:892 #3 0x000000000055e15c in doPickSplit (index=0x7fa4b1ddd5c8, state=0x7ffedea15b80, current=0x7ffedea159c0, parent=0x7ffedea159a0, newLeafTuple=0x27df300, level=9, isNulls=false, isNew=true) at spgdoinsert.c:848 #4 0x0000000000561e53 in spgdoinsert (index=0x7fa4b1ddd5c8, state=0x7ffedea15b80, heapPtr=0x27718d8, datums=0x7ffedea15cc0, isnulls=0x7ffedea15ca0) at spgdoinsert.c:2115 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
ср, 12 мая 2021 г. в 11:09, Dilip Kumar <dilipbalaut@gmail.com>:
While testing something on spgist I found that at certain point while
inserting in spgist it is going for doPickSplit, but even after split
is is not able to find a place to insert a tuple and it keeping going
in that loop infinitely it seems and finally error out with OOM
because in this loop we are continuously allocating memory for the
tuple in temp context but since we are never coming out of the loop it
is erroring out with OOM.
My first idea is that this is the case when index tuple doesn't fit into one index page. As INCLUDED columns are added as is the tuple can not be made shorter by prefix-stripping. Seems we should check every index tuple length to fit the page before its insertion. Will see the code little bit later.
Thanks for the reporting!
On Wed, 12 May 2021 at 1:43 PM, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
ср, 12 мая 2021 г. в 11:09, Dilip Kumar <dilipbalaut@gmail.com>:While testing something on spgist I found that at certain point while
inserting in spgist it is going for doPickSplit, but even after split
is is not able to find a place to insert a tuple and it keeping going
in that loop infinitely it seems and finally error out with OOM
because in this loop we are continuously allocating memory for the
tuple in temp context but since we are never coming out of the loop it
is erroring out with OOM.My first idea is that this is the case when index tuple doesn't fit into one index page. As INCLUDED columns are added as is the tuple can not be made shorter by prefix-stripping. Seems we should check every index tuple length to fit the page before its insertion.
Thanks for looking into this.
Will see the code little bit later.
Ok
ср, 12 мая 2021 г. в 12:36, Dilip Kumar <dilipbalaut@gmail.com>:
On Wed, 12 May 2021 at 1:43 PM, Pavel Borisov <pashkin.elfe@gmail.com> wrote:ср, 12 мая 2021 г. в 11:09, Dilip Kumar <dilipbalaut@gmail.com>:While testing something on spgist I found that at certain point while
inserting in spgist it is going for doPickSplit, but even after split
is is not able to find a place to insert a tuple and it keeping going
in that loop infinitely it seems and finally error out with OOM
because in this loop we are continuously allocating memory for the
tuple in temp context but since we are never coming out of the loop it
is erroring out with OOM.My first idea is that this is the case when index tuple doesn't fit into one index page. As INCLUDED columns are added as is the tuple can not be made shorter by prefix-stripping. Seems we should check every index tuple length to fit the page before its insertion.Thanks for looking into this.Will see the code little bit later.Ok
PFA v1 patch. Does this help?
Вложения
ср, 12 мая 2021 г. в 12:39, Pavel Borisov <pashkin.elfe@gmail.com>:
ср, 12 мая 2021 г. в 12:36, Dilip Kumar <dilipbalaut@gmail.com>:On Wed, 12 May 2021 at 1:43 PM, Pavel Borisov <pashkin.elfe@gmail.com> wrote:ср, 12 мая 2021 г. в 11:09, Dilip Kumar <dilipbalaut@gmail.com>:While testing something on spgist I found that at certain point while
inserting in spgist it is going for doPickSplit, but even after split
is is not able to find a place to insert a tuple and it keeping going
in that loop infinitely it seems and finally error out with OOM
because in this loop we are continuously allocating memory for the
tuple in temp context but since we are never coming out of the loop it
is erroring out with OOM.My first idea is that this is the case when index tuple doesn't fit into one index page. As INCLUDED columns are added as is the tuple can not be made shorter by prefix-stripping. Seems we should check every index tuple length to fit the page before its insertion.Thanks for looking into this.Will see the code little bit later.OkPFA v1 patch. Does this help?
I've made a mistake in attributes count in v1. PFA v2
Вложения
On Wed, May 12, 2021 at 2:21 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: >> >> PFA v1 patch. Does this help? > > I've made a mistake in attributes count in v1. PFA v2 V2 works. Thanks for fixing this quickly, I think you can add a comment for the new error condition you added. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
V2 works. Thanks for fixing this quickly, I think you can add a
comment for the new error condition you added.
Added comments. PFA v3
Вложения
On Wed, May 12, 2021 at 5:11 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: >> >> V2 works. Thanks for fixing this quickly, I think you can add a >> comment for the new error condition you added. > > Added comments. PFA v3 Thanks. + * + * For indexes with INCLUDEd columns we do not know whether we can reduce + * index tuple size by suffixing its key part or we will go into the + * endless loop on pick-split (in case included columns data is big enough INCLUDEd -> why you have used a mixed case here? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
INCLUDEd -> why you have used a mixed case here?
It is current practice to call INCLUDE columns in capital, you can find many places in the current code. But case mixture can be avoided indeed ))
PFA v4
Вложения
On Wed, May 12, 2021 at 5:35 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: >> >> INCLUDEd -> why you have used a mixed case here? > > It is current practice to call INCLUDE columns in capital, you can find many places in the current code. But case mixturecan be avoided indeed )) > PFA v4 Okay, that makes sense. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Pavel Borisov <pashkin.elfe@gmail.com> writes: > [ v4-0001-When-there-are-INCLUDEd-columns-in-SpGist-check-t.patch ] I don't like this patch one bit --- it's basically disabling a fairly important SPGiST feature as soon as there are included columns. What's more, it's not really giving us any better defense against the infinite-picksplit-loop problem than we had before. I wonder if we should give up on the theory posited circa spgdoinsert.c:2213: * Note: if the opclass sets longValuesOK, we rely on the * choose function to eventually shorten the leafDatum * enough to fit on a page. We could add a test here to * complain if the datum doesn't get visibly shorter each * time, but that could get in the way of opclasses that * "simplify" datums in a way that doesn't necessarily * lead to physical shortening on every cycle. The argument that there might be a longValuesOK opclass that *doesn't* shorten the datum each time seems fairly hypothetical to me. An alternative way of looking at things is that this is the opclass's fault. It should have realized that it's not making progress, and thrown an error. However, I'm not sure if the opclass picksplit function has enough info to throw a meaningful error. It looks to me like the trouble case from spg_text_picksplit's point of view is that it's given a single tuple containing a zero-length string, so that there is no prefix it can strip. However, it seems possible that that could be a legitimate case. Even if it's not, the opclass function doesn't have (for instance) the name of the index to cite in an error message. Nor did we give it a way to return a failure indication, which is seeming like a mistake right now. BTW, another nasty thing I discovered while testing this is that the CHECK_FOR_INTERRUPTS() at line 2146 is useless, because we're holding a buffer lock there so InterruptHoldoffCount > 0. So once you get into this loop you can't even cancel the query. Seems like that needs a fix, too. regards, tom lane
On 2021-May-13, Tom Lane wrote: > BTW, another nasty thing I discovered while testing this is that > the CHECK_FOR_INTERRUPTS() at line 2146 is useless, because > we're holding a buffer lock there so InterruptHoldoffCount > 0. > So once you get into this loop you can't even cancel the query. > Seems like that needs a fix, too. This comment made me remember a patch I've had for a while, which splits the CHECK_FOR_INTERRUPTS() definition in two -- one of them is INTERRUPTS_PENDING_CONDITION() which let us test the condition separately; that allows the lock we hold to be released prior to actually processing the interrupts. The btree code modified was found to be an actual problem in production when a btree is corrupted in such a way that vacuum would get an infinite loop. I don't remember the exact details but I think we saw vacuum running for a couple of weeks, and had to restart the server in order to terminate it (since it wouldn't respond to signals). -- Álvaro Herrera Valdivia, Chile "I am amazed at [the pgsql-sql] mailing list for the wonderful support, and lack of hesitasion in answering a lost soul's question, I just wished the rest of the mailing list could be like this." (Fotis) (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)
Вложения
On 2021-May-13, Alvaro Herrera wrote: > The btree code modified was found to be an actual problem in production > when a btree is corrupted in such a way that vacuum would get an > infinite loop. I don't remember the exact details but I think we saw > vacuum running for a couple of weeks, and had to restart the server in > order to terminate it (since it wouldn't respond to signals). (Looking again, the nbtpage.c hunk might have been made obsolete by c34787f91058 and other commits). -- Álvaro Herrera Valdivia, Chile
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2021-May-13, Tom Lane wrote: >> BTW, another nasty thing I discovered while testing this is that >> the CHECK_FOR_INTERRUPTS() at line 2146 is useless, because >> we're holding a buffer lock there so InterruptHoldoffCount > 0. >> So once you get into this loop you can't even cancel the query. >> Seems like that needs a fix, too. Attached is a WIP patch for that part. Basically, if it looks like there's an interrupt pending, make spgdoinsert() fall out of its loop, so it'll release the buffer locks, and then recheck CHECK_FOR_INTERRUPTS() at a point where it can really throw the error. Now, the hole in this approach is what if ProcessInterrupts still declines to throw the error? I've made the patch make use of the existing provision to retry spgdoinsert() in case of deadlock, so that it just goes around and tries again. But that doesn't seem terribly satisfactory, because if InterruptPending remains set then we'll just have an infinite loop at that outer level. IOW, this patch can only cope with the situation where there's not any outer function holding a buffer lock (or other reason to prevent the query cancel from taking effect). Maybe that's good enough, but I'm not terribly pleased with it. > This comment made me remember a patch I've had for a while, which splits > the CHECK_FOR_INTERRUPTS() definition in two -- one of them is > INTERRUPTS_PENDING_CONDITION() which let us test the condition > separately; that allows the lock we hold to be released prior to > actually processing the interrupts. Hm. Yeah, I was feeling that this patch is unduly friendly with the innards of CHECK_FOR_INTERRUPTS(). So maybe some refactoring is called for, but I'm not quite sure what it should look like. Testing the condition separately is fine, but what about the case of ProcessInterrupts refusing to pull the trigger? regards, tom lane diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c index 4d380c99f0..82e68c4ab7 100644 --- a/src/backend/access/spgist/spgdoinsert.c +++ b/src/backend/access/spgist/spgdoinsert.c @@ -1905,13 +1905,14 @@ spgSplitNodeAction(Relation index, SpGistState *state, * Insert one item into the index. * * Returns true on success, false if we failed to complete the insertion - * because of conflict with a concurrent insert. In the latter case, - * caller should re-call spgdoinsert() with the same args. + * (typically because of conflict with a concurrent insert). In the latter + * case, caller should re-call spgdoinsert() with the same args. */ bool spgdoinsert(Relation index, SpGistState *state, ItemPointer heapPtr, Datum *datums, bool *isnulls) { + bool result = true; TupleDesc leafDescriptor = state->leafTupDesc; bool isnull = isnulls[spgKeyColumn]; int level = 0; @@ -2019,9 +2020,17 @@ spgdoinsert(Relation index, SpGistState *state, /* * Bail out if query cancel is pending. We must have this somewhere * in the loop since a broken opclass could produce an infinite - * picksplit loop. + * picksplit loop. Moreover, because it's typically the case that + * we're holding buffer lock(s), ProcessInterrupts() won't be able to + * throw an error now. Hence, if we see that InterruptPending remains + * true, break out of the loop and deal with the problem below. */ CHECK_FOR_INTERRUPTS(); + if (unlikely(InterruptPending)) + { + result = false; + break; + } if (current.blkno == InvalidBlockNumber) { @@ -2140,10 +2149,15 @@ spgdoinsert(Relation index, SpGistState *state, * spgAddNode and spgSplitTuple cases will loop back to here to * complete the insertion operation. Just in case the choose * function is broken and produces add or split requests - * repeatedly, check for query cancel. + * repeatedly, check for query cancel (see comments above). */ process_inner_tuple: CHECK_FOR_INTERRUPTS(); + if (unlikely(InterruptPending)) + { + result = false; + break; + } innerTuple = (SpGistInnerTuple) PageGetItem(current.page, PageGetItemId(current.page, current.offnum)); @@ -2267,5 +2281,12 @@ spgdoinsert(Relation index, SpGistState *state, UnlockReleaseBuffer(parent.buffer); } - return true; + /* + * Finally, check for interrupts again. If there was one, + * ProcessInterrupts() will be able to throw the error now. But should it + * not do so for some reason, return false to cause a retry. + */ + CHECK_FOR_INTERRUPTS(); + + return result; }
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > (Looking again, the nbtpage.c hunk might have been made obsolete by > c34787f91058 and other commits). OK. Here's a revision that adopts your idea, except that I left out the nbtpage.c change since you aren't sure of that part. I added a macro that allows spgdoinsert to Assert that it's not called in a context where the infinite-loop-due-to-InterruptPending risk would arise. This is a little bit fragile, because it'd be easy for ill-considered changes to ProcessInterrupts to break it, but it's better than nothing. regards, tom lane diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c index 4d380c99f0..1c14fc7f76 100644 --- a/src/backend/access/spgist/spgdoinsert.c +++ b/src/backend/access/spgist/spgdoinsert.c @@ -1905,13 +1905,14 @@ spgSplitNodeAction(Relation index, SpGistState *state, * Insert one item into the index. * * Returns true on success, false if we failed to complete the insertion - * because of conflict with a concurrent insert. In the latter case, - * caller should re-call spgdoinsert() with the same args. + * (typically because of conflict with a concurrent insert). In the latter + * case, caller should re-call spgdoinsert() with the same args. */ bool spgdoinsert(Relation index, SpGistState *state, ItemPointer heapPtr, Datum *datums, bool *isnulls) { + bool result = true; TupleDesc leafDescriptor = state->leafTupDesc; bool isnull = isnulls[spgKeyColumn]; int level = 0; @@ -2019,9 +2020,17 @@ spgdoinsert(Relation index, SpGistState *state, /* * Bail out if query cancel is pending. We must have this somewhere * in the loop since a broken opclass could produce an infinite - * picksplit loop. + * picksplit loop. Moreover, because it's typically the case that + * we're holding buffer lock(s), ProcessInterrupts() won't be able to + * throw an error now. Hence, if we see that the interrupt condition + * remains true, break out of the loop and deal with the case below. */ CHECK_FOR_INTERRUPTS(); + if (INTERRUPTS_PENDING_CONDITION()) + { + result = false; + break; + } if (current.blkno == InvalidBlockNumber) { @@ -2140,10 +2149,15 @@ spgdoinsert(Relation index, SpGistState *state, * spgAddNode and spgSplitTuple cases will loop back to here to * complete the insertion operation. Just in case the choose * function is broken and produces add or split requests - * repeatedly, check for query cancel. + * repeatedly, check for query cancel (see comments above). */ process_inner_tuple: CHECK_FOR_INTERRUPTS(); + if (INTERRUPTS_PENDING_CONDITION()) + { + result = false; + break; + } innerTuple = (SpGistInnerTuple) PageGetItem(current.page, PageGetItemId(current.page, current.offnum)); @@ -2267,5 +2281,21 @@ spgdoinsert(Relation index, SpGistState *state, UnlockReleaseBuffer(parent.buffer); } - return true; + /* + * We do not support being called while some outer function is holding a + * buffer lock (or any other reason to postpone query cancels). If that + * were the case, telling the caller to retry would create an infinite + * loop. + */ + Assert(INTERRUPTS_CAN_BE_PROCESSED()); + + /* + * Finally, check for interrupts again. If there was a query cancel, + * ProcessInterrupts() will be able to throw the error now. If it was + * some other kind of interrupt that can just be cleared, return false to + * tell our caller to retry. + */ + CHECK_FOR_INTERRUPTS(); + + return result; } diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 6200699ddd..dd2ade7bb6 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -554,7 +554,7 @@ ProcessClientWriteInterrupt(bool blocked) { /* * Don't mess with whereToSendOutput if ProcessInterrupts wouldn't - * do anything. + * service ProcDiePending. */ if (InterruptHoldoffCount == 0 && CritSectionCount == 0) { @@ -3118,6 +3118,12 @@ RecoveryConflictInterrupt(ProcSignalReason reason) * If an interrupt condition is pending, and it's safe to service it, * then clear the flag and accept the interrupt. Called only when * InterruptPending is true. + * + * Note: if INTERRUPTS_CAN_BE_PROCESSED() is true, then ProcessInterrupts + * is guaranteed to clear the InterruptPending flag before returning. + * (This is not the same as guaranteeing that it's still clear when we + * return; another interrupt could have arrived. But we promise that + * any pre-existing one will have been serviced.) */ void ProcessInterrupts(void) @@ -3248,7 +3254,11 @@ ProcessInterrupts(void) { /* * Re-arm InterruptPending so that we process the cancel request as - * soon as we're done reading the message. + * soon as we're done reading the message. (XXX this is seriously + * ugly: it complicates INTERRUPTS_CAN_BE_PROCESSED(), and it means we + * can't use that macro directly as the initial test in this function, + * meaning that this code also creates opportunities for other bugs to + * appear.) */ InterruptPending = true; } diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 95202d37af..645766a4ab 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -57,6 +57,11 @@ * allowing die interrupts: HOLD_CANCEL_INTERRUPTS() and * RESUME_CANCEL_INTERRUPTS(). * + * Also, INTERRUPTS_PENDING_CONDITION() can be checked to see whether an + * interrupt needs to be serviced, without trying to do so immediately. + * Some callers are also interested in INTERRUPTS_CAN_BE_PROCESSED(), + * which tells whether ProcessInterrupts is sure to clear the interrupt. + * * Special mechanisms are used to let an interrupt be accepted when we are * waiting for a lock or when we are waiting for command input (but, of * course, only if the interrupt holdoff counter is zero). See the @@ -97,24 +102,27 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount; /* in tcop/postgres.c */ extern void ProcessInterrupts(void); +/* Test whether an interrupt is pending */ #ifndef WIN32 +#define INTERRUPTS_PENDING_CONDITION() \ + (unlikely(InterruptPending)) +#else +#define INTERRUPTS_PENDING_CONDITION() \ + (unlikely(UNBLOCKED_SIGNAL_QUEUE()) ? pgwin32_dispatch_queued_signals() : 0, \ + unlikely(InterruptPending)) +#endif +/* Service interrupt if one is pending */ #define CHECK_FOR_INTERRUPTS() \ do { \ - if (unlikely(InterruptPending)) \ - ProcessInterrupts(); \ -} while(0) -#else /* WIN32 */ - -#define CHECK_FOR_INTERRUPTS() \ -do { \ - if (unlikely(UNBLOCKED_SIGNAL_QUEUE())) \ - pgwin32_dispatch_queued_signals(); \ - if (unlikely(InterruptPending)) \ + if (INTERRUPTS_PENDING_CONDITION()) \ ProcessInterrupts(); \ } while(0) -#endif /* WIN32 */ +/* Is ProcessInterrupts() guaranteed to clear InterruptPending? */ +#define INTERRUPTS_CAN_BE_PROCESSED() \ + (InterruptHoldoffCount == 0 && CritSectionCount == 0 && \ + QueryCancelHoldoffCount == 0) #define HOLD_INTERRUPTS() (InterruptHoldoffCount++)
Here's a patch that attempts to prevent the infinite loop in spgdoinsert by checking whether the proposed leaf tuple is getting smaller at each iteration. We can't be totally rigid about that, because for example if the opclass shortens a 7-byte string to 5 bytes, that will make no difference in the tuple's size after alignment. I first tried to handle that by checking datumGetSize() of the key datum itself, but observed that spgtextproc.c has some cases where it'll return an empty leaf-datum string at two levels before succeeding. Maybe it'd be okay to fail that on the grounds that it can't become any smaller later. But on the whole, and considering the existing comment's concerns about opclasses that don't shorten the datum every time, it seems like a good idea to allow some fuzz here. So what this patch does is to allow up to 10 cycles with no reduction in the actual leaf tuple size before failing. That way we can handle slop due to alignment roundoff and slop due to opclass corner cases with a single, very cheap mechanism. It does mean that we might build a few more useless inner tuples before failing --- but that seems like a good tradeoff for *not* failing in cases where the opclass is able to shorten the leaf datum sufficiently. I have not bothered to tease apart the query-cancel and infinite-loop parts of the patch, but probably should do that before committing. What do people think about back-patching this? In existing branches, we've defined it to be an opclass bug if it fails to shorten the leaf datum enough. But not having any defenses against that seems like not a great idea. OTOH, the 10-cycles-to-show-progress rule could be argued to be an API break. regards, tom lane diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c index 4d380c99f0..e882b0cea4 100644 --- a/src/backend/access/spgist/spgdoinsert.c +++ b/src/backend/access/spgist/spgdoinsert.c @@ -669,7 +669,8 @@ checkAllTheSame(spgPickSplitIn *in, spgPickSplitOut *out, bool tooBig, * will eventually terminate if lack of balance is the issue. If the tuple * is too big, we assume that repeated picksplit operations will eventually * make it small enough by repeated prefix-stripping. A broken opclass could - * make this an infinite loop, though. + * make this an infinite loop, though, so spgdoinsert() checks that the + * leaf datums get smaller each time. */ static bool doPickSplit(Relation index, SpGistState *state, @@ -1905,18 +1906,21 @@ spgSplitNodeAction(Relation index, SpGistState *state, * Insert one item into the index. * * Returns true on success, false if we failed to complete the insertion - * because of conflict with a concurrent insert. In the latter case, - * caller should re-call spgdoinsert() with the same args. + * (typically because of conflict with a concurrent insert). In the latter + * case, caller should re-call spgdoinsert() with the same args. */ bool spgdoinsert(Relation index, SpGistState *state, ItemPointer heapPtr, Datum *datums, bool *isnulls) { + bool result = true; TupleDesc leafDescriptor = state->leafTupDesc; bool isnull = isnulls[spgKeyColumn]; int level = 0; Datum leafDatums[INDEX_MAX_KEYS]; int leafSize; + int prevLeafSize; + int numNoProgressCycles = 0; SPPageDesc current, parent; FmgrInfo *procinfo = NULL; @@ -1987,9 +1991,10 @@ spgdoinsert(Relation index, SpGistState *state, /* * If it isn't gonna fit, and the opclass can't reduce the datum size by - * suffixing, bail out now rather than getting into an endless loop. + * suffixing, bail out now rather than doing a lot of useless work. */ - if (leafSize > SPGIST_PAGE_CAPACITY && !state->config.longValuesOK) + if (leafSize > SPGIST_PAGE_CAPACITY && + (isnull || !state->config.longValuesOK)) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("index row size %zu exceeds maximum %zu for index \"%s\"", @@ -2019,9 +2024,17 @@ spgdoinsert(Relation index, SpGistState *state, /* * Bail out if query cancel is pending. We must have this somewhere * in the loop since a broken opclass could produce an infinite - * picksplit loop. + * picksplit loop. Moreover, because it's typically the case that + * we're holding buffer lock(s), ProcessInterrupts() won't be able to + * throw an error now. Hence, if we see that the interrupt condition + * remains true, break out of the loop and deal with the case below. */ CHECK_FOR_INTERRUPTS(); + if (INTERRUPTS_PENDING_CONDITION()) + { + result = false; + break; + } if (current.blkno == InvalidBlockNumber) { @@ -2140,10 +2153,15 @@ spgdoinsert(Relation index, SpGistState *state, * spgAddNode and spgSplitTuple cases will loop back to here to * complete the insertion operation. Just in case the choose * function is broken and produces add or split requests - * repeatedly, check for query cancel. + * repeatedly, check for query cancel (see comments above). */ process_inner_tuple: CHECK_FOR_INTERRUPTS(); + if (INTERRUPTS_PENDING_CONDITION()) + { + result = false; + break; + } innerTuple = (SpGistInnerTuple) PageGetItem(current.page, PageGetItemId(current.page, current.offnum)); @@ -2196,6 +2214,7 @@ spgdoinsert(Relation index, SpGistState *state, /* Adjust level as per opclass request */ level += out.result.matchNode.levelAdd; /* Replace leafDatum and recompute leafSize */ + prevLeafSize = leafSize; if (!isnull) { leafDatums[spgKeyColumn] = out.result.matchNode.restDatum; @@ -2204,19 +2223,50 @@ spgdoinsert(Relation index, SpGistState *state, leafSize += sizeof(ItemIdData); } + /* + * Check new tuple size; fail if it can't fit, unless the + * opclass says it can handle the situation by suffixing. + * In that case, prevent an infinite loop by checking to + * see if we are actually making progress by reducing the + * leafSize size in each pass. This is a bit tricky + * though. Because of alignment considerations, the total + * tuple size might not decrease on every pass. Also, + * there are edge cases where the choose method might seem + * to not make progress for a cycle or two. Somewhat + * arbitrarily, we allow up to 10 no-progress iterations + * before failing. (The limit should be more than + * MAXALIGN, to accommodate opclasses that trim one byte + * from the leaf datum per pass.) + */ + if (leafSize > SPGIST_PAGE_CAPACITY) + { + bool ok = false; + + if (state->config.longValuesOK && !isnull) + { + if (leafSize < prevLeafSize) + { + ok = true; + numNoProgressCycles = 0; + } + else if (++numNoProgressCycles < 10) + ok = true; + } + if (!ok) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("index row size %zu exceeds maximum %zu for index \"%s\"", + leafSize - sizeof(ItemIdData), + SPGIST_PAGE_CAPACITY - sizeof(ItemIdData), + RelationGetRelationName(index)), + errhint("Values larger than a buffer page cannot be indexed."))); + } + /* * Loop around and attempt to insert the new leafDatum at * "current" (which might reference an existing child * tuple, or might be invalid to force us to find a new * page for the tuple). - * - * Note: if the opclass sets longValuesOK, we rely on the - * choose function to eventually shorten the leafDatum - * enough to fit on a page. We could add a test here to - * complain if the datum doesn't get visibly shorter each - * time, but that could get in the way of opclasses that - * "simplify" datums in a way that doesn't necessarily - * lead to physical shortening on every cycle. */ break; case spgAddNode: @@ -2267,5 +2317,21 @@ spgdoinsert(Relation index, SpGistState *state, UnlockReleaseBuffer(parent.buffer); } - return true; + /* + * We do not support being called while some outer function is holding a + * buffer lock (or any other reason to postpone query cancels). If that + * were the case, telling the caller to retry would create an infinite + * loop. + */ + Assert(INTERRUPTS_CAN_BE_PROCESSED()); + + /* + * Finally, check for interrupts again. If there was a query cancel, + * ProcessInterrupts() will be able to throw the error now. If it was + * some other kind of interrupt that can just be cleared, return false to + * tell our caller to retry. + */ + CHECK_FOR_INTERRUPTS(); + + return result; } diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 6200699ddd..dd2ade7bb6 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -554,7 +554,7 @@ ProcessClientWriteInterrupt(bool blocked) { /* * Don't mess with whereToSendOutput if ProcessInterrupts wouldn't - * do anything. + * service ProcDiePending. */ if (InterruptHoldoffCount == 0 && CritSectionCount == 0) { @@ -3118,6 +3118,12 @@ RecoveryConflictInterrupt(ProcSignalReason reason) * If an interrupt condition is pending, and it's safe to service it, * then clear the flag and accept the interrupt. Called only when * InterruptPending is true. + * + * Note: if INTERRUPTS_CAN_BE_PROCESSED() is true, then ProcessInterrupts + * is guaranteed to clear the InterruptPending flag before returning. + * (This is not the same as guaranteeing that it's still clear when we + * return; another interrupt could have arrived. But we promise that + * any pre-existing one will have been serviced.) */ void ProcessInterrupts(void) @@ -3248,7 +3254,11 @@ ProcessInterrupts(void) { /* * Re-arm InterruptPending so that we process the cancel request as - * soon as we're done reading the message. + * soon as we're done reading the message. (XXX this is seriously + * ugly: it complicates INTERRUPTS_CAN_BE_PROCESSED(), and it means we + * can't use that macro directly as the initial test in this function, + * meaning that this code also creates opportunities for other bugs to + * appear.) */ InterruptPending = true; } diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 95202d37af..645766a4ab 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -57,6 +57,11 @@ * allowing die interrupts: HOLD_CANCEL_INTERRUPTS() and * RESUME_CANCEL_INTERRUPTS(). * + * Also, INTERRUPTS_PENDING_CONDITION() can be checked to see whether an + * interrupt needs to be serviced, without trying to do so immediately. + * Some callers are also interested in INTERRUPTS_CAN_BE_PROCESSED(), + * which tells whether ProcessInterrupts is sure to clear the interrupt. + * * Special mechanisms are used to let an interrupt be accepted when we are * waiting for a lock or when we are waiting for command input (but, of * course, only if the interrupt holdoff counter is zero). See the @@ -97,24 +102,27 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount; /* in tcop/postgres.c */ extern void ProcessInterrupts(void); +/* Test whether an interrupt is pending */ #ifndef WIN32 +#define INTERRUPTS_PENDING_CONDITION() \ + (unlikely(InterruptPending)) +#else +#define INTERRUPTS_PENDING_CONDITION() \ + (unlikely(UNBLOCKED_SIGNAL_QUEUE()) ? pgwin32_dispatch_queued_signals() : 0, \ + unlikely(InterruptPending)) +#endif +/* Service interrupt if one is pending */ #define CHECK_FOR_INTERRUPTS() \ do { \ - if (unlikely(InterruptPending)) \ - ProcessInterrupts(); \ -} while(0) -#else /* WIN32 */ - -#define CHECK_FOR_INTERRUPTS() \ -do { \ - if (unlikely(UNBLOCKED_SIGNAL_QUEUE())) \ - pgwin32_dispatch_queued_signals(); \ - if (unlikely(InterruptPending)) \ + if (INTERRUPTS_PENDING_CONDITION()) \ ProcessInterrupts(); \ } while(0) -#endif /* WIN32 */ +/* Is ProcessInterrupts() guaranteed to clear InterruptPending? */ +#define INTERRUPTS_CAN_BE_PROCESSED() \ + (InterruptHoldoffCount == 0 && CritSectionCount == 0 && \ + QueryCancelHoldoffCount == 0) #define HOLD_INTERRUPTS() (InterruptHoldoffCount++) diff --git a/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out b/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out index 0ac99d08fb..ac0ddcecea 100644 --- a/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out +++ b/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out @@ -61,6 +61,12 @@ select * from t binary_upgrade_set_next_toast_pg_class_oid | 1 | binary_upgrade_set_next_toast_pg_class_oid (9 rows) +-- Verify clean failure when INCLUDE'd columns result in overlength tuple +-- The error message details are platform-dependent, so show only SQLSTATE +\set VERBOSITY sqlstate +insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000)); +ERROR: 54000 +\set VERBOSITY default drop index t_f1_f2_f3_idx; create index on t using spgist(f1 name_ops_old) include(f2, f3); \d+ t_f1_f2_f3_idx @@ -100,3 +106,7 @@ select * from t binary_upgrade_set_next_toast_pg_class_oid | 1 | binary_upgrade_set_next_toast_pg_class_oid (9 rows) +\set VERBOSITY sqlstate +insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000)); +ERROR: 54000 +\set VERBOSITY default diff --git a/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql b/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql index 76e78ba41c..982f221a8b 100644 --- a/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql +++ b/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql @@ -27,6 +27,12 @@ select * from t where f1 > 'binary_upgrade_set_n' and f1 < 'binary_upgrade_set_p' order by 1; +-- Verify clean failure when INCLUDE'd columns result in overlength tuple +-- The error message details are platform-dependent, so show only SQLSTATE +\set VERBOSITY sqlstate +insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000)); +\set VERBOSITY default + drop index t_f1_f2_f3_idx; create index on t using spgist(f1 name_ops_old) include(f2, f3); @@ -39,3 +45,7 @@ select * from t select * from t where f1 > 'binary_upgrade_set_n' and f1 < 'binary_upgrade_set_p' order by 1; + +\set VERBOSITY sqlstate +insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000)); +\set VERBOSITY default
On 2021-May-13, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > (Looking again, the nbtpage.c hunk might have been made obsolete by > > c34787f91058 and other commits). > > OK. Here's a revision that adopts your idea, except that I left > out the nbtpage.c change since you aren't sure of that part. Thanks. > I added a macro that allows spgdoinsert to Assert that it's not > called in a context where the infinite-loop-due-to-InterruptPending > risk would arise. This is a little bit fragile, because it'd be > easy for ill-considered changes to ProcessInterrupts to break it, > but it's better than nothing. Hmm, it looks OK to me, but I wonder why you kept the original CHECK_FOR_INTERRUPTS()s since these would be done once we've broken out of the loop anyway. I tested Dilip's original test case and while we still die on OOM, we're able to interrupt it before dying. Not related to this patch -- I was bothered by the UnlockReleaseBuffer calls at the bottom of spgdoinsert that leave the buffer still set in the structs. It's not a problem if you look only at this routine, but I notice that callee doPickSplit does the same thing, so maybe there is some weird situation in which you could end up with the buffer variable set, but we don't hold lock nor pin on the page, so an attempt to clean up would break. I don't know enough about spgist to figure out how to craft a test case, maybe it's impossible to reach for some reason, but it seems glass-in-the-beach sort of thing. -- Álvaro Herrera 39°49'30"S 73°17'W "Someone said that it is at least an order of magnitude more work to do production software than a prototype. I think he is wrong by at least an order of magnitude." (Brian Kernighan)
On 2021-May-13, Tom Lane wrote: > What do people think about back-patching this? In existing branches, > we've defined it to be an opclass bug if it fails to shorten the leaf > datum enough. But not having any defenses against that seems like > not a great idea. OTOH, the 10-cycles-to-show-progress rule could be > argued to be an API break. I think if the alternative is to throw an error, we can afford to retry quite a few more times than 10 in order not have that called an API break. Say, retry (MAXIMUM_ALIGNOF << 3) times or so (if you want to parameterize on maxalign). It's not like this is going to be a performance drag where not needed .. but I think leaving back-branches unfixed is not great. I did run Dilip's test case as well as your new regression test, and both work as intended with your new code (and both OOM-crash the original code). -- Álvaro Herrera Valdivia, Chile
I think it's good to backpatch the check as it doesn't change acceptable behavior, just replace infinite loop with the acceptable thing.
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Hmm, it looks OK to me, but I wonder why you kept the original > CHECK_FOR_INTERRUPTS()s since these would be done once we've broken out > of the loop anyway. I tested Dilip's original test case and while we > still die on OOM, we're able to interrupt it before dying. Hm. My thought was that in the cases where InterruptPending is set for some reason other than a query cancel, we could let ProcessInterrupts service it at less cost than abandoning and retrying the index insertion. On reflection though, that only works for the first CHECK_FOR_INTERRUPTS at the top of the loop, and only the first time through, because during later calls we'll be holding buffer locks. Maybe the best idea is to have one CHECK_FOR_INTERRUPTS at the top of the function, in hopes of clearing out any already-pending interrupts, and then just use the condition test inside the loop. > Not related to this patch -- I was bothered by the UnlockReleaseBuffer > calls at the bottom of spgdoinsert that leave the buffer still set in > the structs. It's not a problem if you look only at this routine, but I > notice that callee doPickSplit does the same thing, so maybe there is > some weird situation in which you could end up with the buffer variable > set, but we don't hold lock nor pin on the page, so an attempt to clean > up would break. Maybe I'm confused, but aren't those just local variables that are about to go out of scope anyway? Clearing them seems more likely to draw compiler warnings about dead stores than accomplish something useful. regards, tom lane
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2021-May-13, Tom Lane wrote: >> What do people think about back-patching this? In existing branches, >> we've defined it to be an opclass bug if it fails to shorten the leaf >> datum enough. But not having any defenses against that seems like >> not a great idea. OTOH, the 10-cycles-to-show-progress rule could be >> argued to be an API break. > I think if the alternative is to throw an error, we can afford to retry > quite a few more times than 10 in order not have that called an API > break. Say, retry (MAXIMUM_ALIGNOF << 3) times or so (if you want to > parameterize on maxalign). It's not like this is going to be a > performance drag where not needed .. but I think leaving back-branches > unfixed is not great. Hm. Index bloat is not something to totally ignore, though, so I'm not sure what the best cutoff is. Anyway, here is a patch set teased apart into committable bites, and with your other points addressed. regards, tom lane diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 6200699ddd..dd2ade7bb6 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -554,7 +554,7 @@ ProcessClientWriteInterrupt(bool blocked) { /* * Don't mess with whereToSendOutput if ProcessInterrupts wouldn't - * do anything. + * service ProcDiePending. */ if (InterruptHoldoffCount == 0 && CritSectionCount == 0) { @@ -3118,6 +3118,12 @@ RecoveryConflictInterrupt(ProcSignalReason reason) * If an interrupt condition is pending, and it's safe to service it, * then clear the flag and accept the interrupt. Called only when * InterruptPending is true. + * + * Note: if INTERRUPTS_CAN_BE_PROCESSED() is true, then ProcessInterrupts + * is guaranteed to clear the InterruptPending flag before returning. + * (This is not the same as guaranteeing that it's still clear when we + * return; another interrupt could have arrived. But we promise that + * any pre-existing one will have been serviced.) */ void ProcessInterrupts(void) @@ -3248,7 +3254,11 @@ ProcessInterrupts(void) { /* * Re-arm InterruptPending so that we process the cancel request as - * soon as we're done reading the message. + * soon as we're done reading the message. (XXX this is seriously + * ugly: it complicates INTERRUPTS_CAN_BE_PROCESSED(), and it means we + * can't use that macro directly as the initial test in this function, + * meaning that this code also creates opportunities for other bugs to + * appear.) */ InterruptPending = true; } diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 95202d37af..b95bb956b6 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -57,6 +57,15 @@ * allowing die interrupts: HOLD_CANCEL_INTERRUPTS() and * RESUME_CANCEL_INTERRUPTS(). * + * Note that ProcessInterrupts() has also acquired a number of tasks that + * do not necessarily cause a query-cancel-or-die response. Hence, it's + * possible that it will just clear InterruptPending and return. + * + * INTERRUPTS_PENDING_CONDITION() can be checked to see whether an + * interrupt needs to be serviced, without trying to do so immediately. + * Some callers are also interested in INTERRUPTS_CAN_BE_PROCESSED(), + * which tells whether ProcessInterrupts is sure to clear the interrupt. + * * Special mechanisms are used to let an interrupt be accepted when we are * waiting for a lock or when we are waiting for command input (but, of * course, only if the interrupt holdoff counter is zero). See the @@ -97,24 +106,27 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount; /* in tcop/postgres.c */ extern void ProcessInterrupts(void); +/* Test whether an interrupt is pending */ #ifndef WIN32 +#define INTERRUPTS_PENDING_CONDITION() \ + (unlikely(InterruptPending)) +#else +#define INTERRUPTS_PENDING_CONDITION() \ + (unlikely(UNBLOCKED_SIGNAL_QUEUE()) ? pgwin32_dispatch_queued_signals() : 0, \ + unlikely(InterruptPending)) +#endif +/* Service interrupt if one is pending */ #define CHECK_FOR_INTERRUPTS() \ do { \ - if (unlikely(InterruptPending)) \ - ProcessInterrupts(); \ -} while(0) -#else /* WIN32 */ - -#define CHECK_FOR_INTERRUPTS() \ -do { \ - if (unlikely(UNBLOCKED_SIGNAL_QUEUE())) \ - pgwin32_dispatch_queued_signals(); \ - if (unlikely(InterruptPending)) \ + if (INTERRUPTS_PENDING_CONDITION()) \ ProcessInterrupts(); \ } while(0) -#endif /* WIN32 */ +/* Is ProcessInterrupts() guaranteed to clear InterruptPending? */ +#define INTERRUPTS_CAN_BE_PROCESSED() \ + (InterruptHoldoffCount == 0 && CritSectionCount == 0 && \ + QueryCancelHoldoffCount == 0) #define HOLD_INTERRUPTS() (InterruptHoldoffCount++) diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c index 4d380c99f0..88ae2499dd 100644 --- a/src/backend/access/spgist/spgdoinsert.c +++ b/src/backend/access/spgist/spgdoinsert.c @@ -1905,13 +1905,14 @@ spgSplitNodeAction(Relation index, SpGistState *state, * Insert one item into the index. * * Returns true on success, false if we failed to complete the insertion - * because of conflict with a concurrent insert. In the latter case, - * caller should re-call spgdoinsert() with the same args. + * (typically because of conflict with a concurrent insert). In the latter + * case, caller should re-call spgdoinsert() with the same args. */ bool spgdoinsert(Relation index, SpGistState *state, ItemPointer heapPtr, Datum *datums, bool *isnulls) { + bool result = true; TupleDesc leafDescriptor = state->leafTupDesc; bool isnull = isnulls[spgKeyColumn]; int level = 0; @@ -2012,6 +2013,14 @@ spgdoinsert(Relation index, SpGistState *state, parent.offnum = InvalidOffsetNumber; parent.node = -1; + /* + * Before entering the loop, try to clear any pending interrupt condition. + * If a query cancel is pending, we might as well accept it now not later; + * while if a non-canceling condition is pending, servicing it here avoids + * having to restart the insertion and redo all the work so far. + */ + CHECK_FOR_INTERRUPTS(); + for (;;) { bool isNew = false; @@ -2019,9 +2028,18 @@ spgdoinsert(Relation index, SpGistState *state, /* * Bail out if query cancel is pending. We must have this somewhere * in the loop since a broken opclass could produce an infinite - * picksplit loop. + * picksplit loop. However, because we'll be holding buffer lock(s) + * after the first iteration, ProcessInterrupts() wouldn't be able to + * throw a cancel error here. Hence, if we see that an interrupt is + * pending, break out of the loop and deal with the situation below. + * Set result = false because we must restart the insertion if the + * interrupt isn't a query-cancel-or-die case. */ - CHECK_FOR_INTERRUPTS(); + if (INTERRUPTS_PENDING_CONDITION()) + { + result = false; + break; + } if (current.blkno == InvalidBlockNumber) { @@ -2140,10 +2158,14 @@ spgdoinsert(Relation index, SpGistState *state, * spgAddNode and spgSplitTuple cases will loop back to here to * complete the insertion operation. Just in case the choose * function is broken and produces add or split requests - * repeatedly, check for query cancel. + * repeatedly, check for query cancel (see comments above). */ process_inner_tuple: - CHECK_FOR_INTERRUPTS(); + if (INTERRUPTS_PENDING_CONDITION()) + { + result = false; + break; + } innerTuple = (SpGistInnerTuple) PageGetItem(current.page, PageGetItemId(current.page, current.offnum)); @@ -2267,5 +2289,21 @@ spgdoinsert(Relation index, SpGistState *state, UnlockReleaseBuffer(parent.buffer); } - return true; + /* + * We do not support being called while some outer function is holding a + * buffer lock (or any other reason to postpone query cancels). If that + * were the case, telling the caller to retry would create an infinite + * loop. + */ + Assert(INTERRUPTS_CAN_BE_PROCESSED()); + + /* + * Finally, check for interrupts again. If there was a query cancel, + * ProcessInterrupts() will be able to throw the error here. If it was + * some other kind of interrupt that can just be cleared, return false to + * tell our caller to retry. + */ + CHECK_FOR_INTERRUPTS(); + + return result; } diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c index 88ae2499dd..520983ede1 100644 --- a/src/backend/access/spgist/spgdoinsert.c +++ b/src/backend/access/spgist/spgdoinsert.c @@ -669,7 +669,8 @@ checkAllTheSame(spgPickSplitIn *in, spgPickSplitOut *out, bool tooBig, * will eventually terminate if lack of balance is the issue. If the tuple * is too big, we assume that repeated picksplit operations will eventually * make it small enough by repeated prefix-stripping. A broken opclass could - * make this an infinite loop, though. + * make this an infinite loop, though, so spgdoinsert() checks that the + * leaf datums get smaller each time. */ static bool doPickSplit(Relation index, SpGistState *state, @@ -1918,6 +1919,8 @@ spgdoinsert(Relation index, SpGistState *state, int level = 0; Datum leafDatums[INDEX_MAX_KEYS]; int leafSize; + int prevLeafSize; + int numNoProgressCycles = 0; SPPageDesc current, parent; FmgrInfo *procinfo = NULL; @@ -1988,9 +1991,10 @@ spgdoinsert(Relation index, SpGistState *state, /* * If it isn't gonna fit, and the opclass can't reduce the datum size by - * suffixing, bail out now rather than getting into an endless loop. + * suffixing, bail out now rather than doing a lot of useless work. */ - if (leafSize > SPGIST_PAGE_CAPACITY && !state->config.longValuesOK) + if (leafSize > SPGIST_PAGE_CAPACITY && + (isnull || !state->config.longValuesOK)) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("index row size %zu exceeds maximum %zu for index \"%s\"", @@ -2218,6 +2222,7 @@ spgdoinsert(Relation index, SpGistState *state, /* Adjust level as per opclass request */ level += out.result.matchNode.levelAdd; /* Replace leafDatum and recompute leafSize */ + prevLeafSize = leafSize; if (!isnull) { leafDatums[spgKeyColumn] = out.result.matchNode.restDatum; @@ -2226,19 +2231,50 @@ spgdoinsert(Relation index, SpGistState *state, leafSize += sizeof(ItemIdData); } + /* + * Check new tuple size; fail if it can't fit, unless the + * opclass says it can handle the situation by suffixing. + * In that case, prevent an infinite loop by checking to + * see if we are actually making progress by reducing the + * leafSize size in each pass. This is a bit tricky + * though. Because of alignment considerations, the total + * tuple size might not decrease on every pass. Also, + * there are edge cases where the choose method might seem + * to not make progress for a cycle or two. Somewhat + * arbitrarily, we allow up to 10 no-progress iterations + * before failing. (The limit should be more than + * MAXALIGN, to accommodate opclasses that trim one byte + * from the leaf datum per pass.) + */ + if (leafSize > SPGIST_PAGE_CAPACITY) + { + bool ok = false; + + if (state->config.longValuesOK && !isnull) + { + if (leafSize < prevLeafSize) + { + ok = true; + numNoProgressCycles = 0; + } + else if (++numNoProgressCycles < 10) + ok = true; + } + if (!ok) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("index row size %zu exceeds maximum %zu for index \"%s\"", + leafSize - sizeof(ItemIdData), + SPGIST_PAGE_CAPACITY - sizeof(ItemIdData), + RelationGetRelationName(index)), + errhint("Values larger than a buffer page cannot be indexed."))); + } + /* * Loop around and attempt to insert the new leafDatum at * "current" (which might reference an existing child * tuple, or might be invalid to force us to find a new * page for the tuple). - * - * Note: if the opclass sets longValuesOK, we rely on the - * choose function to eventually shorten the leafDatum - * enough to fit on a page. We could add a test here to - * complain if the datum doesn't get visibly shorter each - * time, but that could get in the way of opclasses that - * "simplify" datums in a way that doesn't necessarily - * lead to physical shortening on every cycle. */ break; case spgAddNode: diff --git a/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out b/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out index 0ac99d08fb..ac0ddcecea 100644 --- a/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out +++ b/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out @@ -61,6 +61,12 @@ select * from t binary_upgrade_set_next_toast_pg_class_oid | 1 | binary_upgrade_set_next_toast_pg_class_oid (9 rows) +-- Verify clean failure when INCLUDE'd columns result in overlength tuple +-- The error message details are platform-dependent, so show only SQLSTATE +\set VERBOSITY sqlstate +insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000)); +ERROR: 54000 +\set VERBOSITY default drop index t_f1_f2_f3_idx; create index on t using spgist(f1 name_ops_old) include(f2, f3); \d+ t_f1_f2_f3_idx @@ -100,3 +106,7 @@ select * from t binary_upgrade_set_next_toast_pg_class_oid | 1 | binary_upgrade_set_next_toast_pg_class_oid (9 rows) +\set VERBOSITY sqlstate +insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000)); +ERROR: 54000 +\set VERBOSITY default diff --git a/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql b/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql index 76e78ba41c..982f221a8b 100644 --- a/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql +++ b/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql @@ -27,6 +27,12 @@ select * from t where f1 > 'binary_upgrade_set_n' and f1 < 'binary_upgrade_set_p' order by 1; +-- Verify clean failure when INCLUDE'd columns result in overlength tuple +-- The error message details are platform-dependent, so show only SQLSTATE +\set VERBOSITY sqlstate +insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000)); +\set VERBOSITY default + drop index t_f1_f2_f3_idx; create index on t using spgist(f1 name_ops_old) include(f2, f3); @@ -39,3 +45,7 @@ select * from t select * from t where f1 > 'binary_upgrade_set_n' and f1 < 'binary_upgrade_set_p' order by 1; + +\set VERBOSITY sqlstate +insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000)); +\set VERBOSITY default
I wrote: > Anyway, here is a patch set teased apart into committable bites, > and with your other points addressed. Oh, maybe some docs would be a good thing too ... regards, tom lane diff --git a/doc/src/sgml/spgist.sgml b/doc/src/sgml/spgist.sgml index cfb2b3c836..18f1f3cdbd 100644 --- a/doc/src/sgml/spgist.sgml +++ b/doc/src/sgml/spgist.sgml @@ -978,6 +978,18 @@ LANGUAGE C STRICT; fails to do that, the <acronym>SP-GiST</acronym> core resorts to extraordinary measures described in <xref linkend="spgist-all-the-same"/>. </para> + + <para> + When <structfield>longValuesOK</structfield> is true, it is expected + that successive levels of the <acronym>SP-GiST</acronym> tree will + absorb more and more information into the prefixes and node labels of + the inner tuples, making the required leaf datum smaller and smaller, + so that eventually it will fit on a page. + To prevent bugs in operator classes from causing infinite insertion + loops, the <acronym>SP-GiST</acronym> core will raise an error if the + leaf datum does not become any smaller within ten cycles + of <function>choose</function> method calls. + </para> </sect2> <sect2 id="spgist-null-labels">
On Fri, May 14, 2021 at 6:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > On 2021-May-13, Tom Lane wrote: > >> What do people think about back-patching this? In existing branches, > >> we've defined it to be an opclass bug if it fails to shorten the leaf > >> datum enough. But not having any defenses against that seems like > >> not a great idea. OTOH, the 10-cycles-to-show-progress rule could be > >> argued to be an API break. > > > I think if the alternative is to throw an error, we can afford to retry > > quite a few more times than 10 in order not have that called an API > > break. Say, retry (MAXIMUM_ALIGNOF << 3) times or so (if you want to > > parameterize on maxalign). It's not like this is going to be a > > performance drag where not needed .. but I think leaving back-branches > > unfixed is not great. > > Hm. Index bloat is not something to totally ignore, though, so I'm > not sure what the best cutoff is. > > Anyway, here is a patch set teased apart into committable bites, > and with your other points addressed. I have tested with my original issue and this patch is fixing the issue. Thanks! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Now when checking for shortening of leaf tuple is added LongValuesOK become slightly redundant. I'd propose to replace it with more legible name as LongValuesOK doesn't mean they are warranted to be ok just that we can try to shorten them? What about tryShortening, trySuffixing or can(Try)ShortenTuple?
--
Now when checking for shortening of leaf tuple is added LongValuesOK become slightly redundant. I'd propose to replace it with more legible name as LongValuesOK doesn't mean they are warranted to be ok just that we can try to shorten them? What about tryShortening, trySuffixing or can(Try)ShortenTuple?
Or maybe it's even more logical now to invert it and make dontTrySuffixing for use in the opclasses for kd-tree, quadtree etc which are warranted to have the same key data length at any tree level ?
--
On Thu, May 13, 2021 at 6:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > OTOH, the 10-cycles-to-show-progress rule could be > argued to be an API break. Not being familiar with this code, I don't really understand why 10 cycles to show progress wouldn't, like 640kB, be enough for anyone. But as far as back-patching the code goals, I think the question is not so much whether this restriction could hypothetically break anything but whether it will actually break anything, which leads to the question of how many spgist opclasses we think exist outside of core. I did a Google search and found some evidence that PostGIS might have such things, and also this: https://github.com/fake-name/pg-spgist_hamming There might be other things, but I did not find them. -- Robert Haas EDB: http://www.enterprisedb.com
Pavel Borisov <pashkin.elfe@gmail.com> writes: > Now when checking for shortening of leaf tuple is added LongValuesOK > become slightly redundant. I'd propose to replace it with more legible name > as LongValuesOK doesn't mean they are warranted to be ok just that we can > try to shorten them? What about tryShortening, trySuffixing or > can(Try)ShortenTuple? That field name is part of the opclass API. I fear it's several years too late to rename it for no compelling reason. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, May 13, 2021 at 6:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> OTOH, the 10-cycles-to-show-progress rule could be >> argued to be an API break. > Not being familiar with this code, I don't really understand why 10 > cycles to show progress wouldn't, like 640kB, be enough for anyone. Yeah, after further thought I'm thinking that that ought to be plenty. In released branches, that code will never execute at all unless the datum-to-be-indexed is larger than ~8kB. If you are starting with, say, a 30kB string, you'd better be removing a heck of a lot more than one byte per tree level, or your search performance is going to be abysmal. Maybe algorithmic oddities would sometimes result in seemingly making no progress for one cycle, but I doubt there's need for more slop than that. In this light, a 10-cycle grace period seems if anything excessive. > But as far as back-patching the code goals, I think the question is > not so much whether this restriction could hypothetically break > anything but whether it will actually break anything, which leads to > the question of how many spgist opclasses we think exist outside of > core. That is also an interesting viewpoint. codesearch.debian.net knows of no external SPGiST opclasses other than PostGIS'. They don't seem to have indexed the two you found on github, though. None of those four set longValuesOK to true, which means that the whole discussion is moot for them. So it's entirely plausible that spgtextproc.c is the only affected code anywhere. Of course, that conclusion could lead to the position that there's no point in back-patching anyway, since there's no reason to think that spgtextproc.c is buggy. regards, tom lane
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > This comment made me remember a patch I've had for a while, which splits > the CHECK_FOR_INTERRUPTS() definition in two -- one of them is > INTERRUPTS_PENDING_CONDITION() which let us test the condition > separately; that allows the lock we hold to be released prior to > actually processing the interrupts. I've now pushed that macro change ... > The btree code modified was found to be an actual problem in production > when a btree is corrupted in such a way that vacuum would get an > infinite loop. I don't remember the exact details but I think we saw > vacuum running for a couple of weeks, and had to restart the server in > order to terminate it (since it wouldn't respond to signals). ... but I think this bit still needs work, if we still want it at all. The problem is that it seems to believe that ProcessInterrupts is guaranteed not to return, which is far from the truth. Maybe it was true once, but we've grown a lot of accretions on it that will just clear InterruptPending and return. I see that the "return false" leads directly to an "Assert(false)", which seems unhelpful. regards, tom lane