Обсуждение: OOM in spgist insert

Поиск
Список
Период
Сортировка

OOM in spgist insert

От
Dilip Kumar
Дата:
While testing something on spgist I found that at certain point while
inserting in spgist it is going for doPickSplit, but even after split
is is not able to find a place to insert a tuple and it keeping going
in that loop infinitely it seems and finally error out with OOM
because in this loop we are continuously allocating memory for the
tuple in temp context but since we are never coming out of the loop it
is erroring out with OOM.

To reproduce load the data from the attached script 'data_load.sql'
and run below commands

------Load data before running this using 'data_load.sql'
-------Test case start---
create extension spgist_name_ops;

select opcname, amvalidate(opc.oid)
from pg_opclass opc join pg_am am on am.oid = opcmethod
where amname = 'spgist' and opcname = 'name_ops';

-- warning expected here
select opcname, amvalidate(opc.oid)
from pg_opclass opc join pg_am am on am.oid = opcmethod
where amname = 'spgist' and opcname = 'name_ops_old';

create table t(f1 name, f2 integer, f3 text);
create index on t using spgist(f1) include(f2, f3);
\d+ t_f1_f2_f3_idx

insert into t select proname, case when length(proname) % 2 = 0 then
pronargs else null end, prosrc from pg_proc_test;

---- Test case end

--Memory allocation stack----
#1 0x0000000000bf96c5 in palloc0 (size=9696) at mcxt.c:1133
#2 0x000000000056b24b in spgFormLeafTuple (state=0x7ffedea15b80,
heapPtr=0x27df306, datums=0x7ffedea15660, isnulls=0x7ffedea15640) at
spgutils.c:892
#3 0x000000000055e15c in doPickSplit (index=0x7fa4b1ddd5c8,
state=0x7ffedea15b80, current=0x7ffedea159c0, parent=0x7ffedea159a0,
newLeafTuple=0x27df300, level=9,
isNulls=false, isNew=true) at spgdoinsert.c:848
#4 0x0000000000561e53 in spgdoinsert (index=0x7fa4b1ddd5c8,
state=0x7ffedea15b80, heapPtr=0x27718d8, datums=0x7ffedea15cc0,
isnulls=0x7ffedea15ca0)
at spgdoinsert.c:2115


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: OOM in spgist insert

От
Pavel Borisov
Дата:
ср, 12 мая 2021 г. в 11:09, Dilip Kumar <dilipbalaut@gmail.com>:
While testing something on spgist I found that at certain point while
inserting in spgist it is going for doPickSplit, but even after split
is is not able to find a place to insert a tuple and it keeping going
in that loop infinitely it seems and finally error out with OOM
because in this loop we are continuously allocating memory for the
tuple in temp context but since we are never coming out of the loop it
is erroring out with OOM.

My first idea is that this is the case when index tuple doesn't fit into one index page. As INCLUDED columns are added as is the tuple can not be made shorter by prefix-stripping. Seems we should check every index tuple length to fit the page before its insertion. Will see the code little bit later.

Thanks for the reporting!

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: OOM in spgist insert

От
Dilip Kumar
Дата:
On Wed, 12 May 2021 at 1:43 PM, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
ср, 12 мая 2021 г. в 11:09, Dilip Kumar <dilipbalaut@gmail.com>:
While testing something on spgist I found that at certain point while
inserting in spgist it is going for doPickSplit, but even after split
is is not able to find a place to insert a tuple and it keeping going
in that loop infinitely it seems and finally error out with OOM
because in this loop we are continuously allocating memory for the
tuple in temp context but since we are never coming out of the loop it
is erroring out with OOM.

My first idea is that this is the case when index tuple doesn't fit into one index page. As INCLUDED columns are added as is the tuple can not be made shorter by prefix-stripping. Seems we should check every index tuple length to fit the page before its insertion.

Thanks for looking into this.

Will see the code little bit later.

Ok
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: OOM in spgist insert

От
Pavel Borisov
Дата:
ср, 12 мая 2021 г. в 12:36, Dilip Kumar <dilipbalaut@gmail.com>:
On Wed, 12 May 2021 at 1:43 PM, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
ср, 12 мая 2021 г. в 11:09, Dilip Kumar <dilipbalaut@gmail.com>:
While testing something on spgist I found that at certain point while
inserting in spgist it is going for doPickSplit, but even after split
is is not able to find a place to insert a tuple and it keeping going
in that loop infinitely it seems and finally error out with OOM
because in this loop we are continuously allocating memory for the
tuple in temp context but since we are never coming out of the loop it
is erroring out with OOM.

My first idea is that this is the case when index tuple doesn't fit into one index page. As INCLUDED columns are added as is the tuple can not be made shorter by prefix-stripping. Seems we should check every index tuple length to fit the page before its insertion.

Thanks for looking into this.

Will see the code little bit later.

Ok
PFA v1 patch. Does this help?


--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com
Вложения

Re: OOM in spgist insert

От
Pavel Borisov
Дата:


ср, 12 мая 2021 г. в 12:39, Pavel Borisov <pashkin.elfe@gmail.com>:
ср, 12 мая 2021 г. в 12:36, Dilip Kumar <dilipbalaut@gmail.com>:
On Wed, 12 May 2021 at 1:43 PM, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
ср, 12 мая 2021 г. в 11:09, Dilip Kumar <dilipbalaut@gmail.com>:
While testing something on spgist I found that at certain point while
inserting in spgist it is going for doPickSplit, but even after split
is is not able to find a place to insert a tuple and it keeping going
in that loop infinitely it seems and finally error out with OOM
because in this loop we are continuously allocating memory for the
tuple in temp context but since we are never coming out of the loop it
is erroring out with OOM.

My first idea is that this is the case when index tuple doesn't fit into one index page. As INCLUDED columns are added as is the tuple can not be made shorter by prefix-stripping. Seems we should check every index tuple length to fit the page before its insertion.

Thanks for looking into this.

Will see the code little bit later.

Ok
PFA v1 patch. Does this help?
I've made a mistake in attributes count in v1. PFA v2 


--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com
Вложения

Re: OOM in spgist insert

От
Dilip Kumar
Дата:
On Wed, May 12, 2021 at 2:21 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

>>
>> PFA v1 patch. Does this help?
>
> I've made a mistake in attributes count in v1. PFA v2

V2 works.  Thanks for fixing this quickly,  I think you can add a
comment for the new error condition you added.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: OOM in spgist insert

От
Pavel Borisov
Дата:
V2 works.  Thanks for fixing this quickly,  I think you can add a
comment for the new error condition you added.
Added comments. PFA v3 


--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com
Вложения

Re: OOM in spgist insert

От
Dilip Kumar
Дата:
On Wed, May 12, 2021 at 5:11 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>>
>> V2 works.  Thanks for fixing this quickly,  I think you can add a
>> comment for the new error condition you added.
>
> Added comments. PFA v3

Thanks.

+ *
+ * For indexes with INCLUDEd columns we do not know whether we can reduce
+ * index tuple size by suffixing its key part or we will go into the
+ * endless loop on pick-split (in case included columns data is big enough

 INCLUDEd -> why you have used a mixed case here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: OOM in spgist insert

От
Pavel Borisov
Дата:
 INCLUDEd -> why you have used a mixed case here?
It is current practice to call INCLUDE columns in capital, you can find many places in the current code. But case mixture can be avoided indeed ))
PFA v4
Вложения

Re: OOM in spgist insert

От
Dilip Kumar
Дата:
On Wed, May 12, 2021 at 5:35 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>>
>>  INCLUDEd -> why you have used a mixed case here?
>
> It is current practice to call INCLUDE columns in capital, you can find many places in the current code. But case
mixturecan be avoided indeed ))
 
> PFA v4

Okay, that makes sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: OOM in spgist insert

От
Tom Lane
Дата:
Pavel Borisov <pashkin.elfe@gmail.com> writes:
> [ v4-0001-When-there-are-INCLUDEd-columns-in-SpGist-check-t.patch ]

I don't like this patch one bit --- it's basically disabling a fairly
important SPGiST feature as soon as there are included columns.
What's more, it's not really giving us any better defense against
the infinite-picksplit-loop problem than we had before.

I wonder if we should give up on the theory posited circa
spgdoinsert.c:2213:

                     * Note: if the opclass sets longValuesOK, we rely on the
                     * choose function to eventually shorten the leafDatum
                     * enough to fit on a page.  We could add a test here to
                     * complain if the datum doesn't get visibly shorter each
                     * time, but that could get in the way of opclasses that
                     * "simplify" datums in a way that doesn't necessarily
                     * lead to physical shortening on every cycle.

The argument that there might be a longValuesOK opclass that *doesn't*
shorten the datum each time seems fairly hypothetical to me.

An alternative way of looking at things is that this is the opclass's
fault.  It should have realized that it's not making progress, and
thrown an error.  However, I'm not sure if the opclass picksplit
function has enough info to throw a meaningful error.  It looks to
me like the trouble case from spg_text_picksplit's point of view
is that it's given a single tuple containing a zero-length string,
so that there is no prefix it can strip.  However, it seems possible
that that could be a legitimate case.  Even if it's not, the opclass
function doesn't have (for instance) the name of the index to cite
in an error message.  Nor did we give it a way to return a failure
indication, which is seeming like a mistake right now.

BTW, another nasty thing I discovered while testing this is that
the CHECK_FOR_INTERRUPTS() at line 2146 is useless, because
we're holding a buffer lock there so InterruptHoldoffCount > 0.
So once you get into this loop you can't even cancel the query.
Seems like that needs a fix, too.

            regards, tom lane



Re: OOM in spgist insert

От
Alvaro Herrera
Дата:
On 2021-May-13, Tom Lane wrote:

> BTW, another nasty thing I discovered while testing this is that
> the CHECK_FOR_INTERRUPTS() at line 2146 is useless, because
> we're holding a buffer lock there so InterruptHoldoffCount > 0.
> So once you get into this loop you can't even cancel the query.
> Seems like that needs a fix, too.

This comment made me remember a patch I've had for a while, which splits
the CHECK_FOR_INTERRUPTS() definition in two -- one of them is
INTERRUPTS_PENDING_CONDITION() which let us test the condition
separately; that allows the lock we hold to be released prior to
actually processing the interrupts.

The btree code modified was found to be an actual problem in production
when a btree is corrupted in such a way that vacuum would get an
infinite loop.  I don't remember the exact details but I think we saw
vacuum running for a couple of weeks, and had to restart the server in
order to terminate it (since it wouldn't respond to signals).

-- 
Álvaro Herrera       Valdivia, Chile
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."                               (Fotis)
               (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)

Вложения

Re: OOM in spgist insert

От
Alvaro Herrera
Дата:
On 2021-May-13, Alvaro Herrera wrote:

> The btree code modified was found to be an actual problem in production
> when a btree is corrupted in such a way that vacuum would get an
> infinite loop.  I don't remember the exact details but I think we saw
> vacuum running for a couple of weeks, and had to restart the server in
> order to terminate it (since it wouldn't respond to signals).

(Looking again, the nbtpage.c hunk might have been made obsolete by
c34787f91058 and other commits).

-- 
Álvaro Herrera       Valdivia, Chile



Re: OOM in spgist insert

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-May-13, Tom Lane wrote:
>> BTW, another nasty thing I discovered while testing this is that
>> the CHECK_FOR_INTERRUPTS() at line 2146 is useless, because
>> we're holding a buffer lock there so InterruptHoldoffCount > 0.
>> So once you get into this loop you can't even cancel the query.
>> Seems like that needs a fix, too.

Attached is a WIP patch for that part.  Basically, if it looks
like there's an interrupt pending, make spgdoinsert() fall out of
its loop, so it'll release the buffer locks, and then recheck
CHECK_FOR_INTERRUPTS() at a point where it can really throw the
error.  Now, the hole in this approach is what if ProcessInterrupts
still declines to throw the error?  I've made the patch make use of
the existing provision to retry spgdoinsert() in case of deadlock,
so that it just goes around and tries again.  But that doesn't seem
terribly satisfactory, because if InterruptPending remains set then
we'll just have an infinite loop at that outer level.  IOW, this
patch can only cope with the situation where there's not any outer
function holding a buffer lock (or other reason to prevent the
query cancel from taking effect).  Maybe that's good enough, but
I'm not terribly pleased with it.

> This comment made me remember a patch I've had for a while, which splits
> the CHECK_FOR_INTERRUPTS() definition in two -- one of them is
> INTERRUPTS_PENDING_CONDITION() which let us test the condition
> separately; that allows the lock we hold to be released prior to
> actually processing the interrupts.

Hm.  Yeah, I was feeling that this patch is unduly friendly with
the innards of CHECK_FOR_INTERRUPTS().  So maybe some refactoring
is called for, but I'm not quite sure what it should look like.
Testing the condition separately is fine, but what about the case
of ProcessInterrupts refusing to pull the trigger?

            regards, tom lane

diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 4d380c99f0..82e68c4ab7 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -1905,13 +1905,14 @@ spgSplitNodeAction(Relation index, SpGistState *state,
  * Insert one item into the index.
  *
  * Returns true on success, false if we failed to complete the insertion
- * because of conflict with a concurrent insert.  In the latter case,
- * caller should re-call spgdoinsert() with the same args.
+ * (typically because of conflict with a concurrent insert).  In the latter
+ * case, caller should re-call spgdoinsert() with the same args.
  */
 bool
 spgdoinsert(Relation index, SpGistState *state,
             ItemPointer heapPtr, Datum *datums, bool *isnulls)
 {
+    bool        result = true;
     TupleDesc    leafDescriptor = state->leafTupDesc;
     bool        isnull = isnulls[spgKeyColumn];
     int            level = 0;
@@ -2019,9 +2020,17 @@ spgdoinsert(Relation index, SpGistState *state,
         /*
          * Bail out if query cancel is pending.  We must have this somewhere
          * in the loop since a broken opclass could produce an infinite
-         * picksplit loop.
+         * picksplit loop.  Moreover, because it's typically the case that
+         * we're holding buffer lock(s), ProcessInterrupts() won't be able to
+         * throw an error now.  Hence, if we see that InterruptPending remains
+         * true, break out of the loop and deal with the problem below.
          */
         CHECK_FOR_INTERRUPTS();
+        if (unlikely(InterruptPending))
+        {
+            result = false;
+            break;
+        }

         if (current.blkno == InvalidBlockNumber)
         {
@@ -2140,10 +2149,15 @@ spgdoinsert(Relation index, SpGistState *state,
              * spgAddNode and spgSplitTuple cases will loop back to here to
              * complete the insertion operation.  Just in case the choose
              * function is broken and produces add or split requests
-             * repeatedly, check for query cancel.
+             * repeatedly, check for query cancel (see comments above).
              */
     process_inner_tuple:
             CHECK_FOR_INTERRUPTS();
+            if (unlikely(InterruptPending))
+            {
+                result = false;
+                break;
+            }

             innerTuple = (SpGistInnerTuple) PageGetItem(current.page,
                                                         PageGetItemId(current.page, current.offnum));
@@ -2267,5 +2281,12 @@ spgdoinsert(Relation index, SpGistState *state,
         UnlockReleaseBuffer(parent.buffer);
     }

-    return true;
+    /*
+     * Finally, check for interrupts again.  If there was one,
+     * ProcessInterrupts() will be able to throw the error now.  But should it
+     * not do so for some reason, return false to cause a retry.
+     */
+    CHECK_FOR_INTERRUPTS();
+
+    return result;
 }

Re: OOM in spgist insert

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> (Looking again, the nbtpage.c hunk might have been made obsolete by
> c34787f91058 and other commits).

OK.  Here's a revision that adopts your idea, except that I left
out the nbtpage.c change since you aren't sure of that part.

I added a macro that allows spgdoinsert to Assert that it's not
called in a context where the infinite-loop-due-to-InterruptPending
risk would arise.  This is a little bit fragile, because it'd be
easy for ill-considered changes to ProcessInterrupts to break it,
but it's better than nothing.

            regards, tom lane

diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 4d380c99f0..1c14fc7f76 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -1905,13 +1905,14 @@ spgSplitNodeAction(Relation index, SpGistState *state,
  * Insert one item into the index.
  *
  * Returns true on success, false if we failed to complete the insertion
- * because of conflict with a concurrent insert.  In the latter case,
- * caller should re-call spgdoinsert() with the same args.
+ * (typically because of conflict with a concurrent insert).  In the latter
+ * case, caller should re-call spgdoinsert() with the same args.
  */
 bool
 spgdoinsert(Relation index, SpGistState *state,
             ItemPointer heapPtr, Datum *datums, bool *isnulls)
 {
+    bool        result = true;
     TupleDesc    leafDescriptor = state->leafTupDesc;
     bool        isnull = isnulls[spgKeyColumn];
     int            level = 0;
@@ -2019,9 +2020,17 @@ spgdoinsert(Relation index, SpGistState *state,
         /*
          * Bail out if query cancel is pending.  We must have this somewhere
          * in the loop since a broken opclass could produce an infinite
-         * picksplit loop.
+         * picksplit loop.  Moreover, because it's typically the case that
+         * we're holding buffer lock(s), ProcessInterrupts() won't be able to
+         * throw an error now.  Hence, if we see that the interrupt condition
+         * remains true, break out of the loop and deal with the case below.
          */
         CHECK_FOR_INTERRUPTS();
+        if (INTERRUPTS_PENDING_CONDITION())
+        {
+            result = false;
+            break;
+        }

         if (current.blkno == InvalidBlockNumber)
         {
@@ -2140,10 +2149,15 @@ spgdoinsert(Relation index, SpGistState *state,
              * spgAddNode and spgSplitTuple cases will loop back to here to
              * complete the insertion operation.  Just in case the choose
              * function is broken and produces add or split requests
-             * repeatedly, check for query cancel.
+             * repeatedly, check for query cancel (see comments above).
              */
     process_inner_tuple:
             CHECK_FOR_INTERRUPTS();
+            if (INTERRUPTS_PENDING_CONDITION())
+            {
+                result = false;
+                break;
+            }

             innerTuple = (SpGistInnerTuple) PageGetItem(current.page,
                                                         PageGetItemId(current.page, current.offnum));
@@ -2267,5 +2281,21 @@ spgdoinsert(Relation index, SpGistState *state,
         UnlockReleaseBuffer(parent.buffer);
     }

-    return true;
+    /*
+     * We do not support being called while some outer function is holding a
+     * buffer lock (or any other reason to postpone query cancels).  If that
+     * were the case, telling the caller to retry would create an infinite
+     * loop.
+     */
+    Assert(INTERRUPTS_CAN_BE_PROCESSED());
+
+    /*
+     * Finally, check for interrupts again.  If there was a query cancel,
+     * ProcessInterrupts() will be able to throw the error now.  If it was
+     * some other kind of interrupt that can just be cleared, return false to
+     * tell our caller to retry.
+     */
+    CHECK_FOR_INTERRUPTS();
+
+    return result;
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 6200699ddd..dd2ade7bb6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -554,7 +554,7 @@ ProcessClientWriteInterrupt(bool blocked)
         {
             /*
              * Don't mess with whereToSendOutput if ProcessInterrupts wouldn't
-             * do anything.
+             * service ProcDiePending.
              */
             if (InterruptHoldoffCount == 0 && CritSectionCount == 0)
             {
@@ -3118,6 +3118,12 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
  * If an interrupt condition is pending, and it's safe to service it,
  * then clear the flag and accept the interrupt.  Called only when
  * InterruptPending is true.
+ *
+ * Note: if INTERRUPTS_CAN_BE_PROCESSED() is true, then ProcessInterrupts
+ * is guaranteed to clear the InterruptPending flag before returning.
+ * (This is not the same as guaranteeing that it's still clear when we
+ * return; another interrupt could have arrived.  But we promise that
+ * any pre-existing one will have been serviced.)
  */
 void
 ProcessInterrupts(void)
@@ -3248,7 +3254,11 @@ ProcessInterrupts(void)
     {
         /*
          * Re-arm InterruptPending so that we process the cancel request as
-         * soon as we're done reading the message.
+         * soon as we're done reading the message.  (XXX this is seriously
+         * ugly: it complicates INTERRUPTS_CAN_BE_PROCESSED(), and it means we
+         * can't use that macro directly as the initial test in this function,
+         * meaning that this code also creates opportunities for other bugs to
+         * appear.)
          */
         InterruptPending = true;
     }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 95202d37af..645766a4ab 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -57,6 +57,11 @@
  * allowing die interrupts: HOLD_CANCEL_INTERRUPTS() and
  * RESUME_CANCEL_INTERRUPTS().
  *
+ * Also, INTERRUPTS_PENDING_CONDITION() can be checked to see whether an
+ * interrupt needs to be serviced, without trying to do so immediately.
+ * Some callers are also interested in INTERRUPTS_CAN_BE_PROCESSED(),
+ * which tells whether ProcessInterrupts is sure to clear the interrupt.
+ *
  * Special mechanisms are used to let an interrupt be accepted when we are
  * waiting for a lock or when we are waiting for command input (but, of
  * course, only if the interrupt holdoff counter is zero).  See the
@@ -97,24 +102,27 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);

+/* Test whether an interrupt is pending */
 #ifndef WIN32
+#define INTERRUPTS_PENDING_CONDITION() \
+    (unlikely(InterruptPending))
+#else
+#define INTERRUPTS_PENDING_CONDITION() \
+    (unlikely(UNBLOCKED_SIGNAL_QUEUE()) ? pgwin32_dispatch_queued_signals() : 0,  \
+     unlikely(InterruptPending))
+#endif

+/* Service interrupt if one is pending */
 #define CHECK_FOR_INTERRUPTS() \
 do { \
-    if (unlikely(InterruptPending)) \
-        ProcessInterrupts(); \
-} while(0)
-#else                            /* WIN32 */
-
-#define CHECK_FOR_INTERRUPTS() \
-do { \
-    if (unlikely(UNBLOCKED_SIGNAL_QUEUE())) \
-        pgwin32_dispatch_queued_signals(); \
-    if (unlikely(InterruptPending)) \
+    if (INTERRUPTS_PENDING_CONDITION()) \
         ProcessInterrupts(); \
 } while(0)
-#endif                            /* WIN32 */

+/* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
+#define INTERRUPTS_CAN_BE_PROCESSED() \
+    (InterruptHoldoffCount == 0 && CritSectionCount == 0 && \
+     QueryCancelHoldoffCount == 0)

 #define HOLD_INTERRUPTS()  (InterruptHoldoffCount++)


Re: OOM in spgist insert

От
Tom Lane
Дата:
Here's a patch that attempts to prevent the infinite loop in spgdoinsert
by checking whether the proposed leaf tuple is getting smaller at each
iteration.

We can't be totally rigid about that, because for example if the opclass
shortens a 7-byte string to 5 bytes, that will make no difference in the
tuple's size after alignment.  I first tried to handle that by checking
datumGetSize() of the key datum itself, but observed that spgtextproc.c
has some cases where it'll return an empty leaf-datum string at two
levels before succeeding.  Maybe it'd be okay to fail that on the
grounds that it can't become any smaller later.  But on the whole, and
considering the existing comment's concerns about opclasses that don't
shorten the datum every time, it seems like a good idea to allow some
fuzz here.  So what this patch does is to allow up to 10 cycles with no
reduction in the actual leaf tuple size before failing.  That way we can
handle slop due to alignment roundoff and slop due to opclass corner
cases with a single, very cheap mechanism.  It does mean that we might
build a few more useless inner tuples before failing --- but that seems
like a good tradeoff for *not* failing in cases where the opclass is
able to shorten the leaf datum sufficiently.

I have not bothered to tease apart the query-cancel and infinite-loop
parts of the patch, but probably should do that before committing.

What do people think about back-patching this?  In existing branches,
we've defined it to be an opclass bug if it fails to shorten the leaf
datum enough.  But not having any defenses against that seems like
not a great idea.  OTOH, the 10-cycles-to-show-progress rule could be
argued to be an API break.

            regards, tom lane

diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 4d380c99f0..e882b0cea4 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -669,7 +669,8 @@ checkAllTheSame(spgPickSplitIn *in, spgPickSplitOut *out, bool tooBig,
  * will eventually terminate if lack of balance is the issue.  If the tuple
  * is too big, we assume that repeated picksplit operations will eventually
  * make it small enough by repeated prefix-stripping.  A broken opclass could
- * make this an infinite loop, though.
+ * make this an infinite loop, though, so spgdoinsert() checks that the
+ * leaf datums get smaller each time.
  */
 static bool
 doPickSplit(Relation index, SpGistState *state,
@@ -1905,18 +1906,21 @@ spgSplitNodeAction(Relation index, SpGistState *state,
  * Insert one item into the index.
  *
  * Returns true on success, false if we failed to complete the insertion
- * because of conflict with a concurrent insert.  In the latter case,
- * caller should re-call spgdoinsert() with the same args.
+ * (typically because of conflict with a concurrent insert).  In the latter
+ * case, caller should re-call spgdoinsert() with the same args.
  */
 bool
 spgdoinsert(Relation index, SpGistState *state,
             ItemPointer heapPtr, Datum *datums, bool *isnulls)
 {
+    bool        result = true;
     TupleDesc    leafDescriptor = state->leafTupDesc;
     bool        isnull = isnulls[spgKeyColumn];
     int            level = 0;
     Datum        leafDatums[INDEX_MAX_KEYS];
     int            leafSize;
+    int            prevLeafSize;
+    int            numNoProgressCycles = 0;
     SPPageDesc    current,
                 parent;
     FmgrInfo   *procinfo = NULL;
@@ -1987,9 +1991,10 @@ spgdoinsert(Relation index, SpGistState *state,

     /*
      * If it isn't gonna fit, and the opclass can't reduce the datum size by
-     * suffixing, bail out now rather than getting into an endless loop.
+     * suffixing, bail out now rather than doing a lot of useless work.
      */
-    if (leafSize > SPGIST_PAGE_CAPACITY && !state->config.longValuesOK)
+    if (leafSize > SPGIST_PAGE_CAPACITY &&
+        (isnull || !state->config.longValuesOK))
         ereport(ERROR,
                 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                  errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
@@ -2019,9 +2024,17 @@ spgdoinsert(Relation index, SpGistState *state,
         /*
          * Bail out if query cancel is pending.  We must have this somewhere
          * in the loop since a broken opclass could produce an infinite
-         * picksplit loop.
+         * picksplit loop.  Moreover, because it's typically the case that
+         * we're holding buffer lock(s), ProcessInterrupts() won't be able to
+         * throw an error now.  Hence, if we see that the interrupt condition
+         * remains true, break out of the loop and deal with the case below.
          */
         CHECK_FOR_INTERRUPTS();
+        if (INTERRUPTS_PENDING_CONDITION())
+        {
+            result = false;
+            break;
+        }

         if (current.blkno == InvalidBlockNumber)
         {
@@ -2140,10 +2153,15 @@ spgdoinsert(Relation index, SpGistState *state,
              * spgAddNode and spgSplitTuple cases will loop back to here to
              * complete the insertion operation.  Just in case the choose
              * function is broken and produces add or split requests
-             * repeatedly, check for query cancel.
+             * repeatedly, check for query cancel (see comments above).
              */
     process_inner_tuple:
             CHECK_FOR_INTERRUPTS();
+            if (INTERRUPTS_PENDING_CONDITION())
+            {
+                result = false;
+                break;
+            }

             innerTuple = (SpGistInnerTuple) PageGetItem(current.page,
                                                         PageGetItemId(current.page, current.offnum));
@@ -2196,6 +2214,7 @@ spgdoinsert(Relation index, SpGistState *state,
                     /* Adjust level as per opclass request */
                     level += out.result.matchNode.levelAdd;
                     /* Replace leafDatum and recompute leafSize */
+                    prevLeafSize = leafSize;
                     if (!isnull)
                     {
                         leafDatums[spgKeyColumn] = out.result.matchNode.restDatum;
@@ -2204,19 +2223,50 @@ spgdoinsert(Relation index, SpGistState *state,
                         leafSize += sizeof(ItemIdData);
                     }

+                    /*
+                     * Check new tuple size; fail if it can't fit, unless the
+                     * opclass says it can handle the situation by suffixing.
+                     * In that case, prevent an infinite loop by checking to
+                     * see if we are actually making progress by reducing the
+                     * leafSize size in each pass.  This is a bit tricky
+                     * though.  Because of alignment considerations, the total
+                     * tuple size might not decrease on every pass.  Also,
+                     * there are edge cases where the choose method might seem
+                     * to not make progress for a cycle or two.  Somewhat
+                     * arbitrarily, we allow up to 10 no-progress iterations
+                     * before failing.  (The limit should be more than
+                     * MAXALIGN, to accommodate opclasses that trim one byte
+                     * from the leaf datum per pass.)
+                     */
+                    if (leafSize > SPGIST_PAGE_CAPACITY)
+                    {
+                        bool        ok = false;
+
+                        if (state->config.longValuesOK && !isnull)
+                        {
+                            if (leafSize < prevLeafSize)
+                            {
+                                ok = true;
+                                numNoProgressCycles = 0;
+                            }
+                            else if (++numNoProgressCycles < 10)
+                                ok = true;
+                        }
+                        if (!ok)
+                            ereport(ERROR,
+                                    (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+                                     errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
+                                            leafSize - sizeof(ItemIdData),
+                                            SPGIST_PAGE_CAPACITY - sizeof(ItemIdData),
+                                            RelationGetRelationName(index)),
+                                     errhint("Values larger than a buffer page cannot be indexed.")));
+                    }
+
                     /*
                      * Loop around and attempt to insert the new leafDatum at
                      * "current" (which might reference an existing child
                      * tuple, or might be invalid to force us to find a new
                      * page for the tuple).
-                     *
-                     * Note: if the opclass sets longValuesOK, we rely on the
-                     * choose function to eventually shorten the leafDatum
-                     * enough to fit on a page.  We could add a test here to
-                     * complain if the datum doesn't get visibly shorter each
-                     * time, but that could get in the way of opclasses that
-                     * "simplify" datums in a way that doesn't necessarily
-                     * lead to physical shortening on every cycle.
                      */
                     break;
                 case spgAddNode:
@@ -2267,5 +2317,21 @@ spgdoinsert(Relation index, SpGistState *state,
         UnlockReleaseBuffer(parent.buffer);
     }

-    return true;
+    /*
+     * We do not support being called while some outer function is holding a
+     * buffer lock (or any other reason to postpone query cancels).  If that
+     * were the case, telling the caller to retry would create an infinite
+     * loop.
+     */
+    Assert(INTERRUPTS_CAN_BE_PROCESSED());
+
+    /*
+     * Finally, check for interrupts again.  If there was a query cancel,
+     * ProcessInterrupts() will be able to throw the error now.  If it was
+     * some other kind of interrupt that can just be cleared, return false to
+     * tell our caller to retry.
+     */
+    CHECK_FOR_INTERRUPTS();
+
+    return result;
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 6200699ddd..dd2ade7bb6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -554,7 +554,7 @@ ProcessClientWriteInterrupt(bool blocked)
         {
             /*
              * Don't mess with whereToSendOutput if ProcessInterrupts wouldn't
-             * do anything.
+             * service ProcDiePending.
              */
             if (InterruptHoldoffCount == 0 && CritSectionCount == 0)
             {
@@ -3118,6 +3118,12 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
  * If an interrupt condition is pending, and it's safe to service it,
  * then clear the flag and accept the interrupt.  Called only when
  * InterruptPending is true.
+ *
+ * Note: if INTERRUPTS_CAN_BE_PROCESSED() is true, then ProcessInterrupts
+ * is guaranteed to clear the InterruptPending flag before returning.
+ * (This is not the same as guaranteeing that it's still clear when we
+ * return; another interrupt could have arrived.  But we promise that
+ * any pre-existing one will have been serviced.)
  */
 void
 ProcessInterrupts(void)
@@ -3248,7 +3254,11 @@ ProcessInterrupts(void)
     {
         /*
          * Re-arm InterruptPending so that we process the cancel request as
-         * soon as we're done reading the message.
+         * soon as we're done reading the message.  (XXX this is seriously
+         * ugly: it complicates INTERRUPTS_CAN_BE_PROCESSED(), and it means we
+         * can't use that macro directly as the initial test in this function,
+         * meaning that this code also creates opportunities for other bugs to
+         * appear.)
          */
         InterruptPending = true;
     }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 95202d37af..645766a4ab 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -57,6 +57,11 @@
  * allowing die interrupts: HOLD_CANCEL_INTERRUPTS() and
  * RESUME_CANCEL_INTERRUPTS().
  *
+ * Also, INTERRUPTS_PENDING_CONDITION() can be checked to see whether an
+ * interrupt needs to be serviced, without trying to do so immediately.
+ * Some callers are also interested in INTERRUPTS_CAN_BE_PROCESSED(),
+ * which tells whether ProcessInterrupts is sure to clear the interrupt.
+ *
  * Special mechanisms are used to let an interrupt be accepted when we are
  * waiting for a lock or when we are waiting for command input (but, of
  * course, only if the interrupt holdoff counter is zero).  See the
@@ -97,24 +102,27 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);

+/* Test whether an interrupt is pending */
 #ifndef WIN32
+#define INTERRUPTS_PENDING_CONDITION() \
+    (unlikely(InterruptPending))
+#else
+#define INTERRUPTS_PENDING_CONDITION() \
+    (unlikely(UNBLOCKED_SIGNAL_QUEUE()) ? pgwin32_dispatch_queued_signals() : 0,  \
+     unlikely(InterruptPending))
+#endif

+/* Service interrupt if one is pending */
 #define CHECK_FOR_INTERRUPTS() \
 do { \
-    if (unlikely(InterruptPending)) \
-        ProcessInterrupts(); \
-} while(0)
-#else                            /* WIN32 */
-
-#define CHECK_FOR_INTERRUPTS() \
-do { \
-    if (unlikely(UNBLOCKED_SIGNAL_QUEUE())) \
-        pgwin32_dispatch_queued_signals(); \
-    if (unlikely(InterruptPending)) \
+    if (INTERRUPTS_PENDING_CONDITION()) \
         ProcessInterrupts(); \
 } while(0)
-#endif                            /* WIN32 */

+/* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
+#define INTERRUPTS_CAN_BE_PROCESSED() \
+    (InterruptHoldoffCount == 0 && CritSectionCount == 0 && \
+     QueryCancelHoldoffCount == 0)

 #define HOLD_INTERRUPTS()  (InterruptHoldoffCount++)

diff --git a/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
b/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
index 0ac99d08fb..ac0ddcecea 100644
--- a/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
+++ b/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
@@ -61,6 +61,12 @@ select * from t
  binary_upgrade_set_next_toast_pg_class_oid           |  1 | binary_upgrade_set_next_toast_pg_class_oid
 (9 rows)

+-- Verify clean failure when INCLUDE'd columns result in overlength tuple
+-- The error message details are platform-dependent, so show only SQLSTATE
+\set VERBOSITY sqlstate
+insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000));
+ERROR:  54000
+\set VERBOSITY default
 drop index t_f1_f2_f3_idx;
 create index on t using spgist(f1 name_ops_old) include(f2, f3);
 \d+ t_f1_f2_f3_idx
@@ -100,3 +106,7 @@ select * from t
  binary_upgrade_set_next_toast_pg_class_oid           |  1 | binary_upgrade_set_next_toast_pg_class_oid
 (9 rows)

+\set VERBOSITY sqlstate
+insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000));
+ERROR:  54000
+\set VERBOSITY default
diff --git a/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql
b/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql
index 76e78ba41c..982f221a8b 100644
--- a/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql
+++ b/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql
@@ -27,6 +27,12 @@ select * from t
   where f1 > 'binary_upgrade_set_n' and f1 < 'binary_upgrade_set_p'
   order by 1;

+-- Verify clean failure when INCLUDE'd columns result in overlength tuple
+-- The error message details are platform-dependent, so show only SQLSTATE
+\set VERBOSITY sqlstate
+insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000));
+\set VERBOSITY default
+
 drop index t_f1_f2_f3_idx;

 create index on t using spgist(f1 name_ops_old) include(f2, f3);
@@ -39,3 +45,7 @@ select * from t
 select * from t
   where f1 > 'binary_upgrade_set_n' and f1 < 'binary_upgrade_set_p'
   order by 1;
+
+\set VERBOSITY sqlstate
+insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000));
+\set VERBOSITY default

Re: OOM in spgist insert

От
Alvaro Herrera
Дата:
On 2021-May-13, Tom Lane wrote:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > (Looking again, the nbtpage.c hunk might have been made obsolete by
> > c34787f91058 and other commits).
> 
> OK.  Here's a revision that adopts your idea, except that I left
> out the nbtpage.c change since you aren't sure of that part.

Thanks.

> I added a macro that allows spgdoinsert to Assert that it's not
> called in a context where the infinite-loop-due-to-InterruptPending
> risk would arise.  This is a little bit fragile, because it'd be
> easy for ill-considered changes to ProcessInterrupts to break it,
> but it's better than nothing.

Hmm, it looks OK to me, but I wonder why you kept the original
CHECK_FOR_INTERRUPTS()s since these would be done once we've broken out
of the loop anyway.  I tested Dilip's original test case and while we
still die on OOM, we're able to interrupt it before dying.


Not related to this patch -- I was bothered by the UnlockReleaseBuffer
calls at the bottom of spgdoinsert that leave the buffer still set in
the structs.  It's not a problem if you look only at this routine, but I
notice that callee doPickSplit does the same thing, so maybe there is
some weird situation in which you could end up with the buffer variable
set, but we don't hold lock nor pin on the page, so an attempt to clean
up would break.  I don't know enough about spgist to figure out how to
craft a test case, maybe it's impossible to reach for some reason, but
it seems glass-in-the-beach sort of thing.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."                              (Brian Kernighan)



Re: OOM in spgist insert

От
Alvaro Herrera
Дата:
On 2021-May-13, Tom Lane wrote:

> What do people think about back-patching this?  In existing branches,
> we've defined it to be an opclass bug if it fails to shorten the leaf
> datum enough.  But not having any defenses against that seems like
> not a great idea.  OTOH, the 10-cycles-to-show-progress rule could be
> argued to be an API break.

I think if the alternative is to throw an error, we can afford to retry
quite a few more times than 10 in order not have that called an API
break.  Say, retry (MAXIMUM_ALIGNOF << 3) times or so (if you want to
parameterize on maxalign).  It's not like this is going to be a
performance drag where not needed .. but I think leaving back-branches
unfixed is not great.

I did run Dilip's test case as well as your new regression test, and
both work as intended with your new code (and both OOM-crash the
original code).

-- 
Álvaro Herrera       Valdivia, Chile



Re: OOM in spgist insert

От
Pavel Borisov
Дата:
I think it's good to backpatch the check as it doesn't change acceptable behavior, just replace infinite loop with the acceptable thing.

Re: OOM in spgist insert

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Hmm, it looks OK to me, but I wonder why you kept the original
> CHECK_FOR_INTERRUPTS()s since these would be done once we've broken out
> of the loop anyway.  I tested Dilip's original test case and while we
> still die on OOM, we're able to interrupt it before dying.

Hm.  My thought was that in the cases where InterruptPending is set for
some reason other than a query cancel, we could let ProcessInterrupts
service it at less cost than abandoning and retrying the index insertion.
On reflection though, that only works for the first CHECK_FOR_INTERRUPTS
at the top of the loop, and only the first time through, because during
later calls we'll be holding buffer locks.

Maybe the best idea is to have one CHECK_FOR_INTERRUPTS at the top of
the function, in hopes of clearing out any already-pending interrupts,
and then just use the condition test inside the loop.

> Not related to this patch -- I was bothered by the UnlockReleaseBuffer
> calls at the bottom of spgdoinsert that leave the buffer still set in
> the structs.  It's not a problem if you look only at this routine, but I
> notice that callee doPickSplit does the same thing, so maybe there is
> some weird situation in which you could end up with the buffer variable
> set, but we don't hold lock nor pin on the page, so an attempt to clean
> up would break.

Maybe I'm confused, but aren't those just local variables that are about
to go out of scope anyway?  Clearing them seems more likely to draw
compiler warnings about dead stores than accomplish something useful.

            regards, tom lane



Re: OOM in spgist insert

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-May-13, Tom Lane wrote:
>> What do people think about back-patching this?  In existing branches,
>> we've defined it to be an opclass bug if it fails to shorten the leaf
>> datum enough.  But not having any defenses against that seems like
>> not a great idea.  OTOH, the 10-cycles-to-show-progress rule could be
>> argued to be an API break.

> I think if the alternative is to throw an error, we can afford to retry
> quite a few more times than 10 in order not have that called an API
> break.  Say, retry (MAXIMUM_ALIGNOF << 3) times or so (if you want to
> parameterize on maxalign).  It's not like this is going to be a
> performance drag where not needed .. but I think leaving back-branches
> unfixed is not great.

Hm.  Index bloat is not something to totally ignore, though, so I'm
not sure what the best cutoff is.

Anyway, here is a patch set teased apart into committable bites,
and with your other points addressed.

            regards, tom lane

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 6200699ddd..dd2ade7bb6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -554,7 +554,7 @@ ProcessClientWriteInterrupt(bool blocked)
         {
             /*
              * Don't mess with whereToSendOutput if ProcessInterrupts wouldn't
-             * do anything.
+             * service ProcDiePending.
              */
             if (InterruptHoldoffCount == 0 && CritSectionCount == 0)
             {
@@ -3118,6 +3118,12 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
  * If an interrupt condition is pending, and it's safe to service it,
  * then clear the flag and accept the interrupt.  Called only when
  * InterruptPending is true.
+ *
+ * Note: if INTERRUPTS_CAN_BE_PROCESSED() is true, then ProcessInterrupts
+ * is guaranteed to clear the InterruptPending flag before returning.
+ * (This is not the same as guaranteeing that it's still clear when we
+ * return; another interrupt could have arrived.  But we promise that
+ * any pre-existing one will have been serviced.)
  */
 void
 ProcessInterrupts(void)
@@ -3248,7 +3254,11 @@ ProcessInterrupts(void)
     {
         /*
          * Re-arm InterruptPending so that we process the cancel request as
-         * soon as we're done reading the message.
+         * soon as we're done reading the message.  (XXX this is seriously
+         * ugly: it complicates INTERRUPTS_CAN_BE_PROCESSED(), and it means we
+         * can't use that macro directly as the initial test in this function,
+         * meaning that this code also creates opportunities for other bugs to
+         * appear.)
          */
         InterruptPending = true;
     }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 95202d37af..b95bb956b6 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -57,6 +57,15 @@
  * allowing die interrupts: HOLD_CANCEL_INTERRUPTS() and
  * RESUME_CANCEL_INTERRUPTS().
  *
+ * Note that ProcessInterrupts() has also acquired a number of tasks that
+ * do not necessarily cause a query-cancel-or-die response.  Hence, it's
+ * possible that it will just clear InterruptPending and return.
+ *
+ * INTERRUPTS_PENDING_CONDITION() can be checked to see whether an
+ * interrupt needs to be serviced, without trying to do so immediately.
+ * Some callers are also interested in INTERRUPTS_CAN_BE_PROCESSED(),
+ * which tells whether ProcessInterrupts is sure to clear the interrupt.
+ *
  * Special mechanisms are used to let an interrupt be accepted when we are
  * waiting for a lock or when we are waiting for command input (but, of
  * course, only if the interrupt holdoff counter is zero).  See the
@@ -97,24 +106,27 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);

+/* Test whether an interrupt is pending */
 #ifndef WIN32
+#define INTERRUPTS_PENDING_CONDITION() \
+    (unlikely(InterruptPending))
+#else
+#define INTERRUPTS_PENDING_CONDITION() \
+    (unlikely(UNBLOCKED_SIGNAL_QUEUE()) ? pgwin32_dispatch_queued_signals() : 0,  \
+     unlikely(InterruptPending))
+#endif

+/* Service interrupt if one is pending */
 #define CHECK_FOR_INTERRUPTS() \
 do { \
-    if (unlikely(InterruptPending)) \
-        ProcessInterrupts(); \
-} while(0)
-#else                            /* WIN32 */
-
-#define CHECK_FOR_INTERRUPTS() \
-do { \
-    if (unlikely(UNBLOCKED_SIGNAL_QUEUE())) \
-        pgwin32_dispatch_queued_signals(); \
-    if (unlikely(InterruptPending)) \
+    if (INTERRUPTS_PENDING_CONDITION()) \
         ProcessInterrupts(); \
 } while(0)
-#endif                            /* WIN32 */

+/* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
+#define INTERRUPTS_CAN_BE_PROCESSED() \
+    (InterruptHoldoffCount == 0 && CritSectionCount == 0 && \
+     QueryCancelHoldoffCount == 0)

 #define HOLD_INTERRUPTS()  (InterruptHoldoffCount++)

diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 4d380c99f0..88ae2499dd 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -1905,13 +1905,14 @@ spgSplitNodeAction(Relation index, SpGistState *state,
  * Insert one item into the index.
  *
  * Returns true on success, false if we failed to complete the insertion
- * because of conflict with a concurrent insert.  In the latter case,
- * caller should re-call spgdoinsert() with the same args.
+ * (typically because of conflict with a concurrent insert).  In the latter
+ * case, caller should re-call spgdoinsert() with the same args.
  */
 bool
 spgdoinsert(Relation index, SpGistState *state,
             ItemPointer heapPtr, Datum *datums, bool *isnulls)
 {
+    bool        result = true;
     TupleDesc    leafDescriptor = state->leafTupDesc;
     bool        isnull = isnulls[spgKeyColumn];
     int            level = 0;
@@ -2012,6 +2013,14 @@ spgdoinsert(Relation index, SpGistState *state,
     parent.offnum = InvalidOffsetNumber;
     parent.node = -1;

+    /*
+     * Before entering the loop, try to clear any pending interrupt condition.
+     * If a query cancel is pending, we might as well accept it now not later;
+     * while if a non-canceling condition is pending, servicing it here avoids
+     * having to restart the insertion and redo all the work so far.
+     */
+    CHECK_FOR_INTERRUPTS();
+
     for (;;)
     {
         bool        isNew = false;
@@ -2019,9 +2028,18 @@ spgdoinsert(Relation index, SpGistState *state,
         /*
          * Bail out if query cancel is pending.  We must have this somewhere
          * in the loop since a broken opclass could produce an infinite
-         * picksplit loop.
+         * picksplit loop.  However, because we'll be holding buffer lock(s)
+         * after the first iteration, ProcessInterrupts() wouldn't be able to
+         * throw a cancel error here.  Hence, if we see that an interrupt is
+         * pending, break out of the loop and deal with the situation below.
+         * Set result = false because we must restart the insertion if the
+         * interrupt isn't a query-cancel-or-die case.
          */
-        CHECK_FOR_INTERRUPTS();
+        if (INTERRUPTS_PENDING_CONDITION())
+        {
+            result = false;
+            break;
+        }

         if (current.blkno == InvalidBlockNumber)
         {
@@ -2140,10 +2158,14 @@ spgdoinsert(Relation index, SpGistState *state,
              * spgAddNode and spgSplitTuple cases will loop back to here to
              * complete the insertion operation.  Just in case the choose
              * function is broken and produces add or split requests
-             * repeatedly, check for query cancel.
+             * repeatedly, check for query cancel (see comments above).
              */
     process_inner_tuple:
-            CHECK_FOR_INTERRUPTS();
+            if (INTERRUPTS_PENDING_CONDITION())
+            {
+                result = false;
+                break;
+            }

             innerTuple = (SpGistInnerTuple) PageGetItem(current.page,
                                                         PageGetItemId(current.page, current.offnum));
@@ -2267,5 +2289,21 @@ spgdoinsert(Relation index, SpGistState *state,
         UnlockReleaseBuffer(parent.buffer);
     }

-    return true;
+    /*
+     * We do not support being called while some outer function is holding a
+     * buffer lock (or any other reason to postpone query cancels).  If that
+     * were the case, telling the caller to retry would create an infinite
+     * loop.
+     */
+    Assert(INTERRUPTS_CAN_BE_PROCESSED());
+
+    /*
+     * Finally, check for interrupts again.  If there was a query cancel,
+     * ProcessInterrupts() will be able to throw the error here.  If it was
+     * some other kind of interrupt that can just be cleared, return false to
+     * tell our caller to retry.
+     */
+    CHECK_FOR_INTERRUPTS();
+
+    return result;
 }
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 88ae2499dd..520983ede1 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -669,7 +669,8 @@ checkAllTheSame(spgPickSplitIn *in, spgPickSplitOut *out, bool tooBig,
  * will eventually terminate if lack of balance is the issue.  If the tuple
  * is too big, we assume that repeated picksplit operations will eventually
  * make it small enough by repeated prefix-stripping.  A broken opclass could
- * make this an infinite loop, though.
+ * make this an infinite loop, though, so spgdoinsert() checks that the
+ * leaf datums get smaller each time.
  */
 static bool
 doPickSplit(Relation index, SpGistState *state,
@@ -1918,6 +1919,8 @@ spgdoinsert(Relation index, SpGistState *state,
     int            level = 0;
     Datum        leafDatums[INDEX_MAX_KEYS];
     int            leafSize;
+    int            prevLeafSize;
+    int            numNoProgressCycles = 0;
     SPPageDesc    current,
                 parent;
     FmgrInfo   *procinfo = NULL;
@@ -1988,9 +1991,10 @@ spgdoinsert(Relation index, SpGistState *state,

     /*
      * If it isn't gonna fit, and the opclass can't reduce the datum size by
-     * suffixing, bail out now rather than getting into an endless loop.
+     * suffixing, bail out now rather than doing a lot of useless work.
      */
-    if (leafSize > SPGIST_PAGE_CAPACITY && !state->config.longValuesOK)
+    if (leafSize > SPGIST_PAGE_CAPACITY &&
+        (isnull || !state->config.longValuesOK))
         ereport(ERROR,
                 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                  errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
@@ -2218,6 +2222,7 @@ spgdoinsert(Relation index, SpGistState *state,
                     /* Adjust level as per opclass request */
                     level += out.result.matchNode.levelAdd;
                     /* Replace leafDatum and recompute leafSize */
+                    prevLeafSize = leafSize;
                     if (!isnull)
                     {
                         leafDatums[spgKeyColumn] = out.result.matchNode.restDatum;
@@ -2226,19 +2231,50 @@ spgdoinsert(Relation index, SpGistState *state,
                         leafSize += sizeof(ItemIdData);
                     }

+                    /*
+                     * Check new tuple size; fail if it can't fit, unless the
+                     * opclass says it can handle the situation by suffixing.
+                     * In that case, prevent an infinite loop by checking to
+                     * see if we are actually making progress by reducing the
+                     * leafSize size in each pass.  This is a bit tricky
+                     * though.  Because of alignment considerations, the total
+                     * tuple size might not decrease on every pass.  Also,
+                     * there are edge cases where the choose method might seem
+                     * to not make progress for a cycle or two.  Somewhat
+                     * arbitrarily, we allow up to 10 no-progress iterations
+                     * before failing.  (The limit should be more than
+                     * MAXALIGN, to accommodate opclasses that trim one byte
+                     * from the leaf datum per pass.)
+                     */
+                    if (leafSize > SPGIST_PAGE_CAPACITY)
+                    {
+                        bool        ok = false;
+
+                        if (state->config.longValuesOK && !isnull)
+                        {
+                            if (leafSize < prevLeafSize)
+                            {
+                                ok = true;
+                                numNoProgressCycles = 0;
+                            }
+                            else if (++numNoProgressCycles < 10)
+                                ok = true;
+                        }
+                        if (!ok)
+                            ereport(ERROR,
+                                    (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+                                     errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
+                                            leafSize - sizeof(ItemIdData),
+                                            SPGIST_PAGE_CAPACITY - sizeof(ItemIdData),
+                                            RelationGetRelationName(index)),
+                                     errhint("Values larger than a buffer page cannot be indexed.")));
+                    }
+
                     /*
                      * Loop around and attempt to insert the new leafDatum at
                      * "current" (which might reference an existing child
                      * tuple, or might be invalid to force us to find a new
                      * page for the tuple).
-                     *
-                     * Note: if the opclass sets longValuesOK, we rely on the
-                     * choose function to eventually shorten the leafDatum
-                     * enough to fit on a page.  We could add a test here to
-                     * complain if the datum doesn't get visibly shorter each
-                     * time, but that could get in the way of opclasses that
-                     * "simplify" datums in a way that doesn't necessarily
-                     * lead to physical shortening on every cycle.
                      */
                     break;
                 case spgAddNode:
diff --git a/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
b/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
index 0ac99d08fb..ac0ddcecea 100644
--- a/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
+++ b/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
@@ -61,6 +61,12 @@ select * from t
  binary_upgrade_set_next_toast_pg_class_oid           |  1 | binary_upgrade_set_next_toast_pg_class_oid
 (9 rows)

+-- Verify clean failure when INCLUDE'd columns result in overlength tuple
+-- The error message details are platform-dependent, so show only SQLSTATE
+\set VERBOSITY sqlstate
+insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000));
+ERROR:  54000
+\set VERBOSITY default
 drop index t_f1_f2_f3_idx;
 create index on t using spgist(f1 name_ops_old) include(f2, f3);
 \d+ t_f1_f2_f3_idx
@@ -100,3 +106,7 @@ select * from t
  binary_upgrade_set_next_toast_pg_class_oid           |  1 | binary_upgrade_set_next_toast_pg_class_oid
 (9 rows)

+\set VERBOSITY sqlstate
+insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000));
+ERROR:  54000
+\set VERBOSITY default
diff --git a/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql
b/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql
index 76e78ba41c..982f221a8b 100644
--- a/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql
+++ b/src/test/modules/spgist_name_ops/sql/spgist_name_ops.sql
@@ -27,6 +27,12 @@ select * from t
   where f1 > 'binary_upgrade_set_n' and f1 < 'binary_upgrade_set_p'
   order by 1;

+-- Verify clean failure when INCLUDE'd columns result in overlength tuple
+-- The error message details are platform-dependent, so show only SQLSTATE
+\set VERBOSITY sqlstate
+insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000));
+\set VERBOSITY default
+
 drop index t_f1_f2_f3_idx;

 create index on t using spgist(f1 name_ops_old) include(f2, f3);
@@ -39,3 +45,7 @@ select * from t
 select * from t
   where f1 > 'binary_upgrade_set_n' and f1 < 'binary_upgrade_set_p'
   order by 1;
+
+\set VERBOSITY sqlstate
+insert into t values(repeat('xyzzy', 12), 42, repeat('xyzzy', 4000));
+\set VERBOSITY default

Re: OOM in spgist insert

От
Tom Lane
Дата:
I wrote:
> Anyway, here is a patch set teased apart into committable bites,
> and with your other points addressed.

Oh, maybe some docs would be a good thing too ...

            regards, tom lane

diff --git a/doc/src/sgml/spgist.sgml b/doc/src/sgml/spgist.sgml
index cfb2b3c836..18f1f3cdbd 100644
--- a/doc/src/sgml/spgist.sgml
+++ b/doc/src/sgml/spgist.sgml
@@ -978,6 +978,18 @@ LANGUAGE C STRICT;
    fails to do that, the <acronym>SP-GiST</acronym> core resorts to
    extraordinary measures described in <xref linkend="spgist-all-the-same"/>.
   </para>
+
+  <para>
+   When <structfield>longValuesOK</structfield> is true, it is expected
+   that successive levels of the <acronym>SP-GiST</acronym> tree will
+   absorb more and more information into the prefixes and node labels of
+   the inner tuples, making the required leaf datum smaller and smaller,
+   so that eventually it will fit on a page.
+   To prevent bugs in operator classes from causing infinite insertion
+   loops, the <acronym>SP-GiST</acronym> core will raise an error if the
+   leaf datum does not become any smaller within ten cycles
+   of <function>choose</function> method calls.
+  </para>
  </sect2>
 
  <sect2 id="spgist-null-labels">

Re: OOM in spgist insert

От
Dilip Kumar
Дата:
On Fri, May 14, 2021 at 6:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2021-May-13, Tom Lane wrote:
> >> What do people think about back-patching this?  In existing branches,
> >> we've defined it to be an opclass bug if it fails to shorten the leaf
> >> datum enough.  But not having any defenses against that seems like
> >> not a great idea.  OTOH, the 10-cycles-to-show-progress rule could be
> >> argued to be an API break.
>
> > I think if the alternative is to throw an error, we can afford to retry
> > quite a few more times than 10 in order not have that called an API
> > break.  Say, retry (MAXIMUM_ALIGNOF << 3) times or so (if you want to
> > parameterize on maxalign).  It's not like this is going to be a
> > performance drag where not needed .. but I think leaving back-branches
> > unfixed is not great.
>
> Hm.  Index bloat is not something to totally ignore, though, so I'm
> not sure what the best cutoff is.
>
> Anyway, here is a patch set teased apart into committable bites,
> and with your other points addressed.

I have tested with my original issue and this patch is fixing the
issue.  Thanks!


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: OOM in spgist insert

От
Pavel Borisov
Дата:
Now when checking for shortening of  leaf tuple is added LongValuesOK become slightly redundant. I'd propose to replace it with more legible name as LongValuesOK doesn't mean they are warranted to be ok just that we can try to shorten them? What about tryShortening, trySuffixing or can(Try)ShortenTuple?


--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: OOM in spgist insert

От
Pavel Borisov
Дата:

Now when checking for shortening of  leaf tuple is added LongValuesOK become slightly redundant. I'd propose to replace it with more legible name as LongValuesOK doesn't mean they are warranted to be ok just that we can try to shorten them? What about tryShortening, trySuffixing or can(Try)ShortenTuple?
Or maybe it's even more logical now to invert it and make dontTrySuffixing for use in the opclasses for kd-tree, quadtree etc which are warranted to have the same key data length at any tree level ?

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: OOM in spgist insert

От
Robert Haas
Дата:
On Thu, May 13, 2021 at 6:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> OTOH, the 10-cycles-to-show-progress rule could be
> argued to be an API break.

Not being familiar with this code, I don't really understand why 10
cycles to show progress wouldn't, like 640kB, be enough for anyone.
But as far as back-patching the code goals, I think the question is
not so much whether this restriction could hypothetically break
anything but whether it will actually break anything, which leads to
the question of how many spgist opclasses we think exist outside of
core. I did a Google search and found some evidence that PostGIS might
have such things, and also this:

https://github.com/fake-name/pg-spgist_hamming

There might be other things, but I did not find them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: OOM in spgist insert

От
Tom Lane
Дата:
Pavel Borisov <pashkin.elfe@gmail.com> writes:
> Now when checking for shortening of  leaf tuple is added LongValuesOK
> become slightly redundant. I'd propose to replace it with more legible name
> as LongValuesOK doesn't mean they are warranted to be ok just that we can
> try to shorten them? What about tryShortening, trySuffixing or
> can(Try)ShortenTuple?

That field name is part of the opclass API.  I fear it's several years
too late to rename it for no compelling reason.

            regards, tom lane



Re: OOM in spgist insert

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, May 13, 2021 at 6:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> OTOH, the 10-cycles-to-show-progress rule could be
>> argued to be an API break.

> Not being familiar with this code, I don't really understand why 10
> cycles to show progress wouldn't, like 640kB, be enough for anyone.

Yeah, after further thought I'm thinking that that ought to be plenty.
In released branches, that code will never execute at all unless the
datum-to-be-indexed is larger than ~8kB.  If you are starting with,
say, a 30kB string, you'd better be removing a heck of a lot more than
one byte per tree level, or your search performance is going to be
abysmal.  Maybe algorithmic oddities would sometimes result in
seemingly making no progress for one cycle, but I doubt there's need
for more slop than that.  In this light, a 10-cycle grace period seems
if anything excessive.

> But as far as back-patching the code goals, I think the question is
> not so much whether this restriction could hypothetically break
> anything but whether it will actually break anything, which leads to
> the question of how many spgist opclasses we think exist outside of
> core.

That is also an interesting viewpoint.  codesearch.debian.net knows
of no external SPGiST opclasses other than PostGIS'.  They don't
seem to have indexed the two you found on github, though.  None of
those four set longValuesOK to true, which means that the whole
discussion is moot for them.  So it's entirely plausible that
spgtextproc.c is the only affected code anywhere.

Of course, that conclusion could lead to the position that there's
no point in back-patching anyway, since there's no reason to think
that spgtextproc.c is buggy.

            regards, tom lane



Re: OOM in spgist insert

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> This comment made me remember a patch I've had for a while, which splits
> the CHECK_FOR_INTERRUPTS() definition in two -- one of them is
> INTERRUPTS_PENDING_CONDITION() which let us test the condition
> separately; that allows the lock we hold to be released prior to
> actually processing the interrupts.

I've now pushed that macro change ...

> The btree code modified was found to be an actual problem in production
> when a btree is corrupted in such a way that vacuum would get an
> infinite loop.  I don't remember the exact details but I think we saw
> vacuum running for a couple of weeks, and had to restart the server in
> order to terminate it (since it wouldn't respond to signals).

... but I think this bit still needs work, if we still want it at all.
The problem is that it seems to believe that ProcessInterrupts is
guaranteed not to return, which is far from the truth.  Maybe it was
true once, but we've grown a lot of accretions on it that will just
clear InterruptPending and return.  I see that the "return false"
leads directly to an "Assert(false)", which seems unhelpful.

            regards, tom lane