Обсуждение: Question about using AggCheckCallContext in a C function

Поиск
Список
Период
Сортировка

Question about using AggCheckCallContext in a C function

От
Matt Solnit
Дата:
Hi everyone.  A while back, I wrote a function to sum array contents (think
1-D matrix addition), and am using it in a custom SUM aggregate.  Here's an
example:

CREATE TABLE foo (arr INT[]);
INSERT INTO foo VALUES ('{1, 2, 3}'), ('{4, 5, 6}');

SELECT SUM(arr) FROM foo;
   sum
---------
 {5,7,9}
(1 row)

This works, but I got some great advice from Craig Ringer about how I can
make it even faster using AggCheckCallContext().  See http://stackoverflow.com/a/16996606/6198

I'm trying to implement this now, and I'm running into occasional memory
corruption.  Sometimes the back-end segfaults, sometimes I get seemingly-
random "ERROR:  could not open relation with OID xyz" errors.

Here's the relevant code snippet:

// Assumptions:
// 1. We will never be called with a null array.  The CREATE FUNCTION call should specify STRICT to prevent PostgreSQL
fromdoing this. 
// 2. The arrays will never contain null values.  It's up to the caller to ensure this.
// 3. The arrays will always be the same length.  It's up to the caller to ensure this.

ArrayType *array1, *array2;
Datum *arrayData1, *arrayData2;
int arrayLength1;

array2 = PG_GETARG_ARRAYTYPE_P(1);

if (AggCheckCallContext(fcinfo, NULL))
{
  // We are being called by an aggregate.

  if (PG_ARGISNULL(0))
  {
    // This can happen on the very first call as PostgreSQL sets up the "temporary transition value" for the aggregate.

    // NOTE: This never seems to happen!  Is it because the function is STRICT?

    // Return a copy of the second array.
    PG_RETURN_ARRAYTYPE_P(copy_intArrayType(array2));
  }
  else
  {
    // This means that the first array is a "temporary transition value", and we can safely modify it directly with no
sideeffects. 
    // This avoids the overhead of creating a new array object.

    array1 = PG_GETARG_ARRAYTYPE_P(0);

    // Get a direct pointer to each array's contents.
    arrayData1 = ARR_DATA_PTR(array1);
    arrayData2 = ARR_DATA_PTR(array2);

    // Get the length of the first array (should be the same as the second).
    arrayLength1 = (ARR_DIMS(array1))[0];

    // Add the contents of the second array to the first.
    for (i = 0; i < arrayLength1; i++)
    {
      arrayData1[i] += arrayData2[i];
    }

    // Return the updated array.
    PG_RETURN_ARRAYTYPE_P(array1);
  }
}
else
{
  // We are not being called by an aggregate.
  // <omitted>
}

After poring over the code in nodeAgg.c, and looking at the in8inc()
function, I think I know what the problem is:  the typical use of
AggCheckCallContext() is not compatible with TOAST-able data types.

When the state transition function is STRICT (as mine is), and
advance_transition_function() comes across the first non-NULL value,
it makes a copy using datumCopy().  However, from what I can tell,
datumCopy() is not really "compatible" with TOAST for this
particular use case.

Am I on the right track here?  If so, what's the best way to solve
this problem?  I would appreciate any help you can offer.

Sincerely,
Matt Solnit <msolnit@soasta.com>

Re: Question about using AggCheckCallContext in a C function

От
Tom Lane
Дата:
Matt Solnit <msolnit@soasta.com> writes:
> After poring over the code in nodeAgg.c, and looking at the in8inc()
> function, I think I know what the problem is:  the typical use of
> AggCheckCallContext() is not compatible with TOAST-able data types.

That's nonsense.  There are several standard aggregates that use
that with array transition values.

Personally, I'd wonder about the blind-faith assumption in your code that
all the input arrays are exactly the same length, with no NULL elements.
At the very least a check for that would seem advisable.  An empty
(zero-dimensional) array could also make this code crash, so I'd be
inclined to put in a check that ARR_NDIM() is 1, too.

            regards, tom lane


Re: Question about using AggCheckCallContext in a C function

От
Matt Solnit
Дата:
On Aug 12, 2013, at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us>
 wrote:

> Matt Solnit <msolnit@soasta.com> writes:
>> After poring over the code in nodeAgg.c, and looking at the in8inc()
>> function, I think I know what the problem is:  the typical use of
>> AggCheckCallContext() is not compatible with TOAST-able data types.
>
> That's nonsense.  There are several standard aggregates that use
> that with array transition values.
>
> Personally, I'd wonder about the blind-faith assumption in your code that
> all the input arrays are exactly the same length, with no NULL elements.
> At the very least a check for that would seem advisable.  An empty
> (zero-dimensional) array could also make this code crash, so I'd be
> inclined to put in a check that ARR_NDIM() is 1, too.
>
>             regards, tom lane

Thanks for your reply, albeit gruffly-worded :-).  I'm certainly a
novice to the PostgreSQL source code, so I'm not surprised that my
hypothesis was wrong.

Regarding the assumptions, I am perfectly okay with them because I
have complete control over the inputs.  We're using this function
with a very precise data set.  I did, however, take a moment to
verify that *every* value in the table matches my assumptions, and
it does.

So where do I go from here?

Additional information that might helpful:

1. When the crash occurs, and I inspect using gdb, I consistently
find that the first array's contents are "garbage".  For example:

  (gdb) print array1->dataoffset
  $6 = -1795162110

  (gdb) print array1->ndim
  $9 = 989856262

while the second array looks fine:

  (gdb) print array2->dataoffset
  $7 = 0

  (gdb) print array2->ndim
  $10 = 1

2. The function seems to work consistently when I do a SELECT
SUM(mycol) without any GROUP BY.  It's only when I add grouping that
the failures happen.  I'm not sure if this is a real clue or a red
herring.

Finally, can you tell me what precisely happens when you call
datumCopy() with ArrayType?  If it's only returning a copy of
the TOAST reference, then how is it safe for the transition function
to modify the content?  I'm probably *completely* misunderstanding
how this works, so I would love to be enlightened :-).

Sincerely,
Matt Solnit


Re: Question about using AggCheckCallContext in a C function

От
Tom Lane
Дата:
Matt Solnit <msolnit@soasta.com> writes:
> 2. The function seems to work consistently when I do a SELECT
> SUM(mycol) without any GROUP BY.  It's only when I add grouping that
> the failures happen.  I'm not sure if this is a real clue or a red
> herring.

That isn't enormously surprising, since the memory management for
the transition values is different in the two cases.

> Finally, can you tell me what precisely happens when you call
> datumCopy() with ArrayType?  If it's only returning a copy of
> the TOAST reference, then how is it safe for the transition function
> to modify the content?  I'm probably *completely* misunderstanding
> how this works, so I would love to be enlightened :-).

You're right, datumCopy() won't expand a TOAST reference.  What does
expand it is PG_GETARG_ARRAYTYPE_P().  So if you have a case where the
system picks up a copy of an array input that happens to be toasted,
it's the GETARG step in the next invocation of the aggregate transition
function that expands the TOAST reference, and then after that you have an
in-memory copy that's safe to modify.  Maybe you're missing that somehow?
The code fragment you showed looked okay but ...

            regards, tom lane


Re: Question about using AggCheckCallContext in a C function

От
Matt Solnit
Дата:
On Aug 12, 2013, at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Matt Solnit <msolnit@soasta.com> writes:
>> 2. The function seems to work consistently when I do a SELECT
>> SUM(mycol) without any GROUP BY.  It's only when I add grouping that
>> the failures happen.  I'm not sure if this is a real clue or a red
>> herring.
>
> That isn't enormously surprising, since the memory management for
> the transition values is different in the two cases.
>
>> Finally, can you tell me what precisely happens when you call
>> datumCopy() with ArrayType?  If it's only returning a copy of
>> the TOAST reference, then how is it safe for the transition function
>> to modify the content?  I'm probably *completely* misunderstanding
>> how this works, so I would love to be enlightened :-).
>
> You're right, datumCopy() won't expand a TOAST reference.  What does
> expand it is PG_GETARG_ARRAYTYPE_P().  So if you have a case where the
> system picks up a copy of an array input that happens to be toasted,
> it's the GETARG step in the next invocation of the aggregate transition
> function that expands the TOAST reference, and then after that you have an
> in-memory copy that's safe to modify.  Maybe you're missing that somehow?
> The code fragment you showed looked okay but ...
>
>             regards, tom lane

I think I figured it out.  The problem is this line:

  Datum *arrayData1, *arrayData2;

Datum* was correct when I first started this journey, using deconstruct_array(),
but is incorrect when accessing the array's content directly using
ARR_DATA_PTR().  Changing these to int* fixes the problem, at least
on all the systems I've tried so far.

I've been wondering why the broken code worked without a GROUP BY,
and I think it was just dumb luck.  With no GROUP BY, I was only
overrunning a single buffer, and maybe the effects were not
immediately apparent.  With GROUP BY, however, there's a buffer
overrun for each group, and each one increases the chance of doing
something catastrophic.

Sincerely,
Matt Solnit