Обсуждение: BUG #15815: Upgraded from 9.6.8 > 9.6.12 on AWS Aurora: SELECTs causing segmentation fault

Поиск
Список
Период
Сортировка

BUG #15815: Upgraded from 9.6.8 > 9.6.12 on AWS Aurora: SELECTs causing segmentation fault

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15815
Logged by:          Steve I
Email address:      postgres-ca@byerquest.com
PostgreSQL version: 9.6.12
Operating system:   Amazon Aurora
Description:

AWS Aurora, stable on 9.6.6/8 for a year, updated to 9.6.12:

LOG:  server process (PID 31294) was terminated by signal 11: Segmentation
fault
DETAIL:  Failed process was running: 
simplied >>> SELECT value FROM {table} WHERE lower(substring(value,1,1000))
>= '{string}'
LOG:  terminating any other active server processes
FATAL:  Can't handle storage runtime process crash

This specific SQL will cause a segfault on our dataset 100%. If I change any
part of it it won't e.g. remove lower, or substring, or change > to <, or
any part of the string.  We have a few other variations, but this example is
the most often reported and reproducible. 

Guidance on if this is a know issue, how to provide additional information
to further trace it in an AWS environment, or how to bypass it, is most
appreciated.


PG Bug reporting form <noreply@postgresql.org> writes:
> AWS Aurora, stable on 9.6.6/8 for a year, updated to 9.6.12:

> LOG:  server process (PID 31294) was terminated by signal 11: Segmentation
> fault
> DETAIL:  Failed process was running:
> simplied >>> SELECT value FROM {table} WHERE lower(substring(value,1,1000))
>> = '{string}'
> LOG:  terminating any other active server processes

Huh.  Can you get a stack trace from that?

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

Also, could we see the definition of the table (psql \d would be
helpful)?

            regards, tom lane



Re: BUG #15815: Upgraded from 9.6.8 > 9.6.12 on AWS Aurora: SELECTscausing segmentation fault

От
Euler Taveira
Дата:
Em ter, 21 de mai de 2019 às 11:27, PG Bug reporting form
<noreply@postgresql.org> escreveu:
>
> AWS Aurora, stable on 9.6.6/8 for a year, updated to 9.6.12:
>
Aurora is a Postgres fork so you should report it to Amazon. However, ...

> LOG:  server process (PID 31294) was terminated by signal 11: Segmentation
> fault
> DETAIL:  Failed process was running:
> simplied >>> SELECT value FROM {table} WHERE lower(substring(value,1,1000))
> >= '{string}'
> LOG:  terminating any other active server processes
> FATAL:  Can't handle storage runtime process crash
>
Could you reproduce it with stock Postgres? Could you provide a test case?


--
   Euler Taveira                                   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento



I probably need AWS hands to get a trace from behind the curtain. Started a thread there https://forums.aws.amazon.com/thread.jspa?threadID=303488&tstart=0

     Column     |  Type   |                      Modifiers
----------------+---------+------------------------------------------------------
             a  | integer | not null default nextval('{table3}_seq'::regclass)
             b  | integer |
             c  | integer |
             d  | text    |
Indexes:
    … PRIMARY KEY, btree (a)
    … UNIQUE CONSTRAINT, btree (b, c)
    … btree (b)
    … btree (c)
    … btree (lower("substring"(d, 1, 1000)) text_pattern_ops, b)
    … btree (lower("substring"(d, 1, 1000)), b)
Foreign-key constraints:
    … FOREIGN KEY (b) REFERENCES {table2}(b)
    … FOREIGN KEY (c) REFERENCES {table1}(c)

> 200G so it would/will take time to run tests on stock Postgres.


On Tue, May 21, 2019 at 8:46 AM Euler Taveira <euler@timbira.com.br> wrote:
Em ter, 21 de mai de 2019 às 11:27, PG Bug reporting form
<noreply@postgresql.org> escreveu:
>
> AWS Aurora, stable on 9.6.6/8 for a year, updated to 9.6.12:
>
Aurora is a Postgres fork so you should report it to Amazon. However, ...

> LOG:  server process (PID 31294) was terminated by signal 11: Segmentation
> fault
> DETAIL:  Failed process was running:
> simplied >>> SELECT value FROM {table} WHERE lower(substring(value,1,1000))
> >= '{string}'
> LOG:  terminating any other active server processes
> FATAL:  Can't handle storage runtime process crash
>
Could you reproduce it with stock Postgres? Could you provide a test case?


--
   Euler Taveira                                   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


Steve <postgres-ca@byrequest.com> writes:
>      Column     |  Type   |                      Modifiers
> ----------------+---------+------------------------------------------------------
>              a  | integer | not null default
> nextval('{table3}_seq'::regclass)
>              b  | integer |
>              c  | integer |
>              d  | text    |
> Indexes:
>     … PRIMARY KEY, btree (a)
>     … UNIQUE CONSTRAINT, btree (b, c)
>     … btree (b)
>     … btree (c)
>     … btree (lower("substring"(d, 1, 1000)) text_pattern_ops, b)
>     … btree (lower("substring"(d, 1, 1000)), b)
> Foreign-key constraints:
>     … FOREIGN KEY (b) REFERENCES {table2}(b)
>     … FOREIGN KEY (c) REFERENCES {table1}(c)

Hm, so this query is probably using the last of those indexes ---
could we see EXPLAIN output to confirm that?

If so, a plausible explanation is that a portion of that index is corrupt,
although it's certainly not very nice that you're getting a crash rather
than an error report.

If you're in a hurry to restore functionality, dropping and recreating
that index would likely make the problem go away ... but it would also
destroy the evidence we'd need to find the cause of the crash.  So if
you can hold off till we see the stack trace, that'd be nice.

            regards, tom lane



Yeah I agree, and it is, ... our first step was to regenerate all the indexes, but the segfault persists.

We're reproducing the case in a restored-from-snapshot db.  Perhaps I'll reindex again there.  Since we have a AWS snapshot we can jump back pretty fast to retest.

BINGO, An AWS Development Manager just stepped in…  They've identified the problem and deploying a patch release https://forums.aws.amazon.com/thread.jspa?messageID=901775&#901775

On Tue, May 21, 2019 at 9:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Steve <postgres-ca@byrequest.com> writes:
>      Column     |  Type   |                      Modifiers
> ----------------+---------+------------------------------------------------------
>              a  | integer | not null default
> nextval('{table3}_seq'::regclass)
>              b  | integer |
>              c  | integer |
>              d  | text    |
> Indexes:
>     … PRIMARY KEY, btree (a)
>     … UNIQUE CONSTRAINT, btree (b, c)
>     … btree (b)
>     … btree (c)
>     … btree (lower("substring"(d, 1, 1000)) text_pattern_ops, b)
>     … btree (lower("substring"(d, 1, 1000)), b)
> Foreign-key constraints:
>     … FOREIGN KEY (b) REFERENCES {table2}(b)
>     … FOREIGN KEY (c) REFERENCES {table1}(c)

Hm, so this query is probably using the last of those indexes ---
could we see EXPLAIN output to confirm that?

If so, a plausible explanation is that a portion of that index is corrupt,
although it's certainly not very nice that you're getting a crash rather
than an error report.

If you're in a hurry to restore functionality, dropping and recreating
that index would likely make the problem go away ... but it would also
destroy the evidence we'd need to find the cause of the crash.  So if
you can hold off till we see the stack trace, that'd be nice.

                        regards, tom lane
Steve <postgres-ca@byrequest.com> writes:
> BINGO, An AWS Development Manager just stepped in…  They've identified the
> problem and deploying a patch release
> https://forums.aws.amazon.com/thread.jspa?messageID=901775&#901775

Oh, so it was their bug not ours?  Sure wish there was more detail there.

            regards, tom lane



Re: BUG #15815: Upgraded from 9.6.8 > 9.6.12 on AWS Aurora: SELECTscausing segmentation fault

От
Michael Paquier
Дата:
On Tue, May 21, 2019 at 03:25:11PM -0400, Tom Lane wrote:
> Steve <postgres-ca@byrequest.com> writes:
>> BINGO, An AWS Development Manager just stepped in…  They've identified the
>> problem and deploying a patch release
>> https://forums.aws.amazon.com/thread.jspa?messageID=901775&#901775
>
> Oh, so it was their bug not ours?  Sure wish there was more detail there.

Aurora uses a different engine than Postgres as far as I understood,
so we may likely not be impacted by that..  Let's see if we get any
feedback.
--
Michael

Вложения
Patched Aurora from 1.5.0 to 1.5.1 (which fixes an issue with index prefetch). The issue appears to be fully resolved.

Thanks for the help leading into this.

On Tue, May 21, 2019 at 7:21 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, May 21, 2019 at 03:25:11PM -0400, Tom Lane wrote:
> Steve <postgres-ca@byrequest.com> writes:
>> BINGO, An AWS Development Manager just stepped in…  They've identified the
>> problem and deploying a patch release
>> https://forums.aws.amazon.com/thread.jspa?messageID=901775&#901775
>
> Oh, so it was their bug not ours?  Sure wish there was more detail there.

Aurora uses a different engine than Postgres as far as I understood,
so we may likely not be impacted by that..  Let's see if we get any
feedback.
--
Michael