Обсуждение: Why is parula failing?

Поиск
Список
Период
Сортировка

Why is parula failing?

От
Tom Lane
Дата:
For the last few days, buildfarm member parula has been intermittently
failing the partition_prune regression test, due to unexpected plan
changes [1][2][3][4].  The symptoms can be reproduced exactly by
inserting a "vacuum" of one or another of the partitions of table
"ab", so we can presume that the underlying cause is an autovacuum run
against one of those tables.  But why is that happening?  None of
those tables receive any insertions during the test, so I don't
understand why autovacuum would trigger on them.

I suppose we could attach "autovacuum=off" settings to these tables,
but it doesn't seem to me that that should be necessary.  These test
cases are several years old and haven't given trouble before.
Moreover, if that's necessary then there are a lot of other regression
tests that would presumably need the same treatment.

I'm also baffled why this wasn't happening before.  I scraped the
buildfarm logs for 3 months back and confirmed my impression that
this is a new failure mode.  But one of these four runs is on
REL_14_STABLE, eliminating the theory that the cause is a recent
HEAD-only change.

Any ideas?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-19%2016%3A09%3A02
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-18%2011%3A13%3A02
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-14%2011%3A40%3A02
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-14%2019%3A00%3A02



Re: Why is parula failing?

От
Matthias van de Meent
Дата:
On Tue, 19 Mar 2024 at 20:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> For the last few days, buildfarm member parula has been intermittently
> failing the partition_prune regression test, due to unexpected plan
> changes [1][2][3][4].  The symptoms can be reproduced exactly by
> inserting a "vacuum" of one or another of the partitions of table
> "ab", so we can presume that the underlying cause is an autovacuum run
> against one of those tables.  But why is that happening?  None of
> those tables receive any insertions during the test, so I don't
> understand why autovacuum would trigger on them.
>
> I suppose we could attach "autovacuum=off" settings to these tables,
> but it doesn't seem to me that that should be necessary.  These test
> cases are several years old and haven't given trouble before.
> Moreover, if that's necessary then there are a lot of other regression
> tests that would presumably need the same treatment.
>
> I'm also baffled why this wasn't happening before.  I scraped the
> buildfarm logs for 3 months back and confirmed my impression that
> this is a new failure mode.  But one of these four runs is on
> REL_14_STABLE, eliminating the theory that the cause is a recent
> HEAD-only change.
>
> Any ideas?

This may be purely coincidental, but yesterday I also did have a
seemingly random failure in the recovery test suite locally, in
t/027_stream_regress.pl, where it changed the join order of exactly
one of the queries (that uses the tenk table multiple times, iirc 3x
or so). As the work I was doing wasn't related to join order-related
problems, this surprised me. After checking for test file changes
(none), I re-ran the tests without recompilation and the failure went
away, so I attributed this to an untimely autoanalize. However, as
this was also an unexpected plan change in the tests this could be
related.

Sadly, I did not save the output of that test run, so this is just
about all the information I have. The commit I was testing on was
based on ca108be7, and system config is available if needed.

-Matthias



Re: Why is parula failing?

От
Matthias van de Meent
Дата:
On Wed, 20 Mar 2024 at 11:50, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Tue, 19 Mar 2024 at 20:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > For the last few days, buildfarm member parula has been intermittently
> > failing the partition_prune regression test, due to unexpected plan
> > changes [1][2][3][4].  The symptoms can be reproduced exactly by
> > inserting a "vacuum" of one or another of the partitions of table
> > "ab", so we can presume that the underlying cause is an autovacuum run
> > against one of those tables.  But why is that happening?  None of
> > those tables receive any insertions during the test, so I don't
> > understand why autovacuum would trigger on them.

> This may be purely coincidental, but yesterday I also did have a
> seemingly random failure in the recovery test suite locally, in
> t/027_stream_regress.pl, where it changed the join order of exactly
> one of the queries (that uses the tenk table multiple times, iirc 3x
> or so).
[...]
> Sadly, I did not save the output of that test run, so this is just
> about all the information I have. The commit I was testing on was
> based on ca108be7, and system config is available if needed.

It looks like Cirrus CI reproduced my issue with recent commit
a0390f6c [0] and previously also with 4665cebc [1], 875e46a0 [2],
449e798c [3], and other older commits, on both the Windows Server 2019
build and Debian for a39f1a36 (with slightly different plan changes on
this Debian run).  That should rule out most of the environments.

After analyzing the logs produced by the various primaries, I can't
find a good explanation why they would have this issue.  The table is
vacuum analyzed before the regression tests start, and in this run
autovacuum/autoanalyze doesn't seem to kick in until (at least)
seconds after this query was run.


Kind regards,

Matthias van de Meent

[0] https://cirrus-ci.com/task/6295909005262848
[1] https://cirrus-ci.com/task/5229745516838912
[2] https://cirrus-ci.com/task/5098544567156736
[3] https://cirrus-ci.com/task/4783906419900416



Re: Why is parula failing?

От
David Rowley
Дата:
On Wed, 20 Mar 2024 at 08:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I suppose we could attach "autovacuum=off" settings to these tables,
> but it doesn't seem to me that that should be necessary.  These test
> cases are several years old and haven't given trouble before.
> Moreover, if that's necessary then there are a lot of other regression
> tests that would presumably need the same treatment.

Is it worth running that animal with log_autovacuum_min_duration = 0
so we can see what's going on in terms of auto-vacuum auto-analyze in
the log?

David



Re: Why is parula failing?

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> Is it worth running that animal with log_autovacuum_min_duration = 0
> so we can see what's going on in terms of auto-vacuum auto-analyze in
> the log?

Maybe, but I'm not sure.  I thought that if parula were somehow
hitting an ill-timed autovac/autoanalyze, it should be possible to
make that reproducible by inserting "pg_sleep(60)" or so in the test
script, to give the autovac daemon plenty of time to come around and
do the dirty deed.  No luck though --- the results didn't change for
me.  So now I'm not sure what is going on.

Perhaps though it is autovacuum, and there's some environment-specific
enabling condition that parula has and my machine doesn't (which
could also help explain why no other animal is doing this).
So yeah, if we could have log_autovacuum_min_duration = 0 perhaps
that would yield a clue.

            regards, tom lane



Re: Why is parula failing?

От
David Rowley
Дата:
On Thu, 21 Mar 2024 at 12:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So yeah, if we could have log_autovacuum_min_duration = 0 perhaps
> that would yield a clue.

FWIW, I agree with your earlier statement about it looking very much
like auto-vacuum has run on that table, but equally, if something like
the pg_index record was damaged we could get the same plan change.

We could also do something like the attached just in case we're
barking up the wrong tree.

David

Вложения

Re: Why is parula failing?

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> We could also do something like the attached just in case we're
> barking up the wrong tree.

Yeah, checking indisvalid isn't a bad idea.  I'd put another
one further down, just before the DROP of table ab, so we
can see the state both before and after the unstable tests.

            regards, tom lane



Re: Why is parula failing?

От
David Rowley
Дата:
On Thu, 21 Mar 2024 at 14:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > We could also do something like the attached just in case we're
> > barking up the wrong tree.
>
> Yeah, checking indisvalid isn't a bad idea.  I'd put another
> one further down, just before the DROP of table ab, so we
> can see the state both before and after the unstable tests.

So it's taken quite a while to finally fail again.

Effectively, we're getting:

    relname     | relpages | reltuples | indisvalid | autovacuum_count
| autoanalyze_count
----------------+----------+-----------+------------+------------------+-------------------
- ab_a2_b2       |        0 |        -1 |            |
0 |                 0
+ ab_a2_b2       |        0 |        48 |            |
0 |                 0

I see AddNewRelationTuple() does set reltuples to -1, so I can't quite
figure out why 48 is in there.  Even if auto-analyze had somehow
mistakenly run and the autoanalyze_count stats just were not
up-to-date yet, the table has zero blocks, and I don't see how
acquire_sample_rows() would set *totalrows to anything other than 0.0
in this case.  For the vacuum case, I see that reltuples is set from:

/* now we can compute the new value for pg_class.reltuples */
vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
vacrel->scanned_pages,
vacrel->live_tuples);

Again, hard to see how that could come to anything other than zero
given that rel_pages and scanned_pages should be 0.

Looking at the binary representation of a float of -1 vs 48, they're
not nearly the same. 0xBF800000 vs 0x42400000, so it's not looking
like a flipped bit.

It would be good to have log_autovacuum_min_duration = 0 on this
machine for a while.

David



RE: Why is parula failing?

От
"Tharakan, Robins"
Дата:
Hi David / Tom,

> David Rowley <dgrowleyml@gmail.com> writes:
> It would be good to have log_autovacuum_min_duration = 0 on this machine for a while.

- Have set log_autovacuum_min_duration=0 on parula and a test run came out okay.
- Also added REL_16_STABLE to the branches being tested (in case it matters here).

Let me know if I can help with any other changes here.
-
Robins | tharar@ | adelaide@australia

Re: Why is parula failing?

От
David Rowley
Дата:
On Tue, 26 Mar 2024 at 21:03, Tharakan, Robins <tharar@amazon.com> wrote:
>
> > David Rowley <dgrowleyml@gmail.com> writes:
> > It would be good to have log_autovacuum_min_duration = 0 on this machine for a while.
>
> - Have set log_autovacuum_min_duration=0 on parula and a test run came out okay.
> - Also added REL_16_STABLE to the branches being tested (in case it matters here).

Thanks for doing that.

I see PG_16_STABLE has failed twice already with the same issue.  I
don't see any autovacuum / autoanalyze in the log, so I guess that
rules out auto vacuum activity causing this.

Unfortunately, REL_16_STABLE does not have the additional debugging,
so don't get to know what reltuples was set to.

David



Re: Why is parula failing?

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> Unfortunately, REL_16_STABLE does not have the additional debugging,
> so don't get to know what reltuples was set to.

Let's wait a bit to see if it fails in HEAD ... but if not, would
it be reasonable to back-patch the additional debugging output?

            regards, tom lane



Re: Why is parula failing?

От
David Rowley
Дата:
On Wed, 27 Mar 2024 at 18:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > Unfortunately, REL_16_STABLE does not have the additional debugging,
> > so don't get to know what reltuples was set to.
>
> Let's wait a bit to see if it fails in HEAD ... but if not, would
> it be reasonable to back-patch the additional debugging output?

I think REL_16_STABLE has told us that it's not an auto-vacuum issue.
I'm uncertain what a few more failures in master will tell us aside
from if reltuples == 48 is consistent or if that value is going to
fluctuate.

Let's give it a week and see if it fails a few more times.

David



Re: Why is parula failing?

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> On Wed, 27 Mar 2024 at 18:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Let's wait a bit to see if it fails in HEAD ... but if not, would
>> it be reasonable to back-patch the additional debugging output?

> I think REL_16_STABLE has told us that it's not an auto-vacuum issue.
> I'm uncertain what a few more failures in master will tell us aside
> from if reltuples == 48 is consistent or if that value is going to
> fluctuate.

> Let's give it a week and see if it fails a few more times.

We have another failure today [1] with the same symptom:

  ab_a2          |        0 |        -1 |            |                0 |                 0
- ab_a2_b1       |        0 |        -1 |            |                0 |                 0
+ ab_a2_b1       |        0 |        48 |            |                0 |                 0
  ab_a2_b1_a_idx |        1 |         0 | t          |                  |

Different table, same "48" reltuples.  But I have to confess that
I'd not looked closely enough at the previous failure, because
now that I have, this is well out in WTFF territory: how can
reltuples be greater than zero when relpages is zero?  This can't
be a state that autovacuum would have left behind, unless it's
really seriously broken.  I think we need to be looking for
explanations like "memory stomp" or "compiler bug".

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-29%2012%3A46%3A02



Re: Why is parula failing?

От
Tom Lane
Дата:
I wrote:
> I'd not looked closely enough at the previous failure, because
> now that I have, this is well out in WTFF territory: how can
> reltuples be greater than zero when relpages is zero?  This can't
> be a state that autovacuum would have left behind, unless it's
> really seriously broken.  I think we need to be looking for
> explanations like "memory stomp" or "compiler bug".

... in connection with which, I can't help noticing that parula
is using a very old compiler:

configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)

From some quick checking around, that would have to be near the
beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64
to a primary architecture until earlier that same year).  It's not
exactly hard to believe that there were some lingering compiler bugs.
I wonder why parula is using that when its underlying system seems
markedly newer (the kernel at least has a recent build date).

            regards, tom lane



Re: Why is parula failing?

От
David Rowley
Дата:
On Sat, 30 Mar 2024 at 09:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > I'd not looked closely enough at the previous failure, because
> > now that I have, this is well out in WTFF territory: how can
> > reltuples be greater than zero when relpages is zero?  This can't
> > be a state that autovacuum would have left behind, unless it's
> > really seriously broken.  I think we need to be looking for
> > explanations like "memory stomp" or "compiler bug".
>
> ... in connection with which, I can't help noticing that parula
> is using a very old compiler:
>
> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
>
> From some quick checking around, that would have to be near the
> beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64
> to a primary architecture until earlier that same year).  It's not
> exactly hard to believe that there were some lingering compiler bugs.
> I wonder why parula is using that when its underlying system seems
> markedly newer (the kernel at least has a recent build date).

It could be, but wouldn't the bug have to relate to the locking code
and be caused by some other backend stomping on the memory?
Otherwise, shouldn't it be failing consistently every run rather than
sporadically?

David



Re: Why is parula failing?

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> On Sat, 30 Mar 2024 at 09:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... in connection with which, I can't help noticing that parula
>> is using a very old compiler:
>> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
>> I wonder why parula is using that when its underlying system seems
>> markedly newer (the kernel at least has a recent build date).

> It could be, but wouldn't the bug have to relate to the locking code
> and be caused by some other backend stomping on the memory?
> Otherwise, shouldn't it be failing consistently every run rather than
> sporadically?

Your guess is as good as mine ... but we still haven't seen this
class of failure on any other animal, so the idea that it's strictly
a chance timing issue is getting weaker all the time.

            regards, tom lane



RE: Why is parula failing?

От
"Tharakan, Robins"
Дата:
> ... in connection with which, I can't help noticing that parula is using a very old compiler:
>
> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
>
> From some quick checking around, that would have to be near the beginning of aarch64
> support in RHEL (Fedora hadn't promoted aarch64 to a primary architecture until earlier
> that same year).  It's not exactly hard to believe that there were some lingering compiler bugs.

> I wonder why parula is using that when its underlying system seems markedly newer (the kernel at least has a recent
builddate). 

Parula used GCC v7.3.1 since that's what came by default.
I've now switched to GCC v13.2 and triggered a run. Let's see if the tests stabilize now.
-
Robins



RE: Why is parula failing?

От
"Tharakan, Robins"
Дата:
> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests stabilize now.

So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries,
of which some were even"signal 6: Aborted".

FWIW, I compiled gcc v13.2 (default options) from source which IIUC shouldn't be to blame. Two other possible
reasons could be that the buildfarm doesn't have an aarch64 + gcc 13.2 combination (quick check I couldn't
see any), or, that this hardware is flaky.

Either way, this instance is a throw-away so let me know if this isn't helping. I'll swap it out in case there's not
much benefit to be had.

Also, it'd be great if someone could point me to a way to update the "Compiler" section in "System Detail" on
the buildfarm page (it wrongly shows GCC as v7.3.1).

-
thanks
robins



Re: Why is parula failing?

От
Tom Lane
Дата:
"Tharakan, Robins" <tharar@amazon.com> writes:
>> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests stabilize now.

> So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries,
> of which some were even"signal 6: Aborted".

Ugh...

> Also, it'd be great if someone could point me to a way to update the "Compiler" section in "System Detail" on
> the buildfarm page (it wrongly shows GCC as v7.3.1).

The update_personality.pl script in the buildfarm client distro
is what to use to adjust OS version or compiler version data.

            regards, tom lane



Re: Why is parula failing?

От
Robins Tharakan
Дата:

On Tue, 2 Apr 2024 at 15:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Tharakan, Robins" <tharar@amazon.com> writes:
> > So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries,
> > of which some were even"signal 6: Aborted".
>
> Ugh...


parula didn't send any reports to buildfarm for the past 44 hours. Logged in
to see that postgres was stuck on pg_sleep(), which was quite odd! I captured
the backtrace and triggered another run on HEAD, which came out
okay.

I'll keep an eye on this instance more often for the next few days.
(Let me know if I could capture more if a run gets stuck again)


(gdb) bt
#0  0x0000ffff952ae954 in epoll_pwait () from /lib64/libc.so.6
#1  0x000000000083e9c8 in WaitEventSetWaitBlock (nevents=1, occurred_events=<optimized out>, cur_timeout=297992, set=0x2816dac0) at latch.c:1570
#2  WaitEventSetWait (set=0x2816dac0, timeout=timeout@entry=600000, occurred_events=occurred_events@entry=0xffffc395ed28, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994946) at latch.c:1516
#3  0x000000000083ed84 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=600000, wait_event_info=wait_event_info@entry=150994946) at latch.c:538
#4  0x0000000000907404 in pg_sleep (fcinfo=<optimized out>) at misc.c:406
#5  0x0000000000696b10 in ExecInterpExpr (state=0x28384040, econtext=0x28383e38, isnull=<optimized out>) at execExprInterp.c:764
#6  0x00000000006ceef8 in ExecEvalExprSwitchContext (isNull=0xffffc395ee9f, econtext=0x28383e38, state=<optimized out>) at ../../../src/include/executor/executor.h:356
#7  ExecProject (projInfo=<optimized out>) at ../../../src/include/executor/executor.h:390
#8  ExecResult (pstate=<optimized out>) at nodeResult.c:135
#9  0x00000000006b7aec in ExecProcNode (node=0x28383d28) at ../../../src/include/executor/executor.h:274
#10 gather_getnext (gatherstate=0x28383b38) at nodeGather.c:287
#11 ExecGather (pstate=0x28383b38) at nodeGather.c:222
#12 0x000000000069aa4c in ExecProcNode (node=0x28383b38) at ../../../src/include/executor/executor.h:274
#13 ExecutePlan (execute_once=<optimized out>, dest=0x2831ffb0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x28383b38, estate=0x28383910) at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x283239c0, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:363
#15 0x000000000086d454 in PortalRunSelect (portal=portal@entry=0x281f0fb0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x2831ffb0) at pquery.c:924
#16 0x000000000086ec70 in PortalRun (portal=portal@entry=0x281f0fb0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x2831ffb0, altdest=altdest@entry=0x2831ffb0, qc=qc@entry=0xffffc395f250) at pquery.c:768
#17 0x000000000086a944 in exec_simple_query (query_string=query_string@entry=0x28171c90 "SELECT pg_sleep(0.1);") at postgres.c:1274
#18 0x000000000086b480 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4680
#19 0x0000000000866a0c in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:101
#20 0x00000000007c1738 in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, startup_data=startup_data@entry=0xffffc395f718 "", startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0xffffc395f720) at launch_backend.c:265
#21 0x00000000007c5120 in BackendStartup (client_sock=0xffffc395f720) at postmaster.c:3593
#22 ServerLoop () at postmaster.c:1674
#23 0x00000000007c6dc8 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x2816d320) at postmaster.c:1372
#24 0x0000000000496bb8 in main (argc=8, argv=0x2816d320) at main.c:197
 
 
>
> The update_personality.pl script in the buildfarm client distro
> is what to use to adjust OS version or compiler version data.
>
Thanks. Fixed that.

-
robins

Re: Why is parula failing?

От
David Rowley
Дата:
On Mon, 8 Apr 2024 at 23:56, Robins Tharakan <tharakan@gmail.com> wrote:
> #3  0x000000000083ed84 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=600000,
wait_event_info=wait_event_info@entry=150994946)at latch.c:538
 
> #4  0x0000000000907404 in pg_sleep (fcinfo=<optimized out>) at misc.c:406

> #17 0x000000000086a944 in exec_simple_query (query_string=query_string@entry=0x28171c90 "SELECT pg_sleep(0.1);") at
postgres.c:1274

I have no idea why WaitLatch has timeout=600000.  That should be no
higher than timeout=100 for "SELECT pg_sleep(0.1);".  I have no
theories aside from a failing RAM module, cosmic ray or a well-timed
clock change between the first call to gettimeofday() in pg_sleep()
and the next one.

I know this animal is running debug_parallel_query = regress, so that
0.1 Const did have to get serialized and copied to the worker, so
there's another opportunity for the sleep duration to be stomped on,
but that seems pretty unlikely.

I can't think of a reason why the erroneous  reltuples=48 would be
consistent over 2 failing runs if it were failing RAM or a cosmic ray.

Still no partition_prune failures on master since the compiler version
change.  There has been one [1] in REL_16_STABLE. I'm thinking it
might be worth backpatching the partition_prune debug to REL_16_STABLE
to see if we can learn anything new from it.

David

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-04-08%2002%3A12%3A02



Re: Why is parula failing?

От
David Rowley
Дата:
On Tue, 9 Apr 2024 at 15:48, David Rowley <dgrowleyml@gmail.com> wrote:
> Still no partition_prune failures on master since the compiler version
> change.  There has been one [1] in REL_16_STABLE. I'm thinking it
> might be worth backpatching the partition_prune debug to REL_16_STABLE
> to see if we can learn anything new from it.

Master failed today for the first time since the compiler upgrade.
Again reltuples == 48.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-04-10%2000%3A27%3A02

David



Re: Why is parula failing?

От
Robins Tharakan
Дата:

On Wed, 10 Apr 2024 at 10:24, David Rowley <dgrowleyml@gmail.com> wrote:
> Master failed today for the first time since the compiler upgrade.
> Again reltuples == 48.

From the buildfarm members page, parula seems to be the only aarch64 + gcc 13.2
combination today, and then I suspect whether this is about gcc v13.2 maturity on aarch64?

I'll try to upgrade one of the other aarch64s I have (massasauga or snakefly) and
see if this is more about gcc 13.2 maturity on this architecture.
-
robins

Re: Why is parula failing?

От
Robins Tharakan
Дата:

On Wed, 10 Apr 2024 at 10:24, David Rowley <dgrowleyml@gmail.com> wrote:
>
> Master failed today for the first time since the compiler upgrade.
> Again reltuples == 48.

Here's what I can add over the past few days:
- Almost all failures are either reltuples=48 or SIGABRTs
- Almost all SIGABRTs are DDLs - CREATE INDEX / CREATE AGGREGATEs / CTAS
  - A little too coincidental? Recent crashes have stack-trace if interested.

Barring the initial failures (during move to gcc 13.2), in the past week:
- v15 somehow hasn't had a failure yet
- v14 / v16 have got only 1 failure each
- but v12 / v13 are lit up - failed multiple times.

-
robins

Re: Why is parula failing?

От
Robins Tharakan
Дата:

On Mon, 8 Apr 2024 at 21:25, Robins Tharakan <tharakan@gmail.com> wrote:
>
>
> I'll keep an eye on this instance more often for the next few days.
> (Let me know if I could capture more if a run gets stuck again)

 
HEAD is stuck again on pg_sleep(), no CPU for the past hour or so.
Stack trace seems to be similar to last time.


$ pstack 24930
#0  0x0000ffffb8280954 in epoll_pwait () from /lib64/libc.so.6
#1  0x0000000000843408 in WaitEventSetWaitBlock (nevents=1, occurred_events=<optimized out>, cur_timeout=600000, set=0x3b38dac0) at latch.c:1570
#2  WaitEventSetWait (set=0x3b38dac0, timeout=timeout@entry=600000, occurred_events=occurred_events@entry=0xfffffd1d66c8, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994946) at latch.c:1516
#3  0x00000000008437c4 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=600000, wait_event_info=wait_event_info@entry=150994946) at latch.c:538
#4  0x000000000090c384 in pg_sleep (fcinfo=<optimized out>) at misc.c:406
#5  0x0000000000699350 in ExecInterpExpr (state=0x3b5a41a0, econtext=0x3b5a3f98, isnull=<optimized out>) at execExprInterp.c:764
#6  0x00000000006d1668 in ExecEvalExprSwitchContext (isNull=0xfffffd1d683f, econtext=0x3b5a3f98, state=<optimized out>) at ../../../src/include/executor/executor.h:356
#7  ExecProject (projInfo=<optimized out>) at ../../../src/include/executor/executor.h:390
#8  ExecResult (pstate=<optimized out>) at nodeResult.c:135
#9  0x00000000006ba26c in ExecProcNode (node=0x3b5a3e88) at ../../../src/include/executor/executor.h:274
#10 gather_getnext (gatherstate=0x3b5a3c98) at nodeGather.c:287
#11 ExecGather (pstate=0x3b5a3c98) at nodeGather.c:222
#12 0x000000000069d28c in ExecProcNode (node=0x3b5a3c98) at ../../../src/include/executor/executor.h:274
#13 ExecutePlan (execute_once=<optimized out>, dest=0x3b5ae8e0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x3b5a3c98, estate=0x3b5a3a70) at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x3b59c250, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:363
#15 0x00000000008720e4 in PortalRunSelect (portal=portal@entry=0x3b410fb0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x3b5ae8e0) at pquery.c:924
#16 0x0000000000873900 in PortalRun (portal=portal@entry=0x3b410fb0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x3b5ae8e0, altdest=altdest@entry=0x3b5ae8e0, qc=qc@entry=0xfffffd1d6bf0) at pquery.c:768
#17 0x000000000086f5d4 in exec_simple_query (query_string=query_string@entry=0x3b391c90 "SELECT pg_sleep(0.1);") at postgres.c:1274
#18 0x0000000000870110 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4680
#19 0x000000000086b6a0 in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:105
#20 0x00000000007c6268 in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, startup_data=startup_data@entry=0xfffffd1d70b8 "", startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0xfffffd1d70c0) at launch_backend.c:265
#21 0x00000000007c9c50 in BackendStartup (client_sock=0xfffffd1d70c0) at postmaster.c:3593
#22 ServerLoop () at postmaster.c:1674
#23 0x00000000007cb8f8 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x3b38d320) at postmaster.c:1372
#24 0x0000000000496e18 in main (argc=8, argv=0x3b38d320) at main.c:197



 CPU% MEM%   TIME+  Command
.
.
  0.0  0.0  0:00.00 │     └─ /bin/sh -c cd /opt/postgres/build-farm-14 && PATH=/opt/gcc/home/ec2-user/proj/gcc/target/bin/
  0.0  0.1  0:00.07 │        └─ /usr/bin/perl ./run_build.pl --config=build-farm.conf HEAD --verbose
  0.0  0.0  0:00.00 │           └─ sh -c { cd pgsql.build/src/test/regress && make NO_LOCALE=1 check; echo $? > /opt/postg
  0.0  0.0  0:00.00 │              └─ make NO_LOCALE=1 check
  0.0  0.0  0:00.00 │                 └─ /bin/sh -c echo "# +++ regress check in src/test/regress +++" && PATH="/opt/postg
  0.0  0.0  0:00.10 │                    └─ ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=.
  0.0  0.0  0:00.01 │                       ├─ psql -X -a -q -d regression -v HIDE_TABLEAM=on -v HIDE_TOAST_COMPRESSION=on
  0.0  0.1  0:02.64 │                       └─ postgres -D /opt/postgres/build-farm-14/buildroot/HEAD/pgsql.build/src/test
  0.0  0.2  0:00.05 │                          ├─ postgres: postgres regression [local] SELECT
  0.0  0.0  0:00.06 │                          ├─ postgres: logical replication launcher
  0.0  0.1  0:00.36 │                          ├─ postgres: autovacuum launcher
  0.0  0.1  0:00.34 │                          ├─ postgres: walwriter
  0.0  0.0  0:00.32 │                          ├─ postgres: background writer
  0.0  0.3  0:00.05 │                          └─ postgres: checkpointer

-
robins

Re: Why is parula failing?

От
Tom Lane
Дата:
Robins Tharakan <tharakan@gmail.com> writes:
> HEAD is stuck again on pg_sleep(), no CPU for the past hour or so.
> Stack trace seems to be similar to last time.

> #3  0x00000000008437c4 in WaitLatch (latch=<optimized out>,
> wakeEvents=wakeEvents@entry=41, timeout=600000,
> wait_event_info=wait_event_info@entry=150994946) at latch.c:538
> #4  0x000000000090c384 in pg_sleep (fcinfo=<optimized out>) at misc.c:406
> ...
> #17 0x000000000086f5d4 in exec_simple_query
> (query_string=query_string@entry=0x3b391c90
> "SELECT pg_sleep(0.1);") at postgres.c:1274

If we were only supposed to sleep 0.1 seconds, how is it waiting
for 600000 ms (and, presumably, repeating that)?  The logic in
pg_sleep is pretty simple, and it's hard to think of anything except
the system clock jumping (far) backwards that would make this
happen.  Any chance of extracting the local variables from the
pg_sleep stack frame?

            regards, tom lane



Re: Why is parula failing?

От
Tomas Vondra
Дата:

On 4/9/24 05:48, David Rowley wrote:
> On Mon, 8 Apr 2024 at 23:56, Robins Tharakan <tharakan@gmail.com> wrote:
>> #3  0x000000000083ed84 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=600000,
wait_event_info=wait_event_info@entry=150994946)at latch.c:538
 
>> #4  0x0000000000907404 in pg_sleep (fcinfo=<optimized out>) at misc.c:406
> 
>> #17 0x000000000086a944 in exec_simple_query (query_string=query_string@entry=0x28171c90 "SELECT pg_sleep(0.1);") at
postgres.c:1274
> 
> I have no idea why WaitLatch has timeout=600000.  That should be no
> higher than timeout=100 for "SELECT pg_sleep(0.1);".  I have no
> theories aside from a failing RAM module, cosmic ray or a well-timed
> clock change between the first call to gettimeofday() in pg_sleep()
> and the next one.
> 
> I know this animal is running debug_parallel_query = regress, so that
> 0.1 Const did have to get serialized and copied to the worker, so
> there's another opportunity for the sleep duration to be stomped on,
> but that seems pretty unlikely.
> 

AFAIK that GUC is set only for HEAD, so it would not explain the
failures on the other branches.

> I can't think of a reason why the erroneous  reltuples=48 would be
> consistent over 2 failing runs if it were failing RAM or a cosmic ray.
> 

Yeah, that seems very unlikely.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why is parula failing?

От
Tomas Vondra
Дата:
On 4/13/24 15:02, Robins Tharakan wrote:
> On Wed, 10 Apr 2024 at 10:24, David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> Master failed today for the first time since the compiler upgrade.
>> Again reltuples == 48.
> 
> Here's what I can add over the past few days:
> - Almost all failures are either reltuples=48 or SIGABRTs
> - Almost all SIGABRTs are DDLs - CREATE INDEX / CREATE AGGREGATEs / CTAS
>   - A little too coincidental? Recent crashes have stack-trace if
> interested.
> 
> Barring the initial failures (during move to gcc 13.2), in the past week:
> - v15 somehow hasn't had a failure yet
> - v14 / v16 have got only 1 failure each
> - but v12 / v13 are lit up - failed multiple times.
> 

I happened to have an unused rpi5, so I installed Armbian aarch64 with
gcc 13.2.0, built with exactly the same configure options as parula, and
did ~300 loops of "make check" without a single failure.

So either parula has packages in a different way, or maybe it's a more
of a timing issue and rpi5 is way slower than graviton3.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why is parula failing?

От
Robins Tharakan
Дата:

On Sun, 14 Apr 2024 at 00:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If we were only supposed to sleep 0.1 seconds, how is it waiting
> for 600000 ms (and, presumably, repeating that)?  The logic in
> pg_sleep is pretty simple, and it's hard to think of anything except
> the system clock jumping (far) backwards that would make this
> happen.  Any chance of extracting the local variables from the
> pg_sleep stack frame?

- I now have 2 separate runs stuck on pg_sleep() - HEAD / REL_16_STABLE
- I'll keep them (stuck) for this week, in case there's more we can get
from them (and to see how long they take)
- Attached are 'bt full' outputs for both (b.txt - HEAD / a.txt - REL_16_STABLE)

A few things to add:
- To reiterate, this instance has gcc v13.2 compiled without any
flags (my first time ever TBH) IIRC 'make -k check' came out okay,
so at this point I don't think I did something obviously wrong when
building gcc from git.
- I installed gcc v14.0.1 experimental on massasauga (also an aarch64
and built from git) and despite multiple runs, it seems to be doing okay [1].
- Next week (if I'm still scratching my head - and unless someone advises
otherwise), I'll upgrade parula to gcc 14 experimental to see if this is about
gcc maturity on graviton (for some reason). I don't expect much to come
out of it though (given Tomas testing on rpi5, but doesn't hurt)

Ref:
1. https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=massasauga&br=REL_12_STABLE

-
robins
Вложения

Re: Why is parula failing?

От
David Rowley
Дата:
On Mon, 15 Apr 2024 at 16:10, Robins Tharakan <tharakan@gmail.com> wrote:
> - I now have 2 separate runs stuck on pg_sleep() - HEAD / REL_16_STABLE
> - I'll keep them (stuck) for this week, in case there's more we can get
> from them (and to see how long they take)
> - Attached are 'bt full' outputs for both (b.txt - HEAD / a.txt - REL_16_STABLE)

Thanks for getting those.

#4  0x000000000090b7b4 in pg_sleep (fcinfo=<optimized out>) at misc.c:406
        delay = <optimized out>
        delay_ms = <optimized out>
        endtime = 0

This endtime looks like a problem. It seems unlikely to be caused by
gettimeofday's timeval fields being zeroed given that the number of
seconds should have been added to that.

I can't quite make sense of how we end up sleeping at all with a zero
endtime. Assuming the subsequent GetNowFloats() worked, "delay =
endtime - GetNowFloat();" would result in a negative sleep duration
and we'd break out of the sleep loop.

If GetNowFloat() somehow was returning a negative number then we could
end up with a large delay.  But if gettimeofday() was so badly broken
then wouldn't there be some evidence of this in the log timestamps on
failing runs?

I'm not that familiar with the buildfarm config, but I do see some
Valgrind related setting in there. Is PostgreSQL running under
Valgrind on these runs?

David



Re: Why is parula failing?

От
Robins Tharakan
Дата:

On Mon, 15 Apr 2024 at 14:55, David Rowley <dgrowleyml@gmail.com> wrote:
> If GetNowFloat() somehow was returning a negative number then we could
> end up with a large delay.  But if gettimeofday() was so badly broken
> then wouldn't there be some evidence of this in the log timestamps on
> failing runs?

3 things stand out for me here, unsure if they're related somehow:

1. Issue where reltuples=48 (in essence runs complete, but few tests fail)
2. SIGABRT - most of which are DDLs (runs complete, but engine crashes + many tests fail)
3. pg_sleep() stuck - (runs never complete, IIUC never gets reported to buildfarm)

For #3, one thing I had done earlier (and then reverted) was to set the
'wait_timeout' from current undef to 2 hours. I'll set it again to 2hrs
in hopes that #3 starts getting reported to buildfarm too.

> I'm not that familiar with the buildfarm config, but I do see some
> Valgrind related setting in there. Is PostgreSQL running under
> Valgrind on these runs?

Not yet. I was tempted, but valgrind has not yet been enabled on
this member. IIUC by default they're disabled.

                   'use_valgrind' => undef,
-
robins

Re: Why is parula failing?

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> #4  0x000000000090b7b4 in pg_sleep (fcinfo=<optimized out>) at misc.c:406
>         delay = <optimized out>
>         delay_ms = <optimized out>
>         endtime = 0

> This endtime looks like a problem. It seems unlikely to be caused by
> gettimeofday's timeval fields being zeroed given that the number of
> seconds should have been added to that.

Yes, that is very odd.

> I can't quite make sense of how we end up sleeping at all with a zero
> endtime. Assuming the subsequent GetNowFloats() worked, "delay =
> endtime - GetNowFloat();" would result in a negative sleep duration
> and we'd break out of the sleep loop.

If GetCurrentTimestamp() were returning assorted random values, it
wouldn't be hard to imagine this loop sleeping for a long time.
But it's very hard to see how that theory leads to an "endtime"
of exactly zero rather than some other number, and even harder
to credit two different runs getting "endtime" of exactly zero.

> If GetNowFloat() somehow was returning a negative number then we could
> end up with a large delay.  But if gettimeofday() was so badly broken
> then wouldn't there be some evidence of this in the log timestamps on
> failing runs?

And indeed that too.  I'm finding the "compiler bug" theory
palatable.  Robins mentioned having built the compiler from
source, which theoretically should work, but maybe something
went wrong?  Or it's missing some important bug fix?

It might be interesting to back the animal's CFLAGS down
to -O0 and see if things get more stable.

            regards, tom lane



Re: Why is parula failing?

От
Robins Tharakan
Дата:
On Mon, 15 Apr 2024 at 16:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <dgrowleyml@gmail.com> writes:
> > If GetNowFloat() somehow was returning a negative number then we could
> > end up with a large delay.  But if gettimeofday() was so badly broken
> > then wouldn't there be some evidence of this in the log timestamps on
> > failing runs?
>
> And indeed that too.  I'm finding the "compiler bug" theory
> palatable.  Robins mentioned having built the compiler from
> source, which theoretically should work, but maybe something
> went wrong?  Or it's missing some important bug fix?
>
> It might be interesting to back the animal's CFLAGS down
> to -O0 and see if things get more stable.

The last 25 consecutive runs have passed [1] after switching
REL_12_STABLE to -O0 ! So I am wondering whether that confirms that
the compiler version is to blame, and while we're still here,
is there anything else I could try?

If not, by Sunday, I am considering switching parula to gcc v12 (or even
v14 experimental - given that massasauga [2] has been pretty stable since
its upgrade a few days back).

Reference:

Re: Why is parula failing?

От
David Rowley
Дата:
On Tue, 16 Apr 2024 at 18:58, Robins Tharakan <tharakan@gmail.com> wrote:
> The last 25 consecutive runs have passed [1] after switching
> REL_12_STABLE to -O0 ! So I am wondering whether that confirms that
> the compiler version is to blame, and while we're still here,
> is there anything else I could try?

I don't think it's conclusive proof that it's a compiler bug.  -O0
could equally just have changed the timing sufficiently to not trigger
the issue.

It might be a long shot, but I wonder if it might be worth running a
workload such as:

psql -c "create table a (a int primary key, b text not null, c int not
null); insert into a values(1,'',0);" postgres
echo "update a set b = repeat('a', random(1,10)), c=c+1 where a = 1;"
> bench.sql
pgbench -n -t 12500 -c 8 -j 8 -f bench.sql postgres
psql -c "table a;" postgres

On a build with the original CFLAGS.  I expect the value of "c" to be
100k after running that. If it's not then something bad has happened.

That would exercise the locking code heavily and would show us if any
updates were missed due to locks not being correctly respected or seen
by other backends.

David



Re: Why is parula failing?

От
David Rowley
Дата:
On Thu, 21 Mar 2024 at 13:53, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 21 Mar 2024 at 12:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > So yeah, if we could have log_autovacuum_min_duration = 0 perhaps
> > that would yield a clue.
>
> FWIW, I agree with your earlier statement about it looking very much
> like auto-vacuum has run on that table, but equally, if something like
> the pg_index record was damaged we could get the same plan change.
>
> We could also do something like the attached just in case we're
> barking up the wrong tree.

I've not seen any recent failures from Parula that relate to this
issue.  The last one seems to have been about 4 weeks ago.

I'm now wondering if it's time to revert the debugging code added in
1db689715.  Does anyone think differently?

David



Re: Why is parula failing?

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> I've not seen any recent failures from Parula that relate to this
> issue.  The last one seems to have been about 4 weeks ago.

> I'm now wondering if it's time to revert the debugging code added in
> 1db689715.  Does anyone think differently?

+1.  It seems like we wrote off the other issue as a compiler bug,
so I don't have much trouble assuming that this one was too.

            regards, tom lane



Re: Why is parula failing?

От
Robins Tharakan
Дата:
On Tue, 14 May 2024 at 08:55, David Rowley <dgrowleyml@gmail.com> wrote:
I've not seen any recent failures from Parula that relate to this
issue.  The last one seems to have been about 4 weeks ago.

I'm now wondering if it's time to revert the debugging code added in
1db689715.  Does anyone think differently?

Thanks for keeping an eye. Sadly the older machine was decommissioned
and thus parula hasn't been sending results to buildfarm the past few days.

I'll try to build a similar machine (but newer gcc etc.) and reopen this thread in case I hit
something similar.
-
robins