Re: New instability in stats regression test

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: New instability in stats regression test
Дата
Msg-id 3510625.1700940880@sss.pgh.pa.us
обсуждение исходный текст
Ответ на New instability in stats regression test  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: New instability in stats regression test  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
I wrote:
> I'm a bit mystified by this.  This test was introduced in Andres'
> commit 10a082bf7 of 2023-02-11, and it seems to have been stable
> since then.  I trawled the buildfarm logs going back three months
> and found no similar failures.  So why's it failing now?  The
> most plausible theory seems to be that Michael's recent commits
> adding pg_stat_reset_xxx features destabilized the test somehow ...
> but I sure don't see how/why.

After a bit more looking around, I have part of a theory.
Commit 23c8c0c8f of 2023-11-12 added this, a little ways before
the problematic test:

-- Test that reset_shared with no argument resets all the stats types
-- supported (providing NULL as argument has the same effect).
SELECT pg_stat_reset_shared();

The test that is failing is of course

-- Test IO stats reset
SELECT pg_stat_have_stats('io', 0, 0);
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) +
sum(hits)AS io_stats_pre_reset 
  FROM pg_stat_io \gset
SELECT pg_stat_reset_shared('io');
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) +
sum(hits)AS io_stats_post_reset 
  FROM pg_stat_io \gset
SELECT :io_stats_post_reset < :io_stats_pre_reset;

So the observed failure could be explained if, between the
"pg_stat_reset_shared('io')" call and the subsequent scan of
pg_stat_io, concurrent sessions had done more I/O operations
than happened since that new pg_stat_reset_shared() call.
Previously, the "pre_reset" counts would be large enough to
make that a pretty ridiculous theory, but after 23c8c0c8f maybe
it's not.

To test this idea, I made the test print out the actual values
of the counts, like this:

@@ -1585,10 +1585,10 @@

 SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) +
sum(hits)AS io_stats_post_reset 
   FROM pg_stat_io \gset
-SELECT :io_stats_post_reset < :io_stats_pre_reset;
- ?column?
-----------
- t
+SELECT :io_stats_post_reset, :io_stats_pre_reset;
+ ?column? | ?column?
+----------+----------
+    10452 |   190087
 (1 row)

Of course, this makes it fail every time, but the idea is to get
a sense of the magnitude of the counts; and what I'm seeing is
that the "pre reset" counts are typically 10x more than the
"post reset" ones, even after 23c8c0c8f.  If I remove the
suspicious pg_stat_reset_shared() call, there's about 3 orders
of magnitude difference; but still you'd think a 10x safety
margin would be enough.  So this theory doesn't seem to quite
work as-is.  Perhaps there's some additional contributing factor
I didn't think to control.

Nonetheless, it seems like a really bad idea that this test
of I/O stats reset happens after the newly-added test.  It
is clearly now dependent on timing and the amount of concurrent
activity whether it will pass or not.  We should probably
re-order the tests to do the old test first; or else abandon
this test methodology and just test I/O reset the same way
we test the other cases (checking only for timestamp advance).
Or maybe we don't really need the pg_stat_reset_shared() test?

            regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: New instability in stats regression test
Следующее
От: Alexander Lakhin
Дата:
Сообщение: Test 002_pg_upgrade fails with olddump on Windows