Re: stress test for parallel workers

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: stress test for parallel workers
Дата
Msg-id 14878.1570820201@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: stress test for parallel workers  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: stress test for parallel workers  (Thomas Munro <thomas.munro@gmail.com>)
Re: stress test for parallel workers  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
I wrote:
> What we've apparently got here is that signals were received
> so fast that the postmaster ran out of stack space.  I remember
> Andres complaining about this as a theoretical threat, but I
> hadn't seen it in the wild before.

> I haven't finished investigating though, as there are some things
> that remain to be explained.

I still don't have a good explanation for why this only seems to
happen in the pg_upgrade test sequence.  However, I did notice
something very interesting: the postmaster crashes after consuming
only about 1MB of stack space.  This is despite the prevailing
setting of "ulimit -s" being 8192 (8MB).  I also confirmed that
the value of max_stack_depth within the crashed process is 2048,
which implies that get_stack_depth_rlimit got some value larger
than 2MB from getrlimit(RLIMIT_STACK).  And yet, here we have
a crash, and the process memory map confirms that only 1MB was
allocated in the stack region.  So it's really hard to explain
that as anything except a kernel bug: sometimes, the kernel
doesn't give us as much stack as it promised it would.  And the
machine is not loaded enough for there to be any rational
resource-exhaustion excuse for that.

This matches up with the intermittent infinite_recurse failures
we've been seeing in the buildfarm.  Those are happening across
a range of systems, but they're (almost) all Linux-based ppc64,
suggesting that there's a longstanding arch-specific kernel bug
involved.  For reference, I scraped the attached list of such
failures in the last three months.  I wonder whether we can get
the attention of any kernel hackers about that.

Anyway, as to what to do about it --- it occurred to me to wonder
why we are relying on having the signal handlers block and unblock
signals manually, when we could tell sigaction() that we'd like
signals blocked.  It is reasonable to expect that the signal support
is designed to not recursively consume stack space in the face of
a series of signals, while the way we are doing it clearly opens
us up to recursive space consumption.  The stack trace I showed
before proves that the recursion happens at the points where the
signal handlers unblock signals.

As a quick hack I made the attached patch, and it seems to fix the
problem on wobbegong's host.  I don't see crashes any more, and
watching the postmaster's stack space consumption, it stays
comfortably at a tad under 200KB (probably the default initial
allocation), while without the patch it tends to blow up to 700K
or more even in runs that don't crash.

This patch isn't committable as-is because it will (I suppose)
break things on Windows; we still need the old way there for lack
of sigaction().  But that could be fixed with a few #ifdefs.
I'm also kind of tempted to move pqsignal_no_restart into
backend/libpq/pqsignal.c (where BlockSig is defined) and maybe
rename it, but I'm not sure to what.

This issue might go away if we switched to a postmaster implementation
that doesn't do work in the signal handlers, but I'm not entirely
convinced of that.  The existing handlers don't seem to consume a lot
of stack space in themselves (there's not many local variables in them).
The bulk of the stack consumption is seemingly in the platform's signal
infrastructure, so that we might still have a stack consumption issue
even with fairly trivial handlers, if we don't tell sigaction to block
signals.  In any case, this fix seems potentially back-patchable,
while we surely wouldn't risk back-patching a postmaster rewrite.

Comments?

            regards, tom lane

   sysname    |   architecture   |  operating_system  |    sys_owner    |    branch     |      snapshot       |
stage     |                                                      err
  

--------------+------------------+--------------------+-----------------+---------------+---------------------+-----------------+---------------------------------------------------------------------------------------------------------------
 cavefish     | ppc64le (POWER9) | Ubuntu             | Mark Wong       | HEAD          | 2019-07-13 03:49:38 |
pg_upgradeCheck| 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:  Failed process was running: select infinite_recurse(); 
 pintail      | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong       | REL_12_STABLE | 2019-07-13 19:36:51 | Check
       | 2019-07-13 19:39:29.013 UTC [31086:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 bonito       | ppc64le (POWER9) | Fedora             | Mark Wong       | HEAD          | 2019-07-19 23:13:01 | Check
       | 2019-07-19 23:16:33.330 UTC [24191:70] DETAIL:  Failed process was running: select infinite_recurse(); 
 takin        | ppc64le          | opensuse           | Mark Wong       | HEAD          | 2019-07-24 08:24:56 | Check
       | 2019-07-24 08:28:01.735 UTC [16366:75] DETAIL:  Failed process was running: select infinite_recurse(); 
 quokka       | ppc64            | RHEL               | Sandeep Thakkar | HEAD          | 2019-07-31 02:00:07 |
pg_upgradeCheck| 2019-07-31 03:04:04.043 BST [5d40f709.776a:5] DETAIL:  Failed process was running: select
infinite_recurse();
 elasmobranch | ppc64le (POWER9) | openSUSE Leap      | Mark Wong       | HEAD          | 2019-08-01 03:13:38 | Check
       | 2019-08-01 03:19:05.394 UTC [22888:62] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | HEAD          | 2019-08-02 00:10:23 | Check
       | 2019-08-02 00:17:11.075 UTC [28222:73] DETAIL:  Failed process was running: select infinite_recurse(); 
 urocryon     | ppc64le          | debian             | Mark Wong       | HEAD          | 2019-08-02 05:43:46 | Check
       | 2019-08-02 05:51:51.944 UTC [2724:64] DETAIL:  Failed process was running: select infinite_recurse(); 
 batfish      | ppc64le          | Ubuntu             | Mark Wong       | HEAD          | 2019-08-04 19:02:36 |
pg_upgradeCheck| 2019-08-04 19:08:11.728 UTC [23899:79] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | REL_12_STABLE | 2019-08-07 00:03:29 |
pg_upgradeCheck| 2019-08-07 00:11:24.500 UTC [1405:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 quokka       | ppc64            | RHEL               | Sandeep Thakkar | REL_12_STABLE | 2019-08-08 02:43:45 |
pg_upgradeCheck| 2019-08-08 03:47:38.115 BST [5d4b8d3f.cdd7:5] DETAIL:  Failed process was running: select
infinite_recurse();
 quokka       | ppc64            | RHEL               | Sandeep Thakkar | HEAD          | 2019-08-08 14:00:08 | Check
       | 2019-08-08 15:02:59.770 BST [5d4c2b88.cad9:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | REL_11_STABLE | 2019-08-11 02:10:12 |
InstallCheck-C | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL:  Failed process was running: select infinite_recurse(); 
 takin        | ppc64le          | opensuse           | Mark Wong       | HEAD          | 2019-08-11 08:02:48 | Check
       | 2019-08-11 08:05:57.789 UTC [11500:67] DETAIL:  Failed process was running: select infinite_recurse(); 
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | REL_12_STABLE | 2019-08-11 09:52:46 |
pg_upgradeCheck| 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | HEAD          | 2019-08-11 11:29:27 |
pg_upgradeCheck| 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL:  Failed process was running: select infinite_recurse(); 
 demoiselle   | ppc64le (POWER9) | openSUSE Leap      | Mark Wong       | HEAD          | 2019-08-11 14:51:38 |
pg_upgradeCheck| 2019-08-11 14:57:29.422 UTC [9436:70] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | HEAD          | 2019-08-15 00:09:57 | Check
       | 2019-08-15 00:17:43.282 UTC [2831:68] DETAIL:  Failed process was running: select infinite_recurse(); 
 urocryon     | ppc64le          | debian             | Mark Wong       | HEAD          | 2019-08-19 06:28:34 | Check
       | 2019-08-19 06:39:25.749 UTC [26357:66] DETAIL:  Failed process was running: select infinite_recurse(); 
 urocryon     | ppc64le          | debian             | Mark Wong       | HEAD          | 2019-08-21 06:34:47 | Check
       | 2019-08-21 06:37:39.089 UTC [14505:73] DETAIL:  Failed process was running: select infinite_recurse(); 
 demoiselle   | ppc64le (POWER9) | openSUSE Leap      | Mark Wong       | REL_12_STABLE | 2019-09-04 14:42:08 |
pg_upgradeCheck| 2019-09-04 14:56:15.219 UTC [11008:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 pintail      | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong       | REL_12_STABLE | 2019-09-07 19:22:48 |
pg_upgradeCheck| 2019-09-07 19:27:20.789 UTC [25645:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 quokka       | ppc64            | RHEL               | Sandeep Thakkar | REL_12_STABLE | 2019-09-10 02:00:15 | Check
       | 2019-09-10 03:03:17.711 BST [5d77045a.5776:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | HEAD          | 2019-09-17 23:12:33 | Check
       | 2019-09-17 23:19:45.769 UTC [20920:77] DETAIL:  Failed process was running: select infinite_recurse(); 
 shoveler     | ppc64le (POWER8) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-09-18 13:52:55 | Check
       | 2019-09-18 13:56:11.273 UTC [563:71] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | HEAD          | 2019-09-19 00:01:54 | Check
       | 2019-09-19 00:09:30.734 UTC [11775:67] DETAIL:  Failed process was running: select infinite_recurse(); 
 gadwall      | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-09-21 12:26:50 | Check
       | 2019-09-21 12:31:16.199 UTC [7119:70] DETAIL:  Failed process was running: select infinite_recurse(); 
 quokka       | ppc64            | RHEL               | Sandeep Thakkar | HEAD          | 2019-09-24 14:00:11 |
pg_upgradeCheck| 2019-09-24 15:04:49.272 BST [5d8a2276.cba9:5] DETAIL:  Failed process was running: select
infinite_recurse();
 urocryon     | ppc64le          | debian             | Mark Wong       | HEAD          | 2019-09-25 06:24:24 | Check
       | 2019-09-25 06:31:54.876 UTC [26608:76] DETAIL:  Failed process was running: select infinite_recurse(); 
 pintail      | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-09-26 19:33:59 | Check
       | 2019-09-26 19:39:25.850 UTC [6259:69] DETAIL:  Failed process was running: select infinite_recurse(); 
 shoveler     | ppc64le (POWER8) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-09-28 13:54:16 | Check
       | 2019-09-28 13:59:02.354 UTC [7586:71] DETAIL:  Failed process was running: select infinite_recurse(); 
 buri         | ppc64le (POWER9) | CentOS Linux       | Mark Wong       | REL_12_STABLE | 2019-09-28 23:14:23 |
pg_upgradeCheck| 2019-09-28 23:22:13.987 UTC [20133:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 gadwall      | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-10-02 12:44:49 | Check
       | 2019-10-02 12:50:17.823 UTC [10840:76] DETAIL:  Failed process was running: select infinite_recurse(); 
 cavefish     | ppc64le (POWER9) | Ubuntu             | Mark Wong       | HEAD          | 2019-10-04 04:37:58 | Check
       | 2019-10-04 04:46:03.804 UTC [27768:69] DETAIL:  Failed process was running: select infinite_recurse(); 
 cavefish     | ppc64le (POWER9) | Ubuntu             | Mark Wong       | HEAD          | 2019-10-07 03:22:37 |
pg_upgradeCheck| 2019-10-07 03:28:05.031 UTC [2991:68] DETAIL:  Failed process was running: select infinite_recurse(); 
 bufflehead   | ppc64le (POWER8) | openSUSE Leap      | Mark Wong       | HEAD          | 2019-10-09 20:46:56 |
pg_upgradeCheck| 2019-10-09 20:51:47.408 UTC [18136:86] DETAIL:  Failed process was running: select infinite_recurse(); 
 vulpes       | ppc64le          | fedora             | Mark Wong       | HEAD          | 2019-10-11 08:53:50 | Check
       | 2019-10-11 08:57:59.370 UTC [14908:77] DETAIL:  Failed process was running: select infinite_recurse(); 
 shoveler     | ppc64le (POWER8) | Debian GNU/Linux   | Mark Wong       | HEAD          | 2019-10-11 13:54:38 |
pg_upgradeCheck| 2019-10-11 14:01:53.903 UTC [5911:76] DETAIL:  Failed process was running: select infinite_recurse(); 
(38 rows)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 85f15a5..fff83b7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2640,8 +2640,6 @@ SIGHUP_handler(SIGNAL_ARGS)
 {
     int            save_errno = errno;

-    PG_SETMASK(&BlockSig);
-
     if (Shutdown <= SmartShutdown)
     {
         ereport(LOG,
@@ -2700,8 +2698,6 @@ SIGHUP_handler(SIGNAL_ARGS)
 #endif
     }

-    PG_SETMASK(&UnBlockSig);
-
     errno = save_errno;
 }

@@ -2714,8 +2710,6 @@ pmdie(SIGNAL_ARGS)
 {
     int            save_errno = errno;

-    PG_SETMASK(&BlockSig);
-
     ereport(DEBUG2,
             (errmsg_internal("postmaster received signal %d",
                              postgres_signal_arg)));
@@ -2880,8 +2874,6 @@ pmdie(SIGNAL_ARGS)
             break;
     }

-    PG_SETMASK(&UnBlockSig);
-
     errno = save_errno;
 }

@@ -2895,8 +2887,6 @@ reaper(SIGNAL_ARGS)
     int            pid;            /* process id of dead child process */
     int            exitstatus;        /* its exit status */

-    PG_SETMASK(&BlockSig);
-
     ereport(DEBUG4,
             (errmsg_internal("reaping dead processes")));

@@ -3212,8 +3202,6 @@ reaper(SIGNAL_ARGS)
     PostmasterStateMachine();

     /* Done with signal handler */
-    PG_SETMASK(&UnBlockSig);
-
     errno = save_errno;
 }

@@ -5114,8 +5102,6 @@ sigusr1_handler(SIGNAL_ARGS)
 {
     int            save_errno = errno;

-    PG_SETMASK(&BlockSig);
-
     /* Process background worker state change. */
     if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
     {
@@ -5272,8 +5258,6 @@ sigusr1_handler(SIGNAL_ARGS)
         signal_child(StartupPID, SIGUSR2);
     }

-    PG_SETMASK(&UnBlockSig);
-
     errno = save_errno;
 }

diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c
index ecb9ca2..93a039b 100644
--- a/src/port/pqsignal.c
+++ b/src/port/pqsignal.c
@@ -65,7 +65,11 @@ pqsignal(int signo, pqsigfunc func)
  *
  * On Windows, this would be identical to pqsignal(), so don't bother.
  */
-#ifndef WIN32
+#ifndef FRONTEND
+
+extern sigset_t UnBlockSig,
+            BlockSig,
+            StartupBlockSig;

 pqsigfunc
 pqsignal_no_restart(int signo, pqsigfunc func)
@@ -74,7 +78,7 @@ pqsignal_no_restart(int signo, pqsigfunc func)
                 oact;

     act.sa_handler = func;
-    sigemptyset(&act.sa_mask);
+    act.sa_mask = BlockSig;
     act.sa_flags = 0;
 #ifdef SA_NOCLDSTOP
     if (signo == SIGCHLD)

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Christoph Berg
Дата:
Сообщение: Re: pgsql: Remove pqsignal() from libpq's official exports list.
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Connect as multiple users using single client certificate