Обсуждение: orangutan seizes up during isolation-check
Buildfarm member orangutan has failed chronically on both of the branches for which it still reports, HEAD and REL9_1_STABLE, for over two years. The postmaster appears to jam during isolation-check. Dave, orangutan currently has one such jammed postmaster for each branch. Could you gather some information about the running processes? Specifically, it would be helpful to see the output of "ps -el" and a stack trace for each running PostgreSQL process. (If there are enough PostgreSQL processes to make stack traces tedious to acquire, it will be almost as good to have traces for each postmaster and one autovacuum worker per postmaster.) Thanks. Best not to kill the processes yet, in case we need more information. The rest of this message is just a dump my observations from the data already available. The jammed postmasters fail to complete fast shutdown requests. Beyond that, the symptoms are different on HEAD versus 9.1. The 2014-07-09 run is representative for HEAD. multiple-row-versions.spec failed like this after having run for almost 21 hours: --- 1,2 ---- Parsed test spec with 4 sessions ! Connection 2 to database failed: \ No newline at end of file I don't know what would cause PQconnectdb() to hang for 21 hours before failing with a blank error message. Note that the hang duration and the spec in which the hang falls varies from failure to failure. All subsequent specs then fail like this: --- 1,4 ---- Parsed test spec with 2 sessions ! Connection 0 to database failed: could not connect to server: Connection refused ! Is the server running locally and accepting ! connections on Unix domain socket "/tmp/.s.PGSQL.5678"? One can get ECONNREFUSED from a Unix-domain socket when the listen() backlog is full. At this point, we've made only two connection attempts since the last successful one and only about 40 attempts since last postmaster startup. I have no good theories remaining at the moment. The postmaster log ends in 1211 copies of this message: WARNING: worker took too long to start; canceled. At the default autovacuum_naptime=1min, that represents 20:11:00 of autovacuum launch failures. The postmaster had been running about 20:55:42 by the time we collected that log, suggesting that autovacuum was healthy until 40-45 minutes into the doomed PQconnectdb() call. I'm hypothesizing that the postmaster ceased serving autovacuum launcher requests. A jammed postmaster tends to explain both the ECONNREFUSED symptom and the autovacuum symptom. In REL9_1_STABLE, isolation-check completes, but the StopDb-C:2 step that follows isolation-check fails to stop the server. (If you go back far enough in the history, suites other than isolation-check occasionally jam the server.) The server log ends like this: LOG: received fast shutdown request LOG: aborting any active transactions LOG: autovacuum launcher shutting down That suggests a postmaster stuck in PM_WAIT_BACKENDS. The process data should illuminate this situation.
Noah Misch <noah@leadboat.com> writes: > Buildfarm member orangutan has failed chronically on both of the branches for > which it still reports, HEAD and REL9_1_STABLE, for over two years. The > postmaster appears to jam during isolation-check. Dave, orangutan currently > has one such jammed postmaster for each branch. Could you gather some > information about the running processes? What's particularly odd is that orangutan seems to be running an only slightly out-of-date OS X release, which is hardly an unusual configuration. My own laptop gets through isolation-check just fine. Seems like there must be something nonstandard about orangutan's software ... but what? regards, tom lane
On Tue, Sep 02, 2014 at 12:25:39AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > Buildfarm member orangutan has failed chronically on both of the branches for > > which it still reports, HEAD and REL9_1_STABLE, for over two years. The > > postmaster appears to jam during isolation-check. Dave, orangutan currently > > has one such jammed postmaster for each branch. Could you gather some > > information about the running processes? > > What's particularly odd is that orangutan seems to be running an only > slightly out-of-date OS X release, which is hardly an unusual > configuration. My own laptop gets through isolation-check just fine. > Seems like there must be something nonstandard about orangutan's > software ... but what? Agreed. The difference is durable across OS X releases, because orangutan showed like symptoms under 10.7.3. Dave assisted me off-list with data collection and experimentation. Ultimately, --enable-nls was the key distinction, the absence of which spares the other OS X buildfarm animals. The explanation for ECONNREFUSED was more pedestrian than the reasons I had guessed. There were no jammed postmasters running as of the above writing. Rather, the postmasters were gone, but the socket directory entries remained. That happens when the postmaster suffers a "kill -9", a SIGSEGV, an assertion failure, or a similar abrupt exit. When I reproduced the problem, CountChildren() was attempting to walk a corrupt BackendList. Sometimes, the list had an entry such that e->next == e; these send CountChildren() into an infinite loop. Other times, testing "if (bp->dead_end)" prompted a segfault. That explains orangutan sometimes failing quickly and other times hanging for hours. Every crash showed at least two threads running in the postmaster. Multiple threads bring trouble in the form of undefined behavior for fork() w/o exec() and for sigprocmask(). The postmaster uses sigprocmask() to block most signals when doing something nontrivial; this allows it to do such nontrivial work in signal handlers. A sequence of 74 buildfarm runs caught 27 cases of a secondary thread running a signal handler, 14 cases of two signal handlers running at once, and one user-visible postmaster failure. libintl replaces setlocale(). Its setlocale(LC_x, "") uses OS-specific APIs to determine the default locale when $LANG and similar environment variables are empty, as they are during "make check NO_LOCALE=1". On OS X, it calls[1] CFLocaleCopyCurrent(), which in turn spins up a thread. See the end of this message for the postmaster thread stacks active upon hitting a breakpoint set at _dispatch_mgr_thread. I see two options for fixing this in pg_perm_setlocale(LC_x, ""): 1. Fork, call setlocale(LC_x, "") in the child, pass back the effective locale name through a pipe, and pass that name tosetlocale() in the original process. The short-lived child will get the extra threads, and the postmaster will remainclean. 2. On OS X, check for relevant environment variables. Finding none, set LC_x=C before calling setlocale(LC_x, ""). A variationis to raise ereport(FATAL) if sufficient environment variables aren't in place. Either way ensures the libintlsetlocale() will never call CFLocaleCopyCurrent(). This is simpler than (1), but it entails a behavior change: "LANG=initdb" will use LANG=C or fail rather than use the OS X user account locale. I'm skeptical of the value of looking up locale information using other OS X facilities when the usual environment variables are inconclusive, but I see no clear cause to reverse that decision now. I lean toward (1). Thanks, nm [1] http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/localename.c;h=78dc344bba191417855670fb751210d3608db6e6;hb=HEAD#l2883 thread #1: tid = 0xeccea9, 0x00007fff9066b372 libsystem_notify.dylib`notify_register_check + 30, queue = 'com.apple.main-thread' frame #0: 0x00007fff9066b372 libsystem_notify.dylib`notify_register_check + 30 frame #1: 0x00007fff987cf261libsystem_info.dylib`__si_module_static_ds_block_invoke + 109 frame #2: 0x00007fff944d628d libdispatch.dylib`_dispatch_client_callout+ 8 frame #3: 0x00007fff944d61fc libdispatch.dylib`dispatch_once_f + 79 frame#4: 0x00007fff987cf1f2 libsystem_info.dylib`si_module_static_ds + 42 frame #5: 0x00007fff987cec65 libsystem_info.dylib`si_module_with_name+ 60 frame #6: 0x00007fff987cf0e7 libsystem_info.dylib`si_module_config_modules_for_category+ 168 frame #7: 0x00007fff987cedbd libsystem_info.dylib`__si_module_static_search_block_invoke+ 87 frame #8: 0x00007fff944d628d libdispatch.dylib`_dispatch_client_callout+ 8 frame #9: 0x00007fff944d61fc libdispatch.dylib`dispatch_once_f + 79 frame#10: 0x00007fff987ced64 libsystem_info.dylib`si_module_static_search + 42 frame #11: 0x00007fff987cec65 libsystem_info.dylib`si_module_with_name+ 60 frame #12: 0x00007fff987d0cf2 libsystem_info.dylib`getpwuid + 32 frame #13:0x00007fff8dce629c CoreFoundation`CFCopyHomeDirectoryURLForUser + 124 frame #14: 0x00007fff8dce5a84 CoreFoundation`+[CFPrefsSourcewithSourceForIdentifier:user:byHost:container:perform:] + 372 frame #15: 0x00007fff8dce58fbCoreFoundation`-[CFPrefsSearchListSource addSourceForIdentifier:user:byHost:] + 123 frame #16: 0x00007fff8dcec1bbCoreFoundation`+[CFPrefsSearchListSource withSnapshotSearchList:] + 331 frame #17: 0x00007fff8dcec037CoreFoundation`__CFXPreferencesCopyCurrentApplicationState + 151 frame #18: 0x00007fff8dcebb8c CoreFoundation`_CFLocaleCopyCurrentGuts+ 524 frame #19: 0x00000001006e6b87 libintl.8.dylib`_nl_locale_name_default + 47 frame #20: 0x00000001006e707b libintl.8.dylib`libintl_setlocale + 367 frame #21: 0x00000001002fdb52 postgres`pg_perm_setlocale(category=1,locale=<unavailable>) + 18 at pg_locale.c:153 frame #22: 0x00000001001a6ef7 postgres`main(argc=1,argv=0x0000000100907340) + 103 at main.c:127 thread #3: tid = 0xeccec2, 0x00007fff93349e6a libsystem_kernel.dylib`__workq_kernreturn + 10 frame #0: 0x00007fff93349e6alibsystem_kernel.dylib`__workq_kernreturn + 10 frame #1: 0x00007fff8ea13f08 libsystem_pthread.dylib`_pthread_wqthread+ 330 * thread #2: tid = 0xeccec3, 0x00007fff944d8102 libdispatch.dylib`_dispatch_mgr_thread, queue = 'com.apple.root.libdispatch-manager',stop reason = breakpoint 1.2 * frame #0: 0x00007fff944d8102 libdispatch.dylib`_dispatch_mgr_thread frame #1: 0x00007fff944d7f87 libdispatch.dylib`_dispatch_root_queue_drain + 75 frame #2: 0x00007fff944d7ed2 libdispatch.dylib`_dispatch_worker_thread + 119 frame #3: 0x00007fff8ea12899 libsystem_pthread.dylib`_pthread_body+ 138 frame #4: 0x00007fff8ea1272a libsystem_pthread.dylib`_pthread_start + 137
On 09/15/2014 07:51 AM, Noah Misch wrote: > libintl replaces setlocale(). Its setlocale(LC_x, "") uses OS-specific APIs > to determine the default locale when $LANG and similar environment variables > are empty, as they are during "make check NO_LOCALE=1". On OS X, it calls[1] > CFLocaleCopyCurrent(), which in turn spins up a thread. See the end of this > message for the postmaster thread stacks active upon hitting a breakpoint set > at _dispatch_mgr_thread. Ugh. I'd call that a bug in libintl. setlocale() has no business to make the process multi-threaded. Do we have the same problem in backends? At a quick glance, aside from postmaster we only use PG_SETMASK(&BlockSig) in signal handlers, to prevent another signal handler from running concurrently. > I see two options for fixing this in pg_perm_setlocale(LC_x, ""): > > 1. Fork, call setlocale(LC_x, "") in the child, pass back the effective locale > name through a pipe, and pass that name to setlocale() in the original > process. The short-lived child will get the extra threads, and the > postmaster will remain clean. > > 2. On OS X, check for relevant environment variables. Finding none, set > LC_x=C before calling setlocale(LC_x, ""). A variation is to raise > ereport(FATAL) if sufficient environment variables aren't in place. Either > way ensures the libintl setlocale() will never call CFLocaleCopyCurrent(). > This is simpler than (1), but it entails a behavior change: "LANG= initdb" > will use LANG=C or fail rather than use the OS X user account locale. > > I'm skeptical of the value of looking up locale information using other OS X > facilities when the usual environment variables are inconclusive, but I see no > clear cause to reverse that decision now. I lean toward (1). Both of those are horrible hacks. And who's to say that calling setlocale(LC_x, "foo") won't also call some function that makes the process multi-threaded. If not in any current OS X release, it might still happen in a future one. One idea would be to use an extra pthread mutex or similar, in addition to PG_SETMASK(). Whenever you do PG_SETMASK(&BlockSig), also grab the mutex, and release it when you do PG_SETMASK(&UnBlockSig). It would be nice to stop doing non-trivial things in the signal handler in the first place. It's pretty scary, even though it works when the process is single-threaded. I believe the reason it's currently implemented like that are the same problems that the latch code solves with the self-pipe trick: select() is not interrupted by a signal on all platforms, and even if it was, you would need pselect() with is not available (or does not work correctly even if it exists) on all platforms. I think we could use a latch in postmaster too. - Heikki
On Mon, Sep 15, 2014 at 10:11:57AM +0300, Heikki Linnakangas wrote: > On 09/15/2014 07:51 AM, Noah Misch wrote: > >libintl replaces setlocale(). Its setlocale(LC_x, "") uses OS-specific APIs > >to determine the default locale when $LANG and similar environment variables > >are empty, as they are during "make check NO_LOCALE=1". On OS X, it calls[1] > >CFLocaleCopyCurrent(), which in turn spins up a thread. See the end of this > >message for the postmaster thread stacks active upon hitting a breakpoint set > >at _dispatch_mgr_thread. > > Ugh. I'd call that a bug in libintl. setlocale() has no business to > make the process multi-threaded. Fair interpretation. libintl's options for mitigating the problem are even more constrained, unfortunately. > Do we have the same problem in backends? At a quick glance, aside > from postmaster we only use PG_SETMASK(&BlockSig) in signal > handlers, to prevent another signal handler from running > concurrently. In backends, we pass a specific value, not "", to pg_perm_setlocale(). (I expect the problem in EXEC_BACKEND builds, but I doubt anyone uses that as a production configuration on OS X.) Every frontend program does call setlocale(LC_ALL, "") via set_pglocale_pgservice(). None use sigprocmask(), but a few do use fork(). > >1. Fork, call setlocale(LC_x, "") in the child, pass back the effective locale > > name through a pipe, and pass that name to setlocale() in the original > > process. The short-lived child will get the extra threads, and the > > postmaster will remain clean. > >I'm skeptical of the value of looking up locale information using other OS X > >facilities when the usual environment variables are inconclusive, but I see no > >clear cause to reverse that decision now. I lean toward (1). > > Both of those are horrible hacks. And who's to say that calling > setlocale(LC_x, "foo") won't also call some function that makes the > process multi-threaded. If not in any current OS X release, it might > still happen in a future one. Right. Not just setlocale(); a large body of APIs could change that way. The CAVEATS section[1] of the OS X fork() manual page suggests Apple had that in mind at some point. To be 100% safe, we would write the postmaster as though it's always multithreaded, including making[2] EXEC_BACKEND the only supported configuration on OS X. Assuming we don't do that anytime soon, I did have in mind to make the postmaster raise an error if it becomes multithreaded. (Apple's pthread_is_threaded_np() is usable for this check.) Then, at least, we'll more-easily hear about new problems of this nature. > One idea would be to use an extra pthread mutex or similar, in > addition to PG_SETMASK(). Whenever you do PG_SETMASK(&BlockSig), > also grab the mutex, and release it when you do > PG_SETMASK(&UnBlockSig). I had considered and tested such a thing, and it sufficed to clear orangutan's present troubles. You still face undefined behavior after fork(). It's okay in practice if libdispatch threads are careful to register pthread_atfork() handlers for any relevant resources. It's a fair option, but running setlocale(LC_x, "") in a short-lived child assumes less about the system library implementation, seems no more of a hack to me, and isolates the overhead at postmaster start. [1] https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fork.2.html [2] http://www.postgresql.org/message-id/1349280184-sup-326@alvh.no-ip.org
On Mon, Sep 15, 2014 at 12:51:14AM -0400, Noah Misch wrote: > 1. Fork, call setlocale(LC_x, "") in the child, pass back the effective locale > name through a pipe, and pass that name to setlocale() in the original > process. The short-lived child will get the extra threads, and the > postmaster will remain clean. Here's an implementation thereof covering both backend and frontend use of setlocale(). A setlocale() wrapper, pg_setlocale(), injects use of the child process where necessary. I placed the new function in src/common/exec.c, because every executable already needed that file's code and translatable strings for set_pglocale_pgservice() and callees. (Some executables do miss exec.c translations; I did not update them.) src/port/chklocale.c was closer in subject matter, but libpq imports it; this function does not belong in libpq. Also, this function would benefit from a frontend ereport() implementation, which is likely to land in libpgcommon if it lands at all. The runtime changes are conditional on __darwin__ but not on --enable-nls. NLS builds are the production norm; I'd like non-NLS builds to consistently exercise this code rather than shave its bytes and cycles. pg_setlocale() relies on main() setting all six LC_<category> environment variables. The second attached patch seals the cracks in main()'s longstanding attempt to do so. This improves preexisting user-visible behavior in weird cases: "LANG=pt_BR.utf8 LC_ALL=invalid postgres -D nosuch" will now print untranslated text, not pt_BR text. Documentation for each of lc_monetary, lc_numeric and lc_time says "If this variable is set to the empty string (which is the default) then the value is inherited from the execution environment of the server in a system-dependent way." Not so; by design, setlocale(LC_x, "") just re-selects the last value L passed to pg_perm_setlocale(LC_x, L). I have not touched this; I mention it for the sake of the archives.
Вложения
On 10/10/14 8:24 PM, Noah Misch wrote: > Here's an implementation thereof covering both backend and frontend use of > setlocale(). A setlocale() wrapper, pg_setlocale(), injects use of the child > process where necessary. Would such a change not be better in gnulib itself?
On Fri, Oct 10, 2014 at 09:14:38PM -0400, Peter Eisentraut wrote: > On 10/10/14 8:24 PM, Noah Misch wrote: > > Here's an implementation thereof covering both backend and frontend use of > > setlocale(). A setlocale() wrapper, pg_setlocale(), injects use of the child > > process where necessary. > > Would such a change not be better in gnulib itself? Good question. It would be nice to make the change there, for the benefit of other consumers. The patch's setlocale_native_forked() assumes it never runs in a multithreaded process, but libintl_setlocale() must not assume that. I see a few ways libintl/gnulib might proceed: 1) exec() a helper program to run CFLocaleCopyCurrent(). Distributing that executable alongside libintl and locating itat runtime is daunting. 2) Audit the CFLocaleCopyCurrent() code to see if it will, in practice, run reliably in a fork of a multithreaded process. (That conclusion could change without notice.) Use the setlocale_native_forked() technique. 3) Don't change libintl_setlocale(), but add a libintl_setlocale_nothread() for single-threaded consumers to adopt. Eachconsumer will need to modify code. libintl has a lean API; I expect adding this would be controversial. Among those, I would probably adopt (2). We know much about the process conditions in which pg_setlocale() will run, so placing the workaround in PostgreSQL erases the problem of (2). I'm inclined to stick with placing the workaround in PostgreSQL, but there are fair arguments on both sides.
On 10/11/14 1:41 AM, Noah Misch wrote: > Good question. It would be nice to make the change there, for the benefit of > other consumers. The patch's setlocale_native_forked() assumes it never runs > in a multithreaded process, but libintl_setlocale() must not assume that. I > see a few ways libintl/gnulib might proceed: Yeah, it's difficult to see how they might proceed if they keep calling into Core Foundation, which might do anything, now or in the future. I went ahead and submitted a bug report to gettext: https://savannah.gnu.org/bugs/index.php?43404 (They way I understand it is that the files concerned originate in gettext and are copied to gnulib.) Let's see what they say. I like the idea of calling pthread_is_threaded_np() as a verification. This appears to be a OS X-specific function at the moment. If other platforms start adding it, then we'll run into the usual problems of how to link binaries that use pthread functions. Maybe that's not a realistic concern.
On Sat, Oct 11, 2014 at 09:07:46AM -0400, Peter Eisentraut wrote: > On 10/11/14 1:41 AM, Noah Misch wrote: > > Good question. It would be nice to make the change there, for the benefit of > > other consumers. The patch's setlocale_native_forked() assumes it never runs > > in a multithreaded process, but libintl_setlocale() must not assume that. I > > see a few ways libintl/gnulib might proceed: > > Yeah, it's difficult to see how they might proceed if they keep calling > into Core Foundation, which might do anything, now or in the future. > > I went ahead and submitted a bug report to gettext: > https://savannah.gnu.org/bugs/index.php?43404 > > (They way I understand it is that the files concerned originate in > gettext and are copied to gnulib.) > > Let's see what they say. The gettext maintainer was open to implementing the setlocale_native_forked() technique in gettext, though the last visible progress was in October. In any event, PostgreSQL builds will see older gettext for several years. If setlocale-darwin-fork-v1.patch is not wanted, I suggest making the postmaster check during startup whether it has become multithreaded. If multithreaded: FATAL: postmaster became multithreaded during startup HINT: Set the LC_ALL environment variable to a valid locale. I wondered whether to downgrade FATAL to LOG in back branches. Introducing a new reason to block startup is disruptive for a minor release, but having the postmaster deadlock at an unpredictable later time is even more disruptive. I am inclined to halt startup that way in all branches. > I like the idea of calling pthread_is_threaded_np() as a verification. > This appears to be a OS X-specific function at the moment. If other > platforms start adding it, then we'll run into the usual problems of how > to link binaries that use pthread functions. Maybe that's not a > realistic concern. True. As written, "configure" will report the function unavailable if it requires threading libraries. For a measure that's just a backstop against other bugs, that may be just right. I would like to go ahead and commit setlocale-main-harden-v1.patch, which is a good thing to have regardless of what happens with gettext. Thanks, nm
On 12/28/2014 04:58 PM, Noah Misch wrote: > On Sat, Oct 11, 2014 at 09:07:46AM -0400, Peter Eisentraut wrote: >> On 10/11/14 1:41 AM, Noah Misch wrote: >>> Good question. It would be nice to make the change there, for the benefit of >>> other consumers. The patch's setlocale_native_forked() assumes it never runs >>> in a multithreaded process, but libintl_setlocale() must not assume that. I >>> see a few ways libintl/gnulib might proceed: >> Yeah, it's difficult to see how they might proceed if they keep calling >> into Core Foundation, which might do anything, now or in the future. >> >> I went ahead and submitted a bug report to gettext: >> https://savannah.gnu.org/bugs/index.php?43404 >> >> (They way I understand it is that the files concerned originate in >> gettext and are copied to gnulib.) >> >> Let's see what they say. > The gettext maintainer was open to implementing the setlocale_native_forked() > technique in gettext, though the last visible progress was in October. In any > event, PostgreSQL builds will see older gettext for several years. If > setlocale-darwin-fork-v1.patch is not wanted, I suggest making the postmaster > check during startup whether it has become multithreaded. If multithreaded: > > FATAL: postmaster became multithreaded during startup > HINT: Set the LC_ALL environment variable to a valid locale. > > I wondered whether to downgrade FATAL to LOG in back branches. Introducing a > new reason to block startup is disruptive for a minor release, but having the > postmaster deadlock at an unpredictable later time is even more disruptive. I > am inclined to halt startup that way in all branches. Yeah. It should be easily fixable, AIUI, and startup is surely a good and obvious time to to that. > >> I like the idea of calling pthread_is_threaded_np() as a verification. >> This appears to be a OS X-specific function at the moment. If other >> platforms start adding it, then we'll run into the usual problems of how >> to link binaries that use pthread functions. Maybe that's not a >> realistic concern. > True. As written, "configure" will report the function unavailable if it > requires threading libraries. For a measure that's just a backstop against > other bugs, that may be just right. > > > I would like to go ahead and commit setlocale-main-harden-v1.patch, which is a > good thing to have regardless of what happens with gettext. > I'm OK with this, but on its own it won't fix orangutan's problems, will it? cheers andrew
On Sun, Dec 28, 2014 at 4:58 PM, Noah Misch <noah@leadboat.com> wrote: > I wondered whether to downgrade FATAL to LOG in back branches. Introducing a > new reason to block startup is disruptive for a minor release, but having the > postmaster deadlock at an unpredictable later time is even more disruptive. I > am inclined to halt startup that way in all branches. Jeepers. I'd rather not do that. From your report, this problem has been around for years. Yet, as far as I know, it's bothering very few real users, some of whom might be far more bothered by the postmaster suddenly failing to start. I'm fine with a FATAL in master, but I'd vote against doing anything that might prevent startup in the back-branches without more compelling justification. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Dec 28, 2014 at 07:20:04PM -0500, Andrew Dunstan wrote: > On 12/28/2014 04:58 PM, Noah Misch wrote: > >The gettext maintainer was open to implementing the setlocale_native_forked() > >technique in gettext, though the last visible progress was in October. In any > >event, PostgreSQL builds will see older gettext for several years. If > >setlocale-darwin-fork-v1.patch is not wanted, I suggest making the postmaster > >check during startup whether it has become multithreaded. If multithreaded: > > > > FATAL: postmaster became multithreaded during startup > > HINT: Set the LC_ALL environment variable to a valid locale. > >I would like to go ahead and commit setlocale-main-harden-v1.patch, which is a > >good thing to have regardless of what happens with gettext. > > > > I'm OK with this, but on its own it won't fix orangutan's problems, will it? Right; setlocale-main-harden-v1.patch fixes a bug not affecting orangutan at all. None of the above will make orangutan turn green. Checking multithreading during startup would merely let it fail cleanly.
On Wed, Dec 31, 2014 at 12:32:37AM -0500, Robert Haas wrote: > On Sun, Dec 28, 2014 at 4:58 PM, Noah Misch <noah@leadboat.com> wrote: > > I wondered whether to downgrade FATAL to LOG in back branches. Introducing a > > new reason to block startup is disruptive for a minor release, but having the > > postmaster deadlock at an unpredictable later time is even more disruptive. I > > am inclined to halt startup that way in all branches. > > Jeepers. I'd rather not do that. From your report, this problem has > been around for years. Yet, as far as I know, it's bothering very few > real users, some of whom might be far more bothered by the postmaster > suddenly failing to start. I'm fine with a FATAL in master, but I'd > vote against doing anything that might prevent startup in the > back-branches without more compelling justification. Clusters hosted on OS X fall into these categories: 1) Unaffected configuration. This includes everyone setting a valid messages locale via LANG, LC_ALL or LC_MESSAGES. 2) Affected configuration. Through luck and light use, the cluster would not experience the crashes/hangs. 3) Cluster would experience the crashes/hangs. DBAs in (3) want the FATAL at startup, but those in (2) want a LOG message instead. DBAs in (1) don't care. Since intermittent postmaster hangs are far worse than startup failure, if (2) and (3) have similar population, FATAL is the better bet. If (2) is sufficiently more populous than (3), then the many small pricks from startup failure do add up to hurt more than the occasional postmaster hang. Who knows how that calculation plays out.
So at this point removing the --enable-nls from my config will solve the build problem.
Everyone knows there is an issue so there is no point in continuing to have it fail.
On 31 December 2014 at 13:52, Noah Misch <noah@leadboat.com> wrote:
On Sun, Dec 28, 2014 at 07:20:04PM -0500, Andrew Dunstan wrote:
> On 12/28/2014 04:58 PM, Noah Misch wrote:
> >The gettext maintainer was open to implementing the setlocale_native_forked()
> >technique in gettext, though the last visible progress was in October. In any
> >event, PostgreSQL builds will see older gettext for several years. If
> >setlocale-darwin-fork-v1.patch is not wanted, I suggest making the postmaster
> >check during startup whether it has become multithreaded. If multithreaded:
> >
> > FATAL: postmaster became multithreaded during startup
> > HINT: Set the LC_ALL environment variable to a valid locale.
> >I would like to go ahead and commit setlocale-main-harden-v1.patch, which is a
> >good thing to have regardless of what happens with gettext.
> >
>
> I'm OK with this, but on its own it won't fix orangutan's problems, will it?
Right; setlocale-main-harden-v1.patch fixes a bug not affecting orangutan at
all. None of the above will make orangutan turn green. Checking
multithreading during startup would merely let it fail cleanly.
On Wed, Dec 31, 2014 at 01:55:23PM -0500, Dave Cramer wrote: > So at this point removing the --enable-nls from my config will solve the > build problem. > > Everyone knows there is an issue so there is no point in continuing to have > it fail. We hope all packagers will build with --enable-nls, so OS X buildfarm coverage thereof will be valuable. Let me see what I can pull together over the next several weeks toward getting it green.
On Wed, Dec 31, 2014 at 01:56:08PM -0500, Noah Misch wrote: > On Wed, Dec 31, 2014 at 12:32:37AM -0500, Robert Haas wrote: > > On Sun, Dec 28, 2014 at 4:58 PM, Noah Misch <noah@leadboat.com> wrote: > > > I wondered whether to downgrade FATAL to LOG in back branches. Introducing a > > > new reason to block startup is disruptive for a minor release, but having the > > > postmaster deadlock at an unpredictable later time is even more disruptive. I > > > am inclined to halt startup that way in all branches. > > > > Jeepers. I'd rather not do that. From your report, this problem has > > been around for years. Yet, as far as I know, it's bothering very few > > real users, some of whom might be far more bothered by the postmaster > > suddenly failing to start. I'm fine with a FATAL in master, but I'd > > vote against doing anything that might prevent startup in the > > back-branches without more compelling justification. > > Clusters hosted on OS X fall into these categories: > > 1) Unaffected configuration. This includes everyone setting a valid messages > locale via LANG, LC_ALL or LC_MESSAGES. > 2) Affected configuration. Through luck and light use, the cluster would not > experience the crashes/hangs. > 3) Cluster would experience the crashes/hangs. > > DBAs in (3) want the FATAL at startup, but those in (2) want a LOG message > instead. DBAs in (1) don't care. Since intermittent postmaster hangs are far > worse than startup failure, if (2) and (3) have similar population, FATAL is > the better bet. If (2) is sufficiently more populous than (3), then the many > small pricks from startup failure do add up to hurt more than the occasional > postmaster hang. Who knows how that calculation plays out. The first attached patch, for all branches, adds LOG-level messages and an assertion. So cassert builds will fail hard, while others won't. The second patch, for master only, changes the startup-time message to FATAL. If we decide to use FATAL in all branches, I would just squash them into one.
Вложения
On Fri, Jan 2, 2015 at 1:04 PM, Noah Misch <noah@leadboat.com> wrote: > The first attached patch, for all branches, adds LOG-level messages and an > assertion. So cassert builds will fail hard, while others won't. The second > patch, for master only, changes the startup-time message to FATAL. If we > decide to use FATAL in all branches, I would just squash them into one. + errdetail("Please report this to <pgsql-bugs@postgresql.org>."))); Er, is mentioning a mailing list in an error message really necessary? -- Michael
On Mon, Jan 05, 2015 at 02:25:09PM +0900, Michael Paquier wrote: > On Fri, Jan 2, 2015 at 1:04 PM, Noah Misch <noah@leadboat.com> wrote: > > The first attached patch, for all branches, adds LOG-level messages and an > > assertion. So cassert builds will fail hard, while others won't. The second > > patch, for master only, changes the startup-time message to FATAL. If we > > decide to use FATAL in all branches, I would just squash them into one. > > + errdetail("Please report this to <pgsql-bugs@postgresql.org>."))); > Er, is mentioning a mailing list in an error message really necessary? Necessary? No.
On Wed, Dec 31, 2014 at 01:52:49PM -0500, Noah Misch wrote: > On Sun, Dec 28, 2014 at 07:20:04PM -0500, Andrew Dunstan wrote: > > On 12/28/2014 04:58 PM, Noah Misch wrote: > > >The gettext maintainer was open to implementing the setlocale_native_forked() > > >technique in gettext, though the last visible progress was in October. In any > > >event, PostgreSQL builds will see older gettext for several years. If > > >setlocale-darwin-fork-v1.patch is not wanted, I suggest making the postmaster > > >check during startup whether it has become multithreaded. If multithreaded: > > > > > > FATAL: postmaster became multithreaded during startup > > > HINT: Set the LC_ALL environment variable to a valid locale. > > > >I would like to go ahead and commit setlocale-main-harden-v1.patch, which is a > > >good thing to have regardless of what happens with gettext. > > > > > > > I'm OK with this, but on its own it won't fix orangutan's problems, will it? > > Right; setlocale-main-harden-v1.patch fixes a bug not affecting orangutan at > all. None of the above will make orangutan turn green. Checking > multithreading during startup would merely let it fail cleanly. OS X --enable-nls buildfarm members should run tests under LANG=C instead of with locale environment variables unset (make check NO_LOCALE=1). I see two ways to arrange that: (1) add a build-farm.conf option, or (2) have pg_regress.c:initialize_environment() treat OS X like Windows. I mildly favor (2); see attached, untested patch. Windows and OS X --enable-nls share the characteristic that setlocale(LC_x, "") consults sources other than environment variables. (I do wonder why commit 4a6fd46 used LANG=en instead of LANG=C.) On the other hand, LANG=en has been inessential on Windows ever since "pg_regress --no-locale" started to use "initdb --no-locale". While I prefer to see the LANG= hack go away rather than proliferate, I can't cite a practical reason to care. Thanks, nm
Вложения
On 1/1/15 11:04 PM, Noah Misch wrote: >> Clusters hosted on OS X fall into these categories: >> >> 1) Unaffected configuration. This includes everyone setting a valid messages >> locale via LANG, LC_ALL or LC_MESSAGES. >> 2) Affected configuration. Through luck and light use, the cluster would not >> experience the crashes/hangs. >> 3) Cluster would experience the crashes/hangs. >> >> DBAs in (3) want the FATAL at startup, but those in (2) want a LOG message >> instead. DBAs in (1) don't care. Since intermittent postmaster hangs are far >> worse than startup failure, if (2) and (3) have similar population, FATAL is >> the better bet. If (2) is sufficiently more populous than (3), then the many >> small pricks from startup failure do add up to hurt more than the occasional >> postmaster hang. Who knows how that calculation plays out. > > The first attached patch, for all branches, adds LOG-level messages and an > assertion. So cassert builds will fail hard, while others won't. The second > patch, for master only, changes the startup-time message to FATAL. If we > decide to use FATAL in all branches, I would just squash them into one. What I'm seeing now is that the unaccent regression tests when run under make check-world abort with FATAL: postmaster became multithreaded during startup HINT: Set the LC_ALL environment variable to a valid locale. My locale settings for this purpose are LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= The environment variable LANG is set, all the other ones are not set. I could presumably set LC_ALL, but then I lose the ability to set different locales for different categories. Running LC_ALL=$LANG make -C contrib/unaccent check doesn't fix it; I get the same error. The behavior is also a bit strange. The step ============== starting postmaster ============== hangs for one minute before failing. This probably has nothing to do with your change, but maybe pg_regress could detect postmaster startup failures better.
On Wed, Jan 14, 2015 at 04:48:53PM -0500, Peter Eisentraut wrote: > What I'm seeing now is that the unaccent regression tests when run under > make check-world abort with > > FATAL: postmaster became multithreaded during startup > HINT: Set the LC_ALL environment variable to a valid locale. contrib/unaccent/Makefile sets NO_LOCALE=1, so that makes sense. I expect the patch over here will fix it: http://www.postgresql.org/message-id/20150109063015.GA2491320@tornado.leadboat.com > The behavior is also a bit strange. The step > > ============== starting postmaster ============== > > hangs for one minute before failing. This probably has nothing to do > with your change, but maybe pg_regress could detect postmaster startup > failures better. Yeah, I think any postmaster startup failure has that effect. Not ideal.
On Thu, Jan 15, 2015 at 1:04 AM, Noah Misch <noah@leadboat.com> wrote: > On Wed, Jan 14, 2015 at 04:48:53PM -0500, Peter Eisentraut wrote: >> What I'm seeing now is that the unaccent regression tests when run under >> make check-world abort with >> >> FATAL: postmaster became multithreaded during startup >> HINT: Set the LC_ALL environment variable to a valid locale. > > contrib/unaccent/Makefile sets NO_LOCALE=1, so that makes sense. I expect the > patch over here will fix it: > http://www.postgresql.org/message-id/20150109063015.GA2491320@tornado.leadboat.com I just hit this same problem; are you going to commit that patch soon?It's rather annoying to have make check-world fail. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 15, 2015 at 09:24:01AM -0500, Robert Haas wrote: > On Thu, Jan 15, 2015 at 1:04 AM, Noah Misch <noah@leadboat.com> wrote: > > On Wed, Jan 14, 2015 at 04:48:53PM -0500, Peter Eisentraut wrote: > >> What I'm seeing now is that the unaccent regression tests when run under > >> make check-world abort with > >> > >> FATAL: postmaster became multithreaded during startup > >> HINT: Set the LC_ALL environment variable to a valid locale. > > > > contrib/unaccent/Makefile sets NO_LOCALE=1, so that makes sense. I expect the > > patch over here will fix it: > > http://www.postgresql.org/message-id/20150109063015.GA2491320@tornado.leadboat.com > > I just hit this same problem; are you going to commit that patch soon? > It's rather annoying to have make check-world fail. Sure, done. Dave, orangutan should now be able to pass with --enable-nls. Would you restore that option?
On Fri, Jan 16, 2015 at 05:43:44AM -0500, Dave Cramer wrote: > On 16 January 2015 at 01:33, Noah Misch <noah@leadboat.com> wrote: > > Sure, done. Dave, orangutan should now be able to pass with --enable-nls. > > Would you restore that option? > > I can, but is this for HEAD or all versions ? All versions.
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On 16 January 2015 at 01:33, Noah Misch <span dir="ltr"><<ahref="mailto:noah@leadboat.com" target="_blank">noah@leadboat.com</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">OnThu, Jan 15, 2015 at 09:24:01AM -0500, Robert Haas wrote:<br /> > On Thu, Jan 15, 2015 at 1:04 AM, Noah Misch<<a href="mailto:noah@leadboat.com">noah@leadboat.com</a>> wrote:<br /> > > On Wed, Jan 14, 2015 at 04:48:53PM-0500, Peter Eisentraut wrote:<br /> > >> What I'm seeing now is that the unaccent regression tests whenrun under<br /> > >> make check-world abort with<br /> > >><br /> > >> FATAL: postmasterbecame multithreaded during startup<br /> > >> HINT: Set the LC_ALL environment variable to a valid locale.<br/> > ><br /> > > contrib/unaccent/Makefile sets NO_LOCALE=1, so that makes sense. I expect the<br/> > > patch over here will fix it:<br /> > > <a href="http://www.postgresql.org/message-id/20150109063015.GA2491320@tornado.leadboat.com" target="_blank">http://www.postgresql.org/message-id/20150109063015.GA2491320@tornado.leadboat.com</a><br/> ><br /> >I just hit this same problem; are you going to commit that patch soon?<br /> > It's rather annoying to have makecheck-world fail.<br /><br /></span>Sure, done. Dave, orangutan should now be able to pass with --enable-nls.<br />Would you restore that option?<br /></blockquote></div><br /></div><div class="gmail_extra">I can, but is this for HEADor all versions ?</div></div>