Обсуждение: Plpython crashing the backend in one easy step
I'm going to look into this, but thought I'd share the misery: CREATE FUNCTION crash() RETURNS varchar AS ' plpy.execute("syntax error") ' language 'plpython'; SELECT crash(); -Brad
I need some expert guidance here. Suppose you have: CREATE FUNCTION crash() RETURNS varchar AS ' plpy.execute("syntax error") ' language 'plpython'; Here are three possible behaviors: (A) a123=# select crash(); ERROR: parser: parse error at or near "syntax" ERROR: plpython: Call of function `__plpython_procedure_crash_41133' failed. plpy.SPIError: Unknown error in PLy_spi_execute_query. FATAL 2: elog: error during error recovery, giving up! server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing therequest. The connection to the server was lost. Attempting reset: Failed. !# (B) a123=# select crash(); ERROR: parser: parse error at or near "syntax" a123=# (C) a123=# select crash(); ERROR: parser: parse error at or near "syntax" ERROR: plpython: Call of function `__plpython_procedure_crash_41133' failed. plpy.SPIError: Unknown error in PLy_spi_execute_query. a123=# Option (A) is the current code. Fixing this happens near line 2290 (could be off a bit, I have some other patches in this file), in the first if clause in PLy_spi_execute_query. The DECLARE_EXC, SAVE_EXC, TRAP_EXC, RESTORE_EXC, RERAISE_EXC macros are wrappers around sigsetjmp and longjmp, and are used to intercept the elog calls occurring when plpython calls spi. In Option (A), we return NULL, which causes the next level of code to call elog. Elog notices that we're already in an Error state, and shuts down the backend. In Option (B), I replace the 'return NULL' with a RERAISE_EXC, which allows elog to function normally, albeit without any cleanup of the plpython environment or useful messages. In Option (C), I set the global "InError" flag to false, and then return NULL, causing all of the error messages to come out and plpython to clean up gracefully, no backend crash. However, this seems to be an unprecedented approach, and I could be missing something big. There's probably an Option (D) that I'm overlooking. HELP! (thanks) -Brad
Bradley McLean <brad@bradm.net> writes: > In Option (C), I set the global "InError" flag to false, and then > return NULL, causing all of the error messages to come out and > plpython to clean up gracefully, no backend crash. However, this > seems to be an unprecedented approach, and I could be missing > something big. Yes, as in "it's totally unsafe". Suppressing an elog(ERROR) is a *big* no-no at present, because way too much stuff relies on post-abort cleanup to clean up whatever problem is being reported. You cannot allow the transaction to continue after the error, and you most certainly mustn't cavalierly reset the error handling state. The only things you should be doing with longjmp trapping are (a) doing any cleanup that Python itself has to have before you re-propagate the longjmp, or (b) issuing elog(NOTICE) to help identify the error location before you re-propagate the longjmp. plpgsql contains an example of doing (b). Not propagating the longjmp, which is what the code seems to be doing at present, is not acceptable. I had not looked at this code closely, but as-is it is a huge reliability hazard. I will insist on removing plpython from 7.2 entirely if this can't be fixed before release. regards, tom lane
Thanks, I assumed something like that. I can very quickly provide at implementation that meets these criteria (It's my option "b"), but it may be less informative in terms of messages. I'll have a look to see what I can do to supplement the messages. I also need to check if there's a memory leak after I fix it. I was aware of the example in plpgsql. -Brad * Tom Lane (tgl@sss.pgh.pa.us) [011113 17:08]: > > Yes, as in "it's totally unsafe". Suppressing an elog(ERROR) is > a *big* no-no at present, because way too much stuff relies on > post-abort cleanup to clean up whatever problem is being reported. > You cannot allow the transaction to continue after the error, and > you most certainly mustn't cavalierly reset the error handling state. > > The only things you should be doing with longjmp trapping are > (a) doing any cleanup that Python itself has to have before you > re-propagate the longjmp, or > (b) issuing elog(NOTICE) to help identify the error location > before you re-propagate the longjmp. > > plpgsql contains an example of doing (b). > > Not propagating the longjmp, which is what the code seems to be doing > at present, is not acceptable. I had not looked at this code closely, > but as-is it is a huge reliability hazard. I will insist on removing > plpython from 7.2 entirely if this can't be fixed before release. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Okay, the attached patch contains the following: - Kevin Jacob's patch to make plpython worthy of a 'trusted' language by eliminating the ability to arbitrarily read OS files. He also did a bunch of C declaration cleanups. (His patch has yet to appear on either hackers or patches, and I believe is also a prerequisite for plpython in 7.2 final). - Changed every place that catches elog() longjmps to re-propogate. - Added code to keep track of the current plpython function and add it to the elog messages - #ifdef 0 around some items that appear to be not in use. - changes to the expected output of the regression tests, because: Tom's requirement that we always reraise the longjmp is at odds with the original design of plpython, which attempted to allow a plpython function to use Python's exception handling to recover when an embedded SPI call failed. There is no way to do this that I can see without halting the propogation of the longjmp. Thus, the semantics of 'execute' calls from plpython have been changed: should the backend throw an error, the plpython function will be summarily aborted. Previously, it could attempt to catch a python exception. Note that this does create some redundant exception handling paths in upper level methods within this module that are no longer used. I have not yet removed them, but will happily do so (in a careful deliberate manner) once sure that there is no alternative way to restore the python level exception handling. Let me know if this isn't what's needed to get this up to snuff for 7.2. -Brad * Tom Lane (tgl@sss.pgh.pa.us) [011113 16:49]: > > Yes, as in "it's totally unsafe". Suppressing an elog(ERROR) is > a *big* no-no at present, because way too much stuff relies on > post-abort cleanup to clean up whatever problem is being reported. > You cannot allow the transaction to continue after the error, and > you most certainly mustn't cavalierly reset the error handling state. > > The only things you should be doing with longjmp trapping are > (a) doing any cleanup that Python itself has to have before you > re-propagate the longjmp, or > (b) issuing elog(NOTICE) to help identify the error location > before you re-propagate the longjmp. > > plpgsql contains an example of doing (b). > > Not propagating the longjmp, which is what the code seems to be doing > at present, is not acceptable. I had not looked at this code closely, > but as-is it is a huge reliability hazard. I will insist on removing > plpython from 7.2 entirely if this can't be fixed before release. > > regards, tom lane
Вложения
Bradley McLean <brad@bradm.net> writes: > Okay, the attached patch contains the following: I have checked over and applied this patch. It appeared that you incorporated the earlier version of Kevin's patch rather than the later one. I made the further changes indicated by the attached diff, which I think are all the differences between Kevin's two submissions, but I'd appreciate it if both of you would review what's now in CVS and confirm it's okay. (There'll probably be a beta3 release later today, so you can look at that if it's easier than CVS.) One thing I noticed while running the selftest on a LinuxPPC box is that the "feature" test output differed: --- feature.expected Fri May 25 11:48:33 2001 +++ feature.output Fri Nov 16 13:00:15 2001 @@ -29,7 +29,7 @@(1 row)SELECT import_fail(); -NOTICE: ('import socket failed -- untrusted dynamic module: _socket',) +NOTICE: ('import socket failed -- untrusted dynamic module: socket',) import_fail -------------------- failed asexpected I assume you guys both get the "expected" output? Perhaps this should be noted as a possible platform discrepancy. > Tom's requirement that we always reraise the longjmp is at odds with > the original design of plpython, which attempted to allow a plpython > function to use Python's exception handling to recover when an > embedded SPI call failed. There is no way to do this that I can > see without halting the propogation of the longjmp. > Thus, the semantics of 'execute' calls from plpython have been changed: > should the backend throw an error, the plpython function will be > summarily aborted. Previously, it could attempt to catch a python > exception. > Note that this does create some redundant exception handling paths > in upper level methods within this module that are no longer used. > I have not yet removed them, but will happily do so (in a careful > deliberate manner) once sure that there is no alternative way to > restore the python level exception handling. I would suggest leaving it as-is for now. Sooner or later we will probably try to reduce the severity of most errors, and at that point it'd become worthwhile to re-enable error trapping in plpython. regards, tom lane *** plpython.c~ Fri Nov 16 12:19:38 2001 --- plpython.c Fri Nov 16 12:24:43 2001 *************** *** 292,297 **** --- 292,298 ---- "binascii", "calendar", "cmath", + "codecs", "errno", "marshal", "math", *************** *** 325,330 **** --- 326,332 ---- "hexrevision", "maxint", "maxunicode", + "platform", "version", "version_info" };
* Kevin Jacobs (jacobs@penguin.theopalgroup.com) [011116 13:15]: > On Fri, 16 Nov 2001, Tom Lane wrote: > > --- feature.expected Fri May 25 11:48:33 2001 > > +++ feature.output Fri Nov 16 13:00:15 2001 > > @@ -29,7 +29,7 @@ > > (1 row) > > > > SELECT import_fail(); > > -NOTICE: ('import socket failed -- untrusted dynamic module: _socket',) > > +NOTICE: ('import socket failed -- untrusted dynamic module: socket',) > > import_fail > > -------------------- > > failed as expected > > > > I assume you guys both get the "expected" output? Perhaps this should > > be noted as a possible platform discrepancy. > > This diff is most likely a platform issue where the module printed can > change depending on the version of Python installed. Its annoying, but > harmless. I can confirm that: It occurs on stock python 1.5.2 (default on even the latest RH 7.2 linux). Betweeen 1.5 and 2.0 the low level code was moved from socket to _socket to make it possible to include a high level python wrapper called socket. I've reviewed the merge from CVS and it appears correct. Thanks! -Brad
Bradley McLean <brad@bradm.net> writes: > I can confirm that: It occurs on stock python 1.5.2 (default on even the > latest RH 7.2 linux). Bingo --- python 1.5.2 is what's installed on that machine. Seems to pass the self-test fine other than this one item. regards, tom lane
On Fri, 16 Nov 2001, Tom Lane wrote: > --- feature.expected Fri May 25 11:48:33 2001 > +++ feature.output Fri Nov 16 13:00:15 2001 > @@ -29,7 +29,7 @@ > (1 row) > > SELECT import_fail(); > -NOTICE: ('import socket failed -- untrusted dynamic module: _socket',) > +NOTICE: ('import socket failed -- untrusted dynamic module: socket',) > import_fail > -------------------- > failed as expected > > I assume you guys both get the "expected" output? Perhaps this should > be noted as a possible platform discrepancy. This diff is most likely a platform issue where the module printed can change depending on the version of Python installed. Its annoying, but harmless. -Kevin -- Kevin Jacobs The OPAL Group - Enterprise Systems Architect Voice: (216) 986-0710 x 19 E-mail: jacobs@theopalgroup.com Fax: (216) 986-0714 WWW: http://www.theopalgroup.com
Has this been addressed? --------------------------------------------------------------------------- > Bradley McLean <brad@bradm.net> writes: > > In Option (C), I set the global "InError" flag to false, and then > > return NULL, causing all of the error messages to come out and > > plpython to clean up gracefully, no backend crash. However, this > > seems to be an unprecedented approach, and I could be missing > > something big. > > Yes, as in "it's totally unsafe". Suppressing an elog(ERROR) is > a *big* no-no at present, because way too much stuff relies on > post-abort cleanup to clean up whatever problem is being reported. > You cannot allow the transaction to continue after the error, and > you most certainly mustn't cavalierly reset the error handling state. > > The only things you should be doing with longjmp trapping are > (a) doing any cleanup that Python itself has to have before you > re-propagate the longjmp, or > (b) issuing elog(NOTICE) to help identify the error location > before you re-propagate the longjmp. > > plpgsql contains an example of doing (b). > > Not propagating the longjmp, which is what the code seems to be doing > at present, is not acceptable. I had not looked at this code closely, > but as-is it is a huge reliability hazard. I will insist on removing > plpython from 7.2 entirely if this can't be fixed before release. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026