Обсуждение: Avoid orphaned objects dependencies, take 3
Hi, This new thread is a follow-up of [1] and [2]. Problem description: We have occasionally observed objects having an orphaned dependency, the most common case we have seen is functions not linked to any namespaces. Examples to produce such orphaned dependencies: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. A patch has been initially proposed to fix this particular (function-to-namespace) dependency (see [1]), but there could be much more scenarios (like the function-to-datatype one highlighted by Gilles in [1] that could lead to a function having an invalid parameter datatype). As Tom said there are dozens more cases that would need to be considered, and a global approach to avoid those race conditions should be considered instead. A first global approach attempt has been proposed in [2] making use of a dirty snapshot when recording the dependency. But this approach appeared to be "scary" and it was still failing to close some race conditions (see [2] for details). Then, Tom proposed another approach in [2] which is that "creation DDL will have to take a lock on each referenced object that'd conflict with a lock taken by DROP". This is what the attached patch is trying to achieve. It does the following: 1) A new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. Thanks to it, the drop schema in scenario 2 would be locked (resulting in an error should session 1 committs). 2) After locking the object while recording the dependency, the patch checks that the object still exists. Thanks to it, session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). The patch also adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - function and type - table and type (which is I think problematic enough, as involving a table into the game, to fix this stuff as a whole). [1]: https://www.postgresql.org/message-id/flat/a4f55089-7cbd-fe5d-a9bb-19adc6418ae9@darold.net#9af5cdaa9e80879beb1def3604c976e8 [2]: https://www.postgresql.org/message-id/flat/8369ff70-0e31-f194-2954-787f4d9e21dd%40amazon.com Please note that I'm not used to with this area of the code so that the patch might not take the option proposed by Tom the "right" way. Adding the patch to the July CF. Looking forward to your feedback, Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi Bertrand, 22.04.2024 11:45, Bertrand Drouvot wrote: > Hi, > > This new thread is a follow-up of [1] and [2]. > > Problem description: > > We have occasionally observed objects having an orphaned dependency, the > most common case we have seen is functions not linked to any namespaces. > > ... > > Looking forward to your feedback, This have reminded me of bug #17182 [1]. Unfortunately, with the patch applied, the following script: for ((i=1;i<=100;i++)); do ( { for ((n=1;n<=20;n++)); do echo "DROP SCHEMA s;"; done } | psql ) >psql1.log 2>&1 & echo " CREATE SCHEMA s; CREATE FUNCTION s.func1() RETURNS int LANGUAGE SQL AS 'SELECT 1;'; CREATE FUNCTION s.func2() RETURNS int LANGUAGE SQL AS 'SELECT 2;'; CREATE FUNCTION s.func3() RETURNS int LANGUAGE SQL AS 'SELECT 3;'; CREATE FUNCTION s.func4() RETURNS int LANGUAGE SQL AS 'SELECT 4;'; CREATE FUNCTION s.func5() RETURNS int LANGUAGE SQL AS 'SELECT 5;'; " | psql >psql2.log 2>&1 & wait psql -c "DROP SCHEMA s CASCADE" >psql3.log done echo " SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid FROM pg_proc pp LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS NULL" | psql still ends with: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2024-04-22 09:54:39.171 UTC|||662633dc.152bbc|LOG: server process (PID 1388378) was terminated by signal 11: Segmentation fault 2024-04-22 09:54:39.171 UTC|||662633dc.152bbc|DETAIL: Failed process was running: SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid FROM pg_proc pp LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS NULL [1] https://www.postgresql.org/message-id/flat/17182-a6baa001dd1784be%40postgresql.org Best regards, Alexander
Hi, On Mon, Apr 22, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote: > Hi Bertrand, > > 22.04.2024 11:45, Bertrand Drouvot wrote: > > Hi, > > > > This new thread is a follow-up of [1] and [2]. > > > > Problem description: > > > > We have occasionally observed objects having an orphaned dependency, the > > most common case we have seen is functions not linked to any namespaces. > > > > ... > > > > Looking forward to your feedback, > > This have reminded me of bug #17182 [1]. Thanks for the link to the bug! > Unfortunately, with the patch applied, the following script: > > for ((i=1;i<=100;i++)); do > ( { for ((n=1;n<=20;n++)); do echo "DROP SCHEMA s;"; done } | psql ) >psql1.log 2>&1 & > echo " > CREATE SCHEMA s; > CREATE FUNCTION s.func1() RETURNS int LANGUAGE SQL AS 'SELECT 1;'; > CREATE FUNCTION s.func2() RETURNS int LANGUAGE SQL AS 'SELECT 2;'; > CREATE FUNCTION s.func3() RETURNS int LANGUAGE SQL AS 'SELECT 3;'; > CREATE FUNCTION s.func4() RETURNS int LANGUAGE SQL AS 'SELECT 4;'; > CREATE FUNCTION s.func5() RETURNS int LANGUAGE SQL AS 'SELECT 5;'; > " | psql >psql2.log 2>&1 & > wait > psql -c "DROP SCHEMA s CASCADE" >psql3.log > done > echo " > SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid FROM pg_proc pp > LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS NULL" | psql > > still ends with: > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > 2024-04-22 09:54:39.171 UTC|||662633dc.152bbc|LOG: server process (PID > 1388378) was terminated by signal 11: Segmentation fault > 2024-04-22 09:54:39.171 UTC|||662633dc.152bbc|DETAIL: Failed process was > running: SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid > FROM pg_proc pp > LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS NULL > Thanks for sharing the script. That's weird, I just launched it several times with the patch applied and I'm not able to see the seg fault (while I can see it constently failing on the master branch). Are you 100% sure you tested it against a binary with the patch applied? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
22.04.2024 13:52, Bertrand Drouvot wrote: > > That's weird, I just launched it several times with the patch applied and I'm not > able to see the seg fault (while I can see it constently failing on the master > branch). > > Are you 100% sure you tested it against a binary with the patch applied? > Yes, at least I can't see what I'm doing wrong. Please try my self-contained script attached. Best regards, Alexander
Вложения
Hi, On Mon, Apr 22, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote: > 22.04.2024 13:52, Bertrand Drouvot wrote: > > > > That's weird, I just launched it several times with the patch applied and I'm not > > able to see the seg fault (while I can see it constently failing on the master > > branch). > > > > Are you 100% sure you tested it against a binary with the patch applied? > > > > Yes, at least I can't see what I'm doing wrong. Please try my > self-contained script attached. Thanks for sharing your script! Yeah your script ensures the patch is applied before the repro is executed. I do confirm that I can also see the issue with the patch applied (but I had to launch multiple attempts, while on master one attempt is enough). I'll have a look. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Apr 23, 2024 at 04:59:09AM +0000, Bertrand Drouvot wrote: > Hi, > > On Mon, Apr 22, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote: > > 22.04.2024 13:52, Bertrand Drouvot wrote: > > > > > > That's weird, I just launched it several times with the patch applied and I'm not > > > able to see the seg fault (while I can see it constently failing on the master > > > branch). > > > > > > Are you 100% sure you tested it against a binary with the patch applied? > > > > > > > Yes, at least I can't see what I'm doing wrong. Please try my > > self-contained script attached. > > Thanks for sharing your script! > > Yeah your script ensures the patch is applied before the repro is executed. > > I do confirm that I can also see the issue with the patch applied (but I had to > launch multiple attempts, while on master one attempt is enough). > > I'll have a look. Please find attached v2 that should not produce the issue anymore (I launched a lot of attempts without any issues). v1 was not strong enough as it was not always checking for the dependent object existence. v2 now always checks if the object still exists after the additional lock acquisition attempt while recording the dependency. I still need to think about v2 but in the meantime could you please also give v2 a try on you side? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi, On Tue, Apr 23, 2024 at 04:20:46PM +0000, Bertrand Drouvot wrote: > Please find attached v2 that should not produce the issue anymore (I launched a > lot of attempts without any issues). v1 was not strong enough as it was not > always checking for the dependent object existence. v2 now always checks if the > object still exists after the additional lock acquisition attempt while recording > the dependency. > > I still need to think about v2 but in the meantime could you please also give > v2 a try on you side? I gave more thought to v2 and the approach seems reasonable to me. Basically what it does is that in case the object is already dropped before we take the new lock (while recording the dependency) then the error message is a "generic" one (means it does not provide the description of the "already" dropped object). I think it makes sense to write the patch that way by abandoning the patch's ambition to tell the description of the dropped object in all the cases. Of course, I would be happy to hear others thought about it. Please find v3 attached (which is v2 with more comments). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi Bertrand, 24.04.2024 11:38, Bertrand Drouvot wrote: >> Please find attached v2 that should not produce the issue anymore (I launched a >> lot of attempts without any issues). v1 was not strong enough as it was not >> always checking for the dependent object existence. v2 now always checks if the >> object still exists after the additional lock acquisition attempt while recording >> the dependency. >> >> I still need to think about v2 but in the meantime could you please also give >> v2 a try on you side? > I gave more thought to v2 and the approach seems reasonable to me. Basically what > it does is that in case the object is already dropped before we take the new lock > (while recording the dependency) then the error message is a "generic" one (means > it does not provide the description of the "already" dropped object). I think it > makes sense to write the patch that way by abandoning the patch's ambition to > tell the description of the dropped object in all the cases. > > Of course, I would be happy to hear others thought about it. > > Please find v3 attached (which is v2 with more comments). Thank you for the improved version! I can confirm that it fixes that case. I've also tested other cases that fail on master (most of them also fail with v1), please try/look at the attached script. (There could be other broken-dependency cases, of course, but I think I've covered all the representative ones.) All tested cases work correctly with v3 applied — I couldn't get broken dependencies, though concurrent create/drop operations can still produce the "cache lookup failed" error, which is probably okay, except that it is an INTERNAL_ERROR, which assumed to be not easily triggered by users. Best regards, Alexander
Вложения
Hi, On Wed, Apr 24, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote: > 24.04.2024 11:38, Bertrand Drouvot wrote: > > I gave more thought to v2 and the approach seems reasonable to me. Basically what > > it does is that in case the object is already dropped before we take the new lock > > (while recording the dependency) then the error message is a "generic" one (means > > it does not provide the description of the "already" dropped object). I think it > > makes sense to write the patch that way by abandoning the patch's ambition to > > tell the description of the dropped object in all the cases. > > > > Of course, I would be happy to hear others thought about it. > > > > Please find v3 attached (which is v2 with more comments). > > Thank you for the improved version! > > I can confirm that it fixes that case. Great, thanks for the test! > I've also tested other cases that fail on master (most of them also fail > with v1), please try/look at the attached script. Thanks for all those tests! > (There could be other broken-dependency cases, of course, but I think I've > covered all the representative ones.) Agree. Furthermore the way the patch is written should be agnostic to the object's kind that are part of the dependency. Having said that, that does not hurt to add more tests in this patch, so v4 attached adds some of your tests ( that would fail on the master branch without the patch applied). The way the tests are written in the patch are less "racy" that when triggered with your script. Indeed, I think that in the patch the dependent object can not be removed before the new lock is taken when recording the dependency. We may want to add injection points in the game if we feel the need. > All tested cases work correctly with v3 applied — Yeah, same on my side, I did run them too and did not find any issues. > I couldn't get broken > dependencies, Same here. > though concurrent create/drop operations can still produce > the "cache lookup failed" error, which is probably okay, except that it is > an INTERNAL_ERROR, which assumed to be not easily triggered by users. I did not see any of those "cache lookup failed" during my testing with/without your script. During which test(s) did you see those with v3 applied? Attached v4, simply adding more tests to v3. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi,
25.04.2024 08:00, Bertrand Drouvot wrote:
25.04.2024 08:00, Bertrand Drouvot wrote:
though concurrent create/drop operations can still produce the "cache lookup failed" error, which is probably okay, except that it is an INTERNAL_ERROR, which assumed to be not easily triggered by users.I did not see any of those "cache lookup failed" during my testing with/without your script. During which test(s) did you see those with v3 applied?
You can try, for example, table-trigger, or other tests that check for
"cache lookup failed" psql output only (maybe you'll need to increase the
iteration count). For instance, I've got (with v4 applied):
2024-04-25 05:48:08.102 UTC [3638763] ERROR: cache lookup failed for function 20893
2024-04-25 05:48:08.102 UTC [3638763] STATEMENT: CREATE TRIGGER modified_c1 BEFORE UPDATE OF c ON t
FOR EACH ROW WHEN (OLD.c <> NEW.c) EXECUTE PROCEDURE trigger_func('modified_c');
Or with function-function:
2024-04-25 05:52:31.390 UTC [3711897] ERROR: cache lookup failed for function 32190 at character 54
2024-04-25 05:52:31.390 UTC [3711897] STATEMENT: CREATE FUNCTION f1() RETURNS int LANGUAGE SQL RETURN f() + 1;
--
2024-04-25 05:52:37.639 UTC [3720011] ERROR: cache lookup failed for function 34465 at character 54
2024-04-25 05:52:37.639 UTC [3720011] STATEMENT: CREATE FUNCTION f1() RETURNS int LANGUAGE SQL RETURN f() + 1;
Attached v4, simply adding more tests to v3.
Thank you for working on this!
Best regards,
Alexander
Hi, On Thu, Apr 25, 2024 at 09:00:00AM +0300, Alexander Lakhin wrote: > Hi, > > 25.04.2024 08:00, Bertrand Drouvot wrote: > > > > > though concurrent create/drop operations can still produce > > > the "cache lookup failed" error, which is probably okay, except that it is > > > an INTERNAL_ERROR, which assumed to be not easily triggered by users. > > I did not see any of those "cache lookup failed" during my testing with/without > > your script. During which test(s) did you see those with v3 applied? > > You can try, for example, table-trigger, or other tests that check for > "cache lookup failed" psql output only (maybe you'll need to increase the > iteration count). For instance, I've got (with v4 applied): > 2024-04-25 05:48:08.102 UTC [3638763] ERROR: cache lookup failed for function 20893 > 2024-04-25 05:48:08.102 UTC [3638763] STATEMENT: CREATE TRIGGER modified_c1 BEFORE UPDATE OF c ON t > FOR EACH ROW WHEN (OLD.c <> NEW.c) EXECUTE PROCEDURE trigger_func('modified_c'); > > Or with function-function: > 2024-04-25 05:52:31.390 UTC [3711897] ERROR: cache lookup failed for function 32190 at character 54 > 2024-04-25 05:52:31.390 UTC [3711897] STATEMENT: CREATE FUNCTION f1() RETURNS int LANGUAGE SQL RETURN f() + 1; > -- > 2024-04-25 05:52:37.639 UTC [3720011] ERROR: cache lookup failed for function 34465 at character 54 > 2024-04-25 05:52:37.639 UTC [3720011] STATEMENT: CREATE FUNCTION f1() RETURNS int LANGUAGE SQL RETURN f() + 1; I see, so this is during object creation. It's easy to reproduce this kind of issue with gdb. For example set a breakpoint on SearchSysCache1() and during the create function f1() once it breaks on: #0 SearchSysCache1 (cacheId=45, key1=16400) at syscache.c:221 #1 0x00005ad305beacd6 in func_get_detail (funcname=0x5ad308204d50, fargs=0x0, fargnames=0x0, nargs=0, argtypes=0x7ffff2ff9cc0,expand_variadic=true, expand_defaults=true, include_out_arguments=false, funcid=0x7ffff2ff9ba0, rettype=0x7ffff2ff9b9c,retset=0x7ffff2ff9b94, nvargs=0x7ffff2ff9ba4, vatype=0x7ffff2ff9ba8, true_typeids=0x7ffff2ff9bd8, argdefaults=0x7ffff2ff9be0) at parse_func.c:1622 #2 0x00005ad305be7dd0 in ParseFuncOrColumn (pstate=0x5ad30823be98, funcname=0x5ad308204d50, fargs=0x0, last_srf=0x0, fn=0x5ad308204da0,proc_call=false, location=55) at parse_func.c:266 #3 0x00005ad305bdffb0 in transformFuncCall (pstate=0x5ad30823be98, fn=0x5ad308204da0) at parse_expr.c:1474 #4 0x00005ad305bdd2ee in transformExprRecurse (pstate=0x5ad30823be98, expr=0x5ad308204da0) at parse_expr.c:230 #5 0x00005ad305bdec34 in transformAExprOp (pstate=0x5ad30823be98, a=0x5ad308204e20) at parse_expr.c:990 #6 0x00005ad305bdd1a0 in transformExprRecurse (pstate=0x5ad30823be98, expr=0x5ad308204e20) at parse_expr.c:187 #7 0x00005ad305bdd00b in transformExpr (pstate=0x5ad30823be98, expr=0x5ad308204e20, exprKind=EXPR_KIND_SELECT_TARGET) atparse_expr.c:131 #8 0x00005ad305b96b7e in transformReturnStmt (pstate=0x5ad30823be98, stmt=0x5ad308204ee0) at analyze.c:2395 #9 0x00005ad305b92213 in transformStmt (pstate=0x5ad30823be98, parseTree=0x5ad308204ee0) at analyze.c:375 #10 0x00005ad305c6321a in interpret_AS_clause (languageOid=14, languageName=0x5ad308204c40 "sql", funcname=0x5ad308204ad8"f100", as=0x0, sql_body_in=0x5ad308204ee0, parameterTypes=0x0, inParameterNames=0x0, prosrc_str_p=0x7ffff2ffa208,probin_str_p=0x7ffff2ffa200, sql_body_out=0x7ffff2ffa210, queryString=0x5ad3082040b0 "CREATE FUNCTION f100() RETURNS int LANGUAGE SQL RETURN f() + 1;") at functioncmds.c:953 #11 0x00005ad305c63c93 in CreateFunction (pstate=0x5ad308186310, stmt=0x5ad308204f00) at functioncmds.c:1221 then drop function f() in another session. Then the create function f1() would: postgres=# CREATE FUNCTION f() RETURNS int LANGUAGE SQL RETURN f() + 1; ERROR: cache lookup failed for function 16400 This stuff does appear before we get a chance to call the new depLockAndCheckObject() function. I think this is what Tom was referring to in [1]: " So the only real fix for this would be to make every object lookup in the entire system do the sort of dance that's done in RangeVarGetRelidExtended. " The fact that those kind of errors appear also somehow ensure that no orphaned dependencies can be created. The patch does ensure that no orphaned depencies can occur after those "initial" look up are done (should the dependent object be dropped). I'm tempted to not add extra complexity to avoid those kind of errors and keep the patch as it is. All of those servicing the same goal: no orphaned depencies are created. [1]: https://www.postgresql.org/message-id/2872252.1630851337%40sss.pgh.pa.us Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi Bertrand, 25.04.2024 10:20, Bertrand Drouvot wrote: > postgres=# CREATE FUNCTION f() RETURNS int LANGUAGE SQL RETURN f() + 1; > ERROR: cache lookup failed for function 16400 > > This stuff does appear before we get a chance to call the new depLockAndCheckObject() > function. > > I think this is what Tom was referring to in [1]: > > " > So the only real fix for this would be to make every object lookup in the entire > system do the sort of dance that's done in RangeVarGetRelidExtended. > " > > The fact that those kind of errors appear also somehow ensure that no orphaned > dependencies can be created. I agree; the only thing that I'd change here, is the error code. But I've discovered yet another possibility to get a broken dependency. Please try this script: res=0 numclients=20 for ((i=1;i<=100;i++)); do for ((c=1;c<=numclients;c++)); do echo " CREATE SCHEMA s_$c; CREATE CONVERSION myconv_$c FOR 'LATIN1' TO 'UTF8' FROM iso8859_1_to_utf8; ALTER CONVERSION myconv_$c SET SCHEMA s_$c; " | psql >psql1-$c.log 2>&1 & echo "DROP SCHEMA s_$c RESTRICT;" | psql >psql2-$c.log 2>&1 & done wait pg_dump -f db.dump || { echo "on iteration $i"; res=1; break; } for ((c=1;c<=numclients;c++)); do echo "DROP SCHEMA s_$c CASCADE;" | psql >psql3-$c.log 2>&1 done done psql -c "SELECT * FROM pg_conversion WHERE connamespace NOT IN (SELECT oid FROM pg_namespace);" It fails for me (with the v4 patch applied) as follows: pg_dump: error: schema with OID 16392 does not exist on iteration 1 oid | conname | connamespace | conowner | conforencoding | contoencoding | conproc | condefault -------+----------+--------------+----------+----------------+---------------+-------------------+------------ 16396 | myconv_6 | 16392 | 10 | 8 | 6 | iso8859_1_to_utf8 | f Best regards, Alexander
Hi, On Tue, Apr 30, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote: > Hi Bertrand, > > But I've discovered yet another possibility to get a broken dependency. Thanks for the testing and the report! > Please try this script: > res=0 > numclients=20 > for ((i=1;i<=100;i++)); do > for ((c=1;c<=numclients;c++)); do > echo " > CREATE SCHEMA s_$c; > CREATE CONVERSION myconv_$c FOR 'LATIN1' TO 'UTF8' FROM iso8859_1_to_utf8; > ALTER CONVERSION myconv_$c SET SCHEMA s_$c; > " | psql >psql1-$c.log 2>&1 & > echo "DROP SCHEMA s_$c RESTRICT;" | psql >psql2-$c.log 2>&1 & > done > wait > pg_dump -f db.dump || { echo "on iteration $i"; res=1; break; } > for ((c=1;c<=numclients;c++)); do > echo "DROP SCHEMA s_$c CASCADE;" | psql >psql3-$c.log 2>&1 > done > done > psql -c "SELECT * FROM pg_conversion WHERE connamespace NOT IN (SELECT oid FROM pg_namespace);" > > It fails for me (with the v4 patch applied) as follows: > pg_dump: error: schema with OID 16392 does not exist > on iteration 1 > oid | conname | connamespace | conowner | conforencoding | contoencoding | conproc | condefault > -------+----------+--------------+----------+----------------+---------------+-------------------+------------ > 16396 | myconv_6 | 16392 | 10 | 8 | 6 | iso8859_1_to_utf8 | f > Thanks for sharing the test, I'm able to reproduce the issue with v4. Oh I see, your test updates an existing dependency. v4 took care about brand new dependencies creation (recordMultipleDependencies()) but forgot to take care about changing an existing dependency (which is done in another code path: changeDependencyFor()). Please find attached v5 that adds: - a call to the new depLockAndCheckObject() function in changeDependencyFor(). - a test when altering an existing dependency. With v5 applied, I don't see the issue anymore. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi Bertrand, 09.05.2024 15:20, Bertrand Drouvot wrote: > Oh I see, your test updates an existing dependency. v4 took care about brand new > dependencies creation (recordMultipleDependencies()) but forgot to take care > about changing an existing dependency (which is done in another code path: > changeDependencyFor()). > > Please find attached v5 that adds: > > - a call to the new depLockAndCheckObject() function in changeDependencyFor(). > - a test when altering an existing dependency. > > With v5 applied, I don't see the issue anymore. Me too. Thank you for the improved version! I will test the patch in the background, but for now I see no other issues with it. Best regards, Alexander
On Thu, May 09, 2024 at 12:20:51PM +0000, Bertrand Drouvot wrote: > Oh I see, your test updates an existing dependency. v4 took care about brand new > dependencies creation (recordMultipleDependencies()) but forgot to take care > about changing an existing dependency (which is done in another code path: > changeDependencyFor()). > > Please find attached v5 that adds: > > - a call to the new depLockAndCheckObject() function in changeDependencyFor(). > - a test when altering an existing dependency. > > With v5 applied, I don't see the issue anymore. + if (object_description) + ereport(ERROR, errmsg("%s does not exist", object_description)); + else + ereport(ERROR, errmsg("a dependent object does not ex This generates an internal error code. Is that intended? --- /dev/null +++ b/src/test/modules/test_dependencies_locks/specs/test_dependencies_locks.spec This introduces a module with only one single spec. I could get behind an extra module if splitting the tests into more specs makes sense or if there is a restriction on installcheck. However, for one spec file filed with a bunch of objects, and note that I'm OK to let this single spec grow more for this range of tests, it seems to me that this would be better under src/test/isolation/. + if (use_dirty_snapshot) + { + InitDirtySnapshot(DirtySnapshot); + snapshot = &DirtySnapshot; + } + else + snapshot = NULL; I'm wondering if Robert has a comment about that. It looks backwards in a world where we try to encourage MVCC snapshots for catalog scans (aka 568d4138c646), especially for the part related to dependency.c and ObjectAddresses. -- Michael
Вложения
Hi, On Wed, May 15, 2024 at 10:14:09AM +0900, Michael Paquier wrote: > On Thu, May 09, 2024 at 12:20:51PM +0000, Bertrand Drouvot wrote: > > Oh I see, your test updates an existing dependency. v4 took care about brand new > > dependencies creation (recordMultipleDependencies()) but forgot to take care > > about changing an existing dependency (which is done in another code path: > > changeDependencyFor()). > > > > Please find attached v5 that adds: > > > > - a call to the new depLockAndCheckObject() function in changeDependencyFor(). > > - a test when altering an existing dependency. > > > > With v5 applied, I don't see the issue anymore. > > + if (object_description) > + ereport(ERROR, errmsg("%s does not exist", object_description)); > + else > + ereport(ERROR, errmsg("a dependent object does not ex > > This generates an internal error code. Is that intended? Thanks for looking at it! Yes, it's like when say you want to create an object in a schema that does not exist (see get_namespace_oid()). > --- /dev/null > +++ b/src/test/modules/test_dependencies_locks/specs/test_dependencies_locks.spec > > This introduces a module with only one single spec. I could get > behind an extra module if splitting the tests into more specs makes > sense or if there is a restriction on installcheck. However, for > one spec file filed with a bunch of objects, and note that I'm OK to > let this single spec grow more for this range of tests, it seems to me > that this would be better under src/test/isolation/. Yeah, I was not sure about this one (the location is from take 2 mentioned up-thread). I'll look at moving the tests to src/test/isolation/. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, May 14, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote: > Hi Bertrand, > > 09.05.2024 15:20, Bertrand Drouvot wrote: > > Oh I see, your test updates an existing dependency. v4 took care about brand new > > dependencies creation (recordMultipleDependencies()) but forgot to take care > > about changing an existing dependency (which is done in another code path: > > changeDependencyFor()). > > > > Please find attached v5 that adds: > > > > - a call to the new depLockAndCheckObject() function in changeDependencyFor(). > > - a test when altering an existing dependency. > > > > With v5 applied, I don't see the issue anymore. > > Me too. Thank you for the improved version! > I will test the patch in the background, but for now I see no other > issues with it. Thanks for confirming! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Wed, May 15, 2024 at 08:31:43AM +0000, Bertrand Drouvot wrote: > Hi, > > On Wed, May 15, 2024 at 10:14:09AM +0900, Michael Paquier wrote: > > +++ b/src/test/modules/test_dependencies_locks/specs/test_dependencies_locks.spec > > > > This introduces a module with only one single spec. I could get > > behind an extra module if splitting the tests into more specs makes > > sense or if there is a restriction on installcheck. However, for > > one spec file filed with a bunch of objects, and note that I'm OK to > > let this single spec grow more for this range of tests, it seems to me > > that this would be better under src/test/isolation/. > > Yeah, I was not sure about this one (the location is from take 2 mentioned > up-thread). I'll look at moving the tests to src/test/isolation/. Please find attached v6 (only diff with v5 is moving the tests as suggested above). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hello Bertrand, 15.05.2024 11:31, Bertrand Drouvot wrote: > On Wed, May 15, 2024 at 10:14:09AM +0900, Michael Paquier wrote: > >> + if (object_description) >> + ereport(ERROR, errmsg("%s does not exist", object_description)); >> + else >> + ereport(ERROR, errmsg("a dependent object does not ex >> >> This generates an internal error code. Is that intended? > Yes, it's like when say you want to create an object in a schema that does not > exist (see get_namespace_oid()). AFAICS, get_namespace_oid() throws not ERRCODE_INTERNAL_ERROR, but ERRCODE_UNDEFINED_SCHEMA: # SELECT regtype('unknown_schema.type'); ERROR: schema "unknown_schema" does not exist LINE 1: SELECT regtype('unknown_schema.type'); ^ # \echo :LAST_ERROR_SQLSTATE 3F000 Probably, it's worth to avoid ERRCODE_INTERNAL_ERROR here in light of [1] and [2], as these errors are not that abnormal (not Assert-like). [1] https://www.postgresql.org/message-id/Zic_GNgos5sMxKoa%40paquier.xyz [2] https://commitfest.postgresql.org/48/4735/ Best regards, Alexander
Hi, On Sun, May 19, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote: > Hello Bertrand, > > Probably, it's worth to avoid ERRCODE_INTERNAL_ERROR here in light of [1] > and [2], as these errors are not that abnormal (not Assert-like). > > [1] https://www.postgresql.org/message-id/Zic_GNgos5sMxKoa%40paquier.xyz > [2] https://commitfest.postgresql.org/48/4735/ > Thanks for mentioning the above examples, I agree that it's worth to avoid ERRCODE_INTERNAL_ERROR here: please find attached v7 that makes use of a new ERRCODE: ERRCODE_DEPENDENT_OBJECTS_DOES_NOT_EXIST I thought about this name as it is close enough to the already existing "ERRCODE_DEPENDENT_OBJECTS_STILL_EXIST" but I'm open to suggestion too. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
On Wed, May 15, 2024 at 6:31 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Please find attached v6 (only diff with v5 is moving the tests as suggested > above). I don't immediately know what to think about this patch. I've known about this issue for a long time, but I didn't think this was how we would fix it. I think the problem here is basically that we don't lock namespaces (schemas) when we're adding and removing things from the schema. So I assumed that if we ever did something about this, what we would do would be add a bunch of calls to lock schemas to the appropriate parts of the code. What you've done here instead is add locking at a much lower level - whenever we are adding a dependency on an object, we lock the object. The advantage of that approach is that we definitely won't miss anything. The disadvantage of that approach is that it means we have some very low-level code taking locks, which means it's not obvious which operations are taking what locks. Maybe it could even result in some redundancy, like the higher-level code taking a lock also (possibly in a different mode) and then this code taking another one. I haven't gone through the previous threads; it sounds like there's already been some discussion of this, but I'm just telling you how it strikes me on first look. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On Tue, May 21, 2024 at 08:53:06AM -0400, Robert Haas wrote: > On Wed, May 15, 2024 at 6:31 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Please find attached v6 (only diff with v5 is moving the tests as suggested > > above). > > I don't immediately know what to think about this patch. Thanks for looking at it! > I've known about this issue for a long time, but I didn't think this was how we > would fix it. I started initially with [1] but it was focusing on function-schema only. Then I proposed [2] making use of a dirty snapshot when recording the dependency. But this approach appeared to be "scary" and it was still failing to close some race conditions. Then, Tom proposed another approach in [2] which is that "creation DDL will have to take a lock on each referenced object that'd conflict with a lock taken by DROP". This is the one the current patch is trying to implement. > What you've done here instead is add locking at a much > lower level - whenever we are adding a dependency on an object, we > lock the object. The advantage of that approach is that we definitely > won't miss anything. Right, as there is much more than the ones related to schemas, for example: - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper to name a few. > The disadvantage of that approach is that it > means we have some very low-level code taking locks, which means it's > not obvious which operations are taking what locks. Right, but the new operations are "only" the ones leading to recording or altering a dependency. > Maybe it could > even result in some redundancy, like the higher-level code taking a > lock also (possibly in a different mode) and then this code taking > another one. The one that is added here is in AccessShareLock mode. It could conflict with the ones in AccessExclusiveLock means (If I'm not missing any): - AcquireDeletionLock(): which is exactly what we want - get_object_address() - get_object_address_rv() - ExecAlterObjectDependsStmt() - ExecRenameStmt() - ExecAlterObjectDependsStmt() - ExecAlterOwnerStmt() - RemoveObjects() - AlterPublication() I think there is 2 cases here: First case: the "conflicting" lock mode is for one of our own lock then LockCheckConflicts() would report this as a NON conflict. Second case: the "conflicting" lock mode is NOT for one of our own lock then LockCheckConflicts() would report a conflict. But I've the feeling that the existing code would already lock those sessions. One example where it would be the case: Session 1: doing "BEGIN; ALTER FUNCTION noschemas.foo2() SET SCHEMA alterschema" would acquire the lock in AccessExclusiveLock during ExecAlterObjectSchemaStmt()->get_object_address()->LockDatabaseObject() (in the existing code and before the new call that would occur through changeDependencyFor()->depLockAndCheckObject() with the patch in place). Then, session 2: doing "alter function noschemas.foo2() owner to newrole;" would be locked in the existing code while doing ExecAlterOwnerStmt()->get_object_address()->LockDatabaseObject()). Means that in this example, the new lock this patch is introducing would not be responsible of session 2 beging locked. Was your concern about "First case" or "Second case" or both? [1]: https://www.postgresql.org/message-id/flat/5a9daaae-5538-b209-6279-e903c3ea2157%40amazon.com [2]: https://www.postgresql.org/message-id/flat/8369ff70-0e31-f194-2954-787f4d9e21dd%40amazon.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, May 22, 2024 at 6:21 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > I started initially with [1] but it was focusing on function-schema only. Yeah, that's what I thought we would want to do. And then just extend that to the other cases. > Then I proposed [2] making use of a dirty snapshot when recording the dependency. > But this approach appeared to be "scary" and it was still failing to close > some race conditions. The current patch still seems to be using dirty snapshots for some reason, which struck me as a bit odd. My intuition is that if we're relying on dirty snapshots to solve problems, we likely haven't solved the problems correctly, which seems consistent with your statement about "failing to close some race conditions". But I don't think I understand the situation well enough to be sure just yet. > Then, Tom proposed another approach in [2] which is that "creation DDL will have > to take a lock on each referenced object that'd conflict with a lock taken by DROP". > This is the one the current patch is trying to implement. It's a clever idea, but I'm not sure that I like it. > I think there is 2 cases here: > > First case: the "conflicting" lock mode is for one of our own lock then LockCheckConflicts() > would report this as a NON conflict. > > Second case: the "conflicting" lock mode is NOT for one of our own lock then LockCheckConflicts() > would report a conflict. But I've the feeling that the existing code would > already lock those sessions. > > Was your concern about "First case" or "Second case" or both? The second case, I guess. It's bad to take a weaker lock first and then a stronger lock on the same object later, because it can lead to deadlocks that would have been avoided if the stronger lock had been taken at the outset. Here, it seems like things would happen in the other order: if we took two locks, we'd probably take the stronger lock in the higher-level code and then the weaker lock in the dependency code. That shouldn't break anything; it's just a bit inefficient. My concern was really more about the maintainability of the code. I fear that if we add code that takes heavyweight locks in surprising places, we might later find the behavior difficult to reason about. Tom, what is your thought about that concern? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On Wed, May 22, 2024 at 10:48:12AM -0400, Robert Haas wrote: > On Wed, May 22, 2024 at 6:21 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > I started initially with [1] but it was focusing on function-schema only. > > Yeah, that's what I thought we would want to do. And then just extend > that to the other cases. > > > Then I proposed [2] making use of a dirty snapshot when recording the dependency. > > But this approach appeared to be "scary" and it was still failing to close > > some race conditions. > > The current patch still seems to be using dirty snapshots for some > reason, which struck me as a bit odd. My intuition is that if we're > relying on dirty snapshots to solve problems, we likely haven't solved > the problems correctly, which seems consistent with your statement > about "failing to close some race conditions". But I don't think I > understand the situation well enough to be sure just yet. The reason why we are using a dirty snapshot here is for the cases where we are recording a dependency on a referenced object that we are creating at the same time behind the scene (for example, creating a composite type while creating a relation). Without the dirty snapshot, then the object we are creating behind the scene (the composite type) would not be visible and we would wrongly assume that it has been dropped. Note that the usage of the dirty snapshot is only when the object is first reported as "non existing" by the new ObjectByIdExist() function. > > I think there is 2 cases here: > > > > First case: the "conflicting" lock mode is for one of our own lock then LockCheckConflicts() > > would report this as a NON conflict. > > > > Second case: the "conflicting" lock mode is NOT for one of our own lock then LockCheckConflicts() > > would report a conflict. But I've the feeling that the existing code would > > already lock those sessions. > > > > Was your concern about "First case" or "Second case" or both? > > The second case, I guess. It's bad to take a weaker lock first and > then a stronger lock on the same object later, because it can lead to > deadlocks that would have been avoided if the stronger lock had been > taken at the outset. In the example I shared up-thread that would be the opposite: the Session 1 would take an AccessExclusiveLock lock on the object before taking an AccessShareLock during changeDependencyFor(). > Here, it seems like things would happen in the > other order: if we took two locks, we'd probably take the stronger > lock in the higher-level code and then the weaker lock in the > dependency code. Yeah, I agree. > That shouldn't break anything; it's just a bit > inefficient. Yeah, the second lock is useless in that case (like in the example up-thread). > My concern was really more about the maintainability of > the code. I fear that if we add code that takes heavyweight locks in > surprising places, we might later find the behavior difficult to > reason about. > I think I understand your concern about code maintainability but I'm not sure that adding locks while recording a dependency is that surprising. > Tom, what is your thought about that concern? +1 Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, May 23, 2024 at 12:19 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > The reason why we are using a dirty snapshot here is for the cases where we are > recording a dependency on a referenced object that we are creating at the same > time behind the scene (for example, creating a composite type while creating > a relation). Without the dirty snapshot, then the object we are creating behind > the scene (the composite type) would not be visible and we would wrongly assume > that it has been dropped. The usual reason for using a dirty snapshot is that you want to see uncommitted work by other transactions. It sounds like you're saying you just need to see uncommitted work by the same transaction. If that's true, I think using HeapTupleSatisfiesSelf would be clearer. Or maybe we just need to put CommandCounterIncrement() calls in the right places to avoid having the problem in the first place. Or maybe this is another sign that we're doing the work at the wrong level. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On Thu, May 23, 2024 at 02:10:54PM -0400, Robert Haas wrote: > On Thu, May 23, 2024 at 12:19 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > The reason why we are using a dirty snapshot here is for the cases where we are > > recording a dependency on a referenced object that we are creating at the same > > time behind the scene (for example, creating a composite type while creating > > a relation). Without the dirty snapshot, then the object we are creating behind > > the scene (the composite type) would not be visible and we would wrongly assume > > that it has been dropped. > > The usual reason for using a dirty snapshot is that you want to see > uncommitted work by other transactions. It sounds like you're saying > you just need to see uncommitted work by the same transaction. Right. > If that's true, I think using HeapTupleSatisfiesSelf would be clearer. Oh thanks! I did not know about the SNAPSHOT_SELF snapshot type (I should have check all the snapshot types first though) and that's exactly what is needed here. Please find attached v8 making use of SnapshotSelf instead of a dirty snapshot. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com