Обсуждение: The lightbulb just went on...
... with a blinding flash ... The VACUUM funnies I was complaining about before may or may not be real bugs, but they are not what's biting Alfred. None of them can lead to the observed crashes AFAICT. What's biting Alfred is the code that moves a tuple update chain, lines 1541 ff in REL7_0_PATCHES. This sets up a pointer to a source tuple in "tuple". Then it gets the destination page it plans to move the tuple to, and applies vc_vacpage to that page if it hasn't been done already. But when we're moving a tuple chain, *it is possible for the destination page to be the same as the source page*. Since vc_vacpage applies PageRepairFragmentation, all the live tuples on the page may get moved. Afterwards, tuple.t_data is out of date and pointing at some random chunk of some other tuple. The subsequent copy of the tuple copies garbage, which explains Alfred's several crashes in constructing index entries for the copied tuple (all of which bombed out from the index-build calls at lines 1634 ff, ie, for tuples being moved as part of a chain). Once in a while, the obsolete pointer will be pointing at the real header of a different tuple --- perhaps even the place where we are about to put the copy. This improbable case explains the one observed Assert crash in which a copied tuple's HEAP_MOVED_IN bit mysteriously got turned off. Reason: it was cleared through the old-tuple pointer just after being set via the new-tuple one. Proof that this is happening can be seen in the core dumps for Alfred's index-construction-crash cases: tuple.t_data does not point at the same place that the tuple.ip_posid'th page line item points at. This could only happen if the page was reshuffled since the tuple pointer was set up. The explanation for the Assert crash is a bit of a leap of faith, but I feel confident that it's right. The solution is to do everything we're going to do with the source tuple, especially copying it and updating its state, *before* we apply vc_vacpage to the destination page. Then we don't care if the source gets moved during vc_vacpage. I will prepare a patch along this line and send it to Alfred for testing. regards, tom lane
Something to force a v7.0.3 ... ? On Mon, 16 Oct 2000, Tom Lane wrote: > ... with a blinding flash ... > > The VACUUM funnies I was complaining about before may or may not be real > bugs, but they are not what's biting Alfred. None of them can lead to > the observed crashes AFAICT. > > What's biting Alfred is the code that moves a tuple update chain, lines > 1541 ff in REL7_0_PATCHES. This sets up a pointer to a source tuple in > "tuple". Then it gets the destination page it plans to move the tuple > to, and applies vc_vacpage to that page if it hasn't been done already. > But when we're moving a tuple chain, *it is possible for the destination > page to be the same as the source page*. Since vc_vacpage applies > PageRepairFragmentation, all the live tuples on the page may get moved. > Afterwards, tuple.t_data is out of date and pointing at some random > chunk of some other tuple. The subsequent copy of the tuple copies > garbage, which explains Alfred's several crashes in constructing index > entries for the copied tuple (all of which bombed out from the > index-build calls at lines 1634 ff, ie, for tuples being moved as part > of a chain). Once in a while, the obsolete pointer will be pointing at > the real header of a different tuple --- perhaps even the place where we > are about to put the copy. This improbable case explains the one > observed Assert crash in which a copied tuple's HEAP_MOVED_IN bit > mysteriously got turned off. Reason: it was cleared through the > old-tuple pointer just after being set via the new-tuple one. > > Proof that this is happening can be seen in the core dumps for Alfred's > index-construction-crash cases: tuple.t_data does not point at the same > place that the tuple.ip_posid'th page line item points at. This could > only happen if the page was reshuffled since the tuple pointer was set > up. The explanation for the Assert crash is a bit of a leap of faith, > but I feel confident that it's right. > > The solution is to do everything we're going to do with the source > tuple, especially copying it and updating its state, *before* we apply > vc_vacpage to the destination page. Then we don't care if the source > gets moved during vc_vacpage. > > I will prepare a patch along this line and send it to Alfred for > testing. > > regards, tom lane > > Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
The Hermit Hacker <scrappy@hub.org> writes: > Something to force a v7.0.3 ... ? Yes. We had plenty to force a 7.0.3 already, actually, but I was holding off recommending a release in hopes of finding Alfred's problem. I will get this patch made up tonight for REL7_0; if Alfred doesn't see more failures after running it for a few days, then let's move forward on a 7.0.3 release. regards, tom lane
On Mon, 16 Oct 2000, Tom Lane wrote: > The Hermit Hacker <scrappy@hub.org> writes: > > Something to force a v7.0.3 ... ? > > Yes. We had plenty to force a 7.0.3 already, actually, but I was > holding off recommending a release in hopes of finding Alfred's > problem. I thought so, about having plenty, but when I asked before SF, it sort of fell on deaf ears, so figured you weren't ready yet :) > I will get this patch made up tonight for REL7_0; if Alfred doesn't > see more failures after running it for a few days, then let's move > forward on a 7.0.3 release. that works for me ... I'm in Montreal for the weekend, so if we can get it out before Thursday, great, else we'll do it on Monday, 'k?
The Hermit Hacker <scrappy@hub.org> writes: >> I will get this patch made up tonight for REL7_0; if Alfred doesn't >> see more failures after running it for a few days, then let's move >> forward on a 7.0.3 release. > that works for me ... I'm in Montreal for the weekend, so if we can get it > out before Thursday, great, else we'll do it on Monday, 'k? I think he was seeing MTBF of several days anyway, so we won't have any confidence that the problem is gone before next week. regards, tom lane
Tom: I think I may have been seeing this problem as well. We were getting crashes very often with 7.0.2 during VACUUM's if activity was going on to our database during the vacuum (even though the activity was light). Our solution in the meantime was to simply disable the aplications during a vacuum to avoid any activity during hte vacuum, and we have not had a crash on vacuum since that happened. If this sounds consistent with the problem you think Alfred is having, then I would be willing to test your patch on our system as well. If you think it would help, feel free to send me the patch and I will do some testing on it for you. Thanks. Mike On Mon, 16 Oct 2000, Tom Lane wrote: > ... with a blinding flash ... > > The VACUUM funnies I was complaining about before may or may not be real > bugs, but they are not what's biting Alfred. None of them can lead to > the observed crashes AFAICT. ...
Michael J Schout <mschout@gkg.net> writes: > I think I may have been seeing this problem as well. We were getting > crashes very often with 7.0.2 during VACUUM's if activity was going > on to our database during the vacuum (even though the activity was > light). Our solution in the meantime was to simply disable the > aplications during a vacuum to avoid any activity during hte vacuum, > and we have not had a crash on vacuum since that happened. If this > sounds consistent with the problem you think Alfred is having, Yes, it sure does. The patch I have applies atop a previous change in the REL7_0_PATCHES branch, so what I would recommend is that you pull the current state of the REL7_0_PATCHES branch from our CVS server, and then you can test what will shortly become 7.0.3. There are several other critical bug fixes in there since 7.0.2. Dunno if you know how to use cvs, but the critical steps are explained at http://www.postgresql.org/docs/postgres/x28786.htm. Note that the given recipe will pull current development tip, which is NOT what you want. In step 3, instead of doing... co -P pgsql do... co -P -r REL7_0_PATCHES pgsql Then configure and build as usual. regards, tom lane
* Michael J Schout <mschout@gkg.net> [001017 08:50] wrote: > Tom: > > I think I may have been seeing this problem as well. We were getting > crashes very often with 7.0.2 during VACUUM's if activity was going > on to our database during the vacuum (even though the activity was > light). Our solution in the meantime was to simply disable the > aplications during a vacuum to avoid any activity during hte vacuum, > and we have not had a crash on vacuum since that happened. If this > sounds consistent with the problem you think Alfred is having, then > I would be willing to test your patch on our system as well. > > If you think it would help, feel free to send me the patch and I will > do some testing on it for you. I'm not sure if you've been subscribed to this list for long but It would have been nice if you had spoken up when I initially reported the problems so that the developers realized this wasn't a completely isolated incident. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
On Tue, 17 Oct 2000, Tom Lane wrote: > > and we have not had a crash on vacuum since that happened. If this > > sounds consistent with the problem you think Alfred is having, > > Yes, it sure does. > > The patch I have applies atop a previous change in the REL7_0_PATCHES > branch, so what I would recommend is that you pull the current state of > the REL7_0_PATCHES branch from our CVS server, and then you can test > what will shortly become 7.0.3. There are several other critical bug > fixes in there since 7.0.2. Hi Tom. I have built from the REL7_0_PATCHES tree yesturday and did some testing on the database. So far no crashes during vacuum like I had been seeing with 7.0.2 :). I am seeing a different problem (and I have seen this with 7.0.2 as well). If I run vacuum, sometimes this error pops up in the client appliction during the vacuum: ERROR: RelationClearRelation: relation 1668325 modified while in use relation 1668325 is a view named "sessions". what happens to sessions is that it does: SELECT session_data, id FROM sessions WHERE id = ? FOR UPDATE .... client does some processing ... UPDATE sesssions set session_data = ? WHERE id = ?; (this is where the error happens) I think part of my problem might be that sessions is a view and not a table, but it is probably a bug that needs to be noted nonetheless. I am going to try converting "sessions" to a view and see if I can reproduce it that way. Mike
Michael J Schout <mschout@gkg.net> writes: > ERROR: RelationClearRelation: relation 1668325 modified while in use > relation 1668325 is a view named "sessions". Hm. This message is coming out of the relation cache code when it sees an invalidate-your-cache-for-this-relation message from another backend and the relation in question has already been locked during the current transaction. Probably, what is happening is that the vacuum process is vacuuming the view (not too much to do there ;-) but it does it anyway) and sending out the cache inval message for it after the other client process has already started parsing of a query using the view. This is a fairly subtle problem that I don't think we will be able to fix as a backpatch for 7.0.*. It's on the to-fix list for 7.1 though. regards, tom lane