Обсуждение: ProcessStandbyHSFeedbackMessage can make global xmin go backwards
ProcessStandbyHSFeedbackMessage has a race condition: it thinks it can call GetOldestXmin and then the result will politely hold still while it considers what to do next. But in fact, whoever has the oldest xmin could exit their transaction, allowing the global minimum to advance. If a VACUUM process then inspects the ProcArray, it could compute an oldest xmin that is newer than the value that ProcessStandbyHSFeedbackMessage installs just afterwards. So much for keeping the data the standby wanted. AFAICS we have to do all the logic about choosing the new value for MyProc->xmin while holding ProcArrayLock, which IMO means that it should go into a function in procarray.c. The fact that walsender.c is taking ProcArrayLock and whacking MyProc->xmin around is already a horrid violation of modularity, even if it weren't incorrect. Also, it seems like using GetOldestXmin is quite wrong here anyway; we don't really want a result modified by vacuum_defer_cleanup_age do we? It looks to me like that would result in vacuum_defer_cleanup_age being applied twice: the walsender process sets its xmin the defer age into the past, and then subsequent calls of GetOldestXmin compute a result that is the defer age further back. IOW this is an independent mechanism that also results in the computed global xmin going backwards. (Now that we have a standby feedback mechanism, I'm a bit tempted to propose getting rid of vacuum_defer_cleanup_age altogether, rather than hacking things to avoid the above.) regards, tom lane
I wrote: > ProcessStandbyHSFeedbackMessage has a race condition: it thinks it can > call GetOldestXmin and then the result will politely hold still while > it considers what to do next. But in fact, whoever has the oldest xmin > could exit their transaction, allowing the global minimum to advance. > If a VACUUM process then inspects the ProcArray, it could compute an > oldest xmin that is newer than the value that > ProcessStandbyHSFeedbackMessage installs just afterwards. So much > for keeping the data the standby wanted. Actually, it's worse than that. Clamping the walsender's xmin to GetOldestXmin() is useless, and if it weren't useless, this code would be completely wrong even discounting the bugs I pointed out already. Consider what it is we're trying to do here: we'd like to prevent VACUUMs on the master from deleting dead rows with xmax newer than the xmin the standby is requesting. Setting the walsender's xmin will only affect VACUUMs launched after that moment; anything that's already running will have already determined its threshold xmin, and perhaps already removed rows. You can't retroactively fix that. So clamping the walsender's xmin to GetOldestXmin doesn't actually do anything for data integrity; it only ensures that the value of GetOldestXmin doesn't go backwards on successive calls. But that's possible anyway, as the comments for that function already note. What's worse is that the only thing that is prevented from going backwards is the value of GetOldestXmin with allDbs = true. But that's only used by vacuums on shared catalogs. If we wanted to prevent the values seen by vacuums on non-shared tables from going backwards, we'd have to do some much-more-complex calculation that identified the largest value of GetOldestXmin with allDbs = false, across all databases. And that would *still* be wrong, because then the walsender would only be protecting data that is newer than the newest such value, which might allow data in other databases to be reclaimed too early. You could only really make this work by maintaining per-database xmins, which the standby isn't sending and there's no place to put into the walsender's ProcArray entry either. So I've concluded that there's just no point in the GetOldestXmin clamping, and we should just apply the xmin value we get from the standby if it passes basic sanity checks (the epoch test), and hope that we're not too late to prevent loss of the data the standby wanted. This line of reasoning also suggests that it's a pretty bad idea for the walsender to invalidate its existing xmin if it gets a transiently bogus (out of epoch) value from the standby. I think it should sit on the xmin it has, instead, to avoid potential loss of needed data. regards, tom lane
Excerpts from Tom Lane's message of jue oct 20 19:20:19 -0300 2011: > So I've concluded that there's just no point in the GetOldestXmin > clamping, and we should just apply the xmin value we get from the > standby if it passes basic sanity checks (the epoch test), and hope that > we're not too late to prevent loss of the data the standby wanted. I am happy that the HS patch has introduced uses for the xid epoch. I had to use them for multixact truncation in the FK locks patch, and was nervous because I wasn't seeing any user in core code other than the txid functions; this lack of callers, added to the comments in GetNextXidAndEpoch ("we do not support such things"), made me feel a bit uncomfortable about using it. I'm happy to have this problem out of my mind now. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Oct 19, 2011 at 6:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > ProcessStandbyHSFeedbackMessage has a race condition: it thinks it can > call GetOldestXmin and then the result will politely hold still while > it considers what to do next. But in fact, whoever has the oldest xmin > could exit their transaction, allowing the global minimum to advance. > If a VACUUM process then inspects the ProcArray, it could compute an > oldest xmin that is newer than the value that > ProcessStandbyHSFeedbackMessage installs just afterwards. So much > for keeping the data the standby wanted. > > AFAICS we have to do all the logic about choosing the new value for > MyProc->xmin while holding ProcArrayLock, which IMO means that it should > go into a function in procarray.c. The fact that walsender.c is taking > ProcArrayLock and whacking MyProc->xmin around is already a horrid > violation of modularity, even if it weren't incorrect. > > Also, it seems like using GetOldestXmin is quite wrong here anyway; we > don't really want a result modified by vacuum_defer_cleanup_age do we? > It looks to me like that would result in vacuum_defer_cleanup_age being > applied twice: the walsender process sets its xmin the defer age into > the past, and then subsequent calls of GetOldestXmin compute a result > that is the defer age further back. IOW this is an independent > mechanism that also results in the computed global xmin going backwards. > > (Now that we have a standby feedback mechanism, I'm a bit tempted to > propose getting rid of vacuum_defer_cleanup_age altogether, rather than > hacking things to avoid the above.) curious: are these bugs in production, and what would be the user visible symptoms of seeing them in the wild? merlin
Merlin Moncure <mmoncure@gmail.com> writes: > On Wed, Oct 19, 2011 at 6:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ProcessStandbyHSFeedbackMessage has a race condition: it thinks it can >> call GetOldestXmin and then the result will politely hold still while >> it considers what to do next. > curious: are these bugs in production, and what would be the user > visible symptoms of seeing them in the wild? There's no bug so far as data integrity on the master goes. The risk is that you'd see queries failing with replication conflicts on a hot-standby slave even though you thought you'd protected them by setting hot_standby_feedback = on. That would happen if any rows actually got vacuumed despite the standby's attempt to set an xmin that would protect them. This is possible anyway at walsender startup, but I think the logic errors made it more probable. regards, tom lane