Обсуждение: Thoughts on "killed tuples" index hint bits support on standby
Hello, hackers. Currently hint bits in the index pages (dead tuples) are set and taken into account only at primary server. Standby just ignores it. It is done for reasons, of course (see RelationGetIndexScan and [1]): * We do this because the xmin on the primary node could easily be * later than the xmin on the standby node, so that what the primary * thinks is killed is supposed to be visible on standby. So for correct * MVCC for queries during recovery we must ignore these hints and check * all tuples. Also, according to [2] and cases like [3] it seems to be good idea to support "ignore_killed_tuples" on standby. I hope I know the way to support it correctly with reasonable amount of changes. First thing we need to consider - checksums and wal_log_hints are widely used these days. So, at any moment master could send FPW page with new "killed tuples" hints and overwrite hints set by standby. Moreover it is not possible to distinguish hints are set by primary or standby. And there is where hot_standby_feedback comes to play. Master node considers xmin of hot_standy_feedback replicas (RecentGlobalXmin) while setting "killed tuples" bits. So, if hot_standby_feedback is enabled on standby for a while - it could safely trust hint bits from master. Also, standby could set own hints using xmin it sends to primary during feedback (but without marking page as dirty). Of course all is not so easy, there are a few things and corner cases to care about * Looks like RecentGlobalXmin could be moved backwards in case of new replica with lower xmin is connected (or by switching some replica to hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved strictly forward. * hot_standby_feedback could be enabled on the fly. In such a case we need distinguish transactions which are safe or unsafe to deal with hints. Standby could receive fresh RecentGlobalXmin as response to feedback message. All standby transactions with xmin >= RecentGlobalXmin are safe to use hints. * hot_standby_feedback could be disabled on the fly. In such situation standby needs to continue to send feedback while canceling all queries with ignore_killed_tuples=true. Once all such queries are canceled - feedback are no longer needed and should be disabled. Could someone validate my thoughts please? If the idea is mostly correct - I could try to implement and test it. [1] - https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38 [2] - https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8 [3] - https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org
Hi, On 2020-01-16 14:30:12 +0300, Michail Nikolaev wrote: > First thing we need to consider - checksums and wal_log_hints are > widely used these days. So, at any moment master could send FPW page > with new "killed tuples" hints and overwrite hints set by standby. > Moreover it is not possible to distinguish hints are set by primary or standby. Note that the FPIs are only going to be sent for the first write to a page after a checksum. I don't think you're suggesting we rely on them for correctness (this far into the email at least), but it still seems worthwhile to point out. > And there is where hot_standby_feedback comes to play. Master node > considers xmin of hot_standy_feedback replicas (RecentGlobalXmin) > while setting "killed tuples" bits. So, if hot_standby_feedback is > enabled on standby for a while - it could safely trust hint bits from > master. Well, not easily. There's no guarantee that the node it reports hot_standby_feedback to is actually the primary. It could be an cascading replica setup, that doesn't report hot_standby_feedback upstream. Also hot_standby_feedback only takes effect while a streaming connection is active, if that is even temporarily interrupted, the primary will loose all knowledge of the standby's horizon - unless replication slots are in use, that is. Additionally, we also need to handle the case where the replica currently has restarted, and is recovering using local WAL, and/or archive based recovery. In that case the standby could already have sent a *newer* horizon as feedback upstream, but currently again have an older view. It is entirely possible that the standby is consistent and queryable in such a state (if nothing forced disk writes during WAL replay, minRecoveryLSN will not be set to something too new). > Also, standby could set own hints using xmin it sends to primary > during feedback (but without marking page as dirty). We do something similar for heap hint bits already... > Of course all is not so easy, there are a few things and corner cases > to care about > * Looks like RecentGlobalXmin could be moved backwards in case of new > replica with lower xmin is connected (or by switching some replica to > hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved > strictly forward. I'm not sure this is a problem. If that happens we cannot rely on the different xmin horizon anyway, because action may have been taken on the old RecentGlobalXmin. Thus we need to be safe against that anyway. > * hot_standby_feedback could be enabled on the fly. In such a case we > need distinguish transactions which are safe or unsafe to deal with > hints. Standby could receive fresh RecentGlobalXmin as response to > feedback message. All standby transactions with xmin >= > RecentGlobalXmin are safe to use hints. > * hot_standby_feedback could be disabled on the fly. In such situation > standby needs to continue to send feedback while canceling all queries > with ignore_killed_tuples=true. Once all such queries are canceled - > feedback are no longer needed and should be disabled. I don't think we can rely on hot_standby_feedback at all. We can to avoid unnecessary cancellations, etc, and even assume it's setup up reasonably for some configurations, but there always needs to be an independent correctness backstop. I think it might be more promising to improve the the kill logic based on the WAL logged horizons from the primary. All I think we need to do is to use a more conservatively computed RecentGlobalXmin when determining whether tuples are dead, I think. We already regularly log a xl_running_xacts, adding information about the primaries horizon to that, and stashing it in shared memory on the standby, shouldn't be too hard. Then we can make a correct, albeit likely overly pessimistic, visibility determinations about tuples, and go on to set LP_DEAD. There are some complexities around how to avoid unnecessary query cancellations. We'd not want to trigger recovery conflicts based on the new xl_running_xacts field, as that'd make the conflict rate go through the roof - but I think we could safely use the logical minimum of the local RecentGlobalXmin, and the primaries'. That should allow us to set additional LP_DEAD safely, I believe. We could even rely on those LP_DEAD bits. But: I'm less clear on how we can make sure that we can *rely* on LP_DEAD to skip over entries during scans, however. The bits set as described above would be safe, but we also can see LP_DEAD set by the primary (and even upstream cascading standbys at least in case of new base backups taken from them), due to them being not being WAL logged. As we don't WAL log, there is no conflict associated with the LP_DEADs being set. My gut feeling is that it's going to be very hard to get around this, without adding WAL logging for _bt_killitems et al (including an interface for kill_prior_tuple to report the used horizon to the index). I'm wondering if we could recycle BTPageOpaqueData.xact to store the horizon applying to killed tuples on the page. We don't need to store the level for leaf pages, because we have BTP_LEAF, so we could make space for that (potentially signalled by a new BTP flag). Obviously we have to be careful with storing xids in the index, due to potential wraparound danger - but I think such page would have to be vacuumed anyway, before a potential wraparound. I think we could safely unset the xid during nbtree single page cleanup, and vacuum, by making sure no LP_DEAD entries survive, and by including the horizon in the generated WAL record. That however still doesn't really fully allow us to set LP_DEAD on standbys, however - but it'd allow us to take the primary's LP_DEADs into account on a standby. I think we'd run into torn page issues, if we were to do so without WAL logging, because we'd rely on the LP_DEAD bits and BTPageOpaqueData.xact to be in sync. I *think* we might be safe to do so *iff* the page's LSN indicates that there has been a WAL record covering it since the last redo location. I only had my first cup of coffee for the day, so I might also just be entirely off base here. Greetings, Andres Freund
On Thu, Jan 16, 2020 at 9:54 AM Andres Freund <andres@anarazel.de> wrote: > I don't think we can rely on hot_standby_feedback at all. We can to > avoid unnecessary cancellations, etc, and even assume it's setup up > reasonably for some configurations, but there always needs to be an > independent correctness backstop. +1 > I'm less clear on how we can make sure that we can *rely* on LP_DEAD to > skip over entries during scans, however. The bits set as described above > would be safe, but we also can see LP_DEAD set by the primary (and even > upstream cascading standbys at least in case of new base backups taken > from them), due to them being not being WAL logged. As we don't WAL log, > there is no conflict associated with the LP_DEADs being set. My gut > feeling is that it's going to be very hard to get around this, without > adding WAL logging for _bt_killitems et al (including an interface for > kill_prior_tuple to report the used horizon to the index). I agree. What about calling _bt_vacuum_one_page() more often than strictly necessary to avoid a page split on the primary? The B-Tree deduplication patch sometimes does that, albeit for completely unrelated reasons. (We don't want to have to unset an LP_DEAD bit in the case when a new/incoming duplicate tuple has a TID that overlaps with the posting list range of some existing duplicate posting list tuple.) I have no idea how you'd determine that it was time to call _bt_vacuum_one_page(). Seems worth considering. > I'm wondering if we could recycle BTPageOpaqueData.xact to store the > horizon applying to killed tuples on the page. We don't need to store > the level for leaf pages, because we have BTP_LEAF, so we could make > space for that (potentially signalled by a new BTP flag). Obviously we > have to be careful with storing xids in the index, due to potential > wraparound danger - but I think such page would have to be vacuumed > anyway, before a potential wraparound. You would think that, but unfortunately we don't currently do it that way. We store XIDs in deleted leaf pages that can sometimes be missed until the next wraparound. We need to do something like commit 6655a7299d835dea9e8e0ba69cc5284611b96f29, but for B-Tree. It's somewhere on my TODO list. > I think we could safely unset > the xid during nbtree single page cleanup, and vacuum, by making sure no > LP_DEAD entries survive, and by including the horizon in the generated > WAL record. > > That however still doesn't really fully allow us to set LP_DEAD on > standbys, however - but it'd allow us to take the primary's LP_DEADs > into account on a standby. I think we'd run into torn page issues, if we > were to do so without WAL logging, because we'd rely on the LP_DEAD bits > and BTPageOpaqueData.xact to be in sync. I *think* we might be safe to > do so *iff* the page's LSN indicates that there has been a WAL record > covering it since the last redo location. That sounds like a huge mess. -- Peter Geoghegan
Hello again. Andres, Peter, thanks for your comments. Some of issues your mentioned (reporting feedback to the another cascade standby, processing queries after restart and newer xid already reported) could be fixed in provided design, but your intention to have "independent correctness backstop" is a right thing to do. So, I was thinking about another approach which is: * still not too tricky to implement * easy to understand * does not rely on hot_standby_feedback for correctness, but only for efficiency * could be used with any kind of index * does not generate a lot of WAL Let's add a new type of WAL record like "some index killed tuple hint bits are set according to RecentGlobalXmin=x" (without specifying page or even relation). Let's call 'x' as 'LastKilledIndexTuplesXmin' and track it in standby memory. It is sent only in case of wal_log_hints=true. If hints cause FPW - it is sent before FPW record. Also, it is not required to write such WAL every time primary marks index tuple as dead. It should be done only in case 'LastKilledIndexTuplesXmin' is changed (moved forward). On standby such record is used to cancel queries. If transaction is executed with "ignore_killed_tuples==true" (set on snapshot creation) and its xid is less than received LastKilledIndexTuplesXmin - just cancel the query (because it could rely on invalid hint bit). So, technically it should be correct to use hints received from master to skip tuples according to MVCC, but "the conflict rate goes through the roof". To avoid any real conflicts standby sets ignore_killed_tuples = (hot_standby_feedback is on) AND (wal_log_hints is on on primary) AND (standby new snapshot xid >= last LastKilledIndexTuplesXmin received) AND (hot_standby_feedback is reported directly to master). So, hot_standby_feedback loop effectively eliminates any conflicts (because LastKilledIndexTuplesXmin is technically RecentGlobalXmin in such case). But if feedback is broken for some reason - query cancellation logic will keep everything safe. For correctness LastKilledIndexTuplesXmin (and as consequence RecentGlobalXmin) should be moved only forward. To set killed bits on standby we should check tuples visibility according to last LastKilledIndexTuplesXmin received. It is just like master sets these bits according to its state - so it is even safe to transfer them to another standby. Does it look better now? Thanks, Michail.
Hi Michail, On Fri, Jan 24, 2020 at 6:16 AM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > Some of issues your mentioned (reporting feedback to the another > cascade standby, processing queries after restart and newer xid > already reported) could be fixed in provided design, but your > intention to have "independent correctness backstop" is a right thing > to do. Attached is a very rough POC patch of my own, which makes item deletion occur "non-opportunistically" in unique indexes. The idea is that we exploit the uniqueness property of unique indexes to identify "version churn" from non-HOT updates. If any single value on a leaf page has several duplicates, then there is a good chance that we can safely delete some of them. It's worth going to the heap to check whether that's safe selectively, at the point where we'd usually have to split the page. We only have to free one or two items to avoid splitting the page. If we can avoid splitting the page immediately, we may well avoid splitting it indefinitely, or forever. This approach seems to be super effective. It can leave the PK on pgbench_accounts in pristine condition (no page splits) after many hours with a pgbench-like workload that makes all updates non-HOT updates. Even with many clients, and a skewed distribution. Provided the index isn't tiny to begin with, we can always keep up with controlling index bloat -- once the client backends themselves begin to take active responsibility for garbage collection, rather than just treating it as a nice-to-have. I'm pretty sure that I'm going to be spending a lot of time developing this approach, because it really works. This seems fairly relevant to what you're doing. It makes almost all index cleanup occur using the new delete infrastructure for some of the most interesting workloads where deletion takes place in client backends. In practice, a standby will almost be in the same position as the primary in a workload that this approach really helps with, since setting the LP_DEAD bit itself doesn't really need to happen (we can go straight to deleting the items in the new deletion path). To address the questions you've asked: I don't really like the idea of introducing new rules around tuple visibility and WAL logging to set more LP_DEAD bits like this at all. It seems very complicated. I suspect that we'd be better off introducing ways of making the actual deletes occur sooner on the primary, possibly much sooner, avoiding any need for special infrastructure on the standby. This is probably not limited to the special unique index case that my patch focuses on -- we can probably push this general approach forward in a number of different ways. I just started with unique indexes because that seemed most promising. I have only worked on the project for a few days. I don't really know how it will evolve. -- Peter Geoghegan
Вложения
Hello, Peter. Thanks for your feedback. > Attached is a very rough POC patch of my own, which makes item > deletion occur "non-opportunistically" in unique indexes. The idea is > that we exploit the uniqueness property of unique indexes to identify > "version churn" from non-HOT updates. If any single value on a leaf > page has several duplicates, then there is a good chance that we can > safely delete some of them. It's worth going to the heap to check > whether that's safe selectively, at the point where we'd usually have > to split the page. We only have to free one or two items to avoid > splitting the page. If we can avoid splitting the page immediately, we > may well avoid splitting it indefinitely, or forever. Yes, it is a brilliant idea to use uniqueness to avoid bloating the index. I am not able to understand all the code now, but I’ll check the patch in more detail later. > This seems fairly relevant to what you're doing. It makes almost all > index cleanup occur using the new delete infrastructure for some of > the most interesting workloads where deletion takes place in client > backends. In practice, a standby will almost be in the same position > as the primary in a workload that this approach really helps with, > since setting the LP_DEAD bit itself doesn't really need to happen (we > can go straight to deleting the items in the new deletion path). > This is probably > not limited to the special unique index case that my patch focuses on > -- we can probably push this general approach forward in a number of > different ways. I just started with unique indexes because that seemed > most promising. I have only worked on the project for a few days. I > don't really know how it will evolve. Yes, it is relevant, but I think it is «located in a different plane» and complement each other. Because most of the indexes are not unique these days and most of the standbys (and even primaries) have long snapshots (up to minutes, hours) – so, multiple versions of index records are still required for them. Even if we could avoid multiple versions somehow - it could lead to a very high rate of query cancelations on standby. > To address the questions you've asked: I don't really like the idea of > introducing new rules around tuple visibility and WAL logging to set > more LP_DEAD bits like this at all. It seems very complicated. I don’t think it is too complicated. I have polished the idea a little and now it looks even elegant for me :) I’ll try to explain the concept briefly (there are no new visibility rules or changes to set more LP_DEAD bits than now – everything is based on well-tested mechanics): 1) There is some kind of horizon of xmin values primary pushes to a standby currently. All standby’s snapshots are required to satisfice this horizon to access heap and indexes. This is done by *ResolveRecoveryConflictWithSnapshot* and corresponding WAL records (for example -XLOG_BTREE_DELETE). 2) We could introduce a new WAL record (named XLOG_INDEX_HINT in the patch for now) to define a horizon of xmin required for standby’s snapshot to use LP_DEAD bits in the indexes. 3) Master sends XLOG_INDEX_HINT in case it sets LP_DEAD bit on the index page (but before possible FPW caused by hints) by calling *LogIndexHintIfNeeded*. It is required to send such a record only if the new xmin value is greater than one send before. I made tests to estimate the amount of new WAL – it is really small (especially compared to FPW writes done because of LP_DEAD bit set). 4) New XLOG_INDEX_HINT contains only a database id and value of *latestIndexHintXid* (new horizon position). For simplicity, the primary could set just set horizon to *RecentGlobalXmin*. But for now in the patch horizon value extracted from heap in *HeapTupleIsSurelyDead* to reduce the number of XLOG_INDEX_HINT records even more). 5) There is a new field in PGPROC structure - *indexIgnoreKilledTuples*. If it is set to true – standby queries are going to use LP_DEAD bits in index scans. In such a case snapshot is required to satisfice new LP_DEAD-horizon pushed by XLOG_INDEX_HINT records. It is done by the same mechanism as used for heap - *ResolveRecoveryConflictWithSnapshot*. 6) The major thing here – it is safe to set *indexIgnoreKilledTuples* to both ‘true’ and ‘false’ from the perspective of correctness. It is just some kind of performance compromise – use LP_DEAD bits but be aware of XLOG_INDEX_HINT horizon or vice versa. 7) What is the way to do the right decision about this compromise? It is pretty simple – if hot_standby_feedback is on and primary confirmed our feedback is received – then set *indexIgnoreKilledTuples* too ‘true’ – since while feedback is working as expected – the query will be never canceled by XLOG_INDEX_HINT horizon! 8) To support cascading standby setups (with a possible break of feedback chain) – additional byte added to the ‘keep-alive’ message of the feedback protocol. 9) So, at the moment we are safe to use LP_DEAD bits received from the primary when we want to. 10) What is about setting LP_DEAD bits by standby? The main thing here - *RecentGlobalXmin* on standby is always lower than XLOG_INDEX_HINT horizon by definition – standby is always behind the primary. So, if something looks dead on standby – it is definitely dead on the primary. 11) Even if: * the primary changes vacuum_defer_cleanup_age * standby restarted * standby promoted to the primary * base backup taken from standby * standby is serving queries during recovery – nothing could go wrong here. Because *HeapTupleIsSurelyDead* (and index LP_DEAD as result) needs *HEAP* hint bits to be already set at standby. So, the same code decides to set hint bits in the heap (it is done already on standby for a long time) and in the index. So, the only thing we pay – a few additional bytes of WAL and some additional moderate code complexity. But the support of hint-bits on standby is a huge advantage for many workloads. I was able to get more than 1000% performance boost (and it is not surprising – index hint bits is just great optimization). And it works for almost all index types out of the box. Another major thing here – everything is based on old, well-tested mechanics: query cancelation because of snapshot conflicts and setting heap hint bits on standby. Most of the patch – are technical changes to support new query cancelation counters, new WAL record, new PGPROC field and so on. There are some places I am not sure about yet, naming is bad – it is still POC. Thanks, Michail.
Вложения
On Wed, Apr 8, 2020 at 5:23 AM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > Yes, it is a brilliant idea to use uniqueness to avoid bloating the index. I am > not able to understand all the code now, but I’ll check the patch in more > detail later. The design is probably simpler than you imagine. The algorithm tries to be clever about unique indexes in various ways, but don't get distracted by those details. At a high level we're merely performing atomic actions that can already happen, and already use the same high level rules. There is LP_DEAD bit setting that's similar to what often happens in _bt_check_unique() already, plus there is a new way in which we can call _bt_delitems_delete(). Importantly, we're only changing the time and the reasons for doing these things. > Yes, it is relevant, but I think it is «located in a different plane» and > complement each other. Because most of the indexes are not unique these days > and most of the standbys (and even primaries) have long snapshots (up to > minutes, hours) – so, multiple versions of index records are still required for > them. Even if we could avoid multiple versions somehow - it could lead to a very > high rate of query cancelations on standby. I admit that I didn't really understand what you were trying to do initially. I now understand a little better. I think that there is something to be said for calling _bt_delitems_delete() more frequently on the standby, not necessarily just for the special case of unique indexes. That might also help on standbys, at the cost of more recovery conflicts. I admit that it's unclear how much that would help with the cases that you seem to really care about. I'm not going to argue that the kind of thing I'm talking about (actually doing deletion more frequently on the primary) is truly a replacement for your patch, even if it was generalized further than my POC patch -- it is not a replacement. As best, it is a simpler way of "sending the LP_DEAD bits on the primary to the standby sooner". Even still, I cannot help but wonder how much value there is in just doing this much (or figuring out some way to make LP_DEAD bits from the primary usable on the standby). That seems like a far less risky project, even if it is less valuable to some users. Let me make sure I understand your position: You're particularly concerned about cases where there are relatively few page splits, and the standby has to wait for VACUUM to run on the primary before dead index tuples get cleaned up. The primary itself probably has no problem with setting LP_DEAD bits to avoid having index scans visiting the heap unnecessarily. Or maybe the queries are different on the standby anyway, so it matters to the standby that certain index pages get LP_DEAD bits set quickly, though not to the primary (at least not for the same pages). Setting the LP_DEAD bits on the standby (in about the same way as we can already on the primary) is a "night and day" level difference. And we're willing to account for FPIs on the primary (and the LP_DEAD bits set there) just to be able to also set LP_DEAD bits on the standby. Right? -- Peter Geoghegan
Hello, Peter. > Let me make sure I understand your position: > You're particularly concerned about cases where there are relatively > few page splits, and the standby has to wait for VACUUM to run on the > primary before dead index tuples get cleaned up. The primary itself > probably has no problem with setting LP_DEAD bits to avoid having > index scans visiting the heap unnecessarily. Or maybe the queries are > different on the standby anyway, so it matters to the standby that > certain index pages get LP_DEAD bits set quickly, though not to the > primary (at least not for the same pages). Setting the LP_DEAD bits on > the standby (in about the same way as we can already on the primary) > is a "night and day" level difference. > Right? Yes, exactly. My initial attempts were too naive (first and second letter) - but you and Andres gave me some hints on how to make it reliable. The main goal is to make the standby to be able to use and set LP_DEAD almost as a primary does. Of course, standby could receive LP_DEAD with FPI from primary at any moment - so, some kind of cancellation logic is required. Also, we should keep the frequency of query cancellation at the same level - for that reason LP_DEAD bits better to be used only by standbys with hot_standby_feedback enabled. So, I am just repeating myself from the previous letter here. > And we're willing to account > for FPIs on the primary (and the LP_DEAD bits set there) just to be > able to also set LP_DEAD bits on the standby. Yes, metaphorically saying - master sending WAL record with the letter: "Attention, it is possible to receive FPI from me with LP_DEAD set for tuple with xmax=ABCD, so, if you using LP_DEAD - your xmin should be greater or you should cancel yourself". And such a letter is required only if this horizon is moved forward. And... Looks like it works - queries are mush faster, results look correct, additional WAL traffic is low, cancellation at the same level... As far as I can see - the basic concept is correct and effective (but of course, I could miss something). The patch is hard to look into - I'll try to split it into several patches later. And of course, a lot of polishing is required (and there are few places I am not sure about yet). Thanks, Michail.
Hello, hackers.
Sorry for necroposting, but if someone is interested - I hope the patch is ready now and available in the other thread (1).
Thanks,
Michail.
[1] https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com
Sorry for necroposting, but if someone is interested - I hope the patch is ready now and available in the other thread (1).
Thanks,
Michail.
[1] https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com
On Wed, Jan 27, 2021 at 11:30 AM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > Sorry for necroposting, but if someone is interested - I hope the patch is ready now and available in the other thread(1). I wonder if it would help to not actually use the LP_DEAD bit for this. Instead, you could use the currently-unused-in-indexes LP_REDIRECT bit. The general idea here is that this status bit is interpreted as a "recently dead" bit in nbtree. It is always something that is true *relative* to some *ephemeral* coarse-grained definition of recently dead. Whether or not "recently dead" means "dead to my particular MVCC snapshot" can be determined using some kind of in-memory state that won't survive a crash (or a per-index-page epoch?). We still have LP_DEAD bits in this world, which work in the same way as now (and so unambiguously indicate that the index tuple is dead-to-all, at least on the primary). I think that this idea of a "recently dead" bit might be able to accomplish a few goals all at once, including your specific goal of setting "hint bits" in indexes. The issue here is that it would also be nice to use a "recently dead" bit on the primary, but if you do that then maybe you're back to the original problem. OTOH, I also think that we could repurpose the LP_NORMAL bit in index AMs, so we could potentially have 3 different definitions of dead-ness without great difficulty! Perhaps this doesn't seem all that helpful, since I am expanding scope here for a project that is already very difficult. And maybe it just isn't helpful -- I don't know. But it is at least possible that expanding scope could actually *help* your case a lot, even if you only ever care about your original goal. My personal experience with nbtree patches is just that: sometimes *expanding* scope actually makes things *easier*, not harder. This is sometimes called "The Inventor's Paradox": https://en.wikipedia.org/wiki/Inventor%27s_paradox Consider the example of my work on nbtree in PostgreSQL 12. It was actually about 6 or 7 enhancements, not just 1 big enhancement -- it is easy to forget that now. I think that it was *necessary* to add at least 5 of these enhancements at the same time (maybe 2 or so really were optional). This is deeply counter-intuitive, but still seems to be true in my case. The specific reasons why I believe that this is true of the PostgreSQL 12 work are complicated, but it boils down to this: the ~5 related-though-distinct enhancements were individually not all that compelling (certainly not compelling enough to make on-disk changes for), but were much greater than the sum of their parts when considered together. Maybe I got lucky there. More generally, the nbtree stuff that I worked on in 12, 13, and now 14 actually feels like one big project to me. I will refrain from explaining exactly why that is right now, but you might be very surprised at how closely related it all is. I didn't exactly plan it that way, but trying to see things in the most general terms turned out to be a huge asset to me. If there are several independent reasons to move in one particular direction all at once, you can generally afford to be wrong about a lot of things without it truly harming anybody. Plus it's easy to see that you will not frustrate future work that is closely related but distinct when that future work is *directly enabled* by what you're doing. What's more, the extra stuff I'm talking about probably has a bunch of other benefits on the primary, if done well. Consider how the deletion stuff with LP_DEAD bits now considers "extra" index tuples to delete when they're close by. We could even do something like that with these LP_REDIRECT/recently dead bits on the primary. I understand that it's hard to have a really long term outlook. I think that that's almost necessary when working on a project like this, though. Don't forget that this works both ways. Maybe a person that wanted to do this "recently dead" stuff (which sounds really interesting to me right now) would similarly be compelled to consider the bigger picture, including the question of setting hint bits on standbys -- this other person had better not make that harder in the future, for the same reasons (obviously what you want to do here makes sense, it just isn't everything to everybody). This is yet another way in which expanding scope can help. -- Peter Geoghegan
On Wed, Jan 27, 2021 at 5:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > The issue here is that it would also be nice to use a "recently dead" > bit on the primary, but if you do that then maybe you're back to the > original problem. OTOH, I also think that we could repurpose the > LP_NORMAL bit in index AMs, so we could potentially have 3 different > definitions of dead-ness without great difficulty! To be clear, what I mean is that you currently have two bits in line pointers. In an nbtree leaf page we only currently use one -- the LP_DEAD bit. But bringing the second bit into it allows us to have a representation for two additional states (not just one), since of course the meaning of the second bit can be interpreted using the LP_DEAD bit. You could for example having encodings for each of the following distinct per-LP states, while still preserving on-disk compatibility: "Not known to be dead in any sense" (0), "Unambiguously dead to all" (what we now simply call LP_DEAD), "recently dead on standby" (currently-spare bit is set), and "recently dead on primary" (both 'lp_flags' bits set). Applying FPIs on the standby would have to be careful to preserve a standby-only bit. I'm probably not thinking of every subtlety, but "putting all of the pieces of the puzzle together for further consideration" is likely to be a useful exercise. -- Peter Geoghegan
Hello, Peter.
> I wonder if it would help to not actually use the LP_DEAD bit for
> this. Instead, you could use the currently-unused-in-indexes
> LP_REDIRECT bit.
Hm… Sound very promising - an additional bit is a lot in this situation.
> Whether or not "recently dead" means "dead to my
> particular MVCC snapshot" can be determined using some kind of
> in-memory state that won't survive a crash (or a per-index-page
> epoch?).
Do you have any additional information about this idea? (maybe some thread). What kind of “in-memory state that won't survive a crash” and how to deal with flushed bits after the crash?
> "Not known to be dead in any sense" (0), "Unambiguously dead to all"
> (what we now simply call LP_DEAD), "recently dead on standby"
> (currently-spare bit is set), and "recently dead on primary" (both
> 'lp_flags' bits set).
Hm. What is about this way:
10 - dead to all on standby (LP_REDIRECT)
11 - dead to all on primary (LP_DEAD)
01 - future “recently DEAD” on primary (LP_NORMAL)
In such a case standby could just always ignore all LP_DEAD bits from primary (standby will lose its own hint after FPI - but it is not a big deal). So, we don’t need any conflict resolution (and any additional WAL records). Also, hot_standby_feedback-related stuff is not required anymore. All we need to do (without details of course) - is correctly check if it is safe to set LP_REDIRECT on standby according to `minRecoveryPoint` (to avoid consistency issues during crash recovery). Or, probably, introduce some kind of `indexHintMinRecoveryPoint`.
So, it will remove more than 80% of the current patch complexity!
Also, btw, do you know any reason to keep minRecoveryPoint at a low value? Because it blocks standby for settings hints bits in *heap* already. And, probably, will block standby to set *index* hint bits aggressively.
Thanks a lot,
Michail.
> I wonder if it would help to not actually use the LP_DEAD bit for
> this. Instead, you could use the currently-unused-in-indexes
> LP_REDIRECT bit.
Hm… Sound very promising - an additional bit is a lot in this situation.
> Whether or not "recently dead" means "dead to my
> particular MVCC snapshot" can be determined using some kind of
> in-memory state that won't survive a crash (or a per-index-page
> epoch?).
Do you have any additional information about this idea? (maybe some thread). What kind of “in-memory state that won't survive a crash” and how to deal with flushed bits after the crash?
> "Not known to be dead in any sense" (0), "Unambiguously dead to all"
> (what we now simply call LP_DEAD), "recently dead on standby"
> (currently-spare bit is set), and "recently dead on primary" (both
> 'lp_flags' bits set).
Hm. What is about this way:
10 - dead to all on standby (LP_REDIRECT)
11 - dead to all on primary (LP_DEAD)
01 - future “recently DEAD” on primary (LP_NORMAL)
In such a case standby could just always ignore all LP_DEAD bits from primary (standby will lose its own hint after FPI - but it is not a big deal). So, we don’t need any conflict resolution (and any additional WAL records). Also, hot_standby_feedback-related stuff is not required anymore. All we need to do (without details of course) - is correctly check if it is safe to set LP_REDIRECT on standby according to `minRecoveryPoint` (to avoid consistency issues during crash recovery). Or, probably, introduce some kind of `indexHintMinRecoveryPoint`.
Also, looks like both GIST and HASH indexes also do not use LP_REDIRECT.
So, it will remove more than 80% of the current patch complexity!
Also, btw, do you know any reason to keep minRecoveryPoint at a low value? Because it blocks standby for settings hints bits in *heap* already. And, probably, will block standby to set *index* hint bits aggressively.
Thanks a lot,
Michail.
On Thu, Jan 28, 2021 at 10:16 AM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > I wonder if it would help to not actually use the LP_DEAD bit for > > this. Instead, you could use the currently-unused-in-indexes > > LP_REDIRECT bit. > > Hm… Sound very promising - an additional bit is a lot in this situation. Yeah, it would help a lot. But those bits are precious. So it makes sense to think about what to do with both of them in index AMs at the same time. Otherwise we risk missing some important opportunity. > > Whether or not "recently dead" means "dead to my > > particular MVCC snapshot" can be determined using some kind of > > in-memory state that won't survive a crash (or a per-index-page > > epoch?). > > Do you have any additional information about this idea? (maybe some thread). What kind of “in-memory state that won't survivea crash” and how to deal with flushed bits after the crash? Honestly, that part wasn't very well thought out. A lot of things might work. Some kind of "recently dead" bit is easier on the primary. If we have recently dead bits set on the primary (using a dedicated LP bit for original execution recently-dead-ness), then we wouldn't even necessarily have to change anything about index scans/visibility at all. There would still be a significant benefit if we simply used the recently dead bits when considering which heap blocks nbtree simple deletion will visit inside _bt_deadblocks() -- in practice there would probably be no real downside from assuming that the recently dead bits are now fully dead (it would sometimes be wrong, but not enough to matter, probably only when there is a snapshot held for way way too long). Deletion in indexes can work well while starting off with only an *approximate* idea of which index tuples will be safe to delete -- this is a high level idea behind my recent commit d168b666823. It seems very possible that that could be pushed even further here on the primary. On standbys (which set standby recently dead bits) it will be different, because you need "index hint bits" set that are attuned to the workload on the standby, and because you don't ever use the bit to help with deleting anything on the standby (that all happens during original execution). BTW, what happens when the page splits on the primary, btw? Does your patch "move over" the LP_DEAD bits to each half of the split? > Hm. What is about this way: > > 10 - dead to all on standby (LP_REDIRECT) > 11 - dead to all on primary (LP_DEAD) > 01 - future “recently DEAD” on primary (LP_NORMAL) Not sure. > Also, looks like both GIST and HASH indexes also do not use LP_REDIRECT. Right -- if we were to do this, the idea would be that it would apply to all index AMs that currently have (or will ever have) something like the LP_DEAD bit stuff. The GiST and hash support for index deletion is directly based on the original nbtree version, and there is no reason why we cannot eventually do all this stuff in at least those three AMs. There are already some line-pointer level differences in index AMs: LP_DEAD items have storage in index AMs, but not in heapam. This all-table-AMs/all-index-AMs divide in how item pointers work would be preserved. > Also, btw, do you know any reason to keep minRecoveryPoint at a low value? Not offhand. -- Peter Geoghegan
Hello, Peter.
> Yeah, it would help a lot. But those bits are precious. So it makes
> sense to think about what to do with both of them in index AMs at the
> same time. Otherwise we risk missing some important opportunity.
Hm. I was trying to "expand the scope" as you said and got an idea... Why we even should do any conflict resolution for hint bits? Or share precious LP bits?
The only way standby could get an "invalid" hint bit - is an FPI from the primary. We could just use the current *btree_mask* infrastructure and clear all "probably invalid" bits from primary in case of standby while applying FPI (in `XLogReadBufferForRedoExtended`)!
For binary compatibility, we could use one of `btpo_flags` bits to mark the page as "primary bits masked". The same way would work for hash\gist too.
No conflicts, only LP_DEAD bit is used by standby, `btpo_flags` has many free bits for now, easy to implement, page content of primary\standby already differs in this bits...
Looks like an easy and effective solution for me.
What do you think?
>> Also, btw, do you know any reason to keep minRecoveryPoint at a low value?
> Not offhand.
If so, looks like it is not a bad idea to move minRecoveryPoint forward from time to time (for more aggressive standby index hint bits). For each `xl_running_xacts` (about each 15s), for example.
> BTW, what happens when the page splits on the primary, btw? Does your
> patch "move over" the LP_DEAD bits to each half of the split?
That part is not changed in any way.
Thanks,
Michail.
> Yeah, it would help a lot. But those bits are precious. So it makes
> sense to think about what to do with both of them in index AMs at the
> same time. Otherwise we risk missing some important opportunity.
Hm. I was trying to "expand the scope" as you said and got an idea... Why we even should do any conflict resolution for hint bits? Or share precious LP bits?
The only way standby could get an "invalid" hint bit - is an FPI from the primary. We could just use the current *btree_mask* infrastructure and clear all "probably invalid" bits from primary in case of standby while applying FPI (in `XLogReadBufferForRedoExtended`)!
For binary compatibility, we could use one of `btpo_flags` bits to mark the page as "primary bits masked". The same way would work for hash\gist too.
No conflicts, only LP_DEAD bit is used by standby, `btpo_flags` has many free bits for now, easy to implement, page content of primary\standby already differs in this bits...
Looks like an easy and effective solution for me.
What do you think?
>> Also, btw, do you know any reason to keep minRecoveryPoint at a low value?
> Not offhand.
If so, looks like it is not a bad idea to move minRecoveryPoint forward from time to time (for more aggressive standby index hint bits). For each `xl_running_xacts` (about each 15s), for example.
> BTW, what happens when the page splits on the primary, btw? Does your
> patch "move over" the LP_DEAD bits to each half of the split?
That part is not changed in any way.
Thanks,
Michail.
On Sat, Jan 30, 2021 at 9:11 AM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > Yeah, it would help a lot. But those bits are precious. So it makes > > sense to think about what to do with both of them in index AMs at the > > same time. Otherwise we risk missing some important opportunity. > > Hm. I was trying to "expand the scope" as you said and got an idea... Why we even should do any conflict resolution forhint bits? Or share precious LP bits? What does it mean today when an LP_DEAD bit is set on a standby? I don't think that it means nothing at all -- at least not if you think about it in the most general terms. In a way it actually means something like "recently dead" even today, at least in one narrow specific sense: recovery may end, and then we actually *will* do *something* with the LP_DEAD bit, without directly concerning ourselves with when or how each LP_DEAD bit was set. During recovery, we will probably always have to consider the possibility that LP_DEAD bits that get set on the primary may be received by a replica through some implementation detail (e.g. LP_DEAD bits are set in FPIs we replay, maybe even some other thing that neither of us have thought of). We can't really mask LP_DEAD bits from the primary in recovery anyway, because of stuff like page-level checksums. I suspect that it just isn't useful to fight against that. These preexisting assumptions are baked into everything already. Why should we assume that *anybody* understands all of the ways in which that is true? Even today, "LP_DEAD actually means a limited kind of 'recently dead' during recovery + hot standby" is something that is true (as I just went into), but at the same time also has a fuzzy definition. My gut instinct is to be conservative about that. As I suggested earlier, you could invent some distinct kind of "recently dead" that achieves your goals in a way that is 100% orthogonal. The two unused LP dead bits (unused in indexes, though not tables) are only precious if we assume that somebody will eventually use them for something -- if nobody ever uses them then they're 100% useless. The number of possible projects that might end up using the two bits for something is not that high -- certainly not more than 3 or 4. Besides, it is always a good idea to keep the on-disk format as simple and explicit as possible -- it should be easy to analyze forensically in the event of bugs or some other kind of corruption, especially by using tools like amcheck. As I said in my earlier email, we can even play tricks during page deletion by treating certain kinds of "recently dead" bits as approximate things. As a further example, we can "rely" on the "dead-to-all but only on standby" bits when recovery ends, during a subsequent write transactions. We can do so by simply using them in _bt_deadblocks() as if they were LP_DEAD (we'll recheck heap tuples in heapam.c instead). > The only way standby could get an "invalid" hint bit - is an FPI from the primary. We could just use the current *btree_mask*infrastructure and clear all "probably invalid" bits from primary in case of standby while applying FPI (in `XLogReadBufferForRedoExtended`)! I don't like that idea. Apart from what I said already, you're assuming that setting LP_DEAD bits in indexes on the primary won't eventually have some value on the standby after it is promoted and can accept writes -- they really do have meaning and value on standbys. Plus you'll introduce new overhead for this process during replay, which creates significant overhead -- often most leaf pages have some LP_DEAD bits set during recovery. > For binary compatibility, we could use one of `btpo_flags` bits to mark the page as "primary bits masked". The same waywould work for hash\gist too. I agree that there are plenty of btpo_flags bits. However, I have my doubts about this. Why shouldn't this break page-level checksums (or wal_log_hints) in some way? What about pg_rewind, some eventual implementation of incremental backups, etc? I suspect that it will be necessary to invent some entirely new concept that is like a hint bit, but works on standbys (without breaking anything else). > No conflicts, only LP_DEAD bit is used by standby, `btpo_flags` has many free bits for now, easy to implement, page contentof primary\standby already differs in this bits... > Looks like an easy and effective solution for me. Note that the BTP_HAS_GARBAGE flag (which goes in btpo_flags) was deprecated in commit cf2acaf4. It was never a reliable indicator of whether or not some LP_DEAD bits are set in the page. And I think that it never could be made to work like that. > >> Also, btw, do you know any reason to keep minRecoveryPoint at a low value? > > Not offhand. > > If so, looks like it is not a bad idea to move minRecoveryPoint forward from time to time (for more aggressive standbyindex hint bits). For each `xl_running_xacts` (about each 15s), for example. It's necessary for recoverypoints (i.e. the standby equivalent of a checkpoint) to do that in order to ensure that we won't break checksums. This whole area is scary to me: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com Since, as I said, it's already true that LP_DEAD bits on standbys are some particular kind of "recently dead bit", even today, any design that uses LP_DEAD bits in some novel new way (also on the standby) is very hard to test. It might easily have very subtle bugs -- obviously a recently dead bit relates to a tuple pointing to a logical row that is bound to become dead soon enough. The difference between dead and recently dead is already blurred, and I would rather not go there. If you invent some entirely new category of standby-only hint bit at a level below the access method code, then you can use it inside access method code such as nbtree. Maybe you don't have to play games with minRecoveryPoint in code like the "if (RecoveryInProgress())" path from the XLogNeedsFlush() function. Maybe you can do some kind of rudimentary "masking" for the in recovery case at the point that we *write out* a buffer (*not* at the point hint bits are initially set) -- maybe this could happen near to or inside FlushBuffer(), and maybe only when checksums are enabled? I'm unsure. > > BTW, what happens when the page splits on the primary, btw? Does your > > patch "move over" the LP_DEAD bits to each half of the split? > > That part is not changed in any way. Maybe it's okay to assume that it's no loss to throw away hint bits set on the standby, because in practice deletion on the primary will usually do the right thing anyway. But you never said that. I think that you should take an explicit position on this question -- make it a formal part of your overall design. -- Peter Geoghegan
On Sat, Jan 30, 2021 at 5:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > If you invent some entirely new category of standby-only hint bit at a > level below the access method code, then you can use it inside access > method code such as nbtree. Maybe you don't have to play games with > minRecoveryPoint in code like the "if (RecoveryInProgress())" path > from the XLogNeedsFlush() function. Maybe you can do some kind of > rudimentary "masking" for the in recovery case at the point that we > *write out* a buffer (*not* at the point hint bits are initially set) > -- maybe this could happen near to or inside FlushBuffer(), and maybe > only when checksums are enabled? I'm unsure. I should point out that hint bits in heap pages are really not like LP_DEAD bits in indexes -- if they're similar at all then the similarity is only superficial/mechanistic. In fact, the term "hint bits in indexes" does not seem useful at all to me, for this reason. Heap hint bits indicate whether or not the xmin or xmax in a heap tuple header committed or aborted. We cache the commit or abort status of one particular XID in the heap tuple header. Notably, this information alone tells us nothing about whether or not the tuple should be visible to any given MVCC snapshot. Except perhaps in cases involving aborted transactions -- but that "exception" is just a limited special case (and less than 1% of transactions are aborted in almost all environments anyway). In contrast, a set LP_DEAD bit in an index is all the information we need to know that the tuple is dead, and can be ignored completely (except during hot standby, where at least today we assume nothing about the status of the tuple, since that would be unsafe). Generally speaking, the index page LP_DEAD bit is "direct" visibility information about the tuple, not information about XIDs that are stored in the tuple header. So a set LD_DEAD bit in an index is actually like an LP_DEAD-set line pointer in the heap (that's the closest equivalent in the heap, by far). It's also like a frozen heap tuple (except it's dead-to-all, not live-to-all). The difference may be subtle, but it's important here -- it justifies inventing a whole new type of LP_DEAD-style status bit that gets set only on standbys. Even today, heap tuples can have hint bits "independently" set on standbys, subject to certain limitations needed to avoid breaking things like data page checksums. Hint bits are ultimately just a thing that remembers the status of transactions that are known committed or aborted, and so can be set immediately after the relevant xact commits/aborts (at least on the primary, during original execution). A long-held MVCC snapshot is never a reason to not set a hint bit in a heap tuple header (during original execution or during hot standby/recovery). Of course, a long-held MVCC snapshot *is* often a reason why we cannot set an LP_DEAD bit in an index. Conclusion: The whole minRecoveryPoint question that you're trying to answer to improve things for your patch is just the wrong question. Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it would be useful to set "true hint bits" on standbys earlier, and maybe thinking about minRecoveryPoint would help with that problem, but that doesn't help your index-related patch -- because indexes simply don't have true hint bits. -- Peter Geoghegan
Hello, Peter.
Thanks a lot for your comments.
There are some mine thoughts, related to the “masked bits” solution and your comments:
> During recovery, we will probably always have to consider the
> possibility that LP_DEAD bits that get set on the primary may be
> received by a replica through some implementation detail (e.g. LP_DEAD
> bits are set in FPIs we replay, maybe even some other thing that
> neither of us have thought of).
It is fine to receive a page to the standby from any source: `btpo_flags` should have some kind “LP_DEAD safe for standby” bit set to allow new bits to be set and old - read.
> We can't really mask LP_DEAD bits from
> the primary in recovery anyway, because of stuff like page-level
> checksums. I suspect that it just isn't useful to fight against that.
As far as I can see - there is no problem here. Checksums already differ for both heap and index pages on standby and primary. Checksums are calculated before the page is written to the disk (not after applying FPI). So, the masking page during *applying* the FPI is semantically the same as setting a bit in it 1 nanosecond after.
And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD.
> Plus you'll introduce new overhead for this process during replay,
> which creates significant overhead -- often most leaf pages have some
> LP_DEAD bits set during recovery.
I hope it is not big enough, because FPIs are not too frequent + positive effect will easily overcome additional CPU cycles of `btree_mask` (and the page is already in CPU cache at the moment).
> I don't like that idea. Apart from what I said already, you're
> assuming that setting LP_DEAD bits in indexes on the primary won't
> eventually have some value on the standby after it is promoted and can
> accept writes -- they really do have meaning and value on standbys.
I think it is fine to lose part of LP_DEAD on promotion (which can be received only by FPI in practice). They will be set on the first scan anyway. Also, bits set by standby may be used by newly promoted primary if we honor OldestXmin of the previous primary while setting it (just need to add OldestXmin in xl_running_xacts and include it into dead-horizon on standby).
> Why shouldn't this break page-level checksums (or wal_log_hints) in
> some way? What about pg_rewind, some eventual implementation of
> incremental backups, etc? I suspect that it will be necessary to
> invent some entirely new concept that is like a hint bit, but works on
> standbys (without breaking anything else).
As I said before - applying the mask on *standby* will not break any checksums - because the page is still dirty after that (and it is even possible to call `PageSetChecksumInplace` for an additional paranoia). Actual checksums on standby and primary already have different values (and, probably, in the most of the pages because LP_DEAD and “classic” hint bits).
> If you invent some entirely new category of standby-only hint bit at a
> level below the access method code, then you can use it inside access
> method code such as nbtree. Maybe you don't have to play games with
> minRecoveryPoint in code like the "if (RecoveryInProgress())" path
> from the XLogNeedsFlush() function. Maybe you can do some kind of
> rudimentary "masking" for the in recovery case at the point that we
> *write out* a buffer (*not* at the point hint bits are initially set)
> -- maybe this could happen near to or inside FlushBuffer(), and maybe
> only when checksums are enabled? I'm unsure.
Not sure I was able to understand your idea here, sorry.
> The difference may be subtle, but it's important here -- it justifies
> inventing a whole new type of LP_DEAD-style status bit that gets set
> only on standbys. Even today, heap tuples can have hint bits
> "independently" set on standbys, subject to certain limitations needed
> to avoid breaking things like data page checksums
Yes, and I see three major ways to implement it in the current infrastructure:
1) Use LP_REDIRECT (or other free value) instead of LP_DEAD on standby
2) Use LP_DEAD on standby, but involve some kind of recovery conflicts (like here - https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com )
3) Mask index FPI during a replay on hot standby + mark page as “primary LP_DEAD free” in btpo_flags
Of course, each variant requires some special additional things to keep everything safe.
As I see in SetHintsBits limitations are related to XLogNeedsFlush (check of minRecoveryPoint in case of standby).
> Conclusion: The whole minRecoveryPoint question that you're trying to
> answer to improve things for your patch is just the wrong question.
> Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
> would be useful to set "true hint bits" on standbys earlier, and maybe
> thinking about minRecoveryPoint would help with that problem, but that
> doesn't help your index-related patch -- because indexes simply don't
> have true hint bits.
Attention to minRecoveryPoint is required because of possible situation described here - https://www.postgresql.org/message-id/flat/CANtu0ojwFcSQpyCxrGxJuLVTnOBSSrzKuF3cB_yCk0U-X-wpGw%40mail.gmail.com#4d8ef8754b18c5e35146ed589b25bf27
The source of the potential problem - is the fact what the setting of LP_DEAD does not change page LSN and it could cause consistency issues during crash recovery.
Thanks,
Michail.
Thanks a lot for your comments.
There are some mine thoughts, related to the “masked bits” solution and your comments:
> During recovery, we will probably always have to consider the
> possibility that LP_DEAD bits that get set on the primary may be
> received by a replica through some implementation detail (e.g. LP_DEAD
> bits are set in FPIs we replay, maybe even some other thing that
> neither of us have thought of).
It is fine to receive a page to the standby from any source: `btpo_flags` should have some kind “LP_DEAD safe for standby” bit set to allow new bits to be set and old - read.
> We can't really mask LP_DEAD bits from
> the primary in recovery anyway, because of stuff like page-level
> checksums. I suspect that it just isn't useful to fight against that.
As far as I can see - there is no problem here. Checksums already differ for both heap and index pages on standby and primary. Checksums are calculated before the page is written to the disk (not after applying FPI). So, the masking page during *applying* the FPI is semantically the same as setting a bit in it 1 nanosecond after.
And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD.
> Plus you'll introduce new overhead for this process during replay,
> which creates significant overhead -- often most leaf pages have some
> LP_DEAD bits set during recovery.
I hope it is not big enough, because FPIs are not too frequent + positive effect will easily overcome additional CPU cycles of `btree_mask` (and the page is already in CPU cache at the moment).
> I don't like that idea. Apart from what I said already, you're
> assuming that setting LP_DEAD bits in indexes on the primary won't
> eventually have some value on the standby after it is promoted and can
> accept writes -- they really do have meaning and value on standbys.
I think it is fine to lose part of LP_DEAD on promotion (which can be received only by FPI in practice). They will be set on the first scan anyway. Also, bits set by standby may be used by newly promoted primary if we honor OldestXmin of the previous primary while setting it (just need to add OldestXmin in xl_running_xacts and include it into dead-horizon on standby).
> Why shouldn't this break page-level checksums (or wal_log_hints) in
> some way? What about pg_rewind, some eventual implementation of
> incremental backups, etc? I suspect that it will be necessary to
> invent some entirely new concept that is like a hint bit, but works on
> standbys (without breaking anything else).
As I said before - applying the mask on *standby* will not break any checksums - because the page is still dirty after that (and it is even possible to call `PageSetChecksumInplace` for an additional paranoia). Actual checksums on standby and primary already have different values (and, probably, in the most of the pages because LP_DEAD and “classic” hint bits).
> If you invent some entirely new category of standby-only hint bit at a
> level below the access method code, then you can use it inside access
> method code such as nbtree. Maybe you don't have to play games with
> minRecoveryPoint in code like the "if (RecoveryInProgress())" path
> from the XLogNeedsFlush() function. Maybe you can do some kind of
> rudimentary "masking" for the in recovery case at the point that we
> *write out* a buffer (*not* at the point hint bits are initially set)
> -- maybe this could happen near to or inside FlushBuffer(), and maybe
> only when checksums are enabled? I'm unsure.
Not sure I was able to understand your idea here, sorry.
> The difference may be subtle, but it's important here -- it justifies
> inventing a whole new type of LP_DEAD-style status bit that gets set
> only on standbys. Even today, heap tuples can have hint bits
> "independently" set on standbys, subject to certain limitations needed
> to avoid breaking things like data page checksums
Yes, and I see three major ways to implement it in the current infrastructure:
1) Use LP_REDIRECT (or other free value) instead of LP_DEAD on standby
2) Use LP_DEAD on standby, but involve some kind of recovery conflicts (like here - https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com )
3) Mask index FPI during a replay on hot standby + mark page as “primary LP_DEAD free” in btpo_flags
Of course, each variant requires some special additional things to keep everything safe.
As I see in SetHintsBits limitations are related to XLogNeedsFlush (check of minRecoveryPoint in case of standby).
> Conclusion: The whole minRecoveryPoint question that you're trying to
> answer to improve things for your patch is just the wrong question.
> Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
> would be useful to set "true hint bits" on standbys earlier, and maybe
> thinking about minRecoveryPoint would help with that problem, but that
> doesn't help your index-related patch -- because indexes simply don't
> have true hint bits.
Attention to minRecoveryPoint is required because of possible situation described here - https://www.postgresql.org/message-id/flat/CANtu0ojwFcSQpyCxrGxJuLVTnOBSSrzKuF3cB_yCk0U-X-wpGw%40mail.gmail.com#4d8ef8754b18c5e35146ed589b25bf27
The source of the potential problem - is the fact what the setting of LP_DEAD does not change page LSN and it could cause consistency issues during crash recovery.
Thanks,
Michail.
On Mon, Feb 1, 2021 at 1:19 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > It is fine to receive a page to the standby from any source: `btpo_flags` should have some kind “LP_DEAD safe for standby”bit set to allow new bits to be set and old - read. > > > We can't really mask LP_DEAD bits from > > the primary in recovery anyway, because of stuff like page-level > > checksums. I suspect that it just isn't useful to fight against that. > > As far as I can see - there is no problem here. Checksums already differ for both heap and index pages on standby and primary. AFAICT that's not true, at least not in any practical sense. See the comment in the middle of MarkBufferDirtyHint() that begins with "If we must not write WAL, due to a relfilenode-specific...", and see the "Checksums" section at the end of src/backend/storage/page/README. The last paragraph in the README is particularly relevant: New WAL records cannot be written during recovery, so hint bits set during recovery must not dirty the page if the buffer is not already dirty, when checksums are enabled. Systems in Hot-Standby mode may benefit from hint bits being set, but with checksums enabled, a page cannot be dirtied after setting a hint bit (due to the torn page risk). So, it must wait for full-page images containing the hint bit updates to arrive from the primary. IIUC the intention is that MarkBufferDirtyHint() is a no-op during hot standby when we successfully set a hint bit, though only in the XLogHintBitIsNeeded() case. So we don't really dirty the page within SetHintBits() in this specific scenario. That is, the buffer header won't actually get marked BM_DIRTY or BM_JUST_DIRTIED within MarkBufferDirtyHint() when in Hot Standby + XLogHintBitIsNeeded(). What else could work at all? The only "alternative" is to write an FPI, just like on the primary -- but writing new WAL records is not possible during Hot Standby! A comment within MarkBufferDirtyHint() spells it out directly -- we can have hint bits set in hot standby independently of the primary, but it works in a way that makes sure that the hint bits never make it out to disk: "We can set the hint, just not dirty the page as a result so the hint is lost when we evict the page or shutdown" You may be right in some narrow sense -- checksums can differ on a standby. But that's like saying that it's sometimes okay to have a torn page on disk. Yes, it's okay, but only because we expect the problem during crash recovery, and can reliably repair it. > Checksums are calculated before the page is written to the disk (not after applying FPI). So, the masking page during *applying*the FPI is semantically the same as setting a bit in it 1 nanosecond after. > > And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD. I don't see how that is relevant. btree_mask() is only used by wal_consistency_checking, which is mostly just for Postgres hackers. -- Peter Geoghegan
Hello, Peter.
> AFAICT that's not true, at least not in any practical sense. See the
> comment in the middle of MarkBufferDirtyHint() that begins with "If we
> must not write WAL, due to a relfilenode-specific...", and see the
> "Checksums" section at the end of src/backend/storage/page/README. The
> last paragraph in the README is particularly relevant:
I have attached a TAP-test to demonstrate how easily checksums on standby and primary starts to differ. The test shows two different scenarios - for both heap and index (and the bit is placed in both standby and primary).
Yes, MarkBufferDirtyHint does not mark the page as dirty… So, hint bits on secondary could be easily lost. But it leaves the page dirty if it already is (or it could be marked dirty by WAL replay later). So, hints bits could be easily flushed and taken into account during checksum calculation on both - standby and primary.
> "We can set the hint, just not dirty the page as a result so the hint
> is lost when we evict the page or shutdown"
Yes, it is not allowed to mark a page as dirty because of hints on standby. Because we could achieve this:
CHECKPOINT
SET HINT BIT
TORN FLUSH + CRASH = BROKEN CHECKSUM, SERVER FAULT
But this scenario is totally fine:
CHECKPOINT
FPI (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED, EVERYTHING IS OK
And, as result, this is fine too:
CHECKPOINT
FPI WITH MASKED LP_DEAD (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED + LP_DEAD MASKED AGAIN IF STANDBY
So, my point here - it is fine to mask LP_DEAD bits during replay because they are already different on standby and primary. And it is fine to set and flush hint bits (and LP_DEADs) on standby because they already could be easily flushed (just need to consider minRecovertPoint and, probably, OldesXmin from primary in case of LP_DEAD to make promotion easily).
>> And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD.
> I don't see how that is relevant. btree_mask() is only used by
> wal_consistency_checking, which is mostly just for Postgres hackers.
I was thinking about the possibility to reuse these functions in masking during replay.
Thanks,
Michail.
> AFAICT that's not true, at least not in any practical sense. See the
> comment in the middle of MarkBufferDirtyHint() that begins with "If we
> must not write WAL, due to a relfilenode-specific...", and see the
> "Checksums" section at the end of src/backend/storage/page/README. The
> last paragraph in the README is particularly relevant:
I have attached a TAP-test to demonstrate how easily checksums on standby and primary starts to differ. The test shows two different scenarios - for both heap and index (and the bit is placed in both standby and primary).
Yes, MarkBufferDirtyHint does not mark the page as dirty… So, hint bits on secondary could be easily lost. But it leaves the page dirty if it already is (or it could be marked dirty by WAL replay later). So, hints bits could be easily flushed and taken into account during checksum calculation on both - standby and primary.
> "We can set the hint, just not dirty the page as a result so the hint
> is lost when we evict the page or shutdown"
Yes, it is not allowed to mark a page as dirty because of hints on standby. Because we could achieve this:
CHECKPOINT
SET HINT BIT
TORN FLUSH + CRASH = BROKEN CHECKSUM, SERVER FAULT
But this scenario is totally fine:
CHECKPOINT
FPI (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED, EVERYTHING IS OK
And, as result, this is fine too:
CHECKPOINT
FPI WITH MASKED LP_DEAD (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED + LP_DEAD MASKED AGAIN IF STANDBY
So, my point here - it is fine to mask LP_DEAD bits during replay because they are already different on standby and primary. And it is fine to set and flush hint bits (and LP_DEADs) on standby because they already could be easily flushed (just need to consider minRecovertPoint and, probably, OldesXmin from primary in case of LP_DEAD to make promotion easily).
>> And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD.
> I don't see how that is relevant. btree_mask() is only used by
> wal_consistency_checking, which is mostly just for Postgres hackers.
I was thinking about the possibility to reuse these functions in masking during replay.
Thanks,
Michail.
Вложения
Hello, Peter. If you are interested, the possible patch (based on FPI mask during replay) was sent with some additional explanation and graphics to (1). At the moment I unable to find any "incorrectness" in it. Thanks again for your comments. Michail. [1] https://www.postgresql.org/message-id/flat/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com#4c81a4d623d8152f5e8889e97e750eec