Обсуждение: Rename dead_tuples to dead_items in vacuumlazy.c
Attached patch performs polishing within vacuumlazy.c, as follow-up work to the refactoring work in Postgres 14. This mainly consists of changing references of dead tuples to dead items, which reflects the fact that VACUUM no longer deals with TIDs that might point to remaining heap tuples with storage -- the TIDs in the array must now strictly point to LP_DEAD stub line pointers that remain in the heap, following pruning. I've also simplified header comments, and comments above the main entry point functions. These comments made much more sense back when lazy_scan_heap() was simpler, but wasn't yet broken up into smaller, better-scoped functions. If there are no objections, I'll move on this soon. It's mostly just mechanical changes. -- Peter Geoghegan
Вложения
On Wed, Nov 24, 2021 at 11:16 AM Peter Geoghegan <pg@bowt.ie> wrote: > > Attached patch performs polishing within vacuumlazy.c, as follow-up > work to the refactoring work in Postgres 14. This mainly consists of > changing references of dead tuples to dead items, which reflects the > fact that VACUUM no longer deals with TIDs that might point to > remaining heap tuples with storage -- the TIDs in the array must now > strictly point to LP_DEAD stub line pointers that remain in the heap, > following pruning. > > I've also simplified header comments, and comments above the main > entry point functions. These comments made much more sense back when > lazy_scan_heap() was simpler, but wasn't yet broken up into smaller, > better-scoped functions. > > If there are no objections, I'll move on this soon. It's mostly just > mechanical changes. -#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6 +#define PROGRESS_VACUUM_MAX_DEAD_ITEMS 5 +#define PROGRESS_VACUUM_NUM_DEAD_ITEMS 6 Wouldn't this be more logical to change to DEAD_TIDS instead of DEAD_ITEMS? + /* Sorted list of TIDs to delete from indexes */ + ItemPointerData dead[FLEXIBLE_ARRAY_MEMBER]; Instead of just dead, why not deadtid or deaditem? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 24, 2021 at 2:46 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Attached patch performs polishing within vacuumlazy.c, as follow-up > work to the refactoring work in Postgres 14. This mainly consists of > changing references of dead tuples to dead items, which reflects the > fact that VACUUM no longer deals with TIDs that might point to > remaining heap tuples with storage -- the TIDs in the array must now > strictly point to LP_DEAD stub line pointers that remain in the heap, > following pruning. +1 > If there are no objections, I'll move on this soon. It's mostly just > mechanical changes. The patch renames dead tuples to dead items at some places and to dead TIDs at some places. For instance, it renames dead tuples to dead TIDs here: - * Return the maximum number of dead tuples we can record. + * Computes the number of dead TIDs that VACUUM will have to store in the + * worst case, where all line pointers are allocated, and all are LP_DEAD whereas renames to dead items here: - * extra cost of bsearch(), especially if dead tuples on the heap are + * extra cost of bsearch(), especially if dead items on the heap are I think it's more consistent if we change it to one side. I prefer "dead items". --- There is one more place where we can rename "dead tuples": /* * Allocate the space for dead tuples. Note that this handles parallel * VACUUM initialization as part of allocating shared memory space used * for dead_items. */ Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Wed, Nov 24, 2021 at 7:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I think it's more consistent if we change it to one side. I prefer "dead items". I feel like "items" is quite a generic word, so I think I would prefer TIDs. But it's probably not a big deal. -- Robert Haas EDB: http://www.enterprisedb.com
On 2021-Nov-24, Robert Haas wrote: > On Wed, Nov 24, 2021 at 7:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I think it's more consistent if we change it to one side. I prefer > > "dead items". > > I feel like "items" is quite a generic word, so I think I would prefer > TIDs. But it's probably not a big deal. Is there clarity on what each term means? Since this patch only changes things that are specific to heap vacuuming, it seems OK to rely the convention that "item" means "heap item" (not just any generic item). However, I'm not sure that we fully agree exactly what a heap item is. Maybe if we agree to a single non ambiguous definition for each of those terms we can agree what terminology to use. It seems to me we have the following terms: - tuple - line pointer - [heap] item - TID My mental model is that "tuple" (in the narrow context of heap vacuum) is the variable-size on-disk representation of a row in a page; "line pointer" is the fixed-size struct at the bottom of each page that contains location, size and flags of a tuple: struct ItemIdData. The TID is the address of a line pointer -- an ItemPointerData. What is an item? Is an item the same as a line pointer? That seems confusing. I think "item" means the tuple as a whole. In that light, using the term TID for some of the things that the patch renames to "item" seems more appropriate. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Wed, Nov 24, 2021 at 9:37 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > My mental model is that "tuple" (in the narrow context of heap vacuum) > is the variable-size on-disk representation of a row in a page; "line > pointer" is the fixed-size struct at the bottom of each page that > contains location, size and flags of a tuple: struct ItemIdData. The > TID is the address of a line pointer -- an ItemPointerData. > > What is an item? Is an item the same as a line pointer? That seems > confusing. I think "item" means the tuple as a whole. In that light, > using the term TID for some of the things that the patch renames to > "item" seems more appropriate. Hmm. I think in my model an item and an item pointer and a line pointer are all the same thing, but a TID is different. When I talk about a TID, I mean the location of an item pointer, not its contents. So a TID is what tells me that I want block 5 and the 4th slot in the item pointer array. The item pointer tells me that the associate tuple is at a certain position in the page and has a certain length. -- Robert Haas EDB: http://www.enterprisedb.com
On 2021-Nov-24, Robert Haas wrote: > Hmm. I think in my model an item and an item pointer and a line > pointer are all the same thing, but a TID is different. When I talk > about a TID, I mean the location of an item pointer, not its contents. > So a TID is what tells me that I want block 5 and the 4th slot in the > item pointer array. The item pointer tells me that the associate tuple > is at a certain position in the page and has a certain length. OK, but you can have item pointers that don't have any item. LP_REDIRECT, LP_DEAD, LP_UNUSED item pointers don't have items. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Nunca confiaré en un traidor. Ni siquiera si el traidor lo he creado yo" (Barón Vladimir Harkonnen)
On 2021-Nov-24, Alvaro Herrera wrote: > On 2021-Nov-24, Robert Haas wrote: > > > Hmm. I think in my model an item and an item pointer and a line > > pointer are all the same thing, but a TID is different. When I talk > > about a TID, I mean the location of an item pointer, not its contents. > > So a TID is what tells me that I want block 5 and the 4th slot in the > > item pointer array. The item pointer tells me that the associate tuple > > is at a certain position in the page and has a certain length. > > OK, but you can have item pointers that don't have any item. > LP_REDIRECT, LP_DEAD, LP_UNUSED item pointers don't have items. Sorry to reply to myself, but I realized that I forgot to return to the main point of this thread. If we agree that "an LP_DEAD item pointer does not point to any item" (an assertion that gives a precise meaning to both those terms), then a patch that renames "tuples" to "items" is not doing anything useful IMO, because those two terms are synonyms. Now maybe Peter doesn't agree with the definitions I suggest, in which case I would like to know what his definitions are. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "How strange it is to find the words "Perl" and "saner" in such close proximity, with no apparent sense of irony. I doubt that Larry himself could have managed it." (ncm, http://lwn.net/Articles/174769/)
On Wed, Nov 24, 2021 at 9:51 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2021-Nov-24, Robert Haas wrote: > > Hmm. I think in my model an item and an item pointer and a line > > pointer are all the same thing, but a TID is different. When I talk > > about a TID, I mean the location of an item pointer, not its contents. > > So a TID is what tells me that I want block 5 and the 4th slot in the > > item pointer array. The item pointer tells me that the associate tuple > > is at a certain position in the page and has a certain length. > > OK, but you can have item pointers that don't have any item. > LP_REDIRECT, LP_DEAD, LP_UNUSED item pointers don't have items. I guess so. I said before that I thought an item and an item pointer were the same, but on reflection, that doesn't entirely make sense. But I don't know that I like making item and tuple synonymous either. I think perhaps the term "item" by itself is not very clear. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Nov 24, 2021 at 7:16 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Sorry to reply to myself, but I realized that I forgot to return to the > main point of this thread. If we agree that "an LP_DEAD item pointer > does not point to any item" (an assertion that gives a precise meaning > to both those terms), then a patch that renames "tuples" to "items" is > not doing anything useful IMO, because those two terms are synonyms. TIDs (ItemPointerData structs) are of course not the same thing as line pointers (ItemIdData structs). There is a tendency to refer to the latter as "item pointers" all the same, which was confusing. I personally corrected/normalized this in commit ae7291ac in 2019. I think that it's worth being careful about precisely because they're closely related (but distinct) concepts. And so FWIW "LP_DEAD item pointer" is not a thing. I agree that an LP_DEAD item pointer has no tuple storage, and so you could say that it points to nothing (though only in heapam). I probably would just say that it has no tuple storage, though. > Now maybe Peter doesn't agree with the definitions I suggest, in which > case I would like to know what his definitions are. I agree with others that the term "item" is vague, but I don't think that that's necessarily a bad thing here -- I deliberately changed the comments to say either "TIDs" or "LP_DEAD items", emphasizing whatever the important aspect seemed to be in each context (they're LP_DEAD items to the heap structure, TIDs to index structures). I'm not attached to the term "item". To me the truly important point is what these items are *not*: they're not tuples. The renaming is intended to enforce the concepts that I went into at the end of the commit message for commit 8523492d. Now the pruning steps in lazy_scan_prune always avoiding keeping around a DEAD tuple with tuple storage on return to lazy_scan_heap (only LP_DEAD items can remain), since (as of that commit) lazy_scan_prune alone is responsible for things involving the "logical database". This means that index vacuuming and heap vacuuming can now be thought of as removing garbage items from physical data structures (they're purely "physical database" concepts), and nothing else. They don't need recovery conflicts. How could they? Where are you supposed to get the XIDs for that from, when you've only got LP_DEAD items? This is also related to the idea that pruning by VACUUM isn't necessarily all that special compared to earlier pruning or concurrent opportunistic pruning. As I go into on the other recent thread on removing special cases in vacuumlazy.c, ISTM that we ought to do everything except pruning itself (and freezing tuples, which effectively depends on pruning) without even acquiring a cleanup lock. Which is actually quite a lot of things. -- Peter Geoghegan
On 2021-Nov-24, Peter Geoghegan wrote: > TIDs (ItemPointerData structs) are of course not the same thing as > line pointers (ItemIdData structs). There is a tendency to refer to > the latter as "item pointers" all the same, which was confusing. I > personally corrected/normalized this in commit ae7291ac in 2019. I > think that it's worth being careful about precisely because they're > closely related (but distinct) concepts. And so FWIW "LP_DEAD item > pointer" is not a thing. I agree that an LP_DEAD item pointer has no > tuple storage, and so you could say that it points to nothing (though > only in heapam). I probably would just say that it has no tuple > storage, though. OK, this makes a lot more sense. I wasn't aware of ae7291ac (and I wasn't aware of the significance of 8523492d either, but that's not really relevant here.) > I agree with others that the term "item" is vague, but I don't think > that that's necessarily a bad thing here -- I deliberately changed the > comments to say either "TIDs" or "LP_DEAD items", emphasizing whatever > the important aspect seemed to be in each context (they're LP_DEAD > items to the heap structure, TIDs to index structures). I think we could say "LP_DEAD line pointer" and that would be perfectly clear. Given how nuanced we have to be if we want to be clear about this, I would rather not use "LP_DEAD item"; that seems slightly contradictory, since the item is the storage and such a line pointer does not have storage. Perhaps change that define in progress.h to PROGRESS_VACUUM_NUM_DEAD_LPS, and, in the first comment in vacuumlazy.c, use wording such as + * The major space usage for LAZY VACUUM is storage for the array of TIDs + * of dead line pointers that are to be removed from indexes. or + * The major space usage for LAZY VACUUM is storage for the array of TIDs + * of LP_DEAD line pointers that are to be removed from indexes. (The point being that TIDs are not dead themselves, only the line pointers that they refer to.) -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Most hackers will be perfectly comfortable conceptualizing users as entropy sources, so let's move on." (Nathaniel Smith)
On Wed, Nov 24, 2021 at 9:53 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > OK, this makes a lot more sense. I wasn't aware of ae7291ac (and I > wasn't aware of the significance of 8523492d either, but that's not > really relevant here.) Thanks for hearing me out about the significance of 8523492d. Having the right formalisms seems to really matter here, because they enable decoupling, which is generally very useful. This makes it easy to understand (just for example) that index vacuuming and heap vacuuming are just additive, optional steps (in principle) -- an idea that will become even more important once we get Robert's pending TID conveyor belt design. I believe that that design goes one step further than what we have today, by making index vacuuming and heap vacuuming occur in a distinct operation to VACUUM proper (VACUUM would only need to set up the LP_DEAD item list for index vacuuming and heap vacuuming, which may or may not happen immediately after). An interesting question (at least to me) is: within a non-aggressive VACUUM, what remaining steps are *not* technically optional? I am pretty sure that they're all optional in principle (or will be soon), because soon we will be able to independently advance relfrozenxid without freezing all tuples with XIDs before our original FreezeLimit (FreezeLimit should only be used to decide which tuples to freeze, not to decide on a new relfrozenxid). Soon almost everything will be decoupled, without changing the basic invariants that we've had for many years. This flexibility seems really important to me. That just leaves avoiding pruning without necessarily avoiding ostensibly related processing for indexes. We can already independently prune without doing index/heap vacuuming (the bypass indexes optimization). We will also be able to do the opposite thing, with my new patch: we can perform index/heap vacuuming *without* pruning ourselves. This makes sense in the case where we cannot acquire a cleanup lock on a heap page with preexisting LP_DEAD items. > I think we could say "LP_DEAD line pointer" and that would be perfectly > clear. Given how nuanced we have to be if we want to be clear about > this, I would rather not use "LP_DEAD item"; that seems slightly > contradictory, since the item is the storage and such a line pointer > does not have storage. Perhaps change that define in progress.h to > PROGRESS_VACUUM_NUM_DEAD_LPS, and, in the first comment in vacuumlazy.c, > use wording such as I agree with all that, I think. But it's still not clear what the variable dead_tuples should be renamed to within the structure that you lay out (I imagine that you agree with me that dead_tuples is now a bad name). This one detail affects more individual lines of code than the restructuring of comments. -- Peter Geoghegan
On Wed, Nov 24, 2021 at 4:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > The patch renames dead tuples to dead items at some places and to > dead TIDs at some places. > I think it's more consistent if we change it to one side. I prefer "dead items". I just pushed a version of the patch that still uses both terms when talking about dead_items. But the final commit actually makes it clear why, in comments above the LVDeadItems struct itself: LVDeadItems is used by both index vacuuming and heap vacuuming. Thanks -- Peter Geoghegan
On Tue, Nov 30, 2021 at 3:00 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Nov 24, 2021 at 4:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > The patch renames dead tuples to dead items at some places and to > > dead TIDs at some places. > > > I think it's more consistent if we change it to one side. I prefer "dead items". > > I just pushed a version of the patch that still uses both terms when > talking about dead_items. Thanks! I'll change my parallel vacuum refactoring patch accordingly. Regarding the commit, I think that there still is one place in lazyvacuum.c where we can change "dead tuples” to "dead items”: /* * Allocate the space for dead tuples. Note that this handles parallel * VACUUM initialization as part of allocating shared memory space used * for dead_items. */ dead_items_alloc(vacrel, params->nworkers); dead_items = vacrel->dead_items; Also, the commit doesn't change both PROGRESS_VACUUM_MAX_DEAD_TUPLES and PROGRESS_VACUUM_NUM_DEAD_TUPLES. Did you leave them on purpose? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Nov 29, 2021 at 7:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thanks! I'll change my parallel vacuum refactoring patch accordingly. Thanks again for working on that. > Regarding the commit, I think that there still is one place in > lazyvacuum.c where we can change "dead tuples” to "dead items”: > > /* > * Allocate the space for dead tuples. Note that this handles parallel > * VACUUM initialization as part of allocating shared memory space used > * for dead_items. > */ > dead_items_alloc(vacrel, params->nworkers); > dead_items = vacrel->dead_items; Oops. Pushed a fixup for that just now. > Also, the commit doesn't change both PROGRESS_VACUUM_MAX_DEAD_TUPLES > and PROGRESS_VACUUM_NUM_DEAD_TUPLES. Did you leave them on purpose? That was deliberate. It would be a bit strange to alter these constants without also updating the corresponding column names for the pg_stat_progress_vacuum system view. But if I kept the definition from system_views.sql in sync, then I would break user scripts -- for reasons that users don't care about. That didn't seem like the right approach. Also, the system as a whole still assumes "DEAD tuples and LP_DEAD items are the same, and are just as much of a problem in the table as they are in each index". As you know, this is not really true, which is an important problem for us. Fixing it (perhaps as part of adding something like Robert's conveyor belt design) will likely require revising this model quite fundamentally (e.g, the vacthresh calculation in autovacuum.c:relation_needs_vacanalyze() would be replaced). When this happens, we'll probably need to update system views that have columns with names like "dead_tuples" -- because maybe we no longer specifically count dead items/tuples at all. I strongly suspect that the approach to statistics that we take for pg_statistic optimizer stats just doesn't work for dead items/tuples -- statistical sampling only produces useful statistics for the optimizer because certain delicate assumptions are met (even these assumptions only really work with a properly normalized database schema). Maybe revising the model used for autovacuum scheduling wouldn't include changing pg_stat_progress_vacuum, since that isn't technically "part of the model" --- I'm not sure. But it's not something that I am in a hurry to fix. -- Peter Geoghegan
On Wed, Dec 1, 2021 at 4:42 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Nov 29, 2021 at 7:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Thanks! I'll change my parallel vacuum refactoring patch accordingly. > > Thanks again for working on that. > > > Regarding the commit, I think that there still is one place in > > lazyvacuum.c where we can change "dead tuples” to "dead items”: > > > > /* > > * Allocate the space for dead tuples. Note that this handles parallel > > * VACUUM initialization as part of allocating shared memory space used > > * for dead_items. > > */ > > dead_items_alloc(vacrel, params->nworkers); > > dead_items = vacrel->dead_items; > > Oops. Pushed a fixup for that just now. Thanks! > > > Also, the commit doesn't change both PROGRESS_VACUUM_MAX_DEAD_TUPLES > > and PROGRESS_VACUUM_NUM_DEAD_TUPLES. Did you leave them on purpose? > > That was deliberate. > > It would be a bit strange to alter these constants without also > updating the corresponding column names for the > pg_stat_progress_vacuum system view. But if I kept the definition from > system_views.sql in sync, then I would break user scripts -- for > reasons that users don't care about. That didn't seem like the right > approach. Agreed. > > Also, the system as a whole still assumes "DEAD tuples and LP_DEAD > items are the same, and are just as much of a problem in the table as > they are in each index". As you know, this is not really true, which > is an important problem for us. Fixing it (perhaps as part of adding > something like Robert's conveyor belt design) will likely require > revising this model quite fundamentally (e.g, the vacthresh > calculation in autovacuum.c:relation_needs_vacanalyze() would be > replaced). When this happens, we'll probably need to update system > views that have columns with names like "dead_tuples" -- because maybe > we no longer specifically count dead items/tuples at all. I strongly > suspect that the approach to statistics that we take for pg_statistic > optimizer stats just doesn't work for dead items/tuples -- statistical > sampling only produces useful statistics for the optimizer because > certain delicate assumptions are met (even these assumptions only > really work with a properly normalized database schema). > > Maybe revising the model used for autovacuum scheduling wouldn't > include changing pg_stat_progress_vacuum, since that isn't technically > "part of the model" --- I'm not sure. But it's not something that I am > in a hurry to fix. Understood. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/