Обсуждение: visibility maps
<br /> /*<br /> * We don't need to lock the page, as we're only looking at a single bit.<br /> */<br /> result= (map[mapByte] & (1 << mapBit)) ? true : false;<br /><br /><br /> Isn't this a dangerous assumption to make? I am not so sure that even a bit can be read atomically on all platforms. I think the only caller of visibilitymap_test()is VACUUM which can live with a false information. But if this is indeed a problem, should we eitherfix this or have explicit comments there ?<br clear="all" /><br />BTW, my apologies for random comments. I haven'tfollowed the discussion well, neither done a full review. So these things might have been discussed before.<br /><br/> Thanks,<br />Pavan <br /><br />-- <br />Pavan Deolasee<br />EnterpriseDB <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/>
<br />/*<br /> * Size of the bitmap on each visibility map page, in bytes. There's no<br /> * extra headers, so the wholepage minus except for the standard page header<br /> * is used for the bitmap.<br /> */<br />#define MAPSIZE (BLCKSZ- SizeOfPageHeaderData)<br /><br /><br />ISTM that we should MAXALIGN the SizeOfPageHeaderData to compute MAPSIZE.PageGetContents() works that way and I believe that's the right thing to do.<br /><br />Thanks,<br />Pavan<br clear="all"/><br />-- <br />Pavan Deolasee<br /> EnterpriseDB <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/>
Pavan Deolasee wrote: > /* > * We don't need to lock the page, as we're only looking at a single > bit. > */ > result = (map[mapByte] & (1 << mapBit)) ? true : false; > > > Isn't this a dangerous assumption to make ? I am not so sure that even a bit > can be read atomically on all platforms. Umm, what non-atomic state could the bit be in? Half-set, half-cleared? Or do you think that if some other bit in proximity is changed, the other bit would temporarily flip 0->1->0, or something like that? I don't think that should happen. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
<br /><br /><div class="gmail_quote">On Sat, Dec 6, 2008 at 7:57 PM, Heikki Linnakangas <span dir="ltr"><<a href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>></span>wrote:<br /><blockquoteclass="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left:1ex;"><div class="Ih2E3d"><br /></div> Umm, what non-atomic state could the bit be in? Half-set, half-cleared?Or do you think that if some other bit in proximity is changed, the other bit would temporarily flip 0->1->0,or something like that? I don't think that should happen.<br /><br /></blockquote></div><br clear="all" />Sincethe lock is not held, the bit can be flipped while we are reading, isn't it ? IOW, the test is not reliable is whatI fear.<br /><br />Thanks,<br />Pavan<br /><br />-- <br />Pavan Deolasee<br /> EnterpriseDB <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/>
Pavan Deolasee wrote: > On Sat, Dec 6, 2008 at 7:57 PM, Heikki Linnakangas < > heikki.linnakangas@enterprisedb.com> wrote: > >> Umm, what non-atomic state could the bit be in? Half-set, half-cleared? Or >> do you think that if some other bit in proximity is changed, the other bit >> would temporarily flip 0->1->0, or something like that? I don't think that >> should happen. >> > Since the lock is not held, the bit can be flipped while we are reading, > isn't it ? IOW, the test is not reliable is what I fear. If someone is changing the bit at the same time, it doesn't matter whether we read it as 1 or 0. Locking the page wouldn't change the situation: you would still read the old value if you got the lock before the concurrent updater, or the new value if you got the lock after. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > /* > * Size of the bitmap on each visibility map page, in bytes. There's no > * extra headers, so the whole page minus except for the standard page > header > * is used for the bitmap. > */ > #define MAPSIZE (BLCKSZ - SizeOfPageHeaderData) > > > ISTM that we should MAXALIGN the SizeOfPageHeaderData to compute MAPSIZE. > PageGetContents() works that way and I believe that's the right thing to do. Yep, you're right. Thanks, fixed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas napsal(a): > Pavan Deolasee wrote: >> /* >> * We don't need to lock the page, as we're only looking at a single >> bit. >> */ >> result = (map[mapByte] & (1 << mapBit)) ? true : false; >> >> >> Isn't this a dangerous assumption to make ? I am not so sure that even >> a bit >> can be read atomically on all platforms. > > Umm, what non-atomic state could the bit be in? Half-set, half-cleared? > Or do you think that if some other bit in proximity is changed, the > other bit would temporarily flip 0->1->0, or something like that? I > don't think that should happen. > IIRC, Memory reading/writing is atomic operation. Only one CPU(hw thread) can access to the same memory address(es)* in same time*. The question is how compiler compile C code to assembler. But this code seems to me safe. I think we use same principle for TransactionID? Zdenek * Wide is architecture dependent. and of course it depends on alignment. I'm not sure how x86 CPUs read nonalignment memory. But if you enable this on SPARC it is handled in software thru TRAP handlers and it is not really atomic. But in our case we use byte access which is safe everywhere. ** IIRC, some AMD64 processors allows to disable cache coherence check, but it not used in standard OS, and we can ignore it.
On Thu, Dec 11, 2008 at 5:01 PM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: > >> > > IIRC, Memory reading/writing is atomic operation. Only one CPU(hw thread) > can access to the same memory address(es)* in same time*. The question is > how compiler compile C code to assembler. But this code seems to me safe. Yeah, I think the code is safe because we are just reading a bit. BTW, I wonder if we need to acquire EXCLUSIVE lock while writing the visibility map bit ? Since almost (8 * 8192) data blocks would map to the same visibility map page, the lock can certainly become a hot spot. I know we also update PageLSN during the set operation and that would require EXLUSIVE lock, but is that required for consistency given that the entire visibility map is just a hint ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee napsal(a): > On Thu, Dec 11, 2008 at 5:01 PM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: >> IIRC, Memory reading/writing is atomic operation. Only one CPU(hw thread) >> can access to the same memory address(es)* in same time*. The question is >> how compiler compile C code to assembler. But this code seems to me safe. > > Yeah, I think the code is safe because we are just reading a bit. > > BTW, I wonder if we need to acquire EXCLUSIVE lock while writing the > visibility map bit ? Yes, because it is not simple write operation. You need to read byte from memory to register, set bit and write it back. Write memory itself is atomic but somebody could change other bits between read and write. Zdenek
On Thu, Dec 11, 2008 at 7:03 PM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: > > Yes, because it is not simple write operation. You need to read byte from > memory to register, set bit and write it back. Write memory itself is atomic > but somebody could change other bits between read and write. > Yeah, but since its just a hint, we can possibly live with some corner cases. The benefit of avoiding contention on the VM page would easily out weigh the downside of wrong hints. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > On Thu, Dec 11, 2008 at 5:01 PM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: >> IIRC, Memory reading/writing is atomic operation. Only one CPU(hw thread) >> can access to the same memory address(es)* in same time*. The question is >> how compiler compile C code to assembler. But this code seems to me safe. > > Yeah, I think the code is safe because we are just reading a bit. Right. I wonder if we should declare the char *map variable as volatile, though. Shouldn't make a difference in practice, it's only used once in the function, but it feels like the right thing to do given that it is accessing a piece of memory without a lock. > BTW, I wonder if we need to acquire EXCLUSIVE lock while writing the > visibility map bit ? Since almost (8 * 8192) data blocks would map to > the same visibility map page, the lock can certainly become a hot > spot. I know we also update PageLSN during the set operation and that > would require EXLUSIVE lock, but is that required for consistency > given that the entire visibility map is just a hint ? Yeah, if we accept that bits can be bogusly set. There is scenarios where that can happen already, but they involve crashing, not during normal operation and clean shut down. In the future, I'd like to move in the direction of making the visibility map *more* reliable, not less, ultimately allowing index-only-scans, so I'd rather not start relaxing that. Only the first update to a page needs to clear the bit in the visibility map, so I don't think it'll become a bottleneck in practice. Frequently updated pages will never have the bit set in the visibility map to begin with. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Dec 11, 2008 at 7:15 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > > > Yeah, if we accept that bits can be bogusly set. There is scenarios where > that can happen already, but they involve crashing, not during normal > operation and clean shut down. In the future, I'd like to move in the > direction of making the visibility map *more* reliable, not less, ultimately > allowing index-only-scans, so I'd rather not start relaxing that. > Do we have any tests to prove that the VM page lock does not indeed become a bottleneck ? I can do some if we don't have already. > Only the first update to a page needs to clear the bit in the visibility > map, so I don't think it'll become a bottleneck in practice. Frequently > updated pages will never have the bit set in the visibility map to begin > with. > Well that's true only if you reject my heap-prune patch :-) Otherwise, heap-prune will again set the bit (and I believe that's the right thing to do) Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On Thu, Dec 11, 2008 at 5:01 PM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: >> IIRC, Memory reading/writing is atomic operation. Only one CPU(hw thread) >> can access to the same memory address(es)* in same time*. The question is >> how compiler compile C code to assembler. But this code seems to me safe. > Yeah, I think the code is safe because we are just reading a bit. There's no such thing as "just reading a bit" from shared memory. Yes, you will get *some* value, but it is not very clear which value. In particular, on machines with weak memory ordering guarantees (PPC for instance), we put memory fence instructions into the lock/unlock sequences to ensure that someone who obtains a lock guarding a shared-memory data structure will see any changes made by the previous holder of the lock. An access that is entirely free of any locking primitive might get a stale value --- meaning that it might be logically inconsistent with the apparent contents of other parts of shared memory examined just before or after this access. It's likely that there are other lock/unlock operations somewhere in the code that would prevent a visible failure; and in any case the usage of the visibility map is constrained in a way that means getting a slightly stale value isn't a problem. But it needs a lot closer analysis than the existing code comment suggests. I've been thinking for awhile that maybe we should expose the memory fence operations as separate primitives, similar to what's done inside the Linux kernel. This code would feel a lot safer if it had a read fence just before the fetch. IIRC there are some other places where we could use something similar instead of needing a full lock acquisition/release. regards, tom lane
Pavan Deolasee wrote: > On Thu, Dec 11, 2008 at 7:15 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > Do we have any tests to prove that the VM page lock does not indeed > become a bottleneck ? No. > I can do some if we don't have already. Oh, yes please! >> Only the first update to a page needs to clear the bit in the visibility >> map, so I don't think it'll become a bottleneck in practice. Frequently >> updated pages will never have the bit set in the visibility map to begin >> with. > > Well that's true only if you reject my heap-prune patch :-) Otherwise, > heap-prune will again set the bit (and I believe that's the right > thing to do) I'm not sure if we should set the bits in very aggressively. If we're more aggressive about setting the bits, it also means that we have to clear the bits more often, increasing the likelihood of contention that you were worried about. Also, skipping a few pages here and there in vacuum doesn't make it any faster in practice, because you're reading sequentially. You need long contiguous regions of pages that can be skipped until you see a benefit. Setting the PD_ALL_VISIBLE flag on the heap page itself more aggressively might be more interesting, because of the small seqscan optimization. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Dec 11, 2008 at 8:09 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > > > I'm not sure if we should set the bits in very aggressively. If we're more > aggressive about setting the bits, it also means that we have to clear the > bits more often, increasing the likelihood of contention that you were > worried about. Well, I would rather set bits aggressively and reduce contention by changing the lock type. If HOT is working well, VACUUM will have very few things to do, but visibility map wouldn't help as much as it can unless we set the bits after pruning. Another thing I noticed is the since VACUUM tries to set the bit in the first phase, it's working only because HOT prunes DEAD tuples just before we do another scan on line pointers (which I had earlier talked about getting rid of. May be its time I do that). Otherwise, the visibility bit won't be set even though the DEAD tuples will be removed in the second scan and the rest are all LIVE tuples. So if we at all want to take out the another scan of line pointers from the first pass, we should rather push the work setting bits in the prune code. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Thu, Dec 11, 2008 at 8:09 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Pavan Deolasee wrote: > > >> I can do some if we don't have already. > > Oh, yes please! > I did some tests today with pgbench on a decent SMP machine. The goal was to see if multiple clients (20 in the test) tries to update tuples in different data blocks, if the EX lock on the VM page causes any contention. I can confirm that I haven't seen any drop in the tps with VM. I guess since the bit reset is a very small code compared to the entire UPDATE code path, may be its less likely than I thought previously that multiple clients attempt to reset the bit at the same time. I'll do some more tests to see if setting the bit in HOT-prune path leads to any contention or not. I can send details of the test I did, if anyone is interested. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, Le 12 déc. 08 à 13:11, Pavan Deolasee a écrit : > I did some tests today with pgbench on a decent SMP machine. The goal > was to see if multiple clients (20 in the test) tries to update tuples > in different data blocks, if the EX lock on the VM page causes any > contention. If you want to test for a really high number of clients, you could have a try at tsung, which is designed for doing just this. http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php http://tsung.erlang-projects.org/ HTH, regards, - -- dim -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iEYEARECAAYFAklCWekACgkQlBXRlnbh1bmQdACgwloRjx9lZyhLpjGCSuY7K/Au xmUAoJSAlVoqerio175UHFPS1xVzI3iZ =45KY -----END PGP SIGNATURE-----
Pavan Deolasee wrote: > Another thing I noticed is the since VACUUM tries to set the bit in > the first phase, it's working only because HOT prunes DEAD tuples just > before we do another scan on line pointers (which I had earlier talked > about getting rid of. May be its time I do that). Otherwise, the > visibility bit won't be set even though the DEAD tuples will be > removed in the second scan and the rest are all LIVE tuples. So if we > at all want to take out the another scan of line pointers from the > first pass, we should rather push the work setting bits in the prune > code. I don't quite understand this paragraph. If there's any DEAD tuples or line-pointers, the all-visible flag can't be set. After an UPDATE or DELETE, it indeed takes two vacuums until the bits in the visibility map are set. Or did you mean that it only works because the prune phase sets the hint bits on the tuples? HeapTupleSatisfiesVacuum sets them too, so we're not relying on the prune phase to set them. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 17, 2008 at 3:29 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > > > I don't quite understand this paragraph. If there's any DEAD tuples or > line-pointers, the all-visible flag can't be set. After an UPDATE or DELETE, > it indeed takes two vacuums until the bits in the visibility map are set. > > Or did you mean that it only works because the prune phase sets the hint > bits on the tuples? HeapTupleSatisfiesVacuum sets them too, so we're not > relying on the prune phase to set them. > No, I am saying, HOT-prune removes all DEAD tuples from the page (not the DEAD line pointers though) and that's why you may not need two vacuums for the visibility bit to set because the first phase of vacuum would not find any DEAD tuples. Without HOT prune, two vacuums will be required to set the bit and that would be counter intuitive because somebody who has just run vacuum on the table would expect the next vacuum to be no-op if there are no updates/deletes in between. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On Wed, Dec 17, 2008 at 3:29 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> I don't quite understand this paragraph. If there's any DEAD tuples or >> line-pointers, the all-visible flag can't be set. > No, I am saying, HOT-prune removes all DEAD tuples from the page (not > the DEAD line pointers though) and that's why you may not need two > vacuums for the visibility bit to set because the first phase of > vacuum would not find any DEAD tuples. I think what you are suggesting is that we should set the visibility map bit while dead line pointers (tombstones) still remain. If that's what you meant it's a bad idea. It would be correct in the sense of "there are no invisible rows here", but it's not correct in the sense of "there is no work for vacuum to do here". In particular that would be the wrong definition for the hoped-for future feature of index-only scans. regards, tom lane
On Wed, Dec 17, 2008 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > I think what you are suggesting is that we should set the visibility map > bit while dead line pointers (tombstones) still remain. If that's what > you meant it's a bad idea. No, I'm not suggesting that. I understand the problem there. I was merely explaining how HOT-prune may some time (when there are no DEAD line pointers created) help us set the visibility bit early. OTOH I think we can still set PD_ALL_VISIBLE page header flag even when there are DEAD line pointers. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > OTOH I think we can still set PD_ALL_VISIBLE page header flag even > when there are DEAD line pointers. That would mean the header flag means something different than the map bit does, which would mean extra tests need to be made before propagating the flag bit to the map. This is doable, no doubt, but it seems pretty error-prone ... regards, tom lane
Is there anything to do with the below issue? --------------------------------------------------------------------------- Pavan Deolasee wrote: > On Wed, Dec 17, 2008 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > I think what you are suggesting is that we should set the visibility map > > bit while dead line pointers (tombstones) still remain. If that's what > > you meant it's a bad idea. > > No, I'm not suggesting that. I understand the problem there. I was > merely explaining how HOT-prune may some time (when there are no DEAD > line pointers created) help us set the visibility bit early. > > OTOH I think we can still set PD_ALL_VISIBLE page header flag even > when there are DEAD line pointers. > > Thanks, > Pavan > > -- > Pavan Deolasee > EnterpriseDB http://www.enterprisedb.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > "Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > > OTOH I think we can still set PD_ALL_VISIBLE page header flag even > > when there are DEAD line pointers. > > That would mean the header flag means something different than the > map bit does, which would mean extra tests need to be made before > propagating the flag bit to the map. This is doable, no doubt, > but it seems pretty error-prone ... Are there any open items from this discussion? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Tom Lane wrote: >> "Pavan Deolasee" <pavan.deolasee@gmail.com> writes: >>> OTOH I think we can still set PD_ALL_VISIBLE page header flag even >>> when there are DEAD line pointers. >> That would mean the header flag means something different than the >> map bit does, which would mean extra tests need to be made before >> propagating the flag bit to the map. This is doable, no doubt, >> but it seems pretty error-prone ... > > Are there any open items from this discussion? No. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com