Обсуждение: Pluggable Storage - Andres's take
Hi, As I've previously mentioned I had planned to spend some time to polish Haribabu's version of the pluggable storage patch and rebase it on the vtable based slot approach from [1]. While doing so I found more and more things that I previously hadn't noticed. I started rewriting things into something closer to what I think we want architecturally. The current state of my version of the patch is *NOT* ready for proper review (it doesn't even pass all tests, there's FIXME / elog()s). But I think it's getting close enough to it's eventual shape that more eyes, and potentially more hands on keyboards, can be useful. The most fundamental issues I had with Haribabu's last version from [2] are the following: - The use of TableTuple, a typedef from void *, is bad from multiple fronts. For one it reduces just about all type safety. There were numerous bugs in the patch where things were just cast from HeapTuple to TableTuple to HeapTuple (and even to TupleTableSlot). I think it's a really, really bad idea to introduce a vague type like this for development purposes alone, it makes it way too hard to refactor - essentially throwing the biggest benefit of type safe languages out of the window. Additionally I think it's also the wrong approach architecturally. We shouldn't assume that a tuple can efficiently be represented as a single palloc'ed chunk. In fact, we should move *away* from relying on that so much. I've thus removed the TableTuple type entirely. - Previous verions of the patchset exposed Buffers in the tableam.h API, performed buffer locking / pinning / ExecStoreTuple() calls outside of it. That is wrong in my opinion, as various AMs will deal very differently with buffer pinning & locking. The relevant logic is largely moved within the AM. Bringing me to the next point: - tableam exposed various operations based on HeapTuple/TableTuple's (and their Buffers). This all need be slot based, as we can't represent the way each AM will deal with this. I've largely converted the API to be slot based. That has some fallout, but I think largely works. Lots of outdated comments. - I think the move of the indexing from outside the table layer into the storage layer isn't a good idea. It lead to having to pass EState into the tableam, a callback API to perform index updates, etc. This seems to have at least partially been triggered by the speculative insertion codepaths. I've reverted this part of the changes. The speculative insertion / confirm codepaths are now exposed to tableam.h - I think that's the right thing because we'll likely want to have that functionality across more than a single tuple in the future. - The visibility functions relied on the *caller* performing buffer locking. That's not a great idea, because generic code shouldn't know about the locking scheme a particular AM needs. I've changed the external visibility functions to instead take a slot, and perform the necessary locking inside. - There were numerous tableam callback uses inside heapam.c - that makes no sense, we know what the storage is therein. The relevant - The integration between index lookups and heap lookups based on the results on a index lookup was IMO too tight. The index code dealt with heap tuples, which isn't great. I've introduced a new concept, a 'IndexFetchTableData' scan. It's initialized when building an index scan, and provides the necessary state (say current heap buffer), to do table lookups from within a heap. - The am of relations required for bootstrapping was set to 0 - I don't think that's a good idea. I changed it so it's set to the heap AM as well. - HOT was encoded in the API in a bunch of places. That doesn't look right to me. I tried to improve a bit on that, but I'm not yet quite sure I like it. Needs written explanation & arguments... - the heap tableam did a heap_copytuple() nearly everywhere. Leading to a higher memory usage, because the resulting tuples weren't freed or anything. There might be a reason for doing such a change - we've certainly discussed that before - but I'm *vehemently* against doing that at the same time we introduce pluggable storage. Analyzing the performance effects will be hard enough without changes like this. - I've for now backed out the heap rewrite changes, partially. Mostly because I didn't like the way the abstraction looks, but haven't quite figured out how it should look like. - I did not like that speculative tokens were moved to slots. There's really no reason for them to live outside parameters to tableam.h functsions. - lotsa additional smaller changes. - lotsa new bugs My current working state is at [3] (urls to clone repo are at [4]). This is *HEAVILY WIP*. I plan to continue working on it over the next days, but I'll temporarily focus onto v11 work. If others want I could move repo to github and grant others write access. I think the patchseries should eventually look like: - move vacuumlazy.c (and other similar files) into access/heap, there's really nothing generic here. This is a fairly independent task. - slot'ify FDW RefetchForeignRow_function - vtable based slot API, based on [1] - slot'ify trigger API - redo EPQ based on slots (prototyped in my git tree) - redo trigger API to be slot based - tuple traversal API changes - tableam infrastructure, with error if a non-builtin AM is chosen - move heap and calling code to be tableam based - make vacuum callback based (not vacuum.c, just vacuumlazy.c) - [other patches] - allow other AMs - introduce test AM Tasks / Questions: - split up patch - Change heap table AM to not allocate handler function for each table, instead allocate it statically. Avoids a significant amount of data duplication, and allows for a few more compiler optimizations. - Merge tableam.h and tableamapi.h and make most tableam.c functions small inline functions. Having one-line tableam.c wrappers makes this more expensive than necessary. We'll have a big enough trouble not regressing performancewise. - change scan level slot creation to use tableam function for doing so - get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid - COPY's multi_insert path should probably deal with a bunch of slots, rather than forming HeapTuples - bitmap index scans probably need a new tableam.h callback, abstracting bitgetpage() - suspect IndexBuildHeapScan might need to move into the tableam.h API - it's not clear to me that it's realistically possible to this in a generic manner. Greetings, Andres Freund [1] http://archives.postgresql.org/message-id/20180220224318.gw4oe5jadhpmcdnm%40alap3.anarazel.de [2] http://archives.postgresql.org/message-id/CAJrrPGcN5A4jH0PJ-s=6k3+SLA4pozC4HHRdmvU1ZBuA20TE-A@mail.gmail.com [3] https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/pluggable-storage [4] https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary
On Tue, Jul 3, 2018 at 5:06 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
As I've previously mentioned I had planned to spend some time to polish
Haribabu's version of the pluggable storage patch and rebase it on the
vtable based slot approach from [1]. While doing so I found more and
more things that I previously hadn't noticed. I started rewriting things
into something closer to what I think we want architecturally.
Thanks for the deep review and changes.
The current state of my version of the patch is *NOT* ready for proper
review (it doesn't even pass all tests, there's FIXME / elog()s). But I
think it's getting close enough to it's eventual shape that more eyes,
and potentially more hands on keyboards, can be useful.
I will try to update it to make sure that it passes all the tests and also try to
reduce the FIXME's.
The most fundamental issues I had with Haribabu's last version from [2]
are the following:
- The use of TableTuple, a typedef from void *, is bad from multiple
fronts. For one it reduces just about all type safety. There were
numerous bugs in the patch where things were just cast from HeapTuple
to TableTuple to HeapTuple (and even to TupleTableSlot). I think it's
a really, really bad idea to introduce a vague type like this for
development purposes alone, it makes it way too hard to refactor -
essentially throwing the biggest benefit of type safe languages out of
the window.
My earlier intention was to remove the HeapTuple usage entirely and replace
it with slot everywhere outside the tableam. But it ended up with TableTuple
before it reach to the stage because of heavy use of HeapTuple.
Additionally I think it's also the wrong approach architecturally. We
shouldn't assume that a tuple can efficiently be represented as a
single palloc'ed chunk. In fact, we should move *away* from relying on
that so much.
I've thus removed the TableTuple type entirely.
Thanks for the changes, I didn't check the code yet, so for now whenever the
HeapTuple is required it will be generated from slot?
- Previous verions of the patchset exposed Buffers in the tableam.h API,
performed buffer locking / pinning / ExecStoreTuple() calls outside of
it. That is wrong in my opinion, as various AMs will deal very
differently with buffer pinning & locking. The relevant logic is
largely moved within the AM. Bringing me to the next point:
- tableam exposed various operations based on HeapTuple/TableTuple's
(and their Buffers). This all need be slot based, as we can't
represent the way each AM will deal with this. I've largely converted
the API to be slot based. That has some fallout, but I think largely
works. Lots of outdated comments.
Yes, I agree with you.
- I think the move of the indexing from outside the table layer into the
storage layer isn't a good idea. It lead to having to pass EState into
the tableam, a callback API to perform index updates, etc. This seems
to have at least partially been triggered by the speculative insertion
codepaths. I've reverted this part of the changes. The speculative
insertion / confirm codepaths are now exposed to tableam.h - I think
that's the right thing because we'll likely want to have that
functionality across more than a single tuple in the future.
- The visibility functions relied on the *caller* performing buffer
locking. That's not a great idea, because generic code shouldn't know
about the locking scheme a particular AM needs. I've changed the
external visibility functions to instead take a slot, and perform the
necessary locking inside.
When I first moved all the visibility functions as part of tableam, I find this
problem, and it will be good if it takes of buffer locking and etc.
- There were numerous tableam callback uses inside heapam.c - that makes
no sense, we know what the storage is therein. The relevant
- The integration between index lookups and heap lookups based on the
results on a index lookup was IMO too tight. The index code dealt
with heap tuples, which isn't great. I've introduced a new concept, a
'IndexFetchTableData' scan. It's initialized when building an index
scan, and provides the necessary state (say current heap buffer), to
do table lookups from within a heap.
I agree that it will be good with the new concept from index to access the
heap.
- The am of relations required for bootstrapping was set to 0 - I don't
think that's a good idea. I changed it so it's set to the heap AM as
well.
- HOT was encoded in the API in a bunch of places. That doesn't look
right to me. I tried to improve a bit on that, but I'm not yet quite
sure I like it. Needs written explanation & arguments...
- the heap tableam did a heap_copytuple() nearly everywhere. Leading to
a higher memory usage, because the resulting tuples weren't freed or
anything. There might be a reason for doing such a change - we've
certainly discussed that before - but I'm *vehemently* against doing
that at the same time we introduce pluggable storage. Analyzing the
performance effects will be hard enough without changes like this.
How about using of slot instead of tuple and reusing of it? I don't know
yet whether it is possible everywhere.
- I've for now backed out the heap rewrite changes, partially. Mostly
because I didn't like the way the abstraction looks, but haven't quite
figured out how it should look like.
- I did not like that speculative tokens were moved to slots. There's
really no reason for them to live outside parameters to tableam.h
functsions.
- lotsa additional smaller changes.
- lotsa new bugs
Thanks for all the changes.
My current working state is at [3] (urls to clone repo are at [4]).
This is *HEAVILY WIP*. I plan to continue working on it over the next
days, but I'll temporarily focus onto v11 work. If others want I could
move repo to github and grant others write access.
Yes, I want to access the code and do further development on it.
Tasks / Questions:
- split up patch
How about generating refactoring changes as patches first based on
the code in your repo as discussed here[1]?
- Change heap table AM to not allocate handler function for each table,
instead allocate it statically. Avoids a significant amount of data
duplication, and allows for a few more compiler optimizations.
Some kind of static variable handlers for each tableam, but need to check
how can we access that static handler from the relation.
- Merge tableam.h and tableamapi.h and make most tableam.c functions
small inline functions. Having one-line tableam.c wrappers makes this
more expensive than necessary. We'll have a big enough trouble not
regressing performancewise.
OK.
- change scan level slot creation to use tableam function for doing so
- get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid
so with this there shouldn't be a way from slot to tid mapping or there
should be some other way.
- COPY's multi_insert path should probably deal with a bunch of slots,
rather than forming HeapTuples
OK.
- bitmap index scans probably need a new tableam.h callback, abstracting
bitgetpage()
OK.
Regards,
Haribabu Kommi
Fujitsu Australia
Hi! On Tue, Jul 3, 2018 at 10:06 AM Andres Freund <andres@anarazel.de> wrote: > As I've previously mentioned I had planned to spend some time to polish > Haribabu's version of the pluggable storage patch and rebase it on the > vtable based slot approach from [1]. While doing so I found more and > more things that I previously hadn't noticed. I started rewriting things > into something closer to what I think we want architecturally. Great, thank you for working on this patchset! > The current state of my version of the patch is *NOT* ready for proper > review (it doesn't even pass all tests, there's FIXME / elog()s). But I > think it's getting close enough to it's eventual shape that more eyes, > and potentially more hands on keyboards, can be useful. > > The most fundamental issues I had with Haribabu's last version from [2] > are the following: > > - The use of TableTuple, a typedef from void *, is bad from multiple > fronts. For one it reduces just about all type safety. There were > numerous bugs in the patch where things were just cast from HeapTuple > to TableTuple to HeapTuple (and even to TupleTableSlot). I think it's > a really, really bad idea to introduce a vague type like this for > development purposes alone, it makes it way too hard to refactor - > essentially throwing the biggest benefit of type safe languages out of > the window. > > Additionally I think it's also the wrong approach architecturally. We > shouldn't assume that a tuple can efficiently be represented as a > single palloc'ed chunk. In fact, we should move *away* from relying on > that so much. > > I've thus removed the TableTuple type entirely. +1, TableTuple was vague concept. > - Previous verions of the patchset exposed Buffers in the tableam.h API, > performed buffer locking / pinning / ExecStoreTuple() calls outside of > it. That is wrong in my opinion, as various AMs will deal very > differently with buffer pinning & locking. The relevant logic is > largely moved within the AM. Bringing me to the next point: > > > - tableam exposed various operations based on HeapTuple/TableTuple's > (and their Buffers). This all need be slot based, as we can't > represent the way each AM will deal with this. I've largely converted > the API to be slot based. That has some fallout, but I think largely > works. Lots of outdated comments. Makes sense for me. I like passing TupleTableSlot to tableam api function much more. > - I think the move of the indexing from outside the table layer into the > storage layer isn't a good idea. It lead to having to pass EState into > the tableam, a callback API to perform index updates, etc. This seems > to have at least partially been triggered by the speculative insertion > codepaths. I've reverted this part of the changes. The speculative > insertion / confirm codepaths are now exposed to tableam.h - I think > that's the right thing because we'll likely want to have that > functionality across more than a single tuple in the future. I agree that passing EState into tableam doesn't look good. But I believe that tableam needs way more control over indexes than it has in your version patch. If even tableam-independent insertion into indexes on tuple insert is more or less ok, but on update we need something smarter rather than just insert index tuples depending "update_indexes" flag. Tableam specific update method may decide to update only some of indexes. For example, when zheap performs update in-place then it inserts to only indexes, whose fields were updated. And I think any undo-log based storage would have similar behavior. Moreover, it might be required to do something with existing index tuples (for instance, as I know, zheap sets "deleted" flag to index tuples related to previous values of updated fields). If we would like to move indexing outside of tableam, then we might turn "update_indexes" from bool to more enum with values like: "don't insert index tuples", "insert all index tuples", "insert index tuples only for updated fields" and so on. But that looks more like set of hardcoded cases for particular implementations, than proper API. So, probably we shouldn't move indexing outside of tableam, but rather provide better wrappers for doing that in tableam? > - The visibility functions relied on the *caller* performing buffer > locking. That's not a great idea, because generic code shouldn't know > about the locking scheme a particular AM needs. I've changed the > external visibility functions to instead take a slot, and perform the > necessary locking inside. Makes sense for me. But would it cause extra locking/unlocking and in turn performance impact? > - There were numerous tableam callback uses inside heapam.c - that makes > no sense, we know what the storage is therein. The relevant Ok. > - The integration between index lookups and heap lookups based on the > results on a index lookup was IMO too tight. The index code dealt > with heap tuples, which isn't great. I've introduced a new concept, a > 'IndexFetchTableData' scan. It's initialized when building an index > scan, and provides the necessary state (say current heap buffer), to > do table lookups from within a heap. +1 > - The am of relations required for bootstrapping was set to 0 - I don't > think that's a good idea. I changed it so it's set to the heap AM as > well. +1 > - HOT was encoded in the API in a bunch of places. That doesn't look > right to me. I tried to improve a bit on that, but I'm not yet quite > sure I like it. Needs written explanation & arguments... Yes, HOT is heapam specific feature. Other tableams might not have HOT. But it appears that we still expose hot_search_buffer() function in tableam API. But that function have no usage, so it's just redundant and can be removed. > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to > a higher memory usage, because the resulting tuples weren't freed or > anything. There might be a reason for doing such a change - we've > certainly discussed that before - but I'm *vehemently* against doing > that at the same time we introduce pluggable storage. Analyzing the > performance effects will be hard enough without changes like this. I think once we switched to slots, doing heap_copytuple() do frequently is not required anymore. > - I've for now backed out the heap rewrite changes, partially. Mostly > because I didn't like the way the abstraction looks, but haven't quite > figured out how it should look like. Yeah, it's hard part, but we need to invent something in this area... > - I did not like that speculative tokens were moved to slots. There's > really no reason for them to live outside parameters to tableam.h > functsions. Good. > My current working state is at [3] (urls to clone repo are at [4]). > This is *HEAVILY WIP*. I plan to continue working on it over the next > days, but I'll temporarily focus onto v11 work. If others want I could > move repo to github and grant others write access. Github would be more convinient for me. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Jul 5, 2018 at 3:25 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > My current working state is at [3] (urls to clone repo are at [4]). > > This is *HEAVILY WIP*. I plan to continue working on it over the next > > days, but I'll temporarily focus onto v11 work. If others want I could > > move repo to github and grant others write access. > > Github would be more convinient for me. I've another note. It appears that you leave my patch for locking last version of tuple in one call (heapam_lock_tuple() function) almost without changes. During PGCon 2018 Developer meeting I remember you was somewhat unhappy with this approach. So, do you have any notes about that for now? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi, I've pushed up a new version to https://github.com/anarazel/postgres-pluggable-storage which now passes all the tests. Besides a lot of bugfixes, I've rebased the tree, moved TriggerData to be primarily slot based (with a conversion roundtrip when calling trigger functions), and a lot of other small things. On 2018-07-04 20:11:21 +1000, Haribabu Kommi wrote: > On Tue, Jul 3, 2018 at 5:06 PM Andres Freund <andres@anarazel.de> wrote: > > The current state of my version of the patch is *NOT* ready for proper > > review (it doesn't even pass all tests, there's FIXME / elog()s). But I > > think it's getting close enough to it's eventual shape that more eyes, > > and potentially more hands on keyboards, can be useful. > > > > I will try to update it to make sure that it passes all the tests and also > try to > reduce the FIXME's. Cool. Alexander, Haribabu, if you give me (privately) your github accounts, I'll give you write access to that repository. > > The most fundamental issues I had with Haribabu's last version from [2] > > are the following: > > > > - The use of TableTuple, a typedef from void *, is bad from multiple > > fronts. For one it reduces just about all type safety. There were > > numerous bugs in the patch where things were just cast from HeapTuple > > to TableTuple to HeapTuple (and even to TupleTableSlot). I think it's > > a really, really bad idea to introduce a vague type like this for > > development purposes alone, it makes it way too hard to refactor - > > essentially throwing the biggest benefit of type safe languages out of > > the window. > > > > My earlier intention was to remove the HeapTuple usage entirely and replace > it with slot everywhere outside the tableam. But it ended up with TableTuple > before it reach to the stage because of heavy use of HeapTuple. I don't think that's necessary - a lot of the system catalog accesses are going to continue to be heap tuple accesses. And the conversions you did significantly continue to access TableTuples as heap tuples - it was just that the compiler didn't warn about it anymore. A prime example of that is the way the rewriteheap / cluster integreation was done. Cluster continued to sort tuples as heap tuples - even though that's likely incompatible with other tuple formats which need different state. > > Additionally I think it's also the wrong approach architecturally. We > > shouldn't assume that a tuple can efficiently be represented as a > > single palloc'ed chunk. In fact, we should move *away* from relying on > > that so much. > > > > I've thus removed the TableTuple type entirely. > > > > Thanks for the changes, I didn't check the code yet, so for now whenever the > HeapTuple is required it will be generated from slot? Pretty much. > > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to > > a higher memory usage, because the resulting tuples weren't freed or > > anything. There might be a reason for doing such a change - we've > > certainly discussed that before - but I'm *vehemently* against doing > > that at the same time we introduce pluggable storage. Analyzing the > > performance effects will be hard enough without changes like this. > > > > How about using of slot instead of tuple and reusing of it? I don't know > yet whether it is possible everywhere. Not quite sure what you mean? > > Tasks / Questions: > > > > - split up patch > > > > How about generating refactoring changes as patches first based on > the code in your repo as discussed here[1]? Sure - it was just at the moment too much work ;) > > - Change heap table AM to not allocate handler function for each table, > > instead allocate it statically. Avoids a significant amount of data > > duplication, and allows for a few more compiler optimizations. > > > > Some kind of static variable handlers for each tableam, but need to check > how can we access that static handler from the relation. I'm not sure what you mean by "how can we access"? We can just return a pointer from the constant data from the current handler? Except that adding a bunch of consts would be good, the external interface wouldn't need to change? > > - change scan level slot creation to use tableam function for doing so > > - get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid > > > > so with this there shouldn't be a way from slot to tid mapping or there > should be some other way. I'm not sure I follow? > - bitmap index scans probably need a new tableam.h callback, abstracting > > bitgetpage() > > > > OK. Any chance you could try to tackle this? I'm going to be mostly out this week, so we'd probably not run across each others feet... Greetings, Andres Freund
Hi, On 2018-07-05 15:25:25 +0300, Alexander Korotkov wrote: > > - I think the move of the indexing from outside the table layer into the > > storage layer isn't a good idea. It lead to having to pass EState into > > the tableam, a callback API to perform index updates, etc. This seems > > to have at least partially been triggered by the speculative insertion > > codepaths. I've reverted this part of the changes. The speculative > > insertion / confirm codepaths are now exposed to tableam.h - I think > > that's the right thing because we'll likely want to have that > > functionality across more than a single tuple in the future. > > I agree that passing EState into tableam doesn't look good. But I > believe that tableam needs way more control over indexes than it has > in your version patch. If even tableam-independent insertion into > indexes on tuple insert is more or less ok, but on update we need > something smarter rather than just insert index tuples depending > "update_indexes" flag. Tableam specific update method may decide to > update only some of indexes. For example, when zheap performs update > in-place then it inserts to only indexes, whose fields were updated. > And I think any undo-log based storage would have similar behavior. > Moreover, it might be required to do something with existing index > tuples (for instance, as I know, zheap sets "deleted" flag to index > tuples related to previous values of updated fields). I agree that we probably need more - I'm just inclined to think that we need a more concrete target to work against. Currently zheap's indexing logic still is fairly naive, I don't think we'll get the interface right without having worked further on the zheap side of things. > > - The visibility functions relied on the *caller* performing buffer > > locking. That's not a great idea, because generic code shouldn't know > > about the locking scheme a particular AM needs. I've changed the > > external visibility functions to instead take a slot, and perform the > > necessary locking inside. > > Makes sense for me. But would it cause extra locking/unlocking and in > turn performance impact? I don't think so - nearly all the performance relevant cases do all the visibility logic inside the AM. Where the underlying functions can be used, that do not do the locking. Pretty all the converted places just had manual LockBuffer calls. > > - HOT was encoded in the API in a bunch of places. That doesn't look > > right to me. I tried to improve a bit on that, but I'm not yet quite > > sure I like it. Needs written explanation & arguments... > > Yes, HOT is heapam specific feature. Other tableams might not have > HOT. But it appears that we still expose hot_search_buffer() function > in tableam API. But that function have no usage, so it's just > redundant and can be removed. Yea, that was a leftover. > > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to > > a higher memory usage, because the resulting tuples weren't freed or > > anything. There might be a reason for doing such a change - we've > > certainly discussed that before - but I'm *vehemently* against doing > > that at the same time we introduce pluggable storage. Analyzing the > > performance effects will be hard enough without changes like this. > > I think once we switched to slots, doing heap_copytuple() do > frequently is not required anymore. It's mostly gone now. > > - I've for now backed out the heap rewrite changes, partially. Mostly > > because I didn't like the way the abstraction looks, but haven't quite > > figured out how it should look like. > > Yeah, it's hard part, but we need to invent something in this area... I agree. But I really don't yet quite now what. I somewhat wonder if we should just add a cluster_rel() callback to the tableam and let it deal with everything :(. As previously proposed the interface wouldn't have worked with anything not losslessly encodable into a heaptuple, which is unlikely to be sufficient. FWIW, I plan to be mostly out until Thursday this week, and then I'll rebase onto the new version of the abstract slot patch and then try to split up the patchset. Once that's done, I'll do a prototype conversion of zheap, which I'm sure will show up a lot of weaknesses in the current abstraction. Once that's done I hope we can collaborate / divide & conquer to get the individual pieces into commit shape. If either of you wants to get a head start separating something out, let's try to organize who would do what? The EPQ and trigger slotification are probably good candidates. Greetings, Andres Freund
On Mon, Jul 16, 2018 at 11:35 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-07-04 20:11:21 +1000, Haribabu Kommi wrote:
> On Tue, Jul 3, 2018 at 5:06 PM Andres Freund <andres@anarazel.de> wrote:
> > The most fundamental issues I had with Haribabu's last version from [2]
> > are the following:
> >
> > - The use of TableTuple, a typedef from void *, is bad from multiple
> > fronts. For one it reduces just about all type safety. There were
> > numerous bugs in the patch where things were just cast from HeapTuple
> > to TableTuple to HeapTuple (and even to TupleTableSlot). I think it's
> > a really, really bad idea to introduce a vague type like this for
> > development purposes alone, it makes it way too hard to refactor -
> > essentially throwing the biggest benefit of type safe languages out of
> > the window.
> >
>
> My earlier intention was to remove the HeapTuple usage entirely and replace
> it with slot everywhere outside the tableam. But it ended up with TableTuple
> before it reach to the stage because of heavy use of HeapTuple.
I don't think that's necessary - a lot of the system catalog accesses
are going to continue to be heap tuple accesses. And the conversions you
did significantly continue to access TableTuples as heap tuples - it was
just that the compiler didn't warn about it anymore.
A prime example of that is the way the rewriteheap / cluster
integreation was done. Cluster continued to sort tuples as heap tuples -
even though that's likely incompatible with other tuple formats which
need different state.
OK. Understood.
> > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to
> > a higher memory usage, because the resulting tuples weren't freed or
> > anything. There might be a reason for doing such a change - we've
> > certainly discussed that before - but I'm *vehemently* against doing
> > that at the same time we introduce pluggable storage. Analyzing the
> > performance effects will be hard enough without changes like this.
> >
>
> How about using of slot instead of tuple and reusing of it? I don't know
> yet whether it is possible everywhere.
Not quite sure what you mean?
I thought of using slot everywhere can reduce the use of heap_copytuple,
I understand that you already did those changes from you reply from the
other mail.
> > Tasks / Questions:
> >
> > - split up patch
> >
>
> How about generating refactoring changes as patches first based on
> the code in your repo as discussed here[1]?
Sure - it was just at the moment too much work ;)
Yes, it is too much work. How about doing this once most of the
open items are finished?
> > - Change heap table AM to not allocate handler function for each table,
> > instead allocate it statically. Avoids a significant amount of data
> > duplication, and allows for a few more compiler optimizations.
> >
>
> Some kind of static variable handlers for each tableam, but need to check
> how can we access that static handler from the relation.
I'm not sure what you mean by "how can we access"? We can just return a
pointer from the constant data from the current handler? Except that
adding a bunch of consts would be good, the external interface wouldn't
need to change?
I mean we may need to store some tableam ID in each table, so that based on
that ID we get the static tableam handler, because at a time there may be
tables from different tableam methods.
> > - change scan level slot creation to use tableam function for doing so
> > - get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid
> >
>
> so with this there shouldn't be a way from slot to tid mapping or there
> should be some other way.
I'm not sure I follow?
Replacing of heaptuple with slot, currently with slot only the tid is passed
to the tableam methods, To get rid of the tid from slot, we may need any
other method of passing?
> - bitmap index scans probably need a new tableam.h callback, abstracting
> > bitgetpage()
> >
>
> OK.
Any chance you could try to tackle this? I'm going to be mostly out
this week, so we'd probably not run across each others feet...
OK, I will take care of the above point.
Regards,
Haribabu Kommi
Fujitsu Australia
On Tue, Jul 17, 2018 at 11:01 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Mon, Jul 16, 2018 at 11:35 PM Andres Freund <andres@anarazel.de> wrote:On 2018-07-04 20:11:21 +1000, Haribabu Kommi wrote:
> On Tue, Jul 3, 2018 at 5:06 PM Andres Freund <andres@anarazel.de> wrote:
>> - bitmap index scans probably need a new tableam.h callback, abstracting
> > bitgetpage()
> >
>
> OK.
Any chance you could try to tackle this? I'm going to be mostly out
this week, so we'd probably not run across each others feet...OK, I will take care of the above point.
I added new API in the tableam.h to get all the page visible tuples to
abstract the bitgetpage() function.
>>- Merge tableam.h and tableamapi.h and make most tableam.c functions
>> small inline functions. Having one-line tableam.c wrappers makes this
>> more expensive than necessary. We'll have a big enough trouble not
>> regressing performancewise.
I merged tableam.h and tableamapi.h into tableam.h and changed all the
functions as inline. This change may have added some additional headers,
will check them if I can remove their need.
Attached are the updated patches on top your github tree.
Currently I am working on the following.
- I observed that there is a crash when running isolation tests.
- COPY's multi_insert path should probably deal with a bunch of slots,
rather than forming HeapTuples
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On Tue, Jul 24, 2018 at 11:31 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Tue, Jul 17, 2018 at 11:01 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:I added new API in the tableam.h to get all the page visible tuples toabstract the bitgetpage() function.>>- Merge tableam.h and tableamapi.h and make most tableam.c functions>> small inline functions. Having one-line tableam.c wrappers makes this>> more expensive than necessary. We'll have a big enough trouble not>> regressing performancewise.I merged tableam.h and tableamapi.h into tableam.h and changed all thefunctions as inline. This change may have added some additional headers,will check them if I can remove their need.Attached are the updated patches on top your github tree.Currently I am working on the following.- I observed that there is a crash when running isolation tests.
while investing the crash, I observed that it is due to the lot of FIXME's in
the code. So I just fixed minimal changes and looking into correcting
the FIXME's first.
One thing I observed is lack relation pointer is leading to crash in the
flow of EvalPlan* functions, because all ROW_MARK types doesn't
contains relation pointer.
will continue to check all FIXME fixes.
- COPY's multi_insert path should probably deal with a bunch of slots,rather than forming HeapTuples
Implemented supporting of slots in the copy multi insert path.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Hi, I'm currently in the process of rebasing zheap onto the pluggable storage work. The goal, which seems to work surprisingly well, is to find issues that the current pluggable storage patch doesn't yet deal with. I plan to push a tree including a lot of fixes and improvements soon. On 2018-08-03 12:35:50 +1000, Haribabu Kommi wrote: > while investing the crash, I observed that it is due to the lot of FIXME's > in > the code. So I just fixed minimal changes and looking into correcting > the FIXME's first. > > One thing I observed is lack relation pointer is leading to crash in the > flow of EvalPlan* functions, because all ROW_MARK types doesn't > contains relation pointer. > > will continue to check all FIXME fixes. Thanks. > > - COPY's multi_insert path should probably deal with a bunch of slots, > > rather than forming HeapTuples > > > > Implemented supporting of slots in the copy multi insert path. Cool. I've not yet looked at it, but I plan to do so soon. Will have to rebase over the other copy changes first :( - Andres
On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
I'm currently in the process of rebasing zheap onto the pluggable
storage work. The goal, which seems to work surprisingly well, is to
find issues that the current pluggable storage patch doesn't yet deal
with. I plan to push a tree including a lot of fixes and improvements
soon.
Sorry for coming late to this thread.
That's good. Did you find any problems in porting zheap into pluggable
storage? Does it needs any API changes or new API requirement?
On 2018-08-03 12:35:50 +1000, Haribabu Kommi wrote:
> while investing the crash, I observed that it is due to the lot of FIXME's
> in
> the code. So I just fixed minimal changes and looking into correcting
> the FIXME's first.
>
> One thing I observed is lack relation pointer is leading to crash in the
> flow of EvalPlan* functions, because all ROW_MARK types doesn't
> contains relation pointer.
>
> will continue to check all FIXME fixes.
Thanks.
I fixed some of the Isolation test problems. All the issues are related to
EPQ slot handling. Still more needs to be fixed.
Does the new TupleTableSlot abstraction patches has fixed any of these
issues in the recent thread [1]? so that I can look into the change of FDW API
to return slot instead of tuple.
> > - COPY's multi_insert path should probably deal with a bunch of slots,
> > rather than forming HeapTuples
> >
>
> Implemented supporting of slots in the copy multi insert path.
Cool. I've not yet looked at it, but I plan to do so soon. Will have to
rebase over the other copy changes first :(
OK. Understood. There are many changes in the COPY flow conflicts
with my changes. Please let me know once you done the rebase, I can
fix those conflicts and regenerate the patch.
Attached is the patch with further fixes.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Hi, On 2018-08-21 16:55:47 +1000, Haribabu Kommi wrote: > On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote: > > I'm currently in the process of rebasing zheap onto the pluggable > > storage work. The goal, which seems to work surprisingly well, is to > > find issues that the current pluggable storage patch doesn't yet deal > > with. I plan to push a tree including a lot of fixes and improvements > > soon. > > > > Sorry for coming late to this thread. No worries. > That's good. Did you find any problems in porting zheap into pluggable > storage? Does it needs any API changes or new API requirement? A lot, yes. The big changes are: - removal of HeapPageScanDesc - introduction of explicit support functions for tablesample & bitmap scans - introduction of callbacks for vacuum_rel, cluster And quite a bit more along those lines. > Does the new TupleTableSlot abstraction patches has fixed any of these > issues in the recent thread [1]? so that I can look into the change of > FDW API to return slot instead of tuple. Yea, that'd be a good thing to start with. Greetings, Andres Freund
On Tue, Aug 21, 2018 at 6:59 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-08-21 16:55:47 +1000, Haribabu Kommi wrote:
> On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote:
> > I'm currently in the process of rebasing zheap onto the pluggable
> > storage work. The goal, which seems to work surprisingly well, is to
> > find issues that the current pluggable storage patch doesn't yet deal
> > with. I plan to push a tree including a lot of fixes and improvements
> > soon.
> >
> That's good. Did you find any problems in porting zheap into pluggable
> storage? Does it needs any API changes or new API requirement?
A lot, yes. The big changes are:
- removal of HeapPageScanDesc
- introduction of explicit support functions for tablesample & bitmap scans
- introduction of callbacks for vacuum_rel, cluster
And quite a bit more along those lines.
OK. Those are quite a bit of changes.
> Does the new TupleTableSlot abstraction patches has fixed any of these
> issues in the recent thread [1]? so that I can look into the change of
> FDW API to return slot instead of tuple.
Yea, that'd be a good thing to start with.
I found out only the RefetchForeignRow API needs the change and done the same.
Along with that, I fixed all the issues of running make check-world. Attached patches
for the same.
Now I will look into the remaining FIXME's that don't conflict with your further changes.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Hi, On 2018-08-24 11:55:41 +1000, Haribabu Kommi wrote: > On Tue, Aug 21, 2018 at 6:59 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2018-08-21 16:55:47 +1000, Haribabu Kommi wrote: > > > On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote: > > > > I'm currently in the process of rebasing zheap onto the pluggable > > > > storage work. The goal, which seems to work surprisingly well, is to > > > > find issues that the current pluggable storage patch doesn't yet deal > > > > with. I plan to push a tree including a lot of fixes and improvements > > > > soon. > > > > > > > That's good. Did you find any problems in porting zheap into pluggable > > > storage? Does it needs any API changes or new API requirement? > > > > A lot, yes. The big changes are: > > - removal of HeapPageScanDesc > > - introduction of explicit support functions for tablesample & bitmap scans > > - introduction of callbacks for vacuum_rel, cluster > > > > And quite a bit more along those lines. > > > > OK. Those are quite a bit of changes. I've pushed a current version of that to my git tree to the pluggable-storage branch. It's not really a version that I think makese sense to review or such, but it's probably more useful if you work based on that. There's also the pluggable-zheap branch, which I found extremely useful to develop against. There's a few further changes since last time: - Pluggable handlers are now stored in static global variables, and thus do not need to be copied anymore - VACUUM FULL / CLUSTER is moved into one callback that does the actual copying. The various previous rewrite callbacks imo integrated at the wrong level. - there's a GUC that allows to change the default table AM - moving COPY to use slots (roughly based on your / Haribabu's patch) - removed the AM specific shmem initialization callbacks - various AMs are going to need the syncscan APIs, so moving that into AM callbacks doesn't make sense. Missing: - callback for the second scan of CREATE INDEX CONCURRENTLY - commands/analyze.c integration (Working on it) - fixing your (Haribabu's) slotification of copy patch to compute memory usage somehow - table creation callback, currently the pluggable-zheap patch has a few conditionals outside of access/zheap for that purpose (see RelationTruncate - header structure cleanup And then: - lotsa cleanups - rebasing onto a newer version of the abstract slot patchset - splitting out smaller patches You'd moved the bulk insert into tableam callbacks - I don't quite get why? There's not really anything AM specific in that code? > > > Does the new TupleTableSlot abstraction patches has fixed any of these > > > issues in the recent thread [1]? so that I can look into the change of > > > FDW API to return slot instead of tuple. > > > > Yea, that'd be a good thing to start with. > > > > I found out only the RefetchForeignRow API needs the change and done the > same. > Along with that, I fixed all the issues of running make check-world. > Attached patches > for the same. Thanks, that's really helpful! I'll try to merge these soon. I'm starting to think that we're getting closer to something that looks right from a high level, even though there's a lot of details to clean up. Greetings, Andres Freund
On Fri, Aug 24, 2018 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-08-24 11:55:41 +1000, Haribabu Kommi wrote:
> On Tue, Aug 21, 2018 at 6:59 PM Andres Freund <andres@anarazel.de> wrote:
>
> > On 2018-08-21 16:55:47 +1000, Haribabu Kommi wrote:
> > > On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote:
> > > > I'm currently in the process of rebasing zheap onto the pluggable
> > > > storage work. The goal, which seems to work surprisingly well, is to
> > > > find issues that the current pluggable storage patch doesn't yet deal
> > > > with. I plan to push a tree including a lot of fixes and improvements
> > > > soon.
> > > >
> > > That's good. Did you find any problems in porting zheap into pluggable
> > > storage? Does it needs any API changes or new API requirement?
> >
> > A lot, yes. The big changes are:
> > - removal of HeapPageScanDesc
> > - introduction of explicit support functions for tablesample & bitmap scans
> > - introduction of callbacks for vacuum_rel, cluster
> >
> > And quite a bit more along those lines.
> >
>
> OK. Those are quite a bit of changes.
I've pushed a current version of that to my git tree to the
pluggable-storage branch. It's not really a version that I think makese
sense to review or such, but it's probably more useful if you work based
on that. There's also the pluggable-zheap branch, which I found
extremely useful to develop against.
OK. Thanks, will check that also.
There's a few further changes since last time: - Pluggable handlers are
now stored in static global variables, and thus do not need to be copied
anymore - VACUUM FULL / CLUSTER is moved into one callback that does the
actual copying. The various previous rewrite callbacks imo integrated at
the wrong level. - there's a GUC that allows to change the default
table AM - moving COPY to use slots (roughly based on your / Haribabu's
patch) - removed the AM specific shmem initialization callbacks -
various AMs are going to need the syncscan APIs, so moving that into AM
callbacks doesn't make sense.
OK.
Missing:
- callback for the second scan of CREATE INDEX CONCURRENTLY
- commands/analyze.c integration (Working on it)
- fixing your (Haribabu's) slotification of copy patch to compute memory
usage somehow
I will check it.
- table creation callback, currently the pluggable-zheap patch has a few
conditionals outside of access/zheap for that purpose (see RelationTruncate
I will check it.
And then:
- lotsa cleanups
- rebasing onto a newer version of the abstract slot patchset
- splitting out smaller patches
You'd moved the bulk insert into tableam callbacks - I don't quite get
why? There's not really anything AM specific in that code?
The main reason of adding them to AM is just to provide a control to
the specific AM to decide whether they can support the bulk insert or
not?
Current framework doesn't support AM specific bulk insert state to be
passed from one function to another and it's structure is fixed. This needs
to be enhanced to add AM specific private members also.
> > > Does the new TupleTableSlot abstraction patches has fixed any of these
> > > issues in the recent thread [1]? so that I can look into the change of
> > > FDW API to return slot instead of tuple.
> >
> > Yea, that'd be a good thing to start with.
> >
>
> I found out only the RefetchForeignRow API needs the change and done the
> same.
> Along with that, I fixed all the issues of running make check-world.
> Attached patches
> for the same.
Thanks, that's really helpful! I'll try to merge these soon.
I can share the rebased patches for the fixes, so that it will be easy to merge.
I'm starting to think that we're getting closer to something that
looks right from a high level, even though there's a lot of details to
clean up.
That's good.
Regards,
Haribabu Kommi
Fujitsu Australia
On Tue, Aug 28, 2018 at 1:48 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Fri, Aug 24, 2018 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:Hi,
On 2018-08-24 11:55:41 +1000, Haribabu Kommi wrote:
> On Tue, Aug 21, 2018 at 6:59 PM Andres Freund <andres@anarazel.de> wrote:
>
> > On 2018-08-21 16:55:47 +1000, Haribabu Kommi wrote:
> > > On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote:
> > > > I'm currently in the process of rebasing zheap onto the pluggable
> > > > storage work. The goal, which seems to work surprisingly well, is to
> > > > find issues that the current pluggable storage patch doesn't yet deal
> > > > with. I plan to push a tree including a lot of fixes and improvements
> > > > soon.
> > > >
> > > That's good. Did you find any problems in porting zheap into pluggable
> > > storage? Does it needs any API changes or new API requirement?
> >
> > A lot, yes. The big changes are:
> > - removal of HeapPageScanDesc
> > - introduction of explicit support functions for tablesample & bitmap scans
> > - introduction of callbacks for vacuum_rel, cluster
> >
> > And quite a bit more along those lines.
> >
>
> OK. Those are quite a bit of changes.
I've pushed a current version of that to my git tree to the
pluggable-storage branch. It's not really a version that I think makese
sense to review or such, but it's probably more useful if you work based
on that. There's also the pluggable-zheap branch, which I found
extremely useful to develop against.OK. Thanks, will check that also.- fixing your (Haribabu's) slotification of copy patch to compute memory
usage somehowI will check it.
Attached is the copy patch that brings back the size validation.
Compute the tuple size from the first tuple in the batch and use the same for the
rest of the tuples in that batch. This way the calculation overhead is also reduced,
but there may be chances that the first tuple size is very low and rest can be very
long, but I feel those are rare.
- table creation callback, currently the pluggable-zheap patch has a few
conditionals outside of access/zheap for that purpose (see RelationTruncateI will check it.
I found couple of places where the zheap is using some extra logic in verifying
whether it is zheap AM or not, based on that it used to took some extra decisions.
I am analyzing all the extra code that is done, whether any callbacks can handle it
or not? and how? I can come back with more details later.
And then:
- lotsa cleanups
- rebasing onto a newer version of the abstract slot patchset
- splitting out smaller patches
You'd moved the bulk insert into tableam callbacks - I don't quite get
why? There's not really anything AM specific in that code?The main reason of adding them to AM is just to provide a control tothe specific AM to decide whether they can support the bulk insert ornot?Current framework doesn't support AM specific bulk insert state to bepassed from one function to another and it's structure is fixed. This needsto be enhanced to add AM specific private members also.
Do you want me to work on it to make it generic to AM methods to extend
the structure?
> > > Does the new TupleTableSlot abstraction patches has fixed any of these
> > > issues in the recent thread [1]? so that I can look into the change of
> > > FDW API to return slot instead of tuple.
> >
> > Yea, that'd be a good thing to start with.
> >
>
> I found out only the RefetchForeignRow API needs the change and done the
> same.
> Along with that, I fixed all the issues of running make check-world.
> Attached patches
> for the same.
Thanks, that's really helpful! I'll try to merge these soon.I can share the rebased patches for the fixes, so that it will be easy to merge.
Rebased FDW and check-world fixes patch is attached.
I will continue working on the rest of the miss items.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Hi, Thanks for the patches! On 2018-09-03 19:06:27 +1000, Haribabu Kommi wrote: > I found couple of places where the zheap is using some extra logic in > verifying > whether it is zheap AM or not, based on that it used to took some extra > decisions. > I am analyzing all the extra code that is done, whether any callbacks can > handle it > or not? and how? I can come back with more details later. Yea, I think some of them will need to stay (particularly around integrating undo) and som other ones we'll need to abstract. > >> And then: > >> - lotsa cleanups > >> - rebasing onto a newer version of the abstract slot patchset > >> - splitting out smaller patches > >> > >> > >> You'd moved the bulk insert into tableam callbacks - I don't quite get > >> why? There's not really anything AM specific in that code? > >> > > > > The main reason of adding them to AM is just to provide a control to > > the specific AM to decide whether they can support the bulk insert or > > not? > > > > Current framework doesn't support AM specific bulk insert state to be > > passed from one function to another and it's structure is fixed. This needs > > to be enhanced to add AM specific private members also. > > > > Do you want me to work on it to make it generic to AM methods to extend > the structure? I think the best thing here would be to *remove* all AM abstraction for bulk insert, until it's actuallly needed. The likelihood of us getting the interface right and useful without an actual user seems low. Also, this already is a huge patch... > @@ -308,7 +308,7 @@ static void CopyFromInsertBatch(CopyState cstate, EState *estate, > CommandId mycid, int hi_options, > ResultRelInfo *resultRelInfo, > BulkInsertState bistate, > - int nBufferedTuples, TupleTableSlot **bufferedSlots, > + int nBufferedSlots, TupleTableSlot **bufferedSlots, > uint64 firstBufferedLineNo); > static bool CopyReadLine(CopyState cstate); > static bool CopyReadLineText(CopyState cstate); > @@ -2309,11 +2309,12 @@ CopyFrom(CopyState cstate) > void *bistate; > uint64 processed = 0; > bool useHeapMultiInsert; > - int nBufferedTuples = 0; > + int nBufferedSlots = 0; > int prev_leaf_part_index = -1; > -#define MAX_BUFFERED_TUPLES 1000 > +#define MAX_BUFFERED_SLOTS 1000 What's the point of these renames? We're still dealing in tuples. Just seems to make the patch larger. > if (useHeapMultiInsert) > { > + int tup_size; > + > /* Add this tuple to the tuple buffer */ > - if (nBufferedTuples == 0) > + if (nBufferedSlots == 0) > + { > firstBufferedLineNo = cstate->cur_lineno; > - Assert(bufferedSlots[nBufferedTuples] == myslot); > - nBufferedTuples++; > + > + /* > + * Find out the Tuple size of the first tuple in a batch and > + * use it for the rest tuples in a batch. There may be scenarios > + * where the first tuple is very small and rest can be large, but > + * that's rare and this should work for majority of the scenarios. > + */ > + tup_size = heap_compute_data_size(myslot->tts_tupleDescriptor, > + myslot->tts_values, > + myslot->tts_isnull); > + } This seems too exensive to me. I think it'd be better if we instead used the amount of input data consumed for the tuple as a proxy. Does that sound reasonable? Greetings, Andres Freund
On Tue, Sep 4, 2018 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
Thanks for the patches!
On 2018-09-03 19:06:27 +1000, Haribabu Kommi wrote:
> I found couple of places where the zheap is using some extra logic in
> verifying
> whether it is zheap AM or not, based on that it used to took some extra
> decisions.
> I am analyzing all the extra code that is done, whether any callbacks can
> handle it
> or not? and how? I can come back with more details later.
Yea, I think some of them will need to stay (particularly around
integrating undo) and som other ones we'll need to abstract.
OK. I will list all the areas that I found with my observation of how to
abstract or leaving it and then implement around it.
> >> And then:
> >> - lotsa cleanups
> >> - rebasing onto a newer version of the abstract slot patchset
> >> - splitting out smaller patches
> >>
> >>
> >> You'd moved the bulk insert into tableam callbacks - I don't quite get
> >> why? There's not really anything AM specific in that code?
> >>
> >
> > The main reason of adding them to AM is just to provide a control to
> > the specific AM to decide whether they can support the bulk insert or
> > not?
> >
> > Current framework doesn't support AM specific bulk insert state to be
> > passed from one function to another and it's structure is fixed. This needs
> > to be enhanced to add AM specific private members also.
> >
>
> Do you want me to work on it to make it generic to AM methods to extend
> the structure?
I think the best thing here would be to *remove* all AM abstraction for
bulk insert, until it's actuallly needed. The likelihood of us getting
the interface right and useful without an actual user seems low. Also,
this already is a huge patch...
OK. Will remove them and share the patch.
> @@ -308,7 +308,7 @@ static void CopyFromInsertBatch(CopyState cstate, EState *estate,
> CommandId mycid, int hi_options,
> ResultRelInfo *resultRelInfo,
> BulkInsertState bistate,
> - int nBufferedTuples, TupleTableSlot **bufferedSlots,
> + int nBufferedSlots, TupleTableSlot **bufferedSlots,
> uint64 firstBufferedLineNo);
> static bool CopyReadLine(CopyState cstate);
> static bool CopyReadLineText(CopyState cstate);
> @@ -2309,11 +2309,12 @@ CopyFrom(CopyState cstate)
> void *bistate;
> uint64 processed = 0;
> bool useHeapMultiInsert;
> - int nBufferedTuples = 0;
> + int nBufferedSlots = 0;
> int prev_leaf_part_index = -1;
> -#define MAX_BUFFERED_TUPLES 1000
> +#define MAX_BUFFERED_SLOTS 1000
What's the point of these renames? We're still dealing in tuples. Just
seems to make the patch larger.
OK. I will correct it.
> if (useHeapMultiInsert)
> {
> + int tup_size;
> +
> /* Add this tuple to the tuple buffer */
> - if (nBufferedTuples == 0)
> + if (nBufferedSlots == 0)
> + {
> firstBufferedLineNo = cstate->cur_lineno;
> - Assert(bufferedSlots[nBufferedTuples] == myslot);
> - nBufferedTuples++;
> +
> + /*
> + * Find out the Tuple size of the first tuple in a batch and
> + * use it for the rest tuples in a batch. There may be scenarios
> + * where the first tuple is very small and rest can be large, but
> + * that's rare and this should work for majority of the scenarios.
> + */
> + tup_size = heap_compute_data_size(myslot->tts_tupleDescriptor,
> + myslot->tts_values,
> + myslot->tts_isnull);
> + }
This seems too exensive to me. I think it'd be better if we instead
used the amount of input data consumed for the tuple as a proxy. Does that
sound reasonable?
the line length of the row, this can be used as a tuple length to limit the size usage.
comments?
Regards,
Haribabu Kommi
Fujitsu Australia
On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Tue, Sep 4, 2018 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:Hi,
Thanks for the patches!
On 2018-09-03 19:06:27 +1000, Haribabu Kommi wrote:
> I found couple of places where the zheap is using some extra logic in
> verifying
> whether it is zheap AM or not, based on that it used to took some extra
> decisions.
> I am analyzing all the extra code that is done, whether any callbacks can
> handle it
> or not? and how? I can come back with more details later.
Yea, I think some of them will need to stay (particularly around
integrating undo) and som other ones we'll need to abstract.OK. I will list all the areas that I found with my observation of how toabstract or leaving it and then implement around it.
The following are the change where the code is specific to checking whether
it is a zheap relation or not?
Overall I found that It needs 3 new API's at the following locations.
1. RelationSetNewRelfilenode
2. heap_create_init_fork
3. estimate_rel_size
4. Facility to provide handler options like (skip WAL and etc).
_hash_vacuum_one_page:
xlrec.flags = RelationStorageIsZHeap(heapRel) ?
XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP : 0;
_bt_delitems_delete:
xlrec_delete.flags = RelationStorageIsZHeap(heapRel) ?
XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP : 0;
Storing the type of the handler and while checking for these new types adding a
new API for special handing can remove the need of the above code.
RelationAddExtraBlocks:
if (RelationStorageIsZHeap(relation))
{
ZheapInitPage(page, BufferGetPageSize(buffer));
freespace = PageGetZHeapFreeSpace(page);
}
Adding a new API for PageInt and PageGetHeapFreeSpace to redirect the calls to specific
table am handlers.
visibilitymap_set:
if (RelationStorageIsZHeap(rel))
{
recptr = log_zheap_visible(rel->rd_node, heapBuf, vmBuf,
cutoff_xid, flags);
/*
* We do not have a page wise visibility flag in zheap.
* So no need to set LSN on zheap page.
*/
}
Handler options may solve need of above code.
validate_index_heapscan:
/* Set up for predicate or expression evaluation */
/* For zheap relations, the tuple is locally allocated, so free it. */
ExecStoreHeapTuple(heapTuple, slot, RelationStorageIsZHeap(heapRelation));
This will solve after changing the validate_index_heapscan function to slotify.
RelationTruncate:
/* Create the meta page for zheap */
if (RelationStorageIsZHeap(rel))
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
InvalidTransactionId,
InvalidMultiXactId);
if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
rel->rd_rel->relkind != 'p')
{
heap_create_init_fork(rel);
if (RelationStorageIsZHeap(rel))
ZheapInitMetaPage(rel, INIT_FORKNUM);
}
new API in RelationSetNewRelfilenode and heap_create_init_fork can solve it.
cluster:
if (RelationStorageIsZHeap(rel))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a zheap table")));
No change required.
copyFrom:
/*
* In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
* See zheap_prepare_insert for details.
* PBORKED / ZBORKED: abstract
*/
if (!RelationStorageIsZHeap(cstate->rel) && !XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;
How about requesting the table am handler to provide options and use them here?
ExecuteTruncateGuts:
// PBORKED: Need to abstract this
minmulti = GetOldestMultiXactId();
/*
* Need the full transaction-safe pushups.
*
* Create a new empty storage file for the relation, and assign it
* as the relfilenode value. The old storage file is scheduled for
* deletion at commit.
*
* PBORKED: needs to be a callback
*/
if (RelationStorageIsZHeap(rel))
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
InvalidTransactionId, InvalidMultiXactId);
else
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
RecentXmin, minmulti);
if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
{
heap_create_init_fork(rel);
if (RelationStorageIsZHeap(rel))
ZheapInitMetaPage(rel, INIT_FORKNUM);
}
New API inside RelationSetNewRelfilenode can handle it.
ATRewriteCatalogs:
/* Inherit the storage_engine reloption from the parent table. */
if (RelationStorageIsZHeap(rel))
{
static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
DefElem *storage_engine;
storage_engine = makeDefElemExtended("toast", "storage_engine",
(Node *) makeString("zheap"),
DEFELEM_UNSPEC, -1);
reloptions = transformRelOptions((Datum) 0,
list_make1(storage_engine),
"toast",
validnsps, true, false);
}
I don't think anything can be done in API.
ATRewriteTable:
/*
* In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
* See zheap_prepare_insert for details.
*
* ZFIXME / PFIXME: We probably need a different abstraction for this.
*/
if (!RelationStorageIsZHeap(newrel) && !XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;
Options can solve this also.
estimate_rel_size:
if (curpages < 10 &&
(rel->rd_rel->relpages == 0 ||
(RelationStorageIsZHeap(rel) &&
rel->rd_rel->relpages == ZHEAP_METAPAGE + 1)) &&
!rel->rd_rel->relhassubclass &&
rel->rd_rel->relkind != RELKIND_INDEX)
curpages = 10;
/* report estimated # pages */
*pages = curpages;
/* quick exit if rel is clearly empty */
if (curpages == 0 || (RelationStorageIsZHeap(rel) &&
curpages == ZHEAP_METAPAGE + 1))
{
*tuples = 0;
*allvisfrac = 0;
break;
}
/* coerce values in pg_class to more desirable types */
relpages = (BlockNumber) rel->rd_rel->relpages;
reltuples = (double) rel->rd_rel->reltuples;
relallvisible = (BlockNumber) rel->rd_rel->relallvisible;
/*
* If it's a zheap relation, then subtract the pages
* to account for the metapage.
*/
if (relpages > 0 && RelationStorageIsZHeap(rel))
{
curpages--;
relpages--;
}
An API may be needed to find out estimation size based on handler type?
pg_stat_get_tuples_hot_updated and others:
/*
* Counter tuples_hot_updated stores number of hot updates for heap table
* and the number of inplace updates for zheap table.
*/
if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL ||
RelationStorageIsZHeap(rel))
result = 0;
else
result = (int64) (tabentry->tuples_hot_updated);
Is the special condition is needed? The values should be 0 because of zheap right?
RelationSetNewRelfilenode:
/* Initialize the metapage for zheap relation. */
if (RelationStorageIsZHeap(relation))
ZheapInitMetaPage(relation, MAIN_FORKNUM);
New API in RelationSetNetRelfilenode Can solve this problem.
> >> And then:
> >> - lotsa cleanups
> >> - rebasing onto a newer version of the abstract slot patchset
> >> - splitting out smaller patches
> >>
> >>
> >> You'd moved the bulk insert into tableam callbacks - I don't quite get
> >> why? There's not really anything AM specific in that code?
> >>
> >
> > The main reason of adding them to AM is just to provide a control to
> > the specific AM to decide whether they can support the bulk insert or
> > not?
> >
> > Current framework doesn't support AM specific bulk insert state to be
> > passed from one function to another and it's structure is fixed. This needs
> > to be enhanced to add AM specific private members also.
> >
>
> Do you want me to work on it to make it generic to AM methods to extend
> the structure?
I think the best thing here would be to *remove* all AM abstraction for
bulk insert, until it's actuallly needed. The likelihood of us getting
the interface right and useful without an actual user seems low. Also,
this already is a huge patch...OK. Will remove them and share the patch.
Bulk insert API changes are removed.
> @@ -308,7 +308,7 @@ static void CopyFromInsertBatch(CopyState cstate, EState *estate,
> CommandId mycid, int hi_options,
> ResultRelInfo *resultRelInfo,
> BulkInsertState bistate,
> - int nBufferedTuples, TupleTableSlot **bufferedSlots,
> + int nBufferedSlots, TupleTableSlot **bufferedSlots,
> uint64 firstBufferedLineNo);
> static bool CopyReadLine(CopyState cstate);
> static bool CopyReadLineText(CopyState cstate);
> @@ -2309,11 +2309,12 @@ CopyFrom(CopyState cstate)
> void *bistate;
> uint64 processed = 0;
> bool useHeapMultiInsert;
> - int nBufferedTuples = 0;
> + int nBufferedSlots = 0;
> int prev_leaf_part_index = -1;
> -#define MAX_BUFFERED_TUPLES 1000
> +#define MAX_BUFFERED_SLOTS 1000
What's the point of these renames? We're still dealing in tuples. Just
seems to make the patch larger.OK. I will correct it.
> if (useHeapMultiInsert)
> {
> + int tup_size;
> +
> /* Add this tuple to the tuple buffer */
> - if (nBufferedTuples == 0)
> + if (nBufferedSlots == 0)
> + {
> firstBufferedLineNo = cstate->cur_lineno;
> - Assert(bufferedSlots[nBufferedTuples] == myslot);
> - nBufferedTuples++;
> +
> + /*
> + * Find out the Tuple size of the first tuple in a batch and
> + * use it for the rest tuples in a batch. There may be scenarios
> + * where the first tuple is very small and rest can be large, but
> + * that's rare and this should work for majority of the scenarios.
> + */
> + tup_size = heap_compute_data_size(myslot->tts_tupleDescriptor,
> + myslot->tts_values,
> + myslot->tts_isnull);
> + }
This seems too exensive to me. I think it'd be better if we instead
used the amount of input data consumed for the tuple as a proxy. Does that
sound reasonable?Yes, the cstate structure contains the line_buf member that holds the information ofthe line length of the row, this can be used as a tuple length to limit the size usage.comments?
copy from batch insert memory usage limit fix and Provide grammer support for USING method
to create table as also.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On Mon, Sep 10, 2018 at 1:12 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > > On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote: >> > pg_stat_get_tuples_hot_updated and others: > /* > * Counter tuples_hot_updated stores number of hot updates for heap table > * and the number of inplace updates for zheap table. > */ > if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL || > RelationStorageIsZHeap(rel)) > result = 0; > else > result = (int64) (tabentry->tuples_hot_updated); > > > Is the special condition is needed? The values should be 0 because of zheap right? > I also think so. Beena/Mithun has worked on this part of the code, so it is better if they also confirm once. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 10, 2018 at 7:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Sep 10, 2018 at 1:12 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote: >> >> On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote: >>> >> pg_stat_get_tuples_hot_updated and others: >> /* >> * Counter tuples_hot_updated stores number of hot updates for heap table >> * and the number of inplace updates for zheap table. >> */ >> if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL || >> RelationStorageIsZHeap(rel)) >> result = 0; >> else >> result = (int64) (tabentry->tuples_hot_updated); >> >> >> Is the special condition is needed? The values should be 0 because of zheap right? >> > > I also think so. Beena/Mithun has worked on this part of the code, so > it is better if they also confirm once. Yes pg_stat_get_tuples_hot_updated should return 0 for zheap. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
Hello,
On Mon, 10 Sep 2018, 19:33 Amit Kapila, <amit.kapila16@gmail.com> wrote:
On Mon, Sep 10, 2018 at 1:12 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>>
> pg_stat_get_tuples_hot_updated and others:
> /*
> * Counter tuples_hot_updated stores number of hot updates for heap table
> * and the number of inplace updates for zheap table.
> */
> if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL ||
> RelationStorageIsZHeap(rel))
> result = 0;
> else
> result = (int64) (tabentry->tuples_hot_updated);
>
>
> Is the special condition is needed? The values should be 0 because of zheap right?
>
I also think so. Beena/Mithun has worked on this part of the code, so
it is better if they also confirm once.
We have used the hot_updated counter to count the number of inplace updates for zheap to qvoid introducing a new counter. Though, technically, hot updates are 0 for zheap, the counter could hold non-zero value indicating the inplace updates.
Thank you
On Mon, Sep 10, 2018 at 5:42 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:On Tue, Sep 4, 2018 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:Hi,
Thanks for the patches!
On 2018-09-03 19:06:27 +1000, Haribabu Kommi wrote:
> I found couple of places where the zheap is using some extra logic in
> verifying
> whether it is zheap AM or not, based on that it used to took some extra
> decisions.
> I am analyzing all the extra code that is done, whether any callbacks can
> handle it
> or not? and how? I can come back with more details later.
Yea, I think some of them will need to stay (particularly around
integrating undo) and som other ones we'll need to abstract.OK. I will list all the areas that I found with my observation of how toabstract or leaving it and then implement around it.The following are the change where the code is specific to checking whetherit is a zheap relation or not?Overall I found that It needs 3 new API's at the following locations.1. RelationSetNewRelfilenode2. heap_create_init_fork3. estimate_rel_size4. Facility to provide handler options like (skip WAL and etc).
During the porting of Fujitsu in-memory columnar store on top of pluggable
storage, I found that the callers of the "heap_beginscan" are expecting
the returned data is always contains all the records.
For example, in the sequential scan, the heap returns the slot with
the tuple or with value array of all the columns and then the data gets
filtered and later removed the unnecessary columns with projection.
This works fine for the row based storage. For columnar storage, if
the storage knows that upper layers needs only particular columns,
then they can directly return the specified columns and there is no
need of projection step. This will help the columnar storage also
to return proper columns in a faster way.
Is it good to pass the plan to the storage, so that they can find out
the columns that needs to be returned? And also if the projection
can handle in the storage itself for some scenarios, need to be
informed the callers that there is no need to perform the projection
extra.
comments?
Regards,
Haribabu Kommi
Fujitsu Australia
Hi, On 2018-09-21 16:57:43 +1000, Haribabu Kommi wrote: > During the porting of Fujitsu in-memory columnar store on top of pluggable > storage, I found that the callers of the "heap_beginscan" are expecting > the returned data is always contains all the records. Right. > For example, in the sequential scan, the heap returns the slot with > the tuple or with value array of all the columns and then the data gets > filtered and later removed the unnecessary columns with projection. > This works fine for the row based storage. For columnar storage, if > the storage knows that upper layers needs only particular columns, > then they can directly return the specified columns and there is no > need of projection step. This will help the columnar storage also > to return proper columns in a faster way. I think this is an important feature, but I feel fairly strongly that we should only tackle it in a second version. This patchset is already pretty darn large. It's imo not just helpful for columnar, but even for heap - we e.g. spend a lot of time deforming columns that are never accessed. That's particularly harmful when the leading columns are all NOT NULL and fixed width, but even if not, it's painful. > Is it good to pass the plan to the storage, so that they can find out > the columns that needs to be returned? I don't think that's the right approach - this should be a level *below* plan nodes, not reference them. I suspect we're going to have to have a new table_scan_set_columnlist() option or such. > And also if the projection can handle in the storage itself for some > scenarios, need to be informed the callers that there is no need to > perform the projection extra. I don't think that should be done in the storage layer - that's probably better done introducing custom scan nodes and such. This has costing implications etc, so this needs to happen *before* planning is finished. Greetings, Andres Freund
On Fri, Sep 21, 2018 at 5:05 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-09-21 16:57:43 +1000, Haribabu Kommi wrote:
> For example, in the sequential scan, the heap returns the slot with
> the tuple or with value array of all the columns and then the data gets
> filtered and later removed the unnecessary columns with projection.
> This works fine for the row based storage. For columnar storage, if
> the storage knows that upper layers needs only particular columns,
> then they can directly return the specified columns and there is no
> need of projection step. This will help the columnar storage also
> to return proper columns in a faster way.
I think this is an important feature, but I feel fairly strongly that we
should only tackle it in a second version. This patchset is already
pretty darn large. It's imo not just helpful for columnar, but even for
heap - we e.g. spend a lot of time deforming columns that are never
accessed. That's particularly harmful when the leading columns are all
NOT NULL and fixed width, but even if not, it's painful.
OK. Thanks for your opinion.
Then I will first try to cleanup the open items of the existing patch.
> Is it good to pass the plan to the storage, so that they can find out
> the columns that needs to be returned?
I don't think that's the right approach - this should be a level *below*
plan nodes, not reference them. I suspect we're going to have to have a
new table_scan_set_columnlist() option or such.
The table_scan_set_columnlist() API can be a good solution to share
the columns that are expected.
> And also if the projection can handle in the storage itself for some
> scenarios, need to be informed the callers that there is no need to
> perform the projection extra.
I don't think that should be done in the storage layer - that's probably
better done introducing custom scan nodes and such. This has costing
implications etc, so this needs to happen *before* planning is finished.
Sorry, my explanation was wrong, Assuming a scenario where the target list
contains only the plain columns of a table and these columns are already passed
to storage using the above proposed new API and their of one to one mapping.
Based on the above info, deciding whether the projection is required or not
is good.
Regards,
Haribabu Kommi
Fujitsu Australia
On Fri, Aug 24, 2018 at 5:50 AM Andres Freund <andres@anarazel.de> wrote: > I've pushed a current version of that to my git tree to the > pluggable-storage branch. It's not really a version that I think makese > sense to review or such, but it's probably more useful if you work based > on that. There's also the pluggable-zheap branch, which I found > extremely useful to develop against. BTW, I'm going to take a look at current shape of this patch and share my thoughts. But where are the branches you're referring? On your postgres.org git repository pluggable-storage brach was updates last time at June 7. And on the github branches are updated at August 5 and 14, and that is still much older than your email (August 24)... 1. https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/pluggable-storage 2. https://github.com/anarazel/postgres-pluggable-storage 3, https://github.com/anarazel/postgres-pluggable-storage/tree/pluggable-zheap ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Sep 24, 2018 at 5:02 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Fri, Aug 24, 2018 at 5:50 AM Andres Freund <andres@anarazel.de> wrote:
> I've pushed a current version of that to my git tree to the
> pluggable-storage branch. It's not really a version that I think makese
> sense to review or such, but it's probably more useful if you work based
> on that. There's also the pluggable-zheap branch, which I found
> extremely useful to develop against.
BTW, I'm going to take a look at current shape of this patch and share
my thoughts. But where are the branches you're referring? On your
postgres.org git repository pluggable-storage brach was updates last
time at June 7. And on the github branches are updated at August 5
and 14, and that is still much older than your email (August 24)...
The code is latest, but the commit time is older, I feel that is because of
commit squash.
pluggable-storage is the branch where the pluggable storage code is present
and pluggable-zheap branch where zheap is rebased on top of pluggable
storage.
Regards,
Haribabu Kommi
Fujitsu Australia
On Mon, Sep 24, 2018 at 8:04 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > On Mon, Sep 24, 2018 at 5:02 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> >> On Fri, Aug 24, 2018 at 5:50 AM Andres Freund <andres@anarazel.de> wrote: >> > I've pushed a current version of that to my git tree to the >> > pluggable-storage branch. It's not really a version that I think makese >> > sense to review or such, but it's probably more useful if you work based >> > on that. There's also the pluggable-zheap branch, which I found >> > extremely useful to develop against. >> >> BTW, I'm going to take a look at current shape of this patch and share >> my thoughts. But where are the branches you're referring? On your >> postgres.org git repository pluggable-storage brach was updates last >> time at June 7. And on the github branches are updated at August 5 >> and 14, and that is still much older than your email (August 24)... > > > The code is latest, but the commit time is older, I feel that is because of > commit squash. > > pluggable-storage is the branch where the pluggable storage code is present > and pluggable-zheap branch where zheap is rebased on top of pluggable > storage. Got it, thanks! ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Sep 21, 2018 at 5:40 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Fri, Sep 21, 2018 at 5:05 PM Andres Freund <andres@anarazel.de> wrote:Hi,
On 2018-09-21 16:57:43 +1000, Haribabu Kommi wrote:
> For example, in the sequential scan, the heap returns the slot with
> the tuple or with value array of all the columns and then the data gets
> filtered and later removed the unnecessary columns with projection.
> This works fine for the row based storage. For columnar storage, if
> the storage knows that upper layers needs only particular columns,
> then they can directly return the specified columns and there is no
> need of projection step. This will help the columnar storage also
> to return proper columns in a faster way.
I think this is an important feature, but I feel fairly strongly that we
should only tackle it in a second version. This patchset is already
pretty darn large. It's imo not just helpful for columnar, but even for
heap - we e.g. spend a lot of time deforming columns that are never
accessed. That's particularly harmful when the leading columns are all
NOT NULL and fixed width, but even if not, it's painful.OK. Thanks for your opinion.Then I will first try to cleanup the open items of the existing patch.
Here I attached further cleanup patches.
1. Re-arrange the GUC variable
2. Added a check function hook for default_table_access_method GUC
3. Added a new hook validate_index. I tried to change the function
validate_index_heapscan to slotify, but that have many problems as it
is accessing some internals of the heapscandesc structure and accessing
the buffer and etc.
So I added a new hook and provided a callback to handle the index insert.
Please check and let me know comments?
I will further add the new API's that are discussed for Zheap storage and
share the patch.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote: > Here I attached further cleanup patches. > 1. Re-arrange the GUC variable > 2. Added a check function hook for default_table_access_method GUC Cool. > 3. Added a new hook validate_index. I tried to change the function > validate_index_heapscan to slotify, but that have many problems as it > is accessing some internals of the heapscandesc structure and accessing > the buffer and etc. Oops, I also did that locally, in a way. I also made a validate a callback, as the validation logic is going to be specific to the AMs. Sorry for not pushing that up earlier. I'll try to do that soon, there's a fair amount of change. Greetings, Andres Freund
On 2018-09-27 20:03:58 -0700, Andres Freund wrote: > On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote: > > Here I attached further cleanup patches. > > 1. Re-arrange the GUC variable > > 2. Added a check function hook for default_table_access_method GUC > > Cool. > > > > 3. Added a new hook validate_index. I tried to change the function > > validate_index_heapscan to slotify, but that have many problems as it > > is accessing some internals of the heapscandesc structure and accessing > > the buffer and etc. > > Oops, I also did that locally, in a way. I also made a validate a > callback, as the validation logic is going to be specific to the AMs. > Sorry for not pushing that up earlier. I'll try to do that soon, > there's a fair amount of change. I've pushed an updated version, with a fair amount of pending changes, and I hope all your pending (and not redundant, by our concurrent development), patches merged. There's currently 3 regression test failures, that I'll look into tomorrow: - partition_prune shows a few additional Heap Blocks: exact=1 lines. I'm a bit confused as to why, but haven't really investigated yet. - fast_default fails, because I've undone most of 7636e5c60fea83a9f3c, I'll have to redo that in a different way. - I occasionally see failures in aggregates.sql - I've not figured out what's going on there. Amit Khandekar said he'll publish a new version of the slot-abstraction patch tomorrow, so I'll rebase it onto that ASAP. My next planned steps are a) to try to commit parts of the slot-abstraction work b) to try to break out a few more pieces out of the large pluggable storage patch. Greetings, Andres Freund
On Wed, Oct 3, 2018 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-09-27 20:03:58 -0700, Andres Freund wrote:
> On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote:
> > Here I attached further cleanup patches.
> > 1. Re-arrange the GUC variable
> > 2. Added a check function hook for default_table_access_method GUC
>
> Cool.
>
>
> > 3. Added a new hook validate_index. I tried to change the function
> > validate_index_heapscan to slotify, but that have many problems as it
> > is accessing some internals of the heapscandesc structure and accessing
> > the buffer and etc.
>
> Oops, I also did that locally, in a way. I also made a validate a
> callback, as the validation logic is going to be specific to the AMs.
> Sorry for not pushing that up earlier. I'll try to do that soon,
> there's a fair amount of change.
I've pushed an updated version, with a fair amount of pending changes,
and I hope all your pending (and not redundant, by our concurrent
development), patches merged.
Yes, All the patches are merged.
There's currently 3 regression test failures, that I'll look into
tomorrow:
- partition_prune shows a few additional Heap Blocks: exact=1 lines. I'm
a bit confused as to why, but haven't really investigated yet.
- fast_default fails, because I've undone most of 7636e5c60fea83a9f3c,
I'll have to redo that in a different way.
- I occasionally see failures in aggregates.sql - I've not figured out
what's going on there.
I also observed the failure of aggregates.sql, will look into it.
Amit Khandekar said he'll publish a new version of the slot-abstraction
patch tomorrow, so I'll rebase it onto that ASAP.
OK.
Here I attached two new API patches.
1. Set New Rel File node
2. Create Init fork
There is an another patch of "External Relations" in the older patch set,
which is not included in the current git. That patch is of creating
external relations by the extensions for their internal purpose. (Columnar
relations for the columnar storage). This new relkind can be used for
those relations, this way it provides the difference between normal and
columnar relations. Do you have any other idea of supporting those type
of relations?
And also I want to create a new API for heap_create_with_catalog
to let the pluggable storage engine to create additional relations.
This API is not required for every storage engine, so instead of moving
the entire function as API, how about adding an API at the end of the
function and calls only when it is set like hook functions? In case if the
storage engine doesn't need any of the heap_create_with_catalog
functionality then creating a full API is better.
Comments?
My next planned steps are a) to try to commit parts of the
slot-abstraction work b) to try to break out a few more pieces out of
the large pluggable storage patch.
so that I can separate them from larger patch.
Regards,
Haribabu Kommi
Fujitsu Australia
On Wed, Oct 3, 2018 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-09-27 20:03:58 -0700, Andres Freund wrote:
> On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote:
> > Here I attached further cleanup patches.
> > 1. Re-arrange the GUC variable
> > 2. Added a check function hook for default_table_access_method GUC
>
> Cool.
>
>
> > 3. Added a new hook validate_index. I tried to change the function
> > validate_index_heapscan to slotify, but that have many problems as it
> > is accessing some internals of the heapscandesc structure and accessing
> > the buffer and etc.
>
> Oops, I also did that locally, in a way. I also made a validate a
> callback, as the validation logic is going to be specific to the AMs.
> Sorry for not pushing that up earlier. I'll try to do that soon,
> there's a fair amount of change.
I've pushed an updated version, with a fair amount of pending changes,
and I hope all your pending (and not redundant, by our concurrent
development), patches merged.
Yes, All the patches are merged.
There's currently 3 regression test failures, that I'll look into
tomorrow:
- partition_prune shows a few additional Heap Blocks: exact=1 lines. I'm
a bit confused as to why, but haven't really investigated yet.
- fast_default fails, because I've undone most of 7636e5c60fea83a9f3c,
I'll have to redo that in a different way.
- I occasionally see failures in aggregates.sql - I've not figured out
what's going on there.
I also observed the failure of aggregates.sql, will look into it.
Amit Khandekar said he'll publish a new version of the slot-abstraction
patch tomorrow, so I'll rebase it onto that ASAP.
OK.
Here I attached two new API patches.
1. Set New Rel File node
2. Create Init fork
There is an another patch of "External Relations" in the older patch set,
which is not included in the current git. That patch is of creating
external relations by the extensions for their internal purpose. (Columnar
relations for the columnar storage). This new relkind can be used for
those relations, this way it provides the difference between normal and
columnar relations. Do you have any other idea of supporting those type
of relations?
And also I want to create a new API for heap_create_with_catalog
to let the pluggable storage engine to create additional relations.
This API is not required for every storage engine, so instead of moving
the entire function as API, how about adding an API at the end of the
function and calls only when it is set like hook functions? In case if the
storage engine doesn't need any of the heap_create_with_catalog
functionality then creating a full API is better.
Comments?
My next planned steps are a) to try to commit parts of the
slot-abstraction work b) to try to break out a few more pieces out of
the large pluggable storage patch.
so that I can separate them from larger patch.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Hi! On Wed, Oct 3, 2018 at 8:16 AM Andres Freund <andres@anarazel.de> wrote: > I've pushed an updated version, with a fair amount of pending changes, > and I hope all your pending (and not redundant, by our concurrent > development), patches merged. I'd like to also share some patches. I've used current state of pluggable-zheap for the base of my patches. * 0001-remove-extra-snapshot-functions.patch – removes snapshot_satisfiesUpdate() and snapshot_satisfiesVacuum() functions from tableam API. snapshot_satisfiesUpdate() was unused completely. snapshot_satisfiesVacuum() was used only in heap_copy_for_cluster(). So, I've replaced it with direct heapam_satisfies_vacuum(). * 0002-add-costing-function-to-API.patch – adds function for costing sequential and table sample scan to tableam API. zheap costing function are now copies of heap costing function. This should be adjusted in future. Estimation for heap lookup during index scans should be also pluggable, but not yet implemented (TODO). I've examined code in pluggable-zheap branch and EDB github [1] and I didn't found anything related to "delete-marking" indexes as stated on slide #25 of presentation [2]. So, basically contract between heap and indexes is remain unchanged: once you update one indexed field, you have to update all the others. Did I understand correctly that this is postponed? And couple more notes from me: * Right now table_fetch_row_version() is called in most of places with SnapshotAny. That might be working in majority of cases, because in heap there couldn't be multiple tuples residing the same TID, while zheap always returns most recent tuple residing this TID. But I think it would be better to provide some meaningful snapshot instead of SnapshotAny. If even the best thing we can do is to ask for most recent tuple on some TID, we need more consistent way for asking table AM for this. I'm going to elaborate more on this. * I'm not really sure we need ability to iterate multiple tuples referenced by index. It seems that the only place, which really needs this is heap_copy_for_cluster(), which itself is table AM specific. Also zheap doesn't seem to be able to return more than one tuple by zheapam_fetch_follow(). So, I'm going to investigate more on this. And if this iteration is really unneeded, I'll propose a patch to delete that. 1. https://github.com/EnterpriseDB/zheap 2. http://www.pgcon.org/2018/schedule/attachments/501_zheap-a-new-storage-format-postgresql-5.pdf ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Вложения
On Tue, Oct 9, 2018 at 1:46 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Wed, Oct 3, 2018 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:On 2018-09-27 20:03:58 -0700, Andres Freund wrote:
> On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote:
> > Here I attached further cleanup patches.
> > 1. Re-arrange the GUC variable
> > 2. Added a check function hook for default_table_access_method GUC
>
> Cool.
>
>
> > 3. Added a new hook validate_index. I tried to change the function
> > validate_index_heapscan to slotify, but that have many problems as it
> > is accessing some internals of the heapscandesc structure and accessing
> > the buffer and etc.
>
> Oops, I also did that locally, in a way. I also made a validate a
> callback, as the validation logic is going to be specific to the AMs.
> Sorry for not pushing that up earlier. I'll try to do that soon,
> there's a fair amount of change.
I've pushed an updated version, with a fair amount of pending changes,
and I hope all your pending (and not redundant, by our concurrent
development), patches merged.Yes, All the patches are merged.There's currently 3 regression test failures, that I'll look into
tomorrow:
- partition_prune shows a few additional Heap Blocks: exact=1 lines. I'm
a bit confused as to why, but haven't really investigated yet.
- fast_default fails, because I've undone most of 7636e5c60fea83a9f3c,
I'll have to redo that in a different way.
- I occasionally see failures in aggregates.sql - I've not figured out
what's going on there.I also observed the failure of aggregates.sql, will look into it.Amit Khandekar said he'll publish a new version of the slot-abstraction
patch tomorrow, so I'll rebase it onto that ASAP.OK.Here I attached two new API patches.1. Set New Rel File node2. Create Init fork
The above patches are having problem and while testing it leads to crash.
Sorry for not testing earlier. The index relation also creates the NewRelFileNode,
because of moving that function as pluggable table access method, and for
Index relations, there is no tableam routine, thus it leads to crash.
So moving the storage creation methods as table access methods doesn't
work. we may need common access methods that are shared across both
table and index.
Regards,
Haribabu Kommi
Fujitsu Australia
On Tue, Oct 16, 2018 at 12:37 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > I've examined code in pluggable-zheap branch and EDB github [1] and I > didn't found anything related to "delete-marking" indexes as stated on > slide #25 of presentation [2]. So, basically contract between heap > and indexes is remain unchanged: once you update one indexed field, > you have to update all the others. > Yes, this will be the behavior for the first version. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Oct 9, 2018 at 1:46 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Wed, Oct 3, 2018 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:On 2018-09-27 20:03:58 -0700, Andres Freund wrote:
> On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote:
> > Here I attached further cleanup patches.
> > 1. Re-arrange the GUC variable
> > 2. Added a check function hook for default_table_access_method GUC
>
> Cool.
>
>
> > 3. Added a new hook validate_index. I tried to change the function
> > validate_index_heapscan to slotify, but that have many problems as it
> > is accessing some internals of the heapscandesc structure and accessing
> > the buffer and etc.
>
> Oops, I also did that locally, in a way. I also made a validate a
> callback, as the validation logic is going to be specific to the AMs.
> Sorry for not pushing that up earlier. I'll try to do that soon,
> there's a fair amount of change.
I've pushed an updated version, with a fair amount of pending changes,
and I hope all your pending (and not redundant, by our concurrent
development), patches merged.Yes, All the patches are merged.There's currently 3 regression test failures, that I'll look into
tomorrow:
- partition_prune shows a few additional Heap Blocks: exact=1 lines. I'm
a bit confused as to why, but haven't really investigated yet.
- fast_default fails, because I've undone most of 7636e5c60fea83a9f3c,
I'll have to redo that in a different way.
- I occasionally see failures in aggregates.sql - I've not figured out
what's going on there.I also observed the failure of aggregates.sql, will look into it.
The random failure of aggregates.sql is as follows
SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
! avg_32
! ---------------------
! 32.6666666666666667
(1 row)
-- In 7.1, avg(float4) is computed using float8 arithmetic.
--- 8,16 ----
(1 row)
SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
! avg_32
! --------
!
(1 row)
Same NULL result for another aggregate query on column b.
The aggtest table is accessed by two tests that are running in parallel.
i.e aggregates.sql and transactions.sql, In transactions.sql, inside a transaction
all the records in the aggtest table are deleted and aborted the transaction,
I suspect that some visibility checks are having some race conditions that leads
to no records on the table aggtest, thus it returns NULL result.
If I try the scenario manually by opening a transaction and deleting the records, the
issue is not occurring.
I am yet to find the cause for this problem.
Regards,
Haribabu Kommi
On Tue, Oct 16, 2018 at 6:06 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi!
On Wed, Oct 3, 2018 at 8:16 AM Andres Freund <andres@anarazel.de> wrote:
> I've pushed an updated version, with a fair amount of pending changes,
> and I hope all your pending (and not redundant, by our concurrent
> development), patches merged.
I'd like to also share some patches. I've used current state of
pluggable-zheap for the base of my patches.
Thanks for the review and patches.
* 0001-remove-extra-snapshot-functions.patch – removes
snapshot_satisfiesUpdate() and snapshot_satisfiesVacuum() functions
from tableam API. snapshot_satisfiesUpdate() was unused completely.
snapshot_satisfiesVacuum() was used only in heap_copy_for_cluster().
So, I've replaced it with direct heapam_satisfies_vacuum().
Thanks for the correction.
* 0002-add-costing-function-to-API.patch – adds function for costing
sequential and table sample scan to tableam API. zheap costing
function are now copies of heap costing function. This should be
adjusted in future.
This patch misses the new *_cost.c files that are added specific cost
functions.
Estimation for heap lookup during index scans
should be also pluggable, but not yet implemented (TODO).
Yes, Is it possible to use the same API that is added by above
patch?
Regards,
Haribabu Kommi
Fujitsu Australia
On Thu, Oct 18, 2018 at 6:28 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > On Tue, Oct 16, 2018 at 6:06 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> * 0002-add-costing-function-to-API.patch – adds function for costing >> sequential and table sample scan to tableam API. zheap costing >> function are now copies of heap costing function. This should be >> adjusted in future. > > This patch misses the new *_cost.c files that are added specific cost > functions. Thank you for noticing. Revised patchset is attached. >> Estimation for heap lookup during index scans >> should be also pluggable, but not yet implemented (TODO). > > Yes, Is it possible to use the same API that is added by above > patch? I'm not yet sure. I'll elaborate more on that. I'd like to keep number of costing functions small. Handling of costing of index scan heap fetches will probably require function signature change. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Вложения
On Thu, Oct 18, 2018 at 1:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Tue, Oct 9, 2018 at 1:46 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:I also observed the failure of aggregates.sql, will look into it.The random failure of aggregates.sql is as followsSELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;! avg_32! ---------------------! 32.6666666666666667(1 row)-- In 7.1, avg(float4) is computed using float8 arithmetic.--- 8,16 ----(1 row)SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;! avg_32! --------!(1 row)Same NULL result for another aggregate query on column b.The aggtest table is accessed by two tests that are running in parallel.i.e aggregates.sql and transactions.sql, In transactions.sql, inside a transactionall the records in the aggtest table are deleted and aborted the transaction,I suspect that some visibility checks are having some race conditions that leadsto no records on the table aggtest, thus it returns NULL result.If I try the scenario manually by opening a transaction and deleting the records, theissue is not occurring.I am yet to find the cause for this problem.
I am not yet able to generate a test case where the above issue can occur easily for
debugging, it is happening randomly. I will try to add some logs to find out the problem.
During the checking for the above problem, I found some corrections,
1. Remove of the tableam_common.c file as it is not used.
2. Remove the extra heaptuple visibile check in heapgettup_pagemode function
3. New API for init fork.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On Mon, Oct 22, 2018 at 6:16 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Thu, Oct 18, 2018 at 1:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:On Tue, Oct 9, 2018 at 1:46 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:I also observed the failure of aggregates.sql, will look into it.The random failure of aggregates.sql is as followsSELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;! avg_32! ---------------------! 32.6666666666666667(1 row)-- In 7.1, avg(float4) is computed using float8 arithmetic.--- 8,16 ----(1 row)SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;! avg_32! --------!(1 row)Same NULL result for another aggregate query on column b.The aggtest table is accessed by two tests that are running in parallel.i.e aggregates.sql and transactions.sql, In transactions.sql, inside a transactionall the records in the aggtest table are deleted and aborted the transaction,I suspect that some visibility checks are having some race conditions that leadsto no records on the table aggtest, thus it returns NULL result.If I try the scenario manually by opening a transaction and deleting the records, theissue is not occurring.I am yet to find the cause for this problem.I am not yet able to generate a test case where the above issue can occur easily fordebugging, it is happening randomly. I will try to add some logs to find out the problem.
I am able to generate the simple test and found the problem. The issue with the following
SQL.
SELECT *
INTO TABLE xacttest
FROM aggtest;
During the processing of the above query, the tuple that is selected from the aggtest is
sent to the intorel_receive() function, and the same tuple is used for the insert, because
of this reason, the tuple xmin is updated and it leads to failure of selecting the data from
another query. I fixed this issue by materializing the slot.
During the above test run, I found another issue during analyze, that is trying to access
the invalid offset data. Attached a fix patch.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On Tue, Oct 23, 2018 at 5:49 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
I am able to generate the simple test and found the problem. The issue with the followingSQL.SELECT *INTO TABLE xacttestFROM aggtest;During the processing of the above query, the tuple that is selected from the aggtest issent to the intorel_receive() function, and the same tuple is used for the insert, becauseof this reason, the tuple xmin is updated and it leads to failure of selecting the data fromanother query. I fixed this issue by materializing the slot.
Wrong patch attached in the earlier mail, sorry for the inconvenience.
Attached proper fix patch.
I will look into isolation test failures.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On Tue, Oct 23, 2018 at 6:11 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Tue, Oct 23, 2018 at 5:49 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:I am able to generate the simple test and found the problem. The issue with the followingSQL.SELECT *INTO TABLE xacttestFROM aggtest;During the processing of the above query, the tuple that is selected from the aggtest issent to the intorel_receive() function, and the same tuple is used for the insert, becauseof this reason, the tuple xmin is updated and it leads to failure of selecting the data fromanother query. I fixed this issue by materializing the slot.Wrong patch attached in the earlier mail, sorry for the inconvenience.Attached proper fix patch.I will look into isolation test failures.
Here I attached the cumulative patch with all fixes that are shared in earlier mails by me.
Except fast_default test, rest of test failures are fixed.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
> On Fri, 26 Oct 2018 at 13:25, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > > Here I attached the cumulative patch with all fixes that are shared in earlier mails by me. > Except fast_default test, rest of test failures are fixed. Hi, If I understand correctly, these patches are for the branch "pluggable-storage" in [1] (at least I couldn't apply them cleanly to "pluggable-zheap" branch), right? I've tried to experiment a bit with the current status of the patch, and accidentally stumbled upon what seems to be an issue - when I run pgbench agains it with some significant number of clients and script [2]: $ pgbench -T 60 -c 128 -j 64 -f zipfian.sql I've got for some client an error: client 117 aborted in command 5 (SQL) of script 0; ERROR: unrecognized heap_update status: 1 This problem couldn't be reproduced on the master branch, so I've tried to investigate it. It comes from nodeModifyTable.c:1267, when we've got HeapTupleInvisible as a result, and this value in turn comes from table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated result from heap_update, when table_lock_tuple call was introduced. Since I don't see anything similar in the master branch, can anyone clarify why is this lock necessary here? Out of curiosity I've rearranged the code, that handles HeapTupleUpdated, back to switch and removed table_lock_tuple (see the attached patch, it can be applied on top of the lastest two patches posted by Haribabu) and it seems to solve the issue. [1]: https://github.com/anarazel/postgres-pluggable-storage [2]: https://gist.github.com/erthalion/c85ba0e12146596d24c572234501e756
Вложения
On Mon, Oct 29, 2018 at 7:40 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Fri, 26 Oct 2018 at 13:25, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> Here I attached the cumulative patch with all fixes that are shared in earlier mails by me.
> Except fast_default test, rest of test failures are fixed.
Hi,
If I understand correctly, these patches are for the branch "pluggable-storage"
in [1] (at least I couldn't apply them cleanly to "pluggable-zheap" branch),
right?
Yes, the patches attached are for pluggable-storage branch.
I've tried to experiment a bit with the current status of the patch, and
accidentally stumbled upon what seems to be an issue - when I run pgbench
agains it with some significant number of clients and script [2]:
$ pgbench -T 60 -c 128 -j 64 -f zipfian.sql
Thanks for testing the patches.
I've got for some client an error:
client 117 aborted in command 5 (SQL) of script 0; ERROR:
unrecognized heap_update status: 1
This error is for the tuple state of HeapTupleInvisible, As per the comments
in heap_lock_tuple, this is possible in ON CONFLICT UPDATE, but because
of reorganizing of the table_lock_tuple out of EvalPlanQual(), the invisible
error is returned in other cases also. This case is missed in the new code.
This problem couldn't be reproduced on the master branch, so I've tried to
investigate it. It comes from nodeModifyTable.c:1267, when we've got
HeapTupleInvisible as a result, and this value in turn comes from
table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated
result from heap_update, when table_lock_tuple call was introduced. Since I
don't see anything similar in the master branch, can anyone clarify why is this
lock necessary here?
In the master branch code also, there is a tuple lock that is happening in
EvalPlanQual() function, but pluggable-storage code, the lock is kept outside
and also function call rearrangements, to make it easier for the table access
methods to provide their own MVCC implementation.
Out of curiosity I've rearranged the code, that handles
HeapTupleUpdated, back to switch and removed table_lock_tuple (see the attached
patch, it can be applied on top of the lastest two patches posted by Haribabu)
and it seems to solve the issue.
your mail, the attached draft patch of handling of invisible tuples in update and
delete cases should also fix it.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
> On Mon, 29 Oct 2018 at 05:56, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > >> This problem couldn't be reproduced on the master branch, so I've tried to >> investigate it. It comes from nodeModifyTable.c:1267, when we've got >> HeapTupleInvisible as a result, and this value in turn comes from >> table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated >> result from heap_update, when table_lock_tuple call was introduced. Since I >> don't see anything similar in the master branch, can anyone clarify why is this >> lock necessary here? > > > In the master branch code also, there is a tuple lock that is happening in > EvalPlanQual() function, but pluggable-storage code, the lock is kept outside > and also function call rearrangements, to make it easier for the table access > methods to provide their own MVCC implementation. Yes, now I see it, thanks. Also I can confirm that the attached patch solves this issue. FYI, alongside with reviewing the code changes I've ran few performance tests (that's why I hit this issue with pgbench in the first place). In case of high concurrecy so far I see small performance degradation in comparison with the master branch (about 2-5% of average latency, depending on the level of concurrency), but can't really say why exactly (perf just shows barely noticeable overhead there and there, maybe what I see is actually a cumulative impact).
On Wed, Oct 31, 2018 at 9:34 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Mon, 29 Oct 2018 at 05:56, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
>> This problem couldn't be reproduced on the master branch, so I've tried to
>> investigate it. It comes from nodeModifyTable.c:1267, when we've got
>> HeapTupleInvisible as a result, and this value in turn comes from
>> table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated
>> result from heap_update, when table_lock_tuple call was introduced. Since I
>> don't see anything similar in the master branch, can anyone clarify why is this
>> lock necessary here?
>
>
> In the master branch code also, there is a tuple lock that is happening in
> EvalPlanQual() function, but pluggable-storage code, the lock is kept outside
> and also function call rearrangements, to make it easier for the table access
> methods to provide their own MVCC implementation.
Yes, now I see it, thanks. Also I can confirm that the attached patch solves
this issue.
Thanks for the testing and confirmation.
FYI, alongside with reviewing the code changes I've ran few performance tests
(that's why I hit this issue with pgbench in the first place). In case of high
concurrecy so far I see small performance degradation in comparison with the
master branch (about 2-5% of average latency, depending on the level of
concurrency), but can't really say why exactly (perf just shows barely
noticeable overhead there and there, maybe what I see is actually a cumulative
impact).
Thanks for sharing your observation, I will also analyze and try to find out performance
bottlenecks that are causing the overhead.
Here I attached the cumulative fixes of the patches, new API additions for zheap and
basic outline of the documentation.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On Fri, Nov 2, 2018 at 11:17 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Wed, Oct 31, 2018 at 9:34 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:FYI, alongside with reviewing the code changes I've ran few performance tests
(that's why I hit this issue with pgbench in the first place). In case of high
concurrecy so far I see small performance degradation in comparison with the
master branch (about 2-5% of average latency, depending on the level of
concurrency), but can't really say why exactly (perf just shows barely
noticeable overhead there and there, maybe what I see is actually a cumulative
impact).Thanks for sharing your observation, I will also analyze and try to find out performancebottlenecks that are causing the overhead.
I tried running the pgbench performance tests with minimal clients in my laptop and I didn't
find any performance issues, may be issue is visible only with higher clients. Even with
perf tool, I am not able to get a clear problem function. As you said, combining of all changes
leads to some overhead.
Here I attached the cumulative patches with further fixes and basic syntax regress tests also.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Ashwin (copied) and I got a chance to go through the latest code from Andres' github repository. We would like to share some comments/quesitons: The TupleTableSlot argument is well suited for row-oriented storage. For a column-oriented storage engine, a projection list indicating the columns to be scanned may be necessary. Is it possible to share this information with current interface? We realized that DDLs such as heap_create_with_catalog() are not generalized. Haribabu's latest patch that adds SetNewFileNode_function() and CreateInitFort_function() is a step towards this end. However, the current API assumes that the storage engine uses relation forks. Isn't that too restrictive? TupleDelete_function() accepts changingPart as a parameter to indicate if this deletion is part of a movement from one partition to another. Partitioning is a higher level abstraction as compared to storage. Ideally, storage layer should have no knowledge of partitioning. The tuple delete API should not accept any parameter related to partitioning. The API needs to be more accommodating towards block sizes used in storage engines. Currently, the same block size as heap seems to be assumed, as evident from the type of some members of generic scan object: typedef struct TableScanDescData { /* state set up at initscan time */ BlockNumber rs_nblocks; /* total number of blocks in rel */ BlockNumber rs_startblock; /* block # to start at */ BlockNumber rs_numblocks; /* max number of blocks to scan */ /* rs_numblocks is usually InvalidBlockNumber, meaning "scan whole rel" */ bool rs_syncscan; /* report location to syncscan logic? */ } TableScanDescData; Using bytes to represent this information would be more generic. E.g. rs_startlocation as bytes/offset instead of rs_startblock and so on. Asim
On Thu, Nov 22, 2018 at 1:12 PM Asim R P <apraveen@pivotal.io> wrote:
Ashwin (copied) and I got a chance to go through the latest code from
Andres' github repository. We would like to share some
comments/quesitons:
Thanks for the review.
The TupleTableSlot argument is well suited for row-oriented storage.
For a column-oriented storage engine, a projection list indicating the
columns to be scanned may be necessary. Is it possible to share this
information with current interface?
Currently all the interfaces are designed for row-oriented storage, as you
said we need a new API for projection list. The current patch set itself
is big and it needs to stabilized and then in the next set of the patches,
those new API's will be added that will be useful for columnar storage.
We realized that DDLs such as heap_create_with_catalog() are not
generalized. Haribabu's latest patch that adds
SetNewFileNode_function() and CreateInitFort_function() is a step
towards this end. However, the current API assumes that the storage
engine uses relation forks. Isn't that too restrictive?
Current set of API has many assumptions and uses the existing framework.
Thanks for your point, will check it how to enhance it.
TupleDelete_function() accepts changingPart as a parameter to indicate
if this deletion is part of a movement from one partition to another.
Partitioning is a higher level abstraction as compared to storage.
Ideally, storage layer should have no knowledge of partitioning. The
tuple delete API should not accept any parameter related to
partitioning.
Thanks for your point, will look into it in how to change extract it.
The API needs to be more accommodating towards block sizes used in
storage engines. Currently, the same block size as heap seems to be
assumed, as evident from the type of some members of generic scan
object:
typedef struct TableScanDescData
{
/* state set up at initscan time */
BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
BlockNumber rs_numblocks; /* max number of blocks to scan */
/* rs_numblocks is usually InvalidBlockNumber, meaning "scan whole rel" */
bool rs_syncscan; /* report location to syncscan logic? */
} TableScanDescData;
Using bytes to represent this information would be more generic. E.g.
rs_startlocation as bytes/offset instead of rs_startblock and so on.
different block sizes for different storage interfaces. Thanks for your point,
but definitely this can be taken care in the next set of patches.
Andres, as the tupletableslot changes are committed, do you want me to
share the rebased pluggable storage patch? you already working on it?
Regards,
Haribabu Kommi
Fujitsu Australia
Hi, FWIW, now that oids are removed, and the tuple table slot abstraction got in, I'm working on rebasing the pluggable storage patchset ontop of that. On 2018-11-27 12:48:36 +1100, Haribabu Kommi wrote: > On Thu, Nov 22, 2018 at 1:12 PM Asim R P <apraveen@pivotal.io> wrote: > > > Ashwin (copied) and I got a chance to go through the latest code from > > Andres' github repository. We would like to share some > > comments/quesitons: > > > > Thanks for the review. > > > > The TupleTableSlot argument is well suited for row-oriented storage. > > For a column-oriented storage engine, a projection list indicating the > > columns to be scanned may be necessary. Is it possible to share this > > information with current interface? > > > > Currently all the interfaces are designed for row-oriented storage, as you > said we need a new API for projection list. The current patch set itself > is big and it needs to stabilized and then in the next set of the patches, > those new API's will be added that will be useful for columnar storage. Precisely. > > TupleDelete_function() accepts changingPart as a parameter to indicate > > if this deletion is part of a movement from one partition to another. > > Partitioning is a higher level abstraction as compared to storage. > > Ideally, storage layer should have no knowledge of partitioning. The > > tuple delete API should not accept any parameter related to > > partitioning. > > > > Thanks for your point, will look into it in how to change extract it. I don't think that's actually a problem. The changingPart parameter is just a marker that the deletion is part of moving a tuple across partitions. For heap and everythign compatible that's used to include information to the tuple that concurrent modifications to the tuple should error out when reaching such a tuple via EPQ. > Andres, as the tupletableslot changes are committed, do you want me to > share the rebased pluggable storage patch? you already working on it? Working on it. Greetings, Andres Freund
Hi, On 2018/11/02 9:17, Haribabu Kommi wrote: > Here I attached the cumulative fixes of the patches, new API additions for > zheap and > basic outline of the documentation. I've read the documentation patch while also looking at the code and here are some comments. + Each table is stored as its own physical <firstterm>relation</firstterm> and so + is described by an entry in the <structname>pg_class</structname> catalog. I think the "so" in "and so is described by an entry in..." is not necessary. + The contents of an table are entirely under the control of its access method. "a" table + (All the access methods furthermore use the standard page layout described in + <xref linkend="storage-page-layout"/>.) Maybe write the two sentences above as: A table's content is entirely controlled by its access method, although all access methods use the same standard page layout described in <xref linkend="storage-page-layout"/>. + SlotCallbacks_function slot_callbacks; + + SnapshotSatisfies_function snapshot_satisfies; + SnapshotSatisfiesUpdate_function snapshot_satisfiesUpdate; + SnapshotSatisfiesVacuum_function snapshot_satisfiesVacuum; Like other functions, how about a one sentence comment for these, like: /* * Function to get an AM-specific set of functions for manipulating * TupleTableSlots */ SlotCallbacks_function slot_callbacks; /* AM-specific snapshot visibility determination functions */ SnapshotSatisfies_function snapshot_satisfies; SnapshotSatisfiesUpdate_function snapshot_satisfiesUpdate; SnapshotSatisfiesVacuum_function snapshot_satisfiesVacuum; + TupleFetchFollow_function tuple_fetch_follow; + + GetTupleData_function get_tuple_data; How about removing the empty line so that get_tuple_data can be seen as part of the group /* Operations on physical tuples */ + RelationVacuum_function relation_vacuum; + RelationScanAnalyzeNextBlock_function scan_analyze_next_block; + RelationScanAnalyzeNextTuple_function scan_analyze_next_tuple; + RelationCopyForCluster_function relation_copy_for_cluster; + RelationSync_function relation_sync; Add /* Operations to support VACUUM/ANALYZE */ as a description for this group? + BitmapPagescan_function scan_bitmap_pagescan; + BitmapPagescanNext_function scan_bitmap_pagescan_next; Add /* Operations to support bitmap scans */ as a description for this group? + SampleScanNextBlock_function scan_sample_next_block; + SampleScanNextTuple_function scan_sample_next_tuple; Add /* Operations to support sampling scans */ as a description for this group? + ScanEnd_function scan_end; + ScanRescan_function scan_rescan; + ScanUpdateSnapshot_function scan_update_snapshot; Move these two to be in the /* Operations on relation scans */ group? + BeginIndexFetchTable_function begin_index_fetch; + EndIndexFetchTable_function reset_index_fetch; + EndIndexFetchTable_function end_index_fetch; Add /* Operations to support index scans */ as a description for this group? + IndexBuildRangeScan_function index_build_range_scan; + IndexValidateScan_function index_validate_scan; Add /* Operations to support index build */ as a description for this group? + CreateInitFork_function CreateInitFork; Add /* Function to create an init fork for unlogged tables */? By the way, I can see the following two in the source code, but not in the documentation. EstimateRelSize_function EstimateRelSize; SetNewFileNode_function SetNewFileNode; + The table construction and maintenance functions that an table access + method must provide in <structname>TableAmRoutine</structname> are: "a" table access method + <para> +<programlisting> +TupleTableSlotOps * +slot_callbacks (Relation relation); +</programlisting> + API to access the slot specific methods; + Following methods are available; + <structname>TTSOpsVirtual</structname>, + <structname>TTSOpsHeapTuple</structname>, + <structname>TTSOpsMinimalTuple</structname>, + <structname>TTSOpsBufferTuple</structname>, + </para> Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or its relations to the TableAmRoutine abstraction, I think the text description could better be written as: "API to get the slot operations struct for a given table access method" It's not clear to me why various TTSOps* structs are listed here? Is the point that different AMs may choose one of the listed alternatives? For example, I see that heap AM implementation returns TTOpsBufferTuple, so it manipulates slots containing buffered tuples, right? Other AMs are free to return any one of these? For example, some AMs may never use buffer manager and hence not use TTOpsBufferTuple. Is that understanding correct? + <para> +<programlisting> +bool +snapshot_satisfies (TupleTableSlot *slot, Snapshot snapshot); +</programlisting> + API to check whether the provided slot is visible to the current + transaction according the snapshot. + </para> Do you mean: "API to check whether the tuple contained in the provided slot is visible...."? + <para> +<programlisting> +Oid +tuple_insert (Relation rel, TupleTableSlot *slot, CommandId cid, + int options, BulkInsertState bistate); +</programlisting> + API to insert the tuple and provide the <literal>ItemPointerData</literal> + where the tuple is successfully inserted. + </para> It's not clear from the signature where you get the ItemPointerData. Looking at heapam_tuple_insert which puts it in slot->tts_tid, I think this should mention it a bit differently, like: API to insert the tuple contained in the provided slot and return its TID, that is, the location where the tuple is successfully inserted + API to insert the tuple with a speculative token. This API is similar + like <literal>tuple_insert</literal>, with additional speculative + information. How about: This API is similar to <literal>tuple_insert</literal>, although with additional information necessary for speculative insertion + <para> +<programlisting> +void +tuple_complete_speculative (Relation rel, + TupleTableSlot *slot, + uint32 specToken, + bool succeeded); +</programlisting> + API to complete the state of the tuple inserted by the API <literal>tuple_insert_speculative</literal> + with the successful completion of the index insert. + </para> How about: API to complete the speculative insertion of a tuple started by <literal>tuple_insert_speculative</literal>, invoked after finishing the index insert + <para> +<programlisting> +bool +tuple_fetch_row_version (Relation relation, + ItemPointer tid, + Snapshot snapshot, + TupleTableSlot *slot, + Relation stats_relation); +</programlisting> + API to fetch and store the Buffered Heap tuple in the provided slot + based on the ItemPointer. + </para> It seems that this description is based on what heapam_fetch_row_version() does, but it should be more generic, maybe like: API to fetch a buffered tuple given its TID and store it in the provided slot + <para> +<programlisting> +HTSU_Result +TupleLock_function (Relation relation, + ItemPointer tid, + Snapshot snapshot, + TupleTableSlot *slot, + CommandId cid, + LockTupleMode mode, + LockWaitPolicy wait_policy, + uint8 flags, + HeapUpdateFailureData *hufd); I guess you meant to write "tuple_lock" here, not "TupleLock_function". +</programlisting> + API to lock the specified the ItemPointer tuple and fetches the newest version of + its tuple and TID. + </para> How about: API to lock the specified tuple and return the TID of its newest version + <para> +<programlisting> +void +tuple_get_latest_tid (Relation relation, + Snapshot snapshot, + ItemPointer tid); +</programlisting> + API to get the the latest TID of the tuple with the given itempointer. + </para> How about: API to get TID of the latest version of the specified tuple + <para> +<programlisting> +bool +tuple_fetch_follow (struct IndexFetchTableData *scan, + ItemPointer tid, + Snapshot snapshot, + TupleTableSlot *slot, + bool *call_again, bool *all_dead); +</programlisting> + API to get the all the tuples of the page that satisfies itempointer. + </para> IIUC, "all the tuples of of the page" in the above sentence means all the tuples in the HOT chain of a given heap tuple, making this description of the API slightly specific to the heap AM. Can we make the description more generic or is the API itself very specific that it cannot be expressed in generic terms? Ignoring that for a moment, I think the sentence contains more "the"s than there need to be, so maybe write as: API to get all tuples on a given page that are linked to the tuple of the given TID + <para> +<programlisting> +tuple_data +get_tuple_data (TupleTableSlot *slot, tuple_data_flags flags); +</programlisting> + API to return the internal structure members of the HeapTuple. + </para> I think this description doesn't mention enough details of both the information that needs to be specified when calling the function (what's in flags) and the information that's returned. + <para> +<programlisting> +bool +scan_analyze_next_tuple (TableScanDesc scan, TransactionId OldestXmin, + double *liverows, double *deadrows, TupleTableSlot *slot)); +</programlisting> + API to analyze the block and fill the buffered heap tuple in the slot and also + provide the live and dead rows. + </para> How about: API to get the next tuple from the block being scanned, which also updates the number of live and dead rows encountered + <para> +<programlisting> +void +relation_copy_for_cluster (Relation NewHeap, Relation OldHeap, Relation OldIndex, + bool use_sort, + TransactionId OldestXmin, TransactionId FreezeXid, MultiXactId MultiXactCutoff, + double *num_tuples, double *tups_vacuumed, double *tups_recently_dead); +</programlisting> + API to copy one relation to another relation eith using the Index or table scan. + </para> Typo: eith -> either But maybe, rewrite this as: API to make a copy of the content of a relation, optionally sorted using either the specified index or by sorting explicitly + <para> +<programlisting> +TableScanDesc +scan_begin (Relation relation, + Snapshot snapshot, + int nkeys, ScanKey key, + ParallelTableScanDesc parallel_scan, + bool allow_strat, + bool allow_sync, + bool allow_pagemode, + bool is_bitmapscan, + bool is_samplescan, + bool temp_snap); +</programlisting> + API to start the relation scan for the provided relation and returns the + <structname>TableScanDesc</structname> structure. + </para> How about: API to start a scan of a relation using specified options, which returns the <structname>TableScanDesc</structname> structure to be used for subsequent scan operations + <para> +<programlisting> +void +scansetlimits (TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlks); +</programlisting> + API to fix the relation scan range limits. + </para> How about: API to set scan range endpoints + <para> +<programlisting> +bool +scan_bitmap_pagescan (TableScanDesc scan, + TBMIterateResult *tbmres); +</programlisting> + API to scan the relation and fill the scan description bitmap with valid item pointers + for the specified block. + </para> This says "to scan the relation", but seems to be concerned with only a page worth of data as the name also says. Also, it's not clear what "scan description bitmap" means. Maybe write as: API to scan the relation block specified in the scan descriptor to collect and return the tuples requested by the given bitmap + <para> +<programlisting> +bool +scan_bitmap_pagescan_next (TableScanDesc scan, + TupleTableSlot *slot); +</programlisting> + API to fill the buffered heap tuple data from the bitmap scanned item pointers and store + it in the provided slot. + </para> How about: API to select the next tuple from the set of tuples of a given page specified in the scan descriptor and return in the provided slot; returns false if no more tuples to return on the given page + <para> +<programlisting> +bool +scan_sample_next_block (TableScanDesc scan, struct SampleScanState *scanstate); +</programlisting> + API to scan the relation and fill the scan description bitmap with valid item pointers + for the specified block provided by the sample method. + </para> Looking at the code, this API selects the next block using the sampling method and nothing more, although I see that the heap AM implementation also does heapgetpage thus collecting live tuples in the array known only to heap AM. So, how about: API to select the next block of the relation using the given sampling method and set its information in the scan descriptor + <para> +<programlisting> +bool +scan_sample_next_tuple (TableScanDesc scan, struct SampleScanState *scanstate, TupleTableSlot *slot); +</programlisting> + API to fill the buffered heap tuple data from the bitmap scanned item pointers based on the sample + method and store it in the provided slot. + </para> How about: API to select the next tuple using the given sampling method from the set of tuples collected from the block previously selected by the sampling method + <para> +<programlisting> +void +scan_rescan (TableScanDesc scan, ScanKey key, bool set_params, + bool allow_strat, bool allow_sync, bool allow_pagemode); +</programlisting> + API to restart the relation scan with provided data. + </para> How about: API to restart the given scan using provided options, releasing any resources (such as buffer pins) already held by the scan + <para> +<programlisting> +void +scan_update_snapshot (TableScanDesc scan, Snapshot snapshot); +</programlisting> + API to update the relation scan with the new snapshot. + </para> How about: API to set the visibility snapshot to be used by a given scan + <para> +<programlisting> +IndexFetchTableData * +begin_index_fetch (Relation relation); +</programlisting> + API to prepare the <structname>IndexFetchTableData</structname> for the relation. + </para> This API is a bit vague. As in, it's not clear from the name when it's to be called and what's be to be done with the returned struct. How about at least adding more details about what the returned struct is for, like: API to get the <structname>IndexFetchTableData</structname> to be assigned to an index scan on the specified relation + <para> +<programlisting> +void +reset_index_fetch (struct IndexFetchTableData* data); +</programlisting> + API to reset the prepared internal members of the <structname>IndexFetchTableData</structname>. + </para> This description seems wrong if I look at the code. Its purpose seems to be reset the AM-specific members, such as releasing the buffer pin held in xs_cbuf in the heap AM's case. How about: API to release AM-specific resources held by the <structname>IndexFetchTableData</structname> of a given index scan + <para> +<programlisting> +void +end_index_fetch (struct IndexFetchTableData* data); +</programlisting> + API to clear and free the <structname>IndexFetchTableData</structname>. + </para> Given above, how about: API to release AM-specific resources held by the <structname>IndexFetchTableData</structname> of a given index scan and free the memory of <structname>IndexFetchTableData</structname> itself + <para> +<programlisting> +double +index_build_range_scan (Relation heapRelation, + Relation indexRelation, + IndexInfo *indexInfo, + bool allow_sync, + bool anyvisible, + BlockNumber start_blockno, + BlockNumber end_blockno, + IndexBuildCallback callback, + void *callback_state, + TableScanDesc scan); +</programlisting> + API to perform the table scan with bounded range specified by the caller + and insert the satisfied records into the index using the provided callback + function pointer. + </para> This is a bit heavy API and the above description lacks some details. Also, isn't it a bit misleading to use the name end_blockno if it is interpreted as num_blocks by the internal APIs? How about: API to scan the specified blocks of the given table and insert them into the specified index using the provided callback function + <para> +<programlisting> +void +index_validate_scan (Relation heapRelation, + Relation indexRelation, + IndexInfo *indexInfo, + Snapshot snapshot, + struct ValidateIndexState *state); +</programlisting> + API to perform the table scan and insert the satisfied records into the index. + This API is similar like <function>index_build_range_scan</function>. This + is used in the scenario of concurrent index build. + </para> This one's a complicated API too. How about: API to scan the table according to the given snapshot and insert tuples satisfying the snapshot into the specified index, provided their TIDs are also present in the <structname>ValidateIndexState</structname> struct; this API is used as the last phase of a concurrent index build + <sect2> + <title>Table scanning</title> + + <para> + </para> + </sect2> + + <sect2> + <title>Table insert/update/delete</title> + + <para> + </para> + </sect2> + + <sect2> + <title>Table locking</title> + + <para> + </para> + </sect2> + + <sect2> + <title>Table vacuum</title> + + <para> + </para> + </sect2> + + <sect2> + <title>Table fetch</title> + + <para> + </para> + </sect2> Seems like you forgot to put the individual API descriptions under these sub-headers. Actually, I think it'd be better to try to format this page to looks more like the following: https://www.postgresql.org/docs/devel/fdw-callbacks.html - Currently, only indexes have access methods. The requirements for index - access methods are discussed in detail in <xref linkend="indexam"/>. + Currently, only <literal>INDEX</literal> and <literal>TABLE</literal> have + access methods. The requirements for access methods are discussed in detail + in <xref linkend="am"/>. Hmm, I don't see why you decided to add literal tags to INDEX and TABLE. Couldn't this have been written as: Currently, only tables and indexes have access methods. The requirements for access methods are discussed in detail in <xref linkend="am"/>. + This variable specifies the default table access method using which to create + objects (tables and materialized views) when a <command>CREATE</command> command does + not explicitly specify a access method. "variable" is not wrong, but "parameter" is used more often for GUCs. "a access method" should be "an access method". Maybe you could write this as: This variable specifies the default table access method to use when creating tables or materialized views if the <command>CREATE</command> does not explicitly specify an access method. + If the value does not match the name of any existing table access methods, + <productname>PostgreSQL</productname> will automatically use the default + table access method of the current database. any existing table methods -> any existing table method Although, shouldn't that cause an error instead of ignoring the error and use the database default access method instead? Thank you for working on this. Really looking forward to how this shapes up. :) Thanks, Amit
> On Fri, Nov 16, 2018 at 2:05 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > > I tried running the pgbench performance tests with minimal clients in my laptop and I didn't > find any performance issues, may be issue is visible only with higher clients. Even with > perf tool, I am not able to get a clear problem function. As you said, combining of all changes > leads to some overhead. Just out of curiosity I've also tried tpc-c from oltpbench (in the very same simple environment), it doesn't show any significant difference from master as well. > Here I attached the cumulative patches with further fixes and basic syntax regress tests also. While testing the latest version I've noticed, that you didn't include the fix for HeapTupleInvisible (so I see the error again), was it intentionally? > On Tue, Nov 27, 2018 at 2:55 AM Andres Freund <andres@anarazel.de> wrote: > > FWIW, now that oids are removed, and the tuple table slot abstraction > got in, I'm working on rebasing the pluggable storage patchset ontop of > that. Yes, please. I've tried it myself for reviewing purposes, but the rebasing speed was not impressive. Also I want to suggest to move it from github and make a regular patchset, since it's already a bit confusing in the sense what goes where and which patch to apply on top of which branch.
Hi, Thanks for these changes. I've merged a good chunk of them. On 2018-11-16 12:05:26 +1100, Haribabu Kommi wrote: > diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c > index c3960dc91f..3254e30a45 100644 > --- a/src/backend/access/heap/heapam_handler.c > +++ b/src/backend/access/heap/heapam_handler.c > @@ -1741,7 +1741,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do > { > HeapScanDesc scan = (HeapScanDesc) sscan; > Page targpage; > - OffsetNumber targoffset = scan->rs_cindex; > + OffsetNumber targoffset; > OffsetNumber maxoffset; > BufferHeapTupleTableSlot *hslot; > > @@ -1751,7 +1751,9 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do > maxoffset = PageGetMaxOffsetNumber(targpage); > > /* Inner loop over all tuples on the selected page */ > - for (targoffset = scan->rs_cindex; targoffset <= maxoffset; targoffset++) > + for (targoffset = scan->rs_cindex ? scan->rs_cindex : FirstOffsetNumber; > + targoffset <= maxoffset; > + targoffset++) > { > ItemId itemid; > HeapTuple targtuple = &hslot->base.tupdata; I thought it was better to fix the initialization for rs_cindex - any reason you didn't go for that? > diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c > index 8233475aa0..7bad246f55 100644 > --- a/src/backend/access/heap/heapam_visibility.c > +++ b/src/backend/access/heap/heapam_visibility.c > @@ -1838,8 +1838,10 @@ HeapTupleSatisfies(HeapTuple stup, Snapshot snapshot, Buffer buffer) > case NON_VACUUMABLE_VISIBILTY: > return HeapTupleSatisfiesNonVacuumable(stup, snapshot, buffer); > break; > - default: > + case END_OF_VISIBILITY: > Assert(0); > break; > } > + > + return false; /* keep compiler quiet */ I don't understand why END_OF_VISIBILITY is good idea? I now removed END_OF_VISIBILITY, and the default case. > @@ -593,6 +594,10 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self) > if (myState->rel->rd_rel->relhasoids) > slot->tts_tupleOid = InvalidOid; > > + /* Materialize the slot */ > + if (!TTS_IS_VIRTUAL(slot)) > + ExecMaterializeSlot(slot); > + > table_insert(myState->rel, > slot, > myState->output_cid, What's the point of adding materialization here? > @@ -570,6 +563,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull) > Assert(TTS_IS_HEAPTUPLE(scanslot) || > TTS_IS_BUFFERTUPLE(scanslot)); > > + if (hslot->tuple == NULL) > + ExecMaterializeSlot(scanslot); > + > d = heap_getsysattr(hslot->tuple, attnum, > scanslot->tts_tupleDescriptor, > op->resnull); Same? > diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c > index e055c0a7c6..34ef86a5bd 100644 > --- a/src/backend/executor/execMain.c > +++ b/src/backend/executor/execMain.c > @@ -2594,7 +2594,7 @@ EvalPlanQual(EState *estate, EPQState *epqstate, > * datums that may be present in copyTuple). As with the next step, this > * is to guard against early re-use of the EPQ query. > */ > - if (!TupIsNull(slot)) > + if (!TupIsNull(slot) && !TTS_IS_VIRTUAL(slot)) > ExecMaterializeSlot(slot); Same? > #if FIXME > @@ -2787,16 +2787,7 @@ EvalPlanQualFetchRowMarks(EPQState *epqstate) > if (isNull) > continue; > > - elog(ERROR, "frak, need to implement ROW_MARK_COPY"); > -#ifdef FIXME > - // FIXME: this should just deform the tuple and store it as a > - // virtual one. > - tuple = table_tuple_by_datum(erm->relation, datum, erm->relid); > - > - /* store tuple */ > - EvalPlanQualSetTuple(epqstate, erm->rti, tuple); > -#endif > - > + ExecForceStoreHeapTupleDatum(datum, slot); > } > } > } Cool. > diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c > index 56880e3d16..36ca07beb2 100644 > --- a/src/backend/executor/nodeBitmapHeapscan.c > +++ b/src/backend/executor/nodeBitmapHeapscan.c > @@ -224,6 +224,18 @@ BitmapHeapNext(BitmapHeapScanState *node) > > BitmapAdjustPrefetchIterator(node, tbmres); > > + /* > + * Ignore any claimed entries past what we think is the end of the > + * relation. (This is probably not necessary given that we got at > + * least AccessShareLock on the table before performing any of the > + * indexscans, but let's be safe.) > + */ > + if (tbmres->blockno >= scan->rs_nblocks) > + { > + node->tbmres = tbmres = NULL; > + continue; > + } > + I moved this into the storage engine, there just was a minor bug preventing the already existing check from taking effect. I don't think we should expose this kind of thing to the outside of the storage engine. > diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y > index 54382aba88..ea48e1d6e8 100644 > --- a/src/backend/parser/gram.y > +++ b/src/backend/parser/gram.y > @@ -4037,7 +4037,6 @@ CreateStatsStmt: > * > *****************************************************************************/ > > -// PBORKED: storage option > CreateAsStmt: > CREATE OptTemp TABLE create_as_target AS SelectStmt opt_with_data > { > @@ -4068,14 +4067,16 @@ CreateAsStmt: > ; > > create_as_target: > - qualified_name opt_column_list OptWith OnCommitOption OptTableSpace > + qualified_name opt_column_list table_access_method_clause > + OptWith OnCommitOption OptTableSpace > { > $$ = makeNode(IntoClause); > $$->rel = $1; > $$->colNames = $2; > - $$->options = $3; > - $$->onCommit = $4; > - $$->tableSpaceName = $5; > + $$->accessMethod = $3; > + $$->options = $4; > + $$->onCommit = $5; > + $$->tableSpaceName = $6; > $$->viewQuery = NULL; > $$->skipData = false; /* might get changed later */ > } > @@ -4125,14 +4126,15 @@ CreateMatViewStmt: > ; > > create_mv_target: > - qualified_name opt_column_list opt_reloptions OptTableSpace > + qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace > { > $$ = makeNode(IntoClause); > $$->rel = $1; > $$->colNames = $2; > - $$->options = $3; > + $$->accessMethod = $3; > + $$->options = $4; > $$->onCommit = ONCOMMIT_NOOP; > - $$->tableSpaceName = $4; > + $$->tableSpaceName = $5; > $$->viewQuery = NULL; /* filled at analysis time */ > $$->skipData = false; /* might get changed later */ > } Cool. I wonder if we should also somehow support SELECT INTO w/ USING? You've apparently started to do so with? > diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out > index 47dd885c4e..a4094ca3f1 100644 > --- a/src/test/regress/expected/create_am.out > +++ b/src/test/regress/expected/create_am.out > @@ -99,3 +99,81 @@ HINT: Use DROP ... CASCADE to drop the dependent objects too. > -- Drop access method cascade > DROP ACCESS METHOD gist2 CASCADE; > NOTICE: drop cascades to index grect2ind2 > +-- Create a heap2 table am handler with heapam handler > +CREATE ACCESS METHOD heap2 TYPE TABLE HANDLER heap_tableam_handler; > +SELECT * FROM pg_am where amtype = 't'; > + amname | amhandler | amtype > +--------+----------------------+-------- > + heap | heap_tableam_handler | t > + heap2 | heap_tableam_handler | t > +(2 rows) > + > +CREATE TABLE tbl_heap2(f1 int, f2 char(100)) using heap2; > +INSERT INTO tbl_heap2 VALUES(generate_series(1,10), 'Test series'); > +SELECT count(*) FROM tbl_heap2; > + count > +------- > + 10 > +(1 row) > + > +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a > + where a.oid = r.relam AND r.relname = 'tbl_heap2'; > + relname | relkind | amname > +-----------+---------+-------- > + tbl_heap2 | r | heap2 > +(1 row) > + > +-- create table as using heap2 > +CREATE TABLE tblas_heap2 using heap2 AS select * from tbl_heap2; > +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a > + where a.oid = r.relam AND r.relname = 'tblas_heap2'; > + relname | relkind | amname > +-------------+---------+-------- > + tblas_heap2 | r | heap2 > +(1 row) > + > +-- > +-- select into doesn't support new syntax, so it should be > +-- default access method. > +-- > +SELECT INTO tblselectinto_heap from tbl_heap2; > +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a > + where a.oid = r.relam AND r.relname = 'tblselectinto_heap'; > + relname | relkind | amname > +--------------------+---------+-------- > + tblselectinto_heap | r | heap > +(1 row) > + > +DROP TABLE tblselectinto_heap; > +-- create materialized view using heap2 > +CREATE MATERIALIZED VIEW mv_heap2 USING heap2 AS > + SELECT * FROM tbl_heap2; > +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a > + where a.oid = r.relam AND r.relname = 'mv_heap2'; > + relname | relkind | amname > +----------+---------+-------- > + mv_heap2 | m | heap2 > +(1 row) > + > +-- Try creating the unsupported relation kinds with using syntax > +CREATE VIEW test_view USING heap2 AS SELECT * FROM tbl_heap2; > +ERROR: syntax error at or near "USING" > +LINE 1: CREATE VIEW test_view USING heap2 AS SELECT * FROM tbl_heap2... > + ^ > +CREATE SEQUENCE test_seq USING heap2; > +ERROR: syntax error at or near "USING" > +LINE 1: CREATE SEQUENCE test_seq USING heap2; > + ^ > +-- Drop table access method, but fails as objects depends on it > +DROP ACCESS METHOD heap2; > +ERROR: cannot drop access method heap2 because other objects depend on it > +DETAIL: table tbl_heap2 depends on access method heap2 > +table tblas_heap2 depends on access method heap2 > +materialized view mv_heap2 depends on access method heap2 > +HINT: Use DROP ... CASCADE to drop the dependent objects too. > +-- Drop table access method with cascade > +DROP ACCESS METHOD heap2 CASCADE; > +NOTICE: drop cascades to 3 other objects > +DETAIL: drop cascades to table tbl_heap2 > +drop cascades to table tblas_heap2 > +drop cascades to materialized view mv_heap2 > diff --git a/src/test/regress/sql/create_am.sql b/src/test/regress/sql/create_am.sql > index 3e0ac104f3..0472a60f20 100644 > --- a/src/test/regress/sql/create_am.sql > +++ b/src/test/regress/sql/create_am.sql > @@ -66,3 +66,49 @@ DROP ACCESS METHOD gist2; > > -- Drop access method cascade > DROP ACCESS METHOD gist2 CASCADE; > + > +-- Create a heap2 table am handler with heapam handler > +CREATE ACCESS METHOD heap2 TYPE TABLE HANDLER heap_tableam_handler; > + > +SELECT * FROM pg_am where amtype = 't'; > + > +CREATE TABLE tbl_heap2(f1 int, f2 char(100)) using heap2; > +INSERT INTO tbl_heap2 VALUES(generate_series(1,10), 'Test series'); > +SELECT count(*) FROM tbl_heap2; > + > +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a > + where a.oid = r.relam AND r.relname = 'tbl_heap2'; > + > +-- create table as using heap2 > +CREATE TABLE tblas_heap2 using heap2 AS select * from tbl_heap2; > +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a > + where a.oid = r.relam AND r.relname = 'tblas_heap2'; > + > +-- > +-- select into doesn't support new syntax, so it should be > +-- default access method. > +-- > +SELECT INTO tblselectinto_heap from tbl_heap2; > +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a > + where a.oid = r.relam AND r.relname = 'tblselectinto_heap'; > + > +DROP TABLE tblselectinto_heap; > + > +-- create materialized view using heap2 > +CREATE MATERIALIZED VIEW mv_heap2 USING heap2 AS > + SELECT * FROM tbl_heap2; > + > +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a > + where a.oid = r.relam AND r.relname = 'mv_heap2'; > + > +-- Try creating the unsupported relation kinds with using syntax > +CREATE VIEW test_view USING heap2 AS SELECT * FROM tbl_heap2; > + > +CREATE SEQUENCE test_seq USING heap2; > + > + > +-- Drop table access method, but fails as objects depends on it > +DROP ACCESS METHOD heap2; > + > +-- Drop table access method with cascade > +DROP ACCESS METHOD heap2 CASCADE; > -- > 2.18.0.windows.1 Nice! Greetings, Andres Freund
Hi, On 2018-11-26 17:55:57 -0800, Andres Freund wrote: > FWIW, now that oids are removed, and the tuple table slot abstraction > got in, I'm working on rebasing the pluggable storage patchset ontop of > that. I've pushed a version to that to the git tree, including a rebased version of zheap: https://github.com/anarazel/postgres-pluggable-storage https://github.com/anarazel/postgres-pluggable-zheap I'm still working on moving some of the out-of-access/zheap modifications into pluggable storage (see e.g. the first commit of the pluggable-zheap series). But this should allow others to start on a more recent codebasis. My next steps are: - make relation creation properly pluggable - remove the typedefs from tableam.h, instead move them into the TableAmRoutine struct. - Move rs_{nblocks, startblock, numblocks} out of TableScanDescData - Move HeapScanDesc and IndexFetchHeapData out of relscan.h - See if the slot in SysScanDescData can be avoided, it's not exactly free of overhead. - remove ExecSlotCompare(), it's entirely unrelated to these changes imo (and in the wrong place) - rename HeapUpdateFailureData et al to not reference Heap - split pluggable storage patchset, to commit earlier: - EvalPlanQual slotification - trigger slotification - split of IndexBuildHeapScan out of index.c I'm wondering whether we should add table_beginscan/table_getnextslot/index_getnext_slot using the old API in an earlier commit that then could be committed separately, allowing the tablecmd.c changes to be committed soon. I'm wondering whether we should change the table_beginscan* API so it provides a slot - pretty much every caller has to do so, and it seems just as easy to create/dispose via table_beginscan/endscan. Further tasks I'm not yet planning to tackle, that I'd welcome help on: - pg_dump support - pg_upgrade testing - I think we should consider removing HeapTuple->t_tableOid, it should imo live entirely in the slot Greetings, Andres Freund
> On Tue, Dec 11, 2018 at 3:13 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2018-11-26 17:55:57 -0800, Andres Freund wrote: > > FWIW, now that oids are removed, and the tuple table slot abstraction > > got in, I'm working on rebasing the pluggable storage patchset ontop of > > that. > > I've pushed a version to that to the git tree, including a rebased > version of zheap: > https://github.com/anarazel/postgres-pluggable-storage > https://github.com/anarazel/postgres-pluggable-zheap Great, thanks! As a side note, I assume the last reference should be this, right? https://github.com/anarazel/postgres-pluggable-storage/tree/pluggable-zheap > Further tasks I'm not yet planning to tackle, that I'd welcome help on: > - pg_dump support > - pg_upgrade testing > - I think we should consider removing HeapTuple->t_tableOid, it should > imo live entirely in the slot I would love to try help with pg_dump support.
Hello. At Tue, 27 Nov 2018 14:58:35 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <080ce65e-7b96-adbf-1c8c-7c88d87eaeda@lab.ntt.co.jp> > + <para> > +<programlisting> > +TupleTableSlotOps * > +slot_callbacks (Relation relation); > +</programlisting> > + API to access the slot specific methods; > + Following methods are available; > + <structname>TTSOpsVirtual</structname>, > + <structname>TTSOpsHeapTuple</structname>, > + <structname>TTSOpsMinimalTuple</structname>, > + <structname>TTSOpsBufferTuple</structname>, > + </para> > > Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or > its relations to the TableAmRoutine abstraction, I think the text > description could better be written as: > > "API to get the slot operations struct for a given table access method" > > It's not clear to me why various TTSOps* structs are listed here? Is the > point that different AMs may choose one of the listed alternatives? For > example, I see that heap AM implementation returns TTOpsBufferTuple, so it > manipulates slots containing buffered tuples, right? Other AMs are free > to return any one of these? For example, some AMs may never use buffer > manager and hence not use TTOpsBufferTuple. Is that understanding correct? Yeah, I'm not sure why it should not be a pointer to the struct itself but a function. And the four struct doesn't seem relevant to table AMs. Perhaps clear, getsomeattrs and so on should be listed instead. > + <para> > +<programlisting> > +Oid > +tuple_insert (Relation rel, TupleTableSlot *slot, CommandId cid, > + int options, BulkInsertState bistate); > +</programlisting> > + API to insert the tuple and provide the <literal>ItemPointerData</literal> > + where the tuple is successfully inserted. > + </para> > > It's not clear from the signature where you get the ItemPointerData. > Looking at heapam_tuple_insert which puts it in slot->tts_tid, I think > this should mention it a bit differently, like: > > API to insert the tuple contained in the provided slot and return its TID, > that is, the location where the tuple is successfully inserted It is actually an OID, not a TID in the current code. TID is internaly handled. > + <para> > +<programlisting> > +bool > +tuple_fetch_follow (struct IndexFetchTableData *scan, > + ItemPointer tid, > + Snapshot snapshot, > + TupleTableSlot *slot, > + bool *call_again, bool *all_dead); > +</programlisting> > + API to get the all the tuples of the page that satisfies itempointer. > + </para> > > IIUC, "all the tuples of of the page" in the above sentence means all the > tuples in the HOT chain of a given heap tuple, making this description of > the API slightly specific to the heap AM. Can we make the description > more generic or is the API itself very specific that it cannot be > expressed in generic terms? Ignoring that for a moment, I think the > sentence contains more "the"s than there need to be, so maybe write as: > > API to get all tuples on a given page that are linked to the tuple of the > given TID Mmm. This is exposing MVCC matters to indexam. I suppose we should refactor this API. > + <para> > +<programlisting> > +tuple_data > +get_tuple_data (TupleTableSlot *slot, tuple_data_flags flags); > +</programlisting> > + API to return the internal structure members of the HeapTuple. > + </para> > > I think this description doesn't mention enough details of both the > information that needs to be specified when calling the function (what's > in flags) and the information that's returned. (I suppose it will be described in later sections.) > + <para> > +<programlisting> > +bool > +scan_analyze_next_tuple (TableScanDesc scan, TransactionId OldestXmin, > + double *liverows, double *deadrows, TupleTableSlot > *slot)); > +</programlisting> > + API to analyze the block and fill the buffered heap tuple in the slot > and also > + provide the live and dead rows. > + </para> > > How about: > > API to get the next tuple from the block being scanned, which also updates > the number of live and dead rows encountered "live" and "dead" are MVCC terms. I suppose that we should stash out the deadrows somwhere else. (But analyze code would need to be modified if we do so.) > +void > +scansetlimits (TableScanDesc sscan, BlockNumber startBlk, BlockNumber > numBlks); > +</programlisting> > + API to fix the relation scan range limits. > + </para> > > > How about: > > API to set scan range endpoints This sets start point and the number of blocks.. Just "API to set scan range" would be sifficient reffering to the parameter list. > + <para> > +<programlisting> > +bool > +scan_bitmap_pagescan (TableScanDesc scan, > + TBMIterateResult *tbmres); > +</programlisting> > + API to scan the relation and fill the scan description bitmap with > valid item pointers > + for the specified block. > + </para> > > This says "to scan the relation", but seems to be concerned with only a > page worth of data as the name also says. Also, it's not clear what "scan > description bitmap" means. Maybe write as: > > API to scan the relation block specified in the scan descriptor to collect > and return the tuples requested by the given bitmap "API to collect the tuples in a page requested by the given bitmpap scan result." something? I think detailed explanation would be required apart from the one-line description. Anyway the name TBMIterateResult doesn't seem proper to expose. > + <para> > +<programlisting> > +bool > +scan_sample_next_block (TableScanDesc scan, struct SampleScanState > *scanstate); > +</programlisting> > + API to scan the relation and fill the scan description bitmap with > valid item pointers > + for the specified block provided by the sample method. > + </para> > > Looking at the code, this API selects the next block using the sampling > method and nothing more, although I see that the heap AM implementation > also does heapgetpage thus collecting live tuples in the array known only > to heap AM. So, how about: > > API to select the next block of the relation using the given sampling > method and set its information in the scan descriptor "block" and "page" seems randomly choosed here and there. I don't mind that seen in the core but.. > + <para> > +<programlisting> > +bool > +scan_sample_next_tuple (TableScanDesc scan, struct SampleScanState > *scanstate, TupleTableSlot *slot); > +</programlisting> > + API to fill the buffered heap tuple data from the bitmap scanned item > pointers based on the sample > + method and store it in the provided slot. > + </para> > > How about: > > API to select the next tuple using the given sampling method from the set > of tuples collected from the block previously selected by the sampling method I'm not sure "from the set of tuples collected" is true. Just "the state of sample scan" or something wouldn't be fine? > + <para> > +<programlisting> > +void > +scan_rescan (TableScanDesc scan, ScanKey key, bool set_params, > + bool allow_strat, bool allow_sync, bool allow_pagemode); > +</programlisting> > + API to restart the relation scan with provided data. > + </para> > > How about: > > API to restart the given scan using provided options, releasing any > resources (such as buffer pins) already held by the scan It looks too-detailed to me, but "with provided data" looks too coarse.. > + <para> > +<programlisting> > +void > +scan_update_snapshot (TableScanDesc scan, Snapshot snapshot); > +</programlisting> > + API to update the relation scan with the new snapshot. > + </para> > > How about: > > API to set the visibility snapshot to be used by a given scan If so, the function name should be "scan_set_snapshot". Anyway the name look like "the function to update a snapshot (itself)". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. (in the next branch:) At Tue, 27 Nov 2018 14:58:35 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <080ce65e-7b96-adbf-1c8c-7c88d87eaeda@lab.ntt.co.jp> > Thank you for working on this. Really looking forward to how this shapes > up. :) +1. I looked through the documentation part, as where I can do something. am.html: > 61.1. Overview of Index access methods > 61.1.1. Basic API Structure for Indexes > 61.1.2. Index Access Method Functions > 61.1.3. Index Scanning > 61.2. Overview of Table access methods > 61.2.1. Table access method API > 61.2.2. Table Access Method Functions > 61.2.3. Table scanning Aren't 61.1 and 61.2 better in the reverse order? Is there a reason for the difference of the titles between 61.1.1 and 61.2.1? The contents are quite similar. + <sect2 id="table-api"> + <title>Table access method API</title> The member names of index AM struct begins with "am" but they don't have an unified prefix in table AM. It seems a bit incosistent. Perhaps we might should rename some long and internal names.. + <sect2 id="table-functions"> + <title>Table Access Method Functions</title> Table AM functions are far finer-grained than index AM. I think that AM developers needs the more concrete description on what every API function does and explanation on various previously-internal structs. I suppose that how the functions are used in core code paths will be written in the following sections. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Dec 11, 2018 at 12:47 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
Thanks for these changes. I've merged a good chunk of them.
Thanks.
On 2018-11-16 12:05:26 +1100, Haribabu Kommi wrote:
> diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> index c3960dc91f..3254e30a45 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -1741,7 +1741,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do
> {
> HeapScanDesc scan = (HeapScanDesc) sscan;
> Page targpage;
> - OffsetNumber targoffset = scan->rs_cindex;
> + OffsetNumber targoffset;
> OffsetNumber maxoffset;
> BufferHeapTupleTableSlot *hslot;
>
> @@ -1751,7 +1751,9 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do
> maxoffset = PageGetMaxOffsetNumber(targpage);
>
> /* Inner loop over all tuples on the selected page */
> - for (targoffset = scan->rs_cindex; targoffset <= maxoffset; targoffset++)
> + for (targoffset = scan->rs_cindex ? scan->rs_cindex : FirstOffsetNumber;
> + targoffset <= maxoffset;
> + targoffset++)
> {
> ItemId itemid;
> HeapTuple targtuple = &hslot->base.tupdata;
I thought it was better to fix the initialization for rs_cindex - any
reason you didn't go for that?
No specific reason. Thanks for the correction.
> diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
> index 8233475aa0..7bad246f55 100644
> --- a/src/backend/access/heap/heapam_visibility.c
> +++ b/src/backend/access/heap/heapam_visibility.c
> @@ -1838,8 +1838,10 @@ HeapTupleSatisfies(HeapTuple stup, Snapshot snapshot, Buffer buffer)
> case NON_VACUUMABLE_VISIBILTY:
> return HeapTupleSatisfiesNonVacuumable(stup, snapshot, buffer);
> break;
> - default:
> + case END_OF_VISIBILITY:
> Assert(0);
> break;
> }
> +
> + return false; /* keep compiler quiet */
I don't understand why END_OF_VISIBILITY is good idea? I now removed
END_OF_VISIBILITY, and the default case.
OK.
> @@ -593,6 +594,10 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
> if (myState->rel->rd_rel->relhasoids)
> slot->tts_tupleOid = InvalidOid;
>
> + /* Materialize the slot */
> + if (!TTS_IS_VIRTUAL(slot))
> + ExecMaterializeSlot(slot);
> +
> table_insert(myState->rel,
> slot,
> myState->output_cid,
What's the point of adding materialization here?
In earlier testing i observed as the slot that is received is a buffered slot
and it points to the original tuple, but when it inserts it into the new table,
the transaction id changes and it leads to invisible tuple, because of that
reason I added the materialize here.
> @@ -570,6 +563,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
> Assert(TTS_IS_HEAPTUPLE(scanslot) ||
> TTS_IS_BUFFERTUPLE(scanslot));
>
> + if (hslot->tuple == NULL)
> + ExecMaterializeSlot(scanslot);
> +
> d = heap_getsysattr(hslot->tuple, attnum,
> scanslot->tts_tupleDescriptor,
> op->resnull);
Same?
> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index e055c0a7c6..34ef86a5bd 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -2594,7 +2594,7 @@ EvalPlanQual(EState *estate, EPQState *epqstate,
> * datums that may be present in copyTuple). As with the next step, this
> * is to guard against early re-use of the EPQ query.
> */
> - if (!TupIsNull(slot))
> + if (!TupIsNull(slot) && !TTS_IS_VIRTUAL(slot))
> ExecMaterializeSlot(slot);
Same?
Earlier virtual tuple materialize was throwing error, because of that reason I added
that check.
> index 56880e3d16..36ca07beb2 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -224,6 +224,18 @@ BitmapHeapNext(BitmapHeapScanState *node)
>
> BitmapAdjustPrefetchIterator(node, tbmres);
>
> + /*
> + * Ignore any claimed entries past what we think is the end of the
> + * relation. (This is probably not necessary given that we got at
> + * least AccessShareLock on the table before performing any of the
> + * indexscans, but let's be safe.)
> + */
> + if (tbmres->blockno >= scan->rs_nblocks)
> + {
> + node->tbmres = tbmres = NULL;
> + continue;
> + }
> +
I moved this into the storage engine, there just was a minor bug
preventing the already existing check from taking effect. I don't think
we should expose this kind of thing to the outside of the storage
engine.
OK.
> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> index 54382aba88..ea48e1d6e8 100644
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -4037,7 +4037,6 @@ CreateStatsStmt:
> *
> *****************************************************************************/
>
> -// PBORKED: storage option
> CreateAsStmt:
> CREATE OptTemp TABLE create_as_target AS SelectStmt opt_with_data
> {
> @@ -4068,14 +4067,16 @@ CreateAsStmt:
> ;
>
> create_as_target:
> - qualified_name opt_column_list OptWith OnCommitOption OptTableSpace
> + qualified_name opt_column_list table_access_method_clause
> + OptWith OnCommitOption OptTableSpace
> {
> $$ = makeNode(IntoClause);
> $$->rel = $1;
> $$->colNames = $2;
> - $$->options = $3;
> - $$->onCommit = $4;
> - $$->tableSpaceName = $5;
> + $$->accessMethod = $3;
> + $$->options = $4;
> + $$->onCommit = $5;
> + $$->tableSpaceName = $6;
> $$->viewQuery = NULL;
> $$->skipData = false; /* might get changed later */
> }
> @@ -4125,14 +4126,15 @@ CreateMatViewStmt:
> ;
>
> create_mv_target:
> - qualified_name opt_column_list opt_reloptions OptTableSpace
> + qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace
> {
> $$ = makeNode(IntoClause);
> $$->rel = $1;
> $$->colNames = $2;
> - $$->options = $3;
> + $$->accessMethod = $3;
> + $$->options = $4;
> $$->onCommit = ONCOMMIT_NOOP;
> - $$->tableSpaceName = $4;
> + $$->tableSpaceName = $5;
> $$->viewQuery = NULL; /* filled at analysis time */
> $$->skipData = false; /* might get changed later */
> }
Cool. I wonder if we should also somehow support SELECT INTO w/ USING?
You've apparently started to do so with?
I thought the same, but SELECT INTO is deprecated syntax, is it fine to add
the new syntax?
Regards,
Haribabu Kommi
Fujitsu Australia
> On Tue, Dec 11, 2018 at 3:13 AM Andres Freund <andres@anarazel.de> wrote: > > Further tasks I'm not yet planning to tackle, that I'd welcome help on: > - pg_dump support > - pg_upgrade testing > - I think we should consider removing HeapTuple->t_tableOid, it should > imo live entirely in the slot I'm a bit confused, but what kind of pg_dump support you're talking about? After a quick glance I don't see so far any table access specific logic there. To check it I've created a test access method (which is a copy of heap, but with some small differences) and pg_dump worked as expected. As a side note, in a table description I haven't found any mention of which access method is used for this table, probably it's useful to show that with \d+ (see the attached patch).
Вложения
Hi, On 2018-12-15 20:15:12 +0100, Dmitry Dolgov wrote: > > On Tue, Dec 11, 2018 at 3:13 AM Andres Freund <andres@anarazel.de> wrote: > > > > Further tasks I'm not yet planning to tackle, that I'd welcome help on: > > - pg_dump support > > - pg_upgrade testing > > - I think we should consider removing HeapTuple->t_tableOid, it should > > imo live entirely in the slot > > I'm a bit confused, but what kind of pg_dump support you're talking about? > After a quick glance I don't see so far any table access specific logic there. > To check it I've created a test access method (which is a copy of heap, but > with some small differences) and pg_dump worked as expected. We need to dump the table access method at dump time, otherwise we loose that information. > As a side note, in a table description I haven't found any mention of which > access method is used for this table, probably it's useful to show that with \d+ > (see the attached patch). I'm not convinced that's really worth the cost of including it in \d (rather than \d+ or such). When developing an alternative access method it's extremely useful to be able to just change the default access method, and run the existing tests, which this makes harder. It's also a lot of churn. Greetings, Andres Freund
> On Sat, Dec 15, 2018 at 8:37 PM Andres Freund <andres@anarazel.de> wrote: > > We need to dump the table access method at dump time, otherwise we loose > that information. Oh, right. So, something like in the attached patch? > > As a side note, in a table description I haven't found any mention of which > > access method is used for this table, probably it's useful to show that with \d+ > > (see the attached patch). > > I'm not convinced that's really worth the cost of including it in \d > (rather than \d+ or such). Maybe I'm missing the point, but I've meant exactly the same and the patch, suggested in the previous email, add this info to \d+
Вложения
On Mon, Dec 10, 2018 at 8:13 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Just out of curiosity I've also tried tpc-c from oltpbench (in the very same > simple environment), it doesn't show any significant difference from master as > well. FWIW, I have found BenchmarkSQL to be significantly better than oltpbench, having used both quite a bit now: https://bitbucket.org/openscg/benchmarksql For example, oltpbench requires a max_connections setting that far exceeds the number of terminals/clients used by the benchmark, because the number of connections used during bulk loading far exceeds what is truly required. BenchmarkSQL also makes it easy to generate useful html reports, complete with graphs. -- Peter Geoghegan
> On Sat, Dec 15, 2018 at 8:37 PM Andres Freund <andres@anarazel.de> wrote: > > We need to dump the table access method at dump time, otherwise we loose > that information. As a result of the discussion in [1] (btw, thanks for starting it), here is proposed solution with tracking current default_table_access_method. Next I'll tackle similar issue for psql and probably add some tests for both patches. [1]: https://www.postgresql.org/message-id/flat/20190107235616.6lur25ph22u5u5av%40alap3.anarazel.de
Вложения
Hi, On 2019-01-12 01:35:06 +0100, Dmitry Dolgov wrote: > > On Sat, Dec 15, 2018 at 8:37 PM Andres Freund <andres@anarazel.de> wrote: > > > > We need to dump the table access method at dump time, otherwise we loose > > that information. > > As a result of the discussion in [1] (btw, thanks for starting it), here is > proposed solution with tracking current default_table_access_method. Next I'll > tackle similar issue for psql and probably add some tests for both patches. Thanks! > +/* > + * Set the proper default_table_access_method value for the table. > + */ > +static void > +_selectTableAccessMethod(ArchiveHandle *AH, const char *tableam) > +{ > + PQExpBuffer cmd = createPQExpBuffer(); > + const char *want, *have; > + > + have = AH->currTableAm; > + want = tableam; > + > + if (!want) > + return; > + > + if (have && strcmp(want, have) == 0) > + return; > + > + > + appendPQExpBuffer(cmd, "SET default_table_access_method = %s;", tableam); This needs escaping, at the very least with "", but better with proper routines for dealing with identifiers. > @@ -5914,7 +5922,7 @@ getTables(Archive *fout, int *numTables) > "tc.relfrozenxid AS tfrozenxid, " > "tc.relminmxid AS tminmxid, " > "c.relpersistence, c.relispopulated, " > - "c.relreplident, c.relpages, " > + "c.relreplident, c.relpages, am.amname AS amname, " That AS doesn't do anything, does it? > /* other fields were zeroed above */ > > @@ -9355,7 +9370,7 @@ dumpComment(Archive *fout, const char *type, const char *name, > * post-data. > */ > ArchiveEntry(fout, nilCatalogId, createDumpId(), > - tag->data, namespace, NULL, owner, > + tag->data, namespace, NULL, owner, NULL, > "COMMENT", SECTION_NONE, > query->data, "", NULL, > &(dumpId), 1, We really ought to move the arguments to a struct, so we don't generate quite as much useless diffs whenever we do a change around one of these... Greetings, Andres Freund
> On Sat, Jan 12, 2019 at 1:44 AM Andres Freund <andres@anarazel.de> wrote: > > > + appendPQExpBuffer(cmd, "SET default_table_access_method = %s;", tableam); > > This needs escaping, at the very least with "", but better with proper > routines for dealing with identifiers. Thanks for noticing, fixed. > > @@ -5914,7 +5922,7 @@ getTables(Archive *fout, int *numTables) > > "tc.relfrozenxid AS tfrozenxid, " > > "tc.relminmxid AS tminmxid, " > > "c.relpersistence, c.relispopulated, " > > - "c.relreplident, c.relpages, " > > + "c.relreplident, c.relpages, am.amname AS amname, " > > That AS doesn't do anything, does it? Rigth, I've renamed it few times and forgot to get rid of it. Removed. > > > /* other fields were zeroed above */ > > > > @@ -9355,7 +9370,7 @@ dumpComment(Archive *fout, const char *type, const char *name, > > * post-data. > > */ > > ArchiveEntry(fout, nilCatalogId, createDumpId(), > > - tag->data, namespace, NULL, owner, > > + tag->data, namespace, NULL, owner, NULL, > > "COMMENT", SECTION_NONE, > > query->data, "", NULL, > > &(dumpId), 1, > > We really ought to move the arguments to a struct, so we don't generate > quite as much useless diffs whenever we do a change around one of > these... That's what I though too. Maybe then I'll suggest a mini-patch to the master to refactor these arguments out into a separate struct, so we can leverage it here.
Вложения
Thanks for the patch updates. A few comments so far from me : +static void _selectTableAccessMethod(ArchiveHandle *AH, const char *tablespace); tablespace => tableam +_selectTableAccessMethod(ArchiveHandle *AH, const char *tableam) +{ + PQExpBuffer cmd = createPQExpBuffer(); createPQExpBuffer() should be moved after the below statement, so that it does not leak memory : if (have && strcmp(want, have) == 0) return; char *tableam; /* table access method, onlyt for TABLE tags */ Indentation is a bit misaligned. onlyt=> only @@ -2696,6 +2701,7 @@ ReadToc(ArchiveHandle *AH) te->tablespace = ReadStr(AH); te->owner = ReadStr(AH); + te->tableam = ReadStr(AH); Above, I am not sure about the this, but possibly we may require to have archive-version check like how it is done for tablespace : if (AH->version >= K_VERS_1_10) te->tablespace = ReadStr(AH); So how about bumping up the archive version and doing these checks ? Otherwise, if we run pg_restore using old version, we may read some junk into te->tableam, or possibly crash. As I said, I am not sure about this due to lack of clear understanding of archive versioning, but let me know if you indeed find this issue to be true.
> On Mon, Jan 14, 2019 at 2:07 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > createPQExpBuffer() should be moved after the below statement, so that > it does not leak memory Thanks for noticing, fixed. > So how about bumping up the archive version and doing these checks ? Yeah, you're right, I've added this check.
Вложения
On Tue, Dec 11, 2018 at 1:13 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
Further tasks I'm not yet planning to tackle, that I'd welcome help on:
- pg_upgrade testing
I did the pg_upgrade testing from older version with some tables and views
exists, and all of them are properly transformed into new server with heap
as the default access method.
I will add the dimitry pg_dump patch and test the pg_upgrade to confirm
the proper access method is retained on the upgraded database.
- I think we should consider removing HeapTuple->t_tableOid, it should
imo live entirely in the slot
I removed the t_tableOid from HeapTuple and during testing I found some
problems with triggers, will post the patch once it is fixed.
Regards,
Haribabu Kommi
Fujitsu Australia
Hi, On 2019-01-15 18:02:38 +1100, Haribabu Kommi wrote: > On Tue, Dec 11, 2018 at 1:13 PM Andres Freund <andres@anarazel.de> wrote: > > > Hi, > > > > On 2018-11-26 17:55:57 -0800, Andres Freund wrote: > > Further tasks I'm not yet planning to tackle, that I'd welcome help on: > > - pg_upgrade testing > > > > I did the pg_upgrade testing from older version with some tables and views > exists, and all of them are properly transformed into new server with heap > as the default access method. > > I will add the dimitry pg_dump patch and test the pg_upgrade to confirm > the proper access method is retained on the upgraded database. > > > > > - I think we should consider removing HeapTuple->t_tableOid, it should > > imo live entirely in the slot > > > > I removed the t_tableOid from HeapTuple and during testing I found some > problems with triggers, will post the patch once it is fixed. Please note that I'm working on a heavily revised version of the patch right now, trying to clean up a lot of things (you might have seen some of the threads I started). I hope to post it ~Thursday. Local-ish patches shouldn't be a problem though. Greetings, Andres Freund
On Sat, 12 Jan 2019 at 18:11, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Sat, Jan 12, 2019 at 1:44 AM Andres Freund <andres@anarazel.de> wrote: > > > /* other fields were zeroed above */ > > > > > > @@ -9355,7 +9370,7 @@ dumpComment(Archive *fout, const char *type, const char *name, > > > * post-data. > > > */ > > > ArchiveEntry(fout, nilCatalogId, createDumpId(), > > > - tag->data, namespace, NULL, owner, > > > + tag->data, namespace, NULL, owner, NULL, > > > "COMMENT", SECTION_NONE, > > > query->data, "", NULL, > > > &(dumpId), 1, > > > > We really ought to move the arguments to a struct, so we don't generate > > quite as much useless diffs whenever we do a change around one of > > these... > > That's what I thought too. Maybe then I'll suggest a mini-patch to the master to > refactor these arguments out into a separate struct, so we can leverage it here. Then for each of the calls, we would need to declare that structure variable (with = {0}) and assign required fields in that structure before passing it to ArchiveEntry(). But a major reason of ArchiveEntry() is to avoid doing this and instead conveniently pass those fields as parameters. This will cause unnecessary more lines of code. I think better way is to have an ArchiveEntry() function with limited number of parameters, and have an ArchiveEntryEx() with those extra parameters which are not needed in usual cases. E.g. we can have tablespace, tableam, dumpFn and dumpArg as those extra arguments of ArchiveEntryEx(), because most of the places these are passed as NULL. All future arguments would go in ArchiveEntryEx(). Comments ? -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Tue, 15 Jan 2019 at 12:27, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, Jan 14, 2019 at 2:07 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > createPQExpBuffer() should be moved after the below statement, so that > > it does not leak memory > > Thanks for noticing, fixed. Looks good. > > > So how about bumping up the archive version and doing these checks ? > > Yeah, you're right, I've added this check. Need to bump K_VERS_MINOR as well. On Mon, 14 Jan 2019 at 18:36, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > +static void _selectTableAccessMethod(ArchiveHandle *AH, const char > *tablespace); > tablespace => tableam This is yet to be addressed. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
> On Tue, Jan 15, 2019 at 10:52 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > Need to bump K_VERS_MINOR as well. I've bumped it up, but somehow this change escaped the previous version. Now should be there, thanks! > On Mon, 14 Jan 2019 at 18:36, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > +static void _selectTableAccessMethod(ArchiveHandle *AH, const char > > *tablespace); > > tablespace => tableam > > This is yet to be addressed. Fixed. Also I guess another attached patch should address the psql part, namely displaying a table access method with \d+ and possibility to hide it with a psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name).
Вложения
Hi, On 2019-01-15 14:37:36 +0530, Amit Khandekar wrote: > Then for each of the calls, we would need to declare that structure > variable (with = {0}) and assign required fields in that structure > before passing it to ArchiveEntry(). But a major reason of > ArchiveEntry() is to avoid doing this and instead conveniently pass > those fields as parameters. This will cause unnecessary more lines of > code. I think better way is to have an ArchiveEntry() function with > limited number of parameters, and have an ArchiveEntryEx() with those > extra parameters which are not needed in usual cases. I don't think that'll really solve the problem. I think it might be more reasonable to rely on structs. Now that we can rely on designated initializers for structs we can do something like ArchiveEntry((ArchiveArgs){.tablespace = 3, .dumpFn = somefunc, ...}); and unused arguments will automatically initialized to zero. Or we could pass the struct as a pointer, might be more efficient (although I doubt it matters here): ArchiveEntry(&(ArchiveArgs){.tablespace = 3, .dumpFn = somefunc, ...}); What do others think? It'd probably be a good idea to start a new thread about this. Greetings, Andres Freund
On Tue, 15 Jan 2019 at 17:58, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Tue, Jan 15, 2019 at 10:52 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > Need to bump K_VERS_MINOR as well. > > I've bumped it up, but somehow this change escaped the previous version. Now > should be there, thanks! > > > On Mon, 14 Jan 2019 at 18:36, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > +static void _selectTableAccessMethod(ArchiveHandle *AH, const char > > > *tablespace); > > > tablespace => tableam > > > > This is yet to be addressed. > > Fixed. Thanks, the patch looks good to me. Of course there's the other thread about ArchiveEntry arguments which may alter this patch, but otherwise, I have no more comments on this patch. > > Also I guess another attached patch should address the psql part, namely > displaying a table access method with \d+ and possibility to hide it with a > psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name). Will have a look at this one. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, 18 Jan 2019 at 10:13, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On Tue, 15 Jan 2019 at 17:58, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Also I guess another attached patch should address the psql part, namely > > displaying a table access method with \d+ and possibility to hide it with a > > psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name). I am ok with the name. > > Will have a look at this one. --- a/src/test/regress/expected/copy2.out +++ b/src/test/regress/expected/copy2.out @@ -1,3 +1,4 @@ +\set HIDE_TABLEAM on CREATE TEMP TABLE x ( I thought we wanted to avoid having to add this setting in individual regression tests. Can't we do this in pg_regress as a common setting ? + /* Access method info */ + if (pset.sversion >= 120000 && verbose && tableinfo.relam != NULL && + !(pset.hide_tableam && tableinfo.relam_is_default)) + { + printfPQExpBuffer(&buf, _("Access method: %s"), fmtId(tableinfo.relam)); So this will make psql hide the access method if it's same as the default. I understand that this was kind of concluded in the other thread "Displaying and dumping of table access methods". But IMHO, if the hide_tableam is false, we should *always* show the access method, regardless of the default value. I mean, we can make it simple : off means never show table-access, on means always show table-access, regardless of the default access method. And this also will work with regression tests. If some regression test wants specifically to output the access method, it can have a "\SET HIDE_TABLEAM off" command. If we hide the method if it's default, then for a regression test that wants to forcibly show the table access method of all tables, it won't show up for tables that have default access method. ------------ + if (pset.sversion >= 120000 && verbose && tableinfo.relam != NULL && If the server does not support relam, tableinfo.relam will be NULL anyways. So I think sversion check is not needed. ------------ + printfPQExpBuffer(&buf, _("Access method: %s"), fmtId(tableinfo.relam)); fmtId is not required. In fact, we should display the access method name as-is. fmtId is required only for identifiers present in SQL queries. ----------- + printfPQExpBuffer(&buf, _("Access method: %s"), fmtId(tableinfo.relam)); + printTableAddFooter(&cont, buf.data); + } + + } Last two blank lines are not needed. ----------- + bool hide_tableam; } PsqlSettings; These variables, it seems, are supposed to be grouped together by type. ----------- I believe you are going to add a new regression testcase for the change ? -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
> On Fri, Jan 18, 2019 at 11:22 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > --- a/src/test/regress/expected/copy2.out > +++ b/src/test/regress/expected/copy2.out > @@ -1,3 +1,4 @@ > +\set HIDE_TABLEAM on > > I thought we wanted to avoid having to add this setting in individual > regression tests. Can't we do this in pg_regress as a common setting ? Yeah, you're probably right. Actually, I couldn't find anything that looks like "common settings", and so far I've placed it into psql_start_test as a psql argument. But not sure, maybe there is a better place. > + /* Access method info */ > + if (pset.sversion >= 120000 && verbose && tableinfo.relam != NULL && > + !(pset.hide_tableam && tableinfo.relam_is_default)) > + { > + printfPQExpBuffer(&buf, _("Access method: %s"), > fmtId(tableinfo.relam)); > > So this will make psql hide the access method if it's same as the > default. I understand that this was kind of concluded in the other > thread "Displaying and dumping of table access methods". But IMHO, if > the hide_tableam is false, we should *always* show the access method, > regardless of the default value. I mean, we can make it simple : off > means never show table-access, on means always show table-access, > regardless of the default access method. And this also will work with > regression tests. If some regression test wants specifically to output > the access method, it can have a "\SET HIDE_TABLEAM off" command. > > If we hide the method if it's default, then for a regression test that > wants to forcibly show the table access method of all tables, it won't > show up for tables that have default access method. I can't imagine, what kind of test would need to forcibly show the table access method of all the tables? Even if you need to verify tableam for something, maybe it's even easier just to select it from pg_am? > + if (pset.sversion >= 120000 && verbose && tableinfo.relam != NULL && > > If the server does not support relam, tableinfo.relam will be NULL > anyways. So I think sversion check is not needed. > ------------ > > + printfPQExpBuffer(&buf, _("Access method: %s"), fmtId(tableinfo.relam)); > fmtId is not required. > ----------- > > + printfPQExpBuffer(&buf, _("Access method: %s"), fmtId(tableinfo.relam)); > + printTableAddFooter(&cont, buf.data); > + } > + > + > } > > Last two blank lines are not needed. Right, fixed. > + bool hide_tableam; > } PsqlSettings; > > These variables, it seems, are supposed to be grouped together by type. Well, this grouping looks strange for me. But since I don't have a strong opinion, I moved the variable. > I believe you are going to add a new regression testcase for the change ? Yep.
Вложения
On Tue, Jan 15, 2019 at 6:05 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-01-15 18:02:38 +1100, Haribabu Kommi wrote:
> On Tue, Dec 11, 2018 at 1:13 PM Andres Freund <andres@anarazel.de> wrote:
>
> > Hi,
> >
> > On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> > - pg_upgrade testing
> >
>
> I did the pg_upgrade testing from older version with some tables and views
> exists, and all of them are properly transformed into new server with heap
> as the default access method.
>
> I will add the dimitry pg_dump patch and test the pg_upgrade to confirm
> the proper access method is retained on the upgraded database.
>
>
>
> > - I think we should consider removing HeapTuple->t_tableOid, it should
> > imo live entirely in the slot
> >
>
> I removed the t_tableOid from HeapTuple and during testing I found some
> problems with triggers, will post the patch once it is fixed.
Please note that I'm working on a heavily revised version of the patch
right now, trying to clean up a lot of things (you might have seen some
of the threads I started). I hope to post it ~Thursday. Local-ish
patches shouldn't be a problem though.
I will rebase this patch once the revised code is available.
I am not able to remove the complete t_tableOid from HeapTuple,
because of its use in triggers, as the slot is not available in triggers
and I need to store the tableOid also as part of the tuple.
Currently setting of t_tableOid is done only when the tuple is formed
from the slot, and it is use is replaced with slot member.
comments?
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Hi, (resending with compressed attachements, perhaps that'll go through) On 2018-12-10 18:13:40 -0800, Andres Freund wrote: > On 2018-11-26 17:55:57 -0800, Andres Freund wrote: > > FWIW, now that oids are removed, and the tuple table slot abstraction > > got in, I'm working on rebasing the pluggable storage patchset ontop of > > that. > > I've pushed a version to that to the git tree, including a rebased > version of zheap: > https://github.com/anarazel/postgres-pluggable-storage > https://github.com/anarazel/postgres-pluggable-zheap I've pushed the newest, substantially revised, version to the same repository. Note, that while the newest pluggable-zheap version is newer than my last email, it's not based on the latest version, and the pluggable-zheap development is now happening in the main zheap repository. > My next steps are: > - make relation creation properly pluggable > - remove the typedefs from tableam.h, instead move them into the > TableAmRoutine struct. > - Move rs_{nblocks, startblock, numblocks} out of TableScanDescData > - Move HeapScanDesc and IndexFetchHeapData out of relscan.h > - remove ExecSlotCompare(), it's entirely unrelated to these changes imo > (and in the wrong place) These are done. > - split pluggable storage patchset, to commit earlier: > - EvalPlanQual slotification > - trigger slotification > - split of IndexBuildHeapScan out of index.c The patchset is now pretty granularly split into individual pieces. There's two commits that might be worthwhile to split up further: 1) The commit introducing table_beginscan et al, currently also introduces indexscans through tableam. 2) The commit introducing table_(insert|delete|update) also includes table_lock_tuple(), which in turn changes a bunch of EPQ related code. It's probably worthwhile to break that out. I tried to make each individual commit make some sense, and pass all tests on its own. That requires some changes that are then obsolted in a later commit, but it's not as much as I feared. > - rename HeapUpdateFailureData et al to not reference Heap I've not done that, I decided it's best to do that after all the work has gone in. > - See if the slot in SysScanDescData can be avoided, it's not exactly > free of overhead. After reconsidering, I don't think it's worth doing so. There's pretty substantial changes in this series, besides the things mentioned above: - I re-introduced parallel scan into pluggable storage, but added a set of helper functions to avoid having to duplicate the current block based logic from heap. That way it can be shared between most/all block based AMs - latestRemovedXid handling is moved into the table-AM, that's required for correct replay on Hot-Standby, where we do not know the AM of the current - the whole truncation and relation creation code has been overhauled - the order of functions in tableam.h, heapam_handler.c etc has been made more sensible - a number of callbacks have been obsoleted (relation_sync, relation_create_init_fork, scansetlimits) - A bunch of prerequisite work has been merged - (heap|relation)_(open|openrv|close) have been split into their own files - To avoid having to care about the bulk-insert flags code that uses a bulk-insert now unconditionally calls table_finish_bulk_insert(). The AM then internally can decide what it needs to do in case of e.g. HEAP_INSERT_SKIP_WAL. Zheap currently for example doesn't implement that (because UNDO handling is complicated), and this way it can just ignore the option, without needing call-site code for that. - A *lot* of cleanups Todo: - merge psql / pg_dump support by Dmitry - consider removing scan_update_snapshot - consider removing table_gimmegimmeslot() - add substantial docs for every callback - consider revising the current table_lock_tuple() API, I'm not quite convinced that's right - reconsider heap_fetch() API changes, causes unnecessary pain - polish the split out trigger and EPQ changes, so they can be merged soon-ish I plan to merge the first few commits pretty soon (as largely announced in related threads). While I saw an initial attempt at writing smgl docs for the table AM API, I'm not convinced that's the best approach. I think it might make more sense to have high-level docs in sgml, but then do all the per-callback docs in tableam.h. Greetings, Andres Freund
Вложения
- v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch.gz
- v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch.gz
- v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch.gz
- v12-0004-Remove-superfluous-tqual.h-includes.patch.gz
- v12-0005-WIP-rename-tqual.c-to-heapam_visibility.c.patch.gz
- v12-0006-WIP-change-snapshot-type-to-enum-rather-than-cal.patch.gz
- v12-0007-Move-generic-snapshot-related-code-from-tqual.h-.patch.gz
- v12-0008-Move-heap-visibility-routines-to-heapam.h.patch.gz
- v12-0009-Rephrase-references-to-time-qualification.patch.gz
- v12-0010-WIP-Extend-tuples-when-getting-them-from-a-slot.patch.gz
- v12-0011-WIP-Move-page-initialization-from-RelationAddExt.patch.gz
- v12-0012-WIP-ForceStore-HeapTupleDatum.patch.gz
- v12-0013-slot-type-fixes.patch.gz
- v12-0014-Rename-RelationData.rd_amroutine-to-rd_indam.patch.gz
- v12-0015-Add-ExecStorePinnedBufferHeapTuple.patch.gz
- v12-0016-Store-HeapTupleData-in-Buffer-HeapTupleTableSlot.patch.gz
- v12-0017-Buffer-tuples-may-be-virtualized.patch.gz
- v12-0018-slot-tableoid-tid-support.patch.gz
- v12-0019-Set-tableoid-in-a-bunch-of-places.patch.gz
- v12-0020-WIP-Slotified-triggers.patch.gz
- v12-0021-WIP-Slotify-EPQ.patch.gz
- v12-0022-tableam-introduce-minimal-infrastructure.patch.gz
- v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch.gz
- v12-0024-tableam-Inquire-slot-type-from-AM-rather-than-ha.patch.gz
- v12-0025-tableam-introduce-slot-based-table-getnext-and-u.patch.gz
- v12-0026-tableam-Add-insert-delete-update-lock_tuple.patch.gz
- v12-0027-tableam-Add-fetch_row_version.patch.gz
- v12-0028-tableam-Add-use-tableam_fetch_follow_check.patch.gz
- v12-0029-tableam-Add-table_get_latest_tid.patch.gz
- v12-0030-tableam-multi_insert-and-slotify-COPY.patch.gz
- v12-0031-tableam-finish_bulk_insert.patch.gz
- v12-0032-tableam-slotify-CREATE-TABLE-AS-and-CREATE-MATER.patch.gz
- v12-0033-tableam-index-builds.patch.gz
- v12-0034-tableam-relation-creation-VACUUM-FULL-CLUSTER-SE.patch.gz
- v12-0035-tableam-VACUUM-and-ANALYZE.patch.gz
- v12-0036-tableam-planner-size-estimation.patch.gz
- v12-0037-tableam-Sample-Scan-Support.patch.gz
- v12-0038-tableam-bitmap-heap-scan.patch.gz
- v12-0039-tableam-remaining-stuff.patch.gz
- v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch.gz
- v12-0041-tableam-Add-function-to-determine-newest-xid-amo.patch.gz
- v12-0042-tableam-Fetch-tuples-for-triggers-EPQ-using-a-pr.patch.gz
On Sun, 20 Jan 2019 at 22:46, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Fri, Jan 18, 2019 at 11:22 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > --- a/src/test/regress/expected/copy2.out > > +++ b/src/test/regress/expected/copy2.out > > @@ -1,3 +1,4 @@ > > +\set HIDE_TABLEAM on > > > > I thought we wanted to avoid having to add this setting in individual > > regression tests. Can't we do this in pg_regress as a common setting ? > > Yeah, you're probably right. Actually, I couldn't find anything that looks like > "common settings", and so far I've placed it into psql_start_test as a psql > argument. But not sure, maybe there is a better place. Yeah, psql_start_test() looks good to me. pg_regress does not seem to have it's own psqlrc file where we could have put this variable. May be later on if we want to have more such variables, we could device this infrastructure. > > > + /* Access method info */ > > + if (pset.sversion >= 120000 && verbose && tableinfo.relam != NULL && > > + !(pset.hide_tableam && tableinfo.relam_is_default)) > > + { > > + printfPQExpBuffer(&buf, _("Access method: %s"), > > fmtId(tableinfo.relam)); > > > > So this will make psql hide the access method if it's same as the > > default. I understand that this was kind of concluded in the other > > thread "Displaying and dumping of table access methods". But IMHO, if > > the hide_tableam is false, we should *always* show the access method, > > regardless of the default value. I mean, we can make it simple : off > > means never show table-access, on means always show table-access, > > regardless of the default access method. And this also will work with > > regression tests. If some regression test wants specifically to output > > the access method, it can have a "\SET HIDE_TABLEAM off" command. > > > > If we hide the method if it's default, then for a regression test that > > wants to forcibly show the table access method of all tables, it won't > > show up for tables that have default access method. > > I can't imagine, what kind of test would need to forcibly show the table access > method of all the tables? Even if you need to verify tableam for something, > maybe it's even easier just to select it from pg_am? Actually my statement is wrong, sorry. For a regression test that wants to forcibly show table access for all tables, it just needs to SET HIDE_TABLEAM to OFF. With your patch, if we set HIDE_TABLEAM to OFF, it will *always* show the table access regardless of default access method. It is with HIDE_TABLEAM=ON that your patch hides the table access conditionally (i.e. it shows when default value does not match). It's in this case, that I feel we should *unconditionally* hide the table access. Regression tests that use \d+ to show the table details might not be interested specifically in table access method. But these will fail if run with a modified default access method. Besides, my general inclination is : keep the GUC behaviour simple; and also, it looks like we can keep the regression test output consistent without having to have this conditional behaviour. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Mon, Jan 21, 2019 at 1:01 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-12-10 18:13:40 -0800, Andres Freund wrote:
> On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > FWIW, now that oids are removed, and the tuple table slot abstraction
> > got in, I'm working on rebasing the pluggable storage patchset ontop of
> > that.
>
> I've pushed a version to that to the git tree, including a rebased
> version of zheap:
> https://github.com/anarazel/postgres-pluggable-storage
> https://github.com/anarazel/postgres-pluggable-zheap
I've pushed the newest, substantially revised, version to the same
repository. Note, that while the newest pluggable-zheap version is newer
than my last email, it's not based on the latest version, and the
pluggable-zheap development is now happening in the main zheap
repository.
Thanks for the new version of patches and changes.
Todo:
- consider removing scan_update_snapshot
Attached the patch for removal of scan_update_snapshot
and also the rebased patch of reduction in use of t_tableOid.
- consider removing table_gimmegimmeslot()
- add substantial docs for every callback
Will work on the above two.
While I saw an initial attempt at writing smgl docs for the table AM
API, I'm not convinced that's the best approach. I think it might make
more sense to have high-level docs in sgml, but then do all the
per-callback docs in tableam.h.
OK, I will update the sgml docs accordingly.
Index AM has per callback docs in the sgml, refactor them also?
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Hi, Thanks! On 2019-01-22 11:51:57 +1100, Haribabu Kommi wrote: > Attached the patch for removal of scan_update_snapshot > and also the rebased patch of reduction in use of t_tableOid. I'll soon look at the latter. > > - consider removing table_gimmegimmeslot() > > - add substantial docs for every callback > > > > Will work on the above two. I think it's easier if I do the first, because I can just do it while rebasing, reducing unnecessary conflicts. > > While I saw an initial attempt at writing smgl docs for the table AM > > API, I'm not convinced that's the best approach. I think it might make > > more sense to have high-level docs in sgml, but then do all the > > per-callback docs in tableam.h. > > > > OK, I will update the sgml docs accordingly. > Index AM has per callback docs in the sgml, refactor them also? I don't think it's a good idea to tackle the index docs at the same time - this patchset is already humongously large... > diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c > index 62c5f9fa9f..3dc1444739 100644 > --- a/src/backend/access/heap/heapam_handler.c > +++ b/src/backend/access/heap/heapam_handler.c > @@ -2308,7 +2308,6 @@ static const TableAmRoutine heapam_methods = { > .scan_begin = heap_beginscan, > .scan_end = heap_endscan, > .scan_rescan = heap_rescan, > - .scan_update_snapshot = heap_update_snapshot, > .scan_getnextslot = heap_getnextslot, > > .parallelscan_estimate = table_block_parallelscan_estimate, > diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c > index 59061c746b..b48ab5036c 100644 > --- a/src/backend/executor/nodeBitmapHeapscan.c > +++ b/src/backend/executor/nodeBitmapHeapscan.c > @@ -954,5 +954,9 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node, > node->pstate = pstate; > > snapshot = RestoreSnapshot(pstate->phs_snapshot_data); > - table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot); > + Assert(IsMVCCSnapshot(snapshot)); > + > + RegisterSnapshot(snapshot); > + node->ss.ss_currentScanDesc->rs_snapshot = snapshot; > + node->ss.ss_currentScanDesc->rs_temp_snap = true; > } I was rather thinking that we'd just move this logic into table_scan_update_snapshot(), without it invoking a callback. Greetings, Andres Freund
On Tue, Jan 22, 2019 at 12:15 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
Thanks!
On 2019-01-22 11:51:57 +1100, Haribabu Kommi wrote:
> Attached the patch for removal of scan_update_snapshot
> and also the rebased patch of reduction in use of t_tableOid.
I'll soon look at the latter.
Thanks.
> > - consider removing table_gimmegimmeslot()
> > - add substantial docs for every callback
> >
>
> Will work on the above two.
I think it's easier if I do the first, because I can just do it while
rebasing, reducing unnecessary conflicts.
OK. I will work on the doc changes.
> > While I saw an initial attempt at writing smgl docs for the table AM
> > API, I'm not convinced that's the best approach. I think it might make
> > more sense to have high-level docs in sgml, but then do all the
> > per-callback docs in tableam.h.
> >
>
> OK, I will update the sgml docs accordingly.
> Index AM has per callback docs in the sgml, refactor them also?
I don't think it's a good idea to tackle the index docs at the same time
- this patchset is already humongously large...
OK.
> diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> index 62c5f9fa9f..3dc1444739 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2308,7 +2308,6 @@ static const TableAmRoutine heapam_methods = {
> .scan_begin = heap_beginscan,
> .scan_end = heap_endscan,
> .scan_rescan = heap_rescan,
> - .scan_update_snapshot = heap_update_snapshot,
> .scan_getnextslot = heap_getnextslot,
>
> .parallelscan_estimate = table_block_parallelscan_estimate,
> diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
> index 59061c746b..b48ab5036c 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -954,5 +954,9 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
> node->pstate = pstate;
>
> snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
> - table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
> + Assert(IsMVCCSnapshot(snapshot));
> +
> + RegisterSnapshot(snapshot);
> + node->ss.ss_currentScanDesc->rs_snapshot = snapshot;
> + node->ss.ss_currentScanDesc->rs_temp_snap = true;
> }
I was rather thinking that we'd just move this logic into
table_scan_update_snapshot(), without it invoking a callback.
OK. Changed accordingly.
But the table_scan_update_snapshot() function is moved into tableam.c,
to avoid additional header file snapmgr.h inclusion in tableam.h
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
> On Mon, Jan 21, 2019 at 3:01 AM Andres Freund <andres@anarazel.de> wrote: > > The patchset is now pretty granularly split into individual pieces. Wow, thanks! > On Mon, Jan 21, 2019 at 9:33 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > Regression tests that use \d+ to show the table details might > not be interested specifically in table access method. But these will > fail if run with a modified default access method. I see your point, but if a test is not interested specifically in a table am, then I guess it wouldn't use a custom table am in the first place, right? Any way, I don't have strong opinion here, so if everyone agrees that HIDE_TABLEAM will show/hide access method unconditionally, I'm fine with that.
On Tue, 22 Jan 2019 at 15:29, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Mon, Jan 21, 2019 at 9:33 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > Regression tests that use \d+ to show the table details might > > not be interested specifically in table access method. But these will > > fail if run with a modified default access method. > > I see your point, but if a test is not interested specifically in a table am, > then I guess it wouldn't use a custom table am in the first place, right? Right. It wouldn't use a custom table am. But I mean, despite not using a custom table am, the test would fail if the regression runs with a changed default access method, because the regression output file has only one particular am value output. > Anyway, I don't have strong opinion here, so if everyone agrees that HIDE_TABLEAM > will show/hide access method unconditionally, I'm fine with that. Yeah, I agree it's subjective. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
> On Sun, Jan 20, 2019 at 6:17 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Fri, Jan 18, 2019 at 11:22 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > I believe you are going to add a new regression testcase for the change ? > > Yep. So, here are these two patches for pg_dump/psql with a few regression tests.
Вложения
Hi, Attached is a patch that adds some test scenarios for testing the dependency of various object types on table am. Besides simple tables, it considers materialized views, partitioned table, foreign table, and composite types, and verifies that the dependency is created only for those object types that support table access method. This patch is based on commit 1bc7e6a4838 in https://github.com/anarazel/postgres-pluggable-storage Thanks -Amit Khandekar
Вложения
On Tue, Jan 22, 2019 at 1:43 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
OK. I will work on the doc changes.
Sorry for the delay.
Attached a draft patch of doc and comments changes that I worked upon.
Currently I added comments to the callbacks that are present in the TableAmRoutine
structure and I copied it into the docs. I am not sure whether it is a good approach or not?
I am yet to add description for the each parameter of the callbacks for easier understanding.
Or, Giving description of each callbacks in the docs with division of those callbacks
according to them divided in the TableAmRoutine structure? Currently following divisions
are available.
1. Table scan
2. Parallel table scan
3. Index scan
4. Manipulation of physical tuples
5. Non-modifying operations on individual tuples
6. DDL
7. Planner
8. Executor
Suggestions?
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On Mon, 21 Jan 2019 at 08:31, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > (resending with compressed attachements, perhaps that'll go through) > > On 2018-12-10 18:13:40 -0800, Andres Freund wrote: > > On 2018-11-26 17:55:57 -0800, Andres Freund wrote: > > > FWIW, now that oids are removed, and the tuple table slot abstraction > > > got in, I'm working on rebasing the pluggable storage patchset ontop of > > > that. > > > > I've pushed a version to that to the git tree, including a rebased > > version of zheap: > > https://github.com/anarazel/postgres-pluggable-storage I worked on a slight improvement on the 0040-WIP-Move-xid-horizon-computation-for-page-level patch . Instead of pre-fetching all the required buffers beforehand, the attached WIP patch pre-fetches the buffers keeping a constant distance ahead of the buffer reads. It's a WIP patch because right now it just uses a hard-coded 5 buffers ahead. Haven't used effective_io_concurrency like how it is done in nodeBitmapHeapscan.c. Will do that next. But before that, any comments on the way I did the improvements would be nice. Note that for now, the patch is based on the pluggable-storage latest commit; it does not replace the 0040 patch in the patch series. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Вложения
On Wed, 6 Feb 2019 at 18:30, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Mon, 21 Jan 2019 at 08:31, Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > (resending with compressed attachements, perhaps that'll go through) > > > > On 2018-12-10 18:13:40 -0800, Andres Freund wrote: > > > On 2018-11-26 17:55:57 -0800, Andres Freund wrote: > > > > FWIW, now that oids are removed, and the tuple table slot abstraction > > > > got in, I'm working on rebasing the pluggable storage patchset ontop of > > > > that. > > > > > > I've pushed a version to that to the git tree, including a rebased > > > version of zheap: > > > https://github.com/anarazel/postgres-pluggable-storage > > I worked on a slight improvement on the > 0040-WIP-Move-xid-horizon-computation-for-page-level patch . Instead > of pre-fetching all the required buffers beforehand, the attached WIP > patch pre-fetches the buffers keeping a constant distance ahead of the > buffer reads. It's a WIP patch because right now it just uses a > hard-coded 5 buffers ahead. Haven't used effective_io_concurrency like > how it is done in nodeBitmapHeapscan.c. Will do that next. But before > that, any comments on the way I did the improvements would be nice. > > Note that for now, the patch is based on the pluggable-storage latest > commit; it does not replace the 0040 patch in the patch series. In the attached v1 patch, the prefetch_distance is calculated as effective_io_concurrency + 10. Also it has some cosmetic changes. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Вложения
On Mon, Feb 4, 2019 at 2:31 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Tue, Jan 22, 2019 at 1:43 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:OK. I will work on the doc changes.Sorry for the delay.Attached a draft patch of doc and comments changes that I worked upon.Currently I added comments to the callbacks that are present in the TableAmRoutinestructure and I copied it into the docs. I am not sure whether it is a good approach or not?I am yet to add description for the each parameter of the callbacks for easier understanding.Or, Giving description of each callbacks in the docs with division of those callbacksaccording to them divided in the TableAmRoutine structure? Currently following divisionsare available.1. Table scan2. Parallel table scan3. Index scan4. Manipulation of physical tuples5. Non-modifying operations on individual tuples6. DDL7. Planner8. ExecutorSuggestions?
Here I attached the doc patches for the pluggable storage, I divided the API's into the above
specified groups and explained them in the docs.I can further add more details if the approach
seems fine.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
- 0008-Table-access-method-API-explanation.patch
- 0001-Docs-of-default_table_access_method-GUC.patch
- 0002-Rename-indexam.sgml-to-am.sgml.patch
- 0003-Reorganize-am-as-both-table-and-index.patch
- 0004-Doc-update-of-Create-access-method-type-table.patch
- 0005-Doc-update-of-create-materialized-view-.-USING-synta.patch
- 0006-Doc-update-of-CREATE-TABLE-.-USING-syntax.patch
- 0007-Doc-of-CREATE-TABLE-AS-.-USING-syntax.patch
On Tue, Nov 27, 2018 at 4:59 PM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi,
On 2018/11/02 9:17, Haribabu Kommi wrote:
> Here I attached the cumulative fixes of the patches, new API additions for
> zheap and
> basic outline of the documentation.
I've read the documentation patch while also looking at the code and here
are some comments.
Thanks for the review and apologies for the delay.
I have taken care of your most of the comments in the latest version of the
doc patches.
+ <para>
+<programlisting>
+TupleTableSlotOps *
+slot_callbacks (Relation relation);
+</programlisting>
+ API to access the slot specific methods;
+ Following methods are available;
+ <structname>TTSOpsVirtual</structname>,
+ <structname>TTSOpsHeapTuple</structname>,
+ <structname>TTSOpsMinimalTuple</structname>,
+ <structname>TTSOpsBufferTuple</structname>,
+ </para>
Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or
its relations to the TableAmRoutine abstraction, I think the text
description could better be written as:
"API to get the slot operations struct for a given table access method"
It's not clear to me why various TTSOps* structs are listed here? Is the
point that different AMs may choose one of the listed alternatives? For
example, I see that heap AM implementation returns TTOpsBufferTuple, so it
manipulates slots containing buffered tuples, right? Other AMs are free
to return any one of these? For example, some AMs may never use buffer
manager and hence not use TTOpsBufferTuple. Is that understanding correct?
Yes, AM can decide what type of Slot method it wants to use.
Regards,
Haribabu Kommi
Fujitsu Australia
On Fri, Feb 8, 2019 at 5:18 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > In the attached v1 patch, the prefetch_distance is calculated as > effective_io_concurrency + 10. Also it has some cosmetic changes. I did a little brief review of this patch and noticed the following things. +} PrefetchState; That name seems too generic. +/* + * An arbitrary way to come up with a pre-fetch distance that grows with io + * concurrency, but is at least 10 and not more than the max effective io + * concurrency. + */ This comment is kinda useless, because it only tells us what the code does (which is obvious anyway) and not why it does that. Saying that your formula is arbitrary may not be the best way to attract support for it. + for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; i++) It looks strange to me that next_item is stored in prefetch_state and nitems is passed around as an argument. Is there some reason why it's like that? + /* prefetch a fixed number of pages beforehand. */ Not accurate -- the number of pages we prefetch isn't fixed. It depends on effective_io_concurrency. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 21 Feb 2019 at 04:17, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Feb 8, 2019 at 5:18 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > In the attached v1 patch, the prefetch_distance is calculated as > > effective_io_concurrency + 10. Also it has some cosmetic changes. > > I did a little brief review of this patch and noticed the following things. > > +} PrefetchState; > That name seems too generic. Ok, so something like XidHorizonPrefetchState ? On similar lines, does prefetch_buffer() function name sound too generic as well ? > > +/* > + * An arbitrary way to come up with a pre-fetch distance that grows with io > + * concurrency, but is at least 10 and not more than the max effective io > + * concurrency. > + */ > > This comment is kinda useless, because it only tells us what the code > does (which is obvious anyway) and not why it does that. Saying that > your formula is arbitrary may not be the best way to attract support > for it. Well, I had checked the way the number of drive spindles (effective_io_concurrency) is used to calculate the prefetch distance for bitmap heap scans (ComputeIoConcurrency). Basically I think the intention behind that method is to come up with a number that makes it highly likely that we pre-fetch a block of each of the drive spindles. But I didn't get how that exactly works, all the less for non-parallel bitmap scans. Same is the case for the pre-fetching that we do here for xid-horizon stuff, where we do the block reads sequentially. Me and Andres discussed this offline, and he was of the opinion that this formula won't help here, and instead we just keep a constant distance that is some number greater than effective_io_concurrency. I agree that instead of saying "arbitrary" we should explain why we have done that, and before that, come up with an agreed-upon formula. > > + for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; i++) > > It looks strange to me that next_item is stored in prefetch_state and > nitems is passed around as an argument. Is there some reason why it's > like that? We could keep the max count in the structure itself as well. There isn't any specific reason for not keeping it there. It's just that this function prefetch_state () is not a general function for maintaining a prefetch state that spans across function calls; so we might as well just pass the max count to that function instead of having another field in that structure. I am not inclined specifically towards either of the approaches. > > + /* prefetch a fixed number of pages beforehand. */ > > Not accurate -- the number of pages we prefetch isn't fixed. It > depends on effective_io_concurrency. Yeah, will change that in the next patch version, according to what we conclude about the prefetch distance calculation. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Thu, Feb 21, 2019 at 6:44 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Ok, so something like XidHorizonPrefetchState ? On similar lines, does > prefetch_buffer() function name sound too generic as well ? Yeah, that sounds good. And, yeah, then maybe rename the function too. > > +/* > > + * An arbitrary way to come up with a pre-fetch distance that grows with io > > + * concurrency, but is at least 10 and not more than the max effective io > > + * concurrency. > > + */ > > > > This comment is kinda useless, because it only tells us what the code > > does (which is obvious anyway) and not why it does that. Saying that > > your formula is arbitrary may not be the best way to attract support > > for it. > > Well, I had checked the way the number of drive spindles > (effective_io_concurrency) is used to calculate the prefetch distance > for bitmap heap scans (ComputeIoConcurrency). Basically I think the > intention behind that method is to come up with a number that makes it > highly likely that we pre-fetch a block of each of the drive spindles. > But I didn't get how that exactly works, all the less for non-parallel > bitmap scans. Same is the case for the pre-fetching that we do here > for xid-horizon stuff, where we do the block reads sequentially. Me > and Andres discussed this offline, and he was of the opinion that this > formula won't help here, and instead we just keep a constant distance > that is some number greater than effective_io_concurrency. I agree > that instead of saying "arbitrary" we should explain why we have done > that, and before that, come up with an agreed-upon formula. Maybe something like: We don't use the regular formula to determine how much to prefetch here, but instead just add a constant to effective_io_concurrency. That's because it seems best to do some prefetching here even when effective_io_concurrency is set to 0, but if the DBA thinks it's OK to do more prefetching for other operations, then it's probably OK to do more prefetching in this case, too. It may be that this formula is too simplistic, but at the moment we have no evidence of that or any idea about what would work better. > > + for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; i++) > > > > It looks strange to me that next_item is stored in prefetch_state and > > nitems is passed around as an argument. Is there some reason why it's > > like that? > > We could keep the max count in the structure itself as well. There > isn't any specific reason for not keeping it there. It's just that > this function prefetch_state () is not a general function for > maintaining a prefetch state that spans across function calls; so we > might as well just pass the max count to that function instead of > having another field in that structure. I am not inclined specifically > towards either of the approaches. All right, count me as +0.5 for putting a copy in the structure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 21 Feb 2019 at 18:06, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Feb 21, 2019 at 6:44 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > Ok, so something like XidHorizonPrefetchState ? On similar lines, does > > prefetch_buffer() function name sound too generic as well ? > > Yeah, that sounds good. > And, yeah, then maybe rename the function too. Renamed the function to xid_horizon_prefetch_buffer(). > > > > +/* > > > + * An arbitrary way to come up with a pre-fetch distance that grows with io > > > + * concurrency, but is at least 10 and not more than the max effective io > > > + * concurrency. > > > + */ > > > > > > This comment is kinda useless, because it only tells us what the code > > > does (which is obvious anyway) and not why it does that. Saying that > > > your formula is arbitrary may not be the best way to attract support > > > for it. > > > > Well, I had checked the way the number of drive spindles > > (effective_io_concurrency) is used to calculate the prefetch distance > > for bitmap heap scans (ComputeIoConcurrency). Basically I think the > > intention behind that method is to come up with a number that makes it > > highly likely that we pre-fetch a block of each of the drive spindles. > > But I didn't get how that exactly works, all the less for non-parallel > > bitmap scans. Same is the case for the pre-fetching that we do here > > for xid-horizon stuff, where we do the block reads sequentially. Me > > and Andres discussed this offline, and he was of the opinion that this > > formula won't help here, and instead we just keep a constant distance > > that is some number greater than effective_io_concurrency. I agree > > that instead of saying "arbitrary" we should explain why we have done > > that, and before that, come up with an agreed-upon formula. > > Maybe something like: We don't use the regular formula to determine > how much to prefetch here, but instead just add a constant to > effective_io_concurrency. That's because it seems best to do some > prefetching here even when effective_io_concurrency is set to 0, but > if the DBA thinks it's OK to do more prefetching for other operations, > then it's probably OK to do more prefetching in this case, too. It > may be that this formula is too simplistic, but at the moment we have > no evidence of that or any idea about what would work better. Thanks for writing it down for me. I think this is good-to-go as a comment; so I put this as-is into the patch. > > > > + for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; i++) > > > > > > It looks strange to me that next_item is stored in prefetch_state and > > > nitems is passed around as an argument. Is there some reason why it's > > > like that? > > > > We could keep the max count in the structure itself as well. There > > isn't any specific reason for not keeping it there. It's just that > > this function prefetch_state () is not a general function for > > maintaining a prefetch state that spans across function calls; so we > > might as well just pass the max count to that function instead of > > having another field in that structure. I am not inclined specifically > > towards either of the approaches. > > All right, count me as +0.5 for putting a copy in the structure. Have put the nitems into the structure. Thanks for the review. Attached v2. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Вложения
On Fri, Feb 22, 2019 at 11:19 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Thanks for the review. Attached v2. Thanks. I took this, combined it with Andres's v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch, did some polishing of the code and comments, and pgindented. Here's what I ended up with; see what you think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Sat, 23 Feb 2019 at 01:22, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Feb 22, 2019 at 11:19 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > Thanks for the review. Attached v2. > > Thanks. I took this, combined it with Andres's > v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch, did > some polishing of the code and comments, and pgindented. Here's what > I ended up with; see what you think. Thanks Robert ! The changes look good. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Hi, On 2019-01-21 10:32:37 +1100, Haribabu Kommi wrote: > I am not able to remove the complete t_tableOid from HeapTuple, > because of its use in triggers, as the slot is not available in triggers > and I need to store the tableOid also as part of the tuple. What precisely do you man by "use in triggers"? You mean that a trigger might access a HeapTuple's t_tableOid directly, even though all of the information is available in the trigger context? Greetings, Andres Freund
On Wed, Feb 27, 2019 at 11:10 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-01-21 10:32:37 +1100, Haribabu Kommi wrote:
> I am not able to remove the complete t_tableOid from HeapTuple,
> because of its use in triggers, as the slot is not available in triggers
> and I need to store the tableOid also as part of the tuple.
What precisely do you man by "use in triggers"? You mean that a trigger
might access a HeapTuple's t_tableOid directly, even though all of the
information is available in the trigger context?
I forgot the exact scenario, but during the trigger function execution, the
pl/pgsql function execution access the TableOidAttributeNumber from the stored
tuple using the heap_get* function. Because of lack of slot support in the triggers,
we still need to maintain the t_tableOid with proper OID. The heaptuple t_tableOid
member data is updated whenever the heaptuple is generated from slot.
Regards,
Haribabu Kommi
Fujitsu Australia
I haven't been following this thread closely, but I looked briefly at some of the patches posted here: On 21/01/2019 11:01, Andres Freund wrote: > The patchset is now pretty granularly split into individual pieces. Wow, 42 patches, very granular indeed! That's nice for reviewing, but are you planning to squash them before committing? Seems a bit excessive for the git history. Patches 1-4: * v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch * v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch * v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch * v12-0004-Remove-superfluous-tqual.h-includes.patch These look good to me. I think it would make sense to squash these together, and commit now. Patches 20 and 21: * v12-0020-WIP-Slotified-triggers.patch * v12-0021-WIP-Slotify-EPQ.patch I like this slotification of trigger and EPQ code. It seems like a nice thing to do, independently of the other patches. You said you wanted to polish that up to committable state, and commit separately: +1 on that. Perhaps do that even before patches 1-4. > --- a/src/include/commands/trigger.h > +++ b/src/include/commands/trigger.h > @@ -35,8 +35,8 @@ typedef struct TriggerData > HeapTuple tg_trigtuple; > HeapTuple tg_newtuple; > Trigger *tg_trigger; > - Buffer tg_trigtuplebuf; > - Buffer tg_newtuplebuf; > + TupleTableSlot *tg_trigslot; > + TupleTableSlot *tg_newslot; > Tuplestorestate *tg_oldtable; > Tuplestorestate *tg_newtable; > } TriggerData; Do we still need tg_trigtuple and tg_newtuple? Can't we always use the corresponding slots instead? Is it for backwards-compatibility with user-defined triggers written in C? (Documentation also needs to be updated for the changes in this struct) I didn't look a the rest of the patches yet... - Heikki
Hi, On 2019-02-27 18:00:12 +0800, Heikki Linnakangas wrote: > I haven't been following this thread closely, but I looked briefly at some > of the patches posted here: Thanks! > On 21/01/2019 11:01, Andres Freund wrote: > > The patchset is now pretty granularly split into individual pieces. > > Wow, 42 patches, very granular indeed! That's nice for reviewing, but are > you planning to squash them before committing? Seems a bit excessive for the > git history. I've pushed a number of the preliminary patches since you replied. We're down to 23 in my local count... I do plan / did squash some, but not actually that many. I find that patches after a certain size are just too hard to do the necessary final polish to, especially if they do several independent things. Keeping things granular also allows to push incrementally, even when later patches aren't quite ready - imo pretty important for a project this size. > Patches 1-4: > > * v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch > * v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch > * v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch > * v12-0004-Remove-superfluous-tqual.h-includes.patch > > These look good to me. I think it would make sense to squash these together, > and commit now. I've pushed these a while ago. > Patches 20 and 21: > * v12-0020-WIP-Slotified-triggers.patch > * v12-0021-WIP-Slotify-EPQ.patch > > I like this slotification of trigger and EPQ code. It seems like a nice > thing to do, independently of the other patches. You said you wanted to > polish that up to committable state, and commit separately: +1 on > that. I pushed the trigger patch yesterday evening. Working to finalize the EPQ patch now - I've polished it a fair bit since the version posted on the list, but it still needs a bit more. Once the EPQ patch (and two other simple preliminary ones) is pushed, I plan to post a new rebased version to this thread. That's then really only the core table AM work. > > --- a/src/include/commands/trigger.h > > +++ b/src/include/commands/trigger.h > > @@ -35,8 +35,8 @@ typedef struct TriggerData > > HeapTuple tg_trigtuple; > > HeapTuple tg_newtuple; > > Trigger *tg_trigger; > > - Buffer tg_trigtuplebuf; > > - Buffer tg_newtuplebuf; > > + TupleTableSlot *tg_trigslot; > > + TupleTableSlot *tg_newslot; > > Tuplestorestate *tg_oldtable; > > Tuplestorestate *tg_newtable; > > } TriggerData; > > Do we still need tg_trigtuple and tg_newtuple? Can't we always use the > corresponding slots instead? Is it for backwards-compatibility with > user-defined triggers written in C? Yes, the external trigger interface currently relies on those being there. I think we probably ought to revise that, at the very least because it'll otherwise be noticably less efficient to have triggers on !heap tables, but also because it's just cleaner. But I feel like I don't want more significantly sized patches on my plate right now, so my current goal is to just put that on the todo after the pluggable storage work. Kinda wonder if we don't want to do that earlier in a release cycle too, so we can do other breaking changes to the trigger interface without breaking external code multiple times. There's probably also an argument for just not breaking the interface. > (Documentation also needs to be updated for the changes in this > struct) Ah, nice catch, will do that next. Greetings, Andres Freund
Hi,
While playing with the tableam, usage of which starts with commit v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we check for NULL function pointer before actually calling the same and ERROR out instead as NOT_SUPPORTED or something on those lines.
Understand its kind of think which should get caught during development. But still currently it segfaults if missing to define some AM function, might be easier for iterative development to error instead in common place.
Or should there be upfront check for NULL somewhere if all the AM functions are mandatory to have functions defined for them and should not be NULL.
Hi, Thanks for looking! On 2019-03-05 18:27:45 -0800, Ashwin Agrawal wrote: > While playing with the tableam, usage of which starts with commit > v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we > check for NULL function pointer before actually calling the same and ERROR > out instead as NOT_SUPPORTED or something on those lines. Scans seem like absolutely required part of the functionality, so I don't think there's much point in that. It'd just bloat code and runtime. > Understand its kind of think which should get caught during development. > But still currently it segfaults if missing to define some AM function, The segfault iself doesn't bother me at all, it's just a NULL pointer dereference. If we were to put Asserts somewhere it'd crash very similarly. I think you have a point in that: > might be easier for iterative development to error instead in common place. Would make it a tiny bit easier to implement a new AM. We could probably add a few asserts to GetTableAmRoutine(), to check that required functions are implemted. Don't think that'd make a meaningful difference for something like the scan functions, but it'd probably make it easier to forward port AMs to the next release - I'm pretty sure we're going to add required callbacks in the next few releases. Greetings, Andres Freund
Hi, Thanks for looking! On 2019-03-05 18:27:45 -0800, Ashwin Agrawal wrote: > While playing with the tableam, usage of which starts with commit > v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we > check for NULL function pointer before actually calling the same and ERROR > out instead as NOT_SUPPORTED or something on those lines. Scans seem like absolutely required part of the functionality, so I don't think there's much point in that. It'd just bloat code and runtime. > Understand its kind of think which should get caught during development. > But still currently it segfaults if missing to define some AM function, The segfault iself doesn't bother me at all, it's just a NULL pointer dereference. If we were to put Asserts somewhere it'd crash very similarly. I think you have a point in that: > might be easier for iterative development to error instead in common place. Would make it a tiny bit easier to implement a new AM. We could probably add a few asserts to GetTableAmRoutine(), to check that required functions are implemted. Don't think that'd make a meaningful difference for something like the scan functions, but it'd probably make it easier to forward port AMs to the next release - I'm pretty sure we're going to add required callbacks in the next few releases. Greetings, Andres Freund
Hi, On 2019-02-27 09:29:31 -0800, Andres Freund wrote: > Once the EPQ patch (and two other simple preliminary ones) is pushed, I > plan to post a new rebased version to this thread. That's then really > only the core table AM work. That's now done. Here's my current submission of remaining patches. I've polished the first patch, adding DDL support, quite a bit, I'm planning to push that soon. Changes: - I've removed the ability to specify a table AM to partitioned tables, as discussed at [1] - That happily shook out a number of bugs, where the partitioned table's AM was used, when the leaf partitioned AM should have been used (via the slot). In particular this necessicated refactoring the way slots are used for ON CONFLICT of partitioned tables. That's the new WIP patch in the series. But I think the result is actually clearer. - I've integrated the pg_dump and psql patches, although I'd made HIDE_TABLEAM independent of whether \d+ table is on a table with the default AM or not. - There's a good number of new tests in both the DDL and the pg_dump patch - Lots of smaller cleanups My next steps are: - final polish & push the basic DDL and pg_dump patches - cleanup & polish the ON CONFLICT refactoring - cleanup & polish the patch adding the tableam based scan interface. That's by far the largest patch in the series. I might try to split it up further, but it's probably not worth it. - improve documentation for the individual callbacks (integrating work done by others on this thread), in the header - integrate docs patch - integrate the revised version of the xid horizon patch by Amit Khandekar (reviewed by Robert Haas) - fix naive implementation of slot based COPY, to not constantly drop/recreate slots upon partition change. I've hackplemented a better approach, which makes it faster than the current code in my testing. Notes: - I'm currently not targeting v13 with "tableam: Fetch tuples for triggers & EPQ using a proper snapshot.". While we need something like that for some AMs like zheap, I think it's a crap approach and needs more thought. Greetings, Andres Freund [1] https://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
Вложения
- v15-0001-tableam-introduce-table-AM-infrastructure.patch.gz
- v15-0002-tableam-Add-pg_dump-support.patch.gz
- v15-0003-WIP-Use-per-partition-slots-for-ON-CONFLICT.patch.gz
- v15-0004-tableam-Introduce-and-use-begin-endscan-and-do-i.patch.gz
- v15-0005-tableam-Inquire-slot-type-from-AM-rather-than-ha.patch.gz
- v15-0006-tableam-introduce-slot-based-table-getnext-and-u.patch.gz
- v15-0007-tableam-Add-insert-delete-update-lock_tuple.patch.gz
- v15-0008-tableam-Add-fetch_row_version.patch.gz
- v15-0009-tableam-Add-use-tableam_fetch_follow_check.patch.gz
- v15-0010-tableam-Add-table_get_latest_tid.patch.gz
- v15-0011-tableam-multi_insert-and-slotify-COPY.patch.gz
- v15-0012-tableam-finish_bulk_insert.patch.gz
- v15-0013-tableam-slotify-CREATE-TABLE-AS-and-CREATE-MATER.patch.gz
- v15-0014-tableam-index-builds.patch.gz
- v15-0015-tableam-relation-creation-VACUUM-FULL-CLUSTER-SE.patch.gz
- v15-0016-tableam-VACUUM-and-ANALYZE.patch.gz
- v15-0017-tableam-planner-size-estimation.patch.gz
- v15-0018-tableam-Sample-Scan-Support.patch.gz
- v15-0019-tableam-bitmap-heap-scan.patch.gz
- v15-0020-tableam-remaining-stuff.patch.gz
- v15-0021-WIP-Move-xid-horizon-computation-for-page-level-.patch.gz
- v15-0022-tableam-Add-function-to-determine-newest-xid-amo.patch.gz
- v15-0023-tableam-Fetch-tuples-for-triggers-EPQ-using-a-pr.patch.gz
Hi, On 2019-03-05 23:07:21 -0800, Andres Freund wrote: > My next steps are: > - final polish & push the basic DDL and pg_dump patches Done and pushed. Some collation dependent fallout, I'm hoping I've just fixed that. > - cleanup & polish the ON CONFLICT refactoring Here's a cleaned up version of that patch. David, Alvaro, you also played in that area, any objections? I think this makes that part of the code easier to read actually. Robert, thanks for looking at that patch already. Greetings, Andres Freund
Вложения
On Thu, 7 Mar 2019 at 08:33, Andres Freund <andres@anarazel.de> wrote: > Here's a cleaned up version of that patch. David, Alvaro, you also > played in that area, any objections? I think this makes that part of the > code easier to read actually. Robert, thanks for looking at that patch > already. I only had a quick look and don't have a grasp of what the patch series is doing to tuple slots, but I didn't see anything I found alarming during the read. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2019-03-07 11:56:57 +1300, David Rowley wrote: > On Thu, 7 Mar 2019 at 08:33, Andres Freund <andres@anarazel.de> wrote: > > Here's a cleaned up version of that patch. David, Alvaro, you also > > played in that area, any objections? I think this makes that part of the > > code easier to read actually. Robert, thanks for looking at that patch > > already. > > I only had a quick look and don't have a grasp of what the patch > series is doing to tuple slots, but I didn't see anything I found > alarming during the read. Thanks for looking. Re slots - the deal basically is that going forward low level operations, like fetching a row from a table etc, have to be done by a slot that's compatible with the "target" table. You can get compatible slot callbakcs by calling table_slot_callbacks(), or directly create one by calling table_gimmegimmeslot() (likely to be renamed :)). The problem here was that the partition root's slot was used to fetch / store rows from a child partition. By moving mt_existing into ResultRelInfo that's not the case anymore. - Andres
On Wed, Mar 6, 2019 at 6:11 PM Andres Freund <andres@anarazel.de> wrote: > slot that's compatible with the "target" table. You can get compatible > slot callbakcs by calling table_slot_callbacks(), or directly create one > by calling table_gimmegimmeslot() (likely to be renamed :)). Hmm. I assume the issue is that table_createslot() was already taken for another purpose, so then when you needed another callback you went with table_givemeslot(), and then when you needed a third API to do something in the same area the best thing available was table_gimmeslot(), which meant that the fourth API could only be table_gimmegimmeslot(). Does that sound about right? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-03-07 08:52:21 -0500, Robert Haas wrote: > On Wed, Mar 6, 2019 at 6:11 PM Andres Freund <andres@anarazel.de> wrote: > > slot that's compatible with the "target" table. You can get compatible > > slot callbakcs by calling table_slot_callbacks(), or directly create one > > by calling table_gimmegimmeslot() (likely to be renamed :)). > > Hmm. I assume the issue is that table_createslot() was already taken > for another purpose, so then when you needed another callback you went > with table_givemeslot(), and then when you needed a third API to do > something in the same area the best thing available was > table_gimmeslot(), which meant that the fourth API could only be > table_gimmegimmeslot(). > > Does that sound about right? It was 3 AM, and I thought it was hilarious...
On Thu, Mar 7, 2019 at 12:49 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-03-07 08:52:21 -0500, Robert Haas wrote: > > On Wed, Mar 6, 2019 at 6:11 PM Andres Freund <andres@anarazel.de> wrote: > > > slot that's compatible with the "target" table. You can get compatible > > > slot callbakcs by calling table_slot_callbacks(), or directly create one > > > by calling table_gimmegimmeslot() (likely to be renamed :)). > > > > Hmm. I assume the issue is that table_createslot() was already taken > > for another purpose, so then when you needed another callback you went > > with table_givemeslot(), and then when you needed a third API to do > > something in the same area the best thing available was > > table_gimmeslot(), which meant that the fourth API could only be > > table_gimmegimmeslot(). > > > > Does that sound about right? > > It was 3 AM, and I thought it was hilarious... It is. Just like me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund <andres@anarazel.de> writes: > On 2019-03-07 08:52:21 -0500, Robert Haas wrote: >> On Wed, Mar 6, 2019 at 6:11 PM Andres Freund <andres@anarazel.de> wrote: >> > slot that's compatible with the "target" table. You can get compatible >> > slot callbakcs by calling table_slot_callbacks(), or directly create one >> > by calling table_gimmegimmeslot() (likely to be renamed :)). >> >> Hmm. I assume the issue is that table_createslot() was already taken >> for another purpose, so then when you needed another callback you went >> with table_givemeslot(), and then when you needed a third API to do >> something in the same area the best thing available was >> table_gimmeslot(), which meant that the fourth API could only be >> table_gimmegimmeslot(). >> >> Does that sound about right? > > It was 3 AM, and I thought it was hilarious... ♫ Gimme! Gimme! Gimme! A slot after midnight ♫ - ilmari (SCNR) -- "I use RMS as a guide in the same way that a boat captain would use a lighthouse. It's good to know where it is, but you generally don't want to find yourself in the same spot." - Tollef Fog Heen
On Thu, Mar 7, 2019 at 6:33 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-03-05 23:07:21 -0800, Andres Freund wrote:
> My next steps are:
> - final polish & push the basic DDL and pg_dump patches
Done and pushed. Some collation dependent fallout, I'm hoping I've just
fixed that.
Thanks for the corrections that I missed, also for the extra changes.
Here I attached the rebased patches that I shared earlier. I am adding the
comments to explain the API's in the code, will share those patches later.
I observed a crash with the latest patch series in the COPY command.
I am not sure whether the problem is with the reduce of tableOid patch problem,
Will check it and correct the problem.Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
- 0010-Table-access-method-API-explanation.patch
- 0001-Reduce-the-use-of-HeapTuple-t_tableOid.patch
- 0003-Docs-of-default_table_access_method-GUC.patch
- 0004-Rename-indexam.sgml-to-am.sgml.patch
- 0002-Removal-of-scan_update_snapshot-callback.patch
- 0005-Reorganize-am-as-both-table-and-index.patch
- 0006-Doc-update-of-Create-access-method-type-table.patch
- 0009-Doc-of-CREATE-TABLE-AS-.-USING-syntax.patch
- 0007-Doc-update-of-create-materialized-view-.-USING-synta.patch
- 0008-Doc-update-of-CREATE-TABLE-.-USING-syntax.patch
Hi, On 2019-03-05 23:07:21 -0800, Andres Freund wrote: > My next steps are: > - final polish & push the basic DDL and pg_dump patches > - cleanup & polish the ON CONFLICT refactoring Those are pushed. > - cleanup & polish the patch adding the tableam based scan > interface. That's by far the largest patch in the series. I might try > to split it up further, but it's probably not worth it. I decided not to split it up further, and even merged two small commits into it. Subdividing it cleanly would have required making some changes just to undo them in a subsequent patch. > - improve documentation for the individual callbacks (integrating > work done by others on this thread), in the header I've done that for the callbacks in the above commit. Changes: - I've added comments to all the callbacks in the first commit / the scan commit - I've renamed table_gimmegimmeslot to table_slot_create - I've made the callbacks and their wrappers more consistently named - I've added asserts for necessary callbacks in scan commit - Lots of smaller cleanup - Added a commit message While 0001 is pretty bulky, the interesting bits concentrate on a comparatively small area. I'd appreciate if somebody could give the comments added in tableam.h a read (both on callbacks, and their wrappers, as they have different audiences). It'd make sense to first read the commit message, to understand the goal (and I'd obviously also appreciate suggestions for improvements there as well). I'm pretty happy with the current state of the scan patch. I plan to do two more passes through it (formatting, comment polishing, etc. I don't know of any functional changes needed), and then commit it, lest somebody objects. Greetings, Andres Freund
Вложения
- v18-0001-tableam-Add-and-use-scan-APIs.patch
- v18-0002-tableam-Only-allow-heap-in-a-number-of-contrib-m.patch
- v18-0003-tableam-Add-insert-delete-update-lock_tuple.patch
- v18-0004-tableam-Add-fetch_row_version.patch
- v18-0005-tableam-Add-use-tableam_fetch_follow_check.patch
- v18-0006-tableam-Add-table_get_latest_tid.patch
- v18-0007-tableam-multi_insert-and-slotify-COPY.patch
- v18-0008-tableam-finish_bulk_insert.patch
- v18-0009-tableam-slotify-CREATE-TABLE-AS-and-CREATE-MATER.patch
- v18-0010-tableam-index-builds.patch
- v18-0011-tableam-relation-creation-VACUUM-FULL-CLUSTER-SE.patch
- v18-0012-tableam-VACUUM-and-ANALYZE.patch
- v18-0013-tableam-planner-size-estimation.patch
- v18-0014-tableam-Sample-Scan-Support.patch
- v18-0015-tableam-bitmap-heap-scan.patch
- v18-0016-WIP-Move-xid-horizon-computation-for-page-level-.patch
- v18-0017-tableam-Add-function-to-determine-newest-xid-amo.patch
- v18-0018-tableam-Fetch-tuples-for-triggers-EPQ-using-a-pr.patch
Hi, On 2019-03-09 11:03:21 +1100, Haribabu Kommi wrote: > Here I attached the rebased patches that I shared earlier. I am adding the > comments to explain the API's in the code, will share those patches later. I've started to add those for the callbacks in the first commit. I'd appreciate a look! I think I'll include the docs patches, sans the callback documentation, in the next version. I'll probably merge them into one commit, if that's OK with you? > I observed a crash with the latest patch series in the COPY command. Hm, which version was this? I'd at some point accidentally posted a 'tmp' commit that was just a performance hack. Btw, your patches always are attached out of order: https://www.postgresql.org/message-id/CAJrrPGd%2Brkz54wE-oXRojg4XwC3bcF6bjjRziD%2BXhFup9Q3n2w%40mail.gmail.com 10, 1, 3, 4, 2 ... Greetings, Andres Freund
> On Sat, Mar 9, 2019 at 4:13 AM Andres Freund <andres@anarazel.de> wrote: > > While 0001 is pretty bulky, the interesting bits concentrate on a > comparatively small area. I'd appreciate if somebody could give the > comments added in tableam.h a read (both on callbacks, and their > wrappers, as they have different audiences). Potentially stupid question, but I'm curious about this one (couldn't find any discussion about it): +/* + * Generic descriptor for table scans. This is the base-class for table scans, + * which needs to be embedded in the scans of individual AMs. + */ +typedef struct TableScanDescData // ... bool rs_pageatatime; /* verify visibility page-at-a-time? */ bool rs_allow_strat; /* allow or disallow use of access strategy */ bool rs_allow_sync; /* allow or disallow use of syncscan */ + * allow_{strat, sync, pagemode} specify whether a scan strategy, + * synchronized scans, or page mode may be used (although not every AM + * will support those). // ... + TableScanDesc (*scan_begin) (Relation rel, The last commentary makes me think that those flags (allow_strat / allow_sync / pageatime) are more like AM specific, shouldn't they live in HeapScanDescData?
Hi, On 2019-03-10 05:49:26 +0100, Dmitry Dolgov wrote: > > On Sat, Mar 9, 2019 at 4:13 AM Andres Freund <andres@anarazel.de> wrote: > > > > While 0001 is pretty bulky, the interesting bits concentrate on a > > comparatively small area. I'd appreciate if somebody could give the > > comments added in tableam.h a read (both on callbacks, and their > > wrappers, as they have different audiences). > > Potentially stupid question, but I'm curious about this one (couldn't find any > discussion about it): Not stupid... > +/* > + * Generic descriptor for table scans. This is the base-class for > table scans, > + * which needs to be embedded in the scans of individual AMs. > + */ > +typedef struct TableScanDescData > // ... > bool rs_pageatatime; /* verify visibility page-at-a-time? */ > bool rs_allow_strat; /* allow or disallow use of access strategy */ > bool rs_allow_sync; /* allow or disallow use of syncscan */ > > + * allow_{strat, sync, pagemode} specify whether a scan strategy, > + * synchronized scans, or page mode may be used (although not every AM > + * will support those). > // ... > + TableScanDesc (*scan_begin) (Relation rel, > > The last commentary makes me think that those flags (allow_strat / allow_sync / > pageatime) are more like AM specific, shouldn't they live in HeapScanDescData? They're common enough across AMs, but more importantly calling code currently specifies them in several places. As they're thus essentially generic, rather than AM specific, I think it makes sense to have them in the general scan struct. Greetings, Andres Freund
On Sat, Mar 9, 2019 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-03-09 11:03:21 +1100, Haribabu Kommi wrote:
> Here I attached the rebased patches that I shared earlier. I am adding the
> comments to explain the API's in the code, will share those patches later.
I've started to add those for the callbacks in the first commit. I'd
appreciate a look!
Thanks for the updated patches.
+ /* ------------------------------------------------------------------------
+ * Callbacks for hon-modifying operations on individual tuples
+ * ------------------------------------------------------------------------
Typo in tableam.h file. hon->non
I think I'll include the docs patches, sans the callback documentation,
in the next version. I'll probably merge them into one commit, if that's
OK with you?
OK.
For easy review, I will still maintain 3 or 4 patches instead of the current patch
series.
> I observed a crash with the latest patch series in the COPY command.
Hm, which version was this? I'd at some point accidentally posted a
'tmp' commit that was just a performance hack
Yes. in my version that checked have that commit.
May be that is the reason for the failure.
Btw, your patches always are attached out of order:
https://www.postgresql.org/message-id/CAJrrPGd%2Brkz54wE-oXRojg4XwC3bcF6bjjRziD%2BXhFup9Q3n2w%40mail.gmail.com
10, 1, 3, 4, 2 ...
Sorry about that.
I always think why it is ordering that way when I attached the patch files into the mail.
I thought it may be gmail behavior, but with experiment I found that, while adding the multiple
patches, the last selected patch given the preference and it will be listed as first attachment.
I will take care that this problem doesn't repeat it again.
Regards,
Haribabu Kommi
Fujitsu Australia
Hi, On 2019-03-08 19:13:10 -0800, Andres Freund wrote: > Changes: > - I've added comments to all the callbacks in the first commit / the > scan commit > - I've renamed table_gimmegimmeslot to table_slot_create > - I've made the callbacks and their wrappers more consistently named > - I've added asserts for necessary callbacks in scan commit > - Lots of smaller cleanup > - Added a commit message > > While 0001 is pretty bulky, the interesting bits concentrate on a > comparatively small area. I'd appreciate if somebody could give the > comments added in tableam.h a read (both on callbacks, and their > wrappers, as they have different audiences). It'd make sense to first > read the commit message, to understand the goal (and I'd obviously also > appreciate suggestions for improvements there as well). > > I'm pretty happy with the current state of the scan patch. I plan to do > two more passes through it (formatting, comment polishing, etc. I don't > know of any functional changes needed), and then commit it, lest > somebody objects. Here's a further polished version. Pretty boring changes: - newlines - put tableam.h into the correct place - a few comment improvements, including typos - changed reorderqueue_push() to accept the slot. That avoids an unnecessary heap_copytuple() in some cases No meaningful changes in later patches. Greetings, Andres Freund
Вложения
- v19-0001-tableam-Add-and-use-scan-APIs.patch.gz
- v19-0002-tableam-Only-allow-heap-in-a-number-of-contrib-m.patch.gz
- v19-0003-tableam-Add-insert-delete-update-lock_tuple.patch.gz
- v19-0004-tableam-Add-fetch_row_version.patch.gz
- v19-0005-tableam-Add-use-tableam_fetch_follow_check.patch.gz
- v19-0006-tableam-Add-table_get_latest_tid.patch.gz
- v19-0007-tableam-multi_insert-and-slotify-COPY.patch.gz
- v19-0008-tableam-finish_bulk_insert.patch.gz
- v19-0009-tableam-slotify-CREATE-TABLE-AS-and-CREATE-MATER.patch.gz
- v19-0010-tableam-index-builds.patch.gz
- v19-0011-tableam-relation-creation-VACUUM-FULL-CLUSTER-SE.patch.gz
- v19-0012-tableam-VACUUM-and-ANALYZE.patch.gz
- v19-0013-tableam-planner-size-estimation.patch.gz
- v19-0014-tableam-Sample-Scan-Support.patch.gz
- v19-0015-tableam-bitmap-heap-scan.patch.gz
- v19-0016-WIP-Move-xid-horizon-computation-for-page-level-.patch.gz
- v19-0017-tableam-Add-function-to-determine-newest-xid-amo.patch.gz
- v19-0018-tableam-Fetch-tuples-for-triggers-EPQ-using-a-pr.patch.gz
On 2019-03-11 12:37:46 -0700, Andres Freund wrote: > Hi, > > On 2019-03-08 19:13:10 -0800, Andres Freund wrote: > > Changes: > > - I've added comments to all the callbacks in the first commit / the > > scan commit > > - I've renamed table_gimmegimmeslot to table_slot_create > > - I've made the callbacks and their wrappers more consistently named > > - I've added asserts for necessary callbacks in scan commit > > - Lots of smaller cleanup > > - Added a commit message > > > > While 0001 is pretty bulky, the interesting bits concentrate on a > > comparatively small area. I'd appreciate if somebody could give the > > comments added in tableam.h a read (both on callbacks, and their > > wrappers, as they have different audiences). It'd make sense to first > > read the commit message, to understand the goal (and I'd obviously also > > appreciate suggestions for improvements there as well). > > > > I'm pretty happy with the current state of the scan patch. I plan to do > > two more passes through it (formatting, comment polishing, etc. I don't > > know of any functional changes needed), and then commit it, lest > > somebody objects. > > Here's a further polished version. Pretty boring changes: > - newlines > - put tableam.h into the correct place > - a few comment improvements, including typos > - changed reorderqueue_push() to accept the slot. That avoids an > unnecessary heap_copytuple() in some cases > > No meaningful changes in later patches. I pushed this. There's a failure on 32bit machines, unfortunately. The problem comes from the ParallelTableScanDescData embedded in BTShared - after the change the compiler can't see that that actually needs more alignment, because ParallelTableScanDescData doesn't have any 8byte members. That's a problem for just about all such "struct inheritance" type tricks postgres, but we normally just allocate them separately, guaranteeing maxalign. Given that we here already allocate enough space after the BTShared struct, it's probably easiest to just not embed the struct anymore. - Andres
On 2019-03-11 13:31:26 -0700, Andres Freund wrote: > On 2019-03-11 12:37:46 -0700, Andres Freund wrote: > > Hi, > > > > On 2019-03-08 19:13:10 -0800, Andres Freund wrote: > > > Changes: > > > - I've added comments to all the callbacks in the first commit / the > > > scan commit > > > - I've renamed table_gimmegimmeslot to table_slot_create > > > - I've made the callbacks and their wrappers more consistently named > > > - I've added asserts for necessary callbacks in scan commit > > > - Lots of smaller cleanup > > > - Added a commit message > > > > > > While 0001 is pretty bulky, the interesting bits concentrate on a > > > comparatively small area. I'd appreciate if somebody could give the > > > comments added in tableam.h a read (both on callbacks, and their > > > wrappers, as they have different audiences). It'd make sense to first > > > read the commit message, to understand the goal (and I'd obviously also > > > appreciate suggestions for improvements there as well). > > > > > > I'm pretty happy with the current state of the scan patch. I plan to do > > > two more passes through it (formatting, comment polishing, etc. I don't > > > know of any functional changes needed), and then commit it, lest > > > somebody objects. > > > > Here's a further polished version. Pretty boring changes: > > - newlines > > - put tableam.h into the correct place > > - a few comment improvements, including typos > > - changed reorderqueue_push() to accept the slot. That avoids an > > unnecessary heap_copytuple() in some cases > > > > No meaningful changes in later patches. > > I pushed this. There's a failure on 32bit machines, unfortunately. The > problem comes from the ParallelTableScanDescData embedded in BTShared - > after the change the compiler can't see that that actually needs more > alignment, because ParallelTableScanDescData doesn't have any 8byte > members. That's a problem for just about all such "struct inheritance" > type tricks postgres, but we normally just allocate them separately, > guaranteeing maxalign. Given that we here already allocate enough space > after the BTShared struct, it's probably easiest to just not embed the > struct anymore. I've pushed an attempt to fix this, which locally fixes 32bit builds. It's copying the alignment logic for shm_toc_allocate, namely using BUFFERALIGN for alignment. We should probably invent a more appropriate define for this at some point... Greetings, Andres Freund
Hello. I had a look on the patch set. I cannot see the thread structure due to the depth and cannot get the picture on the all patches, but I have some comments. I apologize in advance for possible duplicate with upthread. 0001-Reduce-the... This doesn't apply master. > TupleTableSlot * > ExecStoreHeapTuple(HeapTuple tuple, > TupleTableSlot *slot, > + Oid relid, > bool shouldFree) The comment for ExecStoreHeapTuple is missing the description on "relid" parameter. > - if (HeapTupleSatisfiesVisibility(tuple, &SnapshotDirty, hscan->rs_cbuf)) > + if (HeapTupleSatisfiesVisibility(tuple, RelationGetRelid(hscan->rs_scan.rs_rd), > + &SnapshotDirty, hscan->rs_cbuf)) The second parameter seems to be always RelationGetRelid(...). Actually only relid is required but isn't it better to take Relation instead of Oid as the second parameter? 0005-Reorganize-am-as... > + catalog. The contents of an table are entirely under the control of its "of an table" => "of a table" 0006-Doc-update-of-Create-access.. > + be <type>index_am_handler</type> and for <literal>TABLE</literal> > + access methods, it must be <type>table_am_handler</type>. > + The C-level API that the handler function must implement varies > + depending on the type of access method. The index access method API > + is described in <xref linkend="index-access-methods"/> and the table access method > + API is described in <xref linkend="table-access-methods"/>. If table is the primary object, talbe-am should precede index-am? 0007-Doc-update-of-create-materi.. > + This clause specifies optional access method for the new materialize view; "materialize view" => "materialized view"? > + If this option is not specified, then the default table access method I'm not sure the 'then' is needed. > + is chosen for the new materialized view. see <xref linkend="guc-default-table-access-method"/> "see" => "See" 0008-Doc-update-of-CREATE_TABLE.. > +[ USING <replaceable class="parameter">method</replaceable> ] *I* prefer access_method to just method. > + If this option is not specified, then the default table access method Same to 0007. "I'm not sure the 'then' is needed.". > + is chosen for the new table. see <xref linkend="guc-default-table-access-method"/> Same to 0007. " "see" => "See" " " 0009-Doc-of-CREATE-TABLE-AS Same as 0008. 0010-Table-access-method-API- > + Any new <literal>TABLE ACCSESS METHOD</literal> developers can refer the exisitng <literal>HEAP</literal> > + There are differnt type of API's that are defined and those details are below. "differnt" => "different" > + by the AM, in case if it support parallel scan. "support" => "supports" > + This API to return the total size that is required for the AM to perform Total size of what? Shared memory chunk? Or parallel scan descriptor? > + the parallel table scan. The minimum size that is required is > + <structname>ParallelBlockTableScanDescData</structname>. I don't get what the "The minimum size" tells. Just reading this I would always return the minimum size... > + This API to perform the initialization of the <literal>parallel_scan</literal> > + that is required for the parallel scan to be performed by the AM and also return > + the total size that is required for the AM to perform the parallel table scan. (Note: I'm not good at English..) Similar to the above. I cannot read what the "size" is for. In the code it is used as: > Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan); (The varialbe name should be snapshot_offset) It is the offset from the beginning of parallel scan descriptor but it should be described in other representation, which I'm not sure of.. Something like this? > This API to initialize a parallel scan by the AM and also > return the consumed size so far of parallel scan descriptor. (Sorry for not finishing. Time's up today.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Sat, Mar 9, 2019 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
While 0001 is pretty bulky, the interesting bits concentrate on a
comparatively small area. I'd appreciate if somebody could give the
comments added in tableam.h a read (both on callbacks, and their
wrappers, as they have different audiences). It'd make sense to first
read the commit message, to understand the goal (and I'd obviously also
appreciate suggestions for improvements there as well).
I'm pretty happy with the current state of the scan patch. I plan to do
two more passes through it (formatting, comment polishing, etc. I don't
know of any functional changes needed), and then commit it, lest
somebody objects.
I found couple of typos in the committed patch, attached patch fixes them.
I am not sure about one typo, please check it once.
And I reviewed the 0002 patch, which is a pretty simple and it can be committed.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On Tue, Mar 12, 2019 at 7:28 PM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello.
I had a look on the patch set. I cannot see the thread structure
due to the depth and cannot get the picture on the all patches,
but I have some comments. I apologize in advance for possible
duplicate with upthread.
Thanks for the review.
0001-Reduce-the...
This doesn't apply master.
Yes, these patches doesn't apply to the master.
These patches can only be applied to the code present in [1].
> TupleTableSlot *
> ExecStoreHeapTuple(HeapTuple tuple,
> TupleTableSlot *slot,
> + Oid relid,
> bool shouldFree)
The comment for ExecStoreHeapTuple is missing the description
on "relid" parameter.
Corrected.
> - if (HeapTupleSatisfiesVisibility(tuple, &SnapshotDirty, hscan->rs_cbuf))
> + if (HeapTupleSatisfiesVisibility(tuple, RelationGetRelid(hscan->rs_scan.rs_rd),
> + &SnapshotDirty, hscan->rs_cbuf))
The second parameter seems to be always
RelationGetRelid(...). Actually only relid is required but isn't
it better to take Relation instead of Oid as the second
parameter?
Currently the passed relid is used only in the case of
history MVCC verification function. Passing the direct relation
pointer will lessen the performance impact as there is no
need of calculation to find out the relid.
Will update and share it.
0005-Reorganize-am-as...
> + catalog. The contents of an table are entirely under the control of its
"of an table" => "of a table"
Corrected.
0006-Doc-update-of-Create-access..
> + be <type>index_am_handler</type> and for <literal>TABLE</literal>
> + access methods, it must be <type>table_am_handler</type>.
> + The C-level API that the handler function must implement varies
> + depending on the type of access method. The index access method API
> + is described in <xref linkend="index-access-methods"/> and the table access method
> + API is described in <xref linkend="table-access-methods"/>.
If table is the primary object, talbe-am should precede index-am?
Changed.
0007-Doc-update-of-create-materi..
> + This clause specifies optional access method for the new materialize view;
"materialize view" => "materialized view"?
Corrected.
> + If this option is not specified, then the default table access method
I'm not sure the 'then' is needed.
> + is chosen for the new materialized view. see <xref linkend="guc-default-table-access-method"/>
"see" => "See"
0008-Doc-update-of-CREATE_TABLE..
> +[ USING <replaceable class="parameter">method</replaceable> ]
*I* prefer access_method to just method.
> + If this option is not specified, then the default table access method
Same to 0007. "I'm not sure the 'then' is needed.".
> + is chosen for the new table. see <xref linkend="guc-default-table-access-method"/>
Same to 0007. " "see" => "See" "
"
0009-Doc-of-CREATE-TABLE-AS
Same as 0008.
Corrected as per your suggestions.
0010-Table-access-method-API-
> + Any new <literal>TABLE ACCSESS METHOD</literal> developers can refer the exisitng <literal>HEAP</literal>
> + There are differnt type of API's that are defined and those details are below.
"differnt" => "different"
> + by the AM, in case if it support parallel scan.
"support" => "supports"
Corrected above both.
> + This API to return the total size that is required for the AM to perform
Total size of what? Shared memory chunk? Or parallel scan descriptor?
It returns the required parallel scan descriptor memory size.
> + the parallel table scan. The minimum size that is required is
> + <structname>ParallelBlockTableScanDescData</structname>.
I don't get what the "The minimum size" tells. Just reading
this I would always return the minimum size...
> + This API to perform the initialization of the <literal>parallel_scan</literal>
> + that is required for the parallel scan to be performed by the AM and also return
> + the total size that is required for the AM to perform the parallel table scan.
(Note: I'm not good at English..) Similar to the above. I cannot
read what the "size" is for.
In the code it is used as:
> Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
(The varialbe name should be snapshot_offset) It is the offset
from the beginning of parallel scan descriptor but it should be
described in other representation, which I'm not sure of..
Something like this?
> This API to initialize a parallel scan by the AM and also
> return the consumed size so far of parallel scan descriptor.
I updated the doc around those API's to make easy to understand.
Can you please check whether that helps?
updated patches are attached.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On Sat, Mar 16, 2019 at 5:43 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Sat, Mar 9, 2019 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:Hi,
While 0001 is pretty bulky, the interesting bits concentrate on a
comparatively small area. I'd appreciate if somebody could give the
comments added in tableam.h a read (both on callbacks, and their
wrappers, as they have different audiences). It'd make sense to first
read the commit message, to understand the goal (and I'd obviously also
appreciate suggestions for improvements there as well).
I'm pretty happy with the current state of the scan patch. I plan to do
two more passes through it (formatting, comment polishing, etc. I don't
know of any functional changes needed), and then commit it, lest
somebody objects.I found couple of typos in the committed patch, attached patch fixes them.I am not sure about one typo, please check it once.And I reviewed the 0002 patch, which is a pretty simple and it can be committed.
As you are modifying the 0003 patch for modify API's, I went and reviewed the
existing patch and found couple corrections that are needed, in case if you are not
taken care of them already.
+ /* Update the tuple with table oid */
+ slot->tts_tableOid = RelationGetRelid(relation);
+ if (slot->tts_tableOid != InvalidOid)
+ tuple->t_tableOid = slot->tts_tableOid;
The setting of slot->tts_tableOid is not required in this function,
After set the check is happening, the above code is present in both
heapam_heap_insert and heapam_heap_insert_speculative.
+ slot->tts_tableOid = RelationGetRelid(relation);
In heapam_heap_update, i don't think there is a need to update
slot->tts_tableOid.
+ default:
+ elog(ERROR, "unrecognized heap_update status: %u", result);
heap_update --> table_update?
+ default:
+ elog(ERROR, "unrecognized heap_delete status: %u", result);
same as above?
+ /*hari FIXME*/
+ /*Assert(result != HeapTupleUpdated && hufd.traversed);*/
Removing the commented codes in both ExecDelete and ExecUpdate functions.
+ /**/
+ if (epqreturnslot)
+ {
+ *epqreturnslot = epqslot;
+ return NULL;
+ }
comment update missed?
Regards,
Haribabu Kommi
Fujitsu Australia
Hi,
The psql \dA commands currently doesn't show the type of the access methods of
type 'Table'.
postgres=# \dA heap
List of access methods
Name | Type
------+-------
heap |
(1 row)
Attached a simple patch that fixes the problem and outputs as follows.
postgres=# \dA heap
List of access methods
Name | Type
------+-------
heap | Table
(1 row)
The attached patch directly modifies the query that is sent to the server.
Servers < 12 doesn't support of type 'Table', so the same query can work,
because of the case addition as follows.
SELECT amname AS "Name",
CASE amtype WHEN 'i' THEN 'Index' WHEN 't' THEN 'Table' END AS "Type"
FROM pg_catalog.pg_am ...
Anyone feels that it requires a separate query for servers < 12?
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Hi, Attached is a version of just the first patch. I'm still updating it, but it's getting closer to commit: - There were no tests testing EPQ interactions with DELETE, and only an accidental test for EPQ in UPDATE with a concurrent DELETE. I've added tests. Plan to commit that ahead of the big change. - I was pretty unhappy about how the EPQ integration looked before, I've changed that now. I still wonder if we should restore EvalPlanQualFetch and move the table_lock_tuple() calls in ExecDelete/Update into it. But it seems like it'd not gain that much, because there's custom surrounding code, and it's not that much code. - I changed heapam_tuple_tuple to return *WouldBlock rather than just the last result. I think that's one of the reason Haribabu had neutered a few asserts. - I moved comments from heapam.h to tableam.h where appropriate - I updated the name of HeapUpdateFailureData to TM_FailureData, HTSU_Result to TM_Result, TM_Results members now properly distinguish between update vs modifications (delete & update). - I separated the HEAP_INSERT_ flags into TABLE_INSERT_* and HEAP_INSERT_ with the latter being a copy of TABLE_INSERT_ with the sole addition of _SPECULATIVE. table_insert_speculative callers now don't specify that anymore. Pending work: - Wondering if table_insert/delete/update should rather be table_tuple_insert etc. Would be a bit more consistent with the callback names, but a bigger departure from existing code. - I'm not yet happy with TableTupleDeleted computation in heapam.c, I want to revise that further - formatting - commit message - a few comments need a bit of polishing (ExecCheckTIDVisible, heapam_tuple_lock) - Rename TableTupleMayBeModified to TableTupleOk, but also probably a s/TableTuple/TableMod/ - I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h - two more passes through the patch On 2019-03-21 15:07:04 +1100, Haribabu Kommi wrote: > As you are modifying the 0003 patch for modify API's, I went and reviewed > the > existing patch and found couple corrections that are needed, in case if you > are not > taken care of them already. Some I have... > + /* Update the tuple with table oid */ > + slot->tts_tableOid = RelationGetRelid(relation); > + if (slot->tts_tableOid != InvalidOid) > + tuple->t_tableOid = slot->tts_tableOid; > > The setting of slot->tts_tableOid is not required in this function, > After set the check is happening, the above code is present in both > heapam_heap_insert and heapam_heap_insert_speculative. I'm not following? Those functions are independent? > + slot->tts_tableOid = RelationGetRelid(relation); > > In heapam_heap_update, i don't think there is a need to update > slot->tts_tableOid. Why? > + default: > + elog(ERROR, "unrecognized heap_update status: %u", result); > > heap_update --> table_update? > > > + default: > + elog(ERROR, "unrecognized heap_delete status: %u", result); > > same as above? I've fixed that in a number of places. > + /*hari FIXME*/ > + /*Assert(result != HeapTupleUpdated && hufd.traversed);*/ > > Removing the commented codes in both ExecDelete and ExecUpdate functions. I don't think that's the right fix. I've refactored that code significantly now, and restored the assert in a, imo, correct version. > + /**/ > + if (epqreturnslot) > + { > + *epqreturnslot = epqslot; > + return NULL; > + } > > comment update missed? Well, you'd deleted a comment around there ;). I've added something back now... Greetings, Andres Freund
Вложения
On Fri, Mar 22, 2019 at 5:16 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
Attached is a version of just the first patch. I'm still updating it,
but it's getting closer to commit:
- There were no tests testing EPQ interactions with DELETE, and only an
accidental test for EPQ in UPDATE with a concurrent DELETE. I've added
tests. Plan to commit that ahead of the big change.
- I was pretty unhappy about how the EPQ integration looked before, I've
changed that now.
I still wonder if we should restore EvalPlanQualFetch and move the
table_lock_tuple() calls in ExecDelete/Update into it. But it seems
like it'd not gain that much, because there's custom surrounding code,
and it's not that much code.
- I changed heapam_tuple_tuple to return *WouldBlock rather than just
the last result. I think that's one of the reason Haribabu had
neutered a few asserts.
- I moved comments from heapam.h to tableam.h where appropriate
- I updated the name of HeapUpdateFailureData to TM_FailureData,
HTSU_Result to TM_Result, TM_Results members now properly distinguish
between update vs modifications (delete & update).
- I separated the HEAP_INSERT_ flags into TABLE_INSERT_* and
HEAP_INSERT_ with the latter being a copy of TABLE_INSERT_ with the
sole addition of _SPECULATIVE. table_insert_speculative callers now
don't specify that anymore.
Pending work:
- Wondering if table_insert/delete/update should rather be
table_tuple_insert etc. Would be a bit more consistent with the
callback names, but a bigger departure from existing code.
- I'm not yet happy with TableTupleDeleted computation in heapam.c, I
want to revise that further
- formatting
- commit message
- a few comments need a bit of polishing (ExecCheckTIDVisible, heapam_tuple_lock)
- Rename TableTupleMayBeModified to TableTupleOk, but also probably a s/TableTuple/TableMod/
- I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h
- two more passes through the patch
Thanks for the corrections.
On 2019-03-21 15:07:04 +1100, Haribabu Kommi wrote:
> As you are modifying the 0003 patch for modify API's, I went and reviewed
> the
> existing patch and found couple corrections that are needed, in case if you
> are not
> taken care of them already.
Some I have...
> + /* Update the tuple with table oid */
> + slot->tts_tableOid = RelationGetRelid(relation);
> + if (slot->tts_tableOid != InvalidOid)
> + tuple->t_tableOid = slot->tts_tableOid;
>
> The setting of slot->tts_tableOid is not required in this function,
> After set the check is happening, the above code is present in both
> heapam_heap_insert and heapam_heap_insert_speculative.
I'm not following? Those functions are independent?
In those functions, the slot->tts_tableOid is set and in the next statement
checked whether it is invalid or not? Callers of table_insert should have
already set that. So setting the value and checking in the next line is it required?
The value cannot be InvalidOid.
> + slot->tts_tableOid = RelationGetRelid(relation);
>
> In heapam_heap_update, i don't think there is a need to update
> slot->tts_tableOid.
Why?
The slot->tts_tableOid should have been updated before the call to heap_update.
setting it again after the heap_update is required?
I also observed setting slot->tts_tableOid after table_insert_XXX calls also in
Exec_insert function?
Is this to make sure that AM hasn't modified that value?
> + default:
> + elog(ERROR, "unrecognized heap_update status: %u", result);
>
> heap_update --> table_update?
>
>
> + default:
> + elog(ERROR, "unrecognized heap_delete status: %u", result);
>
> same as above?
I've fixed that in a number of places.
> + /*hari FIXME*/
> + /*Assert(result != HeapTupleUpdated && hufd.traversed);*/
>
> Removing the commented codes in both ExecDelete and ExecUpdate functions.
I don't think that's the right fix. I've refactored that code
significantly now, and restored the assert in a, imo, correct version.
OK.
> + /**/
> + if (epqreturnslot)
> + {
> + *epqreturnslot = epqslot;
> + return NULL;
> + }
>
> comment update missed?
Well, you'd deleted a comment around there ;). I've added something back
now...
This is not only the problem I could have introduced, All the comments that
listed are introduced by me ;).
Regards,
Haribabu Kommi
Fujitsu Australia
Hi, On 2019-03-21 11:15:57 -0700, Andres Freund wrote: > Pending work: > - Wondering if table_insert/delete/update should rather be > table_tuple_insert etc. Would be a bit more consistent with the > callback names, but a bigger departure from existing code. I've left this as is. > - I'm not yet happy with TableTupleDeleted computation in heapam.c, I > want to revise that further I changed that. Found a bunch of untested paths, I've pushed tests for those already. > - formatting Done that. > - commit message Done that. > - a few comments need a bit of polishing (ExecCheckTIDVisible, heapam_tuple_lock) Done that. > - Rename TableTupleMayBeModified to TableTupleOk, but also probably a s/TableTuple/TableMod/ It's now TM_*. /* * Result codes for table_{update,delete,lock}_tuple, and for visibility * routines inside table AMs. */ typedef enum TM_Result { /* * Signals that the action succeeded (i.e. update/delete performed, lock * was acquired) */ TM_Ok, /* The affected tuple wasn't visible to the relevant snapshot */ TM_Invisible, /* The affected tuple was already modified by the calling backend */ TM_SelfModified, /* * The affected tuple was updated by another transaction. This includes * the case where tuple was moved to another partition. */ TM_Updated, /* The affected tuple was deleted by another transaction */ TM_Deleted, /* * The affected tuple is currently being modified by another session. This * will only be returned if (update/delete/lock)_tuple are instructed not * to wait. */ TM_BeingModified, /* lock couldn't be acquired, action skipped. Only used by lock_tuple */ TM_WouldBlock } TM_Result; > - I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h Done. > - two more passes through the patch One of them completed. Which is good, because there was a subtle bug in heapam_tuple_lock (*tid was adjusted to be the followup tuple after the heap_fetch(), before going to heap_lock_tuple - but that's wrong, it should only be adjusted when heap_fetch() ing the next version.). Greetings, Andres Freund
Hi, (sorry, I somehow miskeyed, and sent a partial version of this email before it was ready) On 2019-03-21 11:15:57 -0700, Andres Freund wrote: > Pending work: > - Wondering if table_insert/delete/update should rather be > table_tuple_insert etc. Would be a bit more consistent with the > callback names, but a bigger departure from existing code. I've left this as is. > - I'm not yet happy with TableTupleDeleted computation in heapam.c, I > want to revise that further I changed that. Found a bunch of untested paths, I've pushed tests for those already. > - formatting Done that. > - commit message Done that. > - a few comments need a bit of polishing (ExecCheckTIDVisible, heapam_tuple_lock) Done that. > - Rename TableTupleMayBeModified to TableTupleOk, but also probably a s/TableTuple/TableMod/ It's now TM_*. /* * Result codes for table_{update,delete,lock}_tuple, and for visibility * routines inside table AMs. */ typedef enum TM_Result { /* * Signals that the action succeeded (i.e. update/delete performed, lock * was acquired) */ TM_Ok, /* The affected tuple wasn't visible to the relevant snapshot */ TM_Invisible, /* The affected tuple was already modified by the calling backend */ TM_SelfModified, /* * The affected tuple was updated by another transaction. This includes * the case where tuple was moved to another partition. */ TM_Updated, /* The affected tuple was deleted by another transaction */ TM_Deleted, /* * The affected tuple is currently being modified by another session. This * will only be returned if (update/delete/lock)_tuple are instructed not * to wait. */ TM_BeingModified, /* lock couldn't be acquired, action skipped. Only used by lock_tuple */ TM_WouldBlock } TM_Result; > - I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h Done. > - two more passes through the patch One of them completed. Which is good, because there was a subtle bug in heapam_tuple_lock (*tid was adjusted to be the followup tuple after the heap_fetch(), before going to heap_lock_tuple - but that's wrong, it should only be adjusted when heap_fetch() ing the next version.). I'm pretty happy with that last version (of the first patch). I'm planning to do one more pass, and then push. There are no meaningful changes to later patches in the series besides followup changes required from changes in the first patch. Greetings, Andres Freund
Вложения
- v21-0001-tableam-Add-tuple_-insert-delete-update-lock-and.patch.gz
- v21-0002-tableam-Add-fetch_row_version.patch.gz
- v21-0003-tableam-Add-use-tableam_fetch_follow_check.patch.gz
- v21-0004-tableam-Add-table_get_latest_tid.patch.gz
- v21-0005-tableam-multi_insert-and-slotify-COPY.patch.gz
- v21-0006-tableam-finish_bulk_insert.patch.gz
- v21-0007-tableam-slotify-CREATE-TABLE-AS-and-CREATE-MATER.patch.gz
- v21-0008-tableam-index-builds.patch.gz
- v21-0009-tableam-relation-creation-VACUUM-FULL-CLUSTER-SE.patch.gz
- v21-0010-tableam-VACUUM-and-ANALYZE.patch.gz
- v21-0011-tableam-planner-size-estimation.patch.gz
- v21-0012-tableam-Sample-Scan-Support.patch.gz
- v21-0013-tableam-bitmap-heap-scan.patch.gz
- v21-0014-tableam-Only-allow-heap-in-a-number-of-contrib-m.patch.gz
- v21-0015-WIP-Move-xid-horizon-computation-for-page-level-.patch.gz
- v21-0016-tableam-Add-function-to-determine-newest-xid-amo.patch.gz
- v21-0017-tableam-Fetch-tuples-for-triggers-EPQ-using-a-pr.patch.gz
Hi, On 2019-03-23 20:16:30 -0700, Andres Freund wrote: > I'm pretty happy with that last version (of the first patch). I'm > planning to do one more pass, and then push. And done, after a bunch of mostly cosmetic changes (renaming ExecCheckHeapTupleVisible to ExecCheckTupleVisible, removing an unnecessary change in heap_lock_tuple parameters, a bunch of comments, stuff like that). Let's see what the buildfarm says. The remaining commits luckily all are a good bit smaller. Greetings, Andres Freund
On Wed, Mar 27, 2019 at 11:17 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-02-22 14:52:08 -0500, Robert Haas wrote:
> On Fri, Feb 22, 2019 at 11:19 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Thanks for the review. Attached v2.
>
> Thanks. I took this, combined it with Andres's
> v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch, did
> some polishing of the code and comments, and pgindented. Here's what
> I ended up with; see what you think.
I pushed this after some fairly minor changes, directly including the
patch to route the horizon computation through tableam. The only real
change is that I removed the table relfilenode from the nbtree/hash
deletion WAL record - it was only required to access the heap without
accessing the catalog and was unused now. Also added a WAL version
bump.
It seems possible that some other AM might want to generalize the
prefetch logic from heapam.c, but I think it's fair to defer that until
such an AM wants to do so
As I see that your are fixing some typos of the code that is committed,
I just want to share some more corrections that I found in the patches
that are committed till now.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
On 2019-03-29 18:38:46 +1100, Haribabu Kommi wrote: > As I see that your are fixing some typos of the code that is committed, > I just want to share some more corrections that I found in the patches > that are committed till now. Pushed both, thanks!
Hi, On 2019-03-16 23:21:31 +1100, Haribabu Kommi wrote: > updated patches are attached. Now that nearly all of the tableam patches are committed (with the exception of the copy.c changes which are for bad reasons discussed at [1]) I'm looking at the docs changes. What made you rename indexam.sgml to am.sgml, instead of creating a separate tableam.sgml? Seems easier to just have a separate file? I'm currently not planning to include the function-by-function API reference you've in your patchset, as I think it's more reasonable to centralize all of it in tableam.h. I think I've included most of the information there - could you check whether you agree? [1] https://postgr.es/m/CAKJS1f98Fa%2BQRTGKwqbtz0M%3DCy1EHYR8Q-W08cpA78tOy4euKQ%40mail.gmail.com Greetings, Andres Freund
On Tue, Apr 2, 2019 at 10:18 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-03-16 23:21:31 +1100, Haribabu Kommi wrote:
> updated patches are attached.
Now that nearly all of the tableam patches are committed (with the
exception of the copy.c changes which are for bad reasons discussed at
[1]) I'm looking at the docs changes.
Thanks.
What made you rename indexam.sgml to am.sgml, instead of creating a
separate tableam.sgml? Seems easier to just have a separate file?
No specific reason, I just thought of adding all the access methods under one file.
I can change it to tableam.sgml.
I'm currently not planning to include the function-by-function API
reference you've in your patchset, as I think it's more reasonable to
centralize all of it in tableam.h. I think I've included most of the
information there - could you check whether you agree?
I checked all the comments and explanation that is provided in the tableam.h is
good enough to understand. Even I updated docs section to reflect with some more
details from tableam.h comments.
I can understand your point of avoiding function-by-function API reference,
as the user can check directly the code comments, Still I feel some people
may refer the doc for API changes. I am fine to remove based on your opinion.
Added current set of doc patches for your reference.
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Hi, On 2019-04-02 11:39:57 +1100, Haribabu Kommi wrote: > > What made you rename indexam.sgml to am.sgml, instead of creating a > > separate tableam.sgml? Seems easier to just have a separate file? > > > > No specific reason, I just thought of adding all the access methods under > one file. > I can change it to tableam.sgml. I'd rather keep it separate. It seems likely that both table and indexam docs will grow further over time, and they're not that closely related. Additionally not moving sect1->sect2 etc will keep links stable (which we could also achieve with different sect1 names, I realize that). > I can understand your point of avoiding function-by-function API reference, > as the user can check directly the code comments, Still I feel some people > may refer the doc for API changes. I am fine to remove based on your > opinion. I think having to keeping both tableam.h and the sgml file current is too much overhead - and anybody that's going to create a new tableam is going to be able to look into the source anyway. Greetings, Andres Freund
On Tue, Apr 2, 2019 at 11:53 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-04-02 11:39:57 +1100, Haribabu Kommi wrote:
> > What made you rename indexam.sgml to am.sgml, instead of creating a
> > separate tableam.sgml? Seems easier to just have a separate file?
> >
>
> No specific reason, I just thought of adding all the access methods under
> one file.
> I can change it to tableam.sgml.
I'd rather keep it separate. It seems likely that both table and indexam
docs will grow further over time, and they're not that closely
related. Additionally not moving sect1->sect2 etc will keep links stable
(which we could also achieve with different sect1 names, I realize
that).
OK.
> I can understand your point of avoiding function-by-function API reference,
> as the user can check directly the code comments, Still I feel some people
> may refer the doc for API changes. I am fine to remove based on your
> opinion.
I think having to keeping both tableam.h and the sgml file current is
too much overhead - and anybody that's going to create a new tableam is
going to be able to look into the source anyway.
Here I attached updated patches as per the discussion.
Is the description of table access methods is enough? or do you want me to
add some more details?
Regards,
Haribabu Kommi
Fujitsu Australia
Вложения
Hi, On 2019-04-02 17:11:07 +1100, Haribabu Kommi wrote: > + <varlistentry id="guc-default-table-access-method" xreflabel="default_table_access_method"> > + <term><varname>default_table_access_method</varname> (<type>string</type>) > + <indexterm> > + <primary><varname>default_table_access_method</varname> configuration parameter</primary> > + </indexterm> > + </term> > + <listitem> > + <para> > + The value is either the name of a table access method, or an empty string > + to specify using the default table access method of the current database. > + If the value does not match the name of any existing table access method, > + <productname>PostgreSQL</productname> will automatically use the default > + table access method of the current database. > + </para> Hm, this doesn't strike me as right (there's no such thing as "default table access method of the current database"). You just get an error in that case. I think we should simply not allow setting to "" - what's the point in that? Greetings, Andres Freund
Hi, On 2019-04-02 17:11:07 +1100, Haribabu Kommi wrote: > From a72cfcd523887f1220473231d7982928acc23684 Mon Sep 17 00:00:00 2001 > From: Hari Babu <kommi.haribabu@gmail.com> > Date: Tue, 2 Apr 2019 15:41:17 +1100 > Subject: [PATCH 1/2] tableam : doc update of table access methods > > Providing basic explanation of table access methods > including their structure details and reference heap > implementation files. > --- > doc/src/sgml/catalogs.sgml | 5 ++-- > doc/src/sgml/filelist.sgml | 1 + > doc/src/sgml/postgres.sgml | 1 + > doc/src/sgml/tableam.sgml | 56 ++++++++++++++++++++++++++++++++++++++ > 4 files changed, 61 insertions(+), 2 deletions(-) > create mode 100644 doc/src/sgml/tableam.sgml > > diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml > index f4aabf5dc7..200708e121 100644 > --- a/doc/src/sgml/catalogs.sgml > +++ b/doc/src/sgml/catalogs.sgml > @@ -587,8 +587,9 @@ > The catalog <structname>pg_am</structname> stores information about > relation access methods. There is one row for each access method supported > by the system. > - Currently, only indexes have access methods. The requirements for index > - access methods are discussed in detail in <xref linkend="indexam"/>. > + Currently, only table and indexes have access methods. The requirements for table > + access methods are discussed in detail in <xref linkend="tableam"/> and the > + requirements for index access methods are discussed in detail in <xref linkend="indexam"/>. > </para> I also adapted pg_am.amtype. > diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml > new file mode 100644 > index 0000000000..9eca52ee70 > --- /dev/null > +++ b/doc/src/sgml/tableam.sgml > @@ -0,0 +1,56 @@ > +<!-- doc/src/sgml/tableam.sgml --> > + > +<chapter id="tableam"> > + <title>Table Access Method Interface Definition</title> > + > + <para> > + This chapter defines the interface between the core <productname>PostgreSQL</productname> > + system and <firstterm>access methods</firstterm>, which manage <literal>TABLE</literal> > + types. The core system knows nothing about these access methods beyond > + what is specified here, so it is possible to develop entirely new access > + method types by writing add-on code. > + </para> > + > + <para> > + All Tables in <productname>PostgreSQL</productname> system are the primary > + data store. Each table is stored as its own physical <firstterm>relation</firstterm> > + and so is described by an entry in the <structname>pg_class</structname> > + catalog. A table's content is entirely controlled by its access method. > + (All the access methods furthermore use the standard page layout described > + in <xref linkend="storage-page-layout"/>.) > + </para> I don't think there's actually any sort of dependency on the page layout. It's entirely conceivable to write an AM that doesn't use postgres' shared buffers. > + <para> > + A table access method handler function must be declared to accept a single > + argument of type <type>internal</type> and to return the pseudo-type > + <type>table_am_handler</type>. The argument is a dummy value that simply > + serves to prevent handler functions from being called directly from SQL commands. > + The result of the function must be a palloc'd struct of type <structname>TableAmRoutine</structname>, > + which contains everything that the core code needs to know to make use of > + the table access method. That's not been correct for a while... > The <structname>TableAmRoutine</structname> struct, > + also called the access method's <firstterm>API struct</firstterm>, includes > + fields specifying assorted fixed properties of the access method, such as > + whether it can support bitmap scans. More importantly, it contains pointers > + to support functions for the access method, which do all of the real work to > + access tables. These support functions are plain C functions and are not > + visible or callable at the SQL level. The support functions are described > + in <structname>TableAmRoutine</structname> structure. For more details, please > + refer the file <filename>src/include/access/tableam.h</filename>. > + </para> This seems to not have been adapted after copying it from indexam? I'm still working on this (in particular I think storage.sgml and probably some other places needs updates to make clear they apply to heap not generally; I think there needs to be some references to generic WAL records in tableam.sgml, ...), but I got to run a few errands. One thing I want to call out is that I made the reference to src/include/access/tableam.h a link to gitweb. I think that makes it much more useful to the casual reader. But it also means that, baring some infrastructure / procedure we don't have, the link will just continue to point to master. I'm not particularly concerned about that, but it seems worth pointing out, given that we've only a single link to gitweb in the sgml docs so far. Greetings, Andres Freund
Вложения
I reviewed new docs for $SUBJECT. Find attached proposed changes. There's one XXX item I'm unsure what it's intended to say. Justin From a3d290bf67af2a34e44cd6c160daf552b56a13b5 Mon Sep 17 00:00:00 2001 From: Justin Pryzby <pryzbyj@telsasoft.com> Date: Thu, 4 Apr 2019 00:48:09 -0500 Subject: [PATCH v1] Fine tune documentation for tableam Added at commit b73c3a11963c8bb783993cfffabb09f558f86e37 --- doc/src/sgml/catalogs.sgml | 2 +- doc/src/sgml/config.sgml | 4 ++-- doc/src/sgml/ref/select_into.sgml | 6 +++--- doc/src/sgml/storage.sgml | 17 ++++++++------- doc/src/sgml/tableam.sgml | 44 ++++++++++++++++++++------------------- 5 files changed, 38 insertions(+), 35 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 58c8c96..40ddec4 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -587,7 +587,7 @@ The catalog <structname>pg_am</structname> stores information about relation access methods. There is one row for each access method supported by the system. - Currently, only table and indexes have access methods. The requirements for table + Currently, only tables and indexes have access methods. The requirements for table and index access methods are discussed in detail in <xref linkend="tableam"/> and <xref linkend="indexam"/> respectively. </para> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 4a9a1e8..90b478d 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -7306,8 +7306,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; This parameter specifies the default table access method to use when creating tables or materialized views if the <command>CREATE</command> command does not explicitly specify an access method, or when - <command>SELECT ... INTO</command> is used, which does not allow to - specify a table access method. The default is <literal>heap</literal>. + <command>SELECT ... INTO</command> is used, which does not allow + specification of a table access method. The default is <literal>heap</literal>. </para> </listitem> </varlistentry> diff --git a/doc/src/sgml/ref/select_into.sgml b/doc/src/sgml/ref/select_into.sgml index 17bed24..1443d79 100644 --- a/doc/src/sgml/ref/select_into.sgml +++ b/doc/src/sgml/ref/select_into.sgml @@ -106,11 +106,11 @@ SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replac </para> <para> - In contrast to <command>CREATE TABLE AS</command> <command>SELECT - INTO</command> does not allow to specify properties like a table's access + In contrast to <command>CREATE TABLE AS</command>, <command>SELECT + INTO</command> does not allow specification of properties like a table's access method with <xref linkend="sql-createtable-method" /> or the table's tablespace with <xref linkend="sql-createtable-tablespace" />. Use <xref - linkend="sql-createtableas"/> if necessary. Therefore the default table + linkend="sql-createtableas"/> if necessary. Therefore, the default table access method is chosen for the new table. See <xref linkend="guc-default-table-access-method"/> for more information. </para> diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml index 62333e3..5dfca1b 100644 --- a/doc/src/sgml/storage.sgml +++ b/doc/src/sgml/storage.sgml @@ -189,11 +189,11 @@ there. </para> <para> - Note that the following sections describe the way the builtin + Note that the following sections describe the behavior of the builtin <literal>heap</literal> <link linkend="tableam">table access method</link>, - and the builtin <link linkend="indexam">index access methods</link> work. Due - to the extensible nature of <productname>PostgreSQL</productname> other types - of access method might work similar or not. + and the builtin <link linkend="indexam">index access methods</link>. Due + to the extensible nature of <productname>PostgreSQL</productname>, other + access methods might work differently. </para> <para> @@ -703,11 +703,12 @@ erased (they will be recreated automatically as needed). This section provides an overview of the page format used within <productname>PostgreSQL</productname> tables and indexes.<footnote> <para> - Actually, neither table nor index access methods need not use this page - format. All the existing index methods do use this basic format, but the + Actually, use of this page format is not required for either table or index + access methods. + The <literal>heap</literal> table access method always uses this format. + All the existing index methods also use the basic format, but the data kept on index metapages usually doesn't follow the item layout - rules. The <literal>heap</literal> table access method also always uses - this format. + rules. </para> </footnote> Sequences and <acronym>TOAST</acronym> tables are formatted just like a regular table. diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml index 8d9bfd8..0a89935 100644 --- a/doc/src/sgml/tableam.sgml +++ b/doc/src/sgml/tableam.sgml @@ -48,54 +48,56 @@ callbacks and their behavior is defined in the <structname>TableAmRoutine</structname> structure (with comments inside the struct defining the requirements for callbacks). Most callbacks have - wrapper functions, which are documented for the point of view of a user, - rather than an implementor, of the table access method. For details, + wrapper functions, which are documented from the point of view of a user + (rather than an implementor) of the table access method. For details, please refer to the <ulink url="https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/tableam.h;hb=HEAD"> <filename>src/include/access/tableam.h</filename></ulink> file. </para> <para> - To implement a access method, an implementor will typically need to - implement a AM specific type of tuple table slot (see + To implement an access method, an implementor will typically need to + implement an AM-specific type of tuple table slot (see <ulink url="https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/executor/tuptable.h;hb=HEAD"> - <filename>src/include/executor/tuptable.h</filename></ulink>) which allows + <filename>src/include/executor/tuptable.h</filename></ulink>), which allows code outside the access method to hold references to tuples of the AM, and to access the columns of the tuple. </para> <para> - Currently the the way an AM actually stores data is fairly - unconstrained. It is e.g. possible to use postgres' shared buffer cache, - but not required. In case shared buffers are used, it likely makes to - postgres' standard page layout described in <xref + Currently, the way an AM actually stores data is fairly + unconstrained. For example, it's possible but not required to use postgres' + shared buffer cache. In case it is used, it likely makes +XXX something missing here ? + to postgres' standard page layout described in <xref linkend="storage-page-layout"/>. </para> <para> One fairly large constraint of the table access method API is that, currently, if the AM wants to support modifications and/or indexes, it is - necessary that each tuple has a tuple identifier (<acronym>TID</acronym>) + necessary for each tuple to have a tuple identifier (<acronym>TID</acronym>) consisting of a block number and an item number (see also <xref linkend="storage-page-layout"/>). It is not strictly necessary that the - sub-parts of <acronym>TIDs</acronym> have the same meaning they e.g. have + sub-parts of <acronym>TIDs</acronym> have the same meaning as used for <literal>heap</literal>, but if bitmap scan support is desired (it is optional), the block number needs to provide locality. </para> <para> - For crash safety an AM can use postgres' <link - linkend="wal"><acronym>WAL</acronym></link>, or a custom approach can be - implemented. If <acronym>WAL</acronym> is chosen, either <link - linkend="generic-wal">Generic WAL Records</link> can be used — which - implies higher WAL volume but is easy, or a new type of - <acronym>WAL</acronym> records can be implemented — but that - currently requires modifications of core code (namely modifying + For crash safety, an AM can use postgres' <link + linkend="wal"><acronym>WAL</acronym></link>, or a custom implementation. + If <acronym>WAL</acronym> is chosen, either <link + linkend="generic-wal">Generic WAL Records</link> can be used, + or a new type of <acronym>WAL</acronym> records can be implemented. + Generic WAL Records are easy, but imply higher WAL volume. + Implementation of a new type of WAL record + currently requires modifications to core code (specifically, <filename>src/include/access/rmgrlist.h</filename>). </para> <para> To implement transactional support in a manner that allows different table - access methods be accessed within a single transaction, it likely is + access methods be accessed within a single transaction, it's likely necessary to closely integrate with the machinery in <filename>src/backend/access/transam/xlog.c</filename>. </para> @@ -103,8 +105,8 @@ <para> Any developer of a new <literal>table access method</literal> can refer to the existing <literal>heap</literal> implementation present in - <filename>src/backend/heap/heapam_handler.c</filename> for more details of - how it is implemented. + <filename>src/backend/heap/heapam_handler.c</filename> for details of + its implementation. </para> </chapter> -- 2.1.4
Вложения
Hi, On 2019-04-04 00:51:38 -0500, Justin Pryzby wrote: > I reviewed new docs for $SUBJECT. > Find attached proposed changes. > There's one XXX item I'm unsure what it's intended to say. Thanks! I applied most of these, and filled in the XXX. I didn't like the s/allow to to specify properties/allow specification of properties/, so I left those out. But I could be convinced otherwise... Greetings, Andres Freund
I wrote a little toy implementation that just returns constant data to play with this a little. Looks good overall. There were a bunch of typos in the comments in tableam.h, see attached. Some of the comments could use more copy-editing and clarification, I think, but I stuck to fixing just typos and such for now. index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM doesn't use normal data files, that won't work. I bumped into that with my toy implementation, which wouldn't need to create any data files, if it wasn't for this. The comments for relation_set_new_relfilenode() callback say that the AM can set *freezeXid and *minmulti to invalid. But when I did that, VACUUM hits this assertion: TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId) 3)))", File: "vacuum.c", Line: 1323) There's a little bug in index-only scan executor node, where it mixes up the slots to hold a tuple from the index, and from the table. That doesn't cause any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which uses a virtual slot, it caused warnings like this from index-only scans: WARNING: problem in alloc set ExecutorState: detected write past chunk end in block 0x56419b0f88e8, chunk 0x56419b0f8f90 Attached is a patch with the toy implementation I used to test this. I'm not suggesting we should commit that - although feel free to do that if you think it's useful - but it shows how I bumped into these issues. The second patch fixes the index-only-scan slot confusion (untested, except with my toy AM). - Heikki
Вложения
On Mon, Apr 8, 2019 at 9:34 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I wrote a little toy implementation that just returns constant data to
> play with this a little. Looks good overall.
>
> There were a bunch of typos in the comments in tableam.h, see attached.
> Some of the comments could use more copy-editing and clarification, I
> think, but I stuck to fixing just typos and such for now.
>
> index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM
> doesn't use normal data files, that won't work. I bumped into that with
> my toy implementation, which wouldn't need to create any data files, if
> it wasn't for this.
>
> The comments for relation_set_new_relfilenode() callback say that the AM
> can set *freezeXid and *minmulti to invalid. But when I did that, VACUUM
> hits this assertion:
>
> TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId)
> 3)))", File: "vacuum.c", Line: 1323)
>
> There's a little bug in index-only scan executor node, where it mixes up
> the slots to hold a tuple from the index, and from the table. That
> doesn't cause any ill effects if the AM uses TTSOpsHeapTuple, but with
> my toy AM, which uses a virtual slot, it caused warnings like this from
> index-only scans:
>
> WARNING: problem in alloc set ExecutorState: detected write past chunk
> end in block 0x56419b0f88e8, chunk 0x56419b0f8f90
>
> Attached is a patch with the toy implementation I used to test this.
> I'm not suggesting we should commit that - although feel free to do that
> if you think it's useful - but it shows how I bumped into these issues.
> The second patch fixes the index-only-scan slot confusion (untested,
> except with my toy AM).
>
Awesome... it's built and ran tests cleanly, but I got assertion running VACUUM:
fabrizio=# vacuum toytab ;
TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId) 3)))", File: "vacuum.c", Line: 1323)
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2019-04-08 12:29:16.204 -03 [20844] LOG: server process (PID 24457) was terminated by signal 6: Aborted
2019-04-08 12:29:16.204 -03 [20844] DETAIL: Failed process was running: vacuum toytab ;
2019-04-08 12:29:16.204 -03 [20844] LOG: terminating any other active server processes
2019-04-08 12:29:16.205 -03 [24458] WARNING: terminating connection because of crash of another server process
And backtrace is:
(gdb) bt
#0 0x00007f813779f428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f81377a102a in __GI_abort () at abort.c:89
#2 0x0000000000ec0de9 in ExceptionalCondition (conditionName=0x10e3bb8 "!(((classForm->relfrozenxid) >= ((TransactionId) 3)))", errorType=0x10e33f3 "FailedAssertion", fileName=0x10e345a "vacuum.c", lineNumber=1323) at assert.c:54
#3 0x0000000000893646 in vac_update_datfrozenxid () at vacuum.c:1323
#4 0x000000000089127a in vacuum (relations=0x26c4390, params=0x7ffeb1a3fb30, bstrategy=0x26c4218, isTopLevel=true) at vacuum.c:452
#5 0x00000000008906ae in ExecVacuum (pstate=0x26145b8, vacstmt=0x25f46f0, isTopLevel=true) at vacuum.c:196
#6 0x0000000000c3a883 in standard_ProcessUtility (pstmt=0x25f4a50, queryString=0x25f3be8 "vacuum toytab ;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at utility.c:670
#7 0x0000000000c3977a in ProcessUtility (pstmt=0x25f4a50, queryString=0x25f3be8 "vacuum toytab ;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at utility.c:360
#8 0x0000000000c3793e in PortalRunUtility (portal=0x265ba28, pstmt=0x25f4a50, isTopLevel=true, setHoldSnapshot=false, dest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at pquery.c:1175
#9 0x0000000000c37d7f in PortalRunMulti (portal=0x265ba28, isTopLevel=true, setHoldSnapshot=false, dest=0x25f4b48, altdest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at pquery.c:1321
#10 0x0000000000c36899 in PortalRun (portal=0x265ba28, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x25f4b48, altdest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at pquery.c:796
#11 0x0000000000c2a40e in exec_simple_query (query_string=0x25f3be8 "vacuum toytab ;") at postgres.c:1215
#12 0x0000000000c332a3 in PostgresMain (argc=1, argv=0x261fe68, dbname=0x261fca8 "fabrizio", username=0x261fc80 "fabrizio") at postgres.c:4249
#13 0x0000000000b051fc in BackendRun (port=0x2616d20) at postmaster.c:4429
#14 0x0000000000b042c3 in BackendStartup (port=0x2616d20) at postmaster.c:4120
#15 0x0000000000afc70a in ServerLoop () at postmaster.c:1703
#16 0x0000000000afb94e in PostmasterMain (argc=3, argv=0x25ed850) at postmaster.c:1376
#17 0x0000000000977de8 in main (argc=3, argv=0x25ed850) at main.c:228
Isn't better raise an exception as you did in other functions??
static void
toyam_relation_vacuum(Relation onerel,
struct VacuumParams *params,
BufferAccessStrategy bstrategy)
{
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("function %s not implemented yet", __func__)));
}
Regards,
--
Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Hi, On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: > There were a bunch of typos in the comments in tableam.h, see attached. Some > of the comments could use more copy-editing and clarification, I think, but > I stuck to fixing just typos and such for now. I pushed these after adding three boring changes by pgindent. Thanks for those! I'd greatly welcome more feedback on the comments - I've been pretty deep in this for so long that I don't see all of the issues anymore. And a mild dyslexia doesn't help... > index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM > doesn't use normal data files, that won't work. I bumped into that with my > toy implementation, which wouldn't need to create any data files, if it > wasn't for this. Hm. That should be fixed. I've been burning the candle on both ends for too long, so I'll not get to it today. But I think we should fix it soon. I'll create an open item for it. > The comments for relation_set_new_relfilenode() callback say that the AM can > set *freezeXid and *minmulti to invalid. But when I did that, VACUUM hits > this assertion: > > TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId) > 3)))", File: "vacuum.c", Line: 1323) Hm. That needs to be fixed - IIRC it previously worked, because zheap doesn't have relfrozenxid either. Probably broke it when trying to winnow down the tableam patches. I'm planning to rebase zheap onto the newest version soon, so I'll re-encounter this. > There's a little bug in index-only scan executor node, where it mixes up the > slots to hold a tuple from the index, and from the table. That doesn't cause > any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which > uses a virtual slot, it caused warnings like this from index-only scans: Hm. That's another one that I think I had fixed previously :(, and then concluded that it's not actually necessary for some reason. Your fix looks correct to me. Do you want to commit it? Otherwise I'll look at it after rebasing zheap, and checking it with that. > Attached is a patch with the toy implementation I used to test this. I'm not > suggesting we should commit that - although feel free to do that if you > think it's useful - but it shows how I bumped into these issues. Hm, probably not a bad idea to include something like it. It seems like we kinda would need non-stub implementation of more functions for it to test much / and to serve as an example. I'm mildy inclined to just do it via zheap / externally, but I'm not quite sure that's good enough. > +static Size > +toyam_parallelscan_estimate(Relation rel) > +{ > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("function %s not implemented yet", __func__))); > +} The other stubbed functions seem like we should require them, but I wonder if we should make the parallel stuff optional? Greetings, Andres Freund
On 08/04/2019 20:37, Andres Freund wrote: > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: >> There's a little bug in index-only scan executor node, where it mixes up the >> slots to hold a tuple from the index, and from the table. That doesn't cause >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which >> uses a virtual slot, it caused warnings like this from index-only scans: > > Hm. That's another one that I think I had fixed previously :(, and then > concluded that it's not actually necessary for some reason. Your fix > looks correct to me. Do you want to commit it? Otherwise I'll look at > it after rebasing zheap, and checking it with that. I found another slot type confusion bug, while playing with zedstore. In an Index Scan, if you have an ORDER BY key that needs to be rechecked, so that it uses the reorder queue, then it will sometimes use the reorder queue slot, and sometimes the table AM's slot, for the scan slot. If they're not of the same type, you get an assertion: TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File: "execExprInterp.c", Line: 1905) Attached is a test for this, again using the toy table AM, extended to be able to test this. And a fix. >> Attached is a patch with the toy implementation I used to test this. I'm not >> suggesting we should commit that - although feel free to do that if you >> think it's useful - but it shows how I bumped into these issues. > > Hm, probably not a bad idea to include something like it. It seems like > we kinda would need non-stub implementation of more functions for it to > test much / and to serve as an example. I'm mildy inclined to just do > it via zheap / externally, but I'm not quite sure that's good enough. Works for me. >> +static Size >> +toyam_parallelscan_estimate(Relation rel) >> +{ >> + ereport(ERROR, >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), >> + errmsg("function %s not implemented yet", __func__))); >> +} > > The other stubbed functions seem like we should require them, but I > wonder if we should make the parallel stuff optional? Yeah, that would be good. I would assume it to be optional. - Heikki
Вложения
On 08/04/2019 20:37, Andres Freund wrote: > Hi, > > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: >> There were a bunch of typos in the comments in tableam.h, see attached. Some >> of the comments could use more copy-editing and clarification, I think, but >> I stuck to fixing just typos and such for now. > I pushed these after adding three boring changes by pgindent. Thanks for > those! > > I'd greatly welcome more feedback on the comments - I've been pretty > deep in this for so long that I don't see all of the issues anymore. And > a mild dyslexia doesn't help... Here is another iteration on the comments. The patch is a mix of copy-editing and questions. The questions are marked with "HEIKKI:". I can continue the copy-editing, if you can reply to the questions, clarifying the intention on some parts of the API. (Or feel free to pick and push any of these fixes immediately, if you prefer.) - Heikki
Вложения
Hi, On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote: > Here is another iteration on the comments. The patch is a mix of > copy-editing and questions. The questions are marked with "HEIKKI:". I can > continue the copy-editing, if you can reply to the questions, clarifying the > intention on some parts of the API. (Or feel free to pick and push any of > these fixes immediately, if you prefer.) Thanks! > diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c > index f7f726b5aec..bbcab9ce31a 100644 > --- a/src/backend/utils/misc/guc.c > +++ b/src/backend/utils/misc/guc.c > @@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] = > {"default_table_access_method", PGC_USERSET, CLIENT_CONN_STATEMENT, > gettext_noop("Sets the default table access method for new tables."), > NULL, > - GUC_IS_NAME > + GUC_NOT_IN_SAMPLE | GUC_IS_NAME > }, > &default_table_access_method, > DEFAULT_TABLE_ACCESS_METHOD, Hm, I think we should rather add it to sample. That's an oversight, not intentional. > index 6fbfcb96c98..d4709563e7e 100644 > --- a/src/include/access/tableam.h > +++ b/src/include/access/tableam.h > @@ -91,8 +91,9 @@ typedef enum TM_Result > * xmax is the outdating transaction's XID. If the caller wants to visit the > * replacement tuple, it must check that this matches before believing the > * replacement is really a match. > + * HEIKKI: matches what? xmin, but that's specific to the heapam. It's basically just the old comment moved. I wonder if we can just get rid of that field - because the logic to follow update chains correctly is now inside the lock tuple callback. And as you say - it's not clear what callers can do with it for the purpose of following chains. The counter-argument is that having it makes it a lot less annoying to adapt external code that wants to adapt with the minimal set of changes, and only is really interested in supporting heap for now. > * GetTableAmRoutine() asserts that required callbacks are filled in, remember > * to update when adding a callback. > @@ -179,6 +184,12 @@ typedef struct TableAmRoutine > * > * if temp_snap is true, the snapshot will need to be deallocated at > * scan_end. > + * > + * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's > + * a bit surprising for the AM, no? Can it be called when a scan is > + * already in progress? Yea, it can be called when the scan is in-progress. I think we probably should just fix calling code to not need that - it's imo weird that nodeBitmapHeapscan.c doesn't just delay starting the scan till it has the snapshot. This isn't new code, but it's now going to be exposed to more AMs, so I think there's a good argument to fix it now. Robert: You committed that addition, in commit f35742ccb7aa53ee3ed8416bbb378b0c3eeb6bb9 Author: Robert Haas <rhaas@postgresql.org> Date: 2017-03-08 12:05:43 -0500 Support parallel bitmap heap scans. do you remember why that's done? > + * HEIKKI: A flags bitmask argument would be more readable than 6 booleans > */ > TableScanDesc (*scan_begin) (Relation rel, > Snapshot snapshot, I honestly don't have strong feelings about it. Not sure that I buy that bitmasks would be much more readable - but perhaps we could just use the struct trickery we started to use in commit f831d4accda00b9144bc647ede2e2f848b59f39d Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: 2019-02-01 11:29:42 -0300 Add ArchiveOpts to pass options to ArchiveEntry > @@ -194,6 +205,9 @@ typedef struct TableAmRoutine > /* > * Release resources and deallocate scan. If TableScanDesc.temp_snap, > * TableScanDesc.rs_snapshot needs to be unregistered. > + * > + * HEIKKI: I find this 'temp_snap' thing pretty weird. Can't the caller handle > + * deregistering it? > */ > void (*scan_end) (TableScanDesc scan); It's old logic, just wrapped new. I think there's some argument that some of this should be moved to tableam.c rather than the individual AMs. > @@ -221,6 +235,11 @@ typedef struct TableAmRoutine > /* > * Estimate the size of shared memory needed for a parallel scan of this > * relation. The snapshot does not need to be accounted for. > + * > + * HEIKKI: If this returns X, then the parallelscan_initialize() call > + * mustn't use more than X. So this is not just for optimization purposes, > + * for example. Not sure how to phrase that, but could use some > + * clarification. > */ > Size (*parallelscan_estimate) (Relation rel); Hm. I thought I'd done that by adding the note about the amount of space parallelscan_initialize() getting memory sized by parallelscan_estimate(). > /* > * Reset index fetch. Typically this will release cross index fetch > * resources held in IndexFetchTableData. > + * > + * HEIKKI: Is this called between every call to index_fetch_tuple()? > + * Between every call to index_fetch_tuple(), except when call_again is > + * set? Can it be a no-op? > */ > void (*index_fetch_reset) (struct IndexFetchTableData *data); It's basically just to release resources eagerly. I'll add a note. > @@ -272,19 +297,22 @@ typedef struct TableAmRoutine > * test, return true, false otherwise. > * > * Note that AMs that do not necessarily update indexes when indexed > - * columns do not change, need to return the current/correct version of > + * columns don't change, need to return the current/correct version of > * the tuple that is visible to the snapshot, even if the tid points to an > * older version of the tuple. > > * *call_again is false on the first call to index_fetch_tuple for a tid. > - * If there potentially is another tuple matching the tid, *call_again > - * needs be set to true by index_fetch_tuple, signalling to the caller > + * If there potentially is another tuple matching the tid, the callback > + * needs to set *call_again to true, signalling to the caller > * that index_fetch_tuple should be called again for the same tid. > * > * *all_dead, if all_dead is not NULL, should be set to true by > * index_fetch_tuple iff it is guaranteed that no backend needs to see > - * that tuple. Index AMs can use that do avoid returning that tid in > + * that tuple. Index AMs can use that to avoid returning that tid in > * future searches. > + * > + * HEIKKI: Should the snapshot be given in index_fetch_begin()? Can it > + * differ between calls? > */ > bool (*index_fetch_tuple) (struct IndexFetchTableData *scan, > ItemPointer tid, Hm. It could very well differ between calls. E.g. _bt_check_unique() could benefit from that (although it currently uses the table_index_fetch_tuple_check() wrapper), as it does one lookup with SnapshotDirty, and then the next with SnapshotSelf. > @@ -302,6 +330,8 @@ typedef struct TableAmRoutine > * Fetch tuple at `tid` into `slot`, after doing a visibility test > * according to `snapshot`. If a tuple was found and passed the visibility > * test, returns true, false otherwise. > + * > + * HEIKKI: explain how this differs from index_fetch_tuple. > */ > bool (*tuple_fetch_row_version) (Relation rel, > ItemPointer tid, Currently the wrapper has: * See table_index_fetch_tuple's comment about what the difference between * these functions is. This function is the correct to use outside of * index entry->table tuple lookups. referencing * The difference between this function and table_fetch_row_version is that * this function returns the currently visible version of a row if the AM * supports storing multiple row versions reachable via a single index entry * (like heap's HOT). Whereas table_fetch_row_version only evaluates the * tuple exactly at `tid`. Outside of index entry ->table tuple lookups, * table_fetch_row_version is what's usually needed. should we just duplicate that? > @@ -311,14 +341,17 @@ typedef struct TableAmRoutine > /* > * Return the latest version of the tuple at `tid`, by updating `tid` to > * point at the newest version. > + * > + * HEIKKI: the latest version visible to the snapshot? > */ > void (*tuple_get_latest_tid) (Relation rel, > Snapshot snapshot, > ItemPointer tid); It's such a bad interface :(. I'd love to just remove it. Based on https://www.postgresql.org/message-id/17ef5a8a-71cb-5cbf-1762-dbb71626f84e%40dream.email.ne.jp I think we can basically just remove currtid_byreloid/byrelname. I've not sufficiently thought about TidNext() yet. > /* > - * Does the tuple in `slot` satisfy `snapshot`? The slot needs to be of > - * the appropriate type for the AM. > + * Does the tuple in `slot` satisfy `snapshot`? > + * > + * The AM may modify the data underlying the tuple as a side-effect. > */ > bool (*tuple_satisfies_snapshot) (Relation rel, > TupleTableSlot *slot, Hm, this obviously should be moved here from the wrapper. But I now wonder if we can't phrase this better. Might try to come up with something. > + /* > + * Copy all data from `OldHeap` into `NewHeap`, as part of a CLUSTER or > + * VACUUM FULL. > + * > + * If `OldIndex` is valid, the data should be ordered according to the > + * given index. If `use_sort` is false, the data should be fetched from the > + * index, otherwise it should be fetched from the old table and sorted. > + * > + * OldestXmin, FreezeXid, MultiXactCutoff are currently valid values for > + * the table. > + * HEIKKI: What does "currently valid" mean? Valid for the old table? They are system-wide values, basically. Not sure into how much detail about that to go her? > + * The callback should set *num_tuples, *tups_vacuumed, *tups_recently_dead > + * to statistics computed while copying for the relation. Not all might make > + * sense for every AM. > + * HEIKKI: What to do for the ones that don't make sense? Set to 0? > + */ I don't see much of an alternative, yea. I suspect we're going to have to expand vacuum's reporting once we have a better grasp about what other AMs want / need. > /* > * Prepare to analyze block `blockno` of `scan`. The scan has been started > - * with table_beginscan_analyze(). See also > - * table_scan_analyze_next_block(). > + * with table_beginscan_analyze(). > * > * The callback may acquire resources like locks that are held until > - * table_scan_analyze_next_tuple() returns false. It e.g. can make sense > + * table_scan_analyze_next_tuple() returns false. For example, it can make sense > * to hold a lock until all tuples on a block have been analyzed by > * scan_analyze_next_tuple. > + * HEIKKI: Hold a lock on what? A lwlock on the page? Yea, that's what heapam does. I'm not particularly happy with this, but I'm not sure how to do better. I expect that we'll have to revise this to be more general at some not too far away point. > @@ -618,6 +666,8 @@ typedef struct TableAmRoutine > * internally needs to perform mapping between the internal and a block > * based representation. > * > + * HEIKKI: What TsmRoutine? Where is that? I'm not sure what you mean. The SampleScanState has it's associated tablesample routine. Would saying something like "will call the NextSampleBlock() callback for the TsmRoutine associated with the SampleScanState" be better? > /* > * Like table_beginscan(), but table_beginscan_strat() offers an extended API > - * that lets the caller control whether a nondefault buffer access strategy > - * can be used, and whether syncscan can be chosen (possibly resulting in the > - * scan not starting from block zero). Both of these default to true with > - * plain table_beginscan. > + * that lets the caller to use a non-default buffer access strategy, or > + * specify that a synchronized scan can be used (possibly resulting in the > + * scan not starting from block zero). Both of these default to true, as > + * with plain table_beginscan. > + * > + * HEIKKI: I'm a bit confused by 'allow_strat'. What is the non-default > + * strategy that will get used if you pass allow_strat=true? Perhaps the flag > + * should be called "use_bulkread_strategy"? Or it should be of type > + * BufferAccessStrategyType, or the caller should create a strategy with > + * GetAccessStrategy() and pass that. > */ That's really just a tableam port of the pre-existing heapam interface. I don't like the API very much, but there's only so much things that were realistic to change during this project (I think, there were obviously lots of judgement calls). I don't think there's much reason to defend the current status - and I'm happy to collaborate on fixing that. But I think it's a out of scope for 12. > /* > - * table_beginscan_sampling is an alternative entry point for setting up a > + * table_beginscan_sampling() is an alternative entry point for setting up a > * TableScanDesc for a TABLESAMPLE scan. As with bitmap scans, it's worth > * using the same data structure although the behavior is rather different. > * In addition to the options offered by table_beginscan_strat, this call > * also allows control of whether page-mode visibility checking is used. > + * > + * HEIKKI: What is 'pagemode'? > */ That's a good question. My not defining it is pretty much a cop-out, because there previously wasn't any explanation, and I wasn't sure there *is* a meaningful definition. I mean, it's basically largely an efficiency hack inside heapam.c, but it's currently externally determined e.g. in bernoulli.c (code from 11): * Use bulkread, since we're scanning all pages. But pagemode visibility * checking is a win only at larger sampling fractions. The 25% cutoff * here is based on very limited experimentation. */ node->use_bulkread = true; node->use_pagemode = (percent >= 25); If you have a suggestion how to either get rid of it, or how to properly phrase this... > * TABLE_INSERT_NO_LOGICAL force-disables the emitting of logical decoding > * information for the tuple. This should solely be used during table rewrites > * where RelationIsLogicallyLogged(relation) is not yet accurate for the new > * relation. > + * HEIKKI: Is this optional, too? Can the AM ignore it? Hm. Currently logical decoding isn't really extensible automatically to an AM (it works via WAL and WAL isn't extensible) - so it'll currently not mean anything to non-heap AMs (or AMs that patch/are part of core). > * Note that most of these options will be applied when inserting into the > * heap's TOAST table, too, if the tuple requires any out-of-line data. > @@ -1041,6 +1100,8 @@ table_compute_xid_horizon_for_tuples(Relation rel, > * On return the slot's tts_tid and tts_tableOid are updated to reflect the > * insertion. But note that any toasting of fields within the slot is NOT > * reflected in the slots contents. > + * > + * HEIKKI: I think GetBulkInsertState() should be an AM-specific callback. > */ I agree. There was some of that in an earlier version of the patch, but the interface wasn't yet right. I think there's a lot such things that just need to be added incrementally. > @@ -1170,6 +1235,9 @@ table_delete(Relation rel, ItemPointer tid, CommandId cid, > * update was done. However, any TOAST changes in the new tuple's > * data are not reflected into *newtup. > * > + * HEIKKI: There is no 'newtup'. > + * HEIKKI: HEAP_ONLY_TUPLE is AM-specific; do the callers peek into that, currently? No, callers currently don't. The callback does, and sets *update_indexes accordingly. > - * A side effect is to set indexInfo->ii_BrokenHotChain to true if we detect > + * A side effect is to set index_info->ii_BrokenHotChain to true if we detect > * any potentially broken HOT chains. Currently, we set this if there are any > * RECENTLY_DEAD or DELETE_IN_PROGRESS entries in a HOT chain, without trying > * very hard to detect whether they're really incompatible with the chain tip. > * This only really makes sense for heap AM, it might need to be generalized > * for other AMs later. > + * > + * HEIKKI: What does 'allow_sync' do? Heh, I'm going to be responsible for everything that was previously undocumented, aren't I ;). I guess we should say something vague like "When allow_sync is set to true, an AM may use scans synchronized with other backends, if that makes sense. For some AMs that determines whether tuples are going to be returned in TID order". It's vague, but I'm not sure we can do better. Thanks! Greetings, Andres Freund
On Thu, Apr 11, 2019 at 12:49 PM Andres Freund <andres@anarazel.de> wrote: > > @@ -179,6 +184,12 @@ typedef struct TableAmRoutine > > * > > * if temp_snap is true, the snapshot will need to be deallocated at > > * scan_end. > > + * > > + * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's > > + * a bit surprising for the AM, no? Can it be called when a scan is > > + * already in progress? > > Yea, it can be called when the scan is in-progress. I think we probably > should just fix calling code to not need that - it's imo weird that > nodeBitmapHeapscan.c doesn't just delay starting the scan till it has > the snapshot. This isn't new code, but it's now going to be exposed to > more AMs, so I think there's a good argument to fix it now. > > Robert: You committed that addition, in > > commit f35742ccb7aa53ee3ed8416bbb378b0c3eeb6bb9 > Author: Robert Haas <rhaas@postgresql.org> > Date: 2017-03-08 12:05:43 -0500 > > Support parallel bitmap heap scans. > > do you remember why that's done? I don't think there was any brilliant idea behind it. Delaying the scan start until it has the snapshot seems like a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund <andres@anarazel.de> writes: > On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote: >> + * HEIKKI: A flags bitmask argument would be more readable than 6 booleans > I honestly don't have strong feelings about it. Not sure that I buy that > bitmasks would be much more readable Sure they would be --- how's FLAG_FOR_FOO | FLAG_FOR_BAR not better than unlabeled "true" and "false"? > - but perhaps we could just use the > struct trickery we started to use in I find that rather ugly really. If we're doing something other than a dozen-or-so booleans, maybe its the only viable option. But for cases where a flags argument will serve, that's our longstanding practice and I don't see a reason to deviate. regards, tom lane
Hi, On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: > The comments for relation_set_new_relfilenode() callback say that the AM can > set *freezeXid and *minmulti to invalid. But when I did that, VACUUM hits > this assertion: > > TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId) > 3)))", File: "vacuum.c", Line: 1323) Hm, that necessary change unfortunately escaped into the the zheap tree (which indeed doesn't set relfrozenxid). That's why I'd not noticed this. How about something like the attached? I found a related problem in VACUUM FULL / CLUSTER while working on the above, not fixed in the attached yet. Namely even if a relation doesn't yet have a valid relfrozenxid/relminmxid before a VACUUM FULL / CLUSTER, we'll set one after that. That's not great. I suspect the easiest fix would be to to make the relevant relation_copy_for_cluster() FreezeXid, MultiXactCutoff arguments into pointer, and allow the AM to reset them to an invalid value if that's the appropriate one. It'd probably be better if we just moved the entire xid limit computation into the AM, but I'm worried that we actually need to move it *further up* instead - independent of this change. I don't think it's quite right to allow a table with a toast table to be independently VACUUM FULL/CLUSTERed from the toast table. GetOldestXmin() can go backwards for a myriad of reasons (or limited by old_snapshot_threshold), and I'm fairly certain that e.g. VACUUM FULLing the toast table, setting a lower old_snapshot_threshold, and VACUUM FULLing the main table would result in failures. I think we need to fix this for 12, rather than wait for 13. Does anybody disagree? Greetings, Andres Freund
Вложения
Hi, On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: > index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM > doesn't use normal data files, that won't work. I bumped into that with my > toy implementation, which wouldn't need to create any data files, if it > wasn't for this. There are a few more of these: 1) index_update_stats(), computing pg_class.relpages Feels like the number of both heap and index blocks should be computed by the index build and stored in IndexInfo. That'd also get a bit closer towards allowing indexams not going through smgr (useful e.g. for memory only ones). 2) commands/analyze.c, computing pg_class.relpages This should imo be moved to the tableam callback. It's currently done a bit weirdly imo, with fdws computing relpages the callback, but then also returning the acquirefunc. Seems like it should entirely be computed as part of calling acquirefunc. 3) nodeTidscan, skipping over too large tids I think this should just be moved into the AMs, there's no need to have this in nodeTidscan.c 4) freespace.c, used for the new small-rels-have-no-fsm paths. That's being revised currently anyway. But I'm not particularly concerned even if it stays as is - freespace use is optional anyway. And I can't quite see an AM that doesn't want to use postgres' storage mechanism wanting to use freespace.c Therefore I'm inclined not to thouch this independent of fixing the others. I think none of these are critical issues for tableam, but we should fix them. I'm not sure about doing so for v12 though. 1) and 3) are fairly trivial, but 2) would involve changing the FDW interface, by changing the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH, we're not even in beta1. Comments? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > ... I think none of these are critical issues for tableam, but we should fix > them. > I'm not sure about doing so for v12 though. 1) and 3) are fairly > trivial, but 2) would involve changing the FDW interface, by changing > the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH, > we're not even in beta1. Probably better to fix those API issues now rather than later. regards, tom lane
On Tue, Apr 23, 2019 at 6:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > ... I think none of these are critical issues for tableam, but we should fix > > them. > > > I'm not sure about doing so for v12 though. 1) and 3) are fairly > > trivial, but 2) would involve changing the FDW interface, by changing > > the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH, > > we're not even in beta1. > > Probably better to fix those API issues now rather than later. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Heikki, Ashwin, Tom, On 2019-04-23 15:52:01 -0700, Andres Freund wrote: > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: > > index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM > > doesn't use normal data files, that won't work. I bumped into that with my > > toy implementation, which wouldn't need to create any data files, if it > > wasn't for this. > > There are a few more of these: > I'm not sure about doing so for v12 though. 1) and 3) are fairly > trivial, but 2) would involve changing the FDW interface, by changing > the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH, > we're not even in beta1. Hm. I think some of those changes would be a bit bigger than I initially though. Attached is a more minimal fix that'd route RelationGetNumberOfBlocksForFork() through tableam if necessary. I think it's definitely the right answer for 1), probably the pragmatic answer to 2), but certainly not for 3). I've for now made the AM return the size in bytes, and then convert that into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers are going to continue to want it internally as pages (otherwise there's going to be way too much churn, without a benefit I can see). So I thinkt that's OK. There's also a somewhat weird bit of returning the total relation size for InvalidForkNumber - it's pretty likely that other AMs wouldn't use postgres' current forks, but have equivalent concepts. And without that there'd be no way to get that size. I'm not sure I like this, input welcome. But it seems good to offer the ability to get the entire size somehow. Btw, isn't RelationGetNumberOfBlocksForFork() currently weirdly placed? I don't see why bufmgr.c would be appropriate? Although I don't think it's particulary clear where it'd best reside - I'd tentatively say storage.c. Heikki, Ashwin, your inputs would be appreciated here, in particular the tid fetch bit below. The attached patch isn't intended to be applied as-is, just basis for discussion. > 1) index_update_stats(), computing pg_class.relpages > > Feels like the number of both heap and index blocks should be > computed by the index build and stored in IndexInfo. That'd also get > a bit closer towards allowing indexams not going through smgr (useful > e.g. for memory only ones). Due to parallel index builds that'd actually be hard. Given the number of places wanting to compute relpages for pg_class I think the above patch routing RelationGetNumberOfBlocksForFork() through tableam is the right fix. > 2) commands/analyze.c, computing pg_class.relpages > > This should imo be moved to the tableam callback. It's currently done > a bit weirdly imo, with fdws computing relpages the callback, but > then also returning the acquirefunc. Seems like it should entirely be > computed as part of calling acquirefunc. Here I'm not sure routing RelationGetNumberOfBlocksForFork() through tableam wouldn't be the right minimal approach too. It has the disadvantage of implying certain values for the RelationGetNumberOfBlocksForFork(MAIN) return value. The alternative would be to return the desire sampling range in table_beginscan_analyze() - but that'd require some duplication because currently that just uses the generic scan_begin() callback. I suspect - as previously mentioned- that we're going to have to extend statistics collection beyond the current approach at some point, but I don't think that's now. At least to me it's not clear how to best represent the stats, and how to best use them, if the underlying storage is fundamentally not block best. Nor how we'd avoid code duplication... > 3) nodeTidscan, skipping over too large tids > I think this should just be moved into the AMs, there's no need to > have this in nodeTidscan.c I think here it's *not* actually correct at all to use the relation size. It's currently doing: /* * We silently discard any TIDs that are out of range at the time of scan * start. (Since we hold at least AccessShareLock on the table, it won't * be possible for someone to truncate away the blocks we intend to * visit.) */ nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation); which is fine (except for a certain abstraction leakage) for an AM like heap or zheap, but I suspect strongly that that's not ok for Ashwin & Heikki's approach where tid isn't tied to physical representation. The obvious answer would be to just move that check into the table_fetch_row_version implementation (currently just calling heap_fetch()) - but that doesn't seem OK from a performance POV, because we'd then determine the relation size once for each tid, rather than once per tidscan. And it'd also check in cases where we know the tid is supposed to be valid (e.g. fetching trigger tuples and such). The proper fix seems to be to introduce a new scan variant (e.g. table_beginscan_tid()), and then have table_fetch_row_version take a scan as a parameter. But it seems we'd have to introduce that as a separate tableam callback, because we'd not want to incur the overhead of creating an additional scan / RelationGetNumberOfBlocks() checks for triggers et al. Greetings, Andres Freund
Вложения
On Thu, Apr 25, 2019 at 3:43 PM Andres Freund <andres@anarazel.de> wrote: > Hm. I think some of those changes would be a bit bigger than I initially > though. Attached is a more minimal fix that'd route > RelationGetNumberOfBlocksForFork() through tableam if necessary. I > think it's definitely the right answer for 1), probably the pragmatic > answer to 2), but certainly not for 3). > > I've for now made the AM return the size in bytes, and then convert that > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers > are going to continue to want it internally as pages (otherwise there's > going to be way too much churn, without a benefit I can see). So I > think that's OK. I will provide my inputs, Heikki please correct me or add your inputs. I am not sure how much gain this practically provides, if rest of the system continues to use the value returned in-terms of blocks. I understand things being block based (and not just really block based but all the blocks of relation are storing data and full tuple) is engraved in the system. So, breaking out of it is yes much larger change and not just limited to table AM API. I feel most of the issues discussed here should be faced by zheap as well, as not all blocks/pages contain data like TPD pages should be excluded from sampling and TID scans, etc... > There's also a somewhat weird bit of returning the total relation size > for InvalidForkNumber - it's pretty likely that other AMs wouldn't use > postgres' current forks, but have equivalent concepts. And without that > there'd be no way to get that size. I'm not sure I like this, input > welcome. But it seems good to offer the ability to get the entire size > somehow. Yes, I do think we should have mechanism to get total size and just size for specific purpose. Zedstore currently doesn't use forks. Just a thought, instead of calling it forknum as argument, call it something like data and meta-data or main-data and auxiliary-data size. Though I don't know if usage exists where wish to get size of just some non MAIN fork for heap/zheap, those pieces of code shouldn't be in generic areas instead in AM specific code only. > > > 2) commands/analyze.c, computing pg_class.relpages > > > > This should imo be moved to the tableam callback. It's currently done > > a bit weirdly imo, with fdws computing relpages the callback, but > > then also returning the acquirefunc. Seems like it should entirely be > > computed as part of calling acquirefunc. > > Here I'm not sure routing RelationGetNumberOfBlocksForFork() through > tableam wouldn't be the right minimal approach too. It has the > disadvantage of implying certain values for the > RelationGetNumberOfBlocksForFork(MAIN) return value. The alternative > would be to return the desire sampling range in > table_beginscan_analyze() - but that'd require some duplication because > currently that just uses the generic scan_begin() callback. Yes, just routing relation size via AM layer and using its returned value in terms of blocks still and performing sampling based on blocks based on it, doesn't feel resolves the issue. Maybe need to delegate sampling completely to AM layer. Code duplication can be avoided by similar AMs (heap and zheap) possible using some common utility functions to achieve intended result. > > I suspect - as previously mentioned- that we're going to have to extend > statistics collection beyond the current approach at some point, but I > don't think that's now. At least to me it's not clear how to best > represent the stats, and how to best use them, if the underlying storage > is fundamentally not block best. Nor how we'd avoid code duplication... Yes, will have to give more thoughts into this. > > > 3) nodeTidscan, skipping over too large tids > > I think this should just be moved into the AMs, there's no need to > > have this in nodeTidscan.c > > I think here it's *not* actually correct at all to use the relation > size. It's currently doing: > > /* > * We silently discard any TIDs that are out of range at the time of scan > * start. (Since we hold at least AccessShareLock on the table, it won't > * be possible for someone to truncate away the blocks we intend to > * visit.) > */ > nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation); > > which is fine (except for a certain abstraction leakage) for an AM like > heap or zheap, but I suspect strongly that that's not ok for Ashwin & > Heikki's approach where tid isn't tied to physical representation. Agree, its not nice to have that optimization being performed based on number of block in generic layer. I feel its not efficient either for zheap too due to TPD pages as mentioned above, as number of blocks returned will be higher compared to actually data blocks. > > The obvious answer would be to just move that check into the > table_fetch_row_version implementation (currently just calling > heap_fetch()) - but that doesn't seem OK from a performance POV, because > we'd then determine the relation size once for each tid, rather than > once per tidscan. And it'd also check in cases where we know the tid is > supposed to be valid (e.g. fetching trigger tuples and such). Agree, checking relation size per tuple is out of possible solutions. > > The proper fix seems to be to introduce a new scan variant > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take > a scan as a parameter. But it seems we'd have to introduce that as a > separate tableam callback, because we'd not want to incur the overhead > of creating an additional scan / RelationGetNumberOfBlocks() checks for > triggers et al. Thinking out loud here, we can possibly tackle this in multiple ways. First above mentioned check seems more optimization to me than functionally needed, correct me if wrong. If that's true we can check with AM if wish to apply that optimization or not based on relation size. Like for Zedstore, instead of performing this optimization directly call fetch row version and zedstore can quickly bail out based on TID passed to it, as in meta page has highest allocated TID value. With concurrent inserts though it may perform more work. Or other alternative could be instead of getting relation size. Add callback to get highest TID value from AM. heap and zheap can return TID using highest block number and max TID that block can have. Zedstore can return the highest TID it has assigned so far. Either use the TID then perform the check and not just block-number. Or extract block number from the TID and use that instead for the check. That would at least work for AMs we know of so far and hard to imagine for AMs doesn't exist yet, how this will be used. Irrespective of how we solve this problem, ctids are displayed and need to be specified in (block, offset) fashion for tid scans :-)
On Tue, 9 Apr 2019 at 15:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 08/04/2019 20:37, Andres Freund wrote: > > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: > >> There's a little bug in index-only scan executor node, where it mixes up the > >> slots to hold a tuple from the index, and from the table. That doesn't cause > >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which > >> uses a virtual slot, it caused warnings like this from index-only scans: > > > > Hm. That's another one that I think I had fixed previously :(, and then > > concluded that it's not actually necessary for some reason. Your fix > > looks correct to me. Do you want to commit it? Otherwise I'll look at > > it after rebasing zheap, and checking it with that. > > I found another slot type confusion bug, while playing with zedstore. In > an Index Scan, if you have an ORDER BY key that needs to be rechecked, > so that it uses the reorder queue, then it will sometimes use the > reorder queue slot, and sometimes the table AM's slot, for the scan > slot. If they're not of the same type, you get an assertion: > > TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File: > "execExprInterp.c", Line: 1905) > > Attached is a test for this, again using the toy table AM, extended to > be able to test this. And a fix. > > >> Attached is a patch with the toy implementation I used to test this. I'm not > >> suggesting we should commit that - although feel free to do that if you > >> think it's useful - but it shows how I bumped into these issues. > > > > Hm, probably not a bad idea to include something like it. It seems like > > we kinda would need non-stub implementation of more functions for it to > > test much / and to serve as an example. I'm mildy inclined to just do > > it via zheap / externally, but I'm not quite sure that's good enough. > > Works for me. > > >> +static Size > >> +toyam_parallelscan_estimate(Relation rel) > >> +{ > >> + ereport(ERROR, > >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > >> + errmsg("function %s not implemented yet", __func__))); > >> +} > > > > The other stubbed functions seem like we should require them, but I > > wonder if we should make the parallel stuff optional? > > Yeah, that would be good. I would assume it to be optional. > I was trying the toyam patch and on make check it failed with segmentation fault at static void toyam_relation_set_new_filenode(Relation rel, char persistence, TransactionId *freezeXid, MultiXactId *minmulti) { *freezeXid = InvalidTransactionId; Basically, on running create table t (i int, j int) using toytable, leads to this segmentation fault. Am I missing something here? -- Regards, Rafia Sabih
Hi, On May 6, 2019 3:40:55 AM PDT, Rafia Sabih <rafia.pghackers@gmail.com> wrote: >On Tue, 9 Apr 2019 at 15:17, Heikki Linnakangas <hlinnaka@iki.fi> >wrote: >> >> On 08/04/2019 20:37, Andres Freund wrote: >> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: >> >> There's a little bug in index-only scan executor node, where it >mixes up the >> >> slots to hold a tuple from the index, and from the table. That >doesn't cause >> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy >AM, which >> >> uses a virtual slot, it caused warnings like this from index-only >scans: >> > >> > Hm. That's another one that I think I had fixed previously :(, and >then >> > concluded that it's not actually necessary for some reason. Your >fix >> > looks correct to me. Do you want to commit it? Otherwise I'll look >at >> > it after rebasing zheap, and checking it with that. >> >> I found another slot type confusion bug, while playing with zedstore. >In >> an Index Scan, if you have an ORDER BY key that needs to be >rechecked, >> so that it uses the reorder queue, then it will sometimes use the >> reorder queue slot, and sometimes the table AM's slot, for the scan >> slot. If they're not of the same type, you get an assertion: >> >> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File: >> "execExprInterp.c", Line: 1905) >> >> Attached is a test for this, again using the toy table AM, extended >to >> be able to test this. And a fix. >> >> >> Attached is a patch with the toy implementation I used to test >this. I'm not >> >> suggesting we should commit that - although feel free to do that >if you >> >> think it's useful - but it shows how I bumped into these issues. >> > >> > Hm, probably not a bad idea to include something like it. It seems >like >> > we kinda would need non-stub implementation of more functions for >it to >> > test much / and to serve as an example. I'm mildy inclined to just >do >> > it via zheap / externally, but I'm not quite sure that's good >enough. >> >> Works for me. >> >> >> +static Size >> >> +toyam_parallelscan_estimate(Relation rel) >> >> +{ >> >> + ereport(ERROR, >> >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), >> >> + errmsg("function %s not implemented yet", >__func__))); >> >> +} >> > >> > The other stubbed functions seem like we should require them, but I >> > wonder if we should make the parallel stuff optional? >> >> Yeah, that would be good. I would assume it to be optional. >> >I was trying the toyam patch and on make check it failed with >segmentation fault at > >static void >toyam_relation_set_new_filenode(Relation rel, > char persistence, > TransactionId *freezeXid, > MultiXactId *minmulti) >{ > *freezeXid = InvalidTransactionId; > >Basically, on running create table t (i int, j int) using toytable, >leads to this segmentation fault. > >Am I missing something here? I assume you got compiler warmings compiling it? The API for some callbacks changed a bit. Andred -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Mon, May 6, 2019 at 7:14 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On May 6, 2019 3:40:55 AM PDT, Rafia Sabih <rafia.pghackers@gmail.com> wrote: > >I was trying the toyam patch and on make check it failed with > >segmentation fault at > > > >static void > >toyam_relation_set_new_filenode(Relation rel, > > char persistence, > > TransactionId *freezeXid, > > MultiXactId *minmulti) > >{ > > *freezeXid = InvalidTransactionId; > > > >Basically, on running create table t (i int, j int) using toytable, > >leads to this segmentation fault. > > > >Am I missing something here? > > I assume you got compiler warmings compiling it? The API for some callbacks changed a bit. Attached patch gets toy table AM implementation to match latest master API. The patch builds on top of patch from Heikki in [1]. Compiles and works but the test still continues to fail with WARNING for issue mentioned in [1] Noticed the typo in recently added comment for relation_set_new_filenode(). * Note that only the subset of the relcache filled by * RelationBuildLocalRelation() can be relied upon and that the relation's * catalog entries either will either not yet exist (new relation), or * will still reference the old relfilenode. seems should be * Note that only the subset of the relcache filled by * RelationBuildLocalRelation() can be relied upon and that the relation's * catalog entries will either not yet exist (new relation), or still * reference the old relfilenode. Also wish to point out, while working on Zedstore, we realized that TupleDesc from Relation object can be trusted at AM layer for scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()), catalog is updated first and hence the relation object passed to AM layer reflects new TupleDesc. For heapam its fine as it doesn't use the TupleDesc today during scans in AM layer for scan_getnextslot(). Only TupleDesc which can trusted and matches the on-disk layout of the tuple for scans hence is from TupleTableSlot. Which is little unfortunate as TupleTableSlot is only available in scan_getnextslot(), and not in scan_begin(). Means if AM wishes to do some initialization based on TupleDesc for scans can't be done in scan_begin() and forced to delay till has access to TupleTableSlot. We should at least add comment for scan_begin() to strongly clarify not to trust Relation object TupleDesc. Or maybe other alternative would be have separate API for rewrite case. 1] https://www.postgresql.org/message-id/9a7fb9cc-2419-5db7-8840-ddc10c93f122%40iki.fi
Вложения
On Mon, 6 May 2019 at 16:14, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On May 6, 2019 3:40:55 AM PDT, Rafia Sabih <rafia.pghackers@gmail.com> wrote: > >On Tue, 9 Apr 2019 at 15:17, Heikki Linnakangas <hlinnaka@iki.fi> > >wrote: > >> > >> On 08/04/2019 20:37, Andres Freund wrote: > >> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: > >> >> There's a little bug in index-only scan executor node, where it > >mixes up the > >> >> slots to hold a tuple from the index, and from the table. That > >doesn't cause > >> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy > >AM, which > >> >> uses a virtual slot, it caused warnings like this from index-only > >scans: > >> > > >> > Hm. That's another one that I think I had fixed previously :(, and > >then > >> > concluded that it's not actually necessary for some reason. Your > >fix > >> > looks correct to me. Do you want to commit it? Otherwise I'll look > >at > >> > it after rebasing zheap, and checking it with that. > >> > >> I found another slot type confusion bug, while playing with zedstore. > >In > >> an Index Scan, if you have an ORDER BY key that needs to be > >rechecked, > >> so that it uses the reorder queue, then it will sometimes use the > >> reorder queue slot, and sometimes the table AM's slot, for the scan > >> slot. If they're not of the same type, you get an assertion: > >> > >> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File: > >> "execExprInterp.c", Line: 1905) > >> > >> Attached is a test for this, again using the toy table AM, extended > >to > >> be able to test this. And a fix. > >> > >> >> Attached is a patch with the toy implementation I used to test > >this. I'm not > >> >> suggesting we should commit that - although feel free to do that > >if you > >> >> think it's useful - but it shows how I bumped into these issues. > >> > > >> > Hm, probably not a bad idea to include something like it. It seems > >like > >> > we kinda would need non-stub implementation of more functions for > >it to > >> > test much / and to serve as an example. I'm mildy inclined to just > >do > >> > it via zheap / externally, but I'm not quite sure that's good > >enough. > >> > >> Works for me. > >> > >> >> +static Size > >> >> +toyam_parallelscan_estimate(Relation rel) > >> >> +{ > >> >> + ereport(ERROR, > >> >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > >> >> + errmsg("function %s not implemented yet", > >__func__))); > >> >> +} > >> > > >> > The other stubbed functions seem like we should require them, but I > >> > wonder if we should make the parallel stuff optional? > >> > >> Yeah, that would be good. I would assume it to be optional. > >> > >I was trying the toyam patch and on make check it failed with > >segmentation fault at > > > >static void > >toyam_relation_set_new_filenode(Relation rel, > > char persistence, > > TransactionId *freezeXid, > > MultiXactId *minmulti) > >{ > > *freezeXid = InvalidTransactionId; > > > >Basically, on running create table t (i int, j int) using toytable, > >leads to this segmentation fault. > > > >Am I missing something here? > > I assume you got compiler warmings compiling it? The API for some callbacks changed a bit. > Oh yeah it does. -- Regards, Rafia Sabih
On Mon, 6 May 2019 at 22:39, Ashwin Agrawal <aagrawal@pivotal.io> wrote: > > On Mon, May 6, 2019 at 7:14 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On May 6, 2019 3:40:55 AM PDT, Rafia Sabih <rafia.pghackers@gmail.com> wrote: > > >I was trying the toyam patch and on make check it failed with > > >segmentation fault at > > > > > >static void > > >toyam_relation_set_new_filenode(Relation rel, > > > char persistence, > > > TransactionId *freezeXid, > > > MultiXactId *minmulti) > > >{ > > > *freezeXid = InvalidTransactionId; > > > > > >Basically, on running create table t (i int, j int) using toytable, > > >leads to this segmentation fault. > > > > > >Am I missing something here? > > > > I assume you got compiler warmings compiling it? The API for some callbacks changed a bit. > > Attached patch gets toy table AM implementation to match latest master API. > The patch builds on top of patch from Heikki in [1]. > Compiles and works but the test still continues to fail with WARNING > for issue mentioned in [1] > Thanks Ashwin, this works fine with the mentioned warnings of course. -- Regards, Rafia Sabih
On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
>
> Also wish to point out, while working on Zedstore, we realized that
> TupleDesc from Relation object can be trusted at AM layer for
> scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> catalog is updated first and hence the relation object passed to AM
> layer reflects new TupleDesc. For heapam its fine as it doesn't use
> the TupleDesc today during scans in AM layer for scan_getnextslot().
> Only TupleDesc which can trusted and matches the on-disk layout of the
> tuple for scans hence is from TupleTableSlot. Which is little
> unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> and not in scan_begin(). Means if AM wishes to do some initialization
> based on TupleDesc for scans can't be done in scan_begin() and forced
> to delay till has access to TupleTableSlot. We should at least add
> comment for scan_begin() to strongly clarify not to trust Relation
> object TupleDesc. Or maybe other alternative would be have separate
> API for rewrite case.
Just to correct my typo, I wish to say, TupleDesc from Relation object can't
be trusted at AM layer for scan_begin() API.
Andres, any thoughts on above. I see you had proposed "change the table_beginscan* API so it
provides a slot" in [1], but seems received no response/comments that time.
provides a slot" in [1], but seems received no response/comments that time.
Hi, On 2019-04-29 16:17:41 -0700, Ashwin Agrawal wrote: > On Thu, Apr 25, 2019 at 3:43 PM Andres Freund <andres@anarazel.de> wrote: > > Hm. I think some of those changes would be a bit bigger than I initially > > though. Attached is a more minimal fix that'd route > > RelationGetNumberOfBlocksForFork() through tableam if necessary. I > > think it's definitely the right answer for 1), probably the pragmatic > > answer to 2), but certainly not for 3). > > > > I've for now made the AM return the size in bytes, and then convert that > > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers > > are going to continue to want it internally as pages (otherwise there's > > going to be way too much churn, without a benefit I can see). So I > > think that's OK. > > I will provide my inputs, Heikki please correct me or add your inputs. > > I am not sure how much gain this practically provides, if rest of the > system continues to use the value returned in-terms of blocks. I > understand things being block based (and not just really block based > but all the blocks of relation are storing data and full tuple) is > engraved in the system. So, breaking out of it is yes much larger > change and not just limited to table AM API. I don't think it's that ingrained in all that many parts of the system. Outside of the places I listed upthread, and the one index case that stashes extra info, which places are that "block based"? > I feel most of the issues discussed here should be faced by zheap as > well, as not all blocks/pages contain data like TPD pages should be > excluded from sampling and TID scans, etc... It's not a problem so far, and zheap works on tableam. You can just skip such blocks during sampling / analyze, and return nothing for tidscans. > > > 2) commands/analyze.c, computing pg_class.relpages > > > > > > This should imo be moved to the tableam callback. It's currently done > > > a bit weirdly imo, with fdws computing relpages the callback, but > > > then also returning the acquirefunc. Seems like it should entirely be > > > computed as part of calling acquirefunc. > > > > Here I'm not sure routing RelationGetNumberOfBlocksForFork() through > > tableam wouldn't be the right minimal approach too. It has the > > disadvantage of implying certain values for the > > RelationGetNumberOfBlocksForFork(MAIN) return value. The alternative > > would be to return the desire sampling range in > > table_beginscan_analyze() - but that'd require some duplication because > > currently that just uses the generic scan_begin() callback. > > Yes, just routing relation size via AM layer and using its returned > value in terms of blocks still and performing sampling based on blocks > based on it, doesn't feel resolves the issue. Maybe need to delegate > sampling completely to AM layer. Code duplication can be avoided by > similar AMs (heap and zheap) possible using some common utility > functions to achieve intended result. I don't know what this is actually proposing. > > I suspect - as previously mentioned- that we're going to have to extend > > statistics collection beyond the current approach at some point, but I > > don't think that's now. At least to me it's not clear how to best > > represent the stats, and how to best use them, if the underlying storage > > is fundamentally not block best. Nor how we'd avoid code duplication... > > Yes, will have to give more thoughts into this. > > > > > > 3) nodeTidscan, skipping over too large tids > > > I think this should just be moved into the AMs, there's no need to > > > have this in nodeTidscan.c > > > > I think here it's *not* actually correct at all to use the relation > > size. It's currently doing: > > > > /* > > * We silently discard any TIDs that are out of range at the time of scan > > * start. (Since we hold at least AccessShareLock on the table, it won't > > * be possible for someone to truncate away the blocks we intend to > > * visit.) > > */ > > nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation); > > > > which is fine (except for a certain abstraction leakage) for an AM like > > heap or zheap, but I suspect strongly that that's not ok for Ashwin & > > Heikki's approach where tid isn't tied to physical representation. > > Agree, its not nice to have that optimization being performed based on > number of block in generic layer. I feel its not efficient either for > zheap too due to TPD pages as mentioned above, as number of blocks > returned will be higher compared to actually data blocks. I don't think there's a problem for zheap. The blocks are just interspersed. Having pondered this a lot more, I think this is the way to go for 12. Then we can improve this for v13, to be nice. > > The proper fix seems to be to introduce a new scan variant > > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take > > a scan as a parameter. But it seems we'd have to introduce that as a > > separate tableam callback, because we'd not want to incur the overhead > > of creating an additional scan / RelationGetNumberOfBlocks() checks for > > triggers et al. > > Thinking out loud here, we can possibly tackle this in multiple ways. > First above mentioned check seems more optimization to me than > functionally needed, correct me if wrong. If that's true we can check > with AM if wish to apply that optimization or not based on relation > size. It'd be really expensive to check this differently for heap. We'd have to check the relation size, which is out of the question imo. Greetings, Andres Freund
Hi, On 2019-05-07 23:18:39 -0700, Ashwin Agrawal wrote: > On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote: > > Also wish to point out, while working on Zedstore, we realized that > > TupleDesc from Relation object can be trusted at AM layer for > > scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()), > > catalog is updated first and hence the relation object passed to AM > > layer reflects new TupleDesc. For heapam its fine as it doesn't use > > the TupleDesc today during scans in AM layer for scan_getnextslot(). > > Only TupleDesc which can trusted and matches the on-disk layout of the > > tuple for scans hence is from TupleTableSlot. Which is little > > unfortunate as TupleTableSlot is only available in scan_getnextslot(), > > and not in scan_begin(). Means if AM wishes to do some initialization > > based on TupleDesc for scans can't be done in scan_begin() and forced > > to delay till has access to TupleTableSlot. We should at least add > > comment for scan_begin() to strongly clarify not to trust Relation > > object TupleDesc. Or maybe other alternative would be have separate > > API for rewrite case. > > Just to correct my typo, I wish to say, TupleDesc from Relation object can't > be trusted at AM layer for scan_begin() API. > > Andres, any thoughts on above. I see you had proposed "change the > table_beginscan* API so it > provides a slot" in [1], but seems received no response/comments that time. > [1] > https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de I don't think passing a slot at beginscan time is a good idea. There's several places that want to use different slots for the same scan, and we probably want to increase that over time (e.g. for batching), not decrease it. What kind of initialization do you want to do based on the tuple desc at beginscan time? Greetings, Andres Freund
On Wed, May 8, 2019 at 2:46 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-05-07 23:18:39 -0700, Ashwin Agrawal wrote: > > On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote: > > > Also wish to point out, while working on Zedstore, we realized that > > > TupleDesc from Relation object can be trusted at AM layer for > > > scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()), > > > catalog is updated first and hence the relation object passed to AM > > > layer reflects new TupleDesc. For heapam its fine as it doesn't use > > > the TupleDesc today during scans in AM layer for scan_getnextslot(). > > > Only TupleDesc which can trusted and matches the on-disk layout of the > > > tuple for scans hence is from TupleTableSlot. Which is little > > > unfortunate as TupleTableSlot is only available in scan_getnextslot(), > > > and not in scan_begin(). Means if AM wishes to do some initialization > > > based on TupleDesc for scans can't be done in scan_begin() and forced > > > to delay till has access to TupleTableSlot. We should at least add > > > comment for scan_begin() to strongly clarify not to trust Relation > > > object TupleDesc. Or maybe other alternative would be have separate > > > API for rewrite case. > > > > Just to correct my typo, I wish to say, TupleDesc from Relation object can't > > be trusted at AM layer for scan_begin() API. > > > > Andres, any thoughts on above. I see you had proposed "change the > > table_beginscan* API so it > > provides a slot" in [1], but seems received no response/comments that time. > > [1] > > https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de > > I don't think passing a slot at beginscan time is a good idea. There's > several places that want to use different slots for the same scan, and > we probably want to increase that over time (e.g. for batching), not > decrease it. > > What kind of initialization do you want to do based on the tuple desc at > beginscan time? For Zedstore (column store) need to allocate map (array or bitmask) to mark which columns to project for the scan. Also need to allocate AM internal scan descriptors corresponding to number of attributes for the scan. Hence, need access to number of attributes involved in the scan. Currently, not able to trust Relation's TupleDesc, for Zedstore we worked-around the same by allocating these things on first call to getnextslot, when have access to slot (by switching to memory context used during scan_begin()).
Hi, On 2019-04-25 15:43:15 -0700, Andres Freund wrote: > Hm. I think some of those changes would be a bit bigger than I initially > though. Attached is a more minimal fix that'd route > RelationGetNumberOfBlocksForFork() through tableam if necessary. I > think it's definitely the right answer for 1), probably the pragmatic > answer to 2), but certainly not for 3). > I've for now made the AM return the size in bytes, and then convert that > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers > are going to continue to want it internally as pages (otherwise there's > going to be way too much churn, without a benefit I can see). So I > thinkt that's OK. > > There's also a somewhat weird bit of returning the total relation size > for InvalidForkNumber - it's pretty likely that other AMs wouldn't use > postgres' current forks, but have equivalent concepts. And without that > there'd be no way to get that size. I'm not sure I like this, input > welcome. But it seems good to offer the ability to get the entire size > somehow. I'm still reasonably happy with this. I'll polish it a bit and push. > > 3) nodeTidscan, skipping over too large tids > > I think this should just be moved into the AMs, there's no need to > > have this in nodeTidscan.c > > I think here it's *not* actually correct at all to use the relation > size. It's currently doing: > > /* > * We silently discard any TIDs that are out of range at the time of scan > * start. (Since we hold at least AccessShareLock on the table, it won't > * be possible for someone to truncate away the blocks we intend to > * visit.) > */ > nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation); > > which is fine (except for a certain abstraction leakage) for an AM like > heap or zheap, but I suspect strongly that that's not ok for Ashwin & > Heikki's approach where tid isn't tied to physical representation. > > > The obvious answer would be to just move that check into the > table_fetch_row_version implementation (currently just calling > heap_fetch()) - but that doesn't seem OK from a performance POV, because > we'd then determine the relation size once for each tid, rather than > once per tidscan. And it'd also check in cases where we know the tid is > supposed to be valid (e.g. fetching trigger tuples and such). > > The proper fix seems to be to introduce a new scan variant > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take > a scan as a parameter. But it seems we'd have to introduce that as a > separate tableam callback, because we'd not want to incur the overhead > of creating an additional scan / RelationGetNumberOfBlocks() checks for > triggers et al. Attached is a prototype of a variation of this. I added a table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid) callback / wrapper. Currently it just takes a "plain" scan, but we could add a separate table_beginscan variant too. For heap that just means we can just use HeapScanDesc's rs_nblock to filter out invalid tids, and we only need to call RelationGetNumberOfBlocks() once, rather than every table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good improvement for nodeTidscan's table_get_latest_tid() call (for WHERE CURRENT OF) - which previously computed the relation size once per tuple. Needs a bit of polishing, but I think this is the right direction? Unless somebody protests, I'm going to push something along those lines quite soon. Greetings, Andres Freund
Вложения
On Wed, May 15, 2019 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-04-25 15:43:15 -0700, Andres Freund wrote:
> > 3) nodeTidscan, skipping over too large tids
> > I think this should just be moved into the AMs, there's no need to
> > have this in nodeTidscan.c
>
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
>
> /*
> * We silently discard any TIDs that are out of range at the time of scan
> * start. (Since we hold at least AccessShareLock on the table, it won't
> * be possible for someone to truncate away the blocks we intend to
> * visit.)
> */
> nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
>
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.
>
>
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan. And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).
>
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter. But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.
Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.
For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.
Needs a bit of polishing, but I think this is the right direction?
Highlevel this looks good to me. Will look into full details tomorrow. This alligns with the high level thought I made but implemented in much better way, to consult with the AM to perform the optimization or not. So, now using the new callback table_tuple_tid_valid() AM either can implement some way to perform the validation for TID to optimize the scan, or if has no way to check based on scan descriptor then can decide to always pass true and let table_fetch_row_version() handle the things.
Hi, On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote: > Highlevel this looks good to me. Will look into full details tomorrow. Ping? I'll push the first of the patches soon, and unless you'll comment on the second soon, I'll also push ahead. There's a beta upcoming... - Andres
On Fri, May 17, 2019 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> Highlevel this looks good to me. Will look into full details tomorrow.
Ping?
I'll push the first of the patches soon, and unless you'll comment on
the second soon, I'll also push ahead. There's a beta upcoming...
Relation size API still doesn't address the analyze case as you mentioned but sure something we can improve on later.
On Tue, Apr 9, 2019 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 08/04/2019 20:37, Andres Freund wrote:
> On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
>> There's a little bug in index-only scan executor node, where it mixes up the
>> slots to hold a tuple from the index, and from the table. That doesn't cause
>> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which
>> uses a virtual slot, it caused warnings like this from index-only scans:
>
> Hm. That's another one that I think I had fixed previously :(, and then
> concluded that it's not actually necessary for some reason. Your fix
> looks correct to me. Do you want to commit it? Otherwise I'll look at
> it after rebasing zheap, and checking it with that.
I found another slot type confusion bug, while playing with zedstore. In
an Index Scan, if you have an ORDER BY key that needs to be rechecked,
so that it uses the reorder queue, then it will sometimes use the
reorder queue slot, and sometimes the table AM's slot, for the scan
slot. If they're not of the same type, you get an assertion:
TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
"execExprInterp.c", Line: 1905)
Attached is a test for this, again using the toy table AM, extended to
be able to test this. And a fix.
On Wed, May 15, 2019 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.
For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.
Question on the patch, if not too late
Why call table_beginscan() in TidNext() and not in ExecInitTidScan() ? Seems cleaner to have it in ExecInitTidScan().
Hi, On 2019-05-17 16:56:04 -0700, Ashwin Agrawal wrote: > Question on the patch, if not too late > Why call table_beginscan() in TidNext() and not in ExecInitTidScan() ? > Seems cleaner to have it in ExecInitTidScan(). Largely because it's symmetrical to where most other scans are started ( c.f. nodeSeqscan.c, nodeIndexscan.c). But also, there's no need to incur the cost of a smgrnblocks() etc when the node might never actually be reached during execution. Greetings, Andres Freund
Hi, On 2019-05-17 14:49:19 -0700, Ashwin Agrawal wrote: > On Fri, May 17, 2019 at 12:54 PM Andres Freund <andres@anarazel.de> wrote: > > > Hi, > > > > On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote: > > > Highlevel this looks good to me. Will look into full details tomorrow. > > > > Ping? > > > > I'll push the first of the patches soon, and unless you'll comment on > > the second soon, I'll also push ahead. There's a beta upcoming... > > > > Sorry for the delay, didn't get to it yesterday. Looked into both the > patches. They both look good to me, thank you. Pushed both now. > Relation size API still doesn't address the analyze case as you mentioned > but sure something we can improve on later. I'm much less concerned about that. You can just return a reasonable block size from the size callback, and it'll work for block sampling (and you can just skip pages in the analyze callback if needed, e.g. for zheap's tpd pages). And we assume that a reasonable block size is returned by the size callback anyway, for planning purposes (both in relpages and for estimate_rel_size). We'll probably want to improve that some day, but it doesn't strike me as hugely urgent. Greetings, Andres Freund
On 18/05/2019 01:19, Ashwin Agrawal wrote: > On Tue, Apr 9, 2019 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi > <mailto:hlinnaka@iki.fi>> wrote: > > On 08/04/2019 20:37, Andres Freund wrote: > > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: > >> There's a little bug in index-only scan executor node, where it > mixes up the > >> slots to hold a tuple from the index, and from the table. That > doesn't cause > >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy > AM, which > >> uses a virtual slot, it caused warnings like this from > index-only scans: > > > > Hm. That's another one that I think I had fixed previously :(, > and then > > concluded that it's not actually necessary for some reason. Your fix > > looks correct to me. Do you want to commit it? Otherwise I'll > look at > > it after rebasing zheap, and checking it with that. > > I found another slot type confusion bug, while playing with > zedstore. In > an Index Scan, if you have an ORDER BY key that needs to be rechecked, > so that it uses the reorder queue, then it will sometimes use the > reorder queue slot, and sometimes the table AM's slot, for the scan > slot. If they're not of the same type, you get an assertion: > > TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File: > "execExprInterp.c", Line: 1905) > > Attached is a test for this, again using the toy table AM, extended to > be able to test this. And a fix. > > > It seems the two patches from email [1] fixing slot confusion in Index > Scans are pending to be committed. > > 1] > https://www.postgresql.org/message-id/e71c4da4-3e82-cc4f-32cc-ede387fac8b0%40iki.fi Pushed the first patch now. Andres already fixed the second issue in commit b8b94ea129. - Heikki
On 2019-Jun-06, Heikki Linnakangas wrote: > Pushed the first patch now. Andres already fixed the second issue in commit > b8b94ea129. Please don't omit the "Discussion:" tag in commit messages. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/04/2019 19:49, Andres Freund wrote: > On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote: >> diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c >> index f7f726b5aec..bbcab9ce31a 100644 >> --- a/src/backend/utils/misc/guc.c >> +++ b/src/backend/utils/misc/guc.c >> @@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] = >> {"default_table_access_method", PGC_USERSET, CLIENT_CONN_STATEMENT, >> gettext_noop("Sets the default table access method for new tables."), >> NULL, >> - GUC_IS_NAME >> + GUC_NOT_IN_SAMPLE | GUC_IS_NAME >> }, >> &default_table_access_method, >> DEFAULT_TABLE_ACCESS_METHOD, > > Hm, I think we should rather add it to sample. That's an oversight, not > intentional. I just noticed that this is still an issue. default_table_access_method is not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE. I'll add this to the open items list so we don't forget. - Heikki
On Fri, Aug 09, 2019 at 11:34:05AM +0300, Heikki Linnakangas wrote: > On 11/04/2019 19:49, Andres Freund wrote: >> Hm, I think we should rather add it to sample. That's an oversight, not >> intentional. > > I just noticed that this is still an issue. default_table_access_method is > not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE. > I'll add this to the open items list so we don't forget. I think that we should give it the same visibility as default_tablespace, so adding it to the sample file sounds good to me. -- Michael
Вложения
On 2019-08-13 15:03:13 +0900, Michael Paquier wrote: > On Fri, Aug 09, 2019 at 11:34:05AM +0300, Heikki Linnakangas wrote: > > On 11/04/2019 19:49, Andres Freund wrote: > >> Hm, I think we should rather add it to sample. That's an oversight, not > >> intentional. > > > > I just noticed that this is still an issue. default_table_access_method is > > not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE. > > I'll add this to the open items list so we don't forget. Thanks! > I think that we should give it the same visibility as default_tablespace, > so adding it to the sample file sounds good to me. > diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample > index 65a6da18b3..39fc787851 100644 > --- a/src/backend/utils/misc/postgresql.conf.sample > +++ b/src/backend/utils/misc/postgresql.conf.sample > @@ -622,6 +622,7 @@ > #default_tablespace = '' # a tablespace name, '' uses the default > #temp_tablespaces = '' # a list of tablespace names, '' uses > # only default tablespace > +#default_table_access_method = 'heap' Pushed, thanks. > #check_function_bodies = on > #default_transaction_isolation = 'read committed' > #default_transaction_read_only = off Hm. I find the current ordering there a bit weird. Unrelated to your proposed change. The header of the group is #------------------------------------------------------------------------------ # CLIENT CONNECTION DEFAULTS #------------------------------------------------------------------------------ # - Statement Behavior - but I don't quite see GUCs like default_tablespace, search_path (due to determining a created table's schema), temp_tablespace, default_table_access_method fit reasonably well under that heading. They all can affect persistent state. That seems pretty different from a number of other settings (client_min_messages, default_transaction_isolation, lock_timeout, ...) which only have transient effects. Should we perhaps split that group? Not that I have a good proposal for better names. Greetings, Andres Freund
On Fri, Aug 16, 2019 at 03:29:30PM -0700, Andres Freund wrote: > but I don't quite see GUCs like default_tablespace, search_path (due to > determining a created table's schema), temp_tablespace, > default_table_access_method fit reasonably well under that heading. They > all can affect persistent state. That seems pretty different from a > number of other settings (client_min_messages, > default_transaction_isolation, lock_timeout, ...) which only have > transient effects. Agreed. > Should we perhaps split that group? Not that I have a good proposal for > better names. We could have a section for transaction-related parameters, and move the vacuum ones into the section for autovacuum so as they get grouped, renaming the section "autovacuum and vacuum". An idea of group for search_path, temp_tablespace, default_tablespace & co would be "object parameters", or "relation parameters" for all the parameters which interfere with object definitions? -- Michael