Обсуждение: Custom Scans and private data

Поиск
Список
Период
Сортировка

Custom Scans and private data

От
Andres Freund
Дата:
(sending again, forgot to cc hackers, sorry for the duplicate)

Hi,

I'm trying to use the custom scan API to replace code that currently
does everything via hooks and isn't safe against copyObject() (Yes,
that's not a grand plan).

To me dealing with CustomScan->custom_private seems to be
extraordinarily painful when dealing with nontrivial data
structures. And such aren't all that unlikely when dealing with a custom
scan over something more complex.

Now, custom scans likely modeled private data after FDWs. But it's
already rather painful there (in fact it's the one thing I heard people
complain about repeatedly besides the inability to push down
operations). Just look at the ugly hangups postgres_fdw goes through to
pass data around - storing it in lists with indexes determining the
content and such. That kinda somewhat works if you only integers and
such need to be stored, but if you have more complex data it really is a
PITA. The best alternatives I found are a) to serialize data into a
string, base64 or so. b) use a Const node over a bytea datum.  It seems
rather absurd having to go through deserialization at every execution.

Since we already have CustomScan->methods, it seems to be rather
reasonable to have a CopyCustomScan callback and let it do the copying
of the private data if present? Or possibly of the whole node, which'd
allow to embed it into a bigger node?

I looked at pg-strom and it seems to go through rather ugly procedures
to deal with the problem:
https://github.com/pg-strom/devel/blob/master/src/gpujoin.c
form_gpujoin_info/deform_gpujoin_info.

What's the advantage of the current model?

Greetings,

Andres Freund



Re: Custom Scans and private data

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> Since we already have CustomScan->methods, it seems to be rather
> reasonable to have a CopyCustomScan callback and let it do the copying
> of the private data if present? Or possibly of the whole node, which'd
> allow to embed it into a bigger node?

Weren't there rumblings of requiring plan trees to be deserializable/
reserializable, not just copiable, so that they could be passed to
background workers?  Not that I'm particularly in favor of that, but if
you're going to go in the direction of allowing private data formats to be
copied then I think you're likely to have to address the other thing.

In any case, since this convention already exists for FDWs I'm not
sure why we should make it different for CustomScan.
        regards, tom lane



Re: Custom Scans and private data

От
Andres Freund
Дата:
On 2015-08-25 14:42:32 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Since we already have CustomScan->methods, it seems to be rather
> > reasonable to have a CopyCustomScan callback and let it do the copying
> > of the private data if present? Or possibly of the whole node, which'd
> > allow to embed it into a bigger node?
> 
> Weren't there rumblings of requiring plan trees to be deserializable/
> reserializable, not just copiable, so that they could be passed to
> background workers?

Yes, I do recall that as well. that's going to require a good bit of
independent stuff - currently there's callbacks as pointers in nodes;
that's obviously not going to fly well across processes. Additionally
custom scan already has a TextOutCustomScan callback, even if it's
currently probably intended for debugging.

I rather doubt that adding out/readfuncs without the ability to do
something about what exactly is read in is going to work well. But I
admit I'm not too sure about it.

> In any case, since this convention already exists for FDWs I'm not
> sure why we should make it different for CustomScan.

I think it was a noticeable mistake in the fdw case, but we already
released with that. We shouldn't make the same mistake twice. Looking at
postgres_fdw and the pg-strom example linked upthread imo pretty clearly
shows how ugly this is.  There's also the rather noticeable difference
that we already have callbacks in the node for custom scans (used by
outfuncs), making this rather trivial to add.

Greetings,

Andres Freund



Re: Custom Scans and private data

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2015-08-25 14:42:32 -0400, Tom Lane wrote:
>> In any case, since this convention already exists for FDWs I'm not
>> sure why we should make it different for CustomScan.

> I think it was a noticeable mistake in the fdw case, but we already
> released with that. We shouldn't make the same mistake twice.

I don't agree that it was a mistake, and I do think there is value in
consistency.

In the case at hand, it would not be too hard to provide some utility
functions for some common cases; for instance, if you want to just store
a struct, we could offer convenience functions to wrap that in a bytea
constant and unwrap it again.  Those could be useful for both FDWs and
custom scans.

(The bigger picture here is that we always intended to offer a bunch of
support functions to make writing FDWs easier, once we'd figured out
what made sense.  The fact that we haven't done that work yet doesn't
make it a bad approach.  Nor does "shove it all into some callbacks"
mean that the callbacks will be easy to write.)

> Looking at
> postgres_fdw and the pg-strom example linked upthread imo pretty clearly
> shows how ugly this is.  There's also the rather noticeable difference
> that we already have callbacks in the node for custom scans (used by
> outfuncs), making this rather trivial to add.

I will manfully refrain from taking that bait.
        regards, tom lane



Re: Custom Scans and private data

От
Kouhei Kaigai
Дата:
> On 2015-08-25 14:42:32 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > Since we already have CustomScan->methods, it seems to be rather
> > > reasonable to have a CopyCustomScan callback and let it do the copying
> > > of the private data if present? Or possibly of the whole node, which'd
> > > allow to embed it into a bigger node?
> >
> > Weren't there rumblings of requiring plan trees to be deserializable/
> > reserializable, not just copiable, so that they could be passed to
> > background workers?
>
> Yes, I do recall that as well. that's going to require a good bit of
> independent stuff - currently there's callbacks as pointers in nodes;
> that's obviously not going to fly well across processes. Additionally
> custom scan already has a TextOutCustomScan callback, even if it's
> currently probably intended for debugging.
>
> I rather doubt that adding out/readfuncs without the ability to do
> something about what exactly is read in is going to work well. But I
> admit I'm not too sure about it.
>
> > In any case, since this convention already exists for FDWs I'm not
> > sure why we should make it different for CustomScan.
>
> I think it was a noticeable mistake in the fdw case, but we already
> released with that. We shouldn't make the same mistake twice. Looking at
> postgres_fdw and the pg-strom example linked upthread imo pretty clearly
> shows how ugly this is.  There's also the rather noticeable difference
> that we already have callbacks in the node for custom scans (used by
> outfuncs), making this rather trivial to add.
>
As Tom pointed out, the primary reason why CustomScan required provider
to save its private data on custom_exprs/custom_private were awareness
of copyObject().
In addition, custom_exprs are expected to link expression node to be
processed in setrefs.c and subselect.c because core implementation
cannot know which type of data is managed in private.

Do you concern about custom_private only? Even if we have extra
callbacks like CopyObjectCustomScan() and TextReadCustomScan(),
how do we care about the situation when core implementation needs to
know the location of expression nodes? Is custom_exprs retained as is?

In the earlier version of CustomScan interface had individual
callbacks on setrefs.c and subselect.c, however, its structure
depended on the current implementation too much, then we moved
to the implementation to have two individual private fields
rather than callbacks on outfuncs.c.

On the other hands, I'm inclined to think TextOutCustomScan() might
be a misdesign to support serialize/deserialize via readfuncs.c.
http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F80111D04F@BPXM15GP.gisp.nec.co.jp
I think it shall be deprecated rather then enhancement.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>




Re: Custom Scans and private data

От
Andres Freund
Дата:
On 2015-08-25 19:22:04 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think it was a noticeable mistake in the fdw case, but we already
> > released with that. We shouldn't make the same mistake twice.
> 
> I don't agree that it was a mistake, and I do think there is value in
> consistency.

Having to write ad-hoc serialization code in a different way in each fdw
seems like a bad idea to me. The output of that serialization is rather
hard to read since there's no names and such, and it's quite
inefficient. I have a hard time seing the benefits.

> In the case at hand, it would not be too hard to provide some utility
> functions for some common cases; for instance, if you want to just store
> a struct, we could offer convenience functions to wrap that in a bytea
> constant and unwrap it again.  Those could be useful for both FDWs and
> custom scans.

Yes, that'd already be an improvement. Right now it seems the best way
to achieve that is to either base64 the output into a Value node or use
bytea inside a Const node. Alternatively you can make it a list of ints
but that obviously isn't a grand idea.

But imo that's not really good enough as soon as you have more than a
single struct. For anything but that you'll have to write
serialization/deserialization code again.

Additionally you're in many cases forced to always use the serialized
representation, even if there's no copyObject() inbetween, since you
don't have a proper way to store non-serialized information!

Check out serializePlanData in
https://github.com/laurenz/oracle_fdw/blob/master/oracle_fdw.c - that's
something we shouldn't put people through.

Actually, *especially* if we at some point want to ship plans over the
wire, that kind of serialization is a bad idea - it's very hard to have
any sort of checks. out/readfuncs.c type serialization will have field
names, but you can't do to the equivalent from these interfaces.

> Nor does "shove it all into some callbacks" mean that the callbacks
> will be easy to write.)

Being able to only perform copying/serialization when necessary, storing
the data in a non-serialized manner, seem pretty clear improvements. I
think it's important to be able to easily store core data in fields, but
that's just a copyObject() call away.

> > Looking at
> > postgres_fdw and the pg-strom example linked upthread imo pretty clearly
> > shows how ugly this is.  There's also the rather noticeable difference
> > that we already have callbacks in the node for custom scans (used by
> > outfuncs), making this rather trivial to add.
> 
> I will manfully refrain from taking that bait.

While that comment made me laugh, I'm not really sure why my examples
are bait. One is the probably most used fdw, the other the only user of
the custom scan interface I know of.

Greetings,

Andres Freund



Re: Custom Scans and private data

От
Andres Freund
Дата:
On 2015-08-26 00:55:48 +0000, Kouhei Kaigai wrote:
> As Tom pointed out, the primary reason why CustomScan required provider
> to save its private data on custom_exprs/custom_private were awareness
> of copyObject().

Well, a callback brings that with it as well. I do think it makes sense
to *allow* not to have a callback and rely on copyObject() to do the
work.

> In addition, custom_exprs are expected to link expression node to be
> processed in setrefs.c and subselect.c because core implementation
> cannot know which type of data is managed in private.

Right.

> Do you concern about custom_private only?

Yes, pretty much. There's very little chance you can expand the
expression tree with custom expressions and survive the experience. So
custom_exprs etc. makes sense.

> Even if we have extra
> callbacks like CopyObjectCustomScan() and TextReadCustomScan(),
> how do we care about the situation when core implementation needs to
> know the location of expression nodes? Is custom_exprs retained as is?

Yes.

> In the earlier version of CustomScan interface had individual
> callbacks on setrefs.c and subselect.c, however, its structure
> depended on the current implementation too much, then we moved
> to the implementation to have two individual private fields
> rather than callbacks on outfuncs.c.

I agree with that choice.

> On the other hands, I'm inclined to think TextOutCustomScan() might
> be a misdesign to support serialize/deserialize via readfuncs.c.
>   http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F80111D04F@BPXM15GP.gisp.nec.co.jp
> I think it shall be deprecated rather then enhancement.

Well, right now there's no support for reading/writing plans at all. But
if we add it, TextOutCustomScan() seems like a small problem in
comparison to others. CustomScan contains pointers, that's definitely
not something you can ship over the wire and expect to work. We'll
probably have to store a soname + function name instead.

More generally I rather doubt that it'll always make sense to
serialize/deserialize in a generic manner between different backends. It
very well can make sense to refer to backend-local state in a plan - you
need to be able to take that into account upon
serialization/deserialization. Consider e.g. a connection id for an FDW
that uses pooling.

Greetings,

Andres Freund



Re: Custom Scans and private data

От
Kouhei Kaigai
Дата:
> On 2015-08-26 00:55:48 +0000, Kouhei Kaigai wrote:
> > As Tom pointed out, the primary reason why CustomScan required provider
> > to save its private data on custom_exprs/custom_private were awareness
> > of copyObject().
>
> Well, a callback brings that with it as well. I do think it makes sense
> to *allow* not to have a callback and rely on copyObject() to do the
> work.
>
> > In addition, custom_exprs are expected to link expression node to be
> > processed in setrefs.c and subselect.c because core implementation
> > cannot know which type of data is managed in private.
>
> Right.
>
> > Do you concern about custom_private only?
>
> Yes, pretty much. There's very little chance you can expand the
> expression tree with custom expressions and survive the experience. So
> custom_exprs etc. makes sense.
>
I can understand your concern, however, I'm not certain which is the best
way to do.
Regarding of CustomPath and CustomScanState, we expect developer defines
its own data structure delivered from these types, like:
 typedef struct {     CustomPath  cpath;     List       *host_quals;     /* RestrictInfo run on host */     List
*dev_quals;     /* RestrictInfo run on device */ } GpuScanPath; 

On the other hand, CustomScan node prohibit to have these manner because
of copyObject() awareness. If we can have similar design, it looks to me
harmonious design.

Here is my random thought to realize it, even though current specification
require CustomScan to follow existing copyObject() manner.

- _copyCustomScan() needs to allow to allocate larger than sizeof(CustomScan), according to the requirement by
custom-scanprovider. 
- We may need to have a utility function to copy the common part. It is not preferable to implement by custom-scan
provideritself. 
- For upcoming readfuncs.c support, I don't want to have READ_XXX_FIELD() macro on the extension side. Even though
pg_strtok()is a public function, _readBitmapset() is static function. 
- Is custom_private deprecated? Also, do we force to have CopyObjectCustomScan() and potentially TextReadCustomScan()?
Iwant to keep custom_private as-is, because of very simple custom-scan provider which has at most several primitive
valuesas private one. 

> > Even if we have extra
> > callbacks like CopyObjectCustomScan() and TextReadCustomScan(),
> > how do we care about the situation when core implementation needs to
> > know the location of expression nodes? Is custom_exprs retained as is?
>
> Yes.
>
> > In the earlier version of CustomScan interface had individual
> > callbacks on setrefs.c and subselect.c, however, its structure
> > depended on the current implementation too much, then we moved
> > to the implementation to have two individual private fields
> > rather than callbacks on outfuncs.c.
>
> I agree with that choice.
>
> > On the other hands, I'm inclined to think TextOutCustomScan() might
> > be a misdesign to support serialize/deserialize via readfuncs.c.
> >
> http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F80111D04
> F@BPXM15GP.gisp.nec.co.jp
> > I think it shall be deprecated rather then enhancement.
>
> Well, right now there's no support for reading/writing plans at all. But
> if we add it, TextOutCustomScan() seems like a small problem in
> comparison to others. CustomScan contains pointers, that's definitely
> not something you can ship over the wire and expect to work. We'll
> probably have to store a soname + function name instead.
>
It may not be a long future. The ParallelAppend feature, I've discussed
in another thread, needs capability of plan tree serialization, and under
the development:

https://github.com/kaigai/sepgsql/blob/fappend/src/backend/nodes/readfuncs.c#L1506

I thought we may need to define a new DDL to associate a particular custom-
scan name with a set of callback pointer table, using persistent system
catalog. Your idea, soname + symbol name, may be a solution.

> More generally I rather doubt that it'll always make sense to
> serialize/deserialize in a generic manner between different backends. It
> very well can make sense to refer to backend-local state in a plan - you
> need to be able to take that into account upon
> serialization/deserialization. Consider e.g. a connection id for an FDW
> that uses pooling.
>
+1. I also prefer more generic mechanism to serialize/deserialize, rather
than custom-scan specific.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>



Re: Custom Scans and private data

От
Andres Freund
Дата:
On 2015-08-26 23:55:16 +0000, Kouhei Kaigai wrote:
> - _copyCustomScan() needs to allow to allocate larger than sizeof(CustomScan),
>   according to the requirement by custom-scan provider.

I think it's somewhat nice to allow larger objects, but I don't think
it's absolutely necessary.

> - We may need to have a utility function to copy the common part.
>   It is not preferable to implement by custom-scan provider itself.

I think that could be done by having the CopyCustomScan callback simply
just returning a new node, *without* filling out those fields.

> - For upcoming readfuncs.c support, I don't want to have READ_XXX_FIELD()
>   macro on the extension side. Even though pg_strtok() is a public function,
>   _readBitmapset() is static function.

Yea, the general copying code for other node types should reusable for
that. Invoke parseNodeString() should be usable.

> - Is custom_private deprecated? Also, do we force to have CopyObjectCustomScan()
>   and potentially TextReadCustomScan()?

I'd either remove it, or add a second void * private field which has to
be copied by CopyObjectCustomScan() while leaving the responsibility to
custom_private. I honestly don't see too much value in keeping it.

> It may not be a long future. The ParallelAppend feature, I've discussed
> in another thread, needs capability of plan tree serialization, and under
> the development:

Yes, I skimmed through those discussions.

Greetings,

Andres Freund



Re: Custom Scans and private data

От
Robert Haas
Дата:
On Wed, Aug 26, 2015 at 11:23 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Looking at
>> > postgres_fdw and the pg-strom example linked upthread imo pretty clearly
>> > shows how ugly this is.  There's also the rather noticeable difference
>> > that we already have callbacks in the node for custom scans (used by
>> > outfuncs), making this rather trivial to add.
>>
>> I will manfully refrain from taking that bait.
>
> While that comment made me laugh, I'm not really sure why my examples
> are bait. One is the probably most used fdw, the other the only user of
> the custom scan interface I know of.

I suspect what Tom is getting at is that he thinks the custom scan
interface is a giant piece of dung which I ought not to have committed
unless somebody was holding a gun to my head at time time, and that
putting function pointers in the structure at all was a particularly
poor design decision on par with the Roman decision to make their
water pipes out of a metal which is toxic if ingested.  He and I are
going to have to agree to disagree on that.  I think it's great that
you are experimenting with the feature, and I think we ought to try to
make it better, even if that requires incompatible changes to the API.

When we originally discussed this topic at a previous PGCon developer
meeting, it was suggested that we might want to have a facility for
custom nodes, and I at least had the impression that this might be
separate from particular facilities we might have for custom paths or
custom plans or custom plan-states.  For example, suppose you can do
this:

void
InitCustomNodeAndres(CustomNodeTemplate *tmpl)
{   tmpl->outfunc = _outAndres;   tmpl->copyfunc = _copyAndres;   tmpl->equalfunc = _equalAndres;   tmpl->readfunc =
_readAndres;
}

void
_PG_init()
{  RegisterCustomNodeType("Andres", "andres", "InitCustomNodeAndres");
}

...
   Andres *andres = makeCustomNode(Andres);

If we had something like that, then you could address your use case by
making the structures you want to pass around be nodes rather than
having to invent a way to smash whatever you have into a binary blob.
That would be pretty cool.  Maybe, with a bit of work, we could even
find a way to get rid of the idea of custom-path, custom-plan, and
custom scan-state as node types in their own right and instead have
some trick where they can just be general custom nodes that have one
or two extra methods defined for the stuff that is specific to the
planner and executor.  Not sure exactly how to set that up off-hand,
but maybe there's a good way.

In general, I think we've made extensibility very hard in areas like
this.  There are a whole bunch of thorny problems here that are
interconnected.  For example, suppose you want to add a new kind of
SQL-visible object to the system, something EnterpriseDB has needed to
do repeatedly over the years.  You have to modify like 40 core source
files.  You need new parser rules: but the parser is not extensible.
You need new keywords: but the keyword table is not extensible.  To
represent the results of parsing, you need new parse nodes: but the
node system is not extensible.  You need to classify your statement as
DML for logging purposes, you need to hook it into the event trigger
system, you need to hook it into objectaddress.c, you need a new
command tag, you need to add your new command to the appropriate place
in ProcessUtility.  You can sort of get control in ProcessUtility from
a hook, but it's awkward, and the others can't be handled at all.  So
you just have to hack core.

It would be nice to fully solve this problem so that some of the EDB
stuff could exist as an extension rather than a modification of the
core code, but given the magnitude of the problem I don't expect to
get there any time soon.  Still, I think making core facilities more
accessible to extensions has a ton of merit. FDWs, background workers,
dynamic shared memory, logical decoding plugins, DDL deparsing, and,
yeah, even custom scans are all relatively recent examples of new
facilities we've been careful to make usable outside core, and there
is, I think, a good deal of evidence that every one of those
facilities has gotten use by developers other than the ones who
installed the core support, which I personally find really satisfying.
I predict that if we take steps to make the node system more
extensible, we will find that that capability gets plenty of use,
including some uses we can't now predict.

And even more broadly: I think the time has come to get really
skeptical about every place in our code where there's a big giant
switch.  All of those represent places where non-core code need not
apply.  And all of them represent places where, instead of each module
registering itself with all the correct subsystems, each subsystem
knows about all of the things that plug into it.  Whoever designed the
shared-memory initialization mechanism was smart: it knows very little
itself, and relies on its clients to tell it what they need.  There
are some things I don't like about that mechanism, but at least we got
that part right, and so extensions can use it.  A lot of other places,
unfortunately, took the simpler approach of just hard-coding
everything.  There are some places where performance or other concerns
justify that approach, but there are a lot of places where I think
another approach would be better.  EDB is far from the only company
that would benefit from making more of these subsystems extensible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Custom Scans and private data

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Aug 26, 2015 at 11:23 PM, Andres Freund <andres@anarazel.de> wrote:
>> While that comment made me laugh, I'm not really sure why my examples
>> are bait. One is the probably most used fdw, the other the only user of
>> the custom scan interface I know of.

> I suspect what Tom is getting at is that he thinks the custom scan
> interface is ...

[ROTFL]  No, actually, I'm not interested in fighting that battle.
But I do think that letting custom-scan extensions fool about with
the semantics of plan-copying (and plan-serialization for that matter)
is a risky choice that is not justified by some marginal gains in code
readability, especially when there are other feasible ways to attack
the readability problem.  copyObject needs to be a pretty self-contained
operation with not a lot of external dependencies, and I'm afraid there
would be too much temptation for extensions to take shortcuts that would
fail later.

> [ ruminations about how to improve the system's extensibility ]

Yeah, I've entertained similar thoughts in the past.  It's not very clear
how we get there from here, though --- for instance, any fundamental
change in the Node system would break so much code that it's unlikely
anyone would thank us for it, even assuming we ever finished the
conversion project.

I'm also less than convinced that "I should be able to install major new
features without touching the core code" is actually a sane or useful goal
to have.  It sounds great in the abstract, but long ago (at HP) I worked
on systems that were actually built to be able to do that ... and I can
tell you that the tradeoffs you have to make to do that are not very
pleasant.

As a concrete example, you mentioned the lack of extensibility of the
bison parser, which is undoubtedly a blocking issue if you want a new
feature that would require new SQL syntax.  While bison is certainly
a PITA in some ways, it has one saving grace that I think a lot of people
underappreciate: if it compiles your grammar, the grammar is (probably)
unambiguous and has no unreachable elements.  The extensible parser we
had at HP was, um, considerably less capable of detecting conflicts or
ambiguity.  Maybe we just didn't know how to do it, but I suspect that
that is a fundamental tradeoff when you have incomplete information.

I don't want to sound too negative --- there are probably improvements
we can make in this area.  But I'm skeptical of sticking in hooks that
haven't been pretty carefully thought through.  We've done a lot of that
in the past, and I think the net result is that we have a lot of useless
hooks, or at least hooks that can't be used in a way that we don't break
every release or two.
        regards, tom lane



Re: Custom Scans and private data

От
Andres Freund
Дата:
Hi,

On 2015-08-27 18:59:13 -0400, Tom Lane wrote:
> But I do think that letting custom-scan extensions fool about with
> the semantics of plan-copying (and plan-serialization for that matter)
> is a risky choice that is not justified by some marginal gains in code
> readability

To me the likelihood of having bugs in manual (de)serilization via lists
et al. is higher than introducing bugs via copyfuncs. Or even
out/readfuncs, especially as we could easily add a bunch of
verifications to the latter by verifying field names in READ_*. Wether a
prepared statement fails because a copyfuncs hook failed, or whether
it's in the fdw itself doesn't seem to make that much of a difference.

> especially when there are other feasible ways to attack the
> readability problem.

Lets at least add a Value variant that can store strings that aren't
null terminated, that'd already make things a lot better. Right now you
need to reuse Const nodes and stuff a bytea in there or such.

But even then, that only works if you have a single flat datastructure
that doesn't have any pointers in it. copyObject() style copying can
easily cope with that, using a binary blob really can't.

As far as I can see you right now basically need to write code for
custom scans so that between planning and execution you unconditionally
do:
Planning:
1) create your internal data
2) serialize data,
Execution:
3) unserialize data
4) use unserialized data

There's no way to stash data between 2 and 3 anywhere, even if there's
no need for the whole copying rigamarole because it was just a plain
statement.

If you look at the various *Scan, *Join etc. nodes we have in core, most
of them have a bunch of variable length data structures and pointers in
them. All that you can't sanely do for a custom scan node.

> copyObject needs to be a pretty self-contained operation with not a
> lot of external dependencies

Why?

> > [ ruminations about how to improve the system's extensibility ]
> 
> Yeah, I've entertained similar thoughts in the past.  It's not very clear
> how we get there from here, though --- for instance, any fundamental
> change in the Node system would break so much code that it's unlikely
> anyone would thank us for it, even assuming we ever finished the
> conversion project.

How about adding a single 'ExtensibleNode' node type (inheriting from
Node as normal). Many of the switches over node types would gain one
additional entry for that, and do something like   case T_ExtensibleNode:       handler = LookupHandler((ExtensibleNode
*)node)->extended_type, Handler_ExecInitNode);       if (handler)       {            result = ((ExecInitNodeHandler *)
handler)(node,estate, eflags);            break;       }       /* fall through to error */   default:       elog(ERROR,
"unrecognizednode type: %d", (int) nodeTag(node));
 

that should be doable one-by-one in many of these bigger
switches. There's obviously some limits where that's doable, but I don't
think it'd break much code?

How exactly to identify extended_type in a way that makes sense
e.g. between restarting pg, backends, primary/standby isn't trivial, but
I do think it should be doable.

> I'm also less than convinced that "I should be able to install major new
> features without touching the core code" is actually a sane or useful goal
> to have.

Yea, I have some doubts there as well. There's a lot of features though
which we really don't want in core, where some of the limitations of the
node system really do become problematic.

> As a concrete example, you mentioned the lack of extensibility of the
> bison parser, which is undoubtedly a blocking issue if you want a new
> feature that would require new SQL syntax.

While that's a problem, where I really don't have any good answer, I
also think in many cases it's primarily DDL, and primarily extending
existing DDL. By far the most things I've seen want to add informations
to tables and/or columns - and storage options actually kinda give you
that on the grammar level. We just don't allow you too sanely hook into
it, and there's a bunch of other pointless restrictions.

Greetings,

Andres Freund



Re: Custom Scans and private data

От
Andres Freund
Дата:
On 2015-08-27 23:23:54 +0100, Robert Haas wrote:
> I think it's great that you are experimenting with the feature, and I
> think we ought to try to make it better, even if that requires
> incompatible changes to the API.

Yea, I think it's too early to consider this API stable at all. That's
imo just normal for moving into making something new extensible imo.

> When we originally discussed this topic at a previous PGCon developer
> meeting, it was suggested that we might want to have a facility for
> custom nodes, and I at least had the impression that this might be
> separate from particular facilities we might have for custom paths or
> custom plans or custom plan-states.

I think we can actually use Custom* to testdrive concept we might want
to use more widely at a later stage.

> For example, suppose you can do this:
> 
> void
> InitCustomNodeAndres(CustomNodeTemplate *tmpl)
> {
>     tmpl->outfunc = _outAndres;
>     tmpl->copyfunc = _copyAndres;
>     tmpl->equalfunc = _equalAndres;
>     tmpl->readfunc = _readAndres;
> }
> 
> void
> _PG_init()
> {
>    RegisterCustomNodeType("Andres", "andres", "InitCustomNodeAndres");
> }
> 
> ...
> 
>     Andres *andres = makeCustomNode(Andres);
> 
> If we had something like that, then you could address your use case by
> making the structures you want to pass around be nodes rather than
> having to invent a way to smash whatever you have into a binary blob.

Yes, that'd be pretty helpful already. On the other hand it'd still not
make some of the CustomScan members superflous - I do think exprs/tlist
make a air amount of sense to handle setrefs.c et al.


> In general, I think we've made extensibility very hard in areas like
> this.

Well, a lot of it is just *hard* to make extensible. If we pay too much
attention towards extensibility nobody is going to want to extend things
anymore, because postgres will move to slow :(.

Also I do think we have to be careful about not making it too "cheap" to
have significant features outside of core. If you look at mysql the
massive fragmentation around storage engines really has cost them a lot.

> There are a whole bunch of thorny problems here that are
> interconnected.  For example, suppose you want to add a new kind of
> SQL-visible object to the system, something EnterpriseDB has needed to
> do repeatedly over the years.  You have to modify like 40 core source
> files.

FWIW, I think this is also a problem for core code. It's very easy to
forget individual pieces even if you're diligent and know what you're
doing.

I think there's a bunch of things that could massively make that easier,
without actually costing that much. Just from the top of my head:
1) Allow to add additional syscaches at runtime, including emitting  relevant invalidations
2) Allow to register a callback from dependency.c, add some generic  output to objectaddress.c
3) Make relation/attribute reloptions actually extensible

With these three you can actually add new catalogs and database in a
fairly sane manner, even if the user interface isn't the prettiest. But
I do think that relation options + utility hooks that turn "generic" DDL
into the normal DDL & your custom action can get you pretty far.  But
having to write your own caches, trigger that emit sys/relcache
invalidations is painful. Handling dependencies in a meaningful manner
is, to my knowledge, impossible. Unless you want to default to CASCADE
and do stuff in the drop event trigger...

If WITH (...) or security labels are too ugly or crummy I think you also
can get rather far using functions.

Good, because solving

> You need new parser rules: but the parser is not extensible.
> You need new keywords: but the keyword table is not extensible.

without massive performance and/or maintainability regressions seems hard.

Greetings,

Andres Freund