Обсуждение: xml data type implications of no =

Поиск
Список
Период
Сортировка

xml data type implications of no =

От
Mark Kirkwood
Дата:
Today I ran into some interesting consequences of the xml data type
being without an "=" operator. One I thought I'd post here because it
has a *possible* planner impact. I'm not sure it is actually a bug as
such, but this seemed the best forum to post in initially:

test=# \d bug
       Table "public.bug"
  Column |  Type   | Modifiers
--------+---------+-----------
  id     | integer |
  val    | xml     |

test=# explain select val::text from bug;
                           QUERY PLAN
--------------------------------------------------------------
  Seq Scan on bug  (cost=0.00..58127.78 rows=1000278 width=32)


Note the width estimate. However a more realistic estimate for width is:

test=# select 8192/(reltuples/relpages) as width from pg_class where
relname='bug';
       width
------------------
  394.130431739976

So we are going to massively underestimate the "size" of such a dataset.
Now this appears to be a consequence of no "=" operator (std_typanalyze
in analyze.c bails if there isn't one), so the planner has no idea about
how wide 'val' actually is. I'm wondering if it is worth having at least
an "=" operator to enable some minimal stats to be available for xml
columns.

regards

Mark

Re: xml data type implications of no =

От
Mark Kirkwood
Дата:
On 25/05/10 16:43, Mark Kirkwood wrote:
Today I ran into some interesting consequences of the xml data type being without an "=" operator. One I thought I'd post here because it has a *possible* planner impact. I'm not sure it is actually a bug as such, but this seemed the best forum to post in initially:

test=# \d bug
      Table "public.bug"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer |
 val    | xml     |

test=# explain select val::text from bug;
                          QUERY PLAN                         
--------------------------------------------------------------
 Seq Scan on bug  (cost=0.00..58127.78 rows=1000278 width=32)


Note the width estimate. However a more realistic estimate for width is:

test=# select 8192/(reltuples/relpages) as width from pg_class where relname='bug';
      width      
------------------
 394.130431739976


So we are going to massively underestimate the "size" of such a dataset. Now this appears to be a consequence of no "=" operator (std_typanalyze in analyze.c bails if there isn't one), so the planner has no idea about how wide 'val' actually is. I'm wondering if it is worth having at least an "=" operator to enable some minimal stats to be available for xml columns.


Adding a minimal = op (see attached) and an analyze results in:

test=# explain select val::text from bug;
                          QUERY PLAN                          
---------------------------------------------------------------
 Seq Scan on bug  (cost=0.00..62632.08 rows=1000008 width=385)


which gives a much better indication of dataset size.

Вложения

Re: Bad optimizer data for xml (WAS: xml data type implications of no =)

От
Mark Kirkwood
Дата:
On 27/05/10 13:37, Mark Kirkwood wrote:
> On 25/05/10 16:43, Mark Kirkwood wrote:
>> Today I ran into some interesting consequences of the xml data type
>> being without an "=" operator. One I thought I'd post here because it
>> has a *possible* planner impact. I'm not sure it is actually a bug as
>> such, but this seemed the best forum to post in initially:
>>
>> test=# \d bug
>>       Table "public.bug"
>>  Column |  Type   | Modifiers
>> --------+---------+-----------
>>  id     | integer |
>>  val    | xml     |
>>
>> test=# explain select val::text from bug;
>>                           QUERY PLAN
>> --------------------------------------------------------------
>>  Seq Scan on bug  (cost=0.00..58127.78 rows=1000278 width=32)
>>
>>
>> Note the width estimate. However a more realistic estimate for width is:
>>
>> test=# select 8192/(reltuples/relpages) as width from pg_class where
>> relname='bug';
>>       width
>> ------------------
>>  394.130431739976
>>
>> So we are going to massively underestimate the "size" of such a
>> dataset. Now this appears to be a consequence of no "=" operator
>> (std_typanalyze in analyze.c bails if there isn't one), so the
>> planner has no idea about how wide 'val' actually is. I'm wondering
>> if it is worth having at least an "=" operator to enable some minimal
>> stats to be available for xml columns.
>>
>
> Adding a minimal = op (see attached) and an analyze results in:
>
> test=# explain select val::text from bug;
>                           QUERY PLAN
> ---------------------------------------------------------------
>  Seq Scan on bug  (cost=0.00..62632.08 rows=1000008 width=385)
>
> which gives a much better indication of dataset size.
>
>
>

Maybe I gave this guy a bad title - is it a concern that the 'width'
estimate is so far off for xml datatypes (because of no = op)? It seemed
to me that this could result in some bad plan choices (e.g in subqueries
etc).

regards

Mark

Re: Bad optimizer data for xml (WAS: xml data type implications of no =)

От
Robert Haas
Дата:
On Thu, Jun 3, 2010 at 7:16 PM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> Maybe I gave this guy a bad title - is it a concern that the 'width'
> estimate is so far off for xml datatypes (because of no = op)? It seemed to
> me that this could result in some bad plan choices (e.g in subqueries etc).

It's possible.  I don't really see a reason not to add an = operator
for XML - does anyone else?

It would need to be done by updating src/include/catalog/pg_*.h,
rather than via SQL, of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Bad optimizer data for xml (WAS: xml data type implications of no =)

От
Mark Kirkwood
Дата:
On 09/06/10 15:22, Robert Haas wrote:
> On Thu, Jun 3, 2010 at 7:16 PM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz>  wrote:
>
>> Maybe I gave this guy a bad title - is it a concern that the 'width'
>> estimate is so far off for xml datatypes (because of no = op)? It seemed to
>> me that this could result in some bad plan choices (e.g in subqueries etc).
>>
> It's possible.  I don't really see a reason not to add an = operator
> for XML - does anyone else?
>
> It would need to be done by updating src/include/catalog/pg_*.h,
> rather than via SQL, of course.
>
>

Heh, sure should - I merely included the SQL stuff in case anyone else
wanted to reproduce what I was seeing!

Cheers

Mark

Re: Bad optimizer data for xml (WAS: xml data type implications of no =)

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> It's possible.  I don't really see a reason not to add an = operator
> for XML - does anyone else?

Yes, that was considered and rejected, IIRC.  What is your definition
of equality for xml?

            regards, tom lane

Re: Bad optimizer data for xml (WAS: xml data type implications of no =)

От
Mark Kirkwood
Дата:
On 09/06/10 17:14, Tom Lane wrote:
> Robert Haas<robertmhaas@gmail.com>  writes:
>
>> It's possible.  I don't really see a reason not to add an = operator
>> for XML - does anyone else?
>>
> Yes, that was considered and rejected, IIRC.  What is your definition
> of equality for xml?
>

Yes - but in that previous discussion the optimizer (lack of)
information was not considered (or known I suspect), so maybe a rethink
is worthwhile?

It seems that the nub of this issue is that there are conceptually two
types of =, one for datatype specific comparison, and one for optimizer
statistical information calculation. However the system allows only the
first, so if you don't (or can't) have one then you lose some possibly
important optimization data.

regards

Mark

Re: Bad optimizer data for xml (WAS: xml data type implications of no =)

От
Tom Lane
Дата:
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
> It seems that the nub of this issue is that there are conceptually two
> types of =, one for datatype specific comparison, and one for optimizer
> statistical information calculation. However the system allows only the
> first, so if you don't (or can't) have one then you lose some possibly
> important optimization data.

Nonsense.  ANALYZE and the optimizer work with the datatype's usual
notion of '=', whatever it is.

It's possible that we should install a simplified code path in analyze.c
that can collect width data for a column even in the absence of any '='
operator.  I'm less than convinced that it's worth the trouble though.
Do you have an actual example where such data would have affected a
plan choice?

            regards, tom lane

Re: Bad optimizer data for xml (WAS: xml data type implications of no =)

От
Robert Haas
Дата:
On Wed, Jun 9, 2010 at 1:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> It's possible. =A0I don't really see a reason not to add an =3D operator
>> for XML - does anyone else?
>
> Yes, that was considered and rejected, IIRC. =A0What is your definition
> of equality for xml?

I'd vote for !memcmp().

There can be (and probably already are) other ways to test for other
kinds of equality, too, of course.

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Bad optimizer data for xml (WAS: xml data type implications of no =)

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jun 9, 2010 at 1:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yes, that was considered and rejected, IIRC.  What is your definition
>> of equality for xml?

> I'd vote for !memcmp().

Surely not.  xml is not text.

            regards, tom lane

Re: Bad optimizer data for xml (WAS: xml data type implications of no =)

От
Mark Kirkwood
Дата:
On 10/06/10 02:17, Tom Lane wrote:
> Mark Kirkwood<mark.kirkwood@catalyst.net.nz>  writes:
>
>> It seems that the nub of this issue is that there are conceptually two
>> types of =, one for datatype specific comparison, and one for optimizer
>> statistical information calculation. However the system allows only the
>> first, so if you don't (or can't) have one then you lose some possibly
>> important optimization data.
>>
> Nonsense.  ANALYZE and the optimizer work with the datatype's usual
> notion of '=', whatever it is.
>
>

Slow down the reading Tom... and read what I was actually saying - note
the"conceptually". Of course the code uses the datatype's defined "=".

> It's possible that we should install a simplified code path in analyze.c
> that can collect width data for a column even in the absence of any '='
> operator.
>

Yeah I was thinking along the same lines.

> Do you have an actual example where such data would have affected a
> plan choice?
>
>
>

Not at the moment, I was thinking that anywhere that used such datatypes
in a subquery of similar might be a likely case. I guess I was looking
at this as a case of "this is an area where we have less accurate
optimizer data that we could have", and thinking of ways to improve it.

regards

Mark