Обсуждение: xml data type implications of no =
Today I ran into some interesting consequences of the xml data type being without an "=" operator. One I thought I'd post here because it has a *possible* planner impact. I'm not sure it is actually a bug as such, but this seemed the best forum to post in initially: test=# \d bug Table "public.bug" Column | Type | Modifiers --------+---------+----------- id | integer | val | xml | test=# explain select val::text from bug; QUERY PLAN -------------------------------------------------------------- Seq Scan on bug (cost=0.00..58127.78 rows=1000278 width=32) Note the width estimate. However a more realistic estimate for width is: test=# select 8192/(reltuples/relpages) as width from pg_class where relname='bug'; width ------------------ 394.130431739976 So we are going to massively underestimate the "size" of such a dataset. Now this appears to be a consequence of no "=" operator (std_typanalyze in analyze.c bails if there isn't one), so the planner has no idea about how wide 'val' actually is. I'm wondering if it is worth having at least an "=" operator to enable some minimal stats to be available for xml columns. regards Mark
On 25/05/10 16:43, Mark Kirkwood wrote:
Adding a minimal = op (see attached) and an analyze results in:
test=# explain select val::text from bug;
QUERY PLAN
---------------------------------------------------------------
Seq Scan on bug (cost=0.00..62632.08 rows=1000008 width=385)
which gives a much better indication of dataset size.
Today I ran into some interesting consequences of the xml data type being without an "=" operator. One I thought I'd post here because it has a *possible* planner impact. I'm not sure it is actually a bug as such, but this seemed the best forum to post in initially:
test=# \d bug
Table "public.bug"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | xml |
test=# explain select val::text from bug;
QUERY PLAN
--------------------------------------------------------------
Seq Scan on bug (cost=0.00..58127.78 rows=1000278 width=32)
Note the width estimate. However a more realistic estimate for width is:
test=# select 8192/(reltuples/relpages) as width from pg_class where relname='bug';
width
------------------
394.130431739976
So we are going to massively underestimate the "size" of such a dataset. Now this appears to be a consequence of no "=" operator (std_typanalyze in analyze.c bails if there isn't one), so the planner has no idea about how wide 'val' actually is. I'm wondering if it is worth having at least an "=" operator to enable some minimal stats to be available for xml columns.
Adding a minimal = op (see attached) and an analyze results in:
test=# explain select val::text from bug;
QUERY PLAN
---------------------------------------------------------------
Seq Scan on bug (cost=0.00..62632.08 rows=1000008 width=385)
which gives a much better indication of dataset size.
Вложения
On 27/05/10 13:37, Mark Kirkwood wrote: > On 25/05/10 16:43, Mark Kirkwood wrote: >> Today I ran into some interesting consequences of the xml data type >> being without an "=" operator. One I thought I'd post here because it >> has a *possible* planner impact. I'm not sure it is actually a bug as >> such, but this seemed the best forum to post in initially: >> >> test=# \d bug >> Table "public.bug" >> Column | Type | Modifiers >> --------+---------+----------- >> id | integer | >> val | xml | >> >> test=# explain select val::text from bug; >> QUERY PLAN >> -------------------------------------------------------------- >> Seq Scan on bug (cost=0.00..58127.78 rows=1000278 width=32) >> >> >> Note the width estimate. However a more realistic estimate for width is: >> >> test=# select 8192/(reltuples/relpages) as width from pg_class where >> relname='bug'; >> width >> ------------------ >> 394.130431739976 >> >> So we are going to massively underestimate the "size" of such a >> dataset. Now this appears to be a consequence of no "=" operator >> (std_typanalyze in analyze.c bails if there isn't one), so the >> planner has no idea about how wide 'val' actually is. I'm wondering >> if it is worth having at least an "=" operator to enable some minimal >> stats to be available for xml columns. >> > > Adding a minimal = op (see attached) and an analyze results in: > > test=# explain select val::text from bug; > QUERY PLAN > --------------------------------------------------------------- > Seq Scan on bug (cost=0.00..62632.08 rows=1000008 width=385) > > which gives a much better indication of dataset size. > > > Maybe I gave this guy a bad title - is it a concern that the 'width' estimate is so far off for xml datatypes (because of no = op)? It seemed to me that this could result in some bad plan choices (e.g in subqueries etc). regards Mark
On Thu, Jun 3, 2010 at 7:16 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > Maybe I gave this guy a bad title - is it a concern that the 'width' > estimate is so far off for xml datatypes (because of no = op)? It seemed to > me that this could result in some bad plan choices (e.g in subqueries etc). It's possible. I don't really see a reason not to add an = operator for XML - does anyone else? It would need to be done by updating src/include/catalog/pg_*.h, rather than via SQL, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 09/06/10 15:22, Robert Haas wrote: > On Thu, Jun 3, 2010 at 7:16 PM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz> wrote: > >> Maybe I gave this guy a bad title - is it a concern that the 'width' >> estimate is so far off for xml datatypes (because of no = op)? It seemed to >> me that this could result in some bad plan choices (e.g in subqueries etc). >> > It's possible. I don't really see a reason not to add an = operator > for XML - does anyone else? > > It would need to be done by updating src/include/catalog/pg_*.h, > rather than via SQL, of course. > > Heh, sure should - I merely included the SQL stuff in case anyone else wanted to reproduce what I was seeing! Cheers Mark
Robert Haas <robertmhaas@gmail.com> writes: > It's possible. I don't really see a reason not to add an = operator > for XML - does anyone else? Yes, that was considered and rejected, IIRC. What is your definition of equality for xml? regards, tom lane
On 09/06/10 17:14, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: > >> It's possible. I don't really see a reason not to add an = operator >> for XML - does anyone else? >> > Yes, that was considered and rejected, IIRC. What is your definition > of equality for xml? > Yes - but in that previous discussion the optimizer (lack of) information was not considered (or known I suspect), so maybe a rethink is worthwhile? It seems that the nub of this issue is that there are conceptually two types of =, one for datatype specific comparison, and one for optimizer statistical information calculation. However the system allows only the first, so if you don't (or can't) have one then you lose some possibly important optimization data. regards Mark
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes: > It seems that the nub of this issue is that there are conceptually two > types of =, one for datatype specific comparison, and one for optimizer > statistical information calculation. However the system allows only the > first, so if you don't (or can't) have one then you lose some possibly > important optimization data. Nonsense. ANALYZE and the optimizer work with the datatype's usual notion of '=', whatever it is. It's possible that we should install a simplified code path in analyze.c that can collect width data for a column even in the absence of any '=' operator. I'm less than convinced that it's worth the trouble though. Do you have an actual example where such data would have affected a plan choice? regards, tom lane
On Wed, Jun 9, 2010 at 1:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> It's possible. =A0I don't really see a reason not to add an =3D operator >> for XML - does anyone else? > > Yes, that was considered and rejected, IIRC. =A0What is your definition > of equality for xml? I'd vote for !memcmp(). There can be (and probably already are) other ways to test for other kinds of equality, too, of course. --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 9, 2010 at 1:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yes, that was considered and rejected, IIRC. What is your definition >> of equality for xml? > I'd vote for !memcmp(). Surely not. xml is not text. regards, tom lane
On 10/06/10 02:17, Tom Lane wrote: > Mark Kirkwood<mark.kirkwood@catalyst.net.nz> writes: > >> It seems that the nub of this issue is that there are conceptually two >> types of =, one for datatype specific comparison, and one for optimizer >> statistical information calculation. However the system allows only the >> first, so if you don't (or can't) have one then you lose some possibly >> important optimization data. >> > Nonsense. ANALYZE and the optimizer work with the datatype's usual > notion of '=', whatever it is. > > Slow down the reading Tom... and read what I was actually saying - note the"conceptually". Of course the code uses the datatype's defined "=". > It's possible that we should install a simplified code path in analyze.c > that can collect width data for a column even in the absence of any '=' > operator. > Yeah I was thinking along the same lines. > Do you have an actual example where such data would have affected a > plan choice? > > > Not at the moment, I was thinking that anywhere that used such datatypes in a subquery of similar might be a likely case. I guess I was looking at this as a case of "this is an area where we have less accurate optimizer data that we could have", and thinking of ways to improve it. regards Mark