Re: BUG #13614: Optimizer choosing

Поиск
Список
Период
Сортировка
От Matthew Bellew
Тема Re: BUG #13614: Optimizer choosing
Дата
Msg-id loom.20151014T055051-633@post.gmane.org
обсуждение исходный текст
Ответ на BUG #13614: Optimizer choosing "Nested Loop" due to bad estimate of "rows=1"  (matthewb@labkey.com)
Список pgsql-bugs
This issue has been affecting a lot of our customers lately, (perhaps it has
become  even more common to use nested loop joins in 9.4.5?).  I have created
a repro to show  the problem.

Given two table A joined to B and a filter of say selectivity Asel% on A and
Bsel% on B.   It is very optimistic to assume that the selectivity of the rows
in A join B  is Asel * Bsel.   But this seems to be what Postgres assumes.  A
very conservative estimate would be  that the join selectivity is
min(Asel,Bsel).  The conservative estimate is probably not  ideal, but is less
likely to lead to a pathological plan (one that is orders of magnitude  worse
than if no filters had been applied at all), than using the very optimistic
estimate.

Just guessing that the geometric mean of the optimistic and pessimistic
estimate.   sqrt(min(Asel,Bsel) * (Asel*Bsel)) might be close to a good
compromise.

Regardless, or the exact estimation it seems clear that

a) the current estimate for joins over filtered tables it way too optmistic b)
the estimate should not be allowed to float down to 1, 1 should only be used
to  mean exactly 1, any other estimate should be capped at some sort of
minimum to help  avoid choosing joins that degrade badly if the estimate is
wrong.


SAMPLE 1 - no UNIQUE indexes, on my machine despite estimates, the optimizer
chooses MERGE JOINS and results are very fast.

DROP TABLE IF EXISTS  A;
DROP TABLE IF EXISTS  B;
DROP TABLE IF EXISTS  C;
DROP TABLE IF EXISTS  D;
DROP TABLE IF EXISTS  E;

SELECT * INTO A
FROM
(
  SELECT seq, 100000-seq as rev, (seq+111) % 100000 as rot,
  rand, rand%10 as ten, rand%100 as hundred, rand%1000 as thousand,
  md5(random()::text) AS descr
  FROM (SELECT generate_series(1,100000) AS seq, cast(floor(random()*10000) as
int4) as rand) _
) _;
--CREATE UNIQUE INDEX ON A (seq);
--CREATE UNIQUE INDEX ON A (rev);
--CREATE UNIQUE INDEX ON A (rot);
CREATE INDEX ON A (hundred);

SELECT * INTO B FROM A;
--CREATE UNIQUE INDEX ON B (seq);
--CREATE UNIQUE INDEX ON B (rev);
--CREATE UNIQUE INDEX ON B (rot);
CREATE INDEX ON B (hundred);

SELECT * INTO C FROM A;
--CREATE UNIQUE INDEX ON C (seq);
--CREATE UNIQUE INDEX ON C (rev);
--CREATE UNIQUE INDEX ON C (rot);
CREATE INDEX ON C (hundred);

SELECT * INTO D FROM A;
--CREATE UNIQUE INDEX ON D (seq);
--CREATE UNIQUE INDEX ON D (rev);
--CREATE UNIQUE INDEX ON D (rot);
CREATE INDEX ON D (hundred);


SELECT * INTO E FROM A;
--CREATE UNIQUE INDEX ON E (seq);
--CREATE UNIQUE INDEX ON E (rev);
--CREATE UNIQUE INDEX ON E (rot);
CREATE INDEX ON E (hundred);


SELECT _a.seq, _b.rand, _c.ten, _d.hundred, _e.descr
FROM
   (SELECT * FROM A WHERE hundred in (1,2,3,4,5)) _a
   JOIN (SELECT * FROM B WHERE hundred in (1,2,3,4,5)) _b ON  _a.seq = _b.seq
   JOIN (SELECT * FROM C WHERE hundred in (1,2,3,4,5)) _c ON _b.rev = _c.rev
   JOIN (SELECT * FROM D WHERE hundred in (1,2,3,4,5)) _d ON _c.seq = _d.seq
   JOIN (SELECT * FROM E  WHERE hundred in (1,2,3,4,5)) _e ON _d.rot = _e.rot



SAMPLE 2- with UNIQUE indexes, on my machine the optimizer starts using NESTED
LOOPS once the row estimate gets to 1.  This case isn't tragically bad (worse
than  SAMPLE 1), but we hit pathological cases in the wild that are
exceedingly hard to  avoid.


 DROP TABLE IF EXISTS  A;
DROP TABLE IF EXISTS  B;
DROP TABLE IF EXISTS  C;
DROP TABLE IF EXISTS  D;
DROP TABLE IF EXISTS  E;

SELECT * INTO A
FROM
(
  SELECT seq, 100000-seq as rev, (seq+111) % 100000 as rot,
  rand, rand%10 as ten, rand%100 as hundred, rand%1000 as thousand,
  md5(random()::text) AS descr
  FROM (SELECT generate_series(1,100000) AS seq, cast(floor(random()*10000) as
int4) as rand) _
) _;
CREATE UNIQUE INDEX ON A (seq);
CREATE UNIQUE INDEX ON A (rev);
CREATE UNIQUE INDEX ON A (rot);
CREATE INDEX ON A (hundred);

SELECT * INTO B FROM A;
CREATE UNIQUE INDEX ON B (seq);
CREATE UNIQUE INDEX ON B (rev);
CREATE UNIQUE INDEX ON B (rot);
CREATE INDEX ON B (hundred);

SELECT * INTO C FROM A;
CREATE UNIQUE INDEX ON C (seq);
CREATE UNIQUE INDEX ON C (rev);
CREATE UNIQUE INDEX ON C (rot);
CREATE INDEX ON C (hundred);

SELECT * INTO D FROM A;
CREATE UNIQUE INDEX ON D (seq);
CREATE UNIQUE INDEX ON D (rev);
CREATE UNIQUE INDEX ON D (rot);
CREATE INDEX ON D (hundred);


SELECT * INTO E FROM A;
CREATE UNIQUE INDEX ON E (seq);
CREATE UNIQUE INDEX ON E (rev);
CREATE UNIQUE INDEX ON E (rot);
CREATE INDEX ON E (hundred);


SELECT _a.seq, _b.rand, _c.ten, _d.hundred, _e.descr
FROM
   (SELECT * FROM A WHERE hundred in (1,2,3,4,5)) _a
   JOIN (SELECT * FROM B WHERE hundred in (1,2,3,4,5)) _b ON  _a.seq = _b.seq
   JOIN (SELECT * FROM C WHERE hundred in (1,2,3,4,5)) _c ON _b.rev = _c.rev
   JOIN (SELECT * FROM D WHERE hundred in (1,2,3,4,5)) _d ON _c.seq = _d.seq
   JOIN (SELECT * FROM E  WHERE hundred in (1,2,3,4,5)) _e ON _d.rot = _e.rot

В списке pgsql-bugs по дате отправления:

Предыдущее
От: 許耀彰
Дата:
Сообщение: postgresql database limit check
Следующее
От: John R Pierce
Дата:
Сообщение: Re: postgresql database limit check