Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Дата
Msg-id a06caba6-6821-9d6c-d799-8db1dd0228d2@enterprisedb.com
обсуждение исходный текст
Ответ на Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics  (Quan Zongliang <quanzongliang@yeah.net>)
Ответы Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics  (Quan Zongliang <quanzongliang@yeah.net>)
Список pgsql-hackers

On 6/16/23 11:25, Quan Zongliang wrote:
> 
> We have a small table with only 23 rows and 21 values.
> 
> The resulting MCV and histogram is as follows
> stanumbers1 | {0.08695652,0.08695652}
> stavalues1  | {v1,v2}
> stavalues2  |
> {v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}
> 
> An incorrect number of rows was estimated when HashJoin was done with
> another large table (about 2 million rows).
> 
> Hash Join  (cost=1.52..92414.61 rows=2035023 width=0) (actual
> time=1.943..1528.983 rows=3902 loops=1)
> 

That's interesting. I wonder how come the estimate gets this bad simply
by skipping values entries with a single row in the sample, which means
we know the per-value selectivity pretty well.

I guess the explanation has to be something strange happening when
estimating the join condition selectivity, where we combine MCVs from
both sides of the join (which has to be happening here, otherwise it
would not matter what gets to the MCV).

It'd be interesting to know what's in the other MCV, and what are the
other statistics for the attributes (ndistinct etc.).

Or even better, a reproducer SQL script that builds two tables and then
joins them.

> The reason is that the MCV of the small table excludes values with rows
> of 1. Put them in the MCV in the statistics to get the correct result.
> 
> Using the conservative samplerows <= attstattarget doesn't completely
> solve this problem. It can solve this case.
> 
> After modification we get statistics without histogram:
> stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
> stavalues1  | {v,v2, ... }
> 
> And we have the right estimates:
> Hash Join  (cost=1.52..72100.69 rows=3631 width=0) (actual
> time=1.447..1268.385 rows=3902 loops=1)
> 

I'm not against building a "complete" MCV, but I guess the case where
(samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV
complete whenever we decide (ndistinct <= num_mcv)?

That would need to happen later, because we don't have the ndistinct
estimate yet at this point - we'd have to do the loop a bit later (or
likely twice).

FWIW the patch breaks the calculation of nmultiple (and thus likely the
ndistinct estimate).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jelte Fennema
Дата:
Сообщение: Re: Deleting prepared statements from libpq.
Следующее
От: "Joel Jacobson"
Дата:
Сообщение: Re: Do we want a hashset type?