Обсуждение: ToDo: KNN Search should to support DISTINCT clasuse?
Hello I should to search distinct values based on similarity postgres=# explain select nazobce, nazobce <-> 'Benešov' from obce order by nazobce <-> 'Benešov' limit 10 ; QUERY PLAN -------------------------------------------------------------------------------------------Limit (cost=0.00..0.86 rows=10width=10) -> Index Scan using obce_nazobce_idx on obce (cost=0.00..1433.14 rows=16644 width=10) Order By: (nazobce <-> 'Benešov'::text) (3 rows) Time: 0.576 ms but using DISTINCT breaks KNN searching optimization postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from obce order by nazobce <-> 'Benešov' limit 10 ; QUERY PLAN -----------------------------------------------------------------------------Limit (cost=600.45..600.47 rows=10 width=10) -> Sort (cost=600.45..613.80 rows=5341 width=10) Sort Key: ((nazobce <-> 'Benešov'::text)) -> HashAggregate (cost=418.27..485.03 rows=5341 width=10) -> Seq Scan on obce (cost=0.00..335.05 rows=16644width=10) (5 rows) Regards Pavel Stehule
Pavel Stehule <pavel.stehule@gmail.com> writes: > but using DISTINCT breaks KNN searching optimization > postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from > obce order by nazobce <-> 'Benešov' limit 10 Don't hold your breath. There are two ways the system could implement the DISTINCT clause: either sort and uniq, or hashaggregate. hashaggregate will destroy any input ordering, so there's no value in using the index as input. sort and uniq requires the input to be sorted by *all* the columns being distinct'ed, not just one, so again this index isn't useful. You could get a plan using the index if you only wanted the <-> output column, eg contrib_regression=# explain select distinct t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10; QUERY PLAN -------------------------------------------------------------------------------------Limit (cost=0.00..0.87 rows=10 width=12) -> Unique (cost=0.00..86.75 rows=1000 width=12) -> Index Scan using ti on test_trgm (cost=0.00..84.25rows=1000 width=12) Order By: (t <-> 'foo'::text) (4 rows) Perhaps it would be close enough to what you want to use DISTINCT ON: contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10; QUERY PLAN -------------------------------------------------------------------------------------Limit (cost=0.00..0.87 rows=10 width=12) -> Unique (cost=0.00..86.75 rows=1000 width=12) -> Index Scan using ti on test_trgm (cost=0.00..84.25rows=1000 width=12) Order By: (t <-> 'foo'::text) (4 rows) regards, tom lane
2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> but using DISTINCT breaks KNN searching optimization > >> postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from >> obce order by nazobce <-> 'Benešov' limit 10 > > Don't hold your breath. There are two ways the system could implement > the DISTINCT clause: either sort and uniq, or hashaggregate. > hashaggregate will destroy any input ordering, so there's no value in > using the index as input. sort and uniq requires the input to be sorted > by *all* the columns being distinct'ed, not just one, so again this > index isn't useful. You could get a plan using the index if you only > wanted the <-> output column, eg > > contrib_regression=# explain select distinct t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10; > QUERY PLAN > ------------------------------------------------------------------------------------- > Limit (cost=0.00..0.87 rows=10 width=12) > -> Unique (cost=0.00..86.75 rows=1000 width=12) > -> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12) > Order By: (t <-> 'foo'::text) > (4 rows) > > Perhaps it would be close enough to what you want to use DISTINCT ON: > > contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10; > QUERY PLAN > ------------------------------------------------------------------------------------- > Limit (cost=0.00..0.87 rows=10 width=12) > -> Unique (cost=0.00..86.75 rows=1000 width=12) > -> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12) > Order By: (t <-> 'foo'::text) > (4 rows) > > regards, tom lane good tip - it's working thank you Regards Pavel
On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> but using DISTINCT breaks KNN searching optimization > >> postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from >> obce order by nazobce <-> 'Benešov' limit 10 > > Don't hold your breath. There are two ways the system could implement > the DISTINCT clause: either sort and uniq, or hashaggregate. > hashaggregate will destroy any input ordering, so there's no value in > using the index as input. Isn't that an implementation limitation though, rather than a fundamental limitation? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Don't hold your breath. There are two ways the system could implement >> the DISTINCT clause: either sort and uniq, or hashaggregate. >> hashaggregate will destroy any input ordering, so there's no value in >> using the index as input. > Isn't that an implementation limitation though, rather than a > fundamental limitation? Perhaps, but it's not a simple one to surmount, and I'm dubious about putting the amount of work that'd be required into such a corner case. regards, tom lane
2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Don't hold your breath. There are two ways the system could implement >>> the DISTINCT clause: either sort and uniq, or hashaggregate. >>> hashaggregate will destroy any input ordering, so there's no value in >>> using the index as input. > >> Isn't that an implementation limitation though, rather than a >> fundamental limitation? > > Perhaps, but it's not a simple one to surmount, and I'm dubious about > putting the amount of work that'd be required into such a corner case. I don't think so this use case is too special - but workaround working well Regards Pavel > > regards, tom lane
On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Don't hold your breath. There are two ways the system could implement > the DISTINCT clause: either sort and uniq, or hashaggregate. > hashaggregate will destroy any input ordering, so there's no value in > using the index as input. sort and uniq requires the input to be sorted > by *all* the columns being distinct'ed, not just one, so again this > index isn't useful. We already have some bits that understand functional dependencies for distinct/group by don't we? Do we detect that col <-> 'foo' is functionally dependent on col? If so is it possible to construct a multicolumn index that can produce an ordering like [col <-> 'foo', col] which could be used to get distinct values of col in the knn order (since the first column is functionally dependent on the second it can be ignored for grouping purposes). Not that we can do this now but I wonder whether a lot of the pieces are already there and just need to be hooked up together. -- greg
On Mon, Oct 22, 2012 at 7:09 PM, Greg Stark <stark@mit.edu> wrote: > On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Don't hold your breath. There are two ways the system could implement >> the DISTINCT clause: either sort and uniq, or hashaggregate. >> hashaggregate will destroy any input ordering, so there's no value in >> using the index as input. sort and uniq requires the input to be sorted >> by *all* the columns being distinct'ed, not just one, so again this >> index isn't useful. > > We already have some bits that understand functional dependencies for > distinct/group by don't we? Do we detect that col <-> 'foo' is > functionally dependent on col? If so is it possible to construct a > multicolumn index that can produce an ordering like [col <-> 'foo', > col] which could be used to get distinct values of col in the knn > order (since the first column is functionally dependent on the second > it can be ignored for grouping purposes). > > Not that we can do this now but I wonder whether a lot of the pieces > are already there and just need to be hooked up together. I think the real issue is that a hash aggregate doesn't emit any rows until the entire input is read, so it doesn't play nice with LIMIT. In general there's no other possible strategy, because you could get another row belonging to an existing group at any time right up through the end of the input. However, when the HashAgg node is only implementing DISTINCT (ON), you can emit each new row as soon as you see it, and just make the hash table entry to be certain you don't emit it again. I think someone recently submitted a patch along these lines and we rejected it because the use case was too thin ... but this example is making me think that the use case might not be all that thin after all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 23, 2012 at 7:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > through the end of the input. However, when the HashAgg node is only > implementing DISTINCT (ON), you can emit each new row as soon as you > see it, and just make the hash table entry to be certain you don't > emit it again. I think someone recently submitted a patch along these > lines and we rejected it because the use case was too thin ... but > this example is making me think that the use case might not be all > that thin after all. If anyone wants to play around, the initial patch is available here: http://archives.postgresql.org/message-id/CA%2BCSw_uE-RCyQd_bXJNe%3DusrXkq%2BkeFrQrahkc%2B8ou%2BWs4Y%3DVw%40mail.gmail.com It looks to me like it should work for the distinct on KNN search case out of the box. However it needs some planner adjustment for more complex plans and maybe some refactoring to lose duplication in the executor to be worth considering committing. Regards, Ants Aasma
Pavel Stehule wrote: > 2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>: >> Perhaps it would be close enough to what you want to use DISTINCT ON: >> >> contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit10; > good tip - it's working If two or more values happen to be at exactly the same distance, wouldn't you just get one of them? -Kevin
"Kevin Grittner" <kgrittn@mail.com> writes: > Pavel Stehule wrote: >> 2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>: >>> Perhaps it would be close enough to what you want to use DISTINCT ON: >>> contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit10; >> good tip - it's working > If two or more values happen to be at exactly the same distance, > wouldn't you just get one of them? Yeah, that is a hazard. I'm not sure whether <->'s results are sufficiently quantized to make that a big problem in practice. regards, tom lane
Tom Lane wrote: > "Kevin Grittner" <kgrittn@mail.com> writes: > > Pavel Stehule wrote: > >> 2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>: > >>> Perhaps it would be close enough to what you want to use DISTINCT ON: > >>> contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit10; > > >> good tip - it's working > > > If two or more values happen to be at exactly the same distance, > > wouldn't you just get one of them? > > Yeah, that is a hazard. I'm not sure whether <->'s results are > sufficiently quantized to make that a big problem in practice. It doesn't seem too far-fetched for trigram queries: test=# select nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm); nm | ?column? ----------+----------anderson | 0.4andersen | 0.4andersly | 0.4 (3 rows) test=# select distinct on (nm <-> 'anders') nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm)order by nm <-> 'anders' limit 3; nm | ?column? ----------+----------anderson | 0.4 (1 row) -Kevin
2012/10/25 Kevin Grittner <kgrittn@mail.com>: > Tom Lane wrote: >> "Kevin Grittner" <kgrittn@mail.com> writes: >> > Pavel Stehule wrote: >> >> 2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>: >> >>> Perhaps it would be close enough to what you want to use DISTINCT ON: >> >>> contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit10; >> >> >> good tip - it's working >> >> > If two or more values happen to be at exactly the same distance, >> > wouldn't you just get one of them? >> >> Yeah, that is a hazard. I'm not sure whether <->'s results are >> sufficiently quantized to make that a big problem in practice. > > It doesn't seem too far-fetched for trigram queries: > > test=# select nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm); > nm | ?column? > ----------+---------- > anderson | 0.4 > andersen | 0.4 > andersly | 0.4 > (3 rows) > > test=# select distinct on (nm <-> 'anders') nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm)order by nm <-> 'anders' limit 3; > nm | ?column? > ----------+---------- > anderson | 0.4 > (1 row) yes it is issue - but I am thinking about simple "fuzzy" searching, so exact result is not strongly expected. On second hand if SELECT DISTINCT * will be supported it should be nice. Pavel > > -Kevin