Re: Why is FOR ORDER BY function getting called when the index is handling ordering?

Поиск
Список
Период
Сортировка
От Chris Cleveland
Тема Re: Why is FOR ORDER BY function getting called when the index is handling ordering?
Дата
Msg-id CABSN6VfLK5msEDSR8wPeMh_h3xNyr2Xs5NJK8+uOZp0bVxw7Hg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Why is FOR ORDER BY function getting called when the index is handling ordering?  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers

> ... there's no reason the system needs to know the actual float value
> returned by rank_match(), the ordering operator distance function. In any
> case, that value can only be calculated based on information in the index
> itself, and can't be calculated by rank_match().

This seems to me to be a very poorly designed concept.  An index
ordering operator is an optimization that the planner may or may
not choose to employ.  If you've designed your code on the assumption
that that's the only possible plan, it will break for any but the
most trivial queries.

So the use-case here is an enterprise search engine built using an index access method. A search engine ranks its result set based on many, many factors. Among them: token-specific weights, token statistics calculated across the entire index, field lens, average field lens calculated across the index, various kinds of linguistic analysis (verbs? nouns?), additional terms added to the query drawn from other parts of the index, fuzzy terms based on ngrams from across the index, and a great many other magic tricks. There are also index-specific parameters that are specified at index time, both as parameters to the op classes attached to each column, and options specified by CREATE INDEX ... WITH (...).

If the system must generate a score for ranking using a simple ORDER BY column_op_constant, it can't, because all that information within the index itself isn't available.

In any case, I'm uninterested in making the world safe for a
design that's going to fail if the planner doesn't choose an
indexscan on a specific index.  That's too fragile.


But that's the reality of search engines. It's also the reason that the built-in pg full-text search has real limitations.

This isn't just a search engine problem. *Any* index type that depends on whole-table statistics, or index options, or knowledge about other items in the index will not be able to calculate a proper score without access to the index. This applies to certain types of vector search. It could also apply to a recommendation engine.

In general, it hasn't been a problem for my project because I adjust costs to force the use of the index. (Yes, I know that doing this is controversial, but I have little choice, and it works.)

The only area where it has become a problem is in the creation of the junk column.

I do understand the need for the index to report the value of the sort key up the food chain, because yes, all kinds of arbitrary re-ordering could occur. We already have a mechanism for that, though: IndexScanDesc.xs_orderbyvals. If there were a way for the system to use that instead of a call to the ordering function, we'd be all set.

It would also be nice if the orderbyval could be made available in the projection. That way we could report the score() in the result set.

Matthias' response and links touch on some of these issues.

--
Chris Cleveland
312-339-2677 mobile

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: Weird test mixup
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: BitmapHeapScan streaming read user and prelim refactoring