Обсуждение: Doing better at HINTing an appropriate column within errorMissingColumn()

Поиск

Список

Период

Сортировка

Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

27 марта 2014 г., 22:10:23

With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.

I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.

Consider the following sample sessions of mine, made with the
dellstore2 sample database:

[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate

---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)

[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989

We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.

I have some outstanding concerns about the proposed patch:

* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.

* Perhaps I should have moved the Levenshtein distance functions into
core and be done with it. I thought that given the present restriction
that the implementation imposes on source and target string lengths,
it would be best to leave the user-facing SQL functions in contrib.
That restriction is not relevant to the internal use of Levenshtein
distance added here, though.

Thoughts?
--
Peter Geoghegan

Вложения

levenshtein_column_hint.v1.2014_03_27.patch.gz

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Pavel Stehule

Дата:

28 марта 2014 г., 11:01:15

Hello

I see only one risk - it can do some slowdown of exception processing.

Sometimes you can have a code like

BEGIN

WHILE ..

LOOP

BEGIN

INSERT INTO ...

EXCEPTION WHEN ..

; /* ignore this error */

END;

END LOOP;

without this risks, proposed feature is nice, but should be fast

Regards

Pavel

2014-03-27 20:10 GMT+01:00 Peter Geoghegan <pg@heroku.com>:

With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.

I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.

Consider the following sample sessions of mine, made with the
dellstore2 sample database:

[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate
---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)

[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989

We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.

I have some outstanding concerns about the proposed patch:

* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.

* Perhaps I should have moved the Levenshtein distance functions into
core and be done with it. I thought that given the present restriction
that the implementation imposes on source and target string lengths,
it would be best to leave the user-facing SQL functions in contrib.
That restriction is not relevant to the internal use of Levenshtein
distance added here, though.

Thoughts?
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

28 марта 2014 г., 11:22:14

On Fri, Mar 28, 2014 at 1:00 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I see only one risk - it can do some slowdown of exception processing.

I think it's unlikely that you'd see ERRCODE_UNDEFINED_COLUMN in
procedural code like that in practice. In any case it's worth noting
that I continually pass back a "max" to the Levenshtein distance
implementation, which is the current shortest distance observed. The
implementation is therefore not obliged to exhaustively find a
distance that is already known to be of no use. See commit 604ab0.


-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Pavel Stehule

Дата:

28 марта 2014 г., 11:25:56

2014-03-28 9:22 GMT+01:00 Peter Geoghegan <pg@heroku.com>:

On Fri, Mar 28, 2014 at 1:00 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I see only one risk - it can do some slowdown of exception processing.

I think it's unlikely that you'd see ERRCODE_UNDEFINED_COLUMN in
procedural code like that in practice. In any case it's worth noting
that I continually pass back a "max" to the Levenshtein distance
implementation, which is the current shortest distance observed. The
implementation is therefore not obliged to exhaustively find a
distance that is already known to be of no use. See commit 604ab0.

if it is related to ERRCODE_UNDEFINED_COLUMN then it should be ok (from performance perspective)

but second issue can be usage from plpgsql - where is mix SQL identifiers and plpgsql variables.

Pavel

--
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Oleg Bartunov

Дата:

28 марта 2014 г., 11:37:29

<p dir="ltr">Very interesting idea,  I'd think about optionally add similarity hinting support to psql tab.  With, 
say, 80% of similarity matching,  it shouldn't be  very annoying.  For interactive usage there is no risk of slowdown.
<divclass="gmail_quote">On Mar 27, 2014 11:11 PM, "Peter Geoghegan" <<a
href="mailto:pg@heroku.com">pg@heroku.com</a>>wrote:<br type="attribution" /><blockquote class="gmail_quote"
style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> With the addition of LATERAL subqueries, Tom
fixedup the mechanism<br /> for keeping track of which relations are visible for column references<br /> while the FROM
clauseis being scanned. That allowed<br /> errorMissingColumn() to give a more useful error to the one produced<br />
bythe prior coding of that mechanism, with an errhint sometimes<br /> proffering: 'There is a column named "foo" in
table"bar", but it<br /> cannot be referenced from this part of the query'.<br /><br /> I wondered how much further
thiscould be taken. Attached patch<br /> modifies contrib/fuzzystrmatch, moving its Levenshtein distance code<br />
intocore without actually moving the relevant SQL functions too. That<br /> change allowed me to modify
errorMissingColumn()to make more useful<br /> suggestions as to what might have been intended under other<br />
circumstances,like when someone fat-fingers a column name. psql tab<br /> completion is good, but not so good that this
doesn'thappen all the<br /> time. It's good practice to consistently name columns and tables such<br /> that it's
possibleto intuit the names of columns from the names of<br /> tables and so on, but it's still pretty common to forget
ifa column<br /> name from the table "orders" is "order_id", "orderid", or "ordersid",<br /> particularly if you're
someonewho regularly interacts with many<br /> databases. This problem is annoying in a low intensity kind of way.<br
/><br/> Consider the following sample sessions of mine, made with the<br /> dellstore2 sample database:<br /><br />
[local]/postgres=#select * from orders o join orderlines ol on<br /> o.orderid = ol.orderids limit 1;<br /> ERROR:
 42703:column ol.orderids does not exist<br /> LINE 1: ...* from orders o join orderlines ol on o.orderid =
ol.orderid...<br/>                                                              ^<br /> HINT:  Perhaps you meant to
referencethe column "ol"."orderid".<br /> LOCATION:  errorMissingColumn, parse_relation.c:2989<br /> [local]/postgres=#
select* from orders o join orderlines ol on<br /> o.orderid = ol.orderid limit 1;<br />  orderid | orderdate  |
customerid| netamount |  tax  | totalamount |<br /> orderlineid | orderid | prod_id | quantity | orderdate<br />
---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------<br
/>       1 | 2004-01-27 |       7888 |    313.24 | 25.84 |      339.08 |<br />           1 |       1 |    9117 |      
 1| 2004-01-27<br /> (1 row)<br /><br /> [local]/postgres=# select ordersid from orders o join orderlines ol on<br />
o.orderid= ol.orderid limit 1;<br /> ERROR:  42703: column "ordersid" does not exist<br /> LINE 1: select ordersid from
orderso join orderlines ol on o.orderi...<br />                ^<br /> HINT:  Perhaps you meant to reference the column
"o"."orderid".<br/> LOCATION:  errorMissingColumn, parse_relation.c:2999<br /> [local]/postgres=# select ol.ordersid
fromorders o join orderlines ol<br /> on o.orderid = ol.orderid limit 1;<br /> ERROR:  42703: column ol.ordersid does
notexist<br /> LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...<br />                ^<br />
HINT: Perhaps you meant to reference the column "ol"."orderid".<br /> LOCATION:  errorMissingColumn,
parse_relation.c:2989<br/><br /> We try to give the most useful possible HINT here, charging extra for<br /> a
non-matchingalias, and going through the range table in order and<br /> preferring the first column observed to any
subsequentcolumn whose<br /> name is of the same distance as an earlier Var. The fuzzy string<br /> matching works well
enoughthat it seems possible in practice to<br /> successfully have the parser make the right suggestion, even when
the<br/> user's original guess was fairly far off. I've found it works best to<br /> charge half as much for a
characterdeletion, so that's what is<br /> charged.<br /><br /> I have some outstanding concerns about the proposed
patch:<br/><br /> * It may be the case that dense logosyllabic or morphographic writing<br /> systems, for example
Kanjimight consistently present, say, Japanese<br /> users with a suggestion that just isn't very useful, to the point
of<br/> being annoying. Perhaps some Japanese hackers can comment on the<br /> actual risks here.<br /><br /> * Perhaps
Ishould have moved the Levenshtein distance functions into<br /> core and be done with it. I thought that given the
presentrestriction<br /> that the implementation imposes on source and target string lengths,<br /> it would be best to
leavethe user-facing SQL functions in contrib.<br /> That restriction is not relevant to the internal use of
Levenshtein<br/> distance added here, though.<br /><br /> Thoughts?<br /> --<br /> Peter Geoghegan<br /><br /><br />
--<br/> Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your
subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div>

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Albe Laurenz

Дата:

28 марта 2014 г., 11:47:45

Peter Geoghegan wrote:
> With the addition of LATERAL subqueries, Tom fixed up the mechanism
> for keeping track of which relations are visible for column references
> while the FROM clause is being scanned. That allowed
> errorMissingColumn() to give a more useful error to the one produced
> by the prior coding of that mechanism, with an errhint sometimes
> proffering: 'There is a column named "foo" in table "bar", but it
> cannot be referenced from this part of the query'.
> 
> I wondered how much further this could be taken. Attached patch
> modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
> into core without actually moving the relevant SQL functions too. That
> change allowed me to modify errorMissingColumn() to make more useful
> suggestions as to what might have been intended under other
> circumstances, like when someone fat-fingers a column name.

> [local]/postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderids limit 1;
> ERROR:  42703: column ol.orderids does not exist
> LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
>                                                              ^
> HINT:  Perhaps you meant to reference the column "ol"."orderid".

This sounds like a mild version of DWIM:
http://www.jargondb.org/glossary/dwim

Maybe it is just me, but I get uncomfortable when a program tries
to second-guess what I really want.

Yours,
Laurenz Albe

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Christoph Berg

Дата:

28 марта 2014 г., 12:55:08

Re: Albe Laurenz 2014-03-28 <A737B7A37273E048B164557ADEF4A58B17CE8DEA@ntex2010i.host.magwien.gv.at>
> > ERROR:  42703: column ol.orderids does not exist
> > LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
> >                                                              ^
> > HINT:  Perhaps you meant to reference the column "ol"."orderid".
>
> This sounds like a mild version of DWIM:
> http://www.jargondb.org/glossary/dwim
>
> Maybe it is just me, but I get uncomfortable when a program tries
> to second-guess what I really want.

I find it very annoying when zsh asks me "did you mean foo [y/n]" and
I need to confirm that, but I'd find a mere HINT that I can easily
ignore a very useful feature. +1 for the idea.

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

28 марта 2014 г., 15:57:08

On Fri, Mar 28, 2014 at 4:10 AM, Peter Geoghegan <pg@heroku.com> wrote:
> With the addition of LATERAL subqueries, Tom fixed up the mechanism
> for keeping track of which relations are visible for column references
> while the FROM clause is being scanned. That allowed
> errorMissingColumn() to give a more useful error to the one produced
> by the prior coding of that mechanism, with an errhint sometimes
> proffering: 'There is a column named "foo" in table "bar", but it
> cannot be referenced from this part of the query'.
>
> I wondered how much further this could be taken. Attached patch
> modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
> into core without actually moving the relevant SQL functions too. That
> change allowed me to modify errorMissingColumn() to make more useful
> suggestions as to what might have been intended under other
> circumstances, like when someone fat-fingers a column name. psql tab
> completion is good, but not so good that this doesn't happen all the
> time. It's good practice to consistently name columns and tables such
> that it's possible to intuit the names of columns from the names of
> tables and so on, but it's still pretty common to forget if a column
> name from the table "orders" is "order_id", "orderid", or "ordersid",
> particularly if you're someone who regularly interacts with many
> databases. This problem is annoying in a low intensity kind of way.
>
> Consider the following sample sessions of mine, made with the
> dellstore2 sample database:
>
> [local]/postgres=# select * from orders o join orderlines ol on
> o.orderid = ol.orderids limit 1;
> ERROR:  42703: column ol.orderids does not exist
> LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
>                                                              ^
> HINT:  Perhaps you meant to reference the column "ol"."orderid".
> LOCATION:  errorMissingColumn, parse_relation.c:2989
> [local]/postgres=# select * from orders o join orderlines ol on
> o.orderid = ol.orderid limit 1;
>  orderid | orderdate  | customerid | netamount |  tax  | totalamount |
> orderlineid | orderid | prod_id | quantity | orderdate
>
---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
>        1 | 2004-01-27 |       7888 |    313.24 | 25.84 |      339.08 |
>           1 |       1 |    9117 |        1 | 2004-01-27
> (1 row)
>
> [local]/postgres=# select ordersid from orders o join orderlines ol on
> o.orderid = ol.orderid limit 1;
> ERROR:  42703: column "ordersid" does not exist
> LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
>                ^
> HINT:  Perhaps you meant to reference the column "o"."orderid".
> LOCATION:  errorMissingColumn, parse_relation.c:2999
> [local]/postgres=# select ol.ordersid from orders o join orderlines ol
> on o.orderid = ol.orderid limit 1;
> ERROR:  42703: column ol.ordersid does not exist
> LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
>                ^
> HINT:  Perhaps you meant to reference the column "ol"."orderid".
> LOCATION:  errorMissingColumn, parse_relation.c:2989
>
> We try to give the most useful possible HINT here, charging extra for
> a non-matching alias, and going through the range table in order and
> preferring the first column observed to any subsequent column whose
> name is of the same distance as an earlier Var. The fuzzy string
> matching works well enough that it seems possible in practice to
> successfully have the parser make the right suggestion, even when the
> user's original guess was fairly far off. I've found it works best to
> charge half as much for a character deletion, so that's what is
> charged.
What about the overhead that this processing creates if error
processing needs to scan a schema with let's say hundreds of tables?

> * It may be the case that dense logosyllabic or morphographic writing
> systems, for example Kanji might consistently present, say, Japanese
> users with a suggestion that just isn't very useful, to the point of
> being annoying. Perhaps some Japanese hackers can comment on the
> actual risks here.
As long as Hiragana-only words (basic alphabet for Japanese words),
and more particularly Katakana only-words (to write phonetically
foreign words) are compared (even Kanji-only things compared),
Levenstein could play its role pretty well. But once a comparison is
made with two words using different alphabet, well Levenstein is not
going to work well. A simple example is 'ramen' (Japanese noodles),
that you can find written sometimes in Hiragana, or even in Katakana,
and here Levenstein performs poorly:
=# select levenshtein('ラーメン', 'らあめん');levenshtein
-------------          4
(1 row)
Regards,
-- 
Michael

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

28 марта 2014 г., 20:51:58

On Fri, Mar 28, 2014 at 5:57 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> What about the overhead that this processing creates if error
> processing needs to scan a schema with let's say hundreds of tables?

It doesn't work that way. I've extended searchRangeTableForCol() so
that when it calls scanRTEForColumn(), it considers Levenshtein
distance, and not just plain string equality, which is what happens
today. The code only looks through ParseState.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

01 апреля 2014 г., 17:25:37

On Fri, Mar 28, 2014 at 4:47 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
> Peter Geoghegan wrote:
>> With the addition of LATERAL subqueries, Tom fixed up the mechanism
>> for keeping track of which relations are visible for column references
>> while the FROM clause is being scanned. That allowed
>> errorMissingColumn() to give a more useful error to the one produced
>> by the prior coding of that mechanism, with an errhint sometimes
>> proffering: 'There is a column named "foo" in table "bar", but it
>> cannot be referenced from this part of the query'.
>>
>> I wondered how much further this could be taken. Attached patch
>> modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
>> into core without actually moving the relevant SQL functions too. That
>> change allowed me to modify errorMissingColumn() to make more useful
>> suggestions as to what might have been intended under other
>> circumstances, like when someone fat-fingers a column name.
>
>> [local]/postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderids limit 1;
>> ERROR:  42703: column ol.orderids does not exist
>> LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
>>                                                              ^
>> HINT:  Perhaps you meant to reference the column "ol"."orderid".
>
> This sounds like a mild version of DWIM:
> http://www.jargondb.org/glossary/dwim
>
> Maybe it is just me, but I get uncomfortable when a program tries
> to second-guess what I really want.

It's not really DWIM, because the backend is still throwing an error.
It's just trying to help you sort out the error, along the way.
Still, I share some of your discomfort.  I see Peter's patch as an
example of a broader class of things that we could do - but I'm not
altogether sure that we want to do them.  There's a risk of adding not
only CPU cycles but also clutter.  If we do things that encourage
people to crank the log verbosity down, I think that's going to be bad
more often than it's good.  It strains credulity to think that this
patch alone would have that effect, but there might be quite a few
similar improvements that are possible.  So I think it would be good
to consider how far we want to go in this direction and where we think
we might want to stop.  That's not to say, let's not ever do this,
just, let's think carefully about where we want to end up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

01 апреля 2014 г., 21:04:54

On Tue, Apr 1, 2014 at 7:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> There's a risk of adding not
> only CPU cycles but also clutter.  If we do things that encourage
> people to crank the log verbosity down, I think that's going to be bad
> more often than it's good.

While I share your concern here, I think that this is something that
is only likely to be seen in an interactive psql session, where it is
seen quite frequently. I am reasonably confident that it's highly
unusual to see ERRCODE_UNDEFINED_COLUMN in other settings. Not having
to do a mental context switch when writing an ad-hoc query has
considerable value. Even C compilers like Clang have this kind of
feedback.   This is a patch that was written out of personal
frustration with the experience of interacting with many different
databases. Things like the Python REPL don't do so much of this kind
of thing, but presumably that's because of Python's dynamic typing.
This is a HINT that can be given with fairly high confidence that
it'll be helpful - there just won't be that many things that the user
could have meant to choose from. I think it's even useful when the
suggested column is distant from the original suggestion (i.e.
errorMissingColumn() offers only what is clearly a "wild guess"),
because then the user knows that he or she has got it quite wrong.
Frequently, this will be because the wrong synonym for what should
have been written was used.

> It strains credulity to think that this
> patch alone would have that effect, but there might be quite a few
> similar improvements that are possible.  So I think it would be good
> to consider how far we want to go in this direction and where we think
> we might want to stop.  That's not to say, let's not ever do this,
> just, let's think carefully about where we want to end up.

Fair enough.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Jim Nasby

Дата:

02 апреля 2014 г., 00:38:38

On 4/1/14, 1:04 PM, Peter Geoghegan wrote:
>> It strains credulity to think that this
>> >patch alone would have that effect, but there might be quite a few
>> >similar improvements that are possible.  So I think it would be good
>> >to consider how far we want to go in this direction and where we think
>> >we might want to stop.  That's not to say, let's not ever do this,
>> >just, let's think carefully about where we want to end up.
> Fair enough.

I agree with the concern, but also have to say that I can't count how many times I could have used this. A big +1, at
leastin this case.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Greg Stark

Дата:

02 апреля 2014 г., 23:09:18

<p dir="ltr">Normally I'm not for adding gucs that just gate new features. But I think a simple guc to turn this on or
offwould be fine and alleviate any concerns. I think users would appreciate it quite a lot<p dir="ltr">It would even
havea positive effect of helping raise awareness of the feature. I often scan the list of config options to get an idea
ofnew features when I'm installing new software or upgrading.<p dir="ltr">-- <br /> greg<div class="gmail_quote">On 1
Apr2014 17:38, "Jim Nasby" <<a href="mailto:jim@nasby.net">jim@nasby.net</a>> wrote:<br type="attribution"
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> On 4/1/14, 1:04
PM,Peter Geoghegan wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex"><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">It strains credulity to think that this<br /> >patch alone would have that effect, but there
mightbe quite a few<br /> >similar improvements that are possible.  So I think it would be good<br /> >to
considerhow far we want to go in this direction and where we think<br /> >we might want to stop.  That's not to say,
let'snot ever do this,<br /> >just, let's think carefully about where we want to end up.<br /></blockquote> Fair
enough.<br/></blockquote><br /> I agree with the concern, but also have to say that I can't count how many times I
couldhave used this. A big +1, at least in this case.<br /> -- <br /> Jim C. Nasby, Data Architect                    
 <a href="mailto:jim@nasby.net" target="_blank">jim@nasby.net</a><br /> 512.569.9461 (cell)                         <a
href="http://jim.nasby.net"target="_blank">http://jim.nasby.net</a><br /><br /><br /> -- <br /> Sent via pgsql-hackers
mailinglist (<a href="mailto:pgsql-hackers@postgresql.org" target="_blank">pgsql-hackers@postgresql.org</a>)<br /> To
makechanges to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/<u></u>mailpref/pgsql-hackers</a><br/></blockquote></div>

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Andres Freund

Дата:

02 апреля 2014 г., 23:16:11

On 2014-04-02 21:08:47 +0100, Greg Stark wrote:
> Normally I'm not for adding gucs that just gate new features. But I think a
> simple guc to turn this on or off would be fine and alleviate any concerns.
> I think users would appreciate it quite a lot

I don't have strong feelings about the feature, but introducing a guc
for it feels entirely ridiculous to me. This is a minor detail in an
error message, not more.

> It would even have a positive effect of helping raise awareness of the
> feature. I often scan the list of config options to get an idea of new
> features when I'm installing new software or upgrading.

Really? Should we now add GUCs for every feature then?

Greetings,

Andres Freund

PS: Could you please start to properly quote again? You seem to have
stopped doing that entirely in the last few months.

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

02 апреля 2014 г., 23:33:39

On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I don't have strong feelings about the feature, but introducing a guc
> for it feels entirely ridiculous to me. This is a minor detail in an
> error message, not more.

I agree. It's just a HINT. It's quite helpful in certain particular
contexts, but in the grand scheme of things isn't all that important.
I am being quite conservative in trying to anticipate cases where on
balance it'll actually hurt more than it will help. I doubt that there
actually are any.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Greg Stark

Дата:

03 апреля 2014 г., 07:49:01

On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:

PS: Could you please start to properly quote again? You seem to have
stopped doing that entirely in the last few months.

I've been responding a lot from the phone. Unfortunately the Gmail client on the phone makes it nearly impossible to format messages well. I'm beginning to think it would be better to just not quote at all any more. I'm normally not doing a point-by-point response anyways.

--
greg

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Andres Freund

Дата:

03 апреля 2014 г., 12:10:05

On 2014-04-03 00:48:12 -0400, Greg Stark wrote:
> On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> > PS: Could you please start to properly quote again? You seem to have
> > stopped doing that entirely in the last few months.
>
> I've been responding a lot from the phone. Unfortunately the Gmail client
> on the phone makes it nearly impossible to format messages well. I'm
> beginning to think it would be better to just not quote at all any more.
> I'm normally not doing a point-by-point response anyways.

I really don't care where you're answering from TBH. It's unreadable,
misses context and that's it. If $device doesn't work for you, don't use
it.
I don't mind an occasional quick answer that's badly formatted, but for
other things it's really annoying.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Josh Berkus

Дата:

17 июня 2014 г., 02:05:09

On 04/02/2014 01:16 PM, Andres Freund wrote:
> On 2014-04-02 21:08:47 +0100, Greg Stark wrote:
>> Normally I'm not for adding gucs that just gate new features. But I think a
>> simple guc to turn this on or off would be fine and alleviate any concerns.
>> I think users would appreciate it quite a lot
> 
> I don't have strong feelings about the feature, but introducing a guc
> for it feels entirely ridiculous to me. This is a minor detail in an
> error message, not more.
> 
>> It would even have a positive effect of helping raise awareness of the
>> feature. I often scan the list of config options to get an idea of new
>> features when I'm installing new software or upgrading.
> 
> Really? Should we now add GUCs for every feature then?

-1 for having a GUC for this.

+1 on the feature.

Review with functional test coming up.

Question: How should we handle the issues with East Asian languages
(i.e. Japanese, Chinese) and this Hint?  Should we just avoid hinting
for a selected list of languages which don't work well with levenshtein?If so, how do we get that list?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

17 июня 2014 г., 02:32:00

On Mon, Jun 16, 2014 at 4:04 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Question: How should we handle the issues with East Asian languages
> (i.e. Japanese, Chinese) and this Hint?  Should we just avoid hinting
> for a selected list of languages which don't work well with levenshtein?
>  If so, how do we get that list?

I think that how useful Levenshtein distance is for users based in
east Asia generally, and how useful this patch is to those users are
two distinct questions. I have no idea how common it is for Japanese
users to just use Roman characters as table and attribute names. Since
they're very probably already writing application code that uses Roman
characters (except in the comments, user strings and so on), it might
make sense to do the same in the database. I would welcome further
input on that question. I don't know what the trends are in the real
world.

Also note that the patch scans the range table parse state to pick the
most probable candidate among all Vars/columns that already appear
there. The query would raise an error at an earlier point if a
non-existent relation was referenced, for example. We're only choosing
from a minimal list of possibilities, and pick one that is very
probably what was intended. Even if Levenshtein distance works badly
with Kanji (which is not obviously the case, at least to me), it might
not matter here.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Ian Barwick

Дата:

17 июня 2014 г., 03:31:12

On 14/06/17 8:31, Peter Geoghegan wrote:
> On Mon, Jun 16, 2014 at 4:04 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Question: How should we handle the issues with East Asian languages
>> (i.e. Japanese, Chinese) and this Hint?  Should we just avoid hinting
>> for a selected list of languages which don't work well with levenshtein?
>>  If so, how do we get that list?
> 
> I think that how useful Levenshtein distance is for users based in
> east Asia generally, and how useful this patch is to those users are
> two distinct questions. I have no idea how common it is for Japanese
> users to just use Roman characters as table and attribute names. Since
> they're very probably already writing application code that uses Roman
> characters (except in the comments, user strings and so on), it might
> make sense to do the same in the database. I would welcome further
> input on that question. I don't know what the trends are in the real
> world.

From what I've seen in the wild in Japan, Roman/ASCII characters are
widely used for object/attribute names, as generally it's much less
hassle than switching between input methods, dealing with different
encodings etc. The only place where I've seen Japanese characters widely
used is in tutorials, examples etc. However that's only my personal
observation for one particular non-Roman language.


Regards

Ian Barwick

-- Ian Barwick                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

17 июня 2014 г., 03:47:46

On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote:
> From what I've seen in the wild in Japan, Roman/ASCII characters are
> widely used for object/attribute names, as generally it's much less
> hassle than switching between input methods, dealing with different
> encodings etc. The only place where I've seen Japanese characters widely
> used is in tutorials, examples etc. However that's only my personal
> observation for one particular non-Roman language.
And I agree to this remark, that's a PITA to manage database object
names with Japanese characters directly. I have ever seen some
applications using such ways to define objects though in the past, not
*that* many I concur..
-- 
Michael

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

17 июня 2014 г., 03:54:05

Michael Paquier <michael.paquier@gmail.com> writes:
> On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote:
>> From what I've seen in the wild in Japan, Roman/ASCII characters are
>> widely used for object/attribute names, as generally it's much less
>> hassle than switching between input methods, dealing with different
>> encodings etc. The only place where I've seen Japanese characters widely
>> used is in tutorials, examples etc. However that's only my personal
>> observation for one particular non-Roman language.

> And I agree to this remark, that's a PITA to manage database object
> names with Japanese characters directly. I have ever seen some
> applications using such ways to define objects though in the past, not
> *that* many I concur..

What exactly is the rationale for thinking that Levenshtein distance is
useless in non-Roman alphabets?  AFAIK it just counts insertions and
deletions of characters, which seems like a concept rather independent
of what those characters are.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Ian Barwick

Дата:

17 июня 2014 г., 05:10:32

On 14/06/17 9:53, Tom Lane wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote:
>>> From what I've seen in the wild in Japan, Roman/ASCII characters are
>>> widely used for object/attribute names, as generally it's much less
>>> hassle than switching between input methods, dealing with different
>>> encodings etc. The only place where I've seen Japanese characters widely
>>> used is in tutorials, examples etc. However that's only my personal
>>> observation for one particular non-Roman language.
> 
>> And I agree to this remark, that's a PITA to manage database object
>> names with Japanese characters directly. I have ever seen some
>> applications using such ways to define objects though in the past, not
>> *that* many I concur..
> 
> What exactly is the rationale for thinking that Levenshtein distance is
> useless in non-Roman alphabets?  AFAIK it just counts insertions and
> deletions of characters, which seems like a concept rather independent
> of what those characters are.

With Japanese (which doesn't have an alphabet, but two syllabaries and
a bunch of logographic characters), Levenshtein distance is pretty useless
for examining similarities with words which can be written in either
syllabary (Michael's "ramen" example earlier in the thread); and when
catching "typos" caused by erroneous conversion from phonetic input to
characters - e.g. intending to input "成長" (seichou, growth) but
accidentally selecting "清聴" (seichou, courteous attention).

Howver in this particular use case, as long as it doesn't produce false
positives (I haven't looked at the patch) I don't think it would cause
any problems (of the kind which would require actively excluding certain
languages/character sets), it just wouldn't be quite as useful.

Regards

Ian Barwick

-- Ian Barwick                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

17 июня 2014 г., 05:57:10

On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote:
> Howver in this particular use case, as long as it doesn't produce false
> positives (I haven't looked at the patch) I don't think it would cause
> any problems (of the kind which would require actively excluding certain
> languages/character sets), it just wouldn't be quite as useful.

I'm not sure what you mean by false positives. The patch just shows a
HINT, where before there was none. It's possible for any number of
reasons that it isn't the most useful possible suggestion, since
Levenshtein distance is used as opposed to any other scheme that might
be better sometimes. I think that the hint given is a generally useful
piece of information in the event of an ERRCODE_UNDEFINED_COLUMN
error. Obviously I think the patch is worthwhile, but fundamentally
the HINT given is just a guess, as with the existing HINTs.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Ian Barwick

Дата:

17 июня 2014 г., 06:39:00

On 14/06/17 11:57, Peter Geoghegan wrote:
> On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote:
>> Howver in this particular use case, as long as it doesn't produce false
>> positives (I haven't looked at the patch) I don't think it would cause
>> any problems (of the kind which would require actively excluding certain
>> languages/character sets), it just wouldn't be quite as useful.
> 
> I'm not sure what you mean by false positives. The patch just shows a
> HINT, where before there was none. It's possible for any number of
> reasons that it isn't the most useful possible suggestion, since
> Levenshtein distance is used as opposed to any other scheme that might
> be better sometimes. I think that the hint given is a generally useful
> piece of information in the event of an ERRCODE_UNDEFINED_COLUMN
> error. Obviously I think the patch is worthwhile, but fundamentally
> the HINT given is just a guess, as with the existing HINTs.

I mean, does it come up with a suggestion in every case, even if there is
no remotely similar column? E.g. would
 SELECT foo FROM some_table

bring up column "bar" as a suggestion if "bar" is the only column in
the table?

Anyway, is there an up-to-date version of the patch available? The one from
March doesn't seem to apply cleanly to HEAD.


Thanks

Ian Barwick


-- Ian Barwick                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

17 июня 2014 г., 06:56:54

Peter Geoghegan <pg@heroku.com> writes:
> On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote:
>> Howver in this particular use case, as long as it doesn't produce false
>> positives (I haven't looked at the patch) I don't think it would cause
>> any problems (of the kind which would require actively excluding certain
>> languages/character sets), it just wouldn't be quite as useful.

> I'm not sure what you mean by false positives. The patch just shows a
> HINT, where before there was none. It's possible for any number of
> reasons that it isn't the most useful possible suggestion, since
> Levenshtein distance is used as opposed to any other scheme that might
> be better sometimes. I think that the hint given is a generally useful
> piece of information in the event of an ERRCODE_UNDEFINED_COLUMN
> error. Obviously I think the patch is worthwhile, but fundamentally
> the HINT given is just a guess, as with the existing HINTs.

Not having looked at the patch, but: I think the probability of
useless-noise HINTs could be substantially reduced if the code prints a
HINT only when there is a single available alternative that is clearly
better than the others in Levenshtein distance.  I'm not sure how much
better is "clearly better", but I exclude "zero" from that.  I see that
the original description of the patch says that it will arbitrarily
choose one alternative when there are several with equal Levenshtein
distance, and I'd say that's a bad idea.

You could possibly answer this objection by making the HINT list *all*
the alternatives meeting the minimum Levenshtein distance.  But I think
that's probably overcomplicated and of uncertain value anyhow.  I'd rather
have a rule that "we print only the choice that is at least K units better
than any other choice", where K remains to be determined exactly.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

17 июня 2014 г., 07:22:04

On Mon, Jun 16, 2014 at 8:38 PM, Ian Barwick <ian@2ndquadrant.com> wrote:
> I mean, does it come up with a suggestion in every case, even if there is
> no remotely similar column? E.g. would
>
>   SELECT foo FROM some_table
>
> bring up column "bar" as a suggestion if "bar" is the only column in
> the table?

Yes, it would, but I think that's the correct behavior.

> Anyway, is there an up-to-date version of the patch available? The one from
> March doesn't seem to apply cleanly to HEAD.

Are you sure? I think it might just be that patch is confused about
the deleted file contrib/fuzzystrmatch/levenshtein.c.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

17 июня 2014 г., 07:52:00

On Mon, Jun 16, 2014 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Not having looked at the patch, but: I think the probability of
> useless-noise HINTs could be substantially reduced if the code prints a
> HINT only when there is a single available alternative that is clearly
> better than the others in Levenshtein distance.  I'm not sure how much
> better is "clearly better", but I exclude "zero" from that.  I see that
> the original description of the patch says that it will arbitrarily
> choose one alternative when there are several with equal Levenshtein
> distance, and I'd say that's a bad idea.

I disagree. I happen to think that making some guess is better than no
guess at all here, given the fact that there aren't too many
possibilities to choose from. I think that it might be particularly
annoying to not show some suggestion in the event of a would-be
ambiguous column reference where the column name is itself wrong,
since both mistakes are common. For example, "order_id" was specified
instead of one of either "o.orderid" or "ol.orderid", as in my
original examples. If some correct alias was specified, that would
make the new code prefer the appropriate Var, but it might not be, and
that should be okay in my view.

I'm not trying to remove the need for human judgement here. We've all
heard stories about people who did things like input "Portland" into
their GPS only to end up in Maine rather than Oregon, but I think in
general you can only go so far in worrying about those cases.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

17 июня 2014 г., 23:46:50

On Tue, Jun 17, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Jun 16, 2014 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Not having looked at the patch, but: I think the probability of
>> useless-noise HINTs could be substantially reduced if the code prints a
>> HINT only when there is a single available alternative that is clearly
>> better than the others in Levenshtein distance.  I'm not sure how much
>> better is "clearly better", but I exclude "zero" from that.  I see that
>> the original description of the patch says that it will arbitrarily
>> choose one alternative when there are several with equal Levenshtein
>> distance, and I'd say that's a bad idea.
>
> I disagree. I happen to think that making some guess is better than no
> guess at all here, given the fact that there aren't too many
> possibilities to choose from. I think that it might be particularly
> annoying to not show some suggestion in the event of a would-be
> ambiguous column reference where the column name is itself wrong,
> since both mistakes are common. For example, "order_id" was specified
> instead of one of either "o.orderid" or "ol.orderid", as in my
> original examples. If some correct alias was specified, that would
> make the new code prefer the appropriate Var, but it might not be, and
> that should be okay in my view.
>
> I'm not trying to remove the need for human judgement here. We've all
> heard stories about people who did things like input "Portland" into
> their GPS only to end up in Maine rather than Oregon, but I think in
> general you can only go so far in worrying about those cases.

Emitting a suggestion with a large distance seems like it could be
rather irritating.  If the user types in SELECT prodct_id FROM orders,
and that column does not exist, suggesting "product_id", if such a
column exists, will likely be well-received.  Suggesting a column
named, say, "price", however, will likely make at least some users say
"no I didn't mean that you stupid @%!#" - because probably the issue
there is that the user selected from the completely wrong table,
rather than getting 6 of the 9 characters they typed incorrect.

One existing tool that does something along these lines is 'git',
which seems to have some kind of a heuristic to know when to give up:

[rhaas pgsql]$ git gorp
git: 'gorp' is not a git command. See 'git --help'.

Did you mean this?   grep
[rhaas pgsql]$ git goop
git: 'goop' is not a git command. See 'git --help'.

Did you mean this?   grep
[rhaas pgsql]$ git good
git: 'good' is not a git command. See 'git --help'.
[rhaas pgsql]$ git puma
git: 'puma' is not a git command. See 'git --help'.

Did you mean one of these?   pull   push

I suspect that the maximum useful distance is a function of the string
length.  Certainly, if the distance is greater than or equal to the
length of one of the strings involved, it's just a totally unrelated
string and thus not worth suggesting.  A useful heuristic might be
something like "distance at most 3, or at most half the string length,
whichever is less".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

17 июня 2014 г., 23:59:25

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 17, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> I disagree. I happen to think that making some guess is better than no
>> guess at all here, given the fact that there aren't too many
>> possibilities to choose from.

> Emitting a suggestion with a large distance seems like it could be
> rather irritating.  If the user types in SELECT prodct_id FROM orders,
> and that column does not exist, suggesting "product_id", if such a
> column exists, will likely be well-received.  Suggesting a column
> named, say, "price", however, will likely make at least some users say
> "no I didn't mean that you stupid @%!#" - because probably the issue
> there is that the user selected from the completely wrong table,
> rather than getting 6 of the 9 characters they typed incorrect.

Yeah, that's my point exactly.  There's no very good reason to assume that
the intended answer is in fact among the set of column names we can see;
and if it *is* there, the Levenshtein distance to it isn't going to be
all that large.  I think that suggesting "foobar" when the user typed
"glorp" is not only not helpful, but makes us look like idiots.

> One existing tool that does something along these lines is 'git',
> which seems to have some kind of a heuristic to know when to give up:

I wouldn't necessarily hold up git as a model of user interface
engineering ;-) ... but still, it might be interesting to take a look
at exactly what heuristics they used here.  I'm sure there are other
precedents we could look at, too.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Kevin Grittner

Дата:

18 июня 2014 г., 00:15:25

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I wouldn't necessarily hold up git as a model of user interface
> engineering ;-) ... but still, it might be interesting to take a
> look at exactly what heuristics they used here.  I'm sure there
> are other precedents we could look at, too.

On my Ubuntu machine, bash does something similar.  A few examples
chosen completely arbitrarily:

kgrittn@Kevin-Desktop:~$ got
No command 'got' found, did you mean:
 Command 'go' from package 'golang-go' (universe)
 Command 'gout' from package 'scotch' (universe)
 Command 'jot' from package 'athena-jot' (universe)
 Command 'go2' from package 'go2' (universe)
 Command 'git' from package 'git' (main)
 Command 'gpt' from package 'gpt' (universe)
 Command 'gom' from package 'gom' (universe)
 Command 'goo' from package 'goo' (universe)
 Command 'gst' from package 'gnu-smalltalk' (universe)
 Command 'dot' from package 'graphviz' (main)
 Command 'god' from package 'god' (universe)
 Command 'god' from package 'ruby-god' (universe)
got: command not found
kgrittn@Kevin-Desktop:~$ groupad
No command 'groupad' found, did you mean:
 Command 'groupadd' from package 'passwd' (main)
 Command 'groupd' from package 'cman' (main)
groupad: command not found
kgrittn@Kevin-Desktop:~$ asdf
No command 'asdf' found, did you mean:
 Command 'asdfg' from package 'aoeui' (universe)
 Command 'sadf' from package 'sysstat' (main)
 Command 'sdf' from package 'sdf' (universe)
asdf: command not found
kgrittn@Kevin-Desktop:~$ zxcv
zxcv: command not found

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Josh Berkus

Дата:

18 июня 2014 г., 00:29:09

On 06/17/2014 01:59 PM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Emitting a suggestion with a large distance seems like it could be
>> rather irritating.  If the user types in SELECT prodct_id FROM orders,
>> and that column does not exist, suggesting "product_id", if such a
>> column exists, will likely be well-received.  Suggesting a column
>> named, say, "price", however, will likely make at least some users say
>> "no I didn't mean that you stupid @%!#" - because probably the issue
>> there is that the user selected from the completely wrong table,
>> rather than getting 6 of the 9 characters they typed incorrect.
> 
> Yeah, that's my point exactly.  There's no very good reason to assume that
> the intended answer is in fact among the set of column names we can see;
> and if it *is* there, the Levenshtein distance to it isn't going to be
> all that large.  I think that suggesting "foobar" when the user typed
> "glorp" is not only not helpful, but makes us look like idiots.

Well, there's two different issues:

(1) offering a suggestion which is too different from what the user
typed.  This is easily limited by having a max distance (most likely a
distance/length ratio, with a max of say, 0.5).  The only drawback of
this would be the extra cpu cycles to calculate it, and some arguments
about what the max distance should be.  But for the sake of the
children, let's not have a GUC for it.

(2) If there are multiple columns with the same levenschtien distance,
which one do you suggest?  The current code picks a random one, which
I'm OK with.  The other option would be to list all of the columns.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

18 июня 2014 г., 00:35:59

On Tue, Jun 17, 2014 at 1:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, that's my point exactly.  There's no very good reason to assume that
> the intended answer is in fact among the set of column names we can see;
> and if it *is* there, the Levenshtein distance to it isn't going to be
> all that large.  I think that suggesting "foobar" when the user typed
> "glorp" is not only not helpful, but makes us look like idiots.

Maybe that's just a matter of phrasing the message appropriately. A
more guarded message, that suggests that "foobar" is the *best* match
is correct at least on its own terms (terms that are self evident).
This does pretty effectively communicate to the user that they should
totally rethink not just the column name, but perhaps the entire
query. On the other hand, showing nothing communicates nothing.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

18 июня 2014 г., 00:36:32

Josh Berkus <josh@agliodbs.com> writes:
> (2) If there are multiple columns with the same levenschtien distance,
> which one do you suggest?  The current code picks a random one, which
> I'm OK with.  The other option would be to list all of the columns.

I objected to that upthread.  I don't think that picking a random one is
sane at all.  Listing them all might be OK (I notice that that seems to be
what both bash and git do).

Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff.  The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Josh Berkus

Дата:

18 июня 2014 г., 00:41:32

On 06/17/2014 02:36 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> (2) If there are multiple columns with the same levenschtien distance,
>> which one do you suggest?  The current code picks a random one, which
>> I'm OK with.  The other option would be to list all of the columns.
> 
> I objected to that upthread.  I don't think that picking a random one is
> sane at all.  Listing them all might be OK (I notice that that seems to be
> what both bash and git do).
> 
> Another issue is whether to print only those having exactly the minimum
> observed Levenshtein distance, or to print everything less than some
> cutoff.  The former approach seems to me to be placing a great deal of
> faith in something that's only a heuristic.

Well, that depends on what the cutoff is.  If it's high, like 0.5, that
could be a LOT of columns.  Like, I plan to test this feature with a
3-table join that has a combined 300 columns.  I can completely imagine
coming up with a string which is within 0.5 or even 0.3 of 40 columns names.

So if we want to list everything below a cutoff, we'd need to make that
cutoff fairly narrow, like 0.2.  But that means we'd miss a lot of
potential matches on short column names.

I really think we're overthinking this: it is just a HINT, and we can
improve it in future PostgreSQL versions, and most of our users will
ignore it anyway because they'll be using a client which doesn't display
HINTs.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

18 июня 2014 г., 00:50:35

Peter Geoghegan <pg@heroku.com> writes:
> Maybe that's just a matter of phrasing the message appropriately. A
> more guarded message, that suggests that "foobar" is the *best* match
> is correct at least on its own terms (terms that are self evident).
> This does pretty effectively communicate to the user that they should
> totally rethink not just the column name, but perhaps the entire
> query. On the other hand, showing nothing communicates nothing.

I don't especially buy that argument.  As soon as the user's gotten used
to hints of this sort, the absence of a hint communicates plenty.

In any case, people have now cited two different systems with suggestion
capability, and neither of them behaves as you're arguing for.  The lack
of precedent should give you pause, unless you can point to widely-used
systems that do what you have in mind.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

18 июня 2014 г., 00:53:57

Josh Berkus <josh@agliodbs.com> writes:
> On 06/17/2014 02:36 PM, Tom Lane wrote:
>> Another issue is whether to print only those having exactly the minimum
>> observed Levenshtein distance, or to print everything less than some
>> cutoff.  The former approach seems to me to be placing a great deal of
>> faith in something that's only a heuristic.

> Well, that depends on what the cutoff is.  If it's high, like 0.5, that
> could be a LOT of columns.  Like, I plan to test this feature with a
> 3-table join that has a combined 300 columns.  I can completely imagine
> coming up with a string which is within 0.5 or even 0.3 of 40 columns names.

I think Levenshtein distances are integers, though that's just a minor
point.

> So if we want to list everything below a cutoff, we'd need to make that
> cutoff fairly narrow, like 0.2.  But that means we'd miss a lot of
> potential matches on short column names.

I'm not proposing an immutable cutoff.  Something that scales with the
string length might be a good idea, or we could make it a multiple of
the minimum observed distance, or probably there are a dozen other things
we could do.  I'm just saying that if we have an alternative at distance
3, and another one at distance 4, it's not clear to me that we should
assume that the first one is certainly what the user had in mind.
Especially not if all the other alternatives are distance 10 or more.

> I really think we're overthinking this: it is just a HINT, and we can
> improve it in future PostgreSQL versions, and most of our users will
> ignore it anyway because they'll be using a client which doesn't display
> HINTs.

Agreed that we can make it better later.  But whether it prints exactly
one suggestion, and whether it does that no matter how silly the
suggestion is, are rather fundamental decisions.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Josh Berkus

Дата:

18 июня 2014 г., 00:58:28

On 06/17/2014 02:53 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> On 06/17/2014 02:36 PM, Tom Lane wrote:
>>> Another issue is whether to print only those having exactly the minimum
>>> observed Levenshtein distance, or to print everything less than some
>>> cutoff.  The former approach seems to me to be placing a great deal of
>>> faith in something that's only a heuristic.
> 
>> Well, that depends on what the cutoff is.  If it's high, like 0.5, that
>> could be a LOT of columns.  Like, I plan to test this feature with a
>> 3-table join that has a combined 300 columns.  I can completely imagine
>> coming up with a string which is within 0.5 or even 0.3 of 40 columns names.
> 
> I think Levenshtein distances are integers, though that's just a minor
> point.

I was giving distance/length ratios.  That is, 0.5 would mean that up to
50% of the characters could be replaced/changed.  0.2 would mean that
only one character could be changed at lengths of five characters.  Etc.

The problem with these ratios is that they behave differently with long
strings than short ones.  I think realistically we'd need a double
threshold, i.e. ( distance >= 2 OR ratio <= 0.4 ).  Otherwise the
obvious case, getting two characters wrong in a 4-character column name
(or one in a two character name), doesn't get a HINT.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

18 июня 2014 г., 01:05:56

On Tue, Jun 17, 2014 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not proposing an immutable cutoff.  Something that scales with the
> string length might be a good idea, or we could make it a multiple of
> the minimum observed distance, or probably there are a dozen other things
> we could do.  I'm just saying that if we have an alternative at distance
> 3, and another one at distance 4, it's not clear to me that we should
> assume that the first one is certainly what the user had in mind.
> Especially not if all the other alternatives are distance 10 or more.

The patch just looks for the match with the lowest distance, passing
the lowest observed distance so far as a "max" to the distance
calculation function. That could have some value in certain cases.
People have already raised general concerns about added cycles and/or
clutter.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Gavin Flower

Дата:

18 июня 2014 г., 01:16:52

On 18/06/14 10:05, Peter Geoghegan wrote:
> On Tue, Jun 17, 2014 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not proposing an immutable cutoff.  Something that scales with the
>> string length might be a good idea, or we could make it a multiple of
>> the minimum observed distance, or probably there are a dozen other things
>> we could do.  I'm just saying that if we have an alternative at distance
>> 3, and another one at distance 4, it's not clear to me that we should
>> assume that the first one is certainly what the user had in mind.
>> Especially not if all the other alternatives are distance 10 or more.
> The patch just looks for the match with the lowest distance, passing
> the lowest observed distance so far as a "max" to the distance
> calculation function. That could have some value in certain cases.
> People have already raised general concerns about added cycles and/or
> clutter.
>
How about a list of miss spellings and the likely targets.  (grop, grap, ...) ==> (grep, grape, grope...)
type of thing? Possibly with some kind of adaptive learning algorithm.

I suspect that while this might be a useful research project, it is out 
of scope for the current discussion!


Cheers,
Gavin

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

18 июня 2014 г., 03:18:43

On Tue, Jun 17, 2014 at 5:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> (2) If there are multiple columns with the same levenschtien distance,
>> which one do you suggest?  The current code picks a random one, which
>> I'm OK with.  The other option would be to list all of the columns.
>
> I objected to that upthread.  I don't think that picking a random one is
> sane at all.  Listing them all might be OK (I notice that that seems to be
> what both bash and git do).

What bash does is annoying and stupid, and any time I find a system
with that obnoxious behavior enabled I immediately disable it, so I
don't consider that a good precedent for anything.  I think what the
bash algorithm demonstrates is that while it may be sane to list more
than one option, listing 10 or 20 or 150 is unbearably obnoxious.
Filling the user's *entire terminal window* with a list of suggestions
when they make a minor typo is more like a punishment than an aid.
git's behavior of limiting itself to one or two options, while
somewhat useless, is at least not annoying.

> Another issue is whether to print only those having exactly the minimum
> observed Levenshtein distance, or to print everything less than some
> cutoff.  The former approach seems to me to be placing a great deal of
> faith in something that's only a heuristic.

Well, we've got lots of heuristics.  Many of them serve us quite well.
I might do something like this:

(1) Set the maximum levenshtein distance to half the length of the
string, rounded down, but not more than 3.
(2) If there are more than 2 matches, reduce the maximum distance by 1
and repeat this step.
(3) If there are no remaining matches, print no hint; else print the 1
or 2 matching items.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

18 июня 2014 г., 04:18:27

On Tue, Jun 17, 2014 at 5:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> What bash does is annoying and stupid, and any time I find a system
> with that obnoxious behavior enabled I immediately disable it, so I
> don't consider that a good precedent for anything.

I happen to totally agree with you here. Bash sometimes does awful
things with its completion.

>> Another issue is whether to print only those having exactly the minimum
>> observed Levenshtein distance, or to print everything less than some
>> cutoff.  The former approach seems to me to be placing a great deal of
>> faith in something that's only a heuristic.
>
> Well, we've got lots of heuristics.  Many of them serve us quite well.
> I might do something like this:
>
> (1) Set the maximum levenshtein distance to half the length of the
> string, rounded down, but not more than 3.
> (2) If there are more than 2 matches, reduce the maximum distance by 1
> and repeat this step.
> (3) If there are no remaining matches, print no hint; else print the 1
> or 2 matching items.

I could do that. I can prepare a revision if others feel that's
acceptable. My only concern with this is that a more sophisticated
scheme implies more clutter in the parser, although it should not
imply wasted cycles.

What I particularly wanted to avoid in our choice of completion scheme
is doing nothing because there is an ambiguity about what is best,
which Tom suggested. In practice, that ambiguity will frequently be
something that our users will not care about, and not really see as an
ambiguity, as in my "o.orderid or ol.orderid?" example. However, if
there are 3 equally distant Vars, and not just 2, that's very probably
because none are useful, and so we really ought to show nothing. This
seems most sensible.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Abhijit Menon-Sen

Дата:

29 июня 2014 г., 17:22:10

So, what's the status of this patch?

There's been quite a lot of discussion (though only about the approach;
no formal code/usage review has yet been posted), but as far as I can
tell, it just tapered off without any particular consensus.

-- Abhijit

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

29 июня 2014 г., 17:30:41

Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> So, what's the status of this patch?
> There's been quite a lot of discussion (though only about the approach;
> no formal code/usage review has yet been posted), but as far as I can
> tell, it just tapered off without any particular consensus.

AFAICT, people generally agree that this would probably be useful,
but there's not consensus on how far the code should be willing to
"reach" for a match, nor on what to do when there are multiple
roughly-equally-plausible candidates.

Although printing all candidates seems to be what's preferred by
existing systems with similar facilities, I can see the point that
constructing the message in a translatable fashion might be difficult.
So personally I'd be willing to abandon insistence on that.  I still
think though that printing candidates with very large distances
would be unhelpful.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

03 июля 2014 г., 01:51:21

On Sun, Jun 29, 2014 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Although printing all candidates seems to be what's preferred by
> existing systems with similar facilities, I can see the point that
> constructing the message in a translatable fashion might be difficult.
> So personally I'd be willing to abandon insistence on that.  I still
> think though that printing candidates with very large distances
> would be unhelpful.

Attached revision factors in everyone's concerns here, I think.

I've addressed your concern about the closeness of the match proposed
in the HINT - the absolute as opposed to relative quality of the
match. There is a normalized distance threshold that must always be
exceeded to prevent ludicrous suggestions. This works along similar
lines to those sketched by Robert. Furthermore, I've made it
occasionally possible to see 2 suggestions, when they're equally
distant and when each suggestion comes from a different range table
entry. However, if the two best suggestions (overall or within an RTE)
come from within the same RTE, then that RTE is ignored for the
purposes of picking a suggestion (although the lowest observed
distance from an ignored RTE may still be used as the distance for
later RTEs to beat to get their attributes suggested in the HINT).

The idea here is that this quality-bar for suggestions doesn't come at
the cost of ignoring my concern about the presumably somewhat common
case where there is an unqualified and therefore ambiguous column
reference that happens to also be misspelled. An ambiguous column
reference and an incorrectly spelled column name are both very common,
and so it seems likely that momentary lapses where the user gets both
things wrong at once are also common. We do all this without going
overboard, since as outlined by Robert, when there are 3 or more
equally distant candidates (even if they all come from different
RTEs), we give no HINT at all. The big picture here is to make mental
context switches cheap when writing ad-hoc queries in psql.

A lot of the HINTs that popped up in the regression tests that seemed
kind of questionable no longer appear. These new measures make the
coding somewhat more complex than that of the initial version,
although overall the parser code added by this patch is almost
entirely confined to code paths concerned only with producing
diagnostic messages to help users.

--
Peter Geoghegan

Вложения

levenshtein_column_hint.v2.2014_07_02.patch.gz

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Abhijit Menon-Sen

Дата:

04 июля 2014 г., 18:46:52

At 2014-07-02 15:51:08 -0700, pg@heroku.com wrote:
>
> Attached revision factors in everyone's concerns here, I think.

Is anyone planning to review Peter's revised patch?

> These new measures make the coding somewhat more complex than that of
> the initial version, although overall the parser code added by this
> patch is almost entirely confined to code paths concerned only with
> producing diagnostic messages to help users.

Yes, the new patch looks quite a bit more involved than earlier, but if
that's what it takes to provide a useful HINT, I guess it's not too bad.

-- Abhijit

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

08 июля 2014 г., 16:58:56

On Sat, Jul 5, 2014 at 12:46 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
> At 2014-07-02 15:51:08 -0700, pg@heroku.com wrote:
>>
>> Attached revision factors in everyone's concerns here, I think.
>
> Is anyone planning to review Peter's revised patch?

I have been doing some functional tests, and looked quickly at the
code to understand what it does:
1) Compiles without warnings, passes regression tests
2) Checking process goes through all the existing columns of a
relation even a difference of 1 with some other column(s) has already
been found. As we try to limit the number of hints returned, this
seems like a waste of resources.
3) distanceName could be improved, by for example having some checks
on the string lengths of target and source columns, and immediately
reject the match if for example the length of the source string is the
double/half of the length of target.
4) This is not nice, could it be possible to remove the stuff from varlena.c?
+/* Expand each Levenshtein distance variant */
+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
+#undef LEVENSHTEIN_LESS_EQUAL
Part of the same comment: only varstr_leven_less_equal is used to
calculate the distance, should we really move varstr_leven to core?
This clearly needs to be reworked as not just a copy-paste of the
things in fuzzystrmatch.
The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think.
5) Do we want hints on system columns as well? For example here we
could get tableoid as column hint:
=# select tablepid from foo;
ERROR:  42703: column "tablepid" does not exist
LINE 1: select tablepid from foo;              ^
LOCATION:  errorMissingColumn, parse_relation.c:3123
Time: 0.425 ms
6) Sometimes no hints are returned... Even in simple cases like this one:
=# create table foo (aa int, bb int);
CREATE TABLE
=# select ab from foo;
ERROR:  42703: column "ab" does not exist
LINE 1: select ab from foo;              ^
LOCATION:  errorMissingColumn, parse_relation.c:3123
7) Performance penalty with a table with 1600 columns:
=# CREATE FUNCTION create_long_table(tabname text, columns int)
RETURNS void
LANGUAGE plpgsql
as $$
declare first_col bool = true; count int; query text;
begin query := 'CREATE TABLE ' || tabname || ' ('; for count in 0..columns loop   query := query || 'col' || count ||
'int';   if count <> columns then     query := query || ', ';   end if; end loop; query := query || ')'; execute
query;
end;
$$;
=# SELECT create_long_table('aa', 1599);create_long_table
-------------------

(1 row)
Then tested queries like that: SELECT col888a FROM aa;
Patched version: 2.100ms~2.200ms
master branch (6048896): 0.956 ms~0.990 ms
So the performance impact seems limited.

Regards,
-- 
Michael

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

08 июля 2014 г., 19:49:45

Hi,

On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> 2) Checking process goes through all the existing columns of a
> relation even a difference of 1 with some other column(s) has already
> been found. As we try to limit the number of hints returned, this
> seems like a waste of resources.

In general it's possible that an exact match will later be found
within the RTE, and exact matches don't have to pay the "wrong alias"
penalty, and are immediately returned. It is therefore not a waste of
resources, but even if it was that would be pretty inconsequential as
your benchmark shows.

> 3) distanceName could be improved, by for example having some checks
> on the string lengths of target and source columns, and immediately
> reject the match if for example the length of the source string is the
> double/half of the length of target.

I don't think it's a good idea to tie distanceName() to the ultimate
behavior of errorMissingColumn() hinting, since there may be other
callers in the future. Besides, that isn't going to help much.

> 4) This is not nice, could it be possible to remove the stuff from varlena.c?
> +/* Expand each Levenshtein distance variant */
> +#include "levenshtein.c"
> +#define LEVENSHTEIN_LESS_EQUAL
> +#include "levenshtein.c"
> +#undef LEVENSHTEIN_LESS_EQUAL
> Part of the same comment: only varstr_leven_less_equal is used to
> calculate the distance, should we really move varstr_leven to core?
> This clearly needs to be reworked as not just a copy-paste of the
> things in fuzzystrmatch.
> The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think.

So there'd be one variant within core and one within
contrib/fuzzystrmatch? I don't think that's an improvement.

> 5) Do we want hints on system columns as well?

I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.

> 6) Sometimes no hints are returned... Even in simple cases like this one:
> =# create table foo (aa int, bb int);
> CREATE TABLE
> =# select ab from foo;
> ERROR:  42703: column "ab" does not exist
> LINE 1: select ab from foo;
>                ^
> LOCATION:  errorMissingColumn, parse_relation.c:3123

That's because those two candidates come from a single RTE and have an
equal distance -- you'd see both suggestions if you joined two tables
with each candidate, assuming that each table being joined didn't
individually have the same issue. I think that that's probably
considered the correct behavior by most.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Alvaro Herrera

Дата:

08 июля 2014 г., 23:44:20

Peter Geoghegan wrote:

> > 6) Sometimes no hints are returned... Even in simple cases like this one:
> > =# create table foo (aa int, bb int);
> > CREATE TABLE
> > =# select ab from foo;
> > ERROR:  42703: column "ab" does not exist
> > LINE 1: select ab from foo;
> >                ^
> > LOCATION:  errorMissingColumn, parse_relation.c:3123
> 
> That's because those two candidates come from a single RTE and have an
> equal distance -- you'd see both suggestions if you joined two tables
> with each candidate, assuming that each table being joined didn't
> individually have the same issue. I think that that's probably
> considered the correct behavior by most.

It seems pretty silly to me actually.  Was this designed by a committee?
I agree with the general principle that showing a large number of
candidates (a la bash) is a bad idea, but failing to show two of them ...

Words fail me.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

08 июля 2014 г., 23:55:21

On Tue, Jul 8, 2014 at 1:42 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> That's because those two candidates come from a single RTE and have an
>> equal distance -- you'd see both suggestions if you joined two tables
>> with each candidate, assuming that each table being joined didn't
>> individually have the same issue. I think that that's probably
>> considered the correct behavior by most.
>
> It seems pretty silly to me actually.  Was this designed by a committee?
> I agree with the general principle that showing a large number of
> candidates (a la bash) is a bad idea, but failing to show two of them ...

I guess it was designed by a committee. But we don't fail to show both
because they're equally distant. Rather, it's because they're equally
distant and from the same RTE. This is a contrived example, but
typically showing equally distant columns is useful when they're in a
foreign-key relationship - I was worried about the common case where a
column name is misspelled that would otherwise be ambiguous, which is
why that shows a HINT while the single RTE case doesn't. I think that
in most realistic cases it wouldn't be all that useful to show two
columns from the same table when they're equally distant. It's easy to
imagine that reflecting that no match is good in absolute terms, and
we're somewhat conservative about showing any match. While I think
this general behavior is defensible, I must admit that it did suit me
to write it that way because to do otherwise would have necessitated
more invasive code in the existing general purpose scanRTEForColumn()
function.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

08 июля 2014 г., 23:57:15

On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I was worried about the common case where a
> column name is misspelled that would otherwise be ambiguous, which is
> why that shows a HINT while the single RTE case doesn't

To be clear - I mean a HINT with two suggestions rather than just one.
If there are 3 or more equally distant suggestions (even if they're
all from different RTEs) we also give no HINT in the proposed patch.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

09 июля 2014 г., 09:10:50

On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I was worried about the common case where a
> column name is misspelled that would otherwise be ambiguous, which is
> why that shows a HINT while the single RTE case doesn't

To be clear - I mean a HINT with two suggestions rather than just one.
If there are 3 or more equally distant suggestions (even if they're
all from different RTEs) we also give no HINT in the proposed patch.

Showing up to 2 hints is fine as it does not pollute the error output with perhaps unnecessary messages. That's even more protective than for example git that prints all the equidistant candidates. However I can't understand why it does not show up hints even if there are two equidistant candidates from the same RTE. I think it should.
--
Michael

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

09 июля 2014 г., 09:25:16

On Wed, Jul 9, 2014 at 1:49 AM, Peter Geoghegan <pg@heroku.com> wrote:

> 4) This is not nice, could it be possible to remove the stuff from varlena.c?
> +/* Expand each Levenshtein distance variant */
> +#include "levenshtein.c"
> +#define LEVENSHTEIN_LESS_EQUAL
> +#include "levenshtein.c"
> +#undef LEVENSHTEIN_LESS_EQUAL
> Part of the same comment: only varstr_leven_less_equal is used to
> calculate the distance, should we really move varstr_leven to core?
> This clearly needs to be reworked as not just a copy-paste of the
> things in fuzzystrmatch.
> The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think.

So there'd be one variant within core and one within
contrib/fuzzystrmatch? I don't think that's an improvement.

No. The main difference between varstr_leven_less_equal and varstr_leven is the use of the extra argument max_d in the former. My argument here is instead of blindly cut-pasting into core the code you are interested in to evaluate the string distances, is to refactor it to have a unique function, and to let the business with LEVENSHTEIN_LESS_EQUAL within contrib/fuzzystrmatch. This will require some reshuffling of the distance function, but by looking at this patch I am getting the feeling that this is necessary, and should even be split into a first patch for fuzzystrmatch that would facilitate its integration into core.
Also why is rest_of_char_same within varlena.c?

> 5) Do we want hints on system columns as well?
I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.

This may sound ridiculous, but I have already found myself mistyping ctid by tid and cid while working on patches and modules that played with page format, and needing a couple of minutes to understand what was going on (bad morning). I would have welcomed such hints in those cases.
--
Michael

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

09 июля 2014 г., 09:29:06

On Tue, Jul 8, 2014 at 11:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Showing up to 2 hints is fine as it does not pollute the error output with
> perhaps unnecessary messages. That's even more protective than for example
> git that prints all the equidistant candidates. However I can't understand
> why it does not show up hints even if there are two equidistant candidates
> from the same RTE. I think it should.

Everyone is going to have an opinion on something like that. I was
showing deference to the general concern about the absolute (as
opposed to relative) quality of the HINTs in the event of equidistant
matches by having no two suggestions come from within a single RTE,
while still covering the case I thought was important by having two
suggestions if there were two equidistant matches across RTEs. I think
that's marginally better then what you propose, because your case
deals with two equidistant though distinct columns, bringing into
question the validity of both would-be suggestions. I'll defer to
whatever the consensus is.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

09 июля 2014 г., 09:57:12

On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> So there'd be one variant within core and one within
>> contrib/fuzzystrmatch? I don't think that's an improvement.
>
> No. The main difference between varstr_leven_less_equal and varstr_leven is
> the use of the extra argument max_d in the former. My argument here is
> instead of blindly cut-pasting into core the code you are interested in to
> evaluate the string distances, is to refactor it to have a unique function,
> and to let the business with LEVENSHTEIN_LESS_EQUAL within
> contrib/fuzzystrmatch. This will require some reshuffling of the distance
> function, but by looking at this patch I am getting the feeling that this is
> necessary, and should even be split into a first patch for fuzzystrmatch
> that would facilitate its integration into core.
> Also why is rest_of_char_same within varlena.c?

Just as before, rest_of_char_same() exists for the express purpose of
being called by the two variants varstr_leven_less_equal() and
varstr_leven(). Why wouldn't I copy it over too along with those two?
Where do you propose to put it?

Obviously the existing macro hacks (that I haven't changed) that build
the two variants are not terribly pretty, but they're not arbitrary
either. They reflect the fact that there is no natural way to add
callbacks or something like that. If you pretended that the core code
didn't have to care about one case or the other, and that contrib was
somehow obligated to hook in its own handler for the
!LEVENSHTEIN_LESS_EQUAL case that it now only cares about, then you'd
end up with an even bigger mess. Besides, with the patch the core code
is calling varstr_leven_less_equal(), which is the bigger of the two
variants - it's the LEVENSHTEIN_LESS_EQUAL case, not the
!LEVENSHTEIN_LESS_EQUAL case that core cares about for the purposes of
building HINTs. In short, I don't know what you mean. What would that
reshuffling actually look like?

>> > 5) Do we want hints on system columns as well?
>> I think it's obvious that the answer must be no. That's going to
>> frequently result in suggestions of columns that users will complain
>> aren't even there. If you know about the system columns, you can just
>> get it right. They're supposed to be hidden for most purposes.
>
> This may sound ridiculous, but I have already found myself mistyping ctid by
> tid and cid while working on patches and modules that played with page
> format, and needing a couple of minutes to understand what was going on (bad
> morning).

I think that it's clearly not worth it, even if it is true that a
minority sometimes make this mistake. Most users don't know that there
are system columns. It's not even close to being worth it to bring
that into this.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Alvaro Herrera

Дата:

09 июля 2014 г., 17:50:17

Peter Geoghegan wrote:
> On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:

> >> > 5) Do we want hints on system columns as well?
> >> I think it's obvious that the answer must be no. That's going to
> >> frequently result in suggestions of columns that users will complain
> >> aren't even there. If you know about the system columns, you can just
> >> get it right. They're supposed to be hidden for most purposes.
> >
> > This may sound ridiculous, but I have already found myself mistyping ctid by
> > tid and cid while working on patches and modules that played with page
> > format, and needing a couple of minutes to understand what was going on (bad
> > morning).
> 
> I think that it's clearly not worth it, even if it is true that a
> minority sometimes make this mistake. Most users don't know that there
> are system columns. It's not even close to being worth it to bring
> that into this.

I agree with Peter.  This is targeted at regular users.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

09 июля 2014 г., 18:07:02

On Wed, Jul 9, 2014 at 2:10 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> > I was worried about the common case where a
>> > column name is misspelled that would otherwise be ambiguous, which is
>> > why that shows a HINT while the single RTE case doesn't
>>
>> To be clear - I mean a HINT with two suggestions rather than just one.
>> If there are 3 or more equally distant suggestions (even if they're
>> all from different RTEs) we also give no HINT in the proposed patch.
>
> Showing up to 2 hints is fine as it does not pollute the error output with
> perhaps unnecessary messages. That's even more protective than for example
> git that prints all the equidistant candidates. However I can't understand
> why it does not show up hints even if there are two equidistant candidates
> from the same RTE. I think it should.

Me, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Greg Stark

Дата:

09 июля 2014 г., 18:08:57

On Wed, Jul 9, 2014 at 7:29 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Everyone is going to have an opinion on something like that. I was
> showing deference to the general concern about the absolute (as
> opposed to relative) quality of the HINTs in the event of equidistant
> matches by having no two suggestions come from within a single RTE,
> while still covering the case I thought was important by having two
> suggestions if there were two equidistant matches across RTEs. I think
> that's marginally better then what you propose, because your case
> deals with two equidistant though distinct columns, bringing into
> question the validity of both would-be suggestions. I'll defer to
> whatever the consensus is.

I agree this is bike shedding. But as long as we're bike shedding...

A simple rule is easier for users to understand as well as to code. I
would humbly suggest the following: take all the unqualified column
names, downcase them, check which ones match most closely the
unmatched column. Show the top 3 matches if they're within some
arbitrary distance. If they match exactly except for the case and the
unmatched column is all lower case add a comment that quoting is
required due to the mixed case.

Honestly the current logic and the previous logic both seemed
reasonable to me. They're not going to be perfect in every case so
anything that comes up some some suggestions is fine.

-- 
greg

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

09 июля 2014 г., 22:43:42

On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark <stark@mit.edu> wrote:
> A simple rule is easier for users to understand as well as to code. I
> would humbly suggest the following: take all the unqualified column
> names, downcase them, check which ones match most closely the
> unmatched column. Show the top 3 matches if they're within some
> arbitrary distance.

That's harder than it sounds. You need even more translatable strings
for variant ereports(). I don't think that an easy to understand rule
is necessarily of much value - I'm already charging half price for
deletion because I found representative errors more useful in certain
cases by doing so. I think we want something that displays the most
useful suggestion as often as is practically possible, and does not
display unhelpful suggestions to the extent that it's practical to
avoid them. Plus, as I mentioned, I'm keen to avoid adding more stuff
to scanRTEForColumn() than I already have.

> Honestly the current logic and the previous logic both seemed
> reasonable to me. They're not going to be perfect in every case so
> anything that comes up some some suggestions is fine.

I think that the most recent revision is somewhat better due to the
feedback of Tom and Robert. I didn't feel as strongly as they did
about erring on the side of not showing a HINT, but I think the most
recent revision is a good compromise. But yes, at this point we're
certainly chasing diminishing returns. There are almost any number of
variants of this basic idea that could be suggested.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

09 июля 2014 г., 22:59:38

On Wed, Jul 9, 2014 at 8:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Showing up to 2 hints is fine as it does not pollute the error output with
>> perhaps unnecessary messages. That's even more protective than for example
>> git that prints all the equidistant candidates. However I can't understand
>> why it does not show up hints even if there are two equidistant candidates
>> from the same RTE. I think it should.
>
> Me, too.

The idea is that each RTE gets one best suggestion, because if there
are two best suggestions within an RTE they're probably both wrong.
Whereas across RTEs, it's probably just that there is a foreign key
relationship between the two (and the user accidentally failed to
qualify the particular column of interest on top of the misspelling, a
qualification that would be sufficient to have the code prefer the
qualified-but-misspelled column). Clearly if I was to do what you
suggest it would be closer to a wild guess, and Tom has expressed
concerns about that.

Now, I don't actually ensure that the column names of the two columns
(each from separate RTEs) are identical save for their would-be alias,
but that's just a consequence of the implementation. Also, as I've
mentioned, I don't want to put more stuff in scanRTEForColumn() than I
already have, due to your earlier concern about adding clutter.

I think we're splitting hairs at this point, and frankly I'll do it
that way if it gets the patch closer to being committed. While I
thought it was important to get the unqualified and misspelled case
right (which I did in the first revision, but perhaps at the expense
of Tom's concern about absolute suggestion quality), I don't feel
strongly about this detail either way.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Alvaro Herrera

Дата:

10 июля 2014 г., 00:19:55

Peter Geoghegan wrote:
> On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark <stark@mit.edu> wrote:
> > A simple rule is easier for users to understand as well as to code. I
> > would humbly suggest the following: take all the unqualified column
> > names, downcase them, check which ones match most closely the
> > unmatched column. Show the top 3 matches if they're within some
> > arbitrary distance.
> 
> That's harder than it sounds. You need even more translatable strings
> for variant ereports().

Maybe it is possible to rephrase the message so that the translatable
part doesn't need to concern with how many suggestions there are.  For
instance something like "perhaps you meant a name from the following
list: foo, bar, baz".  Couple with the errmsg_plural stuff, you then
don't need to worry too much about providing different strings for 1, 2,
N suggestions.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

10 июля 2014 г., 00:44:13

On Wed, Jul 9, 2014 at 2:19 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> That's harder than it sounds. You need even more translatable strings
>> for variant ereports().
>
> Maybe it is possible to rephrase the message so that the translatable
> part doesn't need to concern with how many suggestions there are.  For
> instance something like "perhaps you meant a name from the following
> list: foo, bar, baz".  Couple with the errmsg_plural stuff, you then
> don't need to worry too much about providing different strings for 1, 2,
> N suggestions.

That's not really the problem. I already have a lot of things to test
in each of the two ereport() calls. More importantly, showing the
closet, say, 3 matches under an arbitrary distance does not weigh
concerns about that indicating that they're all bad. It's not like
bash tab completion - if there is one best match, that's probably
because that's what the user meant. Whereas if there are two or more
within a single RTE, that's probably because both are unhelpful. They
both happened to require the same number of substitutions to get to,
while not being quite bad enough matches to be excluded by the final
check against a normalized distance threshold (the final check that
prevents ludicrous suggestions).

The fact that there were multiple equally plausible candidates (that
are not identically named and just from different RTEs) tells us
plenty, unlike with tab completion. It's not hard for one column to be
a better match than another, and so it doesn't seem unreasonable to
insist upon that within a single RTE where they cannot be identical,
since a conservative approach seems to be what is generally favored.
In any case I'm just trying to weigh everyone's concerns here. I hope
it's actually possible to compromise, but right now I don't know what
I can do to make useful progress.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

10 июля 2014 г., 01:19:17

On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> 6) Sometimes no hints are returned... Even in simple cases like this one:
> =# create table foo (aa int, bb int);
> CREATE TABLE
> =# select ab from foo;
> ERROR:  42703: column "ab" does not exist
> LINE 1: select ab from foo;
>                ^
> LOCATION:  errorMissingColumn, parse_relation.c:3123

In this example, it seems obvious that both "aa" and "bb" should be
suggested when they are not. But what if there were far more columns,
as might be expected in realistic cases (suppose all other columns
have at least 3 characters)? That's another kettle of fish. The
assumption that it's probably one of those two equally distant columns
is now on very shaky ground. After all, the user can only have meant
one particular column. If we apply a limited kind of Turing test to
this second case, how does the most recent revision's algorithm do?
What would a human suggest? I'm pretty sure the answer is that the
human would shrug. Maybe he or she would say "I guess you might have
meant one of either aa or bb, but that really isn't obvious at all".
That doesn't inspire much confidence.

Now, maybe I should be more optimistic about it being one of the two
because there are only two possibilities to begin with. That seems
pretty dubious, though. In general I find it much more plausible based
on what we know that the user should rethink everything. And, as Tom
pointed out, showing nothing conveys something in itself once users
have been trained to expect something.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

17 июля 2014 г., 16:34:39

On Wed, Jul 9, 2014 at 3:56 PM, Peter Geoghegan <pg@heroku.com> wrote:

What would that reshuffling actually look like?

Something like the patch 1 attached...

Btw, re-reading this thread, everybody seem to agree that this is a useful feature, but we still do not have clear definitions of the circumstances under which column hints should be produced, except the number (up to two). So, putting my hands on it and biting the bullet, I have finished with the two patches attached making the implementation clearer:
- Patch 1 moves levenshtein functions from fuzzystrmatch to core.

- Patch 2 implements the column hints, rather unchanged from original proposition.

Patch 1 does a couple of things:

- fuzzystrmatch is dumped to 1.1, as Levenshtein functions are not part of it anymore, and moved to core.

- Removal of the LESS_EQUAL flag that made the original submission patch harder to understand. All the Levenshtein functions wrap a single common function.

- Documentation is moved, and regression tests for Levenshtein functions are added.

- Functions with costs are renamed with a suffix with costs.

After hacking this feature, I came up with the conclusion that it would be better for the user experience to move directly into backend code all the Levenshtein functions, instead of only moving in the common wrapper as Peter did in his original patches. This is done this way to avoid keeping portions of the same feature in two different places of the code (backend with common routine, fuzzystrmatch with levenshtein functions) and concentrate all the logic in a single place. Now, we may as well consider renaming the levenshtein functions into smarter names, like str_distance, and keep fuzzystrmatch to 1.0, having the functions levenshteing_* calling only the str_distance functions.

Having a set of in-core distance functions for strings would serve more general purposes like other object hinting (constraint names, tables, etc.).

Patch 2 is a rebase of the feature of Peter that can be applied on top of patch 1. The code is rather untouched (haven't much played with Peter's thingies), well-commented, but I think that this needs more work, particularly when a query has a single RTE like in this case where no hints are proposed to the user (mentioned upthread):
create table foo (aa int, bb int);

select ab from foo; -- no hints

Before doing anything more with patch 2, we still need to define clearly how hints should be produced, so that's clearly out-of-scope for this CF. Patch 1, though, prepares the field for hints of all kinds, so perhaps we could argue more on that first?

Regards,

--
Michael

Вложения

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

17 июля 2014 г., 21:54:33

I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.

I noticed in passing that your revision says this *within* levenshtein.c:

+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.

On Thu, Jul 17, 2014 at 6:34 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Patch 2 is a rebase of the feature of Peter that can be applied on top of
> patch 1. The code is rather untouched (haven't much played with Peter's
> thingies), well-commented, but I think that this needs more work,
> particularly when a query has a single RTE like in this case where no hints
> are proposed to the user (mentioned upthread):

The only source of disagreement that I am aware of at this point is
the question of whether or not we should accept two candidates from
the same RTE. I lean slightly towards "no", as already explained [1]
[2], but it's not as if I feel that strongly either way - this
approach of looking for only a single best candidate per RTE taken in
deference to the concerns of others.

I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.

[1] http://www.postgresql.org/message-id/CAM3SWZTrm4PmqMmL9=eYx-8f-Vx-ha7DmE4KOmS2vCOMOzGHrw@mail.gmail.com
[2] http://www.postgresql.org/message-id/CAM3SWZS6kiQEqJz4pV3Fkp6cgw1wS26exOQTjb_XMW3zE5b6mA@mail.gmail.com
-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

18 июля 2014 г., 10:47:22

On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I am not opposed to moving the contrib code into core in the manner
> that you oppose. I don't feel strongly either way.
>
> I noticed in passing that your revision says this *within* levenshtein.c:
>
> + * Guaranteed to work with Name datatype's cstrings.
> + * For full details see levenshtein.c.
Yeah, I looked at what I produced yesterday night again and came
across a couple of similar things :) And reworked a couple of things
in the version attached, mainly wordsmithing and adding comments here
and there, as well as making the naming of the Levenshtein functions
in core the same as the ones in fuzzystrmatch 1.0.

> I imagined that when a committer picked this up, an executive decision
> would be made one way or the other. I am quite willing to revise the
> patch to alter this behavior at the request of a committer.
Fine for me. I'll move this patch to the next stage then.
--
Michael

On Thu, Jul 24, 2014 at 1:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:
> There are several possible methods of doing that, but I think the best
> one is just to leave the SQL-callable C functions in fuzzystrmatch and
> move only the underlying code that supports into core.

I hadn't been paying close attention to this thread, but I'd just assumed
that that would be the approach.

It might be worth introducing new differently-named pg_proc entries for
the same functions in core, but only if we can agree that there are better
names for them than what the extension uses.

Yes, that's a point I raised upthread as well. What about renaming those functions as string_distance and string_distance_less_than? Then have only fuzzystrmatch do some DirectFunctionCall using the in-core functions?

--
Michael

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Alvaro Herrera

Дата:

24 июля 2014 г., 02:18:35

Peter Geoghegan wrote:

> Maybe that would be marginally better than classic Levenshtein
> distance, but I doubt it would pay for itself. It's just more code to
> maintain. Are we really expecting to not get the best possible
> suggestion due to some number of transposition errors very frequently?
> You still have to have a worse suggestion spuriously get ahead of
> yours, and typically there just aren't that many to begin with. I'm
> not targeting spelling errors so much as thinkos around plurals and
> whether or not an underscore was used. Damerau-Levenshtein seems like
> an algorithm with fairly specialized applications.

Yes, it's for typos.  I guess it's an unfrequent scenario to have both a
typoed column and a column that's missing the plural declension, which
is the case in which Damerau-Lvsh would be a win.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Heikki Linnakangas

Дата:

06 октября 2014 г., 10:09:37

On 07/18/2014 10:47 AM, Michael Paquier wrote:
> On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> I am not opposed to moving the contrib code into core in the manner
>> that you oppose. I don't feel strongly either way.
>>
>> I noticed in passing that your revision says this *within* levenshtein.c:
>>
>> + * Guaranteed to work with Name datatype's cstrings.
>> + * For full details see levenshtein.c.
> Yeah, I looked at what I produced yesterday night again and came
> across a couple of similar things :) And reworked a couple of things
> in the version attached, mainly wordsmithing and adding comments here
> and there, as well as making the naming of the Levenshtein functions
> in core the same as the ones in fuzzystrmatch 1.0.
>
>> I imagined that when a committer picked this up, an executive decision
>> would be made one way or the other. I am quite willing to revise the
>> patch to alter this behavior at the request of a committer.
> Fine for me. I'll move this patch to the next stage then.

There are a bunch of compiler warnings:

parse_relation.c: In function ‘errorMissingColumn’:
parse_relation.c:3114:447: warning: ‘closestcol1’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
parse_relation.c:3066:8: note: ‘closestcol1’ was declared here
parse_relation.c:3129:29: warning: ‘closestcol2’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
parse_relation.c:3067:8: note: ‘closestcol2’ was declared here
levenshtein.c: In function ‘levenshtein_common’:
levenshtein.c:107:6: warning: unused variable ‘start_column_local’ 
[-Wunused-variable]

- Heikki

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

07 ноября 2014 г., 23:58:05

On Mon, Oct 6, 2014 at 3:09 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 07/18/2014 10:47 AM, Michael Paquier wrote:
>>
>> On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote:
>>>
>>> I am not opposed to moving the contrib code into core in the manner
>>> that you oppose. I don't feel strongly either way.
>>>
>>> I noticed in passing that your revision says this *within* levenshtein.c:
>>>
>>> + * Guaranteed to work with Name datatype's cstrings.
>>> + * For full details see levenshtein.c.
>>
>> Yeah, I looked at what I produced yesterday night again and came
>> across a couple of similar things :) And reworked a couple of things
>> in the version attached, mainly wordsmithing and adding comments here
>> and there, as well as making the naming of the Levenshtein functions
>> in core the same as the ones in fuzzystrmatch 1.0.
>>
>>> I imagined that when a committer picked this up, an executive decision
>>> would be made one way or the other. I am quite willing to revise the
>>> patch to alter this behavior at the request of a committer.
>>
>> Fine for me. I'll move this patch to the next stage then.
>
>
> There are a bunch of compiler warnings:
>
> parse_relation.c: In function ‘errorMissingColumn’:
> parse_relation.c:3114:447: warning: ‘closestcol1’ may be used uninitialized
> in this function [-Wmaybe-uninitialized]
> parse_relation.c:3066:8: note: ‘closestcol1’ was declared here
> parse_relation.c:3129:29: warning: ‘closestcol2’ may be used uninitialized
> in this function [-Wmaybe-uninitialized]
> parse_relation.c:3067:8: note: ‘closestcol2’ was declared here
> levenshtein.c: In function ‘levenshtein_common’:
> levenshtein.c:107:6: warning: unused variable ‘start_column_local’
> [-Wunused-variable]

Based on this review from a month ago, I'm going to mark this Waiting
on Author.  If nobody updates the patch in a few days, I'll mark it
Returned with Feedback.  Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

10 ноября 2014 г., 07:48:42

On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Based on this review from a month ago, I'm going to mark this Waiting
> on Author.  If nobody updates the patch in a few days, I'll mark it
> Returned with Feedback.  Thanks.

Attached revision fixes the compiler warning that Heikki complained
about. I maintain SQL-callable stub functions from within contrib,
rather than follow Michael's approach. In other words, very little has
changed from my revision from July last [1].

Reminder: I maintain a slight preference for only offering one
suggestion per relation RTE, which is what this revision does (so no
change there). If a committer who picks this up wants me to alter
that, I don't mind doing so; since only Michael spoke up on this, I've
kept things my way.

This is not a completion mechanism; it is supposed to work on
*complete* column references with slight misspellings (e.g. incorrect
use of plurals, or column references with an omitted underscore
character). Weighing Tom's concerns about suggestions that are of
absolute low quality is what makes me conclude that this is the thing
to do.

[1] http://www.postgresql.org/message-id/CAM3SWZTzQO=OY4jmfB-65ieFie8iHUkDErK-0oLJETm8dSrSpw@mail.gmail.com
--
Peter Geoghegan

Вложения

0001-Levenshtein-distance-column-HINT.patch

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

10 ноября 2014 г., 07:56:50

<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Nov 10, 2014 at 1:48 PM, Peter
Geoghegan<span dir="ltr"><<a href="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"><spanclass="">On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> > Based on this review from a month
ago,I'm going to mark this Waiting<br /> > on Author.  If nobody updates the patch in a few days, I'll mark it<br />
>Returned with Feedback.  Thanks.<br /><br /></span>Attached revision fixes the compiler warning that Heikki
complained<br/> about. I maintain SQL-callable stub functions from within contrib,<br /> rather than follow Michael's
approach.In other words, very little has<br /> changed from my revision from July last [1].<span class=""><font
color="#888888"><br/></font></span></blockquote></div>FWIW, I still find this bit of code that this patch adds in
varlena.cugly:<br />+#include "levenshtein.c"<br />+#define LEVENSHTEIN_LESS_EQUAL<br />+#include "levenshtein.c"<br
/>+#undefLEVENSHTEIN_LESS_EQUAL<br /></div><div class="gmail_extra">-- <br /><div class="gmail_signature">Michael<br
/></div></div></div>

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

10 ноября 2014 г., 08:03:50

On Sun, Nov 9, 2014 at 8:56 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> FWIW, I still find this bit of code that this patch adds in varlena.c ugly:
>
> +#include "levenshtein.c"
> +#define LEVENSHTEIN_LESS_EQUAL
> +#include "levenshtein.c"
> +#undef LEVENSHTEIN_LESS_EQUAL

Okay, but this is the coding that currently appears within contrib's
fuzzystrmatch.c, more or less unchanged. The "#undef
LEVENSHTEIN_LESS_EQUAL" line that I added ought to be unnecessary.
I'll give the final word on that to whoever picks this up.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

11 ноября 2014 г., 07:13:25

<div dir="ltr">On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan <span dir="ltr"><<a href="mailto:pg@heroku.com"
target="_blank">pg@heroku.com</a>></span>wrote:<br /><div class="gmail_extra"><div class="gmail_quote"><blockquote
class="gmail_quote"style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> Reminder:
Imaintain a slight preference for only offering one<br /> suggestion per relation RTE, which is what this revision does
(sono<br /> change there). If a committer who picks this up wants me to alter<br /> that, I don't mind doing so; since
onlyMichael spoke up on this, I've<br /> kept things my way.<span class=""><font color="#888888"><br
/></font></span></blockquote></div>Hm.The last version of this patch has not really changed since since my first
review,and I have no more feedback to provide about it except what I already mentioned. I honestly don't think that
thispatch is ready for committer as-is... If someone wants to review it further, well extra opinions I am sure are
welcome.<br/>-- <br /><div class="gmail_signature">Michael<br /></div></div></div>

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

11 ноября 2014 г., 09:29:44

On Mon, Nov 10, 2014 at 8:13 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Hm. The last version of this patch has not really changed since since my
> first review, and I have no more feedback to provide about it except what I
> already mentioned. I honestly don't think that this patch is ready for
> committer as-is... If someone wants to review it further, well extra
> opinions I am sure are welcome.

Why not?

You've already said that you're happy to defer to whatever committer
picks this up with regard to whether or not more than a single
suggestion can come from an RTE. I agreed with this (i.e. I said I'd
defer to their opinion too), and once again drew particular attention
to this state of affairs alongside my most recent revision.

What does that leave?

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

11 ноября 2014 г., 09:52:50

On Mon, Nov 10, 2014 at 10:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Why not?
>
> You've already said that you're happy to defer to whatever committer
> picks this up with regard to whether or not more than a single
> suggestion can come from an RTE. I agreed with this (i.e. I said I'd
> defer to their opinion too), and once again drew particular attention
> to this state of affairs alongside my most recent revision.
>
> What does that leave?

I see you've marked this "Needs Review", even though your previously
marked it "Ready for Committer" a few months back (Robert marked it
"Waiting on Author" very recently because of the compiler warning, and
then I marked it back to "Ready for Committer" once that was
addressed, before you finally marked it back to "Needs Review" and
removed yourself as the reviewer just now).

I'm pretty puzzled by this. Other than our "agree to disagree and
defer to committer" position on the question of whether or not more
than one suggestion can come from a single RTE, which you were fine
with before [1], I have only restored the core/contrib separation to a
state recently suggested by Robert as the best and simplest all around
[2].

Did I miss something else?

[1] http://www.postgresql.org/message-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-BVqzgP0qA1Y_xDC_1g@mail.gmail.com
[2] http://www.postgresql.org/message-id/CA+TgmoYKiiq8MC0UJ5i5XfkTYBg1qqfN4YRCkZ60YDUnumkzzQ@mail.gmail.com
-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

13 ноября 2014 г., 00:00:03

On Sun, Nov 9, 2014 at 11:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Based on this review from a month ago, I'm going to mark this Waiting
>> on Author.  If nobody updates the patch in a few days, I'll mark it
>> Returned with Feedback.  Thanks.
>
> Attached revision fixes the compiler warning that Heikki complained
> about. I maintain SQL-callable stub functions from within contrib,
> rather than follow Michael's approach. In other words, very little has
> changed from my revision from July last [1].

I agree with your proposed approach to moving Levenshtein into core.
However, I think this should be separated into two patches, one of
them moving the Levenshtein functionality into core, and the other
adding the new treatment for missing column errors.  If you can do
that relatively soon, I'll make an effort to get the refactoring patch
committed in the near future.  Once that's done, we can focus in on
the interesting part of the patch, which is the actual machinery for
suggesting alternatives.

On that topic, I think there's unanimous consensus against the design
where equally-distant matches are treated differently based on whether
they are in the same RTE or different RTEs.  I think you need to
change that if you want to get anywhere with this.  On a related note,
the use of the additional parameter AttrNumber closest[2] to
searchRangeTableForCol() and of the additional parameters AttrNumber
*matchedatt and int *distance to scanRTEForColumn() is less than
self-documenting.  I suggest creating a structure called something
like FuzzyAttrMatchState and passing a pointer to it down to both
functions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

13 ноября 2014 г., 00:14:00

On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I agree with your proposed approach to moving Levenshtein into core.
> However, I think this should be separated into two patches, one of
> them moving the Levenshtein functionality into core, and the other
> adding the new treatment for missing column errors.  If you can do
> that relatively soon, I'll make an effort to get the refactoring patch
> committed in the near future.  Once that's done, we can focus in on
> the interesting part of the patch, which is the actual machinery for
> suggesting alternatives.

Okay, thanks. I think I can do that fairly soon.

> On that topic, I think there's unanimous consensus against the design
> where equally-distant matches are treated differently based on whether
> they are in the same RTE or different RTEs.  I think you need to
> change that if you want to get anywhere with this.

Alright. It wasn't as if I felt very strongly about it either way.

> On a related note,
> the use of the additional parameter AttrNumber closest[2] to
> searchRangeTableForCol() and of the additional parameters AttrNumber
> *matchedatt and int *distance to scanRTEForColumn() is less than
> self-documenting.  I suggest creating a structure called something
> like FuzzyAttrMatchState and passing a pointer to it down to both
> functions.

Sure.
-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

13 ноября 2014 г., 03:55:06

On Wed, Nov 12, 2014 at 1:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I agree with your proposed approach to moving Levenshtein into core.
>> However, I think this should be separated into two patches, one of
>> them moving the Levenshtein functionality into core, and the other
>> adding the new treatment for missing column errors.  If you can do
>> that relatively soon, I'll make an effort to get the refactoring patch
>> committed in the near future.  Once that's done, we can focus in on
>> the interesting part of the patch, which is the actual machinery for
>> suggesting alternatives.
>
> Okay, thanks. I think I can do that fairly soon.

Attached patch moves the Levenshtein distance implementation into core.

You're missing patch 2 of 2 here, because I have yet to incorporate
your feedback on the HINT itself -- when I've done that, I'll post a
newly rebased patch 2/2, with those items taken care of. As you
pointed out, there is no reason to wait for that.

--
Peter Geoghegan

Вложения

.0001-Move-Levenshtein-distance-implementation-into-core.patch.swp

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

13 ноября 2014 г., 04:00:54

On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached patch moves the Levenshtein distance implementation into core.

Oops. Somehow managed to send a *.patch.swp file.  :-)

Here is the actual patch.

--
Peter Geoghegan

Вложения

0001-Move-Levenshtein-distance-implementation-into-core.patch

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

13 ноября 2014 г., 06:42:33

On Tue, Nov 11, 2014 at 3:52 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I'm pretty puzzled by this. Other than our "agree to disagree and
> defer to committer" position on the question of whether or not more
> than one suggestion can come from a single RTE, which you were fine
> with before [1], I have only restored the core/contrib separation to a
> state recently suggested by Robert as the best and simplest all around
> [2].
> Did I miss something else?
My point is: I am not sure I can be defined as a reviewer of this
patch or take any credit in this patch review knowing that the latest
version submitted is a simple rebase of the version I did my first
review on. Hence, code speaking, this patch is in the same state as
when it has been firstly submitted.
Thanks,
-- 
Michael

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

13 ноября 2014 г., 19:57:50

On Wed, Nov 12, 2014 at 10:42 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Nov 11, 2014 at 3:52 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I'm pretty puzzled by this. Other than our "agree to disagree and
>> defer to committer" position on the question of whether or not more
>> than one suggestion can come from a single RTE, which you were fine
>> with before [1], I have only restored the core/contrib separation to a
>> state recently suggested by Robert as the best and simplest all around
>> [2].
>> Did I miss something else?
> My point is: I am not sure I can be defined as a reviewer of this
> patch or take any credit in this patch review knowing that the latest
> version submitted is a simple rebase of the version I did my first
> review on. Hence, code speaking, this patch is in the same state as
> when it has been firstly submitted.

Of course you can.  Time spent reviewing is time spent reviewing,
whether it results in changes to the patch or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

13 ноября 2014 г., 20:16:10

On Thu, Nov 13, 2014 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> My point is: I am not sure I can be defined as a reviewer of this
>> patch or take any credit in this patch review knowing that the latest
>> version submitted is a simple rebase of the version I did my first
>> review on. Hence, code speaking, this patch is in the same state as
>> when it has been firstly submitted.
>
> Of course you can.  Time spent reviewing is time spent reviewing,
> whether it results in changes to the patch or not.


My thoughts exactly. I thought Michael did a good job, even if I
didn't agree with everything he said.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

13 ноября 2014 г., 20:36:39

On Wed, Nov 12, 2014 at 8:00 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Attached patch moves the Levenshtein distance implementation into core.
>
> Oops. Somehow managed to send a *.patch.swp file.  :-)
>
> Here is the actual patch.

Committed.  I changed varstr_leven() to varstr_levenshtein() because
abbrvs cn mk the code hrd to undstnd.  And to grep.  And I removed the
StaticAssertStmt you added, because it's not actually used for
anything that necessitates that, yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

13 ноября 2014 г., 20:50:34

On Thu, Nov 13, 2014 at 9:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Committed.  I changed varstr_leven() to varstr_levenshtein() because
> abbrvs cn mk the code hrd to undstnd.  And to grep.

Thanks. I'll produce a revision of patch 2/2 soon.

> And I removed the
> StaticAssertStmt you added, because it's not actually used for
> anything that necessitates that, yet.

I'll add it back in in patch 2/2, so.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

16 ноября 2014 г., 03:36:13

On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On that topic, I think there's unanimous consensus against the design
> where equally-distant matches are treated differently based on whether
> they are in the same RTE or different RTEs.  I think you need to
> change that if you want to get anywhere with this.  On a related note,
> the use of the additional parameter AttrNumber closest[2] to
> searchRangeTableForCol() and of the additional parameters AttrNumber
> *matchedatt and int *distance to scanRTEForColumn() is less than
> self-documenting.  I suggest creating a structure called something
> like FuzzyAttrMatchState and passing a pointer to it down to both
> functions.

Attached patch incorporates this feedback.

The only user-visible difference between this revision and the
previous revision is that it's quite possible for two suggestion to
originate from the same RTE (there is exactly one change in the
regression test's expected output as compared to the last revision for
this reason. The regression tests are otherwise unchanged). It's still
not possible to see more than 2 suggestions under any circumstances,
no matter where they might have originated from, which I think is
appropriate -- we continue to not present any HINT in the event of 3
or more equidistant matches.

I think that the restructuring required to pass around a state
variable has resulted in somewhat clearer code.

--
Peter Geoghegan

Вложения

0001-Levenshtein-distance-column-HINT.patch

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

17 ноября 2014 г., 21:15:07

On Sat, Nov 15, 2014 at 7:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached patch incorporates this feedback.
>
> The only user-visible difference between this revision and the
> previous revision is that it's quite possible for two suggestion to
> originate from the same RTE (there is exactly one change in the
> regression test's expected output as compared to the last revision for
> this reason. The regression tests are otherwise unchanged). It's still
> not possible to see more than 2 suggestions under any circumstances,
> no matter where they might have originated from, which I think is
> appropriate -- we continue to not present any HINT in the event of 3
> or more equidistant matches.
>
> I think that the restructuring required to pass around a state
> variable has resulted in somewhat clearer code.

Cool!

I'm grumpy about the distanceName() function.  That seems too generic.
If we're going to keep this as it is, I suggest something like
computeRTEColumnDistance().  But see below.

On a related note, I'm also grumpy about this comment:

+    /* Charge half as much per deletion as per insertion or per substitution */
+    return varstr_levenshtein_less_equal(actual, len, match, match_len,
+                                         2, 1, 2, max);

The purpose of a code comment is to articulate WHY we did something,
rather than simply to restate what the code quite obviously does.  I
haven't heard a compelling argument for why this should be 2, 1, 2
rather than the default 1, 1, 1; and I'm inclined to do the latter
unless you can make some very good argument for this combination of
weights.  And if you can make such an argument, then there should be
comments so that the next person to come along and look at this code
doesn't go, huh, that's whacky, and change it.

+                 int location, FuzzyAttrMatchState *rtestate)

I suggest calling this "fuzzystate" rather than "rtestate"; it's not
the state of the RTE, but the state of the fuzzy matching.

Within the scanRTEForColumn block, we have a rather large chunk of
code protected by if (rtestate), which contains the only call to
distanceName().  I suggest that we move all of this logic to a
separate, static function, and merge distanceName into it.  I also
suggest testing against NULL explicitly instead of implicitly.  So
this block of code would end up as something like:

if (fuzzystate != NULL)   updateFuzzyAttrMatchState(rte, attcolname, colname, &fuzzystate);

In searchRangeTableForCol, I'm fairly certain that you've changed the
behavior by adding a check for if (rte->rtekind == RTE_JOIN) before
the call to scanRTEForColumn().  Why not instead put this check into
updateFuzzyAttrMatchState?  Then you can be sure you're not changing
the behavior in any other case.

On a similar note, I think the dropped-column test should happen early
as well, probably again in updateFuzzyAttrMatchState().  There's
little point in adding a suggestion only to throw it away again.

+            /*
+             * Charge extra (for inexact matches only) when an alias was
+             * specified that differs from what might have been used to
+             * correctly qualify this RTE's closest column
+             */
+            if (wrongalias)
+                rtestate.distance += 3;

I don't understand what situation this is catering to.  Can you
explain?  It seems to account for a good deal of complexity.
ERROR:  column "oid" does not existLINE 1: select oid > 0, * from altwithoid;               ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".

That seems like a stretch.  I think I suggested before using a
distance threshold of at most 3 or half the word length, whichever is
less.  For a three-letter column name that means not suggesting
anything if more than one character is different.  What you
implemented here is close to that, yet somehow we've got a suggestion
slipping through that has 2 out of 3 characters different.  I'm not
quite sure I see how that's getting through, but I think it shouldn't.
ERROR:  column fullname.text does not existLINE 1: select fullname.text from fullname;               ^
+HINT:  Perhaps you meant to reference the column "fullname"."last".

Same problem, only worse!  They've only got one letter of four in common.
ERROR:  column "xmin" does not existLINE 1: select xmin, * from fooview;               ^
+HINT:  Perhaps you meant to reference the column "fooview"."x".

Basically the same problem again.  I think the distance threshold in
this case should be half the shorter column name, i.e. 0.

Your new test cases include no negative test cases; that is, cases
where the machinery declines to suggest a hint because of, say, 3
equally good possibilities.  They probably should have something like
that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

17 ноября 2014 г., 23:04:46

On Mon, Nov 17, 2014 at 10:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm grumpy about the distanceName() function.  That seems too generic.
> If we're going to keep this as it is, I suggest something like
> computeRTEColumnDistance().  But see below.

Fair point.

> On a related note, I'm also grumpy about this comment:
>
> +    /* Charge half as much per deletion as per insertion or per substitution */
> +    return varstr_levenshtein_less_equal(actual, len, match, match_len,
> +                                         2, 1, 2, max);
>
> The purpose of a code comment is to articulate WHY we did something,
> rather than simply to restate what the code quite obviously does.  I
> haven't heard a compelling argument for why this should be 2, 1, 2
> rather than the default 1, 1, 1; and I'm inclined to do the latter
> unless you can make some very good argument for this combination of
> weights.  And if you can make such an argument, then there should be
> comments so that the next person to come along and look at this code
> doesn't go, huh, that's whacky, and change it.

Okay. I agree that that deserves a comment. The actual argument for
this formulation is that it just seems to work better that way. For
example:

"""
postgres=# \d orderlines    Table "public.orderlines"  Column    |   Type   | Modifiers
-------------+----------+-----------orderlineid | integer  | not nullorderid     | integer  | not nullprod_id     |
integer | not nullquantity    | smallint | not nullorderdate   | date     | not null

postgres=# select qty from orderlines ;
ERROR:  42703: column "qty" does not exist
LINE 1: select qty from orderlines ;              ^
HINT:  Perhaps you meant to reference the column "orderlines"."quantity".
"""

The point is that the fact that the user supplied "qty" string has so
many fewer characters than what was obviously intended - "quantity" -
deserves to be weighed less. If you change the costing to weigh
character deletion as being equal to substitution/addition, this
example breaks. I also think it's pretty common to have noise words in
every attribute (e.g. every column in the "orderlines" table matches
"orderlines_*"), which might otherwise mess things up by overcharging
for deletion. Having extra characters in the correctly spelled column
name seems legitimately less significant to me.

Or, in other words: having actual characters from the misspelling
match the correct spelling (and having actual characters given not
fail to match) is most important. What was given by the user is more
important than what was not given but should have been, which is not
generally true for uses of Levenshtein distance. I reached this
conclusion through trying out the patch with a couple of real schemas,
and seeing what works best.

It's hard to express that idea tersely, in a comment, but I guess I'll try.

> +                 int location, FuzzyAttrMatchState *rtestate)
>
> I suggest calling this "fuzzystate" rather than "rtestate"; it's not
> the state of the RTE, but the state of the fuzzy matching.

The idea here was to differentiate this state from the overall range
table state (in general, FuzzyAttrMatchState may be one or the other).
But okay.

> Within the scanRTEForColumn block, we have a rather large chunk of
> code protected by if (rtestate), which contains the only call to
> distanceName().  I suggest that we move all of this logic to a
> separate, static function, and merge distanceName into it.  I also
> suggest testing against NULL explicitly instead of implicitly.  So
> this block of code would end up as something like:
>
> if (fuzzystate != NULL)
>     updateFuzzyAttrMatchState(rte, attcolname, colname, &fuzzystate);

Okay.

> In searchRangeTableForCol, I'm fairly certain that you've changed the
> behavior by adding a check for if (rte->rtekind == RTE_JOIN) before
> the call to scanRTEForColumn().  Why not instead put this check into
> updateFuzzyAttrMatchState?  Then you can be sure you're not changing
> the behavior in any other case.

I thought that I had avoided changing things (beyond what was
advertised as changed in relation to this most recent revision)
because I also changed things WRT multiple matches per RTE. It's
fuzzy. Anyway, yeah, I could do it there instead.

> On a similar note, I think the dropped-column test should happen early
> as well, probably again in updateFuzzyAttrMatchState().  There's
> little point in adding a suggestion only to throw it away again.

Agreed.

> +            /*
> +             * Charge extra (for inexact matches only) when an alias was
> +             * specified that differs from what might have been used to
> +             * correctly qualify this RTE's closest column
> +             */
> +            if (wrongalias)
> +                rtestate.distance += 3;
>
> I don't understand what situation this is catering to.  Can you
> explain?  It seems to account for a good deal of complexity.

Two cases:

1. Distinguishing between the case where there was an exact match to a
column that isn't visible (i.e. the existing reason for
errorMissingColumn() to call here), and the case where there is a
visible column, but our alias was the wrong one. I guess that could
live in errorMissingColumn(), but overall it's more convenient to do
it here, so that errorMissingColumn() handles things almost uniformly
and doesn't really have to care.

2. For non-exact (fuzzy) matches, it seems more useful to give one
match rather than two when the user gave an alias that matches one
particular RTE. Consider this:

"""
postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid;
ERROR:  42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...              ^
HINT:  Perhaps you meant to reference the column "o"."orderid" or the
column "ol"."orderid".
LOCATION:  errorMissingColumn, parse_relation.c:3166

postgres=# select ol.ordersid from orders o join orderlines ol on
o.orderid = ol.orderid;
ERROR:  42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...              ^
HINT:  Perhaps you meant to reference the column "ol"."orderid".
LOCATION:  errorMissingColumn, parse_relation.c:3147
"""

One suggestion is better than two if it's evident that that single
suggestion is a better fit. And, more broadly, the fact that an alias
was given and matches ought to be weighed.

>  ERROR:  column "oid" does not exist
>  LINE 1: select oid > 0, * from altwithoid;
>                 ^
> +HINT:  Perhaps you meant to reference the column "altwithoid"."col".
>
> That seems like a stretch.  I think I suggested before using a
> distance threshold of at most 3 or half the word length, whichever is
> less.  For a three-letter column name that means not suggesting
> anything if more than one character is different.  What you
> implemented here is close to that, yet somehow we've got a suggestion
> slipping through that has 2 out of 3 characters different.  I'm not
> quite sure I see how that's getting through, but I think it shouldn't.

It's because I don't apply the test on smaller strings. I felt that it
was riskier to apply an absolute quality test when the user-supplied
column reference does not exceed 6 characters. Consider my "qty" vs
"quantity" example. If I make this change:

--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -929,7 +929,7 @@ searchRangeTableForCol(ParseState *pstate, const
char *alias, char *colname,        * seen when 6 deletions are required against actual attribute
name, or 3        * insertions/substitutions.        */
-       if (state->distance > 6 && state->distance > strlen(colname) / 2)
+       if (state->distance > strlen(colname) / 2)       {               state->rsecond = state->rfirst = NULL;
     state->second = state->first = InvalidAttrNumber;

Then a lot of the examples you complain about are fixed. But the "qty"
example is broken. Plus, this happens when the regression tests are
run:

*** /home/pg/postgresql/src/test/regress/expected/alter_table.out
2014-11-17 11:50:16.476426191 -0800
--- /home/pg/postgresql/src/test/regress/results/alter_table.out
2014-11-17 11:57:40.776410110 -0800
***************
*** 1343,1349 **** ERROR:  column "f1" does not exist LINE 1: select f1 from c1;                ^
- HINT:  Perhaps you meant to reference the column "c1"."f2". drop table p1 cascade; NOTICE:  drop cascades to table c1
createtable p1 (f1 int, f2 int);

And:

*** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-17
11:50:16.480426191 -0800
--- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-17
11:57:08.916411263 -0800
***************
*** 3452,3458 **** ERROR:  column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s
join               ^

- HINT:  Perhaps you meant to reference the column "atts"."indexrelid". -- -- Test LATERAL --

(So no hint given in either case)

>  ERROR:  column fullname.text does not exist
>  LINE 1: select fullname.text from fullname;
>                 ^
> +HINT:  Perhaps you meant to reference the column "fullname"."last".

> Basically the same problem again.  I think the distance threshold in
> this case should be half the shorter column name, i.e. 0.

Well, there is always going to be the most marginal possible case that
still gets to see a suggestion. These are non-organic examples from
the regression tests. I'm more worried about having the suggestions
work well for organic/representative cases than I am about suppressing
non-useful suggestions in non-organic/non-representative cases. As I
mentioned, the costing is more or less derived by what I found to work
well in what I thought to be representative cases.

> Your new test cases include no negative test cases; that is, cases
> where the machinery declines to suggest a hint because of, say, 3
> equally good possibilities.  They probably should have something like
> that.

I'll think about that.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

19 ноября 2014 г., 02:29:42

On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
> postgres=# select qty from orderlines ;
> ERROR:  42703: column "qty" does not exist
> LINE 1: select qty from orderlines ;
>                ^
> HINT:  Perhaps you meant to reference the column "orderlines"."quantity".
> """

I don't buy this example, because it would give you the same hint if
you told it you wanted to access a column called ant, or uay, or tit.
And that's clearly ridiculous.  The reason why quantity looks like a
reasonable suggestion for qty is because it's a conventional
abbreviation, but an extremely high percentage of comparable cases
won't be.

>> +            /*
>> +             * Charge extra (for inexact matches only) when an alias was
>> +             * specified that differs from what might have been used to
>> +             * correctly qualify this RTE's closest column
>> +             */
>> +            if (wrongalias)
>> +                rtestate.distance += 3;
>>
>> I don't understand what situation this is catering to.  Can you
>> explain?  It seems to account for a good deal of complexity.
>
> Two cases:
>
> 1. Distinguishing between the case where there was an exact match to a
> column that isn't visible (i.e. the existing reason for
> errorMissingColumn() to call here), and the case where there is a
> visible column, but our alias was the wrong one. I guess that could
> live in errorMissingColumn(), but overall it's more convenient to do
> it here, so that errorMissingColumn() handles things almost uniformly
> and doesn't really have to care.
>
> 2. For non-exact (fuzzy) matches, it seems more useful to give one
> match rather than two when the user gave an alias that matches one
> particular RTE. Consider this:
>
> """
> postgres=# select ordersid from orders o join orderlines ol on
> o.orderid = ol.orderid;
> ERROR:  42703: column "ordersid" does not exist
> LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
>                ^
> HINT:  Perhaps you meant to reference the column "o"."orderid" or the
> column "ol"."orderid".
> LOCATION:  errorMissingColumn, parse_relation.c:3166
>
> postgres=# select ol.ordersid from orders o join orderlines ol on
> o.orderid = ol.orderid;
> ERROR:  42703: column ol.ordersid does not exist
> LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
>                ^
> HINT:  Perhaps you meant to reference the column "ol"."orderid".
> LOCATION:  errorMissingColumn, parse_relation.c:3147
> """

I guess I'm confused at a broader level.  If the alias is wrong, why
are we considering names in this RTE *at all*?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

19 ноября 2014 г., 03:13:11

On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> postgres=# select qty from orderlines ;
>> ERROR:  42703: column "qty" does not exist
>> LINE 1: select qty from orderlines ;
>>                ^
>> HINT:  Perhaps you meant to reference the column "orderlines"."quantity".
>> """
>
> I don't buy this example, because it would give you the same hint if
> you told it you wanted to access a column called ant, or uay, or tit.
> And that's clearly ridiculous.  The reason why quantity looks like a
> reasonable suggestion for qty is because it's a conventional
> abbreviation, but an extremely high percentage of comparable cases
> won't be.

Is that so terrible? Yes, if those *exact* strings are tried, that'll
happen. But the vast majority of 3 letter strings will not do that
(including many 3 letter strings that include one of the letters 'q',
't' and 'y', such as "qqq", "ttt", and "yyy"). Why, in practice, would
someone even attempt those strings? I'm worried about Murphy, not
Machiavelli. That seems like a pretty important distinction here.

I maintain that omission of part of the correct spelling should be
weighed less. I am optimizing for the case where the user has a rough
idea of the structure and spelling of things - if they're typing in
random strings, or totally distinct synonyms, there is little we can
do about that. As I said, there will always be the most marginal case
that still gets a suggestion. I see no reason to hurt the common case
where we help in order to save the user from seeing a "ridiculous"
suggestion. I have a final test for the absolute quality of a
suggestion, but I think we could easily be too conservative about
that. At worst, our "ridiculous" suggestion makes apparent that the
user's incorrect spelling was itself ridiculous. With larger strings,
we can afford to be more conservative, and we are, because we have
more information to go on. Terse column names are not uncommon,
though.

>>> +            /*
>>> +             * Charge extra (for inexact matches only) when an alias was
>>> +             * specified that differs from what might have been used to
>>> +             * correctly qualify this RTE's closest column
>>> +             */
>>> +            if (wrongalias)
>>> +                rtestate.distance += 3;
>>>
>>> I don't understand what situation this is catering to.  Can you
>>> explain?  It seems to account for a good deal of complexity.

> I guess I'm confused at a broader level.  If the alias is wrong, why
> are we considering names in this RTE *at all*?

Because it's a common mistake when writing ad-hoc queries. People may
forget which exact table their column comes from. We certainly want to
weigh the fact that an alias was specified, but it shouldn't totally
limit our guess to that RTE. If nothing else, the fact that there was
a much closer match from another RTE ought to result in forcing there
to be no suggestion (due to there being too many equally good
suggestions). That's because, as I said, an *absolute* test for the
quality of a match is problematic (which, again, is why I err on the
side of letting the final, "absolute quality" test not limit
suggestions, particularly with short strings).

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

19 ноября 2014 г., 04:03:53

Peter Geoghegan <pg@heroku.com> writes:
> On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> postgres=# select qty from orderlines ;
>>> ERROR:  42703: column "qty" does not exist
>>> HINT:  Perhaps you meant to reference the column "orderlines"."quantity".

>> I don't buy this example, because it would give you the same hint if
>> you told it you wanted to access a column called ant, or uay, or tit.
>> And that's clearly ridiculous.  The reason why quantity looks like a
>> reasonable suggestion for qty is because it's a conventional
>> abbreviation, but an extremely high percentage of comparable cases
>> won't be.

> I maintain that omission of part of the correct spelling should be
> weighed less.

I would say that omission of the first letter should completely disqualify
suggestions based on this heuristic; but it might make sense to weight
omissions less after the first letter.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

19 ноября 2014 г., 16:43:59

On Tue, Nov 18, 2014 at 8:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>>> postgres=# select qty from orderlines ;
>>>> ERROR:  42703: column "qty" does not exist
>>>> HINT:  Perhaps you meant to reference the column "orderlines"."quantity".
>
>>> I don't buy this example, because it would give you the same hint if
>>> you told it you wanted to access a column called ant, or uay, or tit.
>>> And that's clearly ridiculous.  The reason why quantity looks like a
>>> reasonable suggestion for qty is because it's a conventional
>>> abbreviation, but an extremely high percentage of comparable cases
>>> won't be.
>
>> I maintain that omission of part of the correct spelling should be
>> weighed less.
>
> I would say that omission of the first letter should completely disqualify
> suggestions based on this heuristic; but it might make sense to weight
> omissions less after the first letter.

I think we would be well-advised not to start inventing our own
approximate matching algorithm.  Peter's suggestion boils down to a
guess that the default cost parameters for Levenshtein suck, and your
suggestion boils down to a guess that we can fix the problems with
Peter's suggestion by bolting another heuristic on top of it - and
possibly running Levenshtein twice with different sets of cost
parameters.  Ugh.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

19 ноября 2014 г., 20:34:07

On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think we would be well-advised not to start inventing our own
> approximate matching algorithm.  Peter's suggestion boils down to a
> guess that the default cost parameters for Levenshtein suck, and your
> suggestion boils down to a guess that we can fix the problems with
> Peter's suggestion by bolting another heuristic on top of it - and
> possibly running Levenshtein twice with different sets of cost
> parameters.  Ugh.

I agree.

While I am perfectly comfortable with the fact that we are guessing
here, my guesses are based on what I observed to work well with real
schemas, and simulated errors that I thought were representative of
human error. Obviously it's possible that another scheme will do
better sometimes, including for example a scheme that picks a match
entirely at random. But on average, I think that what I have here will
do better than anything else proposed so far.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

19 ноября 2014 г., 20:52:32

On Wed, Nov 19, 2014 at 12:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think we would be well-advised not to start inventing our own
>> approximate matching algorithm.  Peter's suggestion boils down to a
>> guess that the default cost parameters for Levenshtein suck, and your
>> suggestion boils down to a guess that we can fix the problems with
>> Peter's suggestion by bolting another heuristic on top of it - and
>> possibly running Levenshtein twice with different sets of cost
>> parameters.  Ugh.
>
> I agree.
>
> While I am perfectly comfortable with the fact that we are guessing
> here, my guesses are based on what I observed to work well with real
> schemas, and simulated errors that I thought were representative of
> human error. Obviously it's possible that another scheme will do
> better sometimes, including for example a scheme that picks a match
> entirely at random. But on average, I think that what I have here will
> do better than anything else proposed so far.

If you agree, then I'm not being clear enough.  I don't think think
that tinkering with the Levenshtein cost factors is a good idea, and I
think it's unhelpful to suggest something when the suggestion and the
original word differ by more than 50% of the number characters in the
shorter word.  Suggesting "col" for "oid" or "x" for "xmax", as crops
up in the regression tests with this patch applied, shows the folly of
this: the user didn't mean the other named column; rather, the user
was confused about whether a particular system column existed for that
table.

If we had a large database of examples showing what the user typed and
what they intended, we could try different algorithms against it and
see which one performs best with fewest false positives.  But if we
don't have that, we should do things that are like the things that
other people have done before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

19 ноября 2014 г., 21:22:21

On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> If you agree, then I'm not being clear enough.  I don't think think
> that tinkering with the Levenshtein cost factors is a good idea, and I
> think it's unhelpful to suggest something when the suggestion and the
> original word differ by more than 50% of the number characters in the
> shorter word.

I agree - except for very short strings, where there is insufficient
information to go on.

I was talking about the difficulty of bolting something on top of
Levenshtein distance that looked at the first character. That would
not need two costings, but would require an encoding aware matching of
the first code point.

> Suggesting "col" for "oid" or "x" for "xmax", as crops
> up in the regression tests with this patch applied, shows the folly of
> this: the user didn't mean the other named column; rather, the user
> was confused about whether a particular system column existed for that
> table.

Those are all very terse strings. What you're overlooking is what is
broken by using straight Levenshtein distance, which includes things
in the regression test that are reasonable and helpful. As I mentioned
before, requiring a greater than 50% of total string size distance
breaks this, just within the regression tests:

""" ERROR:  column "f1" does not exist LINE 1: select f1 from c1;                ^
- HINT:  Perhaps you meant to reference the column "c1"."f2".
"""

And:

""" ERROR:  column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join
 ^

- HINT:  Perhaps you meant to reference the column "atts"."indexrelid".
"""

Those are really useful suggestions! And, they're much more
representative of real user error.

The downside of weighing deletion less than substitution and insertion
is much smaller than the upside. It's worth it. The downside is only
that the user gets to see the best suggestion that isn't all that good
in an absolute sense (which we have a much harder time concluding
using simple tests for short misspellings).

> If we had a large database of examples showing what the user typed and
> what they intended, we could try different algorithms against it and
> see which one performs best with fewest false positives.  But if we
> don't have that, we should do things that are like the things that
> other people have done before.

That seems totally impractical. No one has that kind of data that I'm aware of.

How about git as a kind of precedent? It is not at all conservative
about showing *some* suggestion:

"""
$ git aa
git: 'aa' is not a git command. See 'git --help'.

Did you mean this?
am

$ git d
git: 'd' is not a git command. See 'git --help'.

Did you mean one of these?
diff
add

$ git ddd
git: 'ddd' is not a git command. See 'git --help'.

Did you mean this?
add
"""

And why wouldn't git be? As far as its concerned, you can only have
meant one of those small number of things. Similarly, with the patch,
the number of things we can pick from is fairly limited at this stage,
since we are actually fairly far along with parse analysis.

Now, this won't give a suggestion:

"""
$ git aaaa
git: 'aaaa' is not a git command. See 'git --help'.
"""

So it looks like git similarly weighs deletion less than
insertion/substitution. Are its suggestions any less ridiculous than
your examples of questionable hints from the modified regression test
expected output? This is just a guidance mechanism, and at worst we'll
show the best match (on the mechanisms own terms, which isn't too
bad).

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

19 ноября 2014 г., 21:33:25

On Wed, Nov 19, 2014 at 10:22 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Those are all very terse strings. What you're overlooking is what is
> broken by using straight Levenshtein distance, which includes things
> in the regression test that are reasonable and helpful. As I mentioned
> before, requiring a greater than 50% of total string size distance
> breaks this, just within the regression tests:

Maybe you'd prefer if there was a more gradual ramp-up to requiring a
distance of no greater than 50% of the string size (normalized to take
account of my non-default costings). Right now it's a step function of
the number of characters in the string - there is no "absolute
quality" requirement for strings of 6 or fewer requirements.
Otherwise, there is the 50% distance absolute quality test (the test
that you want to be applied generally). I think that would be better,
without being much more complicated.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

19 ноября 2014 г., 21:34:12

On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan <pg@heroku.com> wrote:
> there is no "absolute
> quality" requirement for strings of 6 or fewer requirements.

I meant 6 or fewer *characters*, obviously.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

19 ноября 2014 г., 22:00:48

On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Maybe you'd prefer if there was a more gradual ramp-up to requiring a
> distance of no greater than 50% of the string size (normalized to take
> account of my non-default costings)

I made this modification:

diff --git a/src/backend/parser/parse_relation.c
b/src/backend/parser/parse_relation.c
index 40c69d7..cca075f 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -929,7 +929,8 @@ searchRangeTableForCol(ParseState *pstate, const
char *alias, char *colname,        * seen when 6 deletions are required against actual attribute
name, or 3        * insertions/substitutions.        */
-       if (state->distance > 6 && state->distance > strlen(colname) / 2)
+       if ((state->distance > 3 && state->distance > strlen(colname)) ||
+               (state->distance > 6 && state->distance > strlen(colname) / 2))       {               state->rsecond =
state->rfirst= NULL;               state->second = state->first = InvalidAttrNumber;

When I run the regression tests now, then all the cases that you found
objectionable in the regression tests' previous expected output
disappear, while all the cases I think are useful that were previously
removed by applying a broad 50% standard remain. While I'm not 100%
sure that this exact formulation is the best one, I think that we can
reach a compromise on this point, that allows the costing to remain
the same without offering particularly bad suggestions for short
strings.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

19 ноября 2014 г., 22:13:17

On Wed, Nov 19, 2014 at 1:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> If you agree, then I'm not being clear enough.  I don't think think
>> that tinkering with the Levenshtein cost factors is a good idea, and I
>> think it's unhelpful to suggest something when the suggestion and the
>> original word differ by more than 50% of the number characters in the
>> shorter word.
>
> I agree - except for very short strings, where there is insufficient
> information to go on.

That's precisely the time I think it's *most* important.  In a very
long string, the threshold should be LESS than 50%.  My original
proposal was "no more than 3 characters of difference, but in any
event not more than half the length of the shorter string".

>> Suggesting "col" for "oid" or "x" for "xmax", as crops
>> up in the regression tests with this patch applied, shows the folly of
>> this: the user didn't mean the other named column; rather, the user
>> was confused about whether a particular system column existed for that
>> table.
>
> Those are all very terse strings. What you're overlooking is what is
> broken by using straight Levenshtein distance, which includes things
> in the regression test that are reasonable and helpful. As I mentioned
> before, requiring a greater than 50% of total string size distance
> breaks this, just within the regression tests:
>
> """
>   ERROR:  column "f1" does not exist
>   LINE 1: select f1 from c1;
>                  ^
> - HINT:  Perhaps you meant to reference the column "c1"."f2".
> """

That's exactly 50%, not more than 50%.

(I'm also on the fence about whether the hint is actually helpful in
that case, but the rule I proposed wouldn't prohibit it.)

> And:
>
> """
>   ERROR:  column atts.relid does not exist
>   LINE 1: select atts.relid::regclass, s.* from pg_stats s join
>                  ^
> - HINT:  Perhaps you meant to reference the column "atts"."indexrelid".
> """
>
> Those are really useful suggestions! And, they're much more
> representative of real user error.

That one's right at 50% too, but it's certainly more than 3 characters
of difference.  I think it's going to be pretty hard to emit a
suggestion in that case but not in a whole lot of cases that don't
make any sense.

> How about git as a kind of precedent? It is not at all conservative
> about showing *some* suggestion:
>
> """
> $ git aa
> git: 'aa' is not a git command. See 'git --help'.
>
> Did you mean this?
> am
>
> $ git d
> git: 'd' is not a git command. See 'git --help'.
>
> Did you mean one of these?
> diff
> add
>
> $ git ddd
> git: 'ddd' is not a git command. See 'git --help'.
>
> Did you mean this?
> add
> """
>
> And why wouldn't git be? As far as its concerned, you can only have
> meant one of those small number of things. Similarly, with the patch,
> the number of things we can pick from is fairly limited at this stage,
> since we are actually fairly far along with parse analysis.

I went and found the actual code git uses for this.  It's here:

https://github.com/git/git/blob/d29e9c89dbbf0876145dc88615b99308cab5f187/help.c

And the underlying Levenshtein implementation is here:

https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c

Apparently what they're doing is charging 0 for a transposition (which
we don't have as a separate concept), 2 for a substitution, 1 for an
insertion, and 3 for a deletion, with the constraint that anything
with a total distance of more than 6 isn't considered.  And that does
overall seem to give pretty good suggestions.  However, an interesting
point about the git algorithm is that it's not hard to make it do
stupid things on short strings:

[rhaas pgsql]$ git xy
git: 'xy' is not a git command. See 'git --help'.

Did you mean one of these?   am   gc   mv   rm

Maybe they should adopt my idea of a lower cutoff for short strings.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

19 ноября 2014 г., 22:20:57

On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Apparently what they're doing is charging 0 for a transposition (which
> we don't have as a separate concept), 2 for a substitution, 1 for an
> insertion, and 3 for a deletion, with the constraint that anything
> with a total distance of more than 6 isn't considered.  And that does
> overall seem to give pretty good suggestions.

The git people know that no git command is longer than 4 or 5
characters. That doesn't apply to us. I certainly would not like to
have an absolute distance test of n characters.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

19 ноября 2014 г., 22:33:30

On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> That's precisely the time I think it's *most* important.  In a very
> long string, the threshold should be LESS than 50%.  My original
> proposal was "no more than 3 characters of difference, but in any
> event not more than half the length of the shorter string".

We can only hint based on the information given by the user. If they
give a lot of badly matching information, we have something to go on.

> That one's right at 50% too, but it's certainly more than 3 characters
> of difference.  I think it's going to be pretty hard to emit a
> suggestion in that case but not in a whole lot of cases that don't
> make any sense.

I don't think that's the case. Other RTEs are penalized for having
non-matching aliases here.

In general, I think the cost of a bad suggestion is much lower than
the benefit of a good one. You seem to be suggesting that they're
equal. Or that they're equally likely in an organic situation. In my
estimation, this is not the case at all.

I'm curious about your thoughts on the compromise of a ramped up
distance threshold to apply a test for the absolute quality of a
match. I think that the fact that git gives bad suggestions with terse
strings tells us a lot, though. Note that unlike git, with terse
strings we may well have a good deal more equidistant matches, and as
soon as the number of would-be matches exceeds 2, we actually give no
matches at all. So that's an additional protection against poor
matches with terse strings.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Alvaro Herrera

Дата:

19 ноября 2014 г., 22:34:16

Robert Haas wrote:

> And the underlying Levenshtein implementation is here:
> 
> https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c
> 
> Apparently what they're doing is charging 0 for a transposition (which
> we don't have as a separate concept), 2 for a substitution, 1 for an
> insertion, and 3 for a deletion, with the constraint that anything
> with a total distance of more than 6 isn't considered.

0 for a transposition, wow. I suggested adding transpositions but there
was no support for that idea.  I suggested it because I thikn it's the
most common form of typo, and charging 2 for a deletion plus 1 for an
insertion makes a single transposition mistaek count as 3, which seems
wrong -- particularly seeing the git precedent.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

19 ноября 2014 г., 22:45:06

On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> 0 for a transposition, wow.

Again, they're optimizing for short strings (git commands) only. There
just isn't that many transposition errors possible with a 4 character
string.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Alvaro Herrera

Дата:

19 ноября 2014 г., 22:55:40

Peter Geoghegan wrote:
> On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > 0 for a transposition, wow.
> 
> Again, they're optimizing for short strings (git commands) only. There
> just isn't that many transposition errors possible with a 4 character
> string.

If there's logic in your statement, I can't see it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

19 ноября 2014 г., 22:57:39

On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Peter Geoghegan wrote:
>> On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>> > 0 for a transposition, wow.
>>
>> Again, they're optimizing for short strings (git commands) only. There
>> just isn't that many transposition errors possible with a 4 character
>> string.
>
> If there's logic in your statement, I can't see it.

The point is that transposition errors should not have no cost. If git
did not have an absolute quality test of a distance of 6, which they
can only have because all git commands are terse, then you could
construct a counter example.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Alvaro Herrera

Дата:

19 ноября 2014 г., 23:10:15

Peter Geoghegan wrote:
> On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Peter Geoghegan wrote:
> >> On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
> >> <alvherre@2ndquadrant.com> wrote:
> >> > 0 for a transposition, wow.
> >>
> >> Again, they're optimizing for short strings (git commands) only. There
> >> just isn't that many transposition errors possible with a 4 character
> >> string.
> >
> > If there's logic in your statement, I can't see it.
> 
> The point is that transposition errors should not have no cost. If git
> did not have an absolute quality test of a distance of 6, which they
> can only have because all git commands are terse, then you could
> construct a counter example.

Okay.  My point is just that whatever the string length, I think we'd do
well to regard transpositions as "cheap" in terms of error cost.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

20 ноября 2014 г., 18:32:48

On Wed, Nov 19, 2014 at 2:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I don't think that's the case. Other RTEs are penalized for having
> non-matching aliases here.

The point I made did not, as far as I can see, have anything to do
with non-matching aliases; it could arise with just a single RTE.

> In general, I think the cost of a bad suggestion is much lower than
> the benefit of a good one. You seem to be suggesting that they're
> equal. Or that they're equally likely in an organic situation. In my
> estimation, this is not the case at all.

The way I see it, the main cost of a bad suggestion is that it annoys
the user with clutter which they may brand as "stupid".  Think about
how much vitriol has been spewed over the years against progress bars
(or estimated completion) times that don't turn out to mirror reality.
Microsoft has gotten more cumulative flack about their inaccurate
progress bars over the years than they would have for dropping an
elevator on a cute baby.  Or think about how annoying it is when a
spell-checker or grammar-checker underlines something you've written
that is, in your own opinion, correctly spelled or grammatical.  Maybe
that kind of thing doesn't annoy you, but it definitely annoys me, and
I think probably a lot of other people.  My general experience is that
people get quite pissed off by bad suggestions from a computer.  At
least in my experience, users' actual level of agitation is often all
out of proportion to what might seem justified, but we are designing
this software for actual users, so their likely emotional reactions
are relevant.

> I'm curious about your thoughts on the compromise of a ramped up
> distance threshold to apply a test for the absolute quality of a
> match. I think that the fact that git gives bad suggestions with terse
> strings tells us a lot, though. Note that unlike git, with terse
> strings we may well have a good deal more equidistant matches, and as
> soon as the number of would-be matches exceeds 2, we actually give no
> matches at all. So that's an additional protection against poor
> matches with terse strings.

I don't know what you mean by a ramped-up distance threshold, exactly.
I think it's good for the distance threshold to be lower for small
strings and higher for large ones.  I think I'm somewhat open to
negotiation on the details, but I think any system that's going to
suggest "quantity" for "tit" is going too far.  If the user types
"qty" when they meant "quantity", they probably don't really need the
hint, because they're going to say to themselves "wait, I guess I
didn't abbreviate that".  The time when they need the hint is when
they typed "quanttiy", because it's quite possible to read a query
with that sort of typo multiple times and not realize that you've made
one.  You're sitting there puzzling over where the quantity column
went, and asking yourselves how you can be mis-remembering the schema,
and saying "wait, didn't I just see that column in the \d output" ...
and you don't even think to check carefully for a spelling mistake.
The hint may well clue you in to what the real problem is.

In other words, I think there's value in trying to clue somebody in
when they've made a typo, but not when they've made a think-o.  We
won't be able to do the latter accurately enough to make it more
useful than annoying.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

20 ноября 2014 г., 19:05:59

Robert Haas <robertmhaas@gmail.com> writes:
> ... In other words, I think there's value in trying to clue somebody in
> when they've made a typo, but not when they've made a think-o.  We
> won't be able to do the latter accurately enough to make it more
> useful than annoying.

FWIW, I concur with Robert's analysis that wrong suggestions are likely to
be annoying.  We should be erring on the side of not making a suggestion
rather than making one that's a low-probability guess.

I'm not particularly convinced that the "f1" -> "f2" example is a useful
behavior, and I'm downright horrified by the "qty" -> "quantity" case.
If the hint mechanism thinks the latter two are close enough together
to suggest, it's going to be spewing a whole lot of utterly ridiculous
suggestions.  I'm going to be annoyed way more times than I'm going to
be helped.

The big picture is that this is more or less our first venture into
heuristic suggestions.  I think we should start slow with a very
conservative set of heuristics.  If it's a success maybe we can get more
aggressive over time --- but if we go over the top here, the entire
concept will be discredited in this community for the next ten years.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

20 ноября 2014 г., 21:08:54

On Thu, Nov 20, 2014 at 8:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not particularly convinced that the "f1" -> "f2" example is a useful
> behavior, and I'm downright horrified by the "qty" -> "quantity" case.
> If the hint mechanism thinks the latter two are close enough together
> to suggest, it's going to be spewing a whole lot of utterly ridiculous
> suggestions.  I'm going to be annoyed way more times than I'm going to
> be helped.

I happen to think that that isn't the case, because the number of
possible suggestions is fairly low anyway, and people don't tend to
make those kind of errors. Robert's examples of "ridiculous"
suggestions of "quantity" based on three letter strings other than
"qty" (e.g. "tit") were rather contrived. In fact, most 3 letter
strings will not offer a suggestion. 3 or more Equidistant would-be
matches tend to offer a lot of additional protection against bad
suggestions for these terse strings.

> The big picture is that this is more or less our first venture into
> heuristic suggestions.  I think we should start slow with a very
> conservative set of heuristics.  If it's a success maybe we can get more
> aggressive over time --- but if we go over the top here, the entire
> concept will be discredited in this community for the next ten years.

I certainly see your point here. It's not as if we have an *evolved*
understanding of the usability issues. Besides, as Robert pointed out,
most of the value of this patch is added by simple cases, like a
failure to pluralize or not pluralize, or the omission of an
underscore.

I still think we should charge half for deletion, but I will concede
that it's prudent to apply a more restrictive absolute quality final
test.
-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

20 ноября 2014 г., 21:30:53

On Thu, Nov 20, 2014 at 7:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> In general, I think the cost of a bad suggestion is much lower than
>> the benefit of a good one. You seem to be suggesting that they're
>> equal. Or that they're equally likely in an organic situation. In my
>> estimation, this is not the case at all.
>
> The way I see it, the main cost of a bad suggestion is that it annoys
> the user with clutter which they may brand as "stupid".  Think about
> how much vitriol has been spewed over the years against progress bars
> (or estimated completion) times that don't turn out to mirror reality.

Well, you can judge the quality of the suggestion immediately. I
imagined a mechanism that gives a little bit more than the minimum
amount of guidance for things like contractions/abbreviations.

> Microsoft has gotten more cumulative flack about their inaccurate
> progress bars over the years than they would have for dropping an
> elevator on a cute baby.

I haven't used a more recent version of Windows than Windows Vista,
but I'm pretty sure that they kept it up.

>> I'm curious about your thoughts on the compromise of a ramped up
>> distance threshold to apply a test for the absolute quality of a
>> match. I think that the fact that git gives bad suggestions with terse
>> strings tells us a lot, though. Note that unlike git, with terse
>> strings we may well have a good deal more equidistant matches, and as
>> soon as the number of would-be matches exceeds 2, we actually give no
>> matches at all. So that's an additional protection against poor
>> matches with terse strings.
>
> I don't know what you mean by a ramped-up distance threshold, exactly.
> I think it's good for the distance threshold to be lower for small
> strings and higher for large ones.  I think I'm somewhat open to
> negotiation on the details, but I think any system that's going to
> suggest "quantity" for "tit" is going too far.

I mean the suggestion of raising the cost threshold more gradually,
not as a step function of the number of characters in the string [1]
where it's either over 6 characters and must pass the 50% test, or
isn't and has no absolute quality test. The exact modification I
described will FWIW remove the "quantity" for "qty" suggestion, as
well as all the similar suggestions that you found objectionable (like
"tit" also offering a suggestion of "quantity").

If you look at the regression tests, none of the sensible suggestions
are lost (some would be by an across the board 50% absolute quality
threshold, as I previously pointed out [2]), but all the bad ones are.
I attach failed regression test output showing the difference between
the previous expected values, and actual values with that small
modification - it looks like most or all bad cases are now fixed.

> If the user types
> "qty" when they meant "quantity", they probably don't really need the
> hint, because they're going to say to themselves "wait, I guess I
> didn't abbreviate that".  The time when they need the hint is when
> they typed "quanttiy", because it's quite possible to read a query
> with that sort of typo multiple times and not realize that you've made
> one.

I agree that that's a more important case.

> In other words, I think there's value in trying to clue somebody in
> when they've made a typo, but not when they've made a think-o.  We
> won't be able to do the latter accurately enough to make it more
> useful than annoying.

That's certainly true; I think that we only disagree about the exact
point at which we enter the think-o correction business.

[1] http://www.postgresql.org/message-id/CAM3SWZT+7hH29Go6ZuY2OrCS40=6yPVM_nt9NjfovP3XwjixDw@mail.gmail.com
[2] http://www.postgresql.org/message-id/CAM3SWZTSGokNhT8rK+0Eed7spNJg4pAdMbqqYi0FH9bWcNvTGA@mail.gmail.com
--
Peter Geoghegan

Вложения

regression.diffs

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

20 ноября 2014 г., 22:03:54

On Thu, Nov 20, 2014 at 1:30 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I mean the suggestion of raising the cost threshold more gradually,
> not as a step function of the number of characters in the string [1]
> where it's either over 6 characters and must pass the 50% test, or
> isn't and has no absolute quality test. The exact modification I
> described will FWIW remove the "quantity" for "qty" suggestion, as
> well as all the similar suggestions that you found objectionable (like
> "tit" also offering a suggestion of "quantity").
>
> If you look at the regression tests, none of the sensible suggestions
> are lost (some would be by an across the board 50% absolute quality
> threshold, as I previously pointed out [2]), but all the bad ones are.
> I attach failed regression test output showing the difference between
> the previous expected values, and actual values with that small
> modification - it looks like most or all bad cases are now fixed.

That does seem to give better results, but it still seems awfully
complicated.  If we just used Levenshtein with all-default cost
factors and a distance cap equal to Max(strlen(what_user_typed),
strlen(candidate_match), 3), what cases that you think are important
would be harmed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

20 ноября 2014 г., 23:01:04

On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> That does seem to give better results, but it still seems awfully
> complicated.  If we just used Levenshtein with all-default cost
> factors and a distance cap equal to Max(strlen(what_user_typed),
> strlen(candidate_match), 3), what cases that you think are important
> would be harmed?

Well, just by plugging in default Levenshtein cost factors, I can see
the following regression:

*** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
10:17:55.042291912 -0800
--- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
11:42:15.670108745 -0800
***************
*** 3452,3458 **** ERROR:  column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s
join               ^

- HINT:  Perhaps you meant to reference the column "atts"."indexrelid".

Within the catalogs, the names of attributes are prefixed as a form of
what you might call internal namespacing. For example, pg_index has
attributes that all begin with "ind*". You could easily omit something
like that, while still more or less knowing what you're looking for.

In more concrete terms, this gets no suggestion:

postgres=# select key from pg_index;
ERROR:  42703: column "key" does not exist
LINE 1: select key from pg_index;              ^

Only this does:

postgres=# select ikey from pg_index;
ERROR:  42703: column "ikey" does not exist
LINE 1: select ikey from pg_index;              ^
HINT:  Perhaps you meant to reference the column "pg_index"."indkey".
postgres=#

The git people varied their Levenshtein costings for a reason.

I also think that a one size fits all cap will break things. It will
independently break the example above, as well as the more marginal
"c1"."f2". vs "c1"."f2" case (okay, maybe that case was exactly on the
threshold, but others won't be).

I don't see that different costings actually saves any complexity.
Similarly, the final cap is quite straightforward. Anything with any
real complexity happens before that.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

20 ноября 2014 г., 23:14:59

On Thu, Nov 20, 2014 at 3:00 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That does seem to give better results, but it still seems awfully
>> complicated.  If we just used Levenshtein with all-default cost
>> factors and a distance cap equal to Max(strlen(what_user_typed),
>> strlen(candidate_match), 3), what cases that you think are important
>> would be harmed?
>
> Well, just by plugging in default Levenshtein cost factors, I can see
> the following regression:
>
> *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
> 10:17:55.042291912 -0800
> --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
> 11:42:15.670108745 -0800
> ***************
> *** 3452,3458 ****
>   ERROR:  column atts.relid does not exist
>   LINE 1: select atts.relid::regclass, s.* from pg_stats s join
>                  ^
> - HINT:  Perhaps you meant to reference the column "atts"."indexrelid".
>
> Within the catalogs, the names of attributes are prefixed as a form of
> what you might call internal namespacing. For example, pg_index has
> attributes that all begin with "ind*". You could easily omit something
> like that, while still more or less knowing what you're looking for.
>
> In more concrete terms, this gets no suggestion:
>
> postgres=# select key from pg_index;
> ERROR:  42703: column "key" does not exist
> LINE 1: select key from pg_index;
>                ^
>
> Only this does:
>
> postgres=# select ikey from pg_index;
> ERROR:  42703: column "ikey" does not exist
> LINE 1: select ikey from pg_index;
>                ^
> HINT:  Perhaps you meant to reference the column "pg_index"."indkey".
> postgres=#

Seems fine to me.  If you typed relid rather than indexrelid or key
rather than indkey, that's a thinko, not a typo.  ikey for indkey
could plausible be a typo, though you'd have to be having a fairly bad
day at the keyboard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Andres Freund

Дата:

20 ноября 2014 г., 23:20:09

On 2014-11-20 12:00:51 -0800, Peter Geoghegan wrote:
> In more concrete terms, this gets no suggestion:
> 
> postgres=# select key from pg_index;
> ERROR:  42703: column "key" does not exist
> LINE 1: select key from pg_index;
>                ^

I don't think that's a bad thing. Yes, for a human those look pretty
similar, but it's easy to construct cases where that gives completely
hilarious results.

I think something simplistic like levenshtein, even with modified
distances, is good to catch typos. But not to find terms that are
related in more complex ways.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

20 ноября 2014 г., 23:20:15

On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Seems fine to me.  If you typed relid rather than indexrelid or key
> rather than indkey, that's a thinko, not a typo.  ikey for indkey
> could plausible be a typo, though you'd have to be having a fairly bad
> day at the keyboard.

I can tell that I have no chance of convincing you otherwise. While I
think you're mistaken to go against the precedent set by git, you're
the one with the commit bit, and I think we've already spent enough
time discussing this. So default costings it is.
-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

20 ноября 2014 г., 23:26:29

Peter Geoghegan <pg@heroku.com> writes:
> On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That does seem to give better results, but it still seems awfully
>> complicated.  If we just used Levenshtein with all-default cost
>> factors and a distance cap equal to Max(strlen(what_user_typed),
>> strlen(candidate_match), 3), what cases that you think are important
>> would be harmed?

> Well, just by plugging in default Levenshtein cost factors, I can see
> the following regression:

> *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
> 10:17:55.042291912 -0800
> --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
> 11:42:15.670108745 -0800
> ***************
> *** 3452,3458 ****
>   ERROR:  column atts.relid does not exist
>   LINE 1: select atts.relid::regclass, s.* from pg_stats s join
>                  ^
> - HINT:  Perhaps you meant to reference the column "atts"."indexrelid".


I do not have a problem with deciding that that is not a "regression";
in fact, not giving that hint seems like a good conservative behavior
here.  By your logic, we should also be prepared to suggest
"supercalifragilisticexpialidocious" when the user enters "ocious".
It's simply a bridge too far for what is supposed to be a hint for
minor typos.  You sound like you want to turn it into something that
will look up column names for people who are too lazy to even try to
type the right thing.  While I can see the value of such a tool within
certain contexts, firing completed queries at a live SQL engine
is not one of them.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

20 ноября 2014 г., 23:46:16

On Thu, Nov 20, 2014 at 12:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I do not have a problem with deciding that that is not a "regression";
> in fact, not giving that hint seems like a good conservative behavior
> here.  By your logic, we should also be prepared to suggest
> "supercalifragilisticexpialidocious" when the user enters "ocious".

That's clearly not true. I just want to be a bit more forgiving of
omissions. That clearly isn't my logic, since that isn't a suggestion
that the implementation will give, or would be anywhere close to
giving - my weighing of deletions is only twice that of substitutions
or insertions, not ten times. git does not use Levenshtein default
costings either.

> It's simply a bridge too far for what is supposed to be a hint for
> minor typos.

Minor typos and minor omissions. My example was on the edge of what
would be tolerable under my proposed cost model.

> You sound like you want to turn it into something that
> will look up column names for people who are too lazy to even try to
> type the right thing.  While I can see the value of such a tool within
> certain contexts, firing completed queries at a live SQL engine
> is not one of them.

It's just a hint; a convenience. Users who imagine that it takes away
the need for putting any thought into their SQL queries have bigger
problems.

Anyway, that's all that needs to be said on that, since I've already
given up on a non-default costing. Also, we have default costing, and
we always apply a 50% standard (I see no point in doing otherwise with
default costings).

Robert: Where does that leave us? What about suggestions across RTEs?
Alias costing, etc?
-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

20 ноября 2014 г., 23:50:26

On Thu, Nov 20, 2014 at 3:20 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Seems fine to me.  If you typed relid rather than indexrelid or key
>> rather than indkey, that's a thinko, not a typo.  ikey for indkey
>> could plausible be a typo, though you'd have to be having a fairly bad
>> day at the keyboard.
>
> I can tell that I have no chance of convincing you otherwise. While I
> think you're mistaken to go against the precedent set by git, you're
> the one with the commit bit, and I think we've already spent enough
> time discussing this. So default costings it is.

I've got a few +1s, too, if you notice.

I'm willing to be outvoted, but not by a majority of one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

20 ноября 2014 г., 23:57:31

On Thu, Nov 20, 2014 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I've got a few +1s, too, if you notice.

Then maybe I spoke too soon.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

David G Johnston

Дата:

21 ноября 2014 г., 00:46:29

Andres Freund-3 wrote
> I think something simplistic like levenshtein, even with modified
> distances, is good to catch typos. But not to find terms that are
> related in more complex ways.


Tom Lane-2 wrote
> The big picture is that this is more or less our first venture into 
> heuristic suggestions.  I think we should start slow with a very 
> conservative set of heuristics.  If it's a success maybe we can get more 
> aggressive over time --- but if we go over the top here, the entire 
> concept will be discredited in this community for the next ten years. 

+1 for both of these conclusions.

The observations regarding standard column prefixes and thinking that
abbreviations are in use when in fact the names are spelled out are indeed
in-the-wild behaviors that should be considered but a levenshtein distance
algorithm is likely not going to be useful in pointing out mistakes in those
situations.  Limiting the immediate focus to "fat/thin-fingering of keys" -
for which levenshtein is well suited - is useful and will provide data
points that can then guide future artificial intelligence endeavors.

David J.




--
View this message in context:
http://postgresql.nabble.com/Doing-better-at-HINTing-an-appropriate-column-within-errorMissingColumn-tp5797700p5827786.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

26 ноября 2014 г., 03:13:42

Alright, so let me summarize what I think are the next steps in
working towards getting this patch committed. I should produce a new
revision which:

* Uses default costings.

* Applies a generic final quality check that enforces a distance of no
greater than 50% of the total string size. (The use of default
costings removes any reason to continue to do this)

* Work through Robert's suggestions on other aspects that need work
[1], most of which I already agreed to.

What is unclear is whether or not I should continue to charge extra
for non-matching user supplied alias (and, I think more broadly,
consider multiple RTEs iff the user did use an alias) - Robert was
skeptical, but didn't seem to have made his mind up. I still think I
should cost things based on aliases, and consider multiple RTEs even
when the user supplied an alias (the penalty should just be a distance
of 1 and not 3, though, in light of other changes to the
weighing/costing). If I don't hear anything in the next day or two,
I'll more or less preserve aliases-related aspects of the patch.

Did I miss something else?

[1] http://www.postgresql.org/message-id/CA+TgmoZLwzgyv=JAYfi6XfAK8OcBuTPYYhP5TbOqsS=YWVvzUw@mail.gmail.com
-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

01 декабря 2014 г., 04:17:02

On Tue, Nov 25, 2014 at 4:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> If I don't hear anything in the next day or two,
> I'll more or less preserve aliases-related aspects of the patch.

FYI, I didn't go ahead and work on this, because I thought that the
thanksgiving holiday in the US probably kept you from giving feedback.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

03 декабря 2014 г., 00:11:12

On Tue, Nov 25, 2014 at 7:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Alright, so let me summarize what I think are the next steps in
> working towards getting this patch committed. I should produce a new
> revision which:
>
> * Uses default costings.
>
> * Applies a generic final quality check that enforces a distance of no
> greater than 50% of the total string size. (The use of default
> costings removes any reason to continue to do this)
>
> * Work through Robert's suggestions on other aspects that need work
> [1], most of which I already agreed to.

Sounds good so far.

> What is unclear is whether or not I should continue to charge extra
> for non-matching user supplied alias (and, I think more broadly,
> consider multiple RTEs iff the user did use an alias) - Robert was
> skeptical, but didn't seem to have made his mind up. I still think I
> should cost things based on aliases, and consider multiple RTEs even
> when the user supplied an alias (the penalty should just be a distance
> of 1 and not 3, though, in light of other changes to the
> weighing/costing). If I don't hear anything in the next day or two,
> I'll more or less preserve aliases-related aspects of the patch.

Basically, the case in which I think it's helpful to issue a
suggestion here is when the user has used the table name rather than
the alias name.  I wonder if it's worth checking for that case
specifically, in lieu of what you've done here, and issuing a totally
different hint in that case ("HINT: You must refer to this as column
as "prime_minister.id" rather than "cameron.id").

Another idea, which I think I like less well, is to check the
Levenshtein distance between the allowed alias and the entered alias
and, if that's within the half-the-shorter-length threshold, consider
possible matches from that RTE, charge the distance between the
correct alias and the entered alias as a penalty to each potential
column match.

What I think won't do is to look at a situation where the user has
entered automobile.id and suggest that maybe they meant student.iq, or
even student.id. The amount of difference between the names has got to
matter for the RTE names, just as it does for the column names.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

04 декабря 2014 г., 05:21:37

On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Basically, the case in which I think it's helpful to issue a
> suggestion here is when the user has used the table name rather than
> the alias name.  I wonder if it's worth checking for that case
> specifically, in lieu of what you've done here, and issuing a totally
> different hint in that case ("HINT: You must refer to this as column
> as "prime_minister.id" rather than "cameron.id").

Well, if an alias is used, and you refer to an attribute using a
non-alias name (i.e. the original table name), then you'll already get
an error suggesting that the alias be used instead -- of course,
that's nothing new. It doesn't matter to the existing hinting
mechanism if the attribute name is otherwise wrong. Once you fix the
code to use the alias suggested, you'll then get this new
Levenshtein-based hint.

> Another idea, which I think I like less well, is to check the
> Levenshtein distance between the allowed alias and the entered alias
> and, if that's within the half-the-shorter-length threshold, consider
> possible matches from that RTE, charge the distance between the
> correct alias and the entered alias as a penalty to each potential
> column match.

I don't about that either. Aliases are often totally arbitrary,
particularly for ad-hoc queries, which is what this is aimed at.

> What I think won't do is to look at a situation where the user has
> entered automobile.id and suggest that maybe they meant student.iq, or
> even student.id.

I'm not sure I follow. If there is an automobile.ip, then it will be
suggested. If there is no automobile column that's much of a match (so
no "automobile.ip", say), then student.id will be suggested (and not
student.iq, *even if there is no student.id* - the final quality check
saves us). So this is possible:

postgres=# select iq, * from student, automobile;
ERROR:  42703: column "iq" does not exist
LINE 1: select iq, * from student, automobile;
               ^
HINT:  Perhaps you meant to reference the column "student"."id".
postgres=# select automobile.iq, * from student, automobile;
ERROR:  42703: column automobile.iq does not exist
LINE 1: select automobile.iq, * from student, automobile;
               ^

(note that using the table name makes us *not* see a suggestion where
we otherwise would).

The point is that there is a fixed penalty for a wrong user-specified
alias, but all relation RTEs are considered.

> The amount of difference between the names has got to
> matter for the RTE names, just as it does for the column names.

I think it makes sense that it matters by a fixed amount. Besides,
this seems complicated enough already - I don't won't to add more
complexity to worry about equidistant (but still actually valid)
RTE/table/alias names.

It sounds like your concern here is mostly a concern about the
relative distance among multiple matches, as opposed to the absolute
quality of suggestions. The former seems a lot less controversial than
the latter was, though - the user always gets the best match, or the
join pair of best matches, or no match when this new hinting mechanism
is involved.

I attach a new revision. The revision:

* Uses default costs for Levenshtein distance.

* Still charges extra for a non-alias-matching match (although it only
charges a fixed distance of 1 extra). This has regression test
coverage.

* Applies a generic final quality check that enforces a requirement
that a hint have a distance of no greater than 50% of the total string
size. No special treatment of shorter strings is involved anymore.

* Moves almost everything out of scanRTEForColumn() as you outlined
(into a new function, updateFuzzyAttrMatchState(), per your
suggestion).

* Moves dropped column detection into updateFuzzyAttrMatchState(), per
your suggestion.

* Still does the "if (rte->rtekind == RTE_JOIN)" thing in the existing
function searchRangeTableForCol().

I am quite confident that a suggestion from a join RTE will never be
useful, to either the existing use of searchRangeTableForCol() or this
expanded use, and it makes more sense to me to put it there. In fact,
the existing use of searchRangeTableForCol() is really rather similar
to this, and will give up on the first identical match (which is taken
as evidence that there is a attribute of that name, but isn't visible
at this level of the query). So I have not followed your suggestion
here.

Thoughts?
--
Peter Geoghegan

Вложения

0001-Levenshtein-distance-column-HINT.patch

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

05 декабря 2014 г., 23:33:19

On Wed, Dec 3, 2014 at 9:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Basically, the case in which I think it's helpful to issue a
>> suggestion here is when the user has used the table name rather than
>> the alias name.  I wonder if it's worth checking for that case
>> specifically, in lieu of what you've done here, and issuing a totally
>> different hint in that case ("HINT: You must refer to this as column
>> as "prime_minister.id" rather than "cameron.id").
>
> Well, if an alias is used, and you refer to an attribute using a
> non-alias name (i.e. the original table name), then you'll already get
> an error suggesting that the alias be used instead -- of course,
> that's nothing new. It doesn't matter to the existing hinting
> mechanism if the attribute name is otherwise wrong. Once you fix the
> code to use the alias suggested, you'll then get this new
> Levenshtein-based hint.

In that case, I think I favor giving no hint at all when the RTE name
is specified but doesn't match exactly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

05 декабря 2014 г., 23:45:44

On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, if an alias is used, and you refer to an attribute using a
>> non-alias name (i.e. the original table name), then you'll already get
>> an error suggesting that the alias be used instead -- of course,
>> that's nothing new. It doesn't matter to the existing hinting
>> mechanism if the attribute name is otherwise wrong. Once you fix the
>> code to use the alias suggested, you'll then get this new
>> Levenshtein-based hint.
>
> In that case, I think I favor giving no hint at all when the RTE name
> is specified but doesn't match exactly.

I don't follow. The existing mechanism only concerns what to do when
the original table name was used when an alias should have been used
instead. What does that have to do with this patch?

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

08 декабря 2014 г., 20:32:04

On Fri, Dec 5, 2014 at 3:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Well, if an alias is used, and you refer to an attribute using a
>>> non-alias name (i.e. the original table name), then you'll already get
>>> an error suggesting that the alias be used instead -- of course,
>>> that's nothing new. It doesn't matter to the existing hinting
>>> mechanism if the attribute name is otherwise wrong. Once you fix the
>>> code to use the alias suggested, you'll then get this new
>>> Levenshtein-based hint.
>>
>> In that case, I think I favor giving no hint at all when the RTE name
>> is specified but doesn't match exactly.
>
> I don't follow. The existing mechanism only concerns what to do when
> the original table name was used when an alias should have been used
> instead. What does that have to do with this patch?

Just that that's the case in which it seems useful to give a hint.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

08 декабря 2014 г., 20:44:02

On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Just that that's the case in which it seems useful to give a hint.

I think it's very possible that the wrong alias may be provided by the
user, and that we should consider that when providing a hint. Besides,
considering every visible RTE (while penalizing non-exact alias names
iff the user provided an alias name) is actually going to make bad
hints less likely, by increasing the number of equidistant low quality
matches in a way that swamps the mechanism into providing no actual
match at all. That's an important additional protection against low
quality matches.

What do other people think here?
-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

08 декабря 2014 г., 20:52:45

On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I think it's very possible that the wrong alias may be provided by the
> user, and that we should consider that when providing a hint.

Note that the existing mechanism (the mechanism that I'm trying to
improve) only ever shows this error message:

"There is a column named \"%s\" in table \"%s\", but it cannot be
referenced from this part of the query."

I think it's pretty clear that this general class of user error is common.
-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

15 декабря 2014 г., 04:24:51

On Tue, Dec 9, 2014 at 2:52 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> I think it's very possible that the wrong alias may be provided by the
>> user, and that we should consider that when providing a hint.
>
> Note that the existing mechanism (the mechanism that I'm trying to
> improve) only ever shows this error message:
>
> "There is a column named \"%s\" in table \"%s\", but it cannot be
> referenced from this part of the query."
>
> I think it's pretty clear that this general class of user error is common.
Moving this patch to CF 2014-12 as work is still going on, note that
it is currently marked with Robert as reviewer and that its current
status is "Needs review".
-- 
Michael

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

15 декабря 2014 г., 19:45:02

On Sun, Dec 14, 2014 at 8:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Dec 9, 2014 at 2:52 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote:
>>> I think it's very possible that the wrong alias may be provided by the
>>> user, and that we should consider that when providing a hint.
>>
>> Note that the existing mechanism (the mechanism that I'm trying to
>> improve) only ever shows this error message:
>>
>> "There is a column named \"%s\" in table \"%s\", but it cannot be
>> referenced from this part of the query."
>>
>> I think it's pretty clear that this general class of user error is common.
> Moving this patch to CF 2014-12 as work is still going on, note that
> it is currently marked with Robert as reviewer and that its current
> status is "Needs review".

The status here is more like "waiting around to see if anyone else has
an opinion".  The issue is what should happen when you enter qualified
name like alvaro.herrera and there is no column named anything like
herrara in the RTE named alvaro, but there is some OTHER RTE that
contains a column with a name that is only a small Levenshtein
distance away from herrera, like roberto.correra.  The questions are:

1. Should we EVER give a you-might-have-meant hint in a case like this?
2. If so, does it matter whether the RTE name is just a bit different
from the actual RTE or whether it's completely different?  In other
words, might we skip the hint in the above case but give one for
alvara.correra?

My current feeling is that we should answer #1 "no", but Peter prefers
to answer it "yes".  My further feeling is that if we do decide to say
"yes" to #1, then I would answer #2 as "yes" also, but Peter would
answer it "no", assigning a fixed penalty for a mismatched RTE rather
than one that varies by the Levenshtein distance between the RTEs.

If no one else expresses an opinion, I'm going to insist on doing it
my way, but I'm happy to have other people weigh in.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Tom Lane

Дата:

15 декабря 2014 г., 20:04:51

Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Dec 14, 2014 at 8:24 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Moving this patch to CF 2014-12 as work is still going on, note that
>> it is currently marked with Robert as reviewer and that its current
>> status is "Needs review".

> The status here is more like "waiting around to see if anyone else has
> an opinion".  The issue is what should happen when you enter qualified
> name like alvaro.herrera and there is no column named anything like
> herrara in the RTE named alvaro, but there is some OTHER RTE that
> contains a column with a name that is only a small Levenshtein
> distance away from herrera, like roberto.correra.  The questions are:

> 1. Should we EVER give a you-might-have-meant hint in a case like this?
> 2. If so, does it matter whether the RTE name is just a bit different
> from the actual RTE or whether it's completely different?  In other
> words, might we skip the hint in the above case but give one for
> alvara.correra?

It would be astonishingly silly to not care about the RTE name's distance,
if you ask me.  This is supposed to detect typos, not thinkos.

I think there might be some value in a separate heuristic that, when
you typed foo.bar and that doesn't match but there is a baz.bar, suggests
that maybe you meant baz.bar, even if baz is not close typo-wise.  This
would be addressing the thinko case not the typo case, so the rules ought
to be quite different --- in particular I doubt that it'd be a good idea
to hint this way if the column names don't match exactly.  But in any
case the key point is that this is a different heuristic addressing a
different failure mode.  We should not try to make the
levenshtein-distance heuristic address that case.

So my two cents is that when considering a qualified name, this patch
should take levenshtein distance across the two components equally.
There's no good reason to suppose that typos will attack one name
component more (nor less) than the other.
        regards, tom lane

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Stephen Frost

Дата:

16 декабря 2014 г., 23:18:34

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> So my two cents is that when considering a qualified name, this patch
> should take levenshtein distance across the two components equally.
> There's no good reason to suppose that typos will attack one name
> component more (nor less) than the other.

Agreed (since it seems like folks are curious for the opinion's of
mostly bystanders).

+1 to the above for my part.
Thanks,
    Stephen

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

21 декабря 2014 г., 02:18:08

On Tue, Dec 16, 2014 at 12:18 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> So my two cents is that when considering a qualified name, this patch
>> should take levenshtein distance across the two components equally.
>> There's no good reason to suppose that typos will attack one name
>> component more (nor less) than the other.
>
> Agreed (since it seems like folks are curious for the opinion's of
> mostly bystanders).
>
> +1 to the above for my part.

Okay, then. Attached patch implements this scheme. It is identical to
the previous revision, except that iff there was an alias specified
and that alias does not match the correct name (alias/table name) of
the RTE currently under consideration, we charge the distance between
the differing aliases rather than a fixed distance of 1.

--
Peter Geoghegan

Вложения

0001-Levenshtein-distance-column-HINT.patch

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

21 декабря 2014 г., 03:30:27

On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached patch implements this scheme.

I had another thought: "NAMEDATALEN + 1" is a better representation of
"infinity" for matching purposes than INT_MAX. I probably should have
made that change, too. It would then not have been necessary to
"#include <limits.h>". I think that this is a useful
belt-and-suspenders precaution against integer overflow. It almost
certainly won't matter, since it's very unlikely that the best match
within an RTE will end up being a dropped column, but we might as well
do it that way (Levenshtein distance is costed in multiples of code
point changes, but the maximum density is 1 byte per codepoint).

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

22 декабря 2014 г., 16:50:50

On Sat, Dec 20, 2014 at 7:30 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Attached patch implements this scheme.
>
> I had another thought: "NAMEDATALEN + 1" is a better representation of
> "infinity" for matching purposes than INT_MAX. I probably should have
> made that change, too. It would then not have been necessary to
> "#include <limits.h>". I think that this is a useful
> belt-and-suspenders precaution against integer overflow. It almost
> certainly won't matter, since it's very unlikely that the best match
> within an RTE will end up being a dropped column, but we might as well
> do it that way (Levenshtein distance is costed in multiples of code
> point changes, but the maximum density is 1 byte per codepoint).

Good idea.

Looking over the latest patch, I think we could simplify the code so
that you don't need multiple FuzzyAttrMatchState objects.  Instead of
creating a separate one for each RTE and then merging them, just have
one.  When you find an inexact-RTE name match, set a field inside the
FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
Levenshtein distance between the RTEs.  Then call scanRTEForColumn()
and pass down the same state object.  Now let
updateFuzzyAttrMatchState() work out what it needs to do based on the
observed inter-column distance and the currently-in-force RTE penalty.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

23 декабря 2014 г., 03:34:39

On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Looking over the latest patch, I think we could simplify the code so
> that you don't need multiple FuzzyAttrMatchState objects.  Instead of
> creating a separate one for each RTE and then merging them, just have
> one.  When you find an inexact-RTE name match, set a field inside the
> FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
> Levenshtein distance between the RTEs.  Then call scanRTEForColumn()
> and pass down the same state object.  Now let
> updateFuzzyAttrMatchState() work out what it needs to do based on the
> observed inter-column distance and the currently-in-force RTE penalty.

I'm afraid I don't follow. I think doing things that way makes things
less clear. Merging is useful because it allows us to consider that an
exact match might exist, which this searchRangeTableForCol() is
already tasked with today. We now look for the best match
exhaustively, or magically return immediately in the event of an exact
match, without caring about the alias correctness or distance.

Having a separate object makes this pattern apparent from the top
level, within searchRangeTableForCol(). I feel that's better.
updateFuzzyAttrMatchState() is the wrong place to put that, because
that task rightfully belongs in searchRangeTableForCol(), where the
high level diagnostic-report-generating control flow lives.

To put it another way, creating a separate object obfuscates
scanRTEForColumn(), since it's the only client of
updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important
function, and right now I am only making it slightly less clear by
tasking it with caring about distance of names on top of strict binary
equality of attribute names. I don't want to push it any further.
-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

23 декабря 2014 г., 03:43:12

On Mon, Dec 22, 2014 at 4:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
> To put it another way, creating a separate object obfuscates
> scanRTEForColumn(), since it's the only client of
> updateFuzzyAttrMatchState().

Excuse me. I mean *not* creating a separate object -- having a unified
state representation for the entire range-table, rather than having
one per RTE and merging them one by one into an overall/final range
table object.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Michael Paquier

Дата:

13 февраля 2015 г., 10:41:46

<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Tue, Dec 23, 2014 at 9:43 AM, Peter
Geoghegan<span dir="ltr"><<a href="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span
class="">OnMon, Dec 22, 2014 at 4:34 PM, Peter Geoghegan <<a href="mailto:pg@heroku.com">pg@heroku.com</a>>
wrote:<br/> > To put it another way, creating a separate object obfuscates<br /> > scanRTEForColumn(), since it's
theonly client of<br /> > updateFuzzyAttrMatchState().<br /><br /><br /></span>Excuse me. I mean *not* creating a
separateobject -- having a unified<br /> state representation for the entire range-table, rather than having<br /> one
perRTE and merging them one by one into an overall/final range<br /> table object.<span class="HOEnZb"></span><br
/></blockquote></div><brclear="all" /></div><div class="gmail_extra">Patch moved to CF 2015-02.<br /></div><div
class="gmail_extra">--<br /><div class="gmail_signature">Michael<br /></div></div></div>

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

10 марта 2015 г., 21:28:23

On Mon, Dec 22, 2014 at 7:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Looking over the latest patch, I think we could simplify the code so
>> that you don't need multiple FuzzyAttrMatchState objects.  Instead of
>> creating a separate one for each RTE and then merging them, just have
>> one.  When you find an inexact-RTE name match, set a field inside the
>> FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
>> Levenshtein distance between the RTEs.  Then call scanRTEForColumn()
>> and pass down the same state object.  Now let
>> updateFuzzyAttrMatchState() work out what it needs to do based on the
>> observed inter-column distance and the currently-in-force RTE penalty.
>
> I'm afraid I don't follow. I think doing things that way makes things
> less clear. Merging is useful because it allows us to consider that an
> exact match might exist, which this searchRangeTableForCol() is
> already tasked with today. We now look for the best match
> exhaustively, or magically return immediately in the event of an exact
> match, without caring about the alias correctness or distance.
>
> Having a separate object makes this pattern apparent from the top
> level, within searchRangeTableForCol(). I feel that's better.
> updateFuzzyAttrMatchState() is the wrong place to put that, because
> that task rightfully belongs in searchRangeTableForCol(), where the
> high level diagnostic-report-generating control flow lives.
>
> To put it another way, creating a separate object obfuscates
> scanRTEForColumn(), since it's the only client of
> updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important
> function, and right now I am only making it slightly less clear by
> tasking it with caring about distance of names on top of strict binary
> equality of attribute names. I don't want to push it any further.

I don't buy this.  What you're essentially doing is using the
FuzzyAttrMatchState object in two ways that are not entirely
compatible with each other - updateFuzzyAttrMatchState doesn't set the
RTE fields, so searchRangeTableForCol has to do it.  So there's an
unspoken contract that in some parts of the code, you can rely on
those fields being set, and in others, you can't.  That pretty much
defeats the whole point of making the state its own object, AFAICS.
Furthermore, you end up with two copies of the state-combining logic,
one in FuzzyAttrMatchState and a second in searchRangeTableForCol.
That's ugly and unnecessary.

I decided to rework this patch myself today; my version is attached.
I believe that this version is significantly easier to understand than
yours, both as to the code and the comments.  I put quite a bit of
work into both.  I also suspect it's more efficient, because it avoids
computing the Levenshtein distances for column names when we already
know that those column names can't possibly be sufficiently-good
matches for us to care about the details; and when it does compute the
Levenshtein distance it keeps the max-distance threshold as low as
possible.  That may not really matter much, but it can't hurt.  More
importantly, essentially all of the fuzzy-matching logic is now
isolated in FuzzyAttrMatchState(); the volume of change in
scanRTEForColumn is the same as in your version, but the volume of
change in searchRangeTableForCol is quite a bit less, so the patch is
smaller overall.

I'm prepared to commit this version if nobody finds a problem with it.
It passes the additional regression tests you wrote.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

column-hint-rmh.patch

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Peter Geoghegan

Дата:

10 марта 2015 г., 23:03:26

On Tue, Mar 10, 2015 at 11:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm prepared to commit this version if nobody finds a problem with it.
> It passes the additional regression tests you wrote.

Looks good to me. Thanks.

-- 
Peter Geoghegan

Re: Doing better at HINTing an appropriate column within errorMissingColumn()

От

Robert Haas

Дата:

11 марта 2015 г., 17:58:46

On Tue, Mar 10, 2015 at 4:03 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Mar 10, 2015 at 11:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm prepared to commit this version if nobody finds a problem with it.
>> It passes the additional regression tests you wrote.
>
> Looks good to me. Thanks.

OK, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Doing better at HINTing an appropriate column within errorMissingColumn()

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения