Обсуждение: Doing better at HINTing an appropriate column within errorMissingColumn()
With the addition of LATERAL subqueries, Tom fixed up the mechanism for keeping track of which relations are visible for column references while the FROM clause is being scanned. That allowed errorMissingColumn() to give a more useful error to the one produced by the prior coding of that mechanism, with an errhint sometimes proffering: 'There is a column named "foo" in table "bar", but it cannot be referenced from this part of the query'. I wondered how much further this could be taken. Attached patch modifies contrib/fuzzystrmatch, moving its Levenshtein distance code into core without actually moving the relevant SQL functions too. That change allowed me to modify errorMissingColumn() to make more useful suggestions as to what might have been intended under other circumstances, like when someone fat-fingers a column name. psql tab completion is good, but not so good that this doesn't happen all the time. It's good practice to consistently name columns and tables such that it's possible to intuit the names of columns from the names of tables and so on, but it's still pretty common to forget if a column name from the table "orders" is "order_id", "orderid", or "ordersid", particularly if you're someone who regularly interacts with many databases. This problem is annoying in a low intensity kind of way. Consider the following sample sessions of mine, made with the dellstore2 sample database: [local]/postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderids limit 1; ERROR: 42703: column ol.orderids does not exist LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid... ^ HINT: Perhaps you meant to reference the column "ol"."orderid". LOCATION: errorMissingColumn, parse_relation.c:2989 [local]/postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderid limit 1; orderid | orderdate | customerid | netamount | tax | totalamount | orderlineid | orderid | prod_id | quantity | orderdate ---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------ 1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 | 1 | 1 | 9117 | 1 | 2004-01-27 (1 row) [local]/postgres=# select ordersid from orders o join orderlines ol on o.orderid = ol.orderid limit 1; ERROR: 42703: column "ordersid" does not exist LINE 1: select ordersid from orders o join orderlines ol on o.orderi... ^ HINT: Perhaps you meant to reference the column "o"."orderid". LOCATION: errorMissingColumn, parse_relation.c:2999 [local]/postgres=# select ol.ordersid from orders o join orderlines ol on o.orderid = ol.orderid limit 1; ERROR: 42703: column ol.ordersid does not exist LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord... ^ HINT: Perhaps you meant to reference the column "ol"."orderid". LOCATION: errorMissingColumn, parse_relation.c:2989 We try to give the most useful possible HINT here, charging extra for a non-matching alias, and going through the range table in order and preferring the first column observed to any subsequent column whose name is of the same distance as an earlier Var. The fuzzy string matching works well enough that it seems possible in practice to successfully have the parser make the right suggestion, even when the user's original guess was fairly far off. I've found it works best to charge half as much for a character deletion, so that's what is charged. I have some outstanding concerns about the proposed patch: * It may be the case that dense logosyllabic or morphographic writing systems, for example Kanji might consistently present, say, Japanese users with a suggestion that just isn't very useful, to the point of being annoying. Perhaps some Japanese hackers can comment on the actual risks here. * Perhaps I should have moved the Levenshtein distance functions into core and be done with it. I thought that given the present restriction that the implementation imposes on source and target string lengths, it would be best to leave the user-facing SQL functions in contrib. That restriction is not relevant to the internal use of Levenshtein distance added here, though. Thoughts? -- Peter Geoghegan
Вложения
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Pavel Stehule
Дата:
Hello
I see only one risk - it can do some slowdown of exception processing. BEGIN
INSERT INTO ... ; /* ignore this error */
END;2014-03-27 20:10 GMT+01:00 Peter Geoghegan <pg@heroku.com>:
With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.
I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.
Consider the following sample sessions of mine, made with the
dellstore2 sample database:
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate
---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)
[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.
I have some outstanding concerns about the proposed patch:
* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.
* Perhaps I should have moved the Levenshtein distance functions into
core and be done with it. I thought that given the present restriction
that the implementation imposes on source and target string lengths,
it would be best to leave the user-facing SQL functions in contrib.
That restriction is not relevant to the internal use of Levenshtein
distance added here, though.
Thoughts?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Fri, Mar 28, 2014 at 1:00 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > I see only one risk - it can do some slowdown of exception processing. I think it's unlikely that you'd see ERRCODE_UNDEFINED_COLUMN in procedural code like that in practice. In any case it's worth noting that I continually pass back a "max" to the Levenshtein distance implementation, which is the current shortest distance observed. The implementation is therefore not obliged to exhaustively find a distance that is already known to be of no use. See commit 604ab0. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Pavel Stehule
Дата:
2014-03-28 9:22 GMT+01:00 Peter Geoghegan <pg@heroku.com>:
On Fri, Mar 28, 2014 at 1:00 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:I think it's unlikely that you'd see ERRCODE_UNDEFINED_COLUMN in
> I see only one risk - it can do some slowdown of exception processing.
procedural code like that in practice. In any case it's worth noting
that I continually pass back a "max" to the Levenshtein distance
implementation, which is the current shortest distance observed. The
implementation is therefore not obliged to exhaustively find a
distance that is already known to be of no use. See commit 604ab0.
if it is related to ERRCODE_UNDEFINED_COLUMN then it should be ok (from performance perspective)
but second issue can be usage from plpgsql - where is mix SQL identifiers and plpgsql variables.
Pavel
--
Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Oleg Bartunov
Дата:
<p dir="ltr">Very interesting idea, I'd think about optionally add similarity hinting support to psql tab. With, say, 80% of similarity matching, it shouldn't be very annoying. For interactive usage there is no risk of slowdown. <divclass="gmail_quote">On Mar 27, 2014 11:11 PM, "Peter Geoghegan" <<a href="mailto:pg@heroku.com">pg@heroku.com</a>>wrote:<br type="attribution" /><blockquote class="gmail_quote" style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> With the addition of LATERAL subqueries, Tom fixedup the mechanism<br /> for keeping track of which relations are visible for column references<br /> while the FROM clauseis being scanned. That allowed<br /> errorMissingColumn() to give a more useful error to the one produced<br /> bythe prior coding of that mechanism, with an errhint sometimes<br /> proffering: 'There is a column named "foo" in table"bar", but it<br /> cannot be referenced from this part of the query'.<br /><br /> I wondered how much further thiscould be taken. Attached patch<br /> modifies contrib/fuzzystrmatch, moving its Levenshtein distance code<br /> intocore without actually moving the relevant SQL functions too. That<br /> change allowed me to modify errorMissingColumn()to make more useful<br /> suggestions as to what might have been intended under other<br /> circumstances,like when someone fat-fingers a column name. psql tab<br /> completion is good, but not so good that this doesn'thappen all the<br /> time. It's good practice to consistently name columns and tables such<br /> that it's possibleto intuit the names of columns from the names of<br /> tables and so on, but it's still pretty common to forget ifa column<br /> name from the table "orders" is "order_id", "orderid", or "ordersid",<br /> particularly if you're someonewho regularly interacts with many<br /> databases. This problem is annoying in a low intensity kind of way.<br /><br/> Consider the following sample sessions of mine, made with the<br /> dellstore2 sample database:<br /><br /> [local]/postgres=#select * from orders o join orderlines ol on<br /> o.orderid = ol.orderids limit 1;<br /> ERROR: 42703:column ol.orderids does not exist<br /> LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...<br/> ^<br /> HINT: Perhaps you meant to referencethe column "ol"."orderid".<br /> LOCATION: errorMissingColumn, parse_relation.c:2989<br /> [local]/postgres=# select* from orders o join orderlines ol on<br /> o.orderid = ol.orderid limit 1;<br /> orderid | orderdate | customerid| netamount | tax | totalamount |<br /> orderlineid | orderid | prod_id | quantity | orderdate<br /> ---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------<br /> 1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |<br /> 1 | 1 | 9117 | 1| 2004-01-27<br /> (1 row)<br /><br /> [local]/postgres=# select ordersid from orders o join orderlines ol on<br /> o.orderid= ol.orderid limit 1;<br /> ERROR: 42703: column "ordersid" does not exist<br /> LINE 1: select ordersid from orderso join orderlines ol on o.orderi...<br /> ^<br /> HINT: Perhaps you meant to reference the column "o"."orderid".<br/> LOCATION: errorMissingColumn, parse_relation.c:2999<br /> [local]/postgres=# select ol.ordersid fromorders o join orderlines ol<br /> on o.orderid = ol.orderid limit 1;<br /> ERROR: 42703: column ol.ordersid does notexist<br /> LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...<br /> ^<br /> HINT: Perhaps you meant to reference the column "ol"."orderid".<br /> LOCATION: errorMissingColumn, parse_relation.c:2989<br/><br /> We try to give the most useful possible HINT here, charging extra for<br /> a non-matchingalias, and going through the range table in order and<br /> preferring the first column observed to any subsequentcolumn whose<br /> name is of the same distance as an earlier Var. The fuzzy string<br /> matching works well enoughthat it seems possible in practice to<br /> successfully have the parser make the right suggestion, even when the<br/> user's original guess was fairly far off. I've found it works best to<br /> charge half as much for a characterdeletion, so that's what is<br /> charged.<br /><br /> I have some outstanding concerns about the proposed patch:<br/><br /> * It may be the case that dense logosyllabic or morphographic writing<br /> systems, for example Kanjimight consistently present, say, Japanese<br /> users with a suggestion that just isn't very useful, to the point of<br/> being annoying. Perhaps some Japanese hackers can comment on the<br /> actual risks here.<br /><br /> * Perhaps Ishould have moved the Levenshtein distance functions into<br /> core and be done with it. I thought that given the presentrestriction<br /> that the implementation imposes on source and target string lengths,<br /> it would be best to leavethe user-facing SQL functions in contrib.<br /> That restriction is not relevant to the internal use of Levenshtein<br/> distance added here, though.<br /><br /> Thoughts?<br /> --<br /> Peter Geoghegan<br /><br /><br /> --<br/> Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div>
Peter Geoghegan wrote: > With the addition of LATERAL subqueries, Tom fixed up the mechanism > for keeping track of which relations are visible for column references > while the FROM clause is being scanned. That allowed > errorMissingColumn() to give a more useful error to the one produced > by the prior coding of that mechanism, with an errhint sometimes > proffering: 'There is a column named "foo" in table "bar", but it > cannot be referenced from this part of the query'. > > I wondered how much further this could be taken. Attached patch > modifies contrib/fuzzystrmatch, moving its Levenshtein distance code > into core without actually moving the relevant SQL functions too. That > change allowed me to modify errorMissingColumn() to make more useful > suggestions as to what might have been intended under other > circumstances, like when someone fat-fingers a column name. > [local]/postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderids limit 1; > ERROR: 42703: column ol.orderids does not exist > LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid... > ^ > HINT: Perhaps you meant to reference the column "ol"."orderid". This sounds like a mild version of DWIM: http://www.jargondb.org/glossary/dwim Maybe it is just me, but I get uncomfortable when a program tries to second-guess what I really want. Yours, Laurenz Albe
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Christoph Berg
Дата:
Re: Albe Laurenz 2014-03-28 <A737B7A37273E048B164557ADEF4A58B17CE8DEA@ntex2010i.host.magwien.gv.at> > > ERROR: 42703: column ol.orderids does not exist > > LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid... > > ^ > > HINT: Perhaps you meant to reference the column "ol"."orderid". > > This sounds like a mild version of DWIM: > http://www.jargondb.org/glossary/dwim > > Maybe it is just me, but I get uncomfortable when a program tries > to second-guess what I really want. I find it very annoying when zsh asks me "did you mean foo [y/n]" and I need to confirm that, but I'd find a mere HINT that I can easily ignore a very useful feature. +1 for the idea. Christoph -- cb@df7cb.de | http://www.df7cb.de/
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
On Fri, Mar 28, 2014 at 4:10 AM, Peter Geoghegan <pg@heroku.com> wrote: > With the addition of LATERAL subqueries, Tom fixed up the mechanism > for keeping track of which relations are visible for column references > while the FROM clause is being scanned. That allowed > errorMissingColumn() to give a more useful error to the one produced > by the prior coding of that mechanism, with an errhint sometimes > proffering: 'There is a column named "foo" in table "bar", but it > cannot be referenced from this part of the query'. > > I wondered how much further this could be taken. Attached patch > modifies contrib/fuzzystrmatch, moving its Levenshtein distance code > into core without actually moving the relevant SQL functions too. That > change allowed me to modify errorMissingColumn() to make more useful > suggestions as to what might have been intended under other > circumstances, like when someone fat-fingers a column name. psql tab > completion is good, but not so good that this doesn't happen all the > time. It's good practice to consistently name columns and tables such > that it's possible to intuit the names of columns from the names of > tables and so on, but it's still pretty common to forget if a column > name from the table "orders" is "order_id", "orderid", or "ordersid", > particularly if you're someone who regularly interacts with many > databases. This problem is annoying in a low intensity kind of way. > > Consider the following sample sessions of mine, made with the > dellstore2 sample database: > > [local]/postgres=# select * from orders o join orderlines ol on > o.orderid = ol.orderids limit 1; > ERROR: 42703: column ol.orderids does not exist > LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid... > ^ > HINT: Perhaps you meant to reference the column "ol"."orderid". > LOCATION: errorMissingColumn, parse_relation.c:2989 > [local]/postgres=# select * from orders o join orderlines ol on > o.orderid = ol.orderid limit 1; > orderid | orderdate | customerid | netamount | tax | totalamount | > orderlineid | orderid | prod_id | quantity | orderdate > ---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------ > 1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 | > 1 | 1 | 9117 | 1 | 2004-01-27 > (1 row) > > [local]/postgres=# select ordersid from orders o join orderlines ol on > o.orderid = ol.orderid limit 1; > ERROR: 42703: column "ordersid" does not exist > LINE 1: select ordersid from orders o join orderlines ol on o.orderi... > ^ > HINT: Perhaps you meant to reference the column "o"."orderid". > LOCATION: errorMissingColumn, parse_relation.c:2999 > [local]/postgres=# select ol.ordersid from orders o join orderlines ol > on o.orderid = ol.orderid limit 1; > ERROR: 42703: column ol.ordersid does not exist > LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord... > ^ > HINT: Perhaps you meant to reference the column "ol"."orderid". > LOCATION: errorMissingColumn, parse_relation.c:2989 > > We try to give the most useful possible HINT here, charging extra for > a non-matching alias, and going through the range table in order and > preferring the first column observed to any subsequent column whose > name is of the same distance as an earlier Var. The fuzzy string > matching works well enough that it seems possible in practice to > successfully have the parser make the right suggestion, even when the > user's original guess was fairly far off. I've found it works best to > charge half as much for a character deletion, so that's what is > charged. What about the overhead that this processing creates if error processing needs to scan a schema with let's say hundreds of tables? > * It may be the case that dense logosyllabic or morphographic writing > systems, for example Kanji might consistently present, say, Japanese > users with a suggestion that just isn't very useful, to the point of > being annoying. Perhaps some Japanese hackers can comment on the > actual risks here. As long as Hiragana-only words (basic alphabet for Japanese words), and more particularly Katakana only-words (to write phonetically foreign words) are compared (even Kanji-only things compared), Levenstein could play its role pretty well. But once a comparison is made with two words using different alphabet, well Levenstein is not going to work well. A simple example is 'ramen' (Japanese noodles), that you can find written sometimes in Hiragana, or even in Katakana, and here Levenstein performs poorly: =# select levenshtein('ラーメン', 'らあめん');levenshtein ------------- 4 (1 row) Regards, -- Michael
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Fri, Mar 28, 2014 at 5:57 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > What about the overhead that this processing creates if error > processing needs to scan a schema with let's say hundreds of tables? It doesn't work that way. I've extended searchRangeTableForCol() so that when it calls scanRTEForColumn(), it considers Levenshtein distance, and not just plain string equality, which is what happens today. The code only looks through ParseState. -- Peter Geoghegan
On Fri, Mar 28, 2014 at 4:47 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote: > Peter Geoghegan wrote: >> With the addition of LATERAL subqueries, Tom fixed up the mechanism >> for keeping track of which relations are visible for column references >> while the FROM clause is being scanned. That allowed >> errorMissingColumn() to give a more useful error to the one produced >> by the prior coding of that mechanism, with an errhint sometimes >> proffering: 'There is a column named "foo" in table "bar", but it >> cannot be referenced from this part of the query'. >> >> I wondered how much further this could be taken. Attached patch >> modifies contrib/fuzzystrmatch, moving its Levenshtein distance code >> into core without actually moving the relevant SQL functions too. That >> change allowed me to modify errorMissingColumn() to make more useful >> suggestions as to what might have been intended under other >> circumstances, like when someone fat-fingers a column name. > >> [local]/postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderids limit 1; >> ERROR: 42703: column ol.orderids does not exist >> LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid... >> ^ >> HINT: Perhaps you meant to reference the column "ol"."orderid". > > This sounds like a mild version of DWIM: > http://www.jargondb.org/glossary/dwim > > Maybe it is just me, but I get uncomfortable when a program tries > to second-guess what I really want. It's not really DWIM, because the backend is still throwing an error. It's just trying to help you sort out the error, along the way. Still, I share some of your discomfort. I see Peter's patch as an example of a broader class of things that we could do - but I'm not altogether sure that we want to do them. There's a risk of adding not only CPU cycles but also clutter. If we do things that encourage people to crank the log verbosity down, I think that's going to be bad more often than it's good. It strains credulity to think that this patch alone would have that effect, but there might be quite a few similar improvements that are possible. So I think it would be good to consider how far we want to go in this direction and where we think we might want to stop. That's not to say, let's not ever do this, just, let's think carefully about where we want to end up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Apr 1, 2014 at 7:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > There's a risk of adding not > only CPU cycles but also clutter. If we do things that encourage > people to crank the log verbosity down, I think that's going to be bad > more often than it's good. While I share your concern here, I think that this is something that is only likely to be seen in an interactive psql session, where it is seen quite frequently. I am reasonably confident that it's highly unusual to see ERRCODE_UNDEFINED_COLUMN in other settings. Not having to do a mental context switch when writing an ad-hoc query has considerable value. Even C compilers like Clang have this kind of feedback. This is a patch that was written out of personal frustration with the experience of interacting with many different databases. Things like the Python REPL don't do so much of this kind of thing, but presumably that's because of Python's dynamic typing. This is a HINT that can be given with fairly high confidence that it'll be helpful - there just won't be that many things that the user could have meant to choose from. I think it's even useful when the suggested column is distant from the original suggestion (i.e. errorMissingColumn() offers only what is clearly a "wild guess"), because then the user knows that he or she has got it quite wrong. Frequently, this will be because the wrong synonym for what should have been written was used. > It strains credulity to think that this > patch alone would have that effect, but there might be quite a few > similar improvements that are possible. So I think it would be good > to consider how far we want to go in this direction and where we think > we might want to stop. That's not to say, let's not ever do this, > just, let's think carefully about where we want to end up. Fair enough. -- Peter Geoghegan
On 4/1/14, 1:04 PM, Peter Geoghegan wrote: >> It strains credulity to think that this >> >patch alone would have that effect, but there might be quite a few >> >similar improvements that are possible. So I think it would be good >> >to consider how far we want to go in this direction and where we think >> >we might want to stop. That's not to say, let's not ever do this, >> >just, let's think carefully about where we want to end up. > Fair enough. I agree with the concern, but also have to say that I can't count how many times I could have used this. A big +1, at leastin this case. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
<p dir="ltr">Normally I'm not for adding gucs that just gate new features. But I think a simple guc to turn this on or offwould be fine and alleviate any concerns. I think users would appreciate it quite a lot<p dir="ltr">It would even havea positive effect of helping raise awareness of the feature. I often scan the list of config options to get an idea ofnew features when I'm installing new software or upgrading.<p dir="ltr">-- <br /> greg<div class="gmail_quote">On 1 Apr2014 17:38, "Jim Nasby" <<a href="mailto:jim@nasby.net">jim@nasby.net</a>> wrote:<br type="attribution" /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> On 4/1/14, 1:04 PM,Peter Geoghegan wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">It strains credulity to think that this<br /> >patch alone would have that effect, but there mightbe quite a few<br /> >similar improvements that are possible. So I think it would be good<br /> >to considerhow far we want to go in this direction and where we think<br /> >we might want to stop. That's not to say, let'snot ever do this,<br /> >just, let's think carefully about where we want to end up.<br /></blockquote> Fair enough.<br/></blockquote><br /> I agree with the concern, but also have to say that I can't count how many times I couldhave used this. A big +1, at least in this case.<br /> -- <br /> Jim C. Nasby, Data Architect <a href="mailto:jim@nasby.net" target="_blank">jim@nasby.net</a><br /> 512.569.9461 (cell) <a href="http://jim.nasby.net"target="_blank">http://jim.nasby.net</a><br /><br /><br /> -- <br /> Sent via pgsql-hackers mailinglist (<a href="mailto:pgsql-hackers@postgresql.org" target="_blank">pgsql-hackers@postgresql.org</a>)<br /> To makechanges to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers" target="_blank">http://www.postgresql.org/<u></u>mailpref/pgsql-hackers</a><br/></blockquote></div>
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Andres Freund
Дата:
On 2014-04-02 21:08:47 +0100, Greg Stark wrote: > Normally I'm not for adding gucs that just gate new features. But I think a > simple guc to turn this on or off would be fine and alleviate any concerns. > I think users would appreciate it quite a lot I don't have strong feelings about the feature, but introducing a guc for it feels entirely ridiculous to me. This is a minor detail in an error message, not more. > It would even have a positive effect of helping raise awareness of the > feature. I often scan the list of config options to get an idea of new > features when I'm installing new software or upgrading. Really? Should we now add GUCs for every feature then? Greetings, Andres Freund PS: Could you please start to properly quote again? You seem to have stopped doing that entirely in the last few months. -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com> wrote: > I don't have strong feelings about the feature, but introducing a guc > for it feels entirely ridiculous to me. This is a minor detail in an > error message, not more. I agree. It's just a HINT. It's quite helpful in certain particular contexts, but in the grand scheme of things isn't all that important. I am being quite conservative in trying to anticipate cases where on balance it'll actually hurt more than it will help. I doubt that there actually are any. -- Peter Geoghegan
On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:
PS: Could you please start to properly quote again? You seem to have
stopped doing that entirely in the last few months.
I've been responding a lot from the phone. Unfortunately the Gmail client on the phone makes it nearly impossible to format messages well. I'm beginning to think it would be better to just not quote at all any more. I'm normally not doing a point-by-point response anyways.
greg
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Andres Freund
Дата:
On 2014-04-03 00:48:12 -0400, Greg Stark wrote: > On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > PS: Could you please start to properly quote again? You seem to have > > stopped doing that entirely in the last few months. > > I've been responding a lot from the phone. Unfortunately the Gmail client > on the phone makes it nearly impossible to format messages well. I'm > beginning to think it would be better to just not quote at all any more. > I'm normally not doing a point-by-point response anyways. I really don't care where you're answering from TBH. It's unreadable, misses context and that's it. If $device doesn't work for you, don't use it. I don't mind an occasional quick answer that's badly formatted, but for other things it's really annoying. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 04/02/2014 01:16 PM, Andres Freund wrote: > On 2014-04-02 21:08:47 +0100, Greg Stark wrote: >> Normally I'm not for adding gucs that just gate new features. But I think a >> simple guc to turn this on or off would be fine and alleviate any concerns. >> I think users would appreciate it quite a lot > > I don't have strong feelings about the feature, but introducing a guc > for it feels entirely ridiculous to me. This is a minor detail in an > error message, not more. > >> It would even have a positive effect of helping raise awareness of the >> feature. I often scan the list of config options to get an idea of new >> features when I'm installing new software or upgrading. > > Really? Should we now add GUCs for every feature then? -1 for having a GUC for this. +1 on the feature. Review with functional test coming up. Question: How should we handle the issues with East Asian languages (i.e. Japanese, Chinese) and this Hint? Should we just avoid hinting for a selected list of languages which don't work well with levenshtein?If so, how do we get that list? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Jun 16, 2014 at 4:04 PM, Josh Berkus <josh@agliodbs.com> wrote: > Question: How should we handle the issues with East Asian languages > (i.e. Japanese, Chinese) and this Hint? Should we just avoid hinting > for a selected list of languages which don't work well with levenshtein? > If so, how do we get that list? I think that how useful Levenshtein distance is for users based in east Asia generally, and how useful this patch is to those users are two distinct questions. I have no idea how common it is for Japanese users to just use Roman characters as table and attribute names. Since they're very probably already writing application code that uses Roman characters (except in the comments, user strings and so on), it might make sense to do the same in the database. I would welcome further input on that question. I don't know what the trends are in the real world. Also note that the patch scans the range table parse state to pick the most probable candidate among all Vars/columns that already appear there. The query would raise an error at an earlier point if a non-existent relation was referenced, for example. We're only choosing from a minimal list of possibilities, and pick one that is very probably what was intended. Even if Levenshtein distance works badly with Kanji (which is not obviously the case, at least to me), it might not matter here. -- Peter Geoghegan
On 14/06/17 8:31, Peter Geoghegan wrote: > On Mon, Jun 16, 2014 at 4:04 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Question: How should we handle the issues with East Asian languages >> (i.e. Japanese, Chinese) and this Hint? Should we just avoid hinting >> for a selected list of languages which don't work well with levenshtein? >> If so, how do we get that list? > > I think that how useful Levenshtein distance is for users based in > east Asia generally, and how useful this patch is to those users are > two distinct questions. I have no idea how common it is for Japanese > users to just use Roman characters as table and attribute names. Since > they're very probably already writing application code that uses Roman > characters (except in the comments, user strings and so on), it might > make sense to do the same in the database. I would welcome further > input on that question. I don't know what the trends are in the real > world. From what I've seen in the wild in Japan, Roman/ASCII characters are widely used for object/attribute names, as generally it's much less hassle than switching between input methods, dealing with different encodings etc. The only place where I've seen Japanese characters widely used is in tutorials, examples etc. However that's only my personal observation for one particular non-Roman language. Regards Ian Barwick -- Ian Barwick http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote: > From what I've seen in the wild in Japan, Roman/ASCII characters are > widely used for object/attribute names, as generally it's much less > hassle than switching between input methods, dealing with different > encodings etc. The only place where I've seen Japanese characters widely > used is in tutorials, examples etc. However that's only my personal > observation for one particular non-Roman language. And I agree to this remark, that's a PITA to manage database object names with Japanese characters directly. I have ever seen some applications using such ways to define objects though in the past, not *that* many I concur.. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote: >> From what I've seen in the wild in Japan, Roman/ASCII characters are >> widely used for object/attribute names, as generally it's much less >> hassle than switching between input methods, dealing with different >> encodings etc. The only place where I've seen Japanese characters widely >> used is in tutorials, examples etc. However that's only my personal >> observation for one particular non-Roman language. > And I agree to this remark, that's a PITA to manage database object > names with Japanese characters directly. I have ever seen some > applications using such ways to define objects though in the past, not > *that* many I concur.. What exactly is the rationale for thinking that Levenshtein distance is useless in non-Roman alphabets? AFAIK it just counts insertions and deletions of characters, which seems like a concept rather independent of what those characters are. regards, tom lane
On 14/06/17 9:53, Tom Lane wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote: >>> From what I've seen in the wild in Japan, Roman/ASCII characters are >>> widely used for object/attribute names, as generally it's much less >>> hassle than switching between input methods, dealing with different >>> encodings etc. The only place where I've seen Japanese characters widely >>> used is in tutorials, examples etc. However that's only my personal >>> observation for one particular non-Roman language. > >> And I agree to this remark, that's a PITA to manage database object >> names with Japanese characters directly. I have ever seen some >> applications using such ways to define objects though in the past, not >> *that* many I concur.. > > What exactly is the rationale for thinking that Levenshtein distance is > useless in non-Roman alphabets? AFAIK it just counts insertions and > deletions of characters, which seems like a concept rather independent > of what those characters are. With Japanese (which doesn't have an alphabet, but two syllabaries and a bunch of logographic characters), Levenshtein distance is pretty useless for examining similarities with words which can be written in either syllabary (Michael's "ramen" example earlier in the thread); and when catching "typos" caused by erroneous conversion from phonetic input to characters - e.g. intending to input "成長" (seichou, growth) but accidentally selecting "清聴" (seichou, courteous attention). Howver in this particular use case, as long as it doesn't produce false positives (I haven't looked at the patch) I don't think it would cause any problems (of the kind which would require actively excluding certain languages/character sets), it just wouldn't be quite as useful. Regards Ian Barwick -- Ian Barwick http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote: > Howver in this particular use case, as long as it doesn't produce false > positives (I haven't looked at the patch) I don't think it would cause > any problems (of the kind which would require actively excluding certain > languages/character sets), it just wouldn't be quite as useful. I'm not sure what you mean by false positives. The patch just shows a HINT, where before there was none. It's possible for any number of reasons that it isn't the most useful possible suggestion, since Levenshtein distance is used as opposed to any other scheme that might be better sometimes. I think that the hint given is a generally useful piece of information in the event of an ERRCODE_UNDEFINED_COLUMN error. Obviously I think the patch is worthwhile, but fundamentally the HINT given is just a guess, as with the existing HINTs. -- Peter Geoghegan
On 14/06/17 11:57, Peter Geoghegan wrote: > On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote: >> Howver in this particular use case, as long as it doesn't produce false >> positives (I haven't looked at the patch) I don't think it would cause >> any problems (of the kind which would require actively excluding certain >> languages/character sets), it just wouldn't be quite as useful. > > I'm not sure what you mean by false positives. The patch just shows a > HINT, where before there was none. It's possible for any number of > reasons that it isn't the most useful possible suggestion, since > Levenshtein distance is used as opposed to any other scheme that might > be better sometimes. I think that the hint given is a generally useful > piece of information in the event of an ERRCODE_UNDEFINED_COLUMN > error. Obviously I think the patch is worthwhile, but fundamentally > the HINT given is just a guess, as with the existing HINTs. I mean, does it come up with a suggestion in every case, even if there is no remotely similar column? E.g. would SELECT foo FROM some_table bring up column "bar" as a suggestion if "bar" is the only column in the table? Anyway, is there an up-to-date version of the patch available? The one from March doesn't seem to apply cleanly to HEAD. Thanks Ian Barwick -- Ian Barwick http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Peter Geoghegan <pg@heroku.com> writes: > On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote: >> Howver in this particular use case, as long as it doesn't produce false >> positives (I haven't looked at the patch) I don't think it would cause >> any problems (of the kind which would require actively excluding certain >> languages/character sets), it just wouldn't be quite as useful. > I'm not sure what you mean by false positives. The patch just shows a > HINT, where before there was none. It's possible for any number of > reasons that it isn't the most useful possible suggestion, since > Levenshtein distance is used as opposed to any other scheme that might > be better sometimes. I think that the hint given is a generally useful > piece of information in the event of an ERRCODE_UNDEFINED_COLUMN > error. Obviously I think the patch is worthwhile, but fundamentally > the HINT given is just a guess, as with the existing HINTs. Not having looked at the patch, but: I think the probability of useless-noise HINTs could be substantially reduced if the code prints a HINT only when there is a single available alternative that is clearly better than the others in Levenshtein distance. I'm not sure how much better is "clearly better", but I exclude "zero" from that. I see that the original description of the patch says that it will arbitrarily choose one alternative when there are several with equal Levenshtein distance, and I'd say that's a bad idea. You could possibly answer this objection by making the HINT list *all* the alternatives meeting the minimum Levenshtein distance. But I think that's probably overcomplicated and of uncertain value anyhow. I'd rather have a rule that "we print only the choice that is at least K units better than any other choice", where K remains to be determined exactly. regards, tom lane
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Jun 16, 2014 at 8:38 PM, Ian Barwick <ian@2ndquadrant.com> wrote: > I mean, does it come up with a suggestion in every case, even if there is > no remotely similar column? E.g. would > > SELECT foo FROM some_table > > bring up column "bar" as a suggestion if "bar" is the only column in > the table? Yes, it would, but I think that's the correct behavior. > Anyway, is there an up-to-date version of the patch available? The one from > March doesn't seem to apply cleanly to HEAD. Are you sure? I think it might just be that patch is confused about the deleted file contrib/fuzzystrmatch/levenshtein.c. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Jun 16, 2014 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Not having looked at the patch, but: I think the probability of > useless-noise HINTs could be substantially reduced if the code prints a > HINT only when there is a single available alternative that is clearly > better than the others in Levenshtein distance. I'm not sure how much > better is "clearly better", but I exclude "zero" from that. I see that > the original description of the patch says that it will arbitrarily > choose one alternative when there are several with equal Levenshtein > distance, and I'd say that's a bad idea. I disagree. I happen to think that making some guess is better than no guess at all here, given the fact that there aren't too many possibilities to choose from. I think that it might be particularly annoying to not show some suggestion in the event of a would-be ambiguous column reference where the column name is itself wrong, since both mistakes are common. For example, "order_id" was specified instead of one of either "o.orderid" or "ol.orderid", as in my original examples. If some correct alias was specified, that would make the new code prefer the appropriate Var, but it might not be, and that should be okay in my view. I'm not trying to remove the need for human judgement here. We've all heard stories about people who did things like input "Portland" into their GPS only to end up in Maine rather than Oregon, but I think in general you can only go so far in worrying about those cases. -- Peter Geoghegan
On Tue, Jun 17, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Jun 16, 2014 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Not having looked at the patch, but: I think the probability of >> useless-noise HINTs could be substantially reduced if the code prints a >> HINT only when there is a single available alternative that is clearly >> better than the others in Levenshtein distance. I'm not sure how much >> better is "clearly better", but I exclude "zero" from that. I see that >> the original description of the patch says that it will arbitrarily >> choose one alternative when there are several with equal Levenshtein >> distance, and I'd say that's a bad idea. > > I disagree. I happen to think that making some guess is better than no > guess at all here, given the fact that there aren't too many > possibilities to choose from. I think that it might be particularly > annoying to not show some suggestion in the event of a would-be > ambiguous column reference where the column name is itself wrong, > since both mistakes are common. For example, "order_id" was specified > instead of one of either "o.orderid" or "ol.orderid", as in my > original examples. If some correct alias was specified, that would > make the new code prefer the appropriate Var, but it might not be, and > that should be okay in my view. > > I'm not trying to remove the need for human judgement here. We've all > heard stories about people who did things like input "Portland" into > their GPS only to end up in Maine rather than Oregon, but I think in > general you can only go so far in worrying about those cases. Emitting a suggestion with a large distance seems like it could be rather irritating. If the user types in SELECT prodct_id FROM orders, and that column does not exist, suggesting "product_id", if such a column exists, will likely be well-received. Suggesting a column named, say, "price", however, will likely make at least some users say "no I didn't mean that you stupid @%!#" - because probably the issue there is that the user selected from the completely wrong table, rather than getting 6 of the 9 characters they typed incorrect. One existing tool that does something along these lines is 'git', which seems to have some kind of a heuristic to know when to give up: [rhaas pgsql]$ git gorp git: 'gorp' is not a git command. See 'git --help'. Did you mean this? grep [rhaas pgsql]$ git goop git: 'goop' is not a git command. See 'git --help'. Did you mean this? grep [rhaas pgsql]$ git good git: 'good' is not a git command. See 'git --help'. [rhaas pgsql]$ git puma git: 'puma' is not a git command. See 'git --help'. Did you mean one of these? pull push I suspect that the maximum useful distance is a function of the string length. Certainly, if the distance is greater than or equal to the length of one of the strings involved, it's just a totally unrelated string and thus not worth suggesting. A useful heuristic might be something like "distance at most 3, or at most half the string length, whichever is less". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 17, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote: >> I disagree. I happen to think that making some guess is better than no >> guess at all here, given the fact that there aren't too many >> possibilities to choose from. > Emitting a suggestion with a large distance seems like it could be > rather irritating. If the user types in SELECT prodct_id FROM orders, > and that column does not exist, suggesting "product_id", if such a > column exists, will likely be well-received. Suggesting a column > named, say, "price", however, will likely make at least some users say > "no I didn't mean that you stupid @%!#" - because probably the issue > there is that the user selected from the completely wrong table, > rather than getting 6 of the 9 characters they typed incorrect. Yeah, that's my point exactly. There's no very good reason to assume that the intended answer is in fact among the set of column names we can see; and if it *is* there, the Levenshtein distance to it isn't going to be all that large. I think that suggesting "foobar" when the user typed "glorp" is not only not helpful, but makes us look like idiots. > One existing tool that does something along these lines is 'git', > which seems to have some kind of a heuristic to know when to give up: I wouldn't necessarily hold up git as a model of user interface engineering ;-) ... but still, it might be interesting to take a look at exactly what heuristics they used here. I'm sure there are other precedents we could look at, too. regards, tom lane
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Kevin Grittner
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wouldn't necessarily hold up git as a model of user interface > engineering ;-) ... but still, it might be interesting to take a > look at exactly what heuristics they used here. I'm sure there > are other precedents we could look at, too. On my Ubuntu machine, bash does something similar. A few examples chosen completely arbitrarily: kgrittn@Kevin-Desktop:~$ got No command 'got' found, did you mean: Command 'go' from package 'golang-go' (universe) Command 'gout' from package 'scotch' (universe) Command 'jot' from package 'athena-jot' (universe) Command 'go2' from package 'go2' (universe) Command 'git' from package 'git' (main) Command 'gpt' from package 'gpt' (universe) Command 'gom' from package 'gom' (universe) Command 'goo' from package 'goo' (universe) Command 'gst' from package 'gnu-smalltalk' (universe) Command 'dot' from package 'graphviz' (main) Command 'god' from package 'god' (universe) Command 'god' from package 'ruby-god' (universe) got: command not found kgrittn@Kevin-Desktop:~$ groupad No command 'groupad' found, did you mean: Command 'groupadd' from package 'passwd' (main) Command 'groupd' from package 'cman' (main) groupad: command not found kgrittn@Kevin-Desktop:~$ asdf No command 'asdf' found, did you mean: Command 'asdfg' from package 'aoeui' (universe) Command 'sadf' from package 'sysstat' (main) Command 'sdf' from package 'sdf' (universe) asdf: command not found kgrittn@Kevin-Desktop:~$ zxcv zxcv: command not found -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/17/2014 01:59 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Emitting a suggestion with a large distance seems like it could be >> rather irritating. If the user types in SELECT prodct_id FROM orders, >> and that column does not exist, suggesting "product_id", if such a >> column exists, will likely be well-received. Suggesting a column >> named, say, "price", however, will likely make at least some users say >> "no I didn't mean that you stupid @%!#" - because probably the issue >> there is that the user selected from the completely wrong table, >> rather than getting 6 of the 9 characters they typed incorrect. > > Yeah, that's my point exactly. There's no very good reason to assume that > the intended answer is in fact among the set of column names we can see; > and if it *is* there, the Levenshtein distance to it isn't going to be > all that large. I think that suggesting "foobar" when the user typed > "glorp" is not only not helpful, but makes us look like idiots. Well, there's two different issues: (1) offering a suggestion which is too different from what the user typed. This is easily limited by having a max distance (most likely a distance/length ratio, with a max of say, 0.5). The only drawback of this would be the extra cpu cycles to calculate it, and some arguments about what the max distance should be. But for the sake of the children, let's not have a GUC for it. (2) If there are multiple columns with the same levenschtien distance, which one do you suggest? The current code picks a random one, which I'm OK with. The other option would be to list all of the columns. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Jun 17, 2014 at 1:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, that's my point exactly. There's no very good reason to assume that > the intended answer is in fact among the set of column names we can see; > and if it *is* there, the Levenshtein distance to it isn't going to be > all that large. I think that suggesting "foobar" when the user typed > "glorp" is not only not helpful, but makes us look like idiots. Maybe that's just a matter of phrasing the message appropriately. A more guarded message, that suggests that "foobar" is the *best* match is correct at least on its own terms (terms that are self evident). This does pretty effectively communicate to the user that they should totally rethink not just the column name, but perhaps the entire query. On the other hand, showing nothing communicates nothing. -- Peter Geoghegan
Josh Berkus <josh@agliodbs.com> writes: > (2) If there are multiple columns with the same levenschtien distance, > which one do you suggest? The current code picks a random one, which > I'm OK with. The other option would be to list all of the columns. I objected to that upthread. I don't think that picking a random one is sane at all. Listing them all might be OK (I notice that that seems to be what both bash and git do). Another issue is whether to print only those having exactly the minimum observed Levenshtein distance, or to print everything less than some cutoff. The former approach seems to me to be placing a great deal of faith in something that's only a heuristic. regards, tom lane
On 06/17/2014 02:36 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> (2) If there are multiple columns with the same levenschtien distance, >> which one do you suggest? The current code picks a random one, which >> I'm OK with. The other option would be to list all of the columns. > > I objected to that upthread. I don't think that picking a random one is > sane at all. Listing them all might be OK (I notice that that seems to be > what both bash and git do). > > Another issue is whether to print only those having exactly the minimum > observed Levenshtein distance, or to print everything less than some > cutoff. The former approach seems to me to be placing a great deal of > faith in something that's only a heuristic. Well, that depends on what the cutoff is. If it's high, like 0.5, that could be a LOT of columns. Like, I plan to test this feature with a 3-table join that has a combined 300 columns. I can completely imagine coming up with a string which is within 0.5 or even 0.3 of 40 columns names. So if we want to list everything below a cutoff, we'd need to make that cutoff fairly narrow, like 0.2. But that means we'd miss a lot of potential matches on short column names. I really think we're overthinking this: it is just a HINT, and we can improve it in future PostgreSQL versions, and most of our users will ignore it anyway because they'll be using a client which doesn't display HINTs. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Peter Geoghegan <pg@heroku.com> writes: > Maybe that's just a matter of phrasing the message appropriately. A > more guarded message, that suggests that "foobar" is the *best* match > is correct at least on its own terms (terms that are self evident). > This does pretty effectively communicate to the user that they should > totally rethink not just the column name, but perhaps the entire > query. On the other hand, showing nothing communicates nothing. I don't especially buy that argument. As soon as the user's gotten used to hints of this sort, the absence of a hint communicates plenty. In any case, people have now cited two different systems with suggestion capability, and neither of them behaves as you're arguing for. The lack of precedent should give you pause, unless you can point to widely-used systems that do what you have in mind. regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: > On 06/17/2014 02:36 PM, Tom Lane wrote: >> Another issue is whether to print only those having exactly the minimum >> observed Levenshtein distance, or to print everything less than some >> cutoff. The former approach seems to me to be placing a great deal of >> faith in something that's only a heuristic. > Well, that depends on what the cutoff is. If it's high, like 0.5, that > could be a LOT of columns. Like, I plan to test this feature with a > 3-table join that has a combined 300 columns. I can completely imagine > coming up with a string which is within 0.5 or even 0.3 of 40 columns names. I think Levenshtein distances are integers, though that's just a minor point. > So if we want to list everything below a cutoff, we'd need to make that > cutoff fairly narrow, like 0.2. But that means we'd miss a lot of > potential matches on short column names. I'm not proposing an immutable cutoff. Something that scales with the string length might be a good idea, or we could make it a multiple of the minimum observed distance, or probably there are a dozen other things we could do. I'm just saying that if we have an alternative at distance 3, and another one at distance 4, it's not clear to me that we should assume that the first one is certainly what the user had in mind. Especially not if all the other alternatives are distance 10 or more. > I really think we're overthinking this: it is just a HINT, and we can > improve it in future PostgreSQL versions, and most of our users will > ignore it anyway because they'll be using a client which doesn't display > HINTs. Agreed that we can make it better later. But whether it prints exactly one suggestion, and whether it does that no matter how silly the suggestion is, are rather fundamental decisions. regards, tom lane
On 06/17/2014 02:53 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> On 06/17/2014 02:36 PM, Tom Lane wrote: >>> Another issue is whether to print only those having exactly the minimum >>> observed Levenshtein distance, or to print everything less than some >>> cutoff. The former approach seems to me to be placing a great deal of >>> faith in something that's only a heuristic. > >> Well, that depends on what the cutoff is. If it's high, like 0.5, that >> could be a LOT of columns. Like, I plan to test this feature with a >> 3-table join that has a combined 300 columns. I can completely imagine >> coming up with a string which is within 0.5 or even 0.3 of 40 columns names. > > I think Levenshtein distances are integers, though that's just a minor > point. I was giving distance/length ratios. That is, 0.5 would mean that up to 50% of the characters could be replaced/changed. 0.2 would mean that only one character could be changed at lengths of five characters. Etc. The problem with these ratios is that they behave differently with long strings than short ones. I think realistically we'd need a double threshold, i.e. ( distance >= 2 OR ratio <= 0.4 ). Otherwise the obvious case, getting two characters wrong in a 4-character column name (or one in a two character name), doesn't get a HINT. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Jun 17, 2014 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not proposing an immutable cutoff. Something that scales with the > string length might be a good idea, or we could make it a multiple of > the minimum observed distance, or probably there are a dozen other things > we could do. I'm just saying that if we have an alternative at distance > 3, and another one at distance 4, it's not clear to me that we should > assume that the first one is certainly what the user had in mind. > Especially not if all the other alternatives are distance 10 or more. The patch just looks for the match with the lowest distance, passing the lowest observed distance so far as a "max" to the distance calculation function. That could have some value in certain cases. People have already raised general concerns about added cycles and/or clutter. -- Peter Geoghegan
On 18/06/14 10:05, Peter Geoghegan wrote: > On Tue, Jun 17, 2014 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm not proposing an immutable cutoff. Something that scales with the >> string length might be a good idea, or we could make it a multiple of >> the minimum observed distance, or probably there are a dozen other things >> we could do. I'm just saying that if we have an alternative at distance >> 3, and another one at distance 4, it's not clear to me that we should >> assume that the first one is certainly what the user had in mind. >> Especially not if all the other alternatives are distance 10 or more. > The patch just looks for the match with the lowest distance, passing > the lowest observed distance so far as a "max" to the distance > calculation function. That could have some value in certain cases. > People have already raised general concerns about added cycles and/or > clutter. > How about a list of miss spellings and the likely targets. (grop, grap, ...) ==> (grep, grape, grope...) type of thing? Possibly with some kind of adaptive learning algorithm. I suspect that while this might be a useful research project, it is out of scope for the current discussion! Cheers, Gavin
On Tue, Jun 17, 2014 at 5:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> (2) If there are multiple columns with the same levenschtien distance, >> which one do you suggest? The current code picks a random one, which >> I'm OK with. The other option would be to list all of the columns. > > I objected to that upthread. I don't think that picking a random one is > sane at all. Listing them all might be OK (I notice that that seems to be > what both bash and git do). What bash does is annoying and stupid, and any time I find a system with that obnoxious behavior enabled I immediately disable it, so I don't consider that a good precedent for anything. I think what the bash algorithm demonstrates is that while it may be sane to list more than one option, listing 10 or 20 or 150 is unbearably obnoxious. Filling the user's *entire terminal window* with a list of suggestions when they make a minor typo is more like a punishment than an aid. git's behavior of limiting itself to one or two options, while somewhat useless, is at least not annoying. > Another issue is whether to print only those having exactly the minimum > observed Levenshtein distance, or to print everything less than some > cutoff. The former approach seems to me to be placing a great deal of > faith in something that's only a heuristic. Well, we've got lots of heuristics. Many of them serve us quite well. I might do something like this: (1) Set the maximum levenshtein distance to half the length of the string, rounded down, but not more than 3. (2) If there are more than 2 matches, reduce the maximum distance by 1 and repeat this step. (3) If there are no remaining matches, print no hint; else print the 1 or 2 matching items. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Jun 17, 2014 at 5:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: > What bash does is annoying and stupid, and any time I find a system > with that obnoxious behavior enabled I immediately disable it, so I > don't consider that a good precedent for anything. I happen to totally agree with you here. Bash sometimes does awful things with its completion. >> Another issue is whether to print only those having exactly the minimum >> observed Levenshtein distance, or to print everything less than some >> cutoff. The former approach seems to me to be placing a great deal of >> faith in something that's only a heuristic. > > Well, we've got lots of heuristics. Many of them serve us quite well. > I might do something like this: > > (1) Set the maximum levenshtein distance to half the length of the > string, rounded down, but not more than 3. > (2) If there are more than 2 matches, reduce the maximum distance by 1 > and repeat this step. > (3) If there are no remaining matches, print no hint; else print the 1 > or 2 matching items. I could do that. I can prepare a revision if others feel that's acceptable. My only concern with this is that a more sophisticated scheme implies more clutter in the parser, although it should not imply wasted cycles. What I particularly wanted to avoid in our choice of completion scheme is doing nothing because there is an ambiguity about what is best, which Tom suggested. In practice, that ambiguity will frequently be something that our users will not care about, and not really see as an ambiguity, as in my "o.orderid or ol.orderid?" example. However, if there are 3 equally distant Vars, and not just 2, that's very probably because none are useful, and so we really ought to show nothing. This seems most sensible. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Abhijit Menon-Sen
Дата:
So, what's the status of this patch? There's been quite a lot of discussion (though only about the approach; no formal code/usage review has yet been posted), but as far as I can tell, it just tapered off without any particular consensus. -- Abhijit
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > So, what's the status of this patch? > There's been quite a lot of discussion (though only about the approach; > no formal code/usage review has yet been posted), but as far as I can > tell, it just tapered off without any particular consensus. AFAICT, people generally agree that this would probably be useful, but there's not consensus on how far the code should be willing to "reach" for a match, nor on what to do when there are multiple roughly-equally-plausible candidates. Although printing all candidates seems to be what's preferred by existing systems with similar facilities, I can see the point that constructing the message in a translatable fashion might be difficult. So personally I'd be willing to abandon insistence on that. I still think though that printing candidates with very large distances would be unhelpful. regards, tom lane
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Sun, Jun 29, 2014 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Although printing all candidates seems to be what's preferred by > existing systems with similar facilities, I can see the point that > constructing the message in a translatable fashion might be difficult. > So personally I'd be willing to abandon insistence on that. I still > think though that printing candidates with very large distances > would be unhelpful. Attached revision factors in everyone's concerns here, I think. I've addressed your concern about the closeness of the match proposed in the HINT - the absolute as opposed to relative quality of the match. There is a normalized distance threshold that must always be exceeded to prevent ludicrous suggestions. This works along similar lines to those sketched by Robert. Furthermore, I've made it occasionally possible to see 2 suggestions, when they're equally distant and when each suggestion comes from a different range table entry. However, if the two best suggestions (overall or within an RTE) come from within the same RTE, then that RTE is ignored for the purposes of picking a suggestion (although the lowest observed distance from an ignored RTE may still be used as the distance for later RTEs to beat to get their attributes suggested in the HINT). The idea here is that this quality-bar for suggestions doesn't come at the cost of ignoring my concern about the presumably somewhat common case where there is an unqualified and therefore ambiguous column reference that happens to also be misspelled. An ambiguous column reference and an incorrectly spelled column name are both very common, and so it seems likely that momentary lapses where the user gets both things wrong at once are also common. We do all this without going overboard, since as outlined by Robert, when there are 3 or more equally distant candidates (even if they all come from different RTEs), we give no HINT at all. The big picture here is to make mental context switches cheap when writing ad-hoc queries in psql. A lot of the HINTs that popped up in the regression tests that seemed kind of questionable no longer appear. These new measures make the coding somewhat more complex than that of the initial version, although overall the parser code added by this patch is almost entirely confined to code paths concerned only with producing diagnostic messages to help users. -- Peter Geoghegan
Вложения
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Abhijit Menon-Sen
Дата:
At 2014-07-02 15:51:08 -0700, pg@heroku.com wrote: > > Attached revision factors in everyone's concerns here, I think. Is anyone planning to review Peter's revised patch? > These new measures make the coding somewhat more complex than that of > the initial version, although overall the parser code added by this > patch is almost entirely confined to code paths concerned only with > producing diagnostic messages to help users. Yes, the new patch looks quite a bit more involved than earlier, but if that's what it takes to provide a useful HINT, I guess it's not too bad. -- Abhijit
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
On Sat, Jul 5, 2014 at 12:46 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: > At 2014-07-02 15:51:08 -0700, pg@heroku.com wrote: >> >> Attached revision factors in everyone's concerns here, I think. > > Is anyone planning to review Peter's revised patch? I have been doing some functional tests, and looked quickly at the code to understand what it does: 1) Compiles without warnings, passes regression tests 2) Checking process goes through all the existing columns of a relation even a difference of 1 with some other column(s) has already been found. As we try to limit the number of hints returned, this seems like a waste of resources. 3) distanceName could be improved, by for example having some checks on the string lengths of target and source columns, and immediately reject the match if for example the length of the source string is the double/half of the length of target. 4) This is not nice, could it be possible to remove the stuff from varlena.c? +/* Expand each Levenshtein distance variant */ +#include "levenshtein.c" +#define LEVENSHTEIN_LESS_EQUAL +#include "levenshtein.c" +#undef LEVENSHTEIN_LESS_EQUAL Part of the same comment: only varstr_leven_less_equal is used to calculate the distance, should we really move varstr_leven to core? This clearly needs to be reworked as not just a copy-paste of the things in fuzzystrmatch. The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think. 5) Do we want hints on system columns as well? For example here we could get tableoid as column hint: =# select tablepid from foo; ERROR: 42703: column "tablepid" does not exist LINE 1: select tablepid from foo; ^ LOCATION: errorMissingColumn, parse_relation.c:3123 Time: 0.425 ms 6) Sometimes no hints are returned... Even in simple cases like this one: =# create table foo (aa int, bb int); CREATE TABLE =# select ab from foo; ERROR: 42703: column "ab" does not exist LINE 1: select ab from foo; ^ LOCATION: errorMissingColumn, parse_relation.c:3123 7) Performance penalty with a table with 1600 columns: =# CREATE FUNCTION create_long_table(tabname text, columns int) RETURNS void LANGUAGE plpgsql as $$ declare first_col bool = true; count int; query text; begin query := 'CREATE TABLE ' || tabname || ' ('; for count in 0..columns loop query := query || 'col' || count || 'int'; if count <> columns then query := query || ', '; end if; end loop; query := query || ')'; execute query; end; $$; =# SELECT create_long_table('aa', 1599);create_long_table ------------------- (1 row) Then tested queries like that: SELECT col888a FROM aa; Patched version: 2.100ms~2.200ms master branch (6048896): 0.956 ms~0.990 ms So the performance impact seems limited. Regards, -- Michael
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
Hi, On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > 2) Checking process goes through all the existing columns of a > relation even a difference of 1 with some other column(s) has already > been found. As we try to limit the number of hints returned, this > seems like a waste of resources. In general it's possible that an exact match will later be found within the RTE, and exact matches don't have to pay the "wrong alias" penalty, and are immediately returned. It is therefore not a waste of resources, but even if it was that would be pretty inconsequential as your benchmark shows. > 3) distanceName could be improved, by for example having some checks > on the string lengths of target and source columns, and immediately > reject the match if for example the length of the source string is the > double/half of the length of target. I don't think it's a good idea to tie distanceName() to the ultimate behavior of errorMissingColumn() hinting, since there may be other callers in the future. Besides, that isn't going to help much. > 4) This is not nice, could it be possible to remove the stuff from varlena.c? > +/* Expand each Levenshtein distance variant */ > +#include "levenshtein.c" > +#define LEVENSHTEIN_LESS_EQUAL > +#include "levenshtein.c" > +#undef LEVENSHTEIN_LESS_EQUAL > Part of the same comment: only varstr_leven_less_equal is used to > calculate the distance, should we really move varstr_leven to core? > This clearly needs to be reworked as not just a copy-paste of the > things in fuzzystrmatch. > The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think. So there'd be one variant within core and one within contrib/fuzzystrmatch? I don't think that's an improvement. > 5) Do we want hints on system columns as well? I think it's obvious that the answer must be no. That's going to frequently result in suggestions of columns that users will complain aren't even there. If you know about the system columns, you can just get it right. They're supposed to be hidden for most purposes. > 6) Sometimes no hints are returned... Even in simple cases like this one: > =# create table foo (aa int, bb int); > CREATE TABLE > =# select ab from foo; > ERROR: 42703: column "ab" does not exist > LINE 1: select ab from foo; > ^ > LOCATION: errorMissingColumn, parse_relation.c:3123 That's because those two candidates come from a single RTE and have an equal distance -- you'd see both suggestions if you joined two tables with each candidate, assuming that each table being joined didn't individually have the same issue. I think that that's probably considered the correct behavior by most. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Alvaro Herrera
Дата:
Peter Geoghegan wrote: > > 6) Sometimes no hints are returned... Even in simple cases like this one: > > =# create table foo (aa int, bb int); > > CREATE TABLE > > =# select ab from foo; > > ERROR: 42703: column "ab" does not exist > > LINE 1: select ab from foo; > > ^ > > LOCATION: errorMissingColumn, parse_relation.c:3123 > > That's because those two candidates come from a single RTE and have an > equal distance -- you'd see both suggestions if you joined two tables > with each candidate, assuming that each table being joined didn't > individually have the same issue. I think that that's probably > considered the correct behavior by most. It seems pretty silly to me actually. Was this designed by a committee? I agree with the general principle that showing a large number of candidates (a la bash) is a bad idea, but failing to show two of them ... Words fail me. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Jul 8, 2014 at 1:42 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> That's because those two candidates come from a single RTE and have an >> equal distance -- you'd see both suggestions if you joined two tables >> with each candidate, assuming that each table being joined didn't >> individually have the same issue. I think that that's probably >> considered the correct behavior by most. > > It seems pretty silly to me actually. Was this designed by a committee? > I agree with the general principle that showing a large number of > candidates (a la bash) is a bad idea, but failing to show two of them ... I guess it was designed by a committee. But we don't fail to show both because they're equally distant. Rather, it's because they're equally distant and from the same RTE. This is a contrived example, but typically showing equally distant columns is useful when they're in a foreign-key relationship - I was worried about the common case where a column name is misspelled that would otherwise be ambiguous, which is why that shows a HINT while the single RTE case doesn't. I think that in most realistic cases it wouldn't be all that useful to show two columns from the same table when they're equally distant. It's easy to imagine that reflecting that no match is good in absolute terms, and we're somewhat conservative about showing any match. While I think this general behavior is defensible, I must admit that it did suit me to write it that way because to do otherwise would have necessitated more invasive code in the existing general purpose scanRTEForColumn() function. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote: > I was worried about the common case where a > column name is misspelled that would otherwise be ambiguous, which is > why that shows a HINT while the single RTE case doesn't To be clear - I mean a HINT with two suggestions rather than just one. If there are 3 or more equally distant suggestions (even if they're all from different RTEs) we also give no HINT in the proposed patch. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:To be clear - I mean a HINT with two suggestions rather than just one.
> I was worried about the common case where a
> column name is misspelled that would otherwise be ambiguous, which is
> why that shows a HINT while the single RTE case doesn't
If there are 3 or more equally distant suggestions (even if they're
all from different RTEs) we also give no HINT in the proposed patch.
Showing up to 2 hints is fine as it does not pollute the error output with perhaps unnecessary messages. That's even more protective than for example git that prints all the equidistant candidates. However I can't understand why it does not show up hints even if there are two equidistant candidates from the same RTE. I think it should.
--
Michael
--
Michael
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
On Wed, Jul 9, 2014 at 1:49 AM, Peter Geoghegan <pg@heroku.com> wrote:
This may sound ridiculous, but I have already found myself mistyping ctid by tid and cid while working on patches and modules that played with page format, and needing a couple of minutes to understand what was going on (bad morning). I would have welcomed such hints in those cases.> 4) This is not nice, could it be possible to remove the stuff from varlena.c?> +/* Expand each Levenshtein distance variant */So there'd be one variant within core and one within
> +#include "levenshtein.c"
> +#define LEVENSHTEIN_LESS_EQUAL
> +#include "levenshtein.c"
> +#undef LEVENSHTEIN_LESS_EQUAL
> Part of the same comment: only varstr_leven_less_equal is used to
> calculate the distance, should we really move varstr_leven to core?
> This clearly needs to be reworked as not just a copy-paste of the
> things in fuzzystrmatch.
> The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think.
contrib/fuzzystrmatch? I don't think that's an improvement.
No. The main difference between varstr_leven_less_equal and varstr_leven is the use of the extra argument max_d in the former. My argument here is instead of blindly cut-pasting into core the code you are interested in to evaluate the string distances, is to refactor it to have a unique function, and to let the business with LEVENSHTEIN_LESS_EQUAL within contrib/fuzzystrmatch. This will require some reshuffling of the distance function, but by looking at this patch I am getting the feeling that this is necessary, and should even be split into a first patch for fuzzystrmatch that would facilitate its integration into core.
Also why is rest_of_char_same within varlena.c?
Also why is rest_of_char_same within varlena.c?
> 5) Do we want hints on system columns as well?I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.
--
Michael
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Jul 8, 2014 at 11:10 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Showing up to 2 hints is fine as it does not pollute the error output with > perhaps unnecessary messages. That's even more protective than for example > git that prints all the equidistant candidates. However I can't understand > why it does not show up hints even if there are two equidistant candidates > from the same RTE. I think it should. Everyone is going to have an opinion on something like that. I was showing deference to the general concern about the absolute (as opposed to relative) quality of the HINTs in the event of equidistant matches by having no two suggestions come from within a single RTE, while still covering the case I thought was important by having two suggestions if there were two equidistant matches across RTEs. I think that's marginally better then what you propose, because your case deals with two equidistant though distinct columns, bringing into question the validity of both would-be suggestions. I'll defer to whatever the consensus is. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier <michael.paquier@gmail.com> wrote: >> So there'd be one variant within core and one within >> contrib/fuzzystrmatch? I don't think that's an improvement. > > No. The main difference between varstr_leven_less_equal and varstr_leven is > the use of the extra argument max_d in the former. My argument here is > instead of blindly cut-pasting into core the code you are interested in to > evaluate the string distances, is to refactor it to have a unique function, > and to let the business with LEVENSHTEIN_LESS_EQUAL within > contrib/fuzzystrmatch. This will require some reshuffling of the distance > function, but by looking at this patch I am getting the feeling that this is > necessary, and should even be split into a first patch for fuzzystrmatch > that would facilitate its integration into core. > Also why is rest_of_char_same within varlena.c? Just as before, rest_of_char_same() exists for the express purpose of being called by the two variants varstr_leven_less_equal() and varstr_leven(). Why wouldn't I copy it over too along with those two? Where do you propose to put it? Obviously the existing macro hacks (that I haven't changed) that build the two variants are not terribly pretty, but they're not arbitrary either. They reflect the fact that there is no natural way to add callbacks or something like that. If you pretended that the core code didn't have to care about one case or the other, and that contrib was somehow obligated to hook in its own handler for the !LEVENSHTEIN_LESS_EQUAL case that it now only cares about, then you'd end up with an even bigger mess. Besides, with the patch the core code is calling varstr_leven_less_equal(), which is the bigger of the two variants - it's the LEVENSHTEIN_LESS_EQUAL case, not the !LEVENSHTEIN_LESS_EQUAL case that core cares about for the purposes of building HINTs. In short, I don't know what you mean. What would that reshuffling actually look like? >> > 5) Do we want hints on system columns as well? >> I think it's obvious that the answer must be no. That's going to >> frequently result in suggestions of columns that users will complain >> aren't even there. If you know about the system columns, you can just >> get it right. They're supposed to be hidden for most purposes. > > This may sound ridiculous, but I have already found myself mistyping ctid by > tid and cid while working on patches and modules that played with page > format, and needing a couple of minutes to understand what was going on (bad > morning). I think that it's clearly not worth it, even if it is true that a minority sometimes make this mistake. Most users don't know that there are system columns. It's not even close to being worth it to bring that into this. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Alvaro Herrera
Дата:
Peter Geoghegan wrote: > On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > >> > 5) Do we want hints on system columns as well? > >> I think it's obvious that the answer must be no. That's going to > >> frequently result in suggestions of columns that users will complain > >> aren't even there. If you know about the system columns, you can just > >> get it right. They're supposed to be hidden for most purposes. > > > > This may sound ridiculous, but I have already found myself mistyping ctid by > > tid and cid while working on patches and modules that played with page > > format, and needing a couple of minutes to understand what was going on (bad > > morning). > > I think that it's clearly not worth it, even if it is true that a > minority sometimes make this mistake. Most users don't know that there > are system columns. It's not even close to being worth it to bring > that into this. I agree with Peter. This is targeted at regular users. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 9, 2014 at 2:10 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan <pg@heroku.com> wrote: >> On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote: >> > I was worried about the common case where a >> > column name is misspelled that would otherwise be ambiguous, which is >> > why that shows a HINT while the single RTE case doesn't >> >> To be clear - I mean a HINT with two suggestions rather than just one. >> If there are 3 or more equally distant suggestions (even if they're >> all from different RTEs) we also give no HINT in the proposed patch. > > Showing up to 2 hints is fine as it does not pollute the error output with > perhaps unnecessary messages. That's even more protective than for example > git that prints all the equidistant candidates. However I can't understand > why it does not show up hints even if there are two equidistant candidates > from the same RTE. I think it should. Me, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 9, 2014 at 7:29 AM, Peter Geoghegan <pg@heroku.com> wrote: > Everyone is going to have an opinion on something like that. I was > showing deference to the general concern about the absolute (as > opposed to relative) quality of the HINTs in the event of equidistant > matches by having no two suggestions come from within a single RTE, > while still covering the case I thought was important by having two > suggestions if there were two equidistant matches across RTEs. I think > that's marginally better then what you propose, because your case > deals with two equidistant though distinct columns, bringing into > question the validity of both would-be suggestions. I'll defer to > whatever the consensus is. I agree this is bike shedding. But as long as we're bike shedding... A simple rule is easier for users to understand as well as to code. I would humbly suggest the following: take all the unqualified column names, downcase them, check which ones match most closely the unmatched column. Show the top 3 matches if they're within some arbitrary distance. If they match exactly except for the case and the unmatched column is all lower case add a comment that quoting is required due to the mixed case. Honestly the current logic and the previous logic both seemed reasonable to me. They're not going to be perfect in every case so anything that comes up some some suggestions is fine. -- greg
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark <stark@mit.edu> wrote: > A simple rule is easier for users to understand as well as to code. I > would humbly suggest the following: take all the unqualified column > names, downcase them, check which ones match most closely the > unmatched column. Show the top 3 matches if they're within some > arbitrary distance. That's harder than it sounds. You need even more translatable strings for variant ereports(). I don't think that an easy to understand rule is necessarily of much value - I'm already charging half price for deletion because I found representative errors more useful in certain cases by doing so. I think we want something that displays the most useful suggestion as often as is practically possible, and does not display unhelpful suggestions to the extent that it's practical to avoid them. Plus, as I mentioned, I'm keen to avoid adding more stuff to scanRTEForColumn() than I already have. > Honestly the current logic and the previous logic both seemed > reasonable to me. They're not going to be perfect in every case so > anything that comes up some some suggestions is fine. I think that the most recent revision is somewhat better due to the feedback of Tom and Robert. I didn't feel as strongly as they did about erring on the side of not showing a HINT, but I think the most recent revision is a good compromise. But yes, at this point we're certainly chasing diminishing returns. There are almost any number of variants of this basic idea that could be suggested. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Jul 9, 2014 at 8:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Showing up to 2 hints is fine as it does not pollute the error output with >> perhaps unnecessary messages. That's even more protective than for example >> git that prints all the equidistant candidates. However I can't understand >> why it does not show up hints even if there are two equidistant candidates >> from the same RTE. I think it should. > > Me, too. The idea is that each RTE gets one best suggestion, because if there are two best suggestions within an RTE they're probably both wrong. Whereas across RTEs, it's probably just that there is a foreign key relationship between the two (and the user accidentally failed to qualify the particular column of interest on top of the misspelling, a qualification that would be sufficient to have the code prefer the qualified-but-misspelled column). Clearly if I was to do what you suggest it would be closer to a wild guess, and Tom has expressed concerns about that. Now, I don't actually ensure that the column names of the two columns (each from separate RTEs) are identical save for their would-be alias, but that's just a consequence of the implementation. Also, as I've mentioned, I don't want to put more stuff in scanRTEForColumn() than I already have, due to your earlier concern about adding clutter. I think we're splitting hairs at this point, and frankly I'll do it that way if it gets the patch closer to being committed. While I thought it was important to get the unqualified and misspelled case right (which I did in the first revision, but perhaps at the expense of Tom's concern about absolute suggestion quality), I don't feel strongly about this detail either way. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Alvaro Herrera
Дата:
Peter Geoghegan wrote: > On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark <stark@mit.edu> wrote: > > A simple rule is easier for users to understand as well as to code. I > > would humbly suggest the following: take all the unqualified column > > names, downcase them, check which ones match most closely the > > unmatched column. Show the top 3 matches if they're within some > > arbitrary distance. > > That's harder than it sounds. You need even more translatable strings > for variant ereports(). Maybe it is possible to rephrase the message so that the translatable part doesn't need to concern with how many suggestions there are. For instance something like "perhaps you meant a name from the following list: foo, bar, baz". Couple with the errmsg_plural stuff, you then don't need to worry too much about providing different strings for 1, 2, N suggestions. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Jul 9, 2014 at 2:19 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> That's harder than it sounds. You need even more translatable strings >> for variant ereports(). > > Maybe it is possible to rephrase the message so that the translatable > part doesn't need to concern with how many suggestions there are. For > instance something like "perhaps you meant a name from the following > list: foo, bar, baz". Couple with the errmsg_plural stuff, you then > don't need to worry too much about providing different strings for 1, 2, > N suggestions. That's not really the problem. I already have a lot of things to test in each of the two ereport() calls. More importantly, showing the closet, say, 3 matches under an arbitrary distance does not weigh concerns about that indicating that they're all bad. It's not like bash tab completion - if there is one best match, that's probably because that's what the user meant. Whereas if there are two or more within a single RTE, that's probably because both are unhelpful. They both happened to require the same number of substitutions to get to, while not being quite bad enough matches to be excluded by the final check against a normalized distance threshold (the final check that prevents ludicrous suggestions). The fact that there were multiple equally plausible candidates (that are not identically named and just from different RTEs) tells us plenty, unlike with tab completion. It's not hard for one column to be a better match than another, and so it doesn't seem unreasonable to insist upon that within a single RTE where they cannot be identical, since a conservative approach seems to be what is generally favored. In any case I'm just trying to weigh everyone's concerns here. I hope it's actually possible to compromise, but right now I don't know what I can do to make useful progress. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > 6) Sometimes no hints are returned... Even in simple cases like this one: > =# create table foo (aa int, bb int); > CREATE TABLE > =# select ab from foo; > ERROR: 42703: column "ab" does not exist > LINE 1: select ab from foo; > ^ > LOCATION: errorMissingColumn, parse_relation.c:3123 In this example, it seems obvious that both "aa" and "bb" should be suggested when they are not. But what if there were far more columns, as might be expected in realistic cases (suppose all other columns have at least 3 characters)? That's another kettle of fish. The assumption that it's probably one of those two equally distant columns is now on very shaky ground. After all, the user can only have meant one particular column. If we apply a limited kind of Turing test to this second case, how does the most recent revision's algorithm do? What would a human suggest? I'm pretty sure the answer is that the human would shrug. Maybe he or she would say "I guess you might have meant one of either aa or bb, but that really isn't obvious at all". That doesn't inspire much confidence. Now, maybe I should be more optimistic about it being one of the two because there are only two possibilities to begin with. That seems pretty dubious, though. In general I find it much more plausible based on what we know that the user should rethink everything. And, as Tom pointed out, showing nothing conveys something in itself once users have been trained to expect something. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
On Wed, Jul 9, 2014 at 3:56 PM, Peter Geoghegan <pg@heroku.com> wrote:
Something like the patch 1 attached...What would that reshuffling actually look like?
Btw, re-reading this thread, everybody seem to agree that this is a useful feature, but we still do not have clear definitions of the circumstances under which column hints should be produced, except the number (up to two). So, putting my hands on it and biting the bullet, I have finished with the two patches attached making the implementation clearer:
- Patch 1 moves levenshtein functions from fuzzystrmatch to core.
- Patch 1 moves levenshtein functions from fuzzystrmatch to core.
- Patch 2 implements the column hints, rather unchanged from original proposition.
Patch 1 does a couple of things:
- fuzzystrmatch is dumped to 1.1, as Levenshtein functions are not part of it anymore, and moved to core.
- Removal of the LESS_EQUAL flag that made the original submission patch harder to understand. All the Levenshtein functions wrap a single common function.
- Documentation is moved, and regression tests for Levenshtein functions are added.
- Functions with costs are renamed with a suffix with costs.
After hacking this feature, I came up with the conclusion that it would be better for the user experience to move directly into backend code all the Levenshtein functions, instead of only moving in the common wrapper as Peter did in his original patches. This is done this way to avoid keeping portions of the same feature in two different places of the code (backend with common routine, fuzzystrmatch with levenshtein functions) and concentrate all the logic in a single place. Now, we may as well consider renaming the levenshtein functions into smarter names, like str_distance, and keep fuzzystrmatch to 1.0, having the functions levenshteing_* calling only the str_distance functions.
Having a set of in-core distance functions for strings would serve more general purposes like other object hinting (constraint names, tables, etc.).
Patch 2 is a rebase of the feature of Peter that can be applied on top of patch 1. The code is rather untouched (haven't much played with Peter's thingies), well-commented, but I think that this needs more work, particularly when a query has a single RTE like in this case where no hints are proposed to the user (mentioned upthread):
create table foo (aa int, bb int);
create table foo (aa int, bb int);
select ab from foo; -- no hints
Before doing anything more with patch 2, we still need to define clearly how hints should be produced, so that's clearly out-of-scope for this CF. Patch 1, though, prepares the field for hints of all kinds, so perhaps we could argue more on that first?
Regards,
--
Michael
Michael
Вложения
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
I am not opposed to moving the contrib code into core in the manner that you oppose. I don't feel strongly either way. I noticed in passing that your revision says this *within* levenshtein.c: + * Guaranteed to work with Name datatype's cstrings. + * For full details see levenshtein.c. On Thu, Jul 17, 2014 at 6:34 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Patch 2 is a rebase of the feature of Peter that can be applied on top of > patch 1. The code is rather untouched (haven't much played with Peter's > thingies), well-commented, but I think that this needs more work, > particularly when a query has a single RTE like in this case where no hints > are proposed to the user (mentioned upthread): The only source of disagreement that I am aware of at this point is the question of whether or not we should accept two candidates from the same RTE. I lean slightly towards "no", as already explained [1] [2], but it's not as if I feel that strongly either way - this approach of looking for only a single best candidate per RTE taken in deference to the concerns of others. I imagined that when a committer picked this up, an executive decision would be made one way or the other. I am quite willing to revise the patch to alter this behavior at the request of a committer. [1] http://www.postgresql.org/message-id/CAM3SWZTrm4PmqMmL9=eYx-8f-Vx-ha7DmE4KOmS2vCOMOzGHrw@mail.gmail.com [2] http://www.postgresql.org/message-id/CAM3SWZS6kiQEqJz4pV3Fkp6cgw1wS26exOQTjb_XMW3zE5b6mA@mail.gmail.com -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote: > I am not opposed to moving the contrib code into core in the manner > that you oppose. I don't feel strongly either way. > > I noticed in passing that your revision says this *within* levenshtein.c: > > + * Guaranteed to work with Name datatype's cstrings. > + * For full details see levenshtein.c. Yeah, I looked at what I produced yesterday night again and came across a couple of similar things :) And reworked a couple of things in the version attached, mainly wordsmithing and adding comments here and there, as well as making the naming of the Levenshtein functions in core the same as the ones in fuzzystrmatch 1.0. > I imagined that when a committer picked this up, an executive decision > would be made one way or the other. I am quite willing to revise the > patch to alter this behavior at the request of a committer. Fine for me. I'll move this patch to the next stage then. -- Michael
Вложения
On Thu, Jul 17, 2014 at 9:34 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Patch 1 does a couple of things: > - fuzzystrmatch is dumped to 1.1, as Levenshtein functions are not part of > it anymore, and moved to core. > - Removal of the LESS_EQUAL flag that made the original submission patch > harder to understand. All the Levenshtein functions wrap a single common > function. > - Documentation is moved, and regression tests for Levenshtein functions are > added. > - Functions with costs are renamed with a suffix with costs. > After hacking this feature, I came up with the conclusion that it would be > better for the user experience to move directly into backend code all the > Levenshtein functions, instead of only moving in the common wrapper as Peter > did in his original patches. This is done this way to avoid keeping portions > of the same feature in two different places of the code (backend with common > routine, fuzzystrmatch with levenshtein functions) and concentrate all the > logic in a single place. Now, we may as well consider renaming the > levenshtein functions into smarter names, like str_distance, and keep > fuzzystrmatch to 1.0, having the functions levenshteing_* calling only the > str_distance functions. This is not cool. Anyone who is running a 9.4 or earlier database using fuzzystrmatch and upgrades, either via dump-and-restore or pg_upgrade, to a version with this patch applied will have a broken database. They will still have the catalog entries for the 1.0 definitions, but those definitions won't be resolvable inside the new cluster's .so file. The user will get a fairly-unfriendly error message that won't go away until they upgrade the extension, which may involve dealing with dependency hell since the new definitions are in a different place than the old definitions, and there may be dependencies on the old definitions. One of the great advantages of extension packaging is that this kind of problem is quite easily avoidable, so let's avoid it. There are several possible methods of doing that, but I think the best one is just to leave the SQL-callable C functions in fuzzystrmatch and move only the underlying code that supports into core. Then, the whole thing will be completely transparent to users. They won't need to upgrade their fuzzystrmatch definitions at all, and everything will just work; under the covers, the fuzzystrmatch code will now be calling into core code rather than to code located in that same module, but the user doesn't need to know or care about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > There are several possible methods of doing that, but I think the best > one is just to leave the SQL-callable C functions in fuzzystrmatch and > move only the underlying code that supports into core. I hadn't been paying close attention to this thread, but I'd just assumed that that would be the approach. It might be worth introducing new differently-named pg_proc entries for the same functions in core, but only if we can agree that there are better names for them than what the extension uses. regards, tom lane
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Jul 23, 2014 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > There are several possible methods of doing that, but I think the best > one is just to leave the SQL-callable C functions in fuzzystrmatch and > move only the underlying code that supports into core. For some reason I thought that that was what Michael was proposing - a more comprehensive move of code into core than the structuring that I proposed. I actually thought about a Levenshtein distance operator at one point months ago, before I entirely gave up on that. The MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein distance functions are not suitable for core as is (although that doesn't matter for my purposes, since all I need is something that accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a considerable limitation for an in-core feature. I didn't get around to forming an opinion on how and if that should be fixed. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Alvaro Herrera
Дата:
Peter Geoghegan wrote: > For some reason I thought that that was what Michael was proposing - a > more comprehensive move of code into core than the structuring that I > proposed. I actually thought about a Levenshtein distance operator at > one point months ago, before I entirely gave up on that. The > MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein > distance functions are not suitable for core as is (although that > doesn't matter for my purposes, since all I need is something that > accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a > considerable limitation for an in-core feature. I didn't get around to > forming an opinion on how and if that should be fixed. I had two thoughts: 1. Should we consider making levenshtein available to frontend programs as well as backend? 2. Would it provide better matching to use Damerau-Levenshtein[1] instead of raw Levenshtein? .oO(Would anyone be so bold as to attempt to implement bitap[2] using bitmapsets ...) [1] http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance [2] http://en.wikipedia.org/wiki/Bitap_algorithm -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Jul 23, 2014 at 1:10 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I had two thoughts: > > 1. Should we consider making levenshtein available to frontend programs > as well as backend? I don't think so. Why would that be useful? > 2. Would it provide better matching to use Damerau-Levenshtein[1] instead > of raw Levenshtein? Maybe that would be marginally better than classic Levenshtein distance, but I doubt it would pay for itself. It's just more code to maintain. Are we really expecting to not get the best possible suggestion due to some number of transposition errors very frequently? You still have to have a worse suggestion spuriously get ahead of yours, and typically there just aren't that many to begin with. I'm not targeting spelling errors so much as thinkos around plurals and whether or not an underscore was used. Damerau-Levenshtein seems like an algorithm with fairly specialized applications. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
On Thu, Jul 24, 2014 at 1:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
-- Robert Haas <robertmhaas@gmail.com> writes:I hadn't been paying close attention to this thread, but I'd just assumed
> There are several possible methods of doing that, but I think the best
> one is just to leave the SQL-callable C functions in fuzzystrmatch and
> move only the underlying code that supports into core.
that that would be the approach.
It might be worth introducing new differently-named pg_proc entries for
the same functions in core, but only if we can agree that there are better
names for them than what the extension uses.
Yes, that's a point I raised upthread as well. What about renaming those functions as string_distance and string_distance_less_than? Then have only fuzzystrmatch do some DirectFunctionCall using the in-core functions?
Michael
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Alvaro Herrera
Дата:
Peter Geoghegan wrote: > Maybe that would be marginally better than classic Levenshtein > distance, but I doubt it would pay for itself. It's just more code to > maintain. Are we really expecting to not get the best possible > suggestion due to some number of transposition errors very frequently? > You still have to have a worse suggestion spuriously get ahead of > yours, and typically there just aren't that many to begin with. I'm > not targeting spelling errors so much as thinkos around plurals and > whether or not an underscore was used. Damerau-Levenshtein seems like > an algorithm with fairly specialized applications. Yes, it's for typos. I guess it's an unfrequent scenario to have both a typoed column and a column that's missing the plural declension, which is the case in which Damerau-Lvsh would be a win. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Heikki Linnakangas
Дата:
On 07/18/2014 10:47 AM, Michael Paquier wrote: > On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote: >> I am not opposed to moving the contrib code into core in the manner >> that you oppose. I don't feel strongly either way. >> >> I noticed in passing that your revision says this *within* levenshtein.c: >> >> + * Guaranteed to work with Name datatype's cstrings. >> + * For full details see levenshtein.c. > Yeah, I looked at what I produced yesterday night again and came > across a couple of similar things :) And reworked a couple of things > in the version attached, mainly wordsmithing and adding comments here > and there, as well as making the naming of the Levenshtein functions > in core the same as the ones in fuzzystrmatch 1.0. > >> I imagined that when a committer picked this up, an executive decision >> would be made one way or the other. I am quite willing to revise the >> patch to alter this behavior at the request of a committer. > Fine for me. I'll move this patch to the next stage then. There are a bunch of compiler warnings: parse_relation.c: In function ‘errorMissingColumn’: parse_relation.c:3114:447: warning: ‘closestcol1’ may be used uninitialized in this function [-Wmaybe-uninitialized] parse_relation.c:3066:8: note: ‘closestcol1’ was declared here parse_relation.c:3129:29: warning: ‘closestcol2’ may be used uninitialized in this function [-Wmaybe-uninitialized] parse_relation.c:3067:8: note: ‘closestcol2’ was declared here levenshtein.c: In function ‘levenshtein_common’: levenshtein.c:107:6: warning: unused variable ‘start_column_local’ [-Wunused-variable] - Heikki
On Mon, Oct 6, 2014 at 3:09 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 07/18/2014 10:47 AM, Michael Paquier wrote: >> >> On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote: >>> >>> I am not opposed to moving the contrib code into core in the manner >>> that you oppose. I don't feel strongly either way. >>> >>> I noticed in passing that your revision says this *within* levenshtein.c: >>> >>> + * Guaranteed to work with Name datatype's cstrings. >>> + * For full details see levenshtein.c. >> >> Yeah, I looked at what I produced yesterday night again and came >> across a couple of similar things :) And reworked a couple of things >> in the version attached, mainly wordsmithing and adding comments here >> and there, as well as making the naming of the Levenshtein functions >> in core the same as the ones in fuzzystrmatch 1.0. >> >>> I imagined that when a committer picked this up, an executive decision >>> would be made one way or the other. I am quite willing to revise the >>> patch to alter this behavior at the request of a committer. >> >> Fine for me. I'll move this patch to the next stage then. > > > There are a bunch of compiler warnings: > > parse_relation.c: In function ‘errorMissingColumn’: > parse_relation.c:3114:447: warning: ‘closestcol1’ may be used uninitialized > in this function [-Wmaybe-uninitialized] > parse_relation.c:3066:8: note: ‘closestcol1’ was declared here > parse_relation.c:3129:29: warning: ‘closestcol2’ may be used uninitialized > in this function [-Wmaybe-uninitialized] > parse_relation.c:3067:8: note: ‘closestcol2’ was declared here > levenshtein.c: In function ‘levenshtein_common’: > levenshtein.c:107:6: warning: unused variable ‘start_column_local’ > [-Wunused-variable] Based on this review from a month ago, I'm going to mark this Waiting on Author. If nobody updates the patch in a few days, I'll mark it Returned with Feedback. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Based on this review from a month ago, I'm going to mark this Waiting > on Author. If nobody updates the patch in a few days, I'll mark it > Returned with Feedback. Thanks. Attached revision fixes the compiler warning that Heikki complained about. I maintain SQL-callable stub functions from within contrib, rather than follow Michael's approach. In other words, very little has changed from my revision from July last [1]. Reminder: I maintain a slight preference for only offering one suggestion per relation RTE, which is what this revision does (so no change there). If a committer who picks this up wants me to alter that, I don't mind doing so; since only Michael spoke up on this, I've kept things my way. This is not a completion mechanism; it is supposed to work on *complete* column references with slight misspellings (e.g. incorrect use of plurals, or column references with an omitted underscore character). Weighing Tom's concerns about suggestions that are of absolute low quality is what makes me conclude that this is the thing to do. [1] http://www.postgresql.org/message-id/CAM3SWZTzQO=OY4jmfB-65ieFie8iHUkDErK-0oLJETm8dSrSpw@mail.gmail.com -- Peter Geoghegan
Вложения
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan<span dir="ltr"><<a href="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><spanclass="">On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> > Based on this review from a month ago,I'm going to mark this Waiting<br /> > on Author. If nobody updates the patch in a few days, I'll mark it<br /> >Returned with Feedback. Thanks.<br /><br /></span>Attached revision fixes the compiler warning that Heikki complained<br/> about. I maintain SQL-callable stub functions from within contrib,<br /> rather than follow Michael's approach.In other words, very little has<br /> changed from my revision from July last [1].<span class=""><font color="#888888"><br/></font></span></blockquote></div>FWIW, I still find this bit of code that this patch adds in varlena.cugly:<br />+#include "levenshtein.c"<br />+#define LEVENSHTEIN_LESS_EQUAL<br />+#include "levenshtein.c"<br />+#undefLEVENSHTEIN_LESS_EQUAL<br /></div><div class="gmail_extra">-- <br /><div class="gmail_signature">Michael<br /></div></div></div>
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Sun, Nov 9, 2014 at 8:56 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > FWIW, I still find this bit of code that this patch adds in varlena.c ugly: > > +#include "levenshtein.c" > +#define LEVENSHTEIN_LESS_EQUAL > +#include "levenshtein.c" > +#undef LEVENSHTEIN_LESS_EQUAL Okay, but this is the coding that currently appears within contrib's fuzzystrmatch.c, more or less unchanged. The "#undef LEVENSHTEIN_LESS_EQUAL" line that I added ought to be unnecessary. I'll give the final word on that to whoever picks this up. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
<div dir="ltr">On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan <span dir="ltr"><<a href="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span>wrote:<br /><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote"style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> Reminder: Imaintain a slight preference for only offering one<br /> suggestion per relation RTE, which is what this revision does (sono<br /> change there). If a committer who picks this up wants me to alter<br /> that, I don't mind doing so; since onlyMichael spoke up on this, I've<br /> kept things my way.<span class=""><font color="#888888"><br /></font></span></blockquote></div>Hm.The last version of this patch has not really changed since since my first review,and I have no more feedback to provide about it except what I already mentioned. I honestly don't think that thispatch is ready for committer as-is... If someone wants to review it further, well extra opinions I am sure are welcome.<br/>-- <br /><div class="gmail_signature">Michael<br /></div></div></div>
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Nov 10, 2014 at 8:13 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Hm. The last version of this patch has not really changed since since my > first review, and I have no more feedback to provide about it except what I > already mentioned. I honestly don't think that this patch is ready for > committer as-is... If someone wants to review it further, well extra > opinions I am sure are welcome. Why not? You've already said that you're happy to defer to whatever committer picks this up with regard to whether or not more than a single suggestion can come from an RTE. I agreed with this (i.e. I said I'd defer to their opinion too), and once again drew particular attention to this state of affairs alongside my most recent revision. What does that leave? -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Nov 10, 2014 at 10:29 PM, Peter Geoghegan <pg@heroku.com> wrote: > Why not? > > You've already said that you're happy to defer to whatever committer > picks this up with regard to whether or not more than a single > suggestion can come from an RTE. I agreed with this (i.e. I said I'd > defer to their opinion too), and once again drew particular attention > to this state of affairs alongside my most recent revision. > > What does that leave? I see you've marked this "Needs Review", even though your previously marked it "Ready for Committer" a few months back (Robert marked it "Waiting on Author" very recently because of the compiler warning, and then I marked it back to "Ready for Committer" once that was addressed, before you finally marked it back to "Needs Review" and removed yourself as the reviewer just now). I'm pretty puzzled by this. Other than our "agree to disagree and defer to committer" position on the question of whether or not more than one suggestion can come from a single RTE, which you were fine with before [1], I have only restored the core/contrib separation to a state recently suggested by Robert as the best and simplest all around [2]. Did I miss something else? [1] http://www.postgresql.org/message-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-BVqzgP0qA1Y_xDC_1g@mail.gmail.com [2] http://www.postgresql.org/message-id/CA+TgmoYKiiq8MC0UJ5i5XfkTYBg1qqfN4YRCkZ60YDUnumkzzQ@mail.gmail.com -- Peter Geoghegan
On Sun, Nov 9, 2014 at 11:48 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Based on this review from a month ago, I'm going to mark this Waiting >> on Author. If nobody updates the patch in a few days, I'll mark it >> Returned with Feedback. Thanks. > > Attached revision fixes the compiler warning that Heikki complained > about. I maintain SQL-callable stub functions from within contrib, > rather than follow Michael's approach. In other words, very little has > changed from my revision from July last [1]. I agree with your proposed approach to moving Levenshtein into core. However, I think this should be separated into two patches, one of them moving the Levenshtein functionality into core, and the other adding the new treatment for missing column errors. If you can do that relatively soon, I'll make an effort to get the refactoring patch committed in the near future. Once that's done, we can focus in on the interesting part of the patch, which is the actual machinery for suggesting alternatives. On that topic, I think there's unanimous consensus against the design where equally-distant matches are treated differently based on whether they are in the same RTE or different RTEs. I think you need to change that if you want to get anywhere with this. On a related note, the use of the additional parameter AttrNumber closest[2] to searchRangeTableForCol() and of the additional parameters AttrNumber *matchedatt and int *distance to scanRTEForColumn() is less than self-documenting. I suggest creating a structure called something like FuzzyAttrMatchState and passing a pointer to it down to both functions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I agree with your proposed approach to moving Levenshtein into core. > However, I think this should be separated into two patches, one of > them moving the Levenshtein functionality into core, and the other > adding the new treatment for missing column errors. If you can do > that relatively soon, I'll make an effort to get the refactoring patch > committed in the near future. Once that's done, we can focus in on > the interesting part of the patch, which is the actual machinery for > suggesting alternatives. Okay, thanks. I think I can do that fairly soon. > On that topic, I think there's unanimous consensus against the design > where equally-distant matches are treated differently based on whether > they are in the same RTE or different RTEs. I think you need to > change that if you want to get anywhere with this. Alright. It wasn't as if I felt very strongly about it either way. > On a related note, > the use of the additional parameter AttrNumber closest[2] to > searchRangeTableForCol() and of the additional parameters AttrNumber > *matchedatt and int *distance to scanRTEForColumn() is less than > self-documenting. I suggest creating a structure called something > like FuzzyAttrMatchState and passing a pointer to it down to both > functions. Sure. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 12, 2014 at 1:13 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I agree with your proposed approach to moving Levenshtein into core. >> However, I think this should be separated into two patches, one of >> them moving the Levenshtein functionality into core, and the other >> adding the new treatment for missing column errors. If you can do >> that relatively soon, I'll make an effort to get the refactoring patch >> committed in the near future. Once that's done, we can focus in on >> the interesting part of the patch, which is the actual machinery for >> suggesting alternatives. > > Okay, thanks. I think I can do that fairly soon. Attached patch moves the Levenshtein distance implementation into core. You're missing patch 2 of 2 here, because I have yet to incorporate your feedback on the HINT itself -- when I've done that, I'll post a newly rebased patch 2/2, with those items taken care of. As you pointed out, there is no reason to wait for that. -- Peter Geoghegan
Вложения
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote: > Attached patch moves the Levenshtein distance implementation into core. Oops. Somehow managed to send a *.patch.swp file. :-) Here is the actual patch. -- Peter Geoghegan
Вложения
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
On Tue, Nov 11, 2014 at 3:52 PM, Peter Geoghegan <pg@heroku.com> wrote: > I'm pretty puzzled by this. Other than our "agree to disagree and > defer to committer" position on the question of whether or not more > than one suggestion can come from a single RTE, which you were fine > with before [1], I have only restored the core/contrib separation to a > state recently suggested by Robert as the best and simplest all around > [2]. > Did I miss something else? My point is: I am not sure I can be defined as a reviewer of this patch or take any credit in this patch review knowing that the latest version submitted is a simple rebase of the version I did my first review on. Hence, code speaking, this patch is in the same state as when it has been firstly submitted. Thanks, -- Michael
On Wed, Nov 12, 2014 at 10:42 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Nov 11, 2014 at 3:52 PM, Peter Geoghegan <pg@heroku.com> wrote: >> I'm pretty puzzled by this. Other than our "agree to disagree and >> defer to committer" position on the question of whether or not more >> than one suggestion can come from a single RTE, which you were fine >> with before [1], I have only restored the core/contrib separation to a >> state recently suggested by Robert as the best and simplest all around >> [2]. >> Did I miss something else? > My point is: I am not sure I can be defined as a reviewer of this > patch or take any credit in this patch review knowing that the latest > version submitted is a simple rebase of the version I did my first > review on. Hence, code speaking, this patch is in the same state as > when it has been firstly submitted. Of course you can. Time spent reviewing is time spent reviewing, whether it results in changes to the patch or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Thu, Nov 13, 2014 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> My point is: I am not sure I can be defined as a reviewer of this >> patch or take any credit in this patch review knowing that the latest >> version submitted is a simple rebase of the version I did my first >> review on. Hence, code speaking, this patch is in the same state as >> when it has been firstly submitted. > > Of course you can. Time spent reviewing is time spent reviewing, > whether it results in changes to the patch or not. My thoughts exactly. I thought Michael did a good job, even if I didn't agree with everything he said. -- Peter Geoghegan
On Wed, Nov 12, 2014 at 8:00 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Attached patch moves the Levenshtein distance implementation into core. > > Oops. Somehow managed to send a *.patch.swp file. :-) > > Here is the actual patch. Committed. I changed varstr_leven() to varstr_levenshtein() because abbrvs cn mk the code hrd to undstnd. And to grep. And I removed the StaticAssertStmt you added, because it's not actually used for anything that necessitates that, yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Thu, Nov 13, 2014 at 9:36 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Committed. I changed varstr_leven() to varstr_levenshtein() because > abbrvs cn mk the code hrd to undstnd. And to grep. Thanks. I'll produce a revision of patch 2/2 soon. > And I removed the > StaticAssertStmt you added, because it's not actually used for > anything that necessitates that, yet. I'll add it back in in patch 2/2, so. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On that topic, I think there's unanimous consensus against the design > where equally-distant matches are treated differently based on whether > they are in the same RTE or different RTEs. I think you need to > change that if you want to get anywhere with this. On a related note, > the use of the additional parameter AttrNumber closest[2] to > searchRangeTableForCol() and of the additional parameters AttrNumber > *matchedatt and int *distance to scanRTEForColumn() is less than > self-documenting. I suggest creating a structure called something > like FuzzyAttrMatchState and passing a pointer to it down to both > functions. Attached patch incorporates this feedback. The only user-visible difference between this revision and the previous revision is that it's quite possible for two suggestion to originate from the same RTE (there is exactly one change in the regression test's expected output as compared to the last revision for this reason. The regression tests are otherwise unchanged). It's still not possible to see more than 2 suggestions under any circumstances, no matter where they might have originated from, which I think is appropriate -- we continue to not present any HINT in the event of 3 or more equidistant matches. I think that the restructuring required to pass around a state variable has resulted in somewhat clearer code. -- Peter Geoghegan
Вложения
On Sat, Nov 15, 2014 at 7:36 PM, Peter Geoghegan <pg@heroku.com> wrote: > Attached patch incorporates this feedback. > > The only user-visible difference between this revision and the > previous revision is that it's quite possible for two suggestion to > originate from the same RTE (there is exactly one change in the > regression test's expected output as compared to the last revision for > this reason. The regression tests are otherwise unchanged). It's still > not possible to see more than 2 suggestions under any circumstances, > no matter where they might have originated from, which I think is > appropriate -- we continue to not present any HINT in the event of 3 > or more equidistant matches. > > I think that the restructuring required to pass around a state > variable has resulted in somewhat clearer code. Cool! I'm grumpy about the distanceName() function. That seems too generic. If we're going to keep this as it is, I suggest something like computeRTEColumnDistance(). But see below. On a related note, I'm also grumpy about this comment: + /* Charge half as much per deletion as per insertion or per substitution */ + return varstr_levenshtein_less_equal(actual, len, match, match_len, + 2, 1, 2, max); The purpose of a code comment is to articulate WHY we did something, rather than simply to restate what the code quite obviously does. I haven't heard a compelling argument for why this should be 2, 1, 2 rather than the default 1, 1, 1; and I'm inclined to do the latter unless you can make some very good argument for this combination of weights. And if you can make such an argument, then there should be comments so that the next person to come along and look at this code doesn't go, huh, that's whacky, and change it. + int location, FuzzyAttrMatchState *rtestate) I suggest calling this "fuzzystate" rather than "rtestate"; it's not the state of the RTE, but the state of the fuzzy matching. Within the scanRTEForColumn block, we have a rather large chunk of code protected by if (rtestate), which contains the only call to distanceName(). I suggest that we move all of this logic to a separate, static function, and merge distanceName into it. I also suggest testing against NULL explicitly instead of implicitly. So this block of code would end up as something like: if (fuzzystate != NULL) updateFuzzyAttrMatchState(rte, attcolname, colname, &fuzzystate); In searchRangeTableForCol, I'm fairly certain that you've changed the behavior by adding a check for if (rte->rtekind == RTE_JOIN) before the call to scanRTEForColumn(). Why not instead put this check into updateFuzzyAttrMatchState? Then you can be sure you're not changing the behavior in any other case. On a similar note, I think the dropped-column test should happen early as well, probably again in updateFuzzyAttrMatchState(). There's little point in adding a suggestion only to throw it away again. + /* + * Charge extra (for inexact matches only) when an alias was + * specified that differs from what might have been used to + * correctly qualify this RTE's closest column + */ + if (wrongalias) + rtestate.distance += 3; I don't understand what situation this is catering to. Can you explain? It seems to account for a good deal of complexity. ERROR: column "oid" does not existLINE 1: select oid > 0, * from altwithoid; ^ +HINT: Perhaps you meant to reference the column "altwithoid"."col". That seems like a stretch. I think I suggested before using a distance threshold of at most 3 or half the word length, whichever is less. For a three-letter column name that means not suggesting anything if more than one character is different. What you implemented here is close to that, yet somehow we've got a suggestion slipping through that has 2 out of 3 characters different. I'm not quite sure I see how that's getting through, but I think it shouldn't. ERROR: column fullname.text does not existLINE 1: select fullname.text from fullname; ^ +HINT: Perhaps you meant to reference the column "fullname"."last". Same problem, only worse! They've only got one letter of four in common. ERROR: column "xmin" does not existLINE 1: select xmin, * from fooview; ^ +HINT: Perhaps you meant to reference the column "fooview"."x". Basically the same problem again. I think the distance threshold in this case should be half the shorter column name, i.e. 0. Your new test cases include no negative test cases; that is, cases where the machinery declines to suggest a hint because of, say, 3 equally good possibilities. They probably should have something like that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Nov 17, 2014 at 10:15 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm grumpy about the distanceName() function. That seems too generic. > If we're going to keep this as it is, I suggest something like > computeRTEColumnDistance(). But see below. Fair point. > On a related note, I'm also grumpy about this comment: > > + /* Charge half as much per deletion as per insertion or per substitution */ > + return varstr_levenshtein_less_equal(actual, len, match, match_len, > + 2, 1, 2, max); > > The purpose of a code comment is to articulate WHY we did something, > rather than simply to restate what the code quite obviously does. I > haven't heard a compelling argument for why this should be 2, 1, 2 > rather than the default 1, 1, 1; and I'm inclined to do the latter > unless you can make some very good argument for this combination of > weights. And if you can make such an argument, then there should be > comments so that the next person to come along and look at this code > doesn't go, huh, that's whacky, and change it. Okay. I agree that that deserves a comment. The actual argument for this formulation is that it just seems to work better that way. For example: """ postgres=# \d orderlines Table "public.orderlines" Column | Type | Modifiers -------------+----------+-----------orderlineid | integer | not nullorderid | integer | not nullprod_id | integer | not nullquantity | smallint | not nullorderdate | date | not null postgres=# select qty from orderlines ; ERROR: 42703: column "qty" does not exist LINE 1: select qty from orderlines ; ^ HINT: Perhaps you meant to reference the column "orderlines"."quantity". """ The point is that the fact that the user supplied "qty" string has so many fewer characters than what was obviously intended - "quantity" - deserves to be weighed less. If you change the costing to weigh character deletion as being equal to substitution/addition, this example breaks. I also think it's pretty common to have noise words in every attribute (e.g. every column in the "orderlines" table matches "orderlines_*"), which might otherwise mess things up by overcharging for deletion. Having extra characters in the correctly spelled column name seems legitimately less significant to me. Or, in other words: having actual characters from the misspelling match the correct spelling (and having actual characters given not fail to match) is most important. What was given by the user is more important than what was not given but should have been, which is not generally true for uses of Levenshtein distance. I reached this conclusion through trying out the patch with a couple of real schemas, and seeing what works best. It's hard to express that idea tersely, in a comment, but I guess I'll try. > + int location, FuzzyAttrMatchState *rtestate) > > I suggest calling this "fuzzystate" rather than "rtestate"; it's not > the state of the RTE, but the state of the fuzzy matching. The idea here was to differentiate this state from the overall range table state (in general, FuzzyAttrMatchState may be one or the other). But okay. > Within the scanRTEForColumn block, we have a rather large chunk of > code protected by if (rtestate), which contains the only call to > distanceName(). I suggest that we move all of this logic to a > separate, static function, and merge distanceName into it. I also > suggest testing against NULL explicitly instead of implicitly. So > this block of code would end up as something like: > > if (fuzzystate != NULL) > updateFuzzyAttrMatchState(rte, attcolname, colname, &fuzzystate); Okay. > In searchRangeTableForCol, I'm fairly certain that you've changed the > behavior by adding a check for if (rte->rtekind == RTE_JOIN) before > the call to scanRTEForColumn(). Why not instead put this check into > updateFuzzyAttrMatchState? Then you can be sure you're not changing > the behavior in any other case. I thought that I had avoided changing things (beyond what was advertised as changed in relation to this most recent revision) because I also changed things WRT multiple matches per RTE. It's fuzzy. Anyway, yeah, I could do it there instead. > On a similar note, I think the dropped-column test should happen early > as well, probably again in updateFuzzyAttrMatchState(). There's > little point in adding a suggestion only to throw it away again. Agreed. > + /* > + * Charge extra (for inexact matches only) when an alias was > + * specified that differs from what might have been used to > + * correctly qualify this RTE's closest column > + */ > + if (wrongalias) > + rtestate.distance += 3; > > I don't understand what situation this is catering to. Can you > explain? It seems to account for a good deal of complexity. Two cases: 1. Distinguishing between the case where there was an exact match to a column that isn't visible (i.e. the existing reason for errorMissingColumn() to call here), and the case where there is a visible column, but our alias was the wrong one. I guess that could live in errorMissingColumn(), but overall it's more convenient to do it here, so that errorMissingColumn() handles things almost uniformly and doesn't really have to care. 2. For non-exact (fuzzy) matches, it seems more useful to give one match rather than two when the user gave an alias that matches one particular RTE. Consider this: """ postgres=# select ordersid from orders o join orderlines ol on o.orderid = ol.orderid; ERROR: 42703: column "ordersid" does not exist LINE 1: select ordersid from orders o join orderlines ol on o.orderi... ^ HINT: Perhaps you meant to reference the column "o"."orderid" or the column "ol"."orderid". LOCATION: errorMissingColumn, parse_relation.c:3166 postgres=# select ol.ordersid from orders o join orderlines ol on o.orderid = ol.orderid; ERROR: 42703: column ol.ordersid does not exist LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord... ^ HINT: Perhaps you meant to reference the column "ol"."orderid". LOCATION: errorMissingColumn, parse_relation.c:3147 """ One suggestion is better than two if it's evident that that single suggestion is a better fit. And, more broadly, the fact that an alias was given and matches ought to be weighed. > ERROR: column "oid" does not exist > LINE 1: select oid > 0, * from altwithoid; > ^ > +HINT: Perhaps you meant to reference the column "altwithoid"."col". > > That seems like a stretch. I think I suggested before using a > distance threshold of at most 3 or half the word length, whichever is > less. For a three-letter column name that means not suggesting > anything if more than one character is different. What you > implemented here is close to that, yet somehow we've got a suggestion > slipping through that has 2 out of 3 characters different. I'm not > quite sure I see how that's getting through, but I think it shouldn't. It's because I don't apply the test on smaller strings. I felt that it was riskier to apply an absolute quality test when the user-supplied column reference does not exceed 6 characters. Consider my "qty" vs "quantity" example. If I make this change: --- a/src/backend/parser/parse_relation.c +++ b/src/backend/parser/parse_relation.c @@ -929,7 +929,7 @@ searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname, * seen when 6 deletions are required against actual attribute name, or 3 * insertions/substitutions. */ - if (state->distance > 6 && state->distance > strlen(colname) / 2) + if (state->distance > strlen(colname) / 2) { state->rsecond = state->rfirst = NULL; state->second = state->first = InvalidAttrNumber; Then a lot of the examples you complain about are fixed. But the "qty" example is broken. Plus, this happens when the regression tests are run: *** /home/pg/postgresql/src/test/regress/expected/alter_table.out 2014-11-17 11:50:16.476426191 -0800 --- /home/pg/postgresql/src/test/regress/results/alter_table.out 2014-11-17 11:57:40.776410110 -0800 *************** *** 1343,1349 **** ERROR: column "f1" does not exist LINE 1: select f1 from c1; ^ - HINT: Perhaps you meant to reference the column "c1"."f2". drop table p1 cascade; NOTICE: drop cascades to table c1 createtable p1 (f1 int, f2 int); And: *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-17 11:50:16.480426191 -0800 --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-17 11:57:08.916411263 -0800 *************** *** 3452,3458 **** ERROR: column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join ^ - HINT: Perhaps you meant to reference the column "atts"."indexrelid". -- -- Test LATERAL -- (So no hint given in either case) > ERROR: column fullname.text does not exist > LINE 1: select fullname.text from fullname; > ^ > +HINT: Perhaps you meant to reference the column "fullname"."last". > Basically the same problem again. I think the distance threshold in > this case should be half the shorter column name, i.e. 0. Well, there is always going to be the most marginal possible case that still gets to see a suggestion. These are non-organic examples from the regression tests. I'm more worried about having the suggestions work well for organic/representative cases than I am about suppressing non-useful suggestions in non-organic/non-representative cases. As I mentioned, the costing is more or less derived by what I found to work well in what I thought to be representative cases. > Your new test cases include no negative test cases; that is, cases > where the machinery declines to suggest a hint because of, say, 3 > equally good possibilities. They probably should have something like > that. I'll think about that. -- Peter Geoghegan
On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote: > postgres=# select qty from orderlines ; > ERROR: 42703: column "qty" does not exist > LINE 1: select qty from orderlines ; > ^ > HINT: Perhaps you meant to reference the column "orderlines"."quantity". > """ I don't buy this example, because it would give you the same hint if you told it you wanted to access a column called ant, or uay, or tit. And that's clearly ridiculous. The reason why quantity looks like a reasonable suggestion for qty is because it's a conventional abbreviation, but an extremely high percentage of comparable cases won't be. >> + /* >> + * Charge extra (for inexact matches only) when an alias was >> + * specified that differs from what might have been used to >> + * correctly qualify this RTE's closest column >> + */ >> + if (wrongalias) >> + rtestate.distance += 3; >> >> I don't understand what situation this is catering to. Can you >> explain? It seems to account for a good deal of complexity. > > Two cases: > > 1. Distinguishing between the case where there was an exact match to a > column that isn't visible (i.e. the existing reason for > errorMissingColumn() to call here), and the case where there is a > visible column, but our alias was the wrong one. I guess that could > live in errorMissingColumn(), but overall it's more convenient to do > it here, so that errorMissingColumn() handles things almost uniformly > and doesn't really have to care. > > 2. For non-exact (fuzzy) matches, it seems more useful to give one > match rather than two when the user gave an alias that matches one > particular RTE. Consider this: > > """ > postgres=# select ordersid from orders o join orderlines ol on > o.orderid = ol.orderid; > ERROR: 42703: column "ordersid" does not exist > LINE 1: select ordersid from orders o join orderlines ol on o.orderi... > ^ > HINT: Perhaps you meant to reference the column "o"."orderid" or the > column "ol"."orderid". > LOCATION: errorMissingColumn, parse_relation.c:3166 > > postgres=# select ol.ordersid from orders o join orderlines ol on > o.orderid = ol.orderid; > ERROR: 42703: column ol.ordersid does not exist > LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord... > ^ > HINT: Perhaps you meant to reference the column "ol"."orderid". > LOCATION: errorMissingColumn, parse_relation.c:3147 > """ I guess I'm confused at a broader level. If the alias is wrong, why are we considering names in this RTE *at all*? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote: >> postgres=# select qty from orderlines ; >> ERROR: 42703: column "qty" does not exist >> LINE 1: select qty from orderlines ; >> ^ >> HINT: Perhaps you meant to reference the column "orderlines"."quantity". >> """ > > I don't buy this example, because it would give you the same hint if > you told it you wanted to access a column called ant, or uay, or tit. > And that's clearly ridiculous. The reason why quantity looks like a > reasonable suggestion for qty is because it's a conventional > abbreviation, but an extremely high percentage of comparable cases > won't be. Is that so terrible? Yes, if those *exact* strings are tried, that'll happen. But the vast majority of 3 letter strings will not do that (including many 3 letter strings that include one of the letters 'q', 't' and 'y', such as "qqq", "ttt", and "yyy"). Why, in practice, would someone even attempt those strings? I'm worried about Murphy, not Machiavelli. That seems like a pretty important distinction here. I maintain that omission of part of the correct spelling should be weighed less. I am optimizing for the case where the user has a rough idea of the structure and spelling of things - if they're typing in random strings, or totally distinct synonyms, there is little we can do about that. As I said, there will always be the most marginal case that still gets a suggestion. I see no reason to hurt the common case where we help in order to save the user from seeing a "ridiculous" suggestion. I have a final test for the absolute quality of a suggestion, but I think we could easily be too conservative about that. At worst, our "ridiculous" suggestion makes apparent that the user's incorrect spelling was itself ridiculous. With larger strings, we can afford to be more conservative, and we are, because we have more information to go on. Terse column names are not uncommon, though. >>> + /* >>> + * Charge extra (for inexact matches only) when an alias was >>> + * specified that differs from what might have been used to >>> + * correctly qualify this RTE's closest column >>> + */ >>> + if (wrongalias) >>> + rtestate.distance += 3; >>> >>> I don't understand what situation this is catering to. Can you >>> explain? It seems to account for a good deal of complexity. > I guess I'm confused at a broader level. If the alias is wrong, why > are we considering names in this RTE *at all*? Because it's a common mistake when writing ad-hoc queries. People may forget which exact table their column comes from. We certainly want to weigh the fact that an alias was specified, but it shouldn't totally limit our guess to that RTE. If nothing else, the fact that there was a much closer match from another RTE ought to result in forcing there to be no suggestion (due to there being too many equally good suggestions). That's because, as I said, an *absolute* test for the quality of a match is problematic (which, again, is why I err on the side of letting the final, "absolute quality" test not limit suggestions, particularly with short strings). -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> postgres=# select qty from orderlines ; >>> ERROR: 42703: column "qty" does not exist >>> HINT: Perhaps you meant to reference the column "orderlines"."quantity". >> I don't buy this example, because it would give you the same hint if >> you told it you wanted to access a column called ant, or uay, or tit. >> And that's clearly ridiculous. The reason why quantity looks like a >> reasonable suggestion for qty is because it's a conventional >> abbreviation, but an extremely high percentage of comparable cases >> won't be. > I maintain that omission of part of the correct spelling should be > weighed less. I would say that omission of the first letter should completely disqualify suggestions based on this heuristic; but it might make sense to weight omissions less after the first letter. regards, tom lane
On Tue, Nov 18, 2014 at 8:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@heroku.com> writes: >> On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote: >>>> postgres=# select qty from orderlines ; >>>> ERROR: 42703: column "qty" does not exist >>>> HINT: Perhaps you meant to reference the column "orderlines"."quantity". > >>> I don't buy this example, because it would give you the same hint if >>> you told it you wanted to access a column called ant, or uay, or tit. >>> And that's clearly ridiculous. The reason why quantity looks like a >>> reasonable suggestion for qty is because it's a conventional >>> abbreviation, but an extremely high percentage of comparable cases >>> won't be. > >> I maintain that omission of part of the correct spelling should be >> weighed less. > > I would say that omission of the first letter should completely disqualify > suggestions based on this heuristic; but it might make sense to weight > omissions less after the first letter. I think we would be well-advised not to start inventing our own approximate matching algorithm. Peter's suggestion boils down to a guess that the default cost parameters for Levenshtein suck, and your suggestion boils down to a guess that we can fix the problems with Peter's suggestion by bolting another heuristic on top of it - and possibly running Levenshtein twice with different sets of cost parameters. Ugh. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think we would be well-advised not to start inventing our own > approximate matching algorithm. Peter's suggestion boils down to a > guess that the default cost parameters for Levenshtein suck, and your > suggestion boils down to a guess that we can fix the problems with > Peter's suggestion by bolting another heuristic on top of it - and > possibly running Levenshtein twice with different sets of cost > parameters. Ugh. I agree. While I am perfectly comfortable with the fact that we are guessing here, my guesses are based on what I observed to work well with real schemas, and simulated errors that I thought were representative of human error. Obviously it's possible that another scheme will do better sometimes, including for example a scheme that picks a match entirely at random. But on average, I think that what I have here will do better than anything else proposed so far. -- Peter Geoghegan
On Wed, Nov 19, 2014 at 12:33 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think we would be well-advised not to start inventing our own >> approximate matching algorithm. Peter's suggestion boils down to a >> guess that the default cost parameters for Levenshtein suck, and your >> suggestion boils down to a guess that we can fix the problems with >> Peter's suggestion by bolting another heuristic on top of it - and >> possibly running Levenshtein twice with different sets of cost >> parameters. Ugh. > > I agree. > > While I am perfectly comfortable with the fact that we are guessing > here, my guesses are based on what I observed to work well with real > schemas, and simulated errors that I thought were representative of > human error. Obviously it's possible that another scheme will do > better sometimes, including for example a scheme that picks a match > entirely at random. But on average, I think that what I have here will > do better than anything else proposed so far. If you agree, then I'm not being clear enough. I don't think think that tinkering with the Levenshtein cost factors is a good idea, and I think it's unhelpful to suggest something when the suggestion and the original word differ by more than 50% of the number characters in the shorter word. Suggesting "col" for "oid" or "x" for "xmax", as crops up in the regression tests with this patch applied, shows the folly of this: the user didn't mean the other named column; rather, the user was confused about whether a particular system column existed for that table. If we had a large database of examples showing what the user typed and what they intended, we could try different algorithms against it and see which one performs best with fewest false positives. But if we don't have that, we should do things that are like the things that other people have done before. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: > If you agree, then I'm not being clear enough. I don't think think > that tinkering with the Levenshtein cost factors is a good idea, and I > think it's unhelpful to suggest something when the suggestion and the > original word differ by more than 50% of the number characters in the > shorter word. I agree - except for very short strings, where there is insufficient information to go on. I was talking about the difficulty of bolting something on top of Levenshtein distance that looked at the first character. That would not need two costings, but would require an encoding aware matching of the first code point. > Suggesting "col" for "oid" or "x" for "xmax", as crops > up in the regression tests with this patch applied, shows the folly of > this: the user didn't mean the other named column; rather, the user > was confused about whether a particular system column existed for that > table. Those are all very terse strings. What you're overlooking is what is broken by using straight Levenshtein distance, which includes things in the regression test that are reasonable and helpful. As I mentioned before, requiring a greater than 50% of total string size distance breaks this, just within the regression tests: """ ERROR: column "f1" does not exist LINE 1: select f1 from c1; ^ - HINT: Perhaps you meant to reference the column "c1"."f2". """ And: """ ERROR: column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join ^ - HINT: Perhaps you meant to reference the column "atts"."indexrelid". """ Those are really useful suggestions! And, they're much more representative of real user error. The downside of weighing deletion less than substitution and insertion is much smaller than the upside. It's worth it. The downside is only that the user gets to see the best suggestion that isn't all that good in an absolute sense (which we have a much harder time concluding using simple tests for short misspellings). > If we had a large database of examples showing what the user typed and > what they intended, we could try different algorithms against it and > see which one performs best with fewest false positives. But if we > don't have that, we should do things that are like the things that > other people have done before. That seems totally impractical. No one has that kind of data that I'm aware of. How about git as a kind of precedent? It is not at all conservative about showing *some* suggestion: """ $ git aa git: 'aa' is not a git command. See 'git --help'. Did you mean this? am $ git d git: 'd' is not a git command. See 'git --help'. Did you mean one of these? diff add $ git ddd git: 'ddd' is not a git command. See 'git --help'. Did you mean this? add """ And why wouldn't git be? As far as its concerned, you can only have meant one of those small number of things. Similarly, with the patch, the number of things we can pick from is fairly limited at this stage, since we are actually fairly far along with parse analysis. Now, this won't give a suggestion: """ $ git aaaa git: 'aaaa' is not a git command. See 'git --help'. """ So it looks like git similarly weighs deletion less than insertion/substitution. Are its suggestions any less ridiculous than your examples of questionable hints from the modified regression test expected output? This is just a guidance mechanism, and at worst we'll show the best match (on the mechanisms own terms, which isn't too bad). -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 19, 2014 at 10:22 AM, Peter Geoghegan <pg@heroku.com> wrote: > Those are all very terse strings. What you're overlooking is what is > broken by using straight Levenshtein distance, which includes things > in the regression test that are reasonable and helpful. As I mentioned > before, requiring a greater than 50% of total string size distance > breaks this, just within the regression tests: Maybe you'd prefer if there was a more gradual ramp-up to requiring a distance of no greater than 50% of the string size (normalized to take account of my non-default costings). Right now it's a step function of the number of characters in the string - there is no "absolute quality" requirement for strings of 6 or fewer requirements. Otherwise, there is the 50% distance absolute quality test (the test that you want to be applied generally). I think that would be better, without being much more complicated. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan <pg@heroku.com> wrote: > there is no "absolute > quality" requirement for strings of 6 or fewer requirements. I meant 6 or fewer *characters*, obviously. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan <pg@heroku.com> wrote: > Maybe you'd prefer if there was a more gradual ramp-up to requiring a > distance of no greater than 50% of the string size (normalized to take > account of my non-default costings) I made this modification: diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c index 40c69d7..cca075f 100644 --- a/src/backend/parser/parse_relation.c +++ b/src/backend/parser/parse_relation.c @@ -929,7 +929,8 @@ searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname, * seen when 6 deletions are required against actual attribute name, or 3 * insertions/substitutions. */ - if (state->distance > 6 && state->distance > strlen(colname) / 2) + if ((state->distance > 3 && state->distance > strlen(colname)) || + (state->distance > 6 && state->distance > strlen(colname) / 2)) { state->rsecond = state->rfirst= NULL; state->second = state->first = InvalidAttrNumber; When I run the regression tests now, then all the cases that you found objectionable in the regression tests' previous expected output disappear, while all the cases I think are useful that were previously removed by applying a broad 50% standard remain. While I'm not 100% sure that this exact formulation is the best one, I think that we can reach a compromise on this point, that allows the costing to remain the same without offering particularly bad suggestions for short strings. -- Peter Geoghegan
On Wed, Nov 19, 2014 at 1:22 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> If you agree, then I'm not being clear enough. I don't think think >> that tinkering with the Levenshtein cost factors is a good idea, and I >> think it's unhelpful to suggest something when the suggestion and the >> original word differ by more than 50% of the number characters in the >> shorter word. > > I agree - except for very short strings, where there is insufficient > information to go on. That's precisely the time I think it's *most* important. In a very long string, the threshold should be LESS than 50%. My original proposal was "no more than 3 characters of difference, but in any event not more than half the length of the shorter string". >> Suggesting "col" for "oid" or "x" for "xmax", as crops >> up in the regression tests with this patch applied, shows the folly of >> this: the user didn't mean the other named column; rather, the user >> was confused about whether a particular system column existed for that >> table. > > Those are all very terse strings. What you're overlooking is what is > broken by using straight Levenshtein distance, which includes things > in the regression test that are reasonable and helpful. As I mentioned > before, requiring a greater than 50% of total string size distance > breaks this, just within the regression tests: > > """ > ERROR: column "f1" does not exist > LINE 1: select f1 from c1; > ^ > - HINT: Perhaps you meant to reference the column "c1"."f2". > """ That's exactly 50%, not more than 50%. (I'm also on the fence about whether the hint is actually helpful in that case, but the rule I proposed wouldn't prohibit it.) > And: > > """ > ERROR: column atts.relid does not exist > LINE 1: select atts.relid::regclass, s.* from pg_stats s join > ^ > - HINT: Perhaps you meant to reference the column "atts"."indexrelid". > """ > > Those are really useful suggestions! And, they're much more > representative of real user error. That one's right at 50% too, but it's certainly more than 3 characters of difference. I think it's going to be pretty hard to emit a suggestion in that case but not in a whole lot of cases that don't make any sense. > How about git as a kind of precedent? It is not at all conservative > about showing *some* suggestion: > > """ > $ git aa > git: 'aa' is not a git command. See 'git --help'. > > Did you mean this? > am > > $ git d > git: 'd' is not a git command. See 'git --help'. > > Did you mean one of these? > diff > add > > $ git ddd > git: 'ddd' is not a git command. See 'git --help'. > > Did you mean this? > add > """ > > And why wouldn't git be? As far as its concerned, you can only have > meant one of those small number of things. Similarly, with the patch, > the number of things we can pick from is fairly limited at this stage, > since we are actually fairly far along with parse analysis. I went and found the actual code git uses for this. It's here: https://github.com/git/git/blob/d29e9c89dbbf0876145dc88615b99308cab5f187/help.c And the underlying Levenshtein implementation is here: https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c Apparently what they're doing is charging 0 for a transposition (which we don't have as a separate concept), 2 for a substitution, 1 for an insertion, and 3 for a deletion, with the constraint that anything with a total distance of more than 6 isn't considered. And that does overall seem to give pretty good suggestions. However, an interesting point about the git algorithm is that it's not hard to make it do stupid things on short strings: [rhaas pgsql]$ git xy git: 'xy' is not a git command. See 'git --help'. Did you mean one of these? am gc mv rm Maybe they should adopt my idea of a lower cutoff for short strings. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Apparently what they're doing is charging 0 for a transposition (which > we don't have as a separate concept), 2 for a substitution, 1 for an > insertion, and 3 for a deletion, with the constraint that anything > with a total distance of more than 6 isn't considered. And that does > overall seem to give pretty good suggestions. The git people know that no git command is longer than 4 or 5 characters. That doesn't apply to us. I certainly would not like to have an absolute distance test of n characters. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > That's precisely the time I think it's *most* important. In a very > long string, the threshold should be LESS than 50%. My original > proposal was "no more than 3 characters of difference, but in any > event not more than half the length of the shorter string". We can only hint based on the information given by the user. If they give a lot of badly matching information, we have something to go on. > That one's right at 50% too, but it's certainly more than 3 characters > of difference. I think it's going to be pretty hard to emit a > suggestion in that case but not in a whole lot of cases that don't > make any sense. I don't think that's the case. Other RTEs are penalized for having non-matching aliases here. In general, I think the cost of a bad suggestion is much lower than the benefit of a good one. You seem to be suggesting that they're equal. Or that they're equally likely in an organic situation. In my estimation, this is not the case at all. I'm curious about your thoughts on the compromise of a ramped up distance threshold to apply a test for the absolute quality of a match. I think that the fact that git gives bad suggestions with terse strings tells us a lot, though. Note that unlike git, with terse strings we may well have a good deal more equidistant matches, and as soon as the number of would-be matches exceeds 2, we actually give no matches at all. So that's an additional protection against poor matches with terse strings. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Alvaro Herrera
Дата:
Robert Haas wrote: > And the underlying Levenshtein implementation is here: > > https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c > > Apparently what they're doing is charging 0 for a transposition (which > we don't have as a separate concept), 2 for a substitution, 1 for an > insertion, and 3 for a deletion, with the constraint that anything > with a total distance of more than 6 isn't considered. 0 for a transposition, wow. I suggested adding transpositions but there was no support for that idea. I suggested it because I thikn it's the most common form of typo, and charging 2 for a deletion plus 1 for an insertion makes a single transposition mistaek count as 3, which seems wrong -- particularly seeing the git precedent. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > 0 for a transposition, wow. Again, they're optimizing for short strings (git commands) only. There just isn't that many transposition errors possible with a 4 character string. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Alvaro Herrera
Дата:
Peter Geoghegan wrote: > On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > 0 for a transposition, wow. > > Again, they're optimizing for short strings (git commands) only. There > just isn't that many transposition errors possible with a 4 character > string. If there's logic in your statement, I can't see it. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Peter Geoghegan wrote: >> On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >> > 0 for a transposition, wow. >> >> Again, they're optimizing for short strings (git commands) only. There >> just isn't that many transposition errors possible with a 4 character >> string. > > If there's logic in your statement, I can't see it. The point is that transposition errors should not have no cost. If git did not have an absolute quality test of a distance of 6, which they can only have because all git commands are terse, then you could construct a counter example. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Alvaro Herrera
Дата:
Peter Geoghegan wrote: > On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Peter Geoghegan wrote: > >> On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera > >> <alvherre@2ndquadrant.com> wrote: > >> > 0 for a transposition, wow. > >> > >> Again, they're optimizing for short strings (git commands) only. There > >> just isn't that many transposition errors possible with a 4 character > >> string. > > > > If there's logic in your statement, I can't see it. > > The point is that transposition errors should not have no cost. If git > did not have an absolute quality test of a distance of 6, which they > can only have because all git commands are terse, then you could > construct a counter example. Okay. My point is just that whatever the string length, I think we'd do well to regard transpositions as "cheap" in terms of error cost. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 19, 2014 at 2:33 PM, Peter Geoghegan <pg@heroku.com> wrote: > I don't think that's the case. Other RTEs are penalized for having > non-matching aliases here. The point I made did not, as far as I can see, have anything to do with non-matching aliases; it could arise with just a single RTE. > In general, I think the cost of a bad suggestion is much lower than > the benefit of a good one. You seem to be suggesting that they're > equal. Or that they're equally likely in an organic situation. In my > estimation, this is not the case at all. The way I see it, the main cost of a bad suggestion is that it annoys the user with clutter which they may brand as "stupid". Think about how much vitriol has been spewed over the years against progress bars (or estimated completion) times that don't turn out to mirror reality. Microsoft has gotten more cumulative flack about their inaccurate progress bars over the years than they would have for dropping an elevator on a cute baby. Or think about how annoying it is when a spell-checker or grammar-checker underlines something you've written that is, in your own opinion, correctly spelled or grammatical. Maybe that kind of thing doesn't annoy you, but it definitely annoys me, and I think probably a lot of other people. My general experience is that people get quite pissed off by bad suggestions from a computer. At least in my experience, users' actual level of agitation is often all out of proportion to what might seem justified, but we are designing this software for actual users, so their likely emotional reactions are relevant. > I'm curious about your thoughts on the compromise of a ramped up > distance threshold to apply a test for the absolute quality of a > match. I think that the fact that git gives bad suggestions with terse > strings tells us a lot, though. Note that unlike git, with terse > strings we may well have a good deal more equidistant matches, and as > soon as the number of would-be matches exceeds 2, we actually give no > matches at all. So that's an additional protection against poor > matches with terse strings. I don't know what you mean by a ramped-up distance threshold, exactly. I think it's good for the distance threshold to be lower for small strings and higher for large ones. I think I'm somewhat open to negotiation on the details, but I think any system that's going to suggest "quantity" for "tit" is going too far. If the user types "qty" when they meant "quantity", they probably don't really need the hint, because they're going to say to themselves "wait, I guess I didn't abbreviate that". The time when they need the hint is when they typed "quanttiy", because it's quite possible to read a query with that sort of typo multiple times and not realize that you've made one. You're sitting there puzzling over where the quantity column went, and asking yourselves how you can be mis-remembering the schema, and saying "wait, didn't I just see that column in the \d output" ... and you don't even think to check carefully for a spelling mistake. The hint may well clue you in to what the real problem is. In other words, I think there's value in trying to clue somebody in when they've made a typo, but not when they've made a think-o. We won't be able to do the latter accurately enough to make it more useful than annoying. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > ... In other words, I think there's value in trying to clue somebody in > when they've made a typo, but not when they've made a think-o. We > won't be able to do the latter accurately enough to make it more > useful than annoying. FWIW, I concur with Robert's analysis that wrong suggestions are likely to be annoying. We should be erring on the side of not making a suggestion rather than making one that's a low-probability guess. I'm not particularly convinced that the "f1" -> "f2" example is a useful behavior, and I'm downright horrified by the "qty" -> "quantity" case. If the hint mechanism thinks the latter two are close enough together to suggest, it's going to be spewing a whole lot of utterly ridiculous suggestions. I'm going to be annoyed way more times than I'm going to be helped. The big picture is that this is more or less our first venture into heuristic suggestions. I think we should start slow with a very conservative set of heuristics. If it's a success maybe we can get more aggressive over time --- but if we go over the top here, the entire concept will be discredited in this community for the next ten years. regards, tom lane
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Thu, Nov 20, 2014 at 8:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not particularly convinced that the "f1" -> "f2" example is a useful > behavior, and I'm downright horrified by the "qty" -> "quantity" case. > If the hint mechanism thinks the latter two are close enough together > to suggest, it's going to be spewing a whole lot of utterly ridiculous > suggestions. I'm going to be annoyed way more times than I'm going to > be helped. I happen to think that that isn't the case, because the number of possible suggestions is fairly low anyway, and people don't tend to make those kind of errors. Robert's examples of "ridiculous" suggestions of "quantity" based on three letter strings other than "qty" (e.g. "tit") were rather contrived. In fact, most 3 letter strings will not offer a suggestion. 3 or more Equidistant would-be matches tend to offer a lot of additional protection against bad suggestions for these terse strings. > The big picture is that this is more or less our first venture into > heuristic suggestions. I think we should start slow with a very > conservative set of heuristics. If it's a success maybe we can get more > aggressive over time --- but if we go over the top here, the entire > concept will be discredited in this community for the next ten years. I certainly see your point here. It's not as if we have an *evolved* understanding of the usability issues. Besides, as Robert pointed out, most of the value of this patch is added by simple cases, like a failure to pluralize or not pluralize, or the omission of an underscore. I still think we should charge half for deletion, but I will concede that it's prudent to apply a more restrictive absolute quality final test. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Thu, Nov 20, 2014 at 7:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> In general, I think the cost of a bad suggestion is much lower than >> the benefit of a good one. You seem to be suggesting that they're >> equal. Or that they're equally likely in an organic situation. In my >> estimation, this is not the case at all. > > The way I see it, the main cost of a bad suggestion is that it annoys > the user with clutter which they may brand as "stupid". Think about > how much vitriol has been spewed over the years against progress bars > (or estimated completion) times that don't turn out to mirror reality. Well, you can judge the quality of the suggestion immediately. I imagined a mechanism that gives a little bit more than the minimum amount of guidance for things like contractions/abbreviations. > Microsoft has gotten more cumulative flack about their inaccurate > progress bars over the years than they would have for dropping an > elevator on a cute baby. I haven't used a more recent version of Windows than Windows Vista, but I'm pretty sure that they kept it up. >> I'm curious about your thoughts on the compromise of a ramped up >> distance threshold to apply a test for the absolute quality of a >> match. I think that the fact that git gives bad suggestions with terse >> strings tells us a lot, though. Note that unlike git, with terse >> strings we may well have a good deal more equidistant matches, and as >> soon as the number of would-be matches exceeds 2, we actually give no >> matches at all. So that's an additional protection against poor >> matches with terse strings. > > I don't know what you mean by a ramped-up distance threshold, exactly. > I think it's good for the distance threshold to be lower for small > strings and higher for large ones. I think I'm somewhat open to > negotiation on the details, but I think any system that's going to > suggest "quantity" for "tit" is going too far. I mean the suggestion of raising the cost threshold more gradually, not as a step function of the number of characters in the string [1] where it's either over 6 characters and must pass the 50% test, or isn't and has no absolute quality test. The exact modification I described will FWIW remove the "quantity" for "qty" suggestion, as well as all the similar suggestions that you found objectionable (like "tit" also offering a suggestion of "quantity"). If you look at the regression tests, none of the sensible suggestions are lost (some would be by an across the board 50% absolute quality threshold, as I previously pointed out [2]), but all the bad ones are. I attach failed regression test output showing the difference between the previous expected values, and actual values with that small modification - it looks like most or all bad cases are now fixed. > If the user types > "qty" when they meant "quantity", they probably don't really need the > hint, because they're going to say to themselves "wait, I guess I > didn't abbreviate that". The time when they need the hint is when > they typed "quanttiy", because it's quite possible to read a query > with that sort of typo multiple times and not realize that you've made > one. I agree that that's a more important case. > In other words, I think there's value in trying to clue somebody in > when they've made a typo, but not when they've made a think-o. We > won't be able to do the latter accurately enough to make it more > useful than annoying. That's certainly true; I think that we only disagree about the exact point at which we enter the think-o correction business. [1] http://www.postgresql.org/message-id/CAM3SWZT+7hH29Go6ZuY2OrCS40=6yPVM_nt9NjfovP3XwjixDw@mail.gmail.com [2] http://www.postgresql.org/message-id/CAM3SWZTSGokNhT8rK+0Eed7spNJg4pAdMbqqYi0FH9bWcNvTGA@mail.gmail.com -- Peter Geoghegan
Вложения
On Thu, Nov 20, 2014 at 1:30 PM, Peter Geoghegan <pg@heroku.com> wrote: > I mean the suggestion of raising the cost threshold more gradually, > not as a step function of the number of characters in the string [1] > where it's either over 6 characters and must pass the 50% test, or > isn't and has no absolute quality test. The exact modification I > described will FWIW remove the "quantity" for "qty" suggestion, as > well as all the similar suggestions that you found objectionable (like > "tit" also offering a suggestion of "quantity"). > > If you look at the regression tests, none of the sensible suggestions > are lost (some would be by an across the board 50% absolute quality > threshold, as I previously pointed out [2]), but all the bad ones are. > I attach failed regression test output showing the difference between > the previous expected values, and actual values with that small > modification - it looks like most or all bad cases are now fixed. That does seem to give better results, but it still seems awfully complicated. If we just used Levenshtein with all-default cost factors and a distance cap equal to Max(strlen(what_user_typed), strlen(candidate_match), 3), what cases that you think are important would be harmed? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: > That does seem to give better results, but it still seems awfully > complicated. If we just used Levenshtein with all-default cost > factors and a distance cap equal to Max(strlen(what_user_typed), > strlen(candidate_match), 3), what cases that you think are important > would be harmed? Well, just by plugging in default Levenshtein cost factors, I can see the following regression: *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20 10:17:55.042291912 -0800 --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20 11:42:15.670108745 -0800 *************** *** 3452,3458 **** ERROR: column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join ^ - HINT: Perhaps you meant to reference the column "atts"."indexrelid". Within the catalogs, the names of attributes are prefixed as a form of what you might call internal namespacing. For example, pg_index has attributes that all begin with "ind*". You could easily omit something like that, while still more or less knowing what you're looking for. In more concrete terms, this gets no suggestion: postgres=# select key from pg_index; ERROR: 42703: column "key" does not exist LINE 1: select key from pg_index; ^ Only this does: postgres=# select ikey from pg_index; ERROR: 42703: column "ikey" does not exist LINE 1: select ikey from pg_index; ^ HINT: Perhaps you meant to reference the column "pg_index"."indkey". postgres=# The git people varied their Levenshtein costings for a reason. I also think that a one size fits all cap will break things. It will independently break the example above, as well as the more marginal "c1"."f2". vs "c1"."f2" case (okay, maybe that case was exactly on the threshold, but others won't be). I don't see that different costings actually saves any complexity. Similarly, the final cap is quite straightforward. Anything with any real complexity happens before that. -- Peter Geoghegan
On Thu, Nov 20, 2014 at 3:00 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> That does seem to give better results, but it still seems awfully >> complicated. If we just used Levenshtein with all-default cost >> factors and a distance cap equal to Max(strlen(what_user_typed), >> strlen(candidate_match), 3), what cases that you think are important >> would be harmed? > > Well, just by plugging in default Levenshtein cost factors, I can see > the following regression: > > *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20 > 10:17:55.042291912 -0800 > --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20 > 11:42:15.670108745 -0800 > *************** > *** 3452,3458 **** > ERROR: column atts.relid does not exist > LINE 1: select atts.relid::regclass, s.* from pg_stats s join > ^ > - HINT: Perhaps you meant to reference the column "atts"."indexrelid". > > Within the catalogs, the names of attributes are prefixed as a form of > what you might call internal namespacing. For example, pg_index has > attributes that all begin with "ind*". You could easily omit something > like that, while still more or less knowing what you're looking for. > > In more concrete terms, this gets no suggestion: > > postgres=# select key from pg_index; > ERROR: 42703: column "key" does not exist > LINE 1: select key from pg_index; > ^ > > Only this does: > > postgres=# select ikey from pg_index; > ERROR: 42703: column "ikey" does not exist > LINE 1: select ikey from pg_index; > ^ > HINT: Perhaps you meant to reference the column "pg_index"."indkey". > postgres=# Seems fine to me. If you typed relid rather than indexrelid or key rather than indkey, that's a thinko, not a typo. ikey for indkey could plausible be a typo, though you'd have to be having a fairly bad day at the keyboard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Andres Freund
Дата:
On 2014-11-20 12:00:51 -0800, Peter Geoghegan wrote: > In more concrete terms, this gets no suggestion: > > postgres=# select key from pg_index; > ERROR: 42703: column "key" does not exist > LINE 1: select key from pg_index; > ^ I don't think that's a bad thing. Yes, for a human those look pretty similar, but it's easy to construct cases where that gives completely hilarious results. I think something simplistic like levenshtein, even with modified distances, is good to catch typos. But not to find terms that are related in more complex ways. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Seems fine to me. If you typed relid rather than indexrelid or key > rather than indkey, that's a thinko, not a typo. ikey for indkey > could plausible be a typo, though you'd have to be having a fairly bad > day at the keyboard. I can tell that I have no chance of convincing you otherwise. While I think you're mistaken to go against the precedent set by git, you're the one with the commit bit, and I think we've already spent enough time discussing this. So default costings it is. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> That does seem to give better results, but it still seems awfully >> complicated. If we just used Levenshtein with all-default cost >> factors and a distance cap equal to Max(strlen(what_user_typed), >> strlen(candidate_match), 3), what cases that you think are important >> would be harmed? > Well, just by plugging in default Levenshtein cost factors, I can see > the following regression: > *** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20 > 10:17:55.042291912 -0800 > --- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20 > 11:42:15.670108745 -0800 > *************** > *** 3452,3458 **** > ERROR: column atts.relid does not exist > LINE 1: select atts.relid::regclass, s.* from pg_stats s join > ^ > - HINT: Perhaps you meant to reference the column "atts"."indexrelid". I do not have a problem with deciding that that is not a "regression"; in fact, not giving that hint seems like a good conservative behavior here. By your logic, we should also be prepared to suggest "supercalifragilisticexpialidocious" when the user enters "ocious". It's simply a bridge too far for what is supposed to be a hint for minor typos. You sound like you want to turn it into something that will look up column names for people who are too lazy to even try to type the right thing. While I can see the value of such a tool within certain contexts, firing completed queries at a live SQL engine is not one of them. regards, tom lane
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Thu, Nov 20, 2014 at 12:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I do not have a problem with deciding that that is not a "regression"; > in fact, not giving that hint seems like a good conservative behavior > here. By your logic, we should also be prepared to suggest > "supercalifragilisticexpialidocious" when the user enters "ocious". That's clearly not true. I just want to be a bit more forgiving of omissions. That clearly isn't my logic, since that isn't a suggestion that the implementation will give, or would be anywhere close to giving - my weighing of deletions is only twice that of substitutions or insertions, not ten times. git does not use Levenshtein default costings either. > It's simply a bridge too far for what is supposed to be a hint for > minor typos. Minor typos and minor omissions. My example was on the edge of what would be tolerable under my proposed cost model. > You sound like you want to turn it into something that > will look up column names for people who are too lazy to even try to > type the right thing. While I can see the value of such a tool within > certain contexts, firing completed queries at a live SQL engine > is not one of them. It's just a hint; a convenience. Users who imagine that it takes away the need for putting any thought into their SQL queries have bigger problems. Anyway, that's all that needs to be said on that, since I've already given up on a non-default costing. Also, we have default costing, and we always apply a 50% standard (I see no point in doing otherwise with default costings). Robert: Where does that leave us? What about suggestions across RTEs? Alias costing, etc? -- Peter Geoghegan
On Thu, Nov 20, 2014 at 3:20 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Seems fine to me. If you typed relid rather than indexrelid or key >> rather than indkey, that's a thinko, not a typo. ikey for indkey >> could plausible be a typo, though you'd have to be having a fairly bad >> day at the keyboard. > > I can tell that I have no chance of convincing you otherwise. While I > think you're mistaken to go against the precedent set by git, you're > the one with the commit bit, and I think we've already spent enough > time discussing this. So default costings it is. I've got a few +1s, too, if you notice. I'm willing to be outvoted, but not by a majority of one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Thu, Nov 20, 2014 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I've got a few +1s, too, if you notice. Then maybe I spoke too soon. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
David G Johnston
Дата:
Andres Freund-3 wrote > I think something simplistic like levenshtein, even with modified > distances, is good to catch typos. But not to find terms that are > related in more complex ways. Tom Lane-2 wrote > The big picture is that this is more or less our first venture into > heuristic suggestions. I think we should start slow with a very > conservative set of heuristics. If it's a success maybe we can get more > aggressive over time --- but if we go over the top here, the entire > concept will be discredited in this community for the next ten years. +1 for both of these conclusions. The observations regarding standard column prefixes and thinking that abbreviations are in use when in fact the names are spelled out are indeed in-the-wild behaviors that should be considered but a levenshtein distance algorithm is likely not going to be useful in pointing out mistakes in those situations. Limiting the immediate focus to "fat/thin-fingering of keys" - for which levenshtein is well suited - is useful and will provide data points that can then guide future artificial intelligence endeavors. David J. -- View this message in context: http://postgresql.nabble.com/Doing-better-at-HINTing-an-appropriate-column-within-errorMissingColumn-tp5797700p5827786.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
Alright, so let me summarize what I think are the next steps in working towards getting this patch committed. I should produce a new revision which: * Uses default costings. * Applies a generic final quality check that enforces a distance of no greater than 50% of the total string size. (The use of default costings removes any reason to continue to do this) * Work through Robert's suggestions on other aspects that need work [1], most of which I already agreed to. What is unclear is whether or not I should continue to charge extra for non-matching user supplied alias (and, I think more broadly, consider multiple RTEs iff the user did use an alias) - Robert was skeptical, but didn't seem to have made his mind up. I still think I should cost things based on aliases, and consider multiple RTEs even when the user supplied an alias (the penalty should just be a distance of 1 and not 3, though, in light of other changes to the weighing/costing). If I don't hear anything in the next day or two, I'll more or less preserve aliases-related aspects of the patch. Did I miss something else? [1] http://www.postgresql.org/message-id/CA+TgmoZLwzgyv=JAYfi6XfAK8OcBuTPYYhP5TbOqsS=YWVvzUw@mail.gmail.com -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Nov 25, 2014 at 4:13 PM, Peter Geoghegan <pg@heroku.com> wrote: > If I don't hear anything in the next day or two, > I'll more or less preserve aliases-related aspects of the patch. FYI, I didn't go ahead and work on this, because I thought that the thanksgiving holiday in the US probably kept you from giving feedback. -- Peter Geoghegan
On Tue, Nov 25, 2014 at 7:13 PM, Peter Geoghegan <pg@heroku.com> wrote: > Alright, so let me summarize what I think are the next steps in > working towards getting this patch committed. I should produce a new > revision which: > > * Uses default costings. > > * Applies a generic final quality check that enforces a distance of no > greater than 50% of the total string size. (The use of default > costings removes any reason to continue to do this) > > * Work through Robert's suggestions on other aspects that need work > [1], most of which I already agreed to. Sounds good so far. > What is unclear is whether or not I should continue to charge extra > for non-matching user supplied alias (and, I think more broadly, > consider multiple RTEs iff the user did use an alias) - Robert was > skeptical, but didn't seem to have made his mind up. I still think I > should cost things based on aliases, and consider multiple RTEs even > when the user supplied an alias (the penalty should just be a distance > of 1 and not 3, though, in light of other changes to the > weighing/costing). If I don't hear anything in the next day or two, > I'll more or less preserve aliases-related aspects of the patch. Basically, the case in which I think it's helpful to issue a suggestion here is when the user has used the table name rather than the alias name. I wonder if it's worth checking for that case specifically, in lieu of what you've done here, and issuing a totally different hint in that case ("HINT: You must refer to this as column as "prime_minister.id" rather than "cameron.id"). Another idea, which I think I like less well, is to check the Levenshtein distance between the allowed alias and the entered alias and, if that's within the half-the-shorter-length threshold, consider possible matches from that RTE, charge the distance between the correct alias and the entered alias as a penalty to each potential column match. What I think won't do is to look at a situation where the user has entered automobile.id and suggest that maybe they meant student.iq, or even student.id. The amount of difference between the names has got to matter for the RTE names, just as it does for the column names. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Basically, the case in which I think it's helpful to issue a > suggestion here is when the user has used the table name rather than > the alias name. I wonder if it's worth checking for that case > specifically, in lieu of what you've done here, and issuing a totally > different hint in that case ("HINT: You must refer to this as column > as "prime_minister.id" rather than "cameron.id"). Well, if an alias is used, and you refer to an attribute using a non-alias name (i.e. the original table name), then you'll already get an error suggesting that the alias be used instead -- of course, that's nothing new. It doesn't matter to the existing hinting mechanism if the attribute name is otherwise wrong. Once you fix the code to use the alias suggested, you'll then get this new Levenshtein-based hint. > Another idea, which I think I like less well, is to check the > Levenshtein distance between the allowed alias and the entered alias > and, if that's within the half-the-shorter-length threshold, consider > possible matches from that RTE, charge the distance between the > correct alias and the entered alias as a penalty to each potential > column match. I don't about that either. Aliases are often totally arbitrary, particularly for ad-hoc queries, which is what this is aimed at. > What I think won't do is to look at a situation where the user has > entered automobile.id and suggest that maybe they meant student.iq, or > even student.id. I'm not sure I follow. If there is an automobile.ip, then it will be suggested. If there is no automobile column that's much of a match (so no "automobile.ip", say), then student.id will be suggested (and not student.iq, *even if there is no student.id* - the final quality check saves us). So this is possible: postgres=# select iq, * from student, automobile; ERROR: 42703: column "iq" does not exist LINE 1: select iq, * from student, automobile; ^ HINT: Perhaps you meant to reference the column "student"."id". postgres=# select automobile.iq, * from student, automobile; ERROR: 42703: column automobile.iq does not exist LINE 1: select automobile.iq, * from student, automobile; ^ (note that using the table name makes us *not* see a suggestion where we otherwise would). The point is that there is a fixed penalty for a wrong user-specified alias, but all relation RTEs are considered. > The amount of difference between the names has got to > matter for the RTE names, just as it does for the column names. I think it makes sense that it matters by a fixed amount. Besides, this seems complicated enough already - I don't won't to add more complexity to worry about equidistant (but still actually valid) RTE/table/alias names. It sounds like your concern here is mostly a concern about the relative distance among multiple matches, as opposed to the absolute quality of suggestions. The former seems a lot less controversial than the latter was, though - the user always gets the best match, or the join pair of best matches, or no match when this new hinting mechanism is involved. I attach a new revision. The revision: * Uses default costs for Levenshtein distance. * Still charges extra for a non-alias-matching match (although it only charges a fixed distance of 1 extra). This has regression test coverage. * Applies a generic final quality check that enforces a requirement that a hint have a distance of no greater than 50% of the total string size. No special treatment of shorter strings is involved anymore. * Moves almost everything out of scanRTEForColumn() as you outlined (into a new function, updateFuzzyAttrMatchState(), per your suggestion). * Moves dropped column detection into updateFuzzyAttrMatchState(), per your suggestion. * Still does the "if (rte->rtekind == RTE_JOIN)" thing in the existing function searchRangeTableForCol(). I am quite confident that a suggestion from a join RTE will never be useful, to either the existing use of searchRangeTableForCol() or this expanded use, and it makes more sense to me to put it there. In fact, the existing use of searchRangeTableForCol() is really rather similar to this, and will give up on the first identical match (which is taken as evidence that there is a attribute of that name, but isn't visible at this level of the query). So I have not followed your suggestion here. Thoughts? -- Peter Geoghegan
Вложения
On Wed, Dec 3, 2014 at 9:21 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Basically, the case in which I think it's helpful to issue a >> suggestion here is when the user has used the table name rather than >> the alias name. I wonder if it's worth checking for that case >> specifically, in lieu of what you've done here, and issuing a totally >> different hint in that case ("HINT: You must refer to this as column >> as "prime_minister.id" rather than "cameron.id"). > > Well, if an alias is used, and you refer to an attribute using a > non-alias name (i.e. the original table name), then you'll already get > an error suggesting that the alias be used instead -- of course, > that's nothing new. It doesn't matter to the existing hinting > mechanism if the attribute name is otherwise wrong. Once you fix the > code to use the alias suggested, you'll then get this new > Levenshtein-based hint. In that case, I think I favor giving no hint at all when the RTE name is specified but doesn't match exactly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Well, if an alias is used, and you refer to an attribute using a >> non-alias name (i.e. the original table name), then you'll already get >> an error suggesting that the alias be used instead -- of course, >> that's nothing new. It doesn't matter to the existing hinting >> mechanism if the attribute name is otherwise wrong. Once you fix the >> code to use the alias suggested, you'll then get this new >> Levenshtein-based hint. > > In that case, I think I favor giving no hint at all when the RTE name > is specified but doesn't match exactly. I don't follow. The existing mechanism only concerns what to do when the original table name was used when an alias should have been used instead. What does that have to do with this patch? -- Peter Geoghegan
On Fri, Dec 5, 2014 at 3:45 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Well, if an alias is used, and you refer to an attribute using a >>> non-alias name (i.e. the original table name), then you'll already get >>> an error suggesting that the alias be used instead -- of course, >>> that's nothing new. It doesn't matter to the existing hinting >>> mechanism if the attribute name is otherwise wrong. Once you fix the >>> code to use the alias suggested, you'll then get this new >>> Levenshtein-based hint. >> >> In that case, I think I favor giving no hint at all when the RTE name >> is specified but doesn't match exactly. > > I don't follow. The existing mechanism only concerns what to do when > the original table name was used when an alias should have been used > instead. What does that have to do with this patch? Just that that's the case in which it seems useful to give a hint. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Just that that's the case in which it seems useful to give a hint. I think it's very possible that the wrong alias may be provided by the user, and that we should consider that when providing a hint. Besides, considering every visible RTE (while penalizing non-exact alias names iff the user provided an alias name) is actually going to make bad hints less likely, by increasing the number of equidistant low quality matches in a way that swamps the mechanism into providing no actual match at all. That's an important additional protection against low quality matches. What do other people think here? -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote: > I think it's very possible that the wrong alias may be provided by the > user, and that we should consider that when providing a hint. Note that the existing mechanism (the mechanism that I'm trying to improve) only ever shows this error message: "There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query." I think it's pretty clear that this general class of user error is common. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
On Tue, Dec 9, 2014 at 2:52 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote: >> I think it's very possible that the wrong alias may be provided by the >> user, and that we should consider that when providing a hint. > > Note that the existing mechanism (the mechanism that I'm trying to > improve) only ever shows this error message: > > "There is a column named \"%s\" in table \"%s\", but it cannot be > referenced from this part of the query." > > I think it's pretty clear that this general class of user error is common. Moving this patch to CF 2014-12 as work is still going on, note that it is currently marked with Robert as reviewer and that its current status is "Needs review". -- Michael
On Sun, Dec 14, 2014 at 8:24 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Dec 9, 2014 at 2:52 AM, Peter Geoghegan <pg@heroku.com> wrote: >> On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote: >>> I think it's very possible that the wrong alias may be provided by the >>> user, and that we should consider that when providing a hint. >> >> Note that the existing mechanism (the mechanism that I'm trying to >> improve) only ever shows this error message: >> >> "There is a column named \"%s\" in table \"%s\", but it cannot be >> referenced from this part of the query." >> >> I think it's pretty clear that this general class of user error is common. > Moving this patch to CF 2014-12 as work is still going on, note that > it is currently marked with Robert as reviewer and that its current > status is "Needs review". The status here is more like "waiting around to see if anyone else has an opinion". The issue is what should happen when you enter qualified name like alvaro.herrera and there is no column named anything like herrara in the RTE named alvaro, but there is some OTHER RTE that contains a column with a name that is only a small Levenshtein distance away from herrera, like roberto.correra. The questions are: 1. Should we EVER give a you-might-have-meant hint in a case like this? 2. If so, does it matter whether the RTE name is just a bit different from the actual RTE or whether it's completely different? In other words, might we skip the hint in the above case but give one for alvara.correra? My current feeling is that we should answer #1 "no", but Peter prefers to answer it "yes". My further feeling is that if we do decide to say "yes" to #1, then I would answer #2 as "yes" also, but Peter would answer it "no", assigning a fixed penalty for a mismatched RTE rather than one that varies by the Levenshtein distance between the RTEs. If no one else expresses an opinion, I'm going to insist on doing it my way, but I'm happy to have other people weigh in. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Dec 14, 2014 at 8:24 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Moving this patch to CF 2014-12 as work is still going on, note that >> it is currently marked with Robert as reviewer and that its current >> status is "Needs review". > The status here is more like "waiting around to see if anyone else has > an opinion". The issue is what should happen when you enter qualified > name like alvaro.herrera and there is no column named anything like > herrara in the RTE named alvaro, but there is some OTHER RTE that > contains a column with a name that is only a small Levenshtein > distance away from herrera, like roberto.correra. The questions are: > 1. Should we EVER give a you-might-have-meant hint in a case like this? > 2. If so, does it matter whether the RTE name is just a bit different > from the actual RTE or whether it's completely different? In other > words, might we skip the hint in the above case but give one for > alvara.correra? It would be astonishingly silly to not care about the RTE name's distance, if you ask me. This is supposed to detect typos, not thinkos. I think there might be some value in a separate heuristic that, when you typed foo.bar and that doesn't match but there is a baz.bar, suggests that maybe you meant baz.bar, even if baz is not close typo-wise. This would be addressing the thinko case not the typo case, so the rules ought to be quite different --- in particular I doubt that it'd be a good idea to hint this way if the column names don't match exactly. But in any case the key point is that this is a different heuristic addressing a different failure mode. We should not try to make the levenshtein-distance heuristic address that case. So my two cents is that when considering a qualified name, this patch should take levenshtein distance across the two components equally. There's no good reason to suppose that typos will attack one name component more (nor less) than the other. regards, tom lane
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Stephen Frost
Дата:
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > So my two cents is that when considering a qualified name, this patch > should take levenshtein distance across the two components equally. > There's no good reason to suppose that typos will attack one name > component more (nor less) than the other. Agreed (since it seems like folks are curious for the opinion's of mostly bystanders). +1 to the above for my part. Thanks, Stephen
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Dec 16, 2014 at 12:18 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> So my two cents is that when considering a qualified name, this patch >> should take levenshtein distance across the two components equally. >> There's no good reason to suppose that typos will attack one name >> component more (nor less) than the other. > > Agreed (since it seems like folks are curious for the opinion's of > mostly bystanders). > > +1 to the above for my part. Okay, then. Attached patch implements this scheme. It is identical to the previous revision, except that iff there was an alias specified and that alias does not match the correct name (alias/table name) of the RTE currently under consideration, we charge the distance between the differing aliases rather than a fixed distance of 1. -- Peter Geoghegan
Вложения
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan <pg@heroku.com> wrote: > Attached patch implements this scheme. I had another thought: "NAMEDATALEN + 1" is a better representation of "infinity" for matching purposes than INT_MAX. I probably should have made that change, too. It would then not have been necessary to "#include <limits.h>". I think that this is a useful belt-and-suspenders precaution against integer overflow. It almost certainly won't matter, since it's very unlikely that the best match within an RTE will end up being a dropped column, but we might as well do it that way (Levenshtein distance is costed in multiples of code point changes, but the maximum density is 1 byte per codepoint). -- Peter Geoghegan
On Sat, Dec 20, 2014 at 7:30 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Attached patch implements this scheme. > > I had another thought: "NAMEDATALEN + 1" is a better representation of > "infinity" for matching purposes than INT_MAX. I probably should have > made that change, too. It would then not have been necessary to > "#include <limits.h>". I think that this is a useful > belt-and-suspenders precaution against integer overflow. It almost > certainly won't matter, since it's very unlikely that the best match > within an RTE will end up being a dropped column, but we might as well > do it that way (Levenshtein distance is costed in multiples of code > point changes, but the maximum density is 1 byte per codepoint). Good idea. Looking over the latest patch, I think we could simplify the code so that you don't need multiple FuzzyAttrMatchState objects. Instead of creating a separate one for each RTE and then merging them, just have one. When you find an inexact-RTE name match, set a field inside the FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the Levenshtein distance between the RTEs. Then call scanRTEForColumn() and pass down the same state object. Now let updateFuzzyAttrMatchState() work out what it needs to do based on the observed inter-column distance and the currently-in-force RTE penalty. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Looking over the latest patch, I think we could simplify the code so > that you don't need multiple FuzzyAttrMatchState objects. Instead of > creating a separate one for each RTE and then merging them, just have > one. When you find an inexact-RTE name match, set a field inside the > FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the > Levenshtein distance between the RTEs. Then call scanRTEForColumn() > and pass down the same state object. Now let > updateFuzzyAttrMatchState() work out what it needs to do based on the > observed inter-column distance and the currently-in-force RTE penalty. I'm afraid I don't follow. I think doing things that way makes things less clear. Merging is useful because it allows us to consider that an exact match might exist, which this searchRangeTableForCol() is already tasked with today. We now look for the best match exhaustively, or magically return immediately in the event of an exact match, without caring about the alias correctness or distance. Having a separate object makes this pattern apparent from the top level, within searchRangeTableForCol(). I feel that's better. updateFuzzyAttrMatchState() is the wrong place to put that, because that task rightfully belongs in searchRangeTableForCol(), where the high level diagnostic-report-generating control flow lives. To put it another way, creating a separate object obfuscates scanRTEForColumn(), since it's the only client of updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important function, and right now I am only making it slightly less clear by tasking it with caring about distance of names on top of strict binary equality of attribute names. I don't want to push it any further. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Mon, Dec 22, 2014 at 4:34 PM, Peter Geoghegan <pg@heroku.com> wrote: > To put it another way, creating a separate object obfuscates > scanRTEForColumn(), since it's the only client of > updateFuzzyAttrMatchState(). Excuse me. I mean *not* creating a separate object -- having a unified state representation for the entire range-table, rather than having one per RTE and merging them one by one into an overall/final range table object. -- Peter Geoghegan
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Michael Paquier
Дата:
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Tue, Dec 23, 2014 at 9:43 AM, Peter Geoghegan<span dir="ltr"><<a href="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">OnMon, Dec 22, 2014 at 4:34 PM, Peter Geoghegan <<a href="mailto:pg@heroku.com">pg@heroku.com</a>> wrote:<br/> > To put it another way, creating a separate object obfuscates<br /> > scanRTEForColumn(), since it's theonly client of<br /> > updateFuzzyAttrMatchState().<br /><br /><br /></span>Excuse me. I mean *not* creating a separateobject -- having a unified<br /> state representation for the entire range-table, rather than having<br /> one perRTE and merging them one by one into an overall/final range<br /> table object.<span class="HOEnZb"></span><br /></blockquote></div><brclear="all" /></div><div class="gmail_extra">Patch moved to CF 2015-02.<br /></div><div class="gmail_extra">--<br /><div class="gmail_signature">Michael<br /></div></div></div>
On Mon, Dec 22, 2014 at 7:34 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Looking over the latest patch, I think we could simplify the code so >> that you don't need multiple FuzzyAttrMatchState objects. Instead of >> creating a separate one for each RTE and then merging them, just have >> one. When you find an inexact-RTE name match, set a field inside the >> FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the >> Levenshtein distance between the RTEs. Then call scanRTEForColumn() >> and pass down the same state object. Now let >> updateFuzzyAttrMatchState() work out what it needs to do based on the >> observed inter-column distance and the currently-in-force RTE penalty. > > I'm afraid I don't follow. I think doing things that way makes things > less clear. Merging is useful because it allows us to consider that an > exact match might exist, which this searchRangeTableForCol() is > already tasked with today. We now look for the best match > exhaustively, or magically return immediately in the event of an exact > match, without caring about the alias correctness or distance. > > Having a separate object makes this pattern apparent from the top > level, within searchRangeTableForCol(). I feel that's better. > updateFuzzyAttrMatchState() is the wrong place to put that, because > that task rightfully belongs in searchRangeTableForCol(), where the > high level diagnostic-report-generating control flow lives. > > To put it another way, creating a separate object obfuscates > scanRTEForColumn(), since it's the only client of > updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important > function, and right now I am only making it slightly less clear by > tasking it with caring about distance of names on top of strict binary > equality of attribute names. I don't want to push it any further. I don't buy this. What you're essentially doing is using the FuzzyAttrMatchState object in two ways that are not entirely compatible with each other - updateFuzzyAttrMatchState doesn't set the RTE fields, so searchRangeTableForCol has to do it. So there's an unspoken contract that in some parts of the code, you can rely on those fields being set, and in others, you can't. That pretty much defeats the whole point of making the state its own object, AFAICS. Furthermore, you end up with two copies of the state-combining logic, one in FuzzyAttrMatchState and a second in searchRangeTableForCol. That's ugly and unnecessary. I decided to rework this patch myself today; my version is attached. I believe that this version is significantly easier to understand than yours, both as to the code and the comments. I put quite a bit of work into both. I also suspect it's more efficient, because it avoids computing the Levenshtein distances for column names when we already know that those column names can't possibly be sufficiently-good matches for us to care about the details; and when it does compute the Levenshtein distance it keeps the max-distance threshold as low as possible. That may not really matter much, but it can't hurt. More importantly, essentially all of the fuzzy-matching logic is now isolated in FuzzyAttrMatchState(); the volume of change in scanRTEForColumn is the same as in your version, but the volume of change in searchRangeTableForCol is quite a bit less, so the patch is smaller overall. I'm prepared to commit this version if nobody finds a problem with it. It passes the additional regression tests you wrote. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
От
Peter Geoghegan
Дата:
On Tue, Mar 10, 2015 at 11:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm prepared to commit this version if nobody finds a problem with it. > It passes the additional regression tests you wrote. Looks good to me. Thanks. -- Peter Geoghegan
On Tue, Mar 10, 2015 at 4:03 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Mar 10, 2015 at 11:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm prepared to commit this version if nobody finds a problem with it. >> It passes the additional regression tests you wrote. > > Looks good to me. Thanks. OK, committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company