Обсуждение: "_" in a serach pattern

Поиск
Список
Период
Сортировка

"_" in a serach pattern

От
Jessica Richard
Дата:
I have a huge table that has a char(80) name column (indexed).

The name pattern is like

A_B_C-D.123.456.pdf
A_B_C-D.123.333.doc.2

When I select name from table where name like 'A_B%', it was doing a table scan;

I guess the underscore "_" was treated like a special character, instead of a character "_".

When I use select name from table where name like 'A\\_B%', it was doing an index scan with my result returned very fast, but I got a warning about the \\:

WARNING:  nonstandard use of \\ in a string literal
LINE 1: ...elect name from table where file_name like 'A\\_B...
                                                             ^
HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.

By the way, the wild card works if it is at the end of the query, it does not seem to work if it is in the middle:

select name from table where name like 'A\\_B%'  -- fine;

select name from table were name like 'A\\_B%.pdf'; -- returns nothing.

questions:

1. How do I get rid of the nonstandard warning, but still using the index search?

2. How do I search with a wild card % in the middle of the pattern? Would varchar(80) or char(80) make a difference about the wild card search (% in the middle)?

Thanks!


Need a vacation? Get great deals to amazing places on Yahoo! Travel.

Re: "_" in a serach pattern

От
Michael Fuhr
Дата:
On Fri, Jul 20, 2007 at 09:06:53AM -0700, Jessica Richard wrote:
> I have a huge table that has a char(80) name column (indexed).

Do you really nead char(N) instead of varchar(N) or text?

> The name pattern is like
>
> A_B_C-D.123.456.pdf
> A_B_C-D.123.333.doc.2
>
> When I select name from table where name like 'A_B%', it was doing a
> table scan;

What does EXPLAIN ANALYZE show?  When you created the database,
what locale did you use?  What do "show lc_ctype" and "show
server_encoding" return?

> I guess the underscore "_" was treated like a special character,
> instead of a character "_".

Yes -- see the Pattern Matching section of the documentation:

http://www.postgresql.org/docs/8.2/interactive/functions-matching.html#FUNCTIONS-LIKE

> When I use select name from table where name like 'A\\_B%', it was doing
> an index scan with my result returned very fast,

What does EXPLAIN ANALYZE show for this query?

> but I got a warning about the \\:
>
> WARNING:  nonstandard use of \\ in a string literal
> LINE 1: ...elect name from table where file_name like 'A\\_B...
>                                                              ^
> HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.

As the hint suggests, you can avoid the warning by using E'A\\_B%'.

> By the way, the wild card works if it is at the end of the query, it
> does not seem to work if it is in the middle:
>
> select name from table where name like 'A\\_B%'  -- fine;
>
> select name from table were name like 'A\\_B%.pdf'; -- returns nothing.

This is probably due to the column type being char(N) instead of
varchar(N) or text:

test=> select 'abc'::char(3) like 'a%c';
 ?column?
----------
 t
(1 row)

test=> select 'abc'::char(4) like 'a%c';
 ?column?
----------
 f
(1 row)

test=> select 'abc'::char(4) like 'a%c%';
 ?column?
----------
 t
(1 row)

test=> select 'abc'::char(4) like 'a%c ';
 ?column?
----------
 t
(1 row)

I had thought that char(N)'s padding spaces were supposed to be
semantically insignificant but I don't know if that applies to
pattern matching.

> questions:
>
> 1. How do I get rid of the nonstandard warning, but still using the index search?

Do what the HINT suggests: use E'pattern' instead of 'pattern', or
use dollar quotes.

http://www.postgresql.org/docs/8.2/interactive/sql-syntax-lexical.html#SQL-SYNTAX-DOLLAR-QUOTING

If you're using a non-C locale then you'll need to create an index
using one of the pattern_ops operator classes to get pattern matches
to use an index.

http://www.postgresql.org/docs/8.2/interactive/indexes-opclass.html

> 2. How do I search with a wild card % in the middle of the pattern? Would
>  varchar(80) or char(80) make a difference about the wild card search
>  (% in the middle)?

See the above examples.  I'd recommend using varchar(N) or text
unless you have a good reason to use char(N).

--
Michael Fuhr

Re: "_" in a serach pattern

От
Peter Koczan
Дата:
Hi, Jessica,
> 1. How do I get rid of the nonstandard warning, but still using the
> index search?
You have two options.

- Turn off the warnings in the postgresql.conf file. Use this with
caution (or don't use it at all) as it does pose a potential threat for
SQL injections if other options aren't properly set. Read up at
http://www.postgresql.org/docs/8.2/static/runtime-config-compatible.html
for more detail.
- Use escape-string formatting. This is the best practice since it is
standards-conforming and more secure. You can do it, for instance, as
    select name from table where name like 'A!_B%' escape '!';
You can escape with most characters, and it's mostly a matter of
personal preference.
>
> 2. How do I search with a wild card % in the middle of the pattern?
> Would varchar(80) or char(80) make a difference about the wild card
> search (% in the middle)?
I think that postgres is seeing the trailing whitespace on the end of
the char type and not accounting for it in the search. In this case,
varchar or text types would make a difference since they don't do
whitespace padding (unless you force it in). You should remember that if
you plan on converting the data type, trim the extraneous whitespace
first. However, I would switch to varchar/text so it saves some space
and saves you these headaches, unless there's an absolute need for
fixed-length char fields. I use text almost exclusively for string data
since it's arbitrary-length, I almost never have to worry about overflow
or later administration.

If you don't want to convert data, you can use the rtrim() function
(i.e. "select rtrim(name) from ...").

Peter

> ------------------------------------------------------------------------
> Need a vacation? Get great deals to amazing places
>
<http://us.rd.yahoo.com/evt=48256/*http://travel.yahoo.com/;_ylc=X3oDMTFhN2hucjlpBF9TAzk3NDA3NTg5BHBvcwM1BHNlYwNncm91cHMEc2xrA2VtYWlsLW5jbQ-->on

> Yahoo! Travel.


Re: "_" in a serach pattern

От
Jessica Richard
Дата:
Peter,

thanks a lot for your reply.

Could you please tell me more....

What is the difference between varchar and text? what is the benefit of each one? and is text alway better than varchar ?--- when it comes to a string column...

thanks

Peter Koczan <pjkoczan@gmail.com> wrote:
Hi, Jessica,
> 1. How do I get rid of the nonstandard warning, but still using the
> index search?
You have two options.

- Turn off the warnings in the postgresql.conf file. Use this with
caution (or don't use it at all) as it does pose a potential threat for
SQL injections if other options aren't properly set. Read up at
http://www.postgresql.org/docs/8.2/static/runtime-config-compatible.html
for more detail.
- Use escape-string formatting. This is the best practice since it is
standards-conforming and more secure. You can do it, for instance, as
select name from table where name like 'A!_B%' escape '!';
You can escape with most characters, and it's mostly a matter of
personal preference.
>
> 2. How do I search with a wild card % in the middle of the pattern?
> Would varchar(80) or char(80) make a difference about the wild card
> search (% in the middle)?
I think that postgres is seeing the trailing whitespace on the end of
the char type and not accounting for it in the search. In this case,
varchar or text types would make a difference since they don't do
whitespace padding (unless you force it in). You should remember that if
you plan on converting the data type, trim the extraneous whitespace
first. However, I would switch to varchar/text so it saves some space
and saves you these headaches, unless there's an absolute need for
fixed-length char fields. I use text almost exclusively for string data
since it's arbitrary-length, I almost never have to worry about overflow
or later administration.

If you don't want to convert data, you can use the rtrim() function
(i.e. "select rtrim(name) from ...").

Peter

> ------------------------------------------------------------------------
> Need a vacation? Get great deals to amazing places
> on
> Yahoo! Travel.



Yahoo! oneSearch: Finally, mobile search that gives answers, not web links.

Re: "_" in a serach pattern

От
Peter Koczan
Дата:
The only difference is that varchar can have a length limit, text is
always unlimited length. There's no significant performance difference
between either. For most intents and purposes, they're equal.

varchar is better than text in that limits are already built in. You can
always impose limits on text types using rules and domains, but at a
slight performance hit and having to create and maintain your own rules
and domains. If you need limits, you might consider using varchar over
text as everything is there.

I use text since I don't want to worry about overflow. Personally,
specifying a varchar(500) column to be used as a "big string" field
makes little sense to me. But, if you need a limit for a display or a
program, you probably should use varchar. Besides, if you ever need to
convert data types, any of the postgres 8.* releases make it easy.

Peter

P.S. You might want to file a bug report about your initial problem.
According to
http://www.postgresql.org/docs/8.2/static/datatype-character.html

Values of type character are physically padded with spaces to the
specified width /n/, and are stored and displayed that way. However, the
padding spaces are treated as semantically insignificant. Trailing
spaces are disregarded when comparing two values of type character, and
they will be removed when converting a character value to one of the
other string types. Note that trailing spaces /are/ semantically
significant in character varying and text values.

It should have ignored the spaces in the char field and didn't.

Peter

Jessica Richard wrote:
> Peter,
>
> thanks a lot for your reply.
>
> Could you please tell me more....
>
> What is the difference between varchar and text? what is the benefit
> of each one? and is text alway better than varchar ?--- when it comes
> to a string column...
>
> thanks
>
> */Peter Koczan <pjkoczan@gmail.com>/* wrote:
>
>     Hi, Jessica,
>     > 1. How do I get rid of the nonstandard warning, but still using the
>     > index search?
>     You have two options.
>
>     - Turn off the warnings in the postgresql.conf file. Use this with
>     caution (or don't use it at all) as it does pose a potential
>     threat for
>     SQL injections if other options aren't properly set. Read up at
>     http://www.postgresql.org/docs/8.2/static/runtime-config-compatible.html
>
>     for more detail.
>     - Use escape-string formatting. This is the best practice since it is
>     standards-conforming and more secure. You can do it, for instance, as
>     select name from table where name like 'A!_B%' escape '!';
>     You can escape with most characters, and it's mostly a matter of
>     personal preference.
>     >
>     > 2. How do I search with a wild card % in the middle of the pattern?
>     > Would varchar(80) or char(80) make a difference about the wild card
>     > search (% in the middle)?
>     I think that postgres is seeing the trailing whitespace on the end of
>     the char type and not accounting for it in the search. In this case,
>     varchar or text types would make a difference since they don't do
>     whitespace padding (unless you force it in). You should remember
>     that if
>     you plan on converting the data type, trim the extraneous whitespace
>     first. However, I would switch to varchar/text so it saves some space
>     and saves you these headaches, unless there's an absolute need for
>     fixed-length char fields. I use text almost exclusively for string
>     data
>     since it's arbitrary-length, I almost never have to worry about
>     overflow
>     or later administration.
>
>     If you don't want to convert data, you can use the rtrim() function
>     (i.e. "select rtrim(name) from ...").
>
>     Peter
>
>     >
>     ------------------------------------------------------------------------
>     > Need a vacation? Get great deals to amazing places
>     > on
>     > Yahoo! Travel.
>
>
> ------------------------------------------------------------------------
> Yahoo! oneSearch: Finally, mobile search that gives answers
> <http://us.rd.yahoo.com/evt=48252/*http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC>,
> not web links.


Re: "_" in a serach pattern

От
Ben Kim
Дата:
> The only difference is that varchar can have a length limit, text is always
> unlimited length. There's no significant performance difference between
> either. For most intents and purposes, they're equal.

Forgive me for diversion, but out of curiosity, would it make some
difference whether strings of known length are compared or strings of
unknown length are compared, like in join condition "using(address)"?

Or, is the actual length stored in a separate field and read first so
strings of different lengths wouldn't even need to be matched?

Is there some reference on this subject other than the source code?


Thanks.

Ben K.
Developer
http://benix.tamu.edu

Re: "_" in a serach pattern

От
Peter Koczan
Дата:
Most database systems can figure out the size (in bytes) of a column
rather quickly from offsets within a record. I doubt that postgres is
any different. The only difference I can think of is that it might be
slower to filter out trailing spaces of a char column before comparing,
though the actual string length may be cached somewhere. Semantically,
it should be no different. I haven't perused the source code enough to
know for sure.

Any of the developers that patrol this list can probably answer your
questions better, especially the question of matching or not matching
strings of different length.

Peter

Ben Kim wrote:
>> The only difference is that varchar can have a length limit, text is
>> always unlimited length. There's no significant performance
>> difference between either. For most intents and purposes, they're equal.
>
> Forgive me for diversion, but out of curiosity, would it make some
> difference whether strings of known length are compared or strings of
> unknown length are compared, like in join condition "using(address)"?
>
> Or, is the actual length stored in a separate field and read first so
> strings of different lengths wouldn't even need to be matched?
>
> Is there some reference on this subject other than the source code?
>
>
> Thanks.
>
> Ben K.
> Developer
> http://benix.tamu.edu
>