Обсуждение: repeated characters in SQL

Поиск
Список
Период
Сортировка

repeated characters in SQL

От
Govind Chettiar
Дата:
I have a simple table consisting of a bunch of English words.  I am trying to find words that have repeated characters in them, for example
apple
tattoo

but not

orange 
lemon

I know that only a maximum of one repetition can occur

I tried various options like
SELECT word FROM  public."SpellItWords" 
 WHERE word ~ E'(.)\1{2,}'

SELECT word FROM  public."SpellItWords" 
 WHERE word ~ E'([a-z])\1{2}'

What finally worked was this
SELECT word FROM  public."SpellItWords" 
 WHERE word ~ E'(.)\\1'

But I don't really understand what this does...Can you explain?

Thanks!

Re: repeated characters in SQL

От
David Rowley
Дата:
On 24 January 2016 at 12:44, Govind Chettiar <rashapoo@gmail.com> wrote:
> I have a simple table consisting of a bunch of English words.  I am trying
> to find words that have repeated characters in them, for example
> apple
> tattoo
>
> but not
>
> orange
> lemon
>
> I know that only a maximum of one repetition can occur
>
> I tried various options like
> SELECT word FROM  public."SpellItWords"
>  WHERE word ~ E'(.)\1{2,}'
>
> SELECT word FROM  public."SpellItWords"
>  WHERE word ~ E'([a-z])\1{2}'
>
> What finally worked was this
> SELECT word FROM  public."SpellItWords"
>  WHERE word ~ E'(.)\\1'
>
> But I don't really understand what this does...Can you explain?

The ~ operator is a regular expression matching operator, and the
(.)\1 is a regular expression. More details here
http://www.postgresql.org/docs/current/static/functions-matching.html

The regular expression . matches a single character, since that . is
wrapped in () the regex engine captures the match and stores it in a
variable, this is called a capture group. Since this is the first such
capture group in the regular expression, then the value matching the .
gets stored in the variable \1, so your regex basically says; "match a
single character which has the same single character to its immediate
right hand side". The extra \ is just an escape character.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: repeated characters in SQL

От
rashapoo@gmail.com
Дата:
Thanks David...so it's looking at each character, storing it in /1, then comparing the "next" character with what is in /1.
I guess the escape character (which is not needed in, say, Notepad++) threw me a bit.

On Sun, Jan 24, 2016 at 2:32 AM, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 24 January 2016 at 12:44, Govind Chettiar <rashapoo@gmail.com> wrote:
> I have a simple table consisting of a bunch of English words.  I am trying
> to find words that have repeated characters in them, for example
> apple
> tattoo
>
> but not
>
> orange
> lemon
>
> I know that only a maximum of one repetition can occur
>
> I tried various options like
> SELECT word FROM  public."SpellItWords"
>  WHERE word ~ E'(.)\1{2,}'
>
> SELECT word FROM  public."SpellItWords"
>  WHERE word ~ E'([a-z])\1{2}'
>
> What finally worked was this
> SELECT word FROM  public."SpellItWords"
>  WHERE word ~ E'(.)\\1'
>
> But I don't really understand what this does...Can you explain?

The ~ operator is a regular expression matching operator, and the
(.)\1 is a regular expression. More details here
http://www.postgresql.org/docs/current/static/functions-matching.html

The regular expression . matches a single character, since that . is
wrapped in () the regex engine captures the match and stores it in a
variable, this is called a capture group. Since this is the first such
capture group in the regular expression, then the value matching the .
gets stored in the variable \1, so your regex basically says; "match a
single character which has the same single character to its immediate
right hand side". The extra \ is just an escape character.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: repeated characters in SQL

От
Francisco Olarte
Дата:
On Sun, Jan 24, 2016 at 7:05 PM,  <rashapoo@gmail.com> wrote:
> I guess the escape character (which is not needed in, say, Notepad++) threw
> me a bit.

Notepad ++ is, AFAIK,  an editor, it SHOULD (within reason) let you
write any text.

The double quote is needed due to the quoting rules of the language.
You want the regexp engine to see the characters leftp, dot, rightp,
backslash, one. But backslah is the scape character in strings ( in
many languages ), so you need to escape it too.

The same happens in, for example, C and Java. To put those five chars
in a string you need "(.)\\1". If you send "(.)\1" to a C compiler it
will build the string leftp, dot, rightp, SOH=(char)(1). It will
arrive to the backslash when parsing, see it is followed by a digit
less than 8, interpret it as an octal escape, and emit the SOH.

Francisco Olarte.


Re: repeated characters in SQL

От
Tom Lane
Дата:
Francisco Olarte <folarte@peoplecall.com> writes:
> On Sun, Jan 24, 2016 at 7:05 PM,  <rashapoo@gmail.com> wrote:
>> I guess the escape character (which is not needed in, say, Notepad++) threw
>> me a bit.

> Notepad ++ is, AFAIK,  an editor, it SHOULD (within reason) let you
> write any text.

> The double quote is needed due to the quoting rules of the language.
> You want the regexp engine to see the characters leftp, dot, rightp,
> backslash, one. But backslah is the scape character in strings ( in
> many languages ), so you need to escape it too.

In this particular case, the extra backslash was needed only because
the OP used E'...' syntax for his string literal.  In a plain SQL
string literal, backslash isn't special.

            regards, tom lane