Обсуждение: Re: tsearch parser inefficiency if text includes urls or emails - new version

Поиск

Список

Период

Сортировка

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

"Kevin Grittner"

Дата:

08 декабря 2009 г., 14:23:26

I wrote:
> Frankly, I'd be amazed if there was a performance regression,
OK, I'm amazed.  While it apparently helps some cases dramatically
(Andres had a case where run time was reduced by 93.2%), I found a
pretty routine case where run time was increased by 3.1%.  I tweaked
the code and got that down to a 2.5% run time increase.  I'm having
troubles getting it any lower than that.  And yes, this is real, not
noise -- the slowest unpatched time for this test is faster than the
fastest time with any version of the patch.  :-(
Andres, could you provide more information on the test which showed
the dramatic improvement?  In particular, info on OS, CPU, character
set, encoding scheme, and what kind of data was used for the test.
I'll do some more testing and try to figure out how the patch is
slowing things down and post with details.
-Kevin

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

Andres Freund

Дата:

08 декабря 2009 г., 14:26:23

On Tuesday 08 December 2009 16:23:11 Kevin Grittner wrote:
> I wrote:
> > Frankly, I'd be amazed if there was a performance regression,
> 
> OK, I'm amazed.  While it apparently helps some cases dramatically
> (Andres had a case where run time was reduced by 93.2%), I found a
> pretty routine case where run time was increased by 3.1%.  I tweaked
> the code and got that down to a 2.5% run time increase.  I'm having
> troubles getting it any lower than that.  And yes, this is real, not
> noise -- the slowest unpatched time for this test is faster than the
> fastest time with any version of the patch.  :-(
> 
> Andres, could you provide more information on the test which showed
> the dramatic improvement?  In particular, info on OS, CPU, character
> set, encoding scheme, and what kind of data was used for the test.
> 
> I'll do some more testing and try to figure out how the patch is
> slowing things down and post with details.
Could you show your testcase? I dont see why it could get slower?

I tested with various data, the one benefiting most was some changelog where 
each entry was signed by an email.

OS: Debian Sid, Core2 Duo, UTF-8, and I tried both C and de_DE.UTF8.

Andres

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

"Kevin Grittner"

Дата:

08 декабря 2009 г., 15:15:57

Andres Freund <andres@anarazel.de> wrote: 
> Could you show your testcase?
OK.  I was going to try to check other platforms first, and package
up the information better, but here goes.
I created 10000 lines with random IP-based URLs for a test.  The
first few lines are:
create table t1 (c1 int not null primary key, c2 text);
insert into t1 values (2, 

'http://255.102.51.212/*/quick/brown/fox?jumps&over&*&lazy&dog.htmlhttp://204.56.222.143/*/quick/brown/fox?jumps&over&*&lazy&dog.htmlhttp://138.183.168.227/*/quick/brown/fox?jumps&over&*&lazy&dog.html
Actually, the special character was initially the word "the", but I
wanted to see if having non-ASCII characters in the value made any
difference.  It didn't.
Unfortunately, I was testing at home last night and forgot to bring
the exact test query with me, but it was this or something close to
it:
\timing
select to_tsvector(c2) from t1, (select generate_series(1,200)) x where c1 = 2;
I was running on Ubuntu 9.10, an AMD dual core CPU (don't have the
model number handy), UTF-8, en_US.UTF8.
> I dont see why it could get slower?
I don't either.  The best I can tell, following the pointer from
orig to any of its elements seems to be way more expensive than I
would ever have guessed.  The only thing that seemed to improve the
speed was minimizing that by using a local variable to capture any
element referenced more than once.  (Although, there is overlap
between the timings for the original patch and the one which seemed
a slight improvement; I would need to do more testing to really rule
out noise and have complete confidence that my changes actually are
an improvement on the original patch.)
Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD
CPU?  (I'm looking forward to testing this today on a 64 bit build
on an Intel CPU.)
-Kevin

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

Andres Freund

Дата:

08 декабря 2009 г., 15:17:30

On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote:
> Andres Freund <andres@anarazel.de> wrote:
> > Could you show your testcase?
Will hopefully look into this later.

> > I dont see why it could get slower?
> I don't either.  The best I can tell, following the pointer from
> orig to any of its elements seems to be way more expensive than I
> would ever have guessed.  The only thing that seemed to improve the
> speed was minimizing that by using a local variable to capture any
> element referenced more than once.  (Although, there is overlap
> between the timings for the original patch and the one which seemed
> a slight improvement; I would need to do more testing to really rule
> out noise and have complete confidence that my changes actually are
> an improvement on the original patch.)
> 
> Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD
> CPU?  (I'm looking forward to testing this today on a 64 bit build
> on an Intel CPU.)
Did you test that with a optimized build?

Andres

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

"Kevin Grittner"

Дата:

08 декабря 2009 г., 15:40:43

Andres Freund <andres@anarazel.de> wrote:
> Did you test that with a optimized build?
After running a series of tests with --enable-cassert to confirm
that nothing squawked, I disabled that before doing any
performance testing.  Going from memory, I used --enable-debug
--with-libxml and --prefix.  I didn't explicitly use or disable any
compiler optimizations.
Possibly relevant is that I was using Greg Smith's peg tool (as best
I could):
http://github.com/gregs1104/peg/ 
I'm not aware of it doing anything to the compile options, but I
didn't look for that, either.  I guess I should poke around that
some more to be sure.  After having peg do the checkout and set up
the directories, did my own ./configure and make runs so that I
could be sure of my configure settings.
Are there any settings which you feel I should be using which
PostgreSQL doesn't set up as you would recommend during the
./configure run?
-Kevin

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

Greg Smith

Дата:

08 декабря 2009 г., 18:09:38

Kevin Grittner wrote:
> After running a series of tests with --enable-cassert to confirm
> that nothing squawked, I disabled that before doing any
> performance testing.  Going from memory, I used --enable-debug
> --with-libxml and --prefix.  I didn't explicitly use or disable any
> compiler optimizations.
>  
> Possibly relevant is that I was using Greg Smith's peg tool (as best
> I could):
>  
> http://github.com/gregs1104/peg/ 
>  
> I'm not aware of it doing anything to the compile options, but I
> didn't look for that, either.
Now that you ask, of course I just spotted a bug in there such that the 
documented behavior for the PGDEBUG feature doesn't actually work.  If 
you were using that to turn off asserts, that may not have worked as 
expected.  Don't know what you did there exactly.  Fix now pushed to the 
repo. 

I general, when doing performance testing, now matter how careful I've 
been I try to do a:

show debug_assertions;

To confirm I'm not running with those on.

Andres, are using any optimization flags when you're testing? 

Would have preferred that the first mention of my new project didn't 
involve a bug report, but that's software I guess.  For everyone here 
who's not on the reviewers mailing list:  peg is a scripting tool I've 
put together recently that makes it easier to setup the environment 
(source code, binaries, database) for testing PostgreSQL in a 
development content.  It takes care of grabbing the latest source, 
building, starting the database, all that boring stuff.  I tried to make 
it automate the most tedious parts of both development and patch 
review.  Documentation and the program itself are at the git repo Kevin 
mentioned.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

"Kevin Grittner"

Дата:

08 декабря 2009 г., 18:20:51

Greg Smith <greg@2ndquadrant.com> wrote:
> Now that you ask, of course I just spotted a bug in there such
> that the documented behavior for the PGDEBUG feature doesn't
> actually work.   If you were using that to turn off asserts, that
> may not have worked as expected.  Don't know what you did there
> exactly.  Fix now pushed to the repo. 
Thanks.  I was going to reply to your original message with my
experiences (and probably still will), but it seemed like it might
be relevant here.  I did check pg_config results before doing
anything, and saw that the debug and cassert weren't set, so that's
why I did explicit configure and make commands.  Even without that
covered, peg was a nice convenience -- I can say that it saved me
more time already than it took to install and read the docs.  Nice
work!
Anyway, I'm not sure whether your reply directly answers the point
I was raising -- peg doesn't do anything with the compiler
optimization flags under the covers, does it?
-Kevin

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

Andres Freund

Дата:

08 декабря 2009 г., 18:25:32

Hi,

On Tuesday 08 December 2009 20:09:22 Greg Smith wrote:
> Andres, are using any optimization flags when you're testing?
I was testing with and without debug/cassert - and did not get adverse results
in both...

Unfortunately it looks like I wont get to test today, but only tomorrow
morning its 9pm and I am not yet fully finished with work....

Andres

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

Greg Smith

Дата:

08 декабря 2009 г., 18:40:21

Kevin Grittner wrote:
> Anyway, I'm not sure whether your reply directly answers the point
> I was raising -- peg doesn't do anything with the compiler
> optimization flags under the covers, does it?
>  
Not really.  It does this:

PGDEBUG="--enable-cassert --enable-debug"
./configure --prefix=$PGINST/$PGPROJECT --enable-depend 
--enable-thread-safety $PGDEBUG

Which are pretty standard options.  The idea is that you'd use the 
default normally, then just set PGDEBUG=" " for a non-debug build--or to 
otherwise change the configure flags but still get things done 
automatically for you.  If it's set before the script starts, it doesn't 
change it.

I did try to design things so that you could do any step in the 
automated series manually and not have that screw things up.  There's 
hundreds of lines of code in there just for things like figuring out 
whether configure has been run or not yet when it decides you need to 
build, so it can try to do the right thing in either case.  My hope was 
that anyone who tried peg out would find it a net positive time savings 
after a single use, glad to hear I accomplished that in your case.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

Tom Lane

Дата:

08 декабря 2009 г., 18:43:25

Greg Smith <greg@2ndquadrant.com> writes:
> Kevin Grittner wrote:
>> Anyway, I'm not sure whether your reply directly answers the point
>> I was raising -- peg doesn't do anything with the compiler
>> optimization flags under the covers, does it?
>> 
> Not really.  It does this:

> PGDEBUG="--enable-cassert --enable-debug"
> ./configure --prefix=$PGINST/$PGPROJECT --enable-depend 
> --enable-thread-safety $PGDEBUG

--enable-cassert might have a noticeable performance impact.
We usually try to not have that on when doing performance testing.
        regards, tom lane

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

Greg Smith

Дата:

08 декабря 2009 г., 18:51:26

Tom Lane wrote:
> --enable-cassert might have a noticeable performance impact.
> We usually try to not have that on when doing performance testing.
>   
All covered in the tool's documentation, and it looks like Kevin did the 
right thing during his tests by checking the pg_config output to confirm 
the right flags were used.  I think we can rule out simple pilot error 
here and work under the assumption Kevin found a mild performance 
regression under some circumstances with the patch.  Hopefully Andres 
will be able to replicate the problem, and it sounds like Kevin might be 
able to provide more information about it tonight too.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

"Kevin Grittner"

Дата:

08 декабря 2009 г., 18:55:30

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> --enable-cassert might have a noticeable performance impact.
> We usually try to not have that on when doing performance testing.
Right.  I turned it on for some initial tests to confirm that we had
no assertion failures; then turned it off for the performance
testing.  I did leave --enable-debug on during performance testing. 
My understanding is that debug info doesn't affect performance, but
I guess it never hurts to try an empirical test to confirm.
-Kevin

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

"Kevin Grittner"

Дата:

08 декабря 2009 г., 21:07:19

I wrote:
> Perhaps it is some quirk of using 32 bit pointers on the 64 bit
> AMD CPU?  (I'm looking forward to testing this today on a 64 bit
> build on an Intel CPU.)
The exact same test on 64 bit OS (SuSE Enterprise Server) on Intel
gave very different results. With 10 runs each of 200 iterations of
parsing the 10000 URLs, the patch Andres submitted ran 0.4% faster
than HEAD, and my attempt to improve on it ran 0.6% slower than
HEAD.  I'll try to run the numbers to get the percentage chance that
a random distribution would have generated a spread as large as
either of those; but I think it's safe to say that the submitted
patch doesn't hurt there and that my attempt to improve on it was
misdirected.  :-/
I would like to independently confirm the dramatic improvement
reported by Andres.  Could I get a short snippet from the log which
was used for that, along with an indication of the size of the text
parsed in that test?  (Since the old code looks like it might have
O(N^2) performance in some situations, while the patch changes that
to O(N), I might not be testing with a big enough N.)
-Kevin

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

Andres Freund

Дата:

09 декабря 2009 г., 17:06:58

On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote:
> Andres Freund <andres@anarazel.de> wrote:
> > Could you show your testcase?
> 
> OK.  I was going to try to check other platforms first, and package
> up the information better, but here goes.
> 
> I created 10000 lines with random IP-based URLs for a test.  The
> first few lines are:
> 
> create table t1 (c1 int not null primary key, c2 text);
> insert into t1 values (2,
> 'http://255.102.51.212/*/quick/brown/fox?jumps&over&*&lazy&dog.html
>  http://204.56.222.143/*/quick/brown/fox?jumps&over&*&lazy&dog.html
>  http://138.183.168.227/*/quick/brown/fox?jumps&over&*&lazy&dog.html
I think you see no real benefit, because your strings are rather short - the 
documents I scanned when noticing the issue where rather long.
If your strings are short, they and the copy will fit into cpu cache anyway, so 
copying them around/converting them to some other string format is not that 
expensive compared to the rest of the work done.

Also after each copying step for large strings the complete cache is filled 
with unrelated information (namely the end of the string). So every charwise 
access will need to wait for a memory access.

A rather extreme/contrived example:


postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
1) g(i)), ' -  ')); ?column? ----------        1(1 row)
Time: 3.740 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
1000) g(i)), ' -  ')); ?column? ----------        1(1 row)
Time: 115.027 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
10000) g(i)), ' -  ')); ?column? ----------        1(1 row) 
Time: 24355.339 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
20000) g(i)), ' -  ')); ?column? ----------        1(1 row)
Time: 47276.739 ms


One easily can see the quadratic complexity here. The quadratic complexity 
lies in the length/amount of emails/urls of the strings, not in the number of 
to_tsvector calls!

After the patch:

postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
20000) g(i)), ' -  ')); ?column? ----------        1(1 row)
Time: 168.384 ms


I could not reproduce the slowdown you mentioned:


Without patch:
postgres=# SELECT to_tsvector('andres@anarazel.de 
http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i) 
ORDER BY g.i LIMIT 1;                                  to_tsvector
-------------------------------------------------------------------------------'/1':4 'andres@anarazel.de':1
'www.postgresql.org':3
 
'www.postgresql.org/1':2(1 row)
Time: 1109.833 ms

With patch:
postgres=# SELECT to_tsvector('andres@anarazel.de 
http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i) 
ORDER BY g.i LIMIT 1;                                  to_tsvector
-------------------------------------------------------------------------------'/1':4 'andres@anarazel.de':1
'www.postgresql.org':3
 
'www.postgresql.org/1':2(1 row)
Time: 1036.689 ms

So on the hardware I tried its even a little bit faster for small strings 
(Older Xeon32bit, Core2 Duo, 64bit, Nehalem based Xeon 64bit).


I could not reproduce any difference with strings not involving urls or emails:

Without patch:

postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) 
FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1;                      to_tsvector
--------------------------------------------------------'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6(1 row)
 
Time: 988.426 ms

With patch:

postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) 
FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1;                      to_tsvector
--------------------------------------------------------'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6(1 row)
 
Time: 975.339 ms


So at least in my testing I do see no danger in the patch ;-)

Andres



PS: I averaged all the time result over multiple runs where it was relevant.

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

"Kevin Grittner"

Дата:

10 декабря 2009 г., 16:01:24

Andres Freund <andres@anarazel.de> wrote:
> I think you see no real benefit, because your strings are rather
> short - the documents I scanned when noticing the issue where
> rather long.
The document I used in the test which showed the regression was
672,585 characters, containing 10,000 URLs.
> A rather extreme/contrived example:
> postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
> 'andres@anarazel.de http://www.postgresql.org/'::text FROM 
> generate_series(1, 
> 20000) g(i)), ' -  '));
The most extreme of your examples uses a 979,996 character string,
which is less than 50% larger than my test.  I am, however, able to
see the performance difference for this particular example, so I now
have something to work with.  I'm seeing some odd behavior in terms
of when there is what sort of difference.  Once I can categorize it
better, I'll follow up.
Thanks for the sample which shows the difference.
-Kevin

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

"Kevin Grittner"

Дата:

10 декабря 2009 г., 17:10:39

I wrote:
> Thanks for the sample which shows the difference.
Ah, once I got on the right track, there is no problem seeing
dramatic improvements with the patch.  It changes some nasty O(N^2)
cases to O(N).  In particular, the fixes affect parsing of large
strings encoded with multi-byte character encodings and containing
email addresses or URLs with a non-IP-address host component.  It
strikes me as odd that URLs without a slash following the host
portion, or with an IP address, are treated so differently in the
parser, but if we want to address that, it's a matter for another
patch.
I'm inclined to think that the minimal differences found in some of
my tests probably have more to do with happenstance of code
alignment than the particulars of the patch.
I did find one significant (although easily solved) problem.  In the
patch, the recursive setup of usewide, pgwstr, and wstr are not
conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
version.  Unless there's a good reason for that, the #ifdef should
be added.
Less critical, but worth fixing one way or the other, TParserClose
does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
TParserCopyClose does.  I think this should be consistent.
Finally, there's that spelling error in the comment for
TParserCopyInit.  Please fix.
If a patch is produced with fixes for these three things, I'd say
it'll be ready for committer.  I'm marking it as Waiting on Author
for fixes to these three items.
Sorry for the delay in review.  I hope there's still time to get
this committed in this CF.
-Kevin

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

"Kevin Grittner"

Дата:

10 декабря 2009 г., 18:06:07

I wrote:
> I did find one significant (although easily solved) problem.  In
> the patch, the recursive setup of usewide, pgwstr, and wstr are
> not conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
> version.  Unless there's a good reason for that, the #ifdef should
> be added.
That should read:
I did find one significant (although easily solved) problem.  In
the patch, the recursive setup of usewide, pgwstr, and wstr are
not conditioned by #ifdef USE_WIDE_UPPER_LOWER as they are in the
non-patched version.  Unless there's a good reason for that, the
#ifdef should be added.
-Kevin

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

Andres Freund

Дата:

10 декабря 2009 г., 18:06:19

On Thursday 10 December 2009 19:10:24 Kevin Grittner wrote:
> I wrote:
> > Thanks for the sample which shows the difference.
> 
> Ah, once I got on the right track, there is no problem seeing
> dramatic improvements with the patch.  It changes some nasty O(N^2)
> cases to O(N).  In particular, the fixes affect parsing of large
> strings encoded with multi-byte character encodings and containing
> email addresses or URLs with a non-IP-address host component.  It
> strikes me as odd that URLs without a slash following the host
> portion, or with an IP address, are treated so differently in the
> parser, but if we want to address that, it's a matter for another
> patch.
Same here. Generally I do have to say that I dont find that parser very nice - 
and actually I think its not very efficient as well because it searches a big 
part of the search space all the time.
I think a generated parser would be more efficient and way much easier to 
read...

> I'm inclined to think that the minimal differences found in some of
> my tests probably have more to do with happenstance of code
> alignment than the particulars of the patch.
Yes, I think that as well.

> I did find one significant (although easily solved) problem.  In the
> patch, the recursive setup of usewide, pgwstr, and wstr are not
> conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
> version.  Unless there's a good reason for that, the #ifdef should
> be added.
Uh. Sorry. Fixed.

> Less critical, but worth fixing one way or the other, TParserClose
> does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
> TParserCopyClose does.  I think this should be consistent.
Actually there was *some* thinking behind that: The end of parsing is quite 
obvious (the call returns, and its so verbose you will never do more than one 
call anyway) - but knowing where the copy ends is quite important to 
understand the parse flow.
I added a log there because in the end that line is not going to make any 
difference ;-)

> Sorry for the delay in review.  I hope there's still time to get
> this committed in this CF.
Thanks for your reviewing!

Actually I dont mind very much if it gets delayed or not. Its trivial enough 
that it shouldnt cause much work/conflicts/whatever next round and I am running 
patched versions anyway, so ...



Andres

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

"Kevin Grittner"

Дата:

10 декабря 2009 г., 18:47:53

Andres Freund <andres@anarazel.de> wrote:
> [concerns addressed in new patch version]
Looks good to me.  I'm marking this Ready for Committer.
-Kevin

Re: tsearch parser inefficiency if text includes urls or emails - new version

От

Tom Lane

Дата:

15 декабря 2009 г., 19:40:08

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Andres Freund <andres@anarazel.de> wrote:
>> [concerns addressed in new patch version]
> Looks good to me.  I'm marking this Ready for Committer.

Applied with minor editorialization --- improving the comments mostly.
        regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: tsearch parser inefficiency if text includes urls or emails - new version