Обсуждение: Re: tsearch parser inefficiency if text includes urls or emails - new version
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
"Kevin Grittner"
Дата:
I wrote: > Frankly, I'd be amazed if there was a performance regression, OK, I'm amazed. While it apparently helps some cases dramatically (Andres had a case where run time was reduced by 93.2%), I found a pretty routine case where run time was increased by 3.1%. I tweaked the code and got that down to a 2.5% run time increase. I'm having troubles getting it any lower than that. And yes, this is real, not noise -- the slowest unpatched time for this test is faster than the fastest time with any version of the patch. :-( Andres, could you provide more information on the test which showed the dramatic improvement? In particular, info on OS, CPU, character set, encoding scheme, and what kind of data was used for the test. I'll do some more testing and try to figure out how the patch is slowing things down and post with details. -Kevin
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
Andres Freund
Дата:
On Tuesday 08 December 2009 16:23:11 Kevin Grittner wrote: > I wrote: > > Frankly, I'd be amazed if there was a performance regression, > > OK, I'm amazed. While it apparently helps some cases dramatically > (Andres had a case where run time was reduced by 93.2%), I found a > pretty routine case where run time was increased by 3.1%. I tweaked > the code and got that down to a 2.5% run time increase. I'm having > troubles getting it any lower than that. And yes, this is real, not > noise -- the slowest unpatched time for this test is faster than the > fastest time with any version of the patch. :-( > > Andres, could you provide more information on the test which showed > the dramatic improvement? In particular, info on OS, CPU, character > set, encoding scheme, and what kind of data was used for the test. > > I'll do some more testing and try to figure out how the patch is > slowing things down and post with details. Could you show your testcase? I dont see why it could get slower? I tested with various data, the one benefiting most was some changelog where each entry was signed by an email. OS: Debian Sid, Core2 Duo, UTF-8, and I tried both C and de_DE.UTF8. Andres
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
"Kevin Grittner"
Дата:
Andres Freund <andres@anarazel.de> wrote: > Could you show your testcase? OK. I was going to try to check other platforms first, and package up the information better, but here goes. I created 10000 lines with random IP-based URLs for a test. The first few lines are: create table t1 (c1 int not null primary key, c2 text); insert into t1 values (2, 'http://255.102.51.212/*/quick/brown/fox?jumps&over&*&lazy&dog.htmlhttp://204.56.222.143/*/quick/brown/fox?jumps&over&*&lazy&dog.htmlhttp://138.183.168.227/*/quick/brown/fox?jumps&over&*&lazy&dog.html Actually, the special character was initially the word "the", but I wanted to see if having non-ASCII characters in the value made any difference. It didn't. Unfortunately, I was testing at home last night and forgot to bring the exact test query with me, but it was this or something close to it: \timing select to_tsvector(c2) from t1, (select generate_series(1,200)) x where c1 = 2; I was running on Ubuntu 9.10, an AMD dual core CPU (don't have the model number handy), UTF-8, en_US.UTF8. > I dont see why it could get slower? I don't either. The best I can tell, following the pointer from orig to any of its elements seems to be way more expensive than I would ever have guessed. The only thing that seemed to improve the speed was minimizing that by using a local variable to capture any element referenced more than once. (Although, there is overlap between the timings for the original patch and the one which seemed a slight improvement; I would need to do more testing to really rule out noise and have complete confidence that my changes actually are an improvement on the original patch.) Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD CPU? (I'm looking forward to testing this today on a 64 bit build on an Intel CPU.) -Kevin
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
Andres Freund
Дата:
On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote: > Andres Freund <andres@anarazel.de> wrote: > > Could you show your testcase? Will hopefully look into this later. > > I dont see why it could get slower? > I don't either. The best I can tell, following the pointer from > orig to any of its elements seems to be way more expensive than I > would ever have guessed. The only thing that seemed to improve the > speed was minimizing that by using a local variable to capture any > element referenced more than once. (Although, there is overlap > between the timings for the original patch and the one which seemed > a slight improvement; I would need to do more testing to really rule > out noise and have complete confidence that my changes actually are > an improvement on the original patch.) > > Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD > CPU? (I'm looking forward to testing this today on a 64 bit build > on an Intel CPU.) Did you test that with a optimized build? Andres
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
"Kevin Grittner"
Дата:
Andres Freund <andres@anarazel.de> wrote: > Did you test that with a optimized build? After running a series of tests with --enable-cassert to confirm that nothing squawked, I disabled that before doing any performance testing. Going from memory, I used --enable-debug --with-libxml and --prefix. I didn't explicitly use or disable any compiler optimizations. Possibly relevant is that I was using Greg Smith's peg tool (as best I could): http://github.com/gregs1104/peg/ I'm not aware of it doing anything to the compile options, but I didn't look for that, either. I guess I should poke around that some more to be sure. After having peg do the checkout and set up the directories, did my own ./configure and make runs so that I could be sure of my configure settings. Are there any settings which you feel I should be using which PostgreSQL doesn't set up as you would recommend during the ./configure run? -Kevin
Kevin Grittner wrote: > After running a series of tests with --enable-cassert to confirm > that nothing squawked, I disabled that before doing any > performance testing. Going from memory, I used --enable-debug > --with-libxml and --prefix. I didn't explicitly use or disable any > compiler optimizations. > > Possibly relevant is that I was using Greg Smith's peg tool (as best > I could): > > http://github.com/gregs1104/peg/ > > I'm not aware of it doing anything to the compile options, but I > didn't look for that, either. Now that you ask, of course I just spotted a bug in there such that the documented behavior for the PGDEBUG feature doesn't actually work. If you were using that to turn off asserts, that may not have worked as expected. Don't know what you did there exactly. Fix now pushed to the repo. I general, when doing performance testing, now matter how careful I've been I try to do a: show debug_assertions; To confirm I'm not running with those on. Andres, are using any optimization flags when you're testing? Would have preferred that the first mention of my new project didn't involve a bug report, but that's software I guess. For everyone here who's not on the reviewers mailing list: peg is a scripting tool I've put together recently that makes it easier to setup the environment (source code, binaries, database) for testing PostgreSQL in a development content. It takes care of grabbing the latest source, building, starting the database, all that boring stuff. I tried to make it automate the most tedious parts of both development and patch review. Documentation and the program itself are at the git repo Kevin mentioned. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
"Kevin Grittner"
Дата:
Greg Smith <greg@2ndquadrant.com> wrote: > Now that you ask, of course I just spotted a bug in there such > that the documented behavior for the PGDEBUG feature doesn't > actually work. If you were using that to turn off asserts, that > may not have worked as expected. Don't know what you did there > exactly. Fix now pushed to the repo. Thanks. I was going to reply to your original message with my experiences (and probably still will), but it seemed like it might be relevant here. I did check pg_config results before doing anything, and saw that the debug and cassert weren't set, so that's why I did explicit configure and make commands. Even without that covered, peg was a nice convenience -- I can say that it saved me more time already than it took to install and read the docs. Nice work! Anyway, I'm not sure whether your reply directly answers the point I was raising -- peg doesn't do anything with the compiler optimization flags under the covers, does it? -Kevin
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
Andres Freund
Дата:
Hi, On Tuesday 08 December 2009 20:09:22 Greg Smith wrote: > Andres, are using any optimization flags when you're testing? I was testing with and without debug/cassert - and did not get adverse results in both... Unfortunately it looks like I wont get to test today, but only tomorrow morning its 9pm and I am not yet fully finished with work.... Andres
Kevin Grittner wrote: > Anyway, I'm not sure whether your reply directly answers the point > I was raising -- peg doesn't do anything with the compiler > optimization flags under the covers, does it? > Not really. It does this: PGDEBUG="--enable-cassert --enable-debug" ./configure --prefix=$PGINST/$PGPROJECT --enable-depend --enable-thread-safety $PGDEBUG Which are pretty standard options. The idea is that you'd use the default normally, then just set PGDEBUG=" " for a non-debug build--or to otherwise change the configure flags but still get things done automatically for you. If it's set before the script starts, it doesn't change it. I did try to design things so that you could do any step in the automated series manually and not have that screw things up. There's hundreds of lines of code in there just for things like figuring out whether configure has been run or not yet when it decides you need to build, so it can try to do the right thing in either case. My hope was that anyone who tried peg out would find it a net positive time savings after a single use, glad to hear I accomplished that in your case. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith <greg@2ndquadrant.com> writes: > Kevin Grittner wrote: >> Anyway, I'm not sure whether your reply directly answers the point >> I was raising -- peg doesn't do anything with the compiler >> optimization flags under the covers, does it? >> > Not really. It does this: > PGDEBUG="--enable-cassert --enable-debug" > ./configure --prefix=$PGINST/$PGPROJECT --enable-depend > --enable-thread-safety $PGDEBUG --enable-cassert might have a noticeable performance impact. We usually try to not have that on when doing performance testing. regards, tom lane
Tom Lane wrote: > --enable-cassert might have a noticeable performance impact. > We usually try to not have that on when doing performance testing. > All covered in the tool's documentation, and it looks like Kevin did the right thing during his tests by checking the pg_config output to confirm the right flags were used. I think we can rule out simple pilot error here and work under the assumption Kevin found a mild performance regression under some circumstances with the patch. Hopefully Andres will be able to replicate the problem, and it sounds like Kevin might be able to provide more information about it tonight too. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
"Kevin Grittner"
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote: > --enable-cassert might have a noticeable performance impact. > We usually try to not have that on when doing performance testing. Right. I turned it on for some initial tests to confirm that we had no assertion failures; then turned it off for the performance testing. I did leave --enable-debug on during performance testing. My understanding is that debug info doesn't affect performance, but I guess it never hurts to try an empirical test to confirm. -Kevin
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
"Kevin Grittner"
Дата:
I wrote: > Perhaps it is some quirk of using 32 bit pointers on the 64 bit > AMD CPU? (I'm looking forward to testing this today on a 64 bit > build on an Intel CPU.) The exact same test on 64 bit OS (SuSE Enterprise Server) on Intel gave very different results. With 10 runs each of 200 iterations of parsing the 10000 URLs, the patch Andres submitted ran 0.4% faster than HEAD, and my attempt to improve on it ran 0.6% slower than HEAD. I'll try to run the numbers to get the percentage chance that a random distribution would have generated a spread as large as either of those; but I think it's safe to say that the submitted patch doesn't hurt there and that my attempt to improve on it was misdirected. :-/ I would like to independently confirm the dramatic improvement reported by Andres. Could I get a short snippet from the log which was used for that, along with an indication of the size of the text parsed in that test? (Since the old code looks like it might have O(N^2) performance in some situations, while the patch changes that to O(N), I might not be testing with a big enough N.) -Kevin
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
Andres Freund
Дата:
On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote: > Andres Freund <andres@anarazel.de> wrote: > > Could you show your testcase? > > OK. I was going to try to check other platforms first, and package > up the information better, but here goes. > > I created 10000 lines with random IP-based URLs for a test. The > first few lines are: > > create table t1 (c1 int not null primary key, c2 text); > insert into t1 values (2, > 'http://255.102.51.212/*/quick/brown/fox?jumps&over&*&lazy&dog.html > http://204.56.222.143/*/quick/brown/fox?jumps&over&*&lazy&dog.html > http://138.183.168.227/*/quick/brown/fox?jumps&over&*&lazy&dog.html I think you see no real benefit, because your strings are rather short - the documents I scanned when noticing the issue where rather long. If your strings are short, they and the copy will fit into cpu cache anyway, so copying them around/converting them to some other string format is not that expensive compared to the rest of the work done. Also after each copying step for large strings the complete cache is filled with unrelated information (namely the end of the string). So every charwise access will need to wait for a memory access. A rather extreme/contrived example: postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 1) g(i)), ' - ')); ?column? ---------- 1(1 row) Time: 3.740 ms postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 1000) g(i)), ' - ')); ?column? ---------- 1(1 row) Time: 115.027 ms postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 10000) g(i)), ' - ')); ?column? ---------- 1(1 row) Time: 24355.339 ms postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 20000) g(i)), ' - ')); ?column? ---------- 1(1 row) Time: 47276.739 ms One easily can see the quadratic complexity here. The quadratic complexity lies in the length/amount of emails/urls of the strings, not in the number of to_tsvector calls! After the patch: postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'andres@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 20000) g(i)), ' - ')); ?column? ---------- 1(1 row) Time: 168.384 ms I could not reproduce the slowdown you mentioned: Without patch: postgres=# SELECT to_tsvector('andres@anarazel.de http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1; to_tsvector -------------------------------------------------------------------------------'/1':4 'andres@anarazel.de':1 'www.postgresql.org':3 'www.postgresql.org/1':2(1 row) Time: 1109.833 ms With patch: postgres=# SELECT to_tsvector('andres@anarazel.de http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1; to_tsvector -------------------------------------------------------------------------------'/1':4 'andres@anarazel.de':1 'www.postgresql.org':3 'www.postgresql.org/1':2(1 row) Time: 1036.689 ms So on the hardware I tried its even a little bit faster for small strings (Older Xeon32bit, Core2 Duo, 64bit, Nehalem based Xeon 64bit). I could not reproduce any difference with strings not involving urls or emails: Without patch: postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1; to_tsvector --------------------------------------------------------'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6(1 row) Time: 988.426 ms With patch: postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1; to_tsvector --------------------------------------------------------'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6(1 row) Time: 975.339 ms So at least in my testing I do see no danger in the patch ;-) Andres PS: I averaged all the time result over multiple runs where it was relevant.
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
"Kevin Grittner"
Дата:
Andres Freund <andres@anarazel.de> wrote: > I think you see no real benefit, because your strings are rather > short - the documents I scanned when noticing the issue where > rather long. The document I used in the test which showed the regression was 672,585 characters, containing 10,000 URLs. > A rather extreme/contrived example: > postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT > 'andres@anarazel.de http://www.postgresql.org/'::text FROM > generate_series(1, > 20000) g(i)), ' - ')); The most extreme of your examples uses a 979,996 character string, which is less than 50% larger than my test. I am, however, able to see the performance difference for this particular example, so I now have something to work with. I'm seeing some odd behavior in terms of when there is what sort of difference. Once I can categorize it better, I'll follow up. Thanks for the sample which shows the difference. -Kevin
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
"Kevin Grittner"
Дата:
I wrote: > Thanks for the sample which shows the difference. Ah, once I got on the right track, there is no problem seeing dramatic improvements with the patch. It changes some nasty O(N^2) cases to O(N). In particular, the fixes affect parsing of large strings encoded with multi-byte character encodings and containing email addresses or URLs with a non-IP-address host component. It strikes me as odd that URLs without a slash following the host portion, or with an IP address, are treated so differently in the parser, but if we want to address that, it's a matter for another patch. I'm inclined to think that the minimal differences found in some of my tests probably have more to do with happenstance of code alignment than the particulars of the patch. I did find one significant (although easily solved) problem. In the patch, the recursive setup of usewide, pgwstr, and wstr are not conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched version. Unless there's a good reason for that, the #ifdef should be added. Less critical, but worth fixing one way or the other, TParserClose does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but TParserCopyClose does. I think this should be consistent. Finally, there's that spelling error in the comment for TParserCopyInit. Please fix. If a patch is produced with fixes for these three things, I'd say it'll be ready for committer. I'm marking it as Waiting on Author for fixes to these three items. Sorry for the delay in review. I hope there's still time to get this committed in this CF. -Kevin
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
"Kevin Grittner"
Дата:
I wrote: > I did find one significant (although easily solved) problem. In > the patch, the recursive setup of usewide, pgwstr, and wstr are > not conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched > version. Unless there's a good reason for that, the #ifdef should > be added. That should read: I did find one significant (although easily solved) problem. In the patch, the recursive setup of usewide, pgwstr, and wstr are not conditioned by #ifdef USE_WIDE_UPPER_LOWER as they are in the non-patched version. Unless there's a good reason for that, the #ifdef should be added. -Kevin
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
Andres Freund
Дата:
On Thursday 10 December 2009 19:10:24 Kevin Grittner wrote: > I wrote: > > Thanks for the sample which shows the difference. > > Ah, once I got on the right track, there is no problem seeing > dramatic improvements with the patch. It changes some nasty O(N^2) > cases to O(N). In particular, the fixes affect parsing of large > strings encoded with multi-byte character encodings and containing > email addresses or URLs with a non-IP-address host component. It > strikes me as odd that URLs without a slash following the host > portion, or with an IP address, are treated so differently in the > parser, but if we want to address that, it's a matter for another > patch. Same here. Generally I do have to say that I dont find that parser very nice - and actually I think its not very efficient as well because it searches a big part of the search space all the time. I think a generated parser would be more efficient and way much easier to read... > I'm inclined to think that the minimal differences found in some of > my tests probably have more to do with happenstance of code > alignment than the particulars of the patch. Yes, I think that as well. > I did find one significant (although easily solved) problem. In the > patch, the recursive setup of usewide, pgwstr, and wstr are not > conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched > version. Unless there's a good reason for that, the #ifdef should > be added. Uh. Sorry. Fixed. > Less critical, but worth fixing one way or the other, TParserClose > does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but > TParserCopyClose does. I think this should be consistent. Actually there was *some* thinking behind that: The end of parsing is quite obvious (the call returns, and its so verbose you will never do more than one call anyway) - but knowing where the copy ends is quite important to understand the parse flow. I added a log there because in the end that line is not going to make any difference ;-) > Sorry for the delay in review. I hope there's still time to get > this committed in this CF. Thanks for your reviewing! Actually I dont mind very much if it gets delayed or not. Its trivial enough that it shouldnt cause much work/conflicts/whatever next round and I am running patched versions anyway, so ... Andres
Re: tsearch parser inefficiency if text includes urls or emails - new version
От
"Kevin Grittner"
Дата:
Andres Freund <andres@anarazel.de> wrote: > [concerns addressed in new patch version] Looks good to me. I'm marking this Ready for Committer. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Andres Freund <andres@anarazel.de> wrote: >> [concerns addressed in new patch version] > Looks good to me. I'm marking this Ready for Committer. Applied with minor editorialization --- improving the comments mostly. regards, tom lane