Обсуждение: pg_trgm version 1.2
create table foo as select
md5(random()::text)|| case when random()<0.000005 then 'lmnop' else '123' end ||
md5(random()::text) as bar
from generate_series(1,10000000);
create index on foo using gin (bar gin_trgm_ops);
--some queries
alter extension pg_trgm update to "1.2";
--close, reopen, more queries
select count(*) from foo where bar like '%12344321lmnabcddd%';
V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache
V1.2: Time: 2.839 ms --- after repeated execution to warm the cache
You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in core) from 4 to some higher value (which it probably should be anyway, but there will always be a case where it needs to be higher than you can afford it to be, so a real solution is needed).
I wasn't sure if this should be a new version of pg_trgm or not, because there is no user visible change other than to performance. But there may be some cases where it results in performance reduction and so it is nice to provide options. Also, I'd like to use it in a back-branch, so versions seems to be the right way to go there.
There is a lot of code duplication between the binary consistent function and the ternary one. I thought it the duplication was necessary in order to support both 1.1 and 1.2 from the same code base.
There may also be some gains in the similarity and regex cases, but I didn't really analyze those for performance.
I've thought about how to document this change. Looking to other example of other contrib modules with multiple versions, I decided that we don't document them, other than in the release notes.
The same patch applies to 9.4 code with a minor conflict in the Makefile, and gives benefits there as well.
Cheers,
Jeff
Вложения
On Sat, Jun 27, 2015 at 5:17 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > This patch implements version 1.2 of contrib module pg_trgm. > > This supports the triconsistent function, introduced in version 9.4 of the > server, to make it faster to implement indexed queries where some keys are > common and some are rare. > > I've included the paths to both upgrade and downgrade between 1.1 and 1.2, > although after doing so you must close and restart the session before you > can be sure the change has taken effect. There is no change to the on-disk > index structure > > This shows the difference it can make in some cases: > > create extension pg_trgm version "1.1"; > > create table foo as select > > md5(random()::text)|| case when random()<0.000005 then 'lmnop' else '123' > end || > > md5(random()::text) as bar > > from generate_series(1,10000000); > > create index on foo using gin (bar gin_trgm_ops); > > --some queries > > alter extension pg_trgm update to "1.2"; > > --close, reopen, more queries > > > select count(*) from foo where bar like '%12344321lmnabcddd%'; > > > > V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache > > V1.2: Time: 2.839 ms --- after repeated execution to warm the cache Wow! I'm going to test this. I have some data sets for which trigram searching isn't really practical...if the search string touches trigrams with a lot of duplication the algorithm can have trouble beating brute force searches. trigram searching is important: it's the only way currently to search string encoded structures for partial strings quickly. merlin
On Mon, Jun 29, 2015 at 7:23 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sat, Jun 27, 2015 at 5:17 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache >> >> V1.2: Time: 2.839 ms --- after repeated execution to warm the cache > > Wow! I'm going to test this. I have some data sets for which trigram > searching isn't really practical...if the search string touches > trigrams with a lot of duplication the algorithm can have trouble > beating brute force searches. > > trigram searching is important: it's the only way currently to search > string encoded structures for partial strings quickly. I ran your patch against stock 9.4 and am happy to confirm massive speedups of pg_trgm; results of 90% reduction in runtime are common. Also, with the new changes it's hard to get the indexed search to significantly underperform brute force searching which is a huge improvement vs the stock behavior, something that made me very wary of using these kinds of searches in the past. datatable: 'test2' rows: ~ 2 million heap size: 3.3GB (includes several unrelated fields) index size: 1GB 9.4: stock 9.5: patched match 50% rows, brute force seq scan 9.4: 11.5s 9.5: 9.1s match 50% rows, indexed (time is quite variable with 9.4 giving > 40 sec times) 9.4: 21.0s 9.5: 11.8s match 1% rows, indexed (>90% time reduction!) 9.4: .566s 9.5: .046s match .1% rows, one selective one non-selective search term, selective term first 9.4: .563s 9.5: .028s match .1% rows, one selective one non-selective search term, selective term last 9.4: 1.014s 9.5: 0.093s very nice! Recently, I examined pg_tgrm for an attribute searching system -- it failed due to response time variability and lack of tools to control that. Were your patch in place, I would have passed it. I had a 'real world' data set though. With this, pg_trgm is basically outperforming SOLR search engine for all cases we're interested in whereas before low selectivity cases where having all kinds of trouble. merlin
This patch implements version 1.2 of contrib module pg_trgm.This supports the triconsistent function, introduced in version 9.4 of the server, to make it faster to implement indexed queries where some keys are common and some are rare.
I've included the paths to both upgrade and downgrade between 1.1 and 1.2, although after doing so you must close and restart the session before you can be sure the change has taken effect. There is no change to the on-disk index structure
create extension pg_trgm version "1.1";create table foo as select
md5(random()::text)|| case when random()<0.000005 then 'lmnop' else '123' end ||
md5(random()::text) as bar
from generate_series(1,10000000);
create index on foo using gin (bar gin_trgm_ops);
--some queries
alter extension pg_trgm update to "1.2";
--close, reopen, more queries
select count(*) from foo where bar like '%12344321lmnabcddd%';
V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache
V1.2: Time: 2.839 ms --- after repeated execution to warm the cache
You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in core) from 4 to some higher value (which it probably should be anyway, but there will always be a case where it needs to be higher than you can afford it to be, so a real solution is needed).
I wasn't sure if this should be a new version of pg_trgm or not, because there is no user visible change other than to performance. But there may be some cases where it results in performance reduction and so it is nice to provide options. Also, I'd like to use it in a back-branch, so versions seems to be the right way to go there.
There is a lot of code duplication between the binary consistent function and the ternary one. I thought it the duplication was necessary in order to support both 1.1 and 1.2 from the same code base.
There may also be some gains in the similarity and regex cases, but I didn't really analyze those for performance.
I've thought about how to document this change. Looking to other example of other contrib modules with multiple versions, I decided that we don't document them, other than in the release notes.
The same patch applies to 9.4 code with a minor conflict in the Makefile, and gives benefits there as well.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
On Sun, Jun 28, 2015 at 1:17 AM, Jeff Janes <jeff.janes@gmail.com> wrote:This patch implements version 1.2 of contrib module pg_trgm.This supports the triconsistent function, introduced in version 9.4 of the server, to make it faster to implement indexed queries where some keys are common and some are rare.Thank you for the patch! Lack of pg_trgm triconsistent support was significant miss after "fast scan" implementation.I've included the paths to both upgrade and downgrade between 1.1 and 1.2, although after doing so you must close and restart the session before you can be sure the change has taken effect. There is no change to the on-disk index structurepg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do you expect them in final commit? As I can see in other contribs we have only last version and upgrade scripts.
You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in core) from 4 to some higher value (which it probably should be anyway, but there will always be a case where it needs to be higher than you can afford it to be, so a real solution is needed).
Actually, it depends on how long it takes to calculate consistent function. The cheaper consistent function is, the higher MAX_MAYBE_ENTRIES could be. Since all functions in PostgreSQL may define its cost, GIN could calculate MAX_MAYBE_ENTRIES from the cost of consistent function.
There may also be some gains in the similarity and regex cases, but I didn't really analyze those for performance.
I've thought about how to document this change. Looking to other example of other contrib modules with multiple versions, I decided that we don't document them, other than in the release notes.
The same patch applies to 9.4 code with a minor conflict in the Makefile, and gives benefits there as well.
Some other notes about the patch:* You allocate boolcheck array and don't use it.
* Check coding style and formatting, in particular "check[i]==GIN_TRUE" should be "check[i] == GIN_TRUE".
* I think some comments needed in gin_trgm_triconsistent() about trigramsMatchGraph(). gin_trgm_triconsistent() may use trigramsMatchGraph() that way because trigramsMatchGraph() implements monotonous boolean function.
Вложения
Jeff Janes <jeff.janes@gmail.com> writes: > On Tue, Jun 30, 2015 at 2:46 AM, Alexander Korotkov < > a.korotkov@postgrespro.ru> wrote: >> pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do you >> expect them in final commit? As I can see in other contribs we have only >> last version and upgrade scripts. > I did see another downgrade path for different module, but on closer > inspection it was one that I wrote while trying to analyze that module, and > then never removed. I have no objection to removing pg_trgm--1.2--1.1.sql > before the commit, but I don't see what we gain by doing so. If a > downgrade is feasible and has been tested, why not include it? Because we don't want to support 1.1 anymore once 1.2 exists. You're supposing that just because you wrote the downgrade script and think it works, there's no further costs associated with having that. Personally, I don't even want to review such a script, let alone document its existence and why someone might want to use it, let alone support 1.1 into the far future. regards, tom lane
On Tue, Jun 30, 2015 at 2:46 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:On Sun, Jun 28, 2015 at 1:17 AM, Jeff Janes <jeff.janes@gmail.com> wrote:This patch implements version 1.2 of contrib module pg_trgm.This supports the triconsistent function, introduced in version 9.4 of the server, to make it faster to implement indexed queries where some keys are common and some are rare.Thank you for the patch! Lack of pg_trgm triconsistent support was significant miss after "fast scan" implementation.I've included the paths to both upgrade and downgrade between 1.1 and 1.2, although after doing so you must close and restart the session before you can be sure the change has taken effect. There is no change to the on-disk index structurepg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do you expect them in final commit? As I can see in other contribs we have only last version and upgrade scripts.I had thought that pg_trgm--1.1.sql was needed for pg_upgrade to work, but I see that that is not the case.I did see another downgrade path for different module, but on closer inspection it was one that I wrote while trying to analyze that module, and then never removed. I have no objection to removing pg_trgm--1.2--1.1.sql before the commit, but I don't see what we gain by doing so. If a downgrade is feasible and has been tested, why not include it?
You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in core) from 4 to some higher value (which it probably should be anyway, but there will always be a case where it needs to be higher than you can afford it to be, so a real solution is needed).
Actually, it depends on how long it takes to calculate consistent function. The cheaper consistent function is, the higher MAX_MAYBE_ENTRIES could be. Since all functions in PostgreSQL may define its cost, GIN could calculate MAX_MAYBE_ENTRIES from the cost of consistent function.The consistent function gets called on every candidate tuple, so if it is expensive then it is also all the more worthwhile to reduce the set of candidate tuples. Perhaps MAX_MAYBE_ENTRIES could be calculated from the log of the maximum of the predictNumberResult entries? Anyway, a subject for a different day....
There may also be some gains in the similarity and regex cases, but I didn't really analyze those for performance.
I've thought about how to document this change. Looking to other example of other contrib modules with multiple versions, I decided that we don't document them, other than in the release notes.
The same patch applies to 9.4 code with a minor conflict in the Makefile, and gives benefits there as well.
Some other notes about the patch:* You allocate boolcheck array and don't use it.That was a bug (at least notionally). trigramsMatchGraph was supposed to be getting boolcheck, not check, passed in to it.It may not have been a bug in practise, because GIN_MAYBE and GIN_TRUE both test as true when cast to booleans. But it seems wrong to rely on that. Or was it intended to work this way?I'm surprised the compiler suite doesn't emit some kind of warning on that.
* Check coding style and formatting, in particular "check[i]==GIN_TRUE" should be "check[i] == GIN_TRUE".Sorry about that, fixed. I also changed it in other places to check[i] != GIN_FALSE, rather than checking against both GIN_TRUE and GIN_MAYBE. At first I was concerned we might decide to add a 4th option to the type which would render != GIN_FALSE the wrong way to test for it. But since it is called GinTernaryValue, I think we wouldn't add a fourth thing to it. But perhaps the more verbose form is clearer to some people.
* I think some comments needed in gin_trgm_triconsistent() about trigramsMatchGraph(). gin_trgm_triconsistent() may use trigramsMatchGraph() that way because trigramsMatchGraph() implements monotonous boolean function.I have a function-level comment that in all cases, GIN_TRUE is at least as favorable to inclusion of a tuple as GIN_MAYBE. Should I reiterate that comment before trigramsMatchGraph() as well? Calling it a monotonic boolean function is precise and concise, but probably less understandable to people reading the code. At least, if I saw that, I would need to go do some reading before I knew what it meant.
See Tom Lane's comment about downgrade scripts. I think just remove it is a right solution.
Let's consider '^(?!.*def).*abc' regular expression as an example. It matches strings which contains 'abc' and don't contains 'def'.# SELECT 'abc' ~ '^(?!.*def).*abc';?column?----------t(1 row)# SELECT 'def abc' ~ '^(?!.*def).*abc';?column?----------f(1 row)# SELECT 'abc def' ~ '^(?!.*def).*abc';?column?----------f(1 row)Theoretically, our trigram regex processing could support negative matching of 'def' trigram, i.e. trigramsMatchGraph(abc = true, def = false) = true but trigramsMatchGraph(abc = true, def = true) = false. Actually, it doesn't because trigramsMatchGraph() implements a monotonic function. I just think it should be stated explicitly.
Вложения
On Tue, Jul 7, 2015 at 6:33 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:See Tom Lane's comment about downgrade scripts. I think just remove it is a right solution.The new patch removes the downgrade path and the ability to install the old version.(If anyone wants an easy downgrade path for testing, they can keep using the prior patch--no functional changes)It also added a comment before the trigramsMatchGraph call.I retained the palloc and the loop to promote the ternary array to a binary array. While I also think it is tempting to get rid of that by abusing the type system and would do it that way in my own standalone code, it seems contrary to way the project usually does things. And I couldn't measure a difference in performance.
....Let's consider '^(?!.*def).*abc' regular expression as an example. It matches strings which contains 'abc' and don't contains 'def'.# SELECT 'abc' ~ '^(?!.*def).*abc';?column?----------t(1 row)# SELECT 'def abc' ~ '^(?!.*def).*abc';?column?----------f(1 row)# SELECT 'abc def' ~ '^(?!.*def).*abc';?column?----------f(1 row)Theoretically, our trigram regex processing could support negative matching of 'def' trigram, i.e. trigramsMatchGraph(abc = true, def = false) = true but trigramsMatchGraph(abc = true, def = true) = false. Actually, it doesn't because trigramsMatchGraph() implements a monotonic function. I just think it should be stated explicitly.Do you think it is likely to change to stop being monotonic and so support the (def=GIN_TRUE) => false case?^(?!.*def) seems like a profoundly niche situation. (Although one that I might actually start using myself now that I know it isn't just a Perl-ism).It doesn't make any difference to this patch, other than perhaps how to word the comments.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com