Обсуждение: How to find greatest record before known values fast

Поиск
Список
Период
Сортировка

How to find greatest record before known values fast

От
"Andrus"
Дата:
I’m looking for a way to increase select statement speed in Postgres 9.0.
 
Table has required index present. Desired result can obtained using index (kuupaev,kellaaeg) immediately.
However Postgres scans all rows:
 
explain analyze SELECT
        max( kuupaev||kellaaeg ) as res
      from firma2.ALGSA
      where laonr=1 and kuupaev <=current_date and
         (kuupaev,kellaaeg) <= ( current_date, '23 59'  )
 
    "Aggregate  (cost=6932.65..6932.67 rows=1 width=10) (actual time=1608.590..1608.592 rows=1 loops=1)"
    "  ->  Seq Scan on algsa  (cost=0.00..6571.49 rows=144464 width=10) (actual time=0.032..922.431 rows=144458 loops=1)"
    "        Filter: ((laonr = 1::numeric) AND (kuupaev <= ('now'::text)::date) AND (ROW(kuupaev, kellaaeg) <= ROW(('now'::text)::date, '23 59'::bpchar)))"
    "Total runtime: 1608.846 ms"
 
In real query instead of 1, current_date and '23 59' there are variable parameters.
 
Table has both indexes present but postgres will not use them.
Indexes can changed and query can re-written if this helps.
Table structure cannot changed. char columns cannot replaced with varchar columns. kuupaev must be date and kellaaeg must be char(5) type.
 
Query contains reduntant condition `kuupaev <=current_date` but index is still not used.
 
I tried also `SELECT max( (kuupaev,kellaaeg ))` but got error that max() function does not exist.
 
How to speed this query ?
 
 
Table structure is :
 
    CREATE TABLE firma2.algsa
    (
      id serial NOT NULL,
      laonr numeric(2,0),
      kuupaev date NOT NULL,
      kellaaeg character(5) NOT NULL DEFAULT ''::bpchar,
      osak character(10) NOT NULL,
      toode character(20) NOT NULL,
      partii character(15),
      kogus numeric(12,4) NOT NULL DEFAULT 0,
      hind numeric(15,5) NOT NULL DEFAULT 0,
      kulum numeric(15,5) NOT NULL DEFAULT 0,
      tegkogus numeric(12,4),
      stkuupaev date,
      klient character(12),
      masin character(5),
      CONSTRAINT algsa_pkey PRIMARY KEY (id)
    );
   
   
    CREATE INDEX algsa_kuupaev_idx
      ON firma2.algsa
      USING btree
      (kuupaev);
   
    CREATE INDEX algsa_kuupaev_kellaaeg_idx
      ON firma2.algsa
      USING btree
      (kuupaev, kellaaeg);
 
using
 
"PostgreSQL 9.0.3, compiled by Visual C++ build 1500, 32-bit"
 
Posted also in
 
 
Andrus.

Re: How to find greatest record before known values fast

От
Jim Nasby
Дата:
So kellaaeg is a time? Your best bet here would be to create an index that is an actual timestamp comprised of both
kuupaevand kellaaeg. You could do this with to_timestamp by concatinating both fields together, or it may be easier to
replacethe space in kellaaeg with a colon and cast it to time, then add the two: 

   kuupaev + replace( kellaaeg, ' ', ':' )::time

I know you can't alter the table, but can you create a view on top of the table? If you did that, you could have a real
timestampfield in the view that is calculated from kuupaev and kellaaeg and you can create a functional index that uses
thesame calculation. That would be the easiest way to use this. 

On 10/2/14, 3:49 PM, Andrus wrote:
> I’m looking for a way to increase select statement speed in Postgres 9.0.
> Table has required index present. Desired result can obtained using index (kuupaev,kellaaeg) immediately.
> However Postgres scans all rows:
> explain analyze SELECT
>         max( kuupaev||kellaaeg ) as res
>       from firma2.ALGSA
>       where laonr=1 and kuupaev <=current_date and
>          (kuupaev,kellaaeg) <= ( current_date, '23 59'  )
>     "Aggregate  (cost=6932.65..6932.67 rows=1 width=10) (actual time=1608.590..1608.592 rows=1 loops=1)"
>     "  ->  Seq Scan on algsa (cost=0.00..6571.49 rows=144464 width=10) (actual time=0.032..922.431 rows=144458
loops=1)"
>     "        Filter: ((laonr = 1::numeric) AND (kuupaev <= ('now'::text)::date) AND (ROW(kuupaev, kellaaeg) <=
ROW(('now'::text)::date,'23 59'::bpchar)))" 
>     "Total runtime: 1608.846 ms"
> In real query instead of 1, current_date and '23 59' there are variable parameters.
> Table has both indexes present but postgres will not use them.
> Indexes can changed and query can re-written if this helps.
> Table structure cannot changed. char columns cannot replaced with varchar columns. kuupaev must be date and kellaaeg
mustbe char(5) type. 
> Query contains reduntant condition `kuupaev <=current_date` but index is still not used.
> I tried also `SELECT max( (kuupaev,kellaaeg ))` but got error that max() function does not exist.
> How to speed this query ?
> Table structure is :
>     CREATE TABLE firma2.algsa
>     (
>       id serial NOT NULL,
>       laonr numeric(2,0),
>       kuupaev date NOT NULL,
>       kellaaeg character(5) NOT NULL DEFAULT ''::bpchar,
>       osak character(10) NOT NULL,
>       toode character(20) NOT NULL,
>       partii character(15),
>       kogus numeric(12,4) NOT NULL DEFAULT 0,
>       hind numeric(15,5) NOT NULL DEFAULT 0,
>       kulum numeric(15,5) NOT NULL DEFAULT 0,
>       tegkogus numeric(12,4),
>       stkuupaev date,
>       klient character(12),
>       masin character(5),
>       CONSTRAINT algsa_pkey PRIMARY KEY (id)
>     );
>     CREATE INDEX algsa_kuupaev_idx
>       ON firma2.algsa
>       USING btree
>       (kuupaev);
>     CREATE INDEX algsa_kuupaev_kellaaeg_idx
>       ON firma2.algsa
>       USING btree
>       (kuupaev, kellaaeg);
> using
> "PostgreSQL 9.0.3, compiled by Visual C++ build 1500, 32-bit"
> Posted also in
> http://stackoverflow.com/questions/26165745/how-find-greatest-tuple-before-given-2-column-tuple-in-postgres-fast
> Andrus.



Re: How to find greatest record before known values fast

От
David G Johnston
Дата:
Andrus Moor wrote
> Table has both indexes present but postgres will not use them.
> Indexes can changed and query can re-written if this helps.
> Table structure cannot changed. char columns cannot replaced with varchar
> columns. kuupaev must be date and kellaaeg must be char(5) type.

Don't discard the option to add additional trigger-maintained columns to the
table and then modify the where clause to use those new columns while still
outputting the originals.

You also need to explain whether you can update from 9.0.3 to 9.0.18

As Craig Ringer already said working with seldom used types in unusual ways
on ancient releases means that you may be stuck with second rate workarounds
or just living with the performance degradation.  I guess at worse you could
delete data so the scan doesn't take as long...

David J.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/How-to-find-greatest-record-before-known-values-fast-tp5821461p5821523.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: How to find greatest record before known values fast

От
Tom Lane
Дата:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> So kellaaeg is a time? Your best bet here would be to create an index
> that is an actual timestamp comprised of both kuupaev and kellaaeg.

The real problem with this query, or at least with the index design,
is that the index design isn't accounting for the need to constrain
"laonr".  The best way to create the index is with laonr first, and
instead of the max() write something like

regression=# create table foo (f1 int, f2 date, f3 bpchar(5));
CREATE TABLE
regression=# create index on foo (f1, f2, f3);
CREATE INDEX
regression=# explain select * from foo where f1 = 1 and (f2,f3) <= (current_date, '23 59') order by f2 desc, f3 desc
limit1; 
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Limit  (cost=0.01..5.44 rows=1 width=17)
   ->  Index Scan Backward using foo_f1_f2_f3_idx on foo  (cost=0.01..16.32 rows=3 width=17)
         Index Cond: ((f1 = 1) AND (ROW(f2, f3) <= ROW(('now'::text)::date, '23 59'::bpchar)))
(3 rows)

(tested on 9.0.18, should work on newer versions too)

BTW, the fact that newer versions are refusing to create an index on
"kuupaev||kellaaeg" should not be dismissed as mere pedantry.
The reason for that is that this expression involves a cast from
date to text, and the format of the text you get depends on DateStyle.
And that should call your attention to the fact that *the original query
gives the wrong answer*, or at least an answer that I bet is not the one
you want, unless DateStyle chances to be ISO.

IMO, one of the ten deadly sins of database design is using text-string
mashing to accomplish operations that are not naturally textual.  This
query and the table design itself are in dire need of sackcloth and ashes.
Had the table designer had the wit to use a timestamp field rather than
this unholy mashup, we'd not be having this conversation.

            regards, tom lane


Re: How to find greatest record before known values fast

От
"Andrus"
Дата:
Hi!

>So kellaaeg is a time? Your best bet here would be to create an index that
>is an actual timestamp comprised of both >kuupaev and kellaaeg. You could
>do this with to_timestamp by concatinating both fields together, or it may
>be easier to replace the space in kellaaeg with a colon and cast it to
>time, then add the two:
>   kuupaev + replace( kellaaeg, ' ', ':' )::time
>I know you can't alter the table, but can you create a view on top of the
>table? If you did that, you could have a real >timestamp field in the view
>that is calculated from kuupaev and kellaaeg and you can create a
>functional index that >uses the same calculation. That would be the easiest
>way to use this.

Thank you.
I solved this by creating composite index on 3 columns and re-writing query
as Tom recommended.
It looks like Tom's recommendation is simpler for me.

Andrus.



Re: How to find greatest record before known values fast

От
Merlin Moncure
Дата:
On Fri, Oct 3, 2014 at 1:28 AM, Andrus <kobruleht2@hot.ee> wrote:
> Hi!
>
>> So kellaaeg is a time? Your best bet here would be to create an index that
>> is an actual timestamp comprised of both >kuupaev and kellaaeg. You could do
>> this with to_timestamp by concatinating both fields together, or it may be
>> easier to replace the space in kellaaeg with a colon and cast it to time,
>> then add the two:
>>   kuupaev + replace( kellaaeg, ' ', ':' )::time
>> I know you can't alter the table, but can you create a view on top of the
>> table? If you did that, you could have a real >timestamp field in the view
>> that is calculated from kuupaev and kellaaeg and you can create a functional
>> index that >uses the same calculation. That would be the easiest way to use
>> this.
>
>
> Thank you.
> I solved this by creating composite index on 3 columns and re-writing query
> as Tom recommended.
> It looks like Tom's recommendation is simpler for me.

Also,

*) quit using char() type.  use varchar() -- at least in postgres,
it's better in every respect.  the char type pads out the fields on
disk.  (this is a common noobie error in postgres since that may not
necessarily be true in other databases)

*) numeric type gives fixed point operations and clean comparisons and
so is generally necessary, but it in some cases floating point
(float4/float8) are more compact and give better performance without
much downside.

*) 9.0 is up to 9.0.18.  Time to upgrade. (it's a binary only replacement).

*) indexes like this: (kuupaev), (kuupaev, kellaaeg) are mostly
redundant because the composite index can service queries on kuupaev
nearly as well as the single field index.  Either kill the single
field index to get better memory utilization or reverse the fields in
the composite index to (kellaaeg, kuupaev) if you make frequent
searches on 'kellaaeg'.

Indexes match quals in left to right order to give the best
performance.   So, an index on a,b,c gives good performance for
searches on (a), (a,b), and (a,b,c).  There are certain limited
exceptions to this rule but it's a good design principle to know.

merlin


faster way to calculate top "tags" for a "resource" based on a column

От
Jonathan Vanasco
Дата:
I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any time
offthis one.  I'm hoping someone has another strategy. 

I have 2 tables:
    resource
    resource_2_tag

I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the
"resource"table. 

both tables have around 1.6million records.

If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk.
If the database doesn't need to warm up, it averages 1.76seconds.

The 1.76s time is troubling me.
Searching for the discrete elements of this is pretty lightweight.

here's an explain --  http://explain.depesz.com/s/PndC

I tried a subquery instead of a join, and the query optimized the plan to the same.

i'm hoping someone will see something that I just don't see.



      Table "public.resource_2_tag"
        Column         |  Type   | Modifiers
-----------------------+---------+-----------
 resource_id           | integer |
 tag_id                | integer |
Indexes:
    "_idx_speed_resource_2_tag__resource_id" btree (resource_id)
    "_idx_speed_resource_2_tag__tag_id" btree (tag_id)

                                                  Table "public.resource"
               Column                |            Type             |                        Modifiers
      

-------------------------------------+-----------------------------+----------------------------------------------------------
 id                                  | integer                     | not null default
nextval('resource_id_seq'::regclass)
resource_attribute1_id               | integer                     |
lots of other columns                |                             |
Indexes:
    "resource_attribute1_idx" btree (resource_attribute1_id)

--------------------------------------------------------------------------------

select count(*) from resource;
-- 1669729

select count(*) from resource_2_tag;
-- 1676594

select count(*) from resource where resource_attribute1_id = 614;
-- 5184
-- 4.386ms

select id from resource where resource_attribute1_id = 614;
-- 5184
-- 87.303ms

popping the 5k elements into an "in" clause, will run the query in around 100ms.


EXPLAIN ANALYZE
SELECT
    resource_2_tag.tag_id AS resource_2_tag_tag_id,
    count(resource_2_tag.tag_id) AS counted
FROM
    resource_2_tag
JOIN resource ON resource.id = resource_2_tag.resource_id
WHERE
    resource.resource_attribute1_id = 614
GROUP BY resource_2_tag.tag_id
ORDER BY counted DESC
LIMIT 25 OFFSET 0;

--------------------------------------------------------------------------------

 Limit  (cost=76659.61..76659.68 rows=25 width=4) (actual time=1502.902..1502.913 rows=25 loops=1)
   ->  Sort  (cost=76659.61..76672.47 rows=5141 width=4) (actual time=1502.900..1502.906 rows=25 loops=1)
         Sort Key: (count(resource_2_tag.tag_id))
         Sort Method: top-N heapsort  Memory: 26kB
         ->  HashAggregate  (cost=76463.13..76514.54 rows=5141 width=4) (actual time=1487.016..1495.206 rows=13887
loops=1)
               ->  Hash Join  (cost=35867.88..76437.42 rows=5141 width=4) (actual time=97.654..1453.337 rows=27068
loops=1)
                     Hash Cond: (resource_2_tag.resource_id = resource.id)
                     ->  Seq Scan on resource_2_tag  (cost=0.00..25847.94 rows=1676594 width=8) (actual
time=0.032..513.046rows=1676594 loops=1) 
                     ->  Hash  (cost=35803.88..35803.88 rows=5120 width=4) (actual time=97.576..97.576 rows=5184
loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 183kB
                           ->  Bitmap Heap Scan on resource  (cost=272.68..35803.88 rows=5120 width=4) (actual
time=5.911..90.264rows=5184 loops=1) 
                                 Recheck Cond: (resource_attribute1_id = 614)
                                 ->  Bitmap Index Scan on resource_attribute1_idx  (cost=0.00..271.40 rows=5120
width=0)(actual time=3.575..3.575 rows=5184 loops=1) 
                                       Index Cond: (resource_attribute1_id = 614)
 Total runtime: 1503.146 ms




Re: How to find greatest record before known values fast

От
"Andrus"
Дата:
Hi!

Thank you for explanations.

>  the char type pads out the fields on disk.

It looks like you wrote that char takes more disk space.

from

http://www.pgcon.org/2013/schedule/attachments/269_tour-of-postgresql-data-types.pdf

page 28:

Unlike    many
databases,    char(n)    is    NOT    stored    as    afixed-sizedfield
in    Postgres.    It    is    treated    exactly    the    sameas
varchar(n)except    for    being    padded

So char type does not take more space than varchar.

Andrus.



Re: How to find greatest record before known values fast

От
Merlin Moncure
Дата:
On Fri, Oct 3, 2014 at 3:28 PM, Andrus <kobruleht2@hot.ee> wrote:
> Hi!
>
> Thank you for explanations.
>
>>  the char type pads out the fields on disk.
>
>
> It looks like you wrote that char takes more disk space.
>
> from
>
> http://www.pgcon.org/2013/schedule/attachments/269_tour-of-postgresql-data-types.pdf
>
> page 28:
>
> Unlike    many
> databases,    char(n)    is    NOT    stored    as    afixed-sizedfield in
> Postgres.    It    is    treated    exactly    the    sameas
> varchar(n)except    for    being    padded
>
> So char type does not take more space than varchar.

I beg to differ:

postgres=#  create table t1(v char(100));
CREATE TABLE
postgres=#  create table t2(v varchar(100));
CREATE TABLE
postgres=# insert into t1 select '' from generate_series(1,1000000);
INSERT 0 1000000
Time: 5951.023 ms
postgres=# insert into t2 select '' from generate_series(1,1000000);
INSERT 0 1000000
Time: 2083.323 ms
postgres=# select pg_size_pretty(pg_relation_size(oid)) from pg_class
where relname = 't1';
 pg_size_pretty
────────────────
 128 MB
(1 row)

postgres=# select pg_size_pretty(pg_relation_size(oid)) from pg_class
where relname = 't2';
 pg_size_pretty
────────────────
 35 MB
(1 row)

merlin


Re: How to find greatest record before known values fast

От
Adrian Klaver
Дата:
On 10/03/2014 01:28 PM, Andrus wrote:
> Hi!
>
> Thank you for explanations.
>
>>  the char type pads out the fields on disk.
>
> It looks like you wrote that char takes more disk space.
>
> from
>
> http://www.pgcon.org/2013/schedule/attachments/269_tour-of-postgresql-data-types.pdf
>
>
> page 28:
>
> Unlike    many
> databases,    char(n)    is    NOT    stored    as    afixed-sizedfield
> in    Postgres.    It    is    treated    exactly    the    sameas
> varchar(n)except    for    being    padded
>
> So char type does not take more space than varchar.

Which directly contradicts the information on page 27:

Character  Types  (or  Strings)
Name
Description

varchar(n)
variable-length with limit

char(n)
fixed-length, blank padded

text
variable unlimited length


and the docs:

http://www.postgresql.org/docs/9.3/interactive/datatype-character.html

Values of type character are physically padded with spaces to the
specified width n, and are stored and displayed that way. However, the
padding spaces are treated as semantically insignificant. Trailing
spaces are disregarded when comparing two values of type character, and
they will be removed when converting a character value to one of the
other string types. Note that trailing spaces are semantically
significant in character varying and text values, and when using pattern
matching, e.g. LIKE, regular expressions.


Tip: There is no performance difference among these three types, apart
from increased storage space when using the blank-padded type, and a few
extra CPU cycles to check the length when storing into a
length-constrained column. While character(n) has performance advantages
in some other database systems, there is no such advantage in
PostgreSQL; in fact character(n) is usually the slowest of the three
because of its additional storage costs. In most situations text or
character varying should be used instead.


>
> Andrus.
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: How to find greatest record before known values fast

От
Tom Lane
Дата:
Adrian Klaver <adrian.klaver@aklaver.com> writes:
>> page 28:
>>
>> Unlike    many
>> databases,    char(n)    is    NOT    stored    as    afixed-sizedfield
>> in    Postgres.    It    is    treated    exactly    the    sameas
>> varchar(n)except    for    being    padded

> Which directly contradicts the information on page 27:

This info is probably not as well worded as it could be, but it's not
really wrong.  The key point is that char(N) is blank-padded (thereby
wasting space) to be N *characters*, but that is not necessarily N
*bytes*, because of possible multi-byte characters.  Therefore the engine
has to treat it as a variable-length datatype.

I believe in some some other DBMSes, char(N) means N *bytes* and is
treated as a fixed-size datatype.  Dunno what they do with multibyte
encodings.

            regards, tom lane


Re: How to find greatest record before known values fast

От
Adrian Klaver
Дата:
On 10/03/2014 03:59 PM, Tom Lane wrote:
> Adrian Klaver <adrian.klaver@aklaver.com> writes:
>>> page 28:
>>>
>>> Unlike    many
>>> databases,    char(n)    is    NOT    stored    as    afixed-sizedfield
>>> in    Postgres.    It    is    treated    exactly    the    sameas
>>> varchar(n)except    for    being    padded
>
>> Which directly contradicts the information on page 27:
>
> This info is probably not as well worded as it could be, but it's not
> really wrong.  The key point is that char(N) is blank-padded (thereby
> wasting space) to be N *characters*, but that is not necessarily N
> *bytes*, because of possible multi-byte characters.  Therefore the engine
> has to treat it as a variable-length datatype.

Well that is a distinction I had missed, thanks for the heads up. So the
bottom line is char(N) is variable length, but for values that do not
reach length N will require more storage space then varchar(n).

>
> I believe in some some other DBMSes, char(N) means N *bytes* and is
> treated as a fixed-size datatype.  Dunno what they do with multibyte
> encodings.
>
>             regards, tom lane
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: How to find greatest record before known values fast

От
"Andrus"
Дата:
Hi!

Thank you.

People often looks for different period sales using different filters.
There are lot of sales and every sale is individual record in sales table.
So increasing sequential scan speed is important.

I tried

create table t1(v char(100), p numeric(12,5));
create table t2(v varchar(100), p numeric(12,5));
insert into t1 select '', generate_series from generate_series(1,1000000);
insert into t2 select '', generate_series from generate_series(1,1000000);

and after that measured speed of

select sum(p) from t1

and

select sum(p) from t2

both of them took approximately 800 ms

So it looks like thee is no difference in sequential scan speed and thus no
need to change char types.

Andrus



Re: How to find greatest record before known values fast

От
"Andrus"
Дата:
Hi!

Thank you.

In my db people often looks for different period sales using different
filters and will sum
There are lot of sales and every sale is individual record in sales table.
So increasing sequential scan speed is important.

I tried

create table t1(v char(100), p numeric(12,5));
create table t2(v varchar(100), p numeric(12,5));
insert into t1 select '', generate_series from generate_series(1,1000000);
insert into t2 select '', generate_series from generate_series(1,1000000);

and after that measured speed of

select sum(p) from t1

and

select sum(p) from t2

Both of them took approximately 800 ms

Also  select max(length(v)) from t1 and select max(length(v)) from  t2

speed is the same

Also I tested with 1600000 rows production char(100) type column database by
running

select sum(decimalcolumn) from testtable

before and after running

alter table testtable alter char100column type varchar(100)

Select sum command tooks 1700 ms in both cases.

So there is no difference in sequential scan speed.
Replacing char with varchar requires re-writing some parts of code.
Disk space is minor issue compared to cost of code-rewrite.
It looks like it is not reasonable to replace char with varchar.

Andrus



Re: How to find greatest record before known values fast

От
Merlin Moncure
Дата:
On Sat, Oct 4, 2014 at 3:46 PM, Andrus <kobruleht2@hot.ee> wrote:
> In my db people often looks for different period sales using different
> filters and will sum
> There are lot of sales and every sale is individual record in sales table.
> So increasing sequential scan speed is important.
>
> I tried
>
> create table t1(v char(100), p numeric(12,5));
> create table t2(v varchar(100), p numeric(12,5));
> insert into t1 select '', generate_series from generate_series(1,1000000);
> insert into t2 select '', generate_series from generate_series(1,1000000);
>
> and after that measured speed of
>
> select sum(p) from t1
>
> and
>
> select sum(p) from t2
>
> Both of them took approximately 800 ms
>
> So there is no difference in sequential scan speed.
> Replacing char with varchar requires re-writing some parts of code.
> Disk space is minor issue compared to cost of code-rewrite.
> It looks like it is not reasonable to replace char with varchar.

Sure, in this trivial case it's not different (both tables are small,
fit in cache, and the numeric calculation is where the bulk of time is
getting spent).   But if your table is double the size it's going to
have impacts on many real world workloads.  I'm not in any way saying
to go change up your database but I'd definitely avoid char() for all
new code.

merlin


Re: faster way to calculate top "tags" for a "resource" based on a column

От
Jim Nasby
Дата:
On 10/3/14, 11:21 AM, Jonathan Vanasco wrote:
> I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any
timeoff this one.  I'm hoping someone has another strategy. 
>
> I have 2 tables:
>     resource
>     resource_2_tag
>
> I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the
"resource"table. 
>
> both tables have around 1.6million records.
>
> If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk.
> If the database doesn't need to warm up, it averages 1.76seconds.
>
> The 1.76s time is troubling me.
> Searching for the discrete elements of this is pretty lightweight.
>
> here's an explain --  http://explain.depesz.com/s/PndC
>
> I tried a subquery instead of a join, and the query optimized the plan to the same.
>
> i'm hoping someone will see something that I just don't see.
>
>
>
>        Table "public.resource_2_tag"
>          Column         |  Type   | Modifiers
> -----------------------+---------+-----------
>   resource_id           | integer |
>   tag_id                | integer |
> Indexes:
>      "_idx_speed_resource_2_tag__resource_id" btree (resource_id)
>      "_idx_speed_resource_2_tag__tag_id" btree (tag_id)
>
>                                                    Table "public.resource"
>                 Column                |            Type             |                        Modifiers
>
-------------------------------------+-----------------------------+----------------------------------------------------------
>   id                                  | integer                     | not null default
nextval('resource_id_seq'::regclass)
> resource_attribute1_id               | integer                     |
> lots of other columns                |                             |
> Indexes:
>     "resource_attribute1_idx" btree (resource_attribute1_id)
>
> --------------------------------------------------------------------------------
>
> select count(*) from resource;
> -- 1669729
>
> select count(*) from resource_2_tag;
> -- 1676594
>
> select count(*) from resource where resource_attribute1_id = 614;
> -- 5184
> -- 4.386ms
>
> select id from resource where resource_attribute1_id = 614;
> -- 5184
> -- 87.303ms
>
> popping the 5k elements into an "in" clause, will run the query in around 100ms.
>
>
> EXPLAIN ANALYZE
> SELECT
>     resource_2_tag.tag_id AS resource_2_tag_tag_id,
>     count(resource_2_tag.tag_id) AS counted
> FROM
>     resource_2_tag
> JOIN resource ON resource.id = resource_2_tag.resource_id
> WHERE
>     resource.resource_attribute1_id = 614
> GROUP BY resource_2_tag.tag_id
> ORDER BY counted DESC
> LIMIT 25 OFFSET 0;
Don't join to the resource table; there's no reason to because you're not pulling anything from it.

If for some reason you do need data out of the resource table, do the LIMIT 25 first, in a sub-select:

SELECT r.*, counted
     FROM resource r
         JOIN (
             SELECT tag_id, count(*)
                 FROM resource_2_tag
                 GROUP BY tag_id
                 ORDER BY tag_id
                 LIMIT 25
         ) t ON ...
;

--
--
Jim Nasby, Data Architect, Blue Treble
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: faster way to calculate top "tags" for a "resource" based on a column

От
Jonathan Vanasco
Дата:
On Oct 6, 2014, at 5:56 PM, Jim Nasby wrote:

> Don't join to the resource table; there's no reason to because you're not pulling anything from it.

Thanks the reply!

I'm not pulling anything from the resource table, but the join is necessary because I'm filtering based on it. ( see
theWHERE clause ) 

I'm not trying to find the most used overall tags, but the ones that are used by resources with a (variable) id on a
columnin the resources table. .   



Re: faster way to calculate top "tags" for a "resource" based on a column

От
Marc Mamin
Дата:
>I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any
timeoff this one.  I'm hoping someone has another strategy. 
>
>I have 2 tables:
>        resource
>        resource_2_tag
>
>I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the
"resource"table. 
>
>both tables have around 1.6million records.
>
>If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk.
>If the database doesn't need to warm up, it averages 1.76seconds.
>
>The 1.76s time is troubling me.
>Searching for the discrete elements of this is pretty lightweight.
>
>here's an explain --  http://explain.depesz.com/s/PndC
>
>I tried a subquery instead of a join, and the query optimized the plan to the same.
>
>i'm hoping someone will see something that I just don't see.
>
>
>
>      Table "public.resource_2_tag"
>        Column         |  Type   | Modifiers
>-----------------------+---------+-----------
> resource_id           | integer |
> tag_id                | integer |
>Indexes:
>    "_idx_speed_resource_2_tag__resource_id" btree (resource_id)
>    "_idx_speed_resource_2_tag__tag_id" btree (tag_id)
>
>                                                  Table "public.resource"
>               Column                |            Type             |                        Modifiers
      

>-------------------------------------+-----------------------------+----------------------------------------------------------
> id                                  | integer                     | not null default
nextval('resource_id_seq'::regclass)
>resource_attribute1_id               | integer                     |
>lots of other columns                |                             |
>Indexes:
>        "resource_attribute1_idx" btree (resource_attribute1_id)
>
>--------------------------------------------------------------------------------
>
>select count(*) from resource;
>-- 1669729
>
>select count(*) from resource_2_tag;
>-- 1676594
>
>select count(*) from resource where resource_attribute1_id = 614;
>-- 5184
>-- 4.386ms
>
>select id from resource where resource_attribute1_id = 614;
>-- 5184
>-- 87.303ms
>
>popping the 5k elements into an "in" clause, will run the query in around 100ms.
>
>
>EXPLAIN ANALYZE
>SELECT
>        resource_2_tag.tag_id AS resource_2_tag_tag_id,
>        count(resource_2_tag.tag_id) AS counted
>FROM
>        resource_2_tag
>JOIN resource ON resource.id = resource_2_tag.resource_id
>WHERE
>        resource.resource_attribute1_id = 614
>GROUP BY resource_2_tag.tag_id
>ORDER BY counted DESC
>LIMIT 25 OFFSET 0;

Hi,
it seems to me that your subquery may deliver duplicate ids.
And with the selectivity of your example, I would expect an index usage
instead of a table scan. You may check how up to date your statistics are
and try to raise the statistic target on the column resource_2_tag.tag_id.
Also try a CTE form for your query:


WITH A as (SELECT DISTINCT id FROM resource
           WHERE resource_attribute1_id = 614
          )
SELECT
        resource_2_tag.tag_id AS resource_2_tag_tag_id,
        count(resource_2_tag.tag_id) AS counted
FROM
        resource_2_tag
JOIN A ON A.id = resource_2_tag.resource_id
ORDER BY counted DESC
LIMIT 25;

regards,
Marc Mamin

Re: faster way to calculate top "tags" for a "resource" based on a column

От
Jonathan Vanasco
Дата:
On Oct 7, 2014, at 10:02 AM, Marc Mamin wrote:

> Hi,
> it seems to me that your subquery may deliver duplicate ids.
> And with the selectivity of your example, I would expect an index usage
> instead of a table scan. You may check how up to date your statistics are
> and try to raise the statistic target on the column resource_2_tag.tag_id.
> Also try a CTE form for your query:


It shouldn't be able to deliver duplicate ids.

=> SELECT COUNT(*) FROM (SELECT DISTINCT id FROM resource WHERE resource_attribute1_id = 614) AS foo;
count -------  5184

=> SELECT COUNT(*) FROM (SELECT id FROM resource WHERE resource_attribute1_id = 614) AS foo;
count -------  5184

However, adding in the DISTINCT drastically changed the query plan, and did give a speedup.

Your comment made me focus on the notion of a Table Scan. I assumed it did the seq scan - and there would not be much
savingsotherwise - because the table is just 2 ids. 

I was wrong.

I noticed that I never put a PRIMARY KEY constraint on that table.

So i tried adding a PRIMARY KEY constraint, then running vacuum analyze...

And that solved all my problems.

the original query ended up being the fastest at 260ms ( down from 1760 )

Join - 260ms
Subquery w/DISTINCT - 300ms
CTE - 330
CTE w/DISTINCT - 345ms
Subquery (no DISTINCT) - 1500ms