Обсуждение: Querying a time range across multiple partitions
Hi everyone,
I'm trying to run a select query from a span of child partitions, separated out in daily tables, in Postgres 9.1.5. The parent looks like this:# \d logins
Table "public.logins"
Column | Type | Modifiers
-------------+-----------------------------+-----------
username | character varying(50) |
event | character varying(20) |
time | timestamp without time zone |
host | character varying(18) |
hash | character varying(32) |
Triggers:
logins_log_trigger BEFORE INSERT ON logins FOR EACH ROW EXECUTE PROCEDURE logins_insert_trigger()
Number of child tables: 1581 (Use \d+ to list them.)
# \d logins_20140904
Table "public.logins_20140904"
Column | Type | Modifiers
-------------+-----------------------------+-----------
username | character varying(50) |
event | character varying(20) |
time | timestamp without time zone |
host | character varying(18) |
hash | character varying(32) |
Indexes:
"logins_20140904_event" hash (event)
"logins_20140904_event_time" btree (event, "time")
"logins_20140904_username" hash (username)
"logins_20140904_username_time" btree (username, "time")
Check constraints:
"logins_20140904_time_check" CHECK ("time" >= '2014-09-04 00:00:00'::timestamp without time zone AND "time" <= '2014-09-04 23:59:59.99'::timestamp without time zone)
Inherits: logins
explain analyze select time,event from logins
where username='bob' and hash='1234' and time > current_date - interval '1 week';
Result (cost=0.00..765.11 rows=1582 width=14)
-> Append (cost=0.00..765.11 rows=1582 width=14)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66)
Filter: (((username)::text = 'bob'::text) AND ((hash)::text = '1234'::text) AND ("time" > (('now'::text)::date - '7 days'::interval)))
-> Index Scan using logins_20100501_username_time on logins_20100501 logins (cost=0.01..0.48 rows=1 width=14)
...
explain analyze select time,event from logins
where username='bob' and hash='1234' and time in ('2014-09-04', '2014-09-05', '2014-09-03');
Result (cost=0.00..2.41 rows=3 width=31) (actual time=0.060..0.060 rows=0 loops=1)
-> Append (cost=0.00..2.41 rows=3 width=31) (actual time=0.060..0.060 rows=0 loops=1)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66) (actual time=0.001..0.001 rows=0 loops=1)
Filter: (((username)::text = 'bob'::text) AND ((hash)::text = '1234'::text) AND ("time" = ANY ('{"2014-09-04 00:00:00","2014-09-05 00:00:00","2014-09-03 00:00:00"}'::timestamp without time zone[])))
-> Bitmap Heap Scan on logins_20140903 logins (cost=1.09..1.20 rows=1 width=14) (actual time=0.039..0.039 rows=0 loops=1)
...
So far, I've been messing around with generating a series, then collecting it back into an array, but nothing I've tried seems to work.
A few examples:
explain analyze select time,event from logins
where username='bob' and hash=1234' and time in (
generate_series(current_date - interval '3 days', current_date, interval '1 day')
);
ERROR: argument of IN must not return a set
explain analyze select time,event from logins
where username='bob' and hash='1234' and time in (
select array_agg(series)
from generate_series(current_date - interval '3 days', current_date, interval '1 day')
as u(series)
);
ERROR: operator does not exist: timestamp without time zone = timestamp without time zone[]
explain analyze select time,event from logins
where username='bob' and hash=1234 and time in (
select unnest(array_agg(date_trunc('day',series))) from
generate_series(current_date - interval '3 days', current_date, interval '1 day') as u(series)
)
On 05 Sep 2014, at 19:31, Cal Heldenbrand <cal@fbsdata.com> wrote: > I'm attempting to run a query that looks something like this: > > explain analyze select time,event from logins > where username='bob' and hash='1234' and time > current_date - interval '1 week'; > > Result (cost=0.00..765.11 rows=1582 width=14) > -> Append (cost=0.00..765.11 rows=1582 width=14) > -> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66) > Filter: (((username)::text = 'bob'::text) AND ((hash)::text = '1234'::text) AND ("time" > (('now'::text)::date- '7 days'::interval))) > -> Index Scan using logins_20100501_username_time on logins_20100501 logins (cost=0.01..0.48 rows=1 width=14) > ... > > This shows that it's attempting to run the query against all of my 1500 child tables. What about: explain analyze select time,event from logins where username='bob' and hash='1234' and time > (current_date - interval '1 week’)::timestamp without time zone; Also, you don’t appear to be having an index that starts from “time”, so none of the indexes will be particularly efficientat finding a specific time range. It’s quite possible that that makes PG think that “time” is not a very good candidateto filter on, simply because the optimizer doesn’t look that far. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
On 9/5/2014 10:31 AM, Cal Heldenbrand wrote:
Number of child tables: 1581
that's an insane number of children. We try and limit it to 50 or so child tables, for instance, 6 months retention by week, of data will millions of rows/day.
-- john r pierce 37N 122W somewhere on the middle of the left coast
What about:
That didn't seem to work either. The thought did occur to me that the query planner wasn't using my combined column indexes. I tried adding just a btree index on time and it still did the same problem.
explain analyze select time,event from logins
where username='bob' and hash='1234' and time > (current_date - interval '1 week’)::timestamp without time zone;
Also, you don’t appear to be having an index that starts from “time”, so none of the indexes will be particularly efficient at finding a specific time range. It’s quite possible that that makes PG think that “time” is not a very good candidate to filter on, simply because the optimizer doesn’t look that far.
Your example query there also goes back to 2010 for scanning tables.
Interestingly enough, this query actually works:
# explain analyze select time,event from logins
where username='bob' and hash='1234' and time > '2014-08-29';
Result (cost=0.00..8.21 rows=8 width=20) (actual time=0.074..0.074 rows=0 loops=1)
-> Append (cost=0.00..8.21 rows=8 width=20) (actual time=0.074..0.074 rows=0 loops=1)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66) (actual time=0.002..0.002 rows=0 loops=1)
Filter: (("time" > '2014-08-29 00:00:00'::timestamp without time zone) AND ((username)::text = 'bob'::text) AND ((hash)::text = '1234'::text))
-> Index Scan using logins_20140829_username on logins_20140829 logins (cost=0.00..1.21 rows=1 width=14) (actual time=0.021..0.021 rows=0 loops=1)
Index Cond: ((username)::text = 'bob'::text)
# explain analyze select time,event from logins
where username='bob' and hash='1234' and time > '2014-08-29';
Result (cost=0.00..8.21 rows=8 width=20) (actual time=0.074..0.074 rows=0 loops=1)
-> Append (cost=0.00..8.21 rows=8 width=20) (actual time=0.074..0.074 rows=0 loops=1)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66) (actual time=0.002..0.002 rows=0 loops=1)
Filter: (("time" > '2014-08-29 00:00:00'::timestamp without time zone) AND ((username)::text = 'bob'::text) AND ((hash)::text = '1234'::text))
-> Index Scan using logins_20140829_username on logins_20140829 logins (cost=0.00..1.21 rows=1 width=14) (actual time=0.021..0.021 rows=0 loops=1)
Index Cond: ((username)::text = 'bob'::text)
So it seems to me that the query parser isn't preprocessing "current_date - interval", but a statically defined time span *does* work.
Doesn't that seem weird?
This particular use case is for user behavior data mining. The hardware is beefy, and has tablespaces split out onto SSD/spindle for new & old data. All of my queries are pretty much a nightly cron process, and I don't really care too much about the speed. Scanning the full 4 years of data takes about 30 seconds per query anyway... but I thought it'd be nice to speed it up when the difference is milliseconds vs 30 seconds.
On Fri, Sep 5, 2014 at 4:17 PM, John R Pierce <pierce@hogranch.com> wrote:
On 9/5/2014 10:31 AM, Cal Heldenbrand wrote:Number of child tables: 1581
that's an insane number of children. We try and limit it to 50 or so child tables, for instance, 6 months retention by week, of data will millions of rows/day.-- john r pierce 37N 122W somewhere on the middle of the left coast
Cal Heldenbrand wrote > explain analyze select time,event from logins > where username='bob' and hash='1234' and time in ( > select array_agg(series) > from generate_series(current_date - interval '3 days', current_date, > interval '1 day') > as u(series) > ); > ERROR: operator does not exist: timestamp without time zone = timestamp > without time zone[] > > explain analyze select time,event from logins Probably doesn't help but the above would have to be written: WHERE time = ANY( SELECT array_agg(...) FROM ... ) I don't know whether the planner is smart enough to optimize on "= ANY()" - whether statically generated or determined at runtime. I am pretty sure that constraint exclusion must be done during the planning stages and that you have to pass in a literal set of values (or an array) that you generate in a previous query. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Querying-a-time-range-across-multiple-partitions-tp5817958p5817989.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On Fri, Sep 5, 2014 at 10:31 AM, Cal Heldenbrand <cal@fbsdata.com> wrote:
This shows that it's attempting to run the query against all of my 1500 child tables.explain analyze select time,event from logins
where username='bob' and hash='1234' and time > current_date - interval '1 week';
Result (cost=0.00..765.11 rows=1582 width=14)
-> Append (cost=0.00..765.11 rows=1582 width=14)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66)
Filter: (((username)::text = 'bob'::text) AND ((hash)::text = '1234'::text) AND ("time" > (('now'::text)::date - '7 days'::interval)))
-> Index Scan using logins_20100501_username_time on logins_20100501 logins (cost=0.01..0.48 rows=1 width=14)
...
I believe the problem is that the planner (which does the partition pruning) is not willing to materialize the value of current_date, so it can't use a specific value to prune partitions. After all, the date might change between the planner and the executor, if you leave the plan open for a long time, or make the call very close to midnight.
You will probably have to select the current_date (or get it from your system, or cron, or whoever triggers the script), and then hardcode it into the query.
Cheers,
Jeff
On Fri, Sep 5, 2014 at 2:17 PM, John R Pierce <pierce@hogranch.com> wrote:
On 9/5/2014 10:31 AM, Cal Heldenbrand wrote:Number of child tables: 1581
that's an insane number of children. We try and limit it to 50 or so child tables, for instance, 6 months retention by week, of data will millions of rows/day.
I've used more than that many for testing purposes, and there was little problem. The main thing is that your insert trigger (if you have one on the master table) needs to be structured as a binary search-like nesting of if..elsif, not a linear-searching like structure. Unless of course almost all inserts go into the newest partition, then it might make more sense to do the linear search with that being the first test. But for performance, better to just insert directly into the correct child table.
I dont' know if that many tables is a good idea in this case, but it is not prima facie insane.
Cheers,
Jeff
On 9/7/2014 12:55 PM, Jeff Janes wrote:
On Fri, Sep 5, 2014 at 2:17 PM, John R Pierce <pierce@hogranch.com> wrote:On 9/5/2014 10:31 AM, Cal Heldenbrand wrote:Number of child tables: 1581
that's an insane number of children. We try and limit it to 50 or so child tables, for instance, 6 months retention by week, of data will millions of rows/day.I've used more than that many for testing purposes, and there was little problem. The main thing is that your insert trigger (if you have one on the master table) needs to be structured as a binary search-like nesting of if..elsif, not a linear-searching like structure. Unless of course almost all inserts go into the newest partition, then it might make more sense to do the linear search with that being the first test. But for performance, better to just insert directly into the correct child table.
any select that can't be preplanned to a specific child will need to check all 1500 children. this is far less efficient than checking, say, 50 and letting the b-tree index of each child reject or narrow down to the specific row(s). The one is roughly 1500*log(N/1500) while the other is 50*log(N/50) at least to a first order approximation.
-- john r pierce 37N 122W somewhere on the middle of the left coast
Thanks Jeff! That's what I wanted to confirm, that I need to hard code / pregenerate my dates in the query. I was mainly curious why it didn't work with current_date, and that answers it.
And BTW, all my inserts happen on the most recent table, so my insert trigger with a linear search is ordered date descending. It seems to work fine since my use case is to insert data once, and never change it again. Of course my only problem is when a select query confuses the planner, and searches my entire set. ;-) On Sun, Sep 7, 2014 at 2:44 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Sep 5, 2014 at 10:31 AM, Cal Heldenbrand <cal@fbsdata.com> wrote:This shows that it's attempting to run the query against all of my 1500 child tables.explain analyze select time,event from logins
where username='bob' and hash='1234' and time > current_date - interval '1 week';
Result (cost=0.00..765.11 rows=1582 width=14)
-> Append (cost=0.00..765.11 rows=1582 width=14)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66)
Filter: (((username)::text = 'bob'::text) AND ((hash)::text = '1234'::text) AND ("time" > (('now'::text)::date - '7 days'::interval)))
-> Index Scan using logins_20100501_username_time on logins_20100501 logins (cost=0.01..0.48 rows=1 width=14)
...I believe the problem is that the planner (which does the partition pruning) is not willing to materialize the value of current_date, so it can't use a specific value to prune partitions. After all, the date might change between the planner and the executor, if you leave the plan open for a long time, or make the call very close to midnight.You will probably have to select the current_date (or get it from your system, or cron, or whoever triggers the script), and then hardcode it into the query.Cheers,Jeff
John, > On Fri, Sep 5, 2014 at 2:17 PM, John R Pierce < pierce@hogranch.com > > wrote: >> On 9/5/2014 10:31 AM, Cal Heldenbrand wrote: >> Number of child tables: 1581 >> that's an insane number of children. We try and limit it to 50 or so >> child tables, for instance, 6 months retention by week, of data will >> millions of rows/day. >> >> I've used more than that many for testing purposes, and there was >> little problem. The main thing is that your insert trigger (if you >> have one on the master table) needs to be structured as a binary >> search-like nesting of if..elsif, not a linear-searching like >> structure. Unless of course almost all inserts go into the newest >> partition, then it might make more sense to do the linear search >> with that being the first test. But for performance, better to just >> insert directly into the correct child table. > any select that can't be preplanned to a specific child will need to > check all 1500 children. this is far less efficient than checking, > say, 50 and letting the b-tree index of each child reject or narrow > down to the specific row(s). The one is roughly 1500*log(N/1500) > while the other is 50*log(N/50) at least to a first order > approximation. can you explain that further? In the end, that argument sounds like it would always be more efficient to use a single tableand its index instead, rather than partitioning it (log(N) < c*log(N/c) for any c > 1, if I'm not totally lost today). So, is there any insights of how many partitions are still useful? I have tables with roughly 1000 partitions and did not have any issues so far. Even with having INSERT rules that are linedup worst-case (from past to current, while data is always being inserted for the current date), I haven't seen any considerabledegradation of INSERT performance so far. Thanks, Andreas
On 9/8/2014 1:40 PM, Andreas Brandl wrote: > can you explain that further? In the end, that argument sounds like it would always be more efficient to use a single tableand its index instead, rather than partitioning it (log(N) < c*log(N/c) for any c > 1, if I'm not totally lost today). it indeed would. good reasons for partitioning include... * efficient date based bulk deletion (we have a very large table that has 6 months retention, so we partition by week and delete the oldest week when a new week starts... dropping a partition is far faster than deleting 20 million records by date) * needing to put data across several tablespaces - I haven't had to do this. * more efficient vacuuming - really really large tables, like 100 GB, take a LONG time to vacuum. sane sized partitions will vacuum in less time, and since older time-based partitions aren't typically updated, they can be frozen. -- john r pierce 37N 122W somewhere on the middle of the left coast
On Mon, Sep 8, 2014 at 1:40 PM, Andreas Brandl <ml@3.141592654.de> wrote:
John,
> On Fri, Sep 5, 2014 at 2:17 PM, John R Pierce < pierce@hogranch.com >
> wrote:
>> On 9/5/2014 10:31 AM, Cal Heldenbrand wrote:
>> Number of child tables: 1581
>> that's an insane number of children. We try and limit it to 50 or so
>> child tables, for instance, 6 months retention by week, of data will
>> millions of rows/day.
>>
>> I've used more than that many for testing purposes, and there was
>> little problem. The main thing is that your insert trigger (if you
>> have one on the master table) needs to be structured as a binary
>> search-like nesting of if..elsif, not a linear-searching like
>> structure. Unless of course almost all inserts go into the newest
>> partition, then it might make more sense to do the linear search
>> with that being the first test. But for performance, better to just
>> insert directly into the correct child table.
> any select that can't be preplanned to a specific child will need to
> check all 1500 children. this is far less efficient than checking,
> say, 50 and letting the b-tree index of each child reject or narrow
> down to the specific row(s). The one is roughly 1500*log(N/1500)
> while the other is 50*log(N/50) at least to a first order
> approximation.
can you explain that further? In the end, that argument sounds like it would always be more efficient to use a single table and its index instead, rather than partitioning it (log(N) < c*log(N/c) for any c > 1, if I'm not totally lost today).
Right. Partitioning is NOT a generic way to improve query performance. With rare exceptions, partitioning makes query performance worse, and then partition pruning may or may not serve to recover most of that lost performance.
Partitioning allows you conspire with PostgreSQL to micromanage the the layout of the data in ways that improve manageability and maintainability. For example, you pay a little overhead each time you run a query and (perhaps) each time you insert a row, but in exchange for that you can "delete" a year of data with a nearly-instantaneous command.
Cheers,
Jeff