Re: Random resultset retrieving -> performance bottleneck
От | Cédric Dufour |
---|---|
Тема | Re: Random resultset retrieving -> performance bottleneck |
Дата | |
Msg-id | NDBBIFNBODNADCAOFDOAAEJOCDAA.cedric.dufour@freesurf.ch обсуждение исходный текст |
Ответ на | Re: Random resultset retrieving -> performance bottleneck ("Christopher Kings-Lynne" <chriskl@familyhealth.com.au>) |
Список | pgsql-sql |
> -----Original Message----- > From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au] > Sent: Saturday, August 03, 2002 12:10 > To: Cédric Dufour; pgsql-sql@postgresql.org > Subject: Re: [SQL] Random resultset retrieving -> performance bottleneck > > > > I'm running into a performance problem when considering the following > > scenario: I have a fairly large table (1mio rows) related to > other smaller > > tables (between 100 and 10000 rows) and would like to retrieve > the joined > > data (through a view) in random order. In order to do so, the main table > > contains a 'Random' field (which is updated on a regular basis, in order > to > > re-randomize the data set), on which an index is created: > > Have you tried adding ORDER BY RANDOM() onto your select query? > > Chris > Yes, but the problem remains the same: even tough a LIMIT clause is given, the full set seems to be evaluated before the sort occurs (and it is even slower than storing a random value in the table itself, since random() has to be computed for each row instead or retrieving the pre-computed one). Thus... SELECT my_view ORDER BY random() LIMIT 100 takes many minutes (actually, after 10 minutes running, I gave up and canceled the query), while... SELECT my_view LIMIT 100 takes only 200 milliseconds So far, I managed to re-order the rows in random, turning off all triggers and forced the planner to use only 'hash' joins (thus forcing it to make 'seq scan' and keeping it from re-ordering the rows based on the indexes). Now, turning off 'nestloop' and 'mergejoin' (in order to have only 'hash' joins) had a rather disorienting side-effect -> See thread '[GENERAL] b1 OR b2 <-> ( CASE WHEN b1 THE true ELSE b2 END ): performance bottleneck on logical OR' from this month (Tom, here is the reason why I copy you this mail, so you understand where my messing around with the planner comes from ;-) ) So I turned all joins sorts ('hash' AND 'nestloop' and 'mergejoin') on again and suceeded having the planner do what I want on this particular query by surrounding the WHERE expression by a ( CASE ... END ): SELECT ... WHERE exp -> SELECT ... WHERE ( CASE WHEN exp THEN true ELSE false END ) Thus having the planner add a 'seq scan' of the randomized table just before returning the resultset: Limit (cost=17623.97..3321903.97 rows=25 width=7855) (actual time=2659.00..2732.00 rows=100 loops=1) -> Hash Join (cost=17623.97..3321923.80 rows=25 width=7855) (actual time=2659.00..2730.00 rows=101 loops=1) -> Hash Join (cost=17429.64..3321729.33 rows=25 width=7819) (actual time=2651.00..2698.00 rows=101 loops=1) -> Hash Join (cost=17235.92..3321535.49 rows=25 width=7783) (actual time=2646.00..2681.00 rows=101 loops=1) -> Hash Join (cost=17041.36..3321315.62 rows=5000 width=7226) (actual time=2635.00..2663.00 rows=101 loops=1) -> Hash Join (cost=16847.45..3316059.18 rows=1000006 width=6669) (actual time=2629.00..2652.00 rows=101 loops=1) -> Hash Join (cost=16795.64..3311007.27 rows=1000006 width=4839) (actual time=2627.00..2648.00 rows=101 loops=1) -> Hash Join (cost=16743.82..3305955.36 rows=1000006 width=4831) (actual time=2626.00..2640.00 rows=101 loops=1) ********** -> Seq Scan on tb_item item (cost=0.00..34708.06 rows=1000006 width=2723) (actual time=847.00..853.00 rows=487 loops=1) -> Hash (cost=14110.82..14110.82 rows=10001 width=2108) (actual time=1775.00..1775.00 rows=0 loops=1) -> Hash Join (cost=13858.17..14110.82 rows=10001 width=2108) (actual time=1041.00..1742.00 rows=10001 loops=1) -> Hash Join (cost=13806.36..14008.94 rows=10001 width=1310) (actual time=1040.00..1488.00 rows=10001 loops=1) [...] This looks real dirty to me and I don't understand why I obtain what I want... but 'la fin justifie les moyens', that is my query runs in approximately 2-3 seconds and I obtain a randomized resultset ;-) Would there be a 'by the book' way to tell the planner to preserve the order of a given table ? Thx for your help, Cedric Dufour
В списке pgsql-sql по дате отправления: