Обсуждение: separating improperly grouped page views

Поиск
Список
Период
Сортировка

separating improperly grouped page views

От
Jeff Frost
Дата:
It seems a user visit cookie's expiration has been improperly set and so the 
tables which represent a user's clickstream through the website are grouped 
improperly.  I'd like to run a set-wise query to regroup them based on periods 
of inactivity greather than 1 hour.  So, what we want to do is adjust the 
visit_id's in the page_view table to point to separate visits whenever the 
user has been idle for more than 1 hour. The query I've come up with looks 
like this:

CREATE TABLE visit_clean AS
SELECT id     ,       CASE       WHEN stamp - last_stamp > INTERVAL '1 hour'    OR last_stamp IS NULL THEN
nextval('tracking_cleanup_seq')      ELSE currval('tracking_cleanup_seq')       END AS visit_id     , visit_id AS
old_visit_id    , uri     , params     , stamp     , cindex     , tindex     , method     , source_address     ,
server_name FROM (    SELECT id        , visit_id        , uri        , params        , stamp        , cindex        ,
tindex       , method        , source_address        , (        SELECT max(pv2.stamp)          FROM page_view pv2
WHERE pv2.visit_id            = pv1.visit_id          AND pv2.stamp               < pv1.stamp          ) AS last_stamp
     , server_name      FROM page_view pv1       ) x;
 

It works reasonably well on a small data set.  But, the plan for this is 
horrible on the real data.
                                                          QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
SeqScan on page_view pv1  (cost=0.00..2634763281.70 rows=3588374 width=239)   SubPlan     ->  Result
(cost=367.09..367.10rows=1 width=0)           InitPlan             ->  Limit  (cost=0.00..367.09 rows=1 width=8)
          ->  Index Scan Backward using page_view_stamp_idx on page_view pv2  (cost=0.00..158215.86 rows=431 width=8)
                     Index Cond: (stamp < $1)                         Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
   ->  Result  (cost=367.09..367.10 rows=1 width=0)           InitPlan             ->  Limit  (cost=0.00..367.09 rows=1
width=8)                  ->  Index Scan Backward using page_view_stamp_idx on page_view pv2  (cost=0.00..158215.86
rows=431width=8)                         Index Cond: (stamp < $1)                         Filter: ((stamp IS NOT NULL)
AND(visit_id = $0))
 

The page_view table is 829MB in size and has 3,590,185 rows.

I let the query run on my test server for about 26hrs without finishing, so I 
can't provide explain analyze output yet.

Does anyone have a better method of separating this data out?

---
Jeff Frost, Owner       <jeff@frostconsultingllc.com>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908     FAX: 650-649-1954



Re: separating improperly grouped page views

От
Tom Lane
Дата:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> [ umpteen million iterations of: ]
>               ->  Limit  (cost=0.00..367.09 rows=1 width=8)
>                     ->  Index Scan Backward using page_view_stamp_idx on page_view pv2  (cost=0.00..158215.86
rows=431width=8)
 
>                           Index Cond: (stamp < $1)
>                           Filter: ((stamp IS NOT NULL) AND (visit_id = $0))

Perhaps an index on (visit_id, stamp) would help.  This one is doing the
best it can, but if the visit_id's you want are thinly scattered, it'll
still suck...
        regards, tom lane


Re: separating improperly grouped page views

От
Jeff Frost
Дата:
On Mon, 18 Jun 2007, Tom Lane wrote:

> Jeff Frost <jeff@frostconsultingllc.com> writes:
>> [ umpteen million iterations of: ]
>>               ->  Limit  (cost=0.00..367.09 rows=1 width=8)
>>                     ->  Index Scan Backward using page_view_stamp_idx on page_view pv2  (cost=0.00..158215.86
rows=431width=8)
 
>>                           Index Cond: (stamp < $1)
>>                           Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
>
> Perhaps an index on (visit_id, stamp) would help.  This one is doing the
> best it can, but if the visit_id's you want are thinly scattered, it'll
> still suck...

Good idea Tom!  In fact the planner seems to like that much better:
 Seq Scan on page_view pv1  (cost=0.00..11529031.34 rows=3580205 width=239)
vs Seq Scan on page_view pv1  (cost=0.00..2622541458.55 rows=3596473 width=237)
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
SeqScan on page_view pv1  (cost=0.00..11529031.34 rows=3580205 width=239)   SubPlan     ->  Result  (cost=1.58..1.59
rows=1width=0)           InitPlan             ->  Limit  (cost=0.00..1.58 rows=1 width=8)                   ->  Index
ScanBackward using page_view_visit_id_stamp_idx on page_view pv2  (cost=0.00..625.39 rows=397 width=8)
      Index Cond: ((visit_id = $0) AND (stamp < $1))                         Filter: (stamp IS NOT NULL)     ->  Result
(cost=1.58..1.59 rows=1 width=0)           InitPlan             ->  Limit  (cost=0.00..1.58 rows=1 width=8)
     ->  Index Scan Backward using page_view_visit_id_stamp_idx on page_view pv2  (cost=0.00..625.39 rows=397 width=8)
                      Index Cond: ((visit_id = $0) AND (stamp < $1))                         Filter: (stamp IS NOT
NULL)
(14 rows)

Compared to:



-------------------------------------------------------------------------------------------------------------------------------
SeqScan on page_view pv1  (cost=0.00..2622541458.55 rows=3596473 width=237)   SubPlan     ->  Result
(cost=364.56..364.57rows=1 width=0)           InitPlan             ->  Limit  (cost=0.00..364.56 rows=1 width=8)
          ->  Index Scan Backward using page_view_stamp_idx on page_view pv2  (cost=0.00..153481.58 rows=421 width=8)
                     Index Cond: (stamp < $1)                         Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
   ->  Result  (cost=364.56..364.57 rows=1 width=0)           InitPlan             ->  Limit  (cost=0.00..364.56 rows=1
width=8)                  ->  Index Scan Backward using page_view_stamp_idx on page_view pv2  (cost=0.00..153481.58
rows=421width=8)                         Index Cond: (stamp < $1)                         Filter: ((stamp IS NOT NULL)
AND(visit_id = $0))
 
(14 rows)



-- 
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954


Re: separating improperly grouped page views

От
Jeff Frost
Дата:
On Sun, 17 Jun 2007, Jeff Frost wrote:

>
-------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on page_view pv1  (cost=0.00..11529031.34 rows=3580205 width=239)
>   SubPlan
>     ->  Result  (cost=1.58..1.59 rows=1 width=0)
>           InitPlan
>             ->  Limit  (cost=0.00..1.58 rows=1 width=8)
>                   ->  Index Scan Backward using page_view_visit_id_stamp_idx 
> on page_view pv2  (cost=0.00..625.39 rows=397 width=8)
>                         Index Cond: ((visit_id = $0) AND (stamp < $1))
>                         Filter: (stamp IS NOT NULL)
>     ->  Result  (cost=1.58..1.59 rows=1 width=0)
>           InitPlan
>             ->  Limit  (cost=0.00..1.58 rows=1 width=8)
>                   ->  Index Scan Backward using page_view_visit_id_stamp_idx 
> on page_view pv2  (cost=0.00..625.39 rows=397 width=8)
>                         Index Cond: ((visit_id = $0) AND (stamp < $1))
>                         Filter: (stamp IS NOT NULL)
> (14 rows)
>
> Compared to:
>
>
>
-------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on page_view pv1  (cost=0.00..2622541458.55 rows=3596473 width=237)
>   SubPlan
>     ->  Result  (cost=364.56..364.57 rows=1 width=0)
>           InitPlan
>             ->  Limit  (cost=0.00..364.56 rows=1 width=8)
>                   ->  Index Scan Backward using page_view_stamp_idx on 
> page_view pv2  (cost=0.00..153481.58 rows=421 width=8)
>                         Index Cond: (stamp < $1)
>                         Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
>     ->  Result  (cost=364.56..364.57 rows=1 width=0)
>           InitPlan
>             ->  Limit  (cost=0.00..364.56 rows=1 width=8)
>                   ->  Index Scan Backward using page_view_stamp_idx on 
> page_view pv2  (cost=0.00..153481.58 rows=421 width=8)
>                         Index Cond: (stamp < $1)
>                         Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
> (14 rows)

And throwing the ORDER BY back in reduces the cost even more!
                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
SubqueryScan x  (cost=0.00..5815824.15 rows=3629753 width=1186)   ->  Index Scan using page_view_visit_idx on page_view
pv1 (cost=0.00..5743229.09 rows=3629753 width=237)         SubPlan           ->  Result  (cost=1.51..1.52 rows=1
width=0)                InitPlan                   ->  Limit  (cost=0.00..1.51 rows=1 width=8)
-> Index Scan Backward using page_view_visit_id_stamp_idx on page_view pv2  (cost=0.00..608.41 rows=402 width=8)
                      Index Cond: ((visit_id = $0) AND (stamp < $1))                               Filter: (stamp IS
NOTNULL)
 
(9 rows)

Now we only have to do that index scan once. :-)  I had foolishly taken that 
out to see if the sort was killing me and forgot to put it back in.

So now it's:
 Subquery Scan x  (cost=0.00..5815824.15 rows=3629753 width=1186) vs
Seq Scan on page_view pv1  (cost=0.00..11529031.34 rows=3580205 width=239) vs
Seq Scan on page_view pv1  (cost=0.00..2622541458.55 rows=3596473 width=237)

-- 
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954


Re: separating improperly grouped page views

От
Jeff Frost
Дата:
On Sun, 17 Jun 2007, Jeff Frost wrote:

> On Mon, 18 Jun 2007, Tom Lane wrote:
>
>> Jeff Frost <jeff@frostconsultingllc.com> writes:
>>> [ umpteen million iterations of: ]
>>>               ->  Limit  (cost=0.00..367.09 rows=1 width=8)
>>>                     ->  Index Scan Backward using page_view_stamp_idx on 
>>> page_view pv2  (cost=0.00..158215.86 rows=431 width=8)
>>>                           Index Cond: (stamp < $1)
>>>                           Filter: ((stamp IS NOT NULL) AND (visit_id = 
>>> $0))
>> 
>> Perhaps an index on (visit_id, stamp) would help.  This one is doing the
>> best it can, but if the visit_id's you want are thinly scattered, it'll
>> still suck...
>
> Good idea Tom!  In fact the planner seems to like that much better:
>
> Seq Scan on page_view pv1  (cost=0.00..11529031.34 rows=3580205 width=239)
> vs
> Seq Scan on page_view pv1  (cost=0.00..2622541458.55 rows=3596473 width=237)

Now that this is working in a reasonable amount of time to process the entire 
data set, I need to work out how to process the new information that comes in 
every so often and still assign it a correct visit_id until the fix for this 
can get through QA.  The cleanup query looks like this:

CREATE TEMP TABLE tmpmaxpvid AS SELECT MAX(id) AS id FROM 
reporting.page_view_clean;

INSERT INTO reporting.page_view_clean
SELECT id     ,       CASE       WHEN stamp - last_stamp > INTERVAL '1 hour'    OR last_stamp IS NULL THEN
nextval('reporting.tracking_cleanup_seq')      ELSE currval('reporting.tracking_cleanup_seq')       END AS visit_id
,uri     , params     , stamp     , visit_id AS old_visit_id  FROM (    SELECT id        , visit_id        , uri
,params        , stamp        , (        SELECT MAX(pv2.stamp)          FROM page_view pv2        WHERE pv2.visit_id
       = pv1.visit_id          AND pv2.stamp               < pv1.stamp          ) AS last_stamp      FROM page_view pv1
  WHERE pv1.stamp               < now() - INTERVAL '1 hour'      AND pv1.id                  > (        SELECT CASE
                WHEN id IS NULL THEN 0                      ELSE id                      END AS id          FROM
tmpmaxpvid         )    ORDER BY pv1.visit_id        , pv1.stamp       ) x;
 


The problem is the page_views that straddle the time before and after the 
query get an artificially different visit_id.

This case statement is what kills me:
       CASE       WHEN stamp - last_stamp > INTERVAL '1 hour'    OR last_stamp IS NULL THEN
nextval('reporting.tracking_cleanup_seq')      ELSE currval('reporting.tracking_cleanup_seq')       END AS visit_id
 

If I change it to this:
       CASE       WHEN last_stamp IS NULL         THEN visit_id       WHEN stamp - last_stamp > INTERVAL '1 hour'
 THEN nextval('reporting.tracking_cleanup_seq')       ELSE currval('reporting.tracking_cleanup_seq')       END AS
visit_id

Then it gives the first page_view in the sequence the correct visit_id, but 
then continues on with a different visit_id for the rest.

I've also tried changing how I select which data to act on like so:
    WHERE pv1.id                  > (        SELECT CASE              WHEN id IS NULL THEN 0              ELSE id
      END AS id          FROM tmpmaxpvid          )      AND pv1.visit_id IN (        SELECT visit_id          FROM
page_viewpv3        GROUP BY visit_id        HAVING max(stamp)              < now() - INTERVAL '1 hour'          )
ORDERBY pv1.visit_id        , pv1.stamp       ) x;
 

But that leaves me skipping some page views because they haven't reached their 
conclusion and because of the id > clause, I'll never go back to see them.

Anyone have any suggestions?

-- 
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954