Обсуждение: odd intermittent query hanging issue
Greetings, I run a handful of queries overnight when traffic is at it's lowest on our system. One particular query will run perfectly fine (around 5 seconds) for several weeks, then suddenly decide to hang indefinitely and never finish. It needs to be killed manually after several hours (I've intentionally let it run to see if it would ever finish) in order for it to die. The fix _seems to be_ to drop and rebuild the index on xrefchannelmember.member_id. The query then goes back to running in the 5 seconds and has no problem again for weeks.... until it happens again. Has anyone heard of such a thing? And if anyone can maybe point me in directions to investigate, it would be much appreciated. Postgresl v 9.1.3 Ubuntu v 11.10 Linux db 3.0.0-16-generic #29-Ubuntu SMP Tue Feb 14 12:48:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux Query, explain plan (when ot works) and table structure below: (Thanking you in advance for any help) select distinct(id) from member left join xrefchannelmember xrcm on id=xrcm.member_id where id not in (Select memberid from xrefcampaignmember) and xrcm.channel_id in (1) order by id asc; explain plan: ------------------------------ Unique (cost=237234.66..239267.33 rows=406533 width=4) (actual time=4790.823..4922.621 rows=418843 loops=1) -> Sort (cost=237234.66..238251.00 rows=406533 width=4) (actual time=4790.818..4826.201 rows=418879 loops=1) Sort Key: member.id Sort Method: quicksort Memory: 31923kB -> Hash Join (cost=167890.31..199359.99 rows=406533 width=4) (actual time=3357.406..4532.952 rows=418879 loops=1) Hash Cond: (xrcm.member_id = member.id) -> Seq Scan on xrefchannelmember xrcm (cost=0.00..19273.69 rows=813066 width=4) (actual time=0.015..219.259 row s=814421 loops=1) Filter: (channel_id = 1) -> Hash (cost=162586.21..162586.21 rows=424328 width=4) (actual time=3357.001..3357.001 rows=444626 loops=1) Buckets: 65536 Batches: 1 Memory Usage: 15632kB -> Seq Scan on member (cost=66114.02..162586.21 rows=424328 width=4) (actual time=2357.280..3216.076 rows =444626 loops=1) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> Seq Scan on xrefcampaignmember (cost=0.00..57931.82 rows=3272882 width=4) (actual time=0.021.. 670.086 rows=3272870 loops=1) Total runtime: 4963.134 ms Table structure: Table "public.xrefcampaignmember" Column | Type | Modifiers --------------+--------------------------+--------------------------------- ------------------- campaignid | integer | not null memberid | integer | not null joined | timestamp with time zone | default ('now'::text)::timestamp without time zone reservedslot | integer | default 0 Indexes: "XrefCampaignMember_pkey" PRIMARY KEY, btree (campaignid, memberid) "xcm_campaignid_idx" btree (campaignid) "xcm_joined_idx" btree (joined) "xcm_memberid_idx" btree (memberid) "xcm_reservedslot_idx" btree (reservedslot)
On 05/18/2012 09:17 AM, Aaron Burnett wrote: > Greetings, > > I run a handful of queries overnight when traffic is at it's lowest on our > system. One particular query will run perfectly fine (around 5 seconds) > for several weeks, then suddenly decide to hang indefinitely and never > finish. It needs to be killed manually after several hours (I've > intentionally let it run to see if it would ever finish) in order for it > to die. > > The fix _seems to be_ to drop and rebuild the index on > xrefchannelmember.member_id. The query then goes back to running in the 5 > seconds and has no problem again for weeks.... until it happens again. > > Has anyone heard of such a thing? And if anyone can maybe point me in > directions to investigate, it would be much appreciated. Few answers but several questions... Is the machine busy processing the query or is it idle? Does it start happening consistently or is it an individual query. I.e. if you run the query again will it hang or complete? Can you get a query plan when the query is failing? Are there any other queries that may be locking your tables (check pg_locks)? Anything of interest in postgresql or system logs? Autovacuum running properly? Are you running any large updates/deletes/etc just prior to the query such that the statistics the planner is using do not reflect reality? Does the network connection to the server still exist? Any other weirdness happening on the machine? > select distinct(id) from member left join xrefchannelmember xrcm on > id=xrcm.member_id where id not in (Select memberid from > xrefcampaignmember) and xrcm.channel_id in (1) order by id asc; BTW, I have an un-substantiated gut feeling that this query has room for improvement. I would experiment with substituting "...where not exists (select..." for the "...not in (select...". You could also try using "select..... except select memberid from xrefcampaignmember..." I'm also not sure I understand the left join since you have the "xcrm.channel_id in (1)" condition so you won't return records in "member" without a corresponding record in xrefchannelmember anyway. I also don't understand the "xcrm.channel_id in (1)" instead of "xcrm.channel_id = 1" unless this is a generated query and there could be multiple ids in that condition. Will one of the following (untested) queries return the results you want and have a better query plan?: select distinct(id) from member join xrefchannelmember xrcm on id=xrcm.member_id where xrcm.channel_id=1 except select memberid from xrefcampaignmember order by id asc; (actually with except the distinct is probably superfluous as well) or: select distinct(id) from member join xrefchannelmember xrcm on id=xrcm.member_id where xrcm.channel_id=1 and not exists (select 1 from xrefcampaignmember x where x.memberid = member.id) order by id asc; or: select distinct(id) from member where exists (select 1 from xrefchannelmember where member_id = member.id and channel_id=1) and not exists (select 1 from xrefcampaignmember where memberid = member.id) order by id asc; Cheers, Steve
Thanks Steve, Answers are inserted below: On 5/18/12 11:09 AM, "Steve Crawford" <scrawford@pinpointresearch.com> wrote: >On 05/18/2012 09:17 AM, Aaron Burnett wrote: >> Greetings, >> >> I run a handful of queries overnight when traffic is at it's lowest on >>our >> system. One particular query will run perfectly fine (around 5 seconds) >> for several weeks, then suddenly decide to hang indefinitely and never >> finish. It needs to be killed manually after several hours (I've >> intentionally let it run to see if it would ever finish) in order for it >> to die. >> >> The fix _seems to be_ to drop and rebuild the index on >> xrefchannelmember.member_id. The query then goes back to running in the >>5 >> seconds and has no problem again for weeks.... until it happens again. >> >> Has anyone heard of such a thing? And if anyone can maybe point me in >> directions to investigate, it would be much appreciated. > >Few answers but several questions... > >Is the machine busy processing the query or is it idle? It is processing and in fact drives the load up a bit. > >Does it start happening consistently or is it an individual query. I.e. >if you run the query again will it hang or complete? Can you get a query >plan when the query is failing? The query will run fine many times a night for many nights, then it will hang. Killing that query and running it again results in the same hang indefinitely. The only fix so far to get the query to run again is to drop and rebuild the xrefcampaignmenber.memberid index (I misquoted which index earlier) and then it will once again run fine for many, many days. > >Are there any other queries that may be locking your tables (check >pg_locks)? I did check, and no, no locks. > >Anything of interest in postgresql or system logs? Nope. > >Autovacuum running properly? Are you running any large >updates/deletes/etc just prior to the query such that the statistics >the planner is using do not reflect reality? No large updates or deletes to any of the tables involved in the query. Statistics seems just fine. > >Does the network connection to the server still exist? Yes. > >Any other weirdness happening on the machine? Nothing at all. > > >> select distinct(id) from member left join xrefchannelmember xrcm on >> id=xrcm.member_id where id not in (Select memberid from >> xrefcampaignmember) and xrcm.channel_id in (1) order by id asc; >BTW, I have an un-substantiated gut feeling that this query has room for >improvement. I would experiment with substituting "...where not exists >(select..." for the "...not in (select...". You could also try using >"select..... except select memberid from xrefcampaignmember..." > >I'm also not sure I understand the left join since you have the >"xcrm.channel_id in (1)" condition so you won't return records in >"member" without a corresponding record in xrefchannelmember anyway. > >I also don't understand the "xcrm.channel_id in (1)" instead of >"xcrm.channel_id = 1" unless this is a generated query and there could >be multiple ids in that condition. > >Will one of the following (untested) queries return the results you want >and have a better query plan?: > >select distinct(id) from member >join xrefchannelmember xrcm on id=xrcm.member_id >where xrcm.channel_id=1 >except >select memberid from xrefcampaignmember >order by id asc; > >(actually with except the distinct is probably superfluous as well) > >or: > >select distinct(id) from member >join xrefchannelmember xrcm on id=xrcm.member_id >where xrcm.channel_id=1 >and not exists (select 1 from xrefcampaignmember x where x.memberid = >member.id) >order by id asc; > >or: > >select distinct(id) from member >where >exists (select 1 from xrefchannelmember where member_id = member.id and >channel_id=1) >and not exists (select 1 from xrefcampaignmember where memberid = >member.id) >order by id asc; Yeah, the query is poo... autogenerated... the LEFT JOIN is not needed as I have pointed out to the person responsible for the code many times, and the 'in(1)' may indeed have many categories in there. But the OLY one that hangs is the 'in(1)' > >Cheers, >Steve >
On 05/18/2012 11:01 AM, Aaron Burnett wrote: >>> ... One particular query will run perfectly fine (around 5 seconds) >>> for several weeks, then suddenly decide to hang indefinitely and never >>> finish.... >> Is the machine busy processing the query or is it idle? > > It is processing and in fact drives the load up a bit. What CPU, disk and memory is it using? It would be very interesting to see the query plan when things go South. > Yeah, the query is poo... autogenerated... the LEFT JOIN is not needed as I have pointed out to the person responsiblefor the code many times, and > the 'in(1)' may indeed have many categories in there. But the OLY one that > hangs is the 'in(1)' Is "1" the largest category? Also, how have you tuned work_mem (show work_mem;)? When the query gets bad do you see PostgreSQL swapping to temp files (watch files in PGDATA/base/DB_OID/pgsql_tmp). Note that work_mem can be set per-connection so you if it is too small for your nighttime maintenance you can adjust it for those operations only. Cheers, Steve
Aaron Burnett wrote: >>> I run a handful of queries overnight when traffic is at it's lowest on our >>> system. One particular query will run perfectly fine (around 5 seconds0) >>> for several weeks, then suddenly decide to hang indefinitely and never >>> finish. It needs to be killed manually after several hours (I've >>> intentionally let it run to see if it would ever finish) in order for it >>> to die. >>> >>> The fix _seems to be_ to drop and rebuild the index on >>> xrefchannelmember.member_id. The query then goes back to running in the 5 >>> seconds and has no problem again for weeks.... until it happens again. >> Is the machine busy processing the query or is it idle? > It is processing and in fact drives the load up a bit. >> Are there any other queries that may be locking your tables (check >> pg_locks)? > I did check, and no, no locks. Try to get the execution plan of the query when it takes long and see how it compares to the plan when it runs ok. Yours, Laurenz Albe
On 05/20/2012 07:45 AM, Aaron Burnett wrote: > Hey Steve, > > Just wanted to send a note of thanks for pointing me in a few new > directions on this. > > Turns out that the query would swap but not all the time. When it swapped, > it wouldn't finish, if it didn't swap it would finish in the expected > time. Not sure why this isn't consistent... > > The work_mem was set at 100MB so as a test I increased it to 200MB on the > session and so far it has consistently finished every time it has run. I'm > not sure why this has reared it's head now as we run far more complex and > data intensive queries than this one and they don't swap... so I guess I > have some forensics to do. You're welcome. Assuming that changing work_mem is a confirmed solution, it sounds like the query is running right on the edge of running out of work_mem and when nudged over that threshold starts to swap. PostgreSQL, like most other things in computers, does not degrade linearly. Things work great till you run out of some resource then degrade dramatically. The work_mem setting specifies memory available for internal sorts and hash operations. Looking at the original explain plan you posted I see lots of such operations - perhaps your other queries don't need them: explain plan: ------------------------------ Unique (cost=237234.66..239267.33 rows=406533 width=4) (actual time=4790.823..4922.621 rows=418843 loops=1) -> Sort (cost=237234.66..238251.00 rows=406533 width=4) (actual time=4790.818..4826.201 rows=418879 loops=1) Sort Key: member.id Sort Method: quicksort Memory: 31923kB -> Hash Join (cost=167890.31..199359.99 rows=406533 width=4) (actual time=3357.406..4532.952 rows=418879 loops=1) Hash Cond: (xrcm.member_id = member.id) -> Seq Scan on xrefchannelmember xrcm (cost=0.00..19273.69 rows=813066 width=4) (actual time=0.015..219.259 row s=814421 loops=1) Filter: (channel_id = 1) -> Hash (cost=162586.21..162586.21 rows=424328 width=4) (actual time=3357.001..3357.001 rows=444626 loops=1) Buckets: 65536 Batches: 1 Memory Usage: 15632kB -> Seq Scan on member (cost=66114.02..162586.21 rows=424328 width=4) (actual time=2357.280..3216.076 rows =444626 loops=1) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> Seq Scan on xrefcampaignmember (cost=0.00..57931.82 rows=3272882 width=4) (actual time=0.021.. 670.086 rows=3272870 loops=1) Total runtime: 4963.134 ms I'm still bothered by a couple things but I don't have a enough understanding of internals to provide definitive advice: 1. Why does reindexing seem to temporarily solve the problem?... 2...especially when I don't see indexes being used in the query I would definitely run explain on some of the alternative versions of the query posted earlier to see if they do a better job of using available indexes or reducing the amount of internal hashing/sorting. If they do, you will have some "ammo" to take back to your developer. Cheers, Steve