Обсуждение: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

Поиск
Список
Период
Сортировка

Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Thomas Munro
Дата:
Hi hackers,

I heard a report of a 10.1 cluster hanging with several 'BtreePage'
wait_events showing in pg_stat_activity.  The query plan involved
Parallel Index Only Scan, and the table is concurrently updated quite
heavily.  I tried and failed to make a reproducer, but from the clues
available it seemed clear that somehow *all* participants in a
Parallel Index Scan must be waiting for someone else to advance the
scan.  The report came with a back trace[1] that was the same in all 3
backends (leader + 2 workers), which I'll summarise here:

  ConditionVariableSleep
  _bt_parallel_seize
  _bt_readnextpage
  _bt_steppage
  _bt_next
  btgettuple
  index_getnext_tid
  IndexOnlyNext

I think _bt_steppage() called _bt_parallel_seize(), then it called
_bt_readnextpage() which I guess must have encountered a BTP_DELETED
or BTP_HALF_DEAD-marked page so didn't take this early break out of
the loop:

                        /* check for deleted page */
                        if (!P_IGNORE(opaque))
                        {
                                PredicateLockPage(rel, blkno,
scan->xs_snapshot);
                                /* see if there are any matches on this page */
                                /* note that this will clear moreRight
if we can stop */
                                if (_bt_readpage(scan, dir,
P_FIRSTDATAKEY(opaque)))
                                        break;
                        }

... and then it called _bt_parallel_seize() itself, in violation of
the rule (by my reading of the code) that you must call
_bt_parallel_release() (via _bt_readpage()) or _bt_parallel_done()
after seizing the scan.  If you call _bt_parallel_seize() again
without doing that first, you'll finish up waiting for yourself
forever.  Does this theory make sense?

[1] http://dpaste.com/05PGJ4E

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Thomas Munro
Дата:
On Mon, Dec 11, 2017 at 3:51 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I heard a report of a 10.1 cluster hanging with several 'BtreePage'
> wait_events showing in pg_stat_activity.

I forgot to add, for bug reporter credit purposes: thanks to Patrick
Hemmer for the off-list report and backtrace.  He's able to work
around this recurring problem for now by disabling parallelism.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Amit Kapila
Дата:
On Mon, Dec 11, 2017 at 8:21 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Hi hackers,
>
>
> ... and then it called _bt_parallel_seize() itself, in violation of
> the rule (by my reading of the code) that you must call
> _bt_parallel_release() (via _bt_readpage()) or _bt_parallel_done()
> after seizing the scan.  If you call _bt_parallel_seize() again
> without doing that first, you'll finish up waiting for yourself
> forever.  Does this theory make sense?
>

Yes, I think if the current page is half-dead or deleted, we need to
set the next page to be scanned and release the parallel scan.  This
has to be done for both forward and backward scans.

Thanks for looking into it.  I will see if we can write some test.  In
the meantime if possible, can you please request Patrick Hemmer to
verify the attached patch?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Thomas Munro
Дата:
On Mon, Dec 11, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Dec 11, 2017 at 8:21 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> ... and then it called _bt_parallel_seize() itself, in violation of
>> the rule (by my reading of the code) that you must call
>> _bt_parallel_release() (via _bt_readpage()) or _bt_parallel_done()
>> after seizing the scan.  If you call _bt_parallel_seize() again
>> without doing that first, you'll finish up waiting for yourself
>> forever.  Does this theory make sense?
>>
>
> Yes, I think if the current page is half-dead or deleted, we need to
> set the next page to be scanned and release the parallel scan.  This
> has to be done for both forward and backward scans.

Your patch seems to match what _bt_readpage() would do in the regular
live page path, namely _bt_parallel_release(scan, opaque->btpo_next)
to advance to the right and _bt_parallel_release(scan,
BufferGetBlockNumber(so->currPos.buf)) to advance to the left.  I
haven't tested it, but it certainly looks correct on that basis (I
admit that the asymmetry threw me for a moment but I get it now).

> Thanks for looking into it.  I will see if we can write some test.  In
> the meantime if possible, can you please request Patrick Hemmer to
> verify the attached patch?

Our discussion was on the #postgresql Freenode channel.  I pointed him
at this thread, but I'm not sure if he'll see my message or be able to
test.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Kuntal Ghosh
Дата:
On Mon, Dec 11, 2017 at 2:26 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Mon, Dec 11, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>> Thanks for looking into it.  I will see if we can write some test.  In
>> the meantime if possible, can you please request Patrick Hemmer to
>> verify the attached patch?
>
> Our discussion was on the #postgresql Freenode channel.  I pointed him
> at this thread, but I'm not sure if he'll see my message or be able to
> test.
After discussing with Amit, I'm able to reproduce the scenario in a
master-standby setup. The issue occurs when we perform parallel
index(-only) scan on a BTP_HALF_DEAD -marked page. (If a page is
marked as BTP_DELETED, it's already unlinked from the index).

When a btree page is deleted during vacuum, it's first marked as
BTP_HALF_DEAD in _bt_mark_page_halfdead and then marked as BTP_DELETED
in _bt_unlink_halfdead_page without releasing cleanup lock on the
buffer. Hence, any scan node cannot lock the same buffer. So, the
issue can't be reproduced on master.

However, after replaying XLOG_BTREE_MARK_PAGE_HALFDEAD record, standby
releases the lock on the same buffer. If we force parallelism, an
index scan on the same page will cause hang the standby server.
Following is a (unpleasant)way to reproduce the issue:

In master (with autovacuum = off):
1. create table t1(a int primary key);
2. insert into t1 select * from generate_series(1,1000); --generates 3
leaf nodes (block no 1,2,4) and 1 root node (block no 3)
3. delete from t1 where a>=367 and a<=735; --delete all tuples pointed by leaf 2
4. analyze t1; --update the stats
5. explain analyze select * from t1 where a>=1 and a<=1000; --ensures
that the next vacuum will consider leaf 2 for page deletion
Now, put a break point at _bt_unlink_halfdead_page, so that vacuum
can't unlink the page.

In standby,
1. force parallelism.
2. explain analyze select * from t1 where a>=1 and a<=1000; and the
parallel workers hang at the above-discussed place!

The attached patch fixes the issue. One comment on the same:
+ else if (scan->parallel_scan != NULL)
+ {
+ /* allow next page be processed by parallel worker */
+ _bt_parallel_release(scan, opaque->btpo_next);
+ }

  /* nope, keep going */
  if (scan->parallel_scan != NULL)

IMHO, There is no need to add an extra if condition.
_bt_parallel_release can be included in the next one.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Amit Kapila
Дата:
On Tue, Dec 12, 2017 at 4:00 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> On Mon, Dec 11, 2017 at 2:26 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Mon, Dec 11, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>> Thanks for looking into it.  I will see if we can write some test.  In
>>> the meantime if possible, can you please request Patrick Hemmer to
>>> verify the attached patch?
>>
>> Our discussion was on the #postgresql Freenode channel.  I pointed him
>> at this thread, but I'm not sure if he'll see my message or be able to
>> test.
> After discussing with Amit, I'm able to reproduce the scenario in a
> master-standby setup. The issue occurs when we perform parallel
> index(-only) scan on a BTP_HALF_DEAD -marked page. (If a page is
> marked as BTP_DELETED, it's already unlinked from the index).
>
> When a btree page is deleted during vacuum, it's first marked as
> BTP_HALF_DEAD in _bt_mark_page_halfdead and then marked as BTP_DELETED
> in _bt_unlink_halfdead_page without releasing cleanup lock on the
> buffer. Hence, any scan node cannot lock the same buffer. So, the
> issue can't be reproduced on master.
>
> However, after replaying XLOG_BTREE_MARK_PAGE_HALFDEAD record, standby
> releases the lock on the same buffer. If we force parallelism, an
> index scan on the same page will cause hang the standby server.
> Following is a (unpleasant)way to reproduce the issue:
>
> In master (with autovacuum = off):
> 1. create table t1(a int primary key);
> 2. insert into t1 select * from generate_series(1,1000); --generates 3
> leaf nodes (block no 1,2,4) and 1 root node (block no 3)
> 3. delete from t1 where a>=367 and a<=735; --delete all tuples pointed by leaf 2
> 4. analyze t1; --update the stats
> 5. explain analyze select * from t1 where a>=1 and a<=1000; --ensures
> that the next vacuum will consider leaf 2 for page deletion

What do you mean by next vacuum, here autovacuum is off?  Are you
missing any step which manually performs the vacuum?

> Now, put a break point at _bt_unlink_halfdead_page, so that vacuum
> can't unlink the page.
>
> In standby,
> 1. force parallelism.
> 2. explain analyze select * from t1 where a>=1 and a<=1000; and the
> parallel workers hang at the above-discussed place!
>
> The attached patch fixes the issue. One comment on the same:
> + else if (scan->parallel_scan != NULL)
> + {
> + /* allow next page be processed by parallel worker */
> + _bt_parallel_release(scan, opaque->btpo_next);
> + }
>
>   /* nope, keep going */
>   if (scan->parallel_scan != NULL)
>
> IMHO, There is no need to add an extra if condition.
> _bt_parallel_release can be included in the next one.
>

I don't think so because it is quite possible that _bt_readpage has
already released it (consider the case where _bt_readpage returns
false).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Thomas Munro
Дата:
Hi,

Here's a reproducer which enabled me to reach this stuck state:

  pid  |  wait_event   |                                    query
-------+---------------+-----------------------------------------------------------------------------
 64617 |               | select pid, wait_event, query from
pg_stat_activity where state = 'active';
 64619 | BufferPin     | VACUUM jobs
 64620 | ExecuteGather | SELECT COUNT(*) FROM jobs
 64621 | ExecuteGather | SELECT COUNT(*) FROM jobs
 64622 | ExecuteGather | SELECT COUNT(*) FROM jobs
 64623 | ExecuteGather | SELECT COUNT(*) FROM jobs
 84167 | BtreePage     | SELECT COUNT(*) FROM jobs
 84168 | BtreePage     | SELECT COUNT(*) FROM jobs
 96440 |               | SELECT COUNT(*) FROM jobs
 96438 |               | SELECT COUNT(*) FROM jobs
 96439 |               | SELECT COUNT(*) FROM jobs
(11 rows)

The main thread deletes stuff in the middle of the key range (not sure
if this is important) and vacuum in a loop, and meanwhile 4 threads
(probably not important, might as well be 1) run Parallel Index Scans
over the whole range, in the hope of hitting the interesting case.  In
the locked-up case I just saw now opaque->btpo_flags had the
BTP_DELETED bit set, not BTP_HALF_DEAD (I could tell because I added
logging).  Clearly pages are periodically being marked half-dead but I
haven't yet managed to get an index scan to hit one of those.

-- 
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Amit Kapila
Дата:
On Wed, Dec 13, 2017 at 7:02 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Hi,
>
> Here's a reproducer which enabled me to reach this stuck state:
>
>   pid  |  wait_event   |                                    query
> -------+---------------+-----------------------------------------------------------------------------
>  64617 |               | select pid, wait_event, query from
> pg_stat_activity where state = 'active';
>  64619 | BufferPin     | VACUUM jobs
>  64620 | ExecuteGather | SELECT COUNT(*) FROM jobs
>  64621 | ExecuteGather | SELECT COUNT(*) FROM jobs
>  64622 | ExecuteGather | SELECT COUNT(*) FROM jobs
>  64623 | ExecuteGather | SELECT COUNT(*) FROM jobs
>  84167 | BtreePage     | SELECT COUNT(*) FROM jobs
>  84168 | BtreePage     | SELECT COUNT(*) FROM jobs
>  96440 |               | SELECT COUNT(*) FROM jobs
>  96438 |               | SELECT COUNT(*) FROM jobs
>  96439 |               | SELECT COUNT(*) FROM jobs
> (11 rows)
>
> The main thread deletes stuff in the middle of the key range (not sure
> if this is important) and vacuum in a loop, and meanwhile 4 threads
> (probably not important, might as well be 1) run Parallel Index Scans
> over the whole range, in the hope of hitting the interesting case.  In
> the locked-up case I just saw now opaque->btpo_flags had the
> BTP_DELETED bit set, not BTP_HALF_DEAD (I could tell because I added
> logging).
>

Good.  I hope that the patch I have posted above is able to resolve
this problem.  I am asking as you haven't explicitly mentioned that.

>  Clearly pages are periodically being marked half-dead but I
> haven't yet managed to get an index scan to hit one of those.
>

I think Kuntal has already able to hit that case, so maybe that is enough.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Thomas Munro
Дата:
On Wed, Dec 13, 2017 at 3:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Good.  I hope that the patch I have posted above is able to resolve
> this problem.  I am asking as you haven't explicitly mentioned that.

I can confirm that your patch fixes the problem for forward scans.
That is, I can see it reaching the BTP_DELETED case via an extra LOG
statement I added, and it worked correctly.  Good.

I don't know how to make it hit the backwards scan case.  I can get a
backward scan in a worker by changing the query to "select count(*)
from (select * from jobs where id + 1 > id order by status desc) ss"
but I suspect that _bt_walk_left() may be hiding deleted pages from us
so the condition may not be reachable with this technique.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Thomas Munro
Дата:
On Wed, Dec 13, 2017 at 4:20 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Dec 13, 2017 at 3:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Good.  I hope that the patch I have posted above is able to resolve
>> this problem.  I am asking as you haven't explicitly mentioned that.
>
> I can confirm that your patch fixes the problem for forward scans.
> That is, I can see it reaching the BTP_DELETED case via an extra LOG
> statement I added, and it worked correctly.  Good.
>
> I don't know how to make it hit the backwards scan case.  I can get a
> backward scan in a worker by changing the query to "select count(*)
> from (select * from jobs where id + 1 > id order by status desc) ss"
> but I suspect that _bt_walk_left() may be hiding deleted pages from us
> so the condition may not be reachable with this technique.

Hmm, no that's not right: clearly it can return half-dead or deleted
pages to the caller.  So I don't know why I never seem to encounter
any, despite concurrent vacuums producing them; maybe something to do
with the interlocking you get with vacuum when you traverse the btree
by walking left -- my btree-fu is not yet strong enough.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Kuntal Ghosh
Дата:
On Tue, Dec 12, 2017 at 5:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Dec 12, 2017 at 4:00 PM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> On Mon, Dec 11, 2017 at 2:26 PM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> On Mon, Dec 11, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>>> Thanks for looking into it.  I will see if we can write some test.  In
>>>> the meantime if possible, can you please request Patrick Hemmer to
>>>> verify the attached patch?
>>>
>>> Our discussion was on the #postgresql Freenode channel.  I pointed him
>>> at this thread, but I'm not sure if he'll see my message or be able to
>>> test.
>> After discussing with Amit, I'm able to reproduce the scenario in a
>> master-standby setup. The issue occurs when we perform parallel
>> index(-only) scan on a BTP_HALF_DEAD -marked page. (If a page is
>> marked as BTP_DELETED, it's already unlinked from the index).
>>
>> When a btree page is deleted during vacuum, it's first marked as
>> BTP_HALF_DEAD in _bt_mark_page_halfdead and then marked as BTP_DELETED
>> in _bt_unlink_halfdead_page without releasing cleanup lock on the
>> buffer. Hence, any scan node cannot lock the same buffer. So, the
>> issue can't be reproduced on master.
>>
>> However, after replaying XLOG_BTREE_MARK_PAGE_HALFDEAD record, standby
>> releases the lock on the same buffer. If we force parallelism, an
>> index scan on the same page will cause hang the standby server.
>> Following is a (unpleasant)way to reproduce the issue:
>>
>> In master (with autovacuum = off):
>> 1. create table t1(a int primary key);
>> 2. insert into t1 select * from generate_series(1,1000); --generates 3
>> leaf nodes (block no 1,2,4) and 1 root node (block no 3)
>> 3. delete from t1 where a>=367 and a<=735; --delete all tuples pointed by leaf 2
>> 4. analyze t1; --update the stats
>> 5. explain analyze select * from t1 where a>=1 and a<=1000; --ensures
>> that the next vacuum will consider leaf 2 for page deletion
>
> What do you mean by next vacuum, here autovacuum is off?  Are you
> missing any step which manually performs the vacuum?
>
Yeah, you've to manually vacuum the table.
6. vacuum t1.

>> Now, put a break point at _bt_unlink_halfdead_page, so that vacuum
>> can't unlink the page.
>>
>> In standby,
>> 1. force parallelism.
>> 2. explain analyze select * from t1 where a>=1 and a<=1000; and the
>> parallel workers hang at the above-discussed place!
>>



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Kuntal Ghosh
Дата:
On Wed, Dec 13, 2017 at 10:33 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> On Tue, Dec 12, 2017 at 5:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Tue, Dec 12, 2017 at 4:00 PM, Kuntal Ghosh
>> <kuntalghosh.2007@gmail.com> wrote:
>>> On Mon, Dec 11, 2017 at 2:26 PM, Thomas Munro
>>> <thomas.munro@enterprisedb.com> wrote:
>>>> On Mon, Dec 11, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>
>>>>> Thanks for looking into it.  I will see if we can write some test.  In
>>>>> the meantime if possible, can you please request Patrick Hemmer to
>>>>> verify the attached patch?
>>>>
>>>> Our discussion was on the #postgresql Freenode channel.  I pointed him
>>>> at this thread, but I'm not sure if he'll see my message or be able to
>>>> test.
>>> After discussing with Amit, I'm able to reproduce the scenario in a
>>> master-standby setup. The issue occurs when we perform parallel
>>> index(-only) scan on a BTP_HALF_DEAD -marked page. (If a page is
>>> marked as BTP_DELETED, it's already unlinked from the index).
>>>
>>> When a btree page is deleted during vacuum, it's first marked as
>>> BTP_HALF_DEAD in _bt_mark_page_halfdead and then marked as BTP_DELETED
>>> in _bt_unlink_halfdead_page without releasing cleanup lock on the
>>> buffer. Hence, any scan node cannot lock the same buffer. So, the
>>> issue can't be reproduced on master.
>>>
>>> However, after replaying XLOG_BTREE_MARK_PAGE_HALFDEAD record, standby
>>> releases the lock on the same buffer. If we force parallelism, an
>>> index scan on the same page will cause hang the standby server.
>>> Following is a (unpleasant)way to reproduce the issue:
>>>
>>> In master (with autovacuum = off):
>>> 1. create table t1(a int primary key);
>>> 2. insert into t1 select * from generate_series(1,1000); --generates 3
>>> leaf nodes (block no 1,2,4) and 1 root node (block no 3)
>>> 3. delete from t1 where a>=367 and a<=735; --delete all tuples pointed by leaf 2
>>> 4. analyze t1; --update the stats
>>> 5. explain analyze select * from t1 where a>=1 and a<=1000; --ensures
>>> that the next vacuum will consider leaf 2 for page deletion
>>
>> What do you mean by next vacuum, here autovacuum is off?  Are you
>> missing any step which manually performs the vacuum?
>>
> Yeah, you've to manually vacuum the table.
> 6. vacuum t1.
>
>>> Now, put a break point at _bt_unlink_halfdead_page, so that vacuum
>>> can't unlink the page.
>>>
>>> In standby,
>>> 1. force parallelism.
>>> 2. explain analyze select * from t1 where a>=1 and a<=1000; and the
>>> parallel workers hang at the above-discussed place!
>>>
I've also verified the backward scan case with the query provided by
Thomas. In standby,
2. explain analyze select * from t1 where a+1>a order by a desc; and
the parallel workers hang.
The patch fixes the issue.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Robert Haas
Дата:
On Wed, Dec 13, 2017 at 12:35 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> I've also verified the backward scan case with the query provided by
> Thomas. In standby,
> 2. explain analyze select * from t1 where a+1>a order by a desc; and
> the parallel workers hang.
> The patch fixes the issue.

Committed and back-patched to v10.

Sorry, Kuntal, I forgot to credit you in the commit message.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

От
Amit Kapila
Дата:
On Thu, Dec 14, 2017 at 2:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Dec 13, 2017 at 12:35 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> I've also verified the backward scan case with the query provided by
>> Thomas. In standby,
>> 2. explain analyze select * from t1 where a+1>a order by a desc; and
>> the parallel workers hang.
>> The patch fixes the issue.
>
> Committed and back-patched to v10.
>

Thanks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com