Обсуждение: Performance lossage in checkpoint dumping

Поиск
Список
Период
Сортировка

Performance lossage in checkpoint dumping

От
Tom Lane
Дата:
While poking at Peter Schmidt's comments about pgbench showing worse
performance than for 7.0 (using -F in both cases), I noticed that given
enough buffer space, FileWrite never seemed to get called at all.  A
little bit of sleuthing revealed the following:

1. Under WAL, we don't write dirty buffers out of the shared memory at
every transaction commit.  Instead, as long as a dirty buffer's slot
isn't needed for something else, it just sits there until the next
checkpoint or shutdown.  CreateCheckpoint calls FlushBufferPool which
writes out all the dirty buffers in one go.  This is a Good Thing; it
lets us consolidate multiple updates of a single datafile page by
successive transactions into one disk write.  We need this to buy back
some of the extra I/O required to write the WAL logfile.

2. However, this means that a lot of the dirty-buffer writes get done by
the periodic checkpoint process, not by the backends that originally
dirtied the buffers.  And that means that every last one gets done by
blind write, because the checkpoint process isn't going to have opened
any relation cache entries --- maybe a couple of system catalog
relations, but for sure it won't have any for user relations.  If you
look at BufferSync, any page that the current process doesn't have an
already-open relcache entry for is sent to smgrblindwrt not smgrwrite.

3. Blind write is gratuitously inefficient: it does separate open,
seek, write, close kernel calls for every request.  This was the right
thing in 7.0.*, because backends relatively seldom did blind writes and
even less often needed to blindwrite multiple pages of a single relation
in succession.  But the typical usage has changed a lot.


I am thinking it'd be a good idea if blind write went through fd.c and
thus was able to re-use open file descriptors, just like normal writes.
This should improve the efficiency of dumping dirty buffers during
checkpoint by a noticeable amount.

Comments?
        regards, tom lane


Re: Performance lossage in checkpoint dumping

От
Bruce Momjian
Дата:
> 3. Blind write is gratuitously inefficient: it does separate open,
> seek, write, close kernel calls for every request.  This was the right
> thing in 7.0.*, because backends relatively seldom did blind writes and
> even less often needed to blindwrite multiple pages of a single relation
> in succession.  But the typical usage has changed a lot.
> 
> 
> I am thinking it'd be a good idea if blind write went through fd.c and
> thus was able to re-use open file descriptors, just like normal writes.
> This should improve the efficiency of dumping dirty buffers during
> checkpoint by a noticeable amount.

I totally agree the current code is broken.  I am reading what you say
and am thinking, "Oh well, we lose there, but at least we only open a
relation once and do them in one shot."  Now I am hearing that is not
true, and it is a performance problem.

This is not a total surprise.  We have that stuff pretty well
streamlined for the old behavour.  Now that things have changed, I can
see the need to reevaluate stuff.

Not sure how to handle the beta issue though.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance lossage in checkpoint dumping

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> I am thinking it'd be a good idea if blind write went through fd.c and
>> thus was able to re-use open file descriptors, just like normal writes.
>> This should improve the efficiency of dumping dirty buffers during
>> checkpoint by a noticeable amount.

> Not sure how to handle the beta issue though.

After looking a little more, I think this is too big a change to risk
making for beta.  I was thinking it might be an easy change, but it's
not; there's noplace to store the open-relation reference if we don't
have a relcache entry.  But we don't want to pay the price of opening a
relcache entry just to dump some buffers.

I recall Vadim speculating about decoupling the storage manager's notion
of open files from the relcache, and having a much more lightweight
open-relation mechanism at the smgr level.  That might be a good way
to tackle this.  But I'm not going to touch it for 7.1...
        regards, tom lane


Re: Performance lossage in checkpoint dumping

От
Bruce Momjian
Дата:
> After looking a little more, I think this is too big a change to risk
> making for beta.  I was thinking it might be an easy change, but it's
> not; there's noplace to store the open-relation reference if we don't
> have a relcache entry.  But we don't want to pay the price of opening a
> relcache entry just to dump some buffers.
> 
> I recall Vadim speculating about decoupling the storage manager's notion
> of open files from the relcache, and having a much more lightweight
> open-relation mechanism at the smgr level.  That might be a good way
> to tackle this.  But I'm not going to touch it for 7.1...

No way to group the writes to you can keep the most recent one open?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance lossage in checkpoint dumping

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> But I'm not going to touch it for 7.1...

> No way to group the writes to you can keep the most recent one open?

Don't see an easy way, do you?
        regards, tom lane


Re: Performance lossage in checkpoint dumping

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> But I'm not going to touch it for 7.1...
> 
> > No way to group the writes to you can keep the most recent one open?
> 
> Don't see an easy way, do you?
> 

No, but I haven't looked at it.  I am now much more concerned with the
delay, and am wondering if I should start thinking about trying my idea
of looking for near-committers and post the patch to the list to see if
anyone likes it for 7.1 final.  Vadim will not be back in enough time to
write any new code in this area, I am afraid.

We could look to fix this in 7.1.1.  Let's see what the pgbench tester
comes back with when he sets the delay to zero.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance lossage in checkpoint dumping

От
The Hermit Hacker
Дата:
On Fri, 16 Feb 2001, Bruce Momjian wrote:

>
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > >> But I'm not going to touch it for 7.1...
> >
> > > No way to group the writes to you can keep the most recent one open?
> >
> > Don't see an easy way, do you?
> >
>
> No, but I haven't looked at it.  I am now much more concerned with the
> delay, and am wondering if I should start thinking about trying my idea
> of looking for near-committers and post the patch to the list to see if
> anyone likes it for 7.1 final.  Vadim will not be back in enough time to
> write any new code in this area, I am afraid.

Near committers? *puzzled look*




Re: Performance lossage in checkpoint dumping

От
Tom Lane
Дата:
The Hermit Hacker <scrappy@hub.org> writes:
> No way to group the writes to you can keep the most recent one open?
> Don't see an easy way, do you?
>> 
>> No, but I haven't looked at it.  I am now much more concerned with the
>> delay,

I concur.  The blind write business is not important enough to hold up
the release for --- for one thing, it has nothing to do with the pgbench
results we're seeing, because these tests don't run long enough to
include any checkpoint cycles.  The commit delay, on the other hand,
is a big problem.

>> and am wondering if I should start thinking about trying my idea
>> of looking for near-committers and post the patch to the list to see if
>> anyone likes it for 7.1 final.  Vadim will not be back in enough time to
>> write any new code in this area, I am afraid.

> Near committers? *puzzled look*

Processes nearly ready to commit.  I'm thinking that any mechanism for
detecting that might be overkill, however, especially compared to just
setting commit_delay to zero by default.

I've been sitting here running pgbench under various scenarios, and so
far I can't find any condition where commit_delay>0 is materially better
than commit_delay=0, even under heavy load.  It's either the same or
much worse.  Numbers to follow...
        regards, tom lane


Re: Performance lossage in checkpoint dumping

От
Bruce Momjian
Дата:
> > No, but I haven't looked at it.  I am now much more concerned with the
> > delay, and am wondering if I should start thinking about trying my idea
> > of looking for near-committers and post the patch to the list to see if
> > anyone likes it for 7.1 final.  Vadim will not be back in enough time to
> > write any new code in this area, I am afraid.
> 
> Near committers? *puzzled look*

Umm, uh, it means backends that have entered COMMIT and will be issuing
an fsync() of their own very soon.  It took me a while to remember what
I mean too because I was thinking of CVS committers.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance lossage in checkpoint dumping

От
The Hermit Hacker
Дата:
On Sat, 17 Feb 2001, Bruce Momjian wrote:

> > > No, but I haven't looked at it.  I am now much more concerned with the
> > > delay, and am wondering if I should start thinking about trying my idea
> > > of looking for near-committers and post the patch to the list to see if
> > > anyone likes it for 7.1 final.  Vadim will not be back in enough time to
> > > write any new code in this area, I am afraid.
> >
> > Near committers? *puzzled look*
>
> Umm, uh, it means backends that have entered COMMIT and will be issuing
> an fsync() of their own very soon.  It took me a while to remember what
> I mean too because I was thinking of CVS committers.

That's what I was thinking to, which was what was confusing the hell out
of me ... like, a near committer ... is that the guy sitting beside you
while you commit? :)




Re: Performance lossage in checkpoint dumping

От
The Hermit Hacker
Дата:
On Sat, 17 Feb 2001, Tom Lane wrote:

> The Hermit Hacker <scrappy@hub.org> writes:
> > No way to group the writes to you can keep the most recent one open?
> > Don't see an easy way, do you?
> >>
> >> No, but I haven't looked at it.  I am now much more concerned with the
> >> delay,
>
> I concur.  The blind write business is not important enough to hold up
> the release for --- for one thing, it has nothing to do with the pgbench
> results we're seeing, because these tests don't run long enough to
> include any checkpoint cycles.  The commit delay, on the other hand,
> is a big problem.
>
> >> and am wondering if I should start thinking about trying my idea
> >> of looking for near-committers and post the patch to the list to see if
> >> anyone likes it for 7.1 final.  Vadim will not be back in enough time to
> >> write any new code in this area, I am afraid.
>
> > Near committers? *puzzled look*
>
> Processes nearly ready to commit.  I'm thinking that any mechanism for
> detecting that might be overkill, however, especially compared to just
> setting commit_delay to zero by default.
>
> I've been sitting here running pgbench under various scenarios, and so
> far I can't find any condition where commit_delay>0 is materially better
> than commit_delay=0, even under heavy load.  It's either the same or
> much worse.  Numbers to follow...

Okay, if the whole commit_delay is purely means as a performance thing,
I'd say go with lowering the default to zero for v7.1, and once Vadim gets
back, we can properly determine why it appears to improve performance in
his case ... since I believe his OS of choice is FreeBSD, and you
mentioned doing tests on it, I can't see how he'd have a more fine
grain'd select() then you have for testing ...