Обсуждение: Performance lossage in checkpoint dumping
While poking at Peter Schmidt's comments about pgbench showing worse performance than for 7.0 (using -F in both cases), I noticed that given enough buffer space, FileWrite never seemed to get called at all. A little bit of sleuthing revealed the following: 1. Under WAL, we don't write dirty buffers out of the shared memory at every transaction commit. Instead, as long as a dirty buffer's slot isn't needed for something else, it just sits there until the next checkpoint or shutdown. CreateCheckpoint calls FlushBufferPool which writes out all the dirty buffers in one go. This is a Good Thing; it lets us consolidate multiple updates of a single datafile page by successive transactions into one disk write. We need this to buy back some of the extra I/O required to write the WAL logfile. 2. However, this means that a lot of the dirty-buffer writes get done by the periodic checkpoint process, not by the backends that originally dirtied the buffers. And that means that every last one gets done by blind write, because the checkpoint process isn't going to have opened any relation cache entries --- maybe a couple of system catalog relations, but for sure it won't have any for user relations. If you look at BufferSync, any page that the current process doesn't have an already-open relcache entry for is sent to smgrblindwrt not smgrwrite. 3. Blind write is gratuitously inefficient: it does separate open, seek, write, close kernel calls for every request. This was the right thing in 7.0.*, because backends relatively seldom did blind writes and even less often needed to blindwrite multiple pages of a single relation in succession. But the typical usage has changed a lot. I am thinking it'd be a good idea if blind write went through fd.c and thus was able to re-use open file descriptors, just like normal writes. This should improve the efficiency of dumping dirty buffers during checkpoint by a noticeable amount. Comments? regards, tom lane
> 3. Blind write is gratuitously inefficient: it does separate open, > seek, write, close kernel calls for every request. This was the right > thing in 7.0.*, because backends relatively seldom did blind writes and > even less often needed to blindwrite multiple pages of a single relation > in succession. But the typical usage has changed a lot. > > > I am thinking it'd be a good idea if blind write went through fd.c and > thus was able to re-use open file descriptors, just like normal writes. > This should improve the efficiency of dumping dirty buffers during > checkpoint by a noticeable amount. I totally agree the current code is broken. I am reading what you say and am thinking, "Oh well, we lose there, but at least we only open a relation once and do them in one shot." Now I am hearing that is not true, and it is a performance problem. This is not a total surprise. We have that stuff pretty well streamlined for the old behavour. Now that things have changed, I can see the need to reevaluate stuff. Not sure how to handle the beta issue though. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> I am thinking it'd be a good idea if blind write went through fd.c and >> thus was able to re-use open file descriptors, just like normal writes. >> This should improve the efficiency of dumping dirty buffers during >> checkpoint by a noticeable amount. > Not sure how to handle the beta issue though. After looking a little more, I think this is too big a change to risk making for beta. I was thinking it might be an easy change, but it's not; there's noplace to store the open-relation reference if we don't have a relcache entry. But we don't want to pay the price of opening a relcache entry just to dump some buffers. I recall Vadim speculating about decoupling the storage manager's notion of open files from the relcache, and having a much more lightweight open-relation mechanism at the smgr level. That might be a good way to tackle this. But I'm not going to touch it for 7.1... regards, tom lane
> After looking a little more, I think this is too big a change to risk > making for beta. I was thinking it might be an easy change, but it's > not; there's noplace to store the open-relation reference if we don't > have a relcache entry. But we don't want to pay the price of opening a > relcache entry just to dump some buffers. > > I recall Vadim speculating about decoupling the storage manager's notion > of open files from the relcache, and having a much more lightweight > open-relation mechanism at the smgr level. That might be a good way > to tackle this. But I'm not going to touch it for 7.1... No way to group the writes to you can keep the most recent one open? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> But I'm not going to touch it for 7.1... > No way to group the writes to you can keep the most recent one open? Don't see an easy way, do you? regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> But I'm not going to touch it for 7.1... > > > No way to group the writes to you can keep the most recent one open? > > Don't see an easy way, do you? > No, but I haven't looked at it. I am now much more concerned with the delay, and am wondering if I should start thinking about trying my idea of looking for near-committers and post the patch to the list to see if anyone likes it for 7.1 final. Vadim will not be back in enough time to write any new code in this area, I am afraid. We could look to fix this in 7.1.1. Let's see what the pgbench tester comes back with when he sets the delay to zero. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Fri, 16 Feb 2001, Bruce Momjian wrote: > > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >> But I'm not going to touch it for 7.1... > > > > > No way to group the writes to you can keep the most recent one open? > > > > Don't see an easy way, do you? > > > > No, but I haven't looked at it. I am now much more concerned with the > delay, and am wondering if I should start thinking about trying my idea > of looking for near-committers and post the patch to the list to see if > anyone likes it for 7.1 final. Vadim will not be back in enough time to > write any new code in this area, I am afraid. Near committers? *puzzled look*
The Hermit Hacker <scrappy@hub.org> writes: > No way to group the writes to you can keep the most recent one open? > Don't see an easy way, do you? >> >> No, but I haven't looked at it. I am now much more concerned with the >> delay, I concur. The blind write business is not important enough to hold up the release for --- for one thing, it has nothing to do with the pgbench results we're seeing, because these tests don't run long enough to include any checkpoint cycles. The commit delay, on the other hand, is a big problem. >> and am wondering if I should start thinking about trying my idea >> of looking for near-committers and post the patch to the list to see if >> anyone likes it for 7.1 final. Vadim will not be back in enough time to >> write any new code in this area, I am afraid. > Near committers? *puzzled look* Processes nearly ready to commit. I'm thinking that any mechanism for detecting that might be overkill, however, especially compared to just setting commit_delay to zero by default. I've been sitting here running pgbench under various scenarios, and so far I can't find any condition where commit_delay>0 is materially better than commit_delay=0, even under heavy load. It's either the same or much worse. Numbers to follow... regards, tom lane
> > No, but I haven't looked at it. I am now much more concerned with the > > delay, and am wondering if I should start thinking about trying my idea > > of looking for near-committers and post the patch to the list to see if > > anyone likes it for 7.1 final. Vadim will not be back in enough time to > > write any new code in this area, I am afraid. > > Near committers? *puzzled look* Umm, uh, it means backends that have entered COMMIT and will be issuing an fsync() of their own very soon. It took me a while to remember what I mean too because I was thinking of CVS committers. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Sat, 17 Feb 2001, Bruce Momjian wrote: > > > No, but I haven't looked at it. I am now much more concerned with the > > > delay, and am wondering if I should start thinking about trying my idea > > > of looking for near-committers and post the patch to the list to see if > > > anyone likes it for 7.1 final. Vadim will not be back in enough time to > > > write any new code in this area, I am afraid. > > > > Near committers? *puzzled look* > > Umm, uh, it means backends that have entered COMMIT and will be issuing > an fsync() of their own very soon. It took me a while to remember what > I mean too because I was thinking of CVS committers. That's what I was thinking to, which was what was confusing the hell out of me ... like, a near committer ... is that the guy sitting beside you while you commit? :)
On Sat, 17 Feb 2001, Tom Lane wrote: > The Hermit Hacker <scrappy@hub.org> writes: > > No way to group the writes to you can keep the most recent one open? > > Don't see an easy way, do you? > >> > >> No, but I haven't looked at it. I am now much more concerned with the > >> delay, > > I concur. The blind write business is not important enough to hold up > the release for --- for one thing, it has nothing to do with the pgbench > results we're seeing, because these tests don't run long enough to > include any checkpoint cycles. The commit delay, on the other hand, > is a big problem. > > >> and am wondering if I should start thinking about trying my idea > >> of looking for near-committers and post the patch to the list to see if > >> anyone likes it for 7.1 final. Vadim will not be back in enough time to > >> write any new code in this area, I am afraid. > > > Near committers? *puzzled look* > > Processes nearly ready to commit. I'm thinking that any mechanism for > detecting that might be overkill, however, especially compared to just > setting commit_delay to zero by default. > > I've been sitting here running pgbench under various scenarios, and so > far I can't find any condition where commit_delay>0 is materially better > than commit_delay=0, even under heavy load. It's either the same or > much worse. Numbers to follow... Okay, if the whole commit_delay is purely means as a performance thing, I'd say go with lowering the default to zero for v7.1, and once Vadim gets back, we can properly determine why it appears to improve performance in his case ... since I believe his OS of choice is FreeBSD, and you mentioned doing tests on it, I can't see how he'd have a more fine grain'd select() then you have for testing ...