Обсуждение: It's past time to redo the smgr API
A long time ago Vadim proposed that we should revise smgr.c's API so that it does not depend on Relations (relcache entries); rather, only a RelFileNode value should be needed to access a file in smgr and lower levels. This would allow us to get rid of the concept of "blind writes". As of CVS tip, with most writes being done by the bgwriter, most writes are blind. Try tracing the bgwriter: every write involves a file open and close, which has got to be hurting its performance. The same has been true for awhile now of the background checkpointer. I'm motivated to do something about this today because I'm reviewing the PITR patch that J.R. Nield and Patrick Macdonald were working on, and one of the things I don't care for is the kluges it adds to the smgr API to work around lack of a Relation in some operations. I propose the following revisions: * smgr.c should internally keep a hashtable (indexed by RelFileNode) with "struct SMgrRelation" entries for each relation that has recently been accessed. File access information for the lower-level modules (md.c, fd.c) would appear in this hashtable entry. * smgropen takes a RelFileNode and returns a pointer to a hashtable entry, which is used as the argument to all other smgr operations. smgropen itself does not cause any actual I/O; thus it doesn't matter if the file doesn't exist yet. smgropen followed by smgrcreate is the way to create a new relation on disk. Without smgrcreate, file existence will be checked at the first read or write attempt. * smgrclose closes the md.c-level file and drops the hashtable entry. Hashtable entries remain valid unless explicitly closed (thus, multiple smgropens for the same file are legal). * The rd_fd field of relcache nodes will be replaced by an "SMgrRelation *" pointer, so that callers having a Relation reference can access the smgr file easily. (In particular, this means that most calls to the bufmgr will still pass Relations, and we don't need to propagate the API change up to the bufmgr level.) * The existing places in the bufmgr that do RelationNodeCacheGetRelation followed by either regular or blind write would be replaced by smgropen followed by smgrwrite. Since smgropen is just a hash table lookup, it is no more expensive than the RelationNodeCacheGetRelation it replaces. Note these places would *not* do smgrclose afterwards. * Because we don't smgrclose after a write, it is possible to have "dangling" smgr entries that aren't useful any more, as well as open file descriptors underneath them. This isn't too big a deal on Unix but it will be fatal for the Windows port, since it would prevent a DROP TABLE if some other backend happens to have touched the table. What I propose to do about this is: 1. In the bgwriter, at each checkpoint do "smgrcloseall" to close all open files.2. In regular backends, receipt of a relcache flush message will result in smgrclose(), even if there is not a relcacheentry for the given relation ID. A global cache flush event (sinval buffer overflow) causes smgrcloseall. This ensures that open descriptors for a dead relation will not persist past the next checkpoint. We had already agreed, I think, that the best solution for Windows' problem with deleting open files is to retry pending deletes at checkpoint. This smgr rewrite will not contribute to actually making that happen, but it won't make things worse either. * I'm going to get rid of the "which" argument that once selected a particular smgr implementation (md.c vs mm.c). It's vestigial at present and is misdefined anyway. If we were to resurrect the concept of multiple storage managers, we'd need the selection to be determinable from the RelFileNode value, rather than being a separate argument. (When we add tablespaces, we could imagine associating smgrs with tablespaces, but it would have to be the responsibility of smgr.c to determine which smgr to use based on the tablespace ID.) * We can remove the "dummy cache" support in relcache.c, as well as RelationNodeCacheGetRelation() and the hashtable that supports it. * The smgr hashtable entries could possibly be used for recording other status; for instance keeping track of which relations need fsync in the bgwriter. I'm also thinking of merging smgr.c's existing list-of-rels-to-be-deleted into this data structure. * AFAICS the only downside of not having a Relation available in smgr.c and md.c is that error messages could only refer to the RelFileNode numbers and not to the relation name. I'm not sure this is bad, since in my experience what you want to know about such errors is the actual disk filename, which RelFileNode tells you and relation name doesn't. We could preserve the current behavior by passing the relation name to smgropen when available, and saving the name in struct SMgrRelation. But I'm inclined not to. Comments? regards, tom lane
On Thu, 5 Feb 2004, Tom Lane wrote: > * Because we don't smgrclose after a write, it is possible to have > "dangling" smgr entries that aren't useful any more, as well as open > file descriptors underneath them. This isn't too big a deal on Unix > but it will be fatal for the Windows port, since it would prevent a > DROP TABLE if some other backend happens to have touched the table. > What I propose to do about this is: > 1. In the bgwriter, at each checkpoint do "smgrcloseall" to close > all open files. > 2. In regular backends, receipt of a relcache flush message will > result in smgrclose(), even if there is not a relcache entry > for the given relation ID. A global cache flush event (sinval > buffer overflow) causes smgrcloseall. > This ensures that open descriptors for a dead relation will not persist > past the next checkpoint. We had already agreed, I think, that the > best solution for Windows' problem with deleting open files is to retry > pending deletes at checkpoint. This smgr rewrite will not contribute > to actually making that happen, but it won't make things worse either. 'k, only comment is on this one ... would it not be a bit more efficient to add a flag to the "SMgrRelation *" structure that acts as a timer? if -1, then relation is deleted, is >0, make it epoch of the time that Relation was last accessed? then your 'smgrcloseall', instead of closing all files (even active/undeleted ones), would only close those that are either idle for x minutes (maybe a postgresql.conf tunable there?) or those that have been DROP'd? Assuming, that is, that your end goal is to reduce the overall # of open/closes of files ... this way, instead of closing 20 just to reopen 19 of them, you only close that one that needs it ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
"Marc G. Fournier" <scrappy@postgresql.org> writes: > 'k, only comment is on this one ... would it not be a bit more efficient > to add a flag to the "SMgrRelation *" structure that acts as a timer? Hm, we could try that, although I'm not sure it would help much. You'd have to set the timeout to be longer than a checkpoint interval to make any difference. In the back of my mind is the thought that the Windows guys are going to end up passing file-delete requests over to the bgwriter anyway, which would largely eliminate the issue --- the bgwriter would know which files need to be sgmrclose'd and wouldn't have to do smgrcloseall. (If they don't do this, how are they going to cope with backends that exit before their file deletion is completed?) I'll do it the easy way for now and we can refine it after we see what the file-close solution for Windows ends up looking like. regards, tom lane
On Thu, Feb 05, 2004 at 02:05:46PM -0500, Tom Lane wrote: > * smgrclose closes the md.c-level file and drops the hashtable entry. > Hashtable entries remain valid unless explicitly closed (thus, multiple > smgropens for the same file are legal). So, will there be a refcount on each cache entry? Else, how would smgrclose when to drop the cache entry if there had been multiple smgropen calls to the same file? (unless, of course, each smgropen call yields a different hash entry?) > I'm also thinking of merging smgr.c's existing > list-of-rels-to-be-deleted into this data structure. Please don't. In the nested transaction environment, each subxact has to keep track of which files should be deleted. That's why I was trying to set the list inside a transaction status stack. Another way to do it would be keeping the list of files to delete along with the deleting Xid, but that would also require keeping a list of which xacts aborted and which ones didn't. > * AFAICS the only downside of not having a Relation available in smgr.c > and md.c is that error messages could only refer to the RelFileNode > numbers and not to the relation name. I'm not sure this is bad, since > in my experience what you want to know about such errors is the actual > disk filename, which RelFileNode tells you and relation name doesn't. I agree it's best to have the filename in the error message, but IMHO the relation name should also be presented to the user for clarity. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Endurecerse, pero jamás perder la ternura" (E. Guevara)
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > On Thu, Feb 05, 2004 at 02:05:46PM -0500, Tom Lane wrote: >> * smgrclose closes the md.c-level file and drops the hashtable entry. >> Hashtable entries remain valid unless explicitly closed (thus, multiple >> smgropens for the same file are legal). > So, will there be a refcount on each cache entry? No, I deliberately intended not. There is at most one persistent reference to an smgr hashtable entry; that's the one in the relcache entry for the relation. The only caller of smgrclose will be relcache.c, and it'll do it only when clearing the relcache reference (or, in the sinval case, upon determining that there is no relcache entry for the relation). Adding a reference count would just complicate matters --- for instance, we'd have to be able to reset them after an error, which would be rather hard from smgr's point of view since it really shouldn't be looking into the relcache to see if there's a reference there. >> I'm also thinking of merging smgr.c's existing >> list-of-rels-to-be-deleted into this data structure. > Please don't. In the nested transaction environment, each subxact has > to keep track of which files should be deleted. That's why I was trying > to set the list inside a transaction status stack. Hm. Okay, I'll leave that separate for now, although I think we do want to merge the structures eventually. > Another way to do it would be keeping the list of files to delete along > with the deleting Xid, but that would also require keeping a list of > which xacts aborted and which ones didn't. Really? When a subtransaction finishes, you either delete its files immediately (ie, if it aborted, you can drop files it created) or you reassign them as the responsibility of the parent (since any effects of the subtransaction really depend on whether the parent commits, you can't actually do its deletes yet). I don't see the need to remember subtransaction state further than that. So ISTM all you need is one field added to the existing list entries to remember which subtransaction is currently "on the hook" for a given file-deletion request. regards, tom lane
On Thu, Feb 05, 2004 at 04:55:44PM -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > > Another way to do it would be keeping the list of files to delete along > > with the deleting Xid, but that would also require keeping a list of > > which xacts aborted and which ones didn't. > > Really? When a subtransaction finishes, you either delete its files > immediately (ie, if it aborted, you can drop files it created) or you > reassign them as the responsibility of the parent (since any effects of > the subtransaction really depend on whether the parent commits, you > can't actually do its deletes yet). Right, that's the mechanism I originally envisioned. (A similar one is needed for deferred triggers, async notifies, heavy locks and on commit actions.) -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) Y dijo Dios: "Que sea Satanás, para que la gente no me culpe de todo a mí." "Y que hayan abogados, para que la gente no culpe de todo a Satanás"
On Thu, 5 Feb 2004, Tom Lane wrote: > "Marc G. Fournier" <scrappy@postgresql.org> writes: > > 'k, only comment is on this one ... would it not be a bit more efficient > > to add a flag to the "SMgrRelation *" structure that acts as a timer? > > Hm, we could try that, although I'm not sure it would help much. You'd > have to set the timeout to be longer than a checkpoint interval to make > any difference. Why? Setting it to the checkpoint interval itself should be sufficient, no? All you want to do is avoid closing any files that were used during that last checkpoint interval, since there is a good chance you'd have to once more reopen them in the checkpoint interval ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
"Marc G. Fournier" <scrappy@postgresql.org> writes: > Why? Setting it to the checkpoint interval itself should be sufficient, > no? All you want to do is avoid closing any files that were used during > that last checkpoint interval, since there is a good chance you'd have to > once more reopen them in the checkpoint interval ... If we did that then (on Windows) every DROP TABLE would take one extra checkpoint interval to take effect in terms of freeing disk space. Not sure if this is a good tradeoff for avoiding some file opens. In any case, I think we should leave it to be debated after we see what the Windows file-closing solution turns out to be. It might become a non-issue. regards, tom lane
On Thu, 5 Feb 2004, Tom Lane wrote: > "Marc G. Fournier" <scrappy@postgresql.org> writes: > > Why? Setting it to the checkpoint interval itself should be sufficient, > > no? All you want to do is avoid closing any files that were used during > > that last checkpoint interval, since there is a good chance you'd have to > > once more reopen them in the checkpoint interval ... > > If we did that then (on Windows) every DROP TABLE would take one extra > checkpoint interval to take effect in terms of freeing disk space. > Not sure if this is a good tradeoff for avoiding some file opens. k, but that would be a different scenario, no? As I mentioned in my original, a DROP TABLE would reset its timeout to -1, meaning to close it and drop it on the next checkpoint interval ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
"Marc G. Fournier" <scrappy@postgresql.org> writes: > k, but that would be a different scenario, no? As I mentioned in my > original, a DROP TABLE would reset its timeout to -1, meaning to close it > and drop it on the next checkpoint interval ... How would it do that? These structs are local to particular backends, so there's no way for a DROP TABLE occurring in one backend to reach over and reset the timeout in the bgwriter process. If we add communication that lets the bgwriter know the table is being dropped, then the problem is solved directly. regards, tom lane
On Thu, 5 Feb 2004, Tom Lane wrote: > "Marc G. Fournier" <scrappy@postgresql.org> writes: > > k, but that would be a different scenario, no? As I mentioned in my > > original, a DROP TABLE would reset its timeout to -1, meaning to close it > > and drop it on the next checkpoint interval ... > > How would it do that? These structs are local to particular backends, > so there's no way for a DROP TABLE occurring in one backend to reach > over and reset the timeout in the bgwriter process. If we add > communication that lets the bgwriter know the table is being dropped, > then the problem is solved directly. D'oh, okay, took me a second to clue into what it was I wasn't cluing into ... got it now, thanks ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> * AFAICS the only downside of not having a Relation available in smgr.c > and md.c is that error messages could only refer to the RelFileNode > numbers and not to the relation name. I'm not sure this is bad, since > in my experience what you want to know about such errors is the actual > disk filename, which RelFileNode tells you and relation name doesn't. > We could preserve the current behavior by passing the relation name to > smgropen when available, and saving the name in struct SMgrRelation. > But I'm inclined not to. > > Comments? That all sounds pretty nice. From my point of view I recall you saying that this would need to be done for tablespaces a long time ago - so I just request that the rewrite be done with future tablespaces in mind :) Chris