Обсуждение: It's past time to redo the smgr API

Поиск
Список
Период
Сортировка

It's past time to redo the smgr API

От
Tom Lane
Дата:
A long time ago Vadim proposed that we should revise smgr.c's API so
that it does not depend on Relations (relcache entries); rather, only
a RelFileNode value should be needed to access a file in smgr and lower
levels.  This would allow us to get rid of the concept of "blind
writes".  As of CVS tip, with most writes being done by the bgwriter,
most writes are blind.  Try tracing the bgwriter: every write involves
a file open and close, which has got to be hurting its performance.
The same has been true for awhile now of the background checkpointer.

I'm motivated to do something about this today because I'm reviewing the
PITR patch that J.R. Nield and Patrick Macdonald were working on, and
one of the things I don't care for is the kluges it adds to the smgr API
to work around lack of a Relation in some operations.

I propose the following revisions:

* smgr.c should internally keep a hashtable (indexed by RelFileNode)
with "struct SMgrRelation" entries for each relation that has recently
been accessed.  File access information for the lower-level modules
(md.c, fd.c) would appear in this hashtable entry.

* smgropen takes a RelFileNode and returns a pointer to a hashtable
entry, which is used as the argument to all other smgr operations.
smgropen itself does not cause any actual I/O; thus it doesn't matter
if the file doesn't exist yet.  smgropen followed by smgrcreate is the
way to create a new relation on disk.  Without smgrcreate, file
existence will be checked at the first read or write attempt.

* smgrclose closes the md.c-level file and drops the hashtable entry.
Hashtable entries remain valid unless explicitly closed (thus, multiple
smgropens for the same file are legal).

* The rd_fd field of relcache nodes will be replaced by an "SMgrRelation *"
pointer, so that callers having a Relation reference can access the smgr
file easily.  (In particular, this means that most calls to the bufmgr
will still pass Relations, and we don't need to propagate the API change
up to the bufmgr level.)

* The existing places in the bufmgr that do RelationNodeCacheGetRelation
followed by either regular or blind write would be replaced by smgropen
followed by smgrwrite.  Since smgropen is just a hash table lookup, it
is no more expensive than the RelationNodeCacheGetRelation it replaces.
Note these places would *not* do smgrclose afterwards.

* Because we don't smgrclose after a write, it is possible to have
"dangling" smgr entries that aren't useful any more, as well as open
file descriptors underneath them.  This isn't too big a deal on Unix
but it will be fatal for the Windows port, since it would prevent a
DROP TABLE if some other backend happens to have touched the table.
What I propose to do about this is: 1. In the bgwriter, at each checkpoint do "smgrcloseall" to close    all open
files.2. In regular backends, receipt of a relcache flush message will    result in smgrclose(), even if there is not a
relcacheentry    for the given relation ID.  A global cache flush event (sinval    buffer overflow) causes
smgrcloseall.
This ensures that open descriptors for a dead relation will not persist
past the next checkpoint.  We had already agreed, I think, that the
best solution for Windows' problem with deleting open files is to retry
pending deletes at checkpoint.  This smgr rewrite will not contribute
to actually making that happen, but it won't make things worse either.

* I'm going to get rid of the "which" argument that once selected a
particular smgr implementation (md.c vs mm.c).  It's vestigial at
present and is misdefined anyway.  If we were to resurrect the concept
of multiple storage managers, we'd need the selection to be determinable
from the RelFileNode value, rather than being a separate argument.
(When we add tablespaces, we could imagine associating smgrs with
tablespaces, but it would have to be the responsibility of smgr.c to
determine which smgr to use based on the tablespace ID.)

* We can remove the "dummy cache" support in relcache.c, as well as
RelationNodeCacheGetRelation() and the hashtable that supports it.

* The smgr hashtable entries could possibly be used for recording other
status; for instance keeping track of which relations need fsync in the
bgwriter.  I'm also thinking of merging smgr.c's existing
list-of-rels-to-be-deleted into this data structure.

* AFAICS the only downside of not having a Relation available in smgr.c
and md.c is that error messages could only refer to the RelFileNode
numbers and not to the relation name.  I'm not sure this is bad, since
in my experience what you want to know about such errors is the actual
disk filename, which RelFileNode tells you and relation name doesn't.
We could preserve the current behavior by passing the relation name to
smgropen when available, and saving the name in struct SMgrRelation.
But I'm inclined not to.

Comments?
        regards, tom lane


Re: It's past time to redo the smgr API

От
"Marc G. Fournier"
Дата:
On Thu, 5 Feb 2004, Tom Lane wrote:

> * Because we don't smgrclose after a write, it is possible to have
> "dangling" smgr entries that aren't useful any more, as well as open
> file descriptors underneath them.  This isn't too big a deal on Unix
> but it will be fatal for the Windows port, since it would prevent a
> DROP TABLE if some other backend happens to have touched the table.
> What I propose to do about this is:
>   1. In the bgwriter, at each checkpoint do "smgrcloseall" to close
>      all open files.
>   2. In regular backends, receipt of a relcache flush message will
>      result in smgrclose(), even if there is not a relcache entry
>      for the given relation ID.  A global cache flush event (sinval
>      buffer overflow) causes smgrcloseall.
> This ensures that open descriptors for a dead relation will not persist
> past the next checkpoint.  We had already agreed, I think, that the
> best solution for Windows' problem with deleting open files is to retry
> pending deletes at checkpoint.  This smgr rewrite will not contribute
> to actually making that happen, but it won't make things worse either.

'k, only comment is on this one ... would it not be a bit more efficient
to add a flag to the "SMgrRelation *" structure that acts as a timer?  if
-1, then relation is deleted, is >0, make it epoch of the time that
Relation was last accessed?  then your 'smgrcloseall', instead of closing
all files (even active/undeleted ones), would only close those that are
either idle for x minutes (maybe a postgresql.conf tunable there?) or
those that have been DROP'd?  Assuming, that is, that your end goal is to
reduce the overall # of open/closes of files ... this way, instead of
closing 20 just to reopen 19 of them, you only close that one that needs
it ...




----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664


Re: It's past time to redo the smgr API

От
Tom Lane
Дата:
"Marc G. Fournier" <scrappy@postgresql.org> writes:
> 'k, only comment is on this one ... would it not be a bit more efficient
> to add a flag to the "SMgrRelation *" structure that acts as a timer?

Hm, we could try that, although I'm not sure it would help much.  You'd
have to set the timeout to be longer than a checkpoint interval to make
any difference.

In the back of my mind is the thought that the Windows guys are going to
end up passing file-delete requests over to the bgwriter anyway, which
would largely eliminate the issue --- the bgwriter would know which
files need to be sgmrclose'd and wouldn't have to do smgrcloseall.
(If they don't do this, how are they going to cope with backends that
exit before their file deletion is completed?)

I'll do it the easy way for now and we can refine it after we see what
the file-close solution for Windows ends up looking like.
        regards, tom lane


Re: It's past time to redo the smgr API

От
Alvaro Herrera
Дата:
On Thu, Feb 05, 2004 at 02:05:46PM -0500, Tom Lane wrote:

> * smgrclose closes the md.c-level file and drops the hashtable entry.
> Hashtable entries remain valid unless explicitly closed (thus, multiple
> smgropens for the same file are legal).

So, will there be a refcount on each cache entry?  Else, how would
smgrclose when to drop the cache entry if there had been multiple
smgropen calls to the same file?  (unless, of course, each smgropen call
yields a different hash entry?)


> I'm also thinking of merging smgr.c's existing
> list-of-rels-to-be-deleted into this data structure.

Please don't.  In the nested transaction environment, each subxact has
to keep track of which files should be deleted.  That's why I was trying
to set the list inside a transaction status stack.

Another way to do it would be keeping the list of files to delete along
with the deleting Xid, but that would also require keeping a list of
which xacts aborted and which ones didn't.

> * AFAICS the only downside of not having a Relation available in smgr.c
> and md.c is that error messages could only refer to the RelFileNode
> numbers and not to the relation name.  I'm not sure this is bad, since
> in my experience what you want to know about such errors is the actual
> disk filename, which RelFileNode tells you and relation name doesn't.

I agree it's best to have the filename in the error message, but IMHO
the relation name should also be presented to the user for clarity.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Endurecerse, pero jamás perder la ternura" (E. Guevara)


Re: It's past time to redo the smgr API

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> On Thu, Feb 05, 2004 at 02:05:46PM -0500, Tom Lane wrote:
>> * smgrclose closes the md.c-level file and drops the hashtable entry.
>> Hashtable entries remain valid unless explicitly closed (thus, multiple
>> smgropens for the same file are legal).

> So, will there be a refcount on each cache entry?

No, I deliberately intended not.  There is at most one persistent
reference to an smgr hashtable entry; that's the one in the relcache
entry for the relation.  The only caller of smgrclose will be
relcache.c, and it'll do it only when clearing the relcache reference
(or, in the sinval case, upon determining that there is no relcache
entry for the relation).  Adding a reference count would just complicate
matters --- for instance, we'd have to be able to reset them after an
error, which would be rather hard from smgr's point of view since it
really shouldn't be looking into the relcache to see if there's a
reference there.

>> I'm also thinking of merging smgr.c's existing
>> list-of-rels-to-be-deleted into this data structure.

> Please don't.  In the nested transaction environment, each subxact has
> to keep track of which files should be deleted.  That's why I was trying
> to set the list inside a transaction status stack.

Hm.  Okay, I'll leave that separate for now, although I think we do want
to merge the structures eventually.

> Another way to do it would be keeping the list of files to delete along
> with the deleting Xid, but that would also require keeping a list of
> which xacts aborted and which ones didn't.

Really?  When a subtransaction finishes, you either delete its files
immediately (ie, if it aborted, you can drop files it created) or you
reassign them as the responsibility of the parent (since any effects of
the subtransaction really depend on whether the parent commits, you
can't actually do its deletes yet).  I don't see the need to remember
subtransaction state further than that.  So ISTM all you need is one
field added to the existing list entries to remember which
subtransaction is currently "on the hook" for a given file-deletion
request.
        regards, tom lane


Re: It's past time to redo the smgr API

От
Alvaro Herrera
Дата:
On Thu, Feb 05, 2004 at 04:55:44PM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

> > Another way to do it would be keeping the list of files to delete along
> > with the deleting Xid, but that would also require keeping a list of
> > which xacts aborted and which ones didn't.
> 
> Really?  When a subtransaction finishes, you either delete its files
> immediately (ie, if it aborted, you can drop files it created) or you
> reassign them as the responsibility of the parent (since any effects of
> the subtransaction really depend on whether the parent commits, you
> can't actually do its deletes yet).

Right, that's the mechanism I originally envisioned.  (A similar one is
needed for deferred triggers, async notifies, heavy locks and on commit
actions.)

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
Y dijo Dios: "Que sea Satanás, para que la gente no me culpe de todo a mí."
"Y que hayan abogados, para que la gente no culpe de todo a Satanás"


Re: It's past time to redo the smgr API

От
"Marc G. Fournier"
Дата:
On Thu, 5 Feb 2004, Tom Lane wrote:

> "Marc G. Fournier" <scrappy@postgresql.org> writes:
> > 'k, only comment is on this one ... would it not be a bit more efficient
> > to add a flag to the "SMgrRelation *" structure that acts as a timer?
>
> Hm, we could try that, although I'm not sure it would help much.  You'd
> have to set the timeout to be longer than a checkpoint interval to make
> any difference.

Why?  Setting it to the checkpoint interval itself should be sufficient,
no?  All you want to do is avoid closing any files that were used during
that last checkpoint interval, since there is a good chance you'd have to
once more reopen them in the checkpoint interval ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664


Re: It's past time to redo the smgr API

От
Tom Lane
Дата:
"Marc G. Fournier" <scrappy@postgresql.org> writes:
> Why?  Setting it to the checkpoint interval itself should be sufficient,
> no?  All you want to do is avoid closing any files that were used during
> that last checkpoint interval, since there is a good chance you'd have to
> once more reopen them in the checkpoint interval ...

If we did that then (on Windows) every DROP TABLE would take one extra
checkpoint interval to take effect in terms of freeing disk space.
Not sure if this is a good tradeoff for avoiding some file opens.

In any case, I think we should leave it to be debated after we see what
the Windows file-closing solution turns out to be.  It might become a
non-issue.
        regards, tom lane


Re: It's past time to redo the smgr API

От
"Marc G. Fournier"
Дата:
On Thu, 5 Feb 2004, Tom Lane wrote:

> "Marc G. Fournier" <scrappy@postgresql.org> writes:
> > Why?  Setting it to the checkpoint interval itself should be sufficient,
> > no?  All you want to do is avoid closing any files that were used during
> > that last checkpoint interval, since there is a good chance you'd have to
> > once more reopen them in the checkpoint interval ...
>
> If we did that then (on Windows) every DROP TABLE would take one extra
> checkpoint interval to take effect in terms of freeing disk space.
> Not sure if this is a good tradeoff for avoiding some file opens.

k, but that would be a different scenario, no?  As I mentioned in my
original, a DROP TABLE would reset its timeout to -1, meaning to close it
and drop it on the next checkpoint interval ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664


Re: It's past time to redo the smgr API

От
Tom Lane
Дата:
"Marc G. Fournier" <scrappy@postgresql.org> writes:
> k, but that would be a different scenario, no?  As I mentioned in my
> original, a DROP TABLE would reset its timeout to -1, meaning to close it
> and drop it on the next checkpoint interval ...

How would it do that?  These structs are local to particular backends,
so there's no way for a DROP TABLE occurring in one backend to reach
over and reset the timeout in the bgwriter process.  If we add
communication that lets the bgwriter know the table is being dropped,
then the problem is solved directly.
        regards, tom lane


Re: It's past time to redo the smgr API

От
"Marc G. Fournier"
Дата:
On Thu, 5 Feb 2004, Tom Lane wrote:

> "Marc G. Fournier" <scrappy@postgresql.org> writes:
> > k, but that would be a different scenario, no?  As I mentioned in my
> > original, a DROP TABLE would reset its timeout to -1, meaning to close it
> > and drop it on the next checkpoint interval ...
>
> How would it do that?  These structs are local to particular backends,
> so there's no way for a DROP TABLE occurring in one backend to reach
> over and reset the timeout in the bgwriter process.  If we add
> communication that lets the bgwriter know the table is being dropped,
> then the problem is solved directly.

D'oh, okay, took me a second to clue into what it was I wasn't cluing into
... got it now, thanks ...


----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664


Re: It's past time to redo the smgr API

От
Christopher Kings-Lynne
Дата:
> * AFAICS the only downside of not having a Relation available in smgr.c
> and md.c is that error messages could only refer to the RelFileNode
> numbers and not to the relation name.  I'm not sure this is bad, since
> in my experience what you want to know about such errors is the actual
> disk filename, which RelFileNode tells you and relation name doesn't.
> We could preserve the current behavior by passing the relation name to
> smgropen when available, and saving the name in struct SMgrRelation.
> But I'm inclined not to.
> 
> Comments?

That all sounds pretty nice.  From my point of view I recall you saying 
that this would need to be done for tablespaces a long time ago - so I 
just request that the rewrite be done with future tablespaces in mind :)

Chris