Обсуждение: Improving backend launch time by preloading relcache
I spent the weekend fooling around trying to reduce the time needed to start a fresh backend. Profiling seemed to indicate that much of the time was going into loading entries into the relcache: relcache entry setup normally requires fetching rows from several different system catalogs. The obvious way to fix that is to preload entries somehow. It turns out we already have a mechanism for this (the pg_internal.init file), but it was only being used to preload entries for a few critical system indexes --- "critical" meaning "relcache/catcache initialization becomes an infinite recursion otherwise". I rearranged things so that pg_internal.init could cache entries for both plain relations and indexes, and then set it up to cache all the system catalogs and indexes that are referenced by catalog caches. (This is a somewhat arbitrary choice, but was easy to implement.) As near as I can tell, this reduces the user-space CPU time involved in a backend launch by about a factor of 5; and there's also a very significant reduction in traffic to shared memory, which should reduce contention problems when multiple backends are involved. It's difficult to measure this stuff, however ... profiling is of limited reliability when you can only get a few clock samples per process launch. I'm planning to commit these changes when 7.3 opens, unless I hear objections. A possible objection is that caching more system catalog descriptors makes it more difficult to support user alterations to the system catalogs; but we don't support those anyway, and I haven't heard of anyone working to remove the other obstacles to it. (Note that this wouldn't completely prevent such things; it would just be necessary to figure out when to delete the pg_internal.init cache file when making schema changes.) regards, tom lane
Tom Lane wrote: > > I spent the weekend fooling around trying to reduce the time needed to > start a fresh backend. Profiling seemed to indicate that much of the > time was going into loading entries into the relcache: relcache entry > setup normally requires fetching rows from several different system > catalogs. The obvious way to fix that is to preload entries somehow. > It turns out we already have a mechanism for this (the pg_internal.init > file), but it was only being used to preload entries for a few critical > system indexes --- "critical" meaning "relcache/catcache initialization > becomes an infinite recursion otherwise". I rearranged things so that > pg_internal.init could cache entries for both plain relations and > indexes, and then set it up to cache all the system catalogs and indexes > that are referenced by catalog caches. (This is a somewhat arbitrary > choice, but was easy to implement.) While examining this issue I found the following change about REINDEX. Subject: [COMMITTERS] pgsql/src/backend catalog/index.c commands/ind ... Date: Mon, 19 Nov 2001 21:46:13 -0500 (EST) From: tgl@postgresql.org To: pgsql-committers@postgresql.org CVSROOT: /cvsroot Module name: pgsql Changes by: tgl@postgresql.org 01/11/19 21:46:13 Modified files: src/backend/catalog: index.c src/backend/commands: indexcmds.c src/backend/tcop: utility.c Log message: Some minor tweaks of REINDEX processing: grab exclusivelock a little earlier, make error checks more uniform. The change on tcop/utility.c seems to inhibit the execution of REINDEX of system indexes under postmaster which I allowed except some system indexes in 7.1. Please put it back in 7.2.1. Inhibited relations are the indexes of the followings. [Shared relations] pg_database, pg_shadow, pg_group [Nailed relations] pg_class, pg_type, pg_attribute, pg_proc There are some trial stuff to handle nailed relations (mostly #ifdef'd ENABLE_REINDEX_NAILED_RELATIONS). Especially setNewRelfilenode() unlinks the pg_internal.init file in case the relation is nailed. However I don't rely on the mechanism so much that I can't feel like removing the #ifdef's. regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > The change on tcop/utility.c seems to inhibit the execution > of REINDEX of system indexes under postmaster which I allowed > except some system indexes in 7.1. That strikes me as a fairly dangerous idea. Do you really believe it's safe? Also, why would it be safe to allow reindex at the table level and not at the index level, which is what the code did before I touched it? > Especially setNewRelfilenode() unlinks the pg_internal.init > file in case the relation is nailed. Probably with this change I'm planning, it'll be necessary to unlink pg_internal.init for any system relation, not only nailed ones. Thanks for pointing that out. regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > The change on tcop/utility.c seems to inhibit the execution > > of REINDEX of system indexes under postmaster which I allowed > > except some system indexes in 7.1. > > That strikes me as a fairly dangerous idea. Do you really > believe it's safe? Also, why would it be safe to allow reindex > at the table level and not at the index level, which is what > the code did before I touched it? REINDEX uses the relfilenode mechanism since 7.1 which lets the replacement of index files be under transactional control. I think it's safe enough. One thing I had to worry about REINDEX on system indexes is how to tell that the target index mustn't be used during the REINDEX operation. Turning off the relhasindex column in pg_class tells PG system that the indexes are unavailable now. It was implemented by me before 7.0. I didn't provided the way to inactivate indexes individually however. > > > Especially setNewRelfilenode() unlinks the pg_internal.init > > file in case the relation is nailed. > > Probably with this change I'm planning, it'll be necessary to unlink > pg_internal.init for any system relation, not only nailed ones. > Thanks for pointing that out. What I meant was to confirm if it's really reliable. Currently e.g. the failure of rename of temporary init file to pg_internal.init isn't fatal but it may be fatal if we include many relcache info in the pg_internal.init file. regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > REINDEX uses the relfilenode mechanism since 7.1 which > lets the replacement of index files be under transactional > control. I think it's safe enough. Okay, in that case tcop/utility is being too picky about all three cases, no? > What I meant was to confirm if it's really reliable. > Currently e.g. the failure of rename of temporary > init file to pg_internal.init isn't fatal but it > may be fatal if we include many relcache info in > the pg_internal.init file. Certainly not --- it must always be possible for a freshly started backend to build the pg_internal.init file from scratch. The reason for unlinking pg_internal.init after changing a catalog schema tuple is that future backends won't know you changed it unless pg_internal.init is rebuilt. Hmm ... what that says is that unlinking pg_internal.init in setRelfilenode is the wrong place. The right place is *after* committing your transaction and *before* sending shared cache inval messages. You can't unlink before you commit, or someone may rebuild using the old information. (A backend that's already logged into the PROC array when you send SI inval will find out about the changes via SI inval. One that is not yet logged in must be prevented from reading the now-obsolete pg_internal.init file. The startup sequence logs into PROC before trying to read pg_internal.init, so that part is done in the right order.) So we need a flag that will cause the unlink to happen at the right time in post-commit cleanup. VACUUM's got this same timing bug, although its change is only one of updating relpages/reltuples which is not so critical... regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > REINDEX uses the relfilenode mechanism since 7.1 which > > lets the replacement of index files be under transactional > > control. I think it's safe enough. > > Okay, in that case tcop/utility is being too picky about all three > cases, no? Probably we don't have to keep the relhasindex info in the db any longer and we had better keep some info about REINDEX in memory local to the backend. In fact in the current implementation the relhasindex is local to the backend and any backend couldn't see the column committed to the status off. > Hmm ... what that says is that unlinking pg_internal.init in > setRelfilenode is the wrong place. Possibly. I couldn't find the appropriate place(way) then and so #ifdef's are still there. regards, Hiroshi Inoue
On Mon, 28 Jan 2002, Tom Lane wrote: > I spent the weekend fooling around trying to reduce the time needed to > start a fresh backend. Profiling seemed to indicate that much of the > time was going into loading entries into the relcache: relcache entry > setup normally requires fetching rows from several different system > catalogs. The obvious way to fix that is to preload entries somehow. > It turns out we already have a mechanism for this (the pg_internal.init > file), but it was only being used to preload entries for a few critical > system indexes --- "critical" meaning "relcache/catcache initialization > becomes an infinite recursion otherwise". I rearranged things so that > pg_internal.init could cache entries for both plain relations and > indexes, and then set it up to cache all the system catalogs and indexes > that are referenced by catalog caches. (This is a somewhat arbitrary > choice, but was easy to implement.) > > As near as I can tell, this reduces the user-space CPU time involved in > a backend launch by about a factor of 5; and there's also a very > significant reduction in traffic to shared memory, which should reduce Tom, what's about absolute timings ? It's quite interesting, because many people have to keep persistent connections to backend and if statup time would be small ( as in MySQL case ), it'd be possible just not waste a system resources ( in some situations ). > contention problems when multiple backends are involved. It's difficult > to measure this stuff, however ... profiling is of limited reliability > when you can only get a few clock samples per process launch. > > I'm planning to commit these changes when 7.3 opens, unless I hear > objections. A possible objection is that caching more system catalog > descriptors makes it more difficult to support user alterations to the > system catalogs; but we don't support those anyway, and I haven't heard > of anyone working to remove the other obstacles to it. (Note that this > wouldn't completely prevent such things; it would just be necessary to > figure out when to delete the pg_internal.init cache file when making > schema changes.) > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Probably we don't have to keep the relhasindex info in the > db any longer and we had better keep some info about REINDEX > in memory local to the backend. I never did much care for the "change relhasindex" hack. Why isn't IsIgnoringSystemIndexes a sufficient solution? I don't really care if REINDEX is a little bit slower than it might be, so just turning off use of *all* system indexes seems like an adequate answer. >> Hmm ... what that says is that unlinking pg_internal.init in >> setRelfilenode is the wrong place. > Possibly. I couldn't find the appropriate place(way) then > and so #ifdef's are still there. Okay. I'll work on that when I commit the patches I have. regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > Probably we don't have to keep the relhasindex info in the > > db any longer and we had better keep some info about REINDEX > > in memory local to the backend. > > I never did much care for the "change relhasindex" hack. Why isn't > IsIgnoringSystemIndexes a sufficient solution? I don't really care > if REINDEX is a little bit slower than it might be, so just turning > off use of *all* system indexes seems like an adequate answer. It may be a reasonable solution. I thought of another idea while reading the thread [HACKERS] sequence indexes. Currently REINDEX recreates indexes from the heap relations because the indexes may be corrupted. However we can recreate indexes from existent ones if they are sane. It would be a lot faster than the current way for large tables. Comments ? regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > I thought of another idea while reading the thread [HACKERS] > sequence indexes. Currently REINDEX recreates indexes from > the heap relations because the indexes may be corrupted. > However we can recreate indexes from existent ones if > they are sane. It would be a lot faster than the current > way for large tables. Hmm ... you are thinking about the case where REINDEX is being used not to recover from corruption, but just to shrink indexes that have accumulated too much free space. Okay, that's a reasonable case to try to optimize, though I'd like to think the problem will go away in a release or two when we implement VACUUM-time index shrinking. However, I'm not sure about the "lot faster" part. The only win I can see is that when rebuilding a btree index, you could skip the sort step by reading the old index in index order. This'd require hacking things deep in the guts of the btree index method, not at the level of the present REINDEX code. And AFAICS it doesn't translate at all to the other index types. Not sure it's worth the trouble. I'd rather see us expend the same effort on shrinking indexes on-the-fly in VACUUM. regards, tom lane
Tom Lane wrote: > > Hmm ... you are thinking about the case where REINDEX is being used > not to recover from corruption, but just to shrink indexes that have > accumulated too much free space. Yes. > Okay, that's a reasonable case to > try to optimize, though I'd like to think the problem will go away > in a release or two when we implement VACUUM-time index shrinking. > > However, I'm not sure about the "lot faster" part. The only win > I can see is that when rebuilding a btree index, you could skip > the sort step by reading the old index in index order. Don't we have to scan the (possibly larger) heap table ? regards, Hiroshi Inoue
> > Okay, that's a reasonable case to > > try to optimize, though I'd like to think the problem will go away > > in a release or two when we implement VACUUM-time index shrinking. > > > > However, I'm not sure about the "lot faster" part. The only win > > I can see is that when rebuilding a btree index, you could skip > > the sort step by reading the old index in index order. > > Don't we have to scan the (possibly larger) heap table ? Yes, but that is done with a seq scan, but the index has to be read in random order, since the index pages are not physically sorted on disk from lowest to highest value. Of course you can spare the sort, but overall ... Imho spending effort on VACUUM is more fruitful, and has the potential to allow much more concurrency than REINDEX, no ? Andreas
Zeugswetter Andreas SB SD wrote: > > > > Okay, that's a reasonable case to > > > try to optimize, though I'd like to think the problem will go away > > > in a release or two when we implement VACUUM-time index shrinking. > > > > > > However, I'm not sure about the "lot faster" part. The only win > > > I can see is that when rebuilding a btree index, you could skip > > > the sort step by reading the old index in index order. > > > > Don't we have to scan the (possibly larger) heap table ? > > Yes, but that is done with a seq scan, but the index has to be read in > random order, since the index pages are not physically sorted on disk > from lowest to highest value. Of course you can spare the sort, > but overall ... Reading a index file is not faster than reading the heap file ? Does sorting finish in a moment ? If so we have to use sequential scan much more often. Anyway there seems no point on changing REINDEX. The only thing I have to do is to remove the needless check in tcop/utility.c as soon as 7.2 is released. regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Anyway there seems no point on changing REINDEX. > The only thing I have to do is to remove the needless > check in tcop/utility.c as soon as 7.2 is released. I don't believe it's needless, and I suggest you not remove it, until we do something about making the pg_internal unlink happen at the right time. With the unlink where it is, I think it's quite unsafe to reindex system indexes in a live database. regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > Anyway there seems no point on changing REINDEX. > > The only thing I have to do is to remove the needless > > check in tcop/utility.c as soon as 7.2 is released. > > I don't believe it's needless, and I suggest you not remove it, > until we do something about making the pg_internal unlink happen > at the right time. With the unlink where it is, I think it's quite > unsafe to reindex system indexes in a live database. Currently there are just a few relations info kept in pg_internal.init and they are all nailed. I'm not allowing REINDEX for nailed relations though there's a #ifdef'd trial implementation. I'm intending the change for 7.2.1 not 7.3. If it isn't allowed in 7.2.x I would strongly object to the 7.2 release itself. regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Currently there are just a few relations info kept in > pg_internal.init and they are all nailed. I'm not > allowing REINDEX for nailed relations Oh, okay. > a #ifdef'd trial implementation. I'm intending the > change for 7.2.1 not 7.3. If it isn't allowed in 7.2.x > I would strongly object to the 7.2 release itself. If you think it should be changed then change it now. I see no reason to wait. regards, tom lane