Обсуждение: Performance TODO items
I have thought of a few new TODO performance items: 1) Someone at O'Reilly suggested that we order our duplicate index entries by tid so if we are hitting the heap for lots of duplicates, the hits will be on sequential pages. Seems like a nice idea. 2) After Tatsuo's report of running 1000 backends on pgbench and from a Solaris report, I think we should have a queue of backends waiting for a spinlock, rather than having them sleep and try again. This is particularly important for multi-processor machines. 3) I am reading the Solaris Internals book and there is mention of a "free behind" capability with large sequential scans. When a large sequential scan happens that would wipe out all the old cache entries, the kernel detects this and places its previous pages first on the free list. For out code, if we do a sequential scan of a table that is larger than our buffer cache size, I think we should detect this and do the same. See http://techdocs.postgresql.org for my performance paper for an example. New TODO entries are: * Order duplicate index entries by tid* Add queue of backends waiting for spinlock* Add free-behind capability for largesequential scans I will modify them with any comments people have. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > New TODO entries are: > > > > * Order duplicate index entries by tid > > In other words - add tid to index key: very old idea. I was thinking during index creation, it would be nice to order them by tid, but not do lots of work to keep it that way. > > * Add queue of backends waiting for spinlock > > We shouldn't mix two different approaches for different > kinds of short-time internal locks - in one cases we need in > light lmgr (when we're going to keep lock long enough, eg for IO) > and in another cases we'd better to proceed with POSIX' mutex-es > or semaphores instead of spinlocks. Queueing backends waiting > for spinlock sounds like nonsense - how are you going to protect > such queue? With spinlocks? -:) Yes, I guess so but hopefully we can spin waiting for the queue lock rather than sleep. We could use POSIX spinlocks/semaphores now but we don't because of performance, right? Should we be spinning waiting for spinlock on multi-cpu machines? Is that the answer? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> New TODO entries are: > > * Order duplicate index entries by tid In other words - add tid to index key: very old idea. > * Add queue of backends waiting for spinlock We shouldn't mix two different approaches for different kinds of short-time internal locks - in one cases we need in light lmgr (when we're going to keep lock long enough, eg for IO) and in another cases we'd better to proceed with POSIX' mutex-es or semaphores instead of spinlocks. Queueing backends waiting for spinlock sounds like nonsense - how are you going to protect such queue? With spinlocks? -:) Vadim
> > > * Order duplicate index entries by tid > > > > In other words - add tid to index key: very old idea. > > I was thinking during index creation, it would be nice to > order them by tid, but not do lots of work to keep it that way. I hear this "not do lots of work" so often from you -:) Days of simplicity are gone, Bruce. To continue, this project requires more and more complex solutions. > > > * Add queue of backends waiting for spinlock > > > > We shouldn't mix two different approaches for different > > kinds of short-time internal locks - in one cases we need in > > light lmgr (when we're going to keep lock long enough, eg for IO) > > and in another cases we'd better to proceed with POSIX' mutex-es > > or semaphores instead of spinlocks. Queueing backends waiting > > for spinlock sounds like nonsense - how are you going to protect > > such queue? With spinlocks? -:) > > Yes, I guess so but hopefully we can spin waiting for the queue lock > rather than sleep. We could use POSIX spinlocks/semaphores now but we > don't because of performance, right? No. As long as no one proved with test that mutexes are bad for performance... Funny, such test would require ~ 1 day of work. > Should we be spinning waiting for spinlock on multi-cpu machines? Is > that the answer? What do you mean? Vadim
> > > > * Order duplicate index entries by tid > > > > > > In other words - add tid to index key: very old idea. > > > > I was thinking during index creation, it would be nice to > > order them by tid, but not do lots of work to keep it that way. > > I hear this "not do lots of work" so often from you -:) > Days of simplicity are gone, Bruce. To continue, this project > requires more and more complex solutions. Yep. I can dream. :-) > > > > * Add queue of backends waiting for spinlock > > > > > > We shouldn't mix two different approaches for different > > > kinds of short-time internal locks - in one cases we need in > > > light lmgr (when we're going to keep lock long enough, eg for IO) > > > and in another cases we'd better to proceed with POSIX' mutex-es > > > or semaphores instead of spinlocks. Queueing backends waiting > > > for spinlock sounds like nonsense - how are you going to protect > > > such queue? With spinlocks? -:) > > > > Yes, I guess so but hopefully we can spin waiting for the queue lock > > rather than sleep. We could use POSIX spinlocks/semaphores now but we > > don't because of performance, right? > > No. As long as no one proved with test that mutexes are bad for > performance... > Funny, such test would require ~ 1 day of work. Good question. I know the number of function calls to spinlock stuff is huge. Seems real semaphores may be a big win on multi-cpu boxes. > > Should we be spinning waiting for spinlock on multi-cpu machines? Is > > that the answer? > > What do you mean? On a single-cpu machine, if you can't get the spinlock, you should just sleep and let the process who had it continue. On a multi-cpu machine, you perhaps should just keep trying, hoping that the process who holds it finishes soon. I wonder is we should find some kind of test on postmaster startup that would test to see if we have multiple cpu's and change from spinlock sleeping to spinlock spin-waiting. Add to this that fact we can't sleep for less than on clock tick on most machines, 10ms, and the spinlock stuff needs work. TODO updated: * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or spining to obtain lock on multi-cpu systems -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > We could use POSIX spinlocks/semaphores now but we > > > don't because of performance, right? > > > > No. As long as no one proved with test that mutexes are bad for > > performance... > > Funny, such test would require ~ 1 day of work. > > Good question. I know the number of function calls to spinlock stuff > is huge. Seems real semaphores may be a big win on multi-cpu boxes. Ok, being tired of endless discussions I'll try to use mutexes instead of spinlocks and run pgbench on my Solaris WS 10 and E4500 (4 CPU) boxes. Vadim
Bruce Momjian <pgman@candle.pha.pa.us> writes: > New TODO entries are: > * Add queue of backends waiting for spinlock I already see: * Create spinlock sleepers queue so everyone doesn't wake up at once BTW, I agree with Vadim's opinion that we should add a new type of lock (intermediate between spinlocks and lockmanager locks) rather than try to add new semantics onto spinlocks. For example, it'd be very nice to distinguish read-only and read-write access in this new kind of lock, but we can't expect spinlocks to be able to do that. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Should we be spinning waiting for spinlock on multi-cpu machines? Is > that the answer? A multi-CPU machine is actually the only case where a true spinlock *does* make sense. On a single CPU you might as well yield the CPU immediately, because you have no chance of getting the lock until the current holder is allowed to run again. On a multi CPU it's a reasonable guess that the current holder is running on one of the other CPUs and may release the lock soon ("soon" == less than a process dispatch cycle, hence busy-wait is better than release CPU). We are currently using spinlocks for a lot of situations where the mean time spent holding the lock is probably larger than "soon" as defined above. We should have a different lock implementation for those cases. True spinlocks should be reserved for protecting code where the maximum time spent holding the lock is guaranteed to be very short. regards, tom lane
> 3) I am reading the Solaris Internals book and there is mention of a > "free behind" capability with large sequential scans. When a large > sequential scan happens that would wipe out all the old cache entries, > the kernel detects this and places its previous pages first > on the free list. For out code, if we do a sequential scan of a table > that is larger than our buffer cache size, I think we should detect > this and do the same. See http://techdocs.postgresql.org for my > performance paper for an example. > > New TODO entries are: > > * Order duplicate index entries by tid > * Add queue of backends waiting for spinlock > * Add free-behind capability for large sequential scans So why do we cache sequetially-read pages? Or at least not have an option to control it? Oracle (to the best of my knowledge) does NOT cache pages read by a sequential index scan for at least two reasons/assumptions (two being all that I can recall): 1. Caching pages for sequential scans over sufficiently large tables will just cycle the cache. The pages that will be cached at the end of the query will be the last N pages of the table, so when the same sequential query is run again, the scan from the beginning of the table will start flushing the oldest cached pages which are more than likely going to be the ones that will be needed at the end of the scan, etc, etc. In a multi-user environment, the effect is worse. 2. Concurrent or consective queries in a dynamic database will not generate plans that use the same sequential scans, so they will tend to thrash the cache. Now there are some databases where the same general queries are run time after time and caching the pages from sequential scans does make sense, but in larger, enterprise-type systems, indices are created to help speed up the most used queries and the sequential cache entries only serve to clutter the cache and flush the useful pages. Is there any way that caching pages read in by a sequential scan could be made a configurable-option? Any chance someone could run pgbench on a test system set up to not cache sequential reads? Darren
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > New TODO entries are: > > * Add queue of backends waiting for spinlock > > I already see: > > * Create spinlock sleepers queue so everyone doesn't wake up at once That is an old copy of the TODO. I reworded it. You will only see this now: * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or spining to obtain lock on multi-cpu systems > BTW, I agree with Vadim's opinion that we should add a new type of lock > (intermediate between spinlocks and lockmanager locks) rather than try > to add new semantics onto spinlocks. For example, it'd be very nice to > distinguish read-only and read-write access in this new kind of lock, > but we can't expect spinlocks to be able to do that. Yes, I agree too. On a uniprocessor machine, if I can't get the spinlock, I want to yield the cpu, ideally for less than 10ms. On a multi-cpu machine, if the lock is held by another CPU and that process is running, we want to spin waiting for the lock to free. If not, we can sleep. We basically need some more sophisticated semantics around these locks, or move to OS semaphores. In fact, can't we sleep on an interruptible system call, and send signals to processes when we release the lock? That way we don't have the 10ms sleep limitation. One idea is to have a bytes for each backend in shared memory and have each backend set the bit if it is waiting for the semaphore. There would be no contention with multiple backends registering their sleep at the same time. We have seen reports of 4-cpu systems having poor performance while the system is only 12% busy, perhaps because the processes are all sleeping waiting for the next tick. I think multi-cpu machines are going to give us new challenges. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> So why do we cache sequetially-read pages? Or at least not have an > option to control it? > > Oracle (to the best of my knowledge) does NOT cache pages read by a > sequential index scan for at least two reasons/assumptions (two being > all that I can recall): > > 1. Caching pages for sequential scans over sufficiently large tables > will just cycle the cache. The pages that will be cached at the end of > the query will be the last N pages of the table, so when the same > sequential query is run again, the scan from the beginning of the table > will start flushing the oldest cached pages which are more than likely > going to be the ones that will be needed at the end of the scan, etc, > etc. In a multi-user environment, the effect is worse. > > 2. Concurrent or consective queries in a dynamic database will not > generate plans that use the same sequential scans, so they will tend to > thrash the cache. > > Now there are some databases where the same general queries are run time > after time and caching the pages from sequential scans does make sense, > but in larger, enterprise-type systems, indices are created to help > speed up the most used queries and the sequential cache entries only > serve to clutter the cache and flush the useful pages. > > Is there any way that caching pages read in by a sequential scan could > be made a configurable-option? > > Any chance someone could run pgbench on a test system set up to not > cache sequential reads? Yep, that is the issue. If the whole table fits in the cache, it is great. If not, it is useless or worse because it forces out other pages. Right now the cache is oldest-out and doesn't know anything about access patterns. We would need to get that info passed in the cache, probably some function parameter. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I have thought of a few new TODO performance items: > 1) Someone at O'Reilly suggested that we order our duplicate index > entries by tid so if we are hitting the heap for lots of duplicates, the > hits will be on sequential pages. Seems like a nice idea. A more general solution is for indexscan to collect up a bunch of TIDs from the index, sort them in-memory by TID order, and then probe into the heap with those TIDs. This is better than the above because you get nice ordering of the heap accesses across multiple key values, not just among the tuples with the same key. (In a unique or near-unique index, the above idea is nearly worthless.) In the best case there are few enough TIDs retrieved from the index that you can just do this once, but even if there are lots of TIDs, it should be a win to do this in batches of a few thousand TIDs. Essentially we decouple indexscans into separate index-access and heap-access phases. One big problem is that this doesn't interact well with concurrent VACUUM: our present solution for concurrent VACUUM assumes that indexscans hold a pin on an index page until they've finished fetching the pointed-to heap tuples. Another objection is that we'd have a harder time implementing the TODO item of marking an indextuple dead when its associated heaptuple is dead. Anyone see a way around these problems? regards, tom lane
> A more general solution is for indexscan to collect up a bunch of TIDs > from the index, sort them in-memory by TID order, and then probe into > the heap with those TIDs. This is better than the above because you get > nice ordering of the heap accesses across multiple key values, not just > among the tuples with the same key. (In a unique or near-unique index, > the above idea is nearly worthless.) > > In the best case there are few enough TIDs retrieved from the index that > you can just do this once, but even if there are lots of TIDs, it should > be a win to do this in batches of a few thousand TIDs. Essentially we > decouple indexscans into separate index-access and heap-access phases. > > One big problem is that this doesn't interact well with concurrent VACUUM: > our present solution for concurrent VACUUM assumes that indexscans hold > a pin on an index page until they've finished fetching the pointed-to > heap tuples. Another objection is that we'd have a harder time > implementing the TODO item of marking an indextuple dead when its > associated heaptuple is dead. Anyone see a way around these problems? Interesting. I figured the cache could keep most pages in such a case. I was thinking more of helping file system readahead by requesting the earlier block first in a mult-block request. Not sure how valuable that would be. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Mon, 30 Jul 2001, Bruce Momjian wrote: > * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or > spining to obtain lock on multi-cpu systems You may be interested in a discussion which happened over on linux-kernel a few months ago. Quite a lot of people want a lightweight userspace semaphore, and for pretty much the same reasons. Linus proposed a pretty interesting solution which has the same minimal overhead as the current spinlocks in the non- contention case, but avoids the spin where there's contention: http://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg39615.html Matthew.
> On Mon, 30 Jul 2001, Bruce Momjian wrote: > > > * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or > > spining to obtain lock on multi-cpu systems > > You may be interested in a discussion which happened over on > linux-kernel a few months ago. > > Quite a lot of people want a lightweight userspace semaphore, > and for pretty much the same reasons. > > Linus proposed a pretty interesting solution which has the > same minimal overhead as the current spinlocks in the non- > contention case, but avoids the spin where there's contention: > > http://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg39615.html Yes, many OS's have user-space spinlocks, for the same performance reasons (no kernel call). -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > > We could use POSIX spinlocks/semaphores now but we > > > > don't because of performance, right? > > > > > > No. As long as no one proved with test that mutexes are bad for > > > performance... > > > Funny, such test would require ~ 1 day of work. > > > > Good question. I know the number of function calls to spinlock stuff > > is huge. Seems real semaphores may be a big win on multi-cpu boxes. > > Ok, being tired of endless discussions I'll try to use mutexes instead > of spinlocks and run pgbench on my Solaris WS 10 and E4500 (4 CPU) boxes. I have updated the TODO list with: * Improve spinlock code o use SysV semaphores or queue of backends waiting on the lock o wakeup sleeper orsleep for less than one clock tick o spin for lock on multi-cpu machines, yield on single cpu machines o read/writelocks -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026