Synchronized scans versus relcache reinitialization
От | Tom Lane |
---|---|
Тема | Synchronized scans versus relcache reinitialization |
Дата | |
Msg-id | 27386.1338059658@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: Synchronized scans versus relcache reinitialization
Re: Synchronized scans versus relcache reinitialization |
Список | pgsql-hackers |
I've been poking at Jeff Frost's and Greg Mullane's recent reports of high load due to many processes getting "stuck" in relcache init file rebuild operations. I can reproduce a similar behavior here by creating a database containing a whole lot of many-column views, thereby bloating pg_attribute to the gigabyte range, then manually removing the pg_internal.init file (simulating what would happen after a relcache inval on any system catalog), and then throwing a bunch of new connections at the database simultaneously. Each new connection tries to rebuild the init file, and they basically saturate the machine. I don't believe that this case quite matches what happened to either Jeff or Greg, but nonetheless it's quite reproducible and it needs to be fixed. I can identify three sub-issues: 1. If pg_attribute is larger than 1/4th of shared_buffers, the synchronized scan logic kicks in when we do seqscans to fill the tuple descriptors for core system catalogs. For this particular use case that's not merely not helpful, it's positively disastrous. The reason is that the desired rows are almost always in the first couple dozen blocks of pg_attribute, and the reading code in RelationBuildTupleDesc knows this and is coded to stop once it's collected the expected number of pg_attribute rows for the particular catalog. So even with a very large pg_attribute, not much work should be expended here. But the syncscan logic causes some of the heapscans to start from points later than block zero, causing them to miss the rows they need, so that the scan has to run to the end and wrap around before it finds all the rows it needs. In my test case on HEAD, this happens just once out of the eleven heapscans that occur in this phase, if a single backend is doing this in isolation. That increases the startup time from a few milliseconds to about eight-tenths of a second, due to having to scan all of pg_attribute. (In my test case, pg_attribute is fully cached in RAM, but most of it is in kernel buffers not PG buffers.) Bad as that is, it gets rapidly worse if there are multiple incoming new connections. All of them get swept up in the full-table syncscan started by the first arrival, so that now all rather than only some of their heapscans start from a point later than block zero, meaning that all eleven rather than just one of their heapscans are unduly expensive. It seems clear to me that we should just disable syncscans for the relcache reload heapscans. There is lots of downside due to breaking the early-exit optimization in RelationBuildTupleDesc, and basically no upside. I'm inclined to just modify systable_beginscan to prevent use of syncscan whenever indexOK is false. If we wanted to change its API we could make this happen only for RelationBuildTupleDesc's calls, but I don't see any upside for allowing syncscans for other forced-heapscan callers either. 2. The larger problem here is that when we have N incoming connections we let all N of them try to rebuild the init file independently. This doesn't make things faster for any one of them, and once N gets large enough it makes things slower for all of them. We would be better off letting the first arrival do the rebuild work while the others just sleep waiting for it. I believe that this fix would probably have ameliorated Jeff and Greg's cases, even though those do not seem to have triggered the syncscan logic. 3. Having now spent a good deal of time poking at this, I think that the syncscan logic is in need of more tuning, and I am wondering whether we should even have it turned on by default. It appears to be totally useless for fully-cached-in-RAM scenarios, even if most of the relation is out in kernel buffers rather than in shared buffers. The best case I saw was less than 2X speedup compared to N-times-the-single-client case, and that wasn't very reproducible, and it didn't happen at all unless I hacked BAS_BULKREAD mode to use a ring buffer size many times larger than the current 256K setting (otherwise the timing requirements are too tight for multiple backends to stay in sync --- a seqscan can blow through that much data in a fraction of a millisecond these days, if it's reading from kernel buffers). The current tuning may be all right for cases where you're actually reading from spinning rust, but that seems to be a decreasing fraction of real-world use cases. Anyway, I think we definitely need to fix systable_beginscan to not use syncscans; that's about a one-line change and seems plenty safe to backpatch. I also intend to look at avoiding concurrent relcache rebuilds, which I think should also be simple enough if we are willing to introduce an additional LWLock. (That would prevent concurrent init file rebuilds in different databases, but it's not clear that very many people care about such scenarios.) I am inclined to back-patch that as well; it's a bit riskier than the first change, but the first change is apparently not going to fix the two cases reported from the field. Comments? regards, tom lane
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Jan-Benedict GlawДата:
Сообщение: Re: VIP: new format for psql - shell - simple using psql in shell