Обсуждение: BUG #9161: wal_writer_delay is limited to 10s
The following bug has been logged on the website: Bug reference: 9161 Logged by: Clemens Eisserer Email address: linuxhippy@gmail.com PostgreSQL version: 9.2.6 Operating system: Linux Description: It seems wal_writer_delay is limited to 10s without any technical reason. I am using postgresql in an embedded system and writing every 10s burns flash at an amazing rate (a 32GB SDHC card survived 4 days with only a few GB written) - I've now switched from ext4 to nilfs2 which is able to distribute writes more evenly - but I am still not confident this setup would last longer than a few weeks/months. A wal_writer_delay of arround ~1h would be a lot more efficient and solve this issue.
linuxhippy@gmail.com writes: > It seems wal_writer_delay is limited to 10s without any technical reason. The "technical reason" is that values much larger than that would be a performance and reliability disaster for typical installations. > I am using postgresql in an embedded system and writing every 10s > burns flash at an amazing rate (a 32GB SDHC card survived 4 days with > only a few GB written) - I've now switched from ext4 to nilfs2 which is able > to distribute writes more evenly - but I am still not confident this setup > would last longer than a few weeks/months. > A wal_writer_delay of arround ~1h would be a lot more efficient and solve > this issue. It's not apparent to me what usage pattern that would actually help with. If the system is idle then it doesn't matter whether the WAL writer wakes up or not. But if you're generating WAL, it's going to get written anyway whenever a transaction commits. You could forestall that with fsync=off, perhaps, but if you do that then the WAL writer won't fsync either, so it shouldn't matter how often it wakes up. Checkpoints will force WAL output more often than once per hour by default, too. So I'm wondering exactly what combination of other settings you envision using this with, and what's the workload of the database server. While it could certainly be argued that the limit of 10s is a bit too tight, allowing values as large as 1h would be a large-caliber foot gun for most people. So I'm very hesitant to raise it that far without a more convincing argument that useful behavior can be had there. The bigger picture is that if you allow WAL to not reach disk for 1h, that's up to 1h worth of data that you lose irretrievably in a crash. So even if this adjustment allowed you to get that behavior, it's not very attractive behavior. If you actually are OK with such little data security, maybe you should consider other approaches. Perhaps you could keep the whole database on a RAM disk (tmpfs) and rsync it to flash every hour, or some variant of that? Also, if I understand your problem correctly, it's not so much that you'd like the WAL writer to not write often as that you'd like it to not write partial blocks. We might be willing to accept a patch that added a setting and some logic to control things that way. But I'm pretty dubious that just changing the upper limit of wal_writer_delay is going to have useful results, even in your scenario. regards, tom lane
Hi Tom, Thanks for taking a look at my bug report. > But if you're generating WAL, it's going to get written anyway > whenever a transaction commits. As far as I understood, this is only the case with synchronous_commit=on, which is the reason why I turned synchrous commit off. > You could forestall that with fsync=off, > perhaps, but if you do that then the WAL writer won't fsync either, so > it shouldn't matter how often it wakes up. Checkpoints will force WAL > output more often than once per hour by default, too. So I'm wondering > exactly what combination of other settings you envision using this with, > and what's the workload of the database server. Because the system should be crash-resistent, my goal would be to have postgresql issue fsyncs only every few minutes instead of every 10s, by running the wal writer only every now and then (I know there is only very little data in the WAL anyway). For now I have a single insert transaction every 10s with synchronous_commit=off, however because the wal writer weaks up every 10s data are immediatly written to the SD card, dramatically reducing its lifespan. Lost data isn't a real concern, however a worst-case scenario would be when after a crash / power loss the database would be corrupted and would refuse to initialize. Regards, Clemens
On 2014-02-14 23:03:44 +0100, Clemens Eisserer wrote: > Hi Tom, > > Thanks for taking a look at my bug report. > > > But if you're generating WAL, it's going to get written anyway > > whenever a transaction commits. > > As far as I understood, this is only the case with > synchronous_commit=on, which is the reason why I turned synchrous > commit off. The WAL writer will be woken up anyway if a transaction commits (c.f. XLogSetAsyncXactLSN()). > > You could forestall that with fsync=off, > > perhaps, but if you do that then the WAL writer won't fsync either, so > > it shouldn't matter how often it wakes up. Checkpoints will force WAL > > output more often than once per hour by default, too. So I'm wondering > > exactly what combination of other settings you envision using this with, > > and what's the workload of the database server. > > Because the system should be crash-resistent, my goal would be to have > postgresql issue fsyncs only every few minutes instead of every 10s, > by running the wal writer only every now and then (I know there is > only very little data in the WAL anyway). > > For now I have a single insert transaction every 10s with > synchronous_commit=off, however because the wal writer weaks up every > 10s data are immediatly written to the SD card, dramatically reducing > its lifespan. The WAL writer shouldn't write anything if there's nothing to write. Maybe you have checkpoint_timeout set to something low and it's performing checkpoints regularly? You might also rethink the quality of the media you're using... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 14, 2014 at 7:50 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > You could forestall that with fsync=off, >> > perhaps, but if you do that then the WAL writer won't fsync either, so >> > it shouldn't matter how often it wakes up. Checkpoints will force WAL >> > output more often than once per hour by default, too. So I'm wondering >> > exactly what combination of other settings you envision using this with, >> > and what's the workload of the database server. >> >> Because the system should be crash-resistent, my goal would be to have >> postgresql issue fsyncs only every few minutes instead of every 10s, >> by running the wal writer only every now and then (I know there is >> only very little data in the WAL anyway). >> >> For now I have a single insert transaction every 10s with >> synchronous_commit=off, however because the wal writer weaks up every >> 10s data are immediatly written to the SD card, dramatically reducing >> its lifespan. > > The WAL writer shouldn't write anything if there's nothing to > write. Maybe you have checkpoint_timeout set to something low and it's > performing checkpoints regularly? > > You might also rethink the quality of the media you're using... Indeed, from my experience, SD cards are terribly unreliable because they have terrible write leveling (if at all). You're better off using a USB stick if you have a USB port. It may cost you a bit more power, but it will both last a lot more, and perform better.
Hi, > The WAL writer shouldn't write anything if there's nothing to > write. Maybe you have checkpoint_timeout set to something low and it's > performing checkpoints regularly? However, there is something to write - I perform a single insert operation every 10s. So what I would like to achieve is to have those inserts every 10s, but a sync only every few minutes or so - to let the data written by postgresql stay in the block cache. > You might also rethink the quality of the media you're using... > Indeed, from my experience, SD cards are terribly unreliable because > they have terrible write leveling (if at all). I already switched to a patched nilfs2 version, which makes even SD cards happy with have horrible wear leveling - and I switched to better cards. However this is just my pet project, and all I ask is why a parameter is artifically clipped ... it's not like I request some magical new feature. Regards, Clemens
On 2014-02-15 09:59:25 +0100, Clemens Eisserer wrote: > However, there is something to write - I perform a single insert > operation every 10s. > > However this is just my pet project, and all I ask is why a parameter > is artifically clipped ... it's not like I request some magical new > feature. Again. Changing that parameter to a higher value would change *nothing*, because a commit *does* wake up the wal writer *independently* of the parameter. So it is a new feature you're requesting. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 9, 2014 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > linuxhippy@gmail.com writes: > > It seems wal_writer_delay is limited to 10s without any technical reason. > > The "technical reason" is that values much larger than that would be a > performance and reliability disaster for typical installations. > How so? I think most installations run with synchronous_commit on. We seem to be arguing both that making it larger would do absolutely nothing, and that making it larger would do something very bad. While it could certainly be argued that the limit of 10s is a bit too > tight, allowing values as large as 1h would be a large-caliber foot gun > for most people. So I'm very hesitant to raise it that far without a > more convincing argument that useful behavior can be had there. > We let them turn fsync off. How much bigger of a foot gun could we possibly hand them? If we make people turn fsync off, rather than letting them use wal_writer_delay to do the very thing that it is intended to do, I don't see that as doing them any favors. > > The bigger picture is that if you allow WAL to not reach disk for 1h, > that's up to 1h worth of data that you lose irretrievably in a crash. > So even if this adjustment allowed you to get that behavior, it's not > very attractive behavior. If you actually are OK with such little data > security, maybe you should consider other approaches. Perhaps you could > keep the whole database on a RAM disk (tmpfs) and rsync it to flash every > hour, or some variant of that? > And hope that the rsync didn't leave it in an inconsistent state because it fired up just when a change was starting to get written. Ick. I think that having knobs that do what they say and say what they do seems much safer than forcing people to jump through dangerous hoops. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > We seem to be arguing both that making it larger would do absolutely > nothing, and that making it larger would do something very bad. The question to me is not so much whether it would do anything bad, as whether it would do anything good. I think it's entirely unproven that merely raising this limit would solve anything for Clemens' use-case; it seems likely that additional behavioral changes would be needed, and we've not seen what's under that rock. What it definitely would do, however, is encourage people to raise the setting, perhaps to larger values than are good for them. The system is currently tuned on the assumption that wal_writer_delay is a fractional second; we have no data on how it performs with significantly larger settings. Also, a closer look shows that the wal writer doesn't write anything merely because it's woken up. Unless I'm misreading XLogBackgroundFlush, it will write only (1) any completed full pages of WAL, or (2) data up to an async commit. As for (1), writing out completed pages (which are not going to get overwritten) doesn't seem like it should be a problem for a flash drive. If it is a problem, maybe recompiling with a bigger XLOG page size is indicated. As for (2), while I grant Clemens' claim that he's willing to tolerate loss of async commits, it's a bit hard to credit that he's willing to lose three hours worth of async commits. Which is what he'd be risking with wal_writer_delay set to 1h. There's also the existing "hibernate" mode, which has unclear interactions with a desire to put off writes for such long intervals. Basically, this proposal needs a lot more evidence than has been provided, IMO. regards, tom lane
On Fri, Feb 14, 2014 at 2:50 PM, Andres Freund <andres@2ndquadrant.com>wrote: > On 2014-02-14 23:03:44 +0100, Clemens Eisserer wrote: > > Hi Tom, > > > > Thanks for taking a look at my bug report. > > > > > But if you're generating WAL, it's going to get written anyway > > > whenever a transaction commits. > > > > As far as I understood, this is only the case with > > synchronous_commit=on, which is the reason why I turned synchrous > > commit off. > > The WAL writer will be woken up anyway if a transaction commits > (c.f. XLogSetAsyncXactLSN()). > That only happens if an 8 KB boundary has been crossed. Depending on how big the records are that he is inserting, you can fit a lot of them in 8 KB. If the records are small, or the table us unlogged so only commit messages get logged, then you could go many multiples of 10s without triggering a wake up due to crossing an 8 KB boundary. So if we did change the upper limit (or if Clemens adds 2 zeros to guc.c and compiles his own) he should be able to get quite a bit of mileage out of it. Why do we awaken it for every 8KB boundary? I thought there was code to wake it when wal_buffers was 1/3 full, but I don't see that code anymore. Cheers, Jeff
On 2014-02-17 13:48:34 -0800, Jeff Janes wrote: > On Fri, Feb 14, 2014 at 2:50 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2014-02-14 23:03:44 +0100, Clemens Eisserer wrote: > > > Hi Tom, > > > > > > Thanks for taking a look at my bug report. > > > > > > > But if you're generating WAL, it's going to get written anyway > > > > whenever a transaction commits. > > > > > > As far as I understood, this is only the case with > > > synchronous_commit=on, which is the reason why I turned synchrous > > > commit off. > > > > The WAL writer will be woken up anyway if a transaction commits > > (c.f. XLogSetAsyncXactLSN()). > > > That only happens if an 8 KB boundary has been crossed. Depending on how > big the records are that he is inserting, you can fit a lot of them in 8 > KB. If the records are small, or the table us unlogged so only commit > messages get logged, then you could go many multiples of 10s without > triggering a wake up due to crossing an 8 KB boundary. Unless I miss something we wake it up if *either* a 8kb boundary has been crossed *or* if the async lsn hasn't been flushed yet. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 17, 2014 at 1:53 PM, Andres Freund <andres@2ndquadrant.com>wrote: > On 2014-02-17 13:48:34 -0800, Jeff Janes wrote: > > On Fri, Feb 14, 2014 at 2:50 PM, Andres Freund <andres@2ndquadrant.com > >wrote: > > > > > On 2014-02-14 23:03:44 +0100, Clemens Eisserer wrote: > > > > Hi Tom, > > > > > > > > Thanks for taking a look at my bug report. > > > > > > > > > But if you're generating WAL, it's going to get written anyway > > > > > whenever a transaction commits. > > > > > > > > As far as I understood, this is only the case with > > > > synchronous_commit=on, which is the reason why I turned synchrous > > > > commit off. > > > > > > The WAL writer will be woken up anyway if a transaction commits > > > (c.f. XLogSetAsyncXactLSN()). > > > > > > That only happens if an 8 KB boundary has been crossed. Depending on how > > big the records are that he is inserting, you can fit a lot of them in 8 > > KB. If the records are small, or the table us unlogged so only commit > > messages get logged, then you could go many multiples of 10s without > > triggering a wake up due to crossing an 8 KB boundary. > > Unless I miss something we wake it up if *either* a 8kb boundary has > been crossed *or* if the async lsn hasn't been flushed yet. > I think it is either a 8kb boundary crossed, or the wal_writer is in "deep" sleep (i.e. 25 times the ordinary sleep). I don't see anything there which refers to async lsn in deciding whether to wake. Anyway, stracing the wal_writer shows that with a slow steady drip of small transactions, it is definitely the wal_writer_delay which is in control, not the latch wake up. Also, referring to Tom's email, it does issues an fdatasync every time it wakes up, because of this: /* if we have already flushed that far, consider async commit records */ For this purpose, async commit could mean either commits down with synchronous_commit=off, or commits of transactions which touched only unlogged tables. So it writes the same 8kb block as last time (but with new data at the logical end of the block) and flushes it. Cheers, Jeff
Hi Jeff, >> > That only happens if an 8 KB boundary has been crossed. Depending on >> > how >> > big the records are that he is inserting, you can fit a lot of them in 8 >> > KB. If the records are small, or the table us unlogged so only commit >> > messages get logged, then you could go many multiples of 10s without >> > triggering a wake up due to crossing an 8 KB boundary. Thanks a lot for the in-depth information. Is the 8KB boundary configureable at compile time? > /* if we have already flushed that far, consider async commit records */ > > For this purpose, async commit could mean either commits down with > synchronous_commit=off, or commits of transactions which touched only > unlogged tables. So it writes the same 8kb block as last time (but with new > data at the logical end of the block) and flushes it. Ah, that explains why my experiments using unlogged tables also weren't sucessful in avoiding the frequent writes. Regards, Clemens
Clemens Eisserer <linuxhippy@gmail.com> writes: > Thanks a lot for the in-depth information. > Is the 8KB boundary configureable at compile time? There is a configure option for XLOG_BLCKSZ, but it won't go beyond 64K which is probably not enough to fix your problem. (I'm not sure offhand if there are technical reasons for 64K in particular as the limit, but in any case I suspect making it really large would create performance issues of its own.) I think what you'd really want is to tweak the walwriter logic so that it avoids writing individual blocks and instead waits for wal_buffers to be, say, half full before writing. regards, tom lane
On 2014-02-17 14:24:09 -0800, Jeff Janes wrote: > On Mon, Feb 17, 2014 at 1:53 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2014-02-17 13:48:34 -0800, Jeff Janes wrote: > > > That only happens if an 8 KB boundary has been crossed. Depending on how > > > big the records are that he is inserting, you can fit a lot of them in 8 > > > KB. If the records are small, or the table us unlogged so only commit > > > messages get logged, then you could go many multiples of 10s without > > > triggering a wake up due to crossing an 8 KB boundary. > > > > Unless I miss something we wake it up if *either* a 8kb boundary has > > been crossed *or* if the async lsn hasn't been flushed yet. > > > > I think it is either a 8kb boundary crossed, or the wal_writer is in "deep" > sleep (i.e. 25 times the ordinary sleep). Ick. Yes, I forgot that sleeping was defined that way (which is a bad idea for the majority of cases imo). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services