Обсуждение: PANIC caused by open_sync on Linux
I encountered PANICs on CentOS 5.0 when I ran write-mostly workload. It occurs only if wal_sync_method is set to open_sync; there were no problem in fdatasync. It occurred on both Postgres 8.2.5 and 8.3dev. PANIC: could not write to log file 0, segment 212 at offset 3399680, length 737280: Input/output error STATEMENT: COMMIT; My nearby Linux guy says mixed usage of bufferd I/O and direct I/O could cause errors (EIO) on many version of Linux kernels. If we use buffered I/O before direct I/O, Linux could fail to discard kernel buffer cache of the region and report EIO -- yes, it's a bug in Linux. We use bufferd I/O on WAL segements even if wal_sync_method is open_sync. We initialized segements with zero using buffered I/O, and after that, we re-open them with specified sync options. The behaviors in the bug are different on RHEL 4 and 5. RHEL 4 -> No error reports even though the kernel cache is incosistenet.RHEL 5 -> write() failes with EIO (Input/output error) PANIC occurs only on RHEL 5, but RHEL 4 also has a problem. If a wal archiver reads the inconsistent cache of wal segments, it could archive wrong contents and PITR might fail at the corrupted archived file. I'll recommend not to use open_sync for users on Linux until the bug is fiexed. However, are there any idea to avoid the bug and to use direct i/o? Mixed usage of bufferd and direct i/o is legal, but enforces complexity to kernels. If we simplify it, things would be more relaxed. For example, dropping zero-filling and only use direct i/o. Is it possible? Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Fri, 26 Oct 2007, ITAGAKI Takahiro wrote: > My nearby Linux guy says mixed usage of buffered I/O and direct I/O > could cause errors (EIO) on many version of Linux kernels. I'd be curious to get some more information about this--specifically which versions have the problems. I'd heard about some weird bugs in the sync write code in versions between RHEL 4 (2.6.9) and 5 (2.6.18), but I wasn't aware of anything wrong with those two stable ones in this area. I have a RHEL 5 system here, will see if I can replicate this EIO error. > Mixed usage of buffered and direct i/o is legal, but enforces complexity > to kernels. If we simplify it, things would be more relaxed. For > example, dropping zero-filling and only use direct i/o. Is it possible? It's possible, but performance suffers considerably. I played around with this at one point when looking into doing all database writes as sync writes. Having to wait until the entire 16MB WAL segment made its way to disk before more WAL could be written can cause a nasty pause in activity, even with direct I/O sync writes. Even the current buffered zero-filled write of that size can be a bit of a drag on performance for the clients that get caught behind it, making it any sort of sync write will be far worse. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > On Fri, 26 Oct 2007, ITAGAKI Takahiro wrote: >> Mixed usage of buffered and direct i/o is legal, but enforces complexity >> to kernels. If we simplify it, things would be more relaxed. For >> example, dropping zero-filling and only use direct i/o. Is it possible? > It's possible, but performance suffers considerably. I played around with > this at one point when looking into doing all database writes as sync > writes. Having to wait until the entire 16MB WAL segment made its way to > disk before more WAL could be written can cause a nasty pause in activity, > even with direct I/O sync writes. Even the current buffered zero-filled > write of that size can be a bit of a drag on performance for the clients > that get caught behind it, making it any sort of sync write will be far > worse. This ties into a loose end we didn't get to yet: being more aggressive about creating future WAL segments. ISTM there is no good reason for clients ever to have to wait for WAL segment creation --- the bgwriter, or possibly the walwriter, ought to handle that in the background. But we only check for the case once per checkpoint and we don't create a segment unless there's very little space left. regards, tom lane
On 10/26/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This ties into a loose end we didn't get to yet: being more aggressive > about creating future WAL segments. ISTM there is no good reason for > clients ever to have to wait for WAL segment creation --- the bgwriter, > or possibly the walwriter, ought to handle that in the background. Agreed. -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/
On Fri, Oct 26, 2007 at 08:34:49AM -0400, Tom Lane wrote: > we only check for the case once per checkpoint and we don't create a > segment unless there's very little space left. Sort of a filthy hack, but what about always having an _extra_ segment around? The bgwriter could do that, no? A -- Andrew Sullivan | ajs@crankycanuck.ca
On Fri, 26 Oct 2007, Andrew Sullivan wrote: > Sort of a filthy hack, but what about always having an _extra_ > segment around? The bgwriter could do that, no? Now it could. The bgwriter in <=8.2 stops executing when there's a checkpoint going on, and needing more WAL segments because a checkpoint is taking too long is one of the major failure cases where proactively creating additional segments would be most helpful. The 8.3 bgwriter keeps running even during checkpoints, so it's feasible to add such a feature now. But that only became true well into the 8.3 feature freeze, after some changes Heikki made just before the "load distributed checkpoint" patch was commited. Before that, it was hard to implement this feature; afterwards, it was too late to fit the change into the 8.3 release. Should be easy enough to add to 8.4 one day. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > The 8.3 bgwriter keeps running even during checkpoints, so it's feasible > to add such a feature now. I wonder though whether the walwriter wouldn't be a better place for it. regards, tom lane
On Fri, 26 Oct 2007, Tom Lane wrote: >> The 8.3 bgwriter keeps running even during checkpoints, so it's feasible >> to add such a feature now. > I wonder though whether the walwriter wouldn't be a better place for it. I do, too, but that wasn't available until too late in the 8.3 cycle to consider adding this feature to there either. There's a couple of potential to-do list ideas that build on the changes in this area in 8.3: -Aggressively pre-allocate WAL segments -Space out checkpoint fsync requests in addition to disk writes -Consider re-inserting a smarter bgwriter all-scan that writes sorted by usage count during idle periods -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> wrote: > There's a couple of potential to-do list ideas that build on the changes > in this area in 8.3: > > -Aggressively pre-allocate WAL segments > -Space out checkpoint fsync requests in addition to disk writes > -Consider re-inserting a smarter bgwriter all-scan that writes sorted by > usage count during idle periods I'd like to add: - Remove "filling with zero" before we recycle WAL segments. If it is not needed, we can avoid buffered i/o on open_sync except first allocation of segments. I think we can do it if we have more robust WAL records that can ignore garbage data written before. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > I'd like to add: > - Remove "filling with zero" before we recycle WAL segments. Huh? We have never done that. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > I'd like to add: > > - Remove "filling with zero" before we recycle WAL segments. > > Huh? We have never done that. Oh, sorry. I misread the codes. I would avoid PANIC if I have enough segements at start up. I'll test the configuration. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Fri, Oct 26, 2007 at 10:39:12PM -0400, Greg Smith wrote: > There's a couple of potential to-do list ideas that build on the changes > in this area in 8.3: I think that's the right way to go. It's too bad that this may still happen in 8.3, but we're way past the point that this is a bug fix, IMO. A -- Andrew Sullivan | ajs@crankycanuck.ca The plural of anecdote is not data. --Roger Brinner
Added to TODO: * Be more aggressive about creating WAL files http://archives.postgresql.org/pgsql-hackers/2007-10/msg01325.php --------------------------------------------------------------------------- Tom Lane wrote: > Greg Smith <gsmith@gregsmith.com> writes: > > On Fri, 26 Oct 2007, ITAGAKI Takahiro wrote: > >> Mixed usage of buffered and direct i/o is legal, but enforces complexity > >> to kernels. If we simplify it, things would be more relaxed. For > >> example, dropping zero-filling and only use direct i/o. Is it possible? > > > It's possible, but performance suffers considerably. I played around with > > this at one point when looking into doing all database writes as sync > > writes. Having to wait until the entire 16MB WAL segment made its way to > > disk before more WAL could be written can cause a nasty pause in activity, > > even with direct I/O sync writes. Even the current buffered zero-filled > > write of that size can be a bit of a drag on performance for the clients > > that get caught behind it, making it any sort of sync write will be far > > worse. > > This ties into a loose end we didn't get to yet: being more aggressive > about creating future WAL segments. ISTM there is no good reason for > clients ever to have to wait for WAL segment creation --- the bgwriter, > or possibly the walwriter, ought to handle that in the background. But > we only check for the case once per checkpoint and we don't create a > segment unless there's very little space left. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +