Обсуждение: COMMIT NOWAIT Performance Option
Proposal: Implement a new option for COMMIT, for enhancing performance, providing a MySQL-like trade-off between performance and robustness for *only* those that want it. COMMIT NOWAIT This form of COMMIT will *not* perform XLogFlush(), but will rely on a special background process to perform regular WAL fsyncs (see later). COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not threaten the consistency or robustness of other COMMIT modes. Read that again and think about it, before we go further, please. Normal COMMIT still guarantees to flush all of WAL up to the point of the commit, whether or not the previous commits have requested that. Mixing COMMIT NOWAIT with other modes does not effect the performance of other backends - those that specify that mode are faster, those that do not simply go at the same speed they did before. This is important, because it allows us to have a fully robust server, yet with certain critical applications going along much faster. No need for an all-or-nothing approach at db cluster level. Unlike fsync = off, WAL is always consistent and the server can be recovered easily, though with some potential for data loss for transactions that chose the COMMIT NOWAIT option. Sounds like a hole there: normal COMMITs that rely on data written by COMMIT NOWAIT transactions are still safe, because the normal COMMIT is still bound by the guarantee to go to disk. The buffer manager/WAL interlock is not effected by this change and remains in place, as it should. This implements the TODO item: --Allow buffered WAL writes and fsync "Instead of guaranteeing recovery of all committed transactions, this would provide improved performance by delaying WAL writes and fsync so an abrupt operating system restart might lose a few seconds of committed transactions but still be consistent. We could perhaps remove the 'fsync' parameter (which results in an an inconsistent database) in favor of this capability." Why do we want this?? Because some apps have *lots* of data and many really don't care whether they lose a few records. Honestly, I've met people that want this, even after 2 hours of discussion and understanding. Plus probably lots of MySQLers also. User Control ------------ New commit mode is available by explicit command, or as a default setting that will be applied to all COMMITs, or both. The full syntax would be COMMIT [WRITE] NOWAIT [IMMEDIATE], for Oracle compatibility (why choose incompatibility?). Note that this is not a transaction start setting like Isolation Level; this happens at end of transaction. The syntax for END is unchanged, defaulting to normal behaviour unless overridden. New userset GUC, commit_wait_default = on (default) | off We change the meaning of the commit_delay parameter: - If commit_delay = 0 then commit_wait_default cannot be set off. - WAL will be flushed every commit_delay milliseconds; if no flush is required this will do nothing very quickly, so there is little overhead of no COMMIT NOWAIT commits have been made. Implementation -------------- COMMIT NOWAIT in xact.c simply ignores XLogFlush and returns. Who does the XLogFlush? Well, my recommendation is a totally new process, WALWriter. But I can see that many of you will say bgwriter should be the person to do this work. IMHO doing WAL flushes will take time and thats time that bgwriter really needs to do other things, plus it can't really guarantee to do flush regularly when its doing checkpoints. When commit_delay > 0 then the WALwriter will startup, or shutdown if commit_delay = 0. WALWriter will XLogFlush every commit_delay milliseconds. A prototype patch is posted to -patches, which is WORK IN PROGRESS. The following TODO items remain 1. discuss which process will issue regular XLogFlush(). If agreed, implement WALWriter process to perform this task. (Yes, the patch isn't fully implemented, yet). 2. remove fsync parameter 3. Prevent COMMIT NOWAIT when commit_delay = 0 4. Discuss whether commit_delay is OK to usurp; twas just an earlier suggestion from someone else, can go either way. 5. docs The remaining items can be completed very quickly if this proposal is acceptable. (I wrote this over Christmas, so it turning up now isn't a rushed proposal and I'm pretty certain it ain't broke). Comments? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> Why do we want this?? Because some apps have *lots* of data and many > really don't care whether they lose a few records. Honestly, I've met > people that want this, even after 2 hours of discussion and > understanding. Plus probably lots of MySQLers also. Most users will take speed over data loss any day. Whether we want to admit it or not. I think these feature is a good middle ground. Sincerely, Joshua D. Drake > > User Control > ------------ > > New commit mode is available by explicit command, or as a default > setting that will be applied to all COMMITs, or both. > > The full syntax would be COMMIT [WRITE] NOWAIT [IMMEDIATE], for Oracle > compatibility (why choose incompatibility?). Note that this is not a > transaction start setting like Isolation Level; this happens at end of > transaction. The syntax for END is unchanged, defaulting to normal > behaviour unless overridden. > > New userset GUC, commit_wait_default = on (default) | off > > We change the meaning of the commit_delay parameter: > > - If commit_delay = 0 then commit_wait_default cannot be set off. > > - WAL will be flushed every commit_delay milliseconds; if no flush is > required this will do nothing very quickly, so there is little overhead > of no COMMIT NOWAIT commits have been made. > > Implementation > -------------- > > COMMIT NOWAIT in xact.c simply ignores XLogFlush and returns. > > Who does the XLogFlush? Well, my recommendation is a totally new > process, WALWriter. But I can see that many of you will say bgwriter > should be the person to do this work. IMHO doing WAL flushes will take > time and thats time that bgwriter really needs to do other things, plus > it can't really guarantee to do flush regularly when its doing > checkpoints. > > When commit_delay > 0 then the WALwriter will startup, or shutdown if > commit_delay = 0. > > WALWriter will XLogFlush every commit_delay milliseconds. > > A prototype patch is posted to -patches, which is WORK IN PROGRESS. > The following TODO items remain > 1. discuss which process will issue regular XLogFlush(). If agreed, > implement WALWriter process to perform this task. (Yes, the patch isn't > fully implemented, yet). > 2. remove fsync parameter > 3. Prevent COMMIT NOWAIT when commit_delay = 0 > 4. Discuss whether commit_delay is OK to usurp; twas just an earlier > suggestion from someone else, can go either way. > 5. docs > > The remaining items can be completed very quickly if this proposal is > acceptable. (I wrote this over Christmas, so it turning up now isn't a > rushed proposal and I'm pretty certain it ain't broke). > > Comments? > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
Simon Riggs wrote: > Proposal: Implement a new option for COMMIT, for enhancing performance, > providing a MySQL-like trade-off between performance and robustness for > *only* those that want it. > > COMMIT NOWAIT > > This form of COMMIT will *not* perform XLogFlush(), but will rely on a > special background process to perform regular WAL fsyncs (see later). > > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not > threaten the consistency or robustness of other COMMIT modes. Read that > again and think about it, before we go further, please. Normal COMMIT > still guarantees to flush all of WAL up to the point of the commit, > whether or not the previous commits have requested that. > > Mixing COMMIT NOWAIT with other modes does not effect the performance of > other backends - those that specify that mode are faster, those that do > not simply go at the same speed they did before. This is important, > because it allows us to have a fully robust server, yet with certain > critical applications going along much faster. No need for an > all-or-nothing approach at db cluster level. > > Unlike fsync = off, WAL is always consistent and the server can be > recovered easily, though with some potential for data loss for > transactions that chose the COMMIT NOWAIT option. Sounds like a hole > there: normal COMMITs that rely on data written by COMMIT NOWAIT > transactions are still safe, because the normal COMMIT is still bound by > the guarantee to go to disk. The buffer manager/WAL interlock is not > effected by this change and remains in place, as it should. OK, so do I have this right? A is issuing COMMIT NOWAIT once a second B is issuing COMMIT every other second Checkpoints happen every 10 seconds Every 10 seconds we have a checkpoint and all WAL+data are on-disk. Every 2 seconds a standard COMMIT occurs and A+B are synced to WAL In-between COMMIT NOWAIT occurs and no data is written to disk - neither WAL nor data-blocks. So, if I have a load of connections issuing standard COMMITs regularly then I'll not see much performance-gain. However, I can guarantee any data visible to those transactions is committed to disk. If I have only one connection to the database and that uses only COMMIT NOWAIT, it should approach the speed of fsync=off. So this is for: 1. data-loading where I don't want to use fsync=off (for whatever reason) 2. Data-logging type apps Have I got the gist of that? -- Richard Huxton Archonet Ltd
On Mon, 2007-02-26 at 23:25 +0000, Richard Huxton wrote: > Simon Riggs wrote: > > Proposal: Implement a new option for COMMIT, for enhancing performance, > > providing a MySQL-like trade-off between performance and robustness for > > *only* those that want it. > > > > COMMIT NOWAIT > > > > This form of COMMIT will *not* perform XLogFlush(), but will rely on a > > special background process to perform regular WAL fsyncs (see later). > > > > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not > > threaten the consistency or robustness of other COMMIT modes. Read that > > again and think about it, before we go further, please. Normal COMMIT > > still guarantees to flush all of WAL up to the point of the commit, > > whether or not the previous commits have requested that. > > > > Mixing COMMIT NOWAIT with other modes does not effect the performance of > > other backends - those that specify that mode are faster, those that do > > not simply go at the same speed they did before. This is important, > > because it allows us to have a fully robust server, yet with certain > > critical applications going along much faster. No need for an > > all-or-nothing approach at db cluster level. > > > > Unlike fsync = off, WAL is always consistent and the server can be > > recovered easily, though with some potential for data loss for > > transactions that chose the COMMIT NOWAIT option. Sounds like a hole > > there: normal COMMITs that rely on data written by COMMIT NOWAIT > > transactions are still safe, because the normal COMMIT is still bound by > > the guarantee to go to disk. The buffer manager/WAL interlock is not > > effected by this change and remains in place, as it should. > > OK, so do I have this right? > > A is issuing COMMIT NOWAIT once a second > B is issuing COMMIT every other second > Checkpoints happen every 10 seconds > > Every 10 seconds we have a checkpoint and all WAL+data are on-disk. > Every 2 seconds a standard COMMIT occurs and A+B are synced to WAL > In-between COMMIT NOWAIT occurs and no data is written to disk - neither > WAL nor data-blocks. Right, but there'd also be a WAL flush every 100ms or so, according to how you set the parameter. So the window of data loss is controllable for both efficiency and user selected robustness for those transactions that want it. (Plus the point that a COMMIT doesn't send data to disk anyway, only WAL. This proposal does nothing to change that, I should add). > So, if I have a load of connections issuing standard COMMITs regularly > then I'll not see much performance-gain. However, I can guarantee any > data visible to those transactions is committed to disk. > > If I have only one connection to the database and that uses only COMMIT > NOWAIT, it should approach the speed of fsync=off. > > So this is for: > 1. data-loading where I don't want to use fsync=off (for whatever reason) > 2. Data-logging type apps > > Have I got the gist of that? Pretty much. Only the people using COMMIT NOWAIT see a performance gain, and of course only if they are doing write transactions of relatively short duration. The interesting point is you can have a huge data grinding app, yet with other tables alongside that hold more important data. In that scenario, 90% of the data would be COMMIT NOWAIT, whilst the small important data is safe. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > The interesting point is you can have a huge data grinding app, yet with > other tables alongside that hold more important data. In that scenario, > 90% of the data would be COMMIT NOWAIT, whilst the small important data > is safe. Does this means that the regular COMMIT is slower because it has to force more data to disk? I imagine that this isn't the case, because it's not the write itself that's slow; rather, it's the wait until the fsync on WAL is reported complete. However, did you measure this? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Feb 26, 2007, at 18:58 , Simon Riggs wrote: > On Mon, 2007-02-26 at 23:25 +0000, Richard Huxton wrote: >> Simon Riggs wrote: >>> Proposal: Implement a new option for COMMIT, for enhancing >>> performance, >>> providing a MySQL-like trade-off between performance and >>> robustness for >>> *only* those that want it. >>> >>> COMMIT NOWAIT Isn't the domain of transient data relegated to certain tables instead of specific transactions? Wouldn't it be easier to create un- wal-logged global temp tables? -M
On Mon, 2007-02-26 at 22:56 +0000, Simon Riggs wrote: > Proposal: Implement a new option for COMMIT, for enhancing performance, > providing a MySQL-like trade-off between performance and robustness for > *only* those that want it. > > COMMIT NOWAIT > > This form of COMMIT will *not* perform XLogFlush(), but will rely on a > special background process to perform regular WAL fsyncs (see later). > I would think that the type of people who would want this feature would want it to be a property of the connection, not per-statement. For one thing, many applications use solitary SQL statements without surrounding them in BEGIN/COMMIT. Regards,Jeff Davis
Jeff Davis wrote: > On Mon, 2007-02-26 at 22:56 +0000, Simon Riggs wrote: > > Proposal: Implement a new option for COMMIT, for enhancing performance, > > providing a MySQL-like trade-off between performance and robustness for > > *only* those that want it. > > > > COMMIT NOWAIT > > > > This form of COMMIT will *not* perform XLogFlush(), but will rely on a > > special background process to perform regular WAL fsyncs (see later). > > > > I would think that the type of people who would want this feature would > want it to be a property of the connection, not per-statement. For one > thing, many applications use solitary SQL statements without surrounding > them in BEGIN/COMMIT. You can set the GUC locally in your connection, of course. (You can even do it per role or per database by using ALTER/SET commands) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
"Simon Riggs" <simon@2ndquadrant.com> writes: > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not > threaten the consistency or robustness of other COMMIT modes. Read that > again and think about it, before we go further, please. I read that, and thought about it, and don't think I believe it. The problem is that there are more interconnections between different transactions than you're allowing for. In particular you need to justify that the behavior is safe for non-transactional operations like btree page splits and pg_clog buffer page writes. The idea that's particularly bothering me at the moment is that after a system crash, we might come back up in a state where a NOWAIT transaction appears committed when its updates didn't all get to disk. "Database corrupt" is a situation that threatens all your transactions... > New commit mode is available by explicit command, or as a default > setting that will be applied to all COMMITs, or both. I dislike introducing new nonstandard syntax ("Oracle compatible" is not standard). If we did this I'd vote for control via a GUC setting only; I think that is more useful anyway, as an application can be made to run with such a setting without invasive source code changes. regards, tom lane
On Tue, February 27, 2007 06:06, Joshua D. Drake wrote: > >> Why do we want this?? Because some apps have *lots* of data and many >> really don't care whether they lose a few records. Honestly, I've met >> people that want this, even after 2 hours of discussion and >> understanding. Plus probably lots of MySQLers also. > > Most users will take speed over data loss any day. Whether we want to > admit it or not. In that case, wouldn't it make just as much sense to have an equivalent for this special transaction mode on individual statements, without transaction context? I'm guessing that who don't really know or want transactions would never start one, running lots of loose statements instead that otherwise get committed individually. Jeroen
On Mon, Feb 26, 2007 at 10:56:58PM +0000, Simon Riggs wrote: > 2. remove fsync parameter Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still want this for things like database restores. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, Feb 27, 2007 at 11:05:45AM +0700, Jeroen T. Vermeulen wrote: > On Tue, February 27, 2007 06:06, Joshua D. Drake wrote: > > > >> Why do we want this?? Because some apps have *lots* of data and many > >> really don't care whether they lose a few records. Honestly, I've met > >> people that want this, even after 2 hours of discussion and > >> understanding. Plus probably lots of MySQLers also. > > > > Most users will take speed over data loss any day. Whether we want to > > admit it or not. > > In that case, wouldn't it make just as much sense to have an equivalent > for this special transaction mode on individual statements, without > transaction context? I'm guessing that who don't really know or want > transactions would never start one, running lots of loose statements > instead that otherwise get committed individually. I don't think it makes sense to optimize for people who can't be bothered to learn about a transaction. In any case, that option is there; you just set the GUC in the session. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Mon, 2007-02-26 at 22:50 -0600, Jim C. Nasby wrote: > On Mon, Feb 26, 2007 at 10:56:58PM +0000, Simon Riggs wrote: > > 2. remove fsync parameter > > Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still > want this for things like database restores. Well, it seemed to be part of the documented TODO, so I included it as a discussion point. I'm happy to leave it, personally, FWIW. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-02-26 at 21:20 -0300, Alvaro Herrera wrote: > Simon Riggs wrote: > > > The interesting point is you can have a huge data grinding app, yet with > > other tables alongside that hold more important data. In that scenario, > > 90% of the data would be COMMIT NOWAIT, whilst the small important data > > is safe. > > Does this means that the regular COMMIT is slower because it has to > force more data to disk? I imagine that this isn't the case, because > it's not the write itself that's slow; rather, it's the wait until the > fsync on WAL is reported complete. However, did you measure this? No, I've not measured that aspect specifically. But the overall effect depends upon your hardware. (Laptops work great :-) A COMMIT will write all of WAL, no matter how much that is. This is no different from now, where a normal COMMIT executing while another backend is doing a work may have to write more than just its own WAL. We actually consider this a good thing: piggyback writes go out of their way to ensure we write as much WAL as possible in one go. There is no doubt a point where the fsync delay has been set wrong *and* there are so few normal COMMITs that they do become slower. But that slower isn't the same as "starved", just the result of having to do the work of others. The WALwriter will be doing the heavy lifting, if set correctly. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-02-26 at 23:04 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not > > threaten the consistency or robustness of other COMMIT modes. Read that > > again and think about it, before we go further, please. > > I read that, and thought about it, and don't think I believe it. I don't mind saying its taken a while to believe it myself. And I very much want to have the concept tested so we all believe it. Doubt=>risk. > The > problem is that there are more interconnections between different > transactions than you're allowing for. In particular you need to > justify that the behavior is safe for non-transactional operations like > btree page splits and pg_clog buffer page writes. The idea that's > particularly bothering me at the moment is that after a system crash, > we might come back up in a state where a NOWAIT transaction appears > committed when its updates didn't all get to disk. "Database corrupt" > is a situation that threatens all your transactions... OK, I can see its possible for the following WAL history to occur: t1 btree split part1 t2 COMMIT;<---------- risk of inconsistent WAL t1 btree split part2 The COMMIT by t2 flushes WAL, so if a crash happens at the point shown we have an incomplete multi-part WAL operation. At this stage, lets assume that no data writes by either t1 or t2 have made it to disk, which is normal and likely. On recovery, we apply all the changes in WAL up to the end of WAL. At the last minute t2 sneaks in and gets committed. t1 was never committed, so we never found out whether it would be a COMMIT NOWAIT or a COMMIT. *Whichever* it will be the btree split is still incomplete and recovery understands this and acts accordingly. So the situation that sounds nasty is actually a normal situation currently, so presents no threat because we already handle this correctly. Multi-part operations seem good to me. > The idea that's > particularly bothering me at the moment is that after a system crash, > we might come back up in a state where a NOWAIT transaction appears > committed when its updates didn't all get to disk. "Database corrupt" > is a situation that threatens all your transactions... Well, I've spent 2 hours thinking this through, gone pale, thought hard and gone down a few blind alleys along the way. The bottom line is that COMMIT NOWAIT isn't fully safe, and thats part of the explicit non-full guarantee, written in big letters on the tin. If a transaction commits and then we crash before we flush WAL, then the transaction will be incomplete. As we define COMMIT now, this is broken and I use that word accurately: If you use COMMIT NOWAIT, you risk data loss *but* you have the choice of which transactions this applies to. However, WAL *is* consistent and there is no danger of database corruption. Simply put, this is not group commit. So if you issue COMMIT NOWAIT and then crash, the transactions that were marked committed need to be marked aborted in clog. So during recovery we will need to keep track of which transactions are in progress, so we can mark them explicitly aborted, rather than the current implicit mechanism. This would need to be done carefully, since a clog page that was created at transaction start may never have made it to disk by the time of the crash. We must extend clog in that case, even if the extension WAL never made it to WAL, either, so that we can record the aborted state of the Xids that exist somewhere on disk. Keeping track of transactions in progress won't take long. It isn't required at all during archive recovery, but that might form the basis for later use as a Snapshot creation mechanism for read-only access during startup. It isn't possible for a normal non-readonly transaction to change data that has been changed by a commit-nowait transaction, and then have the normal transaction commit, yet without also implicitly committing the commit-nowait transaction. Hopefully, I've got that right? > > New commit mode is available by explicit command, or as a default > > setting that will be applied to all COMMITs, or both. > > I dislike introducing new nonstandard syntax ("Oracle compatible" is not > standard). If we did this I'd vote for control via a GUC setting only; > I think that is more useful anyway, as an application can be made to run > with such a setting without invasive source code changes. OK. Having read through all of the above things again, ISTM that we should make this functionality available by a new GUC commit_fsync_delay, which must be set explicitly > 0 before this feature can be used at all. If I confused Tom by using commit_delay, then I'll confuse others also and group commit and deferred fsync are different techniques with different robustness guarantees. When enabled it should have a clear message in the log to show that some commits might be using commit_nowait. I'd even welcome a more descriptive term that summed up the relaxed transaction guarantee implied by the use of the deferred fsync technique. Perhaps even a very explicit USERSET GUC: transaction_guarantee = on (default) | off -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, Feb 27, 2007 at 10:49:32AM +0000, Simon Riggs wrote: > > I dislike introducing new nonstandard syntax ("Oracle compatible" is not > > standard). If we did this I'd vote for control via a GUC setting only; > > I think that is more useful anyway, as an application can be made to run > > with such a setting without invasive source code changes. > > OK. > > Having read through all of the above things again, ISTM that we should > make this functionality available by a new GUC commit_fsync_delay, which > must be set explicitly > 0 before this feature can be used at all. If I > confused Tom by using commit_delay, then I'll confuse others also and > group commit and deferred fsync are different techniques with different > robustness guarantees. When enabled it should have a clear message in > the log to show that some commits might be using commit_nowait. > > I'd even welcome a more descriptive term that summed up the relaxed > transaction guarantee implied by the use of the deferred fsync > technique. Perhaps even a very explicit USERSET GUC: > > transaction_guarantee = on (default) | off So would you set commit_fsync_delay on a per-transaction basis? That doesn't make much sense to me... I guess I'm not seeing how you would explicitly mark transactions that you didn't want to fsync immediately. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Simon, One of the things I love about doing informal online user support in the PostgreSQL community, and formal user support for Sun's customers, is the almost-ironclad guarentee that if a user has a corrupt database or data loss, one of three things is true: a) they didn't apply some recommended PG update; b) they have a bad disk controller or disk config; c) they have bad ram. It seriously narrows down the problem space to know that PostgreSQL does *not* allow data loss if it's physically possible to prevent it. Therefore, if we're going to arm a foot-gun as big as COMMIT NOWAIT for PostgreSQL, I'd like to see the answers to two questions: a) Please give some examples of performance gain on applications using COMMIT NOWAIT. The performance gain needs to be substantial (like, 50% to 100%) to justify a compromise like this. b) Why this and not global temporary tables or queuing? -- Josh Berkus PostgreSQL @ Sun San Francisco
"Jim C. Nasby" <jim@nasby.net> writes: > So would you set commit_fsync_delay on a per-transaction basis? That > doesn't make much sense to me... I guess I'm not seeing how you would > explicitly mark transactions that you didn't want to fsync immediately. My assumption was that most of the time you'd want this behavior per-session and so the existing mechanisms for setting a GUC variable would work perfectly well. If you really want it per-transaction then changing the variable on the fly is possible (perhaps using SET LOCAL). You'll be issuing nonstandard commands either way... regards, tom lane
On Tue, 2007-02-27 at 11:32 -0600, Jim C. Nasby wrote: > On Tue, Feb 27, 2007 at 10:49:32AM +0000, Simon Riggs wrote: > > > I dislike introducing new nonstandard syntax ("Oracle compatible" is not > > > standard). If we did this I'd vote for control via a GUC setting only; > > > I think that is more useful anyway, as an application can be made to run > > > with such a setting without invasive source code changes. > > > > OK. > > > > Having read through all of the above things again, ISTM that we should > > make this functionality available by a new GUC commit_fsync_delay, which > > must be set explicitly > 0 before this feature can be used at all. If I > > confused Tom by using commit_delay, then I'll confuse others also and > > group commit and deferred fsync are different techniques with different > > robustness guarantees. When enabled it should have a clear message in > > the log to show that some commits might be using commit_nowait. > > > > I'd even welcome a more descriptive term that summed up the relaxed > > transaction guarantee implied by the use of the deferred fsync > > technique. Perhaps even a very explicit USERSET GUC: > > > > transaction_guarantee = on (default) | off > > So would you set commit_fsync_delay on a per-transaction basis? That > doesn't make much sense to me... I guess I'm not seeing how you would > explicitly mark transactions that you didn't want to fsync immediately. There are 2 GUCs that would control the behaviour here: transaction_guarantee = on | offSpecifies whether following transaction commits will guaranteeWAL has been flushed priorto reporting commit. If no guaranteeis requested (=off), then data loss may result even after thetransaction has reportedits COMMIT message. USERSET, but listed in postgresql.conf where default = onSet this at role, individual sessionor transaction level toimprove performance of non-critical user data. Use of thissetting does not interfere with thetransaction_guaranteethat other transactions may choose. i.e. if somebody elsechooses to take risks with their data itwill not affectthe transaction guarantees the server offers to you.Can only be set off by a transaction if commit_fsync_delayhasbeen enabled. Use this parameter with care; if you findyourself wanting to use this parameter all ofthe time youshould consult a psychiatrist or change open source databases. commit_fsync_delay = 0...10000 microseconds (0 = off, default)Controls how often the WALWriter issues an XLogFlush()SIGHUP,so set once for each server, in postgresql.confThis provides a maximum time window of potential data lossin the event of a server crash for transactions that choosetransaction_guarantee = off. This parameter has no effectontransactions that choose transaction_guarantee = on. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> There are 2 GUCs that would control the behaviour here: > > transaction_guarantee = on | off > has been enabled. Use this parameter with care; if you find > yourself wanting to use this parameter all of the time you > should consult a psychiatrist or change open source databases. If you guarantee your customers, that you wont loose a transaction that has already been committed you need to at least have synchronous replication to a remote site. Since not many installations have that, I do find it funny that people imply so much safety only by syncing the wal. Without sync replication a "buffered wal" as proposed only increases the chances that you loose something. It certainly is no change from safe heaven to abyssmal hell. So I think the part after the semicolon can safely be dropped. Many will be able to use it always, without changing to another db :-) > commit_fsync_delay = 0...10000 microseconds (0 = off, default) > Controls how often the WALWriter issues an XLogFlush() > SIGHUP, so set once for each server, in postgresql.conf > This provides a maximum time window of potential data loss > in the event of a server crash for transactions that choose > transaction_guarantee = off. This parameter has no effect > on transactions that choose transaction_guarantee = on. The wal sync method probably needs to be considered ? If the wal is opened with open_datasync, how does that affect the performance, or do you ommit the write and leave that to the WALWriter also ? You probably also want more wal_buffers in such a setup. It may be better to trigger the WALWriter with wal_buffer fill-level instead of an extra parameter ? It is imho great that you are working on this. I always thought it impossible, because WAL (write ahead) implied to me, that you are not allowed to do some data/index page changes before wal is on disk. Andreas
On 2/27/07, Josh Berkus <josh@agliodbs.com> wrote: > It seriously narrows down the problem space to know that PostgreSQL does *not* > allow data loss if it's physically possible to prevent it. Seems like we're trying to protect users from themselves again. This is not a PostgreSQL database issue; it's a feature designed for application developers and cases where there is the possibility of acceptable data loss for certain types of operations or transactions. We already have the capacity for loss when fsync is disabled; this is just a more controlled feature. Under Oracle, NOWAIT is an asynchronous commit... anyone that uses it should understand that it's still not on-disk and that they can lose it in the event of a failure. That's what Oracle's docs even say. It's just a risk vs. reward trade off. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
Jonah, > Under Oracle, NOWAIT is an asynchronous commit... anyone that uses it > should understand that it's still not on-disk and that they can lose > it in the event of a failure. That's what Oracle's docs even say. > It's just a risk vs. reward trade off. You're missing my point, which is that nobody has demonstrated that there is any real performance gain from this. I see no reason to implement it if there is no performance gain. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus wrote: > Simon, > > One of the things I love about doing informal online user support in the > PostgreSQL community, and formal user support for Sun's customers, is the > almost-ironclad guarentee that if a user has a corrupt database or data loss, > one of three things is true: > a) they didn't apply some recommended PG update; > b) they have a bad disk controller or disk config; > c) they have bad ram. That is pretty spot on. > > It seriously narrows down the problem space to know that PostgreSQL does *not* > allow data loss if it's physically possible to prevent it. But we do don't we? fsync = off, full_page_writes = off? > > Therefore, if we're going to arm a foot-gun as big as COMMIT NOWAIT for > PostgreSQL, I'd like to see the answers to two questions: I agree with this. > > a) Please give some examples of performance gain on applications using COMMIT > NOWAIT. The performance gain needs to be substantial (like, 50% to 100%) to > justify a compromise like this. WOAH... that seems excessive. There are a couple of things going on here. 1. We have a potential increase in performance for certain workloads. This is good, but must be proven. IS that proof 50%? Bah.. let's talk 15-25%. 2. We have to accept that not everyone wants IRON clad data integrity. We have many, many options for dealing with that now, including PITR and REPLICATION. > > b) Why this and not global temporary tables or queuing? /me would love global temp tables. Much of the PostgreSQL Users out there today, will happily loose a 15 minutes of data if it means their data is served 25% faster. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
Joshua D. Drake wrote: > 2. We have to accept that not everyone wants IRON clad data integrity. > We have many, many options for dealing with that now, including PITR and > REPLICATION. 100% agreed - our own stats collector is extremely similar (in that it may drop data under high load) to a system I designed a few years back which collected stats from snmp devices and windows boxes via WMI on a regular basis. We didn't care if we lost a few values here and there for that data and would happily have taken the riskier but higher performance option had we needed to (and had we ever got round to ever actually building the system!). I do agree that some benchmarks are appropriate though. Regards, Dave
I am not sure about some of this. The Oracle option does not change the engine fsync behavior I believe. All that is changed is whether the client side waits for the complete of the fsync or not. If this is true, the data store, logs, etc, are all protected. The user may still experience a data loss if a network, or system failure occurred just after the client issued the commit. This would be something like I send the message and exit. However prior to the engine receiving the message, a network component fails and the message is never delivered. This will turn into an aborted transaction as far as the engine is concerned. Of course, the exact details are in the protocol between the client and the server. The commit nowait is async with respect to the response to the user, not the underlying engine I think. Therefore performance gains are purely a user perspective, not an engine perspective. Perhaps some network traffic could be pruned, not sending the response. Jordan Henderson On Tuesday 27 February 2007, Joshua D. Drake wrote: > Josh Berkus wrote: > > Simon, > > > > One of the things I love about doing informal online user support in the > > PostgreSQL community, and formal user support for Sun's customers, is the > > almost-ironclad guarentee that if a user has a corrupt database or data > > loss, one of three things is true: > > a) they didn't apply some recommended PG update; > > b) they have a bad disk controller or disk config; > > c) they have bad ram. > > That is pretty spot on. > > > It seriously narrows down the problem space to know that PostgreSQL does > > *not* allow data loss if it's physically possible to prevent it. > > But we do don't we? fsync = off, full_page_writes = off? > > > Therefore, if we're going to arm a foot-gun as big as COMMIT NOWAIT for > > PostgreSQL, I'd like to see the answers to two questions: > > I agree with this. > > > a) Please give some examples of performance gain on applications using > > COMMIT NOWAIT. The performance gain needs to be substantial (like, 50% > > to 100%) to justify a compromise like this. > > WOAH... that seems excessive. There are a couple of things going on here. > > 1. We have a potential increase in performance for certain workloads. > This is good, but must be proven. IS that proof 50%? Bah.. let's talk > 15-25%. > > 2. We have to accept that not everyone wants IRON clad data integrity. > We have many, many options for dealing with that now, including PITR and > REPLICATION. > > > b) Why this and not global temporary tables or queuing? > > /me would love global temp tables. > > Much of the PostgreSQL Users out there today, will happily loose a 15 > minutes of data if it means their data is served 25% faster. > > Sincerely, > > Joshua D. Drake
"Joshua D. Drake" <jd@commandprompt.com> writes: > Josh Berkus wrote: >> It seriously narrows down the problem space to know that PostgreSQL does *not* >> allow data loss if it's physically possible to prevent it. > But we do don't we? fsync = off, full_page_writes = off? One of the things that's really attractive about the proposed mode is that it does *not* create a risk of data corruption (assuming that Simon's analyzed it correctly --- I think the clog code in particular needs a look). What you risk is that when the database comes back up, its state may reflect an instant up to X seconds before the time of the crash, rather than exactly the crash time. It seems to me that that's way better than fsync = off, which allows unlimited corruption. I agree that we ought to look at some performance numbers before accepting the patch, but I think Josh's argument that this opens us up to major corruption problems is probably wrong. The question is whether your application can tolerate loss of "very recent" transactions, and I think there are plenty where it can. regards, tom lane
On 2/27/07, Josh Berkus <josh@agliodbs.com> wrote: > You're missing my point, which is that nobody has demonstrated that there > is any real performance gain from this. I see no reason to implement it > if there is no performance gain. While I'll back your request for results, it seems nearly impossible to expect no performance gain using this. Results never hurt though. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > COMMIT NOWAIT can co-exist with the normal form of COMMIT and does not > > threaten the consistency or robustness of other COMMIT modes. Read that > > again and think about it, before we go further, please. > > I read that, and thought about it, and don't think I believe it. The > problem is that there are more interconnections between different > transactions than you're allowing for. In particular you need to > justify that the behavior is safe for non-transactional operations like > btree page splits and pg_clog buffer page writes. The idea that's > particularly bothering me at the moment is that after a system crash, > we might come back up in a state where a NOWAIT transaction appears > committed when its updates didn't all get to disk. "Database corrupt" > is a situation that threatens all your transactions... > > > New commit mode is available by explicit command, or as a default > > setting that will be applied to all COMMITs, or both. > > I dislike introducing new nonstandard syntax ("Oracle compatible" is not > standard). If we did this I'd vote for control via a GUC setting only; > I think that is more useful anyway, as an application can be made to run > with such a setting without invasive source code changes. Agreed on the GUC-only. I don't see many people using the per-COMMIT setting without wanting it to be for many transactions in the session. (And, YES, I am VERY excited we are getting this feature.) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Jonah H. Harris wrote: > On 2/27/07, Josh Berkus <josh@agliodbs.com> wrote: > > You're missing my point, which is that nobody has demonstrated that there > > is any real performance gain from this. I see no reason to implement it > > if there is no performance gain. > > While I'll back your request for results, it seems nearly impossible > to expect no performance gain using this. > > Results never hurt though. The results are going to be very close to fsync off for sufficiently high values delay, and we _know_ fsync off is a performance win. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Jim C. Nasby wrote: > On Mon, Feb 26, 2007 at 10:56:58PM +0000, Simon Riggs wrote: > > 2. remove fsync parameter > > Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still > want this for things like database restores. I think we will remove fsync in favor of the new delay, and allow -1 to be the same behavior as fsync off. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Jonah H. Harris wrote: >> On 2/27/07, Josh Berkus <josh@agliodbs.com> wrote: >>> You're missing my point, which is that nobody has demonstrated that there >>> is any real performance gain from this. I see no reason to implement it >>> if there is no performance gain. >> While I'll back your request for results, it seems nearly impossible >> to expect no performance gain using this. >> >> Results never hurt though. > > The results are going to be very close to fsync off for sufficiently > high values delay, and we _know_ fsync off is a performance win. This is an assumption. Yes we know that fsync off is a performance win. We do not know that COMMIT NO WAIT is a performance win. Yes, we can all sit here and stare at what we *think* will be the result and yes I actually concur that it will be a performance win. However, I strongly concur that we need at least some evidence. It could easily be that a misstep in the code, causes a loop over the wrong set and all the performance we thought we would get is invalid, not because of theory or what should happen, but because of actual implementation. Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
On Tue, Feb 27, 2007 at 05:18:28PM -0500, Bruce Momjian wrote: > Jim C. Nasby wrote: > > On Mon, Feb 26, 2007 at 10:56:58PM +0000, Simon Riggs wrote: > > > 2. remove fsync parameter > > > > Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still > > want this for things like database restores. > > I think we will remove fsync in favor of the new delay, and allow -1 to > be the same behavior as fsync off. Well, presumably we'd still allow fsync for some number of versions... Actually, I don't know that combining both settings is a wise move. The delay should still provide crash protection, whereas with fsync=off you've got absolutely no protection from anything. That's a huge difference, and one that IMHO warrants a separate setting (and a big, fat WARNING in the comment for that setting). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > On Tue, Feb 27, 2007 at 05:18:28PM -0500, Bruce Momjian wrote: > > Jim C. Nasby wrote: > > > On Mon, Feb 26, 2007 at 10:56:58PM +0000, Simon Riggs wrote: > > > > 2. remove fsync parameter > > > > > > Why? Wouldn't fsync=off still speed up checkpoints? ISTM you'd still > > > want this for things like database restores. > > > > I think we will remove fsync in favor of the new delay, and allow -1 to > > be the same behavior as fsync off. > > Well, presumably we'd still allow fsync for some number of versions... No. > Actually, I don't know that combining both settings is a wise move. The > delay should still provide crash protection, whereas with fsync=off > you've got absolutely no protection from anything. That's a huge > difference, and one that IMHO warrants a separate setting (and a big, > fat WARNING in the comment for that setting). Yes, it needs a warning, or perhaps we just tell people to set it to something high and that is all they can do. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, Feb 27, 2007 at 07:17:37PM -0500, Bruce Momjian wrote: > > Actually, I don't know that combining both settings is a wise move. The > > delay should still provide crash protection, whereas with fsync=off > > you've got absolutely no protection from anything. That's a huge > > difference, and one that IMHO warrants a separate setting (and a big, > > fat WARNING in the comment for that setting). > > Yes, it needs a warning, or perhaps we just tell people to set it to > something high and that is all they can do. Before doing that I'd want to see how the performance compares to fsync=off. My guess is that fsync=off is a big gain during checkpoints. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Joshua D. Drake" <jd@commandprompt.com> writes: >>> On 2/27/07, Josh Berkus <josh@agliodbs.com> wrote: >>>> >>>> I see no reason to implement it if there is no performance gain. > However, I strongly concur that we need at least some evidence. It could > easily be that a misstep in the code, causes a loop over the wrong set > and all the performance we thought we would get is invalid, not because > of theory or what should happen, but because of actual implementation. It rather sounds like you're asking for a proof that Simon can write bug-free code before you allow him to write any code. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Tom, > One of the things that's really attractive about the proposed mode is > that it does *not* create a risk of data corruption Oh, ok. That wasn't how I understood Simon's case. > I agree that we ought to look at some performance numbers before > accepting the patch, but I think Josh's argument that this opens us > up to major corruption problems is probably wrong. OK. I've seen no performance numbers yet though. It just seems to me that any performance patch proposal should start a discussion of what amount of performance we expect to gain. Unfortunately, this is *not* a patch I can test on TPCE or SpecJ, because both of those have ACID requirements which I don't think this would satisfy. I'd have to modify the benchmark, and I already have 4 performance patches queue which don't require that. -- Josh Berkus PostgreSQL @ Sun San Francisco
Gregory Stark wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > >>>> On 2/27/07, Josh Berkus <josh@agliodbs.com> wrote: >>>>> I see no reason to implement it if there is no performance gain. > >> However, I strongly concur that we need at least some evidence. It could >> easily be that a misstep in the code, causes a loop over the wrong set >> and all the performance we thought we would get is invalid, not because >> of theory or what should happen, but because of actual implementation. > > It rather sounds like you're asking for a proof that Simon can write bug-free > code before you allow him to write any code. Well wouldn't that be great! :) but no, not quite. I would just like to see some metrics showing that it is a benefit. Besides the patch needs to work for the metrics to be run. Joshua D. Drake > > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
Joshua D. Drake wrote: > Gregory Stark wrote: > > "Joshua D. Drake" <jd@commandprompt.com> writes: > > > >>>> On 2/27/07, Josh Berkus <josh@agliodbs.com> wrote: > >>>>> I see no reason to implement it if there is no performance gain. > > > >> However, I strongly concur that we need at least some evidence. It could > >> easily be that a misstep in the code, causes a loop over the wrong set > >> and all the performance we thought we would get is invalid, not because > >> of theory or what should happen, but because of actual implementation. > > > > It rather sounds like you're asking for a proof that Simon can write bug-free > > code before you allow him to write any code. > > Well wouldn't that be great! :) but no, not quite. I would just like to > see some metrics showing that it is a benefit. Besides the patch needs > to work for the metrics to be run. I don't understand the great demand for metrics at this point. Once the patch is ready, people can run the patch on their workloads to get real-world metrics. Metrics are only needed before the patch is applied, not before it is discussed. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
All, > > But we do don't we? fsync = off, full_page_writes = off? BTW, our testing seems to indicate that full_page_writes = off is safe on Solaris 10 on good hardware. At least, we haven't been able to break it yet. -- Josh Berkus PostgreSQL @ Sun San Francisco
On Tue, 2007-02-27 at 17:40 -0800, Josh Berkus wrote: > All, > > > > But we do don't we? fsync = off, full_page_writes = off? > > BTW, our testing seems to indicate that full_page_writes = off is safe on > Solaris 10 on good hardware. At least, we haven't been able to break it yet. > Is that an OS-dependent parameter? I always assumed it depended entirely on hardware. I have no way to test it for myself though, so I just leave full_page_writes=on to be safe. Regards,Jeff Davis
Jeff, > Is that an OS-dependent parameter? I always assumed it depended entirely > on hardware. I have no way to test it for myself though, so I just leave > full_page_writes=on to be safe. It's a question of whether your HW+OS can guarentee no torn page writes for the xlog. Running on Sun hardware combined with Solaris 10 with the xlog mounted forcedirectio, the Solaris folks are convinced we are torn-page-proof and so far we haven't been able to prove them wrong. And, on Solaris it's a substantial performance gain (like, 8-10% on OLTP benchmarks). -- Josh Berkus PostgreSQL @ Sun San Francisco
"Josh Berkus" <josh@agliodbs.com> writes: > OK. I've seen no performance numbers yet though. It just seems to me that > any performance patch proposal should start a discussion of what amount of > performance we expect to gain. There exist proposals that can be prototyped and measured to see what potential they have. I don't see how this is one of them though. The only way I could see to prototype this is would, as Bruce said, be to turn fsync off and consider that the extreme end of the spectrum. The reality will lie somewhere in between and exactly where will depend on the value of the delay. > Unfortunately, this is *not* a patch I can test on TPCE or SpecJ, because both > of those have ACID requirements which I don't think this would satisfy. I'd > have to modify the benchmark, and I already have 4 performance patches queue > which don't require that. I haven't read TPCE but if it's like TPCC then the simple addition of a UPS satisfies the ACID requirements even with fsync off entirely. Pretty lame but it's explicitly sufficient. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Bruce Momjian wrote: > Joshua D. Drake wrote: >> Gregory Stark wrote: >>> "Joshua D. Drake" <jd@commandprompt.com> writes: >>> >>>>>> On 2/27/07, Josh Berkus <josh@agliodbs.com> wrote: >>>>>>> I see no reason to implement it if there is no performance gain. >>>> However, I strongly concur that we need at least some evidence. It could >>>> easily be that a misstep in the code, causes a loop over the wrong set >>>> and all the performance we thought we would get is invalid, not because >>>> of theory or what should happen, but because of actual implementation. >>> It rather sounds like you're asking for a proof that Simon can write bug-free >>> code before you allow him to write any code. >> Well wouldn't that be great! :) but no, not quite. I would just like to >> see some metrics showing that it is a benefit. Besides the patch needs >> to work for the metrics to be run. > > I don't understand the great demand for metrics at this point. Once the > patch is ready, people can run the patch on their workloads to get > real-world metrics. Metrics are only needed before the patch is > applied, not before it is discussed. I don't disagree. Maybe I missed something here, but my whole argument for metrics was purely for the feature to be accepted. I would certainly expect full discussion. Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
"Josh Berkus" <josh@agliodbs.com> writes: > It's a question of whether your HW+OS can guarentee no torn page writes for > the xlog. no, the data files. torn pages in the xlog is also a problem but we protect ourselves with a CRC and stop replay if it the CRC doesn't match. So the cost there is a bit of cpu, not extra i/o. > Running on Sun hardware combined with Solaris 10 with the xlog mounted > forcedirectio, the Solaris folks are convinced we are torn-page-proof and so > far we haven't been able to prove them wrong. And, on Solaris it's a > substantial performance gain (like, 8-10% on OLTP benchmarks). I would expect you to need a small non-volatile cache, either in the controller or the drive itself to be torn-page-proof. Or failing that to have drives that operate on 8kb sectors and guarantee that whole sectors get written using residual power. I don't think any drives operate in 8k sectors though. The scary thing about torn pages with full_page_writes off is that we don't offer any way to detect them. If both halves of the 8kb page look reasonable you could conceivably end up continuing without ever knowing your data is corrupt. That could happen if, for example, the change that was being written isn't very dramatic. Perhaps all that's missing is an update chain pointer for example. So you could have two versions of the same record but be missing the chain pointer in the old record. That would eventually lead to having two visible versions of the same record but no crashes or other red flags. I suggested a while back implementing torn page detection by writing a sequential number ever 512 bytes in the blocks. (I was talking about WAL at the time but the same principle applies.) Do it at the smgr layer using readv/writev and the upper layers need never know their data wasn't contiguous on disk. The only effect would be to shorten page sizes by 16 bytes which would be annoying but much less so than full_page_writes. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 2/27/07, Gregory Stark <stark@enterprisedb.com> wrote: > I suggested a while back implementing torn page detection by writing a > sequential number ever 512 bytes in the blocks. (I was talking about WAL at > the time but the same principle applies.) Do it at the smgr layer using > readv/writev and the upper layers need never know their data wasn't contiguous > on disk. The only effect would be to shorten page sizes by 16 bytes which > would be annoying but much less so than full_page_writes. First, rather than using 16-bytes per page and having to deal with handling the non-contiguous space, why not just use a page-level checksum like everyone else? Most of the systems I've seen seem to employ a simple CRC16 or CRC32. Second, unless I'm missing something, I don't see how your algorithm is going to work as each 512 byte chunk of the block will *always* have the same sequential byte value. That is, unless you have some way of preventing wraparound at 255 without adding additional block overhead. Lastly, from a performance perspective, it's going to be faster to compute the entire page's checksum than it would be to check the sequence every 512 bytes and perform the space adjustment. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On Wed, February 28, 2007 06:59, Jim C. Nasby wrote: > On Tue, Feb 27, 2007 at 05:18:28PM -0500, Bruce Momjian wrote: >> I think we will remove fsync in favor of the new delay, and allow -1 to >> be the same behavior as fsync off. > > Well, presumably we'd still allow fsync for some number of versions... I'd hate to lose the ability to disable fsync. I run tons of tests that don't require any protection against server crashes or hardware failures, but their speed does matter. I know it's not the most important requirement in the world, but speeding up those tests means I can run more of them, on more hardware, more often. Test time also affects my development cycle. My main worry is where the option is set, though. For my situation, selecting a "fast and sloppy" mode when starting the server is clearly the best choice. It'd be possible, though awkward, to change my code to use COMMIT NOWAIT. But then am I really sure I'm still testing the same thing? Plus it introduces a risk of binaries (distributed by others) accidentally doing COMMIT NOWAIT, as for testing, in production code. Jeroen
> > > > But we do don't we? fsync = off, full_page_writes = off? > > > > BTW, our testing seems to indicate that full_page_writes = off is safe > > on Solaris 10 on good hardware. At least, we haven't been able to break it yet. > > > > Is that an OS-dependent parameter? I always assumed it depended entirely > on hardware. I have no way to test it for myself though, so I just leave > full_page_writes=on to be safe. It also depends on the FS implementation. The OS/FS must guarantee, that it does not chunk single data page write calls. Usually that is the case, when OS/FS pagesize and pg pagesize are identical. And the HW needs to guarantee atomicity for single calls. e.g. on AIX you need to reduce the pg page size to 4k to be able to give those guarantees. Andreas
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > First, rather than using 16-bytes per page and having to deal with > handling the non-contiguous space, why not just use a page-level > checksum like everyone else? Most of the systems I've seen seem to > employ a simple CRC16 or CRC32. I think a CRC would be a useful feature for people who want an extra degree of protection from faulty hardware. But we've already seen that CRC checks can be expensive. Not everyone will want to take the cpu hit. Storing a byte counter in every block is cheap. And the idea came from what someone said MSSQL does, so "like everyone else" -- which isn't a very compelling argument to begin with -- doesn't argue against it. > Second, unless I'm missing something, I don't see how your algorithm > is going to work as each 512 byte chunk of the block will *always* > have the same sequential byte value. That is, unless you have some > way of preventing wraparound at 255 without adding additional block > overhead. I think the way you would work is to have the smgr note the sequential value it found when it read in a page and then when it writes it out increment that value by one. Conveniently the pages would be 16 bytes shorter than an 8kb page so you have 16 bytes available with every buffer to note information like the last sequential tag the buffer used. > Lastly, from a performance perspective, it's going to be faster to > compute the entire page's checksum than it would be to check the > sequence every 512 bytes and perform the space adjustment. That seems pretty unlikely. CRC checks are expensive cpu-wise, we're already suffering a copy due to our use of read/write the difference between read/write of 8192 bytes and readv/writev of 511b*16+1*6 is going to be non-zero but very small. Thousands of times quicker than the CRC. If we went to direct-io then it would entail an additional memory-copy which would be annoying. But that would still be much much cheaper than a CRC check. The best we could do in that case would be to do a CRC check at the same time as the memory move. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 2/28/07, Gregory Stark <stark@enterprisedb.com> wrote: > But we've already seen that CRC checks can be expensive. Not everyone will > want to take the cpu hit. Storing a byte counter in every block is cheap. CRC checking a page is most certainly the simplest. And, I disagree that it would be worse than either a sequence counter or the full page write. Block checksumming is done at read/write time... which is something that needs to be improved anyway. With a properly tuned bgwriter, the write itself should barely be noticeable. How fast is a CRC of 8K? Last time I checked it was something on the scale of ~95 usec for CRC32 and ~33 usec for sb8. > And the idea came from what someone said MSSQL does, so "like everyone else" > -- which isn't a very compelling argument to begin with -- doesn't argue > against it. Rather than basing designs on poor second-hand information, maybe you and the person who mentioned this idea should get up-to-date and read the SQL Server storage engine architecture. As of SQL Server 2005, blocks *are* checksummed with CRC32. And, just for the record, previous versions of SQL server performed a bit flipping technique for every 512 bytes in the page header; it did *not* waste a byte for every 512 bytes written. > I think the way you would work is to have the smgr note the sequential value > it found when it read in a page and then when it writes it out increment that > value by one. Conveniently the pages would be 16 bytes shorter than an 8kb > page so you have 16 bytes available with every buffer to note information like > the last sequential tag the buffer used. This proposed design is overcomplicated and a waste of space. I mean, we reduce storage overhead using phantom command id and variable varlena, but let's just fill it up again with unnecessary junk bytes. > That seems pretty unlikely. CRC checks are expensive cpu-wise, we're already > suffering a copy due to our use of read/write the difference between > read/write of 8192 bytes and readv/writev of 511b*16+1*6 is going to be > non-zero but very small. Thousands of times quicker than the CRC. Prove it. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > This proposed design is overcomplicated and a waste of space. I mean, > we reduce storage overhead using phantom command id and variable > varlena, but let's just fill it up again with unnecessary junk bytes. We reduced storage overhead using phantom command id by 8 bytes *per tuple*. I hardly think 8 bytes per page is much of a concern. You're already losing an average of 1/2 a tuple per page to rounding and that's a minimum of 16 bytes for the narrowest of tuples. >> That seems pretty unlikely. CRC checks are expensive cpu-wise, we're already >> suffering a copy due to our use of read/write the difference between >> read/write of 8192 bytes and readv/writev of 511b*16+1*6 is going to be >> non-zero but very small. Thousands of times quicker than the CRC. > > Prove it. We've already seen wal CRC checking show up at the top of profiles. Do you really doubt that memcpy is faster than CRC32 checking? Especially when you're already doing memcpy anyways and the only overhead is the few unaligned bytes at the end and the 8 one-byte copies? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Jonah H. Harris" <jonah.harris@gmail.com> writes: > >> This proposed design is overcomplicated and a waste of space. I mean, >> we reduce storage overhead using phantom command id and variable >> varlena, but let's just fill it up again with unnecessary junk bytes. > > We reduced storage overhead using phantom command id by 8 bytes *per tuple*. I > hardly think 8 bytes per page is much of a concern. You're already losing an > average of 1/2 a tuple per page to rounding and that's a minimum of 16 bytes > for the narrowest of tuples. > >>> That seems pretty unlikely. CRC checks are expensive cpu-wise, we're already >>> suffering a copy due to our use of read/write the difference between >>> read/write of 8192 bytes and readv/writev of 511b*16+1*6 is going to be >>> non-zero but very small. Thousands of times quicker than the CRC. >> Prove it. > > We've already seen wal CRC checking show up at the top of profiles. yeah - on fast boxes (diskio wise) wal-crc checking is nearly always on the very top of wal-intensive workloads. Stefan
On 2/28/07, Gregory Stark <stark@enterprisedb.com> wrote: > We reduced storage overhead using phantom command id by 8 bytes *per tuple*. I > hardly think 8 bytes per page is much of a concern. You're already losing an > average of 1/2 a tuple per page to rounding and that's a minimum of 16 bytes > for the narrowest of tuples. Again, it goes back to competent design. Save space here, waste it there. SQL Server's bit-flipping technique is still *much* better than wasting 1 byte for every 512. > We've already seen wal CRC checking show up at the top of profiles. Well, when you consider we're performing a CRC for every log record rather than at the block level, like most other systems, I wouldn't be that surprised. Don't try and use that example as a reason to kill the checksum; it's a completely different use case. > Do you really doubt that memcpy is faster than CRC32 checking? Especially when > you're already doing memcpy anyways and the only overhead is the few unaligned > bytes at the end and the 8 one-byte copies? I'm saying the complexity and implementation of it is going to get you a bit more than you think. If it didn't, you'd already have the thing coded and would be proving me wrong with the patch rather than theory.I can code up a checksum version in an hour or lessif you want to give it a go. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
>> Do you really doubt that memcpy is faster than CRC32 checking? >> Especially when >> you're already doing memcpy anyways and the only overhead is the few >> unaligned >> bytes at the end and the 8 one-byte copies? > > I'm saying the complexity and implementation of it is going to get you > a bit more than you think. If it didn't, you'd already have the thing > coded and would be proving me wrong with the patch rather than theory. > I can code up a checksum version in an hour or less if you want to > give it a go. LIVE FROM THE WWE, CAGE MATCH! Jonah (the Theorist) Harris versus Greg (the Brain) Stark. What is going to happen between these two brothers in arms when they must both prove their theory! Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
I think we need to think about when these CRCs would be read and written. It would be written when it hits the disk, hopefully by the background writer, and I think after a server crash, all pages would have to be read and checked. The good news is that both of these are non-critical paths. --------------------------------------------------------------------------- Jonah H. Harris wrote: > On 2/28/07, Gregory Stark <stark@enterprisedb.com> wrote: > > But we've already seen that CRC checks can be expensive. Not everyone will > > want to take the cpu hit. Storing a byte counter in every block is cheap. > > CRC checking a page is most certainly the simplest. And, I disagree > that it would be worse than either a sequence counter or the full page > write. Block checksumming is done at read/write time... which is > something that needs to be improved anyway. With a properly tuned > bgwriter, the write itself should barely be noticeable. How fast is a > CRC of 8K? Last time I checked it was something on the scale of ~95 > usec for CRC32 and ~33 usec for sb8. > > > And the idea came from what someone said MSSQL does, so "like everyone else" > > -- which isn't a very compelling argument to begin with -- doesn't argue > > against it. > > Rather than basing designs on poor second-hand information, maybe you > and the person who mentioned this idea should get up-to-date and read > the SQL Server storage engine architecture. > > As of SQL Server 2005, blocks *are* checksummed with CRC32. And, just > for the record, previous versions of SQL server performed a bit > flipping technique for every 512 bytes in the page header; it did > *not* waste a byte for every 512 bytes written. > > > I think the way you would work is to have the smgr note the sequential value > > it found when it read in a page and then when it writes it out increment that > > value by one. Conveniently the pages would be 16 bytes shorter than an 8kb > > page so you have 16 bytes available with every buffer to note information like > > the last sequential tag the buffer used. > > This proposed design is overcomplicated and a waste of space. I mean, > we reduce storage overhead using phantom command id and variable > varlena, but let's just fill it up again with unnecessary junk bytes. > > > That seems pretty unlikely. CRC checks are expensive cpu-wise, we're already > > suffering a copy due to our use of read/write the difference between > > read/write of 8192 bytes and readv/writev of 511b*16+1*6 is going to be > > non-zero but very small. Thousands of times quicker than the CRC. > > Prove it. > > -- > Jonah H. Harris, Software Architect | phone: 732.331.1324 > EnterpriseDB Corporation | fax: 732.331.1301 > 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com > Iselin, New Jersey 08830 | http://www.enterprisedb.com/ > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Gregory Stark" <stark@enterprisedb.com> writes: > We've already seen wal CRC checking show up at the top of profiles. > > Do you really doubt that memcpy is faster than CRC32 checking? Especially when > you're already doing memcpy anyways and the only overhead is the few unaligned > bytes at the end and the 8 one-byte copies? Well color me surprised, writev is not nearly so much faster than CRC as I had expected: lseek+write syscall overhead: 7.95 us CRC32 32.54 us writev 26.56 us The reason there's lseek overhead in there is because I had it seek back to the same block repeatedly to (hopefully) avoid any i/o. It seems to have worked as I find it hard to believe these numbers could be so low if there's any i/o being included. I think part of the reason writev is slow is because I'm including the time it took to set up the iovec array. That's 64 word copies write there. And then writev has to read those 64 words back and do 64 extra branches and so on... This is on an Intel T2500 (2Ghz). There is a side issue that tagging each sector is 100% guaranteed to detect torn pages whereas checksums still have a chance of missing them. But usually the scenario where that comes into play is where you have many checksum failures and are ignoring them assuming they never fail. In the case of torn pages there'll only be one torn page and we're going to scream bloody murder if we see it so I don't think that's a big issue. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Joshua D. Drake" <jd@commandprompt.com> writes: > LIVE FROM THE WWE, CAGE MATCH! > > Jonah (the Theorist) Harris versus Greg (the Brain) Stark. > > What is going to happen between these two brothers in arms when they > must both prove their theory! Darn, I wish I had seen this post before I posted the results of my testing. So what happens to the loser in WWE cage matches? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 2/28/07, Joshua D. Drake <jd@commandprompt.com> wrote: > LIVE FROM THE WWE, CAGE MATCH! > > Jonah (the Theorist) Harris versus Greg (the Brain) Stark. > > What is going to happen between these two brothers in arms when they > must both prove their theory! Heh, I like it :) -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
"Bruce Momjian" <bruce@momjian.us> writes: > I think we need to think about when these CRCs would be read and > written. It would be written when it hits the disk, hopefully by the > background writer, and I think after a server crash, all pages would > have to be read and checked. The good news is that both of these are > non-critical paths. If you're protecting against torn pages then yes, if the system is shut down uncleanly by a system crash or power failure you would in theory have to scan every page of every table and index before starting up. But if the system was shut down uncleanly as the result of a Postgres crash or fast shutdown of Postgres then that isn't an issue. And many users may prefer to bring the system up as soon as possible as long as they know any corrupt pages will be spotted and throw errors as soon as it's seen. So I think you need a mode that only checks checksums when a page is read from disk. That would protect against torn pages (but not necessarily before bringing up the system) and against bad i/o hardware. Unfortunately memory errors are far more common than disk errors and I it would be much harder to protect against them. You can't check it when someone may be writing to the buffer, which limits you to checking it only when you acquire some form of lock on the buffer. It also means you would have to write it before you release a lock if you've made any changes. Worse, I'm not sure how to handle hint bits though. We currently don't require any lock at all to set hint bits which means someone may think they can check a checksum while or after you've fiddled some bits. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > >> LIVE FROM THE WWE, CAGE MATCH! >> >> Jonah (the Theorist) Harris versus Greg (the Brain) Stark. >> >> What is going to happen between these two brothers in arms when they >> must both prove their theory! > > Darn, I wish I had seen this post before I posted the results of my testing. > > So what happens to the loser in WWE cage matches? Usually the are left bloody and staring at the winner through the cage. Occasionally they are carried out on a stretcher. The good news is, the loser gets a rematch at the next pay per view. Sincerely, Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
Gregory Stark wrote: > "Bruce Momjian" <bruce@momjian.us> writes: > > > I think we need to think about when these CRCs would be read and > > written. It would be written when it hits the disk, hopefully by the > > background writer, and I think after a server crash, all pages would > > have to be read and checked. The good news is that both of these are > > non-critical paths. > > If you're protecting against torn pages then yes, if the system is shut down > uncleanly by a system crash or power failure you would in theory have to scan > every page of every table and index before starting up. > > But if the system was shut down uncleanly as the result of a Postgres crash or > fast shutdown of Postgres then that isn't an issue. And many users may prefer > to bring the system up as soon as possible as long as they know any corrupt > pages will be spotted and throw errors as soon as it's seen. I don't think we should start up a system and only detect the errors later. > So I think you need a mode that only checks checksums when a page is read from > disk. That would protect against torn pages (but not necessarily before > bringing up the system) and against bad i/o hardware. > > Unfortunately memory errors are far more common than disk errors and I it > would be much harder to protect against them. You can't check it when someone > may be writing to the buffer, which limits you to checking it only when you > acquire some form of lock on the buffer. It also means you would have to write > it before you release a lock if you've made any changes. > > Worse, I'm not sure how to handle hint bits though. We currently don't require > any lock at all to set hint bits which means someone may think they can check > a checksum while or after you've fiddled some bits. Yep, a problem. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 2/28/07, Bruce Momjian <bruce@momjian.us> wrote: > > But if the system was shut down uncleanly as the result of a Postgres crash or > > fast shutdown of Postgres then that isn't an issue. And many users may prefer > > to bring the system up as soon as possible as long as they know any corrupt > > pages will be spotted and throw errors as soon as it's seen. > > I don't think we should start up a system and only detect the errors > later. Which is, of course, how everyone else does it. On block access, the checksum is verified (if you've turned checksum checking on). I *really* doubt you want to pull in every page in the database at startup time to verify the checksum or sequence. Even pages from the last checkpoint would be a killer. All of the databases (Oracle, SQL Server, DB2) have a way to perform a database corruption check which does go out and verify all checksums. If consistency is stored at the block-level, which is pretty much the only way to avoid full page writes, you have to accept some level of possible corruption. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
Jonah H. Harris wrote: > On 2/28/07, Bruce Momjian <bruce@momjian.us> wrote: > > > But if the system was shut down uncleanly as the result of a Postgres crash or > > > fast shutdown of Postgres then that isn't an issue. And many users may prefer > > > to bring the system up as soon as possible as long as they know any corrupt > > > pages will be spotted and throw errors as soon as it's seen. > > > > I don't think we should start up a system and only detect the errors > > later. > > Which is, of course, how everyone else does it. On block access, the > checksum is verified (if you've turned checksum checking on). I > *really* doubt you want to pull in every page in the database at > startup time to verify the checksum or sequence. Even pages from the > last checkpoint would be a killer. > > All of the databases (Oracle, SQL Server, DB2) have a way to perform a > database corruption check which does go out and verify all checksums. > > If consistency is stored at the block-level, which is pretty much the > only way to avoid full page writes, you have to accept some level of > possible corruption. Am am not comfortable starting and having something fail later. How other databases do it is not an issue for me. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Jonah H. Harris wrote: > On 2/28/07, Bruce Momjian <bruce@momjian.us> wrote: > > Am am not comfortable starting and having something fail later. > > Then do you have some other idea for protecting pages from being torn > without storing an entire backup copy or performing a block-level > consistency check? We have several methods suggested to check the blocks, like CRC. My point was that, whatever check method we use, we should be prepared to check on startup, or at least make it the default for a crash restart. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 2/28/07, Bruce Momjian <bruce@momjian.us> wrote: > We have several methods suggested to check the blocks, like CRC. My > point was that, whatever check method we use, we should be prepared to > check on startup, or at least make it the default for a crash restart. Sounds like it should be a guc. I most certainly wouldn't check the entire database, especially if it was over a gigabyte. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On Wed, 2007-02-28 at 14:10 -0500, Jonah H. Harris wrote: > On 2/28/07, Bruce Momjian <bruce@momjian.us> wrote: > > > But if the system was shut down uncleanly as the result of a Postgres crash or > > > fast shutdown of Postgres then that isn't an issue. And many users may prefer > > > to bring the system up as soon as possible as long as they know any corrupt > > > pages will be spotted and throw errors as soon as it's seen. > > > > I don't think we should start up a system and only detect the errors > > later. > > Which is, of course, how everyone else does it. On block access, the > checksum is verified (if you've turned checksum checking on). I > *really* doubt you want to pull in every page in the database at > startup time to verify the checksum or sequence. Even pages from the > last checkpoint would be a killer. > Under normal operations, shutting down the database does a checkpoint, right? So unless you're in recovery mode, there's no additional cost. And I can't think of any reason you'd need to see any pages before the last checkpoint (unless you don't trust your disk and just want to check all the pages, which is more than we can do now anyway). So the additional cost of doing CRCs every time would be the CPU cost, and also the cost during recovery of reading in all the data pages since the last checkpoint. That's 5 minutes of data, in the default configuration. Regards,Jeff Davis
Jonah H. Harris wrote: > On 2/28/07, Bruce Momjian <bruce@momjian.us> wrote: > > We have several methods suggested to check the blocks, like CRC. My > > point was that, whatever check method we use, we should be prepared to > > check on startup, or at least make it the default for a crash restart. > > Sounds like it should be a guc. I most certainly wouldn't check the > entire database, especially if it was over a gigabyte. Keep in mind if you don't check it on startup, you will be checking it for every read, which for rare crashes, might not be wise. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 2/28/07, Bruce Momjian <bruce@momjian.us> wrote: > Keep in mind if you don't check it on startup, you will be checking it > for every read, which for rare crashes, might not be wise. Well understood. That's how most everyone configures their database systems; they certainly don't optimize for torn page detection on a crash. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On 2/28/07, Jeff Davis <pgsql@j-davis.com> wrote: > That's 5 minutes of data, in the default configuration. Right, but I don't know anyone that keeps checkpoints at 5 minutes. At least not on OLTP configurations. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
Jonah H. Harris wrote: > On 2/28/07, Bruce Momjian <bruce@momjian.us> wrote: >> > But if the system was shut down uncleanly as the result of a >> Postgres crash or >> > fast shutdown of Postgres then that isn't an issue. And many users >> may prefer >> > to bring the system up as soon as possible as long as they know any >> corrupt >> > pages will be spotted and throw errors as soon as it's seen. >> >> I don't think we should start up a system and only detect the errors >> later. > > Which is, of course, how everyone else does it. On block access, the > checksum is verified (if you've turned checksum checking on). I > *really* doubt you want to pull in every page in the database at > startup time to verify the checksum or sequence. Even pages from the > last checkpoint would be a killer. AFAICS pages from the last checkpoint would be sufficient, and not that expensive given that we have to pull all the pages touched since last checkpoint from disk to do the WAL replay anyway. > All of the databases (Oracle, SQL Server, DB2) have a way to perform a > database corruption check which does go out and verify all checksums. I think that's to protect from random disk errors rather than just torn pages. Which might be a useful thing but I think that level of protection belongs to the filesystem or storage hardware. What's the use case for the torn-page detection, anyway? If you don't trust your OS/hardware to protect you from torn-pages, but you do care about your data, surely you would use full_pages_writes = on. If you do trust your OS/hardware, just turn it off. And if you don't care about your data, what's the point? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Jonah H. Harris wrote: > On 2/28/07, Jeff Davis <pgsql@j-davis.com> wrote: >> That's 5 minutes of data, in the default configuration. > > Right, but I don't know anyone that keeps checkpoints at 5 minutes. > At least not on OLTP configurations. Uhmm... most do because most don't ever touch the postgresql.conf and those that do, don't touch checkpoints because they don't understand it. Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
On Wed, 2007-02-28 at 14:54 -0500, Jonah H. Harris wrote: > On 2/28/07, Jeff Davis <pgsql@j-davis.com> wrote: > > That's 5 minutes of data, in the default configuration. > > Right, but I don't know anyone that keeps checkpoints at 5 minutes. > At least not on OLTP configurations. > It's got a hard maximum of 1 hour. Also, the need to make checkpoints far apart has been reduced with bgwriter. It will be further reduced with the patch that allows people to tune the bgwriter for their needs. Recovery has to run through all those WAL segments anyway. It's not like we're making a 5 second recovery take 20 minutes, we'd be taking an already long recovery and making it longer (I'm not sure how much longer, but it can't be more than twice as long). I'm not saying there's no cost, but the extra recovery cost seems lower to me than the CRC cost on every data page read during operation. Also, if we find an error, do we even have the ability to correct it? A CRC doesn't tell us which pages were written and which weren't, so we could detect the error but not correct it, right? Regards,Jeff Davis
On 2/28/07, Jeff Davis <pgsql@j-davis.com> wrote: > I'm not saying there's no cost, but the extra recovery cost seems lower > to me than the CRC cost on every data page read during operation. I agree, I just think it should be configurable. > Also, if we find an error, do we even have the ability to correct it? A > CRC doesn't tell us which pages were written and which weren't, so we > could detect the error but not correct it, right? Correct. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On 2/28/07, Joshua D. Drake <jd@commandprompt.com> wrote: > > Right, but I don't know anyone that keeps checkpoints at 5 minutes. > > At least not on OLTP configurations. > > Uhmm... most do because most don't ever touch the postgresql.conf and > those that do, don't touch checkpoints because they don't understand it. Yes, I guess I should've qualified that to specify people who do understand the system better. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
Jonah H. Harris wrote: > On 2/28/07, Joshua D. Drake <jd@commandprompt.com> wrote: >> > Right, but I don't know anyone that keeps checkpoints at 5 minutes. >> > At least not on OLTP configurations. >> >> Uhmm... most do because most don't ever touch the postgresql.conf and >> those that do, don't touch checkpoints because they don't understand it. > > Yes, I guess I should've qualified that to specify people who do > understand the system better. I am not trying to pick on the issue but I do think it is important to recognize that literally only those in the know, are going to ever touch the postgresql.conf. I deal with lots of customers, who employ lots of programmers that I have often heard, "There is a postgresql.conf?". Why? Because they are user space programmers using an ORM or other such tech that allows them to never actually login to postgresql or a shell or anything once they edit the pg_hba.conf. Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
On 2/28/07, Joshua D. Drake <jd@commandprompt.com> wrote: > I am not trying to pick on the issue but I do think it is important to > recognize that literally only those in the know, are going to ever touch > the postgresql.conf. I agree. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > Which is, of course, how everyone else does it. I happen to agree with your conclusion but this line of argument is exceptionally unconvincing. In fact in this crowd you'll tend to turn people off and lose people if you say things like that rather than convince anyone of anything. > Even pages from the last checkpoint would be a killer. Hm that's an interesting thought. We only really have to check pages that would have received a full page write since the last checkpoint. So if we made turning full page writes off still record the page ids of the pages it *would* have written then we just need the code that normally replays full page writes to check the checksum if the page data isn't available. I can't see how that would be a killer. No matter how large a system you're talking about you're going to tune checkpoints to be occurring at about the same interval anyways. So the amount of time the wal replay checksum checking takes will be more or less constant. In fact we're already reading in most, if not all, of those pages anyways since we're replaying wal records that touch them after all. Would we even have to do anything extra? If we check checksums whenever we read in a page surely the wal replay code would automatically detect any torn pages without any special attention. That also makes it clear just how awful full page writes are for scalability. As you scale up the system but try to keep checkpoint intervals constant you're less and less likely to ever see the same page twice between two checkpoints. So as you scale the system up more and more of the wal will consist of full page writes. > All of the databases (Oracle, SQL Server, DB2) have a way to perform a > database corruption check which does go out and verify all checksums. Which is pretty poor design. If we implemented a fsck-like tool I would be far more interested in checking things like "tuples don't overlap" or "hint bits are set correctly" and so on. Checksums do nothing to protect against software failures which is the only kind of failure with a good rationale for being in an external tool. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 2/28/07, Gregory Stark <stark@enterprisedb.com> wrote: > "Jonah H. Harris" <jonah.harris@gmail.com> writes: > > > Which is, of course, how everyone else does it. > > I happen to agree with your conclusion but this line of argument is > exceptionally unconvincing. In fact in this crowd you'll tend to turn people > off and lose people if you say things like that rather than convince anyone of > anything. Rather than reinventing the wheel, it often pays to piggyback on the solutions others in similar situations have encountered. I'm just stating how others provide similar functionality or capabilities. If someone dislikes an idea just because the major vendors have done it that way, that's their own problem. It's up to the community to decide how to proceed given the information at hand. > > Even pages from the last checkpoint would be a killer. > > Hm that's an interesting thought. We only really have to check pages that > would have received a full page write since the last checkpoint. That's the only way I see that it could possibly be acceptable from a time-to-recover performance standpoint. I would still prefer a guc. > Which is pretty poor design. If we implemented a fsck-like tool I would be far > more interested in checking things like "tuples don't overlap" or "hint bits > are set correctly" and so on. Checksums do nothing to protect against software > failures which is the only kind of failure with a good rationale for being in > an external tool. Regardless of whether it's better as a separate tool or in the database itself, they provide a corruption-finding/consistency-checking capability. As far as other checks that could be performed, SQL Server and Oracle do have their own internal structure checks; many of which execute at runtime, not as a separate tool or process. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On 2/28/07, Bruce Momjian <bruce@momjian.us> wrote: > Am am not comfortable starting and having something fail later. Then do you have some other idea for protecting pages from being torn without storing an entire backup copy or performing a block-level consistency check? > How other databases do it is not an issue for me. Not saying it is, just stating the facts. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On Wed, 2007-02-28 at 21:13 +0000, Gregory Stark wrote: > Hm that's an interesting thought. We only really have to check pages that > would have received a full page write since the last checkpoint. So if we made Do we ever do a partial page write, or is what you're saying equivalent to "we only have to check pages that have been written to since the last checkpoint"? And if it is the same, can't we get the pages that were written to from the ctids in the wal records? Regards,Jeff Davis
On Tue, 2007-02-27 at 09:32 -0800, Josh Berkus wrote: > Simon, I think if you address me in a mail, it would be best not to explicitly *remove* my name from the address list. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Jonah H. Harris wrote: > On 2/28/07, Bruce Momjian <bruce@momjian.us> wrote: > > Am am not comfortable starting and having something fail later. > > Then do you have some other idea for protecting pages from being torn > without storing an entire backup copy or performing a block-level > consistency check? I want the startup to fail. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On 2/28/07, Gregory Stark <stark@enterprisedb.com> wrote: >> "Jonah H. Harris" <jonah.harris@gmail.com> writes: >> >> > Which is, of course, how everyone else does it. >> >> I happen to agree with your conclusion but this line of argument is >> exceptionally unconvincing. In fact in this crowd you'll tend to turn people >> off and lose people if you say things like that rather than convince anyone of >> anything. > > Rather than reinventing the wheel, it often pays to piggyback on the > solutions others in similar situations have encountered. I'm just > stating how others provide similar functionality or capabilities. If > someone dislikes an idea just because the major vendors have done it > that way, that's their own problem. It's up to the community to > decide how to proceed given the information at hand. Except that's not what you're doing. There's nothing wrong with saying "foo does this clever thing I think we should copy because <insert argument>". Nor even "foo does this thing, would that help us?" But what you seem to be saying is "*Because* foo does this thing we can conclude it's a clever thing and we should do it". -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Simon, > I think if you address me in a mail, it would be best not to explicitly > *remove* my name from the address list. I was trying to remove everyone but the list address. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
"Jeff Davis" <pgsql@j-davis.com> writes: > On Wed, 2007-02-28 at 21:13 +0000, Gregory Stark wrote: >> Hm that's an interesting thought. We only really have to check pages that >> would have received a full page write since the last checkpoint. So if we made > > Do we ever do a partial page write, or is what you're saying equivalent > to "we only have to check pages that have been written to since the last > checkpoint"? "full page writes" is a term of art in Postgres. When we first modify a page we stuff a copy of the entire page in the WAL so that recovery can restore it. We never write only part of a page to data files intentionally but if the system crashes while it's in the process of being written only part of it might end up getting written. Normally only the modified bits get put in WAL. > And if it is the same, can't we get the pages that were written to from > the ctids in the wal records? yes, though it may still be convenient to use the full page write fields -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 2/28/07, Gregory Stark <stark@enterprisedb.com> wrote: > Except that's not what you're doing. There's nothing wrong with saying "foo > does this clever thing I think we should copy because <insert argument>". Nor > even "foo does this thing, would that help us?" But what you seem to be saying > is "*Because* foo does this thing we can conclude it's a clever thing and we > should do it". Oracle, Microsoft, IBM, Sybase, Teradata, MySQL, and Firebird have a clever feature called page checksumming which I think we should copy because it's simple and effective at detecting page-level corruption due to torn pages and/or faulty storage hardware. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On 2/28/07, Gregory Stark <stark@enterprisedb.com> wrote: > Well color me surprised, writev is not nearly so much faster than CRC as I had > expected: All fun aside, are you going to be submitting a patch for this? -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On Feb 28, 2007, at 4:40 PM, Jonah H. Harris wrote: > Oracle, Microsoft, IBM, Sybase, Teradata, MySQL, and Firebird have a > clever feature called page checksumming which I think we should copy > because it's simple and effective at detecting page-level corruption > due to torn pages and/or faulty storage hardware. Any system that does enough I/O will occasionally generate corrupted pages in the complete absence of any other detectable fault or hardware failure. A fair amount has been written about it with respect to large-scale computing and it happens often enough when systems start getting large that just about everyone implements software I/O checksumming eventually to deal with the problem. I simply assumed that PostgreSQL was doing the same since it is definitely used for systems that are large enough that this becomes a statistically significant issue. A popular alternative to CRC32 for this purpose is the significantly cheaper and almost as effective is the Adler32 algorithm. I know Google used this algorithm when they added checksumming to their database to tame inexplicable transient corruption. Cheers, J. Andrew Rogers
On 2/28/07, J. Andrew Rogers <jrogers@neopolitan.com> wrote: > A popular alternative to CRC32 for this purpose is the significantly > cheaper and almost as effective is the Adler32 algorithm. I know > Google used this algorithm when they added checksumming to their > database to tame inexplicable transient corruption. Last time I checked, Teradata used a modified Fletcher checksum as well; I wasn't aware of Adler32. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On 2/28/07, Jonah H. Harris <jonah.harris@gmail.com> wrote: > Last time I checked, Teradata used a modified Fletcher checksum as > well; I wasn't aware of Adler32. I added adler32 to Gurjeet Singh's crc testing utility, compiled it with gcc 4.0.0 on a single-core Opteron running FC4, and configured it for 8K block sizes. GCC options were -mtune=opteron and the optimizations stated below: Generic optimization: test parameters:: SF=1, block_size=8, list_size=1000 Generating data... done. Time taken : 93 milliseconds Generating CRC using PG (vanilla) algo... done. Time taken : 117 milliseconds Verifying CRC using PG (vanilla) algo... done. Time taken : 89 milliseconds Generating CRC using SliceBy8 algo... done. Time taken : 44 milliseconds Verifying CRC using SliceBy8 algo... done. Time taken : 44 milliseconds Generating CRC using Adler32 algo... done. Time taken : 52 milliseconds Verifying CRC using Adler32 algo... done. Time taken : 53 milliseconds Using -O2 test parameters:: SF=1, block_size=8, list_size=1000 Generating data... done. Time taken : 85 milliseconds Generating CRC using PG (vanilla) algo... done. Time taken : 31 milliseconds Verifying CRC using PG (vanilla) algo... done. Time taken : 30 milliseconds Generating CRC using SliceBy8 algo... done. Time taken : 30 milliseconds Verifying CRC using SliceBy8 algo... done. Time taken : 30 milliseconds Generating CRC using Adler32 algo... done. Time taken : 14 milliseconds Verifying CRC using Adler32 algo... done. Time taken : 14 milliseconds Using -O3 test parameters:: SF=1, block_size=8, list_size=1000 Generating data... done. Time taken : 87 milliseconds Generating CRC using PG (vanilla) algo... done. Time taken : 31 milliseconds Verifying CRC using PG (vanilla) algo... done. Time taken : 31 milliseconds Generating CRC using SliceBy8 algo... done. Time taken : 30 milliseconds Verifying CRC using SliceBy8 algo... done. Time taken : 30 milliseconds Generating CRC using Adler32 algo... done. Time taken : 14 milliseconds Verifying CRC using Adler32 algo... done. Time taken : 15 milliseconds Interesting... -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On 2/28/07, Jonah H. Harris <jonah.harris@gmail.com> wrote: > I added adler32 to Gurjeet Singh's crc testing utility, compiled it > with gcc 4.0.0 on a single-core Opteron running FC4, and configured it > for 8K block sizes. Tested Fletcher and seeing similar results ratio-wise. Using O3, it's: Generating CRC using Fletcher algo... done. Time taken : 11 milliseconds Verifying CRC using Fletcher algo... done. Time taken : 11 milliseconds -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
> Under normal operations, shutting down the database does a > checkpoint, right? So unless you're in recovery mode, there's > no additional cost. > And I can't think of any reason you'd need to see any pages > before the last checkpoint (unless you don't trust your disk > and just want to check all the pages, which is more than we > can do now anyway). The problem is you don't necessarily have the info which pages changed *after* the last checkpoint. Especially since in the mentioned scenario you don't have full-page images. Maybe a suitable replacement for full-page would be to sync the first WAL record for a page change before writing the buffer (startup recovery would then check all pages it touches). This syncing does not sound promising though. And it is also not only one page we are looking for, but one per controller (for battery backed) or one per phys disk. Andreas
"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes: > Maybe a suitable replacement for full-page would be to sync the first WAL > record for a page change before writing the buffer We *always* sync WAL records for page changes before writing the buffer for the page. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
> > Maybe a suitable replacement for full-page would be to sync the first > > WAL record for a page change before writing the buffer > > We *always* sync WAL records for page changes before writing > the buffer for the page. Um, is that so ? And how is that done ? (e.g. bgwriter would need to wait for or even initiate a WAL record sync) Seems this is not clear to others eighter. Why else would people think we need a full data scan after crash without full-pages ? If this is true, all pages with possibly pending IO during crash will be touched and could thus be checked during startup recovery. I can see that we need to sync wal before write for certain changes, but I don't see why we would currently need it for e.g. a heap insert, that gets a free slot. Andreas
Zeugswetter Andreas ADI SD wrote: >>> Maybe a suitable replacement for full-page would be to sync the > first >>> WAL record for a page change before writing the buffer >> We *always* sync WAL records for page changes before writing >> the buffer for the page. > > Um, is that so ? And how is that done ? (e.g. bgwriter would need to > wait for or even initiate a WAL record sync) > Seems this is not clear to others eighter. Yes. The bgwriter or backend that flushes the buffer indeed initiates a WAL sync if the WAL record corresponding the LSN of the buffer hasn't been flushed yet. > Why else would people think we need a full data scan after crash without > full-pages ? They're confused. ;-) > If this is true, all pages with possibly pending IO during crash will be > touched and could thus be checked during startup recovery. That's right. Torn page detection wouldn't require any extra I/O, just CPU. > I can see that we need to sync wal before write for certain changes, but > I don't see why we would currently need it for e.g. a heap insert, that > gets a free slot. The corresponding index insert might hit the disk but the heap insert might not. Now if someone after the crash uses the same heap slot, you'd end up with the index tuple of the aborted transaction pointing to wrong heap tuple. There might be other error scenarios, that's just the first example I could think of. In practice, it's extremely rare for a writer to need to do the WAL flush. When a buffer is chosen as a victim, it's been unused for some time, and the WAL has already been flushed. It does happen when you're vacuuming, see Simon's recent thread, but we should fix that anyway. It also happens on checkpoints, but it's not a big performance issue. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Feb 28, 2007 at 12:16:10PM -0500, Bruce Momjian wrote: > background writer, and I think after a server crash, all pages would > have to be read and checked. The good news is that both of these are Would they? If you're doing recovery you'd have to read all pages dirtied since the last checkpoint... could there be pages other than those that had been torn? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Monday 26 February 2007 19:27, A.M. wrote: > On Feb 26, 2007, at 18:58 , Simon Riggs wrote: > > On Mon, 2007-02-26 at 23:25 +0000, Richard Huxton wrote: > >> Simon Riggs wrote: > >>> Proposal: Implement a new option for COMMIT, for enhancing > >>> performance, > >>> providing a MySQL-like trade-off between performance and > >>> robustness for > >>> *only* those that want it. > >>> > >>> COMMIT NOWAIT > > Isn't the domain of transient data relegated to certain tables > instead of specific transactions? Wouldn't it be easier to create un- > wal-logged global temp tables? > A similar idea we've been kicking around would be having a set storage parameter = nologging option for alter table which would, as it's name implies, cause the system to ignore writing wal logs for the table, much like it does for temp tables now. One cavaet would be you would probably need to forbid such a table from being the parent side of a FK relationship, but otherwise this should be fairly safe even for replay since alter table needs an exclusive lock, so you have finite points where data for the table would be written or ignored. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Mar 3, 2007, at 23:19 , Robert Treat wrote: > > A similar idea we've been kicking around would be having a set storage > parameter = nologging option for alter table which would, as it's name > implies, cause the system to ignore writing wal logs for the table, > much like > it does for temp tables now. One cavaet would be you would > probably need to > forbid such a table from being the parent side of a FK > relationship, but > otherwise this should be fairly safe even for replay since alter > table needs > an exclusive lock, so you have finite points where data for the > table would > be written or ignored. I can't think of any reason why a global temp table couldn't be referenced by another global temp table, but maybe I'm missing something. Whenever postgres starts, it would simply truncate the tables to ensure "temporary" compliance (the no-foot-gun approach). I could then place such tables in a ramdisk tablespace and make postgresql a transaction-safe memcached replacement. This would also be great for materialized views. Cheers, M