Обсуждение: redolog - for discussion
Hi, here are some details I would like to discuss before beeing too deep in the implementation that it's hard to change them later. Point 1 is an extension of the pg_database relation that is required to see the actual redolog state of a database. New fields are: lastbackup datetime redomode int4 redoseq1 int4 lastbackup is the time, the last successful full backup was taken. More precise, the time when pg_dump switched the backend into online backup mode. redomode is defined as 0=none, 1=async-logging, 2=sync- logging, 4=restore, 5=recover, 6=error. redoseq1 is the sequence number of the redolog file began when pg_dump switched to online backup mode (this command implies a logfile switch). Point 2 is the extension of the querylanguage. All the statements are restricted to superusers or the owner of the database. The SQL statements to control the whole process are: ALTER DATABASE REDOMODE {NONE | ASYNCHRONOUS | SYNCHRONOUS}; Turns logging for the database on or off. Database must be in normal operation mode for it (not restore or recover mode). ALTER DATABASE BEGIN BACKUP; Issued by pg_dump before doing anything else. The command stops ALL other activity in the database, so pg_dump has time to pull out at least the information about sequences (actually it does this while getting tables, might require some changes there so the database get's back accessible soon). ALTER DATABASE ONLINE BACKUP; Issued by pg_dump when it finished the things that require total exclusive database access. At this time, a logfile switch is done (only if the actual database is really logged) and the sequence number of the new logfile plus the current datetime remembered. The behaviour of pg_dump's backend changes. It will see a snapshot of this time (implemented in tqual code) in any subsequent command and it is totally unable to do anything that would update the database. Until the final END BACKUP is given, no VACUUM or DROP TABLE etc. commands can be run. If they are issued, the command will be delayed until pg_dump finished. ALTER DATABASE END BACKUP; This turns back the special behaviour of pg_dump's backend. Additionally the remembered time and redolog sequence are stored in pg_database. pg_dump can read them out for the final statement in the dump output (see below). ALTER DATABASE BEGIN RESTORE; This command checks that the actual database is just created and not one single command has been executed before. It is the first command in pg_dump's output if the database beeing dumped is a logged one. It switches the database into restore mode. In this mode, the first command on a new database connection must be the special command RECOVER DATABASE AS USER 'uname' or an ALTER DATABASE END RESTORE ...; When doing the ACL stuff, pg_dump must output a reconnect (\c) to the database without the additional username and then issue the special command. ALTER DATABASE END RESTORE [RECOVERY FROM redolog_seq]; This ends the restore mode. The additional RECOVERY FROM is put into by pg_dump for logged databases only. It reads out this information after END BACKUP. If not given, the database is switched into normal operation mode without logging. But if given, the sequence number is stored in pg_database and the database is put into recover mode. In that mode, only RECOVER commands can be issued. RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET}; The database must be in recover mode. If RESET is given, the mode is switched to ASYNC logging, The lastbackup field is set to 'epoch' and redoseq1 set to 0. It resets the database to the state at the backup snapshot time. For the others, the backend starts the recovery program which reads the redolog files, establishes database connections as required and reruns all the commands in them. If a required logfile isn't found, it tells the backend and waits for the reply. The backend tells the user what happened on error (redolog file with seq n required but not found ...). So the user can put back the required redolog files and let recover resume (actually the user pgsql or root must put them back :-). If the recovery is interrupted (controlling backend terminates), the database is set into error mode and only a RECOVER DATABASE RESET will help. If the recovery finally succeeds, the same as for RESET happens. The database is online in async logmode. Since the "destroydb" is also remembered in the redolog, recovery will stop at least if it hit's that for the database actually recoverd. This is to prevent faulty recovery which could occure if someone destroy's one database, creates a new one with the same name but different contents that is logged, destroy's it again and then want's to restore and recover the first. RECOVER DATABASE CONTINUE; After beeing told to restore some more redolog files, this command let's the recovery resume. RECOVER DATABASE AS USER 'uname'; A special command used in restore and recover mode only. This is restricted to superusers with usecatupd right (not db owner) and modifies the current username in the backend. It's ugly, but the problem is that ordinary users should not be able to use the database while it is in restore or recover mode. So the connection cannot be established like with '\c - username'. For the restore and recover this means, that a real superuser with unrestricted access is needed to restore a database that was dumped with ACL info. But otherwise one with createdb but not superuser rights could put a CREATE USER into a file, create a new database and "restore" that as user pgsql. I'm sure we don't want this. ### Whow - hopefully I didn't forgot anything. All that might look very complicated, but the only commands someone would really use manually will be ALTER DATABASE REDOMODE ...; and RECOVER DATABASE ...; Anything else is used by pg_dump and the recovery program. What I'm trying to implement with it is a behaviour that makes it possible to backup, restore and recover a database on a production system without running closed shop. Doing so for one database will not affect the others in the instance, since no modification of the hba conf or anything else will be required. Only the database actually beeing restored is closed for normal usage. Hmmm - just have the next idea right now. One more field in pg_database could tell that the db is shut down or restricted Someone could disable a single database or restrict usage to superusers/dbowner for some time, make database readonly etc. Anyway - does anybody see problems with the above? Do we need more functionality? Oh yeah - another utility that follows log and replicates databases onto other systems on the fly. But let me get this all running first please :-). Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
Jan Wieck wrote: > > ALTER DATABASE BEGIN BACKUP; > > Issued by pg_dump before doing anything else. > > The command stops ALL other activity in the database, so > pg_dump has time to pull out at least the information > about sequences (actually it does this while getting > tables, might require some changes there so the database > get's back accessible soon). > > ALTER DATABASE ONLINE BACKUP; > > Issued by pg_dump when it finished the things that > require total exclusive database access. > > At this time, a logfile switch is done (only if the > actual database is really logged) and the sequence number > of the new logfile plus the current datetime remembered. > The behaviour of pg_dump's backend changes. It will see a > snapshot of this time (implemented in tqual code) in any ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note, that I'm implementing multi-version concurrency control (MVCC) for 6.5: pg_dump will have to run all queries in one transaction in SERIALIZED mode to get snapshot of transaction' begin time... > subsequent command and it is totally unable to do > anything that would update the database. > > Until the final END BACKUP is given, no VACUUM or DROP > TABLE etc. commands can be run. If they are issued, the > command will be delayed until pg_dump finished. Vacuum will not be delete records in which any active backend is interested - don't worry. ... > > All that might look very complicated, but the only commands ^^^^^^^^^^^^^^^^ Yes -:) We could copy/move pg_dump' stuff into backend... This way pg_dump will just execute one command ALTER DATABASE ONLINE BACKUP; -- as I understand - backend will do all what it need and pg_dump just write backend' output to a file. I think that it would be nice to have code in backend to generate CREATE statements from catalog and extend EXPLAIN to handle something like EXPLAIN TABLE xxx etc. We could call EXPLAIN for all \dXXXX in psql and when dumping schema in pg_dump. Comments? Vadim
Vadim wrote: > > Jan Wieck wrote: > > > > At this time, a logfile switch is done (only if the > > actual database is really logged) and the sequence number > > of the new logfile plus the current datetime remembered. > > The behaviour of pg_dump's backend changes. It will see a > > snapshot of this time (implemented in tqual code) in any > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Note, that I'm implementing multi-version concurrency control > (MVCC) for 6.5: pg_dump will have to run all queries > in one transaction in SERIALIZED mode to get snapshot of > transaction' begin time... Sounds good and would make things easier. I'll keep my hands off from the tqual code and wait for that. But what about sequence values while in SERIALIZED transaction mode. Sequences get overwritten in place! And for a dump/restore/recover it is important, that the sequences get restored ALL at once in the state they where. > > > subsequent command and it is totally unable to do > > anything that would update the database. > > > > Until the final END BACKUP is given, no VACUUM or DROP > > TABLE etc. commands can be run. If they are issued, the > > command will be delayed until pg_dump finished. > > Vacuum will not be delete records in which any active > backend is interested - don't worry. That's the vacuum part, but I still need to delay DROP TABLE/VIEW/SEQUENCE until the backup is complete. > > ... > > > > > All that might look very complicated, but the only commands > ^^^^^^^^^^^^^^^^ > Yes -:) > We could copy/move pg_dump' stuff into backend... > This way pg_dump will just execute one command > > ALTER DATABASE ONLINE BACKUP; -- as I understand > > - backend will do all what it need and pg_dump just > write backend' output to a file. > > I think that it would be nice to have code in backend to > generate CREATE statements from catalog and extend EXPLAIN > to handle something like EXPLAIN TABLE xxx etc. > We could call EXPLAIN for all \dXXXX in psql and > when dumping schema in pg_dump. > > Comments? Indeed :-) If we have serialized transaction that covers sequences, only BEGIN and END BACKUP must remain. BEGIN to force the logfile switch and END to flag that dump is complete and backend can update pg_database. So you want to put major parts of pg_dump's functionality into the backend. Hmmm - would be cool. And it would give us a chance to include tests for most of the dump related code in regression. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
Jan Wieck wrote: > > > > > > > At this time, a logfile switch is done (only if the > > > actual database is really logged) and the sequence number > > > of the new logfile plus the current datetime remembered. > > > The behaviour of pg_dump's backend changes. It will see a > > > snapshot of this time (implemented in tqual code) in any > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Note, that I'm implementing multi-version concurrency control > > (MVCC) for 6.5: pg_dump will have to run all queries > > in one transaction in SERIALIZED mode to get snapshot of > > transaction' begin time... > > Sounds good and would make things easier. I'll keep my hands > off from the tqual code and wait for that. > > But what about sequence values while in SERIALIZED > transaction mode. Sequences get overwritten in place! And for > a dump/restore/recover it is important, that the sequences > get restored ALL at once in the state they where. It's time to re-implement sequences! When they were implemented ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so I had to create table for each sequence. There should be one system table - pg_sequence. One record for each sequence will be inserted into this table and one VIEW will be created: CREATE VIEW _seqname_ AS SELECT * FROM pg_sequence WHERE sequence_name = '_seqname_'; GRANT/REVOKE on sequnece' VIEW will control rights to read sequence using SELECT and rights to change sequence using nextval/setval. Having _one_ sequences table there will be easy to lock all sequences at once and read all values. > > > > Vacuum will not be delete records in which any active > > backend is interested - don't worry. > > That's the vacuum part, but I still need to delay DROP > TABLE/VIEW/SEQUENCE until the backup is complete. Yes. And ALTER too. Vadim
> It's time to re-implement sequences! When they were implemented > ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so > I had to create table for each sequence. > There should be one system table - pg_sequence. One record > for each sequence will be inserted into this table and > one VIEW will be created: I thought you wanted a single table to prevent concurrent access/update contension? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Sat, 12 Dec 1998, Bruce Momjian wrote: > > It's time to re-implement sequences! When they were implemented > > ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so > > I had to create table for each sequence. > > There should be one system table - pg_sequence. One record > > for each sequence will be inserted into this table and > > one VIEW will be created: > > I thought you wanted a single table to prevent concurrent access/update > contension? let's revise what vadim stated *grin* "we should re-implement sequences for v6.5, since row-level locking will exist at that time"? *rofl* Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> On Sat, 12 Dec 1998, Bruce Momjian wrote: > > > > It's time to re-implement sequences! When they were implemented > > > ~ 1.5 year ago there was no GRANT/REVOKE on VIEWs and so > > > I had to create table for each sequence. > > > There should be one system table - pg_sequence. One record > > > for each sequence will be inserted into this table and > > > one VIEW will be created: > > > > I thought you wanted a single table to prevent concurrent access/update > > contension? > > let's revise what vadim stated *grin* > > "we should re-implement sequences for v6.5, since row-level > locking will exist at that time"? *rofl* Oh, now I understand. :-) Vadim, don't be so modest. :-) -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Jan Wieck wrote: > > RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET}; > ... > > For the others, the backend starts the recovery program > which reads the redolog files, establishes database > connections as required and reruns all the commands in ^^^^^^^^^^^^^^^^^^^^^^^^^^ > them. If a required logfile isn't found, it tells the ^^^^^ I foresee problems with using _commands_ logging for recovery/replication -:(( Let's consider two concurrent updates in READ COMMITTED mode: update test set x = 2 where y = 1; and update test set x = 3 where y = 1; The result of both committed transaction will be x = 2 if the 1st transaction updated row _after_ 2nd transaction and x = 3 if the 2nd transaction gets row after 1st one. Order of updates is not defined by order in which commands begun and so order in which commands should be rerun will be unknown... Comments? Vadim
Vadim wrote: > > Jan Wieck wrote: > > > > RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET}; > > > ... > > > > For the others, the backend starts the recovery program > > which reads the redolog files, establishes database > > connections as required and reruns all the commands in > ^^^^^^^^^^^^^^^^^^^^^^^^^^ > > them. If a required logfile isn't found, it tells the > ^^^^^ > > I foresee problems with using _commands_ logging for > recovery/replication -:(( > > Let's consider two concurrent updates in READ COMMITTED mode: > > update test set x = 2 where y = 1; > > and > > update test set x = 3 where y = 1; > > The result of both committed transaction will be x = 2 > if the 1st transaction updated row _after_ 2nd transaction > and x = 3 if the 2nd transaction gets row after 1st one. > Order of updates is not defined by order in which commands > begun and so order in which commands should be rerun > will be unknown... Yepp, the order in which commands begun is absolutely not of interest. Locking could already delay the execution of one command until another one started later has finished and released the lock. It's a classic race condition. Thus, my plan was to log the queries just before the call to CommitTransactionCommand() in tcop. This has the advantage, that queries which bail out with errors don't get into the log at all and must not get rerun. And I can set a static flag to false before starting the command, which is set to true in the buffer manager when a buffer is written (marked dirty), so filtering out queries that do no updates at all is easy. Unfortunately query level logging get's hit by the current implementation of sequence numbers. If a query that get's aborted somewhere in the middle (maybe by a trigger) called nextval() for rows processed earlier, the sequence number isn't advanced at recovery time, because the query is suppressed at all. And sequences aren't locked, so for concurrently running queries getting numbers from the same sequence, the results aren't reproduceable. If some application selects a value resulting from a sequence and uses that later in another query, how could the redolog know that this has changed? It's a Const in the query logged, and all that corrupts the whole thing. All that is painful and I don't see another solution yet than to hook into nextval(), log out the numbers generated in normal operation and getting back the same numbers in redo mode. The whole thing gets more and more complicated :-( Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
Jan Wieck wrote: > > > > > I foresee problems with using _commands_ logging for > > recovery/replication -:(( > > ... > > Yepp, the order in which commands begun is absolutely not of > interest. Locking could already delay the execution of one > command until another one started later has finished and > released the lock. It's a classic race condition. > > Thus, my plan was to log the queries just before the call to > CommitTransactionCommand() in tcop. This has the advantage, Oh, I see - you right! ... > > Unfortunately query level logging get's hit by the current > implementation of sequence numbers. If a query that get's ... > > All that is painful and I don't see another solution yet than > to hook into nextval(), log out the numbers generated in Not so bad, having buffering these numbers in memory... > normal operation and getting back the same numbers in redo > mode. > > The whole thing gets more and more complicated :-( As usual -:)) Vadim
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Vadim Mikheev > Sent: Thursday, December 17, 1998 5:03 PM > To: Jan Wieck > Cc: pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] redolog - for discussion > > > Jan Wieck wrote: > > > > > > > > I foresee problems with using _commands_ logging for > > > recovery/replication -:(( > > > > ... > > > > Yepp, the order in which commands begun is absolutely not of > > interest. Locking could already delay the execution of one > > command until another one started later has finished and > > released the lock. It's a classic race condition. > > > > Thus, my plan was to log the queries just before the call to > > CommitTransactionCommand() in tcop. This has the advantage, > > Oh, I see - you right! > If image level logging is used,probably it's OK. But if query(command) level logging is used ??? If the isolation level of all transactions is SERIARIZABLE,it's probably OK because they are serializable order by the time when they are committed. But if there are transactions whose isolation level is READ COMMITTED, they are not serializable. So commands must be issued according to the original order when they were issued ? If the same mechanism of locking is used at recovery time,the order of locks caused by commands(rerun) will be same ????? I'm not confident. Thanks. Hiroshi inoue Inoue@tpf.co.jp