Обсуждение: warm standby server stops doing checkpoints after a while
Hello list, I have a problem regarding running a warm standby server as described in the postgresql 8.2 documentation. I set up two servers. Both running PostgreSQL 8.2.3-1PGDG on Fedora Core 6 (x86_64). (Master driven by AMD Opteron / Slave running on Intel Xeon) The master server copys its WAL-logs to the slave using rsync over SSH which are then replayed by the slave using a little perl script which does not really do much more then the pseudo code in the documentation does. To this point everything works just fine. To reduce space consumtion of the log archive my restore script checks pg_controldata when it enters a wait cycle and deletes all log files which are older than the last checkpoint segment. (Checkpointing happens about every 5 to 10 minutes) The problem is that the slave server stops checkpointing after some hours of working (about 24 to 48 hours of conitued log replay). After this point in time the slave does not do checkpoints anymore but still replayes every log that it gets from the master. The master works on as expected doing checkpoints every 5 to 10 minutes. I raised loglevel to debug5 but there's nothing interresting in the logs. Just log replay and the fact that there's absolutely no sign of anything related to checkpoints. Restart does not help. Only beginning with a new base backup brings the slave to work again as expected. Both servers are configured identically except archiving/recovery settings. Does somebody has a clue what could cause this behavior of the slave server? Could upgrading to 8.2.4 help? - I didn't find something related in the release notes. Could this be a bug? Greetings, Frank Wittig
Вложения
Frank Wittig <fw@weisshuhn.de> writes: > The problem is that the slave server stops checkpointing after some > hours of working (about 24 to 48 hours of conitued log replay). Hm ... look at RecoveryRestartPoint() in xlog.c. Could there be something wrong with this logic? /* * Do nothing if the elapsed time since the last restartpoint is less than * half of checkpoint_timeout. (We use a value less than * checkpoint_timeout so that variations in the timing of checkpoints on * the master, or speed of transmission of WAL segments to a slave, won't * make the slave skip a restartpoint once it's synced with the master.) * Checking true elapsed time keeps us from doing restartpoints too often * while rapidly scanning large amounts of WAL. */ elapsed_secs = time(NULL) - ControlFile->time; if (elapsed_secs < CheckPointTimeout / 2) return; The idea is that the slave (once in sync with the master) ought to checkpoint every time it sees a checkpoint record in the master's output. I'm not seeing a flaw but maybe there is one here, or somewhere nearby. Are you sure the master is checkpointing? regards, tom lane
Tom Lane schrieb: > Are you sure the master is checkpointing? Yes. I double checked using pg_controldata on the master. On both servers checkpoint_segments is set to 16 and checkpoint_timeout is 300 seconds default. There were two messages in the logs of the master that checkpointing happened too often during bulk imports last night. It's possible that the problem occures during periods if high database activity where checkpointing on the master happens very often. It would be ok if the slave skipped checkpoints if its falling behind but i'd expect it to checkpoint again if it's in sync again. I watched the servers a while after it happend today and found that the slave catched up waiting for new logs. But it didnt do any checkpointing. The last checkpoint was over 2 hours and about 500 log segments ago. I upgraded the slave to 8.2.4-1PGDG today and will upgrade the master too as soon as I can get me a maintenance window for this. Greetings, Frank Wittig
Вложения
On Thu, 2007-05-31 at 10:23 -0400, Tom Lane wrote: > Frank Wittig <fw@weisshuhn.de> writes: > > The problem is that the slave server stops checkpointing after some > > hours of working (about 24 to 48 hours of conitued log replay). > > Hm ... look at RecoveryRestartPoint() in xlog.c. Could there be > something wrong with this logic? > > /* > * Do nothing if the elapsed time since the last restartpoint is less than > * half of checkpoint_timeout. (We use a value less than > * checkpoint_timeout so that variations in the timing of checkpoints on > * the master, or speed of transmission of WAL segments to a slave, won't > * make the slave skip a restartpoint once it's synced with the master.) > * Checking true elapsed time keeps us from doing restartpoints too often > * while rapidly scanning large amounts of WAL. > */ > elapsed_secs = time(NULL) - ControlFile->time; > if (elapsed_secs < CheckPointTimeout / 2) > return; > > The idea is that the slave (once in sync with the master) ought to > checkpoint every time it sees a checkpoint record in the master's > output. I'm not seeing a flaw but maybe there is one here, or somewhere > nearby. Are you sure the master is checkpointing? Hmmm. This can happen if a backend crashes while half-way through any set of changes that causes safe_restartpoint() to be true. Or it might be that one of the Index AMs don't correctly clear the multi-WAL actions in some corner cases. Or it could be that the mdsync looping problem has been worse than we thought and checkpoints have been avoided completely for some time. Frank, This is repeatable, yes? Has anything crashed on your server? Are you using GIN or GIST indexes? I'll look at putting some debug information in there that logs whether multi-WAL actions remain unresolved for any length of time. Continuing to think about this one.... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs schrieb: > This is repeatable, yes? Yes, it occures every time I begin with a new base backup. And it seem to happen during recreation of tsearch2 vectors of large amounts of data sets. > Has anything crashed on your server? No. Crashes didn't occur duriung that times. > Are you using GIN or GIST indexes? I'm using GIN indesex on tsearch2 vectors of very large ammount of data sets. (About 3,8 million data sets of which about 30-50 thousands are recreated and indexed when the descibed behavior occures.) > I'll look at putting some debug information in there that logs whether > multi-WAL actions remain unresolved for any length of time. Extra debug info would be great. I tested myself adding some debug output into the function Tom Lane mentioned and found that after the server stopped checkpointing every time the function is called it exits at this point: /* * Is it safe to checkpoint? We must ask each of the resource managers * whether they have any partial state information that might prevent a * correct restart from this point. If so, we skip this opportunity, but * return at the next checkpoint record for another try. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { if (RmgrTable[rmid].rm_safe_restartpoint != NULL) if (!(RmgrTable[rmid].rm_safe_restartpoint())) return; } It exits every time with the same value for rmid. Logs look like this (The quoted lines repeat): <2007-06-01 13:10:28.936 CEST:%> DEBUG: 00000: executing restore command "/var/lib/pgsql/restore.pl /mnt/wal_archive/00000001000000C9000000C2 pg_xlog/RECOVERYXLOG" <2007-06-01 13:10:28.936 CEST:%> LOCATION: RestoreArchivedFile, xlog.c:2474 <2007-06-01 13:11:29.055 CEST:%> LOG: 00000: restored log file "00000001000000C9000000C2" from archive <2007-06-01 13:11:29.055 CEST:%> LOCATION: RestoreArchivedFile, xlog.c:2504 <2007-06-01 13:11:29.364 CEST:%> DEBUG: 00000: found Checkpoint in XLOG <2007-06-01 13:11:29.364 CEST:%> CONTEXT: xlog redo checkpoint: redo C9/C20DE050; undo 0/0; tli 1; xid 0/36130541; oid 241990328; multi 8; offset 15; online <2007-06-01 13:11:29.364 CEST:%> LOCATION: RecoveryRestartPoint, xlog.c:5739 <2007-06-01 13:11:29.365 CEST:%> DEBUG: 00000: Ressource manager (13) has partial state information <2007-06-01 13:11:29.365 CEST:%> CONTEXT: xlog redo checkpoint: redo C9/C20DE050; undo 0/0; tli 1; xid 0/36130541; oid 241990328; multi 8; offset 15; online <2007-06-01 13:11:29.365 CEST:%> LOCATION: RecoveryRestartPoint, xlog.c:5769 best regards, Frank Wittig
Вложения
On Fri, 2007-06-01 at 13:33 +0200, Frank Wittig wrote: > Simon Riggs schrieb: > > > This is repeatable, yes? > Yes, it occures every time I begin with a new base backup. And it seem > to happen during recreation of tsearch2 vectors of large amounts of data > sets. > > > Has anything crashed on your server? > No. Crashes didn't occur duriung that times. > > > Are you using GIN or GIST indexes? > I'm using GIN indesex on tsearch2 vectors of very large ammount of data > sets. (About 3,8 million data sets of which about 30-50 thousands are > recreated and indexed when the descibed behavior occures.) > > > I'll look at putting some debug information in there that logs whether > > multi-WAL actions remain unresolved for any length of time. > Extra debug info would be great. > I tested myself adding some debug output into the function Tom Lane > mentioned and found that after the server stopped checkpointing every > time the function is called it exits at this point: > > /* > * Is it safe to checkpoint? We must ask each of the resource managers > * whether they have any partial state information that might prevent a > * correct restart from this point. If so, we skip this opportunity, but > * return at the next checkpoint record for another try. > */ > for (rmid = 0; rmid <= RM_MAX_ID; rmid++) > { > if (RmgrTable[rmid].rm_safe_restartpoint != NULL) > if (!(RmgrTable[rmid].rm_safe_restartpoint())) > return; > } > > It exits every time with the same value for rmid. > Logs look like this (The quoted lines repeat): > > <2007-06-01 13:10:28.936 CEST:%> DEBUG: 00000: executing restore > command "/var/lib/pgsql/restore.pl > /mnt/wal_archive/00000001000000C9000000C2 pg_xlog/RECOVERYXLOG" > <2007-06-01 13:10:28.936 CEST:%> LOCATION: RestoreArchivedFile, xlog.c:2474 > <2007-06-01 13:11:29.055 CEST:%> LOG: 00000: restored log file > "00000001000000C9000000C2" from archive > <2007-06-01 13:11:29.055 CEST:%> LOCATION: RestoreArchivedFile, xlog.c:2504 > <2007-06-01 13:11:29.364 CEST:%> DEBUG: 00000: found Checkpoint in XLOG > <2007-06-01 13:11:29.364 CEST:%> CONTEXT: xlog redo checkpoint: redo > C9/C20DE050; undo 0/0; tli 1; xid 0/36130541; oid 241990328; multi 8; > offset 15; online > <2007-06-01 13:11:29.364 CEST:%> LOCATION: RecoveryRestartPoint, > xlog.c:5739 > <2007-06-01 13:11:29.365 CEST:%> DEBUG: 00000: Ressource manager (13) > has partial state information > <2007-06-01 13:11:29.365 CEST:%> CONTEXT: xlog redo checkpoint: redo > C9/C20DE050; undo 0/0; tli 1; xid 0/36130541; oid 241990328; multi 8; > offset 15; online > <2007-06-01 13:11:29.365 CEST:%> LOCATION: RecoveryRestartPoint, > xlog.c:5769 To me, this points clearly to there being an improperly completed action in resource manager 13. (GIN) In summary, it appears that there may be an issue with the GIN code for WAL recovery and this is effecting the Warm Standby. I'll reply again soon with a patch to allow data of incomplete splits to be logged so we can see what's going on. After that we should continue to discuss this on -hackers. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
>> <2007-06-01 13:11:29.365 CEST:%> DEBUG: 00000: Ressource manager (13) >> has partial state information > To me, this points clearly to there being an improperly completed action > in resource manager 13. (GIN) In summary, it appears that there may be > an issue with the GIN code for WAL recovery and this is effecting the > Warm Standby. Hmm. I found that gin_xlog_cleanup doesn't reset incomplete_splits list. Is it possible reason of bug? Attached patch fixes it. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ *** ./src/backend/access/gin/ginxlog.c.orig Fri Jun 1 16:47:47 2007 --- ./src/backend/access/gin/ginxlog.c Fri Jun 1 16:53:47 2007 *************** *** 594,599 **** --- 594,600 ---- MemoryContextSwitchTo(topCtx); MemoryContextDelete(opCtx); + incomplete_splits = NIL; } bool
Teodor Sigaev schrieb: > Hmm. I found that gin_xlog_cleanup doesn't reset incomplete_splits list. > Is it possible reason of bug? Sounds reasonable to me. Would explain why the system didn't ever recover from that state. I'll test your patch and report results on this list. Greetings, Frank Wittig
Вложения
On Fri, 2007-06-01 at 16:58 +0400, Teodor Sigaev wrote: > >> <2007-06-01 13:11:29.365 CEST:%> DEBUG: 00000: Ressource manager (13) > >> has partial state information > > To me, this points clearly to there being an improperly completed action > > in resource manager 13. (GIN) In summary, it appears that there may be > > an issue with the GIN code for WAL recovery and this is effecting the > > Warm Standby. > > Hmm. I found that gin_xlog_cleanup doesn't reset incomplete_splits list. Is it > possible reason of bug? Hi Teodor, Hmm, well, the list should be empty by that point anyway. That code is only executed at the end of xlog replay, not half-way through as we are seeing. There are two possibilities: 1. There are some incomplete splits, pointing to a likely bug in GIN 2. There are so many index splits that we aren't able to make a successful restartpoint using the current mechanism. Not a bug, but would be an issue with how restartpoints interact with GIN (possibly other index types also). When we wrote this I thought (2) would be a problem, but its not shown up to be so for btrees (yet, I guess). I have some ideas if its (2). The attached patch should show which of these it is. I'll dress it up a little better so we have a debug option on this. Please note I've not tested this patch myself, so Frank if you don't mind me splatting something at you we'll see what we see. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Вложения
Hi Simon, > The attached patch should show which of these it is. I'll dress it up a > little better so we have a debug option on this. Please note I've not > tested this patch myself, so Frank if you don't mind me splatting > something at you we'll see what we see. I'll test that. I have an idea on how to trigger that problem on purpose so I dont have to wait until it occures on its own. Please feel free to let me go through any test you want to run. I really need that WAL standby - so I will make any effort I can to help getting it stable. Greetings, Frank Wittig
Вложения
Hi Simon, > The attached patch should show which of these it is. As I thought, I can trigger the issue by letting the database recreate and reindex some ten thousand data sets. Here is what happened in the log (I cleaned it to the lines your patch provides): <2007-06-01 16:28:51.708 CEST:%> LOG: GIN incomplete splits=1 <2007-06-01 16:28:51.708 CEST:%> LOG: GIN incomplete split root:11118 l:45303 r:111740 at redo CA/C8243C28 <2007-06-01 16:29:13.912 CEST:%> LOG: GIN incomplete splits=1 <2007-06-01 16:29:13.912 CEST:%> LOG: GIN incomplete split root:11118 l:45303 r:111740 at redo CA/C8243C28 <2007-06-01 16:29:51.594 CEST:%> LOG: GIN incomplete splits=1 <2007-06-01 16:29:51.594 CEST:%> LOG: GIN incomplete split root:11118 l:45303 r:111740 at redo CA/C8243C28 <2007-06-01 16:30:19.351 CEST:%> LOG: GIN incomplete splits=2 <2007-06-01 16:30:19.351 CEST:%> LOG: GIN incomplete split root:11118 l:45303 r:111740 at redo CA/C8243C28 <2007-06-01 16:30:19.351 CEST:%> LOG: GIN incomplete split root:4269 l:102123 r:111741 at redo CA/EEAD8B80 <2007-06-01 16:31:41.991 CEST:%> LOG: GIN incomplete splits=2 <2007-06-01 16:31:41.991 CEST:%> LOG: GIN incomplete split root:11118 l:45303 r:111740 at redo CA/C8243C28 <2007-06-01 16:31:41.991 CEST:%> LOG: GIN incomplete split root:4269 l:102123 r:111741 at redo CA/EEAD8B80 <2007-06-01 16:34:09.217 CEST:%> LOG: GIN incomplete splits=2 <2007-06-01 16:34:09.218 CEST:%> LOG: GIN incomplete split root:11118 l:45303 r:111740 at redo CA/C8243C28 <2007-06-01 16:34:09.218 CEST:%> LOG: GIN incomplete split root:4269 l:102123 r:111741 at redo CA/EEAD8B80 <2007-06-01 16:35:31.933 CEST:%> LOG: GIN incomplete splits=2 <2007-06-01 16:35:31.933 CEST:%> LOG: GIN incomplete split root:11118 l:45303 r:111740 at redo CA/C8243C28 <2007-06-01 16:35:31.933 CEST:%> LOG: GIN incomplete split root:4269 l:102123 r:111741 at redo CA/EEAD8B80 <2007-06-01 16:36:56.391 CEST:%> LOG: GIN incomplete splits=2 <2007-06-01 16:36:56.391 CEST:%> LOG: GIN incomplete split root:11118 l:45303 r:111740 at redo CA/C8243C28 <2007-06-01 16:36:56.392 CEST:%> LOG: GIN incomplete split root:4269 l:102123 r:111741 at redo CA/EEAD8B80 <2007-06-01 16:38:23.133 CEST:%> LOG: GIN incomplete splits=2 <2007-06-01 16:38:23.133 CEST:%> LOG: GIN incomplete split root:11118 l:45303 r:111740 at redo CA/C8243C28 <2007-06-01 16:38:23.133 CEST:%> LOG: GIN incomplete split root:4269 l:102123 r:111741 at redo CA/EEAD8B80 It didnt do us the favor to produce more incomplete splits. But these two are enough to keep the standby server from doing checkpoints. Greetings, Frank Wittig
Вложения
> <2007-06-01 16:28:51.708 CEST:%> LOG: GIN incomplete split root:11118 > l:45303 r:111740 at redo CA/C8243C28 ... > <2007-06-01 16:38:23.133 CEST:%> LOG: GIN incomplete split root:11118 > l:45303 r:111740 at redo CA/C8243C28 Looks like a bug in GIN. I'll play with it. Can you provide more details about your test? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
I am following this thread with interest and hence as a request if the discussion is kept in this thread, is easier for me to follow it. Just recently I have put the WAL standby in production using 8.2.3 and would like to know the circumstances where I will need to upgrade. Hence the interest. Regards Dhaval
On Fri, 2007-06-01 at 19:11 +0400, Teodor Sigaev wrote: > > <2007-06-01 16:28:51.708 CEST:%> LOG: GIN incomplete split root:11118 > > l:45303 r:111740 at redo CA/C8243C28 > ... > > <2007-06-01 16:38:23.133 CEST:%> LOG: GIN incomplete split root:11118 > > l:45303 r:111740 at redo CA/C8243C28 > > Looks like a bug in GIN. I'll play with it. Can you provide more details about > your test? Agreed, sorry about that. I note that forgetIncompleteSplit doesn't report anything if it fails to find an incomplete split when it looks for it. ISTM that either we have 1) sometimes a starting split, but no ending split 2) a starting split, but we try to remove the wrong split later, for some reason. i.e. we make the change correctly, just misjudge the xlog somehow. (1) would show up on data retrieval tests, but (2) would show up only on recovery. So I would think it is (2). I'd suggest we throw an error, as shown in the enclosed patch. Frank, can you give that a whirl to provide Teodor with something more to work with? Thanks. Neither GIST nor B-tree seems to throw an error in corresponding locations also, so the potential for not being able to track this is high. I'd want to throw errors in those locations also. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Вложения
> I'd suggest we throw an error, as shown in the enclosed patch. Frank, > can you give that a whirl to provide Teodor with something more to work > with? Thanks. I already makes a test suite which reproduce the problem - it leaves incompleted splits. But I discover one more problem: deadlock on buffer's lock. So, right now I investigate the problem. > > Neither GIST nor B-tree seems to throw an error in corresponding > locations also, so the potential for not being able to track this is > high. I'd want to throw errors in those locations also. Agreed, I'll add more check -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Fri, 2007-06-01 at 09:22 -0700, Dhaval Shah wrote: > I am following this thread with interest and hence as a request if the > discussion is kept in this thread, is easier for me to follow it. > > Just recently I have put the WAL standby in production using 8.2.3 and > would like to know the circumstances where I will need to upgrade. > Hence the interest. OK, in summary: We have a rare bug in WAL recovery for GIN indexes that has been noticed during Warm Standby recovery. Currently, this is effecting the ability to perform restartable recovery, which would cause some administrative difficulty if the *standby* server fails. We're working on a fix. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, 2007-06-01 at 20:49 +0400, Teodor Sigaev wrote: > > I'd suggest we throw an error, as shown in the enclosed patch. Frank, > > can you give that a whirl to provide Teodor with something more to work > > with? Thanks. > > I already makes a test suite which reproduce the problem - it leaves incompleted > splits. But I discover one more problem: deadlock on buffer's lock. So, right > now I investigate the problem. OK, I'll leave the code to you from here.... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Found a reason: if parent page is fully backuped after child's split then forgetIncompleteSplit() isn't called at all. Hope, attached patch fix that. Pls, test it. PS I'm going away for weekend, so I'll not be online until Monday. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Вложения
On Fri, 2007-06-01 at 22:42 +0400, Teodor Sigaev wrote: > Found a reason: if parent page is fully backuped after child's split then > forgetIncompleteSplit() isn't called at all. i.e. full_page_writes = on > Hope, attached patch fix that. Pls, test it. > > PS I'm going away for weekend, so I'll not be online until Monday. OK, I'll check the other AMs to see if we have similar problems. Have a good weekend. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Hi Teodor, Teodor Sigaev schrieb: > Hope, attached patch fix that. Pls, test it. The patch is running. I'll keep on reporting. Have a nice weekend. Greetings, Frank Wittig
Вложения
Teodor Sigaev schrieb: > Hope, attached patch fix that. Pls, test it. It still happens. The log is full of incomplete split dumps: <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete splits=8 <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; offset 15; online <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:696 l:93996 r:111778 at redo CF/4B669DF0 <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; offset 15; online <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:562 l:4491 r:111780 at redo CF/9FEF39F0 <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; offset 15; online <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:200 l:63015 r:111781 at redo CF/B45AC9B0 <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; offset 15; online <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:318 l:322 r:111782 at redo CF/B9939B58 <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; offset 15; online <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:374 l:1912 r:111783 at redo CF/B9B6CC00 <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; offset 15; online <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:1552 l:1555 r:111784 at redo CF/F1C6D770 <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; offset 15; online <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:221 l:222 r:111785 at redo D0/ED2F6F0 <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; offset 15; online <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:1147 l:111771 r:111786 at redo D0/1DDE64C8 <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; offset 15; online <2007-06-01 23:00:00.814 CEST:%> LOG: restored log file "00000001000000D000000029" from archive <2007-06-01 23:00:03.325 CEST:%> LOG: restored log file "00000001000000D00000002A" from archive <2007-06-01 23:00:06.145 CEST:%> LOG: restored log file "00000001000000D00000002B" from archive <2007-06-01 23:00:10.996 CEST:%> LOG: restored log file "00000001000000D00000002C" from archive <2007-06-01 23:00:14.283 CEST:%> LOG: restored log file "00000001000000D00000002D" from archive <2007-06-01 23:00:17.086 CEST:%> LOG: restored log file "00000001000000D00000002E" from archive <2007-06-01 23:01:19.587 CEST:%> LOG: restored log file "00000001000000D00000002F" from archive <2007-06-01 23:01:22.700 CEST:%> LOG: restored log file "00000001000000D000000030" from archive <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete splits=9 <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; offset 15; online <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:696 l:93996 r:111778 at redo CF/4B669DF0 <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; offset 15; online <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:562 l:4491 r:111780 at redo CF/9FEF39F0 <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; offset 15; online <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:200 l:63015 r:111781 at redo CF/B45AC9B0 <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; offset 15; online <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:318 l:322 r:111782 at redo CF/B9939B58 <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; offset 15; online <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:374 l:1912 r:111783 at redo CF/B9B6CC00 <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; offset 15; online <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:1552 l:1555 r:111784 at redo CF/F1C6D770 <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; offset 15; online <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:221 l:222 r:111785 at redo D0/ED2F6F0 <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; offset 15; online <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:1147 l:111771 r:111786 at redo D0/1DDE64C8 <2007-06-01 23:01:24.039 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; offset 15; online <2007-06-01 23:01:24.039 CEST:%> LOG: GIN incomplete split root:34 l:1704 r:111787 at redo D0/289FC958 <2007-06-01 23:01:24.039 CEST:%> CONTEXT: xlog redo checkpoint: redo D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; offset 15; online
Вложения
On Fri, 2007-06-01 at 23:14 +0200, Frank Wittig wrote: > Teodor Sigaev schrieb: > > Hope, attached patch fix that. Pls, test it. > > It still happens. > The log is full of incomplete split dumps: > > <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete splits=8 > <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:696 > l:93996 r:111778 at redo CF/4B669DF0 > <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:562 > l:4491 r:111780 at redo CF/9FEF39F0 > <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:200 > l:63015 r:111781 at redo CF/B45AC9B0 > <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:318 > l:322 r:111782 at redo CF/B9939B58 > <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:374 > l:1912 r:111783 at redo CF/B9B6CC00 > <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:1552 > l:1555 r:111784 at redo CF/F1C6D770 > <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:221 > l:222 r:111785 at redo D0/ED2F6F0 > <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete split root:1147 > l:111771 r:111786 at redo D0/1DDE64C8 > <2007-06-01 23:00:00.001 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/28020F48; undo 0/0; tli 1; xid 0/36761651; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:00:00.814 CEST:%> LOG: restored log file > "00000001000000D000000029" from archive > <2007-06-01 23:00:03.325 CEST:%> LOG: restored log file > "00000001000000D00000002A" from archive > <2007-06-01 23:00:06.145 CEST:%> LOG: restored log file > "00000001000000D00000002B" from archive > <2007-06-01 23:00:10.996 CEST:%> LOG: restored log file > "00000001000000D00000002C" from archive > <2007-06-01 23:00:14.283 CEST:%> LOG: restored log file > "00000001000000D00000002D" from archive > <2007-06-01 23:00:17.086 CEST:%> LOG: restored log file > "00000001000000D00000002E" from archive > <2007-06-01 23:01:19.587 CEST:%> LOG: restored log file > "00000001000000D00000002F" from archive > <2007-06-01 23:01:22.700 CEST:%> LOG: restored log file > "00000001000000D000000030" from archive > <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete splits=9 > <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:696 > l:93996 r:111778 at redo CF/4B669DF0 > <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:562 > l:4491 r:111780 at redo CF/9FEF39F0 > <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:200 > l:63015 r:111781 at redo CF/B45AC9B0 > <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:318 > l:322 r:111782 at redo CF/B9939B58 > <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:374 > l:1912 r:111783 at redo CF/B9B6CC00 > <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:1552 > l:1555 r:111784 at redo CF/F1C6D770 > <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:221 > l:222 r:111785 at redo D0/ED2F6F0 > <2007-06-01 23:01:24.038 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:01:24.038 CEST:%> LOG: GIN incomplete split root:1147 > l:111771 r:111786 at redo D0/1DDE64C8 > <2007-06-01 23:01:24.039 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; > offset 15; online > <2007-06-01 23:01:24.039 CEST:%> LOG: GIN incomplete split root:34 > l:1704 r:111787 at redo D0/289FC958 > <2007-06-01 23:01:24.039 CEST:%> CONTEXT: xlog redo checkpoint: redo > D0/30018F68; undo 0/0; tli 1; xid 0/36763650; oid 243915448; multi 8; > offset 15; online OK, here's what I see: 1. After a certain point, consecutive GIN index splits cause a problem. The new RHS block numbers are consecutive from 111780+ 2. The incomplete splits stay around indefinitely after creation and we aren't trying to remove the wrong split at any point. We're either never creating an xlog record, or we are ignoring it in recovery, or we are somehow making multiple entries then not removing all of them. 3. The root seems to move, which isn't what I personally was expecting to see. It seems root refers to the highest parent involved in the split. 4. We're writing lots of redo in between failed page splits. So *almost* everything is working correctly. 5. This starts to happen when we have very large indexes. This may be coincidental but the first relation file is fairly full (900+ MB). I'll do a patch to identify whether these are entry or data parts of the GIN index, and whether we are ignoring records somehow when they hit gin_redo. Perhaps we can have identified the exact issue by the time Teodor returns. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> <2007-06-01 23:00:00.001 CEST:%> LOG: GIN incomplete splits=8 Just to be sure: patch fixes *creating* of WAL log, not replaying. So, primary db should be patched too. During weekend I found possible deadlock in locking protocol in GIN between concurrent UPDATE and VACUUM queries with the same GIN index involved. Strange, but I didn't see it in 8.2 and even now I can't reproduce it. It's easy to reproduce оnly on HEAD with recently added ReadBufferWithStrategy() call instead of ReadBuffer(). ReadBufferWithStrategy() call was added to implement limited-size "ring" of buffers for VACUUM. Nevertheless, it's a possible scenario in 8.2. Attached patch fixes that deadlock bug too. And, previous version of my patch has a mistake which is observable on CREATE INDEX .. USING GIN query. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Вложения
> 1. After a certain point, consecutive GIN index splits cause a problem. > The new RHS block numbers are consecutive from 111780+ That's newly created page. Splitted page might have any number > > 2. The incomplete splits stay around indefinitely after creation and we > aren't trying to remove the wrong split at any point. We're either never > creating an xlog record, or we are ignoring it in recovery, or we are > somehow making multiple entries then not removing all of them. Agreed > 3. The root seems to move, which isn't what I personally was expecting > to see. It seems root refers to the highest parent involved in the > split. root in this context means parent of splitted page. Actually, there is a lot of B-tree in GIN, see http://www.sigaev.ru/gin/GinStructure.pdf > > 4. We're writing lots of redo in between failed page splits. So *almost* > everything is working correctly. > > 5. This starts to happen when we have very large indexes. This may be > coincidental but the first relation file is fairly full (900+ MB). Yes. It seems to me that conditions of error are very rare and B-tree over ItemPointers (second level of GIN) has a big capacity, 1000+ items per page. So, splits occur rather rare. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Ooops. Patch doesn't apply cleanly. New version. > Attached patch fixes that deadlock bug too. And, previous version of my > patch has a mistake which is observable on CREATE INDEX .. USING GIN query. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Вложения
After some observation of massive reindexing of some hundred thousand data sets it seems to me that the slave doesn't skip checkpoints anymore. (Apart from those skipped because of the CheckpointTimeout thing) I'll keep an eye on it and report back any news on the issue. Thank you for the good work! Regards, Frank Wittig Teodor Sigaev schrieb: > Ooops. Patch doesn't apply cleanly. New version. > > >> Attached patch fixes that deadlock bug too. And, previous version of >> my patch has a mistake which is observable on CREATE INDEX .. USING >> GIN query. >
Вложения
> After some observation of massive reindexing of some hundred thousand > data sets it seems to me that the slave doesn't skip checkpoints > anymore. (Apart from those skipped because of the CheckpointTimeout thing) > I'll keep an eye on it and report back any news on the issue. Nice, committed. Thank for your report and testing. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/