Load Distributed Checkpoints, final patch

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Load Distributed Checkpoints, final patch
Дата
Msg-id 46815FAB.7030205@enterprisedb.com
обсуждение исходный текст
Ответы Re: Load Distributed Checkpoints, final patch  (Michael Glaesemann <grzm@seespotcode.net>)
Re: Load Distributed Checkpoints, final patch  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Load Distributed Checkpoints, final patch  (Greg Smith <gsmith@gregsmith.com>)
Re: Load Distributed Checkpoints, final patch  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-patches
Here's latest revision of Itagaki-sans Load Distributed Checkpoints patch:

* bgwriter all-scan is gone. We might or might not improve the LRU-sweep
later so that it can perform any duties the all-sweep might have had
besides reducing the impact of a checkpoint.

* one new GUC variable, called checkpoint_completion_target. Default is
0.5, which should be more than enough to smooth checkpoints on a system
that's not already overloaded. It's also not too large to hurt recovery
times too much on a system that's not already struggling to hit its
recovery time requirements. You can set it to 0 if you want the old
checkpoint behavior for some reason. Maximum is 0.9, to leave some
headroom for fsync and any other things that need to happen during a
checkpoint.

* The minimum rate we write at during a checkpoint is 1 page /
bgwriter_delay.

* Added a paragraph to user manual to describe the feature. Also updated
the formula for expected number of WAL segments, new formula is (2 +
checkpoint_completion_target) * checkpoint_segments + 1. I believe the
comments in xlog.c regarding XLOGfileslop are still valid.

* The signaling in bgwriter.c is based on a spinlock. Tom advised to not
use the spinlock when not strictly necessary, but IMHO it's easier to
understand this way. Feel free to revert that when committing if you
disagree.

* The algorithm for estimating progress wrt. checkpoint_segments is the
same as before. Bursty WAL activity will lead to bursty checkpoint
activity, but I wanted to keep it simple for now. In any case, the I/O
rate will be smoother than without the patch.

* There's some DEBUG elogs which we might want to replace with better
ones later, per the patch in the patch queue by Greg Smith. The ones
that are there now are useful for testing this feature, but are a bit
crude for DBAs to use.

Barring any objections from committer, I'm finished with this patch.

I'm scheduling more DBT-2 tests at a high # of warehouses per Greg
Smith's suggestion just to see what happens, but I doubt that will
change my mind on the above decisions.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: doc/src/sgml/config.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.127
diff -c -r1.127 config.sgml
*** doc/src/sgml/config.sgml    19 Jun 2007 20:13:21 -0000    1.127
--- doc/src/sgml/config.sgml    26 Jun 2007 16:31:28 -0000
***************
*** 1565,1570 ****
--- 1565,1586 ----
        </listitem>
       </varlistentry>

+      <varlistentry id="guc-checkpoint-completion-target" xreflabel="checkpoint_completion_target">
+       <term><varname>checkpoint_completion_target</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_completion_target</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the target length of checkpoints, as a fraction of
+         the checkpoint interval. The default is 0.5.
+
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
        <term><varname>checkpoint_warning</varname> (<type>integer</type>)</term>
        <indexterm>
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.43
diff -c -r1.43 wal.sgml
*** doc/src/sgml/wal.sgml    31 Jan 2007 20:56:19 -0000    1.43
--- doc/src/sgml/wal.sgml    26 Jun 2007 16:53:17 -0000
***************
*** 217,224 ****
    </para>

    <para>
     There will be at least one WAL segment file, and will normally
!    not be more than 2 * <varname>checkpoint_segments</varname> + 1
     files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
--- 217,242 ----
    </para>

    <para>
+    To avoid flooding the OS and I/O subsystem with sudden burst of page writes,
+    writing dirty buffers during a checkpoint is spread over a period of time.
+    That period is controlled <varname>checkpoint_completion_target</varname>.
+    It's given as a fraction of the checkpoint interval, as defined by
+    <varname>checkpoint_timeout</varname> and
+    <varname>checkpoint_segments</varname>. The WAL segment consumption
+    and elapsed time is monitored and the I/O rate is adjusted during
+    checkpoint so that it's finished when the given fraction of elapsed time
+    or WAL segments has passed, whichever is sooner. For example, with the
+    default of 0.5, checkpoint writes are paced so that the checkpoint finishes
+    after 50% of checkpoint_timeout or checkpoint_segments has elapsed. On
+    system that's very close to maximum I/O throughput during normal operation,
+    you might want to increase <varname>checkpoint_completion_target</varname>,
+    though prolonging checkpoints affects recovery time, because more WAL need
+    to be kept around and replayed in recovery.
+   </para>
+
+   <para>
     There will be at least one WAL segment file, and will normally
!    not be more than (2 + checkpoint_completion_target) * <varname>checkpoint_segments</varname> + 1
     files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.272
diff -c -r1.272 xlog.c
*** src/backend/access/transam/xlog.c    31 May 2007 15:13:01 -0000    1.272
--- src/backend/access/transam/xlog.c    26 Jun 2007 17:51:09 -0000
***************
*** 398,404 ****
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
--- 398,404 ----
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 1608,1614 ****
                          if (XLOG_DEBUG)
                              elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
!                         RequestCheckpoint(false, true);
                      }
                  }
              }
--- 1608,1614 ----
                          if (XLOG_DEBUG)
                              elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
!                         RequestXLogFillCheckpoint();
                      }
                  }
              }
***************
*** 5110,5116 ****
           * the rule that TLI only changes in shutdown checkpoints, which
           * allows some extra error checking in xlog_redo.
           */
!         CreateCheckPoint(true, true);

          /*
           * Close down recovery environment
--- 5110,5116 ----
           * the rule that TLI only changes in shutdown checkpoints, which
           * allows some extra error checking in xlog_redo.
           */
!         CreateCheckPoint(true, true, true);

          /*
           * Close down recovery environment
***************
*** 5319,5324 ****
--- 5319,5347 ----
  }

  /*
+  * GetInsertRecPtr -- Returns the current insert position.
+  *
+  * NOTE: The value *actually* returned is the position of the last full
+  * xlog page. It lags behind the real insert position by at most 1 page.
+  * For that, we don't need to acquire WALInsertLock which can be quite
+  * heavily contended, and an approximation is enough for the current
+  * usage of this function.
+  */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+     /* use volatile pointer to prevent code rearrangement */
+     volatile XLogCtlData *xlogctl = XLogCtl;
+     XLogRecPtr recptr;
+
+     SpinLockAcquire(&xlogctl->info_lck);
+     recptr = xlogctl->LogwrtRqst.Write;
+     SpinLockRelease(&xlogctl->info_lck);
+
+     return recptr;
+ }
+
+ /*
   * Get the time of the last xlog segment switch
   */
  time_t
***************
*** 5383,5389 ****
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(true, true);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
--- 5406,5412 ----
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(true, true, true);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
***************
*** 5395,5405 ****
  /*
   * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool force)
  {
      CheckPoint    checkPoint;
      XLogRecPtr    recptr;
--- 5418,5431 ----
  /*
   * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
+  * If immediate is true, we try to finish the checkpoint as fast as we can,
+  * ignoring checkpoint_completion_target parameter.
+  *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool immediate, bool force)
  {
      CheckPoint    checkPoint;
      XLogRecPtr    recptr;
***************
*** 5591,5597 ****
       */
      END_CRIT_SECTION();

!     CheckPointGuts(checkPoint.redo);

      START_CRIT_SECTION();

--- 5617,5623 ----
       */
      END_CRIT_SECTION();

!     CheckPointGuts(checkPoint.redo, immediate);

      START_CRIT_SECTION();

***************
*** 5693,5708 ****
  /*
   * Flush all data in shared memory to disk, and fsync
   *
   * This is the common code shared between regular checkpoints and
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
  {
      CheckPointCLOG();
      CheckPointSUBTRANS();
      CheckPointMultiXact();
!     FlushBufferPool();            /* performs all required fsyncs */
      /* We deliberately delay 2PC checkpointing as long as possible */
      CheckPointTwoPhase(checkPointRedo);
  }
--- 5719,5737 ----
  /*
   * Flush all data in shared memory to disk, and fsync
   *
+  * If immediate is true, try to finish as quickly as possible, ignoring
+  * the GUC variables to throttle checkpoint I/O.
+  *
   * This is the common code shared between regular checkpoints and
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate)
  {
      CheckPointCLOG();
      CheckPointSUBTRANS();
      CheckPointMultiXact();
!     FlushBufferPool(immediate);        /* performs all required fsyncs */
      /* We deliberately delay 2PC checkpointing as long as possible */
      CheckPointTwoPhase(checkPointRedo);
  }
***************
*** 5710,5716 ****
  /*
   * Set a recovery restart point if appropriate
   *
!  * This is similar to CreateCheckpoint, but is used during WAL recovery
   * to establish a point from which recovery can roll forward without
   * replaying the entire recovery log.  This function is called each time
   * a checkpoint record is read from XLOG; it must determine whether a
--- 5739,5745 ----
  /*
   * Set a recovery restart point if appropriate
   *
!  * This is similar to CreateCheckPoint, but is used during WAL recovery
   * to establish a point from which recovery can roll forward without
   * replaying the entire recovery log.  This function is called each time
   * a checkpoint record is read from XLOG; it must determine whether a
***************
*** 5751,5757 ****
      /*
       * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo);

      /*
       * Update pg_control so that any subsequent crash will restart from this
--- 5780,5786 ----
      /*
       * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo, true);

      /*
       * Update pg_control so that any subsequent crash will restart from this
***************
*** 6177,6183 ****
           * have different checkpoint positions and hence different history
           * file names, even if nothing happened in between.
           */
!         RequestCheckpoint(true, false);

          /*
           * Now we need to fetch the checkpoint record location, and also its
--- 6206,6212 ----
           * have different checkpoint positions and hence different history
           * file names, even if nothing happened in between.
           */
!         RequestLazyCheckpoint();

          /*
           * Now we need to fetch the checkpoint record location, and also its
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.233
diff -c -r1.233 bootstrap.c
*** src/backend/bootstrap/bootstrap.c    7 Mar 2007 13:35:02 -0000    1.233
--- src/backend/bootstrap/bootstrap.c    19 Jun 2007 15:29:51 -0000
***************
*** 489,495 ****

      /* Perform a checkpoint to ensure everything's down to disk */
      SetProcessingMode(NormalProcessing);
!     CreateCheckPoint(true, true);

      /* Clean up and exit */
      cleanup();
--- 489,495 ----

      /* Perform a checkpoint to ensure everything's down to disk */
      SetProcessingMode(NormalProcessing);
!     CreateCheckPoint(true, true, true);

      /* Clean up and exit */
      cleanup();
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.195
diff -c -r1.195 dbcommands.c
*** src/backend/commands/dbcommands.c    1 Jun 2007 19:38:07 -0000    1.195
--- src/backend/commands/dbcommands.c    20 Jun 2007 09:36:24 -0000
***************
*** 404,410 ****
       * up-to-date for the copy.  (We really only need to flush buffers for the
       * source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync();

      /*
       * Once we start copying subdirectories, we need to be able to clean 'em
--- 404,410 ----
       * up-to-date for the copy.  (We really only need to flush buffers for the
       * source database, but bufmgr.c provides no API for that.)
       */
!     BufferSync(true);

      /*
       * Once we start copying subdirectories, we need to be able to clean 'em
***************
*** 507,513 ****
           * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
           * we can avoid this.
           */
!         RequestCheckpoint(true, false);

          /*
           * Close pg_database, but keep lock till commit (this is important to
--- 507,513 ----
           * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
           * we can avoid this.
           */
!         RequestImmediateCheckpoint();

          /*
           * Close pg_database, but keep lock till commit (this is important to
***************
*** 661,667 ****
       * open files, which would cause rmdir() to fail.
       */
  #ifdef WIN32
!     RequestCheckpoint(true, false);
  #endif

      /*
--- 661,667 ----
       * open files, which would cause rmdir() to fail.
       */
  #ifdef WIN32
!     RequestImmediateCheckpoint();
  #endif

      /*
***************
*** 1427,1433 ****
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync();

          /*
           * Copy this subdirectory to the new location
--- 1427,1433 ----
           * up-to-date for the copy.  (We really only need to flush buffers for
           * the source database, but bufmgr.c provides no API for that.)
           */
!         BufferSync(true);

          /*
           * Copy this subdirectory to the new location
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.38
diff -c -r1.38 bgwriter.c
*** src/backend/postmaster/bgwriter.c    27 May 2007 03:50:39 -0000    1.38
--- src/backend/postmaster/bgwriter.c    26 Jun 2007 18:14:47 -0000
***************
*** 44,49 ****
--- 44,50 ----
  #include "postgres.h"

  #include <signal.h>
+ #include <sys/time.h>
  #include <time.h>
  #include <unistd.h>

***************
*** 59,64 ****
--- 60,66 ----
  #include "storage/pmsignal.h"
  #include "storage/shmem.h"
  #include "storage/smgr.h"
+ #include "storage/spin.h"
  #include "tcop/tcopprot.h"
  #include "utils/guc.h"
  #include "utils/memutils.h"
***************
*** 70,101 ****
   *
   * The ckpt counters allow backends to watch for completion of a checkpoint
   * request they send.  Here's how it works:
!  *    * At start of a checkpoint, bgwriter increments ckpt_started.
   *    * On completion of a checkpoint, bgwriter sets ckpt_done to
   *      equal ckpt_started.
!  *    * On failure of a checkpoint, bgwrite first increments ckpt_failed,
!  *      then sets ckpt_done to equal ckpt_started.
!  * All three fields are declared sig_atomic_t to ensure they can be read
!  * and written without explicit locking.  The algorithm for backends is:
!  *    1. Record current values of ckpt_failed and ckpt_started (in that
!  *       order!).
   *    2. Send signal to request checkpoint.
   *    3. Sleep until ckpt_started changes.  Now you know a checkpoint has
   *       begun since you started this algorithm (although *not* that it was
   *       specifically initiated by your signal).
   *    4. Record new value of ckpt_started.
!  *    5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
!  *       arithmetic here in case counters wrap around.)  Now you know a
!  *       checkpoint has started and completed, but not whether it was
!  *       successful.
   *    6. If ckpt_failed is different from the originally saved value,
   *       assume request failed; otherwise it was definitely successful.
   *
!  * An additional field is ckpt_time_warn; this is also sig_atomic_t for
!  * simplicity, but is only used as a boolean.  If a backend is requesting
!  * a checkpoint for which a checkpoints-too-close-together warning is
!  * reasonable, it should set this field TRUE just before sending the signal.
!  *
   * The requests array holds fsync requests sent by backends and not yet
   * absorbed by the bgwriter.  Unlike the checkpoint fields, the requests
   * fields are protected by BgWriterCommLock.
--- 72,103 ----
   *
   * The ckpt counters allow backends to watch for completion of a checkpoint
   * request they send.  Here's how it works:
!  *    * At start of a checkpoint, bgwriter reads the request flags
!  *      and increments ckpt_started, while holding ckpt_lck.
   *    * On completion of a checkpoint, bgwriter sets ckpt_done to
   *      equal ckpt_started.
!  *    * On failure of a checkpoint, bgwriter increments ckpt_failed,
!  *      and sets ckpt_done to equal ckpt_started.
!  *
!  * The algorithm for backends is:
!  *    1. Record current values of ckpt_failed and ckpt_started, and
!  *     set any request flags, while holding ckpt_lck.
   *    2. Send signal to request checkpoint.
   *    3. Sleep until ckpt_started changes.  Now you know a checkpoint has
   *       begun since you started this algorithm (although *not* that it was
   *       specifically initiated by your signal).
   *    4. Record new value of ckpt_started.
!  *    5. Sleep until ckpt_done >= saved value of ckpt_started.
!  *       Now you know a checkpoint has started and completed, but not whether
!  *       it was successful.
   *    6. If ckpt_failed is different from the originally saved value,
   *       assume request failed; otherwise it was definitely successful.
   *
!  * There's three request flags that indicate what kind of a checkpoint
!  * should be performed next. They are accumulative; if for example an
!  * immediate checkpoint is requested, another request for a non-immediate
!  * checkpoint won't clear the flag.
!  *
   * The requests array holds fsync requests sent by backends and not yet
   * absorbed by the bgwriter.  Unlike the checkpoint fields, the requests
   * fields are protected by BgWriterCommLock.
***************
*** 112,122 ****
  {
      pid_t        bgwriter_pid;    /* PID of bgwriter (0 if not started) */

!     sig_atomic_t ckpt_started;    /* advances when checkpoint starts */
!     sig_atomic_t ckpt_done;        /* advances when checkpoint done */
!     sig_atomic_t ckpt_failed;    /* advances when checkpoint fails */

!     sig_atomic_t ckpt_time_warn;    /* warn if too soon since last ckpt? */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
--- 114,128 ----
  {
      pid_t        bgwriter_pid;    /* PID of bgwriter (0 if not started) */

!     slock_t        ckpt_lck;        /* protects all the ckpt_* fields */

!     int            ckpt_started;    /* advances when checkpoint starts */
!     int            ckpt_done;        /* advances when checkpoint done */
!     int            ckpt_failed;    /* advances when checkpoint fails */
!
!     bool    ckpt_rqst_time_warn;    /* warn if too soon since last ckpt */
!     bool    ckpt_rqst_immediate;    /* an immediate ckpt has been requested */
!     bool    ckpt_rqst_force;        /* checkpoint even if no WAL activity */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
***************
*** 131,136 ****
--- 137,143 ----
  int            BgWriterDelay = 200;
  int            CheckPointTimeout = 300;
  int            CheckPointWarning = 30;
+ double        CheckPointCompletionTarget = 0.5;

  /*
   * Flags set by interrupt handlers for later service in the main loop.
***************
*** 146,154 ****
--- 153,180 ----

  static bool ckpt_active = false;

+ /*
+  * Current time and WAL insert location when checkpoint was started.
+  * Used to track progress during the checkpoint and throttle I/O
+  * to hit checkpoint_completion_target.
+  */
+ static time_t        ckpt_start_time;
+ static XLogRecPtr    ckpt_start_recptr;
+
+ static double        ckpt_cached_elapsed;
+
  static time_t last_checkpoint_time;
  static time_t last_xlog_switch_time;

+ /* Prototypes for private functions */
+
+ static void RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force);
+ static void CheckArchiveTimeout(void);
+ static void BgWriterNap(void);
+ static bool IsCheckpointOnSchedule(double progress);
+ static bool ImmediateCheckpointRequested(void);
+
+ /* Signal handlers */

  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
***************
*** 170,175 ****
--- 196,202 ----

      Assert(BgWriterShmem != NULL);
      BgWriterShmem->bgwriter_pid = MyProcPid;
+     SpinLockInit(&BgWriterShmem->ckpt_lck);
      am_bg_writer = true;

      /*
***************
*** 281,288 ****
--- 308,318 ----
              /* use volatile pointer to prevent code rearrangement */
              volatile BgWriterShmemStruct *bgs = BgWriterShmem;

+             SpinLockAcquire(&bgs->ckpt_lck);
              bgs->ckpt_failed++;
              bgs->ckpt_done = bgs->ckpt_started;
+             SpinLockRelease(&bgs->ckpt_lck);
+
              ckpt_active = false;
          }

***************
*** 328,337 ****
      for (;;)
      {
          bool        do_checkpoint = false;
-         bool        force_checkpoint = false;
          time_t        now;
          int            elapsed_secs;
-         long        udelay;

          /*
           * Emergency bailout if postmaster has died.  This is to avoid the
--- 358,365 ----
***************
*** 354,360 ****
          {
              checkpoint_requested = false;
              do_checkpoint = true;
-             force_checkpoint = true;
              BgWriterStats.m_requested_checkpoints++;
          }
          if (shutdown_requested)
--- 382,387 ----
***************
*** 377,387 ****
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (elapsed_secs >= CheckPointTimeout)
          {
              do_checkpoint = true;
!             if (!force_checkpoint)
!                 BgWriterStats.m_timed_checkpoints++;
          }

          /*
--- 404,413 ----
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (!do_checkpoint && elapsed_secs >= CheckPointTimeout)
          {
              do_checkpoint = true;
!             BgWriterStats.m_timed_checkpoints++;
          }

          /*
***************
*** 390,395 ****
--- 416,449 ----
           */
          if (do_checkpoint)
          {
+             /* use volatile pointer to prevent code rearrangement */
+             volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+             bool time_warn;
+             bool immediate;
+             bool force;
+
+             /*
+              * Atomically check the request flags to figure out what
+              * kind of a checkpoint we should perform, and increase the
+              * started-counter to acknowledge that we've started
+              * a new checkpoint.
+              */
+
+             SpinLockAcquire(&bgs->ckpt_lck);
+
+             time_warn = bgs->ckpt_rqst_time_warn;
+             bgs->ckpt_rqst_time_warn = false;
+
+             immediate = bgs->ckpt_rqst_immediate;
+             bgs->ckpt_rqst_immediate = false;
+
+             force = bgs->ckpt_rqst_force;
+             bgs->ckpt_rqst_force = false;
+
+             bgs->ckpt_started++;
+
+             SpinLockRelease(&bgs->ckpt_lck);
+
              /*
               * We will warn if (a) too soon since last checkpoint (whatever
               * caused it) and (b) somebody has set the ckpt_time_warn flag
***************
*** 397,417 ****
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (BgWriterShmem->ckpt_time_warn &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!             BgWriterShmem->ckpt_time_warn = false;

              /*
!              * Indicate checkpoint start to any waiting backends.
               */
              ckpt_active = true;
!             BgWriterShmem->ckpt_started++;

!             CreateCheckPoint(false, force_checkpoint);

              /*
               * After any checkpoint, close all smgr files.    This is so we
--- 451,477 ----
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (time_warn &&
                  elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!

              /*
!              * Initialize bgwriter-private variables used during checkpoint.
               */
              ckpt_active = true;
!             ckpt_start_recptr = GetInsertRecPtr();
!             ckpt_start_time = now;
!             ckpt_cached_elapsed = 0;
!
!             elog(DEBUG1, "CHECKPOINT: start");
!
!             CreateCheckPoint(false, immediate, force);

!             elog(DEBUG1, "CHECKPOINT: end");

              /*
               * After any checkpoint, close all smgr files.    This is so we
***************
*** 422,428 ****
              /*
               * Indicate checkpoint completion to any waiting backends.
               */
!             BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
              ckpt_active = false;

              /*
--- 482,491 ----
              /*
               * Indicate checkpoint completion to any waiting backends.
               */
!             SpinLockAcquire(&bgs->ckpt_lck);
!             bgs->ckpt_done = bgs->ckpt_started;
!             SpinLockRelease(&bgs->ckpt_lck);
!
              ckpt_active = false;

              /*
***************
*** 436,446 ****
              BgBufferSync();

          /*
!          * Check for archive_timeout, if so, switch xlog files.  First we do a
!          * quick check using possibly-stale local state.
           */
!         if (XLogArchiveTimeout > 0 &&
!             (int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
          {
              /*
               * Update local state ... note that last_xlog_switch_time is the
--- 499,531 ----
              BgBufferSync();

          /*
!          * Check for archive_timeout and switch xlog files if necessary.
           */
!         CheckArchiveTimeout();
!
!         /* Nap for the configured time. */
!         BgWriterNap();
!     }
! }
!
! /*
!  * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
!  *        if needed
!  */
! static void
! CheckArchiveTimeout(void)
! {
!     time_t        now;
!
!     if (XLogArchiveTimeout <= 0)
!         return;
!
!     now = time(NULL);
!
!     /* First we do a quick check using possibly-stale local state. */
!     if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
!         return;
!
          {
              /*
               * Update local state ... note that last_xlog_switch_time is the
***************
*** 450,459 ****

              last_xlog_switch_time = Max(last_xlog_switch_time, last_time);

-             /* if we did a checkpoint, 'now' might be stale too */
-             if (do_checkpoint)
-                 now = time(NULL);
-
              /* Now we can do the real check */
              if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
              {
--- 535,540 ----
***************
*** 478,483 ****
--- 559,573 ----
                  last_xlog_switch_time = now;
              }
          }
+ }
+
+ /*
+  * BgWriterNap -- Nap for the configured time or until a signal is received.
+  */
+ static void
+ BgWriterNap(void)
+ {
+     long        udelay;

          /*
           * Send off activity statistics to the stats collector
***************
*** 495,502 ****
           *
           * We absorb pending requests after each short sleep.
           */
!         if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) ||
!             (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0))
              udelay = BgWriterDelay * 1000L;
          else if (XLogArchiveTimeout > 0)
              udelay = 1000000L;    /* One second */
--- 585,592 ----
           *
           * We absorb pending requests after each short sleep.
           */
!         if ((bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0) ||
!             ckpt_active)
              udelay = BgWriterDelay * 1000L;
          else if (XLogArchiveTimeout > 0)
              udelay = 1000000L;    /* One second */
***************
*** 505,522 ****

          while (udelay > 999999L)
          {
!             if (got_SIGHUP || checkpoint_requested || shutdown_requested)
                  break;
              pg_usleep(1000000L);
              AbsorbFsyncRequests();
              udelay -= 1000000L;
          }

!         if (!(got_SIGHUP || checkpoint_requested || shutdown_requested))
              pg_usleep(udelay);
      }
  }


  /* --------------------------------
   *        signal handler routines
--- 595,774 ----

          while (udelay > 999999L)
          {
!             if (got_SIGHUP || shutdown_requested ||
!                 (!ckpt_active && checkpoint_requested) ||
!                 (ckpt_active && ImmediateCheckpointRequested()))
                  break;
              pg_usleep(1000000L);
              AbsorbFsyncRequests();
              udelay -= 1000000L;
          }

!
!         if (!(got_SIGHUP || shutdown_requested ||
!               (!ckpt_active && checkpoint_requested) ||
!               (ckpt_active && ImmediateCheckpointRequested())))
              pg_usleep(udelay);
+ }
+
+ /*
+  * Returns true if an immediate checkpoint request is pending.
+  */
+ static bool
+ ImmediateCheckpointRequested()
+ {
+     if (checkpoint_requested)
+     {
+         volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+         /*
+          * We don't need to acquire the ckpt_lck in this case because we're
+          * only looking at a single boolean field.
+          */
+         if (bgs->ckpt_rqst_immediate)
+             return true;
      }
+     return false;
  }

+ #define WRITES_PER_ABSORB 1000
+
+ /*
+  * CheckpointWriteDelay -- yield control to bgwriter during a checkpoint
+  *
+  * To give bgwriter a chance to perform its normal activities while we're
+  * performing a checkpoint, this function is called after each page written
+  * from the buffer cache.
+  *
+  * To throttle the checkpoint I/O rate, we also sleep periodically so that
+  * we hit checkpoint_completion_target.
+  *
+  * 'progress' is an estimate of how much of the writes has been done, as a
+  * fraction between 0.0 meaning none, and 1.0 meaning all done. If 'immediate'
+  * is true, this is an immediate checkpoint and we should not sleep.
+  */
+ void
+ CheckpointWriteDelay(bool immediate, double progress)
+ {
+     static int absorb_counter = WRITES_PER_ABSORB;
+
+     /*
+      * Take a nap and perform the usual bgwriter duties, unless we're behind
+      * schedule, in which case we just try to catch up as quickly as possible.
+      */
+     if (!immediate && am_bg_writer && !shutdown_requested &&
+         !ImmediateCheckpointRequested() &&
+         IsCheckpointOnSchedule(progress))
+     {
+         if (got_SIGHUP)
+         {
+             got_SIGHUP = false;
+             ProcessConfigFile(PGC_SIGHUP);
+         }
+         CheckArchiveTimeout();
+         BgBufferSync();
+         BgWriterNap();
+
+         AbsorbFsyncRequests();
+         absorb_counter = WRITES_PER_ABSORB;
+     }
+     else if (--absorb_counter <= 0)
+     {
+         /*
+          * Absorb pending fsync requests after each WRITES_PER_ABSORB write
+          * operations when we don't sleep, to prevent overflow of the fsync
+          * request queue.
+          */
+         AbsorbFsyncRequests();
+         absorb_counter = WRITES_PER_ABSORB;
+     }
+ }
+
+ /*
+  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
+  *         in time?
+  *
+  * Compares the current progress against the time/segments elapsed since last
+  * checkpoint, and returns true if the progress we've made this far is greater
+  * than the elapsed time/segments.
+  */
+ static bool
+ IsCheckpointOnSchedule(double progress)
+ {
+     struct timeval    now;
+     XLogRecPtr        recptr;
+     double            elapsed_time,
+                     elapsed_xlogs;
+
+     Assert(ckpt_active);
+
+     /* Scale progress according to checkpoint_completion_target. */
+     progress *= CheckPointCompletionTarget;
+
+     /*
+      * Check against the cached value first. Only do the more expensive
+      * calculations once we reach the target previously calculated. Since
+      * neither time or WAL insert pointer moves backwards, a freshly
+      * calculated value can only be greater than or equal to the cached value.
+      */
+     if (progress < ckpt_cached_elapsed)
+     {
+         elog(DEBUG3, "IsCheckpointOnSchedule: Still behind cached=%.3f, scaled progress=%.3f",
+              ckpt_cached_elapsed, progress);
+         return false;
+     }
+
+     /*
+      * Check progress against time elapsed and checkpoint_timeout.
+      */
+     gettimeofday(&now, NULL);
+
+     elapsed_time = ((double) (now.tv_sec - ckpt_start_time) +
+         now.tv_usec / 1000000.0) / CheckPointTimeout;
+
+     if (progress < elapsed_time)
+     {
+         elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_timeout, time=%.3f, scaled progress=%.3f",
+              elapsed_time, progress);
+
+         ckpt_cached_elapsed = elapsed_time;
+
+         return false;
+     }
+
+     /*
+      * Check progress against WAL segments written and checkpoint_segments.
+      *
+      * We compare the current WAL insert location against the location
+      * computed before calling CreateCheckPoint. The code in XLogInsert that
+      * actually triggers a checkpoint when checkpoint_segments is exceeded
+      * compares against RedoRecptr, so this is not completely accurate.
+      * However, it's good enough for our purposes, we're only calculating
+      * an estimate anyway.
+      */
+     recptr = GetInsertRecPtr();
+     elapsed_xlogs =
+         (((double) recptr.xlogid - (double) ckpt_start_recptr.xlogid) * XLogSegsPerFile +
+          ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+         CheckPointSegments;
+
+     if (progress < elapsed_xlogs)
+     {
+         elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_segments, xlog=%.3f, scaled progress=%.3f",
+              elapsed_xlogs, progress);
+
+         ckpt_cached_elapsed = elapsed_xlogs;
+
+         return false;
+     }
+
+     /* It looks like we're on schedule. */
+
+     elog(DEBUG2, "IsCheckpointOnSchedule: on schedule, time=%.3f, xlog=%.3f, scaled progress=%.3f",
+          elapsed_time, elapsed_xlogs, progress);
+
+     return true;
+ }

  /* --------------------------------
   *        signal handler routines
***************
*** 618,626 ****
  }

  /*
!  * RequestCheckpoint
   *        Called in backend processes to request an immediate checkpoint
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
   * immediately.
--- 870,919 ----
  }

  /*
!  * RequestImmediateCheckpoint
   *        Called in backend processes to request an immediate checkpoint
   *
+  * Returns when the checkpoint is finished.
+  */
+ void
+ RequestImmediateCheckpoint()
+ {
+     RequestCheckpoint(true, false, true, true);
+ }
+
+ /*
+  * RequestImmediateCheckpoint
+  *        Called in backend processes to request a lazy checkpoint.
+  *
+  * This is essentially the same as RequestImmediateCheckpoint, except
+  * that this form obeys the checkpoint_completion_target GUC variable, and
+  * can therefore take a lot longer time.
+  *
+  * Returns when the checkpoint is finished.
+  */
+ void
+ RequestLazyCheckpoint()
+ {
+     RequestCheckpoint(true, false, false, true);
+ }
+
+ /*
+  * RequestXLogFillCheckpoint
+  *        Signals the bgwriter that we've reached checkpoint_segments
+  *
+  * Unlike RequestImmediateCheckpoint and RequestLazyCheckpoint, return
+  * immediately without waiting for the checkpoint to finish.
+  */
+ void
+ RequestXLogFillCheckpoint()
+ {
+     RequestCheckpoint(false, true, false, false);
+ }
+
+ /*
+  * RequestCheckpoint
+  *        Common subroutine for all the above Request*Checkpoint variants.
+  *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
   * immediately.
***************
*** 628,648 ****
   * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
   */
! void
! RequestCheckpoint(bool waitforit, bool warnontime)
  {
      /* use volatile pointer to prevent code rearrangement */
      volatile BgWriterShmemStruct *bgs = BgWriterShmem;
!     sig_atomic_t old_failed = bgs->ckpt_failed;
!     sig_atomic_t old_started = bgs->ckpt_started;

      /*
       * If in a standalone backend, just do it ourselves.
       */
      if (!IsPostmasterEnvironment)
      {
!         CreateCheckPoint(false, true);

          /*
           * After any checkpoint, close all smgr files.    This is so we won't
--- 921,950 ----
   * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
+  *
+  * If immediate is true, the checkpoint should be finished ASAP.
+  *
+  * If force is true, force a checkpoint even if no XLOG activity has occured
+  * since the last one.
   */
! static void
! RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force)
  {
      /* use volatile pointer to prevent code rearrangement */
      volatile BgWriterShmemStruct *bgs = BgWriterShmem;
!     int old_failed, old_started;

      /*
       * If in a standalone backend, just do it ourselves.
       */
      if (!IsPostmasterEnvironment)
      {
!         /*
!          * There's no point in doing lazy checkpoints in a standalone
!          * backend, because there's no other backends the checkpoint could
!          * disrupt.
!          */
!         CreateCheckPoint(false, true, true);

          /*
           * After any checkpoint, close all smgr files.    This is so we won't
***************
*** 653,661 ****
          return;
      }

!     /* Set warning request flag if appropriate */
      if (warnontime)
!         bgs->ckpt_time_warn = true;

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
--- 955,982 ----
          return;
      }

!     /*
!      * Atomically set the request flags, and take a snapshot of the counters.
!      * This ensures that when we see that ckpt_started > old_started,
!      * we know the flags we set here have been seen by bgwriter.
!      *
!      * Note that we effectively OR the flags with any existing flags, to
!      * avoid overriding a "stronger" request by another backend.
!      */
!     SpinLockAcquire(&bgs->ckpt_lck);
!
!     old_failed = bgs->ckpt_failed;
!     old_started = bgs->ckpt_started;
!
!     /* Set request flags as appropriate */
      if (warnontime)
!         bgs->ckpt_rqst_time_warn = true;
!     if (immediate)
!         bgs->ckpt_rqst_immediate = true;
!     if (force)
!         bgs->ckpt_rqst_force = true;
!
!     SpinLockRelease(&bgs->ckpt_lck);

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
***************
*** 674,701 ****
       */
      if (waitforit)
      {
!         while (bgs->ckpt_started == old_started)
          {
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
-         old_started = bgs->ckpt_started;

          /*
!          * We are waiting for ckpt_done >= old_started, in a modulo sense.
!          * This is a little tricky since we don't know the width or signedness
!          * of sig_atomic_t.  We make the lowest common denominator assumption
!          * that it is only as wide as "char".  This means that this algorithm
!          * will cope correctly as long as we don't sleep for more than 127
!          * completed checkpoints.  (If we do, we will get another chance to
!          * exit after 128 more checkpoints...)
           */
!         while (((signed char) (bgs->ckpt_done - old_started)) < 0)
          {
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
!         if (bgs->ckpt_failed != old_failed)
              ereport(ERROR,
                      (errmsg("checkpoint request failed"),
                       errhint("Consult recent messages in the server log for details.")));
--- 995,1036 ----
       */
      if (waitforit)
      {
!         int new_started, new_failed;
!
!         /* Wait for a new checkpoint to start. */
!         for(;;)
          {
+             SpinLockAcquire(&bgs->ckpt_lck);
+             new_started = bgs->ckpt_started;
+             SpinLockRelease(&bgs->ckpt_lck);
+
+             if (new_started != old_started)
+                 break;
+
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }

          /*
!          * We are waiting for ckpt_done >= new_started, in a modulo sense.
           */
!         for(;;)
          {
+             int new_done;
+
+             SpinLockAcquire(&bgs->ckpt_lck);
+             new_done = bgs->ckpt_done;
+             new_failed = bgs->ckpt_failed;
+             SpinLockRelease(&bgs->ckpt_lck);
+
+             if(new_done - new_started >= 0)
+                 break;
+
              CHECK_FOR_INTERRUPTS();
              pg_usleep(100000L);
          }
!
!         if (new_failed != old_failed)
              ereport(ERROR,
                      (errmsg("checkpoint request failed"),
                       errhint("Consult recent messages in the server log for details.")));
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.221
diff -c -r1.221 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    18 Jun 2007 00:47:20 -0000    1.221
--- src/backend/storage/buffer/bufmgr.c    26 Jun 2007 17:01:30 -0000
***************
*** 71,79 ****
  /* GUC variables */
  bool        zero_damaged_pages = false;
  double        bgwriter_lru_percent = 1.0;
- double        bgwriter_all_percent = 0.333;
  int            bgwriter_lru_maxpages = 5;
- int            bgwriter_all_maxpages = 5;


  long        NDirectFileRead;    /* some I/O's are direct file access. bypass
--- 71,77 ----
***************
*** 644,650 ****
       * at 1 so that the buffer can survive one clock-sweep pass.)
       */
      buf->tag = newTag;
!     buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
      buf->flags |= BM_TAG_VALID;
      buf->usage_count = 1;

--- 642,648 ----
       * at 1 so that the buffer can survive one clock-sweep pass.)
       */
      buf->tag = newTag;
!     buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR);
      buf->flags |= BM_TAG_VALID;
      buf->usage_count = 1;

***************
*** 999,1042 ****
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(void)
  {
!     int            buf_id;
      int            num_to_scan;
!     int            absorb_counter;

      /*
       * Find out where to start the circular scan.
       */
!     buf_id = StrategySyncStart();

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * Loop over all buffers.
       */
      num_to_scan = NBuffers;
!     absorb_counter = WRITES_PER_ABSORB;
      while (num_to_scan-- > 0)
      {
!         if (SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;

              /*
!              * If in bgwriter, absorb pending fsync requests after each
!              * WRITES_PER_ABSORB write operations, to prevent overflow of the
!              * fsync request queue.  If not in bgwriter process, this is a
!              * no-op.
               */
!             if (--absorb_counter <= 0)
!             {
!                 AbsorbFsyncRequests();
!                 absorb_counter = WRITES_PER_ABSORB;
!             }
          }
          if (++buf_id >= NBuffers)
              buf_id = 0;
--- 997,1098 ----
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
+  * If 'immediate' is true, write them all ASAP, otherwise throttle the
+  * I/O rate according to checkpoint_write_rate GUC variable, and perform
+  * normal bgwriter duties periodically.
   */
  void
! BufferSync(bool immediate)
  {
!     int            buf_id, start_id;
      int            num_to_scan;
!     int            num_to_write;
!     int            num_written;

      /*
       * Find out where to start the circular scan.
       */
!     start_id = StrategySyncStart();

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * Loop over all buffers, and mark the ones that need to be written with
!      * BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we
!      * can estimate how much work needs to be done.
!      *
!      * This allows us to only write those pages that were dirty when the
!      * checkpoint began, and haven't been flushed to disk since. Whenever a
!      * page with BM_CHECKPOINT_NEEDED is written out, either by us later
!      * in this function, or by normal backends or the bgwriter LRU-scan, the
!      * flag is cleared. Any buffer dirtied after this point won't have the
!      * flag set.
       */
      num_to_scan = NBuffers;
!     num_to_write = 0;
!     buf_id = start_id;
      while (num_to_scan-- > 0)
      {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
!
!         /*
!          * Header spinlock is enough to examine BM_DIRTY, see comment in
!          * SyncOneBuffer.
!          */
!         LockBufHdr(bufHdr);
!
!         if (bufHdr->flags & BM_DIRTY)
!         {
!             bufHdr->flags |= BM_CHECKPOINT_NEEDED;
!             num_to_write++;
!         }
!
!         UnlockBufHdr(bufHdr);
!
!         if (++buf_id >= NBuffers)
!             buf_id = 0;
!     }
!
!     elog(DEBUG1, "CHECKPOINT: %d / %d buffers to write", num_to_write, NBuffers);
!
!     /*
!      * Loop over all buffers again, and write the ones (still) marked with
!      * BM_CHECKPOINT_NEEDED.
!      */
!     num_to_scan = NBuffers;
!     num_written = 0;
!     buf_id = start_id;
!     while (num_to_scan-- > 0 && num_written < num_to_write)
!     {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
!         bool needs_flush;
!
!         /* We don't need to acquire the lock here, because we're
!          * only looking at a single bit. It's possible that someone
!          * else writes the buffer and clears the flag right after we
!          * check, but that doesn't matter. This assumes that no-one
!          * clears the flag and sets it again while holding info_lck,
!          * expecting no-one to see the intermediary state.
!          */
!         needs_flush = (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0;
!
!         if (needs_flush && SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;
+             num_written++;

              /*
!              * Perform normal bgwriter duties and sleep to throttle
!              * our I/O rate.
!              *
!              * Note that num_written doesn't include buffers written
!              * by other backends, or the bgwriter LRU-sweep. That means
!              * that the esimate of how much progress we've made is
!              * conservative.
               */
!             CheckpointWriteDelay(immediate,
!                                  (double) num_written / num_to_write);
          }
          if (++buf_id >= NBuffers)
              buf_id = 0;
***************
*** 1051,1058 ****
  void
  BgBufferSync(void)
  {
!     static int    buf_id1 = 0;
!     int            buf_id2;
      int            num_to_scan;
      int            num_written;

--- 1107,1113 ----
  void
  BgBufferSync(void)
  {
!     int            buf_id;
      int            num_to_scan;
      int            num_written;

***************
*** 1060,1104 ****
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * To minimize work at checkpoint time, we want to try to keep all the
!      * buffers clean; this motivates a scan that proceeds sequentially through
!      * all buffers.  But we are also charged with ensuring that buffers that
       * will be recycled soon are clean when needed; these buffers are the ones
!      * just ahead of the StrategySyncStart point.  We make a separate scan
!      * through those.
       */

      /*
-      * This loop runs over all buffers, including pinned ones.    The starting
-      * point advances through the buffer pool on successive calls.
-      *
-      * Note that we advance the static counter *before* trying to write. This
-      * ensures that, if we have a persistent write failure on a dirty buffer,
-      * we'll still be able to make progress writing other buffers. (The
-      * bgwriter will catch the error and just call us again later.)
-      */
-     if (bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0)
-     {
-         num_to_scan = (int) ((NBuffers * bgwriter_all_percent + 99) / 100);
-         num_written = 0;
-
-         while (num_to_scan-- > 0)
-         {
-             if (++buf_id1 >= NBuffers)
-                 buf_id1 = 0;
-             if (SyncOneBuffer(buf_id1, false))
-             {
-                 if (++num_written >= bgwriter_all_maxpages)
-                 {
-                     BgWriterStats.m_maxwritten_all++;
-                     break;
-                 }
-             }
-         }
-         BgWriterStats.m_buf_written_all += num_written;
-     }
-
-     /*
       * This loop considers only unpinned buffers close to the clock sweep
       * point.
       */
--- 1115,1126 ----
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

      /*
!      * The purpose of this sweep is to ensure that buffers that
       * will be recycled soon are clean when needed; these buffers are the ones
!      * just ahead of the StrategySyncStart point.
       */

      /*
       * This loop considers only unpinned buffers close to the clock sweep
       * point.
       */
***************
*** 1107,1117 ****
          num_to_scan = (int) ((NBuffers * bgwriter_lru_percent + 99) / 100);
          num_written = 0;

!         buf_id2 = StrategySyncStart();

          while (num_to_scan-- > 0)
          {
!             if (SyncOneBuffer(buf_id2, true))
              {
                  if (++num_written >= bgwriter_lru_maxpages)
                  {
--- 1129,1139 ----
          num_to_scan = (int) ((NBuffers * bgwriter_lru_percent + 99) / 100);
          num_written = 0;

!         buf_id = StrategySyncStart();

          while (num_to_scan-- > 0)
          {
!             if (SyncOneBuffer(buf_id, true))
              {
                  if (++num_written >= bgwriter_lru_maxpages)
                  {
***************
*** 1119,1126 ****
                      break;
                  }
              }
!             if (++buf_id2 >= NBuffers)
!                 buf_id2 = 0;
          }
          BgWriterStats.m_buf_written_lru += num_written;
      }
--- 1141,1148 ----
                      break;
                  }
              }
!             if (++buf_id >= NBuffers)
!                 buf_id = 0;
          }
          BgWriterStats.m_buf_written_lru += num_written;
      }
***************
*** 1340,1348 ****
   * flushed.
   */
  void
! FlushBufferPool(void)
  {
!     BufferSync();
      smgrsync();
  }

--- 1362,1373 ----
   * flushed.
   */
  void
! FlushBufferPool(bool immediate)
  {
!     elog(DEBUG1, "CHECKPOINT: write phase");
!     BufferSync(immediate);
!
!     elog(DEBUG1, "CHECKPOINT: sync phase");
      smgrsync();
  }

***************
*** 2131,2137 ****
      Assert(buf->flags & BM_IO_IN_PROGRESS);
      buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
!         buf->flags &= ~BM_DIRTY;
      buf->flags |= set_flag_bits;

      UnlockBufHdr(buf);
--- 2156,2162 ----
      Assert(buf->flags & BM_IO_IN_PROGRESS);
      buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
!         buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
      buf->flags |= set_flag_bits;

      UnlockBufHdr(buf);
Index: src/backend/tcop/utility.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tcop/utility.c,v
retrieving revision 1.280
diff -c -r1.280 utility.c
*** src/backend/tcop/utility.c    30 May 2007 20:12:01 -0000    1.280
--- src/backend/tcop/utility.c    20 Jun 2007 09:36:31 -0000
***************
*** 1089,1095 ****
                  ereport(ERROR,
                          (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                           errmsg("must be superuser to do CHECKPOINT")));
!             RequestCheckpoint(true, false);
              break;

          case T_ReindexStmt:
--- 1089,1095 ----
                  ereport(ERROR,
                          (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                           errmsg("must be superuser to do CHECKPOINT")));
!             RequestImmediateCheckpoint();
              break;

          case T_ReindexStmt:
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.400
diff -c -r1.400 guc.c
*** src/backend/utils/misc/guc.c    20 Jun 2007 18:31:39 -0000    1.400
--- src/backend/utils/misc/guc.c    26 Jun 2007 16:58:23 -0000
***************
*** 1569,1583 ****
      },

      {
-         {"bgwriter_all_maxpages", PGC_SIGHUP, RESOURCES,
-             gettext_noop("Background writer maximum number of all pages to flush per round."),
-             NULL
-         },
-         &bgwriter_all_maxpages,
-         5, 0, 1000, NULL, NULL
-     },
-
-     {
          {"log_rotation_age", PGC_SIGHUP, LOGGING_WHERE,
              gettext_noop("Automatic log file rotation will occur after N minutes."),
              NULL,
--- 1569,1574 ----
***************
*** 1830,1844 ****
      },

      {
-         {"bgwriter_all_percent", PGC_SIGHUP, RESOURCES,
-             gettext_noop("Background writer percentage of all buffers to flush per round."),
-             NULL
-         },
-         &bgwriter_all_percent,
-         0.333, 0.0, 100.0, NULL, NULL
-     },
-
-     {
          {"seed", PGC_USERSET, UNGROUPED,
              gettext_noop("Sets the seed for random-number generation."),
              NULL,
--- 1821,1826 ----
***************
*** 1865,1870 ****
--- 1847,1861 ----
          0.1, 0.0, 100.0, NULL, NULL
      },

+     {
+         {"checkpoint_completion_target", PGC_SIGHUP, WAL_CHECKPOINTS,
+             gettext_noop("Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval."),
+             NULL
+         },
+         &CheckPointCompletionTarget,
+         0.5, 0.0, 0.9, NULL, NULL
+     },
+
      /* End-of-list marker */
      {
          {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.216
diff -c -r1.216 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample    3 Jun 2007 17:08:15 -0000    1.216
--- src/backend/utils/misc/postgresql.conf.sample    26 Jun 2007 17:02:05 -0000
***************
*** 168,173 ****
--- 168,174 ----

  #checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
  #checkpoint_timeout = 5min        # range 30s-1h
+ #checkpoint_completion_target = 0.5    # checkpoint duration, range 0.0 - 0.9
  #checkpoint_warning = 30s        # 0 is off

  # - Archiving -
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.78
diff -c -r1.78 xlog.h
*** src/include/access/xlog.h    30 May 2007 20:12:02 -0000    1.78
--- src/include/access/xlog.h    19 Jun 2007 14:10:07 -0000
***************
*** 171,179 ****
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

  #endif   /* XLOG_H */
--- 171,180 ----
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool immediate, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);

  #endif   /* XLOG_H */
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.9
diff -c -r1.9 bgwriter.h
*** src/include/postmaster/bgwriter.h    5 Jan 2007 22:19:57 -0000    1.9
--- src/include/postmaster/bgwriter.h    25 Jun 2007 10:33:38 -0000
***************
*** 20,29 ****
  extern int    BgWriterDelay;
  extern int    CheckPointTimeout;
  extern int    CheckPointWarning;

  extern void BackgroundWriterMain(void);

! extern void RequestCheckpoint(bool waitforit, bool warnontime);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
--- 20,33 ----
  extern int    BgWriterDelay;
  extern int    CheckPointTimeout;
  extern int    CheckPointWarning;
+ extern double CheckPointCompletionTarget;

  extern void BackgroundWriterMain(void);

! extern void RequestImmediateCheckpoint(void);
! extern void RequestLazyCheckpoint(void);
! extern void RequestXLogFillCheckpoint(void);
! extern void CheckpointWriteDelay(bool immediate, double progress);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.90
diff -c -r1.90 buf_internals.h
*** src/include/storage/buf_internals.h    30 May 2007 20:12:03 -0000    1.90
--- src/include/storage/buf_internals.h    12 Jun 2007 11:42:23 -0000
***************
*** 35,40 ****
--- 35,41 ----
  #define BM_IO_ERROR                (1 << 4)        /* previous I/O failed */
  #define BM_JUST_DIRTIED            (1 << 5)        /* dirtied since write started */
  #define BM_PIN_COUNT_WAITER        (1 << 6)        /* have waiter for sole pin */
+ #define BM_CHECKPOINT_NEEDED    (1 << 7)        /* this needs to be written in checkpoint */

  typedef bits16 BufFlags;

Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.104
diff -c -r1.104 bufmgr.h
*** src/include/storage/bufmgr.h    30 May 2007 20:12:03 -0000    1.104
--- src/include/storage/bufmgr.h    25 Jun 2007 11:28:36 -0000
***************
*** 33,41 ****
  /* in bufmgr.c */
  extern bool zero_damaged_pages;
  extern double bgwriter_lru_percent;
- extern double bgwriter_all_percent;
  extern int    bgwriter_lru_maxpages;
- extern int    bgwriter_all_maxpages;

  /* in buf_init.c */
  extern DLLIMPORT char *BufferBlocks;
--- 33,39 ----
***************
*** 136,142 ****
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(void);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
--- 134,140 ----
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(bool immediate);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
***************
*** 161,167 ****
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(void);
  extern void BgBufferSync(void);

  extern void AtProcExit_LocalBuffers(void);
--- 159,165 ----
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(bool immediate);
  extern void BgBufferSync(void);

  extern void AtProcExit_LocalBuffers(void);

В списке pgsql-patches по дате отправления:

Предыдущее
От: Greg Smith
Дата:
Сообщение: Re: Load Distributed Checkpoints, take 3
Следующее
От: Michael Glaesemann
Дата:
Сообщение: Re: Load Distributed Checkpoints, final patch