Обсуждение: Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Поиск
Список
Период
Сортировка

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

От
Bruce Momjian
Дата:
> I am concerned that the bgwriter will not be able to keep up with the
> I/O generated by even a single backend restoring a database, let alone a
> busy system.  To me, the write() performed by the bgwriter, because it
> is I/O, will typically be the bottleneck on any system that is I/O bound
> (especially as the kernel buffers fill) and will not be able to keep up
> with active backends now freed from writes.
>
> The idea to fallback when the bgwriter can not keep up is to have the
> backends sync the data, which seems like it would just slow down an
> I/O-bound system further.
>
> I talked to Magnus about this, and we considered various ideas, but
> could not come up with a clean way of having the backends communicate to
> the bgwriter about their own non-sync writes.  We had the ideas of using
> shared memory or a socket, but these seemed like choke-points.
>
> Here is my new idea.  (I will keep throwing out ideas until I hit on a
> good one.)  The bgwriter it going to have to check before every write to
> determine if the file is already recorded as needing fsync during
> checkpoint.  My idea is to have that checking happen during the bgwriter
> buffer scan, rather than at write time.  if we add a shared memory
> boolean for each buffer, backends needing to write buffers can writer
> buffers already recorded as safe to write by the bgwriter scanner.  I
> don't think the bgwriter is going to be able to keep up with I/O bound
> backends, but I do think it can scan and set those booleans fast enough
> for the backends to then perform the writes.  (We might need a separate
> bgwriter thread to do this or a separate process.)
>
> As I remember, our new queue system has a list of buffers that are most
> likely to be replaced, so the bgwriter can scan those first and make
> sure they have their booleans set.
>
> There is an issue that these booleans are set without locking, so there
> might need to be a double-check of them by backends, first before the
> write, then after just before they replace the buffer.  The bgwriter
> would clear the bits before the checkpoint starts.

Now that no one is ill from my fsync buffer boolean idea, let me give
some implementation details.  :-)

First, we need to add a bit to each shared buffer descriptor (sbufdesc)
that indicates whether the background writer (bwwriter) has recorded the
file associated with the buffer as needing fsync. This bit will be set
only by the background writer, usually during its normal buffer scan
looking for buffers to write. The background writer doesn't write all
dirty buffers on each buffer pass, but it could record the buffers that
need fsync on each pass, allowing backends to write those buffers if
buffer space becomes limited. (Not sure but perhaps the buffer bit set
could be done with only a shared lock on the buffer because no one else
sets the bit.)

(One idea would be to move the fsync bit into its own byte in shared
memory so it is more centralized and no locking is required to set the
bit. Also, should we have one byte per shared buffer to indicate dirty
buffers so the bwwriter can fine them more efficiently?)

The bit can be cleared if either the background writer writes the page,
or a backend writes the page.

Right now, the checkpoint process writes out all dirty buffers. We might
need to change this so the background writer does this because only it
can record files needing fsync.  During checkpoint, the background
writer should write out all buffers. It will not be recording any new
fsync bits during this scan because it is writing every dirty buffer.
(If it did do this, it could mark an fsync bit that was written only
during or after the fsync it performs later.)

Once it is done, it should move the hash of files needing fsync to a
backup pointer and create a new empty list and do a scan so backends can
do writes. A subprocess should do fsync of all files, either using
fork() and having the child read the saved pointer hash, or for
EXEC_BACKEND, write a temp file that the child can read.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073