Обсуждение: DROP DATABASE vs patch to not remove files right away

Поиск
Список
Период
Сортировка

DROP DATABASE vs patch to not remove files right away

От
Tom Lane
Дата:
Over the last couple days I twice saw complaints like this during
DROP DATABASE:

WARNING:  could not remove file or directory "base/80750/80825": No such file or directory
WARNING:  could not remove database directory "base/80750"

I poked at it for awhile and was eventually able to extract a
repeatable test case:

while true
do   psql -c "create database foo;" postgres || exit 1   psql -c "create table foo(f1 int primary key);" foo || exit 1
psql -c "drop table foo;" foo || exit 1   psql -c "checkpoint" postgres &   psql -c "drop database foo;" postgres ||
exit1
 
done

On my machine this fairly consistently draws warnings in both 8.3 and
HEAD.  I believe what is happening is that the bgwriter has a
PendingUnlinkEntry for table foo, and completion of the checkpoint
prompts it to exercise that.  Meanwhile in the DROP DATABASE, rmtree is
working through a list of files to drop, and when it hits the
already-deleted one it complains --- and not only does it complain,
it stops trying to delete any more.  (The second WARNING is quite
misleading, because what it really means is "I stopped trying".)

Without the CHECKPOINT, what we get instead is that each cycle builds up
some more PendingUnlinkEntrys, which will all fail when the checkpoint
comes.  The bgwriter is coded to not report ENOENT, so you don't see any
evidence of that, but it's clearly a possible case and the comment
saying it shouldn't happen is misleading.

Actually ... what if the same DB OID and relfilenode get re-made before
the checkpoint?  Then we'd be unlinking live data.  This is improbable
but hardly less so than the scenario the PendingUnlinkEntry code was
put in to prevent.

ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
causes PendingUnlinkEntrys for the doomed DB to be thrown away too.
This should prevent the unlink-live-data scenario, I think.
Even then, concurrent deletion attempts are probably possible (since
ForgetDatabaseFsyncRequests is asynchronous) and rmtree() is being far
too fragile about dealing with them.  I think that it should be coded
to ignore ENOENT the same as the bgwriter does, and that it should press
on and keep trying to delete things even if it gets a failure.

Thoughts?
        regards, tom lane


Re: DROP DATABASE vs patch to not remove files right away

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> I think that it should be coded
> to ignore ENOENT the same as the bgwriter does, and that it should press
> on and keep trying to delete things even if it gets a failure.

Perhaps, if it gets ENOENT, record this fact -- and after rmtree
returns, retry the whole thing.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: DROP DATABASE vs patch to not remove files right away

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I think that it should be coded
>> to ignore ENOENT the same as the bgwriter does, and that it should press
>> on and keep trying to delete things even if it gets a failure.

> Perhaps, if it gets ENOENT, record this fact -- and after rmtree
> returns, retry the whole thing.

Er, retry what?  There was, presumably, nothing to delete.
        regards, tom lane


Re: DROP DATABASE vs patch to not remove files right away

От
Andrew Dunstan
Дата:

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>   
>> Tom Lane wrote:
>>     
>>> I think that it should be coded
>>> to ignore ENOENT the same as the bgwriter does, and that it should press
>>> on and keep trying to delete things even if it gets a failure.
>>>       
>
>   
>> Perhaps, if it gets ENOENT, record this fact -- and after rmtree
>> returns, retry the whole thing.
>>     
>
> Er, retry what?  There was, presumably, nothing to delete.
>
>             
>   

Yeah. I agree rmtree() should treat ENOENT as non-fatal.

cheers

andrew


Re: DROP DATABASE vs patch to not remove files right away

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
> causes PendingUnlinkEntrys for the doomed DB to be thrown away too.
> This should prevent the unlink-live-data scenario, I think.
> Even then, concurrent deletion attempts are probably possible (since
> ForgetDatabaseFsyncRequests is asynchronous) and rmtree() is being far
> too fragile about dealing with them.  I think that it should be coded
> to ignore ENOENT the same as the bgwriter does, and that it should press
> on and keep trying to delete things even if it gets a failure.

Yep. I can write a patch for that, unless you're onto it already?

However, this makes me reconsider Florian's suggestion to just make
relfilenode larger and avoid reusing them altogether. It would simplify
the code quite a bit, and make it more robust. That is good because even 
if we fix these problems per your suggestion, I'm left wondering if 
we've missed some even weirder corner-cases.

Florian suggested a scheme where the xid and epoch is embedded in the 
filename, but that's unnecessarily complex. We could just make 
relfilenode a 64-bit integer. 2^64 should be enough for everyone.

You listed these problems with Florian's suggestion back then:

> 1. Zero chance of ever backpatching.  (I know I said I wasn't excited
>    about that, but it's still a strike against a proposed fix.)

Still true. We would need to do what you suggested for 8.3, but 
simplifying the code would be good thing in the long run.

> 2. Adds new fields to RelFileNode, which will be a major code change,
>    and possibly a noticeable performance hit (bigger hashtable keys).

We talked about this wrt. map forks, and concluded that it's not an 
issue. If we add the map forks as well, BufferTag struct would grow from  16 bytes to 24 bytes. It's worth doing some
moremicro-benchmarking 
 
with that, but it's probably acceptable. Or we could allocate a few bits 
of the 64-bit relfilenode field in RelFileNode to indicate the map fork.

> 3. Adds new columns to pg_class, which is a real PITA ...

We would only have to change relfilenode from oid to int64.

> 4. Breaks oid2name and all similar code that knows about relfilenode.

True, but they're not hard to fix.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



Re: DROP DATABASE vs patch to not remove files right away

От
Tom Lane
Дата:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Florian suggested a scheme where the xid and epoch is embedded in the 
> filename, but that's unnecessarily complex. We could just make 
> relfilenode a 64-bit integer. 2^64 should be enough for everyone.

Doesn't fix the problem unless DB and TS OIDs become int64 too;
in fact, given that we generate relfilenodes off the OID counter,
it's difficult to see how you do this without making *all* OIDs
64-bit.

Plus you're assuming that the machine has working 64-bit ints.
There's a large difference in my mind between saying "bigint
doesn't work right if you don't have working int64" and "we don't
guarantee the safety of your data if you don't have int64".
I'm not prepared to rip out the non-collision code until such
time as we irrevocably refuse to build on machines without int64.
        regards, tom lane


Re: DROP DATABASE vs patch to not remove files right away

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Florian suggested a scheme where the xid and epoch is embedded in the 
>> filename, but that's unnecessarily complex. We could just make 
>> relfilenode a 64-bit integer. 2^64 should be enough for everyone.
> 
> Doesn't fix the problem unless DB and TS OIDs become int64 too;

Remember what the original scenario was:

1. DROP TABLE foo;
2. BEGIN; CREATE TABLE bar; COPY TO bar ...; COMMIT -- bar gets the same 
relfilenode as "foo", by chance
3. <crash>
4. WAL replay of "DROP TABLE foo" deletes the data inserted at step 2. 
Because the table was created in the same transaction, we skipped WAL 
logging the inserts, and the data is lost.

Even if we can get a collision in DB or tablespace OIDs, we can't get a 
collision at step 2 if we don't reuse relfilenodes.

> in fact, given that we generate relfilenodes off the OID counter,
> it's difficult to see how you do this without making *all* OIDs
> 64-bit.

By generating them off a new 64-bit counter instead of the OID counter. 
Or by extending the OID counter to 64-bits, but only using the lower 64 
bits for OIDs.

> Plus you're assuming that the machine has working 64-bit ints.
> There's a large difference in my mind between saying "bigint
> doesn't work right if you don't have working int64" and "we don't
> guarantee the safety of your data if you don't have int64".
> I'm not prepared to rip out the non-collision code until such
> time as we irrevocably refuse to build on machines without int64.

Hmm. We could make it a struct of two int4s if we have to, couldn't we?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: DROP DATABASE vs patch to not remove files right away

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> Actually ... what if the same DB OID and relfilenode get re-made before
> the checkpoint?  Then we'd be unlinking live data.  This is improbable
> but hardly less so than the scenario the PendingUnlinkEntry code was
> put in to prevent.
> 
> ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
> causes PendingUnlinkEntrys for the doomed DB to be thrown away too.

Because of the asynchronous nature of ForgetDatabaseFsyncRequests, this 
still isn't enough, I'm afraid.

1. DROP TABLE footable;
2. Checkpoint begins.
3. DROP DATABASE foodb;
4. CREATE DATABASE bardb; -- bardb gets same OID as foodb, and a table 
copied from template database, let's call it bartable, gets same OID as 
footable
5. Checkpoint processes pending unlink for footable, but removes 
bartable instead

Needs more thought, after some sleep...

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: DROP DATABASE vs patch to not remove files right away

От
Tom Lane
Дата:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
>> causes PendingUnlinkEntrys for the doomed DB to be thrown away too.

> Because of the asynchronous nature of ForgetDatabaseFsyncRequests, this 
> still isn't enough, I'm afraid.

Hm.  I notice that there is no bug on Windows because dropdb forces a
checkpoint before it starts to remove files.  Maybe the sanest solution
is to just do that on all platforms.
        regards, tom lane


Re: DROP DATABASE vs patch to not remove files right away

От
Andrew Dunstan
Дата:

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>   
>> Tom Lane wrote:
>>     
>>> ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
>>> causes PendingUnlinkEntrys for the doomed DB to be thrown away too.
>>>       
>
>   
>> Because of the asynchronous nature of ForgetDatabaseFsyncRequests, this 
>> still isn't enough, I'm afraid.
>>     
>
> Hm.  I notice that there is no bug on Windows because dropdb forces a
> checkpoint before it starts to remove files.  Maybe the sanest solution
> is to just do that on all platforms.
>
>             
>   

IIRC we did that to avoid an unlink problem :-)

I certainly don't see any particular reason not to do it everywhere.

cheers

andrew


Re: DROP DATABASE vs patch to not remove files right away

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
>>> causes PendingUnlinkEntrys for the doomed DB to be thrown away too.
>
>> Because of the asynchronous nature of ForgetDatabaseFsyncRequests, this
>> still isn't enough, I'm afraid.
>
> Hm.  I notice that there is no bug on Windows because dropdb forces a
> checkpoint before it starts to remove files.  Maybe the sanest solution
> is to just do that on all platforms.

Yeah, I don't see any other simple solution.

Patch attached that does the three changes we've talked about:'
- make ForgetDatabaseFsyncRequests forget unlink requests as well
- make rmtree() not fail on ENOENT
- force checkpoint on in dropdb on all platforms

plus some comment changes reflecting what we now know. I will apply this
tomorrow if there's no objections.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.205
diff -c -r1.205 dbcommands.c
*** src/backend/commands/dbcommands.c    26 Mar 2008 21:10:37 -0000    1.205
--- src/backend/commands/dbcommands.c    17 Apr 2008 18:53:17 -0000
***************
*** 696,713 ****
      pgstat_drop_database(db_id);

      /*
!      * Tell bgwriter to forget any pending fsync requests for files in the
!      * database; else it'll fail at next checkpoint.
       */
      ForgetDatabaseFsyncRequests(db_id);

      /*
!      * On Windows, force a checkpoint so that the bgwriter doesn't hold any
!      * open files, which would cause rmdir() to fail.
       */
- #ifdef WIN32
      RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
- #endif

      /*
       * Remove all tablespace subdirs belonging to the database.
--- 696,714 ----
      pgstat_drop_database(db_id);

      /*
!      * Tell bgwriter to forget any pending fsync and unlink requests for files
!      * in the database; else it'll fail at next checkpoint, or worse it will
!      * delete files that belong to a newly created database with the same OID.
       */
      ForgetDatabaseFsyncRequests(db_id);

      /*
!      * Force a checkpoint to make sure the bgwriter has received the message
!      * sent by ForgetDatabaseFsyncRequests. On Windows, this also ensures that
!      * the bgwriter doesn't hold any open files, which would cause rmdir() to
!      * fail.
       */
      RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);

      /*
       * Remove all tablespace subdirs belonging to the database.
Index: src/backend/storage/smgr/md.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.136
diff -c -r1.136 md.c
*** src/backend/storage/smgr/md.c    10 Mar 2008 20:06:27 -0000    1.136
--- src/backend/storage/smgr/md.c    17 Apr 2008 19:26:16 -0000
***************
*** 1196,1203 ****
          if (unlink(path) < 0)
          {
              /*
!              * ENOENT shouldn't happen either, but it doesn't really matter
!              * because we would've deleted it now anyway.
               */
              if (errno != ENOENT)
                  ereport(WARNING,
--- 1196,1206 ----
          if (unlink(path) < 0)
          {
              /*
!              * There's a race condition, when the database is dropped at the
!              * same time that we process the pending unlink requests. If the
!              * DROP DATABASE deletes the file before we do, we will get ENOENT
!              * here. rmtree() also has to ignore ENOENT errors, to deal with
!              * the possibility that we delete the file first.
               */
              if (errno != ENOENT)
                  ereport(WARNING,
***************
*** 1321,1327 ****
--- 1324,1334 ----
          /* Remove any pending requests for the entire database */
          HASH_SEQ_STATUS hstat;
          PendingOperationEntry *entry;
+         ListCell   *cell,
+                    *prev,
+                    *next;

+         /* Remove fsync requests */
          hash_seq_init(&hstat, pendingOpsTable);
          while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
          {
***************
*** 1331,1336 ****
--- 1338,1359 ----
                  entry->canceled = true;
              }
          }
+
+         /* Remove unlink requests */
+         prev = NULL;
+         for (cell = list_head(pendingUnlinks); cell; cell = next)
+         {
+             PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+             next = lnext(cell);
+             if (entry->rnode.dbNode == rnode.dbNode)
+             {
+                 pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
+                 pfree(entry);
+             }
+             else
+                 prev = cell;
+         }
      }
      else if (segno == UNLINK_RELATION_REQUEST)
      {
***************
*** 1386,1392 ****
  }

  /*
!  * ForgetRelationFsyncRequests -- ensure any fsyncs for a rel are forgotten
   */
  void
  ForgetRelationFsyncRequests(RelFileNode rnode)
--- 1409,1415 ----
  }

  /*
!  * ForgetRelationFsyncRequests -- forget any fsyncs for a rel
   */
  void
  ForgetRelationFsyncRequests(RelFileNode rnode)
***************
*** 1419,1425 ****
  }

  /*
!  * ForgetDatabaseFsyncRequests -- ensure any fsyncs for a DB are forgotten
   */
  void
  ForgetDatabaseFsyncRequests(Oid dbid)
--- 1442,1448 ----
  }

  /*
!  * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
   */
  void
  ForgetDatabaseFsyncRequests(Oid dbid)
Index: src/port/dirmod.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/port/dirmod.c,v
retrieving revision 1.53
diff -c -r1.53 dirmod.c
*** src/port/dirmod.c    11 Apr 2008 23:53:00 -0000    1.53
--- src/port/dirmod.c    17 Apr 2008 19:08:15 -0000
***************
*** 406,413 ****
      {
          snprintf(filepath, MAXPGPATH, "%s/%s", path, *filename);

          if (lstat(filepath, &statbuf) != 0)
!             goto report_and_fail;

          if (S_ISDIR(statbuf.st_mode))
          {
--- 406,429 ----
      {
          snprintf(filepath, MAXPGPATH, "%s/%s", path, *filename);

+         /*
+          * It's ok if the file is not there anymore; we were just about to
+          * delete it anyway.
+          *
+          * This is not an academic possibility. One scenario where this
+          * happens is when bgwriter has a pending unlink request for a file
+          * in a database that's being dropped. In dropdb(), we call
+          * ForgetDatabaseFsyncRequests() to flush out any such pending unlink
+          * requests, but because that's asynchronous, it's not guaranteed
+          * that the bgwriter receives the message in time.
+          */
          if (lstat(filepath, &statbuf) != 0)
!         {
!             if (errno != ENOENT)
!                 goto report_and_fail;
!             else
!                 continue;
!         }

          if (S_ISDIR(statbuf.st_mode))
          {
***************
*** 422,428 ****
          else
          {
              if (unlink(filepath) != 0)
!                 goto report_and_fail;
          }
      }

--- 438,447 ----
          else
          {
              if (unlink(filepath) != 0)
!             {
!                 if (errno != ENOENT)
!                     goto report_and_fail;
!             }
          }
      }


Re: DROP DATABASE vs patch to not remove files right away

От
Tom Lane
Дата:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Patch attached that does the three changes we've talked about:'
> - make ForgetDatabaseFsyncRequests forget unlink requests as well
> - make rmtree() not fail on ENOENT
> - force checkpoint on in dropdb on all platforms

This looks fine as far as it goes, but I'm still thinking it's a bad
idea for rmtree() to fall over on the first failure.  I propose that it
ought to keep trying to delete the rest of the target directory tree.
Otherwise, one permissions problem (for example) could lead to most
of a dropped database not getting freed up,

If there aren't objections, I'll make that happen after Heikki applies
his patch.
        regards, tom lane


Re: DROP DATABASE vs patch to not remove files right away

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Patch attached that does the three changes we've talked about:'
>> - make ForgetDatabaseFsyncRequests forget unlink requests as well
>> - make rmtree() not fail on ENOENT
>> - force checkpoint on in dropdb on all platforms
> 
> This looks fine as far as it goes

Ok, committed.

> but I'm still thinking it's a bad
> idea for rmtree() to fall over on the first failure.  I propose that it
> ought to keep trying to delete the rest of the target directory tree.
> Otherwise, one permissions problem (for example) could lead to most
> of a dropped database not getting freed up,>
> If there aren't objections, I'll make that happen after Heikki applies
> his patch.

Agreed. Be my guest.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com