Обсуждение: Directory fsync and other fun

Поиск
Список
Период
Сортировка

Directory fsync and other fun

От
Andres Freund
Дата:
Hi all,

I started setting up some halfway automated method of simulating hard crashes 
and even while setting those up I found some pretty unsettling results...
Now its not unlikely that my testing is flawed but unfortunately I don't see 
where right now (its 3am now and I have a 8h trainride behind me, so ...)

The simple testsetup I have till now:
Serverscript:
* setup disk
* start pg
* wait for getting killed
* setup disk
* start pg

Clientside:
* CREATE DATABASE ... TEMPLATE crashtemplate
* CHECKPOINT
* make device readonly not allowing any cache flushes or such (using 
devicemapper)
kill server
* connect to database (some of the time it errors here
* select * from $every_table (some time here)

At first pg survived that nicely without any problems. Then I got to my senses 
and started adding some background io. Like:
dd if=/dev/zero of=/mnt/test/foobar bs=10M count=1000

Thats where things started failing. All are logs from after the crash:

1: 
FATAL:  could not read relation mapping file "base/140883/pg_filenode.map": 
Interrupted system call
DEBUG:  autovacuum: processing database "postgres"
FATAL:  could not read relation mapping file "base/140883/pg_filenode.map": 
Success
DEBUG:  autovacuum: processing database "postgres"
...
FATAL:  could not read relation mapping file "base/58963/pg_filenode.map": No 
such file or directory

2:
FATAL:  "base/165459" is not a valid data directory
DETAIL:  File "base/165459/PG_VERSION" does not contain valid data.
HINT:  You might need to initdb.

3:
You are now connected to database "test".
test=# SELECT execute('SELECT * FROM table_'||g.i) FROM generate_series(1, 
3000) g(i);
ERROR:  XX001: could not read block 0 in file "base/124499/11652": read only 0 
of 8192 bytes
LOCATION:  mdread, md.c:656
(that one I did not see with -o data=ordered,barrier=1,commit=300)


I tried the following mount options/filesystems so  far:
-t ext4 -o data=writeback,barrier=1,commit=300,noauto_da_alloc
-t ext4 -o data=writeback,barrier=1,commit=300
-t ext4 -o data=writeback,barrier=0,commit=300
-t ext4 -o data=ordered,barrier=0,commit=300,noauto_da_alloc
-t ext4 -o data=ordered,barrier=1,commit=300,noauto_da_alloc
-t ext4 -o data=ordered,barrier=1,commit=300

The same with s/ext4/ext3/ and with a commit=5. With the latter the errors 
were way much harder to reproduce (not that surprisingly) but still occured.

I attached my preliminary scripts/hacks... They even contain a comment or two. 
Note though that they are a bit of a loaded gun...

I guess it would be sensible trying to do some more extensive tests on a setup 
like that... All I tested till now was create database :-(

Andres


Re: Directory fsync and other fun

От
Takahiro Itagaki
Дата:
Andres Freund <andres@anarazel.de> wrote:

> I started setting up some halfway automated method of simulating hard crashes 
> and even while setting those up I found some pretty unsettling results...
> Now its not unlikely that my testing is flawed but unfortunately I don't see 
> where right now (its 3am now and I have a 8h trainride behind me, so ...)

I think the reported behavior is a TODO item to research:
* Research use of fsync to a newly created directory.  There is no guarantee that mkdir() flushes metadata of the
directory.

Also, I heard ext4 has a "feature" in that rename() might truncate the
renamed file to zero bytes on crash. The user data in the file might be
lost if the machine crashes just after rename().
* Research whether our use of rename() is safe on various file systems.  Especially on ext4, the contents of the
renamedfile might be lost on crash.
 

Comments and suggestion for improvements of words welcome.
If no objection, I'll add them at the Fsync section in the TODO page.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center




Re: Directory fsync and other fun

От
Tom Lane
Дата:
Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes:
> Andres Freund <andres@anarazel.de> wrote:
>> I started setting up some halfway automated method of simulating hard crashes 
>> and even while setting those up I found some pretty unsettling results...
>> Now its not unlikely that my testing is flawed but unfortunately I don't see 
>> where right now (its 3am now and I have a 8h trainride behind me, so ...)

> I think the reported behavior is a TODO item to research:

We have since found out that the copydir code was completely broken and
wasn't fsyncing anything.  So those tests need to be re-run against HEAD
before jumping to any conclusions.
        regards, tom lane


Re: Directory fsync and other fun

От
Greg Stark
Дата:
On Wed, Feb 24, 2010 at 2:51 AM, Takahiro Itagaki
<itagaki.takahiro@oss.ntt.co.jp> wrote:
> Also, I heard ext4 has a "feature" in that rename() might truncate the
> renamed file to zero bytes on crash. The user data in the file might be
> lost if the machine crashes just after rename().

In our case I think this is the one thing that cannot happen. This
happens when you write out the new file and rename it over the old
file without every fsyncing the new file. If the rename succeeds but
all the writes get lost you end up with neither the new nor old file.

The ext4 guys want you do to do an fsync of the new file before doing
the rename. This is terrible for most of the applications that were
doing this -- the latency hit for interactive apps that didn't really
need an fsync is awful -- but in our case we were already doing fsyncs
in every case where we do renames.




-- 
greg