Обсуждение: Directory fsync and other fun
Hi all, I started setting up some halfway automated method of simulating hard crashes and even while setting those up I found some pretty unsettling results... Now its not unlikely that my testing is flawed but unfortunately I don't see where right now (its 3am now and I have a 8h trainride behind me, so ...) The simple testsetup I have till now: Serverscript: * setup disk * start pg * wait for getting killed * setup disk * start pg Clientside: * CREATE DATABASE ... TEMPLATE crashtemplate * CHECKPOINT * make device readonly not allowing any cache flushes or such (using devicemapper) kill server * connect to database (some of the time it errors here * select * from $every_table (some time here) At first pg survived that nicely without any problems. Then I got to my senses and started adding some background io. Like: dd if=/dev/zero of=/mnt/test/foobar bs=10M count=1000 Thats where things started failing. All are logs from after the crash: 1: FATAL: could not read relation mapping file "base/140883/pg_filenode.map": Interrupted system call DEBUG: autovacuum: processing database "postgres" FATAL: could not read relation mapping file "base/140883/pg_filenode.map": Success DEBUG: autovacuum: processing database "postgres" ... FATAL: could not read relation mapping file "base/58963/pg_filenode.map": No such file or directory 2: FATAL: "base/165459" is not a valid data directory DETAIL: File "base/165459/PG_VERSION" does not contain valid data. HINT: You might need to initdb. 3: You are now connected to database "test". test=# SELECT execute('SELECT * FROM table_'||g.i) FROM generate_series(1, 3000) g(i); ERROR: XX001: could not read block 0 in file "base/124499/11652": read only 0 of 8192 bytes LOCATION: mdread, md.c:656 (that one I did not see with -o data=ordered,barrier=1,commit=300) I tried the following mount options/filesystems so far: -t ext4 -o data=writeback,barrier=1,commit=300,noauto_da_alloc -t ext4 -o data=writeback,barrier=1,commit=300 -t ext4 -o data=writeback,barrier=0,commit=300 -t ext4 -o data=ordered,barrier=0,commit=300,noauto_da_alloc -t ext4 -o data=ordered,barrier=1,commit=300,noauto_da_alloc -t ext4 -o data=ordered,barrier=1,commit=300 The same with s/ext4/ext3/ and with a commit=5. With the latter the errors were way much harder to reproduce (not that surprisingly) but still occured. I attached my preliminary scripts/hacks... They even contain a comment or two. Note though that they are a bit of a loaded gun... I guess it would be sensible trying to do some more extensive tests on a setup like that... All I tested till now was create database :-( Andres
Andres Freund <andres@anarazel.de> wrote: > I started setting up some halfway automated method of simulating hard crashes > and even while setting those up I found some pretty unsettling results... > Now its not unlikely that my testing is flawed but unfortunately I don't see > where right now (its 3am now and I have a 8h trainride behind me, so ...) I think the reported behavior is a TODO item to research: * Research use of fsync to a newly created directory. There is no guarantee that mkdir() flushes metadata of the directory. Also, I heard ext4 has a "feature" in that rename() might truncate the renamed file to zero bytes on crash. The user data in the file might be lost if the machine crashes just after rename(). * Research whether our use of rename() is safe on various file systems. Especially on ext4, the contents of the renamedfile might be lost on crash. Comments and suggestion for improvements of words welcome. If no objection, I'll add them at the Fsync section in the TODO page. Regards, --- Takahiro Itagaki NTT Open Source Software Center
Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes: > Andres Freund <andres@anarazel.de> wrote: >> I started setting up some halfway automated method of simulating hard crashes >> and even while setting those up I found some pretty unsettling results... >> Now its not unlikely that my testing is flawed but unfortunately I don't see >> where right now (its 3am now and I have a 8h trainride behind me, so ...) > I think the reported behavior is a TODO item to research: We have since found out that the copydir code was completely broken and wasn't fsyncing anything. So those tests need to be re-run against HEAD before jumping to any conclusions. regards, tom lane
On Wed, Feb 24, 2010 at 2:51 AM, Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> wrote: > Also, I heard ext4 has a "feature" in that rename() might truncate the > renamed file to zero bytes on crash. The user data in the file might be > lost if the machine crashes just after rename(). In our case I think this is the one thing that cannot happen. This happens when you write out the new file and rename it over the old file without every fsyncing the new file. If the rename succeeds but all the writes get lost you end up with neither the new nor old file. The ext4 guys want you do to do an fsync of the new file before doing the rename. This is terrible for most of the applications that were doing this -- the latency hit for interactive apps that didn't really need an fsync is awful -- but in our case we were already doing fsyncs in every case where we do renames. -- greg