Обсуждение: WAL & ZFS
I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize so asto take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit froma larger recordsize. Or any other tweaks??? -- Scott Ribe scott_ribe@elevated-dev.com https://www.linkedin.com/in/scottribe/
I would recommend a separate pg_wal filesystem with the record size to match the WAL page size; in my case 16k. I have keepthe default record size at 128k for the data volume and that configuration has worked well for supporting large DSS whileusing 16k data blocks. > On Mar 30, 2022, at 5:32 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: > > I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize so asto take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit froma larger recordsize. Or any other tweaks??? > > -- > Scott Ribe > scott_ribe@elevated-dev.com > https://www.linkedin.com/in/scottribe/ > > > > >
The WAL is a journal itself and doesn't need another journal for safety. Therefore, a common recommendation is using ext2 (which has no journal) for the WAL partition. Is this correct? Am 31.03.22 um 23:32 schrieb Rui DeSousa: > I would recommend a separate pg_wal filesystem with the record size to match the WAL page size; in my case 16k. I havekeep the default record size at 128k for the data volume and that configuration has worked well for supporting largeDSS while using 16k data blocks. > >> On Mar 30, 2022, at 5:32 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: >> >> I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize soas to take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit froma larger recordsize. Or any other tweaks??? >> >> -- >> Scott Ribe >> scott_ribe@elevated-dev.com >> https://www.linkedin.com/in/scottribe/ >> >> >> >> >> > > -- Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Вложения
> On Mar 31, 2022, at 4:47 PM, Holger Jakobs <holger@jakobs.com> wrote: > > The WAL is a journal itself and doesn't need another journal for safety. Therefore, a common recommendation is using ext2(which has no journal) for the WAL partition. > > Is this correct? I could see that being reasonable. In this case I don't control number and size of drives, and have a layout where it's reallybest in terms of utilization to put them all into the zpool. So my options are limited to ZFS options.
The WAL is a journal itself and doesn't need another journal for safety. Therefore, a common recommendation is using ext2 (which has no journal) for the WAL partition.
Is this correct?
Am 31.03.22 um 23:32 schrieb Rui DeSousa:I would recommend a separate pg_wal filesystem with the record size to match the WAL page size; in my case 16k. I have keep the default record size at 128k for the data volume and that configuration has worked well for supporting large DSS while using 16k data blocks.On Mar 30, 2022, at 5:32 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize so as to take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit from a larger recordsize. Or any other tweaks???
--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/
Journallng file systems journal file operations like open, close, create, extend, rename or delete, not file block operations. File blocks are not protected by file system journalling, just the inodes. The file system journal prevents you from losing files in case of sudden machine crash, like when the machine loses power. It has nothing to do with the block change journaling, which is the role of WAL files. WAL files record the changes to database blocks. With ext2, it would be possible to lose a WAL log in case of a sudden crash and that might prevent the cluster recovery. File system journaling has nothing to do with WAL logs. I would strongly advise against using ext2 for WAL logs.
Regards
-- Mladen Gogala Database Consultant Tel: (347) 321-1217 https://dbwhisperer.wordpress.com
> On Mar 31, 2022, at 5:09 PM, Mladen Gogala <gogala.mladen@gmail.com> wrote: > > Journallng file systems journal file operations like open, close, create, extend, rename or delete, not file block operations.File blocks are not protected by file system journalling, just the inodes. The file system journal prevents youfrom losing files in case of sudden machine crash, like when the machine loses power. It has nothing to do with the blockchange journaling, which is the role of WAL files. LOL, I *knew* that about journaling file systems, thanks for reminding me!
I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize so as to take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit from a larger recordsize. Or any other tweaks??? -- Scott Ribe scott_ribe@elevated-dev.com https://www.linkedin.com/in/scottribe/
Phoronix has tested ZFS against Ext3, Ext4 and XFS. ZFS was consistently performing worse than all other file systems. Here is the test with Oracle:
https://blog.docbert.org/oracle-on-zfs/
Here are several articles that caution against ZFS:
https://serverfault.com/questions/791154/zfs-good-read-but-poor-write-speeds
https://www.phoronix.com/scan.php?page=article&item=ubuntu1910-ext4-zfs&num=3
And finally, this: https://storytime.ivysaur.me/posts/why-not-zfs/
I would consider Linux ZFS only for toy databases that do not hold any serious data.
-- Mladen Gogala Database Consultant Tel: (347) 321-1217 https://dbwhisperer.wordpress.com
> I would consider Linux ZFS only for toy databases that do not hold any serious data. We have tested performance extensively, and ZFS performance is fine for our database (5TB, something like 100GB/day datainserted and deleted). I would consider hardly anything else for serious data, since hardly anything else deals properlywith bit rot. A few comments on the articles: > Phoronix has tested ZFS against Ext3, Ext4 and XFS. ZFS was consistently performing worse than all other file systems.Here is the test with Oracle: > > https://blog.docbert.org/oracle-on-zfs/ > > Here are several articles that caution against ZFS: > > https://forums.servethehome.com/index.php?threads/very-slow-zfs-raidz2-performance-on-truenas-12.33094/ One person opens the thread saying their performance is terrible. Others say their performance is OK and give tuning hints.OP says "Block size has been settled to 128Kb, I just use it for too much everything to go too big or too small. sohere we sit." Which, wow. You can create as many ZFS file systems as you want on a pool, with different recordsizes. Seriously,create a file system for your database configured for your database. > https://serverfault.com/questions/791154/zfs-good-read-but-poor-write-speeds Post is 5 years old, well before lz4 compression was in OpenZFS, and lz4 is key to getting decent DB write performance. Alsonote that poster was testing against an iSCSCI block device with no idea whatsoever about the actual disk configurationbehind it. > https://www.phoronix.com/scan.php?page=article&item=ubuntu1910-ext4-zfs&num=3 This shows ZFS being slower for SQLite, faster for RocksDB, no tests of PostgreSQL. Also, out-of-the-box config with no attemptto tune for database use. > And finally, this: https://storytime.ivysaur.me/posts/why-not-zfs/ Wow. This is an absolutely terrible post. It is filled with so much misinformation it would take an entire long post to deconstructall the errors and biases.
!=. Journal filesystem is only going to journal the metadata. ZFS will guarantee the your WAL page is either written ornot. That is not the case with a journaled filesystem. I wouldn’t recommend doing that; if you are using ZFS then use it for both data and WALs. > On Mar 31, 2022, at 6:47 PM, Holger Jakobs <holger@jakobs.com> wrote: > > The WAL is a journal itself and doesn't need another journal for safety. Therefore, a common recommendation is using ext2(which has no journal) for the WAL partition. > > Is this correct? > > Am 31.03.22 um 23:32 schrieb Rui DeSousa: >> I would recommend a separate pg_wal filesystem with the record size to match the WAL page size; in my case 16k. I havekeep the default record size at 128k for the data volume and that configuration has worked well for supporting largeDSS while using 16k data blocks. >> >>> On Mar 30, 2022, at 5:32 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: >>> >>> I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize soas to take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit froma larger recordsize. Or any other tweaks??? >>> >>> -- >>> Scott Ribe >>> scott_ribe@elevated-dev.com >>> https://www.linkedin.com/in/scottribe/ >>> >>> >>> >>> >>> >> >> > -- > Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012 >
On Mar 31, 2022, at 7:26 PM, Mladen Gogala <gogala.mladen@gmail.com> wrote:Here are several articles that caution against ZFS:https://serverfault.com/questions/791154/zfs-good-read-but-poor-write-speeds
https://www.phoronix.com/scan.php?page=article&item=ubuntu1910-ext4-zfs&num=3
And finally, this: https://storytime.ivysaur.me/posts/why-not-zfs/
I would consider Linux ZFS only for toy databases that do not hold any serious data.
> On Apr 1, 2022, at 11:49 AM, Rui DeSousa <rui@crazybean.net> wrote: > > If you’re using RAIDZ# then performance is going to be heavily impacted and I would highly recommend NOT using RAIDZ# fora database server. Actually, I found even the performance of RAIDZ1 to be acceptable after appropriate configuration--current versions, lz4,etc.
> On Apr 1, 2022, at 1:56 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: > >> On Apr 1, 2022, at 11:49 AM, Rui DeSousa <rui@crazybean.net> wrote: >> >> If you’re using RAIDZ# then performance is going to be heavily impacted and I would highly recommend NOT using RAIDZ#for a database server. > > Actually, I found even the performance of RAIDZ1 to be acceptable after appropriate configuration--current versions, lz4,etc. > It might be for a low iops system; however, I would still recommend against it. I haven’t used RAIDZ in years; it mightbe good for an archive system but I don’t see the value of it in a production database server. You also have to accountfor drive failures and replacement time. A replacement in a RAIDZ configuration is much more expense than replacinga disk in a mirrored set. Disks today are larger as well and the risk of another failure during a rebuild is exponentialincreased thus the need for RAIDZ2 and RAIDZ3. Personally and for logical reasons I would build a RAIDZ in powers of 2; i.e. 2, 4, 8 drives plus parity and then have astripe set of RAIDZ2. So the first option would require 4 drives (2D + 2P) and would have same storage as a RAID10 configuration;however, the RAID10 would perform better under load. The 4+p option seems to be the sweet spot as the rebuildtimes on larger sets are not worth it nor is it worth spreading out 128k over 8 drives - of course one could use alarger record size, but would you want to? For me 128k/16k is only 8 database blocks; reminds me of using Oracle’s readahead=8option :). Note: raidz does not alway stripe across all drives in the set like in a traditional raid set. i.e. It might only use 2+pinstead of 8+p as configured — it depends on the size of the current ZFS record size being written out and free space.
> \On Apr 1, 2022, at 2:23 PM, Rui DeSousa <rui@crazybean.net> wrote: > > Personally and for logical reasons I would build a RAIDZ in powers of 2; i.e. 2, 4, 8 drives plus parity and then havea stripe set of RAIDZ2. I think your advice is all good for larger systems. For the time being, I am using 4 drives because of purchasing decisionsmade before I got here, and it turns out that RAIDZ1 is fast enough for our use, so in this case it makes sensefor the larger space. And rebuild time is not terribly important, since we have enough boxes to have multiple streamingreplicas.