Обсуждение: WAL & ZFS

Поиск
Список
Период
Сортировка

WAL & ZFS

От
Scott Ribe
Дата:
I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize so
asto take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit
froma larger recordsize. Or any other tweaks??? 

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/






Re: WAL & ZFS

От
Rui DeSousa
Дата:
I would recommend a separate pg_wal filesystem with the record size to match the WAL page size; in my case 16k.  I have
keepthe default record size at 128k for the data volume and that configuration has worked well for supporting large DSS
whileusing 16k data blocks. 

> On Mar 30, 2022, at 5:32 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
>
> I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize so
asto take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit
froma larger recordsize. Or any other tweaks??? 
>
> --
> Scott Ribe
> scott_ribe@elevated-dev.com
> https://www.linkedin.com/in/scottribe/
>
>
>
>
>




Re: WAL & ZFS

От
Holger Jakobs
Дата:
The WAL is a journal itself and doesn't need another journal for safety. 
Therefore, a common recommendation is using ext2 (which has no journal) 
for the WAL partition.

Is this correct?

Am 31.03.22 um 23:32 schrieb Rui DeSousa:
> I would recommend a separate pg_wal filesystem with the record size to match the WAL page size; in my case 16k.  I
havekeep the default record size at 128k for the data volume and that configuration has worked well for supporting
largeDSS while using 16k data blocks.
 
>
>> On Mar 30, 2022, at 5:32 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
>>
>> I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize
soas to take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit
froma larger recordsize. Or any other tweaks???
 
>>
>> --
>> Scott Ribe
>> scott_ribe@elevated-dev.com
>> https://www.linkedin.com/in/scottribe/
>>
>>
>>
>>
>>
>
>
-- 
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012


Вложения

Re: WAL & ZFS

От
Scott Ribe
Дата:
> On Mar 31, 2022, at 4:47 PM, Holger Jakobs <holger@jakobs.com> wrote:
>
> The WAL is a journal itself and doesn't need another journal for safety. Therefore, a common recommendation is using
ext2(which has no journal) for the WAL partition. 
>
> Is this correct?

I could see that being reasonable. In this case I don't control number and size of drives, and have a layout where it's
reallybest in terms of utilization to put them all into the zpool. So my options are limited to ZFS options. 


Re: WAL & ZFS

От
Mladen Gogala
Дата:
On 3/31/22 18:47, Holger Jakobs wrote:
The WAL is a journal itself and doesn't need another journal for safety. Therefore, a common recommendation is using ext2 (which has no journal) for the WAL partition.

Is this correct?

Am 31.03.22 um 23:32 schrieb Rui DeSousa:
I would recommend a separate pg_wal filesystem with the record size to match the WAL page size; in my case 16k.  I have keep the default record size at 128k for the data volume and that configuration has worked well for supporting large DSS while using 16k data blocks.

On Mar 30, 2022, at 5:32 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:

I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize so as to take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit from a larger recordsize. Or any other tweaks???

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/







Journallng file systems journal file operations like open, close, create, extend, rename or delete, not file block operations. File blocks are not protected by file system journalling, just the inodes. The file system journal prevents you from losing files in case of sudden machine crash, like when the machine loses power. It has nothing to do with the block change journaling, which is the role of WAL files. WAL files record the changes to database blocks. With ext2, it would be possible to lose a WAL log in case of a sudden crash and that might prevent the cluster recovery. File system journaling has nothing to do with WAL logs. I would strongly advise against using ext2 for WAL logs.

Regards

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: WAL & ZFS

От
Scott Ribe
Дата:
> On Mar 31, 2022, at 5:09 PM, Mladen Gogala <gogala.mladen@gmail.com> wrote:
>
> Journallng file systems journal file operations like open, close, create, extend, rename or delete, not file block
operations.File blocks are not protected by file system journalling, just the inodes. The file system journal prevents
youfrom losing files in case of sudden machine crash, like when the machine loses power. It has nothing to do with the
blockchange journaling, which is the role of WAL files.  

LOL, I *knew* that about journaling file systems, thanks for reminding me!


Re: WAL & ZFS

От
Mladen Gogala
Дата:
On 3/30/22 17:32, Scott Ribe wrote:
I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize so as to take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit from a larger recordsize. Or any other tweaks???

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/





Phoronix has tested ZFS against Ext3, Ext4 and XFS. ZFS was consistently performing worse than all other file systems. Here is the test with Oracle:

https://blog.docbert.org/oracle-on-zfs/

Here are several articles that caution against ZFS:

https://forums.servethehome.com/index.php?threads/very-slow-zfs-raidz2-performance-on-truenas-12.33094/

https://serverfault.com/questions/791154/zfs-good-read-but-poor-write-speeds

https://www.phoronix.com/scan.php?page=article&item=ubuntu1910-ext4-zfs&num=3

And finally, this: https://storytime.ivysaur.me/posts/why-not-zfs/

I would consider Linux ZFS only for toy databases that do not hold any serious data.


-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: WAL & ZFS

От
Scott Ribe
Дата:
> I would consider Linux ZFS only for toy databases that do not hold any serious data.

We have tested performance extensively, and ZFS performance is fine for our database (5TB, something like 100GB/day
datainserted and deleted). I would consider hardly anything else for serious data, since hardly anything else deals
properlywith bit rot. 

A few comments on the articles:

> Phoronix has tested ZFS against Ext3, Ext4 and XFS. ZFS was consistently performing worse than all other file
systems.Here is the test with Oracle: 
>
> https://blog.docbert.org/oracle-on-zfs/
>
> Here are several articles that caution against ZFS:
>
> https://forums.servethehome.com/index.php?threads/very-slow-zfs-raidz2-performance-on-truenas-12.33094/

One person opens the thread saying their performance is terrible. Others say their performance is OK and give tuning
hints.OP says "Block size has been settled to 128Kb, I just use it for too much everything to go too big or too small.
sohere we sit." Which, wow. You can create as many ZFS file systems as you want on a pool, with different recordsizes.
Seriously,create a file system for your database configured for your database.  

> https://serverfault.com/questions/791154/zfs-good-read-but-poor-write-speeds

Post is 5 years old, well before lz4 compression was in OpenZFS, and lz4 is key to getting decent DB write performance.
Alsonote that poster was testing against an iSCSCI block device with no idea whatsoever about the actual disk
configurationbehind it. 

> https://www.phoronix.com/scan.php?page=article&item=ubuntu1910-ext4-zfs&num=3

This shows ZFS being slower for SQLite, faster for RocksDB, no tests of PostgreSQL. Also, out-of-the-box config with no
attemptto tune for database use. 

> And finally, this: https://storytime.ivysaur.me/posts/why-not-zfs/


Wow. This is an absolutely terrible post. It is filled with so much misinformation it would take an entire long post to
deconstructall the errors and biases. 




Re: WAL & ZFS

От
Rui DeSousa
Дата:
!=. Journal filesystem is only going to journal the metadata.  ZFS will guarantee the your WAL page is either written
ornot.  That is not the case with a journaled filesystem.   

I wouldn’t recommend doing that; if you are using ZFS then use it for both data and WALs.



> On Mar 31, 2022, at 6:47 PM, Holger Jakobs <holger@jakobs.com> wrote:
>
> The WAL is a journal itself and doesn't need another journal for safety. Therefore, a common recommendation is using
ext2(which has no journal) for the WAL partition. 
>
> Is this correct?
>
> Am 31.03.22 um 23:32 schrieb Rui DeSousa:
>> I would recommend a separate pg_wal filesystem with the record size to match the WAL page size; in my case 16k.  I
havekeep the default record size at 128k for the data volume and that configuration has worked well for supporting
largeDSS while using 16k data blocks. 
>>
>>> On Mar 30, 2022, at 5:32 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
>>>
>>> I've read all the info I could find re running PG on ZFS: turn off full page writes, turn on lz4, tweak recordsize
soas to take advantage of compression, etc. One thing I haven't seen is whether a separate volume for WAL would benefit
froma larger recordsize. Or any other tweaks??? 
>>>
>>> --
>>> Scott Ribe
>>> scott_ribe@elevated-dev.com
>>> https://www.linkedin.com/in/scottribe/
>>>
>>>
>>>
>>>
>>>
>>
>>
> --
> Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
>




Re: WAL & ZFS

От
Rui DeSousa
Дата:

I call bullshit. I have used ZFS with better performance and it offers higher degree of management of the system.

I look at one articles you pointed out and it was using RAIDZ2 — well there you go.  I would not recommend using anything other than RAID10 i.e. stripe of mirrors.  

If you’re using RAIDZ# then performance is going to be heavily impacted and I would highly recommend NOT using RAIDZ# for a database server.


Re: WAL & ZFS

От
Scott Ribe
Дата:
> On Apr 1, 2022, at 11:49 AM, Rui DeSousa <rui@crazybean.net> wrote:
>
> If you’re using RAIDZ# then performance is going to be heavily impacted and I would highly recommend NOT using RAIDZ#
fora database server. 

Actually, I found even the performance of RAIDZ1 to be acceptable after appropriate configuration--current versions,
lz4,etc. 


Re: WAL & ZFS

От
Rui DeSousa
Дата:

> On Apr 1, 2022, at 1:56 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
>
>> On Apr 1, 2022, at 11:49 AM, Rui DeSousa <rui@crazybean.net> wrote:
>>
>> If you’re using RAIDZ# then performance is going to be heavily impacted and I would highly recommend NOT using
RAIDZ#for a database server. 
>
> Actually, I found even the performance of RAIDZ1 to be acceptable after appropriate configuration--current versions,
lz4,etc. 
>

It might be for a low iops system; however, I would still recommend against it.  I haven’t used RAIDZ in years; it
mightbe good for an archive system but I don’t see the value of it in a production database server.  You also have to
accountfor drive failures and replacement time.  A replacement in a RAIDZ configuration is much more expense than
replacinga disk in a mirrored set.  Disks today are larger as well and the risk of another failure during a rebuild is
exponentialincreased thus the need for RAIDZ2 and RAIDZ3. 

Personally and for logical reasons I would build a RAIDZ in powers of 2;  i.e. 2, 4, 8 drives plus parity and then have
astripe set of RAIDZ2.  So the first option would require 4 drives (2D + 2P) and would have same storage as a RAID10
configuration;however, the RAID10 would perform better under load.  The 4+p option seems to be the sweet spot as the
rebuildtimes on larger sets are not worth it nor is it worth spreading out 128k over 8 drives - of course one could use
alarger record size, but would you want to? For me 128k/16k is only 8 database blocks; reminds me of using Oracle’s
readahead=8option :). 

Note: raidz does not alway stripe across all drives in the set like in a traditional raid set.  i.e. It might only use
2+pinstead of 8+p as configured — it depends on the size of the current ZFS record size being written out and free
space.


Re: WAL & ZFS

От
Scott Ribe
Дата:
> \On Apr 1, 2022, at 2:23 PM, Rui DeSousa <rui@crazybean.net> wrote:
>
> Personally and for logical reasons I would build a RAIDZ in powers of 2;  i.e. 2, 4, 8 drives plus parity and then
havea stripe set of RAIDZ2. 

I think your advice is all good for larger systems. For the time being, I am using 4 drives because of purchasing
decisionsmade before I got here, and it turns out that RAIDZ1 is fast enough for our use, so in this case it makes
sensefor the larger space. And rebuild time is not terribly important, since we have enough boxes to have multiple
streamingreplicas.