Обсуждение: O_DIRECT in freebsd

Поиск
Список
Период
Сортировка

O_DIRECT in freebsd

От
Christopher Kings-Lynne
Дата:
FreeBSD 4.9 was released today.  In the release notes was:

2.2.6 File Systems

A new DIRECTIO kernel option enables support for read operations that 
bypass the buffer cache and put data directly into a userland buffer. 
This feature requires that the O_DIRECT flag is set on the file 
descriptor and that both the offset and length for the read operation 
are multiples of the physical media sector size.

Is that of any use?

Chris



Re: O_DIRECT in freebsd

От
Doug McNaught
Дата:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

> FreeBSD 4.9 was released today.  In the release notes was:
> 
> 2.2.6 File Systems
> 
> A new DIRECTIO kernel option enables support for read operations that
> bypass the buffer cache and put data directly into a userland
> buffer. This feature requires that the O_DIRECT flag is set on the
> file descriptor and that both the offset and length for the read
> operation are multiples of the physical media sector size.
> 
> Is that of any use?

Linux and Solaris have had this for a while.  I'm pretty sure it's
been discussed before--search the archives.  I think the consensus
was that it might be useful for WAL writes, but would be a fair amount
of work and would introduce portability issues...

-Doug


Re: O_DIRECT in freebsd

От
"scott.marlowe"
Дата:
On 29 Oct 2003, Doug McNaught wrote:

> Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> 
> > FreeBSD 4.9 was released today.  In the release notes was:
> > 
> > 2.2.6 File Systems
> > 
> > A new DIRECTIO kernel option enables support for read operations that
> > bypass the buffer cache and put data directly into a userland
> > buffer. This feature requires that the O_DIRECT flag is set on the
> > file descriptor and that both the offset and length for the read
> > operation are multiples of the physical media sector size.
> > 
> > Is that of any use?
> 
> Linux and Solaris have had this for a while.  I'm pretty sure it's
> been discussed before--search the archives.  I think the consensus
> was that it might be useful for WAL writes, but would be a fair amount
> of work and would introduce portability issues...

I would think the biggest savings could come from using directIO for 
vacuuming, so it doesn't cause the kernel to flush buffers.

Would that be just as hard to implement?  



Re: O_DIRECT in freebsd

От
Doug McNaught
Дата:
"scott.marlowe" <scott.marlowe@ihs.com> writes:

> I would think the biggest savings could come from using directIO for 
> vacuuming, so it doesn't cause the kernel to flush buffers.
> 
> Would that be just as hard to implement?  

Two words: "cache coherency".

-Doug


Re: O_DIRECT in freebsd

От
Tom Lane
Дата:
Doug McNaught <doug@mcnaught.org> writes:
> Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>> A new DIRECTIO kernel option enables support for read operations that
>> bypass the buffer cache and put data directly into a userland
>> buffer. This feature requires that the O_DIRECT flag is set on the
>> file descriptor and that both the offset and length for the read
>> operation are multiples of the physical media sector size.

> Linux and Solaris have had this for a while.  I'm pretty sure it's
> been discussed before--search the archives.  I think the consensus
> was that it might be useful for WAL writes, but would be a fair amount
> of work and would introduce portability issues...

Not for WAL --- we never read the WAL at all in normal operation.  (If
it works for writes, then we would want to use it for writing WAL, but
that's not apparent from what Christopher quoted.)

IIRC there was speculation that this would be useful for large seqscans
and for vacuuming.  It'd take some hacking to propagate the knowledge of
that context down to where the fopen occurs, though.
        regards, tom lane


Re: O_DIRECT in freebsd

От
Manfred Spraul
Дата:
Tom Lane wrote:

> Not for WAL --- we never read the WAL at all in normal operation. (If
>
>it works for writes, then we would want to use it for writing WAL, but
>that's not apparent from what Christopher quoted.)
>
At least under Linux, it works for writes. Oracle uses O_DIRECT to 
access (both read and write) disks that are shared between multiple 
nodes in a cluster - their database kernel must know when the data is 
visible to the other nodes.
One problem for WAL is that O_DIRECT would disable the write cache - 
each operation would block until the data arrived on disk, and that 
might block other backends that try to access WALWriteLock.
Perhaps a dedicated backend that does the writeback could fix that.

Has anyone tried to use posix_fadvise for the wal logs?
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html

Linux supports posix_fadvise, it seems to be part of xopen2k.

--   Manfred



Re: O_DIRECT in freebsd

От
Greg Stark
Дата:
Manfred Spraul <manfred@colorfullife.com> writes:

> One problem for WAL is that O_DIRECT would disable the write cache -
> each operation would block until the data arrived on disk, and that might block
> other backends that try to access WALWriteLock.
> Perhaps a dedicated backend that does the writeback could fix that.

aio seems a better fit.

> Has anyone tried to use posix_fadvise for the wal logs?
> http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
> 
> Linux supports posix_fadvise, it seems to be part of xopen2k.

Odd, I don't see it anywhere in the kernel. I don't know what syscall it's
using to do this tweaking.

This is the only option that seems useful for postgres for both the WAL and
vacuum (though in other threads it seems the problems with vacuum lie
elsewhere):
      POSIX_FADV_DONTNEED attempts to free cached pages associated with the      specified region. This is useful, for
example,while streaming large      files. A program may periodically request the kernel to free cached      data that
hasalready been used, so that more useful cached pages are      not discarded instead.
 
      Pages that have not yet been written out will be unaffected, so if the      application wishes to guarantee that
pageswill be released, it should      call fsync or fdatasync first.
 

Perhaps POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL could be useful in a
backend before starting a sequential scan or index scan, but I kind of doubt
it.

-- 
greg



Re: O_DIRECT in freebsd

От
Manfred Spraul
Дата:
Greg Stark wrote:

>Manfred Spraul <manfred@colorfullife.com> writes:
>
>  
>
>>One problem for WAL is that O_DIRECT would disable the write cache -
>>each operation would block until the data arrived on disk, and that might block
>>other backends that try to access WALWriteLock.
>>Perhaps a dedicated backend that does the writeback could fix that.
>>    
>>
>
>aio seems a better fit.
>
>  
>
>>Has anyone tried to use posix_fadvise for the wal logs?
>>http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
>>
>>Linux supports posix_fadvise, it seems to be part of xopen2k.
>>    
>>
>
>Odd, I don't see it anywhere in the kernel. I don't know what syscall it's
>using to do this tweaking.
>  
>
At least in 2.6: linux/mm/fadvise.c, the syscall is fadvise64 or 64_64

>This is the only option that seems useful for postgres for both the WAL and
>vacuum (though in other threads it seems the problems with vacuum lie
>elsewhere):
>
>       POSIX_FADV_DONTNEED attempts to free cached pages associated with the
>       specified region. This is useful, for example, while streaming large
>       files. A program may periodically request the kernel to free cached
>       data that has already been used, so that more useful cached pages are
>       not discarded instead.
>
>       Pages that have not yet been written out will be unaffected, so if the
>       application wishes to guarantee that pages will be released, it should
>       call fsync or fdatasync first.
>  
>
I agree. Either immediately after each flush syscall, or just before 
closing a log file and switching to the next.

>Perhaps POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL could be useful in a
>backend before starting a sequential scan or index scan, but I kind of doubt
>it.
>  
>
IIRC the recommendation is ~20% total memory for the postgres user space 
buffers. That's quite a lot - it might be sufficient to protect that 
cache from vacuum or sequential scans. AddBufferToFreeList already 
contains a comment that this is the right place to try buffer 
replacement strategies.

--   Manfred



Re: O_DIRECT in freebsd

От
Sailesh Krishnamurthy
Дата:
DB2 supports cooked and raw file systems - SMS (System Manged Space)
and DMS (Database Managed Space) tablespaces. 

The DB2 experience is that DMS tends to outperform SMS but requires
considerable tuning and administrative overhead to see these wins. 

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh




Re: O_DIRECT in freebsd

От
Jordan Henderson
Дата:
My experience with DB2 showed that properly setup DMS tablespaces provided a 
significant performance benefit.  I have also seen that the average DBA does 
not generally understand the data or access patterns in the database.  Given 
that, they don't correctly setup table spaces in general, filesystem or raw.  
Likewise, where it is possible to tie a tablespace to a memory buffer pool, 
the average DBA does not setup it up to a performance advantage either.  
However, are we talking about well tuned setups by someone who does 
understand the data and the general access patterns?  For a DBA like that, 
they should be able to take advantage of these features and get significantly 
better results.  I would not say it requires considerable tuning, but an 
understanding of data, storage and access patterns.  Additionally, these 
features did not cause our group considerable administrative overhead.

Jordan Henderson

On Thursday 30 October 2003 12:55, Sailesh Krishnamurthy wrote:
> DB2 supports cooked and raw file systems - SMS (System Manged Space)
> and DMS (Database Managed Space) tablespaces.
>
> The DB2 experience is that DMS tends to outperform SMS but requires
> considerable tuning and administrative overhead to see these wins.



Re: O_DIRECT in freebsd

От
"Dann Corbit"
Дата:
> -----Original Message-----
> From: Jordan Henderson [mailto:jordan_henders@yahoo.com]
> Sent: Thursday, October 30, 2003 4:31 PM
> To: sailesh@cs.berkeley.edu; Doug McNaught
> Cc: Christopher Kings-Lynne; PostgreSQL-development
> Subject: Re: [HACKERS] O_DIRECT in freebsd
>
> My experience with DB2 showed that properly setup DMS
> tablespaces provided a
> significant performance benefit.  I have also seen that the
> average DBA does
> not generally understand the data or access patterns in the
> database.  Given
> that, they don't correctly setup table spaces in general,
> filesystem or raw.
> Likewise, where it is possible to tie a tablespace to a
> memory buffer pool,
> the average DBA does not setup it up to a performance
> advantage either.
> However, are we talking about well tuned setups by someone who does
> understand the data and the general access patterns?  For a
> DBA like that,
> they should be able to take advantage of these features and
> get significantly
> better results.  I would not say it requires considerable
> tuning, but an
> understanding of data, storage and access patterns.
> Additionally, these
> features did not cause our group considerable administrative overhead.

If it is possible for a human with knowledge of this domain to make good
decisions, it ought to be possible to store the same information into an
algorithm that operates off of collected statistics.  After some time
has elapsed, and an average access pattern of some sort has been
reached, the available resources could be divided in a fairly efficient
way.  It might be nice to be able to tweak it, but I would rather have
the computer make the calculations for me.

Just a thought.


Re: O_DIRECT in freebsd

От
Sailesh Krishnamurthy
Дата:
>>>>> "Jordan" == Jordan Henderson <jordan_henders@yahoo.com> writes:
   Jordan> significantly better results.  I would not say it requires   Jordan> considerable tuning, but an
understandingof data, storage   Jordan> and access patterns.  Additionally, these features did not   Jordan> cause our
groupconsiderable administrative overhead.
 

I won't dispute the specifics. I have only worked on the DB2 engine -
never written an app for it nor administered it. You're right - the
bottomline is that you can get a significant performance advantage
provided you care enough to understand what's going on. 

Anyway, I merely responded to provide a data point. Will PostgreSQL
users/administrators care for additional knobs or is there a
preference for "keep it simple, stupid" ?

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh




Re: O_DIRECT in freebsd

От
Jordan Henderson
Дата:
Personally, I think it is useful to have features.  I quite understand the 
difficulties in maintaining some features however.  Also having worked on 
internals for commercial DB engines, I have specifically how code/data paths 
can be shortened.  I would not make the choice for someone to be forced into 
using a product in a specific manner.  Instead, I would let them decide 
whether to choose a simple setup or, if they are up to it, something with 
better performance.  I would not prune the options out.  In doing so, we 
limit what a knowledgeable person can do a priori.

Jordan Henderson

On Thursday 30 October 2003 19:59, Sailesh Krishnamurthy wrote:
> >>>>> "Jordan" == Jordan Henderson <jordan_henders@yahoo.com> writes:
>
>     Jordan> significantly better results.  I would not say it requires
>     Jordan> considerable tuning, but an understanding of data, storage
>     Jordan> and access patterns.  Additionally, these features did not
>     Jordan> cause our group considerable administrative overhead.
>
> I won't dispute the specifics. I have only worked on the DB2 engine -
> never written an app for it nor administered it. You're right - the
> bottomline is that you can get a significant performance advantage
> provided you care enough to understand what's going on.
>
> Anyway, I merely responded to provide a data point. Will PostgreSQL
> users/administrators care for additional knobs or is there a
> preference for "keep it simple, stupid" ?