Обсуждение: O_DIRECT in freebsd
FreeBSD 4.9 was released today. In the release notes was: 2.2.6 File Systems A new DIRECTIO kernel option enables support for read operations that bypass the buffer cache and put data directly into a userland buffer. This feature requires that the O_DIRECT flag is set on the file descriptor and that both the offset and length for the read operation are multiples of the physical media sector size. Is that of any use? Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > FreeBSD 4.9 was released today. In the release notes was: > > 2.2.6 File Systems > > A new DIRECTIO kernel option enables support for read operations that > bypass the buffer cache and put data directly into a userland > buffer. This feature requires that the O_DIRECT flag is set on the > file descriptor and that both the offset and length for the read > operation are multiples of the physical media sector size. > > Is that of any use? Linux and Solaris have had this for a while. I'm pretty sure it's been discussed before--search the archives. I think the consensus was that it might be useful for WAL writes, but would be a fair amount of work and would introduce portability issues... -Doug
On 29 Oct 2003, Doug McNaught wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > > > FreeBSD 4.9 was released today. In the release notes was: > > > > 2.2.6 File Systems > > > > A new DIRECTIO kernel option enables support for read operations that > > bypass the buffer cache and put data directly into a userland > > buffer. This feature requires that the O_DIRECT flag is set on the > > file descriptor and that both the offset and length for the read > > operation are multiples of the physical media sector size. > > > > Is that of any use? > > Linux and Solaris have had this for a while. I'm pretty sure it's > been discussed before--search the archives. I think the consensus > was that it might be useful for WAL writes, but would be a fair amount > of work and would introduce portability issues... I would think the biggest savings could come from using directIO for vacuuming, so it doesn't cause the kernel to flush buffers. Would that be just as hard to implement?
"scott.marlowe" <scott.marlowe@ihs.com> writes: > I would think the biggest savings could come from using directIO for > vacuuming, so it doesn't cause the kernel to flush buffers. > > Would that be just as hard to implement? Two words: "cache coherency". -Doug
Doug McNaught <doug@mcnaught.org> writes: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >> A new DIRECTIO kernel option enables support for read operations that >> bypass the buffer cache and put data directly into a userland >> buffer. This feature requires that the O_DIRECT flag is set on the >> file descriptor and that both the offset and length for the read >> operation are multiples of the physical media sector size. > Linux and Solaris have had this for a while. I'm pretty sure it's > been discussed before--search the archives. I think the consensus > was that it might be useful for WAL writes, but would be a fair amount > of work and would introduce portability issues... Not for WAL --- we never read the WAL at all in normal operation. (If it works for writes, then we would want to use it for writing WAL, but that's not apparent from what Christopher quoted.) IIRC there was speculation that this would be useful for large seqscans and for vacuuming. It'd take some hacking to propagate the knowledge of that context down to where the fopen occurs, though. regards, tom lane
Tom Lane wrote: > Not for WAL --- we never read the WAL at all in normal operation. (If > >it works for writes, then we would want to use it for writing WAL, but >that's not apparent from what Christopher quoted.) > At least under Linux, it works for writes. Oracle uses O_DIRECT to access (both read and write) disks that are shared between multiple nodes in a cluster - their database kernel must know when the data is visible to the other nodes. One problem for WAL is that O_DIRECT would disable the write cache - each operation would block until the data arrived on disk, and that might block other backends that try to access WALWriteLock. Perhaps a dedicated backend that does the writeback could fix that. Has anyone tried to use posix_fadvise for the wal logs? http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html Linux supports posix_fadvise, it seems to be part of xopen2k. -- Manfred
Manfred Spraul <manfred@colorfullife.com> writes: > One problem for WAL is that O_DIRECT would disable the write cache - > each operation would block until the data arrived on disk, and that might block > other backends that try to access WALWriteLock. > Perhaps a dedicated backend that does the writeback could fix that. aio seems a better fit. > Has anyone tried to use posix_fadvise for the wal logs? > http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html > > Linux supports posix_fadvise, it seems to be part of xopen2k. Odd, I don't see it anywhere in the kernel. I don't know what syscall it's using to do this tweaking. This is the only option that seems useful for postgres for both the WAL and vacuum (though in other threads it seems the problems with vacuum lie elsewhere): POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. This is useful, for example,while streaming large files. A program may periodically request the kernel to free cached data that hasalready been used, so that more useful cached pages are not discarded instead. Pages that have not yet been written out will be unaffected, so if the application wishes to guarantee that pageswill be released, it should call fsync or fdatasync first. Perhaps POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL could be useful in a backend before starting a sequential scan or index scan, but I kind of doubt it. -- greg
Greg Stark wrote: >Manfred Spraul <manfred@colorfullife.com> writes: > > > >>One problem for WAL is that O_DIRECT would disable the write cache - >>each operation would block until the data arrived on disk, and that might block >>other backends that try to access WALWriteLock. >>Perhaps a dedicated backend that does the writeback could fix that. >> >> > >aio seems a better fit. > > > >>Has anyone tried to use posix_fadvise for the wal logs? >>http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html >> >>Linux supports posix_fadvise, it seems to be part of xopen2k. >> >> > >Odd, I don't see it anywhere in the kernel. I don't know what syscall it's >using to do this tweaking. > > At least in 2.6: linux/mm/fadvise.c, the syscall is fadvise64 or 64_64 >This is the only option that seems useful for postgres for both the WAL and >vacuum (though in other threads it seems the problems with vacuum lie >elsewhere): > > POSIX_FADV_DONTNEED attempts to free cached pages associated with the > specified region. This is useful, for example, while streaming large > files. A program may periodically request the kernel to free cached > data that has already been used, so that more useful cached pages are > not discarded instead. > > Pages that have not yet been written out will be unaffected, so if the > application wishes to guarantee that pages will be released, it should > call fsync or fdatasync first. > > I agree. Either immediately after each flush syscall, or just before closing a log file and switching to the next. >Perhaps POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL could be useful in a >backend before starting a sequential scan or index scan, but I kind of doubt >it. > > IIRC the recommendation is ~20% total memory for the postgres user space buffers. That's quite a lot - it might be sufficient to protect that cache from vacuum or sequential scans. AddBufferToFreeList already contains a comment that this is the right place to try buffer replacement strategies. -- Manfred
DB2 supports cooked and raw file systems - SMS (System Manged Space) and DMS (Database Managed Space) tablespaces. The DB2 experience is that DMS tends to outperform SMS but requires considerable tuning and administrative overhead to see these wins. -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh
My experience with DB2 showed that properly setup DMS tablespaces provided a significant performance benefit. I have also seen that the average DBA does not generally understand the data or access patterns in the database. Given that, they don't correctly setup table spaces in general, filesystem or raw. Likewise, where it is possible to tie a tablespace to a memory buffer pool, the average DBA does not setup it up to a performance advantage either. However, are we talking about well tuned setups by someone who does understand the data and the general access patterns? For a DBA like that, they should be able to take advantage of these features and get significantly better results. I would not say it requires considerable tuning, but an understanding of data, storage and access patterns. Additionally, these features did not cause our group considerable administrative overhead. Jordan Henderson On Thursday 30 October 2003 12:55, Sailesh Krishnamurthy wrote: > DB2 supports cooked and raw file systems - SMS (System Manged Space) > and DMS (Database Managed Space) tablespaces. > > The DB2 experience is that DMS tends to outperform SMS but requires > considerable tuning and administrative overhead to see these wins.
> -----Original Message----- > From: Jordan Henderson [mailto:jordan_henders@yahoo.com] > Sent: Thursday, October 30, 2003 4:31 PM > To: sailesh@cs.berkeley.edu; Doug McNaught > Cc: Christopher Kings-Lynne; PostgreSQL-development > Subject: Re: [HACKERS] O_DIRECT in freebsd > > My experience with DB2 showed that properly setup DMS > tablespaces provided a > significant performance benefit. I have also seen that the > average DBA does > not generally understand the data or access patterns in the > database. Given > that, they don't correctly setup table spaces in general, > filesystem or raw. > Likewise, where it is possible to tie a tablespace to a > memory buffer pool, > the average DBA does not setup it up to a performance > advantage either. > However, are we talking about well tuned setups by someone who does > understand the data and the general access patterns? For a > DBA like that, > they should be able to take advantage of these features and > get significantly > better results. I would not say it requires considerable > tuning, but an > understanding of data, storage and access patterns. > Additionally, these > features did not cause our group considerable administrative overhead. If it is possible for a human with knowledge of this domain to make good decisions, it ought to be possible to store the same information into an algorithm that operates off of collected statistics. After some time has elapsed, and an average access pattern of some sort has been reached, the available resources could be divided in a fairly efficient way. It might be nice to be able to tweak it, but I would rather have the computer make the calculations for me. Just a thought.
>>>>> "Jordan" == Jordan Henderson <jordan_henders@yahoo.com> writes: Jordan> significantly better results. I would not say it requires Jordan> considerable tuning, but an understandingof data, storage Jordan> and access patterns. Additionally, these features did not Jordan> cause our groupconsiderable administrative overhead. I won't dispute the specifics. I have only worked on the DB2 engine - never written an app for it nor administered it. You're right - the bottomline is that you can get a significant performance advantage provided you care enough to understand what's going on. Anyway, I merely responded to provide a data point. Will PostgreSQL users/administrators care for additional knobs or is there a preference for "keep it simple, stupid" ? -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh
Personally, I think it is useful to have features. I quite understand the difficulties in maintaining some features however. Also having worked on internals for commercial DB engines, I have specifically how code/data paths can be shortened. I would not make the choice for someone to be forced into using a product in a specific manner. Instead, I would let them decide whether to choose a simple setup or, if they are up to it, something with better performance. I would not prune the options out. In doing so, we limit what a knowledgeable person can do a priori. Jordan Henderson On Thursday 30 October 2003 19:59, Sailesh Krishnamurthy wrote: > >>>>> "Jordan" == Jordan Henderson <jordan_henders@yahoo.com> writes: > > Jordan> significantly better results. I would not say it requires > Jordan> considerable tuning, but an understanding of data, storage > Jordan> and access patterns. Additionally, these features did not > Jordan> cause our group considerable administrative overhead. > > I won't dispute the specifics. I have only worked on the DB2 engine - > never written an app for it nor administered it. You're right - the > bottomline is that you can get a significant performance advantage > provided you care enough to understand what's going on. > > Anyway, I merely responded to provide a data point. Will PostgreSQL > users/administrators care for additional knobs or is there a > preference for "keep it simple, stupid" ?