Обсуждение: Safe/Fast I/O ...
Has anyone looked into this? I'm just getting ready to download it and play with it, see what's involved in using it. From what I can see, its essentially an optimized stdio library... URL is at: http://www.research.att.com/sw/tools/sfio Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
On Sun, 12 Apr 1998, The Hermit Hacker wrote: > Has anyone looked into this? I'm just getting ready to download it and > play with it, see what's involved in using it. From what I can see, its > essentially an optimized stdio library... > > URL is at: http://www.research.att.com/sw/tools/sfio Using mmap and/or AIO would be better... FreeBSD and Solaris support AIO I believe. Given past trends Linux will as well. /* Matthew N. Dodd | A memory retaining a love you had for life winter@jurai.net | As cruel as it seems nothing ever seems to http://www.jurai.net/~winter | go right - FLA M 3.1:53 */
On Sun, 12 Apr 1998, Matthew N. Dodd wrote: > On Sun, 12 Apr 1998, The Hermit Hacker wrote: > > Has anyone looked into this? I'm just getting ready to download it and > > play with it, see what's involved in using it. From what I can see, its > > essentially an optimized stdio library... > > > > URL is at: http://www.research.att.com/sw/tools/sfio > > Using mmap and/or AIO would be better... > > FreeBSD and Solaris support AIO I believe. Given past trends Linux will > as well. I hate to have to ask, but how is MMAP or AIO better then sfio? I haven't had enough time to research any of this, and am just starting to look at it... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
On Sun, 12 Apr 1998, The Hermit Hacker wrote: > I hate to have to ask, but how is MMAP or AIO better then sfio? I > haven't had enough time to research any of this, and am just starting to > look at it... If its simple to compile and works as a drop in replacement AND is faster, I see no reason why PostgreSQL shouldn't try to link with it. Keep in mind though that in order to use MMAP or AIO you'd be restructuring the code to be more efficient rather than doing more of the same old thing but optimized. Only testing will prove me right or wrong though. :) /* Matthew N. Dodd | A memory retaining a love you had for life winter@jurai.net | As cruel as it seems nothing ever seems to http://www.jurai.net/~winter | go right - FLA M 3.1:53 */
On Sun, 12 Apr 1998, Matthew N. Dodd wrote: > On Sun, 12 Apr 1998, The Hermit Hacker wrote: > > I hate to have to ask, but how is MMAP or AIO better then sfio? I > > haven't had enough time to research any of this, and am just starting to > > look at it... > > If its simple to compile and works as a drop in replacement AND is faster, > I see no reason why PostgreSQL shouldn't try to link with it. That didn't really answer the question :( > Keep in mind though that in order to use MMAP or AIO you'd be > restructuring the code to be more efficient rather than doing more of the > same old thing but optimized. I don't know anything about AIO, so if you can give me a pointer to where I can read up on it, please do... ...but, with MMAP, unless I'm mistaken, you'd essentially be reading the file(s) into memory and then manipulating the file(s) there. Which means one helluva large amount of RAM being required...no? Using stdio vs sfio, to read a 1.2million line file, the time to complete goes from 7sec to 5sec ... that makes for a substantial savings in time, if its applicable. the problem, as I see it right now, is the docs for it suck ... so, right now, I'm fumbling through figuring it all out :) Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Marc G. Fournier wrote: > On Sun, 12 Apr 1998, Matthew N. Dodd wrote: > > On Sun, 12 Apr 1998, The Hermit Hacker wrote: > > > I hate to have to ask, but how is MMAP or AIO better then sfio? I > > > haven't had enough time to research any of this, and am just starting to > > > look at it... > > > > If its simple to compile and works as a drop in replacement AND is faster, > > I see no reason why PostgreSQL shouldn't try to link with it. > > That didn't really answer the question :( > > > Keep in mind though that in order to use MMAP or AIO you'd be > > restructuring the code to be more efficient rather than doing more of the > > same old thing but optimized. > > I don't know anything about AIO, so if you can give me a pointer > to where I can read up on it, please do... > > ...but, with MMAP, unless I'm mistaken, you'd essentially be > reading the file(s) into memory and then manipulating the file(s) there. > Which means one helluva large amount of RAM being required...no? > > Using stdio vs sfio, to read a 1.2million line file, the time to > complete goes from 7sec to 5sec ... that makes for a substantial savings > in time, if its applicable. > > the problem, as I see it right now, is the docs for it suck ... > so, right now, I'm fumbling through figuring it all out :) One of the options when building perl5 is to use sfio instead of stdio. I haven't tried it, but they seem to think it works. That said, The only place I see this helping pgsql is in copyin and copyout as these use the stdio: fread(), fwrite(), etc interfaces. Everywhere else we use the system call IO interfaces: read(), write(), recv(), send(), select() etc, and do our own buffering. My prediction is that sfio vs stdio will have undetectable performance impact on sql performance and only very minor impact on copyin, copyout (as most of the overhead is in pgsql, not libc). As far as IO, the problem we have is fsync(). To get rid of it means doing a real writeahead log system and (maybe) aio to the log. As soon as we get real logging then we don't need to force datapages out so we can get rid of all the fsync and (given how slow we are otherwise) completely eliminate IO as a bottleneck. Pgsql was built for comfort, not for speed. Fine tuning and code tweeking and microoptimization is fine as far as it goes. But there is probably a maximum 2x speed up to be had that way. Total. We need a 10x speedup to play with serious databases. This will take real architectural changes. If you are interested in what is necessary, I highly recommend the book "Transaction Processing" by Jim Gray (and someone whose name escapes me just now). It is a great big thing and will take a while to get through, but is is decently written and very well worth the time. It pretty much gives away the whole candy store as far as building high performance, reliable, and scalable database and TP systems. I wish it had been available 10 years ago when I got into the DB game. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 - Linux. Not because it is free. Because it is better.
> > On Sun, 12 Apr 1998, The Hermit Hacker wrote: > > I hate to have to ask, but how is MMAP or AIO better then sfio? I > > haven't had enough time to research any of this, and am just starting to > > look at it... > > If its simple to compile and works as a drop in replacement AND is faster, > I see no reason why PostgreSQL shouldn't try to link with it. > > Keep in mind though that in order to use MMAP or AIO you'd be > restructuring the code to be more efficient rather than doing more of the > same old thing but optimized. > > Only testing will prove me right or wrong though. :) As David Gould mentioned, we need to do pre-fetching of data pages somehow. When doing a sequential scan on a table, the OS is doing a one-page prefetch, which is probably enough. The problem is index scans of the table. Those are not sequential in the main heap table (unless it is clustered on the index), so a prefetch would help here a lot. That is where we need async i/o. I am looking in BSDI, and I don't see any way to do async i/o. The only way I can think of doing it is via threads. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> If you are interested in what is necessary, I highly recommend the book > "Transaction Processing" by Jim Gray (and someone whose name escapes me > just now). It is a great big thing and will take a while to get through, but > is is decently written and very well worth the time. It pretty much gives > away the whole candy store as far as building high performance, reliable, > and scalable database and TP systems. I wish it had been available 10 > years ago when I got into the DB game. David is 100% correct here. We need major overhaul. He is also 100% correct about the book he is recommending. I got it last week, and was going to make a big pitch for this, but now that he has mentioned it again, let me support it. His quote: It pretty much gives away the whole candy store... is right on the mark. This book is big, and meaty. Date has it listed in his bibliography, and says: If any computer science text ever deserved the epithet "instant classic," it is surely this one. Its size is daunting at first(over 1000 pages), but the authors display an enviable lightness of touch that makes even the driest aspects of the subject enjoyable reading. In their preface, they state their intent as being "to help...solve real problems"; the book is "pragmatic, covering basic transaction issues in considerable detail"; and the presentation "is full of code fragments showing...basic algorithm and data structures" and is not "encyclopedic." Despite this last claim, the book is (not surprisingly) comprehensive, and is surely destined to become the standard work. Strongly recommended. What more can I say. I will add this book recommendation to tools/FAQ_DEV. The book is not cheap, at ~$90. The book is "Transaction Processing: Concepts and Techniques," by Jim Gray and Andreas Reuter, Morgan Kaufmann publishers, ISBN 1-55860-190-2. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> As David Gould mentioned, we need to do pre-fetching of data pages > somehow. > > When doing a sequential scan on a table, the OS is doing a one-page > prefetch, which is probably enough. The problem is index scans of the > table. Those are not sequential in the main heap table (unless it is > clustered on the index), so a prefetch would help here a lot. > > That is where we need async i/o. I am looking in BSDI, and I don't see > any way to do async i/o. The only way I can think of doing it is via > threads. I found it. It is an fcntl option. From man fcntl: O_ASYNC Enable the SIGIO signal to be sent to the process group when I/O is possible, e.g., upon availability of data to be read. Who else supports this? -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> When doing a sequential scan on a table, the OS is doing a one-page > prefetch, which is probably enough. The problem is index scans of the > table. Those are not sequential in the main heap table (unless it is > clustered on the index), so a prefetch would help here a lot. > > That is where we need async i/o. I am looking in BSDI, and I don't see > any way to do async i/o. The only way I can think of doing it is via > threads. O_ASYNC Enable the SIGIO signal to be sent to the process group when I/O is possible, e.g., upon availability of data to be read. Now I am questioning this. I am not sure this acually for file i/o, or only tty i/o. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Sun, 12 Apr 1998, Bruce Momjian wrote: > > As David Gould mentioned, we need to do pre-fetching of data pages > > somehow. > > > > When doing a sequential scan on a table, the OS is doing a one-page > > prefetch, which is probably enough. The problem is index scans of the > > table. Those are not sequential in the main heap table (unless it is > > clustered on the index), so a prefetch would help here a lot. > > > > That is where we need async i/o. I am looking in BSDI, and I don't see > > any way to do async i/o. The only way I can think of doing it is via > > threads. > > I found it. It is an fcntl option. From man fcntl: > > O_ASYNC Enable the SIGIO signal to be sent to the process group when > I/O is possible, e.g., upon availability of data to be read. > > Who else supports this? FreeBSD... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
>> When doing a sequential scan on a table, the OS is doing a one-page >> prefetch, which is probably enough. The problem is index scans of the >> table. Those are not sequential in the main heap table (unless it is >> clustered on the index), so a prefetch would help here a lot. >> >> That is where we need async i/o. I am looking in BSDI, and I don't see >> any way to do async i/o. The only way I can think of doing it is via >> threads. > > > O_ASYNC Enable the SIGIO signal to be sent to the process group when > I/O is possible, e.g., upon availability of data to be read. > >Now I am questioning this. I am not sure this acually for file i/o, or >only tty i/o. > async file calls: aio_cancel aio_error aio_read aio_return -- gets status of pending io call aio_suspend aio_write And yes the Gray book is great! Jordan Henderson
On Sun, 12 Apr 1998, Bruce Momjian wrote: > I found it. It is an fcntl option. From man fcntl: > > O_ASYNC Enable the SIGIO signal to be sent to the process group when > I/O is possible, e.g., upon availability of data to be read. > > Who else supports this? FreeBSD, and NetBSD appearto. Linux and Solaris appear not to. I was really speaking of the POSIX 1003.1B AIO/LIO calls when I originally brought this up. (aio_read/aio_write) /* Matthew N. Dodd | A memory retaining a love you had for life winter@jurai.net | As cruel as it seems nothing ever seems to http://www.jurai.net/~winter | go right - FLA M 3.1:53 */
> > async file calls: > aio_cancel > aio_error > aio_read > aio_return -- gets status of pending io call > aio_suspend > aio_write Can you elaborate on this? Does it cause a read() to return right away, and signal when data is ready? > > And yes the Gray book is great! > > Jordan Henderson > > -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > async file calls: > > aio_cancel > > aio_error > > aio_read > > aio_return -- gets status of pending io call > > aio_suspend > > aio_write > > Can you elaborate on this? Does it cause a read() to return right away, > and signal when data is ready? These are posix calls. Many systems support them and they are fairly easy to emulate (with threads or io processes) on systems that don't. If we are going to do Async IO, I suggest that we code to the posix interface and build emulators for the systems that don't have the posix calls. I think there is an implementation of this for Linux, but it is a separate package, not part of the base system as far as I know. Of course with Linux anything you know it didn't do two weeks ago, it will do next week... Here is the Solaris man page for aio_read() and aio_write: -dg ----------------------------------------------------------------------------- SunOS 5.5.1 Last change: 19 Aug 1993 1 aio_read(3R) Realtime Library aio_read(3R) NAME aio_read, aio_write - asynchronous read and write operations SYNOPSIS cc [ flag ... ] file ... -lposix4 [ library ... ] #include <aio.h> int aio_read(struct aiocb *aiocbp); int aio_write(struct aiocb *aiocbp); struct aiocb { int aio_fildes; /* file descriptor */ volatile void *aio_buf; /* buffer location */ size_t aio_nbytes; /* length of transfer */ off_t aio_offset; /* file offset */ int aio_reqprio; /* request priority offset */ struct sigevent aio_sigevent; /* signal number and offset */ int aio_lio_opcode; /* listio operation */ }; struct sigevent { int sigev_notify; /* notification mode */ int sigev_signo; /* signal number */ union sigval sigev_value; /* signal value */ }; union sigval { int sival_int; /* integer value */ void *sival_ptr; /* pointer value */ }; MT-LEVEL MT-Safe DESCRIPTION aio_read() queues an asynchronous read request, and returns control immediately. Rather than blocking until completion, the read operation continues concurrently with other activity of the process. Upon enqueuing the request, the calling process reads aiocbp->nbytes from the file referred to by aiocbp->fildes into the buffer pointed to by aiocbp->aio_buf. aiocbp->offset marks the absolute position from the begin- ning of the file (in bytes) at which the read begins. aio_write() queues an asynchronous write request, and returns control immediately. Rather than blocking until completion, the write operation continues concurrently with other activity of the process. Upon enqueuing the request, the calling process writes aiocbp->nbytes from the buffer pointed to by aiocbp- >aio_buf into the file referred to by aiocbp->fildes. If O_APPEND is set for aiocbp->fildes, aio_write() operations append to the file in the same order as the calls were made. If O_APPEND is not set for the file descriptor, then the write operation will occur at the absolute position from the beginning of the file plus aiocbp->offset (in bytes). These asynchronous operations are submitted at a priority equal to the calling process' scheduling priority minus aiocbp->aio_reqprio. aiocb->aio_sigevent defines both the signal to be generated and how the calling process will be notified upon I/O com- pletion. If aio_sigevent.sigev_notify is SIGEV_NONE, then no signal will be posted upon I/O completion, but the error status and the return status for the operation will be set appropriately. If aio_sigevent.sigev_notify is SIGEV_SIGNAL, then the signal specified in aio_sigevent.sigev_signo will be sent to the process. If the SA_SIGINFO flag is set for that signal number, then the signal will be queued to the process and the value specified in aio_sigevent.sigev_value will be the si_value component of the generated signal (see siginfo(5)). RETURN VALUES If the I/O operation is successfully queued, aio_read() and aio_write() return 0, otherwise, they return -1, and set errno to indicate the error condition. aiocbp may be used as an argument to aio_error(3R) and aio_return(3R) in order to determine the error status and the return status of the asynchronous operation while it is proceeding. ERRORS EAGAIN The requested asynchronous I/O operation was not queued due to system resource limita- tions. ENOSYS aio_read() or aio_write() is not supported by this implementation. EBADF If the calling function is aio_read(), and aiocbp->fildes is not a valid file descriptor open for reading. If the calling function is aio_write(), and aiocbp->fildes is not a valid file descriptor open for writing. EINVAL The file offset value implied by aiocbp->aio_offset would be invalid, aiocbp->aio_reqprio is not a valid value, or aiocbp->aio_nbytes is an invalid value. ECANCELED The requested I/O was canceled before the I/O completed due to an explicit aio_cancel(3R) request. EINVAL The file offset value implied by aiocbp- >aio_offset would be invalid. SEE ALSO close(2), exec(2), exit(2), fork(2), lseek(2), read(2), write(2), aio_cancel(3R), aio_return(3R), lio_listio(3R), siginfo(5) NOTES For portability, the application should set aiocb- >aio_reqprio to 0. Applications compiled under Solaris 2.3 and 2.4 and using POSIX aio must be recompiled to work correctly when Solaris supports the Asynchronous Input and Output option. BUGS In Solaris 2.5, these functions always return - 1 and set errno to ENOSYS, because this release does not support the Asynchronous Input and Output option. It is our intention
Bruce Momjian wrote: > > > > > On Sun, 12 Apr 1998, The Hermit Hacker wrote: > > > I hate to have to ask, but how is MMAP or AIO better then sfio? I > > > haven't had enough time to research any of this, and am just starting to > > > look at it... > > > > If its simple to compile and works as a drop in replacement AND is faster, > > I see no reason why PostgreSQL shouldn't try to link with it. > > > > Keep in mind though that in order to use MMAP or AIO you'd be > > restructuring the code to be more efficient rather than doing more of the > > same old thing but optimized. > > > > Only testing will prove me right or wrong though. :) > > As David Gould mentioned, we need to do pre-fetching of data pages > somehow. > > When doing a sequential scan on a table, the OS is doing a one-page > prefetch, which is probably enough. The problem is index scans of the > table. Those are not sequential in the main heap table (unless it is > clustered on the index), so a prefetch would help here a lot. > > That is where we need async i/o. I am looking in BSDI, and I don't see > any way to do async i/o. The only way I can think of doing it is via > threads. I have heard the glibc version 2.0 will support the Posix AIO spec. Solaris currently has AN implementation of AIO, but it is not the POSIX one. This prefetch could be done in another process or thread, rather than tying the code to a given AIO implementation. Ocie
The Hermit Hacker wrote: > ...but, with MMAP, unless I'm mistaken, you'd essentially be > reading the file(s) into memory and then manipulating the file(s) there. > Which means one helluva large amount of RAM being required...no? Not exactly. Memory mapping is used only to map file into some memory addresses but not put into memory. Disk sectors are copied into memory on demand. If some mmaped page is accessed - it is copied from disk into memory. The main reason of using memory mapping is that you don't have to create unnecessary buffers. Normally, for every operation you have to create some in-memory buffer, copy the data there, do some operations, put the data back into file. In case of memory mapping you may avoid of creating of unnecessary buffers, and moreover you may call your system functions less frequently. There are also additional savings. (Less memory copying, reusing memory if several processes map the same file) I don't think there exist more efficient solutions. Mike -- WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340 add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND