Обсуждение: Safe/Fast I/O ...

Поиск
Список
Период
Сортировка

Safe/Fast I/O ...

От
The Hermit Hacker
Дата:
Has anyone looked into this?  I'm just getting ready to download it and
play with it, see what's involved in using it.  From what I can see, its
essentially an optimized stdio library...

URL is at: http://www.research.att.com/sw/tools/sfio

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org


Re: [HACKERS] Safe/Fast I/O ...

От
"Matthew N. Dodd"
Дата:
On Sun, 12 Apr 1998, The Hermit Hacker wrote:
> Has anyone looked into this?  I'm just getting ready to download it and
> play with it, see what's involved in using it.  From what I can see, its
> essentially an optimized stdio library...
>
> URL is at: http://www.research.att.com/sw/tools/sfio

Using mmap and/or AIO would be better...

FreeBSD and Solaris support AIO I believe.  Given past trends Linux will
as well.

/*
   Matthew N. Dodd        | A memory retaining a love you had for life
   winter@jurai.net        | As cruel as it seems nothing ever seems to
   http://www.jurai.net/~winter | go right - FLA M 3.1:53
*/


Re: [HACKERS] Safe/Fast I/O ...

От
The Hermit Hacker
Дата:
On Sun, 12 Apr 1998, Matthew N. Dodd wrote:

> On Sun, 12 Apr 1998, The Hermit Hacker wrote:
> > Has anyone looked into this?  I'm just getting ready to download it and
> > play with it, see what's involved in using it.  From what I can see, its
> > essentially an optimized stdio library...
> >
> > URL is at: http://www.research.att.com/sw/tools/sfio
>
> Using mmap and/or AIO would be better...
>
> FreeBSD and Solaris support AIO I believe.  Given past trends Linux will
> as well.

    I hate to have to ask, but how is MMAP or AIO better then sfio?  I
haven't had enough time to research any of this, and am just starting to
look at it...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org


Re: [HACKERS] Safe/Fast I/O ...

От
"Matthew N. Dodd"
Дата:
On Sun, 12 Apr 1998, The Hermit Hacker wrote:
>     I hate to have to ask, but how is MMAP or AIO better then sfio?  I
> haven't had enough time to research any of this, and am just starting to
> look at it...

If its simple to compile and works as a drop in replacement AND is faster,
I see no reason why PostgreSQL shouldn't try to link with it.

Keep in mind though that in order to use MMAP or AIO you'd be
restructuring the code to be more efficient rather than doing more of the
same old thing but optimized.

Only testing will prove me right or wrong though. :)

/*
   Matthew N. Dodd        | A memory retaining a love you had for life
   winter@jurai.net        | As cruel as it seems nothing ever seems to
   http://www.jurai.net/~winter | go right - FLA M 3.1:53
*/


Re: [HACKERS] Safe/Fast I/O ...

От
The Hermit Hacker
Дата:
On Sun, 12 Apr 1998, Matthew N. Dodd wrote:

> On Sun, 12 Apr 1998, The Hermit Hacker wrote:
> >     I hate to have to ask, but how is MMAP or AIO better then sfio?  I
> > haven't had enough time to research any of this, and am just starting to
> > look at it...
>
> If its simple to compile and works as a drop in replacement AND is faster,
> I see no reason why PostgreSQL shouldn't try to link with it.

    That didn't really answer the question :(

> Keep in mind though that in order to use MMAP or AIO you'd be
> restructuring the code to be more efficient rather than doing more of the
> same old thing but optimized.

    I don't know anything about AIO, so if you can give me a pointer
to where I can read up on it, please do...

    ...but, with MMAP, unless I'm mistaken, you'd essentially be
reading the file(s) into memory and then manipulating the file(s) there.
Which means one helluva large amount of RAM being required...no?

    Using stdio vs sfio, to read a 1.2million line file, the time to
complete goes from 7sec to 5sec ... that makes for a substantial savings
in time, if its applicable.

    the problem, as I see it right now, is the docs for it suck ...
so, right now, I'm fumbling through figuring it all out :)

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org


Re: [HACKERS] Safe/Fast I/O ...

От
dg@illustra.com (David Gould)
Дата:
Marc G. Fournier wrote:
> On Sun, 12 Apr 1998, Matthew N. Dodd wrote:
> > On Sun, 12 Apr 1998, The Hermit Hacker wrote:
> > >     I hate to have to ask, but how is MMAP or AIO better then sfio?  I
> > > haven't had enough time to research any of this, and am just starting to
> > > look at it...
> >
> > If its simple to compile and works as a drop in replacement AND is faster,
> > I see no reason why PostgreSQL shouldn't try to link with it.
>
>     That didn't really answer the question :(
>
> > Keep in mind though that in order to use MMAP or AIO you'd be
> > restructuring the code to be more efficient rather than doing more of the
> > same old thing but optimized.
>
>     I don't know anything about AIO, so if you can give me a pointer
> to where I can read up on it, please do...
>
>     ...but, with MMAP, unless I'm mistaken, you'd essentially be
> reading the file(s) into memory and then manipulating the file(s) there.
> Which means one helluva large amount of RAM being required...no?
>
>     Using stdio vs sfio, to read a 1.2million line file, the time to
> complete goes from 7sec to 5sec ... that makes for a substantial savings
> in time, if its applicable.
>
>     the problem, as I see it right now, is the docs for it suck ...
> so, right now, I'm fumbling through figuring it all out :)

One of the options when building perl5 is to use sfio instead of stdio. I
haven't tried it, but they seem to think it works.

That said, The only place I see this helping pgsql is in copyin and copyout
as these use the stdio: fread(), fwrite(), etc interfaces.

Everywhere else we use the system call IO interfaces: read(), write(),
recv(), send(), select() etc, and do our own buffering.

My prediction is that sfio vs stdio will have undetectable performance
impact on sql performance and only very minor impact on copyin, copyout (as
most of the overhead is in pgsql, not libc).

As far as IO, the problem we have is fsync(). To get rid of it means doing
a real writeahead log system and (maybe) aio to the log. As soon as we get
real logging then we don't need to force datapages out so we can get rid
of all the fsync and (given how slow we are otherwise) completely eliminate
IO as a bottleneck.

Pgsql was built for comfort, not for speed. Fine tuning and code
tweeking and microoptimization is fine as far as it goes. But there is
probably a maximum 2x speed up to be had that way. Total.

We need a 10x speedup to play with serious databases. This will take real
architectural changes.

If you are interested in what is necessary, I highly recommend the book
"Transaction Processing" by Jim Gray (and someone whose name escapes me
just now). It is a great big thing and will take a while to get through, but
is is decently written and very well worth the time. It pretty much gives
away the whole candy store as far as building high performance, reliable,
and scalable database and TP systems. I wish it had been available 10
years ago when I got into the DB game.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - Linux. Not because it is free. Because it is better.


Re: [HACKERS] Safe/Fast I/O ...

От
Bruce Momjian
Дата:
>
> On Sun, 12 Apr 1998, The Hermit Hacker wrote:
> >     I hate to have to ask, but how is MMAP or AIO better then sfio?  I
> > haven't had enough time to research any of this, and am just starting to
> > look at it...
>
> If its simple to compile and works as a drop in replacement AND is faster,
> I see no reason why PostgreSQL shouldn't try to link with it.
>
> Keep in mind though that in order to use MMAP or AIO you'd be
> restructuring the code to be more efficient rather than doing more of the
> same old thing but optimized.
>
> Only testing will prove me right or wrong though. :)

As David Gould mentioned, we need to do pre-fetching of data pages
somehow.

When doing a sequential scan on a table, the OS is doing a one-page
prefetch, which is probably enough.  The problem is index scans of the
table.  Those are not sequential in the main heap table (unless it is
clustered on the index), so a prefetch would help here a lot.

That is where we need async i/o.  I am looking in BSDI, and I don't see
any way to do async i/o.  The only way I can think of doing it is via
threads.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Book recommendation, was Re: [HACKERS] Safe/Fast I/O ...

От
Bruce Momjian
Дата:
> If you are interested in what is necessary, I highly recommend the book
> "Transaction Processing" by Jim Gray (and someone whose name escapes me
> just now). It is a great big thing and will take a while to get through, but
> is is decently written and very well worth the time. It pretty much gives
> away the whole candy store as far as building high performance, reliable,
> and scalable database and TP systems. I wish it had been available 10
> years ago when I got into the DB game.

David is 100% correct here.  We need major overhaul.

He is also 100% correct about the book he is recommending.  I got it
last week, and was going to make a big pitch for this, but now that he
has mentioned it again, let me support it.  His quote:

    It pretty much gives away the whole candy store...

is right on the mark.  This book is big, and meaty.  Date has it listed
in his bibliography, and says:

If any computer science text ever deserved the epithet "instant
classic," it is surely this one.  Its size is daunting at first(over
1000 pages), but the authors display an enviable lightness of touch that
makes even the driest aspects of the subject enjoyable reading.  In
their preface, they state their intent as being "to help...solve real
problems";  the book is "pragmatic, covering basic transaction issues in
considerable detail"; and the presentation "is full of code fragments
showing...basic algorithm and data structures" and is not
"encyclopedic."  Despite this last claim, the book is (not surprisingly)
comprehensive, and is surely destined to become the standard work.
Strongly recommended.

What more can I say.  I will add this book recommendation to
tools/FAQ_DEV.  The book is not cheap, at ~$90.

The book is "Transaction Processing:  Concepts and Techniques," by Jim
Gray and Andreas Reuter, Morgan Kaufmann publishers, ISBN 1-55860-190-2.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Safe/Fast I/O ...

От
Bruce Momjian
Дата:
> As David Gould mentioned, we need to do pre-fetching of data pages
> somehow.
>
> When doing a sequential scan on a table, the OS is doing a one-page
> prefetch, which is probably enough.  The problem is index scans of the
> table.  Those are not sequential in the main heap table (unless it is
> clustered on the index), so a prefetch would help here a lot.
>
> That is where we need async i/o.  I am looking in BSDI, and I don't see
> any way to do async i/o.  The only way I can think of doing it is via
> threads.

I found it.  It is an fcntl option.  From man fcntl:

     O_ASYNC      Enable the SIGIO signal to be sent to the process group when
                  I/O is possible, e.g., upon availability of data to be read.

Who else supports this?

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Safe/Fast I/O ...

От
Bruce Momjian
Дата:
> When doing a sequential scan on a table, the OS is doing a one-page
> prefetch, which is probably enough.  The problem is index scans of the
> table.  Those are not sequential in the main heap table (unless it is
> clustered on the index), so a prefetch would help here a lot.
>
> That is where we need async i/o.  I am looking in BSDI, and I don't see
> any way to do async i/o.  The only way I can think of doing it is via
> threads.


     O_ASYNC      Enable the SIGIO signal to be sent to the process group when
                  I/O is possible, e.g., upon availability of data to be read.

Now I am questioning this.  I am not sure this acually for file i/o, or
only tty i/o.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Safe/Fast I/O ...

От
The Hermit Hacker
Дата:
On Sun, 12 Apr 1998, Bruce Momjian wrote:

> > As David Gould mentioned, we need to do pre-fetching of data pages
> > somehow.
> >
> > When doing a sequential scan on a table, the OS is doing a one-page
> > prefetch, which is probably enough.  The problem is index scans of the
> > table.  Those are not sequential in the main heap table (unless it is
> > clustered on the index), so a prefetch would help here a lot.
> >
> > That is where we need async i/o.  I am looking in BSDI, and I don't see
> > any way to do async i/o.  The only way I can think of doing it is via
> > threads.
>
> I found it.  It is an fcntl option.  From man fcntl:
>
>      O_ASYNC      Enable the SIGIO signal to be sent to the process group when
>                   I/O is possible, e.g., upon availability of data to be read.
>
> Who else supports this?

    FreeBSD...


Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org


Re: [HACKERS] Safe/Fast I/O ...

От
Jordan Henderson
Дата:
>> When doing a sequential scan on a table, the OS is doing a one-page
>> prefetch, which is probably enough.  The problem is index scans of the
>> table.  Those are not sequential in the main heap table (unless it is
>> clustered on the index), so a prefetch would help here a lot.
>>
>> That is where we need async i/o.  I am looking in BSDI, and I don't see
>> any way to do async i/o.  The only way I can think of doing it is via
>> threads.
>
>
>     O_ASYNC      Enable the SIGIO signal to be sent to the process group when
>                  I/O is possible, e.g., upon availability of data to be read.
>
>Now I am questioning this.  I am not sure this acually for file i/o, or
>only tty i/o.
>

async file calls:
    aio_cancel
    aio_error
    aio_read
    aio_return -- gets status of pending io call
    aio_suspend
    aio_write

And yes the Gray book is great!

Jordan Henderson


Re: [HACKERS] Safe/Fast I/O ...

От
"Matthew N. Dodd"
Дата:
On Sun, 12 Apr 1998, Bruce Momjian wrote:
> I found it.  It is an fcntl option.  From man fcntl:
>
>      O_ASYNC      Enable the SIGIO signal to be sent to the process group when
>                   I/O is possible, e.g., upon availability of data to be read.
>
> Who else supports this?

FreeBSD, and NetBSD appearto.

Linux and Solaris appear not to.

I was really speaking of the POSIX 1003.1B AIO/LIO calls when I originally
brought this up. (aio_read/aio_write)

/*
   Matthew N. Dodd        | A memory retaining a love you had for life
   winter@jurai.net        | As cruel as it seems nothing ever seems to
   http://www.jurai.net/~winter | go right - FLA M 3.1:53
*/


Re: [HACKERS] Safe/Fast I/O ...

От
Bruce Momjian
Дата:
>
> async file calls:
>     aio_cancel
>     aio_error
>     aio_read
>     aio_return -- gets status of pending io call
>     aio_suspend
>     aio_write

Can you elaborate on this?  Does it cause a read() to return right away,
and signal when data is ready?

>
> And yes the Gray book is great!
>
> Jordan Henderson
>
>


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Safe/Fast I/O ...

От
dg@illustra.com (David Gould)
Дата:
> > async file calls:
> >     aio_cancel
> >     aio_error
> >     aio_read
> >     aio_return -- gets status of pending io call
> >     aio_suspend
> >     aio_write
>
> Can you elaborate on this?  Does it cause a read() to return right away,
> and signal when data is ready?


These are posix calls. Many systems support them and they are fairly easy
to emulate (with threads or io processes) on systems that don't. If we
are going to do Async IO, I suggest that we code to the posix interface and
build emulators for the systems that don't have the posix calls.

I think there is an implementation of this for Linux, but it is a separate
package, not part of the base system as far as I know. Of course with Linux
anything you know it didn't do two weeks ago, it will do next week...

Here is the Solaris man page for aio_read() and aio_write:

-dg

-----------------------------------------------------------------------------


SunOS 5.5.1         Last change: 19 Aug 1993                    1
aio_read(3R)            Realtime Library             aio_read(3R)


NAME
     aio_read, aio_write - asynchronous read and write operations

SYNOPSIS
     cc [ flag ... ] file ...  -lposix4 [ library ... ]

     #include <aio.h>

     int aio_read(struct aiocb *aiocbp);

     int aio_write(struct aiocb *aiocbp);

     struct aiocb {
        int               aio_fildes;     /* file descriptor */
        volatile void     *aio_buf;       /* buffer location */
        size_t            aio_nbytes;     /* length  of  transfer
     */
        off_t             aio_offset;     /* file offset */
        int               aio_reqprio;    /*   request   priority
     offset */
        struct sigevent   aio_sigevent;   /*  signal  number  and
     offset */
        int               aio_lio_opcode; /* listio operation */
     };

     struct sigevent {
        int               sigev_notify;   /* notification mode */
        int               sigev_signo;    /* signal number */
        union sigval      sigev_value;    /* signal value */
     };

     union sigval {
        int               sival_int;      /* integer value */
        void              *sival_ptr;     /* pointer value */
     };

MT-LEVEL
     MT-Safe

DESCRIPTION
     aio_read() queues an asynchronous read request, and  returns
     control immediately.  Rather than blocking until completion,
     the  read  operation  continues  concurrently   with   other
     activity of the process.

     Upon  enqueuing  the  request,  the  calling  process  reads
     aiocbp->nbytes  from  the file referred to by aiocbp->fildes
     into the buffer pointed to by aiocbp->aio_buf.
     aiocbp->offset marks the absolute position from  the  begin-
     ning of the file (in bytes) at which the read begins.

     aio_write()  queues  an  asynchronous  write  request,   and
     returns  control  immediately.   Rather  than blocking until
     completion, the write operation continues concurrently  with
     other activity of the process.

     Upon enqueuing  the  request,  the  calling  process  writes
     aiocbp->nbytes  from   the  buffer  pointed  to  by  aiocbp-
     >aio_buf into the file referred to  by  aiocbp->fildes.   If
     O_APPEND  is  set for aiocbp->fildes, aio_write() operations
     append to the file in the same order as the calls were made.

     If O_APPEND is not set for the  file  descriptor,  then  the
     write operation will occur at the absolute position from the
     beginning of the file plus aiocbp->offset (in bytes).

     These asynchronous operations are submitted  at  a  priority
     equal  to  the  calling  process'  scheduling priority minus
     aiocbp->aio_reqprio.

     aiocb->aio_sigevent defines both the signal to be  generated
     and  how  the calling process will be notified upon I/O com-
     pletion.  If aio_sigevent.sigev_notify is  SIGEV_NONE,  then
     no  signal will be posted upon I/O completion, but the error
     status and the return status for the operation will  be  set
     appropriately.       If     aio_sigevent.sigev_notify     is
     SIGEV_SIGNAL,    then    the     signal     specified     in
     aio_sigevent.sigev_signo  will  be  sent to the process.  If
     the SA_SIGINFO flag is set for that signal number, then  the
     signal will be queued to the process and the value specified
     in aio_sigevent.sigev_value will be the  si_value  component
     of the generated signal (see siginfo(5)).

RETURN VALUES
     If the I/O operation is successfully queued, aio_read()  and
     aio_write()  return  0,  otherwise,  they return -1, and set
     errno to indicate the error condition.  aiocbp may  be  used
     as  an argument to aio_error(3R) and aio_return(3R) in order
     to determine the error status and the return status  of  the
     asynchronous operation while it is proceeding.

ERRORS
     EAGAIN         The requested asynchronous I/O operation  was
                    not  queued  due  to  system resource limita-
                    tions.

     ENOSYS         aio_read() or aio_write() is not supported by
                    this implementation.

     EBADF          If the calling function  is  aio_read(),  and
                    aiocbp->fildes is not a valid file descriptor
                    open for reading.  If the calling function is
                    aio_write(),  and  aiocbp->fildes  is  not  a
                    valid file descriptor open for writing.

     EINVAL         The file offset value implied by aiocbp->aio_offset
                    would be invalid,
                    aiocbp->aio_reqprio is not a  valid  value,
                    or aiocbp->aio_nbytes is an invalid value.

     ECANCELED      The requested I/O was canceled before the I/O
                    completed  due  to an explicit aio_cancel(3R)
                    request.

     EINVAL         The file  offset  value  implied  by  aiocbp-
                    >aio_offset would be invalid.

SEE ALSO
     close(2),  exec(2),  exit(2),  fork(2),  lseek(2),  read(2),
     write(2),  aio_cancel(3R),  aio_return(3R),  lio_listio(3R),
     siginfo(5)

NOTES
     For portability, the application should set aiocb- >aio_reqprio
     to 0.

     Applications compiled under Solaris 2.3 and  2.4  and  using
     POSIX  aio must be recompiled to work correctly when Solaris
     supports the Asynchronous Input and Output option.

BUGS
     In Solaris 2.5, these functions always return  - 1  and  set
     errno  to  ENOSYS, because this release does not support the
     Asynchronous Input and Output option.  It is  our  intention


Re: [HACKERS] Safe/Fast I/O ...

От
ocie@paracel.com
Дата:
Bruce Momjian wrote:
>
> >
> > On Sun, 12 Apr 1998, The Hermit Hacker wrote:
> > >     I hate to have to ask, but how is MMAP or AIO better then sfio?  I
> > > haven't had enough time to research any of this, and am just starting to
> > > look at it...
> >
> > If its simple to compile and works as a drop in replacement AND is faster,
> > I see no reason why PostgreSQL shouldn't try to link with it.
> >
> > Keep in mind though that in order to use MMAP or AIO you'd be
> > restructuring the code to be more efficient rather than doing more of the
> > same old thing but optimized.
> >
> > Only testing will prove me right or wrong though. :)
>
> As David Gould mentioned, we need to do pre-fetching of data pages
> somehow.
>
> When doing a sequential scan on a table, the OS is doing a one-page
> prefetch, which is probably enough.  The problem is index scans of the
> table.  Those are not sequential in the main heap table (unless it is
> clustered on the index), so a prefetch would help here a lot.
>
> That is where we need async i/o.  I am looking in BSDI, and I don't see
> any way to do async i/o.  The only way I can think of doing it is via
> threads.

I have heard the glibc version 2.0 will support the Posix AIO spec.
Solaris currently has AN implementation of AIO, but it is not the
POSIX one.  This prefetch could be done in another process or thread,
rather than tying the code to a given AIO implementation.

Ocie

Re: [HACKERS] Safe/Fast I/O ...

От
Michal Mosiewicz
Дата:
The Hermit Hacker wrote:

>         ...but, with MMAP, unless I'm mistaken, you'd essentially be
> reading the file(s) into memory and then manipulating the file(s) there.
> Which means one helluva large amount of RAM being required...no?

Not exactly. Memory mapping is used only to map file into some memory
addresses but not put into memory. Disk sectors are copied into memory
on demand. If some mmaped page is accessed - it is copied from disk into
memory.

The main reason of using memory mapping is that you don't have to create
unnecessary buffers. Normally, for every operation you have to create
some in-memory buffer, copy the data there, do some operations, put the
data back into file. In case of memory mapping you may avoid of creating
of unnecessary buffers, and moreover you may call your system functions
less frequently. There are also additional savings. (Less memory
copying, reusing memory if several processes map the same file)

I don't think there exist more efficient solutions.

Mike

--
WWW: http://www.lodz.pdi.net/~mimo  tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz  *  Bugaj 66 m.54 *  95-200 Pabianice  *  POLAND