Обсуждение: Some thoughts on NFS

Поиск
Список
Период
Сортировка

Some thoughts on NFS

От
Thomas Munro
Дата:
Hello hackers,

As discussed in various threads, PostgreSQL-on-NFS is viewed with
suspicion.  Perhaps others knew this already, but I first learned of
the specific mechanism (or at least one of them) for corruption from
Craig Ringer's writing[1] about fsync() on Linux.

The problem is that close() and fsync() could report ENOSPC,
indicating that your dirty data has been dropped from the Linux page
cache, and then future fsync() operations could succeed as if nothing
happened.  It's easy to see that happening[2].

Since Craig's report, we committed a patch based on his PANIC
proposal: we now panic on fsync() and close() failure.  Recovering
from the WAL may or may not be possible, but at no point will we allow
a checkpoint to retry and bogusly succeed.

So, does this mean we fixed the problems with NFS?  Not sure, but I do
see a couple of problems (and they're problems Craig raised in his
thread):

The first is practical.  Running out of diskspace (or quota) is not
all that rare (much more common that EIO from a dying disk, I'd
guess), and definitely recoverable by an administrator: just create
more space.  It would be really nice to avoid panicking for an
*expected* condition.

To do that, I think we'd need to move the ENOSPC error back relation
extension time (when we call pwrite()), as happens for local
filesystems.  Luckily NFS 4.2 provides a mechanism to do that: the NFS
4.2 ALLOCATE[3] command.  To make this work, I think there are two
subproblems to solve:

1.  Figure out how to get the ALLOCATE command all the way through the
stack from PostgreSQL to the remote NFS server, and know for sure that
it really happened.  On the Debian buster Linux 4.18 system I checked,
fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate()
appears to succeed but it doesn't really do anything at all (though I
understand that some versions sometimes write zeros to simulate
allocation, which in this case would be equally useless as it doesn't
reserve anything on an NFS server).  We need the server and NFS client
and libc to be of the right version and cooperate and tell us that
they have really truly reserved space, but there isn't currently a way
as far as I can tell.  How can we achieve that, without writing our
own NFS client?

2.  Deal with the resulting performance suckage.  Extending 8kb at a
time with synchronous network round trips won't fly.

A theoretical question I thought of is whether there are any
interleavings of operations that allow a checkpoint to complete
bogusly, while a concurrent close() in a regular backend fails with
EIO for data that was included in the checkpoint, and panics.  I
*suspect* the answer is that every interleaving is safe for 4.16+
kernels that report IO errors to every descriptor.  In older kernels I
wonder if there could be a schedule where an arbitrary backend eats
the error while closing, then the checkpointer calls fsync()
successfully and then logs a checkpoint, and then then the arbitrary
backend panics (too late).  I suspect EIO on close() doesn't happen in
practice on regular local filesystems, which is why I mention it in
the context of NFS, but I could be wrong about that.

Everything I said above about NFS may also apply to CIFS, I dunno.

[1]
https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAEepm%3D1FGo%3DACPKRmAxvb53mBwyVC%3DTDwTE0DMzkWjdbAYw7sw%40mail.gmail.com
[3] https://tools.ietf.org/html/rfc7862#page-64

-- 
Thomas Munro
https://enterprisedb.com


Re: Some thoughts on NFS

От
Robert Haas
Дата:
On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> How can we achieve that, without writing our
> own NFS client?

<dons crash helmet>

Instead of writing our own NFS client, how about writing our own
network storage protocol?  Imagine a stripped-down postmaster process
running on the NFS server that essentially acts as a block server.
Through some sort of network chatter, it can exchange blocks with the
real postmaster running someplace else.  The mini-protocol would
contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION,
FSYNC_SEGMENT, etc. - basically whatever the relevant operations at
the smgr layer are.  And the user would see the remote server as a
tablespace mapped to a special smgr.

As compared with your proposal, this has both advantages and
disadvantages.  The main advantage is that we aren't dependent on
being able to make NFS behave in any particular way; indeed, this type
of solution could be used not only to work around problems with NFS,
but also problems with any other network filesystem.  We get to reuse
all of the work we've done to try to make local operation reliable;
the remote server can run the same code that would be run locally
whenever the master tells it to do so.  And you can even imagine
trying to push more work to the remote side in some future version of
the protocol.  The main disadvantage is that it doesn't help unless
you can actually run software on the remote box.  If the only access
you have to the remote side is that it exposes an NFS interface, then
this sort of thing is useless.  And that's probably a pretty common
scenario.

So that brings us back to your proposal.  I don't know whether there's
anyway of solving the problem you postulate: "We need the server and
NFS client and libc to be of the right version and cooperate and tell
us that they have really truly reserved space."  If there's not a set
of APIs that can be used to make that happen, then I don't know how we
can ever solve this problem without writing our own client.  Well, I
guess we could submit patches to every libc in the world to add those
APIs.  But that seems like a painful way forward.

I'm kinda glad you're thinking about this problem because I think the
unreliably of PostgreSQL on NFS is a real problem for users and kind
of a black eye for the project.  However, I am not sure that I see an
easy solution in what you wrote, or in general.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Some thoughts on NFS

От
Magnus Hagander
Дата:
On Tue, Feb 19, 2019 at 4:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> How can we achieve that, without writing our
> own NFS client?

<dons crash helmet>

You'll need it :)


Instead of writing our own NFS client, how about writing our own
network storage protocol?  Imagine a stripped-down postmaster process
running on the NFS server that essentially acts as a block server.
Through some sort of network chatter, it can exchange blocks with the
real postmaster running someplace else.  The mini-protocol would
contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION,
FSYNC_SEGMENT, etc. - basically whatever the relevant operations at
the smgr layer are.  And the user would see the remote server as a
tablespace mapped to a special smgr.

As compared with your proposal, this has both advantages and
disadvantages.  The main advantage is that we aren't dependent on
being able to make NFS behave in any particular way; indeed, this type
of solution could be used not only to work around problems with NFS,
but also problems with any other network filesystem.  We get to reuse
all of the work we've done to try to make local operation reliable;
the remote server can run the same code that would be run locally
whenever the master tells it to do so.  And you can even imagine
trying to push more work to the remote side in some future version of
the protocol.  The main disadvantage is that it doesn't help unless
you can actually run software on the remote box.  If the only access
you have to the remote side is that it exposes an NFS interface, then
this sort of thing is useless.  And that's probably a pretty common
scenario.

In my experience, that covers approximately 100% of the usecase.

The only case I've run into people wanting to use postgres on NFS, the NFS server is a big filer from netapp or hitachi or whomever. And you're not going to be able to run something like that on top of it.

There might be a use-case for the split that you mention, absolutely, but it's not going to solve the people-who-want-NFS situation. You'd solve more of that by having the middle layer speak "raw device" underneath and be able to sit on top of things like iSCSI (yes, really).

--

Re: Some thoughts on NFS

От
Joe Conway
Дата:
On 2/19/19 10:59 AM, Magnus Hagander wrote:
> On Tue, Feb 19, 2019 at 4:46 PM Robert Haas <robertmhaas@gmail.com
> <mailto:robertmhaas@gmail.com>> wrote:
>
>     On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro <thomas.munro@gmail.com
>     <mailto:thomas.munro@gmail.com>> wrote:
>     > How can we achieve that, without writing our
>     > own NFS client?
>
>     <dons crash helmet>
>
>
> You'll need it :)
>
>
>     Instead of writing our own NFS client, how about writing our own
>     network storage protocol?  Imagine a stripped-down postmaster process
>     running on the NFS server that essentially acts as a block server.
>     Through some sort of network chatter, it can exchange blocks with the
>     real postmaster running someplace else.  The mini-protocol would
>     contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION,
>     FSYNC_SEGMENT, etc. - basically whatever the relevant operations at
>     the smgr layer are.  And the user would see the remote server as a
>     tablespace mapped to a special smgr.
>
>     As compared with your proposal, this has both advantages and
>     disadvantages.  The main advantage is that we aren't dependent on
>     being able to make NFS behave in any particular way; indeed, this type
>     of solution could be used not only to work around problems with NFS,
>     but also problems with any other network filesystem.  We get to reuse
>     all of the work we've done to try to make local operation reliable;
>     the remote server can run the same code that would be run locally
>     whenever the master tells it to do so.  And you can even imagine
>     trying to push more work to the remote side in some future version of
>     the protocol.  The main disadvantage is that it doesn't help unless
>     you can actually run software on the remote box.  If the only access
>     you have to the remote side is that it exposes an NFS interface, then
>     this sort of thing is useless.  And that's probably a pretty common
>     scenario.
>
>
> In my experience, that covers approximately 100% of the usecase.
>
> The only case I've run into people wanting to use postgres on NFS, the
> NFS server is a big filer from netapp or hitachi or whomever. And you're
> not going to be able to run something like that on top of it.


Exactly my experience too.


> There might be a use-case for the split that you mention, absolutely,
> but it's not going to solve the people-who-want-NFS situation. You'd
> solve more of that by having the middle layer speak "raw device"
> underneath and be able to sit on top of things like iSCSI (yes, really).

Interesting idea but sounds ambitious ;-)

Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Вложения

Re: Some thoughts on NFS

От
Stephen Frost
Дата:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > How can we achieve that, without writing our
> > own NFS client?
>
> <dons crash helmet>
>
> Instead of writing our own NFS client, how about writing our own
> network storage protocol?  Imagine a stripped-down postmaster process
> running on the NFS server that essentially acts as a block server.
> Through some sort of network chatter, it can exchange blocks with the
> real postmaster running someplace else.  The mini-protocol would
> contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION,
> FSYNC_SEGMENT, etc. - basically whatever the relevant operations at
> the smgr layer are.  And the user would see the remote server as a
> tablespace mapped to a special smgr.

In reading this, I honestly thought somewhere along the way you'd say
"and then you have WAL, so just run a replica and forget this whole
network filesystem business."

The practical issue of WAL replay being single-process is an issue
though.  It seems like your mini-protocol was going in a direction that
would have allowed multiple processes to be working between the PG
system and the storage system concurrently, avoiding the single-threaded
issue with WAL but also making it such that the replica wouldn't be able
to be used for read-only queries (without some much larger changes
happening anyway).  I'm not sure the use-case is big enough but it does
seem to me that we're getting to a point where people are generating
enough WAL with systems that they care an awful lot about that they
might be willing to forgo having the ability to perform read-only
queries on the replica as long as they know that they can flip traffic
over to the replica without losing data.

So, what this all really boils down to is that I think this idea of a
different protocol that would allow PG to essentially replicate to a
remote system, or possibly run entirely off of the remote system without
any local storage, could be quite interesting in some situations.

On the other hand, I pretty much agree 100% with Magnus that the NFS
use-case is almost entirely because someone bought a big piece of
hardware that talks NFS and no, you don't get to run whatever code you
want on it.

Thanks!

Stephen

Вложения

Re: Some thoughts on NFS

От
Robert Haas
Дата:
On Tue, Feb 19, 2019 at 10:59 AM Magnus Hagander <magnus@hagander.net> wrote:
> The only case I've run into people wanting to use postgres on NFS, the NFS server is a big filer from netapp or
hitachior whomever. And you're not going to be able to run something like that on top of it. 

Yeah.  :-(

It seems, however, we have no way of knowing to what extent that big
filer actually implements the latest NFS specs and does so correctly.
And if it doesn't, and data goes down the tubes, people are going to
blame PostgreSQL, not the big filer, either because they really
believe we ought to be able to handle it, or because they know that
filing a trouble ticket with NetApp isn't likely to provoke any sort
of swift response.  If PostgreSQL itself is speaking NFS, we might at
least have a little more information about what behavior the filer
claims to implement, but even then it could easily be "lying."  And if
we're just seeing it as a filesystem mount, then we're just ... flying
blind.

> There might be a use-case for the split that you mention, absolutely, but it's not going to solve the
people-who-want-NFSsituation. You'd solve more of that by having the middle layer speak "raw device" underneath and be
ableto sit on top of things like iSCSI (yes, really). 

Not sure I follow this part.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Some thoughts on NFS

От
Tomas Vondra
Дата:
On 2/19/19 5:20 PM, Robert Haas wrote:
> On Tue, Feb 19, 2019 at 10:59 AM Magnus Hagander <magnus@hagander.net> wrote:
>> The only case I've run into people wanting to use postgres on NFS,
>> the NFS server is a big filer from netapp or hitachi or whomever. And
>> you're not going to be able to run something like that on top of it.
> 
> Yeah.  :-(
> 
> It seems, however, we have no way of knowing to what extent that big
> filer actually implements the latest NFS specs and does so correctly.
> And if it doesn't, and data goes down the tubes, people are going to
> blame PostgreSQL, not the big filer, either because they really
> believe we ought to be able to handle it, or because they know that
> filing a trouble ticket with NetApp isn't likely to provoke any sort
> of swift response.  If PostgreSQL itself is speaking NFS, we might at
> least have a little more information about what behavior the filer
> claims to implement, but even then it could easily be "lying."  And if
> we're just seeing it as a filesystem mount, then we're just ... flying
> blind.
> 

Perhaps we should have something like pg_test_nfs, then?

>> There might be a use-case for the split that you mention, 
>> absolutely, but it's not going to solve the people-who-want-NFS 
>> situation. You'd solve more of that by having the middle layer 
>> speak "raw device" underneath and be able to sit on top of things
>> like iSCSI (yes, really).
>
> Not sure I follow this part.
>

I think Magnus says that people running PostgreSQL on NFS generally
don't do that because they somehow chose NFS, but because that's what
their company uses for network storage. Even if we support the custom
block protocol, they probably won't be able to use it.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Some thoughts on NFS

От
Christoph Moench-Tegeder
Дата:
## Magnus Hagander (magnus@hagander.net):

>  You'd solve more
> of that by having the middle layer speak "raw device" underneath and be
> able to sit on top of things like iSCSI (yes, really).

Back in ye olden days we called these middle layers "kernel" and
"filesystem" and had that maintained by specialists.

Regards,
Christoph

-- 
Spare Space.


Re: Some thoughts on NFS

От
Andres Freund
Дата:
Hi,

On 2019-02-19 20:03:05 +1300, Thomas Munro wrote:
> The first is practical.  Running out of diskspace (or quota) is not
> all that rare (much more common that EIO from a dying disk, I'd
> guess), and definitely recoverable by an administrator: just create
> more space.  It would be really nice to avoid panicking for an
> *expected* condition.

Well, that's true, but OTOH, we don't even handle that properly on local
filesystems for WAL. And while people complain, it's not *that* common.


> 1.  Figure out how to get the ALLOCATE command all the way through the
> stack from PostgreSQL to the remote NFS server, and know for sure that
> it really happened.  On the Debian buster Linux 4.18 system I checked,
> fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate()
> appears to succeed but it doesn't really do anything at all (though I
> understand that some versions sometimes write zeros to simulate
> allocation, which in this case would be equally useless as it doesn't
> reserve anything on an NFS server).  We need the server and NFS client
> and libc to be of the right version and cooperate and tell us that
> they have really truly reserved space, but there isn't currently a way
> as far as I can tell.  How can we achieve that, without writing our
> own NFS client?
> 
> 2.  Deal with the resulting performance suckage.  Extending 8kb at a
> time with synchronous network round trips won't fly.

I think I'd just go for fsync();pwrite();fsync(); as the extension
mechanism, iff we're detecting a tablespace is on NFS. The first fsync()
to make sure there's no previous errors that we could mistake for
ENOSPC, the pwrite to extend, the second fsync to make sure there's
actually space. Then we can detect ENOSPC properly.  That possibly does
leave some errors where we could mistake ENOSPC as something more benign
than it is, but the cases seem pretty narrow, due to the previous
fsync() (maybe the other side could be thin provisioned and get an
ENOSPC there - but in that case we didn't actually loose any data. The
only dangerous scenario I can come up with is that the remote side is on
thinly provisioned CoW system, and a concurrent write to an earlier
block runs out of space - but seriously, good riddance to you).

Given the current code we'll already try to extend in bigger chunks when
there's contention, we just need to combine the writes for those, that
ought to not be that hard now that we don't initialize bulk-extended
pages anymore. That won't solve the issue of extending during single
threaded writes, but I feel like that's secondary to actually being
correct. And using bulk-extension in more cases doesn't sound too hard
to me.


Greetings,

Andres Freund


Re: Some thoughts on NFS

От
Magnus Hagander
Дата:

On Tue, Feb 19, 2019 at 5:33 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 2/19/19 5:20 PM, Robert Haas wrote:
> On Tue, Feb 19, 2019 at 10:59 AM Magnus Hagander <magnus@hagander.net> wrote:

>> There might be a use-case for the split that you mention,
>> absolutely, but it's not going to solve the people-who-want-NFS
>> situation. You'd solve more of that by having the middle layer
>> speak "raw device" underneath and be able to sit on top of things
>> like iSCSI (yes, really).
>
> Not sure I follow this part.
>

I think Magnus says that people running PostgreSQL on NFS generally
don't do that because they somehow chose NFS, but because that's what
their company uses for network storage. Even if we support the custom
block protocol, they probably won't be able to use it.

Yes, with the addition that they also often export iSCSI endpoints today, so if we wanted to sit on top of something that could also work. But not sit on top of a custom block protocol we invent ourselves. 

--

Re: Some thoughts on NFS

От
Magnus Hagander
Дата:
On Tue, Feb 19, 2019 at 5:38 PM Christoph Moench-Tegeder <cmt@burggraben.net> wrote:
## Magnus Hagander (magnus@hagander.net):

>  You'd solve more
> of that by having the middle layer speak "raw device" underneath and be
> able to sit on top of things like iSCSI (yes, really).

Back in ye olden days we called these middle layers "kernel" and
"filesystem" and had that maintained by specialists.

Yeah. Unfortunately that turned out in a number of cases to be things like specialists that considered fsync unimportant, or dropping data from the buffer cache without errors.

But what I'm mainly saying is that if we want to run postgres on top of a block device protocol, we should go all the way and do it, not somewhere half way that will be unable to help most people. I'm not saying that we *should*, there is a very big if in that.

--

Re: Some thoughts on NFS

От
Andres Freund
Дата:
Hi,

On 2019-02-19 16:59:35 +0100, Magnus Hagander wrote:
> There might be a use-case for the split that you mention, absolutely, but
> it's not going to solve the people-who-want-NFS situation. You'd solve more
> of that by having the middle layer speak "raw device" underneath and be
> able to sit on top of things like iSCSI (yes, really).

There's decent iSCSI implementations in several kernels, without the NFS
problems. I'm not sure what we'd gain by reimplementing those?

Greetings,

Andres Freund


Re: Some thoughts on NFS

От
Robert Haas
Дата:
On Tue, Feb 19, 2019 at 1:17 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-02-19 16:59:35 +0100, Magnus Hagander wrote:
> > There might be a use-case for the split that you mention, absolutely, but
> > it's not going to solve the people-who-want-NFS situation. You'd solve more
> > of that by having the middle layer speak "raw device" underneath and be
> > able to sit on top of things like iSCSI (yes, really).
>
> There's decent iSCSI implementations in several kernels, without the NFS
> problems. I'm not sure what we'd gain by reimplementing those?

Is that a new thing?  I ran across PostgreSQL-over-iSCSI a number of
years ago and the evidence strongly suggested that it did not reliably
report disk errors back to PostgreSQL, leading to corruption.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Some thoughts on NFS

От
Andres Freund
Дата:
Hi,

On 2019-02-19 13:21:21 -0500, Robert Haas wrote:
> On Tue, Feb 19, 2019 at 1:17 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-02-19 16:59:35 +0100, Magnus Hagander wrote:
> > > There might be a use-case for the split that you mention, absolutely, but
> > > it's not going to solve the people-who-want-NFS situation. You'd solve more
> > > of that by having the middle layer speak "raw device" underneath and be
> > > able to sit on top of things like iSCSI (yes, really).
> >
> > There's decent iSCSI implementations in several kernels, without the NFS
> > problems. I'm not sure what we'd gain by reimplementing those?
> 
> Is that a new thing?  I ran across PostgreSQL-over-iSCSI a number of
> years ago and the evidence strongly suggested that it did not reliably
> report disk errors back to PostgreSQL, leading to corruption.

How many years ago are we talking? I think it's been mostly robust in
the last 6-10 years, maybe? But note that the postgres + linux fsync
issues would have plagued that use case just as well as it did local
storage, at a likely higher incidence of failures (i.e. us forgetting to
retry fsyncs in checkpoints, and linux throwing away dirty data after
fsync failure would both have caused problems that aren't dependent on
iSCSI).  And I think it's not that likely that we'd not screw up a
number of times implementing iSCSI ourselves - not to speak of the fact
that that seems like an odd place to focus development on, given that
it'd basically require all the infrastructure also needed for local DIO,
which'd likely gain us much more.

Greetings,

Andres Freund


Re: Some thoughts on NFS

От
Robert Haas
Дата:
On Tue, Feb 19, 2019 at 1:29 PM Andres Freund <andres@anarazel.de> wrote:
> > Is that a new thing?  I ran across PostgreSQL-over-iSCSI a number of
> > years ago and the evidence strongly suggested that it did not reliably
> > report disk errors back to PostgreSQL, leading to corruption.
>
> How many years ago are we talking? I think it's been mostly robust in
> the last 6-10 years, maybe?

I think it was ~9 years ago.

> But note that the postgres + linux fsync
> issues would have plagued that use case just as well as it did local
> storage, at a likely higher incidence of failures (i.e. us forgetting to
> retry fsyncs in checkpoints, and linux throwing away dirty data after
> fsync failure would both have caused problems that aren't dependent on
> iSCSI).

IIRC, and obviously that's difficult to do after so long, there were
clearly disk errors in the kernel logs, but no hint of a problem in
the PostgreSQL logs.  So it wasn't just a case of us responding to
errors with sufficient vigor -- either they weren't being reported at
all, or only to system calls we weren't checking, e.g. close or
something.

> And I think it's not that likely that we'd not screw up a
> number of times implementing iSCSI ourselves - not to speak of the fact
> that that seems like an odd place to focus development on, given that
> it'd basically require all the infrastructure also needed for local DIO,
> which'd likely gain us much more.

I don't really disagree with you here, but I also think it's important
to be honest about what size hammer is likely to be sufficient to fix
the problem.  Project policy for many years has been essentially
"let's assume the kernel guys know what they are doing," but, I don't
know, color me a little skeptical at this point.  We've certainly made
lots of mistakes all of our own, and it's certainly true that
reimplementing large parts of what the kernel does in user space is
not very appealing ... but on the other hand it looks like filesystem
error reporting isn't even really reliable for local operation (unless
we do an incredibly complicated fd-passing thing that has deadlock
problems we don't know how to solve and likely performance problems
too, or convert the whole backend to use threads) or for NFS operation
(though maybe your suggestion will fix that) so the idea that iSCSI is
just going to be all right seems a bit questionable to me.  Go ahead,
call me a pessimist...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Some thoughts on NFS

От
Andres Freund
Дата:
Hi,

On 2019-02-19 13:45:28 -0500, Robert Haas wrote:
> On Tue, Feb 19, 2019 at 1:29 PM Andres Freund <andres@anarazel.de> wrote:
> > And I think it's not that likely that we'd not screw up a
> > number of times implementing iSCSI ourselves - not to speak of the fact
> > that that seems like an odd place to focus development on, given that
> > it'd basically require all the infrastructure also needed for local DIO,
> > which'd likely gain us much more.
> 
> I don't really disagree with you here, but I also think it's important
> to be honest about what size hammer is likely to be sufficient to fix
> the problem.  Project policy for many years has been essentially
> "let's assume the kernel guys know what they are doing," but, I don't
> know, color me a little skeptical at this point.

Yea, and I think around e.g. using the kernel page cache / not using
DIO, several people, including kernel developers and say me, told us
that's stupid.


> We've certainly made lots of mistakes all of our own, and it's
> certainly true that reimplementing large parts of what the kernel does
> in user space is not very appealing ... but on the other hand it looks
> like filesystem error reporting isn't even really reliable for local
> operation (unless we do an incredibly complicated fd-passing thing
> that has deadlock problems we don't know how to solve and likely
> performance problems too, or convert the whole backend to use threads)
> or for NFS operation (though maybe your suggestion will fix that) so
> the idea that iSCSI is just going to be all right seems a bit
> questionable to me.  Go ahead, call me a pessimist...

My point is that for iSCSC to be performant we'd need *all* the
infrastructure we also need for direct IO *and* a *lot* more. And that
it seems insane to invest very substantial resources into developing our
own iSCSI client when we don't even have DIO support. And DIO support
would allow us to address the error reporting issues, while also
drastically improving performance in a lot of situations. And we'd not
have to essentially develop our own filesystem etc.

Greetings,

Andres Freund


Re: Some thoughts on NFS

От
Robert Haas
Дата:
On Tue, Feb 19, 2019 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:
> My point is that for iSCSC to be performant we'd need *all* the
> infrastructure we also need for direct IO *and* a *lot* more. And that
> it seems insane to invest very substantial resources into developing our
> own iSCSI client when we don't even have DIO support. And DIO support
> would allow us to address the error reporting issues, while also
> drastically improving performance in a lot of situations. And we'd not
> have to essentially develop our own filesystem etc.

OK, got it.  So, I'll merge the patch for direct I/O support tomorrow,
and then the iSCSI patch can go in on Thursday.  OK?  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Some thoughts on NFS

От
Magnus Hagander
Дата:

On Tue, Feb 19, 2019 at 7:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Feb 19, 2019 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:
> My point is that for iSCSC to be performant we'd need *all* the
> infrastructure we also need for direct IO *and* a *lot* more. And that
> it seems insane to invest very substantial resources into developing our
> own iSCSI client when we don't even have DIO support. And DIO support
> would allow us to address the error reporting issues, while also
> drastically improving performance in a lot of situations. And we'd not
> have to essentially develop our own filesystem etc.

OK, got it.  So, I'll merge the patch for direct I/O support tomorrow,
and then the iSCSI patch can go in on Thursday.  OK?  :-)

C'mon Robert.

Surely you know that such patches should be landed on *Fridays*, not Thursdays. 

--

Re: Some thoughts on NFS

От
Robert Haas
Дата:
On Tue, Feb 19, 2019 at 2:05 PM Magnus Hagander <magnus@hagander.net> wrote:
> C'mon Robert.
>
> Surely you know that such patches should be landed on *Fridays*, not Thursdays.

Oh, right.  And preferably via airplane wifi from someplace over the
Atlantic ocean, right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Some thoughts on NFS

От
Thomas Munro
Дата:
On Tue, Feb 19, 2019 at 8:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> A theoretical question I thought of is whether there are any
> interleavings of operations that allow a checkpoint to complete
> bogusly, while a concurrent close() in a regular backend fails with
> EIO for data that was included in the checkpoint, and panics.  I
> *suspect* the answer is that every interleaving is safe for 4.16+
> kernels that report IO errors to every descriptor.  In older kernels I
> wonder if there could be a schedule where an arbitrary backend eats
> the error while closing, then the checkpointer calls fsync()
> successfully and then logs a checkpoint, and then then the arbitrary
> backend panics (too late).  I suspect EIO on close() doesn't happen in
> practice on regular local filesystems, which is why I mention it in
> the context of NFS, but I could be wrong about that.

Ugh.  It looks like Linux NFS doesn't even use the new errseq_t
machinery in 4.16+.  So even if we had the fd-passing patch, I think
there may be a dangerous schedule like this:

A: close() -> EIO, clears AS_EIO flag
B: fsync() -> SUCCESS, log a checkpoint
A: panic!  (but it's too late, we already logged a checkpoint but
didn't flush all the dirty data the belonged to it)

-- 
Thomas Munro
https://enterprisedb.com


Re: Some thoughts on NFS

От
Thomas Munro
Дата:
On Wed, Feb 20, 2019 at 5:52 AM Andres Freund <andres@anarazel.de> wrote:
> > 1.  Figure out how to get the ALLOCATE command all the way through the
> > stack from PostgreSQL to the remote NFS server, and know for sure that
> > it really happened.  On the Debian buster Linux 4.18 system I checked,
> > fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate()
> > appears to succeed but it doesn't really do anything at all (though I
> > understand that some versions sometimes write zeros to simulate
> > allocation, which in this case would be equally useless as it doesn't
> > reserve anything on an NFS server).  We need the server and NFS client
> > and libc to be of the right version and cooperate and tell us that
> > they have really truly reserved space, but there isn't currently a way
> > as far as I can tell.  How can we achieve that, without writing our
> > own NFS client?
> >
> > 2.  Deal with the resulting performance suckage.  Extending 8kb at a
> > time with synchronous network round trips won't fly.
>
> I think I'd just go for fsync();pwrite();fsync(); as the extension
> mechanism, iff we're detecting a tablespace is on NFS. The first fsync()
> to make sure there's no previous errors that we could mistake for
> ENOSPC, the pwrite to extend, the second fsync to make sure there's
> actually space. Then we can detect ENOSPC properly.  That possibly does
> leave some errors where we could mistake ENOSPC as something more benign
> than it is, but the cases seem pretty narrow, due to the previous
> fsync() (maybe the other side could be thin provisioned and get an
> ENOSPC there - but in that case we didn't actually loose any data. The
> only dangerous scenario I can come up with is that the remote side is on
> thinly provisioned CoW system, and a concurrent write to an earlier
> block runs out of space - but seriously, good riddance to you).

This seems to make sense, and has the advantage that it uses
interfaces that exist right now.  But it seems a bit like we'll have
to wait for them to finish building out the errseq_t support for NFS
to avoid various races around the mapping's AS_EIO flag (A: fsync() ->
EIO, B: fsync() -> SUCCESS, log checkpoint; A: panic), and then maybe
we'd have to get at least one of { fd-passing, direct IO, threads }
working on our side ...

-- 
Thomas Munro
https://enterprisedb.com


Re: Some thoughts on NFS

От
Andres Freund
Дата:
Hi,

On 2019-02-20 11:25:22 +1300, Thomas Munro wrote:
> This seems to make sense, and has the advantage that it uses
> interfaces that exist right now.  But it seems a bit like we'll have
> to wait for them to finish building out the errseq_t support for NFS
> to avoid various races around the mapping's AS_EIO flag (A: fsync() ->
> EIO, B: fsync() -> SUCCESS, log checkpoint; A: panic), and then maybe
> we'd have to get at least one of { fd-passing, direct IO, threads }
> working on our side ...

I think we could "just" make use of DIO for relation extensions when
detecting NFS. Given that we just about never actually read the result
of the file extension write, just converting that write to DIO shouldn't
have that bad an overall impact - of course it'll cause slowdowns, but
only while extending files. And that ought to handle ENOSPC correctly,
while leaving the EIO handling separate?

Greetings,

Andres Freund


Re: Some thoughts on NFS

От
Thomas Munro
Дата:
On Wed, Feb 20, 2019 at 7:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 19, 2019 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:
> > My point is that for iSCSC to be performant we'd need *all* the
> > infrastructure we also need for direct IO *and* a *lot* more. And that
> > it seems insane to invest very substantial resources into developing our
> > own iSCSI client when we don't even have DIO support. And DIO support
> > would allow us to address the error reporting issues, while also
> > drastically improving performance in a lot of situations. And we'd not
> > have to essentially develop our own filesystem etc.
>
> OK, got it.  So, I'll merge the patch for direct I/O support tomorrow,
> and then the iSCSI patch can go in on Thursday.  OK?  :-)

Not something I paid a lot of attention to as an application
developer, but in a past life I have seen a lot of mission critical
DB2 and Oracle systems running on ext4 or XFS over (kernel) iSCSI
plugged into big monster filers, and also I think perhaps also cases
of NFS, but those systems use DIO by default (and the latter has its
own NFS client IIUC).  So I suspect if you can just get DIO merged today
we can probably skip the userland iSCSI and call it done.  :-P


--
Thomas Munro
https://enterprisedb.com