Обсуждение: Some thoughts on NFS
Hello hackers, As discussed in various threads, PostgreSQL-on-NFS is viewed with suspicion. Perhaps others knew this already, but I first learned of the specific mechanism (or at least one of them) for corruption from Craig Ringer's writing[1] about fsync() on Linux. The problem is that close() and fsync() could report ENOSPC, indicating that your dirty data has been dropped from the Linux page cache, and then future fsync() operations could succeed as if nothing happened. It's easy to see that happening[2]. Since Craig's report, we committed a patch based on his PANIC proposal: we now panic on fsync() and close() failure. Recovering from the WAL may or may not be possible, but at no point will we allow a checkpoint to retry and bogusly succeed. So, does this mean we fixed the problems with NFS? Not sure, but I do see a couple of problems (and they're problems Craig raised in his thread): The first is practical. Running out of diskspace (or quota) is not all that rare (much more common that EIO from a dying disk, I'd guess), and definitely recoverable by an administrator: just create more space. It would be really nice to avoid panicking for an *expected* condition. To do that, I think we'd need to move the ENOSPC error back relation extension time (when we call pwrite()), as happens for local filesystems. Luckily NFS 4.2 provides a mechanism to do that: the NFS 4.2 ALLOCATE[3] command. To make this work, I think there are two subproblems to solve: 1. Figure out how to get the ALLOCATE command all the way through the stack from PostgreSQL to the remote NFS server, and know for sure that it really happened. On the Debian buster Linux 4.18 system I checked, fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate() appears to succeed but it doesn't really do anything at all (though I understand that some versions sometimes write zeros to simulate allocation, which in this case would be equally useless as it doesn't reserve anything on an NFS server). We need the server and NFS client and libc to be of the right version and cooperate and tell us that they have really truly reserved space, but there isn't currently a way as far as I can tell. How can we achieve that, without writing our own NFS client? 2. Deal with the resulting performance suckage. Extending 8kb at a time with synchronous network round trips won't fly. A theoretical question I thought of is whether there are any interleavings of operations that allow a checkpoint to complete bogusly, while a concurrent close() in a regular backend fails with EIO for data that was included in the checkpoint, and panics. I *suspect* the answer is that every interleaving is safe for 4.16+ kernels that report IO errors to every descriptor. In older kernels I wonder if there could be a schedule where an arbitrary backend eats the error while closing, then the checkpointer calls fsync() successfully and then logs a checkpoint, and then then the arbitrary backend panics (too late). I suspect EIO on close() doesn't happen in practice on regular local filesystems, which is why I mention it in the context of NFS, but I could be wrong about that. Everything I said above about NFS may also apply to CIFS, I dunno. [1] https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAEepm%3D1FGo%3DACPKRmAxvb53mBwyVC%3DTDwTE0DMzkWjdbAYw7sw%40mail.gmail.com [3] https://tools.ietf.org/html/rfc7862#page-64 -- Thomas Munro https://enterprisedb.com
On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro <thomas.munro@gmail.com> wrote: > How can we achieve that, without writing our > own NFS client? <dons crash helmet> Instead of writing our own NFS client, how about writing our own network storage protocol? Imagine a stripped-down postmaster process running on the NFS server that essentially acts as a block server. Through some sort of network chatter, it can exchange blocks with the real postmaster running someplace else. The mini-protocol would contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION, FSYNC_SEGMENT, etc. - basically whatever the relevant operations at the smgr layer are. And the user would see the remote server as a tablespace mapped to a special smgr. As compared with your proposal, this has both advantages and disadvantages. The main advantage is that we aren't dependent on being able to make NFS behave in any particular way; indeed, this type of solution could be used not only to work around problems with NFS, but also problems with any other network filesystem. We get to reuse all of the work we've done to try to make local operation reliable; the remote server can run the same code that would be run locally whenever the master tells it to do so. And you can even imagine trying to push more work to the remote side in some future version of the protocol. The main disadvantage is that it doesn't help unless you can actually run software on the remote box. If the only access you have to the remote side is that it exposes an NFS interface, then this sort of thing is useless. And that's probably a pretty common scenario. So that brings us back to your proposal. I don't know whether there's anyway of solving the problem you postulate: "We need the server and NFS client and libc to be of the right version and cooperate and tell us that they have really truly reserved space." If there's not a set of APIs that can be used to make that happen, then I don't know how we can ever solve this problem without writing our own client. Well, I guess we could submit patches to every libc in the world to add those APIs. But that seems like a painful way forward. I'm kinda glad you're thinking about this problem because I think the unreliably of PostgreSQL on NFS is a real problem for users and kind of a black eye for the project. However, I am not sure that I see an easy solution in what you wrote, or in general. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 19, 2019 at 4:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> How can we achieve that, without writing our
> own NFS client?
<dons crash helmet>
You'll need it :)
Instead of writing our own NFS client, how about writing our own
network storage protocol? Imagine a stripped-down postmaster process
running on the NFS server that essentially acts as a block server.
Through some sort of network chatter, it can exchange blocks with the
real postmaster running someplace else. The mini-protocol would
contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION,
FSYNC_SEGMENT, etc. - basically whatever the relevant operations at
the smgr layer are. And the user would see the remote server as a
tablespace mapped to a special smgr.
As compared with your proposal, this has both advantages and
disadvantages. The main advantage is that we aren't dependent on
being able to make NFS behave in any particular way; indeed, this type
of solution could be used not only to work around problems with NFS,
but also problems with any other network filesystem. We get to reuse
all of the work we've done to try to make local operation reliable;
the remote server can run the same code that would be run locally
whenever the master tells it to do so. And you can even imagine
trying to push more work to the remote side in some future version of
the protocol. The main disadvantage is that it doesn't help unless
you can actually run software on the remote box. If the only access
you have to the remote side is that it exposes an NFS interface, then
this sort of thing is useless. And that's probably a pretty common
scenario.
In my experience, that covers approximately 100% of the usecase.
The only case I've run into people wanting to use postgres on NFS, the NFS server is a big filer from netapp or hitachi or whomever. And you're not going to be able to run something like that on top of it.
There might be a use-case for the split that you mention, absolutely, but it's not going to solve the people-who-want-NFS situation. You'd solve more of that by having the middle layer speak "raw device" underneath and be able to sit on top of things like iSCSI (yes, really).
On 2/19/19 10:59 AM, Magnus Hagander wrote: > On Tue, Feb 19, 2019 at 4:46 PM Robert Haas <robertmhaas@gmail.com > <mailto:robertmhaas@gmail.com>> wrote: > > On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro <thomas.munro@gmail.com > <mailto:thomas.munro@gmail.com>> wrote: > > How can we achieve that, without writing our > > own NFS client? > > <dons crash helmet> > > > You'll need it :) > > > Instead of writing our own NFS client, how about writing our own > network storage protocol? Imagine a stripped-down postmaster process > running on the NFS server that essentially acts as a block server. > Through some sort of network chatter, it can exchange blocks with the > real postmaster running someplace else. The mini-protocol would > contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION, > FSYNC_SEGMENT, etc. - basically whatever the relevant operations at > the smgr layer are. And the user would see the remote server as a > tablespace mapped to a special smgr. > > As compared with your proposal, this has both advantages and > disadvantages. The main advantage is that we aren't dependent on > being able to make NFS behave in any particular way; indeed, this type > of solution could be used not only to work around problems with NFS, > but also problems with any other network filesystem. We get to reuse > all of the work we've done to try to make local operation reliable; > the remote server can run the same code that would be run locally > whenever the master tells it to do so. And you can even imagine > trying to push more work to the remote side in some future version of > the protocol. The main disadvantage is that it doesn't help unless > you can actually run software on the remote box. If the only access > you have to the remote side is that it exposes an NFS interface, then > this sort of thing is useless. And that's probably a pretty common > scenario. > > > In my experience, that covers approximately 100% of the usecase. > > The only case I've run into people wanting to use postgres on NFS, the > NFS server is a big filer from netapp or hitachi or whomever. And you're > not going to be able to run something like that on top of it. Exactly my experience too. > There might be a use-case for the split that you mention, absolutely, > but it's not going to solve the people-who-want-NFS situation. You'd > solve more of that by having the middle layer speak "raw device" > underneath and be able to sit on top of things like iSCSI (yes, really). Interesting idea but sounds ambitious ;-) Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Вложения
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > How can we achieve that, without writing our > > own NFS client? > > <dons crash helmet> > > Instead of writing our own NFS client, how about writing our own > network storage protocol? Imagine a stripped-down postmaster process > running on the NFS server that essentially acts as a block server. > Through some sort of network chatter, it can exchange blocks with the > real postmaster running someplace else. The mini-protocol would > contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION, > FSYNC_SEGMENT, etc. - basically whatever the relevant operations at > the smgr layer are. And the user would see the remote server as a > tablespace mapped to a special smgr. In reading this, I honestly thought somewhere along the way you'd say "and then you have WAL, so just run a replica and forget this whole network filesystem business." The practical issue of WAL replay being single-process is an issue though. It seems like your mini-protocol was going in a direction that would have allowed multiple processes to be working between the PG system and the storage system concurrently, avoiding the single-threaded issue with WAL but also making it such that the replica wouldn't be able to be used for read-only queries (without some much larger changes happening anyway). I'm not sure the use-case is big enough but it does seem to me that we're getting to a point where people are generating enough WAL with systems that they care an awful lot about that they might be willing to forgo having the ability to perform read-only queries on the replica as long as they know that they can flip traffic over to the replica without losing data. So, what this all really boils down to is that I think this idea of a different protocol that would allow PG to essentially replicate to a remote system, or possibly run entirely off of the remote system without any local storage, could be quite interesting in some situations. On the other hand, I pretty much agree 100% with Magnus that the NFS use-case is almost entirely because someone bought a big piece of hardware that talks NFS and no, you don't get to run whatever code you want on it. Thanks! Stephen
Вложения
On Tue, Feb 19, 2019 at 10:59 AM Magnus Hagander <magnus@hagander.net> wrote: > The only case I've run into people wanting to use postgres on NFS, the NFS server is a big filer from netapp or hitachior whomever. And you're not going to be able to run something like that on top of it. Yeah. :-( It seems, however, we have no way of knowing to what extent that big filer actually implements the latest NFS specs and does so correctly. And if it doesn't, and data goes down the tubes, people are going to blame PostgreSQL, not the big filer, either because they really believe we ought to be able to handle it, or because they know that filing a trouble ticket with NetApp isn't likely to provoke any sort of swift response. If PostgreSQL itself is speaking NFS, we might at least have a little more information about what behavior the filer claims to implement, but even then it could easily be "lying." And if we're just seeing it as a filesystem mount, then we're just ... flying blind. > There might be a use-case for the split that you mention, absolutely, but it's not going to solve the people-who-want-NFSsituation. You'd solve more of that by having the middle layer speak "raw device" underneath and be ableto sit on top of things like iSCSI (yes, really). Not sure I follow this part. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/19/19 5:20 PM, Robert Haas wrote: > On Tue, Feb 19, 2019 at 10:59 AM Magnus Hagander <magnus@hagander.net> wrote: >> The only case I've run into people wanting to use postgres on NFS, >> the NFS server is a big filer from netapp or hitachi or whomever. And >> you're not going to be able to run something like that on top of it. > > Yeah. :-( > > It seems, however, we have no way of knowing to what extent that big > filer actually implements the latest NFS specs and does so correctly. > And if it doesn't, and data goes down the tubes, people are going to > blame PostgreSQL, not the big filer, either because they really > believe we ought to be able to handle it, or because they know that > filing a trouble ticket with NetApp isn't likely to provoke any sort > of swift response. If PostgreSQL itself is speaking NFS, we might at > least have a little more information about what behavior the filer > claims to implement, but even then it could easily be "lying." And if > we're just seeing it as a filesystem mount, then we're just ... flying > blind. > Perhaps we should have something like pg_test_nfs, then? >> There might be a use-case for the split that you mention, >> absolutely, but it's not going to solve the people-who-want-NFS >> situation. You'd solve more of that by having the middle layer >> speak "raw device" underneath and be able to sit on top of things >> like iSCSI (yes, really). > > Not sure I follow this part. > I think Magnus says that people running PostgreSQL on NFS generally don't do that because they somehow chose NFS, but because that's what their company uses for network storage. Even if we support the custom block protocol, they probably won't be able to use it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
## Magnus Hagander (magnus@hagander.net): > You'd solve more > of that by having the middle layer speak "raw device" underneath and be > able to sit on top of things like iSCSI (yes, really). Back in ye olden days we called these middle layers "kernel" and "filesystem" and had that maintained by specialists. Regards, Christoph -- Spare Space.
Hi, On 2019-02-19 20:03:05 +1300, Thomas Munro wrote: > The first is practical. Running out of diskspace (or quota) is not > all that rare (much more common that EIO from a dying disk, I'd > guess), and definitely recoverable by an administrator: just create > more space. It would be really nice to avoid panicking for an > *expected* condition. Well, that's true, but OTOH, we don't even handle that properly on local filesystems for WAL. And while people complain, it's not *that* common. > 1. Figure out how to get the ALLOCATE command all the way through the > stack from PostgreSQL to the remote NFS server, and know for sure that > it really happened. On the Debian buster Linux 4.18 system I checked, > fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate() > appears to succeed but it doesn't really do anything at all (though I > understand that some versions sometimes write zeros to simulate > allocation, which in this case would be equally useless as it doesn't > reserve anything on an NFS server). We need the server and NFS client > and libc to be of the right version and cooperate and tell us that > they have really truly reserved space, but there isn't currently a way > as far as I can tell. How can we achieve that, without writing our > own NFS client? > > 2. Deal with the resulting performance suckage. Extending 8kb at a > time with synchronous network round trips won't fly. I think I'd just go for fsync();pwrite();fsync(); as the extension mechanism, iff we're detecting a tablespace is on NFS. The first fsync() to make sure there's no previous errors that we could mistake for ENOSPC, the pwrite to extend, the second fsync to make sure there's actually space. Then we can detect ENOSPC properly. That possibly does leave some errors where we could mistake ENOSPC as something more benign than it is, but the cases seem pretty narrow, due to the previous fsync() (maybe the other side could be thin provisioned and get an ENOSPC there - but in that case we didn't actually loose any data. The only dangerous scenario I can come up with is that the remote side is on thinly provisioned CoW system, and a concurrent write to an earlier block runs out of space - but seriously, good riddance to you). Given the current code we'll already try to extend in bigger chunks when there's contention, we just need to combine the writes for those, that ought to not be that hard now that we don't initialize bulk-extended pages anymore. That won't solve the issue of extending during single threaded writes, but I feel like that's secondary to actually being correct. And using bulk-extension in more cases doesn't sound too hard to me. Greetings, Andres Freund
On Tue, Feb 19, 2019 at 5:33 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 2/19/19 5:20 PM, Robert Haas wrote:
> On Tue, Feb 19, 2019 at 10:59 AM Magnus Hagander <magnus@hagander.net> wrote:
>> There might be a use-case for the split that you mention,
>> absolutely, but it's not going to solve the people-who-want-NFS
>> situation. You'd solve more of that by having the middle layer
>> speak "raw device" underneath and be able to sit on top of things
>> like iSCSI (yes, really).
>
> Not sure I follow this part.
>
I think Magnus says that people running PostgreSQL on NFS generally
don't do that because they somehow chose NFS, but because that's what
their company uses for network storage. Even if we support the custom
block protocol, they probably won't be able to use it.
Yes, with the addition that they also often export iSCSI endpoints today, so if we wanted to sit on top of something that could also work. But not sit on top of a custom block protocol we invent ourselves.
On Tue, Feb 19, 2019 at 5:38 PM Christoph Moench-Tegeder <cmt@burggraben.net> wrote:
## Magnus Hagander (magnus@hagander.net):
> You'd solve more
> of that by having the middle layer speak "raw device" underneath and be
> able to sit on top of things like iSCSI (yes, really).
Back in ye olden days we called these middle layers "kernel" and
"filesystem" and had that maintained by specialists.
Yeah. Unfortunately that turned out in a number of cases to be things like specialists that considered fsync unimportant, or dropping data from the buffer cache without errors.
But what I'm mainly saying is that if we want to run postgres on top of a block device protocol, we should go all the way and do it, not somewhere half way that will be unable to help most people. I'm not saying that we *should*, there is a very big if in that.
Hi, On 2019-02-19 16:59:35 +0100, Magnus Hagander wrote: > There might be a use-case for the split that you mention, absolutely, but > it's not going to solve the people-who-want-NFS situation. You'd solve more > of that by having the middle layer speak "raw device" underneath and be > able to sit on top of things like iSCSI (yes, really). There's decent iSCSI implementations in several kernels, without the NFS problems. I'm not sure what we'd gain by reimplementing those? Greetings, Andres Freund
On Tue, Feb 19, 2019 at 1:17 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-02-19 16:59:35 +0100, Magnus Hagander wrote: > > There might be a use-case for the split that you mention, absolutely, but > > it's not going to solve the people-who-want-NFS situation. You'd solve more > > of that by having the middle layer speak "raw device" underneath and be > > able to sit on top of things like iSCSI (yes, really). > > There's decent iSCSI implementations in several kernels, without the NFS > problems. I'm not sure what we'd gain by reimplementing those? Is that a new thing? I ran across PostgreSQL-over-iSCSI a number of years ago and the evidence strongly suggested that it did not reliably report disk errors back to PostgreSQL, leading to corruption. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-02-19 13:21:21 -0500, Robert Haas wrote: > On Tue, Feb 19, 2019 at 1:17 PM Andres Freund <andres@anarazel.de> wrote: > > On 2019-02-19 16:59:35 +0100, Magnus Hagander wrote: > > > There might be a use-case for the split that you mention, absolutely, but > > > it's not going to solve the people-who-want-NFS situation. You'd solve more > > > of that by having the middle layer speak "raw device" underneath and be > > > able to sit on top of things like iSCSI (yes, really). > > > > There's decent iSCSI implementations in several kernels, without the NFS > > problems. I'm not sure what we'd gain by reimplementing those? > > Is that a new thing? I ran across PostgreSQL-over-iSCSI a number of > years ago and the evidence strongly suggested that it did not reliably > report disk errors back to PostgreSQL, leading to corruption. How many years ago are we talking? I think it's been mostly robust in the last 6-10 years, maybe? But note that the postgres + linux fsync issues would have plagued that use case just as well as it did local storage, at a likely higher incidence of failures (i.e. us forgetting to retry fsyncs in checkpoints, and linux throwing away dirty data after fsync failure would both have caused problems that aren't dependent on iSCSI). And I think it's not that likely that we'd not screw up a number of times implementing iSCSI ourselves - not to speak of the fact that that seems like an odd place to focus development on, given that it'd basically require all the infrastructure also needed for local DIO, which'd likely gain us much more. Greetings, Andres Freund
On Tue, Feb 19, 2019 at 1:29 PM Andres Freund <andres@anarazel.de> wrote: > > Is that a new thing? I ran across PostgreSQL-over-iSCSI a number of > > years ago and the evidence strongly suggested that it did not reliably > > report disk errors back to PostgreSQL, leading to corruption. > > How many years ago are we talking? I think it's been mostly robust in > the last 6-10 years, maybe? I think it was ~9 years ago. > But note that the postgres + linux fsync > issues would have plagued that use case just as well as it did local > storage, at a likely higher incidence of failures (i.e. us forgetting to > retry fsyncs in checkpoints, and linux throwing away dirty data after > fsync failure would both have caused problems that aren't dependent on > iSCSI). IIRC, and obviously that's difficult to do after so long, there were clearly disk errors in the kernel logs, but no hint of a problem in the PostgreSQL logs. So it wasn't just a case of us responding to errors with sufficient vigor -- either they weren't being reported at all, or only to system calls we weren't checking, e.g. close or something. > And I think it's not that likely that we'd not screw up a > number of times implementing iSCSI ourselves - not to speak of the fact > that that seems like an odd place to focus development on, given that > it'd basically require all the infrastructure also needed for local DIO, > which'd likely gain us much more. I don't really disagree with you here, but I also think it's important to be honest about what size hammer is likely to be sufficient to fix the problem. Project policy for many years has been essentially "let's assume the kernel guys know what they are doing," but, I don't know, color me a little skeptical at this point. We've certainly made lots of mistakes all of our own, and it's certainly true that reimplementing large parts of what the kernel does in user space is not very appealing ... but on the other hand it looks like filesystem error reporting isn't even really reliable for local operation (unless we do an incredibly complicated fd-passing thing that has deadlock problems we don't know how to solve and likely performance problems too, or convert the whole backend to use threads) or for NFS operation (though maybe your suggestion will fix that) so the idea that iSCSI is just going to be all right seems a bit questionable to me. Go ahead, call me a pessimist... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-02-19 13:45:28 -0500, Robert Haas wrote: > On Tue, Feb 19, 2019 at 1:29 PM Andres Freund <andres@anarazel.de> wrote: > > And I think it's not that likely that we'd not screw up a > > number of times implementing iSCSI ourselves - not to speak of the fact > > that that seems like an odd place to focus development on, given that > > it'd basically require all the infrastructure also needed for local DIO, > > which'd likely gain us much more. > > I don't really disagree with you here, but I also think it's important > to be honest about what size hammer is likely to be sufficient to fix > the problem. Project policy for many years has been essentially > "let's assume the kernel guys know what they are doing," but, I don't > know, color me a little skeptical at this point. Yea, and I think around e.g. using the kernel page cache / not using DIO, several people, including kernel developers and say me, told us that's stupid. > We've certainly made lots of mistakes all of our own, and it's > certainly true that reimplementing large parts of what the kernel does > in user space is not very appealing ... but on the other hand it looks > like filesystem error reporting isn't even really reliable for local > operation (unless we do an incredibly complicated fd-passing thing > that has deadlock problems we don't know how to solve and likely > performance problems too, or convert the whole backend to use threads) > or for NFS operation (though maybe your suggestion will fix that) so > the idea that iSCSI is just going to be all right seems a bit > questionable to me. Go ahead, call me a pessimist... My point is that for iSCSC to be performant we'd need *all* the infrastructure we also need for direct IO *and* a *lot* more. And that it seems insane to invest very substantial resources into developing our own iSCSI client when we don't even have DIO support. And DIO support would allow us to address the error reporting issues, while also drastically improving performance in a lot of situations. And we'd not have to essentially develop our own filesystem etc. Greetings, Andres Freund
On Tue, Feb 19, 2019 at 1:56 PM Andres Freund <andres@anarazel.de> wrote: > My point is that for iSCSC to be performant we'd need *all* the > infrastructure we also need for direct IO *and* a *lot* more. And that > it seems insane to invest very substantial resources into developing our > own iSCSI client when we don't even have DIO support. And DIO support > would allow us to address the error reporting issues, while also > drastically improving performance in a lot of situations. And we'd not > have to essentially develop our own filesystem etc. OK, got it. So, I'll merge the patch for direct I/O support tomorrow, and then the iSCSI patch can go in on Thursday. OK? :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 19, 2019 at 7:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Feb 19, 2019 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:
> My point is that for iSCSC to be performant we'd need *all* the
> infrastructure we also need for direct IO *and* a *lot* more. And that
> it seems insane to invest very substantial resources into developing our
> own iSCSI client when we don't even have DIO support. And DIO support
> would allow us to address the error reporting issues, while also
> drastically improving performance in a lot of situations. And we'd not
> have to essentially develop our own filesystem etc.
OK, got it. So, I'll merge the patch for direct I/O support tomorrow,
and then the iSCSI patch can go in on Thursday. OK? :-)
C'mon Robert.
Surely you know that such patches should be landed on *Fridays*, not Thursdays.
On Tue, Feb 19, 2019 at 2:05 PM Magnus Hagander <magnus@hagander.net> wrote: > C'mon Robert. > > Surely you know that such patches should be landed on *Fridays*, not Thursdays. Oh, right. And preferably via airplane wifi from someplace over the Atlantic ocean, right? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 19, 2019 at 8:03 PM Thomas Munro <thomas.munro@gmail.com> wrote: > A theoretical question I thought of is whether there are any > interleavings of operations that allow a checkpoint to complete > bogusly, while a concurrent close() in a regular backend fails with > EIO for data that was included in the checkpoint, and panics. I > *suspect* the answer is that every interleaving is safe for 4.16+ > kernels that report IO errors to every descriptor. In older kernels I > wonder if there could be a schedule where an arbitrary backend eats > the error while closing, then the checkpointer calls fsync() > successfully and then logs a checkpoint, and then then the arbitrary > backend panics (too late). I suspect EIO on close() doesn't happen in > practice on regular local filesystems, which is why I mention it in > the context of NFS, but I could be wrong about that. Ugh. It looks like Linux NFS doesn't even use the new errseq_t machinery in 4.16+. So even if we had the fd-passing patch, I think there may be a dangerous schedule like this: A: close() -> EIO, clears AS_EIO flag B: fsync() -> SUCCESS, log a checkpoint A: panic! (but it's too late, we already logged a checkpoint but didn't flush all the dirty data the belonged to it) -- Thomas Munro https://enterprisedb.com
On Wed, Feb 20, 2019 at 5:52 AM Andres Freund <andres@anarazel.de> wrote: > > 1. Figure out how to get the ALLOCATE command all the way through the > > stack from PostgreSQL to the remote NFS server, and know for sure that > > it really happened. On the Debian buster Linux 4.18 system I checked, > > fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate() > > appears to succeed but it doesn't really do anything at all (though I > > understand that some versions sometimes write zeros to simulate > > allocation, which in this case would be equally useless as it doesn't > > reserve anything on an NFS server). We need the server and NFS client > > and libc to be of the right version and cooperate and tell us that > > they have really truly reserved space, but there isn't currently a way > > as far as I can tell. How can we achieve that, without writing our > > own NFS client? > > > > 2. Deal with the resulting performance suckage. Extending 8kb at a > > time with synchronous network round trips won't fly. > > I think I'd just go for fsync();pwrite();fsync(); as the extension > mechanism, iff we're detecting a tablespace is on NFS. The first fsync() > to make sure there's no previous errors that we could mistake for > ENOSPC, the pwrite to extend, the second fsync to make sure there's > actually space. Then we can detect ENOSPC properly. That possibly does > leave some errors where we could mistake ENOSPC as something more benign > than it is, but the cases seem pretty narrow, due to the previous > fsync() (maybe the other side could be thin provisioned and get an > ENOSPC there - but in that case we didn't actually loose any data. The > only dangerous scenario I can come up with is that the remote side is on > thinly provisioned CoW system, and a concurrent write to an earlier > block runs out of space - but seriously, good riddance to you). This seems to make sense, and has the advantage that it uses interfaces that exist right now. But it seems a bit like we'll have to wait for them to finish building out the errseq_t support for NFS to avoid various races around the mapping's AS_EIO flag (A: fsync() -> EIO, B: fsync() -> SUCCESS, log checkpoint; A: panic), and then maybe we'd have to get at least one of { fd-passing, direct IO, threads } working on our side ... -- Thomas Munro https://enterprisedb.com
Hi, On 2019-02-20 11:25:22 +1300, Thomas Munro wrote: > This seems to make sense, and has the advantage that it uses > interfaces that exist right now. But it seems a bit like we'll have > to wait for them to finish building out the errseq_t support for NFS > to avoid various races around the mapping's AS_EIO flag (A: fsync() -> > EIO, B: fsync() -> SUCCESS, log checkpoint; A: panic), and then maybe > we'd have to get at least one of { fd-passing, direct IO, threads } > working on our side ... I think we could "just" make use of DIO for relation extensions when detecting NFS. Given that we just about never actually read the result of the file extension write, just converting that write to DIO shouldn't have that bad an overall impact - of course it'll cause slowdowns, but only while extending files. And that ought to handle ENOSPC correctly, while leaving the EIO handling separate? Greetings, Andres Freund
On Wed, Feb 20, 2019 at 7:58 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 19, 2019 at 1:56 PM Andres Freund <andres@anarazel.de> wrote: > > My point is that for iSCSC to be performant we'd need *all* the > > infrastructure we also need for direct IO *and* a *lot* more. And that > > it seems insane to invest very substantial resources into developing our > > own iSCSI client when we don't even have DIO support. And DIO support > > would allow us to address the error reporting issues, while also > > drastically improving performance in a lot of situations. And we'd not > > have to essentially develop our own filesystem etc. > > OK, got it. So, I'll merge the patch for direct I/O support tomorrow, > and then the iSCSI patch can go in on Thursday. OK? :-) Not something I paid a lot of attention to as an application developer, but in a past life I have seen a lot of mission critical DB2 and Oracle systems running on ext4 or XFS over (kernel) iSCSI plugged into big monster filers, and also I think perhaps also cases of NFS, but those systems use DIO by default (and the latter has its own NFS client IIUC). So I suspect if you can just get DIO merged today we can probably skip the userland iSCSI and call it done. :-P -- Thomas Munro https://enterprisedb.com