Обсуждение: [PATCH] COPY .. COMPRESSED
Greetings, Attached is a patch to add a 'COMPRESSED' option to COPY which will cause COPY to expect a gzip'd file on input and which will output a gzip'd file on output. Included is support for backend COPY, psql's \copy, regression tests for both, and documentation. On top of this I plan to submit a trivial patch to add support for this to file_fdw, allowing creation of FDW tables which operate directly on compressed files (including CSVs, which is what I need this patch for). I've also begun working on a patch to allow this capability to be used through pg_dump/pg_restore which would reduce the bandwidth used between the client and the server for backups and restores. Ideally, one would also be able to use custom format dumps, with compression, even if the client-side pg_dump/pg_restore wasn't compiled with zlib support. Thanks, Stephen
Вложения
Stephen Frost <sfrost@snowman.net> writes: > Attached is a patch to add a 'COMPRESSED' option to COPY which will > cause COPY to expect a gzip'd file on input and which will output a > gzip'd file on output. Included is support for backend COPY, psql's > \copy, regression tests for both, and documentation. I don't think it's a very good idea to invent such a specialized option, nor to tie it to gzip, which is widely considered to be old news. There was discussion (and, I think, a patch in the queue) for allowing COPY to pipe into or out of an arbitrary shell pipe. Why would that not be enough to cover this use-case? That is, instead of a hard-wired capability, people would do something like COPY TO '| gzip >file.gz'. Or they could use bzip2 or whatever struck their fancy. regards, tom lane
Tom, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > Attached is a patch to add a 'COMPRESSED' option to COPY which will > > cause COPY to expect a gzip'd file on input and which will output a > > gzip'd file on output. Included is support for backend COPY, psql's > > \copy, regression tests for both, and documentation. > > I don't think it's a very good idea to invent such a specialized option, > nor to tie it to gzip, which is widely considered to be old news. We're already using gzip/zlib for pg_dump/pg_restore, so it was simple and straight-forward to add and would allow utilizing this option while keeping the custom dump format the same. It also happens to match what I need. While gzip might be 'old hat' it's still extremely popular. I'd be happy to add support for bzip2 or something else that people are interested in, and support compression options for zlib if necessary too. This was intended to get the ball rolling on something as the last discussion that I had seen while hunting through the archives was from 2006, obviously I missed the boat on the last set of patches. > There was discussion (and, I think, a patch in the queue) for allowing > COPY to pipe into or out of an arbitrary shell pipe. Why would that not > be enough to cover this use-case? That is, instead of a hard-wired > capability, people would do something like COPY TO '| gzip >file.gz'. > Or they could use bzip2 or whatever struck their fancy. Sounds like a nice idea, but I can't imagine it'd be available to anyone except for superusers, and looking at that patch, that's exactly the restriction which is in place for it. In addition, that patch's support for "\copy" implements everything locally, making it little different from "zcat mycsv.csv.gz | psql". The patch that I proposed actually sent the compressed stream across the wire, reducing bandwidth utilization. All that said, I've nothing against having the pipe option for the backend COPY command; a bit annoyed with myself for somehow missing that patch. I don't like what it's doing with psql's \copy command and would rather we figure out a way to support PROGRAM .. TO STDOUT, but that still would require superuser privileges. I don't see any easy way to support compressed data streaming to/from the server for COPY w/o defining what methods are available or coming up with some ACL system for what programs can be called by the backend. Thanks, Stephen
On 14 January 2013 13:43, Stephen Frost <sfrost@snowman.net> wrote: > Tom, > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Stephen Frost <sfrost@snowman.net> writes: >> > Attached is a patch to add a 'COMPRESSED' option to COPY which will >> > cause COPY to expect a gzip'd file on input and which will output a >> > gzip'd file on output. Included is support for backend COPY, psql's >> > \copy, regression tests for both, and documentation. >> >> I don't think it's a very good idea to invent such a specialized option, >> nor to tie it to gzip, which is widely considered to be old news. > > We're already using gzip/zlib for pg_dump/pg_restore, so it was simple > and straight-forward to add and would allow utilizing this option while > keeping the custom dump format the same. Both thoughts are useful, I think. There is a new option being added to pre/post process data, so it seems like the best way to add new features - in general. Specifically, we do support compressed output so a simple patch to allow re-loading of the compressed data we generate does seem sensible and reasonable. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Simon Riggs (simon@2ndQuadrant.com) wrote: > There is a new option being added to pre/post process data, so it > seems like the best way to add new features - in general. That structure appears to have no option for passing compressed data to or from a client connection. Instead, it actually overloads the typical meaning for options sent to \copy (which, imv, is "run COPY on the server with these options and have the results stored locally") to mean something different (run part of the COPY command on the server and part of it locally). > Specifically, we do support compressed output so a simple patch to > allow re-loading of the compressed data we generate does seem sensible > and reasonable. Right, we're already using gzip for pg_dump/pg_restore. This just gives an option to move that compression over to the server side. Also, I'd be happy to add support for other compression options. I do like the idea of a generalized answer which just runs a user-provided command on the server but that's always going to require superuser privileges. Thanks, Stephen
On Mon, Jan 14, 2013 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote: > I do like the idea of a generalized answer which just runs a > user-provided command on the server but that's always going to require > superuser privileges. Unless it's one of a set of superuser-authorized compression tools.
Stephen Frost <sfrost@snowman.net> writes: > I do like the idea of a generalized answer which just runs a > user-provided command on the server but that's always going to require > superuser privileges. The design that was being kicked around allowed pipes to be used on the client side too, ie \copy foo to '| gzip ...'. That form would not require any special privileges, and might be thought preferable for another reason too: it offloads the work from the server. regards, tom lane
* Claudio Freire (klaussfreire@gmail.com) wrote: > On Mon, Jan 14, 2013 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote: > > I do like the idea of a generalized answer which just runs a > > user-provided command on the server but that's always going to require > > superuser privileges. > > Unless it's one of a set of superuser-authorized compression tools. Which would require a new ACL system for handling that, as I mentioned.. That certainly isn't what the existing patch does. What would that look like? How would it operate? How would a user invoke it or even know what options are available? Would we provide anything by default? It's great to consider that possibility but there's a lot of details involved. I'm a bit nervous about having a generalized system which can run anything on the system when called by a superuser but when called by a regular user we're on the hook to verify the request against a superuser-provided list and to then make sure nothing goes wrong. Thanks, Stephen
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > I do like the idea of a generalized answer which just runs a > > user-provided command on the server but that's always going to require > > superuser privileges. > > The design that was being kicked around allowed pipes to be used on the > client side too, ie \copy foo to '| gzip ...'. That form would not > require any special privileges, and might be thought preferable for > another reason too: it offloads the work from the server. It's a different use-case which, imv, is really already trivially covered: psql -c 'COPY foo TO STDOUT;' | gzip > myfile.gz While there is no option currently for having the server do the compression before sending the data over the wire. Thanks, Stephen
On 1/14/13 11:28 AM, Stephen Frost wrote: > While there is no option currently for having the server do the > compression before sending the data over the wire. OpenSSL?
* Peter Eisentraut (peter_e@gmx.net) wrote: > On 1/14/13 11:28 AM, Stephen Frost wrote: > > While there is no option currently for having the server do the > > compression before sending the data over the wire. > > OpenSSL? To be honest, I expected that to come up earlier in this discussion. It'd be redundant to use OpenSSL for compression and then ALSO do compression on the client side to save into a custom format dump. There's also plenty of reasons to not want to deal with OpenSSL just to have compression support. Now, protocol-level on-the-wire compression is another option, but there's quite a few drawbacks to that and quite a bit of work involved. Having support for COPY-based compression could be an answer for many cases where on-the-wire compression is desirable. Being able to use pipe's for the backend-side of COPY is a good solution, for that. I'm looking forward to having it and plan to review the patch. That said, I'd like to find an answer to some of these other use cases, if possible. Thanks, Stephen
On Tue, Jan 15, 2013 at 2:33 AM, Stephen Frost <sfrost@snowman.net> wrote: > * Peter Eisentraut (peter_e@gmx.net) wrote: >> On 1/14/13 11:28 AM, Stephen Frost wrote: >> > While there is no option currently for having the server do the >> > compression before sending the data over the wire. >> >> OpenSSL? > > To be honest, I expected that to come up earlier in this discussion. > It'd be redundant to use OpenSSL for compression and then ALSO do > compression on the client side to save into a custom format dump. For what it's worth there was a security announcement not long ago that made OpenSSL disable compression in streams by default. I'm not sure if it's relevant to Postgres or not. -- greg
On Mon, Jan 14, 2013 at 11:33 PM, Stephen Frost <sfrost@snowman.net> wrote: > Now, protocol-level on-the-wire compression > is another option, but there's quite a few drawbacks to that and quite a > bit of work involved. Having support for COPY-based compression could > be an answer for many cases where on-the-wire compression is desirable. Like? Postgres' packetized protocol lends itself quite well for this kind of thing. It could even be done on a packet-by-packet basis. The only drawback I see, is that it pretty much rules out piping through arbitrary commands (a protocol needs to be very clearly defined).
On 1/13/13 9:16 PM, Stephen Frost wrote: > On top of this I plan to submit a trivial patch to add support for > this to file_fdw, allowing creation of FDW tables which operate > directly on compressed files (including CSVs, which is what I need > this patch for). > > I've also begun working on a patch to allow this capability to be used > through pg_dump/pg_restore which would reduce the bandwidth used > between the client and the server for backups and restores. Ideally, > one would also be able to use custom format dumps, with compression, > even if the client-side pg_dump/pg_restore wasn't compiled with zlib > support. I think a problem is that this code is now serving such different uses. Operating on compressed files transparently in file_fdw is obviously useful, but why only gzip? The gold standard is GNU tar, which can operate on any compressed file in a variety of compression formats without even having to specify an option. Writing compressed COPY output files on the backend has limited uses, at least none have been clearly explained, and the popen patch might address those better. Writing compressed COPY output on the frontend can already be done differently. Compression on the wire is a different debate and it probably shouldn't be snuck in through this backdoor. Putting compressed COPY output from the backend straight into a compressed pg_dump file sounds interested, but this patch doesn't do that yet, and I think there will be more issues to solve there.
* Peter Eisentraut (peter_e@gmx.net) wrote: > Operating on compressed files transparently in file_fdw is obviously > useful, but why only gzip? This isn't really an argument, imv. It's only gzip *right this moment* because that's all that I implemented. I've already offered to add bzip2 or whatever else people would like. > The gold standard is GNU tar, which can > operate on any compressed file in a variety of compression formats > without even having to specify an option. Yes, that's what I was hoping to get to, eventually. > Writing compressed COPY output files on the backend has limited uses, at > least none have been clearly explained, and the popen patch might > address those better. I do see value in the popen patch for server-side operations. > Writing compressed COPY output on the frontend can already be done > differently. Certainly. On a similar vein, I'm not convinced that the popen patch for psql's \copy is really a great addition. > Compression on the wire is a different debate and it probably shouldn't > be snuck in through this backdoor. Considering the COPY-COMPRESSED-to-FE piece is the vast majority of the patch, I hope you understand that it certainly wasn't my intent to try and 'sneak it in'. Support for reading and writing compressed files with COPY directly from the FE was one of my goals from the start on this. > Putting compressed COPY output from the backend straight into a > compressed pg_dump file sounds interested, but this patch doesn't do > that yet, and I think there will be more issues to solve there. Let me just vent my dislike for the pg_dump code. :) Probably half the time spent on this overall patch was fighting with that to make it work and it's actually about 90% of the way there, imv. Getting the compressed data into pg_dump is working in my local branch, going to a directory-format dump output, but the custom format is causing me some difficulties which I believe are related to the blocking that's used and that the blocks coming off the wire were 'full-size', if you will, instead of being chunked down to 4KB by the client-side compression. I've simply not had time to debug it and fix it and wanted to get the general patch out for discussion (which I'm glad that I did, given that there's other work going on that's related). Thanks, Stephen
* Claudio Freire (klaussfreire@gmail.com) wrote: > Postgres' packetized protocol lends itself quite well for this kind of > thing. It could even be done on a packet-by-packet basis. The only > drawback I see, is that it pretty much rules out piping through > arbitrary commands (a protocol needs to be very clearly defined). Actually, wouldn't PG's packet-based protocol be exactly the wrong way to do any kind of good on-the-wire compression? You certainly wouldn't want to waste time compressing small packets, such as a single INSERT command or similar, as you'll always have to send a packet out anyway. Even doing it at the protocol level with something ssl-like, where you wrap the entire connection, wouldn't help if the client has a process along the lines of: send INSERT command wait for response send INSERT command wait for response .. .. Since you'd still have to flush after each small bit of data. Where it does work well is when you move into a bulk-data mode (ala COPY) and can compress relatively large amounts of data into a smaller number of full-size packets to be sent. Thanks, Stephen
On Tue, Jan 15, 2013 at 1:08 PM, Stephen Frost <sfrost@snowman.net> wrote: > Where it does work well is when you move into a bulk-data mode (ala > COPY) and can compress relatively large amounts of data into a smaller > number of full-size packets to be sent. Well... exactly. COPY is one case, big result sets is another. And packet headers can include whether each packet is compressed or not, which is quite transparent and easy to handle. There could even be a negotiation phase and make it backwards-compatible.
On Tue, Jan 15, 2013 at 01:35:57PM +0000, Greg Stark wrote: > On Tue, Jan 15, 2013 at 2:33 AM, Stephen Frost <sfrost@snowman.net> wrote: > > * Peter Eisentraut (peter_e@gmx.net) wrote: > >> On 1/14/13 11:28 AM, Stephen Frost wrote: > >> > While there is no option currently for having the server do the > >> > compression before sending the data over the wire. > >> > >> OpenSSL? > > > > To be honest, I expected that to come up earlier in this discussion. > > It'd be redundant to use OpenSSL for compression and then ALSO do > > compression on the client side to save into a custom format dump. > > For what it's worth there was a security announcement not long ago > that made OpenSSL disable compression in streams by default. I'm not > sure if it's relevant to Postgres or not. It's an interesting question. It might be. I thought at first it wouldn't be relevant, but on reflection it is. This attack is called the CRIME attack. This class of attacks stem from reuse of a dictionary across some sort of confidentiality boundary. The attacker looks at the traffic and notices 'how big' the network response is. This tells the attacker the compressor has seen already seem the text. So imagine, I have a website and I keep session cookies, user names, and password crypts in a database. Imagine the session key is a long hexidecimal number. As an attacker, I could send in the username field of the login form guesses of the prefix of somebody else's session key I peek at the encrypted traffic from script to the database. As the prefix match length increases the database reply gets shorter. Essentially, its a side channel attack that ends up reducing guessing to a radix search by prefix of all valid session keys. In this attack, I don't have access to the database except through the web form, but I can see the encrypted database traffic. It is not a huge vulnerability, but yeah in some use cases if postgresql used compression it might provide a difficult, but possible route. Garick
* Claudio Freire (klaussfreire@gmail.com) wrote: > On Tue, Jan 15, 2013 at 1:08 PM, Stephen Frost <sfrost@snowman.net> wrote: > > Where it does work well is when you move into a bulk-data mode (ala > > COPY) and can compress relatively large amounts of data into a smaller > > number of full-size packets to be sent. > > Well... exactly. COPY is one case, big result sets is another. > And packet headers can include whether each packet is compressed or > not, which is quite transparent and easy to handle. There could even > be a negotiation phase and make it backwards-compatible. COPY and a large result set are the only cases, and a large result set could easily be put inside of a COPY statement. I agree that large result sets outside of COPY could benefit from compression and perhaps we can formulate a way to support that also. Thanks, Stephen
On Tue, Jan 15, 2013 at 10:55:04AM -0500, Stephen Frost wrote: > * Peter Eisentraut (peter_e@gmx.net) wrote: > > Operating on compressed files transparently in file_fdw is obviously > > useful, but why only gzip? > > This isn't really an argument, imv. It's only gzip *right this moment* > because that's all that I implemented. I've already offered to add > bzip2 or whatever else people would like. And this leads to support-my-compression-binary-of-the-day mess. Why not just allow them to do '|compression-binary'? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
* Bruce Momjian (bruce@momjian.us) wrote: > On Tue, Jan 15, 2013 at 10:55:04AM -0500, Stephen Frost wrote: > > * Peter Eisentraut (peter_e@gmx.net) wrote: > > > Operating on compressed files transparently in file_fdw is obviously > > > useful, but why only gzip? > > > > This isn't really an argument, imv. It's only gzip *right this moment* > > because that's all that I implemented. I've already offered to add > > bzip2 or whatever else people would like. > > And this leads to support-my-compression-binary-of-the-day mess. Why > not just allow them to do '|compression-binary'? You're right, to clarify, for *file_fdw*, which is a backend-only operation, the popen patch is great (thought I made that clear before). The popen patch doesn't support the '|compression-binary' option through the FE protocol. Even if it did, it would only be available for superusers as we can't allow regular users to run arbitrary commands on the server-side. Thanks, Stephen
On Tue, Jan 15, 2013 at 2:53 PM, Stephen Frost <sfrost@snowman.net> wrote: > The popen patch doesn't support the '|compression-binary' option through > the FE protocol. Even if it did, it would only be available for > superusers as we can't allow regular users to run arbitrary commands on > the server-side. That points towards a fix that involves having a set of non-arbitrary commands that we allow plain users to use. Hmm. There's an interesting thought... How about having a "pg_filters" table in pg_catalog which allows capturing labels and names of known-to-be-safe binary filters: insert into pg_filters (label, location) values ('zcat', '/usr/bin/zcat'), ('bzip2', '/usr/bin/bzip2'), ('bunzip2', '/usr/bin/bunzip2'); And then having some capability to grant permissions to roles to use these filters. That's not a "version 1" capability... Suppose we have, in 9.3, that there are direct references to "|/usr/bin/zcat" (and such), and then hope, in 9.4, to tease this out to be a non-superuser-capable facility via the above pg_filters? These filters should be useful for FDWs as well as for COPY. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On Tue, Jan 15, 2013 at 03:37:07PM -0500, Christopher Browne wrote: > On Tue, Jan 15, 2013 at 2:53 PM, Stephen Frost <sfrost@snowman.net> wrote: > > The popen patch doesn't support the '|compression-binary' option through > > the FE protocol. Even if it did, it would only be available for > > superusers as we can't allow regular users to run arbitrary commands on > > the server-side. > > That points towards a fix that involves having a set of non-arbitrary commands > that we allow plain users to use. > > Hmm. There's an interesting thought... > > How about having a "pg_filters" table in pg_catalog which allows capturing > labels and names of known-to-be-safe binary filters: > > insert into pg_filters (label, location) > values > ('zcat', '/usr/bin/zcat'), > ('bzip2', '/usr/bin/bzip2'), > ('bunzip2', '/usr/bin/bunzip2'); > > And then having some capability to grant permissions to roles to use > these filters. > > That's not a "version 1" capability... Suppose we have, in 9.3, that there are > direct references to "|/usr/bin/zcat" (and such), and then hope, in > 9.4, to tease > this out to be a non-superuser-capable facility via the above pg_filters? > > These filters should be useful for FDWs as well as for COPY. Well, COPY is super-user only, so it seems only useful for FDW, no? We already have lots of user-configuration FDW commands, so I can see adding this one too. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
* Christopher Browne (cbbrowne@gmail.com) wrote: > How about having a "pg_filters" table in pg_catalog which allows capturing > labels and names of known-to-be-safe binary filters: I was considering that (though I was thinking they'd be "transformations" rather than filters; filter implies that you're removing something, imv), but as I mentioned upthread, there are dangers in that direction and having a default set of options strikes me as a lot more challenging to provide. > insert into pg_filters (label, location) > values > ('zcat', '/usr/bin/zcat'), > ('bzip2', '/usr/bin/bzip2'), > ('bunzip2', '/usr/bin/bunzip2'); We'd need to include which direction is supported also, I think. > And then having some capability to grant permissions to roles to use > these filters. Yes, an additional ACL system, as I mentioned upthread, would be required for this. > That's not a "version 1" capability... Suppose we have, in 9.3, that there are > direct references to "|/usr/bin/zcat" (and such), and then hope, in > 9.4, to tease > this out to be a non-superuser-capable facility via the above pg_filters? It would be good to flush out what the syntax, etc, would look like for this, if we're going to support it, before we go down a road that limits us in what we can do. For example, if we implement the existing popen call, and then later want to allow non-superusers to use certain filters, how would the non-superuser specify the filter? I really don't think we want to be taking the shell-like command provided by a non-superuser and then try to match that against a list of commands in a table.. > These filters should be useful for FDWs as well as for COPY. I'm not sure I see how any FDW beyond file_fdw would really benefit from this..? I don't think a MySQL FDW or Reddis FDW would gain anything... Thanks, Stephen
* Bruce Momjian (bruce@momjian.us) wrote: > Well, COPY is super-user only, so it seems only useful for FDW, no? We > already have lots of user-configuration FDW commands, so I can see > adding this one too. COPY is most certainly not superuser-only.. COPY w/ popen, if that popen can call anything, would certainly have to be superuser-only. COPY TO STDOUT / FROM STDIN is available to and used a huge amount by non-superusers. Would be great if we could allow that to work with compressed data also, imv. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Bruce Momjian (bruce@momjian.us) wrote: >> And this leads to support-my-compression-binary-of-the-day mess. Why >> not just allow them to do '|compression-binary'? > The popen patch doesn't support the '|compression-binary' option through > the FE protocol. Even if it did, it would only be available for > superusers as we can't allow regular users to run arbitrary commands on > the server-side. I find the argument that this supports compression-over-the-wire to be quite weak, because COPY is only one form of bulk data transfer, and one that a lot of applications don't ever use. If we think we need to support transmission compression for ourselves, it ought to be integrated at the wire protocol level, not in COPY. Just to not look like I'm rejecting stuff without proposing alternatives, here is an idea about a backwards-compatible design for doing that: we could add an option that can be set in the connection request packet. Say, "transmission_compression = gzip". regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > I find the argument that this supports compression-over-the-wire to be > quite weak, because COPY is only one form of bulk data transfer, and > one that a lot of applications don't ever use. If we think we need to > support transmission compression for ourselves, it ought to be > integrated at the wire protocol level, not in COPY. As far as I can tell, COPY is the option which is strongly recommended for bulk-data-operations. I can see the use-case for wanting SELECT results to be compressed, but it strikes me as the 10% case for PG users rather than the 90% one. Ditto for COPY vs. large INSERT .. VALUES. Compressing every small packet seems like it'd be overkill and might surprise people by actually reducing performance in the case of lots of small requests. It also strikes me as a bit silly to do something like: zcat myfile.gz | \ psql -Z -c "COPY mytable FROM STDIN;" > Just to not look like I'm rejecting stuff without proposing > alternatives, here is an idea about a backwards-compatible design for > doing that: we could add an option that can be set in the connection > request packet. Say, "transmission_compression = gzip". Alright, do we want/need to support multiple options there? What do people think we should support? Any other particular details or issues that come to mind with such an implementation? I'm willing to work through that if it's the route everyone agrees with. Thanks, Stephen
On Tue, Jan 15, 2013 at 04:22:48PM -0500, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > Well, COPY is super-user only, so it seems only useful for FDW, no? We > > already have lots of user-configuration FDW commands, so I can see > > adding this one too. > > COPY is most certainly not superuser-only.. COPY w/ popen, if that > popen can call anything, would certainly have to be superuser-only. COPY with a file name is super-user-only. I am unclear how you would use STDIN/STDOUT in any meaningful way with binary data produced by compression. I guess you could with libpq. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
* Bruce Momjian (bruce@momjian.us) wrote: > COPY with a file name is super-user-only. I am unclear how you would > use STDIN/STDOUT in any meaningful way with binary data produced by > compression. I guess you could with libpq. The patch that I posted provided this: psql -h myhost -c "COPY mytable FROM STDIN COMPRESSED;" < myfile.gz With the compressed file being transferred unmolested to the server side where it was decompressed and processed by the server. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> I find the argument that this supports compression-over-the-wire to be >> quite weak, because COPY is only one form of bulk data transfer, and >> one that a lot of applications don't ever use. If we think we need to >> support transmission compression for ourselves, it ought to be >> integrated at the wire protocol level, not in COPY. > As far as I can tell, COPY is the option which is strongly recommended > for bulk-data-operations. I can see the use-case for wanting SELECT > results to be compressed, but it strikes me as the 10% case for PG users > rather than the 90% one. Ditto for COPY vs. large INSERT .. VALUES. Really? Given that libpq provides no useful support for doing anything with COPY data, much less higher-level packages such as Perl DBI, I'd venture that the real-world ratio is more like 90/10. If not 99/1. There might be a few souls out there who are hardy enough and concerned enough with performance to have made their apps speak COPY protocol, and not given up on it the first time they hit a quoting/escaping bug ... but not many, I bet. > Compressing every small packet seems like it'd be overkill and might > surprise people by actually reducing performance in the case of lots of > small requests. Yeah, proper selection and integration of a compression method would be critical, which is one reason that I'm not suggesting a plugin for this. You couldn't expect any-random-compressor to work well. I think zlib would be okay though when making use of its stream compression features. The key thing there is to force a stream buffer flush (too lazy to look up exactly what zlib calls it, but they have the concept) exactly when we're about to do a flush to the socket. That way we get cross-packet compression but don't have a problem with the compressor failing to send the last partial message when we need it to. (My suggestion of an expansible option is for future-proofing, not because I think we'd try to support more than one option today.) regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Really? Given that libpq provides no useful support for doing anything > with COPY data, much less higher-level packages such as Perl DBI, I'd > venture that the real-world ratio is more like 90/10. If not 99/1. Perhaps I'm taking a bit too narrow view of the world, but my thinking is OLTP won't want things compressed, as it increases latency of requests, while OLAP users are operating with enough data that they'll go through the effort to use COPY. > There might be a few souls out there who are hardy enough and concerned > enough with performance to have made their apps speak COPY protocol, > and not given up on it the first time they hit a quoting/escaping bug > ... but not many, I bet. The Perl/PHP/Ruby/etc users that are writing OLTP systems aren't likely going to be interested in this. The OLAP users are likely using psql (it's what we're using to load terrabytes of data via COPY, JDBC, DBI, etc, all have been tried and pale in comparison..). > The key thing there is to force a stream buffer flush (too lazy to look > up exactly what zlib calls it, but they have the concept) exactly when > we're about to do a flush to the socket. That way we get cross-packet > compression but don't have a problem with the compressor failing to send > the last partial message when we need it to. Yes, I'm familiar with it. > (My suggestion of an expansible option is for future-proofing, not > because I think we'd try to support more than one option today.) Fair enough. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Really? Given that libpq provides no useful support for doing anything >> with COPY data, much less higher-level packages such as Perl DBI, I'd >> venture that the real-world ratio is more like 90/10. If not 99/1. > Perhaps I'm taking a bit too narrow view of the world, but my thinking > is OLTP won't want things compressed, as it increases latency of > requests, while OLAP users are operating with enough data that they'll > go through the effort to use COPY. I should think the argument for or against wire-protocol compression depends mainly on your network environment, not the nature of your application. Either bytes sent are more expensive than CPU cycles at each end, or vice versa. Latency could be a big deal if we weren't going to force compressor flushes at synchronization boundaries, but if we are, any added latency is a matter of a few cycles at most. regards, tom lane
Stephen Frost <sfrost@snowman.net> writes: > Perhaps I'm taking a bit too narrow view of the world, but my thinking > is OLTP won't want things compressed, as it increases latency of > requests, while OLAP users are operating with enough data that they'll > go through the effort to use COPY. Also, if there are so many people using COPY in their apps, why have we never seen any submitted patches to extend libpq with functions to construct/deconstruct COPY data? Surely somebody would have said to themselves that they shouldn't be responsible for knowing those escaping rules. regards, tom lane
On 01/15/2013 06:22 PM, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: >> Perhaps I'm taking a bit too narrow view of the world, but my thinking >> is OLTP won't want things compressed, as it increases latency of >> requests, while OLAP users are operating with enough data that they'll >> go through the effort to use COPY. > Also, if there are so many people using COPY in their apps, why have we > never seen any submitted patches to extend libpq with functions to > construct/deconstruct COPY data? Surely somebody would have said to > themselves that they shouldn't be responsible for knowing those escaping > rules. There are perfectly good libraries in Perl and other languages for constructing/deconstructing CSV data. If we didn't have CSV import/export I suspect we would have heard lots more howls by now. cheers andrew
On Tue, Jan 15, 2013 at 7:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Compressing every small packet seems like it'd be overkill and might >> surprise people by actually reducing performance in the case of lots of >> small requests. > > Yeah, proper selection and integration of a compression method would be > critical, which is one reason that I'm not suggesting a plugin for this. > You couldn't expect any-random-compressor to work well. I think zlib > would be okay though when making use of its stream compression features. > The key thing there is to force a stream buffer flush (too lazy to look > up exactly what zlib calls it, but they have the concept) exactly when > we're about to do a flush to the socket. That way we get cross-packet > compression but don't have a problem with the compressor failing to send > the last partial message when we need it to. Just a "stream flush bit" (or stream reset bit) on the packet header would do. First packet on any stream would be marked, and that's it.
On 1/15/13 2:53 PM, Stephen Frost wrote: > You're right, to clarify, for *file_fdw*, which is a backend-only > operation, the popen patch is great (thought I made that clear before). I would think that if we get writable FDWs, you would want file_fdw to go through zlib so that it can write directly to the file.
* Peter Eisentraut (peter_e@gmx.net) wrote: > On 1/15/13 2:53 PM, Stephen Frost wrote: > > You're right, to clarify, for *file_fdw*, which is a backend-only > > operation, the popen patch is great (thought I made that clear before). > > I would think that if we get writable FDWs, you would want file_fdw to > go through zlib so that it can write directly to the file. With the popen patch, I expect it could be defined as '|gzip > myfile.gz'.. I believe that patch did that. It'd be ideal to add support for that to file_fdw also, certainly. That shouldn't be hard as file_fdw is essentially a wrapper around backend COPY already and this would just be adding a few additional options to pass through. Thanks, Stephen
On Mon, Jan 14, 2013 at 11:28 AM, Stephen Frost <sfrost@snowman.net> wrote: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Stephen Frost <sfrost@snowman.net> writes: >> > I do like the idea of a generalized answer which just runs a >> > user-provided command on the server but that's always going to require >> > superuser privileges. >> >> The design that was being kicked around allowed pipes to be used on the >> client side too, ie \copy foo to '| gzip ...'. That form would not >> require any special privileges, and might be thought preferable for >> another reason too: it offloads the work from the server. > > It's a different use-case which, imv, is really already trivially > covered: > > psql -c 'COPY foo TO STDOUT;' | gzip > myfile.gz > > While there is no option currently for having the server do the > compression before sending the data over the wire. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 15, 2013 at 3:37 PM, Christopher Browne <cbbrowne@gmail.com> wrote: > That points towards a fix that involves having a set of non-arbitrary commands > that we allow plain users to use. > > Hmm. There's an interesting thought... > > How about having a "pg_filters" table in pg_catalog which allows capturing > labels and names of known-to-be-safe binary filters: > > insert into pg_filters (label, location) > values > ('zcat', '/usr/bin/zcat'), > ('bzip2', '/usr/bin/bzip2'), > ('bunzip2', '/usr/bin/bunzip2'); > > And then having some capability to grant permissions to roles to use > these filters. I suspect that's going to be less efficient than using a compression library that's linked into the backend, because you have to copy all the data through the kernel to another process and back. And it's certainly a lot more complex. If it greatly broadened the applicability of this feature I might think it was worthwhile, but I can't see that it does. I suspect that supporting zlib, which we already linked against, would cater to something well upwards of 90% of the use cases here. Sure, there are other things, but zlib is very widely used and bzip2 IME is far too slow to be taken seriously for this kind of application. The additional space savings that you get for the additional CPU investment is typically small, and if you really need it, having to un-gzip and re-bzip2 on the client is always an option. If you're using bzip2 you obviously have CPU time to burn. At any rate, I think it would be good to avoid letting our desire for infinite flexibility get in the way of doing something useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 16, 2013 at 5:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 15, 2013 at 3:37 PM, Christopher Browne <cbbrowne@gmail.com> wrote: >> That points towards a fix that involves having a set of non-arbitrary commands >> that we allow plain users to use. >> >> Hmm. There's an interesting thought... >> >> How about having a "pg_filters" table in pg_catalog which allows capturing >> labels and names of known-to-be-safe binary filters: >> >> insert into pg_filters (label, location) >> values >> ('zcat', '/usr/bin/zcat'), >> ('bzip2', '/usr/bin/bzip2'), >> ('bunzip2', '/usr/bin/bunzip2'); >> >> And then having some capability to grant permissions to roles to use >> these filters. > > I suspect that's going to be less efficient than using a compression > library that's linked into the backend, because you have to copy all > the data through the kernel to another process and back. And it's > certainly a lot more complex. More complex, certainly. By spawning a separate process, we'd get benefit of multicore CPUs, so I'm not sure I agree that it's necessarily slower. > At any rate, I think it would be good to avoid letting our desire for > infinite flexibility get in the way of doing something useful. Oh, agreed. I was actively thinking of the cooler bits of this pointing more towards 9.4 than 9.3. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On Tue, Jan 15, 2013 at 4:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I find the argument that this supports compression-over-the-wire to be > quite weak, because COPY is only one form of bulk data transfer, and > one that a lot of applications don't ever use. If we think we need to > support transmission compression for ourselves, it ought to be > integrated at the wire protocol level, not in COPY. > > Just to not look like I'm rejecting stuff without proposing > alternatives, here is an idea about a backwards-compatible design for > doing that: we could add an option that can be set in the connection > request packet. Say, "transmission_compression = gzip". But presumably this would transparently compress at one end and decompress at the other end, which is again a somewhat different use case. To get compressed output on the client side, you have to decompress and recompress. Maybe that's OK, but it's not quite the same thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Bruce Momjian (bruce@momjian.us) wrote: > On Wed, Jan 16, 2013 at 06:19:09PM -0500, Robert Haas wrote: > > But presumably this would transparently compress at one end and > > decompress at the other end, which is again a somewhat different use > > case. To get compressed output on the client side, you have to > > decompress and recompress. Maybe that's OK, but it's not quite the > > same thing. > > Is there a TODO here? For some reason, I thought there was a patch... Thanks, Stephen
On Wed, Jan 16, 2013 at 06:19:09PM -0500, Robert Haas wrote: > On Tue, Jan 15, 2013 at 4:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I find the argument that this supports compression-over-the-wire to be > > quite weak, because COPY is only one form of bulk data transfer, and > > one that a lot of applications don't ever use. If we think we need to > > support transmission compression for ourselves, it ought to be > > integrated at the wire protocol level, not in COPY. > > > > Just to not look like I'm rejecting stuff without proposing > > alternatives, here is an idea about a backwards-compatible design for > > doing that: we could add an option that can be set in the connection > > request packet. Say, "transmission_compression = gzip". > > But presumably this would transparently compress at one end and > decompress at the other end, which is again a somewhat different use > case. To get compressed output on the client side, you have to > decompress and recompress. Maybe that's OK, but it's not quite the > same thing. Is there a TODO here? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Jan 16, 2013 at 8:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 15, 2013 at 4:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I find the argument that this supports compression-over-the-wire to be >> quite weak, because COPY is only one form of bulk data transfer, and >> one that a lot of applications don't ever use. If we think we need to >> support transmission compression for ourselves, it ought to be >> integrated at the wire protocol level, not in COPY. >> >> Just to not look like I'm rejecting stuff without proposing >> alternatives, here is an idea about a backwards-compatible design for >> doing that: we could add an option that can be set in the connection >> request packet. Say, "transmission_compression = gzip". > > But presumably this would transparently compress at one end and > decompress at the other end, which is again a somewhat different use > case. To get compressed output on the client side, you have to > decompress and recompress. Maybe that's OK, but it's not quite the > same thing. Well, libpq could give some access to raw compressed streams, but, really, even with double compression on the client, it solves the bandwidth issue, not only for pg_dump, pg_restore, and copy, but also for all other transfer-intensive applications. I do think it's the best option.