Обсуждение: a faster compression algorithm for pg_dump
I'd like to revive the discussion about offering another compression algorithm than zlib to at least pg_dump. There has been a previous discussion here: http://archives.postgresql.org/pgsql-performance/2009-08/msg00053.php and it ended without any real result. The results so far were: - There exist BSD-licensed compression algorithms - Nobody knows a patent that is in our way - Nobody can confirm that no patent is in our way I do see a very real demand for replacing zlib which compresses quite well but is slow as hell. For pg_dump what people want is cheap compression, they usually prefer an algorithm that compresses less optimal but that is really fast. One question that I do not yet see answered is, do we risk violating a patent even if we just link against a compression library, for example liblzf, without shipping the actual code? I have checked what other projects do, especially about liblzf which would be my favorite choice (BSD license, available since quite some time...) and there are other projects that actually ship the lzf code (I haven't found a project that just links to it). The most prominent projects are - KOffice (implements a derived version in koffice-2.1.2/libs/store/KoXmlReader.cpp) - Virtual Box (ships it in vbox-ose-1.3.8/src/libs/liblzf-1.51) - TuxOnIce (formerly known as suspend2 - linux kernel patch, ships it in the patch) We have pg_lzcompress.c which implements the compression routines for the tuple toaster. Are we sure that we don't violate any patents with this algorithm? Joachim
On Fri, Apr 9, 2010 at 12:17 AM, Joachim Wieland <joe@mcknight.de> wrote: > One question that I do not yet see answered is, do we risk violating a > patent even if we just link against a compression library, for example > liblzf, without shipping the actual code? > Generally patents are infringed on when the process is used. So whether we link against or ship the code isn't really relevant. The user using the software would need a patent license either way. We want Postgres to be usable without being dependent on any copyright or patent licenses. Linking against as an option isn't nearly as bad since the user compiling it can choose whether to include the restricted feature or not. That's what we do with readline. However it's not nearly as attractive when it restricts what file formats Postgres supports -- it means someone might generate backup dump files that they later discover they don't have a legal right to read and restore :( -- greg
On Fri, Apr 9, 2010 at 5:51 AM, Greg Stark <gsstark@mit.edu> wrote: > Linking against as an option isn't nearly as bad since the user > compiling it can choose whether to include the restricted feature or > not. That's what we do with readline. However it's not nearly as > attractive when it restricts what file formats Postgres supports -- it > means someone might generate backup dump files that they later > discover they don't have a legal right to read and restore :( If we only linked against it, we'd leave it up to the user to weigh the risk as long as we are not aware of any such violation. Our top priority is to make sure that the project would not be harmed if one day such a patent showed up. If I understood you correctly, this is not an issue, even if we included lzf and less again if we only link against it. The rest is about user education and using lzf only in pg_dump and not for toasting, we could show a message in pg_dump if lzf is chosen to make the user aware of the possible issues. If we still cannot do this, then what I am asking is: What does the project need to be able to at least link against such a compression algorithm? Is it a list of 10, 20, 50 or more other projects using it or is it a lawyer saying: "There is no patent."? But then, how can we be sure that the lawyer is right? Or couldn't we include it even if we had both, because again, we couldn't be sure... ? Joachim
Joachim Wieland <joe@mcknight.de> writes: > If we still cannot do this, then what I am asking is: What does the > project need to be able to at least link against such a compression > algorithm? Well, what we *really* need is a convincing argument that it's worth taking some risk for. I find that not obvious. You can pipe the output of pg_dump into your-choice-of-compressor, for example, and that gets you the ability to spread the work across multiple CPUs in addition to eliminating legal risk to the PG project. And in any case the general impression seems to be that the main dump-speed bottleneck is on the backend side not in pg_dump's compression. regards, tom lane
Tom Lane wrote: > Joachim Wieland <joe@mcknight.de> writes: >> If we still cannot do this, then what I am asking is: What does the >> project need to be able to at least link against such a compression >> algorithm? > > Well, what we *really* need is a convincing argument that it's worth > taking some risk for. I find that not obvious. You can pipe the output > of pg_dump into your-choice-of-compressor, for example, and that gets > you the ability to spread the work across multiple CPUs in addition to > eliminating legal risk to the PG project. And in any case the general > impression seems to be that the main dump-speed bottleneck is on the > backend side not in pg_dump's compression. legal risks aside (I'm not a lawyer so I cannot comment on that) the current situation imho is: * for a plain pg_dump the backend is the bottleneck * for a pg_dump -Fc with compression, compression is a huge bottleneck * for pg_dump | gzip, it is usually compression (or bytea and some other datatypes in <9.0) * for a parallel dump you can either dump uncompressed and compress afterwards which increases diskspace requirements (and if you need parallel dump you usually have a large database) and complexity (because you would have to think about how to manually parallel the compression * for a parallel dump that compresses inline you are limited by the compression algorithm on a per core base and given that the current inline compression overhead is huge you loose a lot of the benefits of parallel dump Stefan
Tom Lane <tgl@sss.pgh.pa.us> writes: > Well, what we *really* need is a convincing argument that it's worth > taking some risk for. I find that not obvious. You can pipe the output > of pg_dump into your-choice-of-compressor, for example, and that gets > you the ability to spread the work across multiple CPUs in addition to > eliminating legal risk to the PG project. Well, I like -Fc and playing with the catalog to restore in staging environments only the "interesting" data. I even automated all the catalog mangling in pg_staging so that I just have to setup which schema I want, with only the DDL or with the DATA too. The fun is when you want to exclude functions that are used in triggers based on the schema where the function lives, notthe trigger, BTW, but that's another story. So yes having both -Fc and another compression facility than plain gzip would be good news. And benefiting from a better compression in TOAST would be good too I guess (small size hit, lots faster, would fit). Summary : my convincing argument is using the dumps for efficiently preparing development and testing environments from production data, thanks to -Fc. That includes skipping data to restore. Regards, -- dim
Dimitri Fontaine wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > Well, what we *really* need is a convincing argument that it's worth > > taking some risk for. I find that not obvious. You can pipe the output > > of pg_dump into your-choice-of-compressor, for example, and that gets > > you the ability to spread the work across multiple CPUs in addition to > > eliminating legal risk to the PG project. > > Well, I like -Fc and playing with the catalog to restore in staging > environments only the "interesting" data. I even automated all the > catalog mangling in pg_staging so that I just have to setup which > schema I want, with only the DDL or with the DATA too. > > The fun is when you want to exclude functions that are used in > triggers based on the schema where the function lives, not the > trigger, BTW, but that's another story. > > So yes having both -Fc and another compression facility than plain gzip > would be good news. And benefiting from a better compression in TOAST > would be good too I guess (small size hit, lots faster, would fit). > > Summary?: my convincing argument is using the dumps for efficiently > preparing development and testing environments from production data, > thanks to -Fc. That includes skipping data to restore. I assume people realize that if they are using pg_dump -Fc and then compressing the output later, they should turn off compression in pg_dump, or is that something we should document/suggest? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com
On Tue, Apr 13, 2010 at 03:03:58PM -0400, Tom Lane wrote: > Joachim Wieland <joe@mcknight.de> writes: > > If we still cannot do this, then what I am asking is: What does the > > project need to be able to at least link against such a compression > > algorithm? > > Well, what we *really* need is a convincing argument that it's worth > taking some risk for. I find that not obvious. You can pipe the output > of pg_dump into your-choice-of-compressor, for example, and that gets > you the ability to spread the work across multiple CPUs in addition to > eliminating legal risk to the PG project. And in any case the general > impression seems to be that the main dump-speed bottleneck is on the > backend side not in pg_dump's compression. My client uses pg_dump -Fc and produces about 700GB of compressed postgresql dump nightly from multiple hosts. They also depend on being able to read and filter the dump catalog. A faster compression algorithm would be a huge benefit for dealing with this volume. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.