Обсуждение: compression in LO and other fields
Hi, in TODO is the item "Allow compression of large fields or a compressed field type". It is good idea, but it prabably needs binary field firstly (or not?). I see the inv_api and other LO routines, and my idea is add support for bzip2 stream to the inv_api and allow in current LO routines used compression. It si good idea? Karel Zak ------------------------------------------------------------------------------ Karel Zak <zakkr@zf.jcu.cz> http://home.zf.jcu.cz/~zakkr/ Docs: http://docs.linux.cz (big docs archive) Kim Project: http://home.zf.jcu.cz/~zakkr/kim/ (process manager) FTP: ftp://ftp2.zf.jcu.cz/users/zakkr/ (C/ncurses/PgSQL) ------------------------------------------------------------------------------
Karel Zak - Zakkr <zakkr@zf.jcu.cz> writes: > I see the inv_api and other LO routines, and my idea is add support > for bzip2 stream to the inv_api and allow in current LO routines used > compression. It si good idea? LO is a dead end. What we really want to do is eliminate tuple-size restrictions and then have large ordinary fields (probably of type bytea) in regular tuples. I'd suggest working on compression in that context, say as a new data type called "bytez" or something like that. bytez would act just like bytea except the on-disk representation would be compressed. A compressed variant of type "text" would also be useful. In the long run this will be much more useful and easier to work with than adding another frammish to large objects. regards, tom lane
> LO is a dead end. What we really want to do is eliminate tuple-size > restrictions and then have large ordinary fields (probably of type > bytea) in regular tuples. I'd suggest working on compression in that > context, say as a new data type called "bytez" or something like that. It sounds ideal but I remember that Vadim said inserting a 2GB record is not good idea since it will be written into the log too. If it's a necessary limitation from the point of view of WAL, we have to accept it, I think. BTW, I still don't have enough time to run the huge sort tests on 6.5.x. Probably I would have chance next week to do that... --- Tatsuo Ishii
Tatsuo Ishii wrote: > > LO is a dead end. What we really want to do is eliminate tuple-size > > restrictions and then have large ordinary fields (probably of type > > bytea) in regular tuples. I'd suggest working on compression in that > > context, say as a new data type called "bytez" or something like that. > > It sounds ideal but I remember that Vadim said inserting a 2GB record > is not good idea since it will be written into the log too. If it's a > necessary limitation from the point of view of WAL, we have to accept > it, I think. Just in case someone want to implement a complete compressed data type (including comarision functions, operators and indexing default operator class). I already made some tests with a type I called 'lztext' locally. Only the input-/output-functions exist so far and as the name might suggest, it would be an alternative for 'text'. It uses a simple but fast, byte oriented LZ backward pointing method. No Huffman coding or variable offset/size tagging. First byte of a chunk tells bitwise if the next following 8 items are raw bytes to copy or 12 bit offset, 4 bit size copy information. That is max back offset 4096 and max match size 17 bytes. What made it my preferred method was the fact, that decompression is done entirely using the already decompressed portion of the data, so it does not need any code tables or the like at that time. It is really FASTEST on decompression, which I assume would be the mostly often used operation on huge data types. With some care, comparision could be done on the fly while decompressing two values, so that the entire comparision can be aborted at the occurence of the first difference. The compression rates aren't that giantic. I've got 30-50% for rule plan strings (size limit on views!!!). And the method used only allows for buffer back references of 4K offsets at most, so the rate will not grow for larger data chunks. That's a heavy tradeoff between compression rate and no memory leakage for sure and speed, I know, but I prefer not to force it, instead I usually use a bigger hammer (the tuple size limit is still our original problem - and another IBM 72GB disk doing 22-37 MB/s will make any compressing data type obsolete then). Sorry for the compression specific slang here. Well, anyone interested in the code? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
> The compression rates aren't that giantic. I've got 30-50% > for rule plan strings (size limit on views!!!). And the > method used only allows for buffer back references of 4K > offsets at most, so the rate will not grow for larger data > chunks. That's a heavy tradeoff between compression rate and > no memory leakage for sure and speed, I know, but I prefer > not to force it, instead I usually use a bigger hammer (the > tuple size limit is still our original problem - and another > IBM 72GB disk doing 22-37 MB/s will make any compressing data > type obsolete then). > > Sorry for the compression specific slang here. Well, anyone > interested in the code? In contrib? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> LO is a dead end. What we really want to do is eliminate tuple-size >> restrictions and then have large ordinary fields (probably of type >> bytea) in regular tuples. I'd suggest working on compression in that >> context, say as a new data type called "bytez" or something like that. > It sounds ideal but I remember that Vadim said inserting a 2GB record > is not good idea since it will be written into the log too. If it's a > necessary limitation from the point of view of WAL, we have to accept > it, I think. LO won't make that any better: the data still goes into a table. You'd have 2GB worth of WAL entries either way. The only thing LO would do for you is divide the data into block-sized tuples, so there would be a bunch of little WAL entries instead of one big one. But that'd probably be easy to duplicate too. If we implement big tuples by chaining together disk-block-sized segments, which seems like the most likely approach, couldn't WAL log each segment as a separate log entry? If so, there's almost no difference between LO and inline field for logging purposes. regards, tom lane
On Fri, 12 Nov 1999, Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > >> LO is a dead end. What we really want to do is eliminate tuple-size > >> restrictions and then have large ordinary fields (probably of type > >> bytea) in regular tuples. I'd suggest working on compression in that > >> context, say as a new data type called "bytez" or something like that. --- cut --- > > The only thing LO would do for you is divide the data into block-sized > tuples, so there would be a bunch of little WAL entries instead of one > big one. But that'd probably be easy to duplicate too. If we implement > big tuples by chaining together disk-block-sized segments, which seems > like the most likely approach, couldn't WAL log each segment as a > separate log entry? If so, there's almost no difference between LO and > inline field for logging purposes. > I'am not sure, that LO is a dead end for every users. Big (blob) fields going during SQL engine (?), but why - if I needn't use this data as typically SQL data (I not need index, search .. in (example) gif files). Will pity if LO devel. will go down. I still thing that LO compression is not bad idea :-) Other eventual compression questions: * some aplication allow use over slow networks between client<->server a compressed stream, and PostgreSQL? * MySQL dump allow make compressed dump file, it is good, and PostgreSQL? Karel
On Fri, 12 Nov 1999, Jan Wieck wrote: > Just in case someone want to implement a complete compressed > data type (including comarision functions, operators and > indexing default operator class). > > I already made some tests with a type I called 'lztext' > locally. Only the input-/output-functions exist so far and > as the name might suggest, it would be an alternative for > 'text'. It uses a simple but fast, byte oriented LZ backward > pointing method. No Huffman coding or variable offset/size > tagging. First byte of a chunk tells bitwise if the next > following 8 items are raw bytes to copy or 12 bit offset, 4 > bit size copy information. That is max back offset 4096 and > max match size 17 bytes. I is your original implementation or you use any current compression code? I try bzip2, but output from this algorithm is total binary, I don't know how this use in PgSQL if in backend are all routines (in/out) use *char (yes, I'am newbie for PgSQL hacking:-). > > What made it my preferred method was the fact, that > decompression is done entirely using the already decompressed > portion of the data, so it does not need any code tables or > the like at that time. > > It is really FASTEST on decompression, which I assume would > be the mostly often used operation on huge data types. With > some care, comparision could be done on the fly while > decompressing two values, so that the entire comparision can > be aborted at the occurence of the first difference. > > The compression rates aren't that giantic. I've got 30-50% Not is problem, that your implementation compress all data at once? Typically compression use a stream, and compress only small a buffer in any cycle. > for rule plan strings (size limit on views!!!). And the > method used only allows for buffer back references of 4K > offsets at most, so the rate will not grow for larger data > chunks. That's a heavy tradeoff between compression rate and > no memory leakage for sure and speed, I know, but I prefer > not to force it, instead I usually use a bigger hammer (the > tuple size limit is still our original problem - and another > IBM 72GB disk doing 22-37 MB/s will make any compressing data > type obsolete then). > > Sorry for the compression specific slang here. Well, anyone > interested in the code? Yes, for me - I finish to_char()/to_data() ora compatible routines (Thomas, you still quiet?) and this is new appeal for me :-) Karel
> > It sounds ideal but I remember that Vadim said inserting a 2GB record > > is not good idea since it will be written into the log too. If it's a > > necessary limitation from the point of view of WAL, we have to accept > > it, I think. > > LO won't make that any better: the data still goes into a table. > You'd have 2GB worth of WAL entries either way. What in my mind was LO that is not under the transaction control. I would not say this is a good thing, but I'm afraid we might need this kind of beast in WAL. > The only thing LO would do for you is divide the data into block-sized > tuples, so there would be a bunch of little WAL entries instead of one > big one. But that'd probably be easy to duplicate too. If we implement > big tuples by chaining together disk-block-sized segments, which seems > like the most likely approach, couldn't WAL log each segment as a > separate log entry? If so, there's almost no difference between LO and > inline field for logging purposes. Right. BTW, does anybody know how BLOBs are handled by WAL in commercial DBMSs? --- Tatsuo Ishii
Karel Zak - Zakkr wrote: > On Fri, 12 Nov 1999, Jan Wieck wrote: > > > I already made some tests with a type I called 'lztext' > > locally. Only the input-/output-functions exist so far and > > I is your original implementation or you use any current compression > code? I try bzip2, but output from this algorithm is total binary, > I don't know how this use in PgSQL if in backend are all routines > (in/out) use *char (yes, I'am newbie for PgSQL hacking:-). The internal storage format is based on an article I found at: http://www.neutralzone.org/home/faqsys/docs/slz_art.txt Simple Compression using an LZ buffer Part 3 Revision 1.d: An introduction to compression on the Amiga by Adisak Pochanayon Freely Distributable as long as reproduced completely. Copyright 1993 Adisak Pochanayon I've written the code from scratch. The internal representation is binary, for sure. It's a PostgreSQL variable length data format as usual. I don't know if there's a compression library available that fit's our need. First and most important it must have a license that permits us to include it in the distribution under our existing license. Second it's implementation must not cause any problems in the backend like memory leakage or the like. > > The compression rates aren't that giantic. I've got 30-50% > > Not is problem, that your implementation compress all data at once? > Typically compression use a stream, and compress only small a buffer > in any cycle. No, that's no problem. On type input, the original value is completely in memory given as a char*, and the internal representation is returned as a palloc()'d Datum. For output it's vice versa. O.K. some details on the compression rate. I've used 112 .html files with a total size of 1188346 bytes this time. The smallest one was 131 bytes, the largest one 114549 bytes and most of the files are somewhere between 3-12K. Compression results on the binary level are: gzip -9 outputs 398180 bytes (66.5% rate) gzip -1 outputs 447597 bytes (62.3% rate) my code outputs 529420 bytes (55.4% rate) Html input might be somewhat optimal for Adisak's storage format, but taking into account that my source implementing the type input and output functions is smaller than 600 lines, I think 11% difference to a gzip -9 is a good result anyway. > > Sorry for the compression specific slang here. Well, anyone > > interested in the code? > > Yes, for me - I finish to_char()/to_data() ora compatible routines > (Thomas, you still quiet?) and this is new appeal for me :-) Bruce suggested the contrib area, but I'm not sure if that's the right place. If it goes into the distribution at all, I'd like to use this data type for rule plan strings and function source text in the system catalogs. I don't expect we'll have a general solution for tuples split across multiple blocks for v7.0. And using lztext for rules and function sources would lower some FRP's. But using it in the catalogs requires to be builtin. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
wieck@debis.com (Jan Wieck) writes: > Html input might be somewhat optimal for Adisak's storage > format, but taking into account that my source implementing > the type input and output functions is smaller than 600 > lines, I think 11% difference to a gzip -9 is a good result > anyway. These strike me as very good results. I'm not at all sure that using gzip or bzip would give much better results in practice in Postgres, because those compressors are optimized for relatively large files, whereas a compressed-field datatype would likely be getting relatively small field values to work on. (So your test data set is probably a good one for our purposes --- do the numbers change if you exclude all the files over, say, 10K?) > Bruce suggested the contrib area, but I'm not sure if that's > the right place. If it goes into the distribution at all, I'd > like to use this data type for rule plan strings and function > source text in the system catalogs. Right, if we are going to bother with it at all, we should put it into the core so that we can use it for rule plans. > I don't expect we'll have > a general solution for tuples split across multiple blocks > for v7.0. I haven't given up hope of that yet --- but even if we do, compressing the data is an attractive choice to reduce the frequency with which tuples must be split across blocks. It occurred to me last night that applying compression to individual fields might not be the best approach. Certainly a "bytez" data type is the easiest thing to fit into the existing system, but it's leaving some space savings on the table. What about compressing the *whole* data contents of a tuple on-disk, as a single entity? That should save more space than field-by-field compression. It could be triggered in the tuple storage routines whenever the uncompressed size exceeds some threshold. (We'd need a flag in the tuple header to indicate compressed data, but I think there are bits to spare.) When we get around to having split tuples, the code would still be useful because it'd be applied as a first resort before splitting a large tuple; it'd reduce the frequency of splits and the number of sections big tuples get split into. All automatic and transparent, too --- the user doesn't have to change data declarations at all. Also, if we do it that way, then it would *automatically* apply to both regular tuples and LO, because the current LO implementation is just tuples. (Tatsuo's idea of a non-transaction-controlled LO would need extra work, of course, if we decide that's a good idea...) regards, tom lane
> Yes, for me - I finish to_char()/to_data() ora compatible routines > (Thomas, you still quiet?) and this is new appeal for me :-) Ack! When I saw this I rolled up my primary Netscape window and found you're almost-completed reply underneath. I've sent it along now. - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
On Fri, 12 Nov 1999, Jan Wieck wrote: > I don't know if there's a compression library available that > fit's our need. First and most important it must have a > license that permits us to include it in the distribution > under our existing license. Second it's implementation must > not cause any problems in the backend like memory leakage or > the like. Is this something that could be a configure option? Put the stubs in place, and if someone wants to enable that feature, they can install the compression library first and run with it? Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Tom Lane wrote: > wieck@debis.com (Jan Wieck) writes: > > Html input might be somewhat optimal for Adisak's storage > > format, but taking into account that my source implementing > > the type input and output functions is smaller than 600 > > lines, I think 11% difference to a gzip -9 is a good result > > anyway. > > These strike me as very good results. I'm not at all sure that using > gzip or bzip would give much better results in practice in Postgres, > because those compressors are optimized for relatively large files, > whereas a compressed-field datatype would likely be getting relatively > small field values to work on. (So your test data set is probably a > good one for our purposes --- do the numbers change if you exclude > all the files over, say, 10K?) Will give it a try. > It occurred to me last night that applying compression to individual > fields might not be the best approach. Certainly a "bytez" data type > is the easiest thing to fit into the existing system, but it's leaving > some space savings on the table. What about compressing the *whole* > data contents of a tuple on-disk, as a single entity? That should save > more space than field-by-field compression. But it requires decompression of every tuple into palloc()'d memory during heap access. AFAIK, the heap access routines currently return a pointer to the tuple inside the shm buffer. Don't know what it's performance impact would be. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
Marc G. Fournier wrote: > On Fri, 12 Nov 1999, Jan Wieck wrote: > > > I don't know if there's a compression library available that > > fit's our need. First and most important it must have a > > license that permits us to include it in the distribution > > under our existing license. Second it's implementation must > > not cause any problems in the backend like memory leakage or > > the like. > > Is this something that could be a configure option? Put the stubs in > place, and if someone wants to enable that feature, they can install the > compression library first and run with it? If using the new type in system catalogs, the option could only be what kind of compression to use. And we need our own default compression code shipped anyway. Of course, it could depend on the config what types are used in the syscat. But making the catalog headers things that are shipped as a .in isn't really that good IMHO. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
wieck@debis.com (Jan Wieck) writes: > Tom Lane wrote: >> It occurred to me last night that applying compression to individual >> fields might not be the best approach. Certainly a "bytez" data type >> is the easiest thing to fit into the existing system, but it's leaving >> some space savings on the table. What about compressing the *whole* >> data contents of a tuple on-disk, as a single entity? That should save >> more space than field-by-field compression. > But it requires decompression of every tuple into palloc()'d > memory during heap access. AFAIK, the heap access routines > currently return a pointer to the tuple inside the shm > buffer. Don't know what it's performance impact would be. Good point, but the same will be needed when a tuple is split across multiple blocks. I would expect that (given a reasonably fast decompressor) there will be a net performance *gain* due to having less disk I/O to do. Also, this won't be happening for "every" tuple, just those exceeding a size threshold --- we'd be able to tune the threshold value to trade off speed and space. One thing that does occur to me is that we need to store the uncompressed as well as the compressed data size, so that the working space can be palloc'd before starting the decompression. Also, in case it wasn't clear, I was envisioning leaving the tuple header uncompressed, so that time quals etc can be checked before decompressing the tuple data. regards, tom lane
> >BTW, does anybody know how BLOBs are handled by WAL in commercial >DBMSs? Dec/RDB stores them in it's equivalent of the WAL; full rollback etc is supported. If you load lots of large blobs, you need to make sure you have enough disk space for the journal copies. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
On Fri, 12 Nov 1999, Jan Wieck wrote: > > I don't know if there's a compression library available that > fit's our need. First and most important it must have a > license that permits us to include it in the distribution IMHO bzip2 compression algorith is free and open source - README from bzip2 source: "bzip2-0.9.5 is distributed under a BSD-style license. For details, see the file LICENSE" > Bruce suggested the contrib area, but I'm not sure if that's > the right place. If it goes into the distribution at all, I'd Is any space (on postgresql ftp?) for this unstable code? Good project has incoming of ftp for devel. versions... If you this code not move to contrib, send me it (patch?), please. Karel
Tom Lane wrote: > wieck@debis.com (Jan Wieck) writes: > > > But it requires decompression of every tuple into palloc()'d > > memory during heap access. AFAIK, the heap access routines > > currently return a pointer to the tuple inside the shm > > buffer. Don't know what it's performance impact would be. > > Good point, but the same will be needed when a tuple is split across > multiple blocks. I would expect that (given a reasonably fast > decompressor) there will be a net performance *gain* due to having > less disk I/O to do. Also, this won't be happening for "every" tuple, > just those exceeding a size threshold --- we'd be able to tune the > threshold value to trade off speed and space. Right, this time it's your good point. All of the problems will be there on tuple split implementation. The major problem I see is that a palloc()'d tuple should be pfree()'d after the fetcher is done with it. Since they are in buffer actually, the fetcher doesn't have to care. > One thing that does occur to me is that we need to store the > uncompressed as well as the compressed data size, so that the > working space can be palloc'd before starting the decompression. Yepp - and I'm doing so. Only during compression the result size isn't known. But there is a well known maximum, that is the header overhead plus the data size by 1.125 plus 2 bytes (totally worst case on uncompressable data). And a general mechanism working on the tuple level would fallback to store uncompressed data in the case the compressed size is bigger. > Also, in case it wasn't clear, I was envisioning leaving the tuple > header uncompressed, so that time quals etc can be checked before > decompressing the tuple data. Of course. Well, you asked for the rates on the smaller html files only. 78 files, 131 bytes min, 10000 bytes max, 4582 bytes avg, 357383 bytes total. gzip -9 outputs 145659 bytes (59.2%) gzip -1 outputs 155113 bytes (56.6%) my code outputs 184109 bytes (48.5%) 67 files, 2000 bytes min, 10000 bytes max, 5239 bytes avg, 351006 bytes total. gzip -9 outputs 141772 bytes (59.6%) gzip -1 outputs 151150 bytes (56.9%) my code outputs 179428 bytes (48.9%) The threshold will surely be a tuning parameter of interest. Another tuning option must be to allow/deny compression per table at all. Then we could have both options, using a compressing field type to define which portion of a tuple to compress, or allow to compress the entire tuples. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
On Fri, Nov 12, 1999 at 10:16:21AM +0100, Karel Zak - Zakkr wrote: > > > Other eventual compression questions: > > * MySQL dump allow make compressed dump file, it is good, and PostgreSQL? > pg_dump database | gzip >db.out.gz Ross -- Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> NSBRI Research Scientist/Programmer Computer and Information Technology Institute Rice University, 6100 S. Main St., Houston, TX 77005
wieck@debis.com (Jan Wieck) writes: > The major problem I see is that a palloc()'d tuple should be > pfree()'d after the fetcher is done with it. Since they are > in buffer actually, the fetcher doesn't have to care. I think this may not be as big a problem as it looks. Most places in the executor keep tuples in TupleTableSlots, which are responsible for pfree'ing the tuple if (and only if) necessary; all that code is ready for this change already. There are probably some routines in heapam/indexam that assume they only work with tuples that never need to be freed, but I don't think the fixes will be pervasive. And we're going to have to do that work in any case to support big tuples (assuming we do it by splitting tuples into segments that fit in disk pages). > And a general > mechanism working on the tuple level would fallback to store > uncompressed data in the case the compressed size is bigger. Right. Another possible place for speed-vs-space tuning would be to store the uncompressed representation unless the compressed version is at least X percent smaller, not just at-least-one-byte smaller. regards, tom lane
On Fri, 12 Nov 1999, Ross J. Reedstrom wrote: > On Fri, Nov 12, 1999 at 10:16:21AM +0100, Karel Zak - Zakkr wrote: > > > > > > Other eventual compression questions: > > > > * MySQL dump allow make compressed dump file, it is good, and PostgreSQL? > > > > pg_dump database | gzip >db.out.gz Thank... :-)) But mysqldump --compress is very nice. Karel
> > Bruce suggested the contrib area, but I'm not sure if that's > > the right place. If it goes into the distribution at all, I'd > > like to use this data type for rule plan strings and function > > source text in the system catalogs. > > Right, if we are going to bother with it at all, we should put it > into the core so that we can use it for rule plans. Agreed. I suggested contrib so the code doesn't just disappear. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
At 05:15 PM 11/12/99 +0100, Karel Zak - Zakkr wrote: >> pg_dump database | gzip >db.out.gz > Thank... :-)) > But mysqldump --compress is very nice. Why add functionality for something that can be done so easily by piping the output of pg_dump? This is exactly the kind of coupling of tools that pipes were invented for. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Jan Wieck wrote: > > Of course. > > Well, you asked for the rates on the smaller html files only. > 78 files, 131 bytes min, 10000 bytes max, 4582 bytes avg, > 357383 bytes total. > > gzip -9 outputs 145659 bytes (59.2%) > gzip -1 outputs 155113 bytes (56.6%) > my code outputs 184109 bytes (48.5%) > > 67 files, 2000 bytes min, 10000 bytes max, 5239 bytes avg, > 351006 bytes total. > > gzip -9 outputs 141772 bytes (59.6%) > gzip -1 outputs 151150 bytes (56.9%) > my code outputs 179428 bytes (48.9%) > > The threshold will surely be a tuning parameter of interest. > Another tuning option must be to allow/deny compression per > table at all. Then we could have both options, using a > compressing field type to define which portion of a tuple to > compress, or allow to compress the entire tuples. The next step would be tweaking the costs for sequential scans vs. index scans. I guess that the indexes would stay uncompressed ? ------ Hannu
Don Baccus wrote: > > At 05:15 PM 11/12/99 +0100, Karel Zak - Zakkr wrote: > > >> pg_dump database | gzip >db.out.gz > > > Thank... :-)) > > > But mysqldump --compress is very nice. > > Why add functionality for something that can be done so > easily by piping the output of pg_dump? > > This is exactly the kind of coupling of tools that pipes > were invented for. Exactly ! Another version of the same is used when dumping databases bigger than 2GB (or whatever the file system size limit is) just do: pg_dump database | gzip | split -b 1000000000 db.out.gz. and restore it using cat db.out.gz* | gunzip > psql you could also do other fancy things with your dumps - send them directly to tape storage in Japan or whatever ;) If you need that functionality often enough then write a shell script. --------------- Hannu
> > Jan Wieck wrote: > > > > Of course. > > > > Well, you asked for the rates on the smaller html files only. > > 78 files, 131 bytes min, 10000 bytes max, 4582 bytes avg, > > 357383 bytes total. > > > > gzip -9 outputs 145659 bytes (59.2%) > > gzip -1 outputs 155113 bytes (56.6%) > > my code outputs 184109 bytes (48.5%) > > > > 67 files, 2000 bytes min, 10000 bytes max, 5239 bytes avg, > > 351006 bytes total. > > > > gzip -9 outputs 141772 bytes (59.6%) > > gzip -1 outputs 151150 bytes (56.9%) > > my code outputs 179428 bytes (48.9%) > > > > The threshold will surely be a tuning parameter of interest. > > Another tuning option must be to allow/deny compression per > > table at all. Then we could have both options, using a > > compressing field type to define which portion of a tuple to > > compress, or allow to compress the entire tuples. > > The next step would be tweaking the costs for sequential scans vs. > index scans. > > I guess that the indexes would stay uncompressed ? > > ------ > Hannu > -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
On Fri, 12 Nov 1999, Tom Lane wrote: > wieck@debis.com (Jan Wieck) writes: > > Tom Lane wrote: > >> It occurred to me last night that applying compression to individual > >> fields might not be the best approach. Certainly a "bytez" data type > >> is the easiest thing to fit into the existing system, but it's leaving > >> some space savings on the table. What about compressing the *whole* > >> data contents of a tuple on-disk, as a single entity? That should save > >> more space than field-by-field compression. > > > But it requires decompression of every tuple into palloc()'d > > memory during heap access. AFAIK, the heap access routines > > currently return a pointer to the tuple inside the shm > > buffer. Don't know what it's performance impact would be. > > Good point, but the same will be needed when a tuple is split across > multiple blocks. I would expect that (given a reasonably fast > decompressor) there will be a net performance *gain* due to having > less disk I/O to do. Right now, we're dealing theory...my concern is what Jan points out "what it's performance impact would be"...would much harder would it be to extent our "CREATE TABLE" syntax to do something like: CREATE TABLE classname ( .. ) compressed; Or something similar? Something that leaves the ability to do this in the core, but makes the use of this the choice of the admin? *Assuming* that I'm also reading this thread correctly, it should almost be extended into "ALTER TABLE classname SET COMPRESSED on;", or something like that. Where all new records are *written* compressed (or uncompressed), but any reads checks if compressed size == uncompressed size, and decompresses accordingly... Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
On Fri, 12 Nov 1999, Tom Lane wrote: > wieck@debis.com (Jan Wieck) writes: > > Tom Lane wrote: > >> It occurred to me last night that applying compression to individual > >> fields might not be the best approach. Certainly a "bytez" data type > >> is the easiest thing to fit into the existing system, but it's leaving > >> some space savings on the table. What about compressing the *whole* > >> data contents of a tuple on-disk, as a single entity? That should save > >> more space than field-by-field compression. > > > But it requires decompression of every tuple into palloc()'d > > memory during heap access. AFAIK, the heap access routines > > currently return a pointer to the tuple inside the shm > > buffer. Don't know what it's performance impact would be. > > Good point, but the same will be needed when a tuple is split across > multiple blocks. I would expect that (given a reasonably fast > decompressor) there will be a net performance *gain* due to having > less disk I/O to do. Right now, we're dealing theory...my concern is what Jan points out "what it's performance impact would be"...would much harder would it be to extent our "CREATE TABLE" syntax to do something like: CREATE TABLE classname ( .. ) compressed; Or something similar? Something that leaves the ability to do this in the core, but makes the use of this the choice of the admin? *Assuming* that I'm also reading this thread correctly, it should almost be extended into "ALTER TABLE classname SET COMPRESSED on;", or something like that. Where all new records are *written* compressed (or uncompressed), but any reads checks if compressed size == uncompressed size, and decompresses accordingly... Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Ech - wrong key :-) Hannu Krosing wrote: > Jan Wieck wrote: > > > > The next step would be tweaking the costs for sequential scans vs. > index scans. > > I guess that the indexes would stay uncompressed ? I'm sure about this. On a database of significant size, anyone indexing a field with a possible size over 100 bytes is doing something wrong (and only idiots go above 500 bytes). They are IMPLEMENTING a not well thought out database DESIGN. A database engine should support indices on bigger fields, but it's still a bad schema and thus idiotic. Currently, we don't check the size of indexed fields. And the only problems I've seen with it where some reports that huge PL functions could not be created because there was an unused (idiotic) index on the prosrc attribute and they exceeded the 4K limit for index tuples. I've removed this index already in the v7.0 tree. The ?bug? in the btree code, failing to split a page if the key values exceed 4K, is still there. But I don't think anyone really cares for it. Thus, I assume there aren't many idiots out there. And I don't expect that anyone would ever create an index on a compressed data type. ?bug? -> The difference between a bug and a feature is DOCUMENTATION. Thomas, would you please add this limit on index tuples to the doc's so we have a new FEATURE to tell in the v7.0 announcement? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
Marc G. Fournier wrote: > On Fri, 12 Nov 1999, Tom Lane wrote: > > wieck@debis.com (Jan Wieck) writes: > > > Tom Lane wrote: > Right now, we're dealing theory...my concern is what Jan points > out "what it's performance impact would be"...would much harder would it > be to extent our "CREATE TABLE" syntax to do something like: > > CREATE TABLE classname ( .. ) compressed; > > Or something similar? Something that leaves the ability to do > this in the core, but makes the use of this the choice of the admin? Yepp, exactly that's what I meant with making tuple compression a per table option. Obviously, ALTER TABLE ... must be supported too - that's simply a parser -> utility -> flip flag in pg_class thing (90% cut&paste). I think the part on deciding what to compress is easy, because the flag telling if heap access should try to compress a tuple on append (i.e. INSERT or UPDATE) has to be in pg_class. And the content of a relations pg_class entry is somewhere below the Relation struct (thus already known after heap_open). The idea was to use another bit in the tuple header to tell if an existing heap tuple's data is compressed or not. So the heap fetching allways looks at the bit in the tuple header, and the heap appending looks at the flag in the relation pointer. That's exactly what you want, no? The major part is to make all callers of heap_fetch() and sisters treat in memory decompressed (or from block split reconstructed) tuples the right way. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
wieck@debis.com (Jan Wieck) writes: > The idea was to use another bit in the tuple header to tell > if an existing heap tuple's data is compressed or not. So the > heap fetching allways looks at the bit in the tuple header, > and the heap appending looks at the flag in the relation > pointer. That's exactly what you want, no? Right. Compressed tuples must be unambiguously marked as such on-disk. Whether to compress a tuple when writing it out is a decision that can be made on-the-fly, using strategies that could change from time to time, without invalidating the data that's already out there or affecting the tuple-reading code. If we choose to provide also a way of compressing individual fields rather than whole tuples, it would be good to provide the same flexibility at the field level. Some tuples might contain the field in compressed form, some in uncompressed form. The reading logic should not need to be aware of the way that the writing logic chooses which to do. regards, tom lane
Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > >> LO is a dead end. What we really want to do is eliminate tuple-size > >> restrictions and then have large ordinary fields (probably of type > >> bytea) in regular tuples. I'd suggest working on compression in that > >> context, say as a new data type called "bytez" or something like that. > > > It sounds ideal but I remember that Vadim said inserting a 2GB record > > is not good idea since it will be written into the log too. If it's a > > necessary limitation from the point of view of WAL, we have to accept > > it, I think. > > LO won't make that any better: the data still goes into a table. > You'd have 2GB worth of WAL entries either way. > > The only thing LO would do for you is divide the data into block-sized > tuples, so there would be a bunch of little WAL entries instead of one > big one. But that'd probably be easy to duplicate too. If we implement > big tuples by chaining together disk-block-sized segments, which seems > like the most likely approach, couldn't WAL log each segment as a > separate log entry? If so, there's almost no difference between LO and > inline field for logging purposes. > I don't know LO well. But seems LO allows partial update. Big tuples If so,isn't it a significant difference ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
Sorry, I sent a mail by mistake. Ignore my previous mail. inoue wrote: > Tom Lane wrote: > > > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > >> LO is a dead end. What we really want to do is eliminate tuple-size > > >> restrictions and then have large ordinary fields (probably of type > > >> bytea) in regular tuples. I'd suggest working on compression in that > > >> context, say as a new data type called "bytez" or something like that. > > > > > It sounds ideal but I remember that Vadim said inserting a 2GB record > > > is not good idea since it will be written into the log too. If it's a > > > necessary limitation from the point of view of WAL, we have to accept > > > it, I think. > > > > LO won't make that any better: the data still goes into a table. > > You'd have 2GB worth of WAL entries either way. > > > > The only thing LO would do for you is divide the data into block-sized > > tuples, so there would be a bunch of little WAL entries instead of one > > big one. But that'd probably be easy to duplicate too. If we implement > > big tuples by chaining together disk-block-sized segments, which seems > > like the most likely approach, couldn't WAL log each segment as a > > separate log entry? If so, there's almost no difference between LO and > > inline field for logging purposes. > > > > I don't know LO well. > But seems LO allows partial update. > Big tuples > If so,isn't it a significant difference ? > > Regards. > > Hiroshi Inoue > Inoue@tpf.co.jp