Обсуждение: Proposed adjustments in MaxTupleSize and toast thresholds
I've been looking into Pavan Deolasee's recent discovery that when storing a maximum-length toast tuple, heap_insert uselessly recurses to toast_insert_or_update, wasting a nontrivial number of cycles. It turns out there are several interrelated mistakes here, which are wasting space as well as cycles. First off, as to the exact nature of what's happening: the toast code is designed so that when breaking down a large datum, it's divided into rows with data payloads of exactly TOAST_MAX_CHUNK_SIZE bytes each. On a 4-byte-MAXALIGN machine, this means the rows have total t_len of exactly TOAST_TUPLE_THRESHOLD, which is what was intended. However, that value is not a multiple of 4. Hence when heapam.c comparesMAXALIGN(tup->t_len) > TOAST_TUPLE_THRESHOLD it decides the tuple needs re-toasting. I noted before that this does not happen on an 8-byte-MAXALIGN machine, but had not understood exactly why. The reason is the outer MAXALIGN call in the definition #define TOAST_MAX_CHUNK_SIZE (TOAST_TUPLE_THRESHOLD - \ MAXALIGN( \ MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) + \ sizeof(Oid) + \ sizeof(int32) + \ VARHDRSZ)) On a 4-byte machine that call doesn't do anything, but on an 8-byte machine it causes the value of TOAST_MAX_CHUNK_SIZE to be reduced by 4, which means that t_len of a toast row comes out 4 bytes smaller than on a 4-byte machine, which makes it smaller than TOAST_TUPLE_THRESHOLD even after maxalign'ing. Hence no recursion. That MAXALIGN is actually *wrong* now that I look at it: it's effectively supposing that there is padding alignment after the varlena length word for the chunk data, which of course there is not. But we can't change TOAST_MAX_CHUNK_SIZE without forcing an initdb. Instead we can fix the recursion by removing the MAXALIGN() operations in heapam.c and tuptoaster.c that compare tuple lengths to the thresholds. This effectively moves the threshold for tuple compression up a couple bytes, which is a safe change to make, and makes the comparisons slightly cheaper to boot. I propose doing that in 8.2 (and maybe older branches after we get a bit more testing of it). But the real problem is that we've got sloppy choices of the thresholds and sizes. In the first place, TOAST_MAX_CHUNK_SIZE is being set at a value that makes every toast row have two wasted padding bytes after it (turns out it's the same on both 4- and 8-byte machines, though the specific size of the rows differs). This is silly, we should be using a TOAST_MAX_CHUNK_SIZE that makes the actual row length come out at exactly a MAXALIGN multiple. In the second place, examination of toast tables will show you that on a page with four maximum-length toast rows, there are 12 free bytes on a 4-byte machine and 28 free on an 8-byte machine (not counting the aforementioned padding bytes after each row). That's fine at first glance; because of alignment considerations it's actually the best we can do. The trouble is that TOAST_TUPLE_THRESHOLD is derived from MaxTupleSize, which is derived on the assumption that we should leave 32 bytes for "special space" on heap pages. If we actually had such special space, it wouldn't fit. This happens because the threshold calculation is just #define TOAST_TUPLE_THRESHOLD (MaxTupleSize / 4) which fails to account for the "line pointers" needed for all but the first tuple. These errors cancel out at the moment, but wouldn't if we changed anything about the page header or special space layout. What I suggest we do about this in HEAD is: 1. Rename MaxTupleSize to MaxHeapTupleSize, and get rid of the MaxSpecialSpace allotment in its calculation. We don't use special space on heap pages and we shouldn't be artificially restricting tuple length to allow for something that's unlikely to appear in the future. (Note: yes, I know it's been suggested to keep free-space maps in some heap pages, but that need not factor into a MaxHeapTupleSize limit: big tuples can simply go into a page without any free-space map.) 2. Fix TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET to be correctly calculated (properly allowing for line pointers) and to be MAXALIGN multiples. The threshold value should be exactly the size of the largest tuple that you can put four of onto one page. Fix TOAST_MAX_CHUNK_SIZE so that it is *not* necessarily a MAXALIGN multiple, but rather causes the total length of a toast tuple to come out that way. This guarantees minimum space wastage on toast pages. This will force initdb due to changing chunk sizes in toast tables, but unless we're going to reject Heikki's patch to merge cmin/cmax, there is no hope of an in-place upgrade for 8.3 anyway. BTW, while I was looking at this I noticed that BTMaxItemSize is incorrectly calculated as well: it's coming out a couple bytes smaller than it could safely be. And with a different page header size it could come out a couple bytes larger, instead :-(. Perhaps this is related to Heikki's recent observation that there always seemed to be some extra free space on btree pages? I think the correct calculation would be #define BTMaxItemSize(page) \ MAXALIGN_DOWN((PageGetPageSize(page) - \ MAXALIGN(sizeof(PageHeaderData)+ 2 * sizeof(ItemIdData)) - MAXALIGN(sizeof(BTPageOpaqueData))) / 3) where MAXALIGN_DOWN rounds down to the nearest maxalign multiple, instead of up. Comments? regards, tom lane
On Fri, 2007-02-02 at 15:11 -0500, Tom Lane wrote: > 2. Fix TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET to be correctly > calculated (properly allowing for line pointers) and to be MAXALIGN > multiples. The threshold value should be exactly the size of the > largest tuple that you can put four of onto one page. Fix > TOAST_MAX_CHUNK_SIZE so that it is *not* necessarily a MAXALIGN > multiple, but rather causes the total length of a toast tuple to come > out that way. This guarantees minimum space wastage on toast pages. Jan suggested to me a while back that having a configurable toast threshold would be a useful thing, when that table is also updated reasonably frequently. ISTM like a good idea, so a prototype has been written - nothing to do with Pavan's comments though. As you might expect, it does help in cases where we would otherwise produce lots of UPDATEd versions of a 1000 byte row, as well as on MIS queries that often don't pay much attention to text strings. This then allows the user some control over how much data gets toasted out of the main row. Many applications have long text fields of 100s of characters, for example a customer's stated, cleaned and previous addresses might together be VARCHAR(750), yet we may also want to UPDATE them regularly to store their current_balance. TOAST_MAX_CHUNK_SIZE can be fixed, though TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET could be settable for a table using a WITH parameter. It would seem like overkill to allow the threshold and target to differ when setting the parameter. If configurable, only MAXALIGNed values would be allowed. Sounds like a good time to suggest making these values configurable, within certain reasonable bounds to avoid bad behaviour. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > Sounds like a good time to suggest making these values configurable, > within certain reasonable bounds to avoid bad behaviour. Actually, given what we've just learned --- namely that choosing these values at random is a bad idea --- I'd want to see a whole lot of positive evidence before adding such a configuration knob. regards, tom lane
On Mon, 2007-02-05 at 11:52 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > Sounds like a good time to suggest making these values configurable, > > within certain reasonable bounds to avoid bad behaviour. > > Actually, given what we've just learned --- namely that choosing these > values at random is a bad idea --- I'd want to see a whole lot of > positive evidence before adding such a configuration knob. Sure. My understanding of the process we'd like to follow on this sort of thing is: 1. make proposal, test for unseen negative effects or basic rejections 2. code performance prototype 3. assemble performance evidence 4. debate utility 5. complete coding 6. further review Step 3 is always there for performance work, so even if you don't mention it, I'll assume everybody wants to see that as soon as possible before we progress. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On 2/5/2007 11:52 AM, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: >> Sounds like a good time to suggest making these values configurable, >> within certain reasonable bounds to avoid bad behaviour. > > Actually, given what we've just learned --- namely that choosing these > values at random is a bad idea --- I'd want to see a whole lot of > positive evidence before adding such a configuration knob. Some of the evidence is TOAST itself. Every time you do not SET a column that has been toasted into external storage during an UPDATE, you win because the columns data isn't read during the scan for the row to update, it isn't read during heap_update(), it isn't actually updated at all (the toast reference is copied as is and the external value reused), and not a single byte of the external data is bloating WAL. If someone knows that 99% of their updates will not hit certain text columns in their tables, actually forcing them to be compressed no matter what and to be stored external if they exceed 100 bytes will be a win. Of course, this is a bit different from Simon's approach. What I describe here is a per pg_attribute configuration to enforce a certain new toaster behavior. Since we already have something that gives the toaster a per column cluestick (like not to bother trying to compress), it might be much easier to implement then Simon's proposal. It would require that the toaster goes over the initial heap tuple for those specially configured columns even if the tuple is below the toast threshold, which suggests that a pg_class.relhasspecialtoastneeds could be useful. But I think as for fine tuning capabilities, a column insensitive maximum tuple size is insufficient anyway. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
"Simon Riggs" <simon@2ndquadrant.com> wrote: > > Actually, given what we've just learned --- namely that choosing these > > values at random is a bad idea --- I'd want to see a whole lot of > > positive evidence before adding such a configuration knob. > > 3. assemble performance evidence > > Step 3 is always there for performance work, so even if you don't > mention it, I'll assume everybody wants to see that as soon as possible > before we progress. There was a performance evidence using TOAST in order to partial updates. It added a flag of force toasting. The toast threshold suggested now is more flexible than it, but I think it is one of the evidences. Vertical Partitioning with TOAST http://archives.postgresql.org/pgsql-hackers/2005-12/msg00013.php Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Feb 5, 2007, at 10:45 AM, Simon Riggs wrote: > Jan suggested to me a while back that having a configurable toast > threshold would be a useful thing, when that table is also updated > reasonably frequently. While we're in there it probably makes sense to allow a configurable value for when to compress as well. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Mon, 2007-02-05 at 19:18 -0500, Jan Wieck wrote: > On 2/5/2007 11:52 AM, Tom Lane wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > >> Sounds like a good time to suggest making these values configurable, > >> within certain reasonable bounds to avoid bad behaviour. > > > > Actually, given what we've just learned --- namely that choosing these > > values at random is a bad idea --- I'd want to see a whole lot of > > positive evidence before adding such a configuration knob. > > Some of the evidence is TOAST itself. Every time you do not SET a column > that has been toasted into external storage during an UPDATE, you win > because the columns data isn't read during the scan for the row to > update, it isn't read during heap_update(), it isn't actually updated at > all (the toast reference is copied as is and the external value reused), > and not a single byte of the external data is bloating WAL. If someone > knows that 99% of their updates will not hit certain text columns in > their tables, actually forcing them to be compressed no matter what and > to be stored external if they exceed 100 bytes will be a win. Yes, thats the main use case. > Of course, this is a bit different from Simon's approach. What I > describe here is a per pg_attribute configuration to enforce a certain > new toaster behavior. Since we already have something that gives the > toaster a per column cluestick (like not to bother trying to compress), > it might be much easier to implement then Simon's proposal. It would > require that the toaster goes over the initial heap tuple for those > specially configured columns even if the tuple is below the toast > threshold, which suggests that a pg_class.relhasspecialtoastneeds could > be useful. But I think as for fine tuning capabilities, a column > insensitive maximum tuple size is insufficient anyway. Well, sounds like we both want the same thing. The only discussion seems to be about user interface. Setting it per column is much better for very fine tuning, but setting them in isolation doesn't help decide what to do when you have lots of medium length strings where the sum exceeds the toast target. IMHO it would be better to have an col-level "storage priority" (default 0) and then an table-level settable toast target. So we start applying the storage handling mechanisms on the highest priority columns and keep going in descending order until we are under the limit for the table. ALTER TABLE fooALTER COLUMN foocol SET STORAGE EXTERNAL PRIORITY 5WITH (toast_target = 400); /* must be MAXALIGNedvalue */ Equal priorities are allowed, in which case lowest attribute id wins, i.e. current behaviour remains the default. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, 2007-02-06 at 12:10 +0900, ITAGAKI Takahiro wrote: > "Simon Riggs" <simon@2ndquadrant.com> wrote: > > > > Actually, given what we've just learned --- namely that choosing these > > > values at random is a bad idea --- I'd want to see a whole lot of > > > positive evidence before adding such a configuration knob. > > > > 3. assemble performance evidence > > > > Step 3 is always there for performance work, so even if you don't > > mention it, I'll assume everybody wants to see that as soon as possible > > before we progress. > > There was a performance evidence using TOAST in order to partial updates. > It added a flag of force toasting. The toast threshold suggested now is > more flexible than it, but I think it is one of the evidences. > > Vertical Partitioning with TOAST > http://archives.postgresql.org/pgsql-hackers/2005-12/msg00013.php > Apologies to Junji-san. I'd thought my idea was original, but it seems we think along similar lines. That is the kind of performance gain I see possible. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Added to TODO: * Consider allowing configuration of TOAST thresholds http://archives.postgresql.org/pgsql-hackers/2007-02/msg00213.php --------------------------------------------------------------------------- Simon Riggs wrote: > On Mon, 2007-02-05 at 19:18 -0500, Jan Wieck wrote: > > On 2/5/2007 11:52 AM, Tom Lane wrote: > > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > >> Sounds like a good time to suggest making these values configurable, > > >> within certain reasonable bounds to avoid bad behaviour. > > > > > > Actually, given what we've just learned --- namely that choosing these > > > values at random is a bad idea --- I'd want to see a whole lot of > > > positive evidence before adding such a configuration knob. > > > > Some of the evidence is TOAST itself. Every time you do not SET a column > > that has been toasted into external storage during an UPDATE, you win > > because the columns data isn't read during the scan for the row to > > update, it isn't read during heap_update(), it isn't actually updated at > > all (the toast reference is copied as is and the external value reused), > > and not a single byte of the external data is bloating WAL. If someone > > knows that 99% of their updates will not hit certain text columns in > > their tables, actually forcing them to be compressed no matter what and > > to be stored external if they exceed 100 bytes will be a win. > > Yes, thats the main use case. > > > Of course, this is a bit different from Simon's approach. What I > > describe here is a per pg_attribute configuration to enforce a certain > > new toaster behavior. Since we already have something that gives the > > toaster a per column cluestick (like not to bother trying to compress), > > it might be much easier to implement then Simon's proposal. It would > > require that the toaster goes over the initial heap tuple for those > > specially configured columns even if the tuple is below the toast > > threshold, which suggests that a pg_class.relhasspecialtoastneeds could > > be useful. But I think as for fine tuning capabilities, a column > > insensitive maximum tuple size is insufficient anyway. > > Well, sounds like we both want the same thing. The only discussion seems > to be about user interface. > > Setting it per column is much better for very fine tuning, but setting > them in isolation doesn't help decide what to do when you have lots of > medium length strings where the sum exceeds the toast target. > > IMHO it would be better to have an col-level "storage priority" (default > 0) and then an table-level settable toast target. So we start applying > the storage handling mechanisms on the highest priority columns and keep > going in descending order until we are under the limit for the table. > > ALTER TABLE foo > ALTER COLUMN foocol > SET STORAGE EXTERNAL PRIORITY 5 > WITH > (toast_target = 400); /* must be MAXALIGNed value */ > > Equal priorities are allowed, in which case lowest attribute id wins, > i.e. current behaviour remains the default. > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +