Обсуждение: Cache last known per-tuple offsets to speed long tuple access
I made a patch for "Cache last known per-tuple offsets to speed long tuple access" that is in TODO list. This problem was discussed on hackers-list as "Terrible performance on wide selects". The point of this problem is nocachegetattr() used from ExecEvalVar(). If tuple has many columns, and it has varlen column or null data, time spent in nocachegetattr() is O(N^2) in the number of fields. I referred URL below for implementation. http://archives.postgresql.org/pgsql-performance/2003-01/msg00262.php The point of this patch is as follows: (1)heap_deformtuple_incr() is added. This function can extract attributes of tupple, incrementally. (2)The cache which keeps the result of heap_deformtuple_incr(), is added inside TupleTableSlot. (3)In ExecEvalVar(), heap_deformtuple_incr() is used in place of nocachegetattr(). This would reduce the time from O(N^2) to O(N). In order to measure the effect, I executed the test below. ------------------- Table has 15,000tuples, 200columns. All data type is text. Table name is aaa. Column name is t001...t200. Executed SQL is, select t100, t110, t120, t130, t140, t150, t160, t170, t180, t190, t200 from aaa; The profile result of original code is as follows. ------------------- Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 70.05 1.31 1.31 163846 0.00 0.00 nocachegetattr 8.02 1.46 0.15 163840 0.00 0.00 FunctionCall3 1.87 1.50 0.04 397763 0.00 0.00 AllocSetFreeIndex 1.60 1.52 0.03 163840 0.00 0.00 ExecEvalVar 1.34 1.55 0.03 200414 0.00 0.00 AllocSetAlloc The profile result after the patch applying is as follows. ------------------- Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 39.73 0.29 0.29 180224 0.00 0.00 heap_deformtuple_incr 9.59 0.36 0.07 163840 0.00 0.00 FunctionCall3 6.85 0.41 0.05 16384 0.00 0.02 ExecTargetList 5.48 0.45 0.04 23477 0.00 0.00 hash_any 4.11 0.48 0.03 200414 0.00 0.00 AllocSetAlloc Regards, --- Atsushi Ogawa (a_ogawa@hi-ho.ne.jp)
Вложения
a_ogawa <a_ogawa@hi-ho.ne.jp> writes: > I made a patch for "Cache last known per-tuple offsets to speed > long tuple access" that is in TODO list. > I referred URL below for implementation. > http://archives.postgresql.org/pgsql-performance/2003-01/msg00262.php This wasn't quite what I had in mind. In particular I think it's a bad idea to put a significant part of the implementation directly into ExecEvalVar, because we'll have to duplicate that code anywhere else that proves to be a hot-spot. I think the right way is to make a new function that is just like heap_getattr except it is given a TupleTableSlot instead of separate tuple and tupdesc arguments, and encapsulate the knowledge about optimizing access using previously determined offsets inside that. Also, heap_deformtuple is a hot spot in its own right now, so slowing it down so that it can share code with the TupleTableSlot optimization doesn't seem to me like a win. There's not that much code in it, and the data structures involved are very unlikely to change much, so I think it'd be better to just duplicate the code and then optimize separately. This would also avoid the confusion you've introduced into heap_deformtuple about the roles of the various variables (the first comment is effectively backwards now...) We did not have heap_deformtuple in its current incarnation at the time of the discussion you reference, so there may be some additional thought needed. It'd be nice if heap_deformtuple could make use of already-computed info, for instance. I'm not sure if that can be done without fairly wide-ranging API changes (most of its callers don't have Slots available), so it may be something to leave for a future patch. Lastly, I don't see any point in introducing DeformTupleState and deformtuple_cache as abstractions in their own right. I think it'd be simpler and more readable just to regard all this as fields of a TupleTableSlot. If there were any other contexts that we might want to cache partial tuple disassembly info in, then it'd be worth making a separation, but I don't foresee any such application. BTW, why is it that your profile shows *more* calls to heap_deformtuple_incr after the patch than there were nocachegetattr calls before? Seems like there should be fewer, or at least the same number. I'm now wondering whether there is any point to the "incremental" aspect at all, as opposed to just doing a heap_deformtuple call the first time a column value is demanded and always extracting all the fields at that time. (Of course, this will always look like a win in the context of test cases that access most or all of the fields. What we need to worry about is the penalty incurred in cases that don't.) regards, tom lane
Thank you for advice. I am going to remake a patch, in order to make it simple. The plan of a new patch is as follows. (1)Add fields to TupleTableSlot and TupleTableData. This fields are for holding the tuple disassembly information. (2)Add the codes which initializes/cleans new fields. These codes are added to execTuples.c. (3)Add two functions to execQual.c. One function is just like heap_deformtuple. It is given a TupleTableSlot. And it extracts the field of tuple incrementary using the new fields of TupleTableSlot. The meaning of incrementary is as the following example. Example: The tupple has 100 columns. - first call to get col5 will fill first 5 positions in the array. - next call to get col75 will fill starts from 5 and up to 75. - next call to get col60 will only refer to array, because already extracted. Another function is just like heap_getattr and fast_getattr. It is given a TupleTableSlot. And this function uses new function(like a heap_deformtuple), instead of nocachegetattr. (4)ExecEvalVar uses new function(like a heap_getattr) instead of heap_getattr. With a new patch, only three files of tuptable.h, and execTuple.c and execQual.c are due to be changed. > BTW, why is it that your profile shows *more* calls to > heap_deformtuple_incr after the patch than there were nocachegetattr > calls before? Many one is for printtup. (printtup -> heap_deformtuple -> heap_deformtuple_incr) Since the code of heap_deformtuple and heap_deformtuple_incr has been share, heap_deformtuple_incr looks many calls than nocachegetattr. If a part for the number of calls of heap_deformtuple_incr by printtup is removed, heap_deformtuple_incr and nocachegetattr will be the same number of calls. With my test being to access the column in ascending order (select t100, t110 ...), heap_deformtuple_incr and nocachegetattr is the same calls. If the column is accessed in descending order(select t200, t190...), number of calls of heap_deformtuple_incr will decrease sharply. It is because a result is cached by the first call to get t200. regards, --- Atsushi Ogawa a_ogawa@hi-ho.ne.jp
I remaked patch for "Cache last known per-tuple offsets to speed long tuple access" that is in TODO list. The point of this patch is as follows: (1)Add fields to TupleTableSlot and TupleTableData. This fields are for holding the tuple disassembly information. (2)Add the codes which initializes/cleans new fields. These codes are added to execTuples.c. (3)Add two functions to execQual.c. This function name is slot_deformtuple and this is just like heap_deformtuple. This function can be resumed from the previous execution using the information encapsulated in the TupleTableSlot. Another function is just like heap_getattr and fast_getattr. This function name is slot_getattr. This is just like heap_getattr and fast_getattr macro, except it is given a TupleTableSlot, and this function uses slot_deformtuple insetead of nocachegetattr. (4)ExecEvalVar uses new function slot_getattr instead of heap_getattr. I executed the test below. ------------------- Table has 16,384tuples, 200columns. All data type is text. Table name is aaa. Column name is t001...t200. Executed SQL is, select t100, t110, t120, t130, t140, t150, t160, t170, t180, t190, t200 from aaa; The profile result of original code is as follows. ------------------- Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 74.19 1.38 1.38 163846 0.00 0.00 nocachegetattr 4.30 1.46 0.08 163840 0.00 0.00 FunctionCall3 1.61 1.49 0.03 397750 0.00 0.00 AllocSetFreeIndex 1.61 1.52 0.03 16384 0.00 0.00 ExecTargetList 1.08 1.54 0.02 344152 0.00 0.00 appendBinaryStringInfo The profile result after the patch applying is as follows. ------------------- Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 30.38 0.24 0.24 163840 0.00 0.00 slot_deformtuple 10.13 0.32 0.08 163840 0.00 0.00 FunctionCall3 5.06 0.36 0.04 163840 0.00 0.00 slot_getattr 5.06 0.40 0.04 16384 0.00 0.00 heap_deformtuple 3.80 0.43 0.03 49159 0.00 0.00 ExecClearTuple regards, --- Atsushi Ogawa a_ogawa@hi-ho.ne.jp
Вложения
This has been saved for the 8.1 release: http:/momjian.postgresql.org/cgi-bin/pgpatches2 --------------------------------------------------------------------------- a_ogawa wrote: > > I remaked patch for "Cache last known per-tuple offsets to speed > long tuple access" that is in TODO list. > > The point of this patch is as follows: > (1)Add fields to TupleTableSlot and TupleTableData. > This fields are for holding the tuple disassembly information. > > (2)Add the codes which initializes/cleans new fields. > These codes are added to execTuples.c. > > (3)Add two functions to execQual.c. > This function name is slot_deformtuple and this is > just like heap_deformtuple. This function can be resumed > from the previous execution using the information > encapsulated in the TupleTableSlot. > > Another function is just like heap_getattr and fast_getattr. > This function name is slot_getattr. This is just like > heap_getattr and fast_getattr macro, except it is given a > TupleTableSlot, and this function uses slot_deformtuple > insetead of nocachegetattr. > > (4)ExecEvalVar uses new function slot_getattr instead of > heap_getattr. > > I executed the test below. > ------------------- > Table has 16,384tuples, 200columns. All data type is text. > Table name is aaa. Column name is t001...t200. > Executed SQL is, > select t100, t110, t120, t130, t140, t150, > t160, t170, t180, t190, t200 > from aaa; > > The profile result of original code is as follows. > ------------------- > Each sample counts as 0.01 seconds. > % cumulative self self total > time seconds seconds calls s/call s/call name > 74.19 1.38 1.38 163846 0.00 0.00 nocachegetattr > 4.30 1.46 0.08 163840 0.00 0.00 FunctionCall3 > 1.61 1.49 0.03 397750 0.00 0.00 AllocSetFreeIndex > 1.61 1.52 0.03 16384 0.00 0.00 ExecTargetList > 1.08 1.54 0.02 344152 0.00 0.00 appendBinaryStringInfo > > The profile result after the patch applying is as follows. > ------------------- > Each sample counts as 0.01 seconds. > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 30.38 0.24 0.24 163840 0.00 0.00 slot_deformtuple > 10.13 0.32 0.08 163840 0.00 0.00 FunctionCall3 > 5.06 0.36 0.04 163840 0.00 0.00 slot_getattr > 5.06 0.40 0.04 16384 0.00 0.00 heap_deformtuple > 3.80 0.43 0.03 49159 0.00 0.00 ExecClearTuple > > regards, > > --- Atsushi Ogawa > a_ogawa@hi-ho.ne.jp [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
a_ogawa <a_ogawa@hi-ho.ne.jp> writes: > I remaked patch for "Cache last known per-tuple offsets to speed > long tuple access" that is in TODO list. I've applied this patch with some revisions (mostly to put the functionality in places that seemed more appropriate than execQual.c). Many thanks! regards, tom lane