Обсуждение: User Defined Functions/AM's inherently slow?
I've created a stub AM that literally does nothing. It indexes nothing. It scans for nothing. Nadda. It does just enough work to specify return values that prevent PG from dumping core. What I've found is that this stub AM, compiled with the same options as postgres itself (-O2 -fno-strict-aliasing), is roughly 4 times slower than the built in btree AM that actually does something useful! The test table contains 1 column, and 1 row: My stub AM:explain analyze select * from test where a ==> '1';Total runtime: 0.254 ms builtin btree AM:explain analyze select * from test where a = '1';Total runtime: 0.058 ms (I ran each one a number times, with basically the same results). What gives? *scratches head* I know btree's are efficient, but geez, can they really be more efficient than O(zero)? :) Looking at the backtrace from the beginscan function of each AM, PG doesn't appear to do anything different for user-provided AM's. My platform is OS X 10.3.2, using PG 7.4, gcc version 3.3 (Apple's build #1495). Any insight would be greatly appreciated. Thanks! eric
Eric Ridge <ebr@tcdi.com> writes: > I've created a stub AM that literally does nothing. It's not possible for an index AM to "do nothing", at least not for an indexscan. It has to return tuple pointers. What are you doing for that? regards, tom lane
On Jan 17, 2004, at 10:22 PM, Tom Lane wrote: > Eric Ridge <ebr@tcdi.com> writes: >> I've created a stub AM that literally does nothing. > > It's not possible for an index AM to "do nothing", at least not for an > indexscan. It has to return tuple pointers. What are you doing for > that? costestimate: {PG_RETURN_VOID(); } beginscan: {Relation index = (Relation) PG_GETARG_POINTER(0);int keysz = PG_GETARG_INT32(1);ScanKey scankey = (ScanKey)PG_GETARG_POINTER(2);IndexScanDesc scan = RelationGetIndexScan(index, keysz, scankey); PG_RETURN_POINTER(scan); } rescan: {PG_RETURN_VOID(); } gettuple: {PG_RETURN_BOOL(false); } endscan: {PG_RETURN_VOID(); } I think the above is about as close to "nothing" as one can get. In trying to track down some performance issues with my real AM, I decided to make this stub AM just to see what the overhead is... then I started seeing these crazy results. eric
On Jan 17, 2004, at 10:22 PM, Tom Lane wrote: > Eric Ridge <ebr@tcdi.com> writes: >> I've created a stub AM that literally does nothing. > > It's not possible for an index AM to "do nothing", at least not for an > indexscan. It has to return tuple pointers. What are you doing for > that? I should have included the entire explain output: stub AM: Index Scan using idxa_stub on test2 (cost=0.00..2.68 rows=1 width=5) (actual time=0.002..0.002 rows=0 loops=1) Index Cond: (a ==> '1'::text) Total runtime: 0.247 ms builtin btree AM: Index Scan using idxa_btree on test2 (cost=0.00..4.68 rows=1 width=5) (actual time=0.024..0.026 rows=1 loops=1) Index Cond: (a = '1'::text) Total runtime: 0.060 ms If the "actual time" numbers are really a measure of the amount of time spent in (at least) the index, it seems the stub should report a smaller "total runtime", but alas, it doesn't. eric
Eric Ridge <ebr@tcdi.com> writes: > costestimate: { > PG_RETURN_VOID(); > } This at least needs to set some values into the output parameters --- zeroes are okay, not setting them at all isn't. I'm surprised the planner doesn't go nuts. It looks from your EXPLAIN results like the values are coming up zero anyway, but seeing that cost_index() doesn't do anything to initialize those local variables, I'd expect fairly unpredictable behavior. The other stubs look fine on quick review. regards, tom lane
Eric Ridge <ebr@tcdi.com> writes: > I should have included the entire explain output: > stub AM: > Index Scan using idxa_stub on test2 (cost=0.00..2.68 rows=1 width=5) > (actual time=0.002..0.002 rows=0 loops=1) > Index Cond: (a ==> '1'::text) > Total runtime: 0.247 ms > builtin btree AM: > Index Scan using idxa_btree on test2 (cost=0.00..4.68 rows=1 width=5) > (actual time=0.024..0.026 rows=1 loops=1) > Index Cond: (a = '1'::text) > Total runtime: 0.060 ms > If the "actual time" numbers are really a measure of the amount of time > spent in (at least) the index, it seems the stub should report a > smaller "total runtime", but alas, it doesn't. The difference between "total runtime" and the top plan node's runtime has to represent plan startup/shutdown time. I'm suspicious that your stubs are somehow not initializing something, though on first glance I do not see what. Theory B would be that there's some huge overhead in calling non-built-in functions on your platform. We do know that looking up a "C" function is significantly more expensive than looking up a "builtin" function, but there should only be half a dozen such calls involved in this test case; it's hard to credit that that takes 200 msec. Does the time drop at all on second and subsequent repetitions in a single backend run? regards, tom lane
On Jan 17, 2004, at 11:27 PM, Tom Lane wrote: > Eric Ridge <ebr@tcdi.com> writes: >> costestimate: { >> PG_RETURN_VOID(); >> } > > This at least needs to set some values into the output parameters --- > zeroes are okay, not setting them at all isn't. I'm surprised the > planner doesn't go nuts. It looks from your EXPLAIN results like > the values are coming up zero anyway, but seeing that cost_index() > doesn't do anything to initialize those local variables, I'd expect > fairly unpredictable behavior. I have tried setting them all to zero, and even using the backend/utils/adt/selfuncs.c:genericcostestimate() code (it's not exported, it seems), but no matter what sane (or crazy!) numbers I provide for the cost estimate, the "fact" remains, my AM, which does "nothing" is slower than the builtin btree AM. Could this just be related to the fact that my AM is in a .so, and there's just some operating system/C runtime overhead in calling functions in dynamically loaded libraries? eric
On Jan 17, 2004, at 11:33 PM, Tom Lane wrote: > The difference between "total runtime" and the top plan node's runtime > has to represent plan startup/shutdown time. I'm suspicious that your > stubs are somehow not initializing something, though on first glance I > do not see what. I can't see anything either... which is why I brought it up. I'm still a noob with this stuff, and thought maybe I was just missing something. > Theory B would be that there's some huge overhead in calling > non-built-in > functions on your platform. We do know that looking up a "C" function > is > significantly more expensive than looking up a "builtin" function, but > there should only be half a dozen such calls involved in this test > case; > it's hard to credit that that takes 200 msec. Does the time drop at > all > on second and subsequent repetitions in a single backend run? Yes, it drops from about .680ms to the .250ish that I posted. I suppose I should try compiling this little stub into postgres, eh? eric
>> Theory B would be that there's some huge overhead in calling non-built-in >> functions on your platform. We do know that looking up a "C" function is >> significantly more expensive than looking up a "builtin" function, but >> there should only be half a dozen such calls involved in this test case; >> it's hard to credit that that takes 200 msec. Does the time drop at all >> on second and subsequent repetitions in a single backend run? > > > Yes, it drops from about .680ms to the .250ish that I posted. > > I suppose I should try compiling this little stub into postgres, eh? What if you try the new preload_libraries (or whatever it's called) config variable in postgresql.conf in the 7.4 release? Chris
On Jan 18, 2004, at 1:34 AM, Christopher Kings-Lynne wrote: >>> Theory B would be that there's some huge overhead in calling >>> non-built-in >>> functions on your platform. We do know that looking up a "C" >>> function is >>> significantly more expensive than looking up a "builtin" function, >>> but >>> there should only be half a dozen such calls involved in this test >>> case; >>> it's hard to credit that that takes 200 msec. Does the time drop at >>> all >>> on second and subsequent repetitions in a single backend run? >> Yes, it drops from about .680ms to the .250ish that I posted. >> I suppose I should try compiling this little stub into postgres, eh? > > What if you try the new preload_libraries (or whatever it's called) > config variable in postgresql.conf in the 7.4 release? yes, that takes care of the initial load time of the library itself (bringing the initial .680ms back to .250ish for the first run), but this is not the problem. eric
>> Theory B would be that there's some huge overhead in calling >> non-built-in functions on your platform. I've done some profiling and convinced myself that indeed there's pretty steep overhead involved in fmgr_info() for a "C"-language function. Much of it isn't platform-dependent either --- as best I can tell, the lion's share of the time is being eaten in expand_dynamic_library_name(). In scenarios where a function is called many times per query, we cache the results of fmgr_info() ... but we do not do so for operations like ambeginscan that are done just once per query. Every other function language uses shortcuts or caching to reduce the cost of fmgr_info() lookup; external C language is the only one that hasn't been optimized in this way. I shall see what I can do about that. ISTM we can have a hash table that maps function OID to function address using the same sorts of techniques that plpgsql et al use. regards, tom lane
On Jan 18, 2004, at 7:28 PM, Tom Lane wrote: >>> Theory B would be that there's some huge overhead in calling >>> non-built-in functions on your platform. > > I've done some profiling and convinced myself that indeed there's > pretty > steep overhead involved in fmgr_info() for a "C"-language function. > Much of it isn't platform-dependent either --- as best I can tell, > the lion's share of the time is being eaten in > expand_dynamic_library_name(). In scenarios where a function is called > many times per query, we cache the results of fmgr_info() ... but we do > not do so for operations like ambeginscan that are done just once per > query. Wow, thanks for spending the time on this. What about for gettuple? Do calls to it take advantage of the cache? If not, this likely explains some of my custom am's performance troubles. > Every other function language uses shortcuts or caching to reduce the > cost of fmgr_info() lookup; external C language is the only one that > hasn't been optimized in this way. I shall see what I can do about > that. > ISTM we can have a hash table that maps function OID to function > address > using the same sorts of techniques that plpgsql et al use. If there's anything I can do to help, let me know. I'll be happy to test any patches you might come up with too. eric
"Eric B. Ridge" <ebr@tcdi.com> writes: > Wow, thanks for spending the time on this. What about for gettuple? > Do calls to it take advantage of the cache? If not, this likely > explains some of my custom am's performance troubles. gettuple is looked up once at the start of a scan, so there's no per-tuple overhead involved there. As I said before, we're usually pretty good about avoiding per-tuple overheads --- the cost you identified here is a per-query overhead. > If there's anything I can do to help, let me know. I'll be happy to > test any patches you might come up with too. I have committed a patch into CVS HEAD --- give it a try. regards, tom lane
I don't have the original thread in my inbox anymore, so for reference:http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF -8&selm=20129.1074484755%40sss.pgh.pa.us Tom Lane last said: > I have committed a patch into CVS HEAD --- give it a try. It took me awhile, but I finally got around to trying this (by plopping it into my 7.4 sources), and yes, the hashlookup does significantly improve the performance of calling "C" functions. I agree with your commit message that they're now roughly on par with built-in functions. thanks! eric