Обсуждение: HASH_CHUNK_SIZE vs malloc rounding
Hi hackers, HASH_CHUNK_SIZE is defined as 1024 * 32 = 0x8000. The size of the chunks that nodeHash.c passes to palloc is that + offsetof(HashJoinMemoryChunkData, data), which is 0x20 here. So we ask aset.c for 0x8020 bytes. Sizes in that range are sent directly to malloc, after adding ALLOC_BLOCKHDRSZ and ALLOC_CHUNKHDRSZ. That brings the grand total arriving into malloc on this system to 0x8058. macOS's malloc seems to round this up to the nearest 512 bytes, so we waste 424 bytes. That's 1.2% extra memory overhead, unaccounted for in work_mem. Maybe that's not too bad. FreeBSD's malloc (jemalloc) seems to be even worse though. I haven't figured out the details, but I think it might finish up eating 36KB! I bet other allocators also do badly with "32KB plus a smidgen". To minimise overhead we'd probably need to try to arrange for exactly 32KB (or some other power of 2 or at least factor of common page/chunk size?) to arrive into malloc, which means accounting for both nodeHash.c's header and aset.c's headers in nodeHash.c, which seems a bit horrible. It may not be worth doing anything about. Sadly/happily glibc doesn't seem to have this problem. I looked at the addresses of successive allocations there clearly an an 8 or 16 byte overhead between them but otherwise no rounding of this kind, which I suppose means that this isn't a problem for the majority of users. I was thinking about this because in my shared hash patch, I use the DSA allocator which currently handles allocations in this size range with 4096 byte pages, so 32KB + a smidgen finishes up costing 36KB of memory (~12.5% wasted overhead). I'll probably adjust the chunk size in the patch to avoid that for shared hash tables, but I figured people might want to hear about the existing malloc wastage on certain OSes. -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > I bet other allocators also do badly with "32KB plus a smidgen". To > minimise overhead we'd probably need to try to arrange for exactly > 32KB (or some other power of 2 or at least factor of common page/chunk > size?) to arrive into malloc, which means accounting for both > nodeHash.c's header and aset.c's headers in nodeHash.c, which seems a > bit horrible. It may not be worth doing anything about. Yeah, the other problem is that without a lot more knowledge of the specific allocator, we shouldn't really assume that it's good or bad with an exact-power-of-2 request --- it might well have its own overhead. It is an issue though, and not only in nodeHash.c. I'm pretty sure that StringInfo also makes exact-power-of-2 requests for no essential reason, and there are probably many other places. We could imagine providing an mmgr API function along the lines of "adjust this request size to the nearest thing that can be allocated efficiently". That would avoid the need for callers to know about aset.c overhead explicitly. I'm not sure how it could deal with platform-specific malloc vagaries though :-( regards, tom lane
On Tue, Nov 29, 2016 at 6:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> I bet other allocators also do badly with "32KB plus a smidgen". To >> minimise overhead we'd probably need to try to arrange for exactly >> 32KB (or some other power of 2 or at least factor of common page/chunk >> size?) to arrive into malloc, which means accounting for both >> nodeHash.c's header and aset.c's headers in nodeHash.c, which seems a >> bit horrible. It may not be worth doing anything about. > > Yeah, the other problem is that without a lot more knowledge of the > specific allocator, we shouldn't really assume that it's good or bad with > an exact-power-of-2 request --- it might well have its own overhead. > It is an issue though, and not only in nodeHash.c. I'm pretty sure that > StringInfo also makes exact-power-of-2 requests for no essential reason, > and there are probably many other places. Right, enlargeStringInfo doubles the size whenever it runs out, a common technique. Aside from the "plus a smidgen" thing mentioned above, there is something else interesting about that: Andrew Koenig wrote a widely referenced comp.lang.c++.moderated post[1] about why you should probably use a growth factor of 1.5, and the topic comes up from time to time in standard library implementation discussions[2]. I have no idea whether it really matters in reality and how the space vs time trade off works with whatever actual string growth patterns someone might be optimising for, but it's fun to think about... [1] https://groups.google.com/forum/#!msg/comp.lang.c++.moderated/asH_VojWKJw/Lo51JEmLVboJ [2] https://gcc.gnu.org/ml/libstdc++/2013-03/msg00058.html -- Thomas Munro http://www.enterprisedb.com
On Tue, Nov 29, 2016 at 6:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> I bet other allocators also do badly with "32KB plus a smidgen". To >> minimise overhead we'd probably need to try to arrange for exactly >> 32KB (or some other power of 2 or at least factor of common page/chunk >> size?) to arrive into malloc, which means accounting for both >> nodeHash.c's header and aset.c's headers in nodeHash.c, which seems a >> bit horrible. It may not be worth doing anything about. > > Yeah, the other problem is that without a lot more knowledge of the > specific allocator, we shouldn't really assume that it's good or bad with > an exact-power-of-2 request --- it might well have its own overhead. > It is an issue though, and not only in nodeHash.c. I'm pretty sure that > StringInfo also makes exact-power-of-2 requests for no essential reason, > and there are probably many other places. > > We could imagine providing an mmgr API function along the lines of "adjust > this request size to the nearest thing that can be allocated efficiently". > That would avoid the need for callers to know about aset.c overhead > explicitly. I'm not sure how it could deal with platform-specific malloc > vagaries though :-( Someone pointed out to me off-list that jemalloc's next size class after 32KB is in fact 40KB by default[1]. So PostgreSQL uses 25% more memory for hash joins than it thinks it does on some platforms. Ouch. It doesn't seem that crazy to expose aset.c's overhead size to client code does it? Most client code wouldn't care, but things that are doing something closer to memory allocator work themselves like dense_alloc could care. It could deal with its own overhead itself, and subtract aset.c's overhead using a macro. [1] https://www.freebsd.org/cgi/man.cgi?jemalloc(3) -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Tue, Nov 29, 2016 at 6:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We could imagine providing an mmgr API function along the lines of "adjust >> this request size to the nearest thing that can be allocated efficiently". >> That would avoid the need for callers to know about aset.c overhead >> explicitly. I'm not sure how it could deal with platform-specific malloc >> vagaries though :-( > Someone pointed out to me off-list that jemalloc's next size class > after 32KB is in fact 40KB by default[1]. So PostgreSQL uses 25% more > memory for hash joins than it thinks it does on some platforms. Ouch. > It doesn't seem that crazy to expose aset.c's overhead size to client > code does it? Most client code wouldn't care, but things that are > doing something closer to memory allocator work themselves like > dense_alloc could care. It could deal with its own overhead itself, > and subtract aset.c's overhead using a macro. Seeing that we now have several allocators with different overheads, I think that exposing this directly to clients is exactly what not to do. I still like the idea I sketched above of a context-type-specific function to adjust a request size to something that's efficient. But there's still the question of how do we know what's an efficient-sized malloc request. Is there good reason to suppose that powers of 2 are OK? regards, tom lane
On Fri, Nov 24, 2017 at 4:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It doesn't seem that crazy to expose aset.c's overhead size to client >> code does it? Most client code wouldn't care, but things that are >> doing something closer to memory allocator work themselves like >> dense_alloc could care. It could deal with its own overhead itself, >> and subtract aset.c's overhead using a macro. > > Seeing that we now have several allocators with different overheads, > I think that exposing this directly to clients is exactly what not to do. > I still like the idea I sketched above of a context-type-specific function > to adjust a request size to something that's efficient. > > But there's still the question of how do we know what's an efficient-sized > malloc request. Is there good reason to suppose that powers of 2 are OK? It looks like current glibc versions don't do that sort of thing until you reach M_MMAP_THRESHOLD (default 128kB) and then starts working in 4kb pages. Before that its manual says that it doesn't do any rounding beyond alignment[1]. I haven't come across any other allocator that is so forgiving, but I haven't checked Illumos, AIX or Windows. Only the first of those has source code available, but they confuse you by shipping a bunch of different allocators in their tree. I didn't check the other BSDs. tcmalloc[2], jemalloc, macOS malloc round up to (respectively) 4k page, 8kb size class, 512 byte size class when you allocate 32kb + 1 byte. [1] https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html [2] http://goog-perftools.sourceforge.net/doc/tcmalloc.html -- Thomas Munro http://www.enterprisedb.com