Обсуждение: HASH_CHUNK_SIZE vs malloc rounding

Поиск
Список
Период
Сортировка

HASH_CHUNK_SIZE vs malloc rounding

От
Thomas Munro
Дата:
Hi hackers,

HASH_CHUNK_SIZE is defined as 1024 * 32 = 0x8000.  The size of the
chunks that nodeHash.c passes to palloc is that +
offsetof(HashJoinMemoryChunkData, data), which is 0x20 here.  So we
ask aset.c for 0x8020 bytes.  Sizes in that range are sent directly to
malloc, after adding ALLOC_BLOCKHDRSZ and ALLOC_CHUNKHDRSZ.  That
brings the grand total arriving into malloc on this system to 0x8058.

macOS's malloc seems to round this up to the nearest 512 bytes, so we
waste 424 bytes.  That's 1.2% extra memory overhead, unaccounted for
in work_mem.  Maybe that's not too bad.

FreeBSD's malloc (jemalloc) seems to be even worse though.  I haven't
figured out the details, but I think it might finish up eating 36KB!

I bet other allocators also do badly with "32KB plus a smidgen".  To
minimise overhead we'd probably need to try to arrange for exactly
32KB (or some other power of 2 or at least factor of common page/chunk
size?) to arrive into malloc, which means accounting for both
nodeHash.c's header and aset.c's headers in nodeHash.c, which seems a
bit horrible.  It may not be worth doing anything about.

Sadly/happily glibc doesn't seem to have this problem.  I looked at
the addresses of successive allocations there clearly an an 8 or 16
byte overhead between them but otherwise no rounding of this kind,
which I suppose means that this isn't a problem for the majority of
users.

I was thinking about this because in my shared hash patch, I use the
DSA allocator which currently handles allocations in this size range
with 4096 byte pages, so 32KB + a smidgen finishes up costing 36KB of
memory (~12.5% wasted overhead).  I'll probably adjust the chunk size
in the patch to avoid that for shared hash tables, but I figured
people might want to hear about the existing malloc wastage on certain
OSes.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: HASH_CHUNK_SIZE vs malloc rounding

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> I bet other allocators also do badly with "32KB plus a smidgen".  To
> minimise overhead we'd probably need to try to arrange for exactly
> 32KB (or some other power of 2 or at least factor of common page/chunk
> size?) to arrive into malloc, which means accounting for both
> nodeHash.c's header and aset.c's headers in nodeHash.c, which seems a
> bit horrible.  It may not be worth doing anything about.

Yeah, the other problem is that without a lot more knowledge of the
specific allocator, we shouldn't really assume that it's good or bad with
an exact-power-of-2 request --- it might well have its own overhead.
It is an issue though, and not only in nodeHash.c.  I'm pretty sure that
StringInfo also makes exact-power-of-2 requests for no essential reason,
and there are probably many other places.

We could imagine providing an mmgr API function along the lines of "adjust
this request size to the nearest thing that can be allocated efficiently".
That would avoid the need for callers to know about aset.c overhead
explicitly.  I'm not sure how it could deal with platform-specific malloc
vagaries though :-(
        regards, tom lane



Re: [HACKERS] HASH_CHUNK_SIZE vs malloc rounding

От
Thomas Munro
Дата:
On Tue, Nov 29, 2016 at 6:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> I bet other allocators also do badly with "32KB plus a smidgen".  To
>> minimise overhead we'd probably need to try to arrange for exactly
>> 32KB (or some other power of 2 or at least factor of common page/chunk
>> size?) to arrive into malloc, which means accounting for both
>> nodeHash.c's header and aset.c's headers in nodeHash.c, which seems a
>> bit horrible.  It may not be worth doing anything about.
>
> Yeah, the other problem is that without a lot more knowledge of the
> specific allocator, we shouldn't really assume that it's good or bad with
> an exact-power-of-2 request --- it might well have its own overhead.
> It is an issue though, and not only in nodeHash.c.  I'm pretty sure that
> StringInfo also makes exact-power-of-2 requests for no essential reason,
> and there are probably many other places.

Right, enlargeStringInfo doubles the size whenever it runs out, a
common technique.  Aside from the "plus a smidgen" thing mentioned
above, there is something else interesting about that:  Andrew Koenig
wrote a widely referenced comp.lang.c++.moderated post[1] about why
you should probably use a growth factor of 1.5, and the topic comes up
from time to time in standard library implementation discussions[2].
I have no idea whether it really matters in reality and how the space
vs time trade off works with whatever actual string growth patterns
someone might be optimising for, but it's fun to think about...

[1] https://groups.google.com/forum/#!msg/comp.lang.c++.moderated/asH_VojWKJw/Lo51JEmLVboJ
[2] https://gcc.gnu.org/ml/libstdc++/2013-03/msg00058.html

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] HASH_CHUNK_SIZE vs malloc rounding

От
Thomas Munro
Дата:
On Tue, Nov 29, 2016 at 6:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> I bet other allocators also do badly with "32KB plus a smidgen".  To
>> minimise overhead we'd probably need to try to arrange for exactly
>> 32KB (or some other power of 2 or at least factor of common page/chunk
>> size?) to arrive into malloc, which means accounting for both
>> nodeHash.c's header and aset.c's headers in nodeHash.c, which seems a
>> bit horrible.  It may not be worth doing anything about.
>
> Yeah, the other problem is that without a lot more knowledge of the
> specific allocator, we shouldn't really assume that it's good or bad with
> an exact-power-of-2 request --- it might well have its own overhead.
> It is an issue though, and not only in nodeHash.c.  I'm pretty sure that
> StringInfo also makes exact-power-of-2 requests for no essential reason,
> and there are probably many other places.
>
> We could imagine providing an mmgr API function along the lines of "adjust
> this request size to the nearest thing that can be allocated efficiently".
> That would avoid the need for callers to know about aset.c overhead
> explicitly.  I'm not sure how it could deal with platform-specific malloc
> vagaries though :-(

Someone pointed out to me off-list that jemalloc's next size class
after 32KB is in fact 40KB by default[1].  So PostgreSQL uses 25% more
memory for hash joins than it thinks it does on some platforms.  Ouch.

It doesn't seem that crazy to expose aset.c's overhead size to client
code does it?  Most client code wouldn't care, but things that are
doing something closer to memory allocator work themselves like
dense_alloc could care.  It could deal with its own overhead itself,
and subtract aset.c's overhead using a macro.

[1] https://www.freebsd.org/cgi/man.cgi?jemalloc(3)

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] HASH_CHUNK_SIZE vs malloc rounding

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Tue, Nov 29, 2016 at 6:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We could imagine providing an mmgr API function along the lines of "adjust
>> this request size to the nearest thing that can be allocated efficiently".
>> That would avoid the need for callers to know about aset.c overhead
>> explicitly.  I'm not sure how it could deal with platform-specific malloc
>> vagaries though :-(

> Someone pointed out to me off-list that jemalloc's next size class
> after 32KB is in fact 40KB by default[1].  So PostgreSQL uses 25% more
> memory for hash joins than it thinks it does on some platforms.  Ouch.

> It doesn't seem that crazy to expose aset.c's overhead size to client
> code does it?  Most client code wouldn't care, but things that are
> doing something closer to memory allocator work themselves like
> dense_alloc could care.  It could deal with its own overhead itself,
> and subtract aset.c's overhead using a macro.

Seeing that we now have several allocators with different overheads,
I think that exposing this directly to clients is exactly what not to do.
I still like the idea I sketched above of a context-type-specific function
to adjust a request size to something that's efficient.

But there's still the question of how do we know what's an efficient-sized
malloc request.  Is there good reason to suppose that powers of 2 are OK?
        regards, tom lane


Re: [HACKERS] HASH_CHUNK_SIZE vs malloc rounding

От
Thomas Munro
Дата:
On Fri, Nov 24, 2017 at 4:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It doesn't seem that crazy to expose aset.c's overhead size to client
>> code does it?  Most client code wouldn't care, but things that are
>> doing something closer to memory allocator work themselves like
>> dense_alloc could care.  It could deal with its own overhead itself,
>> and subtract aset.c's overhead using a macro.
>
> Seeing that we now have several allocators with different overheads,
> I think that exposing this directly to clients is exactly what not to do.
> I still like the idea I sketched above of a context-type-specific function
> to adjust a request size to something that's efficient.
>
> But there's still the question of how do we know what's an efficient-sized
> malloc request.  Is there good reason to suppose that powers of 2 are OK?

It looks like current glibc versions don't do that sort of thing until
you reach M_MMAP_THRESHOLD (default 128kB) and then starts working in
4kb pages.  Before that its manual says that it doesn't do any
rounding beyond alignment[1].  I haven't come across any other
allocator that is so forgiving, but I haven't checked Illumos, AIX or
Windows.  Only the first of those has source code available, but they
confuse you by shipping a bunch of different allocators in their tree.
I didn't check the other BSDs.  tcmalloc[2], jemalloc, macOS malloc
round up to (respectively) 4k page, 8kb size class, 512 byte size
class when you allocate 32kb + 1 byte.

[1] https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html
[2] http://goog-perftools.sourceforge.net/doc/tcmalloc.html

-- 
Thomas Munro
http://www.enterprisedb.com