Обсуждение: BUG #13530: sort receives "unexpected out-of-memory situation during sort"
BUG #13530: sort receives "unexpected out-of-memory situation during sort"
От
brent_despain@selinc.com
Дата:
The following bug has been logged on the website: Bug reference: 13530 Logged by: Brent DeSpain Email address: brent_despain@selinc.com PostgreSQL version: 9.3.4 Operating system: Ubuntu Description: We are occasionally receiving "unexpected out-of-memory situation during sort". This happens mostly when we are running our version of pg_dump. It returns the data from each table in primary key order. The amount of data in each table is relatively small (< 100k in the larges tables). The error is not predictably reproducible. work_mem is set to 1536kB. Setting work_mem back to default seems to decrease the occurrence of the error. We did not have the problem reported in PostgreSQL 8.3.x. << Speculation >> I have not contributed before to PostgreSQL so take this with a grain of salt. I reviewed the code and found that the error comes from grow_memtuples(Tuplesortstate *state) in src\backend\utils\sort\tuplesort.c or src\backend\utils\sort\tuplestore.c. The problem seems to come from the new 9.x algorithm to grow memtupsize by grow_ratio when doubling memtupsize is no longer possible. It seems if we get really close to availMem the addition of the chunk header returned by GetMemoryChunkSpace() pushes us over the edge and LACKMEM() becomes TRUE. This is my speculation.
brent_despain@selinc.com writes: > We are occasionally receiving "unexpected out-of-memory situation during > sort". Hmm. Looking at the code here, it suddenly strikes me that it's assuming that LACKMEM() wasn't true to begin with, and that this is not guaranteed, because we adjust the memory consumption counter *before* we call puttuple_common. So it would fail only if we exhausted allowedMem on the same tuple that would cause an enlargement of the memtuples[] array, which would be unusual enough to explain why this isn't terribly reproducible. Could you try something like this at the head of grow_memtuples() to see if it makes things better? /* Forget it if we've already maxed out memtuples, per comment above */ if (!state->growmemtuples) return false; + /* Forget it if we already exhausted memory */ + if (LACKMEM(state)) + return false; regards, tom lane
I wrote: > brent_despain@selinc.com writes: >> We are occasionally receiving "unexpected out-of-memory situation during >> sort". > Hmm. Looking at the code here, it suddenly strikes me that it's assuming > that LACKMEM() wasn't true to begin with, and that this is not guaranteed, > because we adjust the memory consumption counter *before* we call > puttuple_common. In the light of day that theory doesn't hold up, because if LACKMEM were true on entry (ie, availMem < 0) then we'd compute a grow_ratio less than one, so that the "Must enlarge array by at least one element" check would trigger, and we'd never get to the code that's failing. Another idea that occurred to me is that the "memory chunk overhead won't increase" assumption could fail if sizeof(SortTuple) weren't a multiple of MAXALIGN (because repalloc internally rounds the request up to a MAXALIGN boundary) but that doesn't seem plausible either. I'd expect that struct to be 16 bytes on 32-bit or 24 bytes on 64-bit, so it should be maxaligned on any hardware I know about. So I'm about out of ideas. Could you modify your copy of the code to log interesting details when you get this error, like the old and new memtupsize and chunk space measurements? That might give us a clue what's the problem. regards, tom lane
Re: BUG #13530: sort receives "unexpected out-of-memory situation during sort"
От
brent_despain@selinc.com
Дата:
Thanks Tom for the help. We will see if we can capture some additional information to help with the investigation. Brent DeSpain Automation & Integration Engineering Schweitzer Engineering Laboratories, Inc. Boise, ID Wk: 509-334-8007 mailto:brent_despain@selinc.com From: Tom Lane <tgl@sss.pgh.pa.us> To: brent_despain@selinc.com, Cc: pgsql-bugs@postgresql.org Date: 08/01/2015 11:56 AM Subject: Re: [BUGS] BUG #13530: sort receives "unexpected out-of-memory situation during sort" I wrote: > brent_despain@selinc.com writes: >> We are occasionally receiving "unexpected out-of-memory situation during >> sort". > Hmm. Looking at the code here, it suddenly strikes me that it's assuming > that LACKMEM() wasn't true to begin with, and that this is not guaranteed, > because we adjust the memory consumption counter *before* we call > puttuple_common. In the light of day that theory doesn't hold up, because if LACKMEM were true on entry (ie, availMem < 0) then we'd compute a grow_ratio less than one, so that the "Must enlarge array by at least one element" check would trigger, and we'd never get to the code that's failing. Another idea that occurred to me is that the "memory chunk overhead won't increase" assumption could fail if sizeof(SortTuple) weren't a multiple of MAXALIGN (because repalloc internally rounds the request up to a MAXALIGN boundary) but that doesn't seem plausible either. I'd expect that struct to be 16 bytes on 32-bit or 24 bytes on 64-bit, so it should be maxaligned on any hardware I know about. So I'm about out of ideas. Could you modify your copy of the code to log interesting details when you get this error, like the old and new memtupsize and chunk space measurements? That might give us a clue what's the problem. regards, tom lane
Re: BUG #13530: sort receives "unexpected out-of-memory situation during sort"
От
brent_despain@selinc.com
Дата:
Tom, Here is what we found. The error is from tuplestore.c in the grow_memtuples() function. It probably applies to tuplesort.c as well. WARNING: orig_memtupsize: 1024 WARNING: orig_memtuples_size: 4104 WARNING: orig_availMem: 2544 WARNING: newmemtupsize: 1025 WARNING: new_memtuples_size: 8200 WARNING: new_availMem: -1552 WARNING: code_path: 1 code_path indicates that the growth_ratio algorithm path was taken. It looks like the culprit is chunk size algorithm. The orig_memtuples_size is 4096 bytes (plus 8 byte header). It is fully used (1024(orig_memtuples_size) x 4(size of void *)). The growth_ratio algorithm determines that is can fit 1 more tuple setting newmemtupesize to 1025. The repalloc requests 4100 bytes but the chunk size algorithm uses chunks of size N^2. So the next chunk is 8192 bytes plus 8 byte header. This blows past orig_availMem of 2544 and causes LACKMEM to be TRUE. I see a few options. - Don't check LACKMEM here. It hardly get checked anywhere else. - Don't use growth_ratio if the new repalloc request will be <= 8kB. - When new repalloc is <= 8kB over allocate allowedMem, increase newmemtupsize to fully use new chunk size. Let me know what your plans are. Brent DeSpain Automation & Integration Engineering Schweitzer Engineering Laboratories, Inc. Boise, ID Wk: 509-334-8007 mailto:brent_despain@selinc.com From: Tom Lane <tgl@sss.pgh.pa.us> To: brent_despain@selinc.com, Cc: pgsql-bugs@postgresql.org Date: 08/01/2015 11:56 AM Subject: Re: [BUGS] BUG #13530: sort receives "unexpected out-of-memory situation during sort" I wrote: > brent_despain@selinc.com writes: >> We are occasionally receiving "unexpected out-of-memory situation during >> sort". > Hmm. Looking at the code here, it suddenly strikes me that it's assuming > that LACKMEM() wasn't true to begin with, and that this is not guaranteed, > because we adjust the memory consumption counter *before* we call > puttuple_common. In the light of day that theory doesn't hold up, because if LACKMEM were true on entry (ie, availMem < 0) then we'd compute a grow_ratio less than one, so that the "Must enlarge array by at least one element" check would trigger, and we'd never get to the code that's failing. Another idea that occurred to me is that the "memory chunk overhead won't increase" assumption could fail if sizeof(SortTuple) weren't a multiple of MAXALIGN (because repalloc internally rounds the request up to a MAXALIGN boundary) but that doesn't seem plausible either. I'd expect that struct to be 16 bytes on 32-bit or 24 bytes on 64-bit, so it should be maxaligned on any hardware I know about. So I'm about out of ideas. Could you modify your copy of the code to log interesting details when you get this error, like the old and new memtupsize and chunk space measurements? That might give us a clue what's the problem. regards, tom lane
brent_despain@selinc.com writes: > The error is from tuplestore.c in the grow_memtuples() function. It > probably applies to tuplesort.c as well. > WARNING: orig_memtupsize: 1024 > WARNING: orig_memtuples_size: 4104 > WARNING: orig_availMem: 2544 > WARNING: newmemtupsize: 1025 > WARNING: new_memtuples_size: 8200 > WARNING: new_availMem: -1552 Ohhh ... I was supposing that this was happening in tuplesort.c. The issue cannot arise there because sizeof(SortTuple) is at least 16 bytes, so with even the minimum memtupsize of 1024, we are requesting a chunk of at least 16K; which is more than is needed to shove palloc into its separate-chunk allocation path, which has constant chunk allocation overhead as this code expects. But in tuplestore.c, the array elements are just "void *". On a 32-bit machine that's small enough that the very first allocation is a standard palloc chunk not a separate chunk. So then the allocation overhead *does* increase at the first enlargement, and if you're close to exhausting work_mem then LACKMEM can happen due to that. The new code that made sorting run closer to full memory utilization may have increased the odds of this, but it did not create the problem. So you need both a 32-bit machine and fairly small work_mem to trigger this failure; that explains why I could not reproduce it here (I was using a 64-bit machine). > I see a few options. > - Don't check LACKMEM here. It hardly get checked anywhere else. Not workable. If we ignore LACKMEM here, we may have enlarged the array so much that even dumping all the tuples to disk would not restore positive availMem; this is the "space management algorithm will go nuts" hazard referred to in the comment. (The new array-sizing logic might've reduced the odds of this, but I doubt it would be safe to assume it can't happen.) > - Don't use growth_ratio if the new repalloc request will be <= 8kB. > - When new repalloc is <= 8kB over allocate allowedMem, increase > newmemtupsize to fully use new chunk size. AFAICS neither of these will fix the problem, because the issue is that the chunk overhead will increase as soon as we enlarge the initial array request. I think the best fix really is to increase the initial array size so that it's at least large enough to force palloc to do what we want. As a quick hack you could just s/1024/2048/ in tuplestore_begin_common, but what would be more maintainable in the long run is for the memory allocation code to export a #define for its "large chunk" threshold, and make the initial allocation here depend on that. regards, tom lane
I wrote: > But in tuplestore.c, the array elements are just "void *". On a 32-bit > machine that's small enough that the very first allocation is a standard > palloc chunk not a separate chunk. So then the allocation overhead *does* > increase at the first enlargement, and if you're close to exhausting > work_mem then LACKMEM can happen due to that. The new code that made > sorting run closer to full memory utilization may have increased the odds > of this, but it did not create the problem. > So you need both a 32-bit machine and fairly small work_mem to trigger > this failure; that explains why I could not reproduce it here (I was > using a 64-bit machine). Actually that's not true; the problem can happen on 64-bit as well. I was thinking that the behavior change occurred at 8K allocation requests, but actually it happens for requests larger than 8K. > I think the best fix really is to increase the initial array size > so that it's at least large enough to force palloc to do what we want. > As a quick hack you could just s/1024/2048/ in tuplestore_begin_common, > but what would be more maintainable in the long run is for the memory > allocation code to export a #define for its "large chunk" threshold, > and make the initial allocation here depend on that. If you want to use the "official" patch for this, see http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8bd45a394958c3fd7400654439ef2a113043f8f5 regards, tom lane