Обсуждение: memory leak in e94568ecc10 (pre-reading in external sort)
Hi, it seems e94568ecc10 has a pretty bad memory leak. A simple pgbench -i -s 300 allocates ~32GB of memory before it fails vacuum... set primary keys... ERROR: out of memory DETAIL: Failed on request of size 134184960. The relevant bit from the memory context stats: TupleSort main: 33278738504 total in 263 blocks; 78848 free (23 chunks); 33278659656 used regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 10/06/2016 07:50 AM, Tomas Vondra wrote: > it seems e94568ecc10 has a pretty bad memory leak. A simple Oops, fixed, thanks for the report! To be precise, this wasn't a memory leak, just a gross overallocation of memory. The new code in tuplesort.c assumes that it's harmless to call LogicalTapeRewind() on never-used tapes, but in fact, it allocated the read buffer for the tape. I fixed that by changing LogicalTapeRewind() to not allocate it, if the tape was completely empty. This is related to earlier the discussion with Peter G, on whether we should change state->maxTapes to reflect the actual number of tape that were used, when that's less than maxTapes. I think his confusion about the loop in init_tape_buffers(), in CAM3SWZQ7=FCy1iUZ6jNzUUNnktAG6uitC1i-DoNxScP-9ZsHwQ@mail.gmail.com would also have been avoided, if we had done that. So I think we should reconsider that. - Heikki
On Thu, Oct 6, 2016 at 12:00 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > This is related to earlier the discussion with Peter G, on whether we should > change state->maxTapes to reflect the actual number of tape that were used, > when that's less than maxTapes. I think his confusion about the loop in > init_tape_buffers(), in > CAM3SWZQ7=FCy1iUZ6jNzUUNnktAG6uitC1i-DoNxScP-9ZsHwQ@mail.gmail.com would > also have been avoided, if we had done that. So I think we should reconsider > that. -1 on that from me. I don't think that you should modify a variable that is directly linkable to Knuth's description of polyphase merge -- doing so seems like a bad idea. state->maxTapes (Knuth's T) really is something that is established pretty early on, and doesn't change. While the fix you pushed was probably a good idea anyway, I still think you should not use state->maxTapes to exhaustively call LogicalTapeAssignReadBufferSize() on every tape, even non-active tapes. That's the confusing part. It's not as if your need for the number of input tapes (the number of maybe-active tapes) is long lived; you just need to instruct logtape.c on buffer sizing once, at the start of mergeruns(). Besides, what I propose to do is really exactly the same as what you also want to do, except it avoids actually changing state->maxTapes. We'd just pass down what you propose to assign to state->maxTapes directly, which differs (and not just in the common case where there are inactive tapes -- it's always at least off-by-one). Right? -- Peter Geoghegan
On Thu, Oct 6, 2016 at 8:44 AM, Peter Geoghegan <pg@heroku.com> wrote: > Besides, what I propose to do is really exactly the same as what you > also want to do, except it avoids actually changing state->maxTapes. > We'd just pass down what you propose to assign to state->maxTapes > directly, which differs (and not just in the common case where there > are inactive tapes -- it's always at least off-by-one). Right? What I mean is that you should pass down numTapes alongside numInputTapes. The function init_tape_buffers() could either have an additional argument (numTapes), or derive numTapes itself. -- Peter Geoghegan
On 10/06/2016 06:44 PM, Peter Geoghegan wrote: > While the fix you pushed was probably a good idea anyway, I still > think you should not use swhichtate->maxTapes to exhaustively call > LogicalTapeAssignReadBufferSize() on every tape, even non-active > tapes. That's the confusing part. Admittedly that's confusing. Thinking about this some more, I came up with the attached. I removed the separate LogicalTapeAssignReadBufferSize() call altogether - the read buffer size is now passed as argument to the LogicalTapeRewindForRead() call. I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is mentally challenging, because when reading a call site, you have to remember which way round the boolean is. And now you also pass the extra buffer_size argument when rewinding for reading, but not when writing. I gave up on the logic to calculate the quotient and reminder of the available memory, and assigning one extra buffer to some of the tapes. I'm now just rounding it down to the nearest BLCKSZ boundary. That simplifies the code further, although we're now not using all the memory we could. I'm pretty sure that's OK, although I haven't done any performance testing of this. - Heikki
Вложения
On Mon, Oct 10, 2016 at 5:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Admittedly that's confusing. Thinking about this some more, I came up with > the attached. I removed the separate LogicalTapeAssignReadBufferSize() call > altogether - the read buffer size is now passed as argument to the > LogicalTapeRewindForRead() call. > > I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and > LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is mentally > challenging, because when reading a call site, you have to remember which > way round the boolean is. And now you also pass the extra buffer_size > argument when rewinding for reading, but not when writing. I like the general idea here. > I gave up on the logic to calculate the quotient and reminder of the > available memory, and assigning one extra buffer to some of the tapes. I'm > now just rounding it down to the nearest BLCKSZ boundary. That simplifies > the code further, although we're now not using all the memory we could. I'm > pretty sure that's OK, although I haven't done any performance testing of > this. The only thing I notice is that this new struct field could use a comment: > --- a/src/backend/utils/sort/tuplesort.c > +++ b/src/backend/utils/sort/tuplesort.c > @@ -366,6 +366,8 @@ struct Tuplesortstate > char *slabMemoryEnd; /* end of slab memory arena */ > SlabSlot *slabFreeHead; /* head of free list */ > > + size_t read_buffer_size; > + Also, something about trace_sort here: > +#ifdef TRACE_SORT > + if (trace_sort) > + elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes", > + (state->availMem) / 1024, numInputTapes); > +#endif > + > + state->read_buffer_size = state->availMem / numInputTapes; > + USEMEM(state, state->availMem); Maybe you should just make the trace_sort output talk about blocks at this point? -- Peter Geoghegan
On 10/11/2016 12:56 AM, Peter Geoghegan wrote: > Also, something about trace_sort here: > >> +#ifdef TRACE_SORT >> + if (trace_sort) >> + elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes", >> + (state->availMem) / 1024, numInputTapes); >> +#endif >> + >> + state->read_buffer_size = state->availMem / numInputTapes; >> + USEMEM(state, state->availMem); > > Maybe you should just make the trace_sort output talk about blocks at > this point? With # of blocks, you then have to mentally divide by 8 to get the actual memory used. I think kB is nicer in messages that are meant to be read by humans. The bigger problem with this message is that it's not very accurate anymore. The actual amount of memory used is rounded down, and capped by MaxAllocSize*numInputTapes. And would it be better to print the per-tape buffer size, rather than the total? - Heikki