Обсуждение: Sort memory not being released
Wasn't sure if this should go to hackers or not, so here it is... It appears that after a pgsql backend uses memory for sort, it doesn't get released again, at least on solaris. IE: 17660 jnasby 1 0 0 430M 423M cpu3 82:49 0.00% postgres 17662 jnasby 1 58 0 94M 81M sleep 6:29 0.00% postgres 17678 jnasby 1 58 0 80M 72M sleep 1:06 0.00% postgres 17650 jnasby 1 58 0 46M 2112K sleep 0:00 0.00% postgres In this case, I have shared buffers set low and sort mem set really high (5000 and 350,000) to try and avoid sorts going to disk. Even if my settings are bad for whatever reason, shouldn't sort memory be released after it's been used? Before I had sort set to 20,000 and I still saw pgsql processes using more than 50M even when idle. Of course this isn't an issue if you disconnect frequently, but it really would hurt if you were using connection pooling. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > It appears that after a pgsql backend uses memory for sort, it doesn't > get released again, at least on solaris. IE: That's true on many Unixen --- malloc'd space is never released back to the OS until the process dies. Not much we can do about it. If you're concerned, start a fresh session, or use another Unix (or at least a better malloc package). regards, tom lane
On Mon, Jun 16, 2003 at 10:13:02AM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > It appears that after a pgsql backend uses memory for sort, it doesn't > > get released again, at least on solaris. IE: > > That's true on many Unixen --- malloc'd space is never released back to > the OS until the process dies. Not much we can do about it. If you're > concerned, start a fresh session, or use another Unix (or at least a > better malloc package). Holy ugly batman... This is on solaris; is there a different/better malloc I could use? Also, would doing the sort in a seperate process solve the problem? IE: doing a 'blocking fork()' on entry to the sort routine? (Sorry, I'm not a C coder...) -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
> > > It appears that after a pgsql backend uses memory for sort, it doesn't > > > get released again, at least on solaris. IE: > > > > That's true on many Unixen --- malloc'd space is never released back to > > the OS until the process dies. Not much we can do about it. If you're > > concerned, start a fresh session, or use another Unix (or at least a > > better malloc package). > > Holy ugly batman... > > This is on solaris; is there a different/better malloc I could use? See if there's an madvise(2) call on Slowaris, specifically, look for something akin to (taken from FreeBSD): MADV_FREE Gives the VM system the freedom to free pages, and tells the system that information in the specified page range is no longer important. This is an efficient way of allowing malloc(3) to free pages anywhere in the address space, while keeping the address space valid. The next time that the page is referenced, the page might be demand zeroed, or might contain the data that was there before the MADV_FREE call. References made to that address space range will not make the VM system page the information back in from backing store until the page is modified again. That'll allow data that's malloc'ed to be reused by the OS. If there is such a call, it might be prudent to stick one in the sort code just before or after the relevant free() call. -sc -- Sean Chittenden
On Mon, Jun 16, 2003 at 02:58:55PM -0700, Sean Chittenden wrote: > See if there's an madvise(2) call on Slowaris, specifically, look for > something akin to (taken from FreeBSD): > > MADV_FREE Gives the VM system the freedom to free pages, and tells > the system that information in the specified page range > is no longer important. This is an efficient way of > allowing malloc(3) to free pages anywhere in the address > space, while keeping the address space valid. The next > time that the page is referenced, the page might be > demand zeroed, or might contain the data that was there > before the MADV_FREE call. References made to that > address space range will not make the VM system page the > information back in from backing store until the page is > modified again. > > That'll allow data that's malloc'ed to be reused by the OS. If there > is such a call, it might be prudent to stick one in the sort code just > before or after the relevant free() call. -sc Seems not worth it. That just means you have to remember to re-madvise() it when you allocate it again. The memory will be used by the OS again (after swapping it out I guess). For large allocations glibc tends to mmap() which does get unmapped. There's a threshold of 4KB I think. Ofcourse, thousands of allocations for a few bytes will never trigger it. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "the West won the world not by the superiority of its ideas or values or > religion but rather by its superiority in applying organized violence. > Westerners often forget this fact, non-Westerners never do." > - Samuel P. Huntington
Вложения
Martijn van Oosterhout <kleptog@svana.org> writes: > For large allocations glibc tends to mmap() which does get unmapped. There's > a threshold of 4KB I think. Ofcourse, thousands of allocations for a few > bytes will never trigger it. But essentially all our allocation traffic goes through palloc, which bunches small allocations together. In typical scenarios malloc will only see requests of 8K or more, so we should be in good shape on this front. (Not that this is very relevant to Jim's problem, since he's not using glibc...) regards, tom lane
On Tue, Jun 17, 2003 at 10:45:39AM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > For large allocations glibc tends to mmap() which does get unmapped. There's > > a threshold of 4KB I think. Ofcourse, thousands of allocations for a few > > bytes will never trigger it. > > But essentially all our allocation traffic goes through palloc, which > bunches small allocations together. In typical scenarios malloc will > only see requests of 8K or more, so we should be in good shape on this > front. Ah, bad news. The threshold appears to be closer to 64-128KB, so for small allocations normal brk() calls will be made until the third or fourth expansion. This can be tuned (mallopt()) but that's probably not too good an idea. On the other hand, there is a function malloc_trim(). /* Release all but __pad bytes of freed top-most memory back to the system. Return 1 if successful, else 0. */ extern int malloc_trim __MALLOC_P ((size_t __pad)); Not entirely sure if it will help at all. Obviously memory fragmentation is your enemy here. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "the West won the world not by the superiority of its ideas or values or > religion but rather by its superiority in applying organized violence. > Westerners often forget this fact, non-Westerners never do." > - Samuel P. Huntington
Вложения
Martijn van Oosterhout <kleptog@svana.org> writes: > On Tue, Jun 17, 2003 at 10:45:39AM -0400, Tom Lane wrote: >> But essentially all our allocation traffic goes through palloc, which >> bunches small allocations together. In typical scenarios malloc will >> only see requests of 8K or more, so we should be in good shape on this >> front. > Ah, bad news. The threshold appears to be closer to 64-128KB, so for small > allocations normal brk() calls will be made until the third or fourth > expansion. That's probably good, actually. I'd imagine that mmap'ing for every 8K would be a bad idea ... until a context gets up to a few hundred K you shouldn't get too worried about whether you can eventually give it back to the OS. > Obviously memory fragmentation is > your enemy here. True. I think the memory-context structure helps on that, but it cannot solve it completely. (AFAIK, no one has yet done any studies to see what sorts of memory fragmentation issues may exist in a backend that's been running for a long while. It'd be an interesting little project if anyone wants to take it up.) regards, tom lane
> > > For large allocations glibc tends to mmap() which does get > > > unmapped. There's a threshold of 4KB I think. Ofcourse, > > > thousands of allocations for a few bytes will never trigger it. > > > > But essentially all our allocation traffic goes through palloc, > > which bunches small allocations together. In typical scenarios > > malloc will only see requests of 8K or more, so we should be in > > good shape on this front. > > Ah, bad news. The threshold appears to be closer to 64-128KB, so for > small allocations normal brk() calls will be made until the third or > fourth expansion. This can be tuned (mallopt()) but that's probably > not too good an idea. [snip] > Not entirely sure if it will help at all. Obviously memory > fragmentation is your enemy here. Depending on data use constraints and the malloc() routine in use (this works with phk malloc() on FreeBSD, don't know about glibc or Slowaris' routines) there's a cute trick that you can do help with this scenario so that a large malloc()'ed region is at the end of the data segment and therefore a process can be sbrk()'ed and shrink when free() is called on the large allocated region. *) malloc() the memory used in normal operations *) malloc() the memory needed for sorting *) free() the memory used in normal operations *) Do whatever needs to be done with the region of memory allocated for sorting *) free() the memory used for sorting Because phk malloc() works through chains and regions, if the 1st malloc is big enough to handle all malloc() requests during the sort operations, the process's memory will remain reasonably unfragmented as the chain once malloc()'ed for normal operations will be split up and used to handle the requests during the sort operations. Once the sort ops are done and the sort mem is free()'ed, phk malloc will sbrk(-1 * sort_mem) the process (shrinks the process space/releases the top end of the data segment back to the OS). If the malloc order happens like: malloc() sort mem, malloc normal ops, you're hosed because the normal ops mem region is at the top of the address space and potentially persists longer than the sort region (likely what's happening now), the proc can't be sbrk()'ed and the process remains huge until the proc dies or until the regions at the top of the data segment are free()'ed, collapsed into free contiguous regions at the top of BSS, and then sbrk()'ed. For long running servers and processes that grow quite large when processing something, but you'd like to have a small foot print when not processing data, this is what I have to do as a chump defrag routine. Works well for platforms that have a halfway decent malloc(). Another option is to mmap() private anonymous regions, though I haven't don this for anything huge yet as someone reported being able to mmap() less than they were able to malloc()... something I need to test. Anyway, food for thought. -sc -- Sean Chittenden
Вложения
On Tue, 17 Jun 2003, Sean Chittenden wrote: > > > > For large allocations glibc tends to mmap() which does get > > > > unmapped. There's a threshold of 4KB I think. Ofcourse, > > > > thousands of allocations for a few bytes will never trigger it. > > > > > > But essentially all our allocation traffic goes through palloc, > > > which bunches small allocations together. In typical scenarios > > > malloc will only see requests of 8K or more, so we should be in > > > good shape on this front. > > > > Ah, bad news. The threshold appears to be closer to 64-128KB, so for > > small allocations normal brk() calls will be made until the third or > > fourth expansion. This can be tuned (mallopt()) but that's probably > > not too good an idea. > [snip] > > Not entirely sure if it will help at all. Obviously memory > > fragmentation is your enemy here. > > Depending on data use constraints and the malloc() routine in use > (this works with phk malloc() on FreeBSD, don't know about glibc or > Slowaris' routines) there's a cute trick that you can do help with > this scenario so that a large malloc()'ed region is at the end of the > data segment and therefore a process can be sbrk()'ed and shrink when > free() is called on the large allocated region. Glibc allows malloc parameters to be tuned through environment variables. Linux Journal had an article about tuning malloc in May's issue. The article is available online at http://www.linuxjournal.com/article.php?sid=6390 -- Antti Haapala
On Tue, Jun 17, 2003 at 10:45:39AM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > For large allocations glibc tends to mmap() which does get unmapped. There's > > a threshold of 4KB I think. Ofcourse, thousands of allocations for a few > > bytes will never trigger it. > > But essentially all our allocation traffic goes through palloc, which > bunches small allocations together. In typical scenarios malloc will > only see requests of 8K or more, so we should be in good shape on this > front. > > (Not that this is very relevant to Jim's problem, since he's not using > glibc...) Maybe it would be helpful to describe why I noticed this... I've been doing some things that require very large sorts. I generally have very few connections though, so I thought I'd set sort_mem to about 1/3 of my memory. My thought was that it's better to suck down a ton of memory and blow out the disk cache if it means we can avoid hitting the disk for a sort at all. Of course I wasn't planning on sucking down a bunch of memory and holding on to it. :) I've read through the sort code and it seems that the pre-buffering once you go to disk will probably hurt with a huge sort_mem setting, since the data could be double or even triple buffered (in memtuples[], in pgsql's shared buffers, and by the OS). I think that a more ideal scenario (which I've been meaning to email hackers about) would be something like this: If the OS is running low on free physical memory, a sort will use less than sort_mem, as an attempt to avoid swapping. sort_mem is the maximum amount of sort memory a single sort (or maybe a single connection) can take. If sort_mem is over X size, then use only Y for pre-buffering (How much does a large sort_mem help if you have to spill to disk?) If it's pretty clear that the sort won't fit in memory (due to sort_mem or system free memory being low), I think it might help if tuplesort just went to disk right away, instead of waiting until all the memory was used up, but again, I'm not sure how the sort algorithm works when it goes to tape. This should mean that you can set the system up to allow very large sorts before spilling to disk... if there's not a lot of sorts sucking down memory, a large sort will be able to avoid overflowing to disk, which is obviously a huge performance gain. If the system is busy/memory bound though, sorts will overflow to disk, rather than using swap space which I'm sure would be a lot worse. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > Of course I wasn't planning on sucking down a bunch of memory and > holding on to it. :) Sure. But when you're done with the big sort, just start a fresh session. I don't see that this is worth agonizing over. > If sort_mem is over X size, then use only Y for pre-buffering (How much > does a large sort_mem help if you have to spill to disk?) It still helps quite a lot, because the average initial run length is (if I recall Knuth correctly) twice the working buffer size. I can't see a reason for cutting back usage once you've been forced to start spilling. The bigger problem with your discussion is the assumption that we can find out "if the OS is running low on free physical memory". That seems (a) unportable and (b) a moving target. regards, tom lane
On Tue, Jun 17, 2003 at 05:38:36PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > Of course I wasn't planning on sucking down a bunch of memory and > > holding on to it. :) > > Sure. But when you're done with the big sort, just start a fresh > session. I don't see that this is worth agonizing over. In this case I could do that, but that's not always possible. It would certainly wreck havoc with connection pooling, for example. > > If sort_mem is over X size, then use only Y for pre-buffering (How much > > does a large sort_mem help if you have to spill to disk?) > > It still helps quite a lot, because the average initial run length is > (if I recall Knuth correctly) twice the working buffer size. I can't > see a reason for cutting back usage once you've been forced to start > spilling. Only because of double/triple buffering. If having the memory around helps the algorithm then it should be used, at least up to the point of diminishing returns. > The bigger problem with your discussion is the assumption that we can > find out "if the OS is running low on free physical memory". That seems > (a) unportable and (b) a moving target. Well, there's other ways to do what I'm thinking of that don't rely on getting a free memory number from the OS. For example, there could be a 'total_sort_mem' parameter that specifies the total amount of memory that can be used for all sorts on the entire machine. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > Well, there's other ways to do what I'm thinking of that don't rely on > getting a free memory number from the OS. For example, there could be a > 'total_sort_mem' parameter that specifies the total amount of memory > that can be used for all sorts on the entire machine. How would you find out how many other sorts are going on (and how much memory they're actually using)? And probably more to the point, what do you do if you want to sort and the parameter's already exhausted? regards, tom lane
On Tue, Jun 17, 2003 at 10:02:07AM -0700, Sean Chittenden wrote: > For long running servers and processes that grow quite large when > processing something, but you'd like to have a small foot print when > not processing data, this is what I have to do as a chump defrag > routine. Works well for platforms that have a halfway decent > malloc(). Another option is to mmap() private anonymous regions, > though I haven't don this for anything huge yet as someone reported > being able to mmap() less than they were able to malloc()... something > I need to test. Look at the process memory layout. On Linux you get stack+heap is limited to 2GB. To access the rest you need to mmap(). This would vary depending on the OS. IMHO glibc's approach (use mmap() for large allocations) is a good one since the sortmems will generally be mmap()ed (at least they were in my quick test last night. As Tom pointed out, some study into memory framentation would be useful. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "the West won the world not by the superiority of its ideas or values or > religion but rather by its superiority in applying organized violence. > Westerners often forget this fact, non-Westerners never do." > - Samuel P. Huntington
Вложения
On Tue, Jun 17, 2003 at 08:08:29PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > Well, there's other ways to do what I'm thinking of that don't rely on > > getting a free memory number from the OS. For example, there could be a > > 'total_sort_mem' parameter that specifies the total amount of memory > > that can be used for all sorts on the entire machine. > > How would you find out how many other sorts are going on (and how much > memory they're actually using)? And probably more to the point, what do > you do if you want to sort and the parameter's already exhausted? There'd need to be a list of sorts kept in the backend, presumably in shared memory. When a sort started, it would add an entry to the list specifying how much memory it intended to use. Once it was done, it could update with how much was actually used. Actually, I guess a simple counter would suffice instead of a list. As for when memory runs out, there's two things you can do. Obviously, you can just sleep until more memory becomes available (presumably the lock on the shared list/counter would prevent more than one backend from starting a sort at a time). A more elegant solution would be to start decreasing how much memory a sort will use as the limit is approached. A possible algorithm would be: IF total_sort_mem - active_sort_mem < desired_sort_mem THEN desired_sort_mem = (total_sort_mem - active_sort_mem) / 2 So if you can't get all the memory you'd like, take half of whatever's available. Obviously there would have to be a limit to this... you can't sort on 100 bytes. If (total_sort_mem - active_sort_mem) drops below a certain threshold, you would either ignore it and use some small amount of memory to do the sort, or you'd sleep until memory became available. I know this might sound like a lot of added complexity, but if it means you have a much better chance of being able to perform large sorts in-memory instead of on-disk, I think it's well worth it. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
In article <20030617212500.GO40542@flake.decibel.org>, Jim C. Nasby <jim@nasby.net> wrote: >Of course I wasn't planning on sucking down a bunch of memory and >holding on to it. :) What are you worried about? The unused portions will eventually be paged out to disk. On the next sort, you'll spend a little less time allocating the memory (saving time) and a little more time paging the disk in (taking time). Probably, all in all, you'll end up breaking even. Just because your process has access to a lot of memory, doesn't mean that it's all in physical memory at once. Unless your system ran out of physical memory and/or swap, there shouldn't be an issue. It may well be than when you up the sort memory, you may also have to up swap space. No big deal. mrc -- Mike Castle dalgoda@ix.netcom.com www.netcom.com/~dalgoda/ We are all of us living in the shadow of Manhattan. -- Watchmen fatal ("You are in a maze of twisty compiler features, all different"); -- gcc