Обсуждение: pg_dump: largeobject behavior issues (possible bug)
Hello, I have been working a problem with Andrew Gierth (sp?) in regards to pg_dump. Here is the basic breakdown: FreeBSD 10.1 PostgreSQL 9.3.6 64GB ~ memory 500GB database 228G of largeobjects (106M objects) The database dumps fine as long as we don't dump large objects. However, if we try to dump the large objects, FreeBSD will kill pg_dump as it will consume all free memory and swap. With Andrew's help we were able to determine the following: There is a memory cost of about 160 bytes per largeobject. Based on the number of largeobjects we have that would be about 16GB of memory. Also when pg_dump is reading in the largobject list there is a point where pg_dump has a PGresult containing the entire contents of pg_largeobject_metadata and a malloc of an array where it is going to copy the data to. That could easily get above the 40G thus causeFreeBSD to kill the process. tl;dr The memory issue comes down to the fact that in the prep stage, pg_dump creates a TOC entry for every individual large object. It seems that pg_dump should be much more efficient about dumping these objects. Sincerely, JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
>>>>> "Joshua" == Joshua D Drake <jd@commandprompt.com> writes: Joshua> The database dumps fine as long as we don't dump largeJoshua> objects. However, if we try to dump the large objects,FreeBSDJoshua> will kill pg_dump as it will consume all free memory andJoshua> swap. With Andrew's help we were ableto determine theJoshua> following: Joshua> There is a memory cost of about 160 bytes per largeobject. I may have the exact number here wrong, it was just a quick eyeball of the data structures (and depends on malloc overheads anyway). The relevant code is getBlobs in pg_dump.c, which queries the whole of pg_largeobject_metadata without using a cursor (so the PGresult is already huge thanks to having >100 million rows), and then mallocs a BlobInfo array and populates it from the PGresult, also using pg_strdup for the oid string, owner name, and ACL if any. -- Andrew (irc:RhodiumToad)
On 04/23/2015 04:04 PM, Andrew Gierth wrote: >>>>>> "Joshua" == Joshua D Drake <jd@commandprompt.com> writes: > Joshua> The database dumps fine as long as we don't dump large > Joshua> objects. However, if we try to dump the large objects, FreeBSD > Joshua> will kill pg_dump as it will consume all free memory and > Joshua> swap. With Andrew's help we were able to determine the > Joshua> following: > > Joshua> There is a memory cost of about 160 bytes per largeobject. > > I may have the exact number here wrong, it was just a quick eyeball of > the data structures (and depends on malloc overheads anyway). > > The relevant code is getBlobs in pg_dump.c, which queries the whole of > pg_largeobject_metadata without using a cursor (so the PGresult is > already huge thanks to having >100 million rows), and then mallocs a > BlobInfo array and populates it from the PGresult, also using pg_strdup > for the oid string, owner name, and ACL if any. > I'm surprised this hasn't come up before. I have a client that I persuaded to convert all their LOs to bytea fields because of problems with pg_dump handling millions of LOs, and kept them on an older postgres version until they made that change. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > On 04/23/2015 04:04 PM, Andrew Gierth wrote: >> The relevant code is getBlobs in pg_dump.c, which queries the whole of >> pg_largeobject_metadata without using a cursor (so the PGresult is >> already huge thanks to having >100 million rows), and then mallocs a >> BlobInfo array and populates it from the PGresult, also using pg_strdup >> for the oid string, owner name, and ACL if any. > I'm surprised this hasn't come up before. I have a client that I > persuaded to convert all their LOs to bytea fields because of problems > with pg_dump handling millions of LOs, and kept them on an older > postgres version until they made that change. Yeah, this was brought up when we added per-large-object metadata; it was obvious that that patch would cause pg_dump to choke on large numbers of large objects. The (perhaps rather lame) argument was that you wouldn't have that many of them. Given that large objects don't have any individual dependencies, one could envision fixing this by replacing the individual large-object DumpableObjects by a single placeholder to participate in the sort phase, and then when it's time to dump that, scan the large objects using a cursor and create/print/delete the information separately for each one. This would likely involve some rather painful refactoring in pg_dump however. regards, tom lane
On 04/24/2015 03:41 PM, Tom Lane wrote: > Given that large objects don't have any individual dependencies, > one could envision fixing this by replacing the individual large-object > DumpableObjects by a single placeholder to participate in the sort phase, > and then when it's time to dump that, scan the large objects using a > cursor and create/print/delete the information separately for each one. > This would likely involve some rather painful refactoring in pg_dump > however. Andrew G mentioned something about using a cursor for the main query that pulls the info. He said that it wasn't a solution but may be a bandaid (my words). Is that something we may want to look into as a stop gap? > > regards, tom lane > -- The most kicking donkey PostgreSQL Infrastructure company in existence. The oldest, the most experienced, the consulting company to the stars. Command Prompt, Inc. http://www.commandprompt.com/ +1 -503-667-4564 - 24x7 - 365 - Proactive and Managed Professional Services!
On 04/24/2015 06:41 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 04/23/2015 04:04 PM, Andrew Gierth wrote: >>> The relevant code is getBlobs in pg_dump.c, which queries the whole of >>> pg_largeobject_metadata without using a cursor (so the PGresult is >>> already huge thanks to having >100 million rows), and then mallocs a >>> BlobInfo array and populates it from the PGresult, also using pg_strdup >>> for the oid string, owner name, and ACL if any. >> I'm surprised this hasn't come up before. I have a client that I >> persuaded to convert all their LOs to bytea fields because of problems >> with pg_dump handling millions of LOs, and kept them on an older >> postgres version until they made that change. > Yeah, this was brought up when we added per-large-object metadata; it was > obvious that that patch would cause pg_dump to choke on large numbers of > large objects. The (perhaps rather lame) argument was that you wouldn't > have that many of them. > > Given that large objects don't have any individual dependencies, > one could envision fixing this by replacing the individual large-object > DumpableObjects by a single placeholder to participate in the sort phase, > and then when it's time to dump that, scan the large objects using a > cursor and create/print/delete the information separately for each one. > This would likely involve some rather painful refactoring in pg_dump > however. I think we need to think about this some more, TBH, I'm not convinced that the changes made back in 9.0 were well conceived. Having separate TOC entries for each LO seems wrong in principle, although I understand why it was done. For now, my advice would be to avoid use of pg_dump/pg_restore if you have large numbers of LOs. The good news is that these days there are alternative methods of doing backup / restore, albeit not 100% equivalent with pg_dump / pg_restore. One useful thing might be to provide pg_dump with --no-blobs/--blobs-only switches so you could at least easily segregate the blobs into their own dump file. That would be in addition to dealing with the memory problems pg_dump has with millions of LOs, of course. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > On 04/24/2015 06:41 PM, Tom Lane wrote: >> Yeah, this was brought up when we added per-large-object metadata; it was >> obvious that that patch would cause pg_dump to choke on large numbers of >> large objects. The (perhaps rather lame) argument was that you wouldn't >> have that many of them. >> Given that large objects don't have any individual dependencies, >> one could envision fixing this by replacing the individual large-object >> DumpableObjects by a single placeholder to participate in the sort phase, >> and then when it's time to dump that, scan the large objects using a >> cursor and create/print/delete the information separately for each one. >> This would likely involve some rather painful refactoring in pg_dump >> however. > I think we need to think about this some more, TBH, I'm not convinced > that the changes made back in 9.0 were well conceived. Having separate > TOC entries for each LO seems wrong in principle, although I understand > why it was done. Perhaps. One advantage of doing it this way is that you can get pg_restore to extract a single LO from an archive file; though it's debatable whether that's worth the potential resource-consumption hazards. Another issue is that restore options such as --no-owner and --no-privileges would not work for LOs (at least not without messy hacks) if we go back to a scheme where all the LO information is just SQL commands inside a single TOC object. After further thought I realized that if we simply hack pg_dump to emit the LOs in a streaming fashion, but keep the archive-file representation the same as it is now, then we haven't really fixed the problem because pg_restore is still likely to choke when it tries to read the archive's TOC. So my proposal above isn't enough either. Perhaps what we need is some sort of "second-level TOC" which is only ever processed in a streaming fashion, by both pg_dump and pg_restore. This would not support dependency resolution or re-ordering, but we don't need those abilities for LOs. regards, tom lane
On 04/25/2015 12:32 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 04/24/2015 06:41 PM, Tom Lane wrote: >>> Yeah, this was brought up when we added per-large-object metadata; it was >>> obvious that that patch would cause pg_dump to choke on large numbers of >>> large objects. The (perhaps rather lame) argument was that you wouldn't >>> have that many of them. >>> Given that large objects don't have any individual dependencies, >>> one could envision fixing this by replacing the individual large-object >>> DumpableObjects by a single placeholder to participate in the sort phase, >>> and then when it's time to dump that, scan the large objects using a >>> cursor and create/print/delete the information separately for each one. >>> This would likely involve some rather painful refactoring in pg_dump >>> however. >> I think we need to think about this some more, TBH, I'm not convinced >> that the changes made back in 9.0 were well conceived. Having separate >> TOC entries for each LO seems wrong in principle, although I understand >> why it was done. > Perhaps. One advantage of doing it this way is that you can get > pg_restore to extract a single LO from an archive file; though it's > debatable whether that's worth the potential resource-consumption hazards. In my view it isn't worth it. > Another issue is that restore options such as --no-owner and > --no-privileges would not work for LOs (at least not without messy hacks) > if we go back to a scheme where all the LO information is just SQL > commands inside a single TOC object. > > After further thought I realized that if we simply hack pg_dump to emit > the LOs in a streaming fashion, but keep the archive-file representation > the same as it is now, then we haven't really fixed the problem because > pg_restore is still likely to choke when it tries to read the archive's > TOC. So my proposal above isn't enough either. Yep, that's certainly true. > > Perhaps what we need is some sort of "second-level TOC" which is only ever > processed in a streaming fashion, by both pg_dump and pg_restore. This > would not support dependency resolution or re-ordering, but we don't need > those abilities for LOs. > > +1, I had a similar thought, half-formed, but you've expressed it better than I could have. cheers andrew