Re: Parallel pg_dump for 9.1
От | Robert Haas |
---|---|
Тема | Re: Parallel pg_dump for 9.1 |
Дата | |
Msg-id | 603c8f071003291036vf85b254n236143f6c3af257c@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Parallel pg_dump for 9.1 (Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>) |
Список | pgsql-hackers |
On Mon, Mar 29, 2010 at 1:16 PM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > Robert Haas wrote: >> >> On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote: > > [...] >>> >>> - Regarding the output of pg_dump I am proposing two solutions. The >>> first one is to introduce a new archive type "directory" where each >>> table and each blob is a file in a directory, similar to the >>> experimental "files" archive type. Also the idea has come up that you >>> should be able to specify multiple directories in order to make use of >>> several physical disk drives. Thinking this further, in order to >>> manage all the mess that you can create with this, every file of the >>> same backup needs to have a unique identifier and pg_restore should >>> have a check parameter that tells you if your backup directory is in a >>> sane and complete state (think about moving a file from one backup >>> directory to another one or trying to restore from two directories >>> which are from different backup sets...). >> >> I think that specifying several directories is a piece of complexity >> that would be best left alone for a first version of this. But a >> single directory with multiple files sounds pretty reasonable. Of >> course we'll also need to support that format in non-parallel mode, >> and in pg_restore. >> >>> The second solution to the single-file-problem is to generate no >>> output at all, i.e. whatever you export from your source database you >>> import directly into your target database, which in the end turns out >>> to be a parallel form of "pg_dump | psql". >> >> This is a very interesting idea but you might want to get the other >> thing merged first, as it's going to present a different set of >> issues. > > I had some prior discussion with joachim (and I suspect I had some influence > in him trying to implement that) on that. > The reason why this is really needed is that the current pg_restore -j is > actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that are > basically "duplicate this database to that location" (or any migration > really). > The example at had is a 240GB production database with around 850 tables, it > takes ~145min to dump that database single threaded(completely CPU bound), > simply loading the SQL using psql can restore it in ~150min(again CPU bound > both for COPY and index creation), -j8 brings that down to ~55min. > So if you do the math(and a bit of handwaving): > > * using pg_dump | psql you get greatest(140,150) -> 150min. > * using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min > * using a theoretical parallel pg_dump and the existing parallel restore you > would get: 50(just a guess for how fast it might be) + 55 -> 105min > * a parallel dump & restore that can pipline would end up at > greatest(50,55)->55min > > > So a parallel dump alone would only give you a 50% speedup in total time for > doing a migration/upgrade/dump-to-devbox despite the fact that it uses 8x > the resources. A piplined solution would result in a ~3x speedup in total > time and you don't even have to even think about stuff that might be a > problem like having available diskspace on the source/destination to hold a > full temporary dump(if you don't you might even have to add some transfer > time as well). It's a great idea - but there are two features here. I've seen many patches implementing two features during my relatively short time with the project and if the rejection rate hasn't been 100% it's certainly been close. If Joachim thinks he's got it all working, by all means submit both patches. One can apply over the other if they are interdependent. But I STRONGLY suggest separating this into two pieces - it is MUCH easier to get things applied that way, for good and valid reasons. ...Robert
В списке pgsql-hackers по дате отправления: