Обсуждение: bulk inserts
(
log_id bigserial NOT NULL,
timestamp_dbl double precision,
data bytea,
CONSTRAINT data_log_20msec_table_pkey PRIMARY KEY (log_id)
)
WITH (OIDS=FALSE);
ALTER TABLE data_log_20msec_table OWNER TO postgres;
ON data_log_20msec_table
USING btree
(timestamp_dbl);
This electronic mail message is intended exclusively for the individual(s) or entity to which it is addressed. This message, together with any attachment, is confidential and may contain privileged information. Any unauthorized review, use, printing, retaining, copying, disclosure or distribution is strictly prohibited. If you have received this message in error, please immediately advise the sender by reply email message to the sender and delete all copies of this message.
THIS E-MAIL IS NOT AN OFFER OR ACCEPTANCE: Notwithstanding the Uniform Electronic Transactions Act or any other law of similar import, absent an express statement to the contrary contained in this e-mail, neither this e-mail nor any attachments are an offer or acceptance to enter into a contract, and are not intended to bind the sender, LeTourneau Technologies, Inc., or any of its subsidiaries, affiliates, or any other person or entity.
WARNING: Although the company has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote: > Hi, I'm fairly new to postgres and am having trouble finding what I'm > looking for. Is there a feature that allows bulk inserts into tables? > My setup is Win XPe 2002 SP3 and PostgreSQL 8.3. I need to add > entries from a file where each file contains 250 - 500 records. The > files are created by a third-party and read/written as binary. The > table looks like the following: <snip> > The current method for transferring records from the file to postgres is using a prepared statement that is called iterativelyon each record read from the file: > > INSERT INTO data_log_20msec_table (timestamp_dbl,data) VALUES ($1::double precision,$2::bytea) > > Using COPY is out of the question as the file is not formatted for that and since other operations need to occur, the fileneeds to be read sequentially anyway. The usual approach is to use COPY FROM STDIN, then using pqputCopyData (or whatever it's called). That way you can perform any necessary munging and don't require the file to be on disk at all. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Вложения
On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote: > Using COPY is out of the question as the file is not formatted for > that and since other operations need to occur, the file needs to be > read sequentially anyway. Just to expand on what Martin said; if you can generate a set of EXECUTE commands, you can certainly generate a COPY command to insert the same data. The advantage is a large drop in parse time for inserting larger numbers of rows. As you're saying you want to insert 500 rows, I'd suggest at least trying to get COPY working. -- Sam http://samason.me.uk/
Thanks, Sam and Martijn. I am attempting to use the COPY command now. I had misunderstood what was meant by STDIN and assumedI could only use a file for my application and wasn't aware of PQputCopyData(). One assumption I am operating underright now is that the format of the binary file is the same as the buffer in PQputCopyData, including the header. IfI am wrong, could someone please let me know? Thanks, Dave -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Sam Mason Sent: Monday, September 28, 2009 3:53 PM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] bulk inserts On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote: > Using COPY is out of the question as the file is not formatted for > that and since other operations need to occur, the file needs to be > read sequentially anyway. Just to expand on what Martin said; if you can generate a set of EXECUTE commands, you can certainly generate a COPY command to insert the same data. The advantage is a large drop in parse time for inserting larger numbers of rows. As you're saying you want to insert 500 rows, I'd suggest at least trying to get COPY working. -- Sam http://samason.me.uk/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general This electronic mail message is intended exclusively for the individual(s) or entity to which it is addressed. This message,together with any attachment, is confidential and may contain privileged information. Any unauthorized review, use,printing, retaining, copying, disclosure or distribution is strictly prohibited. If you have received this message inerror, please immediately advise the sender by reply email message to the sender and delete all copies of this message. THIS E-MAIL IS NOT AN OFFER OR ACCEPTANCE: Notwithstanding the Uniform Electronic Transactions Act or any other law of similarimport, absent an express statement to the contrary contained in this e-mail, neither this e-mail nor any attachmentsare an offer or acceptance to enter into a contract, and are not intended to bind the sender, LeTourneau Technologies,Inc., or any of its subsidiaries, affiliates, or any other person or entity. WARNING: Although the company has taken reasonable precautions to ensure no viruses are present in this email, the companycannot accept responsibility for any loss or damage arising from the use of this email or attachments.
On Mon, Sep 28, 2009 at 04:35:53PM -0500, Dave Huber wrote: > One assumption I am operating under right now is > that the format of the binary file is the same as the buffer in > PQputCopyData, including the header. If I am wrong, could someone > please let me know? Thanks, I've always used ASCII representations of the data; no need to worry about formats then. Not sure what sort of performance considerations you have, but it's certainly less of a worry for me. I'm reasonably sure the binary format changes more often than you'd like, maybe something like: http://libpqtypes.esilo.com/ would help. I can't see anything about COPY support, but it should help with other things. -- Sam http://samason.me.uk/
the Database writer actually stops any processing and applies that one record to the database
Most bulk operations such as import/export and copy are well worth their weight as they apply en-masse
before any commit ..remember the DB actually stops flushes its buffers to Disk
and then resumes..the only solution here is to disable autocommit but be wary you may have 100's of statements waiting to be commited and then someone does a quit on your session..all your work is lost
good call on copy
http://www.postgresql.org/docs/8.1/interactive/populate.html
cheers,
Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.
> Date: Mon, 28 Sep 2009 21:52:36 +0100
> From: sam@samason.me.uk
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] bulk inserts
>
> On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote:
> > Using COPY is out of the question as the file is not formatted for
> > that and since other operations need to occur, the file needs to be
> > read sequentially anyway.
>
> Just to expand on what Martin said; if you can generate a set of EXECUTE
> commands, you can certainly generate a COPY command to insert the same
> data. The advantage is a large drop in parse time for inserting larger
> numbers of rows. As you're saying you want to insert 500 rows, I'd
> suggest at least trying to get COPY working.
>
> --
> Sam http://samason.me.uk/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Bing™ brings you maps, menus, and reviews organized in one place. Try it now.
On Mon, Sep 28, 2009 at 08:33:45PM -0400, Martin Gainty wrote: > INSERTS/UPDATES are historically slow especially with autocommit is > on (implied autocommit on) the Database writer actually stops any > processing and applies that one record to the database That seems to be overstating the issue somewhat. Each connection is still independent and (assuming no locks are taken) will carry on as such. > Most bulk operations such as import/export and copy are well worth > their weight as they apply en-masse before any commit .. ?? I'm not sure what you're implying about the semantics here, but it doesn't seem right. COPY doesn't somehow break out of ACID semantics, it's only an *optimization* that allows you to get large quantities of data into the database faster. The main reason it's faster is because parsing CSV data is easier than parsing SQL. At least I think that's the only difference; anybody know better? > remember the DB actually stops flushes its buffers to Disk and then > resumes.. The DB as a whole does not stop if you issue a commit; just your session/connection. > the only solution here is to disable autocommit but be wary > you may have 100's of statements waiting to be commited and then > someone does a quit on your session..all your work is lost I'm not sure what you're saying here. These are normal transactional semantics and are what all ACID databases are specified to do. You need to issue a "COMMIT" for data to be committed. -- Sam http://samason.me.uk/
All I have to say is wow! COPY works sooo much faster than the iterative method I was using. Even after having to read the entire binary file and reformat the data into the binary format that postgres needs it is an order of magnitude faster than using a prepared INSERT. At least that’s what my prelim testing is showing me. I will have to let it run for a while with a larger table to see how it performs in the longrun, but initially this is very positive.
Thanks, guys!
Dave
This electronic mail message is intended exclusively for the individual(s) or entity to which it is addressed. This message, together with any attachment, is confidential and may contain privileged information. Any unauthorized review, use, printing, retaining, copying, disclosure or distribution is strictly prohibited. If you have received this message in error, please immediately advise the sender by reply email message to the sender and delete all copies of this message.
THIS E-MAIL IS NOT AN OFFER OR ACCEPTANCE: Notwithstanding the Uniform Electronic Transactions Act or any other law of similar import, absent an express statement to the contrary contained in this e-mail, neither this e-mail nor any attachments are an offer or acceptance to enter into a contract, and are not intended to bind the sender, LeTourneau Technologies, Inc., or any of its subsidiaries, affiliates, or any other person or entity.
WARNING: Although the company has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
All I have to say is wow! COPY works sooo much faster than the iterative method I was using. Even after having to read the entire binary file and reformat the data into the binary format that postgres needs it is an order of magnitude faster than using a prepared INSERT. At least that’s what my prelim testing is showing me. I will have to let it run for a while with a larger table to see how it performs in the longrun, but initially this is very positive.
--
GJ
On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote: > ?? I'm not sure what you're implying about the semantics here, but it > doesn't seem right. COPY doesn't somehow break out of ACID semantics, > it's only an *optimization* that allows you to get large quantities of > data into the database faster. The main reason it's faster is because > parsing CSV data is easier than parsing SQL. > > At least I think that's the only difference; anybody know better? I think a big reason is also that the client can stream the data without waiting for a network round trip ack on every statement. -- "No animals were harmed in the recording of this episode. We tried but that damn monkey was just too fast."
On Tue, Sep 29, 2009 at 08:45:55AM -0700, Alan Hodgson wrote: > On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote: > > it's faster is because > > parsing CSV data is easier than parsing SQL. > > > > At least I think that's the only difference; anybody know better? > > I think a big reason is also that the client can stream the data without > waiting for a network round trip ack on every statement. I don't think so. I'm pretty sure you can send multiple statements in a single round trip. libpq is defined to work in such cases anyway: http://www.postgresql.org/docs/current/static/libpq-exec.html -- Sam http://samason.me.uk/
On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote: > > I think a big reason is also that the client can stream the data > > without waiting for a network round trip ack on every statement. > > I don't think so. I'm pretty sure you can send multiple statements in a > single round trip. libpq is defined to work in such cases anyway: > > http://www.postgresql.org/docs/current/static/libpq-exec.html > I'm sure you probably _can_, but how many programming loops do so? -- "No animals were harmed in the recording of this episode. We tried but that damn monkey was just too fast."
Sam Mason <sam@samason.me.uk> writes: > On Tue, Sep 29, 2009 at 08:45:55AM -0700, Alan Hodgson wrote: >> I think a big reason is also that the client can stream the data without >> waiting for a network round trip ack on every statement. > I don't think so. I'm pretty sure you can send multiple statements in a > single round trip. You can, but that doesn't scale to megabytes of data (at least not well). I think the big points are elimination of per-row network and transaction commit overhead ... but there are some other optimizations in the COPY path too. regards, tom lane
On Tue, Sep 29, 2009 at 09:11:19AM -0700, Alan Hodgson wrote: > On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote: > > I'm pretty sure you can send multiple statements in a > > single round trip. libpq is defined to work in such cases anyway: > > > > http://www.postgresql.org/docs/current/static/libpq-exec.html > > I'm sure you probably _can_, but how many programming loops do so? It's not a very big sample, but I'm pretty sure I do! :) -- Sam http://samason.me.uk/
On Tue, Sep 29, 2009 at 12:17:51PM -0400, Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > > On Tue, Sep 29, 2009 at 08:45:55AM -0700, Alan Hodgson wrote: > >> I think a big reason is also that the client can stream the data without > >> waiting for a network round trip ack on every statement. > > > I don't think so. I'm pretty sure you can send multiple statements in a > > single round trip. > > You can, but that doesn't scale to megabytes of data (at least not > well). No, but I didn't think that was being talked about. I was thinking network round trip time does seem to become a thousand times less important when you're putting a thousand statements together. This would seem to imply that network latency can be almost arbitrarily reduced. > I think the big points are elimination of per-row network and > transaction commit overhead ... Well, if you start including transaction commit then you've just changed semantics away from COPY. I was implicitly thinking of what changes when you keep the same semantics as COPY. > but there are some other optimizations > in the COPY path too. Cool, I'll continue to prefer COPY then! -- Sam http://samason.me.uk/
On 2009-09-29, Alan Hodgson <ahodgson@simkin.ca> wrote: > On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote: >> ?? I'm not sure what you're implying about the semantics here, but it >> doesn't seem right. COPY doesn't somehow break out of ACID semantics, >> it's only an *optimization* that allows you to get large quantities of >> data into the database faster. The main reason it's faster is because >> parsing CSV data is easier than parsing SQL. >> >> At least I think that's the only difference; anybody know better? > > I think a big reason is also that the client can stream the data without > waiting for a network round trip ack on every statement. a single insert statement can insert many rows. the win with copy is more that the data part can be parsed one record at a time. whereas for the insert the whole must be parsed, also the quoting rules are simpler for CSV or pg's tab-delimted format.