Обсуждение: Using COPY to import large xml file

Поиск
Список
Период
Сортировка

Using COPY to import large xml file

От
Anto Aravinth
Дата:
Hello Everyone,

I have downloaded the Stackoverflow posts xml (contains all SO questions till date).. the file is around 70GB.. I wanna import the data in those xml to my table.. is there a way to do so in postgres?


Thanks, 
Anto.

Re: Using COPY to import large xml file

От
Adrien Nayrat
Дата:
On 06/24/2018 05:25 PM, Anto Aravinth wrote:
> Hello Everyone,
>
> I have downloaded the Stackoverflow posts xml (contains all SO questions till
> date).. the file is around 70GB.. I wanna import the data in those xml to my
> table.. is there a way to do so in postgres?
>
>
> Thanks, 
> Anto.

Hello Anto,

I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgres

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info


Вложения

Re: Using COPY to import large xml file

От
Anto Aravinth
Дата:
Thanks for the response. I'm not sure, how long does this tool takes for the 70GB data. 

I used node to stream the xml files into inserts.. which was very slow.. Actually the xml contains 40 million records, out of which 10Million took around 2 hrs using nodejs. Hence, I thought will use COPY command, as suggested on the internet. 

Definitely, will try the code and let you know.. But looks like it uses the same INSERT, not copy.. interesting if it runs quick on my machine. 

On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info> wrote:
On 06/24/2018 05:25 PM, Anto Aravinth wrote:
> Hello Everyone,
>
> I have downloaded the Stackoverflow posts xml (contains all SO questions till
> date).. the file is around 70GB.. I wanna import the data in those xml to my
> table.. is there a way to do so in postgres?
>
>
> Thanks, 
> Anto.

Hello Anto,

I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgres

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info


Re: Using COPY to import large xml file

От
Adrien Nayrat
Дата:
On 06/24/2018 06:07 PM, Anto Aravinth wrote:
> Thanks for the response. I'm not sure, how long does this tool takes for the
> 70GB data.

In my memory, it took several hours. I can't remember if it is xml conversion or
insert which are longer.

>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took around
> 2 hrs using nodejs. Hence, I thought will use COPY command, as suggested on the
> internet. 
>
> Definitely, will try the code and let you know.. But looks like it uses the same
> INSERT, not copy.. interesting if it runs quick on my machine. 

Yes it use INSERT, maybe it is not difficult to change the code to use COPY instead.

--
Adrien NAYRAT
https://blog.anayrat.info


Вложения

Re: Using COPY to import large xml file

От
Adrian Klaver
Дата:
On 06/24/2018 08:25 AM, Anto Aravinth wrote:
> Hello Everyone,
> 
> I have downloaded the Stackoverflow posts xml (contains all SO questions 
> till date).. the file is around 70GB.. I wanna import the data in those 
> xml to my table.. is there a way to do so in postgres?

It is going to require some work. You will need to deal with:

1) The row schema inside the XML is here:

https://ia800107.us.archive.org/27/items/stackexchange/readme.txt

- **posts**.xml

2) The rows are inside a <posts> tag.

Seems to me you have two options:

1) Drop each row into a single XML field and deal with extracting the 
row components in the database.

2) Break down the row into column components before entering them into 
the database.

Adrien has pointed you at a Python program that covers the above:

https://github.com/Networks-Learning/stackexchange-dump-to-postgres

If you are comfortable in Python you can take a look at:

https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/row_processor.py

to see how the rows are broken down into elements.

I would try this out first on one of the smaller datasets found here:

https://archive.org/details/stackexchange

I personally took a look at:

https://archive.org/download/stackexchange/beer.stackexchange.com.7z

because why not?

> 
> 
> Thanks,
> Anto.


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Using COPY to import large xml file

От
Christoph Moench-Tegeder
Дата:
## Adrien Nayrat (adrien.nayrat@anayrat.info):

> I used this tool :
> https://github.com/Networks-Learning/stackexchange-dump-to-postgres

That will be awfully slow: this tool commits each INSERT on it's own,
see loop in
https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/load_into_pg.py#L83

With only small changes - prepare the INSERT, execute for all (or at
least a huge lot of) rows, COMMIT at the end - you can safe quite a lot
of time (500 rows is not "a huge lot"). And when you do that, for
heaven's sake, do not try to create the INSERT statement as a string
with the values - Bobby Tables will eat you. See psycopg documentation
on how it's done (especially watch the warnings):
http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries
On prepared statements with psycopg2, see
http://initd.org/psycopg/articles/2012/10/01/prepared-statements-psycopg/
(python makes stuff rather hard, but I'll spare you the snark and wait until
the language has matured to at least version 5).

Using the COPY protocol with psycopg2 seems to require some hoop-jumping,
but could improve matters even more.

Regards,
Christoph

-- 
Spare Space.


Re: Using COPY to import large xml file

От
Tim Cross
Дата:
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:

> Thanks for the response. I'm not sure, how long does this tool takes for
> the 70GB data.
>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took
> around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> suggested on the internet.
>
> Definitely, will try the code and let you know.. But looks like it uses the
> same INSERT, not copy.. interesting if it runs quick on my machine.
>
> On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info>
> wrote:
>
>> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> > Hello Everyone,
>> >
>> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>> till
>> > date).. the file is around 70GB.. I wanna import the data in those xml
>> to my
>> > table.. is there a way to do so in postgres?
>> >
>> >
>> > Thanks,
>> > Anto.
>>
>> Hello Anto,
>>
>> I used this tool :
>> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>>

If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database which is
also performing other processing. 

We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.

Be wary of what indexes your defining on your table. Depending on the
type and number, these can have significant impact on insert times as
well.


-- 
Tim Cross


Re: Using COPY to import large xml file

От
Anto Aravinth
Дата:


On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx@gmail.com> wrote:

Anto Aravinth <anto.aravinth.cse@gmail.com> writes:

> Thanks for the response. I'm not sure, how long does this tool takes for
> the 70GB data.
>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took
> around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> suggested on the internet.
>
> Definitely, will try the code and let you know.. But looks like it uses the
> same INSERT, not copy.. interesting if it runs quick on my machine.
>
> On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info>
> wrote:
>
>> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> > Hello Everyone,
>> >
>> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>> till
>> > date).. the file is around 70GB.. I wanna import the data in those xml
>> to my
>> > table.. is there a way to do so in postgres?
>> >
>> >
>> > Thanks,
>> > Anto.
>>
>> Hello Anto,
>>
>> I used this tool :
>> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>>

If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database which is
also performing other processing.

We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.

Sure, let me try that.. I have a question here, COPY usually works when you move data from files to your postgres instance, right? Now in node.js, processing the whole file, can I use COPY
programmatically like COPY Stackoverflow <calculated value at run time>? Because from doc:


I don't see its possible. May be I need to convert the files to copy understandable first? 

Anto. 

Be wary of what indexes your defining on your table. Depending on the
type and number, these can have significant impact on insert times as
well.


--
Tim Cross

Re: Using COPY to import large xml file

От
Tim Cross
Дата:


On Mon, 25 Jun 2018 at 11:38, Anto Aravinth <anto.aravinth.cse@gmail.com> wrote:


On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx@gmail.com> wrote:

Anto Aravinth <anto.aravinth.cse@gmail.com> writes:

> Thanks for the response. I'm not sure, how long does this tool takes for
> the 70GB data.
>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took
> around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> suggested on the internet.
>
> Definitely, will try the code and let you know.. But looks like it uses the
> same INSERT, not copy.. interesting if it runs quick on my machine.
>
> On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info>
> wrote:
>
>> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> > Hello Everyone,
>> >
>> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>> till
>> > date).. the file is around 70GB.. I wanna import the data in those xml
>> to my
>> > table.. is there a way to do so in postgres?
>> >
>> >
>> > Thanks,
>> > Anto.
>>
>> Hello Anto,
>>
>> I used this tool :
>> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>>

If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database which is
also performing other processing.

We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.

Sure, let me try that.. I have a question here, COPY usually works when you move data from files to your postgres instance, right? Now in node.js, processing the whole file, can I use COPY
programmatically like COPY Stackoverflow <calculated value at run time>? Because from doc:


I don't see its possible. May be I need to convert the files to copy understandable first? 

Anto. 



Yes. Essentially what you do is create a stream and feed whatever information you want to copy into that stream. PG sees the. data as if it was seeing each line in a file, so you push data onto the stream wherre each item is seperated by a tab (or whatever). Here is the basic low level function I use (Don't know how the formatting will go!)

async function copyInsert(sql, stringifyFN, records) {
  const logName = `${moduleName}.copyInsert`;
  var client;

  assert.ok(Array.isArray(records), "The records arg must be an array");
  assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must be a function");
  
  return getClient()
    .then(c => {
      client = c;
      return new Promise(function(resolve, reject) {
        var stream, rs;
        var idx = 0;
        
        function done() {
          releaseClient(client);
          client = undefined; 
          resolve(idx + 1);
        }

        function onError(err) {
          if (client !== undefined) {
            releaseClient(client);
          }
          reject(new VError(err, `${logName}: COPY failed at record ${idx}`));
        }

        function arrayRead() {
          if (idx === records.length) {
            rs.push(null);
          } else {
            let rec = records[idx];
            rs.push(stringifyFN(rec));
            idx += 1;
          }
        }

        rs = new Readable;
        rs._read = arrayRead;
        rs.on("error", onError);
        stream = client.query(copyFrom(sql));
        stream.on("error", onError);
        stream.on("end", done);
        rs.pipe(stream);
      });
    })
    .catch(err => {
      throw new VError(err, `${logName} Failed COPY insert`);
    });
}

and I will call it like

copyInsert(sql, stringifyClimateRecord, records)

where sql and stringifycomateRecord arguments are

  const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
        + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
        + "vprp_09,vprp_15,wind_speed) FROM STDIN";

  function stringifyClimateRecord(rec) {
    return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
      + `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`;
  }

The stringifyClimateRecord returns a record to be inserted as a 'line' into the stream with values separated by tabs. Records is an array of data records where each record is an array.


--
regards,

Tim

--
Tim Cross

Re: Using COPY to import large xml file

От
Christoph Moench-Tegeder
Дата:
## Anto Aravinth (anto.aravinth.cse@gmail.com):

> Sure, let me try that.. I have a question here, COPY usually works when you
> move data from files to your postgres instance, right? Now in node.js,
> processing the whole file, can I use COPY
> programmatically like COPY Stackoverflow <calculated value at run time>?
> Because from doc:
> 
> https://www.postgresql.org/docs/9.2/static/sql-copy.html
> 
> I don't see its possible. May be I need to convert the files to copy
> understandable first?

"COPY ... FROM STDIN"
 STDIN Specifies that input comes from the client application.

It's on the page...

Regards,
Christoph

-- 
Spare Space.


Re: Using COPY to import large xml file

От
Anto Aravinth
Дата:
Thanks a lot. But I do got lot of challenges! Looks like SO data contains lot of tabs within itself.. So tabs delimiter didn't work for me. I thought I can give a special demiliter but looks like Postrgesql copy allow only one character as delimiter :(

Sad, I guess only way is to insert or do a through serialization of my data into something that COPY can understand. 

On Mon, Jun 25, 2018 at 8:09 AM, Tim Cross <theophilusx@gmail.com> wrote:


On Mon, 25 Jun 2018 at 11:38, Anto Aravinth <anto.aravinth.cse@gmail.com> wrote:


On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx@gmail.com> wrote:

Anto Aravinth <anto.aravinth.cse@gmail.com> writes:

> Thanks for the response. I'm not sure, how long does this tool takes for
> the 70GB data.
>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took
> around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> suggested on the internet.
>
> Definitely, will try the code and let you know.. But looks like it uses the
> same INSERT, not copy.. interesting if it runs quick on my machine.
>
> On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info>
> wrote:
>
>> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> > Hello Everyone,
>> >
>> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>> till
>> > date).. the file is around 70GB.. I wanna import the data in those xml
>> to my
>> > table.. is there a way to do so in postgres?
>> >
>> >
>> > Thanks,
>> > Anto.
>>
>> Hello Anto,
>>
>> I used this tool :
>> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>>

If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database which is
also performing other processing.

We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.

Sure, let me try that.. I have a question here, COPY usually works when you move data from files to your postgres instance, right? Now in node.js, processing the whole file, can I use COPY
programmatically like COPY Stackoverflow <calculated value at run time>? Because from doc:


I don't see its possible. May be I need to convert the files to copy understandable first? 

Anto. 



Yes. Essentially what you do is create a stream and feed whatever information you want to copy into that stream. PG sees the. data as if it was seeing each line in a file, so you push data onto the stream wherre each item is seperated by a tab (or whatever). Here is the basic low level function I use (Don't know how the formatting will go!)

async function copyInsert(sql, stringifyFN, records) {
  const logName = `${moduleName}.copyInsert`;
  var client;

  assert.ok(Array.isArray(records), "The records arg must be an array");
  assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must be a function");
  
  return getClient()
    .then(c => {
      client = c;
      return new Promise(function(resolve, reject) {
        var stream, rs;
        var idx = 0;
        
        function done() {
          releaseClient(client);
          client = undefined; 
          resolve(idx + 1);
        }

        function onError(err) {
          if (client !== undefined) {
            releaseClient(client);
          }
          reject(new VError(err, `${logName}: COPY failed at record ${idx}`));
        }

        function arrayRead() {
          if (idx === records.length) {
            rs.push(null);
          } else {
            let rec = records[idx];
            rs.push(stringifyFN(rec));
            idx += 1;
          }
        }

        rs = new Readable;
        rs._read = arrayRead;
        rs.on("error", onError);
        stream = client.query(copyFrom(sql));
        stream.on("error", onError);
        stream.on("end", done);
        rs.pipe(stream);
      });
    })
    .catch(err => {
      throw new VError(err, `${logName} Failed COPY insert`);
    });
}

and I will call it like

copyInsert(sql, stringifyClimateRecord, records)

where sql and stringifycomateRecord arguments are

  const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
        + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
        + "vprp_09,vprp_15,wind_speed) FROM STDIN";

  function stringifyClimateRecord(rec) {
    return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
      + `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`;
  }

The stringifyClimateRecord returns a record to be inserted as a 'line' into the stream with values separated by tabs. Records is an array of data records where each record is an array.


--
regards,

Tim

--
Tim Cross


Re: Using COPY to import large xml file

От
Nicolas Paris
Дата:

2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:
Thanks a lot. But I do got lot of challenges! Looks like SO data contains lot of tabs within itself.. So tabs delimiter didn't work for me. I thought I can give a special demiliter but looks like Postrgesql copy allow only one character as delimiter :(

Sad, I guess only way is to insert or do a through serialization of my data into something that COPY can understand.

​easiest way would be:
xml -> csv -> \copy

​by csv, I mean regular quoted csv (Simply wrap csv field with double quote, and escape
enventually contained quotes with an other double quote.).

Postgresql copy csv parser is one of the most robust I ever tested
before.

Re: Using COPY to import large xml file

От
Anto Aravinth
Дата:


On Mon, Jun 25, 2018 at 8:20 PM, Nicolas Paris <niparisco@gmail.com> wrote:

2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:
Thanks a lot. But I do got lot of challenges! Looks like SO data contains lot of tabs within itself.. So tabs delimiter didn't work for me. I thought I can give a special demiliter but looks like Postrgesql copy allow only one character as delimiter :(

Sad, I guess only way is to insert or do a through serialization of my data into something that COPY can understand.

​easiest way would be:
xml -> csv -> \copy

​by csv, I mean regular quoted csv (Simply wrap csv field with double quote, and escape
enventually contained quotes with an other double quote.).

I tried but no luck. Here is the sample csv, I wrote from my xml convertor:

1       "Are questions about animations or comics inspired by Japanese culture or styles considered on-topic?"  "pExamples include a href=""http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia.org/wiki/Anime-influenced_animation"" rel=""nofollow""Amerime/a/em./p


pAre questions about these series on-topic?/p

"       "pExamples include a href=""http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia.org/wiki/Anime-influenced_animation"" rel=""nofollow""Amerime/a/em./p


pAre questions about these series on-topic?/p

"       "null"

the schema of my table is:

  CREATE TABLE so2 (
    id  INTEGER NOT NULL PRIMARY KEY,
    title varchar(1000) NULL,
    posts text,
    body TSVECTOR,
    parent_id INTEGER NULL,
    FOREIGN KEY (parent_id) REFERENCES so1(id)
);

and when I run:

COPY so2 from '/Users/user/programs/js/node-mbox/file.csv';


I get:


CONTEXT:  COPY so2, line 1: "1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-top..."

Not sure what I'm missing. Not sure the above csv is breaking because I have newlines within my content. But the error message is very hard to debug. 

 

Postgresql copy csv parser is one of the most robust I ever tested
before.

Re: Using COPY to import large xml file

От
Anto Aravinth
Дата:


On Mon, Jun 25, 2018 at 8:54 PM, Anto Aravinth <anto.aravinth.cse@gmail.com> wrote:


On Mon, Jun 25, 2018 at 8:20 PM, Nicolas Paris <niparisco@gmail.com> wrote:

2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:
Thanks a lot. But I do got lot of challenges! Looks like SO data contains lot of tabs within itself.. So tabs delimiter didn't work for me. I thought I can give a special demiliter but looks like Postrgesql copy allow only one character as delimiter :(

Sad, I guess only way is to insert or do a through serialization of my data into something that COPY can understand.

​easiest way would be:
xml -> csv -> \copy

​by csv, I mean regular quoted csv (Simply wrap csv field with double quote, and escape
enventually contained quotes with an other double quote.).

I tried but no luck. Here is the sample csv, I wrote from my xml convertor:

1       "Are questions about animations or comics inspired by Japanese culture or styles considered on-topic?"  "pExamples include a href=""http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia.org/wiki/Anime-influenced_animation"" rel=""nofollow""Amerime/a/em./p


pAre questions about these series on-topic?/p

"       "pExamples include a href=""http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia.org/wiki/Anime-influenced_animation"" rel=""nofollow""Amerime/a/em./p


pAre questions about these series on-topic?/p

"       "null"

the schema of my table is:

  CREATE TABLE so2 (
    id  INTEGER NOT NULL PRIMARY KEY,
    title varchar(1000) NULL,
    posts text,
    body TSVECTOR,
    parent_id INTEGER NULL,
    FOREIGN KEY (parent_id) REFERENCES so1(id)
);

and when I run:

COPY so2 from '/Users/user/programs/js/node-mbox/file.csv';


I get:


ERROR:  missing data for column "body"

CONTEXT:  COPY so2, line 1: "1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-top..."
 

CONTEXT:  COPY so2, line 1: "1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-top..."

Not sure what I'm missing. Not sure the above csv is breaking because I have newlines within my content. But the error message is very hard to debug. 

 

Postgresql copy csv parser is one of the most robust I ever tested
before.


Re: Using COPY to import large xml file

От
Nicolas Paris
Дата:


2018-06-25 17:30 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:


On Mon, Jun 25, 2018 at 8:54 PM, Anto Aravinth <anto.aravinth.cse@gmail.com> wrote:


On Mon, Jun 25, 2018 at 8:20 PM, Nicolas Paris <niparisco@gmail.com> wrote:

2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:
Thanks a lot. But I do got lot of challenges! Looks like SO data contains lot of tabs within itself.. So tabs delimiter didn't work for me. I thought I can give a special demiliter but looks like Postrgesql copy allow only one character as delimiter :(

Sad, I guess only way is to insert or do a through serialization of my data into something that COPY can understand.

​easiest way would be:
xml -> csv -> \copy

​by csv, I mean regular quoted csv (Simply wrap csv field with double quote, and escape
enventually contained quotes with an other double quote.).

I tried but no luck. Here is the sample csv, I wrote from my xml convertor:

1       "Are questions about animations or comics inspired by Japanese culture or styles considered on-topic?"  "pExamples include a href=""http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia.org/wiki/Anime-influenced_animation"" rel=""nofollow""Amerime/a/em./p


pAre questions about these series on-topic?/p

"       "pExamples include a href=""http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia.org/wiki/Anime-influenced_animation"" rel=""nofollow""Amerime/a/em./p


pAre questions about these series on-topic?/p

"       "null"

the schema of my table is:

  CREATE TABLE so2 (
    id  INTEGER NOT NULL PRIMARY KEY,
    title varchar(1000) NULL,
    posts text,
    body TSVECTOR,
    parent_id INTEGER NULL,
    FOREIGN KEY (parent_id) REFERENCES so1(id)
);

and when I run:

COPY so2 from '/Users/user/programs/js/node-mbox/file.csv';


I get:


ERROR:  missing data for column "body"

CONTEXT:  COPY so2, line 1: "1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-top..."
 

CONTEXT:  COPY so2, line 1: "1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-top..."

Not sure what I'm missing. Not sure the above csv is breaking because I have newlines within my content. But the error message is very hard to debug. 



​What you are missing is the configuration of COPY statement​ (please refer to https://www.postgresql.org/docs/9.2/static/sql-copy.html)
such format, delimiter, quote and escape

Re: Using COPY to import large xml file

От
Adrian Klaver
Дата:
On 06/25/2018 07:25 AM, Anto Aravinth wrote:
> Thanks a lot. But I do got lot of challenges! Looks like SO data 
> contains lot of tabs within itself.. So tabs delimiter didn't work for 
> me. I thought I can give a special demiliter but looks like Postrgesql 
> copy allow only one character as delimiter :(

I use | as it is rarely found in data itself.

> 
> Sad, I guess only way is to insert or do a through serialization of my 
> data into something that COPY can understand.
> 
> On Mon, Jun 25, 2018 at 8:09 AM, Tim Cross <theophilusx@gmail.com 
> <mailto:theophilusx@gmail.com>> wrote:
> 
> 
> 
>     On Mon, 25 Jun 2018 at 11:38, Anto Aravinth
>     <anto.aravinth.cse@gmail.com <mailto:anto.aravinth.cse@gmail.com>>
>     wrote:
> 
> 
> 
>         On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross
>         <theophilusx@gmail.com <mailto:theophilusx@gmail.com>> wrote:
> 
> 
>             Anto Aravinth <anto.aravinth.cse@gmail.com
>             <mailto:anto.aravinth.cse@gmail.com>> writes:
> 
>             > Thanks for the response. I'm not sure, how long does this tool takes for
>             > the 70GB data.
>             >
>             > I used node to stream the xml files into inserts.. which was very slow..
>             > Actually the xml contains 40 million records, out of which 10Million took
>             > around 2 hrs using nodejs. Hence, I thought will use COPY command, as
>             > suggested on the internet.
>             >
>             > Definitely, will try the code and let you know.. But looks like it uses the
>             > same INSERT, not copy.. interesting if it runs quick on my machine.
>             >
>             > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info
<mailto:adrien.nayrat@anayrat.info>>
>             > wrote:
>             >
>             >> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>             >> > Hello Everyone,
>             >> >
>             >> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>             >> till
>             >> > date).. the file is around 70GB.. I wanna import the data in those xml
>             >> to my
>             >> > table.. is there a way to do so in postgres?
>             >> >
>             >> >
>             >> > Thanks,
>             >> > Anto.
>             >>
>             >> Hello Anto,
>             >>
>             >> I used this tool :
>             >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>             <https://github.com/Networks-Learning/stackexchange-dump-to-postgres>
>             >>
> 
>             If you are using nodejs, then you can easily use the
>             pg-copy-streams
>             module to insert the records into your database. I've been
>             using this
>             for inserting large numbers of records from NetCDF files.
>             Takes between
>             40 to 50 minutes to insert 60 Million+ records and we are doing
>             additional calculations on the values, not just inserting them,
>             plus we are inserting into a database over the network and
>             into a database which is
>             also performing other processing.
> 
>             We found a significant speed improvement with COPY over
>             blocks of insert
>             transactions, which was faster than just individual inserts.
>             The only
>             downside with using COPY is that it either completely works or
>             completely fails and when it fails, it can be tricky to work
>             out which
>             record is causing the failure. A benefit of using blocks of
>             transactions
>             is that you have more fine grained control, allowing you to
>             recover from
>             some errors or providing more specific detail regarding the
>             cause of the
>             error.
> 
> 
>         Sure, let me try that.. I have a question here, COPY usually
>         works when you move data from files to your postgres instance,
>         right? Now in node.js, processing the whole file, can I use COPY
>         programmatically like COPY Stackoverflow <calculated value at
>         run time>? Because from doc:
> 
>         https://www.postgresql.org/docs/9.2/static/sql-copy.html
>         <https://www.postgresql.org/docs/9.2/static/sql-copy.html>
> 
>         I don't see its possible. May be I need to convert the files to
>         copy understandable first?
> 
>         Anto.
> 
> 
> 
> 
>     Yes. Essentially what you do is create a stream and feed whatever
>     information you want to copy into that stream. PG sees the. data as
>     if it was seeing each line in a file, so you push data onto the
>     stream wherre each item is seperated by a tab (or whatever). Here is
>     the basic low level function I use (Don't know how the formatting
>     will go!)
> 
>     async function copyInsert(sql, stringifyFN, records) {
>        const logName = `${moduleName}.copyInsert`;
>        var client;
> 
>        assert.ok(Array.isArray(records), "The records arg must be an
>     array");
>        assert.ok(typeof(stringifyFN) === "function", "The stringifyFN
>     arg must be a function");
>        return getClient()
>          .then(c => {
>            client = c;
>            return new Promise(function(resolve, reject) {
>              var stream, rs;
>              var idx = 0;
>              function done() {
>                releaseClient(client);
>                client = undefined;
>                resolve(idx + 1);
>              }
> 
>              function onError(err) {
>                if (client !== undefined) {
>                  releaseClient(client);
>                }
>                reject(new VError(err, `${logName}: COPY failed at record
>     ${idx}`));
>              }
> 
>              function arrayRead() {
>                if (idx === records.length) {
>                  rs.push(null);
>                } else {
>                  let rec = records[idx];
>                  rs.push(stringifyFN(rec));
>                  idx += 1;
>                }
>              }
> 
>              rs = new Readable;
>              rs._read = arrayRead;
>              rs.on("error", onError);
>              stream = client.query(copyFrom(sql));
>              stream.on("error", onError);
>              stream.on("end", done);
>              rs.pipe(stream);
>            });
>          })
>          .catch(err => {
>            throw new VError(err, `${logName} Failed COPY insert`);
>          });
>     }
> 
>     and I will call it like
> 
>     copyInsert(sql, stringifyClimateRecord, records)
> 
>     where sql and stringifycomateRecord arguments are
> 
>        const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
>              + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
>              + "vprp_09,vprp_15,wind_speed) FROM STDIN";
> 
>        function stringifyClimateRecord(rec) {
>          return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
>            +
>     `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`;
>        }
> 
>     The stringifyClimateRecord returns a record to be inserted as a
>     'line' into the stream with values separated by tabs. Records is an
>     array of data records where each record is an array.
> 
> 
>     -- 
>     regards,
> 
>     Tim
> 
>     --
>     Tim Cross
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Using COPY to import large xml file

От
Tim Cross
Дата:
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:

> Thanks a lot. But I do got lot of challenges! Looks like SO data contains
> lot of tabs within itself.. So tabs delimiter didn't work for me. I thought
> I can give a special demiliter but looks like Postrgesql copy allow only
> one character as delimiter :(
>
> Sad, I guess only way is to insert or do a through serialization of my data
> into something that COPY can understand.
>

The COPY command has a number of options, including setting what is used
as the delimiter - it doesn't have to be tab. You need to also look at
the logs/output to see exactly why the copy fails.

I'd recommend first pre-processing your input data to make sure it is
'clean' and all the fields actually match with whatever DDL you have
used to define your db tables etc. I'd then select a small subset and
try different parameters to the copy command until you get the right
combination of data format and copy definition.

It may take some effort to get the right combination, but the result is
probably worth it given your data set size i.e. difference between hours
and days. 

--
Tim Cross


Re: Using COPY to import large xml file

От
Anto Aravinth
Дата:
Thanks a lot everyone. After playing around with small dataset, I could able to make datasets that are easy to go with COPY. Creating datasets of around 50GB took say 2hrs (I can definitely improve on this).

54M records, COPY took around 35 minutes! Awesome.. :) :)

Mean time, I understood few things like vacuum etc. 

Really loving postgres! 

Thanks, 
Anto. 

On Tue, Jun 26, 2018 at 3:40 AM, Tim Cross <theophilusx@gmail.com> wrote:

Anto Aravinth <anto.aravinth.cse@gmail.com> writes:

> Thanks a lot. But I do got lot of challenges! Looks like SO data contains
> lot of tabs within itself.. So tabs delimiter didn't work for me. I thought
> I can give a special demiliter but looks like Postrgesql copy allow only
> one character as delimiter :(
>
> Sad, I guess only way is to insert or do a through serialization of my data
> into something that COPY can understand.
>

The COPY command has a number of options, including setting what is used
as the delimiter - it doesn't have to be tab. You need to also look at
the logs/output to see exactly why the copy fails.

I'd recommend first pre-processing your input data to make sure it is
'clean' and all the fields actually match with whatever DDL you have
used to define your db tables etc. I'd then select a small subset and
try different parameters to the copy command until you get the right
combination of data format and copy definition.

It may take some effort to get the right combination, but the result is
probably worth it given your data set size i.e. difference between hours
and days.

--
Tim Cross