Обсуждение: any solution for doing a data file import spawning it on multiple processes

Поиск
Список
Период
Сортировка

any solution for doing a data file import spawning it on multiple processes

От
"hb@101-factory.eu"
Дата:
hi there,

I am trying to import large data files into pg.
for now i used the. xarg linux command to spawn the file line for line and set  and use the  maximum available
connections. 

we use pg pool as connection pool to the database, and so try to maximize the concurrent data import of the file.

problem for now that it seems to work well but we miss a line once in a while, and that is not acceptable. also it
createszombies ;(.  

does anybody have any other tricks that will do the job?

thanks,

Henk

Re: any solution for doing a data file import spawning it on multiple processes

От
Edson Richter
Дата:
Em 16/06/2012 12:04, hb@101-factory.eu escreveu:
> hi there,
>
> I am trying to import large data files into pg.
> for now i used the. xarg linux command to spawn the file line for line and set  and use the  maximum available
connections.
>
> we use pg pool as connection pool to the database, and so try to maximize the concurrent data import of the file.
>
> problem for now that it seems to work well but we miss a line once in a while, and that is not acceptable. also it
createszombies ;(. 
>
> does anybody have any other tricks that will do the job?
>
> thanks,
>
> Henk

I've used custom Java application using connection pooling (limited to
1000 connections, mean 1000 concurrent file imports).

I'm able to import more than 64000 XML files (about 13Kb each) in 5
minutes, without memory leaks neither zombies, and (of course) no
missing records.

Besides I each thread import separate file, I have another situation
where I have separated threads importing different lines of same file.
No problems at all. Do not forget to check your OS "file open" limits
(it was a big issue in the past for me due Lucene indexes generated
during import).

Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux
Centos 5, Sun Java 1.6.27.

Regards,

Edson Richter


Re: any solution for doing a data file import spawning it on multiple processes

От
"hb@101-factory.eu"
Дата:
thanks i thought about splitting the file, but that did no work out well.

so we receive 2 files evry 30 seconds and need to import this as fast as possible.

we do not run java curently but maybe it's an option.
are you willing to share your code?

also i was thinking using perl for it


henk

On 16 jun. 2012, at 17:37, Edson Richter <edsonrichter@hotmail.com> wrote:

> Em 16/06/2012 12:04, hb@101-factory.eu escreveu:
>> hi there,
>>
>> I am trying to import large data files into pg.
>> for now i used the. xarg linux command to spawn the file line for line and set  and use the  maximum available
connections.
>>
>> we use pg pool as connection pool to the database, and so try to maximize the concurrent data import of the file.
>>
>> problem for now that it seems to work well but we miss a line once in a while, and that is not acceptable. also it
createszombies ;(. 
>>
>> does anybody have any other tricks that will do the job?
>>
>> thanks,
>>
>> Henk
>
> I've used custom Java application using connection pooling (limited to 1000 connections, mean 1000 concurrent file
imports).
>
> I'm able to import more than 64000 XML files (about 13Kb each) in 5 minutes, without memory leaks neither zombies,
and(of course) no missing records. 
>
> Besides I each thread import separate file, I have another situation where I have separated threads importing
differentlines of same file. No problems at all. Do not forget to check your OS "file open" limits (it was a big issue
inthe past for me due Lucene indexes generated during import). 
>
> Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 5, Sun Java 1.6.27.
>
> Regards,
>
> Edson Richter
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: any solution for doing a data file import spawning it on multiple processes

От
Bosco Rama
Дата:
hb@101-factory.eu wrote:
> thanks i thought about splitting the file, but that did no work out well.
>
> so we receive 2 files evry 30 seconds and need to import this as fast as possible.
>
> we do not run java curently but maybe it's an option.
> are you willing to share your code?
>
> also i was thinking using perl for it

Not sure if this will help, but have you looked at pgloader?

<http://pgloader.projects.postgresql.org/>

Bosco.

Re: any solution for doing a data file import spawning it on multiple processes

От
Edson Richter
Дата:
Em 16/06/2012 12:59, hb@101-factory.eu escreveu:
> thanks i thought about splitting the file, but that did no work out well.
>
> so we receive 2 files evry 30 seconds and need to import this as fast as possible.
>
> we do not run java curently but maybe it's an option.
> are you willing to share your code?
>
> also i was thinking using perl for it
>
>
> henk
>
> On 16 jun. 2012, at 17:37, Edson Richter <edsonrichter@hotmail.com> wrote:
>
>> Em 16/06/2012 12:04, hb@101-factory.eu escreveu:
>>> hi there,
>>>
>>> I am trying to import large data files into pg.
>>> for now i used the. xarg linux command to spawn the file line for line and set  and use the  maximum available
connections.
>>>
>>> we use pg pool as connection pool to the database, and so try to maximize the concurrent data import of the file.
>>>
>>> problem for now that it seems to work well but we miss a line once in a while, and that is not acceptable. also it
createszombies ;(. 
>>>
>>> does anybody have any other tricks that will do the job?
>>>
>>> thanks,
>>>
>>> Henk
>> I've used custom Java application using connection pooling (limited to 1000 connections, mean 1000 concurrent file
imports).
>>
>> I'm able to import more than 64000 XML files (about 13Kb each) in 5 minutes, without memory leaks neither zombies,
and(of course) no missing records. 
>>
>> Besides I each thread import separate file, I have another situation where I have separated threads importing
differentlines of same file. No problems at all. Do not forget to check your OS "file open" limits (it was a big issue
inthe past for me due Lucene indexes generated during import). 
>>
>> Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 5, Sun Java 1.6.27.
>>
>> Regards,
>>
>> Edson Richter
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
I'm not allowed to publish my company's code, but the logic if very easy
to understand (you will have to "invent" your own solution, below code
is bare bone):

class MainThread implements Runnable {
     private boolean keepRunning = true;

     public void run() {
         while(keepRunning) {
             try {
                 executeFiles();
                 Thread.sleep(30000); // sleep 30 seconds
             } catch(Exception ex) {
                 ex.printStackTrace();
             }
         }
     }

     private void executeFiles() {
         File monitorDir = new File("/var/mydatafolder/");
         File processingDir = new File("/var/myprocessingfolder/");

         // I'll import only files with names like "data20120621.csv":
         FileFilter fileFilter = new FileFilter() {
             public boolean accept(File file) {
                 boolean isfile = file.isFile() && !file.isHidden() &&
!file.isDirectory();
                 if(!isfile) return false;
                 String fname = file.getName();
                 return fname.startsWith("data") &&
(file.getName().endsWith("csv"));
              }
          };

         List<File> forProcessing = monitorDir.listFiles(fileFilter);

         for(File fileFound : forProcessing) {
             // FileUtil is a utility class, you will have to create
your own... your move method will vary according your Operating System
             FileUtil.move(fileFound, processingDir);
             // ProcessFile is a class that implements Runnable, and do
your stuff there...
             Thread t = new Thread(new ProcessFile(processingDir,
fileFound.getName()));
             t.start();
         }
     }

     /** Use this method to stop the thread from another place in your
complex system! */
     public void synchronized stopWorker() {
         keepRunning = false;
     }

     public static void main(String [] args) {
         Thread t = new Thread(new MainThread());
         t.start();
     }
}




Re: any solution for doing a data file import spawning it on multiple processes

От
"hb@101-factory.eu"
Дата:
thanks all, i will be looking into it.

Met vriendelijke groet,

Henk

On 16 jun. 2012, at 18:23, Edson Richter <edsonrichter@hotmail.com> wrote:

> Em 16/06/2012 12:59, hb@101-factory.eu escreveu:
>> thanks i thought about splitting the file, but that did no work out well.
>>
>> so we receive 2 files evry 30 seconds and need to import this as fast as possible.
>>
>> we do not run java curently but maybe it's an option.
>> are you willing to share your code?
>>
>> also i was thinking using perl for it
>>
>>
>> henk
>>
>> On 16 jun. 2012, at 17:37, Edson Richter <edsonrichter@hotmail.com> wrote:
>>
>>> Em 16/06/2012 12:04, hb@101-factory.eu escreveu:
>>>> hi there,
>>>>
>>>> I am trying to import large data files into pg.
>>>> for now i used the. xarg linux command to spawn the file line for line and set  and use the  maximum available
connections.
>>>>
>>>> we use pg pool as connection pool to the database, and so try to maximize the concurrent data import of the file.
>>>>
>>>> problem for now that it seems to work well but we miss a line once in a while, and that is not acceptable. also it
createszombies ;(. 
>>>>
>>>> does anybody have any other tricks that will do the job?
>>>>
>>>> thanks,
>>>>
>>>> Henk
>>> I've used custom Java application using connection pooling (limited to 1000 connections, mean 1000 concurrent file
imports).
>>>
>>> I'm able to import more than 64000 XML files (about 13Kb each) in 5 minutes, without memory leaks neither zombies,
and(of course) no missing records. 
>>>
>>> Besides I each thread import separate file, I have another situation where I have separated threads importing
differentlines of same file. No problems at all. Do not forget to check your OS "file open" limits (it was a big issue
inthe past for me due Lucene indexes generated during import). 
>>>
>>> Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 5, Sun Java 1.6.27.
>>>
>>> Regards,
>>>
>>> Edson Richter
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
> I'm not allowed to publish my company's code, but the logic if very easy to understand (you will have to "invent"
yourown solution, below code is bare bone): 
>
> class MainThread implements Runnable {
>    private boolean keepRunning = true;
>
>    public void run() {
>        while(keepRunning) {
>            try {
>                executeFiles();
>                Thread.sleep(30000); // sleep 30 seconds
>            } catch(Exception ex) {
>                ex.printStackTrace();
>            }
>        }
>    }
>
>    private void executeFiles() {
>        File monitorDir = new File("/var/mydatafolder/");
>        File processingDir = new File("/var/myprocessingfolder/");
>
>        // I'll import only files with names like "data20120621.csv":
>        FileFilter fileFilter = new FileFilter() {
>            public boolean accept(File file) {
>                boolean isfile = file.isFile() && !file.isHidden() && !file.isDirectory();
>                if(!isfile) return false;
>                String fname = file.getName();
>                return fname.startsWith("data") && (file.getName().endsWith("csv"));
>             }
>         };
>
>        List<File> forProcessing = monitorDir.listFiles(fileFilter);
>
>        for(File fileFound : forProcessing) {
>            // FileUtil is a utility class, you will have to create your own... your move method will vary according
yourOperating System 
>            FileUtil.move(fileFound, processingDir);
>            // ProcessFile is a class that implements Runnable, and do your stuff there...
>            Thread t = new Thread(new ProcessFile(processingDir, fileFound.getName()));
>            t.start();
>        }
>    }
>
>    /** Use this method to stop the thread from another place in your complex system! */
>    public void synchronized stopWorker() {
>        keepRunning = false;
>    }
>
>    public static void main(String [] args) {
>        Thread t = new Thread(new MainThread());
>        t.start();
>    }
> }
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general