Обсуждение: COPY in Java?
I'm looking at the JDBC documentation at https://jdbc.postgresql.org/documentation/head/index.html and not seeing any mention of COPY and any semantics for using COPY FROM STDIN or COPY TO STDOUT... google has a link to a 404 page with some sort of CopyManager we may need to develop a specialized sort of sql loader to replace an existing Oracle tool, we need to intercept bad rows and put them in an exceptions file, while bulk loading the good data. I'm thinking we batch N rows into COPY, and if a row faults, put it in the exception pool then retry the batch without it. the source data needs to be massaged before it can be fed to copy, so we can't use COPY FROM filename ... -- john r pierce, recycling bits in santa cruz
John R Pierce schrieb am 18.07.2016 um 22:34: > I'm looking at the JDBC documentation at > https://jdbc.postgresql.org/documentation/head/index.html and not > seeing any mention of COPY and any semantics for using COPY FROM > STDIN or COPY TO STDOUT... google has a link to a 404 page with some > sort of CopyManager https://jdbc.postgresql.org/documentation/publicapi/index.html indeed returns a 404 > we may need to develop a specialized sort of sql loader to replace an > existing Oracle tool, we need to intercept bad rows and put them in > an exceptions file, while bulk loading the good data. I'm thinking > we batch N rows into COPY, and if a row faults, put it in the > exception pool then retry the batch without it. the source data > needs to be massaged before it can be fed to copy, so we can't use > COPY FROM filename ... The CopyManager is actually quite easy to use: org.postgresql.core.BaseConnection con = (BaseConnection)DriverManager.getConnection(....); CopyManager mgr = new CopyManager(con); Reader in = new BuffereReader(new FileReader("...")); mgr.copyIn("copy target_table from stdin with (...)", in); But I don't think it's suitable for what you need to do because: * The input file needs to be formatted so that COPY can handle it. * there is no way you can catch the bad rows with that. I assume the first problem could be solved by implementing your own Reader that "massages" each line while it reads the sourcefile. But I don't think there is an efficient way to "catch" the bad rows through the Copy API. But you might be interested in http://pgloader.io/ which is a tool similar to SQL*Loader it supports more formats then COPYdoes. I have not worked with it however Thomas
On Mon, Jul 18, 2016 at 2:16 PM, Thomas Kellerer <spam_eater@gmx.net> wrote:
I assume the first problem could be solved by implementing your own Reader that "massages" each line while it reads the source file.
But I don't think there is an efficient way to "catch" the bad rows through the Copy API.
I've done exactly this. A long time ago, but so long as you don't mind redoing the entire COPY without the bad rows, it works just fine. If you get lots of failures in any given batch, you'll be better off doing individual inserts in large transactions and catching the failures there, or writing a more sophisticated reader which catches the bad rows and prevents them from even being attempted.
But you might be interested in http://pgloader.io/ which is a tool similar to SQL*Loader it supports more formats then COPY does.
I have not worked with it however
Is Kettle still around (java-based ETL tool)? It always had its difficulties, in my experience, but for basic bulk inserts with redirection of bad rows to a log, it works very well. Not the fastest to execute relative to implementing a custom COPY reader, but if you aren't already dealing with overloaded infrastructure and don't need to absolutely maximize efficiency, it will get you up and running really quickly (once you are familiar with the tool and over the learning curve, of course) and isn't database-specific, so much more portable. I found it really easy to write Java/Groovy/jython code which could instantiate the kettle tools and run them in the context of other tools/scripts, making it quite modular. It was all a long time ago or I'd provide more detail. Sorry. And none of that should discredit the pgloader.io suggestion. I just have no experience of it.
--sam
John R Pierce schrieb am 18.07.2016 um 22:34: > I'm looking at the JDBC documentation at > https://jdbc.postgresql.org/documentation/head/index.html and not > seeing any mention of COPY and any semantics for using COPY FROM > STDIN or COPY TO STDOUT... google has a link to a 404 page with some > sort of CopyManager https://jdbc.postgresql.org/documentation/publicapi/index.html indeed returns a 404 > we may need to develop a specialized sort of sql loader to replace an > existing Oracle tool, we need to intercept bad rows and put them in > an exceptions file, while bulk loading the good data. I'm thinking > we batch N rows into COPY, and if a row faults, put it in the > exception pool then retry the batch without it. the source data > needs to be massaged before it can be fed to copy, so we can't use > COPY FROM filename ... The CopyManager is actually quite easy to use: org.postgresql.core.BaseConnection con = (BaseConnection)DriverManager.getConnection(....); CopyManager mgr = new CopyManager(con); Reader in = new BuffereReader(new FileReader("...")); mgr.copyIn("copy target_table from stdin with (...)", in); But I don't think it's suitable for what you need to do because: * The input file needs to be formatted so that COPY can handle it. * there is no way you can catch the bad rows with that. I assume the first problem could be solved by implementing your own Reader that "massages" each line while it reads the sourcefile. But I don't think there is an efficient way to "catch" the bad rows through the Copy API. But you might be interested in http://pgloader.io/ which is a tool similar to SQL*Loader it supports more formats then COPYdoes. I have not worked with it however Thomas
On 7/18/2016 2:16 PM, Thomas Kellerer wrote:
John R Pierce schrieb am 18.07.2016 um 22:34:I'm looking at the JDBC documentation at
https://jdbc.postgresql.org/documentation/head/index.html and not
seeing any mention of COPY and any semantics for using COPY FROM
STDIN or COPY TO STDOUT... google has a link to a 404 page with some
sort of CopyManager
https://jdbc.postgresql.org/documentation/publicapi/index.html
indeed returns a 404
so where is the documentation on CopyManager ?
-- john r pierce, recycling bits in santa cruz
On Mon, Jul 18, 2016 at 2:16 PM, Thomas Kellerer <spam_eater@gmx.net> wrote:
I assume the first problem could be solved by implementing your own Reader that "massages" each line while it reads the source file.
But I don't think there is an efficient way to "catch" the bad rows through the Copy API.
I've done exactly this. A long time ago, but so long as you don't mind redoing the entire COPY without the bad rows, it works just fine. If you get lots of failures in any given batch, you'll be better off doing individual inserts in large transactions and catching the failures there, or writing a more sophisticated reader which catches the bad rows and prevents them from even being attempted.
But you might be interested in http://pgloader.io/ which is a tool similar to SQL*Loader it supports more formats then COPY does.
I have not worked with it however
Is Kettle still around (java-based ETL tool)? It always had its difficulties, in my experience, but for basic bulk inserts with redirection of bad rows to a log, it works very well. Not the fastest to execute relative to implementing a custom COPY reader, but if you aren't already dealing with overloaded infrastructure and don't need to absolutely maximize efficiency, it will get you up and running really quickly (once you are familiar with the tool and over the learning curve, of course) and isn't database-specific, so much more portable. I found it really easy to write Java/Groovy/jython code which could instantiate the kettle tools and run them in the context of other tools/scripts, making it quite modular. It was all a long time ago or I'd provide more detail. Sorry. And none of that should discredit the pgloader.io suggestion. I just have no experience of it.
--sam
On 7/18/2016 2:16 PM, Thomas Kellerer wrote:
John R Pierce schrieb am 18.07.2016 um 22:34:I'm looking at the JDBC documentation at
https://jdbc.postgresql.org/documentation/head/index.html and not
seeing any mention of COPY and any semantics for using COPY FROM
STDIN or COPY TO STDOUT... google has a link to a 404 page with some
sort of CopyManager
https://jdbc.postgresql.org/documentation/publicapi/index.html
indeed returns a 404
so where is the documentation on CopyManager ?
-- john r pierce, recycling bits in santa cruz
On 19.07.2016 05:11, John R Pierce wrote: > On 7/18/2016 2:16 PM, Thomas Kellerer wrote: >> John R Pierce schrieb am 18.07.2016 um 22:34: >>> I'm looking at the JDBC documentation at >>> https://jdbc.postgresql.org/documentation/head/index.html and not >>> seeing any mention of COPY and any semantics for using COPY FROM >>> STDIN or COPY TO STDOUT... google has a link to a 404 page with some >>> sort of CopyManager >> >> https://jdbc.postgresql.org/documentation/publicapi/index.html >> >> indeed returns a 404 > > so where is the documentation on CopyManager ? > I'm also getting HTTP 404 when I try to access https://jdbc.postgresql.org/documentation/publicapi/index.html. The cited URL should contain the JavaDocs. You can get the archive from the Maven Central repository. This is the link for the latest version: http://central.maven.org/maven2/org/postgresql/postgresql/9.4.1209/postgresql-9.4.1209-javadoc.jar Directory browsing is enabled so if this is not the version you are using, you can browse to the directory with the appropriate version. -- Поздрави, | Best regards, Радослав Петров | Radoslav Petrov
On 19.07.2016 05:11, John R Pierce wrote: > On 7/18/2016 2:16 PM, Thomas Kellerer wrote: >> John R Pierce schrieb am 18.07.2016 um 22:34: >>> I'm looking at the JDBC documentation at >>> https://jdbc.postgresql.org/documentation/head/index.html and not >>> seeing any mention of COPY and any semantics for using COPY FROM >>> STDIN or COPY TO STDOUT... google has a link to a 404 page with some >>> sort of CopyManager >> >> https://jdbc.postgresql.org/documentation/publicapi/index.html >> >> indeed returns a 404 > > so where is the documentation on CopyManager ? > I'm also getting HTTP 404 when I try to access https://jdbc.postgresql.org/documentation/publicapi/index.html. The cited URL should contain the JavaDocs. You can get the archive from the Maven Central repository. This is the link for the latest version: http://central.maven.org/maven2/org/postgresql/postgresql/9.4.1209/postgresql-9.4.1209-javadoc.jar Directory browsing is enabled so if this is not the version you are using, you can browse to the directory with the appropriate version. -- Поздрави, | Best regards, Радослав Петров | Radoslav Petrov