Обсуждение: importing a messy text file
Hi,
I have a 56GB textfile that i want to import into postgres. I deleted the header with the column names (using sed) so that i could use COPY with the non-csv text type (because some of the text values contain quotes).
--
Willy-Bas Loos
On Wed, Apr 30, 2014 at 10:07:09AM +0200, Willy-Bas Loos wrote: > it would probably not be > feasible to manually add tabs for every record that misses one, although i > don't know how many it would be. Why not a) let a script do that, b) remove tabs from the end of lines as needed ? All in all it sounds like you want to use a scripting language to sanitize the file. Karsten -- GPG key ID E4071346 @ gpg-keyserver.de E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Wed, 30 Apr 2014 10:07:09 +0200 Willy-Bas Loos <willybas@gmail.com> wrote: > Hi, > > I have a 56GB textfile that i want to import into postgres. !!! > The problem is this: > There is a tab after the last column, in many but not all records. You could clean it up with something like s/^I$// , don't you? -- Alberto Cabello Sánchez <alberto@unex.es>
Alberto, it would be hard to use sed( s/^I$// ), because there is no explicit NULL value and there are many NULLs in the last column.
So i can't be sure how many tabs should be in the end of each line.--
Willy-Bas Loos
On Wed, 30 Apr 2014 10:47:12 +0200 Willy-Bas Loos <willybas@gmail.com> wrote: > Alberto, it would be hard to use sed( s/^I$// ), because there is no > explicit NULL value and there are many NULLs in the last column. > So i can't be sure how many tabs should be in the end of each line. Ok, I understand (I think). What about using "cut" to strip the extra fields? -- Alberto Cabello Sánchez <alberto@unex.es>
Hey,
with latest version 9.3 you can use a copy from with a programm as argument.
I had a similar problem (an extra space at the end of each line), so I used sed in a pipeline to remove it before feeding it to database.with latest version 9.3 you can use a copy from with a programm as argument.
Rémi-C
2014-04-30 10:47 GMT+02:00 Willy-Bas Loos <willybas@gmail.com>:
Yes, Karsten, maybe scripting would be the easiest way to fix this, i would then probably insert this line for line. That's a possibility. It might take a long time to run, but that's not much of an issue.Alberto, it would be hard to use sed( s/^I$// ), because there is no explicit NULL value and there are many NULLs in the last column.So i can't be sure how many tabs should be in the end of each line.--Willy-Bas Loos
On Wed, Apr 30, 2014 at 11:03 AM, Alberto Cabello Sánchez <alberto@unex.es> wrote:
What about using "cut" to strip the extra fields?
Wow, i didn't know "cut", this looks promising.
thanks.
--
Willy-Bas Loos
On Wed, Apr 30, 2014 at 11:06 AM, Rémi Cura <remi.cura@gmail.com> wrote:
If you have an older version you can use the mkfifo command to "trick" postgres into believing he will copy from a file.with latest version 9.3 you can use a copy from with a programm as argument.You can use the unix | pipe for most of the stuff.
Hi, you're right i can see the new feature in the docs.
But since i am working from the (bash) command line already, i don't see any use for this.
I already pass the data like this:
cat <file> |psql -c "copy <table> from stdin NULL ''" <db> > <file>.log 2>&1
cat <file> |psql -c "copy <table> from stdin NULL ''" <db> > <file>.log 2>&1
It's especially handy if you want to stay in sql(transaction), i gues.
Cheers,
--
Willy-Bas Loos
On 04/30/2014 03:50 AM, Willy-Bas Loos wrote:
Unless you know this is the only problem (extra/missing tab) I would triage the file with a scripting language. One favourite check isOn Wed, Apr 30, 2014 at 11:06 AM, Rémi Cura <remi.cura@gmail.com> wrote:If you have an older version you can use the mkfifo command to "trick" postgres into believing he will copy from a file.with latest version 9.3 you can use a copy from with a programm as argument.You can use the unix | pipe for most of the stuff.Hi, you're right i can see the new feature in the docs.But since i am working from the (bash) command line already, i don't see any use for this.I already pass the data like this:
cat <file> |psql -c "copy <table> from stdin NULL ''" <db> > <file>.log 2>&1It's especially handy if you want to stay in sql(transaction), i gues.Cheers,
--Willy-Bas Loos
awk '{a[NF]++}END{print "Field count: Record count";for (i in a){printf "%11d : %d", i, a[i]}}If you only have the tab problem you will get two lines of output with field count = N, N-1. Take care in setting the field separator.
On Wed, Apr 30, 2014 at 1:07 AM, Willy-Bas Loos <willybas@gmail.com> wrote:
The file is tab delimited and not quoted.Hi,I have a 56GB textfile that i want to import into postgres.
Would Pgloader be an option? http://tapoueh.org/pgloader/
On Thu, May 1, 2014 at 1:30 AM, bricklen <bricklen@gmail.com> wrote: > > On Wed, Apr 30, 2014 at 1:07 AM, Willy-Bas Loos <willybas@gmail.com> wrote: >> >> >> Hi, >> >> I have a 56GB textfile that i want to import into postgres. >> The file is tab delimited and not quoted. > > > Would Pgloader be an option? http://tapoueh.org/pgloader/ Or pg_bulkload? Contrary to the in-core COPY, it can filter "incorrect" tuple data at the line level. -- Michael