Обсуждение: free(): invalid pointer , when to dumping data to a new server
Hi
I have problem with a new server that crashes when restoring data to it. It's around 1.5 TB of raw data with both spatial and raster data.
I use Postgres 12.4 and POSTGIS="3.0.2 2fb2a18" [EXTENSION] PGSQL="120" GEOS="3.8.1-CAPI-1.13.3" SFCGAL="1.3.7" PROJ="7.1.1" GDAL="GDAL 3.1.3, released 2020/09/01" LIBXML="2.9.7" LIBJSON="0.13.1" LIBPROTOBUF="1.3.0" WAGYU="0.4.3 (Internal)" TOPOLOGY RASTER
(I have also tested with postgis 3.1 and latest form master)
For Os I use CentOS Linux release 8.2.2004 (Core) and kernel 4.18.0-193.19.1.el8_2.x86_64 on server with 64 “Intel(R) Xeon(R) Gold 6246R CPU @ 3.40GHz”
First I have restored 3 schemas with no errors the logs
388G data/sl/schema_grl_met_wisline_v2
49G data/sl/schema_met_vaer_dyrkbar
118G data/sl/schema_met_vaer_wisline
using this commands :
pg_restore -j 50 --format=d -d sl data/sl/schema_grl_met_wisline_v2/ 2> log/sl/restor_grl_met_wisline_v2.log
pg_restore -j 50 --format=d -d sl data/sl/schema_met_vaer_wisline/ 2> log/sl/restor_met_vaer_wisline.log
pg_restore -j 50 --format=d -d sl data/sl/schema_met_vaer_dyrkbar/ 2> log/sl/restor_met_vaer_dyrkbar.log
Then i start to restore the rest of data below.
744G data/sl/schema_sl_rest_ofd
Here I split the restore job in 3 parts, like below .
Command 1 below went OK and no errors in the client log file or on the server side.
pg_restore --section=pre-data -1 --format=d -d sl data/sl/schema_sl_rest_ofd/ 2> log/sl/restore_sl_rest_ofdata_pre_data.log
Command 2 below went OK according to logs the client side but not according logs on the server side and the last table with one more than one billion rows are empty.
pg_restore --section=data -j 50 --format=d -d sl data/sl/schema_sl_rest_ofd/ 2> log/sl/restore_sl_rest_ofdata_data.log
I do not understand why I do not have any errors in client log file, because on the server side I this.
< [local] postgres, sl, 2020-10-22 00:52:47.823 CEST >LOG: duration: 3417759.209 ms statement: COPY grl_sk.dtm10_norge_retiled (rid, rast, filename) FROM stdin;
< [local] postgres, sl, 2020-10-22 00:56:05.771 CEST >LOG: duration: 3615706.705 ms statement: COPY sl_sos.lsk_arome_v1 (flateid, point_uid_ref, epoch_hours, tg_l03, tg_l04, tswi_l03, tswi_l04
, tg_l05, tswi_l05, pr, t2m, sfcwindl, snw, twpr) FROM stdin;
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
< , , 2020-10-22 00:56:05.830 CEST >LOG: server process (PID 516912) was terminated by signal 6: Aborted
< , , 2020-10-22 00:56:05.830 CEST >LOG: terminating any other active server processes
< , , 2020-10-22 00:56:05.830 CEST >WARNING: terminating connection because of crash of another server process
< , , 2020-10-22 00:56:05.830 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< , , 2020-10-22 00:56:05.830 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< , , 2020-10-22 00:56:05.830 CEST >WARNING: terminating connection because of crash of another server process
< , , 2020-10-22 00:56:05.830 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< , , 2020-10-22 00:56:05.830 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< , , 2020-10-22 00:56:05.830 CEST >WARNING: terminating connection because of crash of another server process
< , , 2020-10-22 00:56:05.830 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< , , 2020-10-22 00:56:05.830 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< , , 2020-10-22 00:56:05.830 CEST >WARNING: terminating connection because of crash of another server process
< , , 2020-10-22 00:56:05.830 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< , , 2020-10-22 00:56:05.830 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< , , 2020-10-22 00:56:10.892 CEST >LOG: all server processes terminated; reinitializing
< , , 2020-10-22 00:56:11.324 CEST >LOG: database system was interrupted; last known up at 2020-10-22 00:52:33 CEST
< , , 2020-10-22 00:56:11.480 CEST >LOG: database system was not properly shut down; automatic recovery in progress
< , , 2020-10-22 00:56:11.521 CEST >LOG: redo starts at B9C/EA83BBF0
< , , 2020-10-22 00:56:35.074 CEST >LOG: redo done at BA0/FDFFEF48
< , , 2020-10-22 00:56:51.250 CEST >LOG: database system is ready to accept connections
I don't care about the last table, we dont't need that data, so I run the last command below.
pg_restore --section=post-data -j 1 --format=d -d sl data/sl/schema_sl_rest_ofd/ 2> log/sl/restore_sl_rest_ofdata_post_data.log
Then I see this in the logs so I to run the above command multiple times.
First time :
< [local] postgres, sl, 2020-10-22 03:46:27.349 CEST >LOG: duration: 1746.370 ms statement: CREATE INDEX o_8_dtm10_norge_retiled_st_convexhull_idx ON grl_sk.o_8_dtm10_norge_retiled USING gist
(public.st_convexhull(rast));
free(): invalid pointer
free(): invalid pointer
< , , 2020-10-22 03:46:30.283 CEST >LOG: background worker "parallel worker" (PID 658814) was terminated by signal 6: Aborted
< , , 2020-10-22 03:46:30.283 CEST >LOG: terminating any other active server processes
< [local] postgres, sl, 2020-10-22 03:46:30.283 CEST >WARNING: terminating connection because of crash of another server process
< [local] postgres, sl, 2020-10-22 03:46:30.283 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< [local] postgres, sl, 2020-10-22 03:46:30.283 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< , , 2020-10-22 03:46:30.283 CEST >WARNING: terminating connection because of crash of another server process
< , , 2020-10-22 03:46:30.283 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< , , 2020-10-22 03:46:30.283 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< , , 2020-10-22 03:46:30.599 CEST >LOG: all server processes terminated; reinitializing
< , , 2020-10-22 03:46:31.019 CEST >LOG: database system was interrupted; last known up at 2020-10-22 03:41:23 CEST
< , , 2020-10-22 03:46:31.178 CEST >LOG: database system was not properly shut down; automatic recovery in progress
< , , 2020-10-22 03:46:31.224 CEST >LOG: redo starts at BAA/B07178F8
< , , 2020-10-22 03:46:37.209 CEST >LOG: redo done at BAC/7F3DFD70
Second time :
< [local] postgres, sl, 2020-10-22 08:13:08.882 CEST >LOG: duration: 417316.937 ms statement: CREATE INDEX geoidx_org_forestmap____ostl3_predictions ON org_forestmap.ostl3_predictions USING gi
st (geo);
< [local] munin, template1, 2020-10-22 08:13:36.797 CEST >FATAL: role "munin" does not exist
< [local] munin, template1, 2020-10-22 08:13:36.835 CEST >FATAL: role "munin" does not exist
< [local] postgres, sl_help, 2020-10-22 08:13:39.106 CEST >FATAL: database "sl_help" does not exist
< [local] postgres, sl_help, 2020-10-22 08:13:39.183 CEST >FATAL: database "sl_help" does not exist
< [local] munin, template1, 2020-10-22 08:13:44.607 CEST >FATAL: role "munin" does not exist
< [local] munin, template1, 2020-10-22 08:13:44.684 CEST >FATAL: role "munin" does not exist
< [local] munin, template1, 2020-10-22 08:18:36.225 CEST >FATAL: role "munin" does not exist
< [local] munin, template1, 2020-10-22 08:18:36.266 CEST >FATAL: role "munin" does not exist
< [local] postgres, sl_help, 2020-10-22 08:18:38.611 CEST >FATAL: database "sl_help" does not exist
< [local] postgres, sl_help, 2020-10-22 08:18:38.691 CEST >FATAL: database "sl_help" does not exist
< [local] munin, template1, 2020-10-22 08:18:44.234 CEST >FATAL: role "munin" does not exist
< [local] munin, template1, 2020-10-22 08:18:44.314 CEST >FATAL: role "munin" does not exist
free(): invalid pointer
free(): invalid pointer
< , , 2020-10-22 08:19:58.932 CEST >LOG: background worker "parallel worker" (PID 835249) was terminated by signal 6: Aborted
< , , 2020-10-22 08:19:58.932 CEST >LOG: terminating any other active server processes
< [local] postgres, sl, 2020-10-22 08:19:58.932 CEST >WARNING: terminating connection because of crash of another server process
< [local] postgres, sl, 2020-10-22 08:19:58.932 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< [local] postgres, sl, 2020-10-22 08:19:58.932 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< , , 2020-10-22 08:19:58.932 CEST >WARNING: terminating connection because of crash of another server process
< , , 2020-10-22 08:19:58.932 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< , , 2020-10-22 08:19:58.932 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< 192.168.229.48 postgres, sl, 2020-10-22 08:19:58.932 CEST >WARNING: terminating connection because of crash of another server process
< 192.168.229.48 postgres, sl, 2020-10-22 08:19:58.932 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< 192.168.229.48 postgres, sl, 2020-10-22 08:19:58.932 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
< 192.168.229.48 postgres, sl, 2020-10-22 08:19:58.970 CEST >FATAL: the database system is in recovery mode
< , , 2020-10-22 08:19:59.041 CEST >LOG: all server processes terminated; reinitializing
< , , 2020-10-22 08:19:59.476 CEST >LOG: database system was interrupted; last known up at 2020-10-22 08:11:17 CEST
< , , 2020-10-22 08:20:01.901 CEST >LOG: database system was not properly shut down; automatic recovery in progress
< , , 2020-10-22 08:20:02.005 CEST >LOG: redo starts at BB1/AA4309A0
< , , 2020-10-22 08:20:03.517 CEST >LOG: invalid record length at BB1/FBF33FB0: wanted 24, got 0
< , , 2020-10-22 08:20:03.517 CEST >LOG: redo done at BB1/FBF33F78
< , , 2020-10-22 08:20:05.573 CEST >LOG: database system is ready to accept connections
Any idea about why this happens and what I can do to fix it ?
Do you believe Postgres 13 will work better ?
Sent: Thursday, October 22, 2020 9:32 AM
To: pgsql-admin@lists.postgresql.org <pgsql-admin@lists.postgresql.org>
Subject: free(): invalid pointer , when to dumping data to a new server
Lars
Sent: Thursday, October 22, 2020 9:32 AM
To: pgsql-admin@lists.postgresql.org <pgsql-admin@lists.postgresql.org>
Subject: free(): invalid pointer , when to dumping data to a new server
- yum search debuginfo|grep postgresLast metadata expiration check: 0:17:59 ago on Mon 26 Oct 2020 10:53:03 AM CET.postgresql12-odbc-debuginfo.x86_64 : Debug information for package postgresql12-odbc
On Mon, 2020-10-26 at 10:22 +0000, Lars Aksel Opsahl wrote: > I have done some more testing and postgres also crashes right after 'pg_restore --section=pre-data -1' > is done also. Here I also see the 'free(): invalid pointer' in the postgres log. > > This happens on with Postgres 12 and Postgres 13. and any version of postgis I have tested with. > > The pre-data input is little bit more than 200.000 lines with more than 6000 CREATE statements, > if I test with the first 100 lines it does not crash. It seems like no data are lost here because > the server does recovery at startup. > > I tested on another with CentOS and the same software and it worked ok > > The server diagnostic does not indicate any memory errors. > > Dnf search does not give mee any debuginfo file for postgres You'll have to edit "/etc/yum.repos.d/pgdg-fedora-all.repo" and set the "PostgreSQL 13 for Fedora $releasever - $basearch - Debuginfo" repository to "enabled=1". Then you can install the debuginfo packages and generate a stack trace. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
HI
Lars
On 2020-Oct-27, Lars Aksel Opsahl wrote: > Type "apropos word" to search for commands related to "word"... > Reading symbols from /usr/pgsql-12/bin/postmaster...Reading symbols from /usr/lib/debug/usr/pgsql-12/bin/postgres-12.4-1PGDG.rhel8.x86_64.debug...done. > done. > BFD: warning: /var/lib/systemd/coredump/core.postmaster.26.6f7d98107f544085a41c6f8b876418d0.298120.1603810048000000 istruncated: expected core file size >= 2326896640, found: 2147483648 > > warning: core file may not match specified executable file. Uh :-( The core file would have been too large, so the system wrote a truncated copy, with the result that GDB cannot interpret it correctly. I don't know if you can raise the 2TB limit on core files, but one thing you can do is change the "core filter" so that it doesn't dump shared memory, which is often where the bloat comes from. This will result in a much smaller core file, but still useful to find the problem. You can do this: echo 0x01 > /proc/$pid/coredump_filter where the PID can be the postmaster's, before running pg_dump.
Lars Aksel Opsahl <Lars.Opsahl@nibio.no> writes: > Here is the new stackstrace not too much info that helps mee thou. This is probably the reason why "bt" is producing garbage: > BFD: warning: /var/lib/systemd/coredump/core.postmaster.26.6f7d98107f544085a41c6f8b876418d0.298120.1603810048000000 istruncated: expected core file size >= 2326896640, found: 2147483648 So the core file is being truncated at 2GB, but the process image was a little bit larger than that. You need to get a complete core file. First thing I'd check is the "ulimit -c" setting that the postmaster is being started under. I see that systemd is capturing the core dump, so it could also be that there's some systemd configuration you need to adjust ("man systemd-coredump" might help here). regards, tom lane
>
Lars Aksel Opsahl <Lars.Opsahl@nibio.no> writes: > (gdb) bt > #0 0x00007fa9c241e70f in raise () from /lib64/libc.so.6 > #1 0x00007fa9c2408b25 in abort () from /lib64/libc.so.6 > #2 0x00007fa9c2461897 in __libc_message () from /lib64/libc.so.6 > #3 0x00007fa9c2467fdc in malloc_printerr () from /lib64/libc.so.6 > #4 0x00007fa9c24698dc in _int_free () from /lib64/libc.so.6 > #5 0x00007f992fe099f9 in osgeo::proj::common::UnitOfMeasure::~UnitOfMeasure() () from /lib64/libproj.so.15 > #6 0x00007fa9c2420e9c in __run_exit_handlers () from /lib64/libc.so.6 > #7 0x00007fa9c2420fd0 in exit () from /lib64/libc.so.6 > #8 0x000000000075beb0 in proc_exit (code=code@entry=0) at ipc.c:152 > #9 0x000000000077fd21 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x182b6d8, dbname=<optimized out>, username=<optimizedout>) at postgres.c:4455 > #10 0x00000000007085a1 in BackendRun (port=0x1819b90, port=0x1819b90) at postmaster.c:4448 > #11 BackendStartup (port=0x1819b90) at postmaster.c:4139 > #12 ServerLoop () at postmaster.c:1704 > #13 0x00000000007094d0 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x17e7800) at postmaster.c:1377 > #14 0x0000000000482c7e in main (argc=3, argv=0x17e7800) at main.c:228 OK, I'd say that definitely puts the bug outside of core Postgres. It looks like proj has installed an atexit() handler that is trying to free something that was already freed. This might be proj's fault directly, or perhaps a caller did something wrong (a likely bet, perhaps, is using palloc for something that proj expects to survive till program exit). Probably the next step is to take this to the postgis support lists and see what they think about it. regards, tom lane
On Tue, Oct 27, 2020 at 5:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Lars Aksel Opsahl <Lars.Opsahl@nibio.no> writes: > > (gdb) bt > > #0 0x00007fa9c241e70f in raise () from /lib64/libc.so.6 > > #1 0x00007fa9c2408b25 in abort () from /lib64/libc.so.6 > > #2 0x00007fa9c2461897 in __libc_message () from /lib64/libc.so.6 > > #3 0x00007fa9c2467fdc in malloc_printerr () from /lib64/libc.so.6 > > #4 0x00007fa9c24698dc in _int_free () from /lib64/libc.so.6 > > #5 0x00007f992fe099f9 in osgeo::proj::common::UnitOfMeasure::~UnitOfMeasure() () from /lib64/libproj.so.15 > > #6 0x00007fa9c2420e9c in __run_exit_handlers () from /lib64/libc.so.6 > > #7 0x00007fa9c2420fd0 in exit () from /lib64/libc.so.6 > > #8 0x000000000075beb0 in proc_exit (code=code@entry=0) at ipc.c:152 > > #9 0x000000000077fd21 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x182b6d8, dbname=<optimized out>, username=<optimizedout>) at postgres.c:4455 > > #10 0x00000000007085a1 in BackendRun (port=0x1819b90, port=0x1819b90) at postmaster.c:4448 > > #11 BackendStartup (port=0x1819b90) at postmaster.c:4139 > > #12 ServerLoop () at postmaster.c:1704 > > #13 0x00000000007094d0 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x17e7800) at postmaster.c:1377 > > #14 0x0000000000482c7e in main (argc=3, argv=0x17e7800) at main.c:228 > > OK, I'd say that definitely puts the bug outside of core Postgres. > It looks like proj has installed an atexit() handler that is trying > to free something that was already freed. This might be proj's > fault directly, or perhaps a caller did something wrong (a likely > bet, perhaps, is using palloc for something that proj expects to > survive till program exit). > > Probably the next step is to take this to the postgis support lists > and see what they think about it. FYI, looking at this one in a bit more details it looks like it's standard C++ destructors -- they're added to the list of exit handlers automatically when C++ is used, AIUI. Still leaves the problem with postgis/proj of course, but as a noteworthy detail. It's not as easy as an explicit atexit(). However, mostly this looks like a packaging issue on RHEL8, or more likely a combination of a packaging issue and some updated packages that somehow trigger a new issue. It seems to load multiple versions of some of the required C++ libraries at the same time, and I bet they're conflicting with each other. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Hi, > FYI, looking at this one in a bit more details it looks like it's > standard C++ destructors -- they're added to the list of exit handlers > automatically when C++ is used, AIUI. > > Still leaves the problem with postgis/proj of course, but as a > noteworthy detail. It's not as easy as an explicit atexit(). Yes, there has to be a bug leading to the memory already having been freed or such. Probably worth trying to run this under valgrind or such. Andres