Обсуждение: Segmentation Fault
PostgreSQL v15.4
Running on Ubuntu Jammy
ARM64 x8
64 GB RAM
I have been trying to tune the performance of a query, and wasn't having a lot of luck.I decided to run VACUUM ANALYZE to see if that would help.
The Postgres server crashed.
I ran through each of the tables individually and ran VACUUM ANALYZE one by one. That way completed successfully.
I ran my query to see if that had helped performance. Segmentation fault. The server restarted itself and came back to life. Reran the query and it crashed again. Every time I run the query I get the same segmentation fault. Different queries run just fine.
I tried attaching GDB to the PID that the log file says is the main pid for startup. That seems to have worked in that it tells me that there were various libraries that were loaded. I'm not sure if it is sufficient or not. I'm pretty sure it isn't.
I tried getting GDB based on DebuggingProgramCrash - Ubuntu Wiki which says "If you are on Ubuntu Jammy or later, you do not need to worry about installing debug symbol packages anymore." I am running on Jammy so that should be good.
I then tried getting a Valgrind trace from PostgreSQL. I used "systemctl stop postgresql" to stop the service from running. I launched Valgrind using the command line below my sig. And everything worked just fine except the query took 10x the time that it was taking before I started the VACUUM ANALYZE.
So I restarted the service, reran my query, and it failed with the same log entries showing a Signal 11: Segmentation fault.
Running it inside of Valgrind stopped the error from happening which is strange to me.
What can I do to provide enough information to figure out why this is crashing?
rjsjr
Valgrind Command
valgrind --leak-check=no --gen-suppressions=all \
--suppressions=src/tools/valgrind.supp --time-stamp=yes \
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \
--log-file=/datadrive/%p.log --trace-children=yes \
/usr/lib/postgresql/15/bin/postgres --config-file=/etc/postgresql/15/main/postgresql.conf \
--log_line_prefix="%m %p " \
--log_statement=all --shared_buffers=64MB 2>&1 | tee /datadrive/postmaster.log
GDB Backtrace command
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69
69 ../sysdeps/unix/sysv/linux/select.c: No such file or directory.
(gdb) bt
#0 0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69
#1 0x0000aaaab91b4a1c in ?? ()
#2 0x0000aaaab91b63cc in PostmasterMain ()
#3 0x0000aaaab8f053d8 in main ()
PostgreSQL Log File
2023-09-13 20:20:21.198 UTC [295031] LOG: starting PostgreSQL 15.4 (Ubuntu 15.4-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
2023-09-13 20:20:21.198 UTC [295031] LOG: listening on IPv4 address "0.0.0.0", port 5432
2023-09-13 20:20:21.198 UTC [295031] LOG: listening on IPv6 address "::", port 5432
2023-09-13 20:20:21.203 UTC [295031] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-13 20:20:21.218 UTC [295037] LOG: database system was shut down at 2023-09-13 20:20:20 UTC
2023-09-13 20:20:21.228 UTC [295031] LOG: database system is ready to accept connections
2023-09-13 20:21:21.604 UTC [295031] LOG: server process (PID 295102) was terminated by signal 11: Segmentation fault
2023-09-13 20:21:21.604 UTC [295031] DETAIL: Failed process was running: --explain analyze
select
a.equipmenttype
, a.technology
, count(*)
from
newfaultentrylineitems a
where
a.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'
and a.assessmentyearmonth >= 202209
and a.assessmentyearmonth < 202309
group by a.equipmenttype, a.technology
order by count(*) desc
2023-09-13 20:21:21.604 UTC [295031] LOG: terminating any other active server processes
2023-09-13 20:21:21.610 UTC [295031] LOG: all server processes terminated; reinitializing
2023-09-13 20:21:22.074 UTC [295109] LOG: database system was interrupted; last known up at 2023-09-13 20:20:21 UTC
2023-09-13 20:21:22.137 UTC [295112] FATAL: the database system is in recovery mode
2023-09-13 20:21:22.187 UTC [295109] LOG: database system was not properly shut down; automatic recovery in progress
2023-09-13 20:21:22.195 UTC [295109] LOG: redo starts at 17/6AFACF40
2023-09-13 20:21:22.195 UTC [295109] LOG: invalid record length at 17/6AFACF78: wanted 24, got 0
2023-09-13 20:21:22.195 UTC [295109] LOG: redo done at 17/6AFACF40 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-09-13 20:21:22.208 UTC [295110] LOG: checkpoint starting: end-of-recovery immediate wait
2023-09-13 20:21:22.383 UTC [295113] FATAL: the database system is not yet accepting connections
2023-09-13 20:21:22.383 UTC [295113] DETAIL: Consistent recovery state has not been yet reached.
2023-09-13 20:21:22.428 UTC [295110] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.199 s, sync=0.004 s, total=0.224 s; sync files=2, longest=0.003>
2023-09-13 20:21:22.435 UTC [295031] LOG: database system is ready to accept connections
2023-09-13 20:40:24.416 UTC [295031] LOG: server process (PID 295140) was terminated by signal 11: Segmentation fault
2023-09-13 20:40:24.416 UTC [295031] DETAIL: Failed process was running: --explain analyze
select
a.equipmenttype
, a.technology
, count(*)
from
newfaultentrylineitems a
where
a.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'
and a.assessmentyearmonth >= 202209
and a.assessmentyearmonth < 202309
group by a.equipmenttype, a.technology
order by count(*) desc
2023-09-13 20:40:24.416 UTC [295031] LOG: terminating any other active server processes
2023-09-13 20:40:24.666 UTC [295523] FATAL: the database system is in recovery mode
2023-09-13 20:40:24.922 UTC [295524] FATAL: the database system is in recovery mode
2023-09-13 20:40:25.182 UTC [295525] FATAL: the database system is in recovery mode
2023-09-13 20:40:25.379 UTC [295526] FATAL: the database system is in recovery mode
2023-09-13 20:40:25.574 UTC [295527] FATAL: the database system is in recovery mode
2023-09-13 20:40:25.775 UTC [295528] FATAL: the database system is in recovery mode
2023-09-13 20:40:25.977 UTC [295529] FATAL: the database system is in recovery mode
2023-09-13 20:40:26.179 UTC [295530] FATAL: the database system is in recovery mode
2023-09-13 20:40:26.377 UTC [295531] FATAL: the database system is in recovery mode
2023-09-13 20:40:26.578 UTC [295532] FATAL: the database system is in recovery mode
2023-09-13 20:40:26.777 UTC [295533] FATAL: the database system is in recovery mode
2023-09-13 20:40:26.973 UTC [295534] FATAL: the database system is in recovery mode
2023-09-13 20:40:27.166 UTC [295535] FATAL: the database system is in recovery mode
2023-09-13 20:40:27.372 UTC [295536] FATAL: the database system is in recovery mode
2023-09-13 20:40:27.568 UTC [295537] FATAL: the database system is in recovery mode
2023-09-13 20:40:27.766 UTC [295538] FATAL: the database system is in recovery mode
2023-09-13 20:40:27.961 UTC [295539] FATAL: the database system is in recovery mode
2023-09-13 20:40:28.163 UTC [295540] FATAL: the database system is in recovery mode
2023-09-13 20:40:28.389 UTC [295541] FATAL: the database system is in recovery mode
2023-09-13 20:40:28.589 UTC [295542] FATAL: the database system is in recovery mode
2023-09-13 20:40:28.784 UTC [295543] FATAL: the database system is in recovery mode
2023-09-13 20:40:28.978 UTC [295544] FATAL: the database system is in recovery mode
2023-09-13 20:40:29.114 UTC [295031] LOG: issuing SIGKILL to recalcitrant children
2023-09-13 20:40:29.178 UTC [295545] FATAL: the database system is in recovery mode
2023-09-13 20:40:29.403 UTC [295546] FATAL: the database system is in recovery mode
2023-09-13 20:40:29.603 UTC [295547] FATAL: the database system is in recovery mode
2023-09-13 20:40:29.798 UTC [295548] FATAL: the database system is in recovery mode
2023-09-13 20:40:30.017 UTC [295549] FATAL: the database system is in recovery mode
2023-09-13 20:40:30.214 UTC [295550] FATAL: the database system is in recovery mode
2023-09-13 20:40:58.414 UTC [295031] LOG: all server processes terminated; reinitializing
2023-09-13 20:40:58.876 UTC [295553] LOG: database system was interrupted; last known up at 2023-09-13 20:21:22 UTC
2023-09-13 20:40:58.986 UTC [295553] LOG: database system was not properly shut down; automatic recovery in progress
2023-09-13 20:40:58.992 UTC [295553] LOG: redo starts at 17/6AFACFF0
2023-09-13 20:40:58.992 UTC [295553] LOG: invalid record length at 17/6AFAD028: wanted 24, got 0
2023-09-13 20:40:58.992 UTC [295553] LOG: redo done at 17/6AFACFF0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-09-13 20:40:59.006 UTC [295554] LOG: checkpoint starting: end-of-recovery immediate wait
2023-09-13 20:40:59.228 UTC [295554] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.198 s, sync=0.004 s, total=0.228 s; sync files=2, longest=0.003>
2023-09-13 20:40:59.237 UTC [295031] LOG: database system is ready to accept connections
I realized that when I was running GDB I just took a snap of when it failed. Here I've added after continuing and requesting a backtrace.
rjsjr
Continuing.
[Detaching after fork from child process 297277]
[Detaching after fork from child process 297278]
[Detaching after fork from child process 297279]
[Detaching after fork from child process 297280]
[Detaching after fork from child process 297281]
[Detaching after fork from child process 297282]
[Detaching after fork from child process 297283]
[Detaching after fork from child process 297284]
[Detaching after fork from child process 297286]
[Detaching after fork from child process 297287]
[Detaching after fork from child process 297289]
Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69 in ../sysdeps/unix/sysv/linux/select.c
(gdb) bt
#0 0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
#1 0x0000aaaab8124a1c in ?? ()
#2 0x0000aaaab81263cc in PostmasterMain ()
#3 0x0000aaaab7e753d8 in main ()
(gdb) c
Continuing.
[Detaching after fork from child process 297301]
[Detaching after fork from child process 297304]
Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69 in ../sysdeps/unix/sysv/linux/select.c
(gdb) bt
#0 0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
#1 0x0000aaaab8124a1c in ?? ()
#2 0x0000aaaab81263cc in PostmasterMain ()
#3 0x0000aaaab7e753d8 in main ()
(gdb) c
Continuing.
[Detaching after fork from child process 297306]
[Detaching after fork from child process 297311]
[Detaching after fork from child process 297320]
Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69 in ../sysdeps/unix/sysv/linux/select.c
(gdb) c
Continuing.
[Detaching after fork from child process 297330]
Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69 in ../sysdeps/unix/sysv/linux/select.c
(gdb) c
Continuing.
[Detaching after fork from child process 297332]
[Detaching after fork from child process 297333]
[Detaching after fork from child process 297334]
[Detaching after fork from child process 297335]
[Detaching after fork from child process 297336]
Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69 in ../sysdeps/unix/sysv/linux/select.c
(gdb) c
Continuing.
[Detaching after fork from child process 297337]
[Detaching after fork from child process 297338]
[Detaching after fork from child process 297339]
[Detaching after fork from child process 297344]
[Detaching after fork from child process 297347]
Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69 in ../sysdeps/unix/sysv/linux/select.c
(gdb) c
Continuing.
[Detaching after fork from child process 297351]
[Detaching after fork from child process 297352]
[Detaching after fork from child process 297353]
[Detaching after fork from child process 297354]
[Detaching after fork from child process 297355]
On Wed, Sep 13, 2023 at 4:53 PM Robert Sanford <wobbet@gmail.com> wrote:
PostgreSQL v15.4Running on Ubuntu JammyARM64 x864 GB RAMI have been trying to tune the performance of a query, and wasn't having a lot of luck.I decided to run VACUUM ANALYZE to see if that would help.The Postgres server crashed.I ran through each of the tables individually and ran VACUUM ANALYZE one by one. That way completed successfully.I ran my query to see if that had helped performance. Segmentation fault. The server restarted itself and came back to life. Reran the query and it crashed again. Every time I run the query I get the same segmentation fault. Different queries run just fine.I tried attaching GDB to the PID that the log file says is the main pid for startup. That seems to have worked in that it tells me that there were various libraries that were loaded. I'm not sure if it is sufficient or not. I'm pretty sure it isn't.I tried getting GDB based on DebuggingProgramCrash - Ubuntu Wiki which says "If you are on Ubuntu Jammy or later, you do not need to worry about installing debug symbol packages anymore." I am running on Jammy so that should be good.I then tried getting a Valgrind trace from PostgreSQL. I used "systemctl stop postgresql" to stop the service from running. I launched Valgrind using the command line below my sig. And everything worked just fine except the query took 10x the time that it was taking before I started the VACUUM ANALYZE.So I restarted the service, reran my query, and it failed with the same log entries showing a Signal 11: Segmentation fault.Running it inside of Valgrind stopped the error from happening which is strange to me.What can I do to provide enough information to figure out why this is crashing?rjsjrValgrind Commandvalgrind --leak-check=no --gen-suppressions=all \--suppressions=src/tools/valgrind.supp --time-stamp=yes \--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \--log-file=/datadrive/%p.log --trace-children=yes \/usr/lib/postgresql/15/bin/postgres --config-file=/etc/postgresql/15/main/postgresql.conf \--log_line_prefix="%m %p " \--log_statement=all --shared_buffers=64MB 2>&1 | tee /datadrive/postmaster.logGDB Backtrace command[Thread debugging using libthread_db enabled]Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:6969 ../sysdeps/unix/sysv/linux/select.c: No such file or directory.(gdb) bt#0 0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69#1 0x0000aaaab91b4a1c in ?? ()#2 0x0000aaaab91b63cc in PostmasterMain ()#3 0x0000aaaab8f053d8 in main ()PostgreSQL Log File2023-09-13 20:20:21.198 UTC [295031] LOG: starting PostgreSQL 15.4 (Ubuntu 15.4-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit2023-09-13 20:20:21.198 UTC [295031] LOG: listening on IPv4 address "0.0.0.0", port 54322023-09-13 20:20:21.198 UTC [295031] LOG: listening on IPv6 address "::", port 54322023-09-13 20:20:21.203 UTC [295031] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"2023-09-13 20:20:21.218 UTC [295037] LOG: database system was shut down at 2023-09-13 20:20:20 UTC2023-09-13 20:20:21.228 UTC [295031] LOG: database system is ready to accept connections2023-09-13 20:21:21.604 UTC [295031] LOG: server process (PID 295102) was terminated by signal 11: Segmentation fault2023-09-13 20:21:21.604 UTC [295031] DETAIL: Failed process was running: --explain analyzeselecta.equipmenttype, a.technology, count(*)fromnewfaultentrylineitems awherea.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'and a.assessmentyearmonth >= 202209and a.assessmentyearmonth < 202309group by a.equipmenttype, a.technologyorder by count(*) desc2023-09-13 20:21:21.604 UTC [295031] LOG: terminating any other active server processes2023-09-13 20:21:21.610 UTC [295031] LOG: all server processes terminated; reinitializing2023-09-13 20:21:22.074 UTC [295109] LOG: database system was interrupted; last known up at 2023-09-13 20:20:21 UTC2023-09-13 20:21:22.137 UTC [295112] FATAL: the database system is in recovery mode2023-09-13 20:21:22.187 UTC [295109] LOG: database system was not properly shut down; automatic recovery in progress2023-09-13 20:21:22.195 UTC [295109] LOG: redo starts at 17/6AFACF402023-09-13 20:21:22.195 UTC [295109] LOG: invalid record length at 17/6AFACF78: wanted 24, got 02023-09-13 20:21:22.195 UTC [295109] LOG: redo done at 17/6AFACF40 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s2023-09-13 20:21:22.208 UTC [295110] LOG: checkpoint starting: end-of-recovery immediate wait2023-09-13 20:21:22.383 UTC [295113] FATAL: the database system is not yet accepting connections2023-09-13 20:21:22.383 UTC [295113] DETAIL: Consistent recovery state has not been yet reached.2023-09-13 20:21:22.428 UTC [295110] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.199 s, sync=0.004 s, total=0.224 s; sync files=2, longest=0.003>2023-09-13 20:21:22.435 UTC [295031] LOG: database system is ready to accept connections2023-09-13 20:40:24.416 UTC [295031] LOG: server process (PID 295140) was terminated by signal 11: Segmentation fault2023-09-13 20:40:24.416 UTC [295031] DETAIL: Failed process was running: --explain analyzeselecta.equipmenttype, a.technology, count(*)fromnewfaultentrylineitems awherea.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'and a.assessmentyearmonth >= 202209and a.assessmentyearmonth < 202309group by a.equipmenttype, a.technologyorder by count(*) desc2023-09-13 20:40:24.416 UTC [295031] LOG: terminating any other active server processes2023-09-13 20:40:24.666 UTC [295523] FATAL: the database system is in recovery mode2023-09-13 20:40:24.922 UTC [295524] FATAL: the database system is in recovery mode2023-09-13 20:40:25.182 UTC [295525] FATAL: the database system is in recovery mode2023-09-13 20:40:25.379 UTC [295526] FATAL: the database system is in recovery mode2023-09-13 20:40:25.574 UTC [295527] FATAL: the database system is in recovery mode2023-09-13 20:40:25.775 UTC [295528] FATAL: the database system is in recovery mode2023-09-13 20:40:25.977 UTC [295529] FATAL: the database system is in recovery mode2023-09-13 20:40:26.179 UTC [295530] FATAL: the database system is in recovery mode2023-09-13 20:40:26.377 UTC [295531] FATAL: the database system is in recovery mode2023-09-13 20:40:26.578 UTC [295532] FATAL: the database system is in recovery mode2023-09-13 20:40:26.777 UTC [295533] FATAL: the database system is in recovery mode2023-09-13 20:40:26.973 UTC [295534] FATAL: the database system is in recovery mode2023-09-13 20:40:27.166 UTC [295535] FATAL: the database system is in recovery mode2023-09-13 20:40:27.372 UTC [295536] FATAL: the database system is in recovery mode2023-09-13 20:40:27.568 UTC [295537] FATAL: the database system is in recovery mode2023-09-13 20:40:27.766 UTC [295538] FATAL: the database system is in recovery mode2023-09-13 20:40:27.961 UTC [295539] FATAL: the database system is in recovery mode2023-09-13 20:40:28.163 UTC [295540] FATAL: the database system is in recovery mode2023-09-13 20:40:28.389 UTC [295541] FATAL: the database system is in recovery mode2023-09-13 20:40:28.589 UTC [295542] FATAL: the database system is in recovery mode2023-09-13 20:40:28.784 UTC [295543] FATAL: the database system is in recovery mode2023-09-13 20:40:28.978 UTC [295544] FATAL: the database system is in recovery mode2023-09-13 20:40:29.114 UTC [295031] LOG: issuing SIGKILL to recalcitrant children2023-09-13 20:40:29.178 UTC [295545] FATAL: the database system is in recovery mode2023-09-13 20:40:29.403 UTC [295546] FATAL: the database system is in recovery mode2023-09-13 20:40:29.603 UTC [295547] FATAL: the database system is in recovery mode2023-09-13 20:40:29.798 UTC [295548] FATAL: the database system is in recovery mode2023-09-13 20:40:30.017 UTC [295549] FATAL: the database system is in recovery mode2023-09-13 20:40:30.214 UTC [295550] FATAL: the database system is in recovery mode2023-09-13 20:40:58.414 UTC [295031] LOG: all server processes terminated; reinitializing2023-09-13 20:40:58.876 UTC [295553] LOG: database system was interrupted; last known up at 2023-09-13 20:21:22 UTC2023-09-13 20:40:58.986 UTC [295553] LOG: database system was not properly shut down; automatic recovery in progress2023-09-13 20:40:58.992 UTC [295553] LOG: redo starts at 17/6AFACFF02023-09-13 20:40:58.992 UTC [295553] LOG: invalid record length at 17/6AFAD028: wanted 24, got 02023-09-13 20:40:58.992 UTC [295553] LOG: redo done at 17/6AFACFF0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s2023-09-13 20:40:59.006 UTC [295554] LOG: checkpoint starting: end-of-recovery immediate wait2023-09-13 20:40:59.228 UTC [295554] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.198 s, sync=0.004 s, total=0.228 s; sync files=2, longest=0.003>2023-09-13 20:40:59.237 UTC [295031] LOG: database system is ready to accept connections
Robert Sanford <wobbet@gmail.com> writes: > I realized that when I was running GDB I just took a snap of when it > failed. Here I've added after continuing and requesting a backtrace. This is still quite unhelpful; it appears to show normal behavior in a process that's not the failing one. There is some advice about collecting useful backtraces at https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend I'd counsel the method of enabling core dumps and then attaching gdb to a core file, rather than dealing with live processes. regards, tom lane
On Wed, 2023-09-13 at 16:53 -0500, Robert Sanford wrote: > The Postgres server crashed. The PostgreSQL log file may contain interesting information. Yours, Laurenz Albe
Woke up this morning excited to create a core dump and... it worked.
Not only does it work, it works fast. Query time dropped from 5.5s to 1.2s. That was what I was hoping to get out of the VACUUM ANALYZE.
Very confused. I don't get it. I hope it doesn't happen again, but if it does I'll get the core dump.
rjsjr
On Wed, Sep 13, 2023 at 6:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Sanford <wobbet@gmail.com> writes:
> I realized that when I was running GDB I just took a snap of when it
> failed. Here I've added after continuing and requesting a backtrace.
This is still quite unhelpful; it appears to show normal behavior in
a process that's not the failing one. There is some advice about
collecting useful backtraces at
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend
I'd counsel the method of enabling core dumps and then attaching
gdb to a core file, rather than dealing with live processes.
regards, tom lane
And now it's happening again...
I'm trying to get a core dump but I can't seem to actually generate one.
When I try to set ulimit -c from the command line, I get a message that I'm not allowed to do that. So I went in and updated /etc/security/limits.conf for the postgres user as follows.
postgres hard core unlimited
postgres soft core unlimited
I then restarted the postgres server. When I look at the limits in the pid folder I see...
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
My core_pattern is set to "/tmp/core-%e-%s-%u-%g-%p-%t"
I run the query that forces the crash but there is no core file in any of
/tmp/
/etc/postgresql/15
/etc/postgresql/15/main
/var/lib/postgresql/15
/var/lib/postgresql/15/main
/usr/lib/postgresql/15
/usr/lib/postgresql/15/lib
/usr/lib/postgresql/15/bin
GDB (more below) is even able to detect that there's a segmentation fault.
What am I doing wrong? Why can't I get a core file anywhere?
rjsjr
From the GDB shell... which looks identical to what's in the log file...
(gdb) set pagination off
(gdb) set logging file /datadrive/pgdebug.txt
(gdb) set logging on
Warning: 'set logging on', an alias for the command 'set logging enabled', is deprecated.
Use 'set logging enabled on'.
Copying output to /datadrive/pgdebug.txt.
Copying debug output to /datadrive/pgdebug.txt.
(gdb) b errfinish
Breakpoint 1 at 0xaaaabef843b8
(gdb) cont
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1
(gdb) bt
#0 0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) bt
#0 0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) cont
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1
(gdb) bt
#0 0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) cont
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x0000fff7add7a000 in ?? ()
(gdb) bt
#0 0x0000fff7add7a000 in ?? ()
#1 0x0000aaaacc627ca0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) cont
Continuing.
Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb)
On Thu, Sep 14, 2023 at 8:27 AM Robert Sanford <wobbet@gmail.com> wrote:
Woke up this morning excited to create a core dump and... it worked.Not only does it work, it works fast. Query time dropped from 5.5s to 1.2s. That was what I was hoping to get out of the VACUUM ANALYZE.Very confused. I don't get it. I hope it doesn't happen again, but if it does I'll get the core dump.rjsjrOn Wed, Sep 13, 2023 at 6:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:Robert Sanford <wobbet@gmail.com> writes:
> I realized that when I was running GDB I just took a snap of when it
> failed. Here I've added after continuing and requesting a backtrace.
This is still quite unhelpful; it appears to show normal behavior in
a process that's not the failing one. There is some advice about
collecting useful backtraces at
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend
I'd counsel the method of enabling core dumps and then attaching
gdb to a core file, rather than dealing with live processes.
regards, tom lane
On Thu, Sep 14, 2023, at 4:29 PM, Robert Sanford wrote:
And now it's happening again...I'm trying to get a core dump but I can't seem to actually generate one.When I try to set ulimit -c from the command line, I get a message that I'm not allowed to do that. So I went in and updated /etc/security/limits.conf for the postgres user as follows.postgres hard core unlimitedpostgres soft core unlimitedI then restarted the postgres server. When I look at the limits in the pid folder I see...Limit Soft Limit Hard Limit UnitsMax cpu time unlimited unlimited secondsMax file size unlimited unlimited bytesMax data size unlimited unlimited bytesMax stack size 8388608 unlimited bytesMax core file size 0 unlimited bytes
Are you using systemd? If so, execute
sudo systemctl show -p LimitCORE postgresql@15-main.service
(Replace service name accordingly.)
If it is not showing infinity as value, you need to edit the service file using
the following command:
sudo systemctl edit postgresql@15-main.service
and add
[Service]
LimitCORE=infinity
Restart the service after it and check again with systemctl show command.
You should also check where the operating system is saving the core files.
cat /proc/sys/kernel/core_pattern
If it says core, it means it is storing it in PGDATA directory. Some OSes can
use a command/script to send the core files to another location.
Robert Sanford <wobbet@gmail.com> writes: > I'm trying to get a core dump but I can't seem to actually generate one. > ... > I then restarted the postgres server. When I look at the limits in the pid > folder I see... > Limit Soft Limit Hard Limit Units > Max cpu time unlimited unlimited seconds > Max file size unlimited unlimited bytes > Max data size unlimited unlimited bytes > Max stack size 8388608 unlimited bytes > Max core file size 0 unlimited bytes This shows that you are now allowed to set "ulimit -c unlimited", but you have not actually done so, at least not in the session that you're using to inspect these settings. If you are starting the server "by hand" (from a manual shell command), it should work to do "ulimit -c unlimited" before issuing the command to start the server. If you are using infrastructure such as systemd or initscripts, then you'll need to figure out how to get it to set the ulimit in the environment of the server process. BTW, have you tried disabling JIT to see if that stops the crashes? regards, tom lane
Turning JIT off seems to have solved the problem so far. But I've gone a couple of hours w/o it happening and then it starts again. So I'm still kind of focused on trying to get a core dump.
Here are some of my notes from trying to get one last night...
I am typically using systemctl to launch postgres but have also tested running manually. I will try adding the ulimit call before running manually on my next set of tests if I get another failure. If I'm not getting another failure then I'll re-enable JIT and try to induce it again.
I have 32GB of shared mem configured for postgresql but only 10G of drive space on my OS drive. I'm wondering if that's causing it to not be able to write the core dump. I'm going to move my core dump location to my data drive which has well over 100GB free...
When running systemctl list I notice that I have versions 12 and 14 of postgresql also running. I would hope that doesn't cause an issue but it also doesn't seem ideal. How do I remove them from systemd startup?
Running the systemctl show I got the following info...
LimitCPU=infinity
LimitCPUSoft=infinity
LimitFSIZE=infinity
LimitFSIZESoft=infinity
LimitDATA=infinity
LimitDATASoft=infinity
LimitSTACK=infinity
LimitSTACKSoft=8388608
LimitCORE=infinity
LimitCORESoft=0
LimitRSS=infinity
LimitRSSSoft=infinity
LimitNOFILE=524288
LimitNOFILESoft=1024
My pgdata directory is set in my postgresql.conf file as
data_directory = '/var/lib/postgresql/15/main'
PGDATA is not set in the environment.
My core_pattern is set to "/tmp/core-%e-%s-%u-%g-%p-%t"
The contents of that directory are
azureuser@kpidb-dev:/etc$ sudo ls -al /var/lib/postgresql/15/main
total 120
drwx------ 20 postgres postgres 4096 Sep 14 19:13 .
drwxr-xr-x 3 postgres postgres 4096 Aug 18 20:44 ..
-rw------- 1 postgres postgres 3 Aug 18 20:44 PG_VERSION
drwx------ 5 postgres postgres 4096 Aug 18 20:44 base
-rw------- 1 postgres postgres 44 Sep 14 19:13 current_logfiles
drwx------ 2 postgres postgres 4096 Sep 14 19:39 global
drwx------ 2 postgres postgres 4096 Sep 14 19:13 log
drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_commit_ts
drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_dynshmem
drwx------ 4 postgres postgres 4096 Sep 14 19:39 pg_logical
drwx------ 4 postgres postgres 4096 Aug 18 20:44 pg_multixact
drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_notify
drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_replslot
drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_serial
drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_snapshots
drwx------ 2 postgres postgres 4096 Sep 14 19:13 pg_stat
drwx------ 2 postgres postgres 4096 Aug 29 01:44 pg_stat_tmp
drwx------ 2 postgres postgres 4096 Sep 5 14:36 pg_subtrans
drwx------ 2 postgres postgres 4096 Aug 30 14:05 pg_tblspc
drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_twophase
drwx------ 3 postgres postgres 20480 Sep 13 20:16 pg_wal
drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_xact
-rw------- 1 postgres postgres 88 Aug 18 20:44 postgresql.auto.conf
-rw------- 1 postgres postgres 238 Sep 14 19:13 postmaster.opts
-rw------- 1 postgres postgres 101 Sep 14 19:39 postmaster.pid
On Thu, Sep 14, 2023 at 7:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Sanford <wobbet@gmail.com> writes:
> I'm trying to get a core dump but I can't seem to actually generate one.
> ...
> I then restarted the postgres server. When I look at the limits in the pid
> folder I see...
> Limit Soft Limit Hard Limit Units
> Max cpu time unlimited unlimited seconds
> Max file size unlimited unlimited bytes
> Max data size unlimited unlimited bytes
> Max stack size 8388608 unlimited bytes
> Max core file size 0 unlimited bytes
This shows that you are now allowed to set "ulimit -c unlimited",
but you have not actually done so, at least not in the session that
you're using to inspect these settings.
If you are starting the server "by hand" (from a manual shell
command), it should work to do "ulimit -c unlimited" before issuing
the command to start the server. If you are using infrastructure such
as systemd or initscripts, then you'll need to figure out how to get
it to set the ulimit in the environment of the server process.
BTW, have you tried disabling JIT to see if that stops the crashes?
regards, tom lane