Обсуждение: Segmentation Fault

Поиск

Список

Период

Сортировка

Segmentation Fault

От

Robert Sanford

Дата:

14 сентября 2023 г., 00:53:17

PostgreSQL v15.4

Running on Ubuntu Jammy

ARM64 x8

64 GB RAM

I have been trying to tune the performance of a query, and wasn't having a lot of luck.I decided to run VACUUM ANALYZE to see if that would help.

The Postgres server crashed.

I ran through each of the tables individually and ran VACUUM ANALYZE one by one. That way completed successfully.

I ran my query to see if that had helped performance. Segmentation fault. The server restarted itself and came back to life. Reran the query and it crashed again. Every time I run the query I get the same segmentation fault. Different queries run just fine.

I tried attaching GDB to the PID that the log file says is the main pid for startup. That seems to have worked in that it tells me that there were various libraries that were loaded. I'm not sure if it is sufficient or not. I'm pretty sure it isn't.

I tried getting GDB based on DebuggingProgramCrash - Ubuntu Wiki which says "If you are on Ubuntu Jammy or later, you do not need to worry about installing debug symbol packages anymore." I am running on Jammy so that should be good.

I then tried getting a Valgrind trace from PostgreSQL. I used "systemctl stop postgresql" to stop the service from running. I launched Valgrind using the command line below my sig. And everything worked just fine except the query took 10x the time that it was taking before I started the VACUUM ANALYZE.

So I restarted the service, reran my query, and it failed with the same log entries showing a Signal 11: Segmentation fault.

Running it inside of Valgrind stopped the error from happening which is strange to me.

What can I do to provide enough information to figure out why this is crashing?

rjsjr

Valgrind Command

valgrind --leak-check=no --gen-suppressions=all \

--suppressions=src/tools/valgrind.supp --time-stamp=yes \

--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \

--log-file=/datadrive/%p.log --trace-children=yes \

/usr/lib/postgresql/15/bin/postgres --config-file=/etc/postgresql/15/main/postgresql.conf \

--log_line_prefix="%m %p " \

--log_statement=all --shared_buffers=64MB 2>&1 | tee /datadrive/postmaster.log

GDB Backtrace command

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69

69 ../sysdeps/unix/sysv/linux/select.c: No such file or directory.

(gdb) bt

#0 0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69

#1 0x0000aaaab91b4a1c in ?? ()

#2 0x0000aaaab91b63cc in PostmasterMain ()

#3 0x0000aaaab8f053d8 in main ()

PostgreSQL Log File

2023-09-13 20:20:21.198 UTC [295031] LOG: starting PostgreSQL 15.4 (Ubuntu 15.4-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit

2023-09-13 20:20:21.198 UTC [295031] LOG: listening on IPv4 address "0.0.0.0", port 5432

2023-09-13 20:20:21.198 UTC [295031] LOG: listening on IPv6 address "::", port 5432

2023-09-13 20:20:21.203 UTC [295031] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"

2023-09-13 20:20:21.218 UTC [295037] LOG: database system was shut down at 2023-09-13 20:20:20 UTC

2023-09-13 20:20:21.228 UTC [295031] LOG: database system is ready to accept connections

2023-09-13 20:21:21.604 UTC [295031] LOG: server process (PID 295102) was terminated by signal 11: Segmentation fault

2023-09-13 20:21:21.604 UTC [295031] DETAIL: Failed process was running: --explain analyze

select

a.equipmenttype

, a.technology

, count(*)

from

newfaultentrylineitems a

where

a.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'

and a.assessmentyearmonth >= 202209

and a.assessmentyearmonth < 202309

group by a.equipmenttype, a.technology

order by count(*) desc

2023-09-13 20:21:21.604 UTC [295031] LOG: terminating any other active server processes

2023-09-13 20:21:21.610 UTC [295031] LOG: all server processes terminated; reinitializing

2023-09-13 20:21:22.074 UTC [295109] LOG: database system was interrupted; last known up at 2023-09-13 20:20:21 UTC

2023-09-13 20:21:22.137 UTC [295112] FATAL: the database system is in recovery mode

2023-09-13 20:21:22.187 UTC [295109] LOG: database system was not properly shut down; automatic recovery in progress

2023-09-13 20:21:22.195 UTC [295109] LOG: redo starts at 17/6AFACF40

2023-09-13 20:21:22.195 UTC [295109] LOG: invalid record length at 17/6AFACF78: wanted 24, got 0

2023-09-13 20:21:22.195 UTC [295109] LOG: redo done at 17/6AFACF40 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

2023-09-13 20:21:22.208 UTC [295110] LOG: checkpoint starting: end-of-recovery immediate wait

2023-09-13 20:21:22.383 UTC [295113] FATAL: the database system is not yet accepting connections

2023-09-13 20:21:22.383 UTC [295113] DETAIL: Consistent recovery state has not been yet reached.

2023-09-13 20:21:22.428 UTC [295110] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.199 s, sync=0.004 s, total=0.224 s; sync files=2, longest=0.003>

2023-09-13 20:21:22.435 UTC [295031] LOG: database system is ready to accept connections

2023-09-13 20:40:24.416 UTC [295031] LOG: server process (PID 295140) was terminated by signal 11: Segmentation fault

2023-09-13 20:40:24.416 UTC [295031] DETAIL: Failed process was running: --explain analyze

select

a.equipmenttype

, a.technology

, count(*)

from

newfaultentrylineitems a

where

a.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'

and a.assessmentyearmonth >= 202209

and a.assessmentyearmonth < 202309

group by a.equipmenttype, a.technology

order by count(*) desc

2023-09-13 20:40:24.416 UTC [295031] LOG: terminating any other active server processes

2023-09-13 20:40:24.666 UTC [295523] FATAL: the database system is in recovery mode

2023-09-13 20:40:24.922 UTC [295524] FATAL: the database system is in recovery mode

2023-09-13 20:40:25.182 UTC [295525] FATAL: the database system is in recovery mode

2023-09-13 20:40:25.379 UTC [295526] FATAL: the database system is in recovery mode

2023-09-13 20:40:25.574 UTC [295527] FATAL: the database system is in recovery mode

2023-09-13 20:40:25.775 UTC [295528] FATAL: the database system is in recovery mode

2023-09-13 20:40:25.977 UTC [295529] FATAL: the database system is in recovery mode

2023-09-13 20:40:26.179 UTC [295530] FATAL: the database system is in recovery mode

2023-09-13 20:40:26.377 UTC [295531] FATAL: the database system is in recovery mode

2023-09-13 20:40:26.578 UTC [295532] FATAL: the database system is in recovery mode

2023-09-13 20:40:26.777 UTC [295533] FATAL: the database system is in recovery mode

2023-09-13 20:40:26.973 UTC [295534] FATAL: the database system is in recovery mode

2023-09-13 20:40:27.166 UTC [295535] FATAL: the database system is in recovery mode

2023-09-13 20:40:27.372 UTC [295536] FATAL: the database system is in recovery mode

2023-09-13 20:40:27.568 UTC [295537] FATAL: the database system is in recovery mode

2023-09-13 20:40:27.766 UTC [295538] FATAL: the database system is in recovery mode

2023-09-13 20:40:27.961 UTC [295539] FATAL: the database system is in recovery mode

2023-09-13 20:40:28.163 UTC [295540] FATAL: the database system is in recovery mode

2023-09-13 20:40:28.389 UTC [295541] FATAL: the database system is in recovery mode

2023-09-13 20:40:28.589 UTC [295542] FATAL: the database system is in recovery mode

2023-09-13 20:40:28.784 UTC [295543] FATAL: the database system is in recovery mode

2023-09-13 20:40:28.978 UTC [295544] FATAL: the database system is in recovery mode

2023-09-13 20:40:29.114 UTC [295031] LOG: issuing SIGKILL to recalcitrant children

2023-09-13 20:40:29.178 UTC [295545] FATAL: the database system is in recovery mode

2023-09-13 20:40:29.403 UTC [295546] FATAL: the database system is in recovery mode

2023-09-13 20:40:29.603 UTC [295547] FATAL: the database system is in recovery mode

2023-09-13 20:40:29.798 UTC [295548] FATAL: the database system is in recovery mode

2023-09-13 20:40:30.017 UTC [295549] FATAL: the database system is in recovery mode

2023-09-13 20:40:30.214 UTC [295550] FATAL: the database system is in recovery mode

2023-09-13 20:40:58.414 UTC [295031] LOG: all server processes terminated; reinitializing

2023-09-13 20:40:58.876 UTC [295553] LOG: database system was interrupted; last known up at 2023-09-13 20:21:22 UTC

2023-09-13 20:40:58.986 UTC [295553] LOG: database system was not properly shut down; automatic recovery in progress

2023-09-13 20:40:58.992 UTC [295553] LOG: redo starts at 17/6AFACFF0

2023-09-13 20:40:58.992 UTC [295553] LOG: invalid record length at 17/6AFAD028: wanted 24, got 0

2023-09-13 20:40:58.992 UTC [295553] LOG: redo done at 17/6AFACFF0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

2023-09-13 20:40:59.006 UTC [295554] LOG: checkpoint starting: end-of-recovery immediate wait

2023-09-13 20:40:59.228 UTC [295554] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.198 s, sync=0.004 s, total=0.228 s; sync files=2, longest=0.003>

2023-09-13 20:40:59.237 UTC [295031] LOG: database system is ready to accept connections

Re: Segmentation Fault

От

Robert Sanford

Дата:

14 сентября 2023 г., 01:31:12

I realized that when I was running GDB I just took a snap of when it failed. Here I've added after continuing and requesting a backtrace.

rjsjr

Continuing.

[Detaching after fork from child process 297277]

[Detaching after fork from child process 297278]

[Detaching after fork from child process 297279]

[Detaching after fork from child process 297280]

[Detaching after fork from child process 297281]

[Detaching after fork from child process 297282]

[Detaching after fork from child process 297283]

[Detaching after fork from child process 297284]

[Detaching after fork from child process 297286]

[Detaching after fork from child process 297287]

[Detaching after fork from child process 297289]

Program received signal SIGUSR1, User defined signal 1.

0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69

69 in ../sysdeps/unix/sysv/linux/select.c

(gdb) bt

#0 0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69

#1 0x0000aaaab8124a1c in ?? ()

#2 0x0000aaaab81263cc in PostmasterMain ()

#3 0x0000aaaab7e753d8 in main ()

(gdb) c

Continuing.

[Detaching after fork from child process 297301]

[Detaching after fork from child process 297304]

Program received signal SIGUSR1, User defined signal 1.

0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69

69 in ../sysdeps/unix/sysv/linux/select.c

(gdb) bt

#0 0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69

#1 0x0000aaaab8124a1c in ?? ()

#2 0x0000aaaab81263cc in PostmasterMain ()

#3 0x0000aaaab7e753d8 in main ()

(gdb) c

Continuing.

[Detaching after fork from child process 297306]

[Detaching after fork from child process 297311]

[Detaching after fork from child process 297320]

Program received signal SIGUSR1, User defined signal 1.

0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69

69 in ../sysdeps/unix/sysv/linux/select.c

(gdb) c

Continuing.

[Detaching after fork from child process 297330]

Program received signal SIGUSR1, User defined signal 1.

0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69

69 in ../sysdeps/unix/sysv/linux/select.c

(gdb) c

Continuing.

[Detaching after fork from child process 297332]

[Detaching after fork from child process 297333]

[Detaching after fork from child process 297334]

[Detaching after fork from child process 297335]

[Detaching after fork from child process 297336]

Program received signal SIGUSR1, User defined signal 1.

0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69

69 in ../sysdeps/unix/sysv/linux/select.c

(gdb) c

Continuing.

[Detaching after fork from child process 297337]

[Detaching after fork from child process 297338]

[Detaching after fork from child process 297339]

[Detaching after fork from child process 297344]

[Detaching after fork from child process 297347]

Program received signal SIGUSR1, User defined signal 1.

0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69

69 in ../sysdeps/unix/sysv/linux/select.c

(gdb) c

Continuing.

[Detaching after fork from child process 297351]

[Detaching after fork from child process 297352]

[Detaching after fork from child process 297353]

[Detaching after fork from child process 297354]

[Detaching after fork from child process 297355]

On Wed, Sep 13, 2023 at 4:53 PM Robert Sanford <wobbet@gmail.com> wrote:

PostgreSQL v15.4
Running on Ubuntu Jammy
ARM64 x8
64 GB RAM

I have been trying to tune the performance of a query, and wasn't having a lot of luck.I decided to run VACUUM ANALYZE to see if that would help.

The Postgres server crashed.

I ran through each of the tables individually and ran VACUUM ANALYZE one by one. That way completed successfully.

I ran my query to see if that had helped performance. Segmentation fault. The server restarted itself and came back to life. Reran the query and it crashed again. Every time I run the query I get the same segmentation fault. Different queries run just fine.

I tried attaching GDB to the PID that the log file says is the main pid for startup. That seems to have worked in that it tells me that there were various libraries that were loaded. I'm not sure if it is sufficient or not. I'm pretty sure it isn't.

I tried getting GDB based on DebuggingProgramCrash - Ubuntu Wiki which says "If you are on Ubuntu Jammy or later, you do not need to worry about installing debug symbol packages anymore." I am running on Jammy so that should be good.

I then tried getting a Valgrind trace from PostgreSQL. I used "systemctl stop postgresql" to stop the service from running. I launched Valgrind using the command line below my sig. And everything worked just fine except the query took 10x the time that it was taking before I started the VACUUM ANALYZE.

So I restarted the service, reran my query, and it failed with the same log entries showing a Signal 11: Segmentation fault.

Running it inside of Valgrind stopped the error from happening which is strange to me.

What can I do to provide enough information to figure out why this is crashing?

rjsjr

Valgrind Command
valgrind --leak-check=no --gen-suppressions=all \
--suppressions=src/tools/valgrind.supp --time-stamp=yes \
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \
--log-file=/datadrive/%p.log --trace-children=yes \
/usr/lib/postgresql/15/bin/postgres --config-file=/etc/postgresql/15/main/postgresql.conf \
--log_line_prefix="%m %p " \
--log_statement=all --shared_buffers=64MB 2>&1 | tee /datadrive/postmaster.log

GDB Backtrace command
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69
69 ../sysdeps/unix/sysv/linux/select.c: No such file or directory.
(gdb) bt
#0 0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69
#1 0x0000aaaab91b4a1c in ?? ()
#2 0x0000aaaab91b63cc in PostmasterMain ()
#3 0x0000aaaab8f053d8 in main ()

PostgreSQL Log File
2023-09-13 20:20:21.198 UTC [295031] LOG: starting PostgreSQL 15.4 (Ubuntu 15.4-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
2023-09-13 20:20:21.198 UTC [295031] LOG: listening on IPv4 address "0.0.0.0", port 5432
2023-09-13 20:20:21.198 UTC [295031] LOG: listening on IPv6 address "::", port 5432
2023-09-13 20:20:21.203 UTC [295031] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-13 20:20:21.218 UTC [295037] LOG: database system was shut down at 2023-09-13 20:20:20 UTC
2023-09-13 20:20:21.228 UTC [295031] LOG: database system is ready to accept connections
2023-09-13 20:21:21.604 UTC [295031] LOG: server process (PID 295102) was terminated by signal 11: Segmentation fault
2023-09-13 20:21:21.604 UTC [295031] DETAIL: Failed process was running: --explain analyze
select
a.equipmenttype
, a.technology
, count(*)
from
newfaultentrylineitems a
where
a.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'
and a.assessmentyearmonth >= 202209
and a.assessmentyearmonth < 202309
group by a.equipmenttype, a.technology
order by count(*) desc
2023-09-13 20:21:21.604 UTC [295031] LOG: terminating any other active server processes
2023-09-13 20:21:21.610 UTC [295031] LOG: all server processes terminated; reinitializing
2023-09-13 20:21:22.074 UTC [295109] LOG: database system was interrupted; last known up at 2023-09-13 20:20:21 UTC
2023-09-13 20:21:22.137 UTC [295112] FATAL: the database system is in recovery mode
2023-09-13 20:21:22.187 UTC [295109] LOG: database system was not properly shut down; automatic recovery in progress
2023-09-13 20:21:22.195 UTC [295109] LOG: redo starts at 17/6AFACF40
2023-09-13 20:21:22.195 UTC [295109] LOG: invalid record length at 17/6AFACF78: wanted 24, got 0
2023-09-13 20:21:22.195 UTC [295109] LOG: redo done at 17/6AFACF40 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-09-13 20:21:22.208 UTC [295110] LOG: checkpoint starting: end-of-recovery immediate wait
2023-09-13 20:21:22.383 UTC [295113] FATAL: the database system is not yet accepting connections
2023-09-13 20:21:22.383 UTC [295113] DETAIL: Consistent recovery state has not been yet reached.
2023-09-13 20:21:22.428 UTC [295110] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.199 s, sync=0.004 s, total=0.224 s; sync files=2, longest=0.003>
2023-09-13 20:21:22.435 UTC [295031] LOG: database system is ready to accept connections
2023-09-13 20:40:24.416 UTC [295031] LOG: server process (PID 295140) was terminated by signal 11: Segmentation fault
2023-09-13 20:40:24.416 UTC [295031] DETAIL: Failed process was running: --explain analyze
select
a.equipmenttype
, a.technology
, count(*)
from
newfaultentrylineitems a
where
a.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'
and a.assessmentyearmonth >= 202209
and a.assessmentyearmonth < 202309
group by a.equipmenttype, a.technology
order by count(*) desc
2023-09-13 20:40:24.416 UTC [295031] LOG: terminating any other active server processes
2023-09-13 20:40:24.666 UTC [295523] FATAL: the database system is in recovery mode
2023-09-13 20:40:24.922 UTC [295524] FATAL: the database system is in recovery mode
2023-09-13 20:40:25.182 UTC [295525] FATAL: the database system is in recovery mode
2023-09-13 20:40:25.379 UTC [295526] FATAL: the database system is in recovery mode
2023-09-13 20:40:25.574 UTC [295527] FATAL: the database system is in recovery mode
2023-09-13 20:40:25.775 UTC [295528] FATAL: the database system is in recovery mode
2023-09-13 20:40:25.977 UTC [295529] FATAL: the database system is in recovery mode
2023-09-13 20:40:26.179 UTC [295530] FATAL: the database system is in recovery mode
2023-09-13 20:40:26.377 UTC [295531] FATAL: the database system is in recovery mode
2023-09-13 20:40:26.578 UTC [295532] FATAL: the database system is in recovery mode
2023-09-13 20:40:26.777 UTC [295533] FATAL: the database system is in recovery mode
2023-09-13 20:40:26.973 UTC [295534] FATAL: the database system is in recovery mode
2023-09-13 20:40:27.166 UTC [295535] FATAL: the database system is in recovery mode
2023-09-13 20:40:27.372 UTC [295536] FATAL: the database system is in recovery mode
2023-09-13 20:40:27.568 UTC [295537] FATAL: the database system is in recovery mode
2023-09-13 20:40:27.766 UTC [295538] FATAL: the database system is in recovery mode
2023-09-13 20:40:27.961 UTC [295539] FATAL: the database system is in recovery mode
2023-09-13 20:40:28.163 UTC [295540] FATAL: the database system is in recovery mode
2023-09-13 20:40:28.389 UTC [295541] FATAL: the database system is in recovery mode
2023-09-13 20:40:28.589 UTC [295542] FATAL: the database system is in recovery mode
2023-09-13 20:40:28.784 UTC [295543] FATAL: the database system is in recovery mode
2023-09-13 20:40:28.978 UTC [295544] FATAL: the database system is in recovery mode
2023-09-13 20:40:29.114 UTC [295031] LOG: issuing SIGKILL to recalcitrant children
2023-09-13 20:40:29.178 UTC [295545] FATAL: the database system is in recovery mode
2023-09-13 20:40:29.403 UTC [295546] FATAL: the database system is in recovery mode
2023-09-13 20:40:29.603 UTC [295547] FATAL: the database system is in recovery mode
2023-09-13 20:40:29.798 UTC [295548] FATAL: the database system is in recovery mode
2023-09-13 20:40:30.017 UTC [295549] FATAL: the database system is in recovery mode
2023-09-13 20:40:30.214 UTC [295550] FATAL: the database system is in recovery mode
2023-09-13 20:40:58.414 UTC [295031] LOG: all server processes terminated; reinitializing
2023-09-13 20:40:58.876 UTC [295553] LOG: database system was interrupted; last known up at 2023-09-13 20:21:22 UTC
2023-09-13 20:40:58.986 UTC [295553] LOG: database system was not properly shut down; automatic recovery in progress
2023-09-13 20:40:58.992 UTC [295553] LOG: redo starts at 17/6AFACFF0
2023-09-13 20:40:58.992 UTC [295553] LOG: invalid record length at 17/6AFAD028: wanted 24, got 0
2023-09-13 20:40:58.992 UTC [295553] LOG: redo done at 17/6AFACFF0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-09-13 20:40:59.006 UTC [295554] LOG: checkpoint starting: end-of-recovery immediate wait
2023-09-13 20:40:59.228 UTC [295554] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.198 s, sync=0.004 s, total=0.228 s; sync files=2, longest=0.003>
2023-09-13 20:40:59.237 UTC [295031] LOG: database system is ready to accept connections

Re: Segmentation Fault

От

Tom Lane

Дата:

14 сентября 2023 г., 02:58:57

Robert Sanford <wobbet@gmail.com> writes:
> I realized that when I was running GDB I just took a snap of when it
> failed. Here I've added after continuing and requesting a backtrace.

This is still quite unhelpful; it appears to show normal behavior in
a process that's not the failing one.  There is some advice about
collecting useful backtraces at

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

I'd counsel the method of enabling core dumps and then attaching
gdb to a core file, rather than dealing with live processes.

            regards, tom lane

Re: Segmentation Fault

От

Laurenz Albe

Дата:

14 сентября 2023 г., 05:52:58

On Wed, 2023-09-13 at 16:53 -0500, Robert Sanford wrote:
> The Postgres server crashed.

The PostgreSQL log file may contain interesting information.

Yours,
Laurenz Albe

Re: Segmentation Fault

От

Robert Sanford

Дата:

14 сентября 2023 г., 16:27:03

Woke up this morning excited to create a core dump and... it worked.

Not only does it work, it works fast. Query time dropped from 5.5s to 1.2s. That was what I was hoping to get out of the VACUUM ANALYZE.

Very confused. I don't get it. I hope it doesn't happen again, but if it does I'll get the core dump.

rjsjr

On Wed, Sep 13, 2023 at 6:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Sanford <wobbet@gmail.com> writes:
> I realized that when I was running GDB I just took a snap of when it
> failed. Here I've added after continuing and requesting a backtrace.

This is still quite unhelpful; it appears to show normal behavior in
a process that's not the failing one. There is some advice about
collecting useful backtraces at

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

I'd counsel the method of enabling core dumps and then attaching
gdb to a core file, rather than dealing with live processes.

regards, tom lane

Re: Segmentation Fault

От

Robert Sanford

Дата:

14 сентября 2023 г., 22:29:20

And now it's happening again...

I'm trying to get a core dump but I can't seem to actually generate one.

When I try to set ulimit -c from the command line, I get a message that I'm not allowed to do that. So I went in and updated /etc/security/limits.conf for the postgres user as follows.

postgres hard core unlimited

postgres soft core unlimited

I then restarted the postgres server. When I look at the limits in the pid folder I see...

Limit Soft Limit Hard Limit Units

Max cpu time unlimited unlimited seconds

Max file size unlimited unlimited bytes

Max data size unlimited unlimited bytes

Max stack size 8388608 unlimited bytes

Max core file size 0 unlimited bytes

My core_pattern is set to "/tmp/core-%e-%s-%u-%g-%p-%t"

I run the query that forces the crash but there is no core file in any of

/tmp/

/etc/postgresql/15

/etc/postgresql/15/main

/var/lib/postgresql/15

/var/lib/postgresql/15/main

/usr/lib/postgresql/15

/usr/lib/postgresql/15/lib

/usr/lib/postgresql/15/bin

GDB (more below) is even able to detect that there's a segmentation fault.

What am I doing wrong? Why can't I get a core file anywhere?

rjsjr

From the GDB shell... which looks identical to what's in the log file...

(gdb) set pagination off

(gdb) set logging file /datadrive/pgdebug.txt

(gdb) set logging on

Warning: 'set logging on', an alias for the command 'set logging enabled', is deprecated.

Use 'set logging enabled on'.

Copying output to /datadrive/pgdebug.txt.

Copying debug output to /datadrive/pgdebug.txt.

(gdb) b errfinish

Breakpoint 1 at 0xaaaabef843b8

(gdb) cont

Continuing.

Program received signal SIGUSR1, User defined signal 1.

0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1

(gdb) bt

#0 0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1

Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) bt

#0 0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1

Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) cont

Continuing.

Program received signal SIGUSR1, User defined signal 1.

0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1

(gdb) bt

#0 0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1

Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) cont

Continuing.

Program received signal SIGSEGV, Segmentation fault.

0x0000fff7add7a000 in ?? ()

(gdb) bt

#0 0x0000fff7add7a000 in ?? ()

#1 0x0000aaaacc627ca0 in ?? ()

Backtrace stopped: previous frame inner to this frame (corrupt stack?)

(gdb) cont

Continuing.

Program terminated with signal SIGSEGV, Segmentation fault.

The program no longer exists.

(gdb)

On Thu, Sep 14, 2023 at 8:27 AM Robert Sanford <wobbet@gmail.com> wrote:

Woke up this morning excited to create a core dump and... it worked.

Not only does it work, it works fast. Query time dropped from 5.5s to 1.2s. That was what I was hoping to get out of the VACUUM ANALYZE.

Very confused. I don't get it. I hope it doesn't happen again, but if it does I'll get the core dump.

rjsjr

On Wed, Sep 13, 2023 at 6:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Sanford <wobbet@gmail.com> writes:
> I realized that when I was running GDB I just took a snap of when it
> failed. Here I've added after continuing and requesting a backtrace.

This is still quite unhelpful; it appears to show normal behavior in
a process that's not the failing one. There is some advice about
collecting useful backtraces at

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

I'd counsel the method of enabling core dumps and then attaching
gdb to a core file, rather than dealing with live processes.

regards, tom lane

Re: Segmentation Fault

От

"Euler Taveira"

Дата:

14 сентября 2023 г., 23:57:29

On Thu, Sep 14, 2023, at 4:29 PM, Robert Sanford wrote:

And now it's happening again...

I'm trying to get a core dump but I can't seem to actually generate one.

When I try to set ulimit -c from the command line, I get a message that I'm not allowed to do that. So I went in and updated /etc/security/limits.conf for the postgres user as follows.
postgres hard core unlimited
postgres soft core unlimited

I then restarted the postgres server. When I look at the limits in the pid folder I see...
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes

Are you using systemd? If so, execute

sudo systemctl show -p LimitCORE postgresql@15-main.service

(Replace service name accordingly.)

If it is not showing infinity as value, you need to edit the service file using

the following command:

sudo systemctl edit postgresql@15-main.service

and add

[Service]

LimitCORE=infinity

Restart the service after it and check again with systemctl show command.

You should also check where the operating system is saving the core files.

cat /proc/sys/kernel/core_pattern

If it says core, it means it is storing it in PGDATA directory. Some OSes can

use a command/script to send the core files to another location.

Euler Taveira

EDB https://www.enterprisedb.com/

Re: Segmentation Fault

От

Tom Lane

Дата:

15 сентября 2023 г., 03:17:16

Robert Sanford <wobbet@gmail.com> writes:
> I'm trying to get a core dump but I can't seem to actually generate one.
> ...
> I then restarted the postgres server. When I look at the limits in the pid
> folder I see...
> Limit                     Soft Limit           Hard Limit           Units
> Max cpu time              unlimited            unlimited            seconds
> Max file size             unlimited            unlimited            bytes
> Max data size             unlimited            unlimited            bytes
> Max stack size            8388608              unlimited            bytes
> Max core file size        0                    unlimited            bytes

This shows that you are now allowed to set "ulimit -c unlimited",
but you have not actually done so, at least not in the session that
you're using to inspect these settings.

If you are starting the server "by hand" (from a manual shell
command), it should work to do "ulimit -c unlimited" before issuing
the command to start the server.  If you are using infrastructure such
as systemd or initscripts, then you'll need to figure out how to get
it to set the ulimit in the environment of the server process.

BTW, have you tried disabling JIT to see if that stops the crashes?

            regards, tom lane

Re: Segmentation Fault

От

Robert Sanford

Дата:

15 сентября 2023 г., 19:23:15

Turning JIT off seems to have solved the problem so far. But I've gone a couple of hours w/o it happening and then it starts again. So I'm still kind of focused on trying to get a core dump.

Here are some of my notes from trying to get one last night...

I am typically using systemctl to launch postgres but have also tested running manually. I will try adding the ulimit call before running manually on my next set of tests if I get another failure. If I'm not getting another failure then I'll re-enable JIT and try to induce it again.

I have 32GB of shared mem configured for postgresql but only 10G of drive space on my OS drive. I'm wondering if that's causing it to not be able to write the core dump. I'm going to move my core dump location to my data drive which has well over 100GB free...

When running systemctl list I notice that I have versions 12 and 14 of postgresql also running. I would hope that doesn't cause an issue but it also doesn't seem ideal. How do I remove them from systemd startup?

Running the systemctl show I got the following info...

LimitCPU=infinity

LimitCPUSoft=infinity

LimitFSIZE=infinity

LimitFSIZESoft=infinity

LimitDATA=infinity

LimitDATASoft=infinity

LimitSTACK=infinity

LimitSTACKSoft=8388608

LimitCORE=infinity

LimitCORESoft=0

LimitRSS=infinity

LimitRSSSoft=infinity

LimitNOFILE=524288

LimitNOFILESoft=1024

My pgdata directory is set in my postgresql.conf file as

data_directory = '/var/lib/postgresql/15/main'

PGDATA is not set in the environment.

My core_pattern is set to "/tmp/core-%e-%s-%u-%g-%p-%t"

The contents of that directory are

azureuser@kpidb-dev:/etc$ sudo ls -al /var/lib/postgresql/15/main

total 120

drwx------ 20 postgres postgres 4096 Sep 14 19:13 .

drwxr-xr-x 3 postgres postgres 4096 Aug 18 20:44 ..

-rw------- 1 postgres postgres 3 Aug 18 20:44 PG_VERSION

drwx------ 5 postgres postgres 4096 Aug 18 20:44 base

-rw------- 1 postgres postgres 44 Sep 14 19:13 current_logfiles

drwx------ 2 postgres postgres 4096 Sep 14 19:39 global

drwx------ 2 postgres postgres 4096 Sep 14 19:13 log

drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_commit_ts

drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_dynshmem

drwx------ 4 postgres postgres 4096 Sep 14 19:39 pg_logical

drwx------ 4 postgres postgres 4096 Aug 18 20:44 pg_multixact

drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_notify

drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_replslot

drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_serial

drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_snapshots

drwx------ 2 postgres postgres 4096 Sep 14 19:13 pg_stat

drwx------ 2 postgres postgres 4096 Aug 29 01:44 pg_stat_tmp

drwx------ 2 postgres postgres 4096 Sep 5 14:36 pg_subtrans

drwx------ 2 postgres postgres 4096 Aug 30 14:05 pg_tblspc

drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_twophase

drwx------ 3 postgres postgres 20480 Sep 13 20:16 pg_wal

drwx------ 2 postgres postgres 4096 Aug 18 20:44 pg_xact

-rw------- 1 postgres postgres 88 Aug 18 20:44 postgresql.auto.conf

-rw------- 1 postgres postgres 238 Sep 14 19:13 postmaster.opts

-rw------- 1 postgres postgres 101 Sep 14 19:39 postmaster.pid

On Thu, Sep 14, 2023 at 7:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Sanford <wobbet@gmail.com> writes:
> I'm trying to get a core dump but I can't seem to actually generate one.
> ...
> I then restarted the postgres server. When I look at the limits in the pid
> folder I see...
> Limit Soft Limit Hard Limit Units
> Max cpu time unlimited unlimited seconds
> Max file size unlimited unlimited bytes
> Max data size unlimited unlimited bytes
> Max stack size 8388608 unlimited bytes
> Max core file size 0 unlimited bytes

This shows that you are now allowed to set "ulimit -c unlimited",
but you have not actually done so, at least not in the session that
you're using to inspect these settings.

If you are starting the server "by hand" (from a manual shell
command), it should work to do "ulimit -c unlimited" before issuing
the command to start the server. If you are using infrastructure such
as systemd or initscripts, then you'll need to figure out how to get
it to set the ulimit in the environment of the server process.

BTW, have you tried disabling JIT to see if that stops the crashes?

regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Segmentation Fault