Обсуждение: BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

Поиск
Список
Период
Сортировка

BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17977
Logged by:          Cory Albrecht
Email address:      coryca@gmail.com
PostgreSQL version: 14.6
Operating system:   TrueNAS-13.0-U5.1 (FreeBSD 13.1-RELEASE-p7)
Description:

PorstGreSQL running in a jail crashes randomly with signal 10 bus error. It
is the only application so affected.

Jun 14 16:25:10 postgresql-3 postgres[5026]: [11-1] [2023-06-14 16:25:10.563
EDT][6489e39c.13a2] <%>LOG:  00000: server process (PID 40166) was
terminated by signal 10: Bus error
Jun 14 16:25:10 postgresql-3 postgres[5026]: [11-2] [2023-06-14 16:25:10.563
EDT][6489e39c.13a2] <%>LOCATION:  LogChildExit, postmaster.c:3759
Jun 14 16:25:10 postgresql-3 postgres[5026]: [12-1] [2023-06-14 16:25:10.563
EDT][6489e39c.13a2] <%>LOG:  00000: terminating any other active server
processes
Jun 14 16:25:10 postgresql-3 postgres[5026]: [12-2] [2023-06-14 16:25:10.563
EDT][6489e39c.13a2] <%>LOCATION:  HandleChildCrash, postmaster.c:3482
Jun 14 16:25:10 postgresql-3 postgres[5026]: [13-1] [2023-06-14 16:25:10.577
EDT][6489e39c.13a2] <%>LOG:  00000: all server processes terminated;
reinitializing
Jun 14 16:25:10 postgresql-3 postgres[5026]: [13-2] [2023-06-14 16:25:10.577
EDT][6489e39c.13a2] <%>LOCATION:  PostmasterStateMachine,
postmaster.c:4038
Jun 14 16:25:10 postgresql-3 postgres[40194]: [14-1] [2023-06-14
16:25:10.603 EDT][648a2226.9d02] <%>LOG:  00000: database system was
interrupted; last known up at 2023-06-14 16:24:10 EDT
Jun 14 16:25:10 postgresql-3 postgres[40194]: [14-2] [2023-06-14
16:25:10.603 EDT][648a2226.9d02] <%>LOCATION:  StartupXLOG, xlog.c:6582
Jun 14 16:25:10 postgresql-3 postgres[40194]: [15-1] [2023-06-14
16:25:10.689 EDT][648a2226.9d02] <%>LOG:  00000: database system was not
properly shut down; automatic recovery in progress
Jun 14 16:25:10 postgresql-3 postgres[40194]: [15-2] [2023-06-14
16:25:10.689 EDT][648a2226.9d02] <%>LOCATION:  StartupXLOG, xlog.c:7106
Jun 14 16:25:10 postgresql-3 postgres[40194]: [16-1] [2023-06-14
16:25:10.696 EDT][648a2226.9d02] <%>LOG:  00000: redo starts at 6/B918C0C0
Jun 14 16:25:10 postgresql-3 postgres[40194]: [16-2] [2023-06-14
16:25:10.696 EDT][648a2226.9d02] <%>LOCATION:  StartupXLOG, xlog.c:7384
Jun 14 16:25:10 postgresql-3 postgres[40194]: [17-1] [2023-06-14
16:25:10.696 EDT][648a2226.9d02] <%>LOG:  00000: invalid record length at
6/B918C0F8: wanted 24, got 0
Jun 14 16:25:10 postgresql-3 postgres[40194]: [17-2] [2023-06-14
16:25:10.696 EDT][648a2226.9d02] <%>LOCATION:  ReadRecord, xlog.c:4454
Jun 14 16:25:10 postgresql-3 postgres[40194]: [18-1] [2023-06-14
16:25:10.696 EDT][648a2226.9d02] <%>LOG:  00000: redo done at 6/B918C0C0
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
Jun 14 16:25:10 postgresql-3 postgres[40194]: [18-2] [2023-06-14
16:25:10.696 EDT][648a2226.9d02] <%>LOCATION:  StartupXLOG, xlog.c:7648
Jun 14 16:25:10 postgresql-3 postgres[5026]: [14-1] [2023-06-14 16:25:10.731
EDT][6489e39c.13a2] <%>LOG:  00000: database system is ready to accept
connections

In this case the last few log lines added by PID 40166 were:

Jun 14 16:24:10 postgresql-3 postgres[40166]: [11-1] [2023-06-14
16:24:10.353 EDT][648a21ea.9ce6] <maps%physgeo>LOG:  00000: statement:
SELECT regclass('pg_class')::oid AS oidselect;BEGIN;DECLARE oidcursor BINARY
CURSOR FOR SELECT regclass('pg_class')::oid AS oidbinarycursor;FETCH FORWARD
1 FROM oidcursor;CLOSE oidcursor;COMMIT;
Jun 14 16:24:10 postgresql-3 postgres[40166]: [11-2] [2023-06-14
16:24:10.353 EDT][648a21ea.9ce6] <maps%physgeo>LOCATION:  exec_simple_query,
postgres.c:1016
Jun 14 16:24:10 postgresql-3 postgres[40166]: [12-1] [2023-06-14
16:24:10.355 EDT][648a21ea.9ce6] <maps%physgeo>LOG:  00000: statement:
SELECT postgis_version()
Jun 14 16:24:10 postgresql-3 postgres[40166]: [12-2] [2023-06-14
16:24:10.355 EDT][648a21ea.9ce6] <maps%physgeo>LOCATION:  exec_simple_query,
postgres.c:1016
Jun 14 16:24:10 postgresql-3 postgres[40166]: [13-1] [2023-06-14
16:24:10.545 EDT][648a21ea.9ce6] <maps%physgeo>LOG:  00000: statement:
SELECT postgis_geos_version(), postgis_proj_version()
Jun 14 16:24:10 postgresql-3 postgres[40166]: [13-2] [2023-06-14
16:24:10.545 EDT][648a21ea.9ce6] <maps%physgeo>LOCATION:  exec_simple_query,
postgres.c:1016
Jun 14 16:24:10 postgresql-3 postgres[40166]: [14-1] [2023-06-14
16:24:10.545 EDT][648a21ea.9ce6] <maps%physgeo>LOG:  00000: statement:
SELECT has_schema_privilege(n.oid, 'usage') AND has_table_privilege(t.oid,
'select') AND has_table_privilege(l.oid, 'select') FROM pg_namespace n,
pg_class t, pg_class l WHERE n.nspname = 'topology' AND t.relnamespace =
n.oid AND l.relnamespace = n.oid AND t.relname = 'topology' AND l.relname =
'layer'
Jun 14 16:24:10 postgresql-3 postgres[40166]: [14-2] [2023-06-14
16:24:10.545 EDT][648a21ea.9ce6] <maps%physgeo>LOCATION:  exec_simple_query,
postgres.c:1016
Jun 14 16:24:10 postgresql-3 postgres[40166]: [15-1] [2023-06-14
16:24:10.549 EDT][648a21ea.9ce6] <maps%physgeo>LOG:  00000: statement:
SELECT has_table_privilege(c.oid, 'select') AND has_table_privilege(f.oid,
'select') FROM pg_class c, pg_class f, pg_namespace n, pg_extension e WHERE
c.relnamespace = n.oid AND c.relname = 'pointcloud_columns' AND
f.relnamespace = n.oid AND f.relname = 'pointcloud_formats' AND n.oid =
e.extnamespace AND e.extname = 'pointcloud'
Jun 14 16:24:10 postgresql-3 postgres[40166]: [15-2] [2023-06-14
16:24:10.549 EDT][648a21ea.9ce6] <maps%physgeo>LOCATION:  exec_simple_query,
postgres.c:1016
Jun 14 16:24:10 postgresql-3 postgres[40166]: [16-1] [2023-06-14
16:24:10.550 EDT][648a21ea.9ce6] <maps%physgeo>LOG:  00000: statement:
SELECT has_table_privilege(c.oid, 'select') FROM pg_class c, pg_namespace n,
pg_type t WHERE c.relnamespace = n.oid AND n.oid = t.typnamespace AND
c.relname = 'raster_columns' AND t.typname = 'raster'
Jun 14 16:24:10 postgresql-3 postgres[40166]: [16-2] [2023-06-14
16:24:10.550 EDT][648a21ea.9ce6] <maps%physgeo>LOCATION:  exec_simple_query,
postgres.c:1016
Jun 14 16:24:10 postgresql-3 postgres[40166]: [17-1] [2023-06-14
16:24:10.551 EDT][648a21ea.9ce6] <maps%physgeo>LOG:  00000: statement: SET
extra_float_digits=3; SET application_name='QGIS3 desktop'; SET
datestyle='ISO'
Jun 14 16:24:10 postgresql-3 postgres[40166]: [17-2] [2023-06-14
16:24:10.551 EDT][648a21ea.9ce6] <maps%physgeo>LOCATION:  exec_simple_query,
postgres.c:1016
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-1] [2023-06-14
16:24:10.551 EDT][648a21ea.9ce6] <maps%physgeo>LOG:  00000: statement:
SELECT COUNT(*)
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-2]
                                       FROM pg_class t, pg_class i,
pg_namespace ns, pg_index ix, pg_attribute a
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-3]
                                       WHERE
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-4]
                                           t.oid=ix.indrelid
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-5]
                                           AND t.relnamespace=ns.oid
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-6]
                                           AND i.oid=ix.indexrelid
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-7]
                                           AND a.attrelid=t.oid
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-8]
                                           AND a.attnum=ANY(ix.indkey)
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-9]
                                           AND t.relkind IN ('r', 'm')
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-10]
                                            AND ns.nspname='public'
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-11]
                                            AND
t.relname='ne_10m_rivers_lake_centerlines_scale_rank'
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-12]
                                            AND a.attname='geom';
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-13]
                                        
Jun 14 16:24:10 postgresql-3 postgres[40166]: [18-14] [2023-06-14
16:24:10.551 EDT][648a21ea.9ce6] <maps%physgeo>LOCATION:  exec_simple_query,
postgres.c:1016
Jun 14 16:24:10 postgresql-3 postgres[40166]: [19-1] [2023-06-14
16:24:10.555 EDT][648a21ea.9ce6] <maps%physgeo>LOG:  00000: statement:
SELECT oid, typname FROM pg_type WHERE oid IN (20)
Jun 14 16:24:10 postgresql-3 postgres[40166]: [19-2] [2023-06-14
16:24:10.555 EDT][648a21ea.9ce6] <maps%physgeo>LOCATION:  exec_simple_query,
postgres.c:1016

Which was the user application QGIS reconnecting right after the previous
signal 10 bus error crash. 

I have my suspicions that QGIS is doing something funky as I have had no
success making PostgreSQL crash with some scripts opening up several dozen
simultaneous connections and making dozens of SQL queries per connection.
QGIS is my heaviest database user, everything else is fairly tiny and quick
and never causes PostgreSQL to crash like this. However, I don;t really know
how to proceed to prove that it is QGIS or what, exactly, it is doing.

Let me know what logging you want me to turn on and how and I can gather
that data.


Re: BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> PorstGreSQL running in a jail crashes randomly with signal 10 bus error.

Can you collect a core dump and get a stack trace from it?

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

            regards, tom lane



Re: BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

От
Cory Albrecht
Дата:
To get a core dump I would need to rebuild. PostrgreSQL from ports didn;t have what I needed pre built.

# This file is auto-generated by 'make config'.
# Options for postgresql14-server-14.6_1
_OPTIONS_READ=postgresql14-server-14.6_1
_FILE_COMPLETE_OPTIONS_LIST=DEBUG DOCS DTRACE GSSAPI INTDATE LDAP LLVM LZ4 NLS OPTIMIZED_CFLAGS PAM SSL TZDATA XML
OPTIONS_FILE_UNSET+=DEBUG
OPTIONS_FILE_UNSET+=DOCS
OPTIONS_FILE_UNSET+=DTRACE
OPTIONS_FILE_SET+=GSSAPI
OPTIONS_FILE_SET+=INTDATE
OPTIONS_FILE_SET+=LDAP
OPTIONS_FILE_SET+=LLVM
OPTIONS_FILE_SET+=LZ4
OPTIONS_FILE_SET+=NLS
OPTIONS_FILE_SET+=OPTIMIZED_CFLAGS
OPTIONS_FILE_SET+=PAM
OPTIONS_FILE_SET+=SSL
OPTIONS_FILE_UNSET+=TZDATA
OPTIONS_FILE_SET+=XML

Also, which PID do I attach GDB to?

e.g.

root@postgresql-3:~ # ps awxu | grep post
postgres 15892   0.0  0.0 189272  36072  -  SsJ  02:04   0:05.72 /usr/local/bin/postgres -D /var/db/postgres/data14
postgres 15894   0.0  0.0 189272  46632  -  IsJ  02:04   0:00.02 postgres: checkpointer  (postgres)
postgres 15895   0.0  0.0 189272  36108  -  SsJ  02:04   0:00.13 postgres: background writer  (postgres)
postgres 15896   0.0  0.0 189272  36100  -  SsJ  02:04   0:00.13 postgres: walwriter  (postgres)
postgres 15897   0.0  0.0 191832  36472  -  SsJ  02:04   0:00.50 postgres: autovacuum launcher  (postgres)
postgres 15898   0.0  0.0 189272  36052  -  SsJ  02:04   0:00.07 postgres: archiver  (postgres)
postgres 15899   0.0  0.0  35752  20004  -  SsJ  02:04   0:01.55 postgres: stats collector  (postgres)
postgres 15900   0.0  0.0 191832  36312  -  IsJ  02:04   0:00.02 postgres: logical replication launcher  (postgres)
postgres 15920   0.0  0.0 191832  38756  -  SsJ  02:05   0:00.29 postgres: walsender repmgr pg2.db.vpn.cory.albrecht.name(44629)  (postgres)
postgres 15947   0.0  0.1 228380 131336  -  IsJ  02:06   0:00.28 postgres: maps Alternate History 10.128.6.9(56280)  (postgres)
postgres 15990   0.0  0.1 228380 122996  -  IsJ  02:09   0:00.21 postgres: maps Alternate History 10.128.6.9(56536)  (postgres)
root     39871   0.0  0.0  12840   2328  2  S+J  14:40   0:00.00 grep post

If it's the individual database ones like 15947 or 15990 in the above example, I have no way of knowing which one will get a SIGBUS. Is the main process (e.g. 15892) OK?



On Thu, Jun 15, 2023 at 9:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
> PorstGreSQL running in a jail crashes randomly with signal 10 bus error.

Can you collect a core dump and get a stack trace from it?

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

                        regards, tom lane

Re: BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

От
Tom Lane
Дата:
Cory Albrecht <coryca@gmail.com> writes:
> Also, which PID do I attach GDB to?
> If it's the individual database ones like 15947 or 15990 in the
> above example, I have no way of knowing which one will get a SIGBUS. Is the
> main process (e.g. 15892) OK?

If you don't know that, the best way is to collect a core file and
then gdb that.  Starting the postmaster under "ulimit -c unlimited"
ought to be enough.

            regards, tom lane



Re: BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

От
Cory Albrecht
Дата:
After things percolated overnight, and realising that scripts O had made to try and overload the server had connected with SSL, not Keberos, encryption, I decided to turn off the GSS/Kerberos settings so that everything connected only with SSL encryption. So far today in working with QGIS I have no SIGBUS crashes. Perhaps that reminds anybody of any previous issues?

I've currently got my poudriere machine rebuilding the the package (I needed to upgrade the servers from 14.6 to 14.8 anyways) to get a coredump, but that will take some time.

On Thu, Jun 15, 2023 at 3:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Cory Albrecht <coryca@gmail.com> writes:
> Also, which PID do I attach GDB to?
> If it's the individual database ones like 15947 or 15990 in the
> above example, I have no way of knowing which one will get a SIGBUS. Is the
> main process (e.g. 15892) OK?

If you don't know that, the best way is to collect a core file and
then gdb that.  Starting the postmaster under "ulimit -c unlimited"
ought to be enough.

                        regards, tom lane

Re: BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

От
Tom Lane
Дата:
Cory Albrecht <coryca@gmail.com> writes:
> After things percolated overnight, and realising that scripts O had made to
> try and overload the server had connected with SSL, not Keberos,
> encryption, I decided to turn off the GSS/Kerberos settings so that
> everything connected only with SSL encryption. So far today in working with
> QGIS I have no SIGBUS crashes. Perhaps that reminds anybody of any previous
> issues?

Hmm, interesting.  Is the GSS support built against MIT Kerberos,
or against Heimdal?  We've recently given up on supporting Heimdal
because it's so under-maintained.  I've not heard specific reports
of bugs matching this symptom, but if you're working with Heimdal
(i.e., FreeBSD's native GSS support) it might be worth trying the
other.

            regards, tom lane



Re: BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

От
Cory Albrecht
Дата:
No, I am using MIT KRB5 as my servers are mostly FreeBSD (with one Debian) but the laptops & desktops are Ubuntu LTS. 

On Thu, Jun 15, 2023 at 4:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Cory Albrecht <coryca@gmail.com> writes:
> After things percolated overnight, and realising that scripts O had made to
> try and overload the server had connected with SSL, not Keberos,
> encryption, I decided to turn off the GSS/Kerberos settings so that
> everything connected only with SSL encryption. So far today in working with
> QGIS I have no SIGBUS crashes. Perhaps that reminds anybody of any previous
> issues?

Hmm, interesting.  Is the GSS support built against MIT Kerberos,
or against Heimdal?  We've recently given up on supporting Heimdal
because it's so under-maintained.  I've not heard specific reports
of bugs matching this symptom, but if you're working with Heimdal
(i.e., FreeBSD's native GSS support) it might be worth trying the
other.

                        regards, tom lane

Re: BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

От
Stephen Frost
Дата:
Greetings,

* Cory Albrecht (coryca@gmail.com) wrote:
> No, I am using MIT KRB5 as my servers are mostly FreeBSD (with one Debian)
> but the laptops & desktops are Ubuntu LTS.

You're sure that the GSS library that PG is actually linked against and
running with is the MIT Krb5 library on the FreeBSD system?

Were you able to get a core dump of the SIGBUS with GSS?  You're using
GSSAPI authentication and encryption?

Thanks,

Stephen

Вложения

Re: BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

От
Cory Albrecht
Дата:
Yes. I build everything with poudriere.

On Wed, Jul 12, 2023 at 8:57 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* Cory Albrecht (coryca@gmail.com) wrote:
> No, I am using MIT KRB5 as my servers are mostly FreeBSD (with one Debian)
> but the laptops & desktops are Ubuntu LTS.

You're sure that the GSS library that PG is actually linked against and
running with is the MIT Krb5 library on the FreeBSD system?

Were you able to get a core dump of the SIGBUS with GSS?  You're using
GSSAPI authentication and encryption?

Thanks,

Stephen