Обсуждение: BUG #18076: Consistently receiving Signal 7 and Signal 11 errors
The following bug has been logged on the website: Bug reference: 18076 Logged by: Robert Sanford Email address: wobbet@wobbet.com PostgreSQL version: 15.4 Operating system: Linux 5.15.0-1042-azure #49~20.04.1-Ubuntu aarch64 Description: Installed v15 via Docker on my local dev machine (Intel x64). Created the database schema, ran a bunch of tests. All very good. pg_dump the database and then imported into a new instance running on ARM64 VM in Azure (8 cores, 64 GB RAM, Ubuntu 20.04.1). This is a full VM, not a docker instance. ARM64 CPUs are, according to Azure, physical CPUs and not virtual. Logs from the Azure machine show Signal 7 and Signal 11 errors consistently in the same place of the processing every time (log entries below). It's executing a specific function. Server restarts and has to catch up from the WAL. None of the other functions show this behavior. Running the stored procedure from DBeaver does not cause the same issues. My ETL app is a .NET Core 6 console application running on my local machine and connecting to either the DB on my local machine or the VM in Azure. It connects via NPGSQL. It doesn't seem to be horribly expensive in terms of CPU usage and generally finishes in < 5s. That being said, there can be as many as 32 simultaneous connections running the specific stored procedure. After reading other bug reports of Signal 7 I set huge_pages=off and disabled extensions such as autoanalyze. Other non-default config entries include: - max_connections = 100 - shared_buffers = 8GB - work_mem = 8GB - maintenance_work_mem = 1GB - max_wal_size = 4GB - min_wal_size = 500MB I am hopeful that someone will be able to help me out with this. Let me know if there's any other details that I need to provide. Thanks! rjsjr Log entries... 2023-08-29 01:50:57.225 UTC [432307] LOG: server process (PID 433701) was terminated by signal 7: Bus error 2023-08-29 01:50:57.225 UTC [432307] DETAIL: Failed process was running: select * from process_route_compliance($1, $2, $3) 2023-08-29 01:50:57.225 UTC [432307] LOG: terminating any other active server processes 2023-08-29 01:50:57.231 UTC [432307] LOG: all server processes terminated; reinitializing 2023-08-29 01:50:57.441 UTC [438947] LOG: database system was interrupted; last known up at 2023-08-29 01:44:22 UTC 2023-08-29 01:50:57.547 UTC [438947] LOG: database system was not properly shut down; automatic recovery in progress 2023-08-29 01:50:57.553 UTC [438947] LOG: redo starts at F/3CD6C068 2023-08-29 01:50:57.896 UTC [438950] FATAL: the database system is not yet accepting connections 2023-08-29 01:50:57.896 UTC [438950] DETAIL: Consistent recovery state has not been yet reached. 2023-08-29 01:50:58.612 UTC [438947] LOG: unexpected pageaddr E/99864000 in log segment 000000010000000F00000050, offset 8798208 2023-08-29 01:50:58.612 UTC [438947] LOG: redo done at F/50863FB0 system usage: CPU: user: 0.66 s, system: 0.32 s, elapsed: 1.05 s 2023-08-29 01:50:58.629 UTC [438948] LOG: checkpoint starting: end-of-recovery immediate wait 2023-08-29 01:51:07.996 UTC [438948] LOG: checkpoint complete: wrote 35042 buffers (3.3%); 0 WAL file(s) added, 20 removed, 0 recycled; write=8.989 s, sync=0.197 s, total=9.372 s; sync files=254, longest=0.193 s, average=0.001 s; distance=322528 kB, estimate=322528 kB 2023-08-29 01:51:08.027 UTC [432307] LOG: database system is ready to accept connections 2023-08-29 01:52:07.472 UTC [432307] LOG: server process (PID 440110) was terminated by signal 7: Bus error 2023-08-29 01:52:07.472 UTC [432307] DETAIL: Failed process was running: select * from process_route_compliance($1, $2, $3) 2023-08-29 01:52:07.472 UTC [432307] LOG: terminating any other active server processes 2023-08-29 01:52:07.500 UTC [432307] LOG: all server processes terminated; reinitializing 2023-08-29 01:52:07.689 UTC [440184] LOG: database system was interrupted; last known up at 2023-08-29 01:51:07 UTC 2023-08-29 01:52:07.797 UTC [440184] LOG: database system was not properly shut down; automatic recovery in progress 2023-08-29 01:52:07.803 UTC [440184] LOG: redo starts at F/508640C0 2023-08-29 01:52:07.892 UTC [440184] LOG: unexpected pageaddr E/C8C98000 in log segment 000000010000000F00000051, offset 13205504 2023-08-29 01:52:07.892 UTC [440184] LOG: redo done at F/51C97F28 system usage: CPU: user: 0.06 s, system: 0.02 s, elapsed: 0.08 s 2023-08-29 01:52:07.908 UTC [440185] LOG: checkpoint starting: end-of-recovery immediate wait 2023-08-29 01:52:08.085 UTC [440185] LOG: checkpoint complete: wrote 1216 buffers (0.1%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.140 s, sync=0.011 s, total=0.181 s; sync files=39, longest=0.005 s, average=0.001 s; distance=20688 kB, estimate=20688 kB 2023-08-29 01:52:08.104 UTC [432307] LOG: database system is ready to accept connections 2023-08-29 01:52:08.153 UTC [440187] FATAL: the database system is not yet accepting connections 2023-08-29 01:52:08.153 UTC [440187] DETAIL: Consistent recovery state has not been yet reached. 2023-08-29 01:52:09.716 UTC [432307] LOG: server process (PID 440192) was terminated by signal 11: Segmentation fault 2023-08-29 01:52:09.716 UTC [432307] DETAIL: Failed process was running: select * from process_route_compliance($1, $2, $3) 2023-08-29 01:52:09.716 UTC [432307] LOG: terminating any other active server processes 2023-08-29 01:52:09.720 UTC [432307] LOG: all server processes terminated; reinitializing 2023-08-29 01:52:09.873 UTC [440196] LOG: database system was interrupted; last known up at 2023-08-29 01:52:08 UTC 2023-08-29 01:52:09.984 UTC [440196] LOG: database system was not properly shut down; automatic recovery in progress 2023-08-29 01:52:09.998 UTC [440196] LOG: redo starts at F/51C980C0 2023-08-29 01:52:10.001 UTC [440196] LOG: unexpected pageaddr E/C8CEA000 in log segment 000000010000000F00000051, offset 13541376 2023-08-29 01:52:10.001 UTC [440196] LOG: redo done at F/51CE9AB0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 2023-08-29 01:52:10.024 UTC [440197] LOG: checkpoint starting: end-of-recovery immediate wait 2023-08-29 01:52:10.132 UTC [440197] LOG: checkpoint complete: wrote 58 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.077 s, sync=0.014 s, total=0.113 s; sync files=26, longest=0.005 s, average=0.001 s; distance=328 kB, estimate=328 kB 2023-08-29 01:52:10.141 UTC [432307] LOG: database system is ready to accept connections 2023-08-29 01:52:11.501 UTC [432307] LOG: server process (PID 440272) was terminated by signal 11: Segmentation fault 2023-08-29 01:52:11.501 UTC [432307] DETAIL: Failed process was running: select * from process_route_compliance($1, $2, $3) 2023-08-29 01:52:11.501 UTC [432307] LOG: terminating any other active server processes 2023-08-29 01:52:11.504 UTC [432307] LOG: all server processes terminated; reinitializing 2023-08-29 01:52:11.648 UTC [440279] LOG: database system was interrupted; last known up at 2023-08-29 01:52:10 UTC 2023-08-29 01:52:11.748 UTC [440279] LOG: database system was not properly shut down; automatic recovery in progress 2023-08-29 01:52:11.754 UTC [440279] LOG: redo starts at F/51CEA0C0 2023-08-29 01:52:11.759 UTC [440279] LOG: unexpected pageaddr E/C8D86000 in log segment 000000010000000F00000051, offset 14180352 2023-08-29 01:52:11.759 UTC [440279] LOG: redo done at F/51D85F90 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 2023-08-29 01:52:11.778 UTC [440280] LOG: checkpoint starting: end-of-recovery immediate wait 2023-08-29 01:52:11.874 UTC [440280] LOG: checkpoint complete: wrote 89 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.062 s, sync=0.016 s, total=0.103 s; sync files=31, longest=0.008 s, average=0.001 s; distance=624 kB, estimate=624 kB 2023-08-29 01:52:11.883 UTC [432307] LOG: database system is ready to accept connections 2023-08-29 01:53:16.516 UTC [432307] LOG: server process (PID 440285) was terminated by signal 11: Segmentation fault 2023-08-29 01:53:16.516 UTC [432307] DETAIL: Failed process was running: select * from process_route_compliance($1, $2, $3)
Note - one thing I read was that work_mem is a per-connection setting so 8GB * 32 active connections = way more RAM than the machine has.
I decreased that to 128MB and re-ran the tests. Failed in the same place. It did get more done than it had previously, but failed in the same place. Then I decreased it to 32MB and the same thing happened.
Using the Azure VM monitor I see that available memory has not ever gone below 50% so work_mem may not be the issue.
rjsjr
On Tue, Aug 29, 2023 at 3:31 PM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:
Bug reference: 18076
Logged by: Robert Sanford
Email address: wobbet@wobbet.com
PostgreSQL version: 15.4
Operating system: Linux 5.15.0-1042-azure #49~20.04.1-Ubuntu aarch64
Description:
Installed v15 via Docker on my local dev machine (Intel x64). Created the
database schema, ran a bunch of tests. All very good.
pg_dump the database and then imported into a new instance running on ARM64
VM in Azure (8 cores, 64 GB RAM, Ubuntu 20.04.1). This is a full VM, not a
docker instance. ARM64 CPUs are, according to Azure, physical CPUs and not
virtual.
Logs from the Azure machine show Signal 7 and Signal 11 errors consistently
in the same place of the processing every time (log entries below). It's
executing a specific function. Server restarts and has to catch up from the
WAL. None of the other functions show this behavior. Running the stored
procedure from DBeaver does not cause the same issues.
My ETL app is a .NET Core 6 console application running on my local machine
and connecting to either the DB on my local machine or the VM in Azure. It
connects via NPGSQL. It doesn't seem to be horribly expensive in terms of
CPU usage and generally finishes in < 5s. That being said, there can be as
many as 32 simultaneous connections running the specific stored procedure.
After reading other bug reports of Signal 7 I set huge_pages=off and
disabled extensions such as autoanalyze. Other non-default config entries
include:
- max_connections = 100
- shared_buffers = 8GB
- work_mem = 8GB
- maintenance_work_mem = 1GB
- max_wal_size = 4GB
- min_wal_size = 500MB
I am hopeful that someone will be able to help me out with this. Let me know
if there's any other details that I need to provide.
Thanks!
rjsjr
Log entries...
2023-08-29 01:50:57.225 UTC [432307] LOG: server process (PID 433701) was
terminated by signal 7: Bus error
2023-08-29 01:50:57.225 UTC [432307] DETAIL: Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:50:57.225 UTC [432307] LOG: terminating any other active
server processes
2023-08-29 01:50:57.231 UTC [432307] LOG: all server processes terminated;
reinitializing
2023-08-29 01:50:57.441 UTC [438947] LOG: database system was interrupted;
last known up at 2023-08-29 01:44:22 UTC
2023-08-29 01:50:57.547 UTC [438947] LOG: database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:50:57.553 UTC [438947] LOG: redo starts at F/3CD6C068
2023-08-29 01:50:57.896 UTC [438950] FATAL: the database system is not yet
accepting connections
2023-08-29 01:50:57.896 UTC [438950] DETAIL: Consistent recovery state has
not been yet reached.
2023-08-29 01:50:58.612 UTC [438947] LOG: unexpected pageaddr E/99864000 in
log segment 000000010000000F00000050, offset 8798208
2023-08-29 01:50:58.612 UTC [438947] LOG: redo done at F/50863FB0 system
usage: CPU: user: 0.66 s, system: 0.32 s, elapsed: 1.05 s
2023-08-29 01:50:58.629 UTC [438948] LOG: checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:51:07.996 UTC [438948] LOG: checkpoint complete: wrote 35042
buffers (3.3%); 0 WAL file(s) added, 20 removed, 0 recycled; write=8.989 s,
sync=0.197 s, total=9.372 s; sync files=254, longest=0.193 s, average=0.001
s; distance=322528 kB, estimate=322528 kB
2023-08-29 01:51:08.027 UTC [432307] LOG: database system is ready to
accept connections
2023-08-29 01:52:07.472 UTC [432307] LOG: server process (PID 440110) was
terminated by signal 7: Bus error
2023-08-29 01:52:07.472 UTC [432307] DETAIL: Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:07.472 UTC [432307] LOG: terminating any other active
server processes
2023-08-29 01:52:07.500 UTC [432307] LOG: all server processes terminated;
reinitializing
2023-08-29 01:52:07.689 UTC [440184] LOG: database system was interrupted;
last known up at 2023-08-29 01:51:07 UTC
2023-08-29 01:52:07.797 UTC [440184] LOG: database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:07.803 UTC [440184] LOG: redo starts at F/508640C0
2023-08-29 01:52:07.892 UTC [440184] LOG: unexpected pageaddr E/C8C98000 in
log segment 000000010000000F00000051, offset 13205504
2023-08-29 01:52:07.892 UTC [440184] LOG: redo done at F/51C97F28 system
usage: CPU: user: 0.06 s, system: 0.02 s, elapsed: 0.08 s
2023-08-29 01:52:07.908 UTC [440185] LOG: checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:08.085 UTC [440185] LOG: checkpoint complete: wrote 1216
buffers (0.1%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.140 s,
sync=0.011 s, total=0.181 s; sync files=39, longest=0.005 s, average=0.001
s; distance=20688 kB, estimate=20688 kB
2023-08-29 01:52:08.104 UTC [432307] LOG: database system is ready to
accept connections
2023-08-29 01:52:08.153 UTC [440187] FATAL: the database system is not yet
accepting connections
2023-08-29 01:52:08.153 UTC [440187] DETAIL: Consistent recovery state has
not been yet reached.
2023-08-29 01:52:09.716 UTC [432307] LOG: server process (PID 440192) was
terminated by signal 11: Segmentation fault
2023-08-29 01:52:09.716 UTC [432307] DETAIL: Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:09.716 UTC [432307] LOG: terminating any other active
server processes
2023-08-29 01:52:09.720 UTC [432307] LOG: all server processes terminated;
reinitializing
2023-08-29 01:52:09.873 UTC [440196] LOG: database system was interrupted;
last known up at 2023-08-29 01:52:08 UTC
2023-08-29 01:52:09.984 UTC [440196] LOG: database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:09.998 UTC [440196] LOG: redo starts at F/51C980C0
2023-08-29 01:52:10.001 UTC [440196] LOG: unexpected pageaddr E/C8CEA000 in
log segment 000000010000000F00000051, offset 13541376
2023-08-29 01:52:10.001 UTC [440196] LOG: redo done at F/51CE9AB0 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-08-29 01:52:10.024 UTC [440197] LOG: checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:10.132 UTC [440197] LOG: checkpoint complete: wrote 58
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.077 s,
sync=0.014 s, total=0.113 s; sync files=26, longest=0.005 s, average=0.001
s; distance=328 kB, estimate=328 kB
2023-08-29 01:52:10.141 UTC [432307] LOG: database system is ready to
accept connections
2023-08-29 01:52:11.501 UTC [432307] LOG: server process (PID 440272) was
terminated by signal 11: Segmentation fault
2023-08-29 01:52:11.501 UTC [432307] DETAIL: Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:11.501 UTC [432307] LOG: terminating any other active
server processes
2023-08-29 01:52:11.504 UTC [432307] LOG: all server processes terminated;
reinitializing
2023-08-29 01:52:11.648 UTC [440279] LOG: database system was interrupted;
last known up at 2023-08-29 01:52:10 UTC
2023-08-29 01:52:11.748 UTC [440279] LOG: database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:11.754 UTC [440279] LOG: redo starts at F/51CEA0C0
2023-08-29 01:52:11.759 UTC [440279] LOG: unexpected pageaddr E/C8D86000 in
log segment 000000010000000F00000051, offset 14180352
2023-08-29 01:52:11.759 UTC [440279] LOG: redo done at F/51D85F90 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-08-29 01:52:11.778 UTC [440280] LOG: checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:11.874 UTC [440280] LOG: checkpoint complete: wrote 89
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.062 s,
sync=0.016 s, total=0.103 s; sync files=31, longest=0.008 s, average=0.001
s; distance=624 kB, estimate=624 kB
2023-08-29 01:52:11.883 UTC [432307] LOG: database system is ready to
accept connections
2023-08-29 01:53:16.516 UTC [432307] LOG: server process (PID 440285) was
terminated by signal 11: Segmentation fault
2023-08-29 01:53:16.516 UTC [432307] DETAIL: Failed process was running:
select * from process_route_compliance($1, $2, $3)
Robert Sanford <wobbet@gmail.com> writes: > Using the Azure VM monitor I see that available memory has not ever gone > below 50% so work_mem may not be the issue. There was never any reason to think that. >> 2023-08-29 01:50:57.225 UTC [432307] LOG: server process (PID 433701) was >> terminated by signal 7: Bus error The usual cause of SIGBUS is a misaligned memory access on hardware that's picky about that. (Intel chips aren't, so that it's often possible for such bugs to slip through developer testing. But your ARM system might be.) Unfortunately, there's about zero chance of locating the bug from the information you've provided. What'd be helpful is to capture stack traces from a few of the failed processes. See https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend for directions. Please try to get traces from both the SIG7 and SIG11 cases, as it's not very clear whether they have the same root. regards, tom lane
Tom,
Thanks for that! The symlink was the issue. I recreated the symlink and it found everything. I started seeing some weird errors when running a simple test but realized that they were actually bugs in my code!
Now I'm rerunning the full suite and am no longer getting the Signal 7 or Signal 11. I'm hoping that the OS upgrade is making things work better but that also worries me.
If (when?!?!) I am able to recreate the bug I'll get a stack trace.
rjsjr
On Wed, Aug 30, 2023 at 9:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Sanford <wobbet@gmail.com> writes:
> The update dropped the mount point where the database was stored. I've
> restored the mount point but now PostgreSQL can't find it.
> It's looking for pg_tblspc/16392/PG_15_202209061 when that folder is
> located instead on /datadrive/pgsql/PG_15_202209061.
pg_tblspc/16392/PG_15_202209061 should be a symlink to wherever the
tablespace really is. You could make that link by hand if it
disappeared somehow, but I think you are probably misdiagnosing
the problem. Shouldn't you be adding something to the system's
list of mounts, instead?
regards, tom lane