Обсуждение: BUG #16174: PostgreSQL 9.6 service can not start after unexpected shutdown
BUG #16174: PostgreSQL 9.6 service can not start after unexpected shutdown
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 16174 Logged by: Enrico La Torre Email address: pg.dba.iit.team@gmail.com PostgreSQL version: 9.6.12 Operating system: Windows Server 2016 Standard 10.0.14393 Description: Hi, in my company we run several thousands of PostgreSQL clusters currently with version 9.6.12 (one cluster per server). We run PostgreSQL on a virtual machine in Hyper-V Manager. A minor upgrade to the latest minor version is planned but I did not find any fix for this incident in any minor upgrade from 9.6.12 to 9.6.16. The same incident has been reported in https://www.postgresql.org/message-id/CAJtpbKR2NZKmoxveMSdJh%3Dzng1%2B6NwqK3EEbwQs0ac6oHE%3D4eg%40mail.gmail.com for PostgresSQL 11 but it has not been responded to yet. However, we experience the same incident in with PostgreSQL 9.6. We can only correlate this incident to a reboot after an unexpected shutdown of Windows. During restart it can(!) happen that the PostgreSQL service can not be started by the Service Control Manager because the PostgreSQL cluster is already running in the background. A pg_ctl status will return a running database server. This behavior must be related to a race condition, because the incident does not occur every time or can not be reproduced easily. The unexpected shutdowns originate from either power outages or more likely due to restarts of the server that hosts the Hyper-V Manager. If this server gets patched and nees to restart it basically kills all running VMs in that process, if they don't shutdown fast enough. Of course you can argue that we must get rid of these unexpected shutdowns. But isn't the boot process always the same after a shutdown nor matter if it was shutdown cleanly or not? Why is the PostgreSQL cluster started in the background and not by the Service Control Manager? Is there a bug in the interplay between pg_ctl and the Windows Service Control Manager? At the bottom I attached an excerpt of the relevant log files. Kind regards PostgreSQL version number you are running: 9.6.12 How you installed PostgreSQL: We initally started with PostgreSQL 9.6.6 and the EnterpriseDB installer postgresql-9.6.6-2-windows-x64.exe The minor upgrade to 9.6.12 was an installation with EnterpriseDB installer postgresql-9.6.12-1-windows-x64.exe and then simply a distribution of all 9.6.12 binary files to all other servers. Changes made to the settings in the postgresql.conf file: version -------------------------------------------------------------- PostgreSQL 9.6.12, compiled by Visual C++ build 1800, 64-bit name | current_setting | source ------------------------------+-------------------------------------------------+---------------------- application_name | psql | client archive_command | copy %p I:\Backup\PostgreSQL-WAL-Archive\%f | configuration file archive_mode | on | configuration file archive_timeout | 1h | configuration file checkpoint_completion_target | 0.9 | configuration file client_encoding | WIN1252 | client cluster_name | xxxxxxxxxxxx | configuration file DateStyle | ISO, DMY | configuration file default_text_search_config | simple | configuration file dynamic_shared_memory_type | windows | configuration file effective_cache_size | 3GB | configuration file lc_messages | C | configuration file lc_monetary | C | configuration file lc_numeric | C | configuration file lc_time | C | configuration file listen_addresses | * | configuration file log_checkpoints | on | configuration file log_connections | on | configuration file log_destination | stderr | configuration file log_disconnections | on | configuration file log_hostname | on | configuration file log_line_prefix | %t [%p]: [%l] %quser=%u,db=%d,app=%a,client=%h | configuration file log_temp_files | 100kB | configuration file log_timezone | America/Chicago | configuration file logging_collector | on | configuration file maintenance_work_mem | 256MB | configuration file max_connections | 100 | configuration file max_stack_depth | 2MB | environment variable max_wal_senders | 10 | configuration file max_wal_size | 4GB | configuration file min_wal_size | 2GB | configuration file port | 5432 | configuration file search_path | "$user", public | database shared_buffers | 512MB | configuration file ssl | on | configuration file ssl_cert_file | ssl-cert-snakeoil.pem | configuration file ssl_key_file | ssl-cert-snakeoil.key | configuration file TimeZone | America/Chicago | configuration file track_activity_query_size | 8192 | configuration file wal_level | replica | configuration file wal_sync_method | fsync | configuration file work_mem | 61166kB | configuration file Operating system and version: Windows Server 2016 Standard 10.0.14393 What program you're using to connect to PostgreSQL: psql, pgadmin4, ODBC driver PostgreSQL Unicode(x64) 9.06.04.00 Is there anything relevant or unusual in the PostgreSQL server logs?: in pg_log 2019-12-19 03:35:06 CST [4756]: [1] LOG: database system was interrupted; last known up at 2019-12-19 03:21:56 CST 2019-12-19 03:35:06 CST [3924]: [1] user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc LOG: connection received: host=446-091STL01.aldi-499.loc port=62037 2019-12-19 03:35:06 CST [3788]: [1] user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc LOG: connection received: host=446-091STL01.aldi-499.loc port=62042 2019-12-19 03:35:06 CST [5844]: [1] user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc LOG: connection received: host=446-091STL01.aldi-499.loc port=62044 2019-12-19 03:35:07 CST [3780]: [1] user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc LOG: connection received: host=446-091STL01.aldi-499.loc port=62043 2019-12-19 03:35:07 CST [3924]: [2] user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc LOG: could not accept SSL connection: An established connection was aborted by the software in your host machine. 2019-12-19 03:35:07 CST [3788]: [2] user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc LOG: could not accept SSL connection: An established connection was aborted by the software in your host machine. 2019-12-19 03:35:07 CST [3780]: [2] user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc LOG: could not accept SSL connection: An established connection was aborted by the software in your host machine. 2019-12-19 03:35:08 CST [5844]: [2] user=SVCACC_446_PSQL,db=postgres,app=[unknown],client=446-091STL01.aldi-499.loc FATAL: the database system is starting up 2019-12-19 03:35:09 CST [4144]: [1] user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc LOG: connection received: host=446-091STL01.aldi-499.loc port=62046 2019-12-19 03:35:09 CST [4144]: [2] user=SVCACC_446_PSQL,db=postgres,app=[unknown],client=446-091STL01.aldi-499.loc FATAL: the database system is starting up 2019-12-19 03:35:10 CST [4548]: [1] user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc LOG: connection received: host=446-091STL01.aldi-499.loc port=62048 [...] These messages continue until 2019-12-19 03:36:05 CST [5236]: [2] user=SVCACC_446_PSQL,db=postgres,app=[unknown],client=446-091STL01.aldi-499.loc FATAL: the database system is starting up 2019-12-19 03:37:29 CST [4756]: [2] LOG: database system was not properly shut down; automatic recovery in progress 2019-12-19 03:37:29 CST [4756]: [3] LOG: redo starts at 26/23141D00 2019-12-19 03:37:29 CST [4756]: [4] LOG: invalid record length at 26/23151158: wanted 24, got 0 2019-12-19 03:37:29 CST [4756]: [5] LOG: redo done at 26/23151120 2019-12-19 03:37:30 CST [4756]: [6] LOG: last completed transaction was at log time 2019-12-19 03:23:00.369744-06 2019-12-19 03:37:30 CST [4756]: [7] LOG: checkpoint starting: end-of-recovery immediate 2019-12-19 03:37:30 CST [4756]: [8] LOG: checkpoint complete: wrote 10 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.006 s, sync=0.414 s, total=0.657 s; sync files=5, longest=0.098 s, average=0.082 s; distance=61 kB, estimate=61 kB 2019-12-19 03:37:30 CST [4756]: [9] LOG: MultiXact member wraparound protections are now enabled 2019-12-19 03:37:30 CST [5380]: [3] LOG: database system is ready to accept connections in Event Viewer / Application Level Date and Time Source Event ID Task Category Information 12/19/2019 3:34:23 AM PostgreSQL 0 None Waiting for server startup... Information 12/19/2019 3:35:05 AM PostgreSQL 0 None "2019-12-19 03:35:05 CST [5380]: [1] LOG: redirecting log output to logging collector process2019-12-19 03:35:05 CST [5380]: [2] HINT: Future log output will appear in directory ""pg_log""." Error 12/19/2019 3:36:06 AM PostgreSQL 0 None Timed out waiting for server startup Information 12/19/2019 3:39:29 AM PostgreSQL 0 None Waiting for server startup... Error 12/19/2019 3:39:29 AM PostgreSQL 0 None "2019-12-19 03:39:29 CST [5216]: [1] FATAL: lock file ""postmaster.pid"" already exists2019-12-19 03:39:29 CST [5216]: [2] HINT: Is another postmaster (PID 5380) running in data directory ""G:/databases/PostgreSQL/9.6""?" Error 12/19/2019 3:39:30 AM PostgreSQL 0 None Timed out waiting for server startup in Event Viewer / System Level Date and Time Source Event ID Task Category Information 12/19/2019 3:31:58 AM Microsoft-Windows-Kernel-General 12 (1) The operating system started at system time 2019-12-19T09:31:58.494227200Z. Information 12/19/2019 3:31:58 AM Microsoft-Windows-Kernel-Boot 20 (31) The last shutdown's success status was false. The last boot's success status was true. Critical 12/19/2019 3:32:08 AM Microsoft-Windows-Kernel-Power 41 (63) The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly. Error 12/19/2019 3:39:30 AM Service Control Manager 7001 None "The xxx-Server service depends on the postgresql - PostgreSQL Server 9.6 service which failed to start because of the following error: The operation completed successfully." Information 12/19/2019 3:39:30 AM Service Control Manager 7036 None The postgresql - PostgreSQL Server 9.6 service entered the stopped state. What you were doing when the error happened / how to cause the error: with PowerShell: # stop the cluster pg_ctl stop -D G:\databases\PostgreSQL\9.6 -m fast # start the cluster via Service Control Manager Start-Service -Name postgresql