Обсуждение: [BUGS] BUG #14714: long running sessions from remote instance seems to hangsome times

Поиск
Список
Период
Сортировка

[BUGS] BUG #14714: long running sessions from remote instance seems to hangsome times

От
josef.machytka@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      14714
Logged by:          Josef Machytka
Email address:      josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system:   Debian jessie
Description:

On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.

Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.

When we restart task it runs normally. I cannot see any changes in process
priority in OS, process switches between sleeping and running state and I do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."

We implemented timeouts for PG tasks into our programs to restart tasks but
it very annoying and we have to use long enough timeouts anyway so it delays
tasks.... It happens from time to time also to dblink connections inside pg
function so we had to rewrite our programs which were running without any
problems on PG 9.5. Problems started when we switched to PG 9.6 and we
experienced them on all minor versions - we currently run 9.6.3 and problem
is still present...

Do you please have any advice what to check or what setting might help?




--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

josef.machytka@gmail.com writes:
> Sometimes some session goes into some kind of "drowsy" state and runs
> incredibly slowly. It causes some very low disk IO and runs like 20x slower
> then usual.

Is it using a lot of CPU time according to ps or vmstat?

> Do you please have any advice what to check or what setting might help?

If it's eating CPU time, try using "perf" to find out where.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14714: long running sessions from remote instanceseems to hang some times

От
Pavel Stehule
Дата:


2017-06-19 16:49 GMT+02:00 <josef.machytka@gmail.com>:
The following bug has been logged on the website:

Bug reference:      14714
Logged by:          Josef Machytka
Email address:      josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system:   Debian jessie
Description:

On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.

Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.

When we restart task it runs normally. I cannot see any changes in process
priority in OS, process switches between sleeping and running state and I do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."

We implemented timeouts for PG tasks into our programs to restart tasks but
it very annoying and we have to use long enough timeouts anyway so it delays
tasks.... It happens from time to time also to dblink connections inside pg
function so we had to rewrite our programs which were running without any
problems on PG 9.5. Problems started when we switched to PG 9.6 and we
experienced them on all minor versions - we currently run 9.6.3 and problem
is still present...

Do you please have any advice what to check or what setting might help?

it can be spin lock issue. Try to use "perf top". Install postgresql-server debug symbols before.

what is your share buffers configuration?

Regards

Pavel 




--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14714: long running sessions from remote instanceseems to hang some times

От
Jeff Janes
Дата:
On Mon, Jun 19, 2017 at 7:49 AM, <josef.machytka@gmail.com> wrote:
The following bug has been logged on the website:

Bug reference:      14714
Logged by:          Josef Machytka
Email address:      josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system:   Debian jessie
Description:

On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.

Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.

One possibility is that another process inserted a bunch of new tuples and committed them, and now the existing session has to dig through all those new tuples and ignore them because they are too new to be visible to it.  This would particularly be a problem where the queries have things like "MAX(indexed_column)" or "ORDER BY indexed_column desc LIMIT 1" and where the newly insert tuples all have values which are at the end of the range being probed by that query.  Although I don't know why this would have gotten worse from 9.5 to 9.6.  The change should have been in the other direction.

Is this on a master or on a hot standby?

Cheers,

Jeff 

Re: [BUGS] BUG #14714: long running sessions from remote instanceseems to hang some times

От
Josef Machytka
Дата:
Today I got several sessions in this "drowsy" state and it almost caused collapse of the whole reporting system because aggregations were not fully calculated.
Looks like now on PG 9.6.3 is this problem even worse then on previous 9.6.x subversions...

I checked "top", "iotop -u postgres", "perf top" and "perf top -u postgres" and as far as I could see those sessions are running although they are marked as "S" sleeping in "top" command almost all the time.
From time to time they cause very low CPU load, something like from 0,5 to 0,8% - which is visible only in "perf" output. In "top" I see almost all the time 0.0% on CPU.
The same with disk IO - iotop shows only a few MB/s from time to time.
They are marked as "active" in "pg_stat_activity" and no wait events are shown.

GCE Instance has Debian 8, 10 CPUs, 46GB RAM
Our database has 2390 GB is distributed over 3 different tablespaces on 3 different GCE standard persistent disks, wal logs and temp tablespace have separate disk.

Out pg config is:
shared_buffers = 8GB
temp_buffers = 512MB
work_mem = 192MB
maintenance_work_mem = 2GB
max_connections = 150
max_wal_size = 8GB
min_wal_size = 4GB
max_worker_processes = 50
max_parallel_workers_per_gather = 4

There are several procedures which set work_mem to 256MB because tests showed it speeds up processing really significantly.
Reason is when we switched from PG 9.5 to PG 9.6 we found that many old queries had suddenly very bad explain plan and very long run time.
Only setting work_mem from previous 64MB to the current values helped to get the same or better results for those queries on 9.6.
Of course we upgraded instance with more memory. But we do not use swap.

Shall I look for some special perf symbols in output? Or is there a possibility to debug it deeper?

Thanks

Josef


On 19 June 2017 at 17:56, Pavel Stehule <pavel.stehule@gmail.com> wrote:


2017-06-19 16:49 GMT+02:00 <josef.machytka@gmail.com>:
The following bug has been logged on the website:

Bug reference:      14714
Logged by:          Josef Machytka
Email address:      josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system:   Debian jessie
Description:

On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.

Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.

When we restart task it runs normally. I cannot see any changes in process
priority in OS, process switches between sleeping and running state and I do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."

We implemented timeouts for PG tasks into our programs to restart tasks but
it very annoying and we have to use long enough timeouts anyway so it delays
tasks.... It happens from time to time also to dblink connections inside pg
function so we had to rewrite our programs which were running without any
problems on PG 9.5. Problems started when we switched to PG 9.6 and we
experienced them on all minor versions - we currently run 9.6.3 and problem
is still present...

Do you please have any advice what to check or what setting might help?

it can be spin lock issue. Try to use "perf top". Install postgresql-server debug symbols before.

what is your share buffers configuration?

Regards

Pavel 




--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #14714: long running sessions from remote instanceseems to hang some times

От
Pavel Stehule
Дата:


2017-06-23 11:26 GMT+02:00 Josef Machytka <josef.machytka@gmail.com>:
Today I got several sessions in this "drowsy" state and it almost caused collapse of the whole reporting system because aggregations were not fully calculated.
Looks like now on PG 9.6.3 is this problem even worse then on previous 9.6.x subversions...

I checked "top", "iotop -u postgres", "perf top" and "perf top -u postgres" and as far as I could see those sessions are running although they are marked as "S" sleeping in "top" command almost all the time.
From time to time they cause very low CPU load, something like from 0,5 to 0,8% - which is visible only in "perf" output. In "top" I see almost all the time 0.0% on CPU.
The same with disk IO - iotop shows only a few MB/s from time to time.
They are marked as "active" in "pg_stat_activity" and no wait events are shown.

GCE Instance has Debian 8, 10 CPUs, 46GB RAM
Our database has 2390 GB is distributed over 3 different tablespaces on 3 different GCE standard persistent disks, wal logs and temp tablespace have separate disk.

Out pg config is:
shared_buffers = 8GB
temp_buffers = 512MB
work_mem = 192MB
maintenance_work_mem = 2GB
max_connections = 150
max_wal_size = 8GB
min_wal_size = 4GB
max_worker_processes = 50
max_parallel_workers_per_gather = 4

There are several procedures which set work_mem to 256MB because tests showed it speeds up processing really significantly.
Reason is when we switched from PG 9.5 to PG 9.6 we found that many old queries had suddenly very bad explain plan and very long run time.
Only setting work_mem from previous 64MB to the current values helped to get the same or better results for those queries on 9.6.
Of course we upgraded instance with more memory. But we do not use swap.

Shall I look for some special perf symbols in output? Or is there a possibility to debug it deeper?


what is perf top output?

you can install debug symbols for postgres and for kernel  https://wiki.debian.org/DebugPackage

Regards

Pavel

Thanks

Josef


On 19 June 2017 at 17:56, Pavel Stehule <pavel.stehule@gmail.com> wrote:


2017-06-19 16:49 GMT+02:00 <josef.machytka@gmail.com>:
The following bug has been logged on the website:

Bug reference:      14714
Logged by:          Josef Machytka
Email address:      josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system:   Debian jessie
Description:

On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.

Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.

When we restart task it runs normally. I cannot see any changes in process
priority in OS, process switches between sleeping and running state and I do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."

We implemented timeouts for PG tasks into our programs to restart tasks but
it very annoying and we have to use long enough timeouts anyway so it delays
tasks.... It happens from time to time also to dblink connections inside pg
function so we had to rewrite our programs which were running without any
problems on PG 9.5. Problems started when we switched to PG 9.6 and we
experienced them on all minor versions - we currently run 9.6.3 and problem
is still present...

Do you please have any advice what to check or what setting might help?

it can be spin lock issue. Try to use "perf top". Install postgresql-server debug symbols before.

what is your share buffers configuration?

Regards

Pavel 




--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs



Re: [BUGS] BUG #14714: long running sessions from remote instanceseems to hang some times

От
Josef Machytka
Дата:
sorry, here is file with sample - basically several group by aggregations over different partitioned tables have been locked in this "drowsy" state

On 23 June 2017 at 11:37, Pavel Stehule <pavel.stehule@gmail.com> wrote:


2017-06-23 11:26 GMT+02:00 Josef Machytka <josef.machytka@gmail.com>:
Today I got several sessions in this "drowsy" state and it almost caused collapse of the whole reporting system because aggregations were not fully calculated.
Looks like now on PG 9.6.3 is this problem even worse then on previous 9.6.x subversions...

I checked "top", "iotop -u postgres", "perf top" and "perf top -u postgres" and as far as I could see those sessions are running although they are marked as "S" sleeping in "top" command almost all the time.
From time to time they cause very low CPU load, something like from 0,5 to 0,8% - which is visible only in "perf" output. In "top" I see almost all the time 0.0% on CPU.
The same with disk IO - iotop shows only a few MB/s from time to time.
They are marked as "active" in "pg_stat_activity" and no wait events are shown.

GCE Instance has Debian 8, 10 CPUs, 46GB RAM
Our database has 2390 GB is distributed over 3 different tablespaces on 3 different GCE standard persistent disks, wal logs and temp tablespace have separate disk.

Out pg config is:
shared_buffers = 8GB
temp_buffers = 512MB
work_mem = 192MB
maintenance_work_mem = 2GB
max_connections = 150
max_wal_size = 8GB
min_wal_size = 4GB
max_worker_processes = 50
max_parallel_workers_per_gather = 4

There are several procedures which set work_mem to 256MB because tests showed it speeds up processing really significantly.
Reason is when we switched from PG 9.5 to PG 9.6 we found that many old queries had suddenly very bad explain plan and very long run time.
Only setting work_mem from previous 64MB to the current values helped to get the same or better results for those queries on 9.6.
Of course we upgraded instance with more memory. But we do not use swap.

Shall I look for some special perf symbols in output? Or is there a possibility to debug it deeper?


what is perf top output?

you can install debug symbols for postgres and for kernel  https://wiki.debian.org/DebugPackage

Regards

Pavel

Thanks

Josef


On 19 June 2017 at 17:56, Pavel Stehule <pavel.stehule@gmail.com> wrote:


2017-06-19 16:49 GMT+02:00 <josef.machytka@gmail.com>:
The following bug has been logged on the website:

Bug reference:      14714
Logged by:          Josef Machytka
Email address:      josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system:   Debian jessie
Description:

On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.

Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.

When we restart task it runs normally. I cannot see any changes in process
priority in OS, process switches between sleeping and running state and I do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."

We implemented timeouts for PG tasks into our programs to restart tasks but
it very annoying and we have to use long enough timeouts anyway so it delays
tasks.... It happens from time to time also to dblink connections inside pg
function so we had to rewrite our programs which were running without any
problems on PG 9.5. Problems started when we switched to PG 9.6 and we
experienced them on all minor versions - we currently run 9.6.3 and problem
is still present...

Do you please have any advice what to check or what setting might help?

it can be spin lock issue. Try to use "perf top". Install postgresql-server debug symbols before.

what is your share buffers configuration?

Regards

Pavel 




--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs




Вложения
Josef Machytka <josef.machytka@gmail.com> writes:
> sorry, here is file with sample - basically several group by aggregations
> over different partitioned tables have been locked in this "drowsy" state

The amount of time being spent in the kernel is strikingly high.  I wonder
if you're seeing some variant of the old "context swap storm" problem.
Try watching the output of "vmstat 1" for awhile to see if the cs rate
is high.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14714: long running sessions from remote instanceseems to hang some times

От
Josef Machytka
Дата:
If there are no queries on database I see number of context switches from ~900 to ~1500.

I checked our monitoring based on node_exporter, Prometheus and Grafana and when this "drowsiness" of sessions happens I see number of context switches from ~10 000 to ~25 000 with a few peeks ~28 000.

But similar numbers or even slightly higher I see also in other cases of high load on the database and they did not caused any "drowsy" sessions. In fact monitoring does not show any significant differences between cases with "drowsy" sessions and without them...If you have some advices about what else to monitor I will add.

On 23 June 2017 at 16:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Josef Machytka <josef.machytka@gmail.com> writes:
> sorry, here is file with sample - basically several group by aggregations
> over different partitioned tables have been locked in this "drowsy" state

The amount of time being spent in the kernel is strikingly high.  I wonder
if you're seeing some variant of the old "context swap storm" problem.
Try watching the output of "vmstat 1" for awhile to see if the cs rate
is high.

                        regards, tom lane

Re: [BUGS] BUG #14714: long running sessions from remote instanceseems to hang some times

От
Andres Freund
Дата:
Hi,

On 2017-06-19 14:49:07 +0000, josef.machytka@gmail.com wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      14714
> Logged by:          Josef Machytka
> Email address:      josef.machytka@gmail.com
> PostgreSQL version: 9.6.3
> Operating system:   Debian jessie
> Description:        
> 
> On pg 9.6.* we have sometimes strange problem with some sessions. We
> experienced it from bash scripts, golang program and node.js applications.
> Environment are Google Compute Engine instances with Debian 8.
> 
> Sometimes some session goes into some kind of "drowsy" state and runs
> incredibly slowly. It causes some very low disk IO and runs like 20x slower
> then usual. This happens mainly when database is under heavy load and it
> happens randomly to different tasks. It happens randomly to both connections
> from local instance (cronjob running on instance with PG) and remote
> connections from other instances.
> 
> When we restart task it runs normally. I cannot see any changes in process
> priority in OS, process switches between sleeping and running state and I do
> not see any locks or waits in pg_stat_activity or pg_locks. Usually all
> other processes finish OK and this "drowsy" one runs "for ever..."

Which version of the os & kernel is this?  This kind of sounds like the
symptoms that we've observed more frequently a while ago with
the kernel's transparent_hugepages management.

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14714: long running sessions from remote instanceseems to hang some times

От
Josef Machytka
Дата:
Debian 8:
Linux <hostname> 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u1 (2017-06-18) x86_64 GNU/Linux

On 23 June 2017 at 18:57, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2017-06-19 14:49:07 +0000, josef.machytka@gmail.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      14714
> Logged by:          Josef Machytka
> Email address:      josef.machytka@gmail.com
> PostgreSQL version: 9.6.3
> Operating system:   Debian jessie
> Description:
>
> On pg 9.6.* we have sometimes strange problem with some sessions. We
> experienced it from bash scripts, golang program and node.js applications.
> Environment are Google Compute Engine instances with Debian 8.
>
> Sometimes some session goes into some kind of "drowsy" state and runs
> incredibly slowly. It causes some very low disk IO and runs like 20x slower
> then usual. This happens mainly when database is under heavy load and it
> happens randomly to different tasks. It happens randomly to both connections
> from local instance (cronjob running on instance with PG) and remote
> connections from other instances.
>
> When we restart task it runs normally. I cannot see any changes in process
> priority in OS, process switches between sleeping and running state and I do
> not see any locks or waits in pg_stat_activity or pg_locks. Usually all
> other processes finish OK and this "drowsy" one runs "for ever..."

Which version of the os & kernel is this?  This kind of sounds like the
symptoms that we've observed more frequently a while ago with
the kernel's transparent_hugepages management.

Greetings,

Andres Freund

Re: [BUGS] BUG #14714: long running sessions from remote instanceseems to hang some times

От
Andres Freund
Дата:
Hi,

(please don't top-quote on this list)

On 2017-06-26 12:27:04 +0200, Josef Machytka wrote:
> Debian 8:
> Linux <hostname> 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u1 (2017-06-18)
> x86_64 GNU/Linux

You should chekc the following parameters:
cat /proc/sys/vm/zone_reclaim_mode
cat /sys/kernel/mm/transparent_hugepage/enabled
and disable them:
echo 0 > /proc/sys/vm/zone_reclaim_mode
echo never >  /sys/kernel/mm/transparent_hugepage/enabled

- Andres

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14714: long running sessions from remote instanceseems to hang some times

От
Josef Machytka
Дата:
Hi,
thank you for an advice - I set it and tested for several days.
Good news is we do not have "drowsy" sessions any more.
But performance of queries on huge tables went seriously down. Even with parallel processing.

It is understandable regarding memory management but not sustainable for the future since our tables are growing almost exponentially. In these days we have weekly increase in size >100GB and it goes still higher. Database now have 2,5 TB. We had to move some of our queries from PostgreSQL to Bigquery because performance was really bad...

I checked history of this database and previously PostgreSQL 9.5 was running on Google GCE instance with Ubuntu 14.04 now we have new instance with Debian 8 and PostgeSQL 9.6.

On previous instance with Ubuntu we did not have all these problems at all. I checked and Ubuntu 14.04 has by default hugepages enabled (value “always”) + fixed setting of number of huge pages but Debian 8 has default setting “madvise” - memory adviser.

Please has someone here some deeper experiences with differences between Ubuntu and Debian as environment for PostgreSQL as highly used OLAP database? Because so far it looks like switching to Debian was rather bad choice....

Thanks

Josef