Обсуждение: pgagent hangs forever in "r" state

Поиск
Список
Период
Сортировка

pgagent hangs forever in "r" state

От
Josh Berkus
Дата:
All,

On the same server with the SQL errors (sequences are cleaned up now),
we're having pgagent start all jobs in the "r" state, log the first step
in each job in the "r" state, and then hang forever (or, at least for
three days).  At this point, 8 different jobs are in "r" state.

In pg_stat_activity, pgagent is connected, but hasn't run anything
except the query to get the next step.

We're debugging away, but would appreciate advice on other ways to
figure out what pgagent is doing.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: pgagent hangs forever in "r" state

От
Josh Berkus
Дата:
On 07/06/2015 10:26 AM, Josh Berkus wrote:
> All,
> 
> On the same server with the SQL errors (sequences are cleaned up now),
> we're having pgagent start all jobs in the "r" state, log the first step
> in each job in the "r" state, and then hang forever (or, at least for
> three days).  At this point, 8 different jobs are in "r" state.

... as an update, what it appears to be doing is executing the first
step of each job, successfully.  Then, for some reason, it attempts to
destroy that jobthread after just the first step.  I say "attempts"
because that child backend is still running, and none of pga_job,
pga_joblog, or pga_jobsteplog get updated.

I've looked in jobs.cpp and pgAgent.cpp, and I can't figure out what
would cause it to stop after the first step.  All of the steps I've
looked at are regular queries, and can be successfully executed by hand.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: pgagent hangs forever in "r" state

От
Dave Page
Дата:
On Tue, Jul 7, 2015 at 12:02 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/06/2015 10:26 AM, Josh Berkus wrote:
>> All,
>>
>> On the same server with the SQL errors (sequences are cleaned up now),
>> we're having pgagent start all jobs in the "r" state, log the first step
>> in each job in the "r" state, and then hang forever (or, at least for
>> three days).  At this point, 8 different jobs are in "r" state.
>
> ... as an update, what it appears to be doing is executing the first
> step of each job, successfully.  Then, for some reason, it attempts to
> destroy that jobthread after just the first step.  I say "attempts"
> because that child backend is still running, and none of pga_job,
> pga_joblog, or pga_jobsteplog get updated.
>
> I've looked in jobs.cpp and pgAgent.cpp, and I can't figure out what
> would cause it to stop after the first step.  All of the steps I've
> looked at are regular queries, and can be successfully executed by hand.

Are you able to get a stacktrace from a running instance that's hung?

Also, does setting the logging level to debug reveal anything useful
in the filesystem logs? (pgagent -l2 ....)

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgagent hangs forever in "r" state

От
Josh Berkus
Дата:
On 07/10/2015 07:16 AM, Dave Page wrote:
> Are you able to get a stacktrace from a running instance that's hung?

Yes, but it's not doing anything, it's just polling for the next job.

> Also, does setting the logging level to debug reveal anything useful
> in the filesystem logs? (pgagent -l2 ....)

Nope.  And I scanned the postgresql log for SQL errors, and there's nothing.

What appears to be happening is:

1. pgagent picks up job

2. pgagent finds first step in that job

3. pgagent executes first step (successfully, according to the
postgresql log, and according to a fake "touch file" job I added).

4. pgagent abandons the job without updating the status of either the
job or the step.

5. pgagent starts polling for new jobs.

6. Repeat 1-5 until all jobs are in "r" state with an assigned agent,
but not running.

Note that this particular user has now abandoned pgagent and is
switching to centrally managed cron, so my ability to get further
troubleshooting information will be limited.

According to apt-get, this is the latest pgagent package.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: pgagent hangs forever in "r" state

От
Dave Page
Дата:
Sanket, can you reproduce this?

On Fri, Jul 10, 2015 at 5:41 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/10/2015 07:16 AM, Dave Page wrote:
>> Are you able to get a stacktrace from a running instance that's hung?
>
> Yes, but it's not doing anything, it's just polling for the next job.
>
>> Also, does setting the logging level to debug reveal anything useful
>> in the filesystem logs? (pgagent -l2 ....)
>
> Nope.  And I scanned the postgresql log for SQL errors, and there's nothing.
>
> What appears to be happening is:
>
> 1. pgagent picks up job
>
> 2. pgagent finds first step in that job
>
> 3. pgagent executes first step (successfully, according to the
> postgresql log, and according to a fake "touch file" job I added).
>
> 4. pgagent abandons the job without updating the status of either the
> job or the step.
>
> 5. pgagent starts polling for new jobs.
>
> 6. Repeat 1-5 until all jobs are in "r" state with an assigned agent,
> but not running.
>
> Note that this particular user has now abandoned pgagent and is
> switching to centrally managed cron, so my ability to get further
> troubleshooting information will be limited.
>
> According to apt-get, this is the latest pgagent package.
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com



-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgagent hangs forever in "r" state

От
Sanket Mehta
Дата:
Sure Dave,

I will try and let you know.

Regards,
Sanket Mehta
Sr Software engineer
Enterprisedb

On Mon, Jul 13, 2015 at 7:44 PM, Dave Page <dpage@pgadmin.org> wrote:
Sanket, can you reproduce this?

On Fri, Jul 10, 2015 at 5:41 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/10/2015 07:16 AM, Dave Page wrote:
>> Are you able to get a stacktrace from a running instance that's hung?
>
> Yes, but it's not doing anything, it's just polling for the next job.
>
>> Also, does setting the logging level to debug reveal anything useful
>> in the filesystem logs? (pgagent -l2 ....)
>
> Nope.  And I scanned the postgresql log for SQL errors, and there's nothing.
>
> What appears to be happening is:
>
> 1. pgagent picks up job
>
> 2. pgagent finds first step in that job
>
> 3. pgagent executes first step (successfully, according to the
> postgresql log, and according to a fake "touch file" job I added).
>
> 4. pgagent abandons the job without updating the status of either the
> job or the step.
>
> 5. pgagent starts polling for new jobs.
>
> 6. Repeat 1-5 until all jobs are in "r" state with an assigned agent,
> but not running.
>
> Note that this particular user has now abandoned pgagent and is
> switching to centrally managed cron, so my ability to get further
> troubleshooting information will be limited.
>
> According to apt-get, this is the latest pgagent package.
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com



--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgagent hangs forever in "r" state

От
Sanket Mehta
Дата:
Hi Dave,

I have tried to reproduce the same in my system by resetting the pgagent sequences to 1.

Below is the results I have came across:

         the job is not getting executed and its current status is "i" (no steps found ) although I have specified 2 steps for that job.

Below is the log entry in postgreSQL logs:

         2015-07-20 15:35:14 IST ERROR:  duplicate key value violates unique constraint "pga_joblog_pkey"
         2015-07-20 15:35:14 IST DETAIL:  Key (jlgid)=(3) already exists.
         2015-07-20 15:35:14 IST STATEMENT:  INSERT INTO pgagent.pga_joblog(jlgid, jlgjobid, jlgstatus) VALUES (3, 1, 'r')

So my analysis is:

         as sequences are getting reset before the job starts, it tries to insert jlgid directly from nextvalue of its sequence which is already present in the table. same thing happens for jobsteplog and joblog tables also.

Suggestion to resolve the issue:

         get the max value of ID from table before insert query and run the sequence till we get the max(ID)+1 and then we execute the insert statement with this new value as ID.

Please provide your suggestion for the same.


Regards,
Sanket Mehta
Sr Software engineer
Enterprisedb

On Tue, Jul 14, 2015 at 11:07 AM, Sanket Mehta <sanket.mehta@enterprisedb.com> wrote:
Sure Dave,

I will try and let you know.

Regards,
Sanket Mehta
Sr Software engineer
Enterprisedb

On Mon, Jul 13, 2015 at 7:44 PM, Dave Page <dpage@pgadmin.org> wrote:
Sanket, can you reproduce this?

On Fri, Jul 10, 2015 at 5:41 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/10/2015 07:16 AM, Dave Page wrote:
>> Are you able to get a stacktrace from a running instance that's hung?
>
> Yes, but it's not doing anything, it's just polling for the next job.
>
>> Also, does setting the logging level to debug reveal anything useful
>> in the filesystem logs? (pgagent -l2 ....)
>
> Nope.  And I scanned the postgresql log for SQL errors, and there's nothing.
>
> What appears to be happening is:
>
> 1. pgagent picks up job
>
> 2. pgagent finds first step in that job
>
> 3. pgagent executes first step (successfully, according to the
> postgresql log, and according to a fake "touch file" job I added).
>
> 4. pgagent abandons the job without updating the status of either the
> job or the step.
>
> 5. pgagent starts polling for new jobs.
>
> 6. Repeat 1-5 until all jobs are in "r" state with an assigned agent,
> but not running.
>
> Note that this particular user has now abandoned pgagent and is
> switching to centrally managed cron, so my ability to get further
> troubleshooting information will be limited.
>
> According to apt-get, this is the latest pgagent package.
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com



--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pgagent hangs forever in "r" state

От
Dave Page
Дата:
Hi

On Mon, Jul 20, 2015 at 12:49 PM, Sanket Mehta
<sanket.mehta@enterprisedb.com> wrote:
> Hi Dave,
>
> I have tried to reproduce the same in my system by resetting the pgagent
> sequences to 1.
>
> Below is the results I have came across:
>
>          the job is not getting executed and its current status is "i" (no
> steps found ) although I have specified 2 steps for that job.
>
> Below is the log entry in postgreSQL logs:
>
>          2015-07-20 15:35:14 IST ERROR:  duplicate key value violates unique
> constraint "pga_joblog_pkey"
>          2015-07-20 15:35:14 IST DETAIL:  Key (jlgid)=(3) already exists.
>          2015-07-20 15:35:14 IST STATEMENT:  INSERT INTO
> pgagent.pga_joblog(jlgid, jlgjobid, jlgstatus) VALUES (3, 1, 'r')

OK - I'm not sure that's related to the issue in this thread though.

> So my analysis is:
>
>          as sequences are getting reset before the job starts, it tries to
> insert jlgid directly from nextvalue of its sequence which is already
> present in the table. same thing happens for jobsteplog and joblog tables
> also.
>
> Suggestion to resolve the issue:
>
>          get the max value of ID from table before insert query and run the
> sequence till we get the max(ID)+1 and then we execute the insert statement
> with this new value as ID.

We're not doing that - it'll be a performance killer. The user needs
to not muck about with their sequences.

The real questions are:

1) Why doesn't the insert error appear in the pgAgent log?

2) Can you reproduce Josh's issue where the first step in a job runs,
but the second doesn't (without touching the sequences)?

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgagent hangs forever in "r" state

От
Josh Berkus
Дата:
On 07/20/2015 08:01 AM, Dave Page wrote:
>> Suggestion to resolve the issue:
>>
>>          get the max value of ID from table before insert query and run the
>> sequence till we get the max(ID)+1 and then we execute the insert statement
>> with this new value as ID.
> 
> We're not doing that - it'll be a performance killer. The user needs
> to not muck about with their sequences.

Yeah, pgagent doesn't need to cope with sequence failures.  That's the
DBA's job to fix.

> 
> The real questions are:
> 
> 1) Why doesn't the insert error appear in the pgAgent log?
> 
> 2) Can you reproduce Josh's issue where the first step in a job runs,
> but the second doesn't (without touching the sequences)?

Particularly, *after* the sequence errors are fixed, do the jobs now run
normally?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: pgagent hangs forever in "r" state

От
Sanket Mehta
Дата:
Hi Dave,

As per our discussion, I have tried to reproduce the issue, but it seems jobs are running fine without any problem.
Hence the issue is not reproducible on my system.

Regards,
Sanket Mehta
Sr Software engineer
Enterprisedb

On Tue, Jul 21, 2015 at 9:52 AM, Josh Berkus <josh@agliodbs.com> wrote:
On 07/20/2015 08:01 AM, Dave Page wrote:
>> Suggestion to resolve the issue:
>>
>>          get the max value of ID from table before insert query and run the
>> sequence till we get the max(ID)+1 and then we execute the insert statement
>> with this new value as ID.
>
> We're not doing that - it'll be a performance killer. The user needs
> to not muck about with their sequences.

Yeah, pgagent doesn't need to cope with sequence failures.  That's the
DBA's job to fix.

>
> The real questions are:
>
> 1) Why doesn't the insert error appear in the pgAgent log?
>
> 2) Can you reproduce Josh's issue where the first step in a job runs,
> but the second doesn't (without touching the sequences)?

Particularly, *after* the sequence errors are fixed, do the jobs now run
normally?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgadmin-support mailing list (pgadmin-support@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgadmin-support

Re: pgagent hangs forever in "r" state

От
Ashesh Vashi
Дата:
On Wed, Jul 22, 2015 at 11:56 AM, Sanket Mehta <sanket.mehta@enterprisedb.com> wrote:
Hi Dave,

As per our discussion, I have tried to reproduce the issue, but it seems jobs are running fine without any problem.
Hence the issue is not reproducible on my system.
Josh,

As we're unable to reproduce the issue at our end, can you please generate the logs after setting the log level to debug2 (i.e. -l2) for analyzing the issue?

--

Thanks & Regards,

Ashesh Vashi
EnterpriseDB INDIA: Enterprise PostgreSQL Company


http://www.linkedin.com/in/asheshvashi



Regards,
Sanket Mehta
Sr Software engineer
Enterprisedb

On Tue, Jul 21, 2015 at 9:52 AM, Josh Berkus <josh@agliodbs.com> wrote:
On 07/20/2015 08:01 AM, Dave Page wrote:
>> Suggestion to resolve the issue:
>>
>>          get the max value of ID from table before insert query and run the
>> sequence till we get the max(ID)+1 and then we execute the insert statement
>> with this new value as ID.
>
> We're not doing that - it'll be a performance killer. The user needs
> to not muck about with their sequences.

Yeah, pgagent doesn't need to cope with sequence failures.  That's the
DBA's job to fix.

>
> The real questions are:
>
> 1) Why doesn't the insert error appear in the pgAgent log?
>
> 2) Can you reproduce Josh's issue where the first step in a job runs,
> but the second doesn't (without touching the sequences)?

Particularly, *after* the sequence errors are fixed, do the jobs now run
normally?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgadmin-support mailing list (pgadmin-support@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgadmin-support


Re: pgagent hangs forever in "r" state

От
Josh Berkus
Дата:
On 07/21/2015 11:29 PM, Ashesh Vashi wrote:
> On Wed, Jul 22, 2015 at 11:56 AM, Sanket Mehta
> <sanket.mehta@enterprisedb.com <mailto:sanket.mehta@enterprisedb.com>>
> wrote:
> 
>     Hi Dave,
> 
>     As per our discussion, I have tried to reproduce the issue, but it
>     seems jobs are running fine without any problem.
>     Hence the issue is not reproducible on my system.
> 
> Josh,
> 
> As we're unable to reproduce the issue at our end, can you please
> generate the logs after setting the log level to debug2 (i.e. -l2) for
> analyzing the issue?

As I said before, the logs show nothing of interest.  There are no
errors. just polling for new jobs.

Anyway, the failing system has now been shut down, so we'll need to
chalk this particular failure up to "wierd stuff".


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: pgagent hangs forever in "r" state

От
Leonard Boyce
Дата:
Just chiming in quickly,

We experienced exactly the same behavior with pg 9.4.
It was consistent across the 18 database servers that we upgraded from
9.3 to 9.4.

A job would complete successfully, and then just remain in 'running' state.

We tried purging and re-installing pg and pgadmin/agent cleanly and
still had the same results.

Eventually in frustration we just copied over the pgagent binary
(nothing else) from a server running 9.3 and amazingly stuff started
working.

All I can really contribute to this issue discussion is our "path to
the problem".

Initial OS: Ubuntu 12 LTS x64 running pg 9.1 from the pgdg repos
We were running 9.3 without any issues on those servers. All pgagent
jobs would complete and run as expected
We upgraded the OS to Ubuntu 14.04 LTS
We then upgraded pg to 9.4

Post-upgrade, pgagent (which was also upgraded in the process) started
exhibiting the behavior as described in the thread

Leonard


On Wed, Jul 22, 2015 at 12:33 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/21/2015 11:29 PM, Ashesh Vashi wrote:
>> On Wed, Jul 22, 2015 at 11:56 AM, Sanket Mehta
>> <sanket.mehta@enterprisedb.com <mailto:sanket.mehta@enterprisedb.com>>
>> wrote:
>>
>>     Hi Dave,
>>
>>     As per our discussion, I have tried to reproduce the issue, but it
>>     seems jobs are running fine without any problem.
>>     Hence the issue is not reproducible on my system.
>>
>> Josh,
>>
>> As we're unable to reproduce the issue at our end, can you please
>> generate the logs after setting the log level to debug2 (i.e. -l2) for
>> analyzing the issue?
>
> As I said before, the logs show nothing of interest.  There are no
> errors. just polling for new jobs.
>
> Anyway, the failing system has now been shut down, so we'll need to
> chalk this particular failure up to "wierd stuff".
>
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
>
>
> --
> Sent via pgadmin-support mailing list (pgadmin-support@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgadmin-support