Обсуждение: Unexpected "canceling statement due to user request" error

Поиск
Список
Период
Сортировка

Unexpected "canceling statement due to user request" error

От
Will Storey
Дата:
Hi!

I have a query that fails due to this error and I'm trying to understand
why.

My understanding is I should only see this error if I cancel a query
manually, such as with kill -INT or with pg_cancel_backend(). However I
can't find anything doing that.

The query looks like this:

    SELECT *
    FROM (
      SELECT
        c1, c2, [...],
        EXTRACT(EPOCH FROM time) AS epoch,
        to_rfc3339_us(time) AS time_rfc3339
      FROM t1
      WHERE
        NOT EXISTS (SELECT 1 FROM t2 WHERE c2 = ?) AND
        c2 = ? AND
        time < ? AND
        time > ?::timestamptz - ? * interval '1 day'
      LIMIT ?
    ) AS s
    ORDER BY epoch DESC

t1 is partitioned on time, by month.

I run it using Perl's DBI with DBD::Pg. I wrap it in a transaction where I
first run:

    SET LOCAL statement_timeout TO 1000

I know this query can time out, and it does, resulting in the error I
expect: "canceling statement due to statement timeout". The problem is
occasionally I see this other error: "canceling statement due to user
request".

Looking at the query logs, when the query fails with the user request
error, the query reached the statement timeout (at least in cases I
checked). In one instance the duration shows as 1283ms for example.

Could there be any situation where a query getting cancelled due to a
statement timeout be reported as cancelled due to user request? Or do you
have any ideas about what might be going on?

This is on PostgreSQL 9.6.14 on Ubuntu Xenial.

Thank you!



Re: Unexpected "canceling statement due to user request" error

От
Adrian Klaver
Дата:
On 8/16/19 2:02 PM, Will Storey wrote:
> Hi!
> 
> I have a query that fails due to this error and I'm trying to understand
> why.
> 
> My understanding is I should only see this error if I cancel a query
> manually, such as with kill -INT or with pg_cancel_backend(). However I
> can't find anything doing that.
> 
> The query looks like this:
> 
>      SELECT *
>      FROM (
>        SELECT
>          c1, c2, [...],
>          EXTRACT(EPOCH FROM time) AS epoch,
>          to_rfc3339_us(time) AS time_rfc3339
>        FROM t1
>        WHERE
>          NOT EXISTS (SELECT 1 FROM t2 WHERE c2 = ?) AND
>          c2 = ? AND
>          time < ? AND
>          time > ?::timestamptz - ? * interval '1 day'
>        LIMIT ?
>      ) AS s
>      ORDER BY epoch DESC
> 
> t1 is partitioned on time, by month.
> 
> I run it using Perl's DBI with DBD::Pg. I wrap it in a transaction where I
> first run:
> 
>      SET LOCAL statement_timeout TO 1000
> 
> I know this query can time out, and it does, resulting in the error I
> expect: "canceling statement due to statement timeout". The problem is
> occasionally I see this other error: "canceling statement due to user
> request".
> 
> Looking at the query logs, when the query fails with the user request
> error, the query reached the statement timeout (at least in cases I
> checked). In one instance the duration shows as 1283ms for example.
> 
> Could there be any situation where a query getting cancelled due to a
> statement timeout be reported as cancelled due to user request? Or do you
> have any ideas about what might be going on?

Here is the relevant source:

https://doxygen.postgresql.org/postgres_8c.html#ab7af3ee6ad40efb8454c6f2f790d9588

Start at around line 3078.

> 
> This is on PostgreSQL 9.6.14 on Ubuntu Xenial.
> 
> Thank you!
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Unexpected "canceling statement due to user request" error

От
Will Storey
Дата:
On Sat 2019-08-17 10:32:40 -0700, Adrian Klaver wrote:
> > I know this query can time out, and it does, resulting in the error I
> > expect: "canceling statement due to statement timeout". The problem is
> > occasionally I see this other error: "canceling statement due to user
> > request".
> > 
> > Looking at the query logs, when the query fails with the user request
> > error, the query reached the statement timeout (at least in cases I
> > checked). In one instance the duration shows as 1283ms for example.
> > 
> > Could there be any situation where a query getting cancelled due to a
> > statement timeout be reported as cancelled due to user request? Or do you
> > have any ideas about what might be going on?
> 
> Here is the relevant source:
> 
> https://doxygen.postgresql.org/postgres_8c.html#ab7af3ee6ad40efb8454c6f2f790d9588
> 
> Start at around line 3078.

Thanks for the pointer!

It looks like we'd have to be entering that section and finding the
statement timeout not set, otherwise we'd not fall through to the "user
request" case.

But as far as I can tell, the only reason my query would be cancelled is
because of the statement timeout. I wonder if there is some way the
statement timeout flag is being lost/cleared. I see there's situations
where we clear timeout flags.

Something I forgot to mention: The queries are against a hot standby
server.



Re: Unexpected "canceling statement due to user request" error

От
Tom Lane
Дата:
Will Storey <will@summercat.com> writes:
> Thanks for the pointer!

> It looks like we'd have to be entering that section and finding the
> statement timeout not set, otherwise we'd not fall through to the "user
> request" case.

> But as far as I can tell, the only reason my query would be cancelled is
> because of the statement timeout. I wonder if there is some way the
> statement timeout flag is being lost/cleared. I see there's situations
> where we clear timeout flags.

I poked at this for awhile and concluded that what you must be seeing is
that the statement timeout interrupt triggers, but no CHECK_FOR_INTERRUPTS
call happens thereafter, until we get to the disable_statement_timeout()
call in finish_xact_command().  That throws away the timeout-happened
indicator, but it does nothing to reset InterruptPending or
QueryCancelPending.  Those flags will also be thrown away eventually,
after we've read a new client command:

        /*
         * (4) disable async signal conditions again.
         *
         * Query cancel is supposed to be a no-op when there is no query in
         * progress, so if a query cancel arrived while we were idle, just
         * reset QueryCancelPending. ProcessInterrupts() has that effect when
         * it's called when DoingCommandRead is set, so check for interrupts
         * before resetting DoingCommandRead.
         */
        CHECK_FOR_INTERRUPTS();
        DoingCommandRead = false;

Ignoring the timeout if it happened too late is what we're supposed
to be doing, so that's fine so far.

But: what happens if some other CHECK_FOR_INTERRUPTS occurs before we
reach the main input loop?  Of course, exactly the symptom you saw:
a bogus "canceling statement due to user request" error report.
What's worse, that happens after we've completed the transaction, leaving
the client with a very misleading idea of whether the command committed
or not.

So how could that happen?  After experimenting, I found a pretty
obvious candidate: statement duration logging happens after we
do finish_xact_command, and there's a CHECK_FOR_INTERRUPTS in
errfinish().  So I hypothesize that you have log_duration or
log_min_duration_statement enabled.

I was able to replicate the problem by lobotomizing a couple of
CHECK_FOR_INTERRUPTS that would normally happen in a simple
"select pg_sleep()" query (see attached for-demonstration-
purposes-only patch).  Then I get

regression=# set statement_timeout TO '1s';
SET
regression=# select pg_sleep(2.0);  -- completes normally after 2 sec
 pg_sleep 
----------
 
(1 row)

regression=# set log_min_duration_statement TO '100ms';
SET
regression=# select pg_sleep(2.0);  -- sits for 2 sec, then:
ERROR:  canceling statement due to user request

So that's not very good and we ought to do something about it.
We can't have disable_statement_timeout() clear QueryCancelPending,
because that might lose state about some other cause for query
cancel.  I'm inclined to think that we need to split QueryCancelPending
into more than one flag.

A separate question is how come the particular query you're complaining
about has (seemingly) a fairly wide window where it never does any
CHECK_FOR_INTERRUPTS call before terminating.  Perhaps there's someplace
we need to sprinkle a few more of those.

            regards, tom lane

diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 80ed9cc..d09ac95 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -72,7 +72,7 @@ ExecResult(PlanState *pstate)
     PlanState  *outerPlan;
     ExprContext *econtext;

-    CHECK_FOR_INTERRUPTS();
+//    CHECK_FOR_INTERRUPTS();

     econtext = node->ps.ps_ExprContext;

diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index d330a88..9f515db 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -369,7 +369,7 @@ pg_sleep(PG_FUNCTION_ARGS)
         float8        delay;
         long        delay_ms;

-        CHECK_FOR_INTERRUPTS();
+//        CHECK_FOR_INTERRUPTS();

         delay = endtime - GetNowFloat();
         if (delay >= 600.0)

Re: Unexpected "canceling statement due to user request" error

От
Will Storey
Дата:
On Sun 2019-09-01 19:46:19 -0400, Tom Lane wrote:
> I poked at this for awhile and concluded that what you must be seeing is
> that the statement timeout interrupt triggers, but no CHECK_FOR_INTERRUPTS
> call happens thereafter, until we get to the disable_statement_timeout()
> call in finish_xact_command().  That throws away the timeout-happened
> indicator, but it does nothing to reset InterruptPending or
> QueryCancelPending.  Those flags will also be thrown away eventually,
> after we've read a new client command:
> 
>         /*
>          * (4) disable async signal conditions again.
>          *
>          * Query cancel is supposed to be a no-op when there is no query in
>          * progress, so if a query cancel arrived while we were idle, just
>          * reset QueryCancelPending. ProcessInterrupts() has that effect when
>          * it's called when DoingCommandRead is set, so check for interrupts
>          * before resetting DoingCommandRead.
>          */
>         CHECK_FOR_INTERRUPTS();
>         DoingCommandRead = false;
> 
> Ignoring the timeout if it happened too late is what we're supposed
> to be doing, so that's fine so far.
> 
> But: what happens if some other CHECK_FOR_INTERRUPTS occurs before we
> reach the main input loop?  Of course, exactly the symptom you saw:
> a bogus "canceling statement due to user request" error report.
> What's worse, that happens after we've completed the transaction, leaving
> the client with a very misleading idea of whether the command committed
> or not.
> 
> So how could that happen?  After experimenting, I found a pretty
> obvious candidate: statement duration logging happens after we
> do finish_xact_command, and there's a CHECK_FOR_INTERRUPTS in
> errfinish().  So I hypothesize that you have log_duration or
> log_min_duration_statement enabled.

Yes, I have log_min_duration_statement set to 0! (log_duration is off)

> I was able to replicate the problem by lobotomizing a couple of
> CHECK_FOR_INTERRUPTS that would normally happen in a simple
> "select pg_sleep()" query (see attached for-demonstration-
> purposes-only patch).  Then I get
> 
> regression=# set statement_timeout TO '1s';
> SET
> regression=# select pg_sleep(2.0);  -- completes normally after 2 sec
>  pg_sleep 
> ----------
>  
> (1 row)
> 
> regression=# set log_min_duration_statement TO '100ms';
> SET
> regression=# select pg_sleep(2.0);  -- sits for 2 sec, then:
> ERROR:  canceling statement due to user request
> 
> So that's not very good and we ought to do something about it.
> We can't have disable_statement_timeout() clear QueryCancelPending,
> because that might lose state about some other cause for query
> cancel.  I'm inclined to think that we need to split QueryCancelPending
> into more than one flag.
> 
> A separate question is how come the particular query you're complaining
> about has (seemingly) a fairly wide window where it never does any
> CHECK_FOR_INTERRUPTS call before terminating.  Perhaps there's someplace
> we need to sprinkle a few more of those.

Yeah, it is strange. There are many queries in this system with statement
timeouts and this is the only one where I see this and it happens several
times a day.

Thank you so much for looking into this. An impressive analysis!



Re: Unexpected "canceling statement due to user request" error

От
Tom Lane
Дата:
Will Storey <will@summercat.com> writes:
> On Sun 2019-09-01 19:46:19 -0400, Tom Lane wrote:
>> A separate question is how come the particular query you're complaining
>> about has (seemingly) a fairly wide window where it never does any
>> CHECK_FOR_INTERRUPTS call before terminating.  Perhaps there's someplace
>> we need to sprinkle a few more of those.

> Yeah, it is strange. There are many queries in this system with statement
> timeouts and this is the only one where I see this and it happens several
> times a day.

It'd be interesting to see EXPLAIN ANALYZE output from a typical execution
of that query.  Even better would be a test case (w/ tables and dummy
data) so somebody else could reproduce it; but maybe looking at EXPLAIN
ANALYZE would be enough to determine where we're missing checks.

            regards, tom lane



Re: Unexpected "canceling statement due to user request" error

От
Will Storey
Дата:
On Sun 2019-09-01 20:58:30 -0400, Tom Lane wrote:
> >> A separate question is how come the particular query you're complaining
> >> about has (seemingly) a fairly wide window where it never does any
> >> CHECK_FOR_INTERRUPTS call before terminating.  Perhaps there's someplace
> >> we need to sprinkle a few more of those.
> 
> > Yeah, it is strange. There are many queries in this system with statement
> > timeouts and this is the only one where I see this and it happens several
> > times a day.
> 
> It'd be interesting to see EXPLAIN ANALYZE output from a typical execution
> of that query.  Even better would be a test case (w/ tables and dummy
> data) so somebody else could reproduce it; but maybe looking at EXPLAIN
> ANALYZE would be enough to determine where we're missing checks.

I put output from a query with values that caused the error here:


https://gist.githubusercontent.com/horgh/653e60b8c2f071d859424517d653eb4e/raw/5d16e11d0ac884ed9dc120c4521e2de8f2e7c3d6/test.txt

I sanitised some of the names and values again. Sorry about that.

As far as a test case: I'm not sure how to reproduce it myself. It just
happens periodically (15 times in the past 24 hours). The tables are
somewhat large (60+ GiB per partition). I could try to generate dummy
versions if you like, but they would have to be significantly dummied, so
I'm not sure it would end up being representative.