Обсуждение: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'

Поиск
Список
Период
Сортировка

PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
Hello,

I encounter from time to time that one of our ESQL/C clients hangs with
the details described below for ever, while accessing on startup a table 
'sisisinst' which has only some 50 rows, all of CHAR columns. Other ESQL/C clients
start up fine (they all check this table 'sisisinst' on startup to
verify the correct version of our application) Also SQL works
fine.

What can I do to get to the source of the problem? Thanks

    matthias


ESQL/C statement:

   EXEC SQL PREPARE sid_sisisinst  FROM :select_anw;

host variable :select_anw contains:

(gdb) p select_anw
$2 = "SELECT ctid, * from sisisinst WHERE version  = :v1", '\000'

process PID 23957 is hanging in poll() on network socket to PostgreSQL server:
(Version 11.4 on Linux)

(gdb) bt
#0  0x00007ff9020c2664 in poll () from /lib64/libc.so.6
#1  0x00007ff90144c22f in pqSocketPoll () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5
#2  0x00007ff90144c0ff in pqSocketCheck () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5
#3  0x00007ff90144bf83 in pqWaitTimed () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5
#4  0x00007ff90144bf55 in pqWait () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5
#5  0x00007ff901447d3a in PQgetResult () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5
#6  0x00007ff901448478 in PQexecFinish () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5
#7  0x00007ff90144819a in PQprepare () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5
#8  0x00007ff901cc788d in prepare_common () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6
#9  0x00007ff901cc7a26 in ECPGprepare () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6
#10 0x00007ff9044a86ee in select_record (scroll=1, lock=0, key=2, 
    sel_anw=0x7ffed86186c0 "SELECT ctid, * from %s", where_anw=0x7ffed8617330 "%s = :v1", 
    p_daten=0x7ff904dd7700 <hrec_sisisinst>, i_between=0, p_oben=0x7ffed86171d0, order_by=0x0, 
    auf_ab=0x0, group_by=0x0, having=0x0, into_temp=0x0, count=0x7ffed8617328) at sisisinst.pgc:925
#11 0x00007ff9044a7a8d in sisisinst (zugriff=1, scroll=1, lock=0, key=2, sto=-20000, 
    p_daten=0x7ffed8618c30, sel_anw=0x7ffed86186c0 "SELECT ctid, * from %s", 
    where_anw=0x7ffed8617330 "%s = :v1", p_btw_daten=0x0, order_by=0x0, auf_ab=0x0, group_by=0x0, 
    having=0x0, into_temp=0x0, count=0x7ffed8617328) at sisisinst.pgc:450
#12 0x00007ff9043718a5 in DB_rdir (tabmodul=0x7ff9044a7639 <sisisinst>, key=2, scroll=1, lock=0, 
    p_daten=0x7ffed8618c30) at dbcall.pgc:1665
#13 0x00007ff90436f6c9 in DB_ChkVer () at dbcall.pgc:307
#14 0x00007ff9043708d3 in DB_opdbP (mode=1) at dbcall.pgc:840
#15 0x00007ff90436fff8 in DB_opdb () at dbcall.pgc:543
#16 0x0000000000405efa in InitIndex (setid=FSTAB_Personen) at ./INDEX.c:1000
#17 0x0000000000405241 in main (argc=2, argv=0x7ffed8619fa8) at ./INDEX.c:412

running PostgreSQL processes:

# ps ax | grep postgres:
  329 pts/13   S+     0:00 grep --color=auto postgres:
 2354 ?        Ss     0:17 postgres: checkpointer
 2355 ?        Ss     0:09 postgres: background writer
 2356 ?        Ss     0:10 postgres: walwriter
 2357 ?        Ss     0:04 postgres: autovacuum launcher
 2358 ?        Ss     0:00 postgres: archiver   last was 000000010000000F00000062
 2359 ?        Ss     0:31 postgres: stats collector
 2360 ?        Ss     0:00 postgres: logical replication launcher
14044 ?        Ss     0:00 postgres: sisis sisis ::1(53426) idle
14100 ?        Ss     0:00 postgres: sisis sisis ::1(53432) idle
16852 ?        Ss     0:00 postgres: sisis sisis ::1(53486) idle
17575 ?        Ss     0:00 postgres: sisis sisis ::1(48736) idle
17577 ?        Ss     0:00 postgres: sisis sisis ::1(48742) idle
17583 ?        Ss     0:00 postgres: sisis sisis 10.23.33.19(49630) idle
17584 ?        Ss     0:00 postgres: sisis sisis 10.23.33.19(49632) idle
17585 ?        Ss     0:00 postgres: sisis sisis 10.23.33.19(49634) idle
17587 ?        Ss     0:00 postgres: sisis sisis ::1(48762) idle
17590 ?        Ss     0:00 postgres: sisis sisis ::1(48770) idle
17593 ?        Ss     0:00 postgres: sisis sisis ::1(48778) idle
17609 ?        Ss     0:04 postgres: sisis sisis ::1(48794) idle
17612 ?        Ss     0:00 postgres: sisis sisis ::1(48802) idle
17615 ?        Ss     0:04 postgres: sisis sisis ::1(48810) idle
17618 ?        Ss     0:00 postgres: sisis sisis ::1(48818) idle
17621 ?        Ss     0:08 postgres: sisis sisis ::1(48826) idle
17866 ?        Ss     0:00 postgres: sisis sisis 10.23.33.19(49758) idle
17868 ?        Ss     0:00 postgres: sisis sisis ::1(48886) idle
17907 ?        Ss     0:00 postgres: sisis sisis ::1(48900) idle
17948 ?        Ss     0:00 postgres: sisis sisis ::1(48914) idle
21457 ?        Ss     0:00 postgres: sisis sisis ::1(53664) idle
23958 ?        Ss     0:00 postgres: sisis sisis ::1(53742) idle
                                                     ^^^^^


lsof -P -p 23957 | grep 5432
INDEX   23957 sisis    2u  IPv6          117475261      0t0       TCP localhost:53742->localhost:5432 (ESTABLISHED)

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'

От
Tom Lane
Дата:
Matthias Apitz <guru@unixarea.de> writes:
> I encounter from time to time that one of our ESQL/C clients hangs with
> the details described below for ever, while accessing on startup a table
> 'sisisinst' which has only some 50 rows, all of CHAR columns. Other ESQL/C clients
> start up fine (they all check this table 'sisisinst' on startup to
> verify the correct version of our application) Also SQL works
> fine.

> What can I do to get to the source of the problem? Thanks

Can you get a stack trace from the connected backend?  Or even just look
at what it's doing according to pg_stat_activity?  (If it's a reasonably
modern server, the wait-condition fields would be interesting.)

            regards, tom lane



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
El día Montag, April 27, 2020 a las 08:40:04 -0400, Tom Lane escribió:

> Matthias Apitz <guru@unixarea.de> writes:
> > I encounter from time to time that one of our ESQL/C clients hangs with
> > the details described below for ever, while accessing on startup a table 
> > 'sisisinst' which has only some 50 rows, all of CHAR columns. Other ESQL/C clients
> > start up fine (they all check this table 'sisisinst' on startup to
> > verify the correct version of our application) Also SQL works
> > fine.
> 
> > What can I do to get to the source of the problem? Thanks
> 
> Can you get a stack trace from the connected backend?  Or even just look
> at what it's doing according to pg_stat_activity?  (If it's a reasonably
> modern server, the wait-condition fields would be interesting.)

srap32dxr1:/opt/lib/sisis/catserver/bin # lsof -P | grep 53742
INDEX     23957                      sisis    2u     IPv6          117475261       0t0        TCP
localhost:53742->localhost:5432(ESTABLISHED)
 
postmaste 23958                   postgres    9u     IPv6          117478789       0t0        TCP
localhost:5432->localhost:53742(ESTABLISHED)
 

'INDEX' is the hanging proc; PID 23958 its server; gdb shows:

# gdb /usr/local/sisis-pap/pgsql/bin/postgres 23958
...
(gdb) bt
#0  0x00007fd567776000 in epoll_pwait () from /lib64/libc.so.6
#1  0x000000000084476c in WaitEventSetWaitBlock ()
#2  0x0000000000844647 in WaitEventSetWait ()
#3  0x00000000006f89d2 in secure_read ()
#4  0x0000000000707425 in pq_recvbuf ()
#5  0x0000000000707709 in pq_discardbytes ()
#6  0x0000000000707aba in pq_getmessage ()
#7  0x000000000086b478 in SocketBackend ()
#8  0x000000000086b4c4 in ReadCommand ()
#9  0x000000000086fda9 in PostgresMain ()
#10 0x00000000007dde15 in BackendRun ()
#11 0x00000000007dd595 in BackendStartup ()
#12 0x00000000007d9c9b in ServerLoop ()
#13 0x00000000007d9556 in PostmasterMain ()
#14 0x000000000070e2a5 in main ()

sisis=# select * from pg_stat_activity where client_port = 53742;

 datid | datname |  pid  | usesysid | usename | application_name | client_addr | client_hostname | client_port |
 backend_start          | xact_start | query_start |          state_change           | wait_event_type | wait_event |
state| backend_xid | backend_xmin | query |  backend_type  
 

 93831 | sisis   | 23958 |    16384 | sisis   |                  | ::1         |                 |       53742 |
27.04.202012:37:46.941441 CEST |            |             | 27.04.2020 12:37:46.943073 CEST | Client          |
ClientRead| idle  |             |              |       | client backend
 
(1 Zeile)


HIH && Kinde Regards

    matthias

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
As the client seems to be waiting on the server and the server on the
client... could it have been caused by this in /etc/hosts (what our IT
department configured):

/etc/hosts:

127.0.0.1       localhost

# special IPv6 addresses
::1             localhost ipv6-localhost ipv6-loopback


i.e. the client has sent its request on IPv4 and the server answered on
IPv6?

I will remove the word 'localhost' from the IPv6 addr and restart
everything, or do you want to see any additional inforation?

Thanks

    matthias
-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'

От
Tom Lane
Дата:
Matthias Apitz <guru@unixarea.de> writes:
> El día Montag, April 27, 2020 a las 08:40:04 -0400, Tom Lane escribió:
>> Can you get a stack trace from the connected backend?

> (gdb) bt
> #0  0x00007fd567776000 in epoll_pwait () from /lib64/libc.so.6
> #1  0x000000000084476c in WaitEventSetWaitBlock ()
> #2  0x0000000000844647 in WaitEventSetWait ()
> #3  0x00000000006f89d2 in secure_read ()
> #4  0x0000000000707425 in pq_recvbuf ()
> #5  0x0000000000707709 in pq_discardbytes ()
> #6  0x0000000000707aba in pq_getmessage ()
> #7  0x000000000086b478 in SocketBackend ()
> #8  0x000000000086b4c4 in ReadCommand ()
> #9  0x000000000086fda9 in PostgresMain ()

Oh, that is *very* interesting, because there is only one caller of
pq_discardbytes:

        /*
         * Allocate space for message.  If we run out of room (ridiculously
         * large message), we will elog(ERROR), but we want to discard the
         * message body so as not to lose communication sync.
         */
        PG_TRY();
        {
            enlargeStringInfo(s, len);
        }
        PG_CATCH();
        {
            if (pq_discardbytes(len) == EOF)
                ereport(COMMERROR,
                        (errcode(ERRCODE_PROTOCOL_VIOLATION),
                         errmsg("incomplete message from client")));
            ...

What this code intends to handle is the case where the client has sent a
message that is so long that the backend hasn't enough memory to buffer
it.  What's actually happened, more likely, is that the received message
length is corrupt and just appears to be large, since the client-side
trace shows that libpq has sent what it has to send and is now waiting for
a reply.  If the received length were correct then the pq_discardbytes
call would have completed after eating the message.

So what it looks like is that something is corrupting data on its
way from the client to the server.  Flaky firewall maybe?  If you're
using SSL, maybe an SSL library bug?  I'm reduced to speculation at
this point.  It's hard even to say what to do to gather more info.
If you could reproduce it then I'd suggest watching the connection
with wireshark or the like to see what data is actually going across
the wire ... but IIUC it's pretty random, so that approach seems
unlikely to work.

If you're in a position to run a modified server, you could try
inserting a debug log message:

        }
        PG_CATCH();
        {
+           elog(COMMERROR, "bogus received message length: %d", len);
            if (pq_discardbytes(len) == EOF)
                ereport(COMMERROR,

(This is in src/backend/libpq/pqcomm.c, around line 1300 as of HEAD.)
While this seems unlikely to teach us a huge amount, perhaps the
value of the incorrect length would be informative.

Are you always seeing this error at the exact same place so far as
the client side is concerned?  It's hard to see why a transport-level
problem would preferentially affect one particular query ...

            regards, tom lane



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'

От
Tom Lane
Дата:
Matthias Apitz <guru@unixarea.de> writes:
> As the client seems to be waiting on the server and the server on the
> client... could it have been caused by this in /etc/hosts (what our IT
> department configured):

Doubt it.  Your connection evidently got through (at least) the
authentication exchange, so basic data transmission seems to be
working.  I speculate that there's something data-dependent about
the command that's failing.

As an example, it would not be terribly surprising to get a behavior
like this if there were something deciding to do an LF -> CRLF or
similar transformation on the data stream.  The substitution might
accidentally not trigger during authentication, only to break some
actual query a bit later.  But there really shouldn't be anything
making such changes in our data stream.

Still, looking at recent network-level configuration changes wouldn't
be a bad idea.

            regards, tom lane



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
I've added the proposed log line to the server code in src/backend/libpq/pqcomm.c
and restarted the server with this. In the log file /data/postgresql11/data/serverlog
are other messages about length of packages, like:

2020-04-28 09:49:49.877 CEST [16229] LOG:  invalid length of startup packet
2020-04-28 09:49:50.117 CEST [16231] LOG:  invalid length of startup packet

They're coming from:

grep 'invalid length of startup packet' src/backend/*/*.c
src/backend/postmaster/postmaster.c:                             errmsg("invalid length of startup packet")));

I will change the code there to to get the 'len' printed with the
message

...
        if (len < (int32) sizeof(ProtocolVersion) ||
                len > MAX_STARTUP_PACKET_LENGTH)
        {
                ereport(COMMERROR,
                                (errcode(ERRCODE_PROTOCOL_VIOLATION),
                                 errmsg("invalid length %d of startup packet", len)));
                return STATUS_ERROR;
        }

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
El día lunes, abril 27, 2020 a las 09:40:39a. m. -0400, Tom Lane escribió:

> Matthias Apitz <guru@unixarea.de> writes:
> > El día Montag, April 27, 2020 a las 08:40:04 -0400, Tom Lane escribió:
> >> Can you get a stack trace from the connected backend?
> 
> > (gdb) bt
> > #0  0x00007fd567776000 in epoll_pwait () from /lib64/libc.so.6
> > #1  0x000000000084476c in WaitEventSetWaitBlock ()
> > #2  0x0000000000844647 in WaitEventSetWait ()
> > #3  0x00000000006f89d2 in secure_read ()
> > #4  0x0000000000707425 in pq_recvbuf ()
> > #5  0x0000000000707709 in pq_discardbytes ()
> > #6  0x0000000000707aba in pq_getmessage ()
> > #7  0x000000000086b478 in SocketBackend ()
> > #8  0x000000000086b4c4 in ReadCommand ()
> > #9  0x000000000086fda9 in PostgresMain ()
> 
> Oh, that is *very* interesting, because there is only one caller of
> pq_discardbytes:
> 
>         /*
>          * Allocate space for message.  If we run out of room (ridiculously
>          * large message), we will elog(ERROR), but we want to discard the
>          * message body so as not to lose communication sync.
>          */
>         PG_TRY();
>         {
>             enlargeStringInfo(s, len);
>         }
>         PG_CATCH();
>         {
>             if (pq_discardbytes(len) == EOF)
>                 ereport(COMMERROR,
>                         (errcode(ERRCODE_PROTOCOL_VIOLATION),
>                          errmsg("incomplete message from client")));
>             ...
> 
> What this code intends to handle is the case where the client has sent a
> message that is so long that the backend hasn't enough memory to buffer
> it.  What's actually happened, more likely, is that the received message
> length is corrupt and just appears to be large, since the client-side
> trace shows that libpq has sent what it has to send and is now waiting for
> a reply.  If the received length were correct then the pq_discardbytes
> call would have completed after eating the message.
> 
> So what it looks like is that something is corrupting data on its
> way from the client to the server.  Flaky firewall maybe?  If you're
> using SSL, maybe an SSL library bug?  I'm reduced to speculation at
> this point.  It's hard even to say what to do to gather more info.
> If you could reproduce it then I'd suggest watching the connection
> with wireshark or the like to see what data is actually going across
> the wire ... but IIUC it's pretty random, so that approach seems
> unlikely to work.
> 
> If you're in a position to run a modified server, you could try
> inserting a debug log message:
> 
>         }
>         PG_CATCH();
>         {
> +           elog(COMMERROR, "bogus received message length: %d", len);
>             if (pq_discardbytes(len) == EOF)
>                 ereport(COMMERROR,
> 
> (This is in src/backend/libpq/pqcomm.c, around line 1300 as of HEAD.)
> While this seems unlikely to teach us a huge amount, perhaps the
> value of the incorrect length would be informative.
> 
> Are you always seeing this error at the exact same place so far as
> the client side is concerned?  It's hard to see why a transport-level
> problem would preferentially affect one particular query ...

I've added the printout of the length in this case and another one, and
see this in the server's log file:


2020-05-04 10:05:49.977 CEST [32092] LOG:  invalid length 33554940 of startup packet
2020-05-04 10:05:50.207 CEST [32094] LOG:  invalid length 33554940 of startup packet
2020-05-04 12:32:50.781 CEST [20334] LOG:  bogus received message length: 1650422894
2020-05-04 12:36:41.293 CEST [20380] LOG:  bogus received message length: 1650422894
2020-05-04 12:39:39.461 CEST [20441] LOG:  bogus received message length: 1650422894
2020-05-04 13:01:50.566 CEST [24222] LOG:  bogus received message length: 1650422894

Any comments? Thanks

    matthias


-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



Matthias Apitz <guru@unixarea.de> writes:
> El día lunes, abril 27, 2020 a las 09:40:39a. m. -0400, Tom Lane escribió:
>> If you're in a position to run a modified server, you could try
>> inserting a debug log message:

> I've added the printout of the length in this case and another one, and
> see this in the server's log file:

> 2020-05-04 10:05:49.977 CEST [32092] LOG:  invalid length 33554940 of startup packet
> 2020-05-04 10:05:50.207 CEST [32094] LOG:  invalid length 33554940 of startup packet
> 2020-05-04 12:32:50.781 CEST [20334] LOG:  bogus received message length: 1650422894
> 2020-05-04 12:36:41.293 CEST [20380] LOG:  bogus received message length: 1650422894
> 2020-05-04 12:39:39.461 CEST [20441] LOG:  bogus received message length: 1650422894
> 2020-05-04 13:01:50.566 CEST [24222] LOG:  bogus received message length: 1650422894

Hmph.  That confirms that we're getting a bogus message length, but not
much more.  It's quite interesting though that the bogus value is always
the same.  According to my calculator 1650422894 corresponds to ASCII
"b_tn", or possibly "nt_b" depending on what you want to assume about
endianness, which looks tantalizingly like it could be a fragment of a
SQL query.  So I'm still leaning to the idea that the client has sent
a malformed query stream --- but how?  Or if the data got corrupted in
transit, how did that happen?

Can you work backwards to what the client was doing just before the
point at which it hangs?  It's likely that the particular PQprepare
call it's stuck on is just the victim of prior misfeasance.  If you
can find "b_tn" or "nt_b" in any strings the client should have been
sending up to this point, that might shed light too.

            regards, tom lane



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
El día martes, mayo 05, 2020 a las 04:18:02p. m. -0400, Tom Lane escribió:

> Matthias Apitz <guru@unixarea.de> writes:
> > El día lunes, abril 27, 2020 a las 09:40:39a. m. -0400, Tom Lane escribió:
> >> If you're in a position to run a modified server, you could try
> >> inserting a debug log message:
> 
> > I've added the printout of the length in this case and another one, and
> > see this in the server's log file:
> 
> > 2020-05-04 10:05:49.977 CEST [32092] LOG:  invalid length 33554940 of startup packet
> > 2020-05-04 10:05:50.207 CEST [32094] LOG:  invalid length 33554940 of startup packet
> > 2020-05-04 12:32:50.781 CEST [20334] LOG:  bogus received message length: 1650422894
> > 2020-05-04 12:36:41.293 CEST [20380] LOG:  bogus received message length: 1650422894
> > 2020-05-04 12:39:39.461 CEST [20441] LOG:  bogus received message length: 1650422894
> > 2020-05-04 13:01:50.566 CEST [24222] LOG:  bogus received message length: 1650422894
> 
> Hmph.  That confirms that we're getting a bogus message length, but not
> much more.  It's quite interesting though that the bogus value is always
> the same.  According to my calculator 1650422894 corresponds to ASCII
> "b_tn", or possibly "nt_b" depending on what you want to assume about
> endianness, which looks tantalizingly like it could be a fragment of a
> SQL query.  So I'm still leaning to the idea that the client has sent
> a malformed query stream --- but how?  Or if the data got corrupted in
> transit, how did that happen?
> 
> Can you work backwards to what the client was doing just before the
> point at which it hangs?  It's likely that the particular PQprepare
> call it's stuck on is just the victim of prior misfeasance.  If you
> can find "b_tn" or "nt_b" in any strings the client should have been
> sending up to this point, that might shed light too.

$ printf "0x%x\n" 1650422894
0x625f746e
$ printf "0x%x\n" 1650422894 | xxd -r
b_tn

I will try to add in the case of the 'bogus received message length...'
more logging of the content of the message.

The client in question is written in ESQL/C and I will enable the (very
good) ESQL/C logging feature of this interface to get an understanding
what the client was sending before this connection gets stuck. We will
produce tons of log files, but apart of Linux strace or tcpdump with
even more tons, I don't see any way at the moment. The situation is
relatively seldom.

Thanks

    matthias

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
El día Dienstag, April 28, 2020 a las 10:10:18 +0200, Matthias Apitz escribió:

> 
> I've added the proposed log line to the server code in src/backend/libpq/pqcomm.c
> and restarted the server with this. In the log file /data/postgresql11/data/serverlog
> are other messages about length of packages, like:
> 
> 2020-04-28 09:49:49.877 CEST [16229] LOG:  invalid length of startup packet
> 2020-04-28 09:49:50.117 CEST [16231] LOG:  invalid length of startup packet
> 
> They're coming from:
> 
> grep 'invalid length of startup packet' src/backend/*/*.c
> src/backend/postmaster/postmaster.c:                             errmsg("invalid length of startup packet")));
> 
> I will change the code there to to get the 'len' printed with the
> message
> 
> ...
>         if (len < (int32) sizeof(ProtocolVersion) ||
>                 len > MAX_STARTUP_PACKET_LENGTH)
>         {
>                 ereport(COMMERROR,
>                                 (errcode(ERRCODE_PROTOCOL_VIOLATION),
>                                  errmsg("invalid length %d of startup packet", len)));
>                 return STATUS_ERROR;
>         }

I'm still trying to catch from where (i.e. from which client connection) the above
message are originated. Over the actual weeken(!), I see the following
picture:

# egrep 'bogus|incomplete message from|invalid length' /data/postgresql11/data/serverlog
...
2020-05-09 05:35:20.124 CEST [6905] LOG:  invalid length 33554940 of startup packet
2020-05-09 05:35:20.368 CEST [6907] LOG:  invalid length 33554940 of startup packet
2020-05-09 05:38:14.381 CEST [7186] LOG:  invalid length 33554940 of startup packet
2020-05-09 05:38:14.669 CEST [7246] LOG:  invalid length 33554940 of startup packet
2020-05-10 05:34:56.716 CEST [13241] LOG:  invalid length 33554940 of startup packet
2020-05-10 05:34:56.942 CEST [13243] LOG:  invalid length 33554940 of startup packet
2020-05-10 05:37:46.599 CEST [13921] LOG:  invalid length 33554940 of startup packet
2020-05-10 05:37:49.577 CEST [14017] LOG:  invalid length 33554940 of startup packet

i.e. the messages only appear short after 5:3x am; this is exactly the
time when (based on cronjobs) our software is deployed to this server
and all servers (the clients for PostgreSQL) are started. Following
questions:

1. In addition to the above code, can I get somehow the connecting client's PID? If
   so, I could get from its PID its command line from the /proc file
   system. This would help to instrument the client side with logging.
   There are some hundred clients of different server types booting up
   at this time frame after the deployment. 

2. Can I get something out of the buffer of the startup packet? For
   example, with adding to the above change something like:

   {
   char buf[1024+1];
   memset (buf, 0, sizeof(buf));
   pq_getbytes(buf, 1024);
   buf[1024] = '\0';
   ereport(COMMERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION),
          errmsg("first 1024 bytes of startup packet [%s]", buf)));
   }

Thanks

    matthias

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



Matthias Apitz <guru@unixarea.de> writes:
> 1. In addition to the above code, can I get somehow the connecting client's PID? If

Not in any portable way.  If you believe that the offending client is on
the local system, you could try building something involving netstat,
lsof, etc.  getpeereid() might be useful too.

> 2. Can I get something out of the buffer of the startup packet?

I'm fairly sure that at the point where the error is thrown, we've
only read the putative length word, so there's nothing buffered.
But sure, you could try to collect whatever is available from the
socket and dump that --- probably a hex dump would be advisable,
because it'll contain \0 bytes.  Seems like a good idea.

            regards, tom lane



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
El día Sonntag, Mai 10, 2020 a las 11:33:41 -0400, Tom Lane escribió:

> Matthias Apitz <guru@unixarea.de> writes:
> > 1. In addition to the above code, can I get somehow the connecting client's PID? If
> 
> Not in any portable way.  If you believe that the offending client is on
> the local system, you could try building something involving netstat,
> lsof, etc.  getpeereid() might be useful too.

I launched a shell script to get some more information in this case:

    ...
        if (len < (int32) sizeof(ProtocolVersion) ||
                len > MAX_STARTUP_PACKET_LENGTH)
        {
                int pid = getpid();
                char cmd[80];
                sprintf(cmd, "/home/sisis/guru/getClient.sh %d", pid);
                system(cmd);
                
                ereport(COMMERROR,
                                (errcode(ERRCODE_PROTOCOL_VIOLATION),
                                 errmsg("invalid length %d of startup packet", len)));
                return STATUS_ERROR;
        }
    ...

The process with the PID in question is 'postmaster':

PS: 14622 ? Ss 0:00 /usr/local/sisis-pap/pgsql/bin/postmaster -D /data/postgresql11/data

lsof -P -p 14622

COMMAND     PID     USER   FD      TYPE    DEVICE  SIZE/OFF      NODE NAME
postmaste 14622 postgres  cwd       DIR     254,0      4096   5373968 /data/postgresql11/data
postmaste 14622 postgres  rtd       DIR     254,0      4096         2 /
postmaste 14622 postgres  txt       REG     254,0   9959488   5636120 /usr/local/sisis-pap/pgsql/bin/postgres
postmaste 14622 postgres  DEL       REG       0,5           131559958 /dev/zero
...
postmaste 14622 postgres    9u     IPv4 131903453       0t0       TCP
srap32dxr1.dev.oclc.org:5432->srap32dxr1.dev.oclc.org:44836(ESTABLISHED)
 


...
Interestingly, the other side of this connection of the port 5432 is not
visible, i.e. a 'lsof -P | grep 5432' does not show it in this moment.

    matthias


-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



Matthias Apitz <guru@unixarea.de> writes:
> I launched a shell script to get some more information in this case:

ok

> The process with the PID in question is 'postmaster':

Yeah, this should actually be a newly-forked postmaster child process,
but it won't have done anything yet to change its ps-visible command
line.

> postmaste 14622 postgres    9u     IPv4 131903453       0t0       TCP
srap32dxr1.dev.oclc.org:5432->srap32dxr1.dev.oclc.org:44836(ESTABLISHED) 

OK, so it's gotten a TCP not Unix-socket connection; that's already
going to narrow things a little bit.

> Interestingly, the other side of this connection of the port 5432 is not
> visible, i.e. a 'lsof -P | grep 5432' does not show it in this moment.

Permissions problem maybe?  I'm not sure that lsof will tell you much
about non-postgres-owned processes, unless you run it as root.

            regards, tom lane



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
El día Montag, Mai 11, 2020 a las 09:50:12 -0400, Tom Lane escribió:

> 
> > postmaste 14622 postgres    9u     IPv4 131903453       0t0       TCP
srap32dxr1.dev.oclc.org:5432->srap32dxr1.dev.oclc.org:44836(ESTABLISHED)
 
> 
> OK, so it's gotten a TCP not Unix-socket connection; that's already
> going to narrow things a little bit.
> 
> > Interestingly, the other side of this connection of the port 5432 is not
> > visible, i.e. a 'lsof -P | grep 5432' does not show it in this moment.
> 
> Permissions problem maybe?  I'm not sure that lsof will tell you much
> about non-postgres-owned processes, unless you run it as root.

Ah, yes. As this is just a dev host only, I set 'chmod 4755 /usr/bin/lsof'
and now I see, the connecting client is one of our Java tools and I can
look at the problem from this side.

    matthias


-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
El día Montag, Mai 11, 2020 a las 06:24:52 +0200, Matthias Apitz escribió:

> Ah, yes. As this is just a dev host only, I set 'chmod 4755 /usr/bin/lsof'
> and now I see, the connecting client is one of our Java tools and I can
> look at the problem from this side.
> 

With the knowledge which Java process it is causing, I'm able to simple
reproduce the message and with knowing the 4-tuple of the TCP connection to
identify the exchange in a capture with TCPDUMP sniffing on localhost
port 5432.

Below is the exchange. The visible strings are: 'sisis' (the PG user),
'sisis123' (its password in PG, a dummy password used for testing) and
'srap32dxr1' the hostname. The server is terminating the connection with
a F-pkg:

19:54:02.940205 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [S], seq 3950072774, win 43690, options [mss
65495,sackOK,TSval 3334863612 ecr 0,nop,wscale 7], length 0
 
    0x0000:  4500 003c e98b 4000 4006 fadc 0a17 2113  E..<..@.@.....!.
    0x0010:  0a17 2113 bd36 1538 eb71 53c6 0000 0000  ..!..6.8.qS.....
    0x0020:  a002 aaaa 5682 0000 0204 ffd7 0402 080a  ....V...........
    0x0030:  c6c5 fafc 0000 0000 0103 0307            ............
19:54:02.940217 IP 10.23.33.19.5432 > 10.23.33.19.48438: Flags [S.], seq 749639996, ack 3950072775, win 43690, options
[mss65495,sackOK,TS val 3334863612 ecr 3334863612,nop,wscale 7], length 0
 
    0x0000:  4500 003c 0000 4000 4006 e468 0a17 2113  E..<..@.@..h..!.
    0x0010:  0a17 2113 1538 bd36 2cae 993c eb71 53c7  ..!..8.6,..<.qS.
    0x0020:  a012 aaaa 5682 0000 0204 ffd7 0402 080a  ....V...........
    0x0030:  c6c5 fafc c6c5 fafc 0103 0307            ............
19:54:02.940226 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [.], ack 1, win 342, options [nop,nop,TS val 3334863612
ecr3334863612], length 0
 
    0x0000:  4500 0034 e98c 4000 4006 fae3 0a17 2113  E..4..@.@.....!.
    0x0010:  0a17 2113 bd36 1538 eb71 53c7 2cae 993d  ..!..6.8.qS.,..=
    0x0020:  8010 0156 567a 0000 0101 080a c6c5 fafc  ...VVz..........
    0x0030:  c6c5 fafc                                ....
19:54:02.948877 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [P.], seq 1:513, ack 1, win 342, options [nop,nop,TS val
3334863621ecr 3334863612], length 512
 
    0x0000:  4500 0234 e98d 4000 4006 f8e2 0a17 2113  E..4..@.@.....!.
    0x0010:  0a17 2113 bd36 1538 eb71 53c7 2cae 993d  ..!..6.8.qS.,..=
    0x0020:  8018 0156 587a 0000 0101 080a c6c5 fb05  ...VXz..........
    0x0030:  c6c5 fafc 0200 0200 0000 0000 7372 6170  ............srap
    0x0040:  3332 6478 7231 0000 0000 0000 0000 0000  32dxr1..........
    0x0050:  0000 0000 0000 0000 0000 0a73 6973 6973  ...........sisis
    0x0060:  0000 0000 0000 0000 0000 0000 0000 0000  ................
    0x0070:  0000 0000 0000 0000 0005 7369 7369 7331  ..........sisis1
    0x0080:  3233 0000 0000 0000 0000 0000 0000 0000  23..............
    0x0090:  0000 0000 0000 0000 0800 0000 0000 0000  ................
    0x00a0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
    0x00b0:  0000 0000 0000 0001 0200 0604 0801 0000  ................
    0x00c0:  0000 0002 0000 0000 0000 0000 0000 0000  ................
    0x00d0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
    0x00e0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
    0x00f0:  0000                                     ..
19:54:02.948886 IP 10.23.33.19.5432 > 10.23.33.19.48438: Flags [.], ack 513, win 350, options [nop,nop,TS val
3334863621ecr 3334863621], length 0
 
    0x0000:  4500 0034 6004 4000 4006 846c 0a17 2113  E..4`.@.@..l..!.
    0x0010:  0a17 2113 1538 bd36 2cae 993d eb71 55c7  ..!..8.6,..=.qU.
    0x0020:  8010 015e 567a 0000 0101 080a c6c5 fb05  ...^Vz..........
    0x0030:  c6c5 fb05                                ....
19:54:02.948970 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [P.], seq 513:612, ack 1, win 342, options [nop,nop,TS
val3334863621 ecr 3334863621], length 99
 
    0x0000:  4500 0097 e98e 4000 4006 fa7e 0a17 2113  E.....@.@..~..!.
    0x0010:  0a17 2113 bd36 1538 eb71 55c7 2cae 993d  ..!..6.8.qU.,..=
    0x0020:  8018 0156 56dd 0000 0101 080a c6c5 fb05  ...VV...........
    0x0030:  c6c5 fb05 0201 0063 0000 0000 0000 0000  .......c........
    0x0040:  0000 0600 0000 0000 0800 0000 0000 0000  ................
    0x0050:  0000 0000 0000 0000 0000 0000 0000 0000  ................
    0x0060:  0000 0000 0000 0000 0000 0000 0000 0000  ................
    0x0070:  0135 3132 0000 0003 0000 0000 e200 1801  .512............
    0x0080:  0c07 cdff 85ee ef65 7fff ffff d602 0800  .......e........
    0x0090:  0680 0648 0000 00                        ...H...
19:54:02.948974 IP 10.23.33.19.5432 > 10.23.33.19.48438: Flags [.], ack 612, win 350, options [nop,nop,TS val
3334863621ecr 3334863621], length 0
 
    0x0000:  4500 0034 6005 4000 4006 846b 0a17 2113  E..4`.@.@..k..!.
    0x0010:  0a17 2113 1538 bd36 2cae 993d eb71 562a  ..!..8.6,..=.qV*
    0x0020:  8010 015e 567a 0000 0101 080a c6c5 fb05  ...^Vz..........
    0x0030:  c6c5 fb05                                ....
19:54:04.176794 IP 10.23.33.19.5432 > 10.23.33.19.48438: Flags [F.], seq 1, ack 612, win 350, options [nop,nop,TS val
3334864848ecr 3334863621], length 0
 
    0x0000:  4500 0034 6006 4000 4006 846a 0a17 2113  E..4`.@.@..j..!.
    0x0010:  0a17 2113 1538 bd36 2cae 993d eb71 562a  ..!..8.6,..=.qV*
    0x0020:  8011 015e 567a 0000 0101 080a c6c5 ffd0  ...^Vz..........
    0x0030:  c6c5 fb05                                ....
19:54:04.176861 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [.], ack 2, win 342, options [nop,nop,TS val 3334864849
ecr3334864848], length 0
 
    0x0000:  4500 0034 e98f 4000 4006 fae0 0a17 2113  E..4..@.@.....!.
    0x0010:  0a17 2113 bd36 1538 eb71 562a 2cae 993e  ..!..6.8.qV*,..>
    0x0020:  8010 0156 567a 0000 0101 080a c6c5 ffd1  ...VVz..........
    0x0030:  c6c5 ffd0                                ....
19:54:04.177085 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [F.], seq 612, ack 2, win 342, options [nop,nop,TS val
3334864849ecr 3334864848], length 0
 
    0x0000:  4500 0034 e990 4000 4006 fadf 0a17 2113  E..4..@.@.....!.
    0x0010:  0a17 2113 bd36 1538 eb71 562a 2cae 993e  ..!..6.8.qV*,..>
    0x0020:  8011 0156 567a 0000 0101 080a c6c5 ffd1  ...VVz..........
    0x0030:  c6c5 ffd0                                ....
19:54:04.177096 IP 10.23.33.19.5432 > 10.23.33.19.48438: Flags [.], ack 613, win 350, options [nop,nop,TS val
3334864849ecr 3334864849], length 0
 
    0x0000:  4500 0034 0000 4000 4006 e470 0a17 2113  E..4..@.@..p..!.
    0x0010:  0a17 2113 1538 bd36 2cae 993e eb71 562b  ..!..8.6,..>.qV+
    0x0020:  8010 015e b7e3 0000 0101 080a c6c5 ffd1  ...^............
    0x0030:  c6c5 ffd1                                ....


-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



Matthias Apitz <guru@unixarea.de> writes:
> Below is the exchange. The visible strings are: 'sisis' (the PG user),
> 'sisis123' (its password in PG, a dummy password used for testing) and
> 'srap32dxr1' the hostname. The server is terminating the connection with
> a F-pkg:

[ squint... ]  That looks nothing like a Postgres-protocol exchange
to me.  If it weren't for the runs of zeroes, I'd wonder if the
connection had SSL encryption turned on.  Perhaps you captured the
wrong session?

            regards, tom lane



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
El día Montag, Mai 11, 2020 a las 02:41:29 -0400, Tom Lane escribió:

> Matthias Apitz <guru@unixarea.de> writes:
> > Below is the exchange. The visible strings are: 'sisis' (the PG user),
> > 'sisis123' (its password in PG, a dummy password used for testing) and
> > 'srap32dxr1' the hostname. The server is terminating the connection with
> > a F-pkg:
> 
> [ squint... ]  That looks nothing like a Postgres-protocol exchange
> to me.  If it weren't for the runs of zeroes, I'd wonder if the
> connection had SSL encryption turned on.  Perhaps you captured the
> wrong session?

I don't think so. It is exactly the problematic connection started by
the Java process against the port 5432; look the SYN pkg:

19:54:02.940205 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [S], seq 3950072774, win 43690,
options [mss 65495,sackOK,TS val 3334863612 ecr 0,nop,wscale 7], length 0

As you say, that this is not any Postgres-protocol exchange, I will
check the configuration of this part of our software. It can be some
kind of misconfiguration, some part is talking TCP/IP to the wrong server.

Thanks for the hint.

    matthias
-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'

От
Matthias Apitz
Дата:
El día Montag, Mai 11, 2020 a las 02:41:29 -0400, Tom Lane escribió:

> Matthias Apitz <guru@unixarea.de> writes:
> > Below is the exchange. The visible strings are: 'sisis' (the PG user),
> > 'sisis123' (its password in PG, a dummy password used for testing) and
> > 'srap32dxr1' the hostname. The server is terminating the connection with
> > a F-pkg:
> 
> [ squint... ]  That looks nothing like a Postgres-protocol exchange
> to me.  If it weren't for the runs of zeroes, I'd wonder if the
> connection had SSL encryption turned on.  Perhaps you captured the
> wrong session?

It turned out, that the messages:

2020-05-12 05:33:06.828 CEST [6271] LOG:  invalid length 33554940 of startup packet
2020-05-12 05:33:08.305 CEST [6291] LOG:  invalid length 33554940 of startup packet
2020-05-12 05:34:07.495 CEST [6596] LOG:  invalid length 33554940 of startup packet
2020-05-12 05:34:09.710 CEST [6927] LOG:  invalid length 33554940 of startup packet

have been caused by a misconfiguration of the component in question: the
application used the Sybase JDBC driver to talk to the PostgreSQL server
:-(

This part of the thread is hereby solved.

Remaining is still:

2020-04-30 15:14:03.367 CEST [3843] LOG:  bogus received message length: 1650422894
2020-04-30 15:57:48.926 CEST [10380] LOG:  bogus received message length: 1650422894
2020-04-30 15:58:31.851 CEST [10380] LOG:  incomplete message from client
2020-04-30 15:58:53.578 CEST [10439] LOG:  bogus received message length: 1650422894
2020-04-30 16:00:13.500 CEST [10439] LOG:  incomplete message from client

    matthias


-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub