Обсуждение: 7.0RC1: possible query and backend problem

Поиск
Список
Период
Сортировка

7.0RC1: possible query and backend problem

От
Michael Blakeley
Дата:
I've just upgraded to 7.0RC1 on a Solaris 2.6+patches system (U5 with
256MB). I'm having some trouble with a brand-new query, so I don't
know if this is a new bug or an old one. It has two parts: query
execution and backend robustness.

First, I'm getting a crash with this query (via Apache mod_perl and
DBI, or via psql):

Table U: id varchar(128) primary key, stop datetime, start datetime
Table S1: id varchar(128) primary key, url varchar(512);
Table S2: id varchar(128) primary key, url varchar(512);

    SELECT id,date_part('epoch',sum(stop-start)),count(*),S1.url,S2.url
    FROM U WHERE 'now'::datetime-start<'1 month'::interval
    AND (id=S1.id OR id=S2.id) GROUP BY id,S1.url,S2.url;

The idea is to select whichever of S1.url and S2.url is non-null.

 From psql, I get:

    pqReadData() -- backend closed the channel unexpectedly.
         This probably means the backend terminated abnormally
         before or while processing the request.
    connection to server was lost

And now there are no postgres processes running.

I suspect that the problem is that, in my data set, either S1.url or
S2.url will be null for any given row, and this is causing problems
for GROUP BY (I can also believe that my SQL is bad - but that
shouldn't crash postgres, should it?). If I stick to just S1, the
query runs ok.

But... now there are no postgres processes running at all.

# ps -fu postgres
      UID   PID  PPID  C    STIME TTY      TIME CMD

In the postgres log:

>Server process (pid 11189) exited with status 139 at Wed Apr 19 16:37:25 2000
>Terminating any active server processes...
>NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend died
>abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am going to
>terminate your database system connection and exit.
>         Please reconnect to the database system and repeat your query.
>Server processes were terminated at Wed Apr 19 16:37:25 2000
>Reinitializing shared memory and semaphores
>IpcMemoryCreate: shmget failed (Invalid argument) key=5432110,
>size=144, permission=700
>This type of error is usually caused by an improper
>shared memory or System V IPC semaphore configuration.
>For more information, see the FAQ and platform-specific
>FAQ's in the source directory pgsql/doc or on our
>web site at http://www.postgresql.org.
>IpcMemoryIdGet: shmget failed (Invalid argument) key=5432110,
>size=144, permission=0
>IpcMemoryAttach: shmat failed (Invalid argument) id=-2
>FATAL 1:  AttachSLockMemory: could not attach segment

Now, the real problem here is that postgres seems to be trying to
restart, and fails. It complains about shmget - my /etc/system says:
set shmsys:shminfo_shmmax =  0x40000000
set shmsys:shminfo_shmmni = 100
set shmsys:shminfo_shmseg = 32
set semsys:seminfo_semmns = 1024
set semsys:seminfo_semmni = 1024
set semsys:seminfo_semmnu = 1024
set semsys:seminfo_semmsl = 1024
set semsys:seminfo_semmap = 1024

So I don't think this is really a config problem (if it were, the db
wouldn't start up to begin with). Rather, I'd guess that postgres
isn't freeing a memory block during the "Terminating any active
server processes..." phase.

Running ipcs -a immediately after the crash shows several segments
(although oddly, I have no trouble starting the db again as root -
maybe there's a permissions issue at work here?):

# ipcs -a
IPC status from <running system> as of Wed Apr 19 16:45:42 2000
T         ID      KEY        MODE        OWNER    GROUP  CREATOR
CGROUP NATTCH      SEGSZ  CPID  LPID   ATIME    DTIME    CTIME
Shared Memory:
m        800   0x0052e32e --rw------- postgres postgres postgres
postgres      0        120 12737 12737 13:01:36 13:01:36 13:01:36
m        801   0x0052e325 --rw------- postgres postgres postgres
postgres      0   15373312 12737 12737 13:01:36 13:01:36 13:01:36
m        802   0x0052e32b --rw------- postgres postgres postgres
postgres      0     102184 12737 12737 13:01:36 13:01:36 13:01:36
m          3   0x500005a3 --rw-r--r--     root     root     root
root      1         68   232   232 15:29:35 15:29:35 15:29:35

Anyway, I hope this report is helpful. Thanks to the developers for
their hard work on PG7.

-- Mike

Re: 7.0RC1: possible query and backend problem

От
Michael Blakeley
Дата:
At 5:16 PM -0700 4/19/2000, Michael Blakeley wrote:
>I've just upgraded to 7.0RC1 on a Solaris 2.6+patches system (U5
>with 256MB). I'm having some trouble with a brand-new query, so I
>don't know if this is a new bug or an old one. It has two parts:
>query execution and backend robustness.
>
>First, I'm getting a crash with this query (via Apache mod_perl and
>DBI, or via psql):
>
>Table U: id varchar(128) primary key, stop datetime, start datetime
>Table S1: id varchar(128) primary key, url varchar(512);
>Table S2: id varchar(128) primary key, url varchar(512);
>
>    SELECT id,date_part('epoch',sum(stop-start)),count(*),S1.url,S2.url
>    FROM U WHERE 'now'::datetime-start<'1 month'::interval
>    AND (id=S1.id OR id=S2.id) GROUP BY id,S1.url,S2.url;

For the benefit of anyone trying to work around crashes like this,
the query works fine for me as a UNION, instead.

    SELECT id,date_part('epoch',sum(stop-start)),count(*),S1.url
    FROM U WHERE 'now'::datetime-start<'1 month'::interval
    AND id=S1.id GROUP BY id,S1.url
    UNION
    SELECT id,date_part('epoch',sum(stop-start)),count(*),S2.url
    FROM U WHERE 'now'::datetime-start<'1 month'::interval
    AND id=S2.id GROUP BY id,S2.url;

Here's a backtrace, too:

# gdb ./bin/postgres ./data/base/sonicbox/core
GNU gdb 4.17
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.6"...
Core was generated by `/usr/local/pgsql/bin/postmaster -i -N 512 -B
1024 -o -F'.
Program terminated with signal 11, Segmentation Fault.
Reading symbols from /usr/lib/libnsl.so.1...done.
Reading symbols from /usr/lib/libsocket.so.1...done.
Reading symbols from /usr/lib/libdl.so.1...done.
Reading symbols from /usr/lib/libm.so.1...done.
Reading symbols from /usr/lib/libc.so.1...done.
Reading symbols from /usr/lib/libmp.so.2...done.
Reading symbols from /usr/platform/SUNW,Ultra-5_10/lib/libc_psr.so.1...done.
Reading symbols from /usr/lib/nss_files.so.1...done.
#0  0x110f6c in make_one_rel_by_joins ()
(gdb) bt
#0  0x110f6c in make_one_rel_by_joins ()
#1  0x110d40 in make_one_rel ()
#2  0x11be74 in subplanner ()
#3  0x11bd98 in query_planner ()
#4  0x11ca40 in union_planner ()
#5  0x11c3fc in subquery_planner ()
#6  0x11c13c in planner ()
#7  0x16314c in pg_plan_query ()
#8  0x163428 in pg_exec_query_dest ()
#9  0x163270 in pg_exec_query ()
#10 0x164f60 in PostgresMain ()
#11 0x131738 in DoBackend ()
#12 0x130f08 in BackendStartup ()
#13 0x12f41c in ServerLoop ()
#14 0x12e978 in PostmasterMain ()
#15 0xd89f8 in main ()

-- Mike

Re: 7.0RC1: possible query and backend problem

От
Tom Lane
Дата:
Michael Blakeley <mike@blakeley.com> writes:
> I've just upgraded to 7.0RC1 on a Solaris 2.6+patches system (U5 with
> 256MB). I'm having some trouble with a brand-new query, so I don't
> know if this is a new bug or an old one. It has two parts: query
> execution and backend robustness.

>     SELECT id,date_part('epoch',sum(stop-start)),count(*),S1.url,S2.url
>     FROM U WHERE 'now'::datetime-start<'1 month'::interval
>     AND (id=S1.id OR id=S2.id) GROUP BY id,S1.url,S2.url;

>     pqReadData() -- backend closed the channel unexpectedly.
>          This probably means the backend terminated abnormally
>          before or while processing the request.

I've fixed this problem, I think --- the three-way join clause was
confusing the planner :-(.

> I suspect that the problem is that, in my data set, either S1.url or
> S2.url will be null for any given row, and this is causing problems
> for GROUP BY

That shouldn't be a problem, although you do need to be careful
when dealing with NULLs --- it's easy to write the query so that
the WHERE condition will produce NULL, which is interpreted as FALSE.
But the above should work OK, because (NULL OR TRUE) will produce
TRUE.

>> IpcMemoryCreate: shmget failed (Invalid argument) key=5432110,
>> size=144, permission=700

Hmm, that is odd.  The thing that looks peculiar to me is that
it seems to be calculating a different size for the segment than
it did the first time through:

> # ipcs -a
> IPC status from <running system> as of Wed Apr 19 16:45:42 2000
> T         ID      KEY        MODE        OWNER    GROUP  CREATOR
> CGROUP NATTCH      SEGSZ  CPID  LPID   ATIME    DTIME    CTIME
> Shared Memory:
> m        800   0x0052e32e --rw------- postgres postgres postgres
> postgres      0        120 12737 12737 13:01:36 13:01:36 13:01:36

See the difference?  120 vs 144?  What's causing that I wonder...
and would it explain the failure to reattach?

            regards, tom lane

Re: 7.0RC1: possible query and backend problem

От
Tom Lane
Дата:
I wrote:
>>> IpcMemoryCreate: shmget failed (Invalid argument) key=5432110,
>>> size=144, permission=700

> Hmm, that is odd.  The thing that looks peculiar to me is that
> it seems to be calculating a different size for the segment than
> it did the first time through:

>> # ipcs -a
>> IPC status from <running system> as of Wed Apr 19 16:45:42 2000
>> T         ID      KEY        MODE        OWNER    GROUP  CREATOR
>> CGROUP NATTCH      SEGSZ  CPID  LPID   ATIME    DTIME    CTIME
>> Shared Memory:
>> m        800   0x0052e32e --rw------- postgres postgres postgres
>> postgres      0        120 12737 12737 13:01:36 13:01:36 13:01:36

> See the difference?  120 vs 144?  What's causing that I wonder...
> and would it explain the failure to reattach?

After further investigation, "Invalid argument" is the typical kernel
error code from shmget() if one tries to attach to an existing shared
memory segment that is smaller than one asked for.  So that's
consistent.  The size requested for the spinlock segment (which is
the only one of Postgres' three shmem segments that could be as small
as 144 bytes) is computed by "sizeof(struct foo)"; there is no way
that that is going to change from one invocation to the text.  But
the numbers 120 and 144 are consistent with the theory that the shmem
segment was originally created by Postgres 6.5 and you are now trying
to attach to it with Postgres 7.0 --- 7.0 has more spinlocks than 6.5
did.

Your trace appeared to show a working 7.0 postmaster getting this error
while trying to reinitialize.  That doesn't make any sense to me; if
the 7.0 postmaster had managed to start up originally, then it must
have found or created a suitably-sized shmem segment.  So I'm confused
about the details, but I've got to think that we are looking at some
sort of interference between 6.5 and 7.0 installations.  One possibility
is that after you started the 7.0 postmaster, you accidentally tried to
start a 6.5 postmaster on the same port number, and the 6.5 code managed
to resize the shared mem segment before failing because of the port
number conflict.  Not sure if that could happen --- my shmget() man
page doesn't say anything about changing the size of an already-existing
shmem segment, but maybe your Unix works differently.

Comments anyone?  If this actually is what happened, we should reorder
the startup sequence to check for port-number conflicts before any
shared memory segments are touched.  But I'm not sure about it.

            regards, tom lane

PS: if you don't see the connection between port number and shmem,
it's this: the key numbers used for shmem segments are computed from
the port number.  So different postmasters can coexist on one machine
if they have different port numbers; they'll get different shmem
segments.  But starting two postmasters on the same port is bad news.
I thought we had adequate interlocks against that, but now I'm wondering.