Обсуждение: Problem on AIX with current
Per Tom's request(1000 concurrent backends), I tried current on IBM AIX 5L and found that make check hungs: parallel group (13 tests): float4 oid varchar pgbench hungs too if more than 4 or so concurrent backends are involved. Unfortunately gdb does not work well on AIX, so I'm stucked. Maybe a new locking code? BTW PostgreSQL 7.1.3 works fine. -- Tatsuo Ishii
> Per Tom's request(1000 concurrent backends), I tried current on IBM > AIX 5L and found that make check hungs: > > parallel group (13 tests): float4 oid varchar > > pgbench hungs too if more than 4 or so concurrent backends are > involved. I once had hangs during make check on AIX 4, but after make distclean and rebuild was never able to reproduce. Can you read the man page for cs(3), AIX 4 sais it is not recommended suggests to use compare_and_swap, maybe AIX 5 has more to say ? > Unfortunately gdb does not work well on AIX, so I'm stucked. > Maybe a new locking code? Use dbx (and ddd) ? I don't have access to AIX 5. Andreas
> > Per Tom's request(1000 concurrent backends), I tried current on IBM > > AIX 5L and found that make check hungs: > > > > parallel group (13 tests): float4 oid varchar > > > > pgbench hungs too if more than 4 or so concurrent backends are > > involved. > > I once had hangs during make check on AIX 4, but after make distclean > and > rebuild was never able to reproduce. I thing I did make distclean. > Can you read the man page for cs(3), AIX 4 sais it is not recommended > suggests to use compare_and_swap, maybe AIX 5 has more to say ? Note: The cs subroutine is only provided to support binary compatibility with AIX Version 3 applications. Whenwriting new applications, it is not recommended to use this subroutine; it may cause reduced performance in the future.Applications should use the compare_and_swap (compare_and_swap Subroutine) subroutine, unless they need touse unaligned memory locations. Seems same as AIX 4? > > Unfortunately gdb does not work well on AIX, so I'm stucked. > > Maybe a new locking code? > > Use dbx (and ddd) ? Here is a stack trace using dbx. semop(??, ??, ??) at 0xd02be73c IpcSemaphoreLock(??, ??, ??), line 425 in "ipc.c" LWLockAcquire(??, ??), line 270 in "lwlock.c" LockAcquire(??, ??, ??, ??, ??), line 482 in "lock.c" LockRelation(??, ??), line 153 in "lmgr.c" heap_openr(??, ??), line 512 in "heapam.c" scan_pg_rel_ind(??, ??), line 380 in "relcache.c" ScanPgRelation(??, ??), line 307 in "relcache.c" IndexedAccessMethodInitialize(??, ??, ??), line 994 in "relcache.c" RelationNameGetRelation(??), line 1484 in "relcache.c" heap_openr(??, ??), line 502 in "heapam.c" setTargetTable(??, ??, ??, ??), line 136 in "parse_clause.c" transformUpdateStmt(??, ??), line 2416 in "analyze.c" transformStmt(??, ??), line 228 in "analyze.c" parse_analyze(??, ??), line 92 in "analyze.c" pg_analyze_and_rewrite(??), line 428 in "postgres.c" unnamed block $b1877, line 740 in "postgres.c" unnamed block $b1876, line 740 in "postgres.c" unnamed block $b1872, line 740 in "postgres.c" pg_exec_query_string(??, ??, ??), line 740 in "postgres.c" PostgresMain(??, ??, ??, ??, ??), line 1943 in "postgres.c" DoBackend(??), line 2104 in "postmaster.c" BackendStartup(??), line 1837 in "postmaster.c" unnamed block $b1665, line 917 in "postmaster.c" ServerLoop(), line 917 in "postmaster.c" PostmasterMain(??, ??), line 712 in "postmaster.c" main(argc = 0, argv = (nil)), line 178 in "main.c"
... cs(3) > > Seems same as AIX 4? Yes, identical. > > Hmm, does anyone want to produce new s_lock code for AIX that uses > compare_and_swap? But I'm not sure that's the problem here. I did once, but performance was worse, so I discarded it. Since AIX 5 still has it, I see no reason to change it. Still, testing it on AIX 5 might reveal that compare_and_swap is now faster, Tatsuo ? Andreas
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> Can you read the man page for cs(3), AIX 4 sais it is not recommended >> suggests to use compare_and_swap, maybe AIX 5 has more to say ? > Note: The cs subroutine is only provided to support binary > compatibility with AIX Version 3 applications. When writing new > applications, it is not recommended to use this subroutine; it may cause > reduced performance in the future. Applications should use the > compare_and_swap (compare_and_swap Subroutine) subroutine, unless they > need to use unaligned memory locations. > Seems same as AIX 4? Hmm, does anyone want to produce new s_lock code for AIX that uses compare_and_swap? But I'm not sure that's the problem here. > Here is a stack trace using dbx. > semop(??, ??, ??) at 0xd02be73c > IpcSemaphoreLock(??, ??, ??), line 425 in "ipc.c" > LWLockAcquire(??, ??), line 270 in "lwlock.c" > LockAcquire(??, ??, ??, ??, ??), line 482 in "lock.c" This process is waiting to acquire the LockMgr lock. You need to look at the rest of the processes and try to figure out who's got the lock. regards, tom lane
> > Here is a stack trace using dbx. > > > semop(??, ??, ??) at 0xd02be73c > > IpcSemaphoreLock(??, ??, ??), line 425 in "ipc.c" > > LWLockAcquire(??, ??), line 270 in "lwlock.c" > > LockAcquire(??, ??, ??, ??, ??), line 482 in "lock.c" > > This process is waiting to acquire the LockMgr lock. You need to look > at the rest of the processes and try to figure out who's got the lock. Strange enough, there's no other backend (of course except stats collectors) here. I make sure this with ps and pg_stat_activity view. BTW pg_stat_activity view shows: 16556 | test | 197378 | 1 | postgres | update accounts set abalance = abalance + 406, filler = 'added amount toabalance is 406' where aid = 1447 -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > Strange enough, there's no other backend (of course except stats > collectors) here. I make sure this with ps and pg_stat_activity view. If you have no better way of determining what's going on, it might help to recompile with LOCK_DEBUG defined, then enable trace_lwlocks in postgresql.conf (better turn on debug_print_query, log_timestamp, and log_pid too). This will generate rather voluminous log output, perhaps enough to provide a clue. regards, tom lane
> If you have no better way of determining what's going on, it might help > to recompile with LOCK_DEBUG defined, then enable trace_lwlocks in > postgresql.conf (better turn on debug_print_query, log_timestamp, and > log_pid too). This will generate rather voluminous log output, perhaps > enough to provide a clue. When I recompiled with LOCK_DEBUG and trace_lwlocks = true, it *works* (and saw lots of lock debugging messages, of course). However if I turn trace_lwlocks to off, the backend stucks again. Is there anything I can do? Note the machine has 4 processors. Is that related to? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > When I recompiled with LOCK_DEBUG and trace_lwlocks = true, it *works* > (and saw lots of lock debugging messages, of course). However if I > turn trace_lwlocks to off, the backend stucks again. Ugh ... ye classic Heisenbug ... > Is there anything I can do? Apparently the problem is timing-sensitive, which is hardly surprising for a lock issue. You might find that it occurs some of the time if you repeat the test over and over. > Note the machine has 4 processors. Is that related to? Hard to tell at this point, but considering that no one else has reported a problem so far, it does seem like multiple CPUs at least help to make the failure more probable. But it could just be a portability problem. Do you have another machine with identical OS and fewer processors to try for comparison? Andreas, have you tried CVS tip lately on AIX? What's your results? regards, tom lane
> Andreas, have you tried CVS tip lately on AIX? What's your results? All 77 ok, no hangs, with make check on single CPU AIX 4.3.2. Only problem on AIX is, that the argv[0] stuff does not work anymore (I think since we don't exec() anymore), which is rather annoying. Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > Only problem on AIX is, that the argv[0] stuff does not work anymore > (I think since we don't exec() anymore), which is rather annoying. Hmm, perhaps we are selecting the wrong PS_STRINGS method for AIX? Please look at src/backend/utils/misc/ps_status.c and see if one of the other methods will work on AIX. regards, tom lane
> > Only problem on AIX is, that the argv[0] stuff does not work anymore > > (I think since we don't exec() anymore), which is rather annoying. > > Hmm, perhaps we are selecting the wrong PS_STRINGS method for AIX? > Please look at src/backend/utils/misc/ps_status.c and see if one of > the other methods will work on AIX. Yes, I see. Quite silly that I did not look earlier. The compiler does not define _AIX4 or _AIX3, no idea who thought that. It only defines _AIX, _AIX32, _AIX41 and _AIX43. I am quite sure that all AIX Versions accept the CLOBBER method, thus I ask you to apply the following patch, to make it work. Andreas
Вложения
Your patch has been added to the PostgreSQL unapplied patches list at: http://candle.pha.pa.us/cgi-bin/pgpatches I will try to apply it within the next 48 hours. > > > > Only problem on AIX is, that the argv[0] stuff does not work anymore > > > (I think since we don't exec() anymore), which is rather annoying. > > > > Hmm, perhaps we are selecting the wrong PS_STRINGS method for AIX? > > Please look at src/backend/utils/misc/ps_status.c and see if one of > > the other methods will work on AIX. > > Yes, I see. Quite silly that I did not look earlier. > The compiler does not define _AIX4 or _AIX3, no idea who thought that. > It only defines _AIX, _AIX32, _AIX41 and _AIX43. > > I am quite sure that all AIX Versions accept the CLOBBER method, > thus I ask you to apply the following patch, to make it work. > > Andreas Content-Description: ps_status.patch [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> > > Only problem on AIX is, that the argv[0] stuff does not work anymore > > > (I think since we don't exec() anymore), which is rather annoying. > > > > Hmm, perhaps we are selecting the wrong PS_STRINGS method for AIX? > > Please look at src/backend/utils/misc/ps_status.c and see if one of > > the other methods will work on AIX. > > Yes, I see. Quite silly that I did not look earlier. > The compiler does not define _AIX4 or _AIX3, no idea who thought that. > It only defines _AIX, _AIX32, _AIX41 and _AIX43. > > I am quite sure that all AIX Versions accept the CLOBBER method, > thus I ask you to apply the following patch, to make it work. CLOBBER does not work with AIX5L, nor CHANGE_ARGV. (SETPROCTITLE, PSTAT and PS_STRINGS can not be used since AIX5L does not have appropreate header files). -- Tatsuo Ishii
Patch rejected, please resubmit: CLOBBER does not work with AIX5L, nor CHANGE_ARGV. (SETPROCTITLE, PSTAT and PS_STRINGS can not be used since AIX5L does not have appropreate header files). -- Tatsuo Ishii > > > > Only problem on AIX is, that the argv[0] stuff does not work anymore > > > (I think since we don't exec() anymore), which is rather annoying. > > > > Hmm, perhaps we are selecting the wrong PS_STRINGS method for AIX? > > Please look at src/backend/utils/misc/ps_status.c and see if one of > > the other methods will work on AIX. > > Yes, I see. Quite silly that I did not look earlier. > The compiler does not define _AIX4 or _AIX3, no idea who thought that. > It only defines _AIX, _AIX32, _AIX41 and _AIX43. > > I am quite sure that all AIX Versions accept the CLOBBER method, > thus I ask you to apply the following patch, to make it work. > > Andreas Content-Description: ps_status.patch [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> > I am quite sure that all AIX Versions accept the CLOBBER method, > > thus I ask you to apply the following patch, to make it work. > > CLOBBER does not work with AIX5L, nor CHANGE_ARGV. (SETPROCTITLE, > PSTAT and PS_STRINGS can not be used since AIX5L does not have > appropreate header files). Have you actually tried my patch, and what was the effect ? The previous code was wrong, since it did not do any PS magic, it defaulted to PS_USE_NONE. Else can you please tell me a predefine for AIX5, thanks. Andreas
> > > I am quite sure that all AIX Versions accept the CLOBBER method, > > > thus I ask you to apply the following patch, to make it work. > > > > CLOBBER does not work with AIX5L, nor CHANGE_ARGV. (SETPROCTITLE, > > PSTAT and PS_STRINGS can not be used since AIX5L does not have > > appropreate header files). > > Have you actually tried my patch, and what was the effect ? > The previous code was wrong, since it did not do any PS magic, > it defaulted to PS_USE_NONE. To make sure I did everything correctly, I cvsed fresh sources and applied your patches again. The result: It works fine! I don't know why, but I must have done something wrong.:-< Sorry for the wrong info. Bruce, please apply the patches. BTW, still I'm getting the stucking backends. New info: a snapshot dated on 10/3 works fine. -- Tatsuo Ishii
> BTW, still I'm getting the stucking backends. New info: a snapshot > dated on 10/3 works fine. I allways have trouble with those different date formats. Do you mean, that the problem is fixed as of 3. October, or that an old snapshot from 10. March still worked ? Snapshot of 1. Oct 2001 does not hang in "make check" on AIX 4.3.2 4 CPU machine. So it seems to be a problem on AIX5L only :-( Maybe a semaphore bug ? Andreas
> > BTW, still I'm getting the stucking backends. New info: a snapshot > > dated on 10/3 works fine. > > I allways have trouble with those different date formats. Do you > mean, that the problem is fixed as of 3. October, or that an old > snapshot from 10. March still worked ? Of course the working source is 3rd October. > Snapshot of 1. Oct 2001 does not hang in "make check" on AIX 4.3.2 > 4 CPU machine. Oh, you have 4 way machine too? > So it seems to be a problem on AIX5L only :-( Maybe a semaphore bug ? Maybe. BTW, what is your compiler? I'm using xlc. -- Tatsuo Ishii
> > > BTW, still I'm getting the stucking backends. New info: a snapshot > > > dated on 10/3 works fine. > > > > I allways have trouble with those different date formats. Do you > > mean, that the problem is fixed as of 3. October, or that an old > > snapshot from 10. March still worked ? > > Of course the working source is 3rd October. Tom, do you have an idea what you might have fixed to that effect ? > > > Snapshot of 1. Oct 2001 does not hang in "make check" on AIX 4.3.2 > > 4 CPU machine. > > Oh, you have 4 way machine too? Well, the company I work for has all sorts of AIX hardware, but no AIX5 yet. I usually use a 43P 150 with one 604e CPU for development and testing, but "borrowed" another one to test the 4 CPU hang :-) > > So it seems to be a problem on AIX5L only :-( Maybe a > semaphore bug ? > > Maybe. BTW, what is your compiler? I'm using xlc. Same here, xlc from VisualAge C++, maybe other version: vac.C 5.0.1.3 COMMITTED C for AIX Compiler I made the experience, that gcc compiled code is somewhat slower. Andreas
> > > > I am quite sure that all AIX Versions accept the CLOBBER method, > > > thus I ask you to apply the following patch, to make it work. > > > > CLOBBER does not work with AIX5L, nor CHANGE_ARGV. (SETPROCTITLE, > > PSTAT and PS_STRINGS can not be used since AIX5L does not have > > appropreate header files). > > Have you actually tried my patch, and what was the effect ? > The previous code was wrong, since it did not do any PS magic, > it defaulted to PS_USE_NONE. > > Else can you please tell me a predefine for AIX5, thanks. Patch applied. Thanks. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: >> Of course the working source is 3rd October. > Tom, do you have an idea what you might have fixed to that effect ? No idea. I've been fixing some portability issues in dynahash.c, but AFAIK they only affected the pgstats collector process not backends. Also, that breakage had existed for months... regards, tom lane