Обсуждение: Another nasty cache problem
I'm down to the point where the parallel tests mostly work with a small SI buffer --- but they do still sometimes fail. I've realized that there is a whole class of bugs along the following lines: There are plenty of routines that do two or more SearchSysCacheTuple calls to get the information they need. As the code stands, it is unsafe to continue accessing the tuple returned by SearchSysCacheTuple after making a second such call, because the second call could possibly cause an SI cache reset message to be processed, thereby flushing the contents of the caches. heap_open and CommandCounterIncrement are other routines that could cause cache entries to be dropped. This is a very insidious kind of bug because the probability of occurrence is very low (at normal SI buffer size a reset is unlikely, and even if it happens, you won't observe a failure unless the pfree'd tuple is actually overwritten before you're done with it). So we cannot hope to catch these things by testing. I am not sure what to do about it. One solution path is to make all the potential trouble spots do SearchSysCacheTupleCopy and then pfree the copied tuple when done. However, that adds a nontrivial amount of overhead, and it'd be awfully easy to miss some trouble spots or to introduce new ones in the future. Another possibility is to introduce some sort of notion of a reference count, and to make the standard usage pattern betuple = SearchSysCacheTuple(...);... use tuple ...ReleaseSysCacheTuple(tuple); The idea here is that a tuple with positive refcount would not be deleted during a cache reset, but would simply be removed from its cache, and then finally deleted when released (or during elog recovery). This might allow us to get rid of SearchSysCacheTupleCopy, too, since the refcount should be just as good as palloc'ing one's own copy for most purposes. I haven't looked at the callers of SearchSysCacheTuple to see whether this would be a practical change to make. I was wondering if anyone had any comments or better ideas... regards, tom lane
> I'm down to the point where the parallel tests mostly work with a small > SI buffer --- but they do still sometimes fail. I've realized that > there is a whole class of bugs along the following lines: > > There are plenty of routines that do two or more SearchSysCacheTuple > calls to get the information they need. As the code stands, it is > unsafe to continue accessing the tuple returned by SearchSysCacheTuple > after making a second such call, because the second call could possibly > cause an SI cache reset message to be processed, thereby flushing the > contents of the caches. Yes, I have always been aware of this problem. The issue is that since cache entries are removed on a oldest-removed-first basis, I never thought that several cache lookups would be a problem. If you do many cache lookups and expect very old ones to still exist, that could be a problem. However, a full reset of the cache could cause major problems. Is there a way to re-load the cache after the reset with the most recently cached entries? Seems that would be easier. However, your issue is probably that the new cache entries would have different locations from the old entries. Is it possible to delay the cache reset of the five most recent cache entries, and do them later? I don't see many places where more than 2-3 cache entries are kept. Maybe we need to keep them around somehow during cache reset. > I am not sure what to do about it. One solution path is to make > all the potential trouble spots do SearchSysCacheTupleCopy and then > pfree the copied tuple when done. However, that adds a nontrivial > amount of overhead, and it'd be awfully easy to miss some trouble > spots or to introduce new ones in the future. Sounds like a lot of overhead to do the copy. > > Another possibility is to introduce some sort of notion of a reference > count, and to make the standard usage pattern be > tuple = SearchSysCacheTuple(...); > ... use tuple ... > ReleaseSysCacheTuple(tuple); > The idea here is that a tuple with positive refcount would not be > deleted during a cache reset, but would simply be removed from its > cache, and then finally deleted when released (or during elog > recovery). If you can do that, can't you just keep the few most recent ones by default. Seems that would be very clean. > This might allow us to get rid of SearchSysCacheTupleCopy, too, > since the refcount should be just as good as palloc'ing one's own > copy for most purposes. Yes, that would be nice. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Yes, I have always been aware of this problem. The issue is that since > cache entries are removed on a oldest-removed-first basis, I never > thought that several cache lookups would be a problem. They're not, under normal circumstances... > However, a full reset of the cache could cause major problems. Is there > a way to re-load the cache after the reset with the most recently cached > entries? Seems that would be easier. However, your issue is probably > that the new cache entries would have different locations from the old > entries. Is it possible to delay the cache reset of the five most > recent cache entries, and do them later? I don't think that's a good answer; what if one of those entries is the one that the SI messages wanted us to update? With a scheme like that, you might be protecting a cache entry that actually isn't being used anymore. With a refcount you'd at least know whether it was safe to throw the entry away. Of course this just begs the question of what to do when an SI update message arrives for a tuple that is locked down by refcount. Maybe we have to kick out an elog(ERROR) then. Could be messy. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Yes, I have always been aware of this problem. The issue is that since > > cache entries are removed on a oldest-removed-first basis, I never > > thought that several cache lookups would be a problem. > > They're not, under normal circumstances... > > > However, a full reset of the cache could cause major problems. Is there > > a way to re-load the cache after the reset with the most recently cached > > entries? Seems that would be easier. However, your issue is probably > > that the new cache entries would have different locations from the old > > entries. Is it possible to delay the cache reset of the five most > > recent cache entries, and do them later? > > I don't think that's a good answer; what if one of those entries is the > one that the SI messages wanted us to update? With a scheme like that, > you might be protecting a cache entry that actually isn't being used > anymore. With a refcount you'd at least know whether it was safe to > throw the entry away. > > Of course this just begs the question of what to do when an SI update > message arrives for a tuple that is locked down by refcount. Maybe > we have to kick out an elog(ERROR) then. Could be messy. Yep, that was my question. You can re-load it, but if it is the one that just got invalidated, what do you reload? My guess is that you keep using the same cache entry until the current transaction finishes, at which point you can throw it away. Now, if we did proper locking, no SI message could arrive for such an entry. My assumption is that these are mostly system cache entries, and they rarely change, right? If someone is operating on a table that gets an SI entry, odds are that later on the system will fail because the table is changed in some way, right? Actually, don't we have a transaction id for the transaction that loaded that cache entry. We can add a transaction id to the cache record that shows the transaction that last accessed that cache entry. Then we can say if any SI message comes in for a cache entry that was accessed by the current transaction, we throw an elog. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Now, if we did proper locking, no SI message could arrive for such an > entry. > My assumption is that these are mostly system cache entries, and they > rarely change, right? If someone is operating on a table that gets an > SI entry, odds are that later on the system will fail because the table > is changed in some way, right? If the tuple is actually *changed* then that's true (and locking should have prevented it anyway). But we also issue cache flushes against whole system tables in order to handle VACUUM of a system table. There, the only thing that's actually been modified is the tuple's physical location (ctid). We don't want to blow away transactions that are just looking at cache entries when a VACUUM happens. Perhaps the caches shouldn't store ctid? Not sure. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Now, if we did proper locking, no SI message could arrive for such an > > entry. > > > My assumption is that these are mostly system cache entries, and they > > rarely change, right? If someone is operating on a table that gets an > > SI entry, odds are that later on the system will fail because the table > > is changed in some way, right? > > If the tuple is actually *changed* then that's true (and locking should > have prevented it anyway). But we also issue cache flushes against > whole system tables in order to handle VACUUM of a system table. There, > the only thing that's actually been modified is the tuple's physical > location (ctid). We don't want to blow away transactions that are just > looking at cache entries when a VACUUM happens. > > Perhaps the caches shouldn't store ctid? Not sure. I am guilt of that. There are a few place where I grab the tuple from the cache, then use that to update the heap. I thought it was a nifty solution at the time. I thought I used the CacheCopy calls for that, but I am not positive. Even if I did, that doesn't help because the copy probably has an invalid tid at that point, thought I have opened the table. Maybe I have to make sure I open the table before geting the tid from the cache. Is it only the tid that is of concern. If so, that can probably be fixed somehow. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Perhaps the caches shouldn't store ctid? Not sure. > I am guilt of that. There are a few place where I grab the tuple from > the cache, then use that to update the heap. I thought it was a nifty > solution at the time. I thought I used the CacheCopy calls for that, > but I am not positive. Even if I did, that doesn't help because the > copy probably has an invalid tid at that point, thought I have opened > the table. Maybe I have to make sure I open the table before geting the > tid from the cache. I believe we worked that out and fixed it a few months ago: it's safe to use the cache to find a tuple you want to update, if you open and lock the containing table *before* doing the cache lookup. Then you know VACUUM's not running on that table (since you have it locked) and you have an up-to-date TID for the tuple (since the open+lock would have processed any pending shared-inval messages). I went around and made sure that's true everywhere. What I was thinking about was adding code to the caches that would (a) maintain refcounts on cached tuples, (b) reread rather than discard a tuple if it is invalidated while refcount > 0, and (c) kick out an error if the reread shows that the tuple has in fact changed. It seems that we would need to ignore the TID when deciding if a tuple has changed, however. regards, tom lane
> I believe we worked that out and fixed it a few months ago: it's safe > to use the cache to find a tuple you want to update, if you open and > lock the containing table *before* doing the cache lookup. Then you > know VACUUM's not running on that table (since you have it locked) > and you have an up-to-date TID for the tuple (since the open+lock > would have processed any pending shared-inval messages). I went > around and made sure that's true everywhere. Good. > What I was thinking about was adding code to the caches that would > (a) maintain refcounts on cached tuples, (b) reread rather than > discard a tuple if it is invalidated while refcount > 0, and (c) > kick out an error if the reread shows that the tuple has in fact > changed. It seems that we would need to ignore the TID when deciding > if a tuple has changed, however. Yes, that is one solution. We can do it the same way heap_fetch works. It requires a Buffer pointer which it uses to return a value that calls ReleaseBuffer() when completed. However, would just throwing an elog on any cache invalidate on a cache row looked up in the current transaction/command counter make more sense? Sometimes you are using that cache oid in some later actions that really can't be proper unlocked at the end? Would be less code. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Sun, 30 Jan 2000, Tom Lane wrote: > There are plenty of routines that do two or more SearchSysCacheTuple > calls to get the information they need. As the code stands, it is > unsafe to continue accessing the tuple returned by SearchSysCacheTuple > after making a second such call, because the second call could possibly > cause an SI cache reset message to be processed, thereby flushing the > contents of the caches. > > heap_open and CommandCounterIncrement are other routines that could > cause cache entries to be dropped. This sort of thing should be documented, at least in the comment on top of the function. From the developer's FAQ I gathered something like that these tuples can be used for a short while, which is of course very exact. Anyway, I just counted 254 uses of SearchSysCacheTuple in the backend tree and a majority of these are probably obviously innocent. Since I don't have any more developing planned, I would volunteer to take a look at all of those and look for violations of second cache look up, heap_open, and CommandCounterIncrement, fixing them where possible, or at least pointing them out to more experienced people. That might save you from going out of your way and instituting some reference count or whatever, and it would be an opportunity for me to read some code. -- Peter Eisentraut Sernanders vaeg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
On Sun, 30 Jan 2000, Bruce Momjian wrote: > > Perhaps the caches shouldn't store ctid? Not sure. > > I am guilt of that. There are a few place where I grab the tuple from > the cache, then use that to update the heap. I thought it was a nifty > solution at the time. I thought I used the CacheCopy calls for that, > but I am not positive. Even if I did, that doesn't help because the > copy probably has an invalid tid at that point, thought I have opened > the table. Maybe I have to make sure I open the table before geting the > tid from the cache. Urgh, I better check my code for that as well ... :( > > Is it only the tid that is of concern. If so, that can probably be > fixed somehow. > > > -- Peter Eisentraut Sernanders vaeg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
Peter Eisentraut <e99re41@DoCS.UU.SE> writes: > This sort of thing should be documented, ... or changed ... > Anyway, I just counted 254 uses of SearchSysCacheTuple in the backend tree > and a majority of these are probably obviously innocent. Since I don't > have any more developing planned, I would volunteer to take a look at all > of those and look for violations of second cache look up, heap_open, and > CommandCounterIncrement, fixing them where possible, or at least pointing > them out to more experienced people. That might save you from going out of > your way and instituting some reference count or whatever, and it would be > an opportunity for me to read some code. I appreciate the offer, but I don't really want to fix it that way. If that's how things have to work, then the code will be *extremely* fragile --- any routine that opens a relation or looks up a cache tuple will potentially break its callers as well as itself. And since the probability of failure is so low, we'll never find it; we'll just keep getting the occasional irreproducible failure report from the field. I think we need a designed-in solution rather than a restrictive coding rule. Also, I am not sure that the existing uses are readily fixable. For example, I saw a number of crashes in the parser last night, most of which traced to uses of Operator or Type pointers --- which are really SearchSysCacheTuple results, but the parser passes them around with wild abandon. I don't see any easy way of restructuring that code to avoid this. I am starting to think that Bruce's idea might be the way to go: lock down any cache entry that's been referenced since the last transaction start or CommandCounterIncrement, and elog() if it's changed by invalidation. Then the only coding rule needed is "cached tuples don't stay valid across CommandCounterIncrement", which is relatively simple to check for. regards, tom lane
Tom Lane wrote: > I'm down to the point where the parallel tests mostly work with a small > SI buffer --- but they do still sometimes fail. Have you committed your changes? I tried the parallel tests with cvs of around 5pm GMT 31 Jan, and they were all fine (I just ran out of procs at one point). This is much better than last week! Thanks! I also tried that nonsensical join from the other day, and it failed in the same way again: newnham=# select * from crsids,"tblPerson" where newnham-# crsids.crsid != "tblPerson"."CRSID"; Backend sent B message without prior T D21Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself. After \. : Unknown protocol character 'M' read from backend. (The protocol character is the first character the backend sends in responseto a query it receives). PQendcopy: resetting connection Asynchronous NOTIFY 'ndropoulou' from backend with pid '1818589281' received. Asynchronous NOTIFY 'ndropoulou' from backend with pid '1818589281' received. pq_flush: send() failed: Broken pipe FATAL: pq_endmessage failed: errno=32 but no NOTICEs about SI anywhere any more, in fact no messages at all until the "Unknown protocol character" bit above. The psql frontend process grows to about 120Mb in size before this if that matters (200Mb swap free). (Looking at why pg_dumpall creates unique indices for each different type of index at the moment...) Cheers, Patrick
> I am starting to think that Bruce's idea might be the way to go: lock > down any cache entry that's been referenced since the last transaction > start or CommandCounterIncrement, and elog() if it's changed by > invalidation. Then the only coding rule needed is "cached tuples don't > stay valid across CommandCounterIncrement", which is relatively > simple to check for. Yea, I had a good idea ... -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Patrick Welche <prlw1@newn.cam.ac.uk> writes: > Tom Lane wrote: >> I'm down to the point where the parallel tests mostly work with a small >> SI buffer --- but they do still sometimes fail. > Have you committed your changes? I tried the parallel tests with cvs > of around 5pm GMT 31 Jan, and they were all fine (I just ran out of > procs at one point). This is much better than last week! Thanks! Yes, I committed what I had last night (about 04:35 GMT 1/31). There are cache-flush-related bugs still left to deal with, but they seem to be far lower in probability than the ones squashed so far. I'm finding that even with MAXNUMMESSAGES set to 8, the parallel tests usually pass; so it seems we need some other way of testing to nail down the remaining problems. > I also tried that nonsensical join from the other day, and it failed in > the same way again: > newnham=# select * from crsids,"tblPerson" where > newnham-# crsids.crsid != "tblPerson"."CRSID"; > Backend sent B message without prior T Hmm. Can you provide a self-contained test case (a script to build the failing tables, preferably)? regards, tom lane
On Mon, Jan 31, 2000 at 09:02:30PM -0500, Tom Lane wrote: > Patrick Welche <prlw1@newn.cam.ac.uk> writes: > > Tom Lane wrote: > > There are cache-flush-related bugs still left to deal with, but they > seem to be far lower in probability than the ones squashed so far. > I'm finding that even with MAXNUMMESSAGES set to 8, the parallel tests > usually pass; so it seems we need some other way of testing to nail down > the remaining problems. > > > I also tried that nonsensical join from the other day, and it failed in > > the same way again: > > newnham=# select * from crsids,"tblPerson" where > > newnham-# crsids.crsid != "tblPerson"."CRSID"; > > Backend sent B message without prior T > > Hmm. Can you provide a self-contained test case (a script to build the > failing tables, preferably)? It seems this is a memory exhaustion thing: I have 128Mb real memory. Attached below is the C program used to create some random data in tables test and test2 of database test (which needs to exist). Executing the non-sensical query select * from test,test2 where test.i!=test2.i; should result in 2600*599=1557400 (ie lots of) rows to be returned. The process's memory consumption during this select grows to 128Mb, and after a moment or two: Backend sent D message without prior T Backend sent D message without prior T ... Which isn't quite the same message as before, but is of the same type. 59 processes: 2 running, 57 sleeping CPU states: 2.3% user, 86.4% nice, 9.3% system, 0.0% interrupt, 1.9% idle Memory: 74M Act, 37M Inact, 184K Wired, 364K Free, 95M Swap, 262M Swap free PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND1547 prlw1 50 0 128M 516K run 1:28 59.28%59.28% psql1552 postgres 50 0 1920K 632K run 1:37 24.32% 24.32% postgres later, while the "Backend sent..." messages appear 1547 prlw1 -5 0 128M 68M sleep 1:41 23.00% 23.00% psql1552 postgres 2 0 1920K 4K sleep 1:41 141.00% 6.88% <postgres> Note that there is still plenty of swap space. The 128Mb number seems to be more than a coincidence (how to prove?) So, is this only happening to me? How can lack of real memory affect timing of interprocess communication? Cheers, Patrick ========================================================================== #include <ctype.h> #include <stdio.h> #include <stdlib.h> #include "libpq-fe.h" const char *progname; PGresult *send_query(PGconn *db, const char *query) { PGresult *res; res=PQexec(db,query); switch(PQresultStatus(res)) { case PGRES_EMPTY_QUERY: printf("PGRES_EMPTY_QUERY: %s\n",query); break; case PGRES_COMMAND_OK: printf("PGRES_COMMAND_OK: %s\n",query); break; casePGRES_TUPLES_OK: printf("PGRES_TUPLES_OK: %s\n",query); break; case PGRES_COPY_OUT: printf("PGRES_COPY_OUT:%s\n",query); break; case PGRES_COPY_IN: printf("PGRES_COPY_IN: %s\n",query); break; case PGRES_BAD_RESPONSE: printf("PGRES_BAD_RESPONSE: %s\n",query); exit(1); break; casePGRES_NONFATAL_ERROR: printf("PGRES_NONFATAL_ERROR: %s\n",query); break; case PGRES_FATAL_ERROR: printf("PGRES_FATAL_ERROR: %s\n",query); exit(1); break; default: fprintf(stderr,"Error from %s:Unknown response from "\ "PQresultStatus()\n",progname); exit(1); break; } return res; } char get_letter(void) { int c; do c=(int)random()%128; while(!(isascii(c)&&isalpha(c))); return (char)tolower(c); } unsigned int get_num(void) { return random()%100; } int main(int argc, char* argv[]) { char id[7],query[2048]; int i; PGconn *db; PGresult *res; progname=argv[0]; srandom(42); /* same data each time hopefully */ db=PQconnectdb("dbname=test"); if(PQstatus(db)==CONNECTION_BAD) { fprintf(stderr,"Error from %s: Unable to connectto database \"test\".\n", progname); exit(1); } res=send_query(db,"create table test (txt text,var varchar(7),i integer)"); PQclear(res); res=send_query(db,"create tabletest2(txt text,var varchar(7),i integer)"); PQclear(res); for(i=1;i<=2600;++i) { sprintf(id,"%c%c%c%c%03u",get_letter(),get_letter(),get_letter(), get_letter(),get_num()); sprintf(query,"insert into test values ('%s','%s','%i')",id,id,i); res=send_query(db,query); PQclear(res); } for(i=1;i<=600;++i) { sprintf(id,"%c%c%c%c%03u",get_letter(),get_letter(),get_letter(), get_letter(),get_num()); sprintf(query,"insert into test2 values ('%s','%s','%i')",id,id,i); res=send_query(db,query); PQclear(res); } PQfinish(db); return 0; }
On Thu, Feb 03, 2000 at 11:24:34AM +0000, Patrick Welche wrote: > char id[7],query[2048]; ^ should be 8 to be safe..
Patrick Welche <prlw1@newn.cam.ac.uk> writes: > Note that there is still plenty of swap space. The 128Mb number seems to be > more than a coincidence (how to prove?) The majority of Unix systems have a process size limit kernel parameter, which is normally set to less than the amount of available swap space (you don't want a single process running wild to chew up all your swap and make other stuff start failing for lack of swap...) Check your kernel parameters. It sounds to me like the backend has hit the size limit and is not reacting gracefully to failure of malloc() to allocate more space. It ought to exit with an elog(FATAL), probably. Sigh, time to take another pass through the code to cast a suspicious eye on everyplace that calls malloc() directly. There's a separate question about *why* such a simple query is chewing up so much memory. What query plan does EXPLAIN show for your test query? You said this was with current sources, right? regards, tom lane
Tom Lane wrote: >There's a separate question about *why* such a simple query is chewing >up so much memory. What query plandoes EXPLAIN show for your test >query? I can show a similar problem. >You said this was with current sources, right? This is with current sources: I managed to kill the backend before it had used up all swap. If left to run on 6.5.3 or CVS as of 2 weeks back it would kill the whole machine; I haven't let it get that far today. bray=# explain select * from pg_operator as a, pg_operator as b; NOTICE: QUERY PLAN: Nested Loop (cost=12604.88 rows=258064 width=162) -> Seq Scan on pg_operator b (cost=24.76 rows=508 width=81) -> SeqScan on pg_operator a (cost=24.76 rows=508 width=81) EXPLAIN -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver PGP key from public servers; key ID32B8FAA1 ======================================== "O come, let us worship and bow down; let us kneel before the LORD our maker." Psalms 95:6
On Thu, Feb 03, 2000 at 12:00:21PM -0500, Tom Lane wrote: > > The majority of Unix systems have a process size limit kernel parameter, > which is normally set to less than the amount of available swap space > (you don't want a single process running wild to chew up all your swap > and make other stuff start failing for lack of swap...) Check your > kernel parameters. Probably to do with the shell limit: memoryuse 125460 kbytes > There's a separate question about *why* such a simple query is chewing > up so much memory. What query plan does EXPLAIN show for your test > query? test=# explain select * from test,test2 where test.i!=test2.i; NOTICE: QUERY PLAN: Nested Loop (cost=64104.80 rows=1559400 width=56) -> Seq Scan on test2 (cost=24.80 rows=600 width=28) -> Seq Scan ontest (cost=106.80 rows=2600 width=28) EXPLAIN > You said this was with current sources, right? They're about 2 days old now. (Well, after your SI buffer overrun fixes) Cheers, Patrick
"Oliver Elphick" <olly@lfix.co.uk> writes: > Tom Lane wrote: >> There's a separate question about *why* such a simple query is chewing >> up so much memory. What query plan does EXPLAIN show for your test >> query? > I can show a similar problem. > bray=# explain select * from pg_operator as a, pg_operator as b; > NOTICE: QUERY PLAN: > Nested Loop (cost=12604.88 rows=258064 width=162) > -> Seq Scan on pg_operator b (cost=24.76 rows=508 width=81) > -> Seq Scan on pg_operator a (cost=24.76 rows=508 width=81) OK, I sussed this one --- there's a (longstanding) memory leak in catcache.c. When entering a system-table tuple into the cache, it forgot to free the copy of the tuple that had been created in transaction-local memory context. Cause enough cache entries to be created within one transaction, and you'd start to notice the leak. The above query exhibits the problem because it produces about 250K tuples each with six regproc columns, and each regprocout call does a cache lookup to convert regproc OID to procedure name. Since you're cycling through 500-plus different procedure names, and the cache only keeps ~ 300 entries, there's going to be a fresh cache entry made every time :-( With the fix I just committed, current sources execute the above query in constant backend memory space. psql's space usage still goes to the moon, of course, since it's trying to buffer the whole query result :-( ... but there's no way around that short of a major redesign of libpq's API. When and if we switch over to CORBA, we really need to rethink the client access API so that buffering the query result in the client- side library is an option not a requirement. I do not think this is the same problem that Patrick Welche is complaining of, unfortunately. regards, tom lane
Tom Lane wrote: > With the fix I just committed, current sources execute the above query > in constant backend memory space. psql's space usage still goes to the > moon, of course, since it's trying to buffer the whole query result :-( > ... but there's no way around that short of a major redesign of libpq's > API. When and if we switch over to CORBA, we really need to rethink > the client access API so that buffering the query result in the client- > side library is an option not a requirement. What about portals? Doesn't psql use portals?
Patrick Welche <prlw1@newn.cam.ac.uk> writes: > PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND > 1547 prlw1 50 0 128M 516K run 1:28 59.28% 59.28% psql > 1552 postgres 50 0 1920K 632K run 1:37 24.32% 24.32% postgres Sigh, I shoulda read this closely enough to notice that you were complaining of psql memory overrun, not backend memory overrun :-( The major problem here is that libpq's API is designed on the assumption that libpq will buffer the whole query result in application memory before letting the app see any of it. I see no way around that without a fundamental redesign of the API. Which will happen someday, but not today. The minor problem is that libpq doesn't react very gracefully to running out of memory. It detects it OK, but then aborts query processing, which means it gets out of step with the backend. It needs to be fixed so that it continues to absorb tuples (but drops them on the floor) until the backend is done. I've known of this problem for some time, but have had many higher-priority problems to worry about. Perhaps someone else would like to take it on... regards, tom lane
Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes: > What about portals? Doesn't psql use portals? No ... portals are a backend concept ... regards, tom lane
Tom Lane wrote: > > Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes: > > What about portals? Doesn't psql use portals? > > No ... portals are a backend concept ... Since when? According to the old doco you do... select portal XX * from table_name where ...; fetch 20 into XX. If the PQexec() is called with "fetch 20" at a time wouldn't this mean that you wouldn't exhaust front-end memory with a big query?
Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes: >> No ... portals are a backend concept ... > Since when? > According to the old doco you do... > select portal XX * from table_name where ...; > fetch 20 into XX. That still works if you spell it in the SQL-approved way, DECLARE CURSOR followed by FETCH. > If the PQexec() is called with "fetch 20" at a time > wouldn't this mean that you wouldn't exhaust front-end > memory with a big query? Sure, and that's how you work around the problem. Nonetheless this requires the user to structure his queries to avoid sucking up a lot of data in a single query. If the user doesn't have any particular reason to need random access into a query result, it'd be nicer to be able to read the result in a streaming fashion without buffering it anywhere *or* making arbitrary divisions in it. In any case, psql doesn't (and IMHO shouldn't) convert a SELECT into a series of FETCHes for you. regards, tom lane
It seems that I am still tracking problems, but each time they turn out to have a different cause: A slight variant on the select that caused memory to run out gives newnham=# select crsids.surname, "tblPerson"."Surname" from crsids,"tblPerson" where crsids.usn="tblPerson"."USN"::int4; pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Nested Loop (cost=66496.62 rows=34359 width=40) -> Seq Scan on tblPerson (cost=157.62 rows=2625 width=24) -> Seq Scanon crsids (cost=25.27 rows=584 width=16) this is the table I based the memory hog on (2600*600). The backend closes instantly ie., no memory usage! And, as before, it is hard to find a test case that will do the same as repeatably (ie., test case never crashes, the above case crashes every single time). "tblPerson", as its strange capitalisation suggests, was imported from M$ access via ODBC. select test.txt,test2.var from test,test2 where test2.i=test.var::int4; Nested Loop (cost=63504.80 rows=2600 width=40) -> Seq Scan on test2 (cost=24.80 rows=600 width=16) -> Seq Scan on test (cost=105.80 rows=2600 width=24) works fine. Any thoughts on where to look? Cheers, Patrick
Patrick Welche <prlw1@newn.cam.ac.uk> writes: > newnham=# select crsids.surname, "tblPerson"."Surname" from crsids,"tblPerson" where crsids.usn="tblPerson"."USN"::int4; > pqReadData() -- backend closed the channel unexpectedly. > This probably means the backend terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > Any thoughts on where to look? Is there anything in the postmaster log? Is there a core file (look in the database subdirectory, ie .../data/base/yourdatabase/core)? If so, compiling the backend with -g and extracting a backtrace from the resulting corefile with gdb would be very useful info. regards, tom lane
On Fri, Feb 04, 2000 at 03:58:57PM -0500, Tom Lane wrote: > > > Any thoughts on where to look? > > Is there anything in the postmaster log? DEBUG: Data Base System is in production state at Fri Feb 4 17:11:05 2000 Server process (pid 3588) exited with status 11 at Fri Feb 4 17:14:57 2000 Terminating any active server processes... Server processes were terminated at Fri Feb 4 17:14:57 2000 Reinitializing shared memory and semaphores > Is there a core file (look > in the database subdirectory, ie .../data/base/yourdatabase/core)? But no core file ... so who knows what the sigsegv comes from. (don't worry coredumpsize unlimited) > If so, compiling the backend with -g and extracting a backtrace from > the resulting corefile with gdb would be very useful info. (already have the -g..) Still looking... Cheers, Patrick
Patrick Welche <prlw1@newn.cam.ac.uk> writes: >> Is there anything in the postmaster log? > DEBUG: Data Base System is in production state at Fri Feb 4 17:11:05 2000 > Server process (pid 3588) exited with status 11 at Fri Feb 4 17:14:57 2000 > But no core file ... so who knows what the sigsegv comes from. (don't worry > coredumpsize unlimited) There sure oughta be a corefile after a SIGSEGV. Hmm. How are you starting the postmaster --- is it from a system startup script? It might work better to start it from an ordinary user process. I discovered the other day on a Linux box that the system just plain would not dump a core file from a process started by root, even though the process definitely had nonzero "ulimit -c" and had set its euid to a nonprivileged userid. But start the same process by hand from an unprivileged login, and it would dump a core file. Weird. Dunno if your platform behaves the same way, but it's worth trying. regards, tom lane
The killer query was: select crsids.surname,"tblPerson"."Surname" from crsids,"tblPerson" where crsids.usn="tblPerson"."USN"::int4; and the reason for the SIGSEGV, is that somehow, text_int4(text *string) in src/backend/utils/adt/int.c is called with string=(text *)0x0, so obviously this is a problem! crsids.usn is integer, "tblPerson"."USN" is varchar(9). Oddly enough, text_int4 is called from fmgr.c:136 which is in the case statement for n_arguments=2, yet that should be 1 (gdb) print {FmgrInfo}0x8221a30 $4 = {fn_addr = 0x80f9dbc <text_int4>, fn_plhandler = 0, fn_oid = 1620, fn_nargs = 1} unless gdb is reporting the wrong line number. values->data[0]=0=string. I have a backtrace and a pretty printed copy of the query tree if useful... Still trying to make a small test case... Any suggestions appreciated! Cheers, Patrick (source of 31st Jan) On Sat, Feb 05, 2000 at 12:18:29PM -0500, Tom Lane wrote: > Patrick Welche <prlw1@newn.cam.ac.uk> writes: > >> Is there anything in the postmaster log? > > > DEBUG: Data Base System is in production state at Fri Feb 4 17:11:05 2000 > > Server process (pid 3588) exited with status 11 at Fri Feb 4 17:14:57 2000 > > > But no core file ... so who knows what the sigsegv comes from. (don't worry > > coredumpsize unlimited) > > There sure oughta be a corefile after a SIGSEGV. Hmm. How are you > starting the postmaster --- is it from a system startup script? > It might work better to start it from an ordinary user process. > I discovered the other day on a Linux box that the system just plain > would not dump a core file from a process started by root, even though > the process definitely had nonzero "ulimit -c" and had set its euid > to a nonprivileged userid. But start the same process by hand from an > unprivileged login, and it would dump a core file. Weird. Dunno if > your platform behaves the same way, but it's worth trying. > > regards, tom lane
Patrick Welche <prlw1@newn.cam.ac.uk> writes: > and the reason for the SIGSEGV, is that somehow, text_int4(text *string) in > src/backend/utils/adt/int.c is called with string=(text *)0x0, so obviously > this is a problem! Um. Probably you have a NULL value in "tblPerson"."USN" somewhere? There are a lot of functions without adequate defenses against NULL inputs :-( --- we've been cleaning them up slowly, but evidently you found another one. regards, tom lane
On Fri, Feb 11, 2000 at 06:18:32PM -0500, Tom Lane wrote: > Patrick Welche <prlw1@newn.cam.ac.uk> writes: > > and the reason for the SIGSEGV, is that somehow, text_int4(text *string) in > > src/backend/utils/adt/int.c is called with string=(text *)0x0, so obviously > > this is a problem! > > Um. Probably you have a NULL value in "tblPerson"."USN" somewhere? Yes of course! Naturally I was looking for something far too complicated and the trees got in the way.. And that's why my test case didn't work. > There are a lot of functions without adequate defenses against NULL > inputs :-( --- we've been cleaning them up slowly, but evidently you > found another one. So the trouble is, if the function returns and int, and you want to say return null, there really isn't a value that can be stuck into the int that represents null? In the meantime, I think this might help, so I would have seen: newnham=# select crsids.surname,"tblPerson"."Surname" from crsids,"tblPerson" where crsids.usn="tblPerson"."USN"::int4; ERROR: Trying to convert NULL text to integer (int4) Cheers, Patrick Index: int.c =================================================================== RCS file: /usr/local/cvsroot/pgsql/src/backend/utils/adt/int.c,v retrieving revision 1.32 diff -c -r1.32 int.c *** int.c 2000/01/26 05:57:14 1.32 --- int.c 2000/02/14 11:22:32 *************** *** 277,282 **** --- 277,285 ---- int len; char *str; + if (!string) + elog(ERROR, "Trying to convert NULL text to integer (int2)"); + len = (VARSIZE(string) - VARHDRSZ); str = palloc(len + 1); *************** *** 317,322 **** --- 320,328 ---- int len; char *str; + + if (!string) + elog(ERROR, "Trying to convert NULL text to integer (int4)"); len = (VARSIZE(string) - VARHDRSZ);
Patrick Welche <prlw1@newn.cam.ac.uk> writes: > + if (!string) > + elog(ERROR, "Trying to convert NULL text to integer (int2)"); This is unreasonable behavior. The correct patch is just if (!string) return 0; which will allow the function manager to plow ahead with returning the NULL that it's going to return anyway. See the past pghackers threads about redesigning the function manager interface if you don't understand what's going on here. regards, tom lane
On Mon, Feb 14, 2000 at 11:00:16AM -0500, Tom Lane wrote: > Patrick Welche <prlw1@newn.cam.ac.uk> writes: > > > + if (!string) > > + elog(ERROR, "Trying to convert NULL text to integer (int2)"); > > This is unreasonable behavior. The correct patch is just > > if (!string) > return 0; > > which will allow the function manager to plow ahead with returning the > NULL that it's going to return anyway. See the past pghackers threads > about redesigning the function manager interface if you don't understand > what's going on here. Off top of head, that means that null and the string "0" both return 0.. OK - I'll look for the mail thread. Cheers, Patrick