Обсуждение: Understanding memory usage
Hello,
I'm trying to understand how Psycopg behaves memory-wise.Using the original query:
SELECT pagecontent FROM test WHERE siteid = 3;
loaded 0
VmSize: 3251264 kB
cleared 0
VmSize: 112924 kB
loaded 1
VmSize: 3250496 kB
cleared 1
VmSize: 112928 kB
SELECT pagecontent FROM test WHERE siteid = 3;
loaded 0
VmSize: 3251264 kB
cleared 0
VmSize: 112924 kB
loaded 1
VmSize: 3250496 kB
cleared 1
VmSize: 112928 kB
So far, so good -- incidentally, the lowest memory usage seems to 112 MB, where it is reported at 78 MB in bug report #78. Any idea why?
Now, using this very simple query (column1 being a NUMBER, around 2 millions rows):
SELECT column1 FROM table1
SELECT column1 FROM table1
loaded 0
VmSize: 259392 kB
cleared 0
VmSize: 232960 kB
loaded 1
VmSize: 259392 kB
cleared 1
VmSize: 226676 kB
So why isn't the memory reclaimed as in the first case?
I've also done some tests where the fetchall() call is commented altogether. The results are :
SELECT pagecontent FROM test WHERE siteid = 3;
loaded 0
VmSize: 2226860 kB
cleared 0
VmSize: 112920 kB
loaded 1
VmSize: 2226488 kB
cleared 1
VmSize: 112924 kB
SELECT pagecontent FROM test WHERE siteid = 3;
loaded 0
VmSize: 2226860 kB
cleared 0
VmSize: 112920 kB
loaded 1
VmSize: 2226488 kB
cleared 1
VmSize: 112924 kB
SELECT column1 FROM table1
loaded 0
VmSize: 159772 kB
cleared 0
VmSize: 113036 kB
loaded 1
VmSize: 162356 kB
cleared 1
VmSize: 112920 kB
In that case, the memory is effectively reclaimed for the second query -- getting to the lowest 112 MB number I have mentioned before.
But what is the reason why Psycopg needs memory to just run a query, without fetching results?
Thanks for your help!
--
Damiano Albani
Damiano Albani
On Mon, Oct 28, 2013 at 9:28 PM, Damiano Albani <damiano.albani@gmail.com> wrote: > But what is the reason why Psycopg needs memory to just run a query, without > fetching results? Because the result is returned to the client as the response for the query and is stored inside the cursor. fetch*() only return it to Python. The behaviour of actually getting the result on fetch*() is obtained using a server-side cursor. The price you pay is more network roundtrips and more resources used by the server. -- Daniele
Hello,
On Tue, Oct 29, 2013 at 12:23 AM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
By the way, I've re-run my tests but focused on the VmRSS metric, which represents how much actual physical memory is used by the process.
--
Damiano Albani
On Tue, Oct 29, 2013 at 12:23 AM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
Because the result is returned to the client as the response for the
query and is stored inside the cursor. fetch*() only return it to
Python.
So why does calling "fetch*()" uses additional memory then? Does it copy the data returned from the database?
By the way, I've re-run my tests but focused on the VmRSS metric, which represents how much actual physical memory is used by the process.
And I got the same behavior, that is almost no memory is reclaimed after having fetched a large number of rows.
For instance, if I fetch 2 millions small rows, memory usage peaks around 500 MB and then only lowers to ~ 450 MB after data is freed.
On the other hand, fetching 100 large rows amounts to a 3 GB peak, which subsequently falls back to 10 MB.
So is it a problem related to Psycopg itself or rather how Python handles memory in general?
Regards,
Damiano Albani
On Wed, Oct 30, 2013 at 5:24 PM, Damiano Albani <damiano.albani@gmail.com> wrote:
Hello,Because the result is returned to the client as the response for the
query and is stored inside the cursor. fetch*() only return it to
Python.So why does calling "fetch*()" uses additional memory then? Does it copy the data returned from the database?
Data in the cursor is stored in the form of a PQresult structure, which is an opaque object for which the libpq provides access function.
Such data is converted into Python objects when fetch*() is used. This usually implies a copy, because e.g. Python strings own their data, but even returning numbers to Python generally implies creating new instances.
By the way, I've re-run my tests but focused on the VmRSS metric, which represents how much actual physical memory is used by the process.And I got the same behavior, that is almost no memory is reclaimed after having fetched a large number of rows.For instance, if I fetch 2 millions small rows, memory usage peaks around 500 MB and then only lowers to ~ 450 MB after data is freed.
What to you mean as "freed"? Have you deleted the cursor and made sure the gc reclaimed it? The cursor doesn't destroy the internal data until it is deleted or another query is run (because after fetchall() you can invoke scroll(0) and return it to Python again). And of course when the data returned by fetch() is released depends on the client usage. After a big query you may see memory usage going down as soon as you execute "select 1 from false" because the result is replaced by a smaller one.
On the other hand, fetching 100 large rows amounts to a 3 GB peak, which subsequently falls back to 10 MB.So is it a problem related to Psycopg itself or rather how Python handles memory in general?
The only "problem" you may attribute to Psycopg is if you find an unbound usage of the memory. If you run some piece of code in a loop and see memory increasing linearly you have found a leak. Otherwise you can attribute the artefacts you see to the Python GC.
-- Daniele
On Wed, Oct 30, 2013 at 7:27 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
As far as I could test, in my environment, they're clearly not equal in terms of side effects.What to you mean as "freed"? Have you deleted the cursor and made sure the gc reclaimed it? The cursor doesn't destroy the internal data until it is deleted or another query is run (because after fetchall() you can invoke scroll(0) and return it to Python again). And of course when the data returned by fetch() is released depends on the client usage.
By "freed", I mean doing like in the bug report #78:
del data
areadcur.close()
acon.close()
del areadcur
del acon
del data
areadcur.close()
acon.close()
del areadcur
del acon
After a big query you may see memory usage going down as soon as you execute "select 1 from false" because the result is replaced by a smaller one.
That's not the result that I get. Doing a query returning 2 millions rows followed by a "SELECT 1" has no effect on RSS memory usage in my case.
The only "problem" you may attribute to Psycopg is if you find an unbound usage of the memory. If you run some piece of code in a loop and see memory increasing linearly you have found a leak. Otherwise you can attribute the artefacts you see to the Python GC.
Indeed, there's no memory leak that I can see. But don't you find strange that Python / Psycopg memory management differs between 2 roughly equivalent query:
- a query returning 20 rows × 10 MB each
- a query returning 2 millions rows × 100 bytes each
For the first, I can reclaim the memory after getting the results. For the second, I can't.
--
Damiano Albani
--
Damiano Albani
On Wed, Oct 30, 2013 at 7:27 PM, Damiano Albani <damiano.albani@gmail.com> wrote: > On Wed, Oct 30, 2013 at 7:27 PM, Daniele Varrazzo > <daniele.varrazzo@gmail.com> wrote: >> >> >> What to you mean as "freed"? Have you deleted the cursor and made sure the >> gc reclaimed it? The cursor doesn't destroy the internal data until it is >> deleted or another query is run (because after fetchall() you can invoke >> scroll(0) and return it to Python again). And of course when the data >> returned by fetch() is released depends on the client usage. > > > By "freed", I mean doing like in the bug report #78: > > del data > areadcur.close() > acon.close() > del areadcur > del acon > Please provide a complete and repeatable script (not as the guy in #78 did: repeatable means don't fetch from a table I don't see: use generate_series() and repeat() to generate a syntetic dataset). >> After a big query you may see memory usage going down as soon as you >> execute "select 1 from false" because the result is replaced by a smaller >> one. > > > That's not the result that I get. Doing a query returning 2 millions rows > followed by a "SELECT 1" has no effect on RSS memory usage in my case. I've run some tests and I see what you mean: after returning *to python* 1M of 100b strings the VmRSS doesn't go down from its peak, even deleting the cursor and the objects and running gc.collect(). What I also see is: - If the dataset remains in the cursor (execute() called, not fetchall()) the memory shrinks as expected. - Using 10 strings of 10M each also shows expected memory usage. - Repeating the test with 1M of 100b strings more than once doesn't bring the memory usage higher than the first time. >> The only "problem" you may attribute to Psycopg is if you find an unbound >> usage of the memory. If you run some piece of code in a loop and see memory >> increasing linearly you have found a leak. Otherwise you can attribute the >> artefacts you see to the Python GC. > > > Indeed, there's no memory leak that I can see. But don't you find strange > that Python / Psycopg memory management differs between 2 roughly equivalent > query: > > a query returning 20 rows × 10 MB each > a query returning 2 millions rows × 100 bytes each > > As far as I could test, in my environment, they're clearly not equal in > terms of side effects. > For the first, I can reclaim the memory after getting the results. For the > second, I can't. I easily expect a much bigger overhead in building millions of Python object compared to building 20. Not only for the 37 bytes of overhead each string has (sys.getsizeof()), but also for the consequences for the GC to manage objects in the millions. What I *suspect* is that you are seeing the combined effects of the OS VM management, the C allocator, the Python allocator (that runs on top of C one), and the Python GC system. I am not expert enough in any of these areas to provide a good insight of how everything works together. -- Daniele
On Thu, Oct 31, 2013 at 12:01 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
I easily expect a much bigger overhead in building millions of Python
object compared to building 20. Not only for the 37 bytes of overhead
each string has (sys.getsizeof()), but also for the consequences for
the GC to manage objects in the millions.
For the record, I've eventually settled for a solution using pgnumpy.
It's capable of handling results made of millions of rows with very little overhead as far as I could see.
As my original goal was to feed the data to Pandas down the line, pgnumpy seems spot on.
--
Damiano Albani