Обсуждение: Understanding memory usage

Поиск
Список
Период
Сортировка

Understanding memory usage

От
Damiano Albani
Дата:
Hello,

I'm trying to understand how Psycopg behaves memory-wise.
Basically, in my use case, it looks like memory is not released as I thought it would be.

I've taken the script mentioned in bug report #78 and run on my system (Ubuntu 12.04/x64, PostgreSQL 9.3, Python 2.7.3, Psycopg 2.5.1).

Using the original query:
SELECT pagecontent FROM test WHERE siteid = 3;

loaded 0
VmSize:  3251264 kB
cleared 0
VmSize:   112924 kB
loaded 1
VmSize:  3250496 kB
cleared 1
VmSize:   112928 kB


So far, so good -- incidentally, the lowest memory usage seems to 112 MB, where it is reported at 78 MB in bug report #78. Any idea why?

Now, using this very simple query (column1 being a NUMBER, around 2 millions rows):
SELECT column1 FROM table1

loaded 0
VmSize:   259392 kB
cleared 0
VmSize:   232960 kB
loaded 1
VmSize:   259392 kB
cleared 1
VmSize:   226676 kB


So why isn't the memory reclaimed as in the first case?

I've also done some tests where the fetchall() call is commented altogether. The results are :
SELECT pagecontent FROM test WHERE siteid = 3;

loaded 0
VmSize:  2226860 kB
cleared 0
VmSize:   112920 kB
loaded 1
VmSize:  2226488 kB
cleared 1
VmSize:   112924 kB

SELECT column1 FROM table1

loaded 0
VmSize:   159772 kB
cleared 0
VmSize:   113036 kB
loaded 1
VmSize:   162356 kB
cleared 1
VmSize:   112920 kB


In that case, the memory is effectively reclaimed for the second query -- getting to the lowest 112 MB number I have mentioned before.

But what is the reason why Psycopg needs memory to just run a query, without fetching results?

Thanks for your help!

--
Damiano Albani

Re: Understanding memory usage

От
Daniele Varrazzo
Дата:
On Mon, Oct 28, 2013 at 9:28 PM, Damiano Albani
<damiano.albani@gmail.com> wrote:

> But what is the reason why Psycopg needs memory to just run a query, without
> fetching results?

Because the result is returned to the client as the response for the
query and is stored inside the cursor. fetch*() only return it to
Python.

The behaviour of actually getting the result on fetch*() is obtained
using a server-side cursor. The price you pay is more network
roundtrips and more resources used by the server.

-- Daniele


Re: Understanding memory usage

От
Damiano Albani
Дата:
Hello,

On Tue, Oct 29, 2013 at 12:23 AM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:

Because the result is returned to the client as the response for the
query and is stored inside the cursor. fetch*() only return it to
Python.

So why does calling "fetch*()" uses additional memory then? Does it copy the data returned from the database?


By the way, I've re-run my tests but focused on the VmRSS metric, which represents how much actual physical memory is used by the process.

And I got the same behavior, that is almost no memory is reclaimed after having fetched a large number of rows.
For instance, if I fetch 2 millions small rows, memory usage peaks around 500 MB and then only lowers to ~ 450 MB after data is freed.
On the other hand, fetching 100 large rows amounts to a 3 GB peak, which subsequently falls back to 10 MB.

So is it a problem related to Psycopg itself or rather how Python handles memory in general?

Regards,

--
Damiano Albani

Re: Understanding memory usage

От
Daniele Varrazzo
Дата:

On Wed, Oct 30, 2013 at 5:24 PM, Damiano Albani <damiano.albani@gmail.com> wrote:
Hello,


On Tue, Oct 29, 2013 at 12:23 AM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:

Because the result is returned to the client as the response for the
query and is stored inside the cursor. fetch*() only return it to
Python.

So why does calling "fetch*()" uses additional memory then? Does it copy the data returned from the database?

Data in the cursor is stored in the form of a PQresult structure, which is an opaque object for which the libpq provides access function.

Such data is converted into Python objects when fetch*() is used. This usually implies a copy, because e.g. Python strings own their data, but even returning numbers to Python generally implies creating new instances.


By the way, I've re-run my tests but focused on the VmRSS metric, which represents how much actual physical memory is used by the process.

And I got the same behavior, that is almost no memory is reclaimed after having fetched a large number of rows.
For instance, if I fetch 2 millions small rows, memory usage peaks around 500 MB and then only lowers to ~ 450 MB after data is freed.

What to you mean as "freed"? Have you deleted the cursor and made sure the gc reclaimed it? The cursor doesn't destroy the internal data until it is deleted or another query is run (because after fetchall() you can invoke scroll(0) and return it to Python again). And of course when the data returned by fetch() is released depends on the client usage. After a big query you may see memory usage going down as soon as you execute "select 1 from false" because the result is replaced by a smaller one.

 
On the other hand, fetching 100 large rows amounts to a 3 GB peak, which subsequently falls back to 10 MB.

So is it a problem related to Psycopg itself or rather how Python handles memory in general?

The only "problem" you may attribute to Psycopg is if you find an unbound usage of the memory. If you run some piece of code in a loop and see memory increasing linearly you have found a leak. Otherwise you can attribute the artefacts you see to the Python GC.
 

-- Daniele

Re: Understanding memory usage

От
Damiano Albani
Дата:
On Wed, Oct 30, 2013 at 7:27 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:

What to you mean as "freed"? Have you deleted the cursor and made sure the gc reclaimed it? The cursor doesn't destroy the internal data until it is deleted or another query is run (because after fetchall() you can invoke scroll(0) and return it to Python again). And of course when the data returned by fetch() is released depends on the client usage.

By "freed", I mean doing like in the bug report #78:

    del data
    areadcur.close()
    acon.close()
    del areadcur
    del acon


After a big query you may see memory usage going down as soon as you execute "select 1 from false" because the result is replaced by a smaller one.

That's not the result that I get. Doing a query returning 2 millions rows followed by a "SELECT 1" has no effect on RSS memory usage in my case.

The only "problem" you may attribute to Psycopg is if you find an unbound usage of the memory. If you run some piece of code in a loop and see memory increasing linearly you have found a leak. Otherwise you can attribute the artefacts you see to the Python GC.

Indeed, there's no memory leak that I can see. But don't you find strange that Python / Psycopg memory management differs between 2 roughly equivalent query:
  • a query returning 20 rows × 10 MB each
  • a query returning 2 millions rows × 100 bytes each
As far as I could test, in my environment, they're clearly not equal in terms of side effects.
For the first, I can reclaim the memory after getting the results. For the second, I can't.

--
Damiano Albani

Re: Understanding memory usage

От
Daniele Varrazzo
Дата:
On Wed, Oct 30, 2013 at 7:27 PM, Damiano Albani
<damiano.albani@gmail.com> wrote:
> On Wed, Oct 30, 2013 at 7:27 PM, Daniele Varrazzo
> <daniele.varrazzo@gmail.com> wrote:
>>
>>
>> What to you mean as "freed"? Have you deleted the cursor and made sure the
>> gc reclaimed it? The cursor doesn't destroy the internal data until it is
>> deleted or another query is run (because after fetchall() you can invoke
>> scroll(0) and return it to Python again). And of course when the data
>> returned by fetch() is released depends on the client usage.
>
>
> By "freed", I mean doing like in the bug report #78:
>
>     del data
>     areadcur.close()
>     acon.close()
>     del areadcur
>     del acon
>

Please provide a complete and repeatable script (not as the guy in #78
did: repeatable means don't fetch from a table I don't see: use
generate_series() and repeat() to generate a syntetic dataset).

>> After a big query you may see memory usage going down as soon as you
>> execute "select 1 from false" because the result is replaced by a smaller
>> one.
>
>
> That's not the result that I get. Doing a query returning 2 millions rows
> followed by a "SELECT 1" has no effect on RSS memory usage in my case.

I've run some tests and I see what you mean: after returning *to
python* 1M of 100b strings the VmRSS doesn't go down from its peak,
even deleting the cursor and the objects and running gc.collect().
What I also see is:

- If the dataset remains in the cursor (execute() called, not
fetchall()) the memory shrinks as expected.
- Using 10 strings of 10M each also shows expected memory usage.
- Repeating the test with 1M of 100b strings more than once doesn't
bring the memory usage higher than the first time.


>> The only "problem" you may attribute to Psycopg is if you find an unbound
>> usage of the memory. If you run some piece of code in a loop and see memory
>> increasing linearly you have found a leak. Otherwise you can attribute the
>> artefacts you see to the Python GC.
>
>
> Indeed, there's no memory leak that I can see. But don't you find strange
> that Python / Psycopg memory management differs between 2 roughly equivalent
> query:
>
> a query returning 20 rows × 10 MB each
> a query returning 2 millions rows × 100 bytes each
>
> As far as I could test, in my environment, they're clearly not equal in
> terms of side effects.
> For the first, I can reclaim the memory after getting the results. For the
> second, I can't.

I easily expect a much bigger overhead in building millions of Python
object compared to building 20. Not only for the 37 bytes of overhead
each string has (sys.getsizeof()), but also for the consequences for
the GC to manage objects in the millions.

What I *suspect* is that you are seeing the combined effects of the OS
VM management, the C allocator, the Python allocator (that runs on top
of C one), and the Python GC system. I am not expert enough in any of
these areas to provide a good insight of how everything works
together.


-- Daniele


Re: Understanding memory usage

От
Damiano Albani
Дата:
On Thu, Oct 31, 2013 at 12:01 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:

I easily expect a much bigger overhead in building millions of Python
object compared to building 20. Not only for the 37 bytes of overhead
each string has (sys.getsizeof()), but also for the consequences for
the GC to manage objects in the millions.

For the record, I've eventually settled for a solution using pgnumpy.
It's capable of handling results made of millions of rows with very little overhead as far as I could see.
As my original goal was to feed the data to Pandas down the line, pgnumpy seems spot on.

--
Damiano Albani