Обсуждение: Caching Python modules
hello … i have just fallen over a nasty problem (maybe missing feature) with PL/Pythonu … consider: -- add a document to the corpus CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$ from SecondCorpus import SecondCorpus from SecondDocument import SecondDocument i am doing some intense text mining here. the problem is: is it possible to cache those imported modules from function to function call. GD works nicely for variables but can this actually be done with imported modules as well? the import takes around 95% of the total time so it is definitely something which should go away somehow. i have checked the docs but i am not more clever now. many thanks, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de
On 17/08/11 14:09, PostgreSQL - Hans-Jürgen Schönig wrote: > CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$ > > from SecondCorpus import SecondCorpus > from SecondDocument import SecondDocument > > i am doing some intense text mining here. > the problem is: is it possible to cache those imported modules from function to function call. > GD works nicely for variables but can this actually be done with imported modules as well? > the import takes around 95% of the total time so it is definitely something which should go away somehow. > i have checked the docs but i am not more clever now. After a module is imported in a backend, it stays in the interpreter's sys.modules dictionary and importing it again will not cause the module Python code to be executed. As long as you are using the same backend you should be able to call add_to_corpus repeatedly and the import statements should take a long time only the first time you call them. This simple test demonstrates it: $ cat /tmp/slow.py import time time.sleep(5) $ PYTHONPATH=/tmp/ bin/postgres -p 5433 -D data/ LOG: database system was shut down at 2011-08-17 14:16:18 CEST LOG: database system is ready to accept connections $ bin/psql -p 5433 postgres Timing is on. psql (9.2devel) Type "help" for help. postgres=# select slow();slow ------ (1 row) Time: 5032.835 ms postgres=# select slow();slow ------ (1 row) Time: 1.051 ms Cheers, Jan
On 17/08/11 14:19, Jan Urbański wrote: > On 17/08/11 14:09, PostgreSQL - Hans-Jürgen Schönig wrote: >> CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$ >> >> from SecondCorpus import SecondCorpus >> from SecondDocument import SecondDocument >> >> i am doing some intense text mining here. >> the problem is: is it possible to cache those imported modules from function to function call. >> GD works nicely for variables but can this actually be done with imported modules as well? >> the import takes around 95% of the total time so it is definitely something which should go away somehow. >> i have checked the docs but i am not more clever now. > > After a module is imported in a backend, it stays in the interpreter's > sys.modules dictionary and importing it again will not cause the module > Python code to be executed. > > As long as you are using the same backend you should be able to call > add_to_corpus repeatedly and the import statements should take a long > time only the first time you call them. > > This simple test demonstrates it: > > [example missing the slow() function code] Oops, forgot to show the CREATE statement of the test function: postgres=# create or replace function slow() returns void language plpythonu as $$ import slow $$; Jan
On Aug 17, 2011, at 2:19 PM, Jan Urbański wrote: > On 17/08/11 14:09, PostgreSQL - Hans-Jürgen Schönig wrote: >> CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$ >> >> from SecondCorpus import SecondCorpus >> from SecondDocument import SecondDocument >> >> i am doing some intense text mining here. >> the problem is: is it possible to cache those imported modules from function to function call. >> GD works nicely for variables but can this actually be done with imported modules as well? >> the import takes around 95% of the total time so it is definitely something which should go away somehow. >> i have checked the docs but i am not more clever now. > > After a module is imported in a backend, it stays in the interpreter's > sys.modules dictionary and importing it again will not cause the module > Python code to be executed. > > As long as you are using the same backend you should be able to call > add_to_corpus repeatedly and the import statements should take a long > time only the first time you call them. > > This simple test demonstrates it: > > $ cat /tmp/slow.py > import time > time.sleep(5) > > $ PYTHONPATH=/tmp/ bin/postgres -p 5433 -D data/ > LOG: database system was shut down at 2011-08-17 14:16:18 CEST > LOG: database system is ready to accept connections > > $ bin/psql -p 5433 postgres > Timing is on. > psql (9.2devel) > Type "help" for help. > > postgres=# select slow(); > slow > ------ > > (1 row) > > Time: 5032.835 ms > postgres=# select slow(); > slow > ------ > > (1 row) > > Time: 1.051 ms > > Cheers, > Jan hello jan … the code is actually like this … the first function is called once per backend. it compiles some fairly fat in memory stuff … this takes around 2 secs or so … but this is fine and not an issue. -- setup the environment CREATE OR REPLACE FUNCTION textprocess.setup_sentiment(pypath text, lang text) RETURNS void AS $$ import sys sys.path.append(pypath) sys.path.append(pypath + "/external") from SecondCorpus import SecondCorpus import const GD['path_to_classes'] = pypath GD['corpus'] = SecondCorpus(lang) GD['lang'] = lang return; $$ LANGUAGE 'plpythonu' STABLE; this is called more frequently ... -- add a document to the corpus CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$ from SecondCorpus import SecondCorpus from SecondDocument import SecondDocument doc1 = SecondDocument(GD['corpus'].senti_provider, lang, t) doc1.create_sentences() GD['corpus'].add_document(doc1) GD['corpus'].process() return doc1.total_score $$ LANGUAGE 'plpythonu' STABLE; the point here actually is: if i use the classes in a normal python command line program this routine does not look likean issue creating the document object and doing the magic in there is not a problem actually … on the SQL side this is already fairly heavy for some reason ... funcid | schemaname | funcname | calls | total_time | self_time | ?column? --------+-------------+-----------------+-------+------------+-----------+----------235287 | textprocess | setup_sentiment| 54 | 100166 | 100166 | 1854235288 | textprocess | add_to_corpus | 996 | 438909 | 438909 | 440 looks like some afternoon with some more low level tools :(. many thanks, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de