Обсуждение: Running update in chunks?
I have seen a lot of slow update questions asked both here and on stack overflow but they usually involve large tables. In my case the dataset is kind of small. I have an app in which I import data and then merge the records with an existing table. Currently I do most of the heavy lifting with code and it works well enough but as the imports get bigger I thought I would rewrite the code to speed it up using postgres. Basically I get the data which I consider to be dirty and I put it into a table using hstore to store the data. I then run a series of update queries to locate the "real" records in the various tables. The import data looks like this https://gist.github.com/4584366 and has about 98K records in it. The lookup table is very small only a couple of hundred records in it. This is the query I am running update cars.imports i set make_id = md.make_id from cars.models md where i.model_id = md.id; Here is the analyse "Update on imports i (cost=2.46..49720.34 rows=138858 width=526) (actual time=51968.553..51968.553 rows=0 loops=1)" " -> Hash Join (cost=2.46..49720.34 rows=138858 width=526) (actual time=0.044..408.170 rows=98834 loops=1)" " Hash Cond: (i.model_id = md.id)" " -> Seq Scan on imports i (cost=0.00..47808.58 rows=138858 width=516) (actual time=0.010..323.616 rows=98834 loops=1)" " -> Hash (cost=1.65..1.65 rows=65 width=14) (actual time=0.026..0.026 rows=65 loops=1)" " Buckets: 1024 Batches: 1 Memory Usage: 4kB" " -> Seq Scan on models md (cost=0.00..1.65 rows=65 width=14) (actual time=0.002..0.012 rows=65 loops=1)" "Total runtime: 51968.602 ms" This query takes fifty seconds on a macbook air with i7 processor and eight gigs of RAM and SSD hard drive. I am using postgres 9.2 installed with homebrew using the standard conf file. So it seems to me that this query is running as fast as it could but it's still much slower than doing things with code one record at a time (using some memoization). Anyway... Presuming I can't really do anything to speed up this query does it make sense to try and do this in chunks and if so what is the best technique for doing that.
On 21/01/13 08:04, Tim Uckun wrote: > This is the query I am running > > update cars.imports i > set make_id = md.make_id > from cars.models md where i.model_id = md.id; > > > Here is the analyse Looks like it's the actual update that's taking all the time. > This query takes fifty seconds on a macbook air with i7 processor and > eight gigs of RAM and SSD hard drive. I am using postgres 9.2 > installed with homebrew using the standard conf file. Can you try a couple of things just to check timings. Probably worth EXPLAIN ANALYSE. SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id = md.id; CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md ON i.model_id = md.id; Now the first one should take half a second judging by your previous explain. If the second one takes 50 seconds too then that's just the limit of your SSD's write. If it's much faster then something else is happening. -- Richard Huxton Archonet Ltd
> Can you try a couple of things just to check timings. Probably worth EXPLAIN > ANALYSE. > > SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id = > md.id; Takes about 300 ms > > CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md > ON i.model_id = md.id; Takes about 300 ms > > Now the first one should take half a second judging by your previous > explain. If the second one takes 50 seconds too then that's just the limit > of your SSD's write. If it's much faster then something else is happening.
On 21/01/13 10:30, Tim Uckun wrote: >> Can you try a couple of things just to check timings. Probably worth EXPLAIN >> ANALYSE. >> >> SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id = >> md.id; > > Takes about 300 ms > >> CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md >> ON i.model_id = md.id; > Takes about 300 ms OK - so writing all the data takes very under one second but updating the same amount takes 50 seconds. The only differences I can think of are WAL logging (transaction log) and index updates (the temp table has no indexes). 1. Try CREATE TABLE rather than CREATE TEMP TABLE - if that's still quick then it's not the time taken to write WAL. 2. Run the update query against your new tt table and see how long that takes. 3. Add indexes and repeat (in particular I'd be suspicious of the gin index on "data") My guess is that it's the time taken to update the "data" index - gin indexes can be slow to rebuild (although 50 seconds seems *very* slow). If so there are a few options: 1. Split the table and put whatever this "data" is into an import_data table - assuming it doesn't change often. 2. Try a fill-factor of 50% or less - keeping the updates on the same data page as the original might help 3. Drop the gin index before doing your bulk update and rebuild it at the end. This is a common approach with bulk-loading / updates. Oh - I'm assuming you're only updating those rows whose id has changed - that seemed to be the suggestion in your first message. If not, simply adding "AND make_id <> md.make_id" should help. Also (and you may well have considered this) - for a normalised setup you'd just have the model-id in "imports" and look up the make-id through the "models" table. -- Richard Huxton Archonet Ltd
Richard Huxton wrote: > The only differences I can think of are WAL logging (transaction > log) and index updates (the temp table has no indexes). What about foreign keys? Are there any tables which reference the updated column in a foreign key declaration? Do they have indexes on that column? -Kevin
Just to close this up and give some guidance to future googlers... There are no foreign keys. The table is empty when I start. I copy the data into it from a text file. Removing indexes didn't help much (made a very slight difference). running a query CREATE TEMP TABLE tt AS SELECT .... using a massive join takes about 8 seconds. I presume that's the baseline for the disk and RAM given my current postgres configuration. Note that this is not a satisfactory option for me because I can't do what I want in one step (the update I specified is one of many). running a very simple update "UPDATE imports set make_id = null" takes over 50 seconds so that's the minimum amount of time any update is going to take. Running a complex update where I join all the tables together and update all the fields takes about 106 seconds. Just running a complex select with the joins takes no time at all. I tried chunking the updates using chunks of 100 records and 1000 records (where ID between X and Y repeatedly) and it was even slower. Conclusion. Updates on postgres are slow (given the default postgresql.conf). I presume this is due to MVCC or the WAL or something and there are probably some things I can do to tweak the conf file to make them go faster but out of the box running an update on a table with lots of rows is going to cost you a lot. Removing the indexes doesn't help that much. Suggestion for the PG team. Deliver a more realistic postgres.conf by default. The default one seems to be aimed at ten year old PCs with very little RAM and disk space. At least deliver additional conf files for small, medium, large, huge setups.
On 21/01/13 20:09, Tim Uckun wrote: > Just to close this up and give some guidance to future googlers... Careful, future googlers. > Conclusion. Updates on postgres are slow Nope. > (given the default > postgresql.conf). I presume this is due to MVCC or the WAL or > something and there are probably some things I can do to tweak the > conf file to make them go faster but out of the box running an update > on a table with lots of rows is going to cost you a lot. Unlikely. Do you really think that a PostgreSQL installation typically runs 100 times slower on updates than inserts and every other user has just said "oh, that's ok then"? Or is it more likely that something peculiar is broken on your setup. > Removing the indexes doesn't help that much. > > Suggestion for the PG team. Deliver a more realistic postgres.conf by > default. The default one seems to be aimed at ten year old PCs with > very little RAM and disk space. At least deliver additional conf files > for small, medium, large, huge setups. -- Richard Huxton
On Mon, Jan 21, 2013 at 1:23 PM, Richard Huxton <dev@archonet.com> wrote: > On 21/01/13 20:09, Tim Uckun wrote: >> >> Just to close this up and give some guidance to future googlers... > > Careful, future googlers. > >> Conclusion. Updates on postgres are slow > > Nope. > > >> (given the default >> postgresql.conf). I presume this is due to MVCC or the WAL or >> something and there are probably some things I can do to tweak the >> conf file to make them go faster but out of the box running an update >> on a table with lots of rows is going to cost you a lot. > > Unlikely. Do you really think that a PostgreSQL installation typically runs > 100 times slower on updates than inserts and every other user has just said > "oh, that's ok then"? Or is it more likely that something peculiar is broken > on your setup. > > >> Removing the indexes doesn't help that much. >> >> Suggestion for the PG team. Deliver a more realistic postgres.conf by >> default. The default one seems to be aimed at ten year old PCs with >> very little RAM and disk space. At least deliver additional conf files >> for small, medium, large, huge setups. I'd be curious to see results of the same "update" on a standard HDD vs the SSD, and maybe on a more typical database deployment hardware vs a macbook air. --patrick
> > Nope. If you have any suggestions I am all ears. For the purposes of this discussion we can narrow down the problem this update statement. Update imports set make_id = null. There are 98K records in the table. There is no index on the make_id field. Standard untouched postgresql.conf from the brew install of postgres. > Unlikely. Do you really think that a PostgreSQL installation typically runs > 100 times slower on updates than inserts and every other user has just said > "oh, that's ok then"? Or is it more likely that something peculiar is broken > on your setup. I really don't know. That's why I am here asking. I don't think anything particular is broken with my system. As mentioned above the setup is really simple. Standard postgres install, the default conf file, update one field on one table. It takes fifty plus seconds. I concede that if I was to go into the postgres.conf and make some changes it will probably run faster (maybe much faster) but I wanted to exhaust other factors before I went messing with the default install.
> I'd be curious to see results of the same "update" on a standard HDD > vs the SSD, and maybe on a more typical database deployment hardware > vs a macbook air. > I haven't tried it on any other machine yet. CREATE TEMP TABLE tt as SELECT ... takes eight seconds so presumably the disk is not the choke point.
Richard Huxton wrote: > On 21/01/13 20:09, Tim Uckun wrote: >> Just to close this up and give some guidance to future >> googlers... > Careful, future googlers. +1 >> Conclusion. Updates on postgres are slow > Nope. Agreed. >> (given the default postgresql.conf). I presume this is due to >> MVCC or the WAL or something and there are probably some things >> I can do to tweak the conf file to make them go faster but out >> of the box running an update on a table with lots of rows is >> going to cost you a lot. > Unlikely. Do you really think that a PostgreSQL installation > typically runs 100 times slower on updates than inserts and every > other user has just said "oh, that's ok then"? Or is it more > likely that something peculiar is broken on your setup. As someone who has managed hundreds of databases ranging up to over 3TB without seeing this without some very specific cause, I agree that there is something wonky on Tim's setup which he hasn't told us about. Then again, I'm not sure we've pushed hard enough for the relevant details. Tim, if you're still interested in resolving this, please post the results from running the SQL code on this page: http://wiki.postgresql.org/wiki/Server_Configuration It might be worthwhile to read through this page: http://wiki.postgresql.org/wiki/Slow_Query_Questions ... and try some of the ideas there. Base disk perfromance numbers would help put the results in perspective. The cause could be anything from table bloat due to inadequate vacuuming to hardware problems. -Kevin
Tim Uckun wrote: > If you have any suggestions I am all ears. For the purposes of this > discussion we can narrow down the problem this update statement. > > Update imports set make_id = null. Well, that simplifies things. First off, what does it say for rows affected? (Hint, if you really are using a default configuration and it doesn't say 0 rows affected, please show us the actual query used.) Second, try connecting to the database as a superuser and running: VACUUM ANALYZE imports; -- (show us the results) VACUUM FULL imports; VACUUM FREEZE ANALYZE; -- (don't specify a table) Then try your query and see whether performance is any different. -Kevin
On Mon, Jan 21, 2013 at 1:48 PM, Tim Uckun <timuckun@gmail.com> wrote: >> I'd be curious to see results of the same "update" on a standard HDD >> vs the SSD, and maybe on a more typical database deployment hardware >> vs a macbook air. >> > > > I haven't tried it on any other machine yet. CREATE TEMP TABLE tt as > SELECT ... takes eight seconds so presumably the disk is not the > choke point. you are making an assumption that a fresh write is the same as a re-write. try the test. --patrick
Kevin Grittner wrote: > First off, what does it say for rows affected? (Hint, if you really > are using a default configuration and it doesn't say 0 rows > affected, please show us the actual query used.) Never mind that bit -- I got myself confused. Sorry for the noise. -Kevin
> First off, what does it say for rows affected? (Hint, if you really > are using a default configuration and it doesn't say 0 rows > affected, please show us the actual query used.) update imports set make_id = null Query returned successfully: 98834 rows affected, 49673 ms execution time. vacuum analyze imports Query returned successfully with no result in 4138 ms. VACUUM FULL imports; Query returned successfully with no result in 38106 ms. VACUUM FREEZE ANALYZE; Query returned successfully with no result in 184635 ms update imports set make_id = 0 Query returned successfully: 98834 rows affected, 45860 ms execution time. So all the vacuuming saved about four seconds of execution time. here is the postgresql.conf completely untouched from the default install https://gist.github.com/4590590
Oh I forgot SELECT version(); "PostgreSQL 9.2.2 on x86_64-apple-darwin12.2.1, compiled by Apple clang version 4.1 (tags/Apple/clang-421.11.65) (based on LLVM 3.1svn), 64-bit" SELECT name, current_setting(name), source FROM pg_settings WHERE source NOT IN ('default', 'override'); "application_name";"pgAdmin III - Query Tool";"client" "bytea_output";"escape";"session" "client_encoding";"UNICODE";"session" "client_min_messages";"notice";"session" "DateStyle";"ISO, DMY";"session" "default_text_search_config";"pg_catalog.english";"configuration file" "lc_messages";"en_NZ.UTF-8";"configuration file" "lc_monetary";"en_NZ.UTF-8";"configuration file" "lc_numeric";"en_NZ.UTF-8";"configuration file" "lc_time";"en_NZ.UTF-8";"configuration file" "log_timezone";"NZ";"configuration file" "max_connections";"20";"configuration file" "max_stack_depth";"2MB";"environment variable" "search_path";"chrysler, public";"session" "shared_buffers";"1600kB";"configuration file" "TimeZone";"NZ";"configuration file"
On 01/21/2013 03:45 PM, Tim Uckun wrote: > Oh I forgot > ... > "shared_buffers";"1600kB";"configuration file" You *reduced* shared buffers to 1.6MB? IIRC the typical default is 32MB and the most common adjustment is to *increase* shared buffers. Most of my servers are set to 2GB. Try bumping that up to a reasonable value (http://www.postgresql.org/docs/9.2/static/runtime-config-resource.html) and share the results. Don't forget to restart PG after changing that setting. Cheers, Steve
On 01/21/2013 04:15 PM, Steve Crawford wrote: > On 01/21/2013 03:45 PM, Tim Uckun wrote: >> Oh I forgot >> ... Me, too. I forgot to ask for the table definition. If there are variable-length fields like "text" or "varchar", what is the typical size of the data. Also, what is the physical size of the table (\dt+ yourtable)? Perhaps even the output of select * from pg_stat_user_tables where relname='yourtable'; might be useful. Cheers, Steve
I already posted the schema earlier. It's a handful of integer fields with one hstore field. On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford <scrawford@pinpointresearch.com> wrote: > select * from pg_stat_user_tables where relname='yourtable'; Messy output "relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count" 26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22 12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22 12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24
Kevin Grittner wrote: > update imports set make_id = 0 > > Query returned successfully: 98834 rows affected, 45860 ms execution time. For difficult problems, there is nothing like a self-contained test case, that someone else can run to see the issue. Here's a starting point: create extension if not exists hstore; create schema cars; drop table if exists cars.imports; CREATE TABLE cars.imports ( id serial NOT NULL, target_id integer, batch_id integer, make_id integer, model_id integer, date timestamp without time zone, division_id integer, dealer_id integer, data hstore, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL, CONSTRAINT imports_pkey PRIMARY KEY (id) ); CREATE INDEX index_imports_on_data ON cars.imports USING gin (data); CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id); CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id); CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id); CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id); insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x; vacuum freeze analyze; \timing on update cars.imports set make_id = 0; vacuum analyze; update cars.imports set make_id = 0; ... and here's what I get when I run it on my desktop computer with ordinary disk drives and a completely default configuration: test=# create extension if not exists hstore; CREATE EXTENSION Time: 48.032 ms test=# create schema cars; CREATE SCHEMA Time: 8.150 ms test=# drop table if exists cars.imports; NOTICE: table "imports" does not exist, skipping DROP TABLE Time: 0.205 ms test=# CREATE TABLE cars.imports test-# ( test(# id serial NOT NULL, test(# target_id integer, test(# batch_id integer, test(# make_id integer, test(# model_id integer, test(# date timestamp without time zone, test(# division_id integer, test(# dealer_id integer, test(# data hstore, test(# created_at timestamp without time zone NOT NULL, test(# updated_at timestamp without time zone NOT NULL, test(# CONSTRAINT imports_pkey PRIMARY KEY (id) test(# ); NOTICE: CREATE TABLE will create implicit sequence "imports_id_seq" for serial column "imports.id" NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" for table "imports" CREATE TABLE Time: 152.677 ms test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data); CREATE INDEX Time: 6.391 ms test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id); CREATE INDEX Time: 64.668 ms test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id); CREATE INDEX Time: 65.573 ms test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id); CREATE INDEX Time: 64.959 ms test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id); CREATE INDEX Time: 64.906 ms test=# insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x; INSERT 0 100000 Time: 2516.559 ms test=# vacuum freeze analyze; VACUUM Time: 3357.778 ms test=# \timing on Timing is on. test=# update cars.imports set make_id = 0; UPDATE 100000 Time: 2937.241 ms test=# vacuum analyze; VACUUM Time: 2097.426 ms test=# update cars.imports set make_id = 0; UPDATE 100000 Time: 3935.939 ms Ubuntu 12.10 i7-3770 CPU @ 3.40GHz with 16GB RAM Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1. PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit test=# SELECT name, current_setting(name), source test-# FROM pg_settings test-# WHERE source NOT IN ('default', 'override'); name | current_setting | source ----------------------------+--------------------+---------------------- application_name | psql | client client_encoding | UTF8 | client DateStyle | ISO, MDY | configuration file default_text_search_config | pg_catalog.english | configuration file lc_messages | en_US.UTF-8 | configuration file lc_monetary | en_US.UTF-8 | configuration file lc_numeric | en_US.UTF-8 | configuration file lc_time | en_US.UTF-8 | configuration file log_timezone | US/Central | configuration file max_connections | 100 | configuration file max_stack_depth | 2MB | environment variable shared_buffers | 32MB | configuration file TimeZone | US/Central | configuration file (13 rows) I did absolutely no tuning from the default configuration. So, what timings do you get if you run the identical script? Is there something you can do to the above script (maybe in terms of populating data) which will cause the performance you see? -Kevin
On Monday, January 21, 2013, Tim Uckun wrote:
> First off, what does it say for rows affected? (Hint, if you really
> are using a default configuration and it doesn't say 0 rows
> affected, please show us the actual query used.)
update imports set make_id = null
Query returned successfully: 98834 rows affected, 49673 ms execution time.
vacuum analyze imports
Query returned successfully with no result in 4138 ms.
VACUUM FULL imports;
What if you do:
alter table cars.imports set (fillfactor=50);
Before the vacuum full, and then try the update again?
Cheers,
Jeff
On Monday, January 21, 2013, Tim Uckun wrote:
I already posted the schema earlier. It's a handful of integer fields
with one hstore field.
one hstore field can easily be equivalent to 50 text fields with an index on each one.
I'm pretty sure that that is your bottleneck.
what does \di+ show?
Cheers,
Jeff
On 01/21/2013 05:02 PM, Tim Uckun wrote: > I already posted the schema earlier. It's a handful of integer fields > with one hstore field. Oh well. I can't find it but maybe it got lost in shipping or eaten by a spam filter. > > > On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford > <scrawford@pinpointresearch.com> wrote: >> select * from pg_stat_user_tables where relname='yourtable'; > > > Messy output Don't know if you are in psql but if so, expanded display works for this. I.e.: steve@[local] => \x Expanded display is on. steve@[local] => select * from pg_stat_user_tables where relname='footest'; -[ RECORD 1 ]-----+------------------------------ relid | 781691 schemaname | public relname | footest seq_scan | 3 seq_tup_read | 609 idx_scan | idx_tup_fetch | n_tup_ins | 609 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup | 301 n_dead_tup | 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | 2012-12-19 08:42:23.347368-08 vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 2 > > "relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count" > 26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22 > 12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22 > 12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24 > > So at least autovacuum is running (and some manual vacuum and analyze as well). Cheers, Steve
On 01/21/2013 06:21 PM, Kevin Grittner wrote: > Kevin Grittner wrote: > ... > shared_buffers | 32MB | configuration file > ... > > I did absolutely no tuning from the default configuration. > But Tim has his shared_buffers set to 1600kB ("shared_buffers";"1600kB";"configuration file") or roughly 1/20 of the typical default value, which is a very low starting value anyway, on a machine populated with 8GB RAM. I'd like to see how it runs with a more reasonable shared_buffers setting. At a very minimum the 32MB default. Cheers, Steve
Steve Crawford escribió: > On 01/21/2013 05:02 PM, Tim Uckun wrote: > >I already posted the schema earlier. It's a handful of integer fields > >with one hstore field. > > Oh well. I can't find it but maybe it got lost in shipping or eaten > by a spam filter. This is what we have the archives are for: http://www.postgresql.org/message-id/flat/50FEDF66.7050100@pinpointresearch.com#50FEDF66.7050100@pinpointresearch.com The original message is at the top of the page (obviously). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 01/22/2013 10:59 AM, Alvaro Herrera wrote: > Steve Crawford escribió: >> On 01/21/2013 05:02 PM, Tim Uckun wrote: >>> I already posted the schema earlier. It's a handful of integer fields >>> with one hstore field. >> Oh well. I can't find it but maybe it got lost in shipping or eaten >> by a spam filter. > This is what we have the archives are for: > > http://www.postgresql.org/message-id/flat/50FEDF66.7050100@pinpointresearch.com#50FEDF66.7050100@pinpointresearch.com > > The original message is at the top of the page (obviously). > Didn't notice that the information was over on the github site (which, of course, also makes it impossible to search for in my email and unavailable to the mail archives for those wishing to view it in the future). Cheers, Steve
Jeff Janes wrote: > one hstore field can easily be equivalent to 50 text fields with > an index on each one. > > I'm pretty sure that that is your bottleneck. I agree that seems like the most likely cause. Each update to the row holding the hstore column requires adding new index entries for all the hstore elements, and autovacuum will need to clean up the old ones in the background. The best solution would be to either normalize the data instead of using hstore, or move the hstore to a separate table which is referenced by some sort of ID from the frequently-updated table. -Kevin
Sorry I haven't been responsive for a little while. I ran your script but creating a new schema instead of my existing schema. My timings were similar to yours (more or less) except fo the vacuums which took roughly 147891 ms. On Tue, Jan 22, 2013 at 3:21 PM, Kevin Grittner <kgrittn@mail.com> wrote: > Kevin Grittner wrote: > >> update imports set make_id = 0 >> >> Query returned successfully: 98834 rows affected, 45860 ms execution time. > > For difficult problems, there is nothing like a self-contained test > case, that someone else can run to see the issue. Here's a starting > point: > > create extension if not exists hstore; > create schema cars; > drop table if exists cars.imports; > CREATE TABLE cars.imports > ( > id serial NOT NULL, > target_id integer, > batch_id integer, > make_id integer, > model_id integer, > date timestamp without time zone, > division_id integer, > dealer_id integer, > data hstore, > created_at timestamp without time zone NOT NULL, > updated_at timestamp without time zone NOT NULL, > CONSTRAINT imports_pkey PRIMARY KEY (id) > ); > CREATE INDEX index_imports_on_data ON cars.imports USING gin (data); > CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id); > CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id); > CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id); > CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id); > insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x; > vacuum freeze analyze; > \timing on > update cars.imports set make_id = 0; > vacuum analyze; > update cars.imports set make_id = 0; > > ... and here's what I get when I run it on my desktop computer with > ordinary disk drives and a completely default configuration: > > test=# create extension if not exists hstore; > CREATE EXTENSION > Time: 48.032 ms > test=# create schema cars; > CREATE SCHEMA > Time: 8.150 ms > test=# drop table if exists cars.imports; > NOTICE: table "imports" does not exist, skipping > DROP TABLE > Time: 0.205 ms > test=# CREATE TABLE cars.imports > test-# ( > test(# id serial NOT NULL, > test(# target_id integer, > test(# batch_id integer, > test(# make_id integer, > test(# model_id integer, > test(# date timestamp without time zone, > test(# division_id integer, > test(# dealer_id integer, > test(# data hstore, > test(# created_at timestamp without time zone NOT NULL, > test(# updated_at timestamp without time zone NOT NULL, > test(# CONSTRAINT imports_pkey PRIMARY KEY (id) > test(# ); > NOTICE: CREATE TABLE will create implicit sequence "imports_id_seq" for serial column "imports.id" > NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" for table "imports" > CREATE TABLE > Time: 152.677 ms > test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data); > CREATE INDEX > Time: 6.391 ms > test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id); > CREATE INDEX > Time: 64.668 ms > test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id); > CREATE INDEX > Time: 65.573 ms > test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id); > CREATE INDEX > Time: 64.959 ms > test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id); > CREATE INDEX > Time: 64.906 ms > test=# insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x; > INSERT 0 100000 > Time: 2516.559 ms > test=# vacuum freeze analyze; > VACUUM > Time: 3357.778 ms > test=# \timing on > Timing is on. > test=# update cars.imports set make_id = 0; > UPDATE 100000 > Time: 2937.241 ms > test=# vacuum analyze; > VACUUM > Time: 2097.426 ms > test=# update cars.imports set make_id = 0; > UPDATE 100000 > Time: 3935.939 ms > > Ubuntu 12.10 > i7-3770 CPU @ 3.40GHz with 16GB RAM > Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux > A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1. > > PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit > > test=# SELECT name, current_setting(name), source > test-# FROM pg_settings > test-# WHERE source NOT IN ('default', 'override'); > name | current_setting | source > ----------------------------+--------------------+---------------------- > application_name | psql | client > client_encoding | UTF8 | client > DateStyle | ISO, MDY | configuration file > default_text_search_config | pg_catalog.english | configuration file > lc_messages | en_US.UTF-8 | configuration file > lc_monetary | en_US.UTF-8 | configuration file > lc_numeric | en_US.UTF-8 | configuration file > lc_time | en_US.UTF-8 | configuration file > log_timezone | US/Central | configuration file > max_connections | 100 | configuration file > max_stack_depth | 2MB | environment variable > shared_buffers | 32MB | configuration file > TimeZone | US/Central | configuration file > (13 rows) > > I did absolutely no tuning from the default configuration. > > So, what timings do you get if you run the identical script? Is > there something you can do to the above script (maybe in terms of > populating data) which will cause the performance you see? > > -Kevin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
> > What if you do: > alter table cars.imports set (fillfactor=50); > Before the vacuum full, and then try the update again? This makes a dramatic difference when combined with a vacuum. UPDATE 98834 Time: 3408.210 ms Ten times faster! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
relid | 26710 schemaname | cars relname | imports seq_scan | 280 seq_tup_read | 25873543 idx_scan | 4 idx_tup_fetch | 2749 n_tup_ins | 98926 n_tup_upd | 6350466 n_tup_del | 92 n_tup_hot_upd | 625286 n_live_tup | 98834 n_dead_tup | 0 last_vacuum | 2013-01-25 21:55:36.078614+13 last_autovacuum | 2013-01-25 21:58:40.850546+13 last_analyze | 2013-01-25 21:55:36.305967+13 last_autoanalyze | 2013-01-25 21:51:54.307639+13 vacuum_count | 6 autovacuum_count | 32 analyze_count | 6 autoanalyze_count | 25 On Wed, Jan 23, 2013 at 7:50 AM, Steve Crawford <scrawford@pinpointresearch.com> wrote: > On 01/21/2013 05:02 PM, Tim Uckun wrote: >> >> I already posted the schema earlier. It's a handful of integer fields >> with one hstore field. > > > Oh well. I can't find it but maybe it got lost in shipping or eaten by a > spam filter. > > >> >> >> On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford >> <scrawford@pinpointresearch.com> wrote: >>> >>> select * from pg_stat_user_tables where relname='yourtable'; >> >> >> >> Messy output > > Don't know if you are in psql but if so, expanded display works for this. > I.e.: > steve@[local] => \x > Expanded display is on. > steve@[local] => select * from pg_stat_user_tables where relname='footest'; > -[ RECORD 1 ]-----+------------------------------ > relid | 781691 > schemaname | public > relname | footest > seq_scan | 3 > seq_tup_read | 609 > idx_scan | > idx_tup_fetch | > n_tup_ins | 609 > n_tup_upd | 0 > n_tup_del | 0 > n_tup_hot_upd | 0 > n_live_tup | 301 > n_dead_tup | 0 > last_vacuum | > last_autovacuum | > last_analyze | > last_autoanalyze | 2012-12-19 08:42:23.347368-08 > vacuum_count | 0 > autovacuum_count | 0 > analyze_count | 0 > autoanalyze_count | 2 > > > >> >> >> "relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count" >> >> 26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22 >> 12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22 >> 12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24 >> >> > So at least autovacuum is running (and some manual vacuum and analyze as > well). > > Cheers, > Steve > -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
> I agree that seems like the most likely cause. Each update to the > row holding the hstore column requires adding new index entries for > all the hstore elements, and autovacuum will need to clean up the > old ones in the background. The best solution would be to either > normalize the data instead of using hstore, or move the hstore to a > separate table which is referenced by some sort of ID from the > frequently-updated table. That's very interesting. I can certainly split up the table, no big deal there. So would the index be redone even if I am not updating the hstore field itself? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 25/01/13 08:57, Tim Uckun wrote: >> What if you do: >> alter table cars.imports set (fillfactor=50); >> Before the vacuum full, and then try the update again? > > This makes a dramatic difference when combined with a vacuum. > > UPDATE 98834 > Time: 3408.210 ms > > Ten times faster! That suggests (to me, at least) that it is related to index updating. Again, your GIN index seems primary candidate. A fillfactor of 50% means row updates probably stay on the same disk-block as their previous version. This implies less index updates. Try running iostat (I think that's available on a Mac) with/without the fillfactor and with/without the GIN index while you do the updates. It's possible your SSD is just behaving oddly under stress. -- Richard Huxton Archonet Ltd -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
> > That suggests (to me, at least) that it is related to index updating. Again, > your GIN index seems primary candidate. > > Try running iostat (I think that's available on a Mac) with/without the > fillfactor and with/without the GIN index while you do the updates. It's > possible your SSD is just behaving oddly under stress. > I dropped the index and the numbers shot up tenfold or more. I don't know why postgres feels the need to update the GIN index on the hstore field when I am only updating an integer field but it looks like I need to split the hstore into a different table. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 25/01/13 11:38, Tim Uckun wrote: >> That suggests (to me, at least) that it is related to index updating. Again, >> your GIN index seems primary candidate. >> >> Try running iostat (I think that's available on a Mac) with/without the >> fillfactor and with/without the GIN index while you do the updates. It's >> possible your SSD is just behaving oddly under stress. >> > > I dropped the index and the numbers shot up tenfold or more. I don't > know why postgres feels the need to update the GIN index on the hstore > field when I am only updating an integer field but it looks like I > need to split the hstore into a different table. If the row moves to a different block, then it has no choice. The old index entry will point to an invalid block. There are some optimisations (HOT - http://pgsql.tapoueh.org/site/html/misc/hot.html) but that relies on (iirc) the update staying on the same block and also not updating any indexed fields (and you were, I think). A GIN index is very expensive to update compared to btree too. -- Richard Huxton Archonet Ltd -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Tim Uckun wrote: > I dropped the index and the numbers shot up tenfold or more. I don't > know why postgres feels the need to update the GIN index on the hstore > field when I am only updating an integer field but it looks like I > need to split the hstore into a different table. Every UPDATE that is not HOT will create a row version with a new "row id". That means that all indexes referencing that row will have to get updated. That is consistent with better performance with low fillfactor (which makes HOT more likely). Yours, Laurenz Albe -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Fri, Jan 25, 2013 at 3:38 AM, Tim Uckun <timuckun@gmail.com> wrote: >> >> That suggests (to me, at least) that it is related to index updating. Again, >> your GIN index seems primary candidate. >> >> Try running iostat (I think that's available on a Mac) with/without the >> fillfactor and with/without the GIN index while you do the updates. It's >> possible your SSD is just behaving oddly under stress. >> > > > I dropped the index and the numbers shot up tenfold or more. I don't > know why postgres feels the need to update the GIN index on the hstore > field when I am only updating an integer field When the row gets updated, it might move to some place else. An index maps data values to row locations. So if the location changes, all indexes need to be updated, even if the data value for that index did not change. (Well, I shouldn't say they *need* to change. The database could have been designed, with considerable difficulty and consequences, to leave behind permanent redirect pointers to the new location. But it wasn't) There is a mechanism called HOT update (Heap-Only Tuple) which can prevent this under certain conditions. 1) Either none of the fields being updated are indexed, or any that are both updated and indexed are updated to the value they already have. 2) There is room for a new copy of the tuple on the same page as the old one. lowering the fillfactor helps with requirement 2, especially since your tuples are probably wide (because of the hstore column) and so not many fit on a page. Note that if you update a field to have the same value as it already does, it still makes a new copy of the entire tuple anyway. (It detects that the :old = :new for HOT-eligibility purposes if the field is indexed, but not for suppression of copying purposes. And if the tuple needs to be copied but there is no room on that page, then it isn't eligible for HOT after all). So you should add a where clause to the UPDATE to filter out things that are unchanged. > but it looks like I > need to split the hstore into a different table. That would be one solution, but I think a better one would be to not store "make_id" in "imports" in the first place, but instead to always fetch it by joining "imports" to "models" at query time. Cheers, Jeff
> > That would be one solution, but I think a better one would be to not > store "make_id" in "imports" in the first place, but instead to always > fetch it by joining "imports" to "models" at query time. > My problem here is that the incoming data is quite messy so the join conditions become weird (lots of ORs and such). A multi pass approach seems to work better.
> I agree that seems like the most likely cause. Each update to the > row holding the hstore column requires adding new index entries for > all the hstore elements, and autovacuum will need to clean up the > old ones in the background. The best solution would be to either > normalize the data instead of using hstore, or move the hstore to a > separate table which is referenced by some sort of ID from the > frequently-updated table. That's very interesting. I can certainly split up the table, no big deal there. So would the index be redone even if I am not updating the hstore field itself? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Sorry I haven't been responsive for a little while. I ran your script but creating a new schema instead of my existing schema. My timings were similar to yours (more or less) except fo the vacuums which took roughly 147891 ms. On Tue, Jan 22, 2013 at 3:21 PM, Kevin Grittner <kgrittn@mail.com> wrote: > Kevin Grittner wrote: > >> update imports set make_id = 0 >> >> Query returned successfully: 98834 rows affected, 45860 ms execution time. > > For difficult problems, there is nothing like a self-contained test > case, that someone else can run to see the issue. Here's a starting > point: > > create extension if not exists hstore; > create schema cars; > drop table if exists cars.imports; > CREATE TABLE cars.imports > ( > id serial NOT NULL, > target_id integer, > batch_id integer, > make_id integer, > model_id integer, > date timestamp without time zone, > division_id integer, > dealer_id integer, > data hstore, > created_at timestamp without time zone NOT NULL, > updated_at timestamp without time zone NOT NULL, > CONSTRAINT imports_pkey PRIMARY KEY (id) > ); > CREATE INDEX index_imports_on_data ON cars.imports USING gin (data); > CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id); > CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id); > CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id); > CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id); > insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x; > vacuum freeze analyze; > \timing on > update cars.imports set make_id = 0; > vacuum analyze; > update cars.imports set make_id = 0; > > ... and here's what I get when I run it on my desktop computer with > ordinary disk drives and a completely default configuration: > > test=# create extension if not exists hstore; > CREATE EXTENSION > Time: 48.032 ms > test=# create schema cars; > CREATE SCHEMA > Time: 8.150 ms > test=# drop table if exists cars.imports; > NOTICE: table "imports" does not exist, skipping > DROP TABLE > Time: 0.205 ms > test=# CREATE TABLE cars.imports > test-# ( > test(# id serial NOT NULL, > test(# target_id integer, > test(# batch_id integer, > test(# make_id integer, > test(# model_id integer, > test(# date timestamp without time zone, > test(# division_id integer, > test(# dealer_id integer, > test(# data hstore, > test(# created_at timestamp without time zone NOT NULL, > test(# updated_at timestamp without time zone NOT NULL, > test(# CONSTRAINT imports_pkey PRIMARY KEY (id) > test(# ); > NOTICE: CREATE TABLE will create implicit sequence "imports_id_seq" for serial column "imports.id" > NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" for table "imports" > CREATE TABLE > Time: 152.677 ms > test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data); > CREATE INDEX > Time: 6.391 ms > test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id); > CREATE INDEX > Time: 64.668 ms > test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id); > CREATE INDEX > Time: 65.573 ms > test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id); > CREATE INDEX > Time: 64.959 ms > test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id); > CREATE INDEX > Time: 64.906 ms > test=# insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x; > INSERT 0 100000 > Time: 2516.559 ms > test=# vacuum freeze analyze; > VACUUM > Time: 3357.778 ms > test=# \timing on > Timing is on. > test=# update cars.imports set make_id = 0; > UPDATE 100000 > Time: 2937.241 ms > test=# vacuum analyze; > VACUUM > Time: 2097.426 ms > test=# update cars.imports set make_id = 0; > UPDATE 100000 > Time: 3935.939 ms > > Ubuntu 12.10 > i7-3770 CPU @ 3.40GHz with 16GB RAM > Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux > A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1. > > PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit > > test=# SELECT name, current_setting(name), source > test-# FROM pg_settings > test-# WHERE source NOT IN ('default', 'override'); > name | current_setting | source > ----------------------------+--------------------+---------------------- > application_name | psql | client > client_encoding | UTF8 | client > DateStyle | ISO, MDY | configuration file > default_text_search_config | pg_catalog.english | configuration file > lc_messages | en_US.UTF-8 | configuration file > lc_monetary | en_US.UTF-8 | configuration file > lc_numeric | en_US.UTF-8 | configuration file > lc_time | en_US.UTF-8 | configuration file > log_timezone | US/Central | configuration file > max_connections | 100 | configuration file > max_stack_depth | 2MB | environment variable > shared_buffers | 32MB | configuration file > TimeZone | US/Central | configuration file > (13 rows) > > I did absolutely no tuning from the default configuration. > > So, what timings do you get if you run the identical script? Is > there something you can do to the above script (maybe in terms of > populating data) which will cause the performance you see? > > -Kevin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 25/01/13 08:57, Tim Uckun wrote: >> What if you do: >> alter table cars.imports set (fillfactor=50); >> Before the vacuum full, and then try the update again? > > This makes a dramatic difference when combined with a vacuum. > > UPDATE 98834 > Time: 3408.210 ms > > Ten times faster! That suggests (to me, at least) that it is related to index updating. Again, your GIN index seems primary candidate. A fillfactor of 50% means row updates probably stay on the same disk-block as their previous version. This implies less index updates. Try running iostat (I think that's available on a Mac) with/without the fillfactor and with/without the GIN index while you do the updates. It's possible your SSD is just behaving oddly under stress. -- Richard Huxton Archonet Ltd -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
> > That suggests (to me, at least) that it is related to index updating. Again, > your GIN index seems primary candidate. > > Try running iostat (I think that's available on a Mac) with/without the > fillfactor and with/without the GIN index while you do the updates. It's > possible your SSD is just behaving oddly under stress. > I dropped the index and the numbers shot up tenfold or more. I don't know why postgres feels the need to update the GIN index on the hstore field when I am only updating an integer field but it looks like I need to split the hstore into a different table. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
> > What if you do: > alter table cars.imports set (fillfactor=50); > Before the vacuum full, and then try the update again? This makes a dramatic difference when combined with a vacuum. UPDATE 98834 Time: 3408.210 ms Ten times faster! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Tim Uckun wrote: > I dropped the index and the numbers shot up tenfold or more. I don't > know why postgres feels the need to update the GIN index on the hstore > field when I am only updating an integer field but it looks like I > need to split the hstore into a different table. Every UPDATE that is not HOT will create a row version with a new "row id". That means that all indexes referencing that row will have to get updated. That is consistent with better performance with low fillfactor (which makes HOT more likely). Yours, Laurenz Albe -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 25/01/13 11:38, Tim Uckun wrote: >> That suggests (to me, at least) that it is related to index updating. Again, >> your GIN index seems primary candidate. >> >> Try running iostat (I think that's available on a Mac) with/without the >> fillfactor and with/without the GIN index while you do the updates. It's >> possible your SSD is just behaving oddly under stress. >> > > I dropped the index and the numbers shot up tenfold or more. I don't > know why postgres feels the need to update the GIN index on the hstore > field when I am only updating an integer field but it looks like I > need to split the hstore into a different table. If the row moves to a different block, then it has no choice. The old index entry will point to an invalid block. There are some optimisations (HOT - http://pgsql.tapoueh.org/site/html/misc/hot.html) but that relies on (iirc) the update staying on the same block and also not updating any indexed fields (and you were, I think). A GIN index is very expensive to update compared to btree too. -- Richard Huxton Archonet Ltd -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
relid | 26710 schemaname | cars relname | imports seq_scan | 280 seq_tup_read | 25873543 idx_scan | 4 idx_tup_fetch | 2749 n_tup_ins | 98926 n_tup_upd | 6350466 n_tup_del | 92 n_tup_hot_upd | 625286 n_live_tup | 98834 n_dead_tup | 0 last_vacuum | 2013-01-25 21:55:36.078614+13 last_autovacuum | 2013-01-25 21:58:40.850546+13 last_analyze | 2013-01-25 21:55:36.305967+13 last_autoanalyze | 2013-01-25 21:51:54.307639+13 vacuum_count | 6 autovacuum_count | 32 analyze_count | 6 autoanalyze_count | 25 On Wed, Jan 23, 2013 at 7:50 AM, Steve Crawford <scrawford@pinpointresearch.com> wrote: > On 01/21/2013 05:02 PM, Tim Uckun wrote: >> >> I already posted the schema earlier. It's a handful of integer fields >> with one hstore field. > > > Oh well. I can't find it but maybe it got lost in shipping or eaten by a > spam filter. > > >> >> >> On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford >> <scrawford@pinpointresearch.com> wrote: >>> >>> select * from pg_stat_user_tables where relname='yourtable'; >> >> >> >> Messy output > > Don't know if you are in psql but if so, expanded display works for this. > I.e.: > steve@[local] => \x > Expanded display is on. > steve@[local] => select * from pg_stat_user_tables where relname='footest'; > -[ RECORD 1 ]-----+------------------------------ > relid | 781691 > schemaname | public > relname | footest > seq_scan | 3 > seq_tup_read | 609 > idx_scan | > idx_tup_fetch | > n_tup_ins | 609 > n_tup_upd | 0 > n_tup_del | 0 > n_tup_hot_upd | 0 > n_live_tup | 301 > n_dead_tup | 0 > last_vacuum | > last_autovacuum | > last_analyze | > last_autoanalyze | 2012-12-19 08:42:23.347368-08 > vacuum_count | 0 > autovacuum_count | 0 > analyze_count | 0 > autoanalyze_count | 2 > > > >> >> >> "relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count" >> >> 26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22 >> 12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22 >> 12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24 >> >> > So at least autovacuum is running (and some manual vacuum and analyze as > well). > > Cheers, > Steve > -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 2013-01-25, Tim Uckun <timuckun@gmail.com> wrote: >> I agree that seems like the most likely cause. Each update to the >> row holding the hstore column requires adding new index entries for >> all the hstore elements, and autovacuum will need to clean up the >> old ones in the background. The best solution would be to either >> normalize the data instead of using hstore, or move the hstore to a >> separate table which is referenced by some sort of ID from the >> frequently-updated table. > > > That's very interesting. I can certainly split up the table, no big > deal there. So would the index be redone even if I am not updating > the hstore field itself? Absolutely! see MVCC. http://www.postgresql.org/docs/current/static/mvcc-intro.html -- ⚂⚃ 100% natural