Обсуждение: Re: Update table with random values from another table
On 12/02/09, Rory Campbell-Lange (rory@campbell-lange.net) wrote: > I have a test system for which I need to replace actual user's data (in > 'users') with anonymised data from another table ('testnames') on > postgres 8.3. > > The tricky aspect is that each row from testnames must be randomised to > avoid each row in users being updated with the same value. > > I've been trying to avoid a correlated subquery for each column I am trying > to update, and instead trying the following, which never returns. There are > 2000 records in testnames and about 200 in users. I'm obviously doing something badly wrong because: UPDATE users SET t_firstname = (select firstname from testnames order by random() limit 1), t_surname = (select lastname from testnames order by random() limit 1) WHERE n_role IN (2,3); Doesn't return either after 60 seconds on a 8 core machine with 8GB of RAM and 15K disks in R10 (no swap in use). Rory
On Thu, Feb 12, 2009 at 1:10 PM, Rory Campbell-Lange <rory@campbell-lange.net> wrote: > UPDATE > users > SET t_firstname = > (select firstname from testnames order by random() limit 1), > t_surname = > (select lastname from testnames order by random() limit 1) > WHERE > n_role IN (2,3); > > Doesn't return either after 60 seconds on a 8 core machine with 8GB of > RAM and 15K disks in R10 (no swap in use). That would be because, for every row in users table, postgres has to run two subselects, with order by random() (which in it self is quite expensive). How many roles it should update ? or otherwise: what's the return of select count(*) from users where n_role IN (2,3) ? try that sort of approach (modify it for your use): test2=# create table foo(a int, b int); .. insert some test data to foo(), and ziew(a) ... test2=# update foo set a=n1.a , b=n2.a from (select generate_series(1,100) id, a from ziew order by random() limit 100) n1, (select generate_series(1,100) id, a from ziew order by random() limit 100) n2; -- GJ
actually forget about that generate_series() in sub queries, I just realized that it won't do. I don't have too much time to analyse and find solution, but essentially you need to do it like in my example. See, that's where normalization would help a lot. Ie, having a separate table for name, and surname - and than third one to connect them into full name.
=?UTF-8?Q?Grzegorz_Ja=C5=9Bkiewicz?= <gryzman@gmail.com> writes: > On Thu, Feb 12, 2009 at 1:10 PM, Rory Campbell-Lange > <rory@campbell-lange.net> wrote: >> UPDATE >> users >> SET t_firstname = >> (select firstname from testnames order by random() limit 1), >> t_surname = >> (select lastname from testnames order by random() limit 1) >> WHERE >> n_role IN (2,3); >> >> Doesn't return either after 60 seconds on a 8 core machine with 8GB of >> RAM and 15K disks in R10 (no swap in use). > That would be because, for every row in users table, postgres has to > run two subselects, with order by random() (which in it self is quite > expensive). Well, no, because those subselects are independent of the parent query; I'd expect PG to do them just once. Do they show up as "SubPlans" or "InitPlans" in EXPLAIN? regards, tom lane
Hi Grzegorz Many thanks for your reply. On 12/02/09, Grzegorz Ja??kiewicz (gryzman@gmail.com) wrote: > actually forget about that generate_series() in sub queries, I just > realized that it won't do. > I don't have too much time to analyse and find solution, but > essentially you need to do it like in my example. > > See, that's where normalization would help a lot. Ie, having a > separate table for name, and surname - and than third one to connect > them into full name. I realise that for every row in my users table (which has a unique integer field) I can update it if I construct a matching id field against a random row from the testnames table. Something like this: UPDATE users SET .... FROM (SELECT dynamic_id, firstname, lastname FROM testnames ORDER BY random() ) x WHERE users.id = x.id; However I'm not sure how to generate a dynamic_id for testnames. If I use generate_series() I get a full join, rather than 1 firstname1 lastname1 2 firstname2 lastname2 Rory
On 12/02/09, Tom Lane (tgl@sss.pgh.pa.us) wrote: > =?UTF-8?Q?Grzegorz_Ja=C5=9Bkiewicz?= <gryzman@gmail.com> writes: > > On Thu, Feb 12, 2009 at 1:10 PM, Rory Campbell-Lange > > <rory@campbell-lange.net> wrote: > >> UPDATE > >> users > >> SET t_firstname = > >> (select firstname from testnames order by random() limit 1), > >> t_surname = > >> (select lastname from testnames order by random() limit 1) > >> WHERE > >> n_role IN (2,3); > >> > >> Doesn't return either after 60 seconds on a 8 core machine with 8GB of > >> RAM and 15K disks in R10 (no swap in use). > > > That would be because, for every row in users table, postgres has to > > run two subselects, with order by random() (which in it self is quite > > expensive). > > Well, no, because those subselects are independent of the parent query; > I'd expect PG to do them just once. Do they show up as "SubPlans" or > "InitPlans" in EXPLAIN? Hi Tom I don't know what the problem was, but I restarted my psql session and the query runs in 2.181 ms. The plan is an InitPlan. Apologies for the specious post. I'm still unable to work out how to update some columns in my user's table each with a unique record from my testnames table :). Rory
On 12/02/09, Rory Campbell-Lange (rory@campbell-lange.net) wrote: > I realise that for every row in my users table (which has a unique > integer field) I can update it if I construct a matching id field > against a random row from the testnames table. I can make my join table pretty well by using the ranking procedures outlined here: http://www.barik.net/archive/2006/04/30/162447/ CREATE TEMPORARY SEQUENCE rank_seq; select nextval('rank_seq') AS id, firstname, lastname from testnames; or SELECT firstname, lastname, (SELECT count(*) FROM testnames t2 WHERE t2.firstname < t1.firstname) + 2 AS id FROM testnames t1 ORDER BY id; The second method skips some ids (probably because I haven't got an integer column in testnames)? It looks like I will have to go for the first procedure or write a function with a loop and counter. Any other ideas? Rory
On Thu, Feb 12, 2009 at 05:39:49PM +0000, Rory Campbell-Lange wrote: > On 12/02/09, Rory Campbell-Lange (rory@campbell-lange.net) wrote: > > I realise that for every row in my users table (which has a unique > > integer field) I can update it if I construct a matching id field > > against a random row from the testnames table. > > I can make my join table pretty well by using the ranking procedures > outlined here: http://www.barik.net/archive/2006/04/30/162447/ > > CREATE TEMPORARY SEQUENCE rank_seq; > select nextval('rank_seq') AS id, firstname, lastname from testnames; [...] > Any other ideas? The first is similar to the best I could come up with as well. Your problem is difficult to express in SQL because what you're trying to do doesn't seem very relational in nature. I'd do something like: BEGIN; ALTER TABLE users ADD COLUMN num SERIAL; CREATE TEMP SEQUENCE s1; UPDATE users u SET name = x.name FROM ( SELECT name, nextval('s1') AS id FROM ( SELECT name FROM testnames ORDER BY random() OFFSET 0) x) x WHERE u.id = x.id; ALTER TABLE users DROP COLUMN num; COMMIT; If your existing unique integer field runs from 1 to a number less than the number of testuser names then you won't need to add the "num" column first. The inner selects are about making sure that things are ordered randomly before we assign a sequence value to them, not sure if it's strictly needed but shouldn't hurt. -- Sam http://samason.me.uk/