Обсуждение: How to insert on duplicate key?
Hi all, I have a table like this, CREATE TABLE mytable( a varchar(40), b text, c text, PRIMARY KEY (a, b) ); What I want to do is: insert a record into a table, and when the record already exists(according to the primary key), update it. I know that there is a ON DUPLICATE clause with MySQL, so I'm wondering is there a quick and clean way to do this in PostgreSQL ? I have googled and currently the only way I can find is do query first and then update or insert. Thanks a lot. Regards,
On Dec 24, 2007, at 22:03 , fdu.xiaojf@gmail.com wrote: > I have googled and currently the only way I can find is do query > first and then update or insert. Or alternatively, UPDATE and see if you've affected any rows. If not, insert. Michael Glaesemann grzm seespotcode net
Michael Glaesemann wrote: > > On Dec 24, 2007, at 22:03 , fdu.xiaojf@gmail.com wrote: > >> I have googled and currently the only way I can find is do query first >> and then update or insert. > > Or alternatively, UPDATE and see if you've affected any rows. If not, > insert. > > Michael Glaesemann > grzm seespotcode net > > > Thanks for your quick reply! Is the update and insert method quick? I have tried the query and update/insert way, and it was very slow when more than 1 million records have been inserted. (I have more than 20 million records to insert.) Thanks again! Xiao Jianfeng
On Tue, 25 Dec 2007, fdu.xiaojf@gmail.com wrote: > insert a record into a table, and when the record already > exists(according to the primary key), update it. There is an example that does exactly that, 37-1, in the documentation at http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html It actually does the update first and only if that fails does the insert, which avoids the whole duplicate key issue altogether. > I have tried the query and update/insert way, and it was very slow when > more than 1 million records have been inserted. (I have more than 20 > million records to insert.) This may be better because it isn't doing the query first. You may discover that you need to aggressively run one of the VACUUM processes (I'd guess regular and ANALYZE but not FULL) in order to keep performance steady as the number of records grows. Anytime you update a row, that becomes a dead row that's still taking up space, and if you do a lot of those they get in the way of finding the rows that are still live. Take a look at http://www.postgresql.org/docs/current/interactive/routine-vacuuming.html to get an idea of the process. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, 25 Dec 2007, fdu.xiaojf@gmail.com wrote: > insert a record into a table, and when the record already > exists(according to the primary key), update it. There is an example that does exactly that, 37-1, in the documentation at http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html It actually does the update first and only if that fails does the insert, which avoids the whole duplicate key issue altogether. > I have tried the query and update/insert way, and it was very slow when > more than 1 million records have been inserted. (I have more than 20 > million records to insert.) This may be better because it isn't doing the query first. You may discover that you need to aggressively run one of the VACUUM processes (I'd guess regular and ANALYZE but not FULL) in order to keep performance steady as the number of records grows. Anytime you update a row, that becomes a dead row that's still taking up space, and if you do a lot of those they get in the way of finding the rows that are still live. Take a look at http://www.postgresql.org/docs/current/interactive/routine-vacuuming.html to get an idea of the process. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
On Dec 24, 2007, at 11:15 PM, Greg Smith wrote: >> > > This may be better because it isn't doing the query first. You may > discover that you need to aggressively run one of the VACUUM > processes (I'd guess regular and ANALYZE but not FULL) in order to > keep performance steady as the number of records grows. Anytime you > update a row, that becomes a dead row that's still taking up space, > and if you do a lot of those they get in the way of finding the rows > that are still live. Take a look at http://www.postgresql.org/docs/current/interactive/routine-vacuuming.html > to get an idea of the process. Whoa. I am going to have to dig into the implementation. What is wrong with update in place, concurrency issues? The dead row presumably is no longer indexed, right? Since it is known to be dead is it automatically removed when there are no live transaction that reference or may reference it and its data page space marked available for new rows? If not, why not? I'm dredging my mind for stuff from my RDBMS implementation grad course a very long time ago. I would imagine that vacuuming often in a huge insert update would be a pretty poor performer depending on implementation. How is this implemented? Why would it be heavy IO if a list of pointers effectively is being kept to the dead rows to simply be added to the free list? What else is vacuum doing? Lookup implemented removal from indices, something else? - samantha
On Thursday 27 December 2007 12:23, Samantha Atkins wrote: > On Dec 24, 2007, at 11:15 PM, Greg Smith wrote: > > This may be better because it isn't doing the query first. You may > > discover that you need to aggressively run one of the VACUUM > > processes (I'd guess regular and ANALYZE but not FULL) in order to > > keep performance steady as the number of records grows. Anytime you > > update a row, that becomes a dead row that's still taking up space, > > and if you do a lot of those they get in the way of finding the rows > > that are still live. Take a look at > > http://www.postgresql.org/docs/current/interactive/routine-vacuuming.html > > to get an idea of the process. > > Whoa. I am going to have to dig into the implementation. What is > wrong with update in place, concurrency issues? The dead row > presumably is no longer indexed, right? At the time your transaction commits, it cannot update in place, since someone else may be looking at the old version of the row in the middle of thier transaction, so you need two copies. Even after updated you still need some pointer in the index for the old version of the row, in case it its referenced again. > Since it is known to be dead > is it automatically removed when there are no live transaction that > reference or may reference it and its data page space marked available > for new rows? If not, why not? I'm dredging my mind for stuff from > my RDBMS implementation grad course a very long time ago. > The problem is you have determined in your mind that a row is "known dead" without explination of how that would actually be determined. A given transaction doesn't have a way to determine if there are live transaction looking at the row, that would require quite a bit of knowledge about what else is occuring in the system to be able to determine that. That level of knowledge/complexity is what vacuum takes care of. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On 12/28/07, Robert Treat <xzilla@users.sourceforge.net> wrote: > On Thursday 27 December 2007 12:23, Samantha Atkins wrote: > > Since it is known to be dead > > is it automatically removed when there are no live transaction that > > reference or may reference it and its data page space marked available > > for new rows? If not, why not? I'm dredging my mind for stuff from > > my RDBMS implementation grad course a very long time ago. > The problem is you have determined in your mind that a row is "known dead" > without explination of how that would actually be determined. A given > transaction doesn't have a way to determine if there are live transaction > looking at the row, that would require quite a bit of knowledge about what > else is occuring in the system to be able to determine that. That level of > knowledge/complexity is what vacuum takes care of. If you're familiar with the intricacies of the garbage collection vs realloc/free debate surrounding programming languages, especially in multithreaded environments, this is basically the same thing applied to disk storage. PostgreSQL implements garbage collection. Its methods for doing it automatically are still being refined, hence the advice on manual tuning for specific workloads.
On Fri, 28 Dec 2007, Robert Treat wrote: > A given transaction doesn't have a way to determine if there are live > transaction looking at the row, that would require quite a bit of > knowledge about what else is occuring in the system to be able to > determine that. That level of knowledge/complexity is what vacuum takes > care of. One reason it doesn't happen automatically is that it would slow down the client doing the update. Sorting through to figure out which rows can be seen by which clients is better done later for a number of reasons; some samples: -It's more likely there won't be any transactions still referencing the original row as time moves forward. If you do it the instant the row is dead, odds are higher there's still somebody using the original one and you can't prune it yet anyway. -It's more efficicent to sort through a bunch of these in bulk than to do them one at a time. -You need to have a similar vacuum process happening regularly anyway to analyze your tables and keep statistics about them up to date, so might as well do both of those things at once. The downside is that vacuum can have a relatively high impact on the system, but the answer there is to do it more often so that any individual vacuum is less difficult. It'a also worth mentioning that some of this update row reuse happens more automatically in V8.3 with a new feature called HOT, so in some cases this particular issue has already has a workaround everyone can get in the near future. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD