Обсуждение: Vacuum, Freeze and Analyze: the big picture
Folks, There's currently some great ideas bouncing around about eliminating the overhead associated with FREEZE. However, I wanted to take a step back and take a look at the big picture for VACUUM, FREEZE and ANALYZE. Otherwise, we're liable to repeat the 8.4 problem of making one operation better (background vacuum) while making another one worse (freezing). The big, big picture is this: 90% of our users need to think about VACUUM/ANALYZE at least 10% of the time and 10% of our users need to think aboutit almost 90% of the time. That's considerably better than was the case 5 years ago, when vacuum management was a daily or weekly responsibility for nearly 100% of our users, but it's still not good enough. Our target should be that only those with really unusual setups should have to *ever* think about vacuum and analyze. So I've set up a wiki page to document the various problems that force users to think about vacuum and analyze and try to troubleshoot it: https://wiki.postgresql.org/wiki/VacuumHeadaches We can also collect suggested solutions here. I'm looking to create a long-term development target which removes most of these vacuum headaches over the next 3 or 4 releases, without making the unremoved headaches siginficantly worse. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 30 May 2013 19:48, Josh Berkus <josh@agliodbs.com> wrote: > Folks, > > There's currently some great ideas bouncing around about eliminating the > overhead associated with FREEZE. However, I wanted to take a step back > and take a look at the big picture for VACUUM, FREEZE and ANALYZE. > Otherwise, we're liable to repeat the 8.4 problem of making one > operation better (background vacuum) while making another one worse > (freezing). > > The big, big picture is this: > > 90% of our users need to think about VACUUM/ANALYZE > at least 10% of the time > and 10% of our users need to think about it > almost 90% of the time. > > That's considerably better than was the case 5 years ago, when vacuum > management was a daily or weekly responsibility for nearly 100% of our > users, but it's still not good enough. Our target should be that only > those with really unusual setups should have to *ever* think about > vacuum and analyze. > > So I've set up a wiki page to document the various problems that force > users to think about vacuum and analyze and try to troubleshoot it: > > https://wiki.postgresql.org/wiki/VacuumHeadaches > > We can also collect suggested solutions here. I'm looking to create a > long-term development target which removes most of these vacuum > headaches over the next 3 or 4 releases, without making the unremoved > headaches siginficantly worse. Great collection of issues. I'm not sure I understand this: "Problem: As of 9.3, there's a significant benefit to vacuum freezing tables early so that index-only scan is enabled, since freezing also updates the visibility map. However, with default settings, such freezing only happens for data which is very old. This means that index-only scan is less effective than it could be for tables which have relatively infrequent updates and deletes." Why specifically VACUUM FREEZE rather than regular VACUUM? I thought regular VACUUM updated the visibility map too? And why as of 9.3 instead of 9.2? -- Thom
Hi, On 2013-05-30 11:48:12 -0700, Josh Berkus wrote: > There's currently some great ideas bouncing around about eliminating the > overhead associated with FREEZE. However, I wanted to take a step back > and take a look at the big picture for VACUUM, FREEZE and ANALYZE. > Otherwise, we're liable to repeat the 8.4 problem of making one > operation better (background vacuum) while making another one worse > (freezing). Inhowfar did 8.4 make freezing worse? I can't remember any new problems there? I agree that we need to be careful not to make things worse... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> "Problem: As of 9.3, there's a significant benefit to vacuum freezing > tables early so that index-only scan is enabled, since freezing also > updates the visibility map. However, with default settings, such > freezing only happens for data which is very old. This means that > index-only scan is less effective than it could be for tables which > have relatively infrequent updates and deletes." > > Why specifically VACUUM FREEZE rather than regular VACUUM? I thought > regular VACUUM updated the visibility map too? And why as of 9.3 > instead of 9.2? As of 9.2, that was a typo. Allvisible only gets set if there was some reason for VACUUM to visit the page anyway, no? Which means that an insert-only or insert-mostly table doesn't get set allvisible until FREEZE. And insert-only tables are usually very large, and thus really *need* index-only scan. Hmmm. I should rewrite that item entirely. It has nothing to do with FREEZE, really. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
> Inhowfar did 8.4 make freezing worse? I can't remember any new problems > there? Before the Visibility Map, we always vacuumed all pages in a relation when it was vacuumed at all. This means that we froze tuples at vacuum_min_freeze_age, not at autovacuum_max_freeze_age, which is when we do it post-8.4. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-05-30 12:11:23 -0700, Josh Berkus wrote: > > > Inhowfar did 8.4 make freezing worse? I can't remember any new problems > > there? > > Before the Visibility Map, we always vacuumed all pages in a relation > when it was vacuumed at all. This means that we froze tuples at > vacuum_min_freeze_age, not at autovacuum_max_freeze_age, which is when > we do it post-8.4. If we have reason to vacuum the relation we do it at vacuum_freeze_table_age, not only at autovacuum_max_freeze_age. The difference is that the latter triggers a vacuum, while the former only changes a partial vacuum into a full one. Calling that behaviour unconditionally worse is, err, interesting... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-05-30 20:01:01 +0100, Thom Brown wrote: > "Problem: As of 9.3, there's a significant benefit to vacuum freezing > tables early so that index-only scan is enabled, since freezing also > updates the visibility map. However, with default settings, such > freezing only happens for data which is very old. This means that > index-only scan is less effective than it could be for tables which > have relatively infrequent updates and deletes." > Why specifically VACUUM FREEZE rather than regular VACUUM? I thought > regular VACUUM updated the visibility map too? It does. It's after all what it uses to decide which parts of the table to scan if not doing a full table vacuum. > And why as of 9.3 instead of 9.2? Mabe because 9.3 updates the vm quicker than earlier version by checking whether all tuples are visible after we've actually removed the dead tuples. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> If we have reason to vacuum the relation we do it at > vacuum_freeze_table_age, not only at autovacuum_max_freeze_age. The > difference is that the latter triggers a vacuum, while the former only > changes a partial vacuum into a full one. > > Calling that behaviour unconditionally worse is, err, interesting... *overall* it's better. But as far as FREEZE itself is concerned, it's worse. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-05-30 12:18:29 -0700, Josh Berkus wrote: > > > If we have reason to vacuum the relation we do it at > > vacuum_freeze_table_age, not only at autovacuum_max_freeze_age. The > > difference is that the latter triggers a vacuum, while the former only > > changes a partial vacuum into a full one. > > > > Calling that behaviour unconditionally worse is, err, interesting... > > *overall* it's better. But as far as FREEZE itself is concerned, it's > worse. I am not trying to give you a hard time, but I really can't follow. In <8.3 we only froze tuples that were older than vacuum_freeze_min_age, just as today (although the default was higher then than today). 100mio transactions is long enough that you almost guaranteedly be in a different checkpoint cycle when freezing than when initially writing the tuple's buffer. So independent of the time the buffer is frozen (be it a) we always scan the whole relation, b) we have a partial vacuum upgraded to a full one due to vacuum_freeze_table_age c) an anti wraparound vacuum) we will usually write a buffer multiple times. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 30 May 2013 19:48, Josh Berkus <josh@agliodbs.com> wrote: > There's currently some great ideas bouncing around about eliminating the > overhead associated with FREEZE. However, I wanted to take a step back > and take a look at the big picture for VACUUM, FREEZE and ANALYZE. That is a very commendable approach. We should do that more often. > The big, big picture is this: > > 90% of our users need to think about VACUUM/ANALYZE > at least 10% of the time > and 10% of our users need to think about it > almost 90% of the time. When you say stuff like that, you should add "speculating from my personal experience". People might get the impression you'd measured this somehow and it could confuse the issue if you try to assemble a high level viewpoint and then add in factoids that are just opinions. We should strive to measure such things. > That's considerably better than was the case 5 years ago, when vacuum > management was a daily or weekly responsibility for nearly 100% of our > users, but it's still not good enough. Our target should be that only > those with really unusual setups should have to *ever* think about > vacuum and analyze. I think that's where we already are given that 1000s of users have quite small databases. The problem increases with scale. Larger databases have bigger problems and make it easier to notice things are happening. I think you should mention that the evidence for these issues is anecdotal and take careful notes of the backgrounds in which they occurred. Saying things occur in all cases wouldn't be accurate or helpful to their resolution. We should be seeking to contrast this against other databases to see if we are better or worse than other systems. For example, recording the moans of someone who is currently managing a 1 TB database, but yet hasn't ever managed anything else that big is less valuable than a balanced, experienced viewpoint (if such exists). Anyway, I support this approach, just wanted to make sure we do it in sufficient detail to be useful. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jun 1, 2013 at 1:49 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 30 May 2013 19:48, Josh Berkus <josh@agliodbs.com> wrote: > >> There's currently some great ideas bouncing around about eliminating the >> overhead associated with FREEZE. However, I wanted to take a step back >> and take a look at the big picture for VACUUM, FREEZE and ANALYZE. > > That is a very commendable approach. We should do that more often. > >> The big, big picture is this: >> >> 90% of our users need to think about VACUUM/ANALYZE >> at least 10% of the time >> and 10% of our users need to think about it >> almost 90% of the time. > > When you say stuff like that, you should add "speculating from my > personal experience". People might get the impression you'd measured > this somehow and it could confuse the issue if you try to assemble a > high level viewpoint and then add in factoids that are just opinions. > > We should strive to measure such things. > >> That's considerably better than was the case 5 years ago, when vacuum >> management was a daily or weekly responsibility for nearly 100% of our >> users, but it's still not good enough. Our target should be that only >> those with really unusual setups should have to *ever* think about >> vacuum and analyze. > > I think that's where we already are given that 1000s of users have > quite small databases. > > The problem increases with scale. Larger databases have bigger > problems and make it easier to notice things are happening. > > I think you should mention that the evidence for these issues is > anecdotal and take careful notes of the backgrounds in which they > occurred. Saying things occur in all cases wouldn't be accurate or > helpful to their resolution. > > We should be seeking to contrast this against other databases to see > if we are better or worse than other systems. For example, recording > the moans of someone who is currently managing a 1 TB database, but > yet hasn't ever managed anything else that big is less valuable than a > balanced, experienced viewpoint (if such exists). > > Anyway, I support this approach, just wanted to make sure we do it in > sufficient detail to be useful. I agree with all that. I don't have any data either, but I agree that AFAICT it seems to mostly be a problem for large (terabyte-scale) databases, or ones that are dreadfully short of I/O bandwidth. AWS, I'm looking at you. It would be interesting to make a list of what other issues people have seen using PostgreSQL on very large data sets. Complaints I've heard include: 1. Inexplicable failure of the planner to use indexes on very large tables, preferring an obviously-stupid sequential scan. This might be fixed by the latest index-size fudge factor work. 2. Lack of concurrent DDL. On VACUUM and ANALYZE specifically, I'd have to say that the most common problems I encounter are (a) anti-wraparound vacuums kicking in at inconvenient times and eating up too many resources and (b) users making ridiculous settings changes to avoid the problems caused by anti-wraparound vacuums kicking in at inconvenient times and eating up too many resources. The changes we've been discussing elsewhere may not completely solve this problem, because we'll still have to read all pages that aren't yet all-visible... but they should surely help. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Saturday, June 1, 2013, Robert Haas wrote:
I agree with all that. I don't have any data either, but I agree that
AFAICT it seems to mostly be a problem for large (terabyte-scale)
databases, or ones that are dreadfully short of I/O bandwidth. AWS,
I'm looking at you.
It would be interesting to make a list of what other issues people
have seen using PostgreSQL on very large data sets. Complaints I've
heard include:
1. Inexplicable failure of the planner to use indexes on very large
tables, preferring an obviously-stupid sequential scan. This might be
fixed by the latest index-size fudge factor work.
2. Lack of concurrent DDL.
On VACUUM and ANALYZE specifically, I'd have to say that the most
common problems I encounter are (a) anti-wraparound vacuums kicking in
at inconvenient times and eating up too many resources and (b) users
making ridiculous settings changes to avoid the problems caused by
anti-wraparound vacuums kicking in at inconvenient times and eating up
too many resources.
Do we know why anti-wraparound uses so many resources in the first place? The default settings seem to be quite conservative to me, even for a system that has only a single 5400 rpm hdd (and even more so for any real production system that would be used for a many-GB database).
I wonder if there is something simple but currently unknown going on which is causing it to damage performance out of all proportion to the resources it ought to be using.
Cheers,
Jeff
On Sun, Jun 2, 2013 at 2:44 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > Do we know why anti-wraparound uses so many resources in the first place? > The default settings seem to be quite conservative to me, even for a system > that has only a single 5400 rpm hdd (and even more so for any real > production system that would be used for a many-GB database). > > I wonder if there is something simple but currently unknown going on which > is causing it to damage performance out of all proportion to the resources > it ought to be using. I can't rule that out. Personally, I've always attributed it to the fact that it's (a) long and (b) I/O-intensive. But it's not impossible there could also be bugs lurking. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 3, 2013 at 1:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jun 2, 2013 at 2:44 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> Do we know why anti-wraparound uses so many resources in the first place? >> The default settings seem to be quite conservative to me, even for a system >> that has only a single 5400 rpm hdd (and even more so for any real >> production system that would be used for a many-GB database). >> >> I wonder if there is something simple but currently unknown going on which >> is causing it to damage performance out of all proportion to the resources >> it ought to be using. > > I can't rule that out. Personally, I've always attributed it to the > fact that it's (a) long and (b) I/O-intensive. But it's not > impossible there could also be bugs lurking. It could be related to the OS. I have no evidence for or against, but it's possible that OS write-out routines defeat the careful cost based throttling that PostgreSQL does by periodically dumping a large portion of dirty pages into the write queue at once. That does nasty things to query latencies as evidenced by the work on checkpoint spreading. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 06/02/2013 05:56 AM, Robert Haas wrote: > On Sat, Jun 1, 2013 at 1:49 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 30 May 2013 19:48, Josh Berkus <josh@agliodbs.com> wrote: >> >>> There's currently some great ideas bouncing around about eliminating the >>> overhead associated with FREEZE. However, I wanted to take a step back >>> and take a look at the big picture for VACUUM, FREEZE and ANALYZE. >> >> That is a very commendable approach. We should do that more often. >> >>> The big, big picture is this: >>> >>> 90% of our users need to think about VACUUM/ANALYZE >>> at least 10% of the time >>> and 10% of our users need to think about it >>> almost 90% of the time. >> >> When you say stuff like that, you should add "speculating from my >> personal experience". People might get the impression you'd measured >> this somehow and it could confuse the issue if you try to assemble a >> high level viewpoint and then add in factoids that are just opinions. >> >> We should strive to measure such things. >> >>> That's considerably better than was the case 5 years ago, when vacuum >>> management was a daily or weekly responsibility for nearly 100% of our >>> users, but it's still not good enough. Our target should be that only >>> those with really unusual setups should have to *ever* think about >>> vacuum and analyze. >> >> I think that's where we already are given that 1000s of users have >> quite small databases. >> >> The problem increases with scale. Larger databases have bigger >> problems and make it easier to notice things are happening. >> >> I think you should mention that the evidence for these issues is >> anecdotal and take careful notes of the backgrounds in which they >> occurred. Saying things occur in all cases wouldn't be accurate or >> helpful to their resolution. >> >> We should be seeking to contrast this against other databases to see >> if we are better or worse than other systems. For example, recording >> the moans of someone who is currently managing a 1 TB database, but >> yet hasn't ever managed anything else that big is less valuable than a >> balanced, experienced viewpoint (if such exists). >> >> Anyway, I support this approach, just wanted to make sure we do it in >> sufficient detail to be useful. > > I agree with all that. I don't have any data either, but I agree that > AFAICT it seems to mostly be a problem for large (terabyte-scale) > databases, or ones that are dreadfully short of I/O bandwidth. AWS, > I'm looking at you. > > It would be interesting to make a list of what other issues people > have seen using PostgreSQL on very large data sets. Complaints I've > heard include: > > 1. Inexplicable failure of the planner to use indexes on very large > tables, preferring an obviously-stupid sequential scan. This might be > fixed by the latest index-size fudge factor work. I've seen cases on Stack Overflow and elsewhere in which disk merge sorts perform vastly better than in-memory quicksort, so the user benefited from greatly *lowering* work_mem. > (b) users > making ridiculous settings changes to avoid the problems caused by > anti-wraparound vacuums kicking in at inconvenient times and eating up > too many resources. Some recent experiences I've had have also bought home to me that vacuum problems are often of the user's own making. "My database is slow" -> "This autovacuum thing is using up lots of I/O and CPU, I'll increase this delay setting here" -> "My database is slower" -> "Maybe I didn't solve the autovacuum thing, I'll just turn it off" -> "My database is barely working" -> "I'll whack in some manual VACUUM cron jobs during low load maintenance hours and hope that keeps the worst of the problem away, that's what random forum posts on the Internet say to do". -> "oh my, why did my DB just do an emergency shutdown?" Vacuum being more able to operate in a feedback loop driven by bloat statistics might be quite valuable, but I'm also wondering if there's any remotely feasible way to more usefully alert users when they're having table bloat issues and vacuum isn't coping. Particularly for cases where autovacuum is working but being impaired by locking. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer <craig@2ndquadrant.com> wrote: > On 06/02/2013 05:56 AM, Robert Haas wrote: >> I agree with all that. I don't have any data either, but I agree that >> AFAICT it seems to mostly be a problem for large (terabyte-scale) >> databases, or ones that are dreadfully short of I/O bandwidth. AWS, >> I'm looking at you. > I've seen cases on Stack Overflow and elsewhere in which disk merge > sorts perform vastly better than in-memory quicksort, so the user > benefited from greatly *lowering* work_mem. I have seen this a few times, to. It would be interesting to characterize the conditions under which this is the case. >> (b) users >> making ridiculous settings changes to avoid the problems caused by >> anti-wraparound vacuums kicking in at inconvenient times and eating up >> too many resources. Where I hit a nightmare scenario with an anti-wraparound autovacuum, personally, was after an upgrade using pg_dump piped to psql. At a high OLTP transaction load time (obviously the most likely time for it to kick in, because it is triggered by xid consumption), it started to READ AND REWRITE every heap page of every table. This overwhelmed the battery-backed write cache, causing a series of "freezes" for a few minutes at a time, raising a very large number of end-user complaints. This is when I started insisting on a VACUUM FREEZE ANALYZE after any bulk load before it was considered complete and the database brought online for production use. > Some recent experiences I've had have also bought home to me that vacuum > problems are often of the user's own making. > > "My database is slow" > -> > "This autovacuum thing is using up lots of I/O and CPU, I'll increase > this delay setting here" > -> > "My database is slower" > -> > "Maybe I didn't solve the autovacuum thing, I'll just turn it > off" > -> > "My database is barely working" > -> > "I'll whack in some manual VACUUM cron jobs during low load maintenance > hours and hope that keeps the worst of the problem away, that's what > random forum posts on the Internet say to do". > -> "oh my, why did my DB just do an emergency shutdown?" Yeah, I've seen exactly that sequence, and some variations on it quite often. In fact, when I was first using PostgreSQL I got as far as "Maybe I didn't solve the autovacuum thing" but instead of "I'll just turn it off" my next step was "I wonder what would happen if I tried making it *more* aggressive so that it didn't have so much work to do each time it fired?" Of course, that vastly improved things. I have found it surprisingly difficult to convince other people to try that, though. I have seen people so convinced that vacuum (and particularly autovacuum) are *so* evil that they turn off autovacuum and monitor the freeze status of their tables and databases so that they can run VACUUM "just in time" to prevent the emergency shutdown. Obviously, this isn't great for their performance. :-( -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-02 11:44:04 -0700, Jeff Janes wrote: > Do we know why anti-wraparound uses so many resources in the first place? > The default settings seem to be quite conservative to me, even for a > system that has only a single 5400 rpm hdd (and even more so for any real > production system that would be used for a many-GB database). I guess the point is that nobody can actually run a bigger OLTP database successfully with the default settings. Usually that will end up with a) huge amounts of bloat in the tables autovac doesn't scan first b) forced shutdowns because autovac doesn't freeze quickly enough. The default suggestion that frequently seems to be made is just to disable autovac cost limitations because of that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> I agree with all that. I don't have any data either, but I agree that > AFAICT it seems to mostly be a problem for large (terabyte-scale) > databases, or ones that are dreadfully short of I/O bandwidth. AWS, > I'm looking at you. Well, at this point, numerically I'd bet that more than 50% of our users are on AWS, some other cloud, or some kind of iSCSI storage ... some place where IO sucks. It's How Things Are Done Now. Speaking for my own clientele, people run into issues, or think they have issues, with autovacuum at databases as small as 100GB, as long as they have sufficient write throughput. One really pathological case I had to troubleshoot was a database which was only 200MB in size! (this database contained counts of things, and was updated 10,000 times per second). Anyway, my goal with that wiki page -- which is on the wiki so that others can add to it -- is to get all of the common chronic issues on the table so that we don't inadvertently make one problem worse while making another one better. Some of the solutions to FREEZE being bandied around seemed likely to do that. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Jeff, > Do we know why anti-wraparound uses so many resources in the first place? > The default settings seem to be quite conservative to me, even for a > system that has only a single 5400 rpm hdd (and even more so for any real > production system that would be used for a many-GB database). > > I wonder if there is something simple but currently unknown going on which > is causing it to damage performance out of all proportion to the resources > it ought to be using. Does anti-wraparound vacuum (AWAV) write synchronously? If so, there's a potential whole world of hurt there. Otherwise, the effect you're seeing is just blowing out various caches: the CPU cache, storage cache, and filesystem cache. While we can (and do) prevent vacuum from blowing out shared_buffers, we can't do much about the others. Also, locking while it does its work. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Jun 3, 2013 at 1:28 PM, Josh Berkus <josh@agliodbs.com> wrote: > Does anti-wraparound vacuum (AWAV) write synchronously? If so, there's > a potential whole world of hurt there. Not any moreso than anything else ... although it probably does a very high percentage of FPIs, which might lead to lots of checkpointing. > Also, locking while it does its work. Eh? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>> Also, locking while it does its work. > > Eh? Even if we're doing lazy vacuum, we have to lock a few pages at a time of each table. This does result in response time delays on the current workload, which can be quite bad if it's a highly contended table already. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-06-03 11:00:38 -0700, Josh Berkus wrote: > > >> Also, locking while it does its work. > > > > Eh? > > Even if we're doing lazy vacuum, we have to lock a few pages at a time > of each table. This does result in response time delays on the current > workload, which can be quite bad if it's a highly contended table already. We don't really lock more pages at a time than normal DML does. 1 heap page at a time, possibly several index pages at once. There's something related which can cause problems which is that we require cleanup locks on the page to be able to repair fragmentation which makes *vacuum* wait for all clients to release their page pins. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 06/03/2013 11:12 AM, Andres Freund wrote: > On 2013-06-03 11:00:38 -0700, Josh Berkus wrote: >> >>>> Also, locking while it does its work. >>> >>> Eh? >> >> Even if we're doing lazy vacuum, we have to lock a few pages at a time >> of each table. This does result in response time delays on the current >> workload, which can be quite bad if it's a highly contended table already. > > We don't really lock more pages at a time than normal DML does. 1 heap > page at a time, possibly several index pages at once. Really? I though vacuum held onto its locks until it reached vacuum_cost. If it doesn't, then maybe we should adjust the default for vacuum_cost_limit upwards. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Jun 03, 2013 at 11:27:57AM +0300, Ants Aasma wrote: > > I can't rule that out. Personally, I've always attributed it to the > > fact that it's (a) long and (b) I/O-intensive. But it's not > > impossible there could also be bugs lurking. > > It could be related to the OS. I have no evidence for or against, but > it's possible that OS write-out routines defeat the careful cost based > throttling that PostgreSQL does by periodically dumping a large > portion of dirty pages into the write queue at once. That does nasty > things to query latencies as evidenced by the work on checkpoint > spreading. In other contexts I've run into issues relating to large continuous writes stalling. The issue is basically that the Linux kernel allows (by default) writes to pile up until they reach 5% of physical memory before deciding that the sucker who wrote the last block becomes responsible for writing the whole lot out. At full speed of course. Depending on the amount of memory and the I/O speed of your disks this could take a while, and interfere with other processes. This leads to extremely bursty I/O behaviour. The solution, as usual, is to make it more aggressive, so the kernel background writer triggers at 1% memory. I'm not saying that's the problem here, but it is an example of a situation where the write queue can become very large very quickly. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Mon, Jun 3, 2013 at 2:56 PM, Josh Berkus <josh@agliodbs.com> wrote: > Really? I though vacuum held onto its locks until it reached > vacuum_cost. If it doesn't, then maybe we should adjust the default for > vacuum_cost_limit upwards. That would be completely insane. Or in other words, no, it doesn't do anything like that. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 3, 2013 at 5:03 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > I've seen cases on Stack Overflow and elsewhere in which disk merge > sorts perform vastly better than in-memory quicksort, so the user > benefited from greatly *lowering* work_mem. I've heard of that happening on Oracle, when the external sort is capable of taking advantage of I/O parallelism, but I have a pretty hard time believing that it could happen with Postgres under any circumstances. Maybe if someone was extraordinarily unlucky and happened to hit quicksort's O(n ^ 2) worst case it could happen, but we take various measures that make that very unlikely. It might also have something to do with our "check for pre-sorted input" [1], but I'm still skeptical. [1] http://www.postgresql.org/message-id/CAEYLb_Xn4-6f1ofsf2qduf24dDCVHbQidt7JPpdL_RiT1zBJ6A@mail.gmail.com -- Peter Geoghegan
On Mon, Jun 3, 2013 at 3:48 PM, Martijn van Oosterhout <kleptog@svana.org> wrote: > On Mon, Jun 03, 2013 at 11:27:57AM +0300, Ants Aasma wrote: >> > I can't rule that out. Personally, I've always attributed it to the >> > fact that it's (a) long and (b) I/O-intensive. But it's not >> > impossible there could also be bugs lurking. >> >> It could be related to the OS. I have no evidence for or against, but >> it's possible that OS write-out routines defeat the careful cost based >> throttling that PostgreSQL does by periodically dumping a large >> portion of dirty pages into the write queue at once. That does nasty >> things to query latencies as evidenced by the work on checkpoint >> spreading. > > In other contexts I've run into issues relating to large continuous > writes stalling. The issue is basically that the Linux kernel allows > (by default) writes to pile up until they reach 5% of physical memory > before deciding that the sucker who wrote the last block becomes > responsible for writing the whole lot out. At full speed of course. > Depending on the amount of memory and the I/O speed of your disks this > could take a while, and interfere with other processes. > > This leads to extremely bursty I/O behaviour. > > The solution, as usual, is to make it more aggressive, so the > kernel background writer triggers at 1% memory. > > I'm not saying that's the problem here, but it is an example of a > situation where the write queue can become very large very quickly. Yeah. IMHO, the Linux kernel's behavior around the write queue is flagrantly insane. The threshold for background writing really seems like it ought to be zero. I can see why it makes sense to postpone writing back dirty data if we're otherwise starved for I/O. But it seems like the kernel is disposed to cache large amounts of dirty data for an unbounded period of time even if the I/O system is completely idle, and it's difficult to imagine what class of user would find that behavior desirable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > Martijn van Oosterhout <kleptog@svana.org> wrote: >> On Mon, Jun 03, 2013 at 11:27:57AM +0300, Ants Aasma wrote: >>> It could be related to the OS. I have no evidence for or against, but >>> it's possible that OS write-out routines defeat the careful cost based >>> throttling that PostgreSQL does by periodically dumping a large >>> portion of dirty pages into the write queue at once. That does nasty >>> things to query latencies as evidenced by the work on checkpoint >>> spreading. >> >> In other contexts I've run into issues relating to large continuous >> writes stalling. The issue is basically that the Linux kernel allows >> (by default) writes to pile up until they reach 5% of physical memory >> before deciding that the sucker who wrote the last block becomes >> responsible for writing the whole lot out. At full speed of course. >> Depending on the amount of memory and the I/O speed of your disks this >> could take a while, and interfere with other processes. >> >> This leads to extremely bursty I/O behaviour. >> >> The solution, as usual, is to make it more aggressive, so the >> kernel background writer triggers at 1% memory. >> >> I'm not saying that's the problem here, but it is an example of a >> situation where the write queue can become very large very quickly. > > Yeah. IMHO, the Linux kernel's behavior around the write queue is > flagrantly insane. The threshold for background writing really seems > like it ought to be zero. I can see why it makes sense to postpone > writing back dirty data if we're otherwise starved for I/O. I imagine the reason the OS guys would give for holding up on disk writes for as long as possible would sound an awful lot like the reason PostgreSQL developers give for doing it. Keep in mind that the OS doesn't know whether there might or might not be another layer of caching (on a battery-backed RAID controller or SSD). It's trying to minimize disk writes by waiting, to improve throughput by collapsing duplicate writes and allowing the writes to be performed in a more efficient order based on physical layout. > But it seems like the kernel is disposed to cache large amounts > of dirty data for an unbounded period of time even if the I/O > system is completely idle, It's not unbounded time. Last I heard, the default was 30 seconds. > and it's difficult to imagine what class of user would find that > behavior desirable. Well, certainly not a user of a database that keeps dirty pages lingering for five minutes by default, and often increases that to minimize full page writes. IMO, our defaults for bgwriter are far too passive. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 3, 2013 at 3:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> But it seems like the kernel is disposed to cache large amounts >> of dirty data for an unbounded period of time even if the I/O >> system is completely idle, > > It's not unbounded time. Last I heard, the default was 30 seconds. I'm pretty sure it is unbounded. The VM documentation is a bit vague on what dirty_expire_centisecs actually means, which is I presume where this number comes from. It says: "This tunable is used to define when dirty data is old enough to be eligible for writeout by the pdflush daemons. It is expressed in 100'ths of a second. Data which has been dirty in-memory for longer than this interval will be written out next time a pdflush daemon wakes up." So I think the a pdflush daemon won't necessarily wake up until dirty_background_bytes or dirty_background_ratio have been exceeded, regardless of this threshold. Am I mistaken? https://www.kernel.org/doc/Documentation/sysctl/vm.txt -- Peter Geoghegan
On 06/04/2013 05:27 AM, Peter Geoghegan wrote: > On Mon, Jun 3, 2013 at 5:03 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> I've seen cases on Stack Overflow and elsewhere in which disk merge >> sorts perform vastly better than in-memory quicksort, so the user >> benefited from greatly *lowering* work_mem. > I've heard of that happening on Oracle, when the external sort is > capable of taking advantage of I/O parallelism, but I have a pretty > hard time believing that it could happen with Postgres under any > circumstances. IIRC it's usually occurred with very expensive comparison operations. I'll see if I can find one of the SO cases. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-06-03 16:41:32 -0700, Peter Geoghegan wrote: > On Mon, Jun 3, 2013 at 3:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > >> But it seems like the kernel is disposed to cache large amounts > >> of dirty data for an unbounded period of time even if the I/O > >> system is completely idle, > > > > It's not unbounded time. Last I heard, the default was 30 seconds. > > I'm pretty sure it is unbounded. The VM documentation is a bit vague > on what dirty_expire_centisecs actually means, which is I presume > where this number comes from. It says: > > "This tunable is used to define when dirty data is old enough to be eligible > for writeout by the pdflush daemons. It is expressed in 100'ths of a second. > Data which has been dirty in-memory for longer than this interval will be > written out next time a pdflush daemon wakes up." > > So I think the a pdflush daemon won't necessarily wake up until > dirty_background_bytes or dirty_background_ratio have been exceeded, > regardless of this threshold. Am I mistaken? Without having it checked again, afair it should wakeup every dirty_writeback_centisecs which is something like 5seconds. All that has pretty significantly changed - and imo improved! - in the last year or so of kernel development. Unfortunately it will take a while till we commonly see those kernels being used :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> All that has pretty significantly changed - and imo improved! - in the > last year or so of kernel development. Unfortunately it will take a > while till we commonly see those kernels being used :( ... after being completely broken for 3.2 through 3.5. We're actually using 3.9 in production on some machines, because we couldn't take the IO disaster that is 3.4. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, May 30, 2013 at 7:48 PM, Josh Berkus <josh@agliodbs.com> wrote: > The big, big picture is this: > > 90% of our users need to think about VACUUM/ANALYZE > at least 10% of the time > and 10% of our users need to think about it > almost 90% of the time. > > That's considerably better than was the case 5 years ago, when vacuum > management was a daily or weekly responsibility for nearly 100% of our > users, Fwiw I think this is not the right picture. I think the current situation an accurate description of the way things are and have always been. It's an arms race. We've raised the bar of how large and busy your database has to be before vacuum becomes a pain and users scale their databases up. As long as we stay one step ahead of the users 90% of users won't have to think about vacuum/analyze much. There will always be outliers. When the visibility map went in the argument was that wraparound was so rare that it wasn't worth doubling the size of the visibility map to have a second bit. If the table gets even a low amount of traffic nearly all blocks will need to be frozen anyways by that time. To do something like the visibility map for freezing we would need something like a map that stores the high 8 bits of the oldest unfrozen xid in the block. That be a lot more complex and take a lot more space. -- greg
On Mon, Jun 3, 2013 at 5:03 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 06/02/2013 05:56 AM, Robert Haas wrote:
> (b) users> making ridiculous settings changes to avoid the problems caused bySome recent experiences I've had have also bought home to me that vacuum
> anti-wraparound vacuums kicking in at inconvenient times and eating up
> too many resources.
problems are often of the user's own making.
"My database is slow"
->
"This autovacuum thing is using up lots of I/O and CPU, I'll increase
this delay setting here"
Do you think this was the correct diagnosis but with the wrong action taken, or was the diagnosis incorrect in the first place (i.e. it may be using some IO and CPU, but that isn't what was causing the initial problem)? And if the diagnosis was correct, was it causing problems under default settings, or only because they already turned off the cost delay?
Cheers,
Jeff
On Mon, Jun 3, 2013 at 6:34 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
But this is only after autovacuum_vacuum_cost_delay was already changed to zero, right? It is hard to imagine the write cache being overwhelmed by the default setting, or even substantially more aggressive than the default but still not zero. Anti-wraparound vacuums should generate almost purely sequential writes (at least if only btree indexes exist), so they should clear very quickly.
Where I hit a nightmare scenario with an anti-wraparound
autovacuum, personally, was after an upgrade using pg_dump piped to
psql. At a high OLTP transaction load time (obviously the most
likely time for it to kick in, because it is triggered by xid
consumption), it started to READ AND REWRITE every heap page of
every table. This overwhelmed the battery-backed write cache,
causing a series of "freezes" for a few minutes at a time, raising
a very large number of end-user complaints.
Yeah, I've seen exactly that sequence, and some variations on it
> "I'll whack in some manual VACUUM cron jobs during low load maintenance
> hours and hope that keeps the worst of the problem away, that's what
> random forum posts on the Internet say to do".
> -> "oh my, why did my DB just do an emergency shutdown?"
quite often. In fact, when I was first using PostgreSQL I got as
far as "Maybe I didn't solve the autovacuum thing" but instead of
"I'll just turn it off" my next step was "I wonder what would
happen if I tried making it *more* aggressive so that it didn't
have so much work to do each time it fired?" Of course, that
vastly improved things. I have found it surprisingly difficult to
convince other people to try that, though.
What is it you changed? Either a anti-wraparound happens, or it does not, so I'm not sure what you mean about making it more aggressive so there is less to do. It always has to do the whole thing. Was it the autovacuum_vacuum_scale_factor that you changed?
Cheers,
Jeff
On Mon, Jun 3, 2013 at 5:03 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
This one doesn't make much sense to me, unless they mucked around with autovacuum_freeze_max_age as well as turning autovacuum itself off (common practice?). With the default setting of autovacuum_freeze_max_age, if it can't complete the anti-wraparound before emergency shutdown with autovac off, it probably would not have completed it with autovac on, either.
"I'll whack in some manual VACUUM cron jobs during low load maintenance->
hours and hope that keeps the worst of the problem away, that's what
random forum posts on the Internet say to do".
-> "oh my, why did my DB just do an emergency shutdown?"
This one doesn't make much sense to me, unless they mucked around with autovacuum_freeze_max_age as well as turning autovacuum itself off (common practice?). With the default setting of autovacuum_freeze_max_age, if it can't complete the anti-wraparound before emergency shutdown with autovac off, it probably would not have completed it with autovac on, either.
Cheers,
Jeff
On 06/07/2013 04:38 AM, Jeff Janes wrote: > On Mon, Jun 3, 2013 at 5:03 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> "My database is slow" >> -> >> "This autovacuum thing is using up lots of I/O and CPU, I'll increase >> this delay setting here" > > Do you think this was the correct diagnosis but with the wrong action > taken, or was the diagnosis incorrect in the first place (i.e. it may be > using some IO and CPU, but that isn't what was causing the initial > problem)? And if the diagnosis was correct, was it causing problems under > default settings, or only because they already turned off the cost delay? The problem is that vacuum running too slow tends to result in table and index bloat. Which results in less efficient cache use, slower scans, and generally worsening performance. I've repeatedly seen the user attribute the resulting high I/O to autovacuum (which is, after all, always working away trying to keep up) - and "solving" the problem by further slowing autovacuum. It is very counter-intuitive that to fix the problem the user needs to make the background process that's doing the I/O take up *more* resources, so that other queries take *even less*. >> -> >> "I'll whack in some manual VACUUM cron jobs during low load maintenance >> hours and hope that keeps the worst of the problem away, that's what >> random forum posts on the Internet say to do". >> -> "oh my, why did my DB just do an emergency shutdown?" > > This one doesn't make much sense to me, unless they mucked around with > autovacuum_freeze_max_age as well as turning autovacuum itself off > (common practice?). Unfortunately, yes, as an extension of the above reasoning people seem to apply around autovacuum. The now horrifyingly bloated DB is being kept vaguely functional by regular cron'd vacuum runs, but then autovacuum kicks back in and starts thrashing the system. It's already performing really badly because of all the bloat so this is more than it can take and performance tanks critically. Particularly since it probably has 1000 or more backends thrashing away if it's anything like many of the systems I've been seeing in the wild. The operator's response: Panic and find out how to make it stop. Once autovacuum quits doing its thing the system returns to staggering along and they go back to planning a hardware upgrade someday, then suddenly it's emergency wraparound prevention time. I suspect vacuum, autovacuum, autovacuum tuning, table and index bloat, etc is just too complicated for a lot of people running Pg installs to really understand. I'd really, really love to see some feedback-based auto-tuning of vacuum. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer <craig@2ndquadrant.com> wrote: > On 06/07/2013 04:38 AM, Jeff Janes wrote: >> Craig Ringer <craig@2ndquadrant.com> > The problem is that vacuum running too slow tends to result in > table and index bloat. Which results in less efficient cache use, > slower scans, and generally worsening performance. > > I've repeatedly seen the user attribute the resulting high I/O to > autovacuum (which is, after all, always working away trying to > keep up) - and "solving" the problem by further slowing > autovacuum. > > It is very counter-intuitive that to fix the problem the user > needs to make the background process that's doing the I/O take up > *more* resources, so that other queries take *even less*. Exactly. It can be very hard to convince someone to make autovacuum more aggressive when they associate its default configuration with slowness. >>> -> "I'll whack in some manual VACUUM cron jobs during low load >>> maintenance hours and hope that keeps the worst of the problem >>> away, that's what random forum posts on the Internet say to >>> do". >>> -> "oh my, why did my DB just do an emergency shutdown?" >> >> This one doesn't make much sense to me, unless they mucked >> around with autovacuum_freeze_max_age as well as turning >> autovacuum itself off (common practice?). > > Unfortunately, yes, as an extension of the above reasoning people > seem to apply around autovacuum. The now horrifyingly bloated DB > is being kept vaguely functional by regular cron'd vacuum runs, > but then autovacuum kicks back in and starts thrashing the > system. It's already performing really badly because of all the > bloat so this is more than it can take and performance tanks > critically. Particularly since it probably has 1000 or more > backends thrashing away if it's anything like many of the systems > I've been seeing in the wild. > > The operator's response: Panic and find out how to make it stop. > Once autovacuum quits doing its thing the system returns to > staggering along and they go back to planning a hardware upgrade > someday, then suddenly it's emergency wraparound prevention time. I have seen exactly this pattern multiple times. They sometimes completely ignore all advice about turning on and tuning autovacuum and instead want to know the exact formula for when the the wraparound prevention autovacuum will trigger, so they can run a vacuum "just in time" to prevent it -- since they believe this will minimize disk access and thus give them best performance. They often take this opportunity to run VACUUM FULL on the table and don't see the point of following that with any other form of VACUUM, so they wipe out their visibility map in the process. > I suspect vacuum, autovacuum, autovacuum tuning, table and index > bloat, etc is just too complicated for a lot of people running Pg > installs to really understand. The ones who suffer most are those who learn just enough to think they know how to tune better than the defaults, but not enough to really understand the full impact of the changes they are making. I have no particular ideas on what to do about that observation, unfortunately. > I'd really, really love to see some feedback-based auto-tuning of > vacuum. +1 -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> I suspect vacuum, autovacuum, autovacuum tuning, table and index bloat, > etc is just too complicated for a lot of people running Pg installs to > really understand. I'd really, really love to see some feedback-based > auto-tuning of vacuum. Heck, it's hard for *me* to understand, and I helped design it. I think there's no question that it could be vastly improved. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Jun 3, 2013 at 6:42 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-06-02 11:44:04 -0700, Jeff Janes wrote:I guess the point is that nobody can actually run a bigger OLTP database
> Do we know why anti-wraparound uses so many resources in the first place?
> The default settings seem to be quite conservative to me, even for a
> system that has only a single 5400 rpm hdd (and even more so for any real
> production system that would be used for a many-GB database).
successfully with the default settings. Usually that will end up with a)
huge amounts of bloat in the tables autovac doesn't scan first b) forced
shutdowns because autovac doesn't freeze quickly enough.
I think that Greg Smith posted elsewhere that 4MB/sec of dirtying (which is the default) was about right for some of his very busy systems, which seem like they had pretty impressive IO subsystems. I was surprised it was so low. Are there other anecdotes about what settings work well in practise, assuming people ever find ones that work well?
Which raises the question, Is the primary problem that there are no settings that work well for very those systems, or that there usually are such sweet-spot settings but mere mortals cannot find them?
The default suggestion that frequently seems to be made is just to
disable autovac cost limitations because of that.
Is there general agreement that this suggestion is bad? Setting autovacuum_vacuum_cost_delay to zero is basically saying "I dare you to do your best to destroy my IO performance." So it is not surprising that this just moves one from the frying pan to the fire, or maybe the reverse. (The small ring buffer used by vacuums might save your bacon if your fsyncs actually need to hit disk, as the constant stream of fsync requests to the WAL will act as a secondary throttle).
How about recommending that if autovacuum is not keeping up, that it be tried to set it to the default divided by the number of spindles? That may be overly aggressive, but infinitely less aggressive than setting it to zero would be.
Cheers,
Jeff
On 2013-06-12 14:43:53 -0700, Jeff Janes wrote: > > The default suggestion that frequently seems to be made is just to > > disable autovac cost limitations because of that. > Is there general agreement that this suggestion is bad? Setting > autovacuum_vacuum_cost_delay to zero is basically saying "I dare you to do > your best to destroy my IO performance." So it is not surprising that this > just moves one from the frying pan to the fire, or maybe the reverse. It sure as heck is better than an anti wraparound shutdown every week because autovacuum doesn't finish all relations in time. More often than not a large part of the relations has already been frozen so it won't dirty all that much. I think it's actually a quite sensible setting in many situations given the state of the current controls. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Jeff Janes <jeff.janes@gmail.com> wrote: > Are there other anecdotes about what settings work well in > practise, assuming people ever find ones that work well? Putting WAL on its own RAID on its own battery-backed cached can help a lot more than I would have thought -- even with read-only transactions. http://www.postgresql.org/message-id/4B71358E020000250002F0E4@gw.wicourts.gov -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jun 3, 2013, at 6:45 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 06/04/2013 05:27 AM, Peter Geoghegan wrote: >> On Mon, Jun 3, 2013 at 5:03 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >>> I've seen cases on Stack Overflow and elsewhere in which disk merge >>> sorts perform vastly better than in-memory quicksort, so the user >>> benefited from greatly *lowering* work_mem. >> I've heard of that happening on Oracle, when the external sort is >> capable of taking advantage of I/O parallelism, but I have a pretty >> hard time believing that it could happen with Postgres under any >> circumstances. > IIRC it's usually occurred with very expensive comparison operations. > > I'll see if I can find one of the SO cases. FWIW, I've definitely seen this behavior in the past, on really old versions (certainly pre-9, possibly pre-8). IIRC there's some kind of compression or something used with on-disk sorts. If that's correct then I think what's happeningis that the "on-disk" sort that fits into cache is actually using less memory than quicksort. Or perhaps it wasjust a matter of memory locality within each tape. It's been too long since I looked at it. :(
On Tue, Jun 18, 2013 at 6:11 PM, Jim Nasby <jim@nasby.net> wrote: > IIRC there's some kind of compression or something used with on-disk sorts. I think you're mistaken. > If that's correct then I think what's happening is that the "on-disk" sort that fits into cache > is actually using less memory than quicksort. Or perhaps it was just a matter of memory > locality within each tape. It's been too long since I looked at it. :( External sorts do of course use less memory, but quicksort is particularly good at taking advantage of memory locality. I think it's possible that what you recall is the days when we used the OS qsort(), and we were at the mercy of the implementation that the OS provided. When we effectively began to vendor our own sort routine in 2006, we chose a high-quality one with various protections against quadratic behaviors. Implementations that lacked these protections were prevalent at a surprisingly late stage. -- Peter Geoghegan