Обсуждение: [HACKERS] WARM and indirect indexes
Two options are on the table to attack the problem of updates causing write amplification: WARM and indirect indexes. They are completely different approaches but have overlapping effects on what scenarios are improved. Here's a recap of both features, with the intent that we make a well-considered decision about each. The main effect of both features is that an updated tuple doesn't require updating indexes that are on unmodified columns. Indirect indexes are a completely new server feature which may enable other refinements later on; WARM is a targeted optimization on top of the HOT optimization. The big advantage of WARM is that it works automatically, like HOT: the user doesn't need to do anything different than today to get the benefit. With indirect indexes, the user needs to create the index as indirect explicitely. There are two big disadvantages to WARM (as to HOT): it cannot be applied when the heap page is full; and concurrent long-running transactions can spoil reclaimability of recently dead tuples in heap pages. There's a further disadvantage: currently, there can be only one WARM update in an update chain. (Pavan believes it is possible to allow multiple ones.) All those cases can benefit from indirect indexes. Another interesting case is a table with a primary key and a JSON object, on which you have a GIN index (or an int[] array, or text search). What happens if you modify the JSON? With WARM, this is just a normal index update. With indirect indexes, you may be able to skip inserting index entries for all the JSON elements except those which changed. (However, this is not implemented yet.) - When scanning a WARM-updated block starting from an index, you may need to do more work to walk the update chain until you find the visible tuple. Most of the time, HOT/WARM chains are very short thanks to HOT pruning, so this shouldn't be a problem. - Indirect indexes require a primary key to be present. If the PK is dropped, the IndIx must be dropped too. - Indirect indexes become larger if the primary key is wide. - Indirect indexes are not fully implemented yet (need to remove restriction of PK value being 6 bytes; also wholesale vacuuming, though there's no universal agreement that this is strictly necessary.) - An indirect index takes longer to read, since it needs to descend both the IndIx itself and the primary key index. -- Álvaro Herrera
On Tue, Jan 10, 2017 at 04:24:42PM -0300, Alvaro Herrera wrote: > Two options are on the table to attack the problem of updates causing > write amplification: WARM and indirect indexes. They are completely > different approaches but have overlapping effects on what scenarios are > improved. Here's a recap of both features, with the intent that we make > a well-considered decision about each. > > The main effect of both features is that an updated tuple doesn't > require updating indexes that are on unmodified columns. Indirect > indexes are a completely new server feature which may enable other > refinements later on; WARM is a targeted optimization on top of the HOT > optimization. > > The big advantage of WARM is that it works automatically, like HOT: the > user doesn't need to do anything different than today to get the > benefit. With indirect indexes, the user needs to create the index as > indirect explicitely. Thank you for the summary. I think we have to consider two things with indirect indexes: 1. What percentage speedup is the _average_ user going to get? You have to consider people who will use indirect indexes who get no benefit or a net slowdown, and users who will get a benefit. 2. What percentage of users are going to use indirect indexes? So, for #1 you might have users who are getting +1%, +50%, and -20%, so maybe +10% average, and for #2 you might have 0.1%. When you multiply them out, you get 0.01% average improvement per installation, which is very small. Obviously, these are just wild guesses, but this is just to make a point. If you assume WARM has been optimized, #1 gets even lower. I am not saying we shouldn't do it, but I am afraid that the complexity in figuring out when to use indirect indexes, combined with the number of users who will try them, really hurts its inclusion. FYI, we have a similar issue in adding GUC variables, which I outlined in this blog post: http://momjian.us/main/blogs/pgblog/2009.html#January_10_2009 -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Bruce Momjian wrote: > 1. What percentage speedup is the _average_ user going to get? You > have to consider people who will use indirect indexes who get no benefit > or a net slowdown, and users who will get a benefit. > > 2. What percentage of users are going to use indirect indexes? > > So, for #1 you might have users who are getting +1%, +50%, and -20%, so > maybe +10% average, and for #2 you might have 0.1%. When you multiply > them out, you get 0.01% average improvement per installation, which is > very small. Obviously, these are just wild guesses, but this is just to > make a point. Perhaps not many users will require indirect indexes; but for those that do, the feature might be invaluable. We don't do only things that benefit everybody -- some features are there to satisfy small populations ("snapshot too old" is a recent example). We should of course do, and perhaps even favor doing things that benefit everybody, but should also do the other things. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 10, 2017 at 11:36:24PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > > 1. What percentage speedup is the _average_ user going to get? You > > have to consider people who will use indirect indexes who get no benefit > > or a net slowdown, and users who will get a benefit. > > > > 2. What percentage of users are going to use indirect indexes? > > > > So, for #1 you might have users who are getting +1%, +50%, and -20%, so > > maybe +10% average, and for #2 you might have 0.1%. When you multiply > > them out, you get 0.01% average improvement per installation, which is > > very small. Obviously, these are just wild guesses, but this is just to > > make a point. > > Perhaps not many users will require indirect indexes; but for those that > do, the feature might be invaluable. We don't do only things that > benefit everybody -- some features are there to satisfy small > populations ("snapshot too old" is a recent example). We should of > course do, and perhaps even favor doing things that benefit everybody, > but should also do the other things. I never said "We should do only things that benefit everybody," so why are you saying that? You are arguing against something I didn't say. I am trying to make a balanced analysis, and you arguing against an extreme position. My point is that anything you add must be weighed against the value it gives to users who use it, and the percentage of users who will use it. Against that benefit, you have to look at the cost of exposing that API to users, code complexity, maintenance, etc. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Bruce Momjian wrote: > My point is that anything you add must be weighed against the value it > gives to users who use it, and the percentage of users who will use it. > Against that benefit, you have to look at the cost of exposing that API > to users, code complexity, maintenance, etc. I agree. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 10, 2017 at 09:25:05PM -0500, Bruce Momjian wrote: > Thank you for the summary. I think we have to consider two things with > indirect indexes: > > 1. What percentage speedup is the _average_ user going to get? You > have to consider people who will use indirect indexes who get no benefit > or a net slowdown, and users who will get a benefit. > > 2. What percentage of users are going to use indirect indexes? > > So, for #1 you might have users who are getting +1%, +50%, and -20%, so > maybe +10% average, and for #2 you might have 0.1%. When you multiply > them out, you get 0.01% average improvement per installation, which is > very small. Obviously, these are just wild guesses, but this is just to > make a point. If you assume WARM has been optimized, #1 gets even > lower. Sorry to have to reply to my own email but I need some of the text above. Basically, with WARM, the adoption rate (#2) is 100%. I am asking what instructions we will give users for #1 to prevent people from using indirect indexes and getting worse performance. Are we going to say, "Use indirect indexes on columns that are updated frequently?" Actually, that seems pretty clear and would be easy for users to follow. I think the big question is that we will not know the benefits of indirect indexes over WARM until we have implemented WARM, and if the benefits of indirect indexes over WARM are small, and considering #2, we might decide that it isn't worth adding it, for the reasons I already outlined. Therefore, I think we need WARM done first, then we can test indirect indexes to see if they are a sufficient win to add it for the small percentage of users who will use it. In general, Postgres doesn't support ever possible performance tuning option, and I think we are better for that because Postgres is simpler to use. Going back to my blog post, if you add a feature, every user who is considering tuning Postgres has to understand the feature and decide if they should use it, so even for people who don't user the feature, there is a cost, however small. In summary, I love WARM, and might love indirect indexes too, but I need to feel that indirect indexes are a clear win for the added complexity, both in our code, and for the user API. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Bruce Momjian wrote: > Therefore, I think we need WARM done first, then we can test indirect > indexes to see if they are a sufficient win to add it for the small > percentage of users who will use it. Agreed -- that's my plan. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 11, 2017 at 12:24:55PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > > Therefore, I think we need WARM done first, then we can test indirect > > indexes to see if they are a sufficient win to add it for the small > > percentage of users who will use it. > > Agreed -- that's my plan. Thanks. I hate to pour cold water on a feature, honestly, but I don't want us to over-react to the write amplification problem either. We don't want to have X features to improve it when a single feature is sufficient. (Amit might be right that the real win for indirect indexes will be some type of clustered index, where the win might be larger.) I know Uber dumped us "allegedly" over this issue (among other complaints), but I am concerned we are overreacting if we change Postgres too much to address this concern. Hence, I am arguing we don't add both features at the same time without evaluating the need for the second feature after the first feature is done. Let me give an example of us not over-reacting. When we implemented HOT (thanks Pavan), we considered the problem that our default fill factor for heap is 100%, so there is no room for HOT updates on a full page. Should we reduce the default fill factor when adding HOT? We decided not to, on the hope that the first update to a row on a full page would put the new row on a page with sufficient free space for future HOT updates, and that has proven to be the case. We do document lower full factors for heap to improve HOT updates, but I think everyone feels that is mostly for good performance after the initial table load, and that over time the frequently-updated rows will naturally migrate to pages with sufficient free space. My point is that we didn't over-react in that case, and the result was fine --- read-only rows got dense storage, and frequently-updated rows got sufficient free space for HOT, but we had to push HOT into production to confirm we were in good shape. I am thinking we need to complete WARM to figure out what it doesn't do well in production so we can fairly evaluate indirect indexes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +