Обсуждение: BRIN minmax multi - incorrect distance for infinite timestamp/date
Hi, Ashutosh Bapat reported me off-list a possible issue in how BRIN minmax-multi calculate distance for infinite timestamp/date values. The current code does this: if (TIMESTAMP_NOT_FINITE(dt1) || TIMESTAMP_NOT_FINITE(dt2)) PG_RETURN_FLOAT8(0); so means infinite values are "very close" to any other value, and thus likely to be merged into a summary range. That's exactly the opposite of what we want to do, possibly resulting in inefficient indexes. Consider this example create table test (a timestamptz) with (fillfactor=50); insert into test select (now() + ((10000 * random())::int || ' seconds')::interval) from generate_series(1,1000000) s(i); update test set a = '-infinity'::timestamptz where random() < 0.01; update test set a = 'infinity'::timestamptz where random() < 0.01; explain (analyze, timing off, costs off) select * from test where a = '2024-01-01'::timestamptz; QUERY PLAN ------------------------------------------------------------------------------ Bitmap Heap Scan on test (actual rows=0 loops=1) Recheck Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time zone) Rows Removed by Index Recheck: 680662 Heap Blocks: lossy=6024 -> Bitmap Index Scan on test_a_idx (actual rows=60240 loops=1) Index Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time zone) Planning Time: 0.075 ms Execution Time: 106.871 ms (8 rows) Clearly, large part of the table gets scanned - this happens because when building the index, we end up with ranges like this: [-infinity,a,b,c,...,x,y,z,infinity] and we conclude that distance for [-infinity,a] is 0, and we combine these values into a range. And the same for [z,infinity]. But we should do exactly the opposite thing - never merge those. Attached is a patch fixing this, with which the plan looks like this: QUERY PLAN ------------------------------------------------------------------------------ Bitmap Heap Scan on test (actual rows=0 loops=1) Recheck Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time zone) -> Bitmap Index Scan on test_a_idx (actual rows=0 loops=1) Index Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time zone) Planning Time: 0.289 ms Execution Time: 9.432 ms (6 rows) Which seems much better. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Thu, 12 Oct 2023 at 23:43, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Ashutosh Bapat reported me off-list a possible issue in how BRIN > minmax-multi calculate distance for infinite timestamp/date values. > > The current code does this: > > if (TIMESTAMP_NOT_FINITE(dt1) || TIMESTAMP_NOT_FINITE(dt2)) > PG_RETURN_FLOAT8(0); > Yes indeed, that looks wrong. I noticed the same thing while reviewing the infinite interval patch. > so means infinite values are "very close" to any other value, and thus > likely to be merged into a summary range. That's exactly the opposite of > what we want to do, possibly resulting in inefficient indexes. > Is this only inefficient? Or can it also lead to wrong query results? > Attached is a patch fixing this > I wonder if it's actually necessary to give infinity any special handling at all for dates and timestamps. For those types, "infinity" is actually just INT_MIN/MAX, which compares correctly with any finite value, and will be much larger/smaller than any common value, so it seems like it isn't necessary to give "infinite" values any special treatment. That would be consistent with date_cmp() and timestamp_cmp(). Something else that looks wrong about that BRIN code is that the integer subtraction might lead to overflow -- it is subtracting two integer values, then casting the result to float8. It should cast each input before subtracting, more like brin_minmax_multi_distance_int8(). IOW, I think brin_minmax_multi_distance_date/timestamp could be made basically identical to brin_minmax_multi_distance_int4/8. Regards, Dean
On 10/13/23 11:21, Dean Rasheed wrote: > On Thu, 12 Oct 2023 at 23:43, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Ashutosh Bapat reported me off-list a possible issue in how BRIN >> minmax-multi calculate distance for infinite timestamp/date values. >> >> The current code does this: >> >> if (TIMESTAMP_NOT_FINITE(dt1) || TIMESTAMP_NOT_FINITE(dt2)) >> PG_RETURN_FLOAT8(0); >> > > Yes indeed, that looks wrong. I noticed the same thing while reviewing > the infinite interval patch. > >> so means infinite values are "very close" to any other value, and thus >> likely to be merged into a summary range. That's exactly the opposite of >> what we want to do, possibly resulting in inefficient indexes. >> > > Is this only inefficient? Or can it also lead to wrong query results? > I don't think it can produce incorrect results. It only affects which values we "merge" into an interval when building the summaries. >> Attached is a patch fixing this >> > > I wonder if it's actually necessary to give infinity any special > handling at all for dates and timestamps. For those types, "infinity" > is actually just INT_MIN/MAX, which compares correctly with any finite > value, and will be much larger/smaller than any common value, so it > seems like it isn't necessary to give "infinite" values any special > treatment. That would be consistent with date_cmp() and > timestamp_cmp(). > Right, but .... > Something else that looks wrong about that BRIN code is that the > integer subtraction might lead to overflow -- it is subtracting two > integer values, then casting the result to float8. It should cast each > input before subtracting, more like brin_minmax_multi_distance_int8(). > ... it also needs to fix this, otherwise it overflows. Consider delta = dt2 - dt1; and assume dt1 is INT64_MIN, or that dt2 is INT64_MAX. > IOW, I think brin_minmax_multi_distance_date/timestamp could be made > basically identical to brin_minmax_multi_distance_int4/8. > Right. Attached is a patch doing it this way. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Fri, 13 Oct 2023 at 11:44, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 10/13/23 11:21, Dean Rasheed wrote: > > > > Is this only inefficient? Or can it also lead to wrong query results? > > I don't think it can produce incorrect results. It only affects which > values we "merge" into an interval when building the summaries. > Ah, I get it now. These "distance" support functions are only used to see how far apart 2 ranges are, for the purposes of the algorithm that merges the 2 closest ranges. So if it gets it wrong, it only leads to a poor choice of ranges to merge, making the query inefficient, but still correct. Presumably, that also makes this kind of change safe to back-patch (not sure if you were planning to do that?), since it will only affect range merging choices when inserting new values into existing indexes. Regards, Dean
On 10/13/23 14:04, Dean Rasheed wrote: > On Fri, 13 Oct 2023 at 11:44, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 10/13/23 11:21, Dean Rasheed wrote: >>> >>> Is this only inefficient? Or can it also lead to wrong query results? >> >> I don't think it can produce incorrect results. It only affects which >> values we "merge" into an interval when building the summaries. >> > > Ah, I get it now. These "distance" support functions are only used to > see how far apart 2 ranges are, for the purposes of the algorithm that > merges the 2 closest ranges. So if it gets it wrong, it only leads to > a poor choice of ranges to merge, making the query inefficient, but > still correct. > Right. > Presumably, that also makes this kind of change safe to back-patch > (not sure if you were planning to do that?), since it will only affect > range merging choices when inserting new values into existing indexes. > I do plan to backpatch this, yes. I don't think there are many people affected by this (few people are using infinite dates/timestamps, but maybe the overflow could be more common). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 13 Oct 2023 at 13:17, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I do plan to backpatch this, yes. I don't think there are many people > affected by this (few people are using infinite dates/timestamps, but > maybe the overflow could be more common). > OK, though I doubt that such values are common in practice. There's also an overflow problem in brin_minmax_multi_distance_interval() though, and I think that's worse because overflows there throw "interval out of range" errors, which can prevent index creation or inserts. There's a patch (0009 in [1]) as part of the infinite interval work, but that just uses interval_mi(), and so doesn't fix the interval-out-of-range errors, except for infinite intervals, which are treated as special cases, which I don't think is really necessary. I think this should be rewritten to compute delta from ia and ib without going via an intermediate Interval value. I.e., instead of computing "result", just do something like dayfraction = (ib->time % USECS_PER_DAY) - (ia->time % USECS_PER_DAY); days = (ib->time / USECS_PER_DAY) - (ia->time / USECS_PER_DAY); days += (int64) ib->day - (int64) ia->day; days += ((int64) ib->month - (int64) ia->month) * INT64CONST(30); then convert to double precision as it does now: delta = (double) days + dayfraction / (double) USECS_PER_DAY; So the first part is exact 64-bit integer arithmetic, with no chance of overflow, and it'll handle "infinite" intervals just fine, when that feature gets added. Regards, Dean [1] https://www.postgresql.org/message-id/CAExHW5u1JE7dxK=WLzqhCszNToxQzJdieRmhREpW6r8w6kcRGQ@mail.gmail.com
Thanks Tomas for bringing this discussion to hackers. On Fri, Oct 13, 2023 at 8:58 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > On Fri, 13 Oct 2023 at 13:17, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > I do plan to backpatch this, yes. I don't think there are many people > > affected by this (few people are using infinite dates/timestamps, but > > maybe the overflow could be more common). > > The example you gave is missing CREATE INDEX command. Is it "create index test_idx_a on test using brin(a);" Do already create indexes have this issue? Do they need to rebuilt after upgrading? > > OK, though I doubt that such values are common in practice. > > There's also an overflow problem in > brin_minmax_multi_distance_interval() though, and I think that's worse > because overflows there throw "interval out of range" errors, which > can prevent index creation or inserts. > > There's a patch (0009 in [1]) as part of the infinite interval work, > but that just uses interval_mi(), and so doesn't fix the > interval-out-of-range errors, except for infinite intervals, which are > treated as special cases, which I don't think is really necessary. > Right. I used interval_mi() to preserve the finite value behaviour as is. But ... > I think this should be rewritten to compute delta from ia and ib > without going via an intermediate Interval value. I.e., instead of > computing "result", just do something like > > dayfraction = (ib->time % USECS_PER_DAY) - (ia->time % USECS_PER_DAY); > days = (ib->time / USECS_PER_DAY) - (ia->time / USECS_PER_DAY); > days += (int64) ib->day - (int64) ia->day; > days += ((int64) ib->month - (int64) ia->month) * INT64CONST(30); > > then convert to double precision as it does now: > > delta = (double) days + dayfraction / (double) USECS_PER_DAY; > Given Tomas's explanation of how these functions are supposed to work, I think your suggestions is better. I was worried that above calculations may not produce the same result as the current code when there is no error because modulo and integer division are not distributive over subtraction. But it looks like together they behave as normal division which is distributive over subtraction. I couldn't find an example where that is not true. Tomas, you may want to incorporate this in your patch. If not, I will incorporate it in my infinite interval patchset in [1]. [1] https://www.postgresql.org/message-id/CAExHW5u1JE7dxK=WLzqhCszNToxQzJdieRmhREpW6r8w6kcRGQ@mail.gmail.com -- Best Wishes, Ashutosh Bapat
On 10/16/23 11:25, Ashutosh Bapat wrote: > Thanks Tomas for bringing this discussion to hackers. > > > On Fri, Oct 13, 2023 at 8:58 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> >> On Fri, 13 Oct 2023 at 13:17, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> I do plan to backpatch this, yes. I don't think there are many people >>> affected by this (few people are using infinite dates/timestamps, but >>> maybe the overflow could be more common). >>> > > The example you gave is missing CREATE INDEX command. Is it "create > index test_idx_a on test using brin(a);" Ah, you're right - apologies. FWIW when testing I usually use 1-page ranges, to possibly hit more combinations / problems. More importantly, it needs to specify the opclass, otherwise it'll use the default minmax opclass which does not use distance at all: create index test_idx_a on test using brin (a timestamptz_minmax_multi_ops) with (pages_per_range=1); > > Do already create indexes have this issue? Do they need to rebuilt > after upgrading? > Yes, existing indexes will have inefficient ranges. I'm not sure we want to push people to reindex everything, the issue seem somewhat unlikely in practice. >> >> OK, though I doubt that such values are common in practice. >> >> There's also an overflow problem in >> brin_minmax_multi_distance_interval() though, and I think that's worse >> because overflows there throw "interval out of range" errors, which >> can prevent index creation or inserts. >> >> There's a patch (0009 in [1]) as part of the infinite interval work, >> but that just uses interval_mi(), and so doesn't fix the >> interval-out-of-range errors, except for infinite intervals, which are >> treated as special cases, which I don't think is really necessary. >> > > Right. I used interval_mi() to preserve the finite value behaviour as > is. But ... > >> I think this should be rewritten to compute delta from ia and ib >> without going via an intermediate Interval value. I.e., instead of >> computing "result", just do something like >> >> dayfraction = (ib->time % USECS_PER_DAY) - (ia->time % USECS_PER_DAY); >> days = (ib->time / USECS_PER_DAY) - (ia->time / USECS_PER_DAY); >> days += (int64) ib->day - (int64) ia->day; >> days += ((int64) ib->month - (int64) ia->month) * INT64CONST(30); >> >> then convert to double precision as it does now: >> >> delta = (double) days + dayfraction / (double) USECS_PER_DAY; >> > > Given Tomas's explanation of how these functions are supposed to work, > I think your suggestions is better. > > I was worried that above calculations may not produce the same result > as the current code when there is no error because modulo and integer > division are not distributive over subtraction. But it looks like > together they behave as normal division which is distributive over > subtraction. I couldn't find an example where that is not true. > > Tomas, you may want to incorporate this in your patch. If not, I will > incorporate it in my infinite interval patchset in [1]. > I'd rather keep it as separate patch, although maybe let's deal with it separately from the larger patches. It's a bug, and having it in a patch set that adds a feature does not seem like a good idea (or maybe I don't understand what the other thread does, I haven't looked very closely). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Oct 16, 2023 at 7:33 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 10/16/23 11:25, Ashutosh Bapat wrote: > > Thanks Tomas for bringing this discussion to hackers. > > > > > > On Fri, Oct 13, 2023 at 8:58 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > >> > >> On Fri, 13 Oct 2023 at 13:17, Tomas Vondra > >> <tomas.vondra@enterprisedb.com> wrote: > >>> > >>> I do plan to backpatch this, yes. I don't think there are many people > >>> affected by this (few people are using infinite dates/timestamps, but > >>> maybe the overflow could be more common). > >>> > > > > The example you gave is missing CREATE INDEX command. Is it "create > > index test_idx_a on test using brin(a);" > > Ah, you're right - apologies. FWIW when testing I usually use 1-page > ranges, to possibly hit more combinations / problems. More importantly, > it needs to specify the opclass, otherwise it'll use the default minmax > opclass which does not use distance at all: > > create index test_idx_a on test > using brin (a timestamptz_minmax_multi_ops) with (pages_per_range=1); > Thanks. > > > > Do already create indexes have this issue? Do they need to rebuilt > > after upgrading? > > > > Yes, existing indexes will have inefficient ranges. I'm not sure we want > to push people to reindex everything, the issue seem somewhat unlikely > in practice. > If the column has infinity values only then they need to rebuild the index. Such users may notice this bug fix in the release notes and decide to rebuild the index themselves. > >> I think this should be rewritten to compute delta from ia and ib > >> without going via an intermediate Interval value. I.e., instead of > >> computing "result", just do something like > >> > >> dayfraction = (ib->time % USECS_PER_DAY) - (ia->time % USECS_PER_DAY); > >> days = (ib->time / USECS_PER_DAY) - (ia->time / USECS_PER_DAY); > >> days += (int64) ib->day - (int64) ia->day; > >> days += ((int64) ib->month - (int64) ia->month) * INT64CONST(30); > >> > >> then convert to double precision as it does now: > >> > >> delta = (double) days + dayfraction / (double) USECS_PER_DAY; > >> > > > > Given Tomas's explanation of how these functions are supposed to work, > > I think your suggestions is better. > > > > I was worried that above calculations may not produce the same result > > as the current code when there is no error because modulo and integer > > division are not distributive over subtraction. But it looks like > > together they behave as normal division which is distributive over > > subtraction. I couldn't find an example where that is not true. > > > > Tomas, you may want to incorporate this in your patch. If not, I will > > incorporate it in my infinite interval patchset in [1]. > > > > I'd rather keep it as separate patch, although maybe let's deal with it > separately from the larger patches. It's a bug, and having it in a patch > set that adds a feature does not seem like a good idea (or maybe I don't > understand what the other thread does, I haven't looked very closely). > If you incorporate these changes, I will need to remove 0009, which mostly rewrites that function, from my patchset. If you don't, my patch rewrites anyway. Either way is fine with me. -- Best Wishes, Ashutosh Bapat
Hi, Here's a couple cleaned-up patches fixing the various discussed here. I've tried to always add a regression test demonstrating the issue first, and then fix it in the next patch. In particular, this deals with these issues: 1) overflows in distance calculation for large timestamp values (0002) 2) incorrect subtraction in distance for date values (0003) 3) incorrect distance for infinite date/timestamp values (0005) 4) failing distance for extreme interval values (0007) All the problems except "2" have been discussed earlier, but this seems a bit more serious than the other issues, as it's easier to hit. It subtracts the values in the opposite order (smaller - larger), so the distances are negated. Which means we actually merge the values from the most distant ones, and thus are "guaranteed" to build very a very inefficient summary. People with multi-minmax indexes on "date" columns probably will need to reindex. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
- 0001-Tests-for-overflows-with-dates-and-timestamps-in-BRI.patch
- 0002-Fix-overflow-in-brin_minmax_multi_distance_timestamp.patch
- 0003-Fix-calculation-in-brin_minmax_multi_distance_date.patch
- 0004-Add-tests-for-inifite-date-timestamp-values.patch
- 0005-Fix-handling-of-infinity-date-timestamp-values.patch
- 0006-Add-test-for-BRIN-minmax-multi-with-extreme-interval.patch
- 0007-Fix-distance-calculation-for-extreme-interval-values.patch
- 0008-Add-more-tests-for-BRIN-on-interval-values.patch
On 10/17/23 22:25, Tomas Vondra wrote: > Hi, > > Here's a couple cleaned-up patches fixing the various discussed here. > I've tried to always add a regression test demonstrating the issue > first, and then fix it in the next patch. > > In particular, this deals with these issues: > > 1) overflows in distance calculation for large timestamp values (0002) > > 2) incorrect subtraction in distance for date values (0003) > > 3) incorrect distance for infinite date/timestamp values (0005) > > 4) failing distance for extreme interval values (0007) > > All the problems except "2" have been discussed earlier, but this seems > a bit more serious than the other issues, as it's easier to hit. It > subtracts the values in the opposite order (smaller - larger), so the > distances are negated. Which means we actually merge the values from the > most distant ones, and thus are "guaranteed" to build very a very > inefficient summary. People with multi-minmax indexes on "date" columns > probably will need to reindex. > BTW when adding the tests with extreme values, I noticed this: test=# select '5874897-01-01'::date; date --------------- 5874897-01-01 (1 row) test=# select '5874897-01-01'::date + '1 second'::interval; ERROR: date out of range for timestamp IIUC this happens because the first thing date_pl_interval does is date2timestamp, ignoring the fact that the ranges of those data types are different - dates allow values up to '5874897 AD' while timestamps only allows values up to '294276 AD'. This seems to be a long-standing behavior, added by a9e08392dd6f in 2004. Not sure how serious it is, I just noticed when I tried to do arithmetics on the extreme values in tests. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 17 Oct 2023 at 21:25, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Here's a couple cleaned-up patches fixing the various discussed here. > I've tried to always add a regression test demonstrating the issue > first, and then fix it in the next patch. > This looks good to me. > 2) incorrect subtraction in distance for date values (0003) > > All the problems except "2" have been discussed earlier, but this seems > a bit more serious than the other issues, as it's easier to hit. It > subtracts the values in the opposite order (smaller - larger), so the > distances are negated. Which means we actually merge the values from the > most distant ones, and thus are "guaranteed" to build very a very > inefficient summary. > Yeah, that's not good. Amusingly this accidentally made infinite dates behave correctly, since they were distance 0 away from anything else, which was larger than all the other negative distances! But yes, that needed fixing properly. Thanks for taking care of this. Regards, Dean
On Wed, Oct 18, 2023 at 1:55 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > Here's a couple cleaned-up patches fixing the various discussed here. > I've tried to always add a regression test demonstrating the issue > first, and then fix it in the next patch. It will be good to commit the test changes as well. > > In particular, this deals with these issues: > > 1) overflows in distance calculation for large timestamp values (0002) I could reduce the SQL for timestamp overflow test to just -- test overflows during CREATE INDEX with extreme timestamp values CREATE TEMPORARY TABLE brin_timestamp_test(a TIMESTAMPTZ); SET datestyle TO iso; INSERT INTO brin_timestamp_test VALUES ('4713-01-01 00:00:30 BC'), ('294276-12-01 00:00:01'); CREATE INDEX ON brin_timestamp_test USING brin (a timestamptz_minmax_multi_ops) WITH (pages_per_range=1); I didn't understand the purpose of adding 60 odd values to the table. It didn't tell which of those values triggers the overflow. Minimal set above is much easier to understand IMO. Using a temporary table just avoids DROP TABLE statement. But I am ok if you want to use non-temporary table with DROP. Code changes in 0002 look fine. Do we want to add a comment "cast to a wider datatype to avoid overflow"? Or is that too explicit? The code changes fix the timestamp issue but there's a diff in case of > > 2) incorrect subtraction in distance for date values (0003) The test case for date brin index didn't crash though. Even after applying 0003 patch. The reason why date subtraction can't overflow is a bit obscure. PostgreSQL doesn't allow dates beyond 4714-12-31 BC because of the code below #define IS_VALID_DATE(d) \ ((DATETIME_MIN_JULIAN - POSTGRES_EPOCH_JDATE) <= (d) && \ (d) < (DATE_END_JULIAN - POSTGRES_EPOCH_JDATE)) This prevents the lower side to be well within the negative int32 overflow threshold and we always subtract higher value from the lower one. May be good to elaborate this? A later patch does use float 8 casting eliminating "any" possibility of overflow. So the comment may not be necessary after squashing all the changes. > > 3) incorrect distance for infinite date/timestamp values (0005) The tests could use a minimal set of rows here too. The code changes look fine and fix the problem seen with the tests alone. > > 4) failing distance for extreme interval values (0007) I could reproduce the issue with a minimal set of values -- test handling of overflow for interval values CREATE TABLE brin_interval_test(a INTERVAL); INSERT INTO brin_interval_test VALUES ('177999985 years'), ('-178000000 years'); CREATE INDEX ON brin_interval_test USING brin (a interval_minmax_multi_ops) WITH (pages_per_range=1); DROP TABLE brin_interval_test; The code looks fine and fixed the issue seen with the test. We may want to combine various test cases though. Like the test adding infinity and extreme values could be combined. Also the number of values it inserts in the table for the reasons stated above. > > All the problems except "2" have been discussed earlier, but this seems > a bit more serious than the other issues, as it's easier to hit. It > subtracts the values in the opposite order (smaller - larger), so the > distances are negated. Which means we actually merge the values from the > most distant ones, and thus are "guaranteed" to build very a very > inefficient summary. People with multi-minmax indexes on "date" columns > probably will need to reindex. > Right. Do we highlight that in the commit message so that the person writing release notes picks it up from there? -- Best Wishes, Ashutosh Bapat
On Wed, 18 Oct 2023 at 09:16, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > BTW when adding the tests with extreme values, I noticed this: > > test=# select '5874897-01-01'::date; > date > --------------- > 5874897-01-01 > (1 row) > > test=# select '5874897-01-01'::date + '1 second'::interval; > ERROR: date out of range for timestamp > That's correct because date + interval returns timestamp, and the value is out of range for a timestamp. This is equivalent to: select '5874897-01-01'::date::timestamp + '1 second'::interval; ERROR: date out of range for timestamp and I think it's good that it gives a different error from this: select '294276-01-01'::date::timestamp + '1 year'::interval; ERROR: timestamp out of range so you can tell that the overflow in the first case happens before the addition. Of course a side effect of internally casting first is that you can't do things like this: select '5874897-01-01'::date - '5872897 years'::interval; ERROR: date out of range for timestamp which arguably ought to return '2000-01-01 00:00:00'. In practice though, I think it would be far more trouble than it's worth trying to change that. Regards, Dean
On 10/18/23 12:13, Dean Rasheed wrote: > On Tue, 17 Oct 2023 at 21:25, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Here's a couple cleaned-up patches fixing the various discussed here. >> I've tried to always add a regression test demonstrating the issue >> first, and then fix it in the next patch. >> > > This looks good to me. > >> 2) incorrect subtraction in distance for date values (0003) >> >> All the problems except "2" have been discussed earlier, but this seems >> a bit more serious than the other issues, as it's easier to hit. It >> subtracts the values in the opposite order (smaller - larger), so the >> distances are negated. Which means we actually merge the values from the >> most distant ones, and thus are "guaranteed" to build very a very >> inefficient summary. >> > > Yeah, that's not good. Amusingly this accidentally made infinite dates > behave correctly, since they were distance 0 away from anything else, > which was larger than all the other negative distances! But yes, that > needed fixing properly. > Right. Apparently two wrongs can make a right, after all ;-) regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/18/23 12:47, Ashutosh Bapat wrote: > On Wed, Oct 18, 2023 at 1:55 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> Here's a couple cleaned-up patches fixing the various discussed here. >> I've tried to always add a regression test demonstrating the issue >> first, and then fix it in the next patch. > > It will be good to commit the test changes as well. > I do plan to commit them, ofc. I was just explaining why I'm adding the tests first, and then fixing the issue in a separate commit. >> >> In particular, this deals with these issues: >> >> 1) overflows in distance calculation for large timestamp values (0002) > > I could reduce the SQL for timestamp overflow test to just > -- test overflows during CREATE INDEX with extreme timestamp values > CREATE TEMPORARY TABLE brin_timestamp_test(a TIMESTAMPTZ); > > SET datestyle TO iso; > > INSERT INTO brin_timestamp_test VALUES > ('4713-01-01 00:00:30 BC'), > ('294276-12-01 00:00:01'); > > CREATE INDEX ON brin_timestamp_test USING brin (a > timestamptz_minmax_multi_ops) WITH (pages_per_range=1); > > I didn't understand the purpose of adding 60 odd values to the table. > It didn't tell which of those values triggers the overflow. Minimal > set above is much easier to understand IMO. Using a temporary table > just avoids DROP TABLE statement. But I am ok if you want to use > non-temporary table with DROP. > > Code changes in 0002 look fine. Do we want to add a comment "cast to a > wider datatype to avoid overflow"? Or is that too explicit? > > The code changes fix the timestamp issue but there's a diff in case of > I did use that many values to actually force "compaction" and merging of points into ranges. We only keep 32 values per page range, so with 2 values we'll not build a range. You're right it may still trigger the overflow (we probably still calculate distances, I didn't realize that), but without the compaction we can't check the query plans. However, I agree 60 values may be a bit too much. And I realized we can reduce the count quite a bit by using the values_per_range option. We could set it to 8 (which is the minimum). >> >> 2) incorrect subtraction in distance for date values (0003) > > The test case for date brin index didn't crash though. Even after > applying 0003 patch. The reason why date subtraction can't overflow is > a bit obscure. PostgreSQL doesn't allow dates beyond 4714-12-31 BC > because of the code below > #define IS_VALID_DATE(d) \ > ((DATETIME_MIN_JULIAN - POSTGRES_EPOCH_JDATE) <= (d) && \ > (d) < (DATE_END_JULIAN - POSTGRES_EPOCH_JDATE)) > This prevents the lower side to be well within the negative int32 > overflow threshold and we always subtract higher value from the lower > one. May be good to elaborate this? A later patch does use float 8 > casting eliminating "any" possibility of overflow. So the comment may > not be necessary after squashing all the changes. > Not sure what you mean by "crash". Yes, it doesn't hit an assert, because there's none when calculating distance for date. It however should fail in the query plan check due to the incorrect order of subtractions. Also, the commit message does not claim to fix overflow. In fact, it says it can't overflow ... >> >> 3) incorrect distance for infinite date/timestamp values (0005) > > The tests could use a minimal set of rows here too. > > The code changes look fine and fix the problem seen with the tests alone. > OK >> >> 4) failing distance for extreme interval values (0007) > > I could reproduce the issue with a minimal set of values > -- test handling of overflow for interval values > CREATE TABLE brin_interval_test(a INTERVAL); > > INSERT INTO brin_interval_test VALUES > ('177999985 years'), > ('-178000000 years'); > > CREATE INDEX ON brin_interval_test USING brin (a > interval_minmax_multi_ops) WITH (pages_per_range=1); > DROP TABLE brin_interval_test; > > The code looks fine and fixed the issue seen with the test. > > We may want to combine various test cases though. Like the test adding > infinity and extreme values could be combined. Also the number of > values it inserts in the table for the reasons stated above. > I prefer not to do that. I find it more comprehensible to keep the tests separate / testing different things. If the tests were expensive to setup or something like that, that'd be a different situation. >> >> All the problems except "2" have been discussed earlier, but this seems >> a bit more serious than the other issues, as it's easier to hit. It >> subtracts the values in the opposite order (smaller - larger), so the >> distances are negated. Which means we actually merge the values from the >> most distant ones, and thus are "guaranteed" to build very a very >> inefficient summary. People with multi-minmax indexes on "date" columns >> probably will need to reindex. >> > > Right. Do we highlight that in the commit message so that the person > writing release notes picks it up from there? > Yes, I think I'll mention what impact each issue can have on indexes. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 18, 2023 at 8:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I did use that many values to actually force "compaction" and merging of > points into ranges. We only keep 32 values per page range, so with 2 > values we'll not build a range. You're right it may still trigger the > overflow (we probably still calculate distances, I didn't realize that), > but without the compaction we can't check the query plans. > > However, I agree 60 values may be a bit too much. And I realized we can > reduce the count quite a bit by using the values_per_range option. We > could set it to 8 (which is the minimum). > I haven't read BRIN code, except the comments in the beginning of the file. From what you describe it seems we will store first 32 values as is, but later as the number of values grow create ranges from those? Please point me to the relevant source code/documentation. Anyway, if we can reduce the number of rows we insert, that will be good. > > > > Not sure what you mean by "crash". Yes, it doesn't hit an assert, > because there's none when calculating distance for date. It however > should fail in the query plan check due to the incorrect order of > subtractions. > > Also, the commit message does not claim to fix overflow. In fact, it > says it can't overflow ... > Reading the commit message "Tests for overflows with dates and timestamps in BRIN ... ... The new regression tests check this for date and timestamp data types. It adds tables with data close to the allowed min/max values, and builds a minmax-multi index on it." I expected the CREATE INDEX statement to throw an error or fail the "Assert(delta >= 0);" in brin_minmax_multi_distance_date(). But a later commit mentions that the overflow is not possible. > > > > We may want to combine various test cases though. Like the test adding > > infinity and extreme values could be combined. Also the number of > > values it inserts in the table for the reasons stated above. > > > > I prefer not to do that. I find it more comprehensible to keep the tests > separate / testing different things. If the tests were expensive to > setup or something like that, that'd be a different situation. Fair enough. > >> > > > > Right. Do we highlight that in the commit message so that the person > > writing release notes picks it up from there? > > > > Yes, I think I'll mention what impact each issue can have on indexes. Thanks. -- Best Wishes, Ashutosh Bapat
On Thu, 19 Oct 2023, 05:32 Ashutosh Bapat, <ashutosh.bapat.oss@gmail.com> wrote:
On Wed, Oct 18, 2023 at 8:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I did use that many values to actually force "compaction" and merging of
> points into ranges. We only keep 32 values per page range, so with 2
> values we'll not build a range. You're right it may still trigger the
> overflow (we probably still calculate distances, I didn't realize that),
> but without the compaction we can't check the query plans.
>
> However, I agree 60 values may be a bit too much. And I realized we can
> reduce the count quite a bit by using the values_per_range option. We
> could set it to 8 (which is the minimum).
>
I haven't read BRIN code, except the comments in the beginning of the
file. From what you describe it seems we will store first 32 values as
is, but later as the number of values grow create ranges from those?
Please point me to the relevant source code/documentation. Anyway, if
we can reduce the number of rows we insert, that will be good.
I don't think 60 values is excessive, but instead of listing them out by hand, perhaps use generate_series().
Regards,
Dean
On 10/19/23 06:32, Ashutosh Bapat wrote: > On Wed, Oct 18, 2023 at 8:23 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> >> I did use that many values to actually force "compaction" and merging of >> points into ranges. We only keep 32 values per page range, so with 2 >> values we'll not build a range. You're right it may still trigger the >> overflow (we probably still calculate distances, I didn't realize that), >> but without the compaction we can't check the query plans. >> >> However, I agree 60 values may be a bit too much. And I realized we can >> reduce the count quite a bit by using the values_per_range option. We >> could set it to 8 (which is the minimum). >> > > I haven't read BRIN code, except the comments in the beginning of the > file. From what you describe it seems we will store first 32 values as > is, but later as the number of values grow create ranges from those? > Please point me to the relevant source code/documentation. Anyway, if > we can reduce the number of rows we insert, that will be good. > I don't think we have documentation other than what's at the beginning of the file. What the comment tries to explain is that the summary has a maximum size (32 values by default), and each value can be either a point or a range. A point requires one value, range requires two. So we accumulate values one by one - until 32 values that's fine. Once we get 33, we have to merge some of the points into ranges, and we do that in a greedy way by distance. For example, this may happen: 33 values -> 31 values + 1 range [requires 33] -> 30 values + 1 range [requires 32] ... The exact steps depend on which values/ranges are picked for the merge, of course. In any case, there's no difference between the initial 32 values and the values added later. Does that explain the algorithm? I'm not against clarifying the comment, of course. >>> >> >> Not sure what you mean by "crash". Yes, it doesn't hit an assert, >> because there's none when calculating distance for date. It however >> should fail in the query plan check due to the incorrect order of >> subtractions. >> >> Also, the commit message does not claim to fix overflow. In fact, it >> says it can't overflow ... >> > > > Reading the commit message > "Tests for overflows with dates and timestamps in BRIN ... > > ... > > The new regression tests check this for date and timestamp data types. > It adds tables with data close to the allowed min/max values, and builds > a minmax-multi index on it." > > I expected the CREATE INDEX statement to throw an error or fail the > "Assert(delta >= 0);" in brin_minmax_multi_distance_date(). But a > later commit mentions that the overflow is not possible. > Hmmm, yeah. The comment should mention the date doesn't have issue with overflows, but other bugs. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/19/23 09:04, Dean Rasheed wrote: > On Thu, 19 Oct 2023, 05:32 Ashutosh Bapat, <ashutosh.bapat.oss@gmail.com > <mailto:ashutosh.bapat.oss@gmail.com>> wrote: > > On Wed, Oct 18, 2023 at 8:23 PM Tomas Vondra > <tomas.vondra@enterprisedb.com > <mailto:tomas.vondra@enterprisedb.com>> wrote: > > > > > I did use that many values to actually force "compaction" and > merging of > > points into ranges. We only keep 32 values per page range, so with 2 > > values we'll not build a range. You're right it may still trigger the > > overflow (we probably still calculate distances, I didn't realize > that), > > but without the compaction we can't check the query plans. > > > > However, I agree 60 values may be a bit too much. And I realized > we can > > reduce the count quite a bit by using the values_per_range option. We > > could set it to 8 (which is the minimum). > > > > I haven't read BRIN code, except the comments in the beginning of the > file. From what you describe it seems we will store first 32 values as > is, but later as the number of values grow create ranges from those? > Please point me to the relevant source code/documentation. Anyway, if > we can reduce the number of rows we insert, that will be good. > > > I don't think 60 values is excessive, but instead of listing them out by > hand, perhaps use generate_series(). > I tried to do that, but I ran into troubles with the "date" tests. I needed to build values that close to the min/max values, so I did something like SELECT '4713-01-01 BC'::date + (i || ' days')::interval FROM generate_series(1,10) s(i); And then the same for the max date, but that fails because of the date/timestamp conversion in date plus operator. However, maybe two simple generate_series() would work ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 19, 2023 at 2:31 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Does that explain the algorithm? I'm not against clarifying the comment, > of course. Thanks a lot for this explanation. It's clear now. > I tried to do that, but I ran into troubles with the "date" tests. I > needed to build values that close to the min/max values, so I did > something like > > SELECT '4713-01-01 BC'::date + (i || ' days')::interval FROM > generate_series(1,10) s(i); > > And then the same for the max date, but that fails because of the > date/timestamp conversion in date plus operator. > > However, maybe two simple generate_series() would work ... > Something like this? select i::date from generate_series('4713-02-01 BC'::date, '4713-01-01 BC'::date, '-1 day'::interval) i; i --------------- 4713-02-01 BC 4713-01-31 BC 4713-01-30 BC 4713-01-29 BC 4713-01-28 BC 4713-01-27 BC 4713-01-26 BC 4713-01-25 BC 4713-01-24 BC 4713-01-23 BC 4713-01-22 BC 4713-01-21 BC 4713-01-20 BC 4713-01-19 BC 4713-01-18 BC 4713-01-17 BC 4713-01-16 BC 4713-01-15 BC 4713-01-14 BC 4713-01-13 BC 4713-01-12 BC 4713-01-11 BC 4713-01-10 BC 4713-01-09 BC 4713-01-08 BC 4713-01-07 BC 4713-01-06 BC 4713-01-05 BC 4713-01-04 BC 4713-01-03 BC 4713-01-02 BC 4713-01-01 BC (32 rows) -- Best Wishes, Ashutosh Bapat
On 10/19/23 11:22, Ashutosh Bapat wrote: > On Thu, Oct 19, 2023 at 2:31 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> >> Does that explain the algorithm? I'm not against clarifying the comment, >> of course. > > Thanks a lot for this explanation. It's clear now. > >> I tried to do that, but I ran into troubles with the "date" tests. I >> needed to build values that close to the min/max values, so I did >> something like >> >> SELECT '4713-01-01 BC'::date + (i || ' days')::interval FROM >> generate_series(1,10) s(i); >> >> And then the same for the max date, but that fails because of the >> date/timestamp conversion in date plus operator. >> >> However, maybe two simple generate_series() would work ... >> > > Something like this? select i::date from generate_series('4713-02-01 > BC'::date, '4713-01-01 BC'::date, '-1 day'::interval) i; That works, but if you try the same thing with the largest date, that'll fail select i::date from generate_series('5874896-12-01'::date, '5874897-01-01'::date, '1 day'::interval) i; ERROR: date out of range for timestamp regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 19, 2023 at 4:51 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 10/19/23 11:22, Ashutosh Bapat wrote: > > On Thu, Oct 19, 2023 at 2:31 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > > > >> > >> Does that explain the algorithm? I'm not against clarifying the comment, > >> of course. > > > > Thanks a lot for this explanation. It's clear now. > > > >> I tried to do that, but I ran into troubles with the "date" tests. I > >> needed to build values that close to the min/max values, so I did > >> something like > >> > >> SELECT '4713-01-01 BC'::date + (i || ' days')::interval FROM > >> generate_series(1,10) s(i); > >> > >> And then the same for the max date, but that fails because of the > >> date/timestamp conversion in date plus operator. > >> > >> However, maybe two simple generate_series() would work ... > >> > > > > Something like this? select i::date from generate_series('4713-02-01 > > BC'::date, '4713-01-01 BC'::date, '-1 day'::interval) i; > > That works, but if you try the same thing with the largest date, that'll > fail > > select i::date from generate_series('5874896-12-01'::date, > '5874897-01-01'::date, > '1 day'::interval) i; > > ERROR: date out of range for timestamp Hmm, I see. This uses generate_series(timestamp, timestamp, interval) version. date + integer -> date though, so the following works. It's also an example at https://www.postgresql.org/docs/16/functions-srf.html. #SELECT '5874896-12-01'::date + i FROM generate_series(1,10) s(i); I think we should provide generate_series(date, date, integer) which will use date + integer -> date. -- Best Wishes, Ashutosh Bapat
On Thu, Oct 19, 2023 at 6:11 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > I think we should provide generate_series(date, date, integer) which > will use date + integer -> date. Just to be clear, I don't mean that this patch should add it. -- Best Wishes, Ashutosh Bapat
On 10/20/23 11:52, Ashutosh Bapat wrote: > On Thu, Oct 19, 2023 at 6:11 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: >> >> I think we should provide generate_series(date, date, integer) which >> will use date + integer -> date. > > Just to be clear, I don't mean that this patch should add it. > I'm not against adding such generate_series() variant. For this patch I'll use something like the query you proposed, I think. I was thinking about the (date + interval) failure a bit more, and while I think it's confusing it's not quite wrong. The problem is that the interval may have hours/minutes, so it makes sense that the operator returns timestamp. That's not what most operators do, where the data type does not change. So a bit unexpected, but seems correct. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
FWIW I've cleaned up and pushed all the patches we came up with this thread. And I've backpatched all of them to 14+. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 27, 2023 at 10:32 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > FWIW I've cleaned up and pushed all the patches we came up with this > thread. And I've backpatched all of them to 14+. > Thanks a lot Tomas. -- Best Wishes, Ashutosh Bapat