Обсуждение: Bug in ginRedoRecompress that causes opaque data on page to beoverrun
Hi,
We recently encountered an issue where the opaque data flags on a gin data leaf page was corrupted while replaying a gin insert WAL record. Upon further examination of the redo code, we found a bug in ginRedoRecompress code, which extracts the WAL information and updates the page.
Specifically, when a new segment is inserted in the middle of a page, a memmove operation is performed [1] at the current point in the page to make room for the new segment. If this segment insertion is followed by delete segment actions that are yet to be processed and the total data size is very close to GinDataPageMaxDataSize, then we may move the data portion beyond the boundary causing the opaque data to be corrupted.
One way of solving this problem is to perform the replay work on a scratch space, perform sanity check on the total size of the data portion before copying it back to the actual page. While it involves additional memory allocation and memcpy operations, it is safer and similar to the 'do' code path where we ensure to make a copy of all segment past the first modified segment before placing them back on the page [2].
I have attached a patch for that approach here. Please let us know any comments or feedback.
Thanks!
Best
Siva
References:
[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/gin/ginxlog.c;h=7515f8bc167c2eafceced5d6ad5d74f7ec09e0a5;hb=refs/heads/REL9_6_STABLE#l278
[2] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/gin/gindatapage.c;h=cd3b9dfb784b084dd27a37146a4909fa1109ee81;hb=refs/heads/REL9_6_STABLE#l1726
Вложения
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun
Hi, Siva! On Tue, Sep 4, 2018 at 11:01 PM R, Siva <sivasubr@amazon.com> wrote: > We recently encountered an issue where the opaque data flags on a gin data leaf page was corrupted while replaying a gininsert WAL record. Upon further examination of the redo code, we found a bug in ginRedoRecompress code, which extractsthe WAL information and updates the page. > > Specifically, when a new segment is inserted in the middle of a page, a memmove operation is performed [1] at the currentpoint in the page to make room for the new segment. If this segment insertion is followed by delete segment actionsthat are yet to be processed and the total data size is very close to GinDataPageMaxDataSize, then we may move thedata portion beyond the boundary causing the opaque data to be corrupted. > > One way of solving this problem is to perform the replay work on a scratch space, perform sanity check on the total sizeof the data portion before copying it back to the actual page. While it involves additional memory allocation and memcpyoperations, it is safer and similar to the 'do' code path where we ensure to make a copy of all segment past the firstmodified segment before placing them back on the page [2]. > > I have attached a patch for that approach here. Please let us know any comments or feedback. Do you have a test scenario for reproduction of this issue? We need it to ensure that fix is correct. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Sep 4, 2018 at 12:59 PM, R, Siva <sivasubr@amazon.com> wrote: > We recently encountered an issue where the opaque data flags on a gin data > leaf page was corrupted while replaying a gin insert WAL record. Upon > further examination of the redo code, we found a bug in ginRedoRecompress > code, which extracts the WAL information and updates the page. I wonder how you managed to detect it in the first place. Were you using something like wal_consistency_checking=all, with a custom stress-test? -- Peter Geoghegan
Hi Alexander! On Tue, Sep 4, 2018 at 09:16 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Do you have a test scenario for reproduction of this issue? We need > it to ensure that fix is correct. Unfortunately, I do not have a way of reproducing this issue. So far I have tried a workload consisting of inserts (of the same attribute value that is indexed), batch deletes of rows and vacuum interleaved with engine crash/restarts. Hi Peter! On Tue, Sep 4, 2018 at 09:55 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I wonder how you managed to detect it in the first place. Were you > using something like wal_consistency_checking=all, with a custom > stress-test? We observed this corruption during stress testing and were able to isolate the corrupted page and WAL record changes leading up to the corruption using some internal diagnostic tools. Best Siva
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun
On Wed, Sep 5, 2018 at 1:45 AM R, Siva <sivasubr@amazon.com> wrote: > On Tue, Sep 4, 2018 at 09:16 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > Do you have a test scenario for reproduction of this issue? We need > > it to ensure that fix is correct. > > Unfortunately, I do not have a way of reproducing this issue. > So far I have tried a workload consisting of inserts (of the > same attribute value that is indexed), batch deletes of rows > and vacuum interleaved with engine crash/restarts. Issue reproduction and testing is essential for bug fix. Remember last time you reported GIN bug [1], after issue reproduction it appears that we have more things to fix. I's quite clear for me that if segment list contains GIN_SEGMENT_INSERT before GIN_SEGMENT_DELETE, then it might lead to wrong behavior in ginRedoRecompress(). But it's not yet clear to understand what code patch could lead to such segment list... I'll explore code more and probably will come with some idea. Links [1] https://www.postgresql.org/message-id/flat/1531867212836.63354%40amazon.com ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun
On Wed, Sep 5, 2018 at 12:26 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Wed, Sep 5, 2018 at 1:45 AM R, Siva <sivasubr@amazon.com> wrote: > > On Tue, Sep 4, 2018 at 09:16 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > > Do you have a test scenario for reproduction of this issue? We need > > > it to ensure that fix is correct. > > > > Unfortunately, I do not have a way of reproducing this issue. > > So far I have tried a workload consisting of inserts (of the > > same attribute value that is indexed), batch deletes of rows > > and vacuum interleaved with engine crash/restarts. > > Issue reproduction and testing is essential for bug fix. Remember > last time you reported GIN bug [1], after issue reproduction it > appears that we have more things to fix. I's quite clear for me that > if segment list contains GIN_SEGMENT_INSERT before GIN_SEGMENT_DELETE, > then it might lead to wrong behavior in ginRedoRecompress(). But it's > not yet clear to understand what code patch could lead to such segment > list... I'll explore code more and probably will come with some idea. Aha, I've managed to reproduce this. 1. Apply ginRedoRecompress-asserts.patch, which add assertions to ginRedoRecompress() detecting past opaque writes. 2. Setup streaming replication. 3. Execute following on the master. create or replace function test () returns void as $$ declare i int; begin FOR i IN 1..1000 LOOP insert into test values ('{1}'); end loop; end $$ language plpgsql; create table test (a int[]); insert into test (select '{}'::int[] from generate_series(1,10000)); insert into test (select '{1}'::int[] from generate_series(1,100000)); create index test_idx on test using gin(a) with (fastupdate = off); delete from test where a = '{}'::int[]; vacuum test; select test(); So, since we managed to reproduce this, I'm going to test and review your fix. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Вложения
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun
On Wed, Sep 5, 2018 at 2:39 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Wed, Sep 5, 2018 at 12:26 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > On Wed, Sep 5, 2018 at 1:45 AM R, Siva <sivasubr@amazon.com> wrote: > > > On Tue, Sep 4, 2018 at 09:16 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > > > Do you have a test scenario for reproduction of this issue? We need > > > > it to ensure that fix is correct. > > > > > > Unfortunately, I do not have a way of reproducing this issue. > > > So far I have tried a workload consisting of inserts (of the > > > same attribute value that is indexed), batch deletes of rows > > > and vacuum interleaved with engine crash/restarts. > > > > Issue reproduction and testing is essential for bug fix. Remember > > last time you reported GIN bug [1], after issue reproduction it > > appears that we have more things to fix. I's quite clear for me that > > if segment list contains GIN_SEGMENT_INSERT before GIN_SEGMENT_DELETE, > > then it might lead to wrong behavior in ginRedoRecompress(). But it's > > not yet clear to understand what code patch could lead to such segment > > list... I'll explore code more and probably will come with some idea. > > Aha, I've managed to reproduce this. > 1. Apply ginRedoRecompress-asserts.patch, which add assertions to > ginRedoRecompress() detecting past opaque writes. It was wrong, sorry. It appears that I put strict inequality into asserts, while there should be loose inequality. Correct version of patch is attached. And scenario I've posted isn't really reproducing the bug... ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Вложения
On Wed, Sep 5, 2018 at 4:59 AM, R, Siva <sivasubr@amazon.com> wrote: > Hi, > > We recently encountered an issue where the opaque data flags on a gin data > leaf page was corrupted while replaying a gin insert WAL record. Upon > further examination of the redo code, we found a bug in ginRedoRecompress > code, which extracts the WAL information and updates the page. > > Specifically, when a new segment is inserted in the middle of a page, a > memmove operation is performed [1] at the current point in the page to make > room for the new segment. If this segment insertion is followed by delete > segment actions that are yet to be processed and the total data size is very > close to GinDataPageMaxDataSize, then we may move the data portion beyond > the boundary causing the opaque data to be corrupted. > > One way of solving this problem is to perform the replay work on a scratch > space, perform sanity check on the total size of the data portion before > copying it back to the actual page. While it involves additional memory > allocation and memcpy operations, it is safer and similar to the 'do' code > path where we ensure to make a copy of all segment past the first modified > segment before placing them back on the page [2]. > Hmm, could you share the sequence of what kind of WAL has applied to the broken page? I suspect the segment list contains GIN_SEGMENT_REPLACE before GIN_SEGMENT_INSERT. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun
On Wed, Sep 5, 2018 at 5:05 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Wed, Sep 5, 2018 at 2:39 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > On Wed, Sep 5, 2018 at 12:26 PM Alexander Korotkov > > <a.korotkov@postgrespro.ru> wrote: > > > On Wed, Sep 5, 2018 at 1:45 AM R, Siva <sivasubr@amazon.com> wrote: > > > > On Tue, Sep 4, 2018 at 09:16 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > > > > Do you have a test scenario for reproduction of this issue? We need > > > > > it to ensure that fix is correct. > > > > > > > > Unfortunately, I do not have a way of reproducing this issue. > > > > So far I have tried a workload consisting of inserts (of the > > > > same attribute value that is indexed), batch deletes of rows > > > > and vacuum interleaved with engine crash/restarts. > > > > > > Issue reproduction and testing is essential for bug fix. Remember > > > last time you reported GIN bug [1], after issue reproduction it > > > appears that we have more things to fix. I's quite clear for me that > > > if segment list contains GIN_SEGMENT_INSERT before GIN_SEGMENT_DELETE, > > > then it might lead to wrong behavior in ginRedoRecompress(). But it's > > > not yet clear to understand what code patch could lead to such segment > > > list... I'll explore code more and probably will come with some idea. > > > > Aha, I've managed to reproduce this. > > 1. Apply ginRedoRecompress-asserts.patch, which add assertions to > > ginRedoRecompress() detecting past opaque writes. > > It was wrong, sorry. It appears that I put strict inequality into > asserts, while there should be loose inequality. Correct version of > patch is attached. And scenario I've posted isn't really reproducing > the bug... Finally I managed to reproduce the bug. The scenario is following. Underlying idea is that when insertion of multiple tuples goes to the beginning of the page and this insertion succeed only thanks to collapse of some short segments together, then this insertion wouldn't fit to the page if given alone. create table test (i integer primary key, a int[]) with (fillfactor = 50); insert into test (select id, array[id%2]::int[] from generate_series(1,15376) id); create index test_idx on test using gin(a) with (fastupdate = off); update test set a = '{1}' where i % 4 = 0 or i % 16 = 2 or i % 64 in (6, 46, 36) or i % 256 = 54; vacuum test; insert into test (select id + 16376, '{0}' from generate_series(1,5180) id); update test set a = '{1}' where i <= 15376 and i % 256 in (182, 198); vacuum test; alter index test_idx set (fastupdate = on); delete from test where i <= 134 and a = '{1}'; vacuum test; insert into test (select id+30000, '{0}' from generate_series(1,110) id); vacuum test; With ginRedoRecompress-asserts-v2.patch following assertion is fired. TRAP: FailedAssertion("!(segptr + newsegsize + (szleft - segsize) <= ( ((void) ((_Bool) (! (!(PageValidateSpecialPointer(page))) || (ExceptionalCondition("!(PageValidateSpecialPointer(page))", ("FailedAssertion"), "ginxlog.c", 289), 0)))), (char *) ((char *) (page) + ((PageHeader) (page))->pd_special) ))", File: "ginxlog.c", Line: 289) ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Sep 5, 2018 at 08:55 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Finally I managed to reproduce the bug. The scenario is following. > Underlying idea is that when insertion of multiple tuples goes to the > beginning of the page and this insertion succeed only thanks to > collapse of some short segments together, then this insertion wouldn't > fit to the page if given alone. Sorry for the late reply. Thank you so much for working on this and reproducing the issue! Like you mentioned, the WAL record where we detected this problem has future segments deleted due to compaction and written out as an insert segment. > alter index test_idx set (fastupdate = on); Just curious why does this help with the repro? This is related to only using the Gin pending list vs the posting tree. I will try to reproduce the issue with the above workload and also test the fix with the same and report back. On Wed, Sep 5, 2018 at 5:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Hmm, could you share the sequence of what kind of WAL has applied to > the broken page? I suspect the segment list contains > GIN_SEGMENT_REPLACE before GIN_SEGMENT_INSERT. These are the segment operations on the WAL in sequence: - 1 Replace action on segment N - 5 Insert actions after segment N - the 5th insert action is essentially replacing the last 3 remaining segments with a new one. - 3 delete actions on the remaining segments. Best Siva
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun
On Thu, Sep 6, 2018 at 12:53 AM R, Siva <sivasubr@amazon.com> wrote: > On Tue, Sep 5, 2018 at 08:55 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > Finally I managed to reproduce the bug. The scenario is following. > > Underlying idea is that when insertion of multiple tuples goes to the > > beginning of the page and this insertion succeed only thanks to > > collapse of some short segments together, then this insertion wouldn't > > fit to the page if given alone. > > Sorry for the late reply. > Thank you so much for working on this and reproducing the issue! > Like you mentioned, the WAL record where we detected this problem > has future segments deleted due to compaction and written out > as an insert segment. > > > alter index test_idx set (fastupdate = on); > Just curious why does this help with the repro? This is related to only > using the Gin pending list vs the posting tree. With (fastupdate = on) GIN performs bulk update of posting lists, inserting multiple tuples at once if possible. With (fastupdate = off) GIN always inserts tuples one-by-one. It might be still possible to reproduce the issue with (fastupdate = off), but it seems even harder. BTW, I've tried the patch you've posted. On my test case it fails with following assertion. TRAP: FailedAssertion("!(a_action == 2)", File: "ginxlog.c", Line: 243) I thought about fixing this issue more, and I decided we can fix it in less invasive way. Once modification is started we can copy tail of the page into separately allocated chunk of memory, and the use it as the source of original segments. See the patch attached. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Вложения
On Tue, Sep 6, 2018 at 09:53 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > With (fastupdate = on) GIN performs bulk update of posting lists, > inserting multiple tuples at once if possible. With (fastupdate = > off) GIN always inserts tuples one-by-one. It might be still possible > to reproduce the issue with (fastupdate = off), but it seems even > harder. Ah I see. This is cool, I will keep in mind for future testing. Thanks! > BTW, I've tried the patch you've posted. On my test case it fails > with following assertion. > TRAP: FailedAssertion("!(a_action == 2)", File: "ginxlog.c", Line: 243) > I thought about fixing this issue more, and I decided we can fix it in > less invasive way. Once modification is started we can copy tail of > the page into separately allocated chunk of memory, and the use it as > the source of original segments. See the patch attached. I'm also running into this assert with the workload, I think my patch is not handling the case where the action is add items on the last segment of the page correctly. I'm still investigating the issue further to find the source of the bug. Meanwhile I reviewed your patch and it looks good to me. I agree that copying out the entire tail out to the scratch space in one shot vs copying out every segment reduces the number of memcpy calls and simplifies the solution overall. Let us go ahead with this patch. Best Siva
Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun
I'm also running into this assert with the workload, I think my patch is
not handling the case where the action is add items on the last segment
of the page correctly. I'm still investigating the issue further to find the
source of the bug.
Meanwhile I reviewed your patch and it looks good to me. I agree that
copying out the entire tail out to the scratch space in one shot vs copying
out every segment reduces the number of memcpy calls and simplifies
the solution overall. Let us go ahead with this patch.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company