Обсуждение: logical changeset generation v4
Hi everyone, Here is the newest version of logical changeset generation. Changes since last time round: * loads and loads of bugfixes * crash/restart persistency of in-memory structures in a crash safe manner * very large transaction support (spooling to disk) * rebased onto the newest version of xlogreader Overview over the patches: Xlogreader (separate patch): [01] Centralize Assert* macros into c.h so its common between backend/frontend [02] Provide a common malloc wrappers and palloc et al. emulation for frontend'ish environs [03] Split out xlog reading into its own module called xlogreader [04] Add pg_xlogdump contrib module Those seem to be ready baring some infrastructure work around common backend/frontend code for xlogdump. Add capability to map from (tablespace, relfilenode) to pg_class.oid: [05]: Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode) [06]: Add RelationMapFilenodeToOid function to relmapper.c [07]: Add pg_relation_by_filenode to lookup up a relation by (tablespace, filenode) Imo those are pretty solid although there are some doubts about the correctness of [05] which I think are all fixed in this version: The fundamental problem of adding a (tablespace, relfilenode) syscache is that no unique index exists in pg_class over (relfilenode, reltablespace) because relfilenode is set to a '0' (aka InvalidOid) when the table is either a shared table or a nailed table. This cannot really be changed as pg_class.relfilenode is not authoritative for those and can possibly not even accessed (different table, early startup). We also don't want to rely on the null bitmap, so we can't set it to NULL. The reason why I think it is safe to use the added RELFILENODE syscache as I have in those patches is that when looking a (tablespace, filenode) pair up none of those duplicat '0' values will get looked up as there is no point in looking up an invalid relfilenode. Instead the shared/nailed relfilenodes will have to get mapped via RelationMapFilenodeToOid. The alternative here seems to be to invent an own attoptcache style but given that the above syscache is fairly performance critical and should do invalidations in a sensible manner that seems to be an unnecessary amount of code. Any opinions here? [08] wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement Its useful to represent values that are not a valid CommandId. Add such a representation. Imo this is straightforward and easy. [09] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader For timetravel access to the catalog we need to be able to lookup (cmin, cmax) pairs of catalog rows when were 'inside' that TX. This patch just adapts the signature of the *Satisfies routines to expect a HeapTuple instead of a HeapTupleHeader. The amount of changes for that is fairly low as the HeapTupleSatisfiesVisibility macro already expected the former. It also makes sure the HeapTuple fields are setup in the few places that didn't already do so. [10] wal_decoding: Allow walsender's to connect to a specific database For logical decoding we need to be able access the schema of a database - for that we need to be connected to a database. Thus allow recovery connections to connect to a specific database. This patch currently has the disadvantage that its not possible anymore to connect to a database thats actually named "replication" as the decision whether a connection goes to a database or not is made based uppon dbname != replication. Better ideas? [11] wal_decoding: Add alreadyLocked parameter to GetOldestXminNoLock Pretty boring preparatory for being able to nail a certain xid as the global horizon. I don't think there is much to be said about this anymore, it already has been somewhat discussed. [12] wal_decodign: Log xl_running_xact's at a higher frequency than checkpoints are done Make the bgwriter emit a xl_running_xacts record every 15s if there is xlog activity in the system. Imo this isn't too complicated and already beneficial for HS so it could be applied separately. [13] copydir: make fsync_fname public This should probably go to some other file, fd.[ch]? Otherwise its pretty trivial. [14] wal decoding: Add information about a tables primary key to struct RelationData Back when discussing the first prototype of BDR Heikki was concerned of doing a search for the primary key during heap_delete. I agree that that isn't really a good idea. So remember the primary key (or a candidate key) when looking through the available indexes in RelationGetIndexList(). I don't really like the name rd_primary as it also contains candidate keys (i.e. indimmediate, inunique, noexpression, notnull), better suggestions? I don't think there is too much debatable in here, but there is no independent benefit of applying it separately. [15] wal decoding: Introduce wal decoding via catalog timetravel The heart of changeset generation. Built out of several parts: * snapshot building infrastructure * transaction reassembly * shared memory state for replication slots * new wal_level=logical that catches more data * (local) output plugin interface * (external) walsender interface [16] wal decoding: Add a simple decoding module in contrib named 'test_decoding' An example output plugin also used in regression tests [17] wal decoding: Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes An application to receive changes over the walsender/replication=1 interface. [18] wal_decoding: Add test_logical_replication extension for easier testing of logical decoding An extension that allows to use logical decoding from sql. This isn't really suitable for production, high performance use but its usefor for development and more importantly it makes it relatively easy to write regression tests without new infrastructure. I am starting to be happy about the state of this! Open issues & questions: 1) testing infrastructure 2) Determination of replication slots 3) Options for output plugins 4) the general walsender interface 5) Additional catalog tables 1) Currently all the tests are in patch [18] which is a contrib module. There are two reasons for putting them there: First the tests need wal_level=logical which isn't the case with the current regression tests. Second, I don't think the test_logical_replication functions should live in core as they shouldn't be used for a production replication scenario (causes longrunning transactions, requires polling) , but I have failed to find a neat way to include a contrib extension in the plain regression tests. 2) Currently the logical replication infrastructure assigns a 'slot-id' when a new replica is setup. That slot id isn't really nice (e.g. "id-321578-3"). It also requires that [18] keeps state in a global variable to make writing regression tests easy. I think it would be better to make the user specify those replication slot ids, but I am not really sure about it. 3) Currently no options can be passed to an output plugin. I am thinking about making "INIT_LOGICAL_REPLICATION 'plugin'" accept the now widely used ('option' ['value'], ...) syntax and pass that to the output plugin's initialization function. 4) Does anybody object to: -- allocate a permanent replication slot INIT_LOGICAL_REPLICATION 'plugin' 'slotname' (options); -- stream data START_LOGICAL_REPLICATION 'slotname' 'recptr'; -- deallocate a permanent replication slot FREE_LOGICAL_REPLICATION 'slotname'; 5) Currently its only allowed to access catalog tables, its fairly trivial to extend this to additional tables if you can accept some (noticeable but not too big) overhead for modifications on those tables. I was thinking of making that an option for tables, that would be useful for replication solutions configuration tables. Further todo: * don't reread so much WAL after a restart/crash * better TOAST support, the current one can fail after A DROP TABLE * only peg a new "catalog xmin" instead of the global xmin * more docs about the internals * nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the outside. There have been improvements vs 3.1 but not enough * abort too old replication slots Puh. The current git tree is at: git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4 http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4 The xlogreader development happens xlogreader_4. Input? Greetings, Andres Freund PS: Thanks for the input & help so far! -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
- 0001-Centralize-Assert-macros-into-c.h-so-its-common-betw.patch
- 0002-Provide-a-common-malloc-wrappers-and-palloc-et-al.-e.patch.gz
- 0003-Split-out-xlog-reading-into-its-own-module-called-xl.patch.gz
- 0004-Add-pg_xlogdump-contrib-module.patch
- 0005-wal_decoding-Add-a-new-RELFILENODE-syscache-to-fetch.patch
- 0006-wal_decoding-Add-RelationMapFilenodeToOid-function-t.patch
- 0007-wal-decoding-Add-pg_relation_by_filenode-to-lookup-u.patch
- 0008-wal_decoding-Introduce-InvalidCommandId-and-declare-.patch
- 0009-wal_decoding-Adjust-all-Satisfies-routines-to-take-a.patch
- 0010-wal_decoding-Allow-walsender-s-to-connect-to-a-speci.patch
- 0011-wal_decoding-Add-alreadyLocked-parameter-to-GetOldes.patch
- 0012-wal_decodign-Log-xl_running_xact-s-at-a-higher-frequ.patch
- 0013-wal_decoding-copydir-make-fsync_fname-public.patch
- 0014-wal-decoding-Add-information-about-a-tables-primary-.patch
- 0015-wal-decoding-Introduce-wal-decoding-via-catalog-time.patch.gz
- 0017-wal-decoding-Introduce-pg_receivellog-the-pg_receive.patch
- 0018-wal_decoding-Add-test_logical_replication-extension-.patch
- 0019-wal-decoding-design-document-v2.3-and-snapshot-build.patch.gz
- 0016-wal-decoding-Add-a-simple-decoding-module-in-contrib.patch
Andreas, Is there a git fork for logical replication somewhere? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> schrieb: >Andreas, > >Is there a git fork for logical replication somewhere? Check the bottom of the email ;) --- Please excuse brevity and formatting - I am writing this on my mobile phone.
At 2013-01-14 18:15:39 -0800, josh@agliodbs.com wrote: > > Is there a git fork for logical replication somewhere? git://git.postgresql.org/git/users/andresfreund/postgres.git, branch xlog-decoding-rebasing-cf4 (and xlogreader_v4). -- Abhijit
At 2013-01-15 02:38:45 +0100, andres@2ndquadrant.com wrote: > > 2) Currently the logical replication infrastructure assigns a > 'slot-id' when a new replica is setup. That slot id isn't really > nice (e.g. "id-321578-3"). It also requires that [18] keeps state > in a global variable to make writing regression tests easy. > > I think it would be better to make the user specify those replication > slot ids, but I am not really sure about it. I agree, it would be better to let the user name the slot (and report an error if the given name is already in use). > 3) Currently no options can be passed to an output plugin. I am > thinking about making "INIT_LOGICAL_REPLICATION 'plugin'" accept the > now widely used ('option' ['value'], ...) syntax and pass that to the > output plugin's initialization function. Sounds good. > 4) Does anybody object to: > -- allocate a permanent replication slot > INIT_LOGICAL_REPLICATION 'plugin' 'slotname' (options); > > -- stream data > START_LOGICAL_REPLICATION 'slotname' 'recptr'; > > -- deallocate a permanent replication slot > FREE_LOGICAL_REPLICATION 'slotname'; That looks fine, but I think it should be: INIT_LOGICAL_REPLICATION 'slotname' 'plugin' (options); i.e., swap 'plugin' and 'slotname' in your proposal to make the slotname come first for all three commands. Not important, but a wee bit nicer. -- Abhijit
Andres Freund wrote: I've been giving a couple of these parts a look. In particular > [03] Split out xlog reading into its own module called xlogreader Cleaned this one up a bit last week. I will polish it some more, publish for some final comments, and commit. > [08] wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement This seems reasonable. Mainly it has the effect that a transaction can have exactly one less command than before. I don't think this is a problem for anyone in practice. > [09] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader Seemed okay when I looked at it. > Second, I don't think the test_logical_replication functions should live > in core as they shouldn't be used for a production replication scenario > (causes longrunning transactions, requires polling) , but I have failed > to find a neat way to include a contrib extension in the plain > regression tests. I think this would work if you make a "stamp" file in the contrib module, similar to how doc/src/sgml uses those. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 15/01/13 14:38, Andres Freund wrote: > Hi everyone, > > Here is the newest version of logical changeset generation. > > I'm quite interested in this feature - so tried applying the 19 patches to the latest 9.3 checkout. Patch and compile are good. However portals seem busted: bench=# BEGIN; BEGIN bench=# DECLARE c1 CURSOR FOR SELECT * FROM pgbench_accounts; DECLARE CURSOR bench=# FETCH 2 FROM c1; aid | bid | abalance | filler -----+-----+----------+--------------------------------------------------------- ----------------------------- 1 | 1 | 0 | 2 | 1 | 0 | (2 rows) bench=# DELETE FROM pgbench_accounts WHERE CURRENT OF c1; The connection to the server was lost. Attempting reset: Failed.
On 15/01/13 17:37, Mark Kirkwood wrote: > On 15/01/13 14:38, Andres Freund wrote: >> Hi everyone, >> >> Here is the newest version of logical changeset generation. >> >> > > > > I'm quite interested in this feature - so tried applying the 19 > patches to the latest 9.3 checkout. Patch and compile are good. > > However portals seem busted: > > bench=# BEGIN; > BEGIN > bench=# DECLARE c1 CURSOR FOR SELECT * FROM pgbench_accounts; > DECLARE CURSOR > bench=# FETCH 2 FROM c1; > aid | bid | abalance | filler > > -----+-----+----------+--------------------------------------------------------- > > ----------------------------- > 1 | 1 | 0 | > > 2 | 1 | 0 | > > (2 rows) > > bench=# DELETE FROM pgbench_accounts WHERE CURRENT OF c1; > The connection to the server was lost. Attempting reset: Failed. > Sorry - forgot to add: assert and debug build, and it is an assertion failure that is being picked up: TRAP: FailedAssertion("!(htup->t_tableOid != ((Oid) 0))", File: "tqual.c", Line: 940)
On 2013-01-15 17:41:50 +1300, Mark Kirkwood wrote: > On 15/01/13 17:37, Mark Kirkwood wrote: > >On 15/01/13 14:38, Andres Freund wrote: > >>Hi everyone, > >> > >>Here is the newest version of logical changeset generation. > >> > >> > > > > > > > >I'm quite interested in this feature - so tried applying the 19 patches to > >the latest 9.3 checkout. Patch and compile are good. Thanks! Any input welcome. The git tree might make it easier to follow development ;) > >However portals seem busted: > > > >bench=# BEGIN; > >BEGIN > >bench=# DECLARE c1 CURSOR FOR SELECT * FROM pgbench_accounts; > >DECLARE CURSOR > >bench=# FETCH 2 FROM c1; > > aid | bid | abalance | filler > > > >-----+-----+----------+--------------------------------------------------------- > > > >----------------------------- > > 1 | 1 | 0 | > > > > 2 | 1 | 0 | > > > >(2 rows) > > > >bench=# DELETE FROM pgbench_accounts WHERE CURRENT OF c1; > >The connection to the server was lost. Attempting reset: Failed. > > > > Sorry - forgot to add: assert and debug build, and it is an assertion > failure that is being picked up: > > TRAP: FailedAssertion("!(htup->t_tableOid != ((Oid) 0))", File: "tqual.c", > Line: 940) I unfortunately don't see the error here, I guess its related to how stack is reused. But I think I found the error, check the attached patch which I also pushed to the git repository. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On 2013-01-15 01:00:00 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > I've been giving a couple of these parts a look. In particular > > > [03] Split out xlog reading into its own module called xlogreader > > Cleaned this one up a bit last week. I will polish it some more, > publish for some final comments, and commit. I have some smaller bugfixes in my current version that you probably don't have yet (on grounds of being fixed this weekend)... So we need to be a bit careful not too loose those. > > Second, I don't think the test_logical_replication functions should live > > in core as they shouldn't be used for a production replication scenario > > (causes longrunning transactions, requires polling) , but I have failed > > to find a neat way to include a contrib extension in the plain > > regression tests. > > I think this would work if you make a "stamp" file in the contrib > module, similar to how doc/src/sgml uses those. I tried that, the problem is not the building itself but getting it installed into the temporary installation... But anyway, testing decoding requires a different wal_level so I was hesitant to continue on grounds of that alone. Should we just up that? Its probably problematic for tests arround some WAL optimizations an such? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On 2013-01-15 01:00:00 -0300, Alvaro Herrera wrote: > > Andres Freund wrote: > > > > I've been giving a couple of these parts a look. In particular > > > > > [03] Split out xlog reading into its own module called xlogreader > > > > Cleaned this one up a bit last week. I will polish it some more, > > publish for some final comments, and commit. > > I have some smaller bugfixes in my current version that you probably > don't have yet (on grounds of being fixed this weekend)... So we need to > be a bit careful not too loose those. Sure. Do you have them as individual commits? I'm assuming you rebased the tree. Maybe in your reflog? IIRC I also have at least one minor bug fix. > > > Second, I don't think the test_logical_replication functions should live > > > in core as they shouldn't be used for a production replication scenario > > > (causes longrunning transactions, requires polling) , but I have failed > > > to find a neat way to include a contrib extension in the plain > > > regression tests. > > > > I think this would work if you make a "stamp" file in the contrib > > module, similar to how doc/src/sgml uses those. > > I tried that, the problem is not the building itself but getting it > installed into the temporary installation... Oh, hm. Maybe the contrib module's make installcheck, then? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-15 09:56:41 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > On 2013-01-15 01:00:00 -0300, Alvaro Herrera wrote: > > > Andres Freund wrote: > > > > > > I've been giving a couple of these parts a look. In particular > > > > > > > [03] Split out xlog reading into its own module called xlogreader > > > > > > Cleaned this one up a bit last week. I will polish it some more, > > > publish for some final comments, and commit. > > > > I have some smaller bugfixes in my current version that you probably > > don't have yet (on grounds of being fixed this weekend)... So we need to > > be a bit careful not too loose those. > > Sure. Do you have them as individual commits? I'm assuming you rebased > the tree. Maybe in your reflog? IIRC I also have at least one minor > bug fix. I can check, which commit did you base your modifications on? > > > > Second, I don't think the test_logical_replication functions should live > > > > in core as they shouldn't be used for a production replication scenario > > > > (causes longrunning transactions, requires polling) , but I have failed > > > > to find a neat way to include a contrib extension in the plain > > > > regression tests. > > > > > > I think this would work if you make a "stamp" file in the contrib > > > module, similar to how doc/src/sgml uses those. > > > > I tried that, the problem is not the building itself but getting it > > installed into the temporary installation... > > Oh, hm. Maybe the contrib module's make installcheck, then? Thats what I do right now, but I really would prefer to have it checked during normal make checks, installchecks aren't run all that commonly :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-01-15 09:56:41 -0300, Alvaro Herrera wrote: >> Oh, hm. Maybe the contrib module's make installcheck, then? > Thats what I do right now, but I really would prefer to have it checked > during normal make checks, installchecks aren't run all that commonly :( Sure they are, in every buildfarm cycle. I don't see the problem. (In the case of contrib, make installcheck is a whole lot faster than make check, as well as being older. So I don't really see why you think it's less used.) regards, tom lane
On 2013-01-15 10:28:28 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-01-15 09:56:41 -0300, Alvaro Herrera wrote: > >> Oh, hm. Maybe the contrib module's make installcheck, then? > > > Thats what I do right now, but I really would prefer to have it checked > > during normal make checks, installchecks aren't run all that commonly :( > > Sure they are, in every buildfarm cycle. I don't see the problem. > > (In the case of contrib, make installcheck is a whole lot faster than > make check, as well as being older. So I don't really see why you > think it's less used.) Oh. Because I was being dumb ;). And I admittedly never ran a buildfarm animal so far. But the other part of the problem is hiding in the unfortunately removed part of the problem description - the tests require the non-default options wal_level=logical and max_logical_slots=3+. Is there a problem of making those the default in the buildfarm created config? I guess there would need to be an alternative output file for wal_level < logical? Afaics there is no way to make a test conditional? I shortly thought something like DO $$ BEGIN IF current_setting('wal_level') != 'df' THEN RAISE FATAL 'wal_level needs to be logical'; END IF; END $$; could be used to avoid creating a huge second output file, but we can't raise FATAL errors from plpgsql. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > But the other part of the problem is hiding in the unfortunately removed > part of the problem description - the tests require the non-default > options wal_level=logical and max_logical_slots=3+. Oh. Well, that's not going to work. > Is there a problem of making those the default in the buildfarm created > config? Even if we hacked the buildfarm script to do so, it'd be a nonstarter because it would cause ordinary manual "make installcheck" to fail. I think the only reasonable way to handle this would be to (1) make "make installcheck" a no-op in this contrib module, and (2) make "make check" work, being careful to start the test postmaster with the necessary options. regards, tom lane
On 2013-01-15 11:10:22 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > But the other part of the problem is hiding in the unfortunately removed > > part of the problem description - the tests require the non-default > > options wal_level=logical and max_logical_slots=3+. > > Oh. Well, that's not going to work. An alternative would be to have max_logical_slots default to a low value and make the amount of logged information a wal_level independent GUC that can be changed on the fly. ISTM that that would result in too complicated code to deal with other backends not having the same notion of that value and such, but its possible. > > Is there a problem of making those the default in the buildfarm created > > config? > > Even if we hacked the buildfarm script to do so, it'd be a nonstarter > because it would cause ordinary manual "make installcheck" to fail. I thought we could have a second expected file for that case. Not nice :( > I think the only reasonable way to handle this would be to (1) make > "make installcheck" a no-op in this contrib module, and (2) make > "make check" work, being careful to start the test postmaster with > the necessary options. Youre talking about adding a contrib-module specific make check or changing the normal make check's wal_level? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-01-15 11:10:22 -0500, Tom Lane wrote: >> I think the only reasonable way to handle this would be to (1) make >> "make installcheck" a no-op in this contrib module, and (2) make >> "make check" work, being careful to start the test postmaster with >> the necessary options. > Youre talking about adding a contrib-module specific make check or > changing the normal make check's wal_level? This contrib module's "make check" would change the wal_level. Global change no good for any number of reasons, the most obvious being that we want to be able to test other wal_levels too. I'm not sure whether the "make check" infrastructure currently supports passing arguments through to the test postmaster's command line, but it shouldn't be terribly hard to add if not. regards, tom lane
Andres Freund wrote: > On 2013-01-15 09:56:41 -0300, Alvaro Herrera wrote: > > Andres Freund wrote: > > > On 2013-01-15 01:00:00 -0300, Alvaro Herrera wrote: > > > > Andres Freund wrote: > > > > > > > > I've been giving a couple of these parts a look. In particular > > > > > > > > > [03] Split out xlog reading into its own module called xlogreader > > > > > > > > Cleaned this one up a bit last week. I will polish it some more, > > > > publish for some final comments, and commit. > > > > > > I have some smaller bugfixes in my current version that you probably > > > don't have yet (on grounds of being fixed this weekend)... So we need to > > > be a bit careful not too loose those. > > > > Sure. Do you have them as individual commits? I'm assuming you rebased > > the tree. Maybe in your reflog? IIRC I also have at least one minor > > bug fix. > > I can check, which commit did you base your modifications on? Dunno. It's probably easier to reverse-apply the version you submitted to see what changed, and then forward-apply again. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-15 15:16:44 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > On 2013-01-15 09:56:41 -0300, Alvaro Herrera wrote: > > > Andres Freund wrote: > > > > On 2013-01-15 01:00:00 -0300, Alvaro Herrera wrote: > > > > > Andres Freund wrote: > > > > > > > > > > I've been giving a couple of these parts a look. In particular > > > > > > > > > > > [03] Split out xlog reading into its own module called xlogreader > > > > > > > > > > Cleaned this one up a bit last week. I will polish it some more, > > > > > publish for some final comments, and commit. > > > > > > > > I have some smaller bugfixes in my current version that you probably > > > > don't have yet (on grounds of being fixed this weekend)... So we need to > > > > be a bit careful not too loose those. > > > > > > Sure. Do you have them as individual commits? I'm assuming you rebased > > > the tree. Maybe in your reflog? IIRC I also have at least one minor > > > bug fix. > > > > I can check, which commit did you base your modifications on? > > Dunno. It's probably easier to reverse-apply the version you submitted > to see what changed, and then forward-apply again. There's at least the two attached patches... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
Andres Freund wrote: > [09] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader > > For timetravel access to the catalog we need to be able to lookup (cmin, > cmax) pairs of catalog rows when were 'inside' that TX. This patch just > adapts the signature of the *Satisfies routines to expect a HeapTuple > instead of a HeapTupleHeader. The amount of changes for that is fairly > low as the HeapTupleSatisfiesVisibility macro already expected the > former. > > It also makes sure the HeapTuple fields are setup in the few places that > didn't already do so. I had a look at this part. Running the regression tests unveiled a case where the tableOid wasn't being set (and thus caused an assertion to fail), so I added that. I also noticed that the additions to pruneheap.c are sometimes filling a tuple before it's strictly necessary, leading to wasted work. Moved those too. Looks good to me as attached. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
Alvaro Herrera wrote: > I had a look at this part. Running the regression tests unveiled a case > where the tableOid wasn't being set (and thus caused an assertion to > fail), so I added that. I also noticed that the additions to > pruneheap.c are sometimes filling a tuple before it's strictly > necessary, leading to wasted work. Moved those too. Actually I missed that downthread there are some fixes to this part; I had fixed one of these independently, but there's one I missed. Added that one too now (not attaching a new version). (Also, it seems pointless to commit this unless we know for sure that the downstream change that requires it is good; so I'm holding commit until we've discussed the other stuff more thoroughly. Per discussion with Andres.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 18, 2013 at 11:33 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Andres Freund wrote: > >> [09] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader >> >> For timetravel access to the catalog we need to be able to lookup (cmin, >> cmax) pairs of catalog rows when were 'inside' that TX. This patch just >> adapts the signature of the *Satisfies routines to expect a HeapTuple >> instead of a HeapTupleHeader. The amount of changes for that is fairly >> low as the HeapTupleSatisfiesVisibility macro already expected the >> former. >> >> It also makes sure the HeapTuple fields are setup in the few places that >> didn't already do so. > > I had a look at this part. Running the regression tests unveiled a case > where the tableOid wasn't being set (and thus caused an assertion to > fail), so I added that. I also noticed that the additions to > pruneheap.c are sometimes filling a tuple before it's strictly > necessary, leading to wasted work. Moved those too. > > Looks good to me as attached. I took a quick look at this and am just curious why we're adding the requirement that t_tableOid has to be initialized? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-01-18 11:48:43 -0500, Robert Haas wrote: > On Fri, Jan 18, 2013 at 11:33 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Andres Freund wrote: > > > >> [09] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader > >> > >> For timetravel access to the catalog we need to be able to lookup (cmin, > >> cmax) pairs of catalog rows when were 'inside' that TX. This patch just > >> adapts the signature of the *Satisfies routines to expect a HeapTuple > >> instead of a HeapTupleHeader. The amount of changes for that is fairly > >> low as the HeapTupleSatisfiesVisibility macro already expected the > >> former. > >> > >> It also makes sure the HeapTuple fields are setup in the few places that > >> didn't already do so. > > > > I had a look at this part. Running the regression tests unveiled a case > > where the tableOid wasn't being set (and thus caused an assertion to > > fail), so I added that. I also noticed that the additions to > > pruneheap.c are sometimes filling a tuple before it's strictly > > necessary, leading to wasted work. Moved those too. > > > > Looks good to me as attached. > > I took a quick look at this and am just curious why we're adding the > requirement that t_tableOid has to be initialized? Its a stepping stone for catalog timetravel. I separated it into a different patch because it seems to make the real patch easier to review without having to deal with all those unrelated hunks. The reason why we need t_tableOid and a valid ItemPointer is that during catalog timetravel (so we can decode the heaptuples in WAL) we need to see tuples in the catalog that have been changed in the transaction we travelled to. That means we need to lookup cmin/cmax values which aren't stored separately anymore. My first approach was to build support for logging allocated combocids (only for catalog tables) and use the existing combocid infrastructure to look them up. Turns out thats not a correct solution, consider this: * T100: INSERT (xmin: 100, xmax: Invalid, (cmin|cmax): 3) * T101: UPDATE (xmin: 100, xmax: 101, (cmin|cmax): 10) If you know travel to T100 and you want to decide whether that tuple is visible when in CommandId = 5 you have the problem that the original cmin value has been overwritten by the cmax from T101. Note that in this scenario no ComboCids have been generated! The problematic part is that the information about what happened is only available in T101. I took resolve to doing something similar to what the heap rewrite code uses to track update chains. Everytime a catalog tuple inserted/updated/deleted (filenode, ctid, cmin, cmax) is wal logged (if wal_level=logical) and while traveling to a transaction all those are put up in a hash table so they can get looked up if we need the respective cmin/cmax values. As we do that for all modifications of catalog tuples in that transaction we only ever need that mapping when inspecting that specific transaction. Seems to work very nicely, I have made quite some tests with it and I know of no failure cases. To be able to make that lookup we need to get the relfilenode & item pointer of the tuple were just looking up. Thats why I changed the signature to pass a HeapTuple instead of a HeapTupleHeader. We get the relfilenode from the buffer that has been passed *not* from the passed table oid. So requiring a valid table oid isn't strictly required as long as the item pointer is valid, but it has made debugging noticeably easier. Makes sense? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > I took a quick look at this and am just curious why we're adding the > requirement that t_tableOid has to be initialized? I assume he meant it had been left at a random value, which is surely bad practice even if a specific usage doesn't fall over today. regards, tom lane
On 13-01-14 08:38 PM, Andres Freund wrote: > Hi everyone, > > Here is the newest version of logical changeset generation. > > > > 2) Currently the logical replication infrastructure assigns a 'slot-id' > when a new replica is setup. That slot id isn't really nice > (e.g. "id-321578-3"). It also requires that [18] keeps state in a global > variable to make writing regression tests easy. > > I think it would be better to make the user specify those replication > slot ids, but I am not really sure about it. Shortly after trying out the latest version I hit the following scenario 1. I started pg_receivellog but mistyped the name of my plugin 2. It looped and used up all of my logical replication slots I killed pg_receivellog and restarted it with the correct plugin name but it won't do anything because I have no free slots. I can't free the slots with -F because I have no clue what the names of the slots are. I can figure the names out by looking in pg_llog but if my replication program can't do that so it won't be able to clean up from a failed attempt. I agree with you that we should make the user program specify a slot, we eventually might want to provide a view that shows the currently allocated slots. For a logical based slony I would just generate the slot name based on the remote node id. If walsender generates the slot name then I would need to store a mapping between slot names and slons so when a slon restarted it would know which slot to resume using. I'd have to use a table in the slony schema on the remote database for this. There would always be a risk of losing track of a slot id if the slon crashed after getting the slot number but before committing the mapping on the remote database. > 3) Currently no options can be passed to an output plugin. I am thinking > about making "INIT_LOGICAL_REPLICATION 'plugin'" accept the now widely > used ('option' ['value'], ...) syntax and pass that to the output > plugin's initialization function. I think we discussed this last CF, I like this idea. > 4) Does anybody object to: > -- allocate a permanent replication slot > INIT_LOGICAL_REPLICATION 'plugin' 'slotname' (options); > > -- stream data > START_LOGICAL_REPLICATION 'slotname' 'recptr'; > > -- deallocate a permanent replication slot > FREE_LOGICAL_REPLICATION 'slotname'; +1 > 5) Currently its only allowed to access catalog tables, its fairly > trivial to extend this to additional tables if you can accept some > (noticeable but not too big) overhead for modifications on those tables. > > I was thinking of making that an option for tables, that would be useful > for replication solutions configuration tables. I think this will make the life of anyone developing a new replication system easier. Slony has a lot of infrastructure for allowing slonik scripts to wait for configuration changes to popogate everywhere before making other configuration changes because you can get race conditions. If I were designing a new replication system and I had this feature then I would try to use it to come up with a simpler model of propagating configuration changes. > > Andres Freund > > Steve
On Fri, Jan 18, 2013 at 12:32 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Makes sense? Yes. The catalog timetravel stuff still gives me heartburn. The idea of treating system catalogs in a special way has never sat well with me and still doesn't - not that I am sure what I'd like better. The complexity of the whole system is also somewhat daunting. But my question with relation to this specific patch was mostly whether setting the table OID everywhere was worth worrying about from a performance standpoint, or whether any of the other adjustments this patch makes could have negative consequences there, since the Satisfies functions can get very hot on some workloads. It seems like the consensus is "no, that's not worth worrying about", at least as far as the table OIDs are concerned. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-01-20 21:45:11 -0500, Robert Haas wrote: > On Fri, Jan 18, 2013 at 12:32 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Makes sense? > > Yes. The catalog timetravel stuff still gives me heartburn. The idea > of treating system catalogs in a special way has never sat well with > me and still doesn't - not that I am sure what I'd like better. The > complexity of the whole system is also somewhat daunting. Understandable :( Althoutg it seems to me most parts of it have already been someplace else in the pg code, and the actual timetravel code is relatively small. > But my question with relation to this specific patch was mostly > whether setting the table OID everywhere was worth worrying about from > a performance standpoint, or whether any of the other adjustments this > patch makes could have negative consequences there, since the > Satisfies functions can get very hot on some workloads. It seems like > the consensus is "no, that's not worth worrying about", at least as > far as the table OIDs are concerned. I agree, I don't really see any such potential of that here, the effectively changed amount of code is very minor since the interface mostly stayed the same due to the HeapTupleSatisfiesVisibility wrapper. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, I pushed a new rebased version (the xlogreader commit made it annoying to merge). The main improvements are * way much coherent code internally for intializing logical rep * explicit control over slots * options for logical replication On 2013-01-19 23:42:02 -0500, Steve Singer wrote: > On 13-01-14 08:38 PM, Andres Freund wrote: > >2) Currently the logical replication infrastructure assigns a 'slot-id' > >when a new replica is setup. That slot id isn't really nice > >(e.g. "id-321578-3"). It also requires that [18] keeps state in a global > >variable to make writing regression tests easy. > > > >I think it would be better to make the user specify those replication > >slot ids, but I am not really sure about it. > > Shortly after trying out the latest version I hit the following scenario > 1. I started pg_receivellog but mistyped the name of my plugin > 2. It looped and used up all of my logical replication slots > > I killed pg_receivellog and restarted it with the correct plugin name but it > won't do anything because I have no free slots. I can't free the slots with > -F because I have no clue what the names of the slots are. I can figure > the names out by looking in pg_llog but if my replication program can't do > that so it won't be able to clean up from a failed attempt. > > I agree with you that we should make the user program specify a slot, we > eventually might want to provide a view that shows the currently allocated > slots. For a logical based slony I would just generate the slot name based > on the remote node id. If walsender generates the slot name then I would > need to store a mapping between slot names and slons so when a slon > restarted it would know which slot to resume using. I'd have to use a > table in the slony schema on the remote database for this. There would > always be a risk of losing track of a slot id if the slon crashed after > getting the slot number but before committing the mapping on the remote > database. This is changed now, slotnames need to be provided and there also is a pg_stat_logical_replication view (thanks Abhijit!). > >3) Currently no options can be passed to an output plugin. I am thinking > >about making "INIT_LOGICAL_REPLICATION 'plugin'" accept the now widely > >used ('option' ['value'], ...) syntax and pass that to the output > >plugin's initialization function. > > I think we discussed this last CF, I like this idea. Added to the extension and walsender interface. Its used in the few tests we have to specify that xids should not be included in the tests for reproducability, so its even tested ;) I haven't added code for setting up options via pg_receivellog yet. > >5) Currently its only allowed to access catalog tables, its fairly > >trivial to extend this to additional tables if you can accept some > >(noticeable but not too big) overhead for modifications on those tables. > > > >I was thinking of making that an option for tables, that would be useful > >for replication solutions configuration tables. > > I think this will make the life of anyone developing a new replication > system easier. Slony has a lot of infrastructure for allowing slonik > scripts to wait for configuration changes to popogate everywhere before > making other configuration changes because you can get race conditions. If > I were designing a new replication system and I had this feature then I > would try to use it to come up with a simpler model of propagating > configuration changes. Working on it now. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-19 23:42:02 -0500, Steve Singer wrote: > >5) Currently its only allowed to access catalog tables, its fairly > >trivial to extend this to additional tables if you can accept some > >(noticeable but not too big) overhead for modifications on those tables. > > > >I was thinking of making that an option for tables, that would be useful > >for replication solutions configuration tables. > > I think this will make the life of anyone developing a new replication > system easier. Slony has a lot of infrastructure for allowing slonik > scripts to wait for configuration changes to popogate everywhere before > making other configuration changes because you can get race conditions. If > I were designing a new replication system and I had this feature then I > would try to use it to come up with a simpler model of propagating > configuration changes. I pushed support for this, turned out to be a rather moderate patch (after a cleanup patch that was required anyway): src/backend/access/common/reloptions.c | 10 ++++++++++ src/backend/utils/cache/relcache.c | 9 ++++++++- src/include/utils/rel.h | 9 +++++++++ 3 files changed, 27 insertions(+), 1 deletion(-) With the (attached for convenience) patch applied you can do # ALTER TABLE replication_metadata SET (treat_as_catalog_table = true); to enable this. What I wonder about is: * does anybody have a better name for the reloption? * Currently this can be set mid-transaction but it will only provide access for in the next transaction but doesn't error out when setting it mid-transaction. I personally find that acceptable, other opinions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On Wed, Jan 23, 2013 at 7:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > With the (attached for convenience) patch applied you can do > # ALTER TABLE replication_metadata SET (treat_as_catalog_table = true); > > to enable this. > What I wonder about is: > * does anybody have a better name for the reloption? IMHO, it should somehow involve the words "logical" and "replication". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-01-23 10:18:50 -0500, Robert Haas wrote: > On Wed, Jan 23, 2013 at 7:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > With the (attached for convenience) patch applied you can do > > # ALTER TABLE replication_metadata SET (treat_as_catalog_table = true); > > > > to enable this. > > What I wonder about is: > > * does anybody have a better name for the reloption? > > IMHO, it should somehow involve the words "logical" and "replication". Not a bad point. In the back of my mind I was thinking of reusing it to do error checking when accessing the heap via index methods as a way of making sure index support writers are aware of the complexities of doing so (c.f. ALTER TYPE .. ADD VALUE only being usable outside transactions). But thats probably over the top. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
Hi, I decided to reply on the patches thread to be able to find this later. On 2013-01-23 22:48:50 +0200, Heikki Linnakangas wrote: > "logical changeset generation v4" > This is a boatload of infrastructure for supporting logical replication, yet > we have no code actually implementing logical replication that would go with > this. The premise of logical replication over trigger-based was that it'd be > faster, yet we cannot asses that without a working implementation. I don't > think this can be committed in this state. Its a fair point that this is a huge amount of code without a user in itself in-core. But the reason it got no user included is because several people explicitly didn't want a user in-core for now but said the first part of this would be to implement the changeset generation as a separate piece. Didn't you actually prefer not to have any users of this in-core yourself? Also, while the apply side surely isn't benchmarkable without any being submitted, the changeset generation can very well be benchmarked. A very, very adhoc benchmark:-c max_wal_senders=10-c max_logical_slots=10 --disabled for anything but logical-c wal_level=logical--hot_standby for anything but logical-c checkpoint_segments=100-c log_checkpoints=on-c shared_buffers=512MB-cautovacuum=on-c log_min_messages=notice-c log_line_prefix='[%p %t] '-c wal_keep_segments=100-c fsync=off-csynchronous_commit=off pgbench -p 5440 -h /tmp -n -M prepared -c 16 -j 16 -T 30 pgbench upstream: tps: 22275.941409 space overhead: 0% pgbench logical-submitted tps: 16274.603046 space overhead: 2.1% pgbench logical-HEAD (will submit updated version tomorrow or so): tps: 20853.341551 space overhead: 2.3% pgbench single plpgsql trigger (INSERT INTO log(data) VALUES(NEW::text)) tps: 14101.349535 space overhead: 369% Note that in the single trigger case nobody consumed the queue while the logical version streamed the changes out and stored them to disk. Adding a default NOW() or similar to the tables immediately makes logical decoding faster by a factor of about 3 in comparison to the above trivial trigger. The only reason the submitted version of logical decoding is comparatively slow is that its xmin update policy is braindamaged, working on that right now. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 23, 2013 at 5:30 PM, Andres Freund <andres@2ndquadrant.com> wrote: > pgbench upstream: > tps: 22275.941409 > space overhead: 0% > pgbench logical-submitted > tps: 16274.603046 > space overhead: 2.1% > pgbench logical-HEAD (will submit updated version tomorrow or so): > tps: 20853.341551 > space overhead: 2.3% > pgbench single plpgsql trigger (INSERT INTO log(data) VALUES(NEW::text)) > tps: 14101.349535 > space overhead: 369% > > Note that in the single trigger case nobody consumed the queue while the > logical version streamed the changes out and stored them to disk. > > Adding a default NOW() or similar to the tables immediately makes > logical decoding faster by a factor of about 3 in comparison to the > above trivial trigger. > > The only reason the submitted version of logical decoding is > comparatively slow is that its xmin update policy is braindamaged, > working on that right now. I agree. The thing that scares me about the logical replication stuff is not that it might be slow (and if your numbers are to be believed, it isn't), but that I suspect it's riddled with bugs and possibly some questionable design decisions. If we commit it and release it, then we're going to be stuck maintaining it for a very, very long time. If it turns out to have serious bugs that can't be fixed without a new major release, it's going to be a serious black eye for the project. Of course, I have no evidence that that will happen. But it is a really big piece of code, and therefore unless you are superman, it's probably got a really large number of bugs. The scary thing is that it is not as if we can say, well, this is a big hunk of code, but it doesn't really touch the core of the system, so if it's broken, it'll be broken itself, but it won't break anything else. Rather, this code is deeply in bed with WAL, with MVCC, and with the on-disk format of tuples, and makes fundamental changes to the first two of those. You agreed with Tom that 9.2 is the buggiest release in recent memory, but I think logical replication could easily be an order of magnitude worse. I also have serious concerns about checksums and foreign key locks. Any single one of those three patches could really inflict unprecedented damage on our community's reputation for stability and reliability if they turn out to be seriously buggy, and unfortunately I don't consider that an unlikely outcome. I don't know what to do about it, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
"Joshua D. Drake"
Дата:
On 01/23/2013 05:17 PM, Robert Haas wrote: > Of course, I have no evidence that that will happen. But it is a > really big piece of code, and therefore unless you are superman, it's > probably got a really large number of bugs. The scary thing is that > it is not as if we can say, well, this is a big hunk of code, but it > doesn't really touch the core of the system, so if it's broken, it'll > be broken itself, but it won't break anything else. Rather, this code > is deeply in bed with WAL, with MVCC, and with the on-disk format of > tuples, and makes fundamental changes to the first two of those. You > agreed with Tom that 9.2 is the buggiest release in recent memory, but > I think logical replication could easily be an order of magnitude > worse. Command Prompt worked for YEARS to get logical replication right and we never got it to the point where I would have been happy submitting it to -core. It behooves .Org to be extremely conservative about this feature. Granted, it is a feature we should have had years ago but still. It is not a simple thing, it is not an easy thing. It is complicated and complex to get correcft. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC @cmdpromptinc - 509-416-6579
On 24 January 2013 01:17, Robert Haas <robertmhaas@gmail.com> wrote: > I agree. The thing that scares me about the logical replication stuff > is not that it might be slow (and if your numbers are to be believed, > it isn't), but that I suspect it's riddled with bugs and possibly some > questionable design decisions. If we commit it and release it, then > we're going to be stuck maintaining it for a very, very long time. If > it turns out to have serious bugs that can't be fixed without a new > major release, it's going to be a serious black eye for the project. > > Of course, I have no evidence that that will happen. This is a generic argument against applying any invasive patch. I agree 9.2 had major bugs on release, though that was because of the invasive nature of some of the changes, even in seemingly minor patches. The most invasive and therefore risky changes in this release are already committed - changes to the way WAL reading and timelines work. If we don't apply a single additional patch in this CF, we will still in my opinion have a major requirement for beta testing prior to release. The code executed here is isolated to users of the new feature and is therefore low risk to non-users. Of course there will be bugs. Everybody understands what new feature means and we as a project aren't exposed to risks from this. New feature also means groundbreaking new capabilities, so the balance of high reward, low risk means this gets my vote to apply. I'm just about to spend some days giving a final review on it to confirm/refute that opinion in technical detail. Code using these features is available and marked them clearly as full copyright transfer to PGDG, TPL licenced. That code is external not by author's choice, but at the specific request of the project to make it thay way. I personally will be looking to add code to core over time. It was useful for everybody that replication solutions started out of core, but replication is now a core requirement for databases and we must fully deliver on that thought. I agree with your concern re: checksums and foreign key locks. FK locks has had considerable review and support, so I expect that to be a manageable issue. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Heikki Linnakangas
Дата:
On 24.01.2013 00:30, Andres Freund wrote: > Hi, > > I decided to reply on the patches thread to be able to find this later. > > On 2013-01-23 22:48:50 +0200, Heikki Linnakangas wrote: >> "logical changeset generation v4" >> This is a boatload of infrastructure for supporting logical replication, yet >> we have no code actually implementing logical replication that would go with >> this. The premise of logical replication over trigger-based was that it'd be >> faster, yet we cannot asses that without a working implementation. I don't >> think this can be committed in this state. > > Its a fair point that this is a huge amount of code without a user in > itself in-core. > But the reason it got no user included is because several people > explicitly didn't want a user in-core for now but said the first part of > this would be to implement the changeset generation as a separate > piece. Didn't you actually prefer not to have any users of this in-core > yourself? Yes, I certainly did. But we still need to see the other piece of the puzzle to see how this fits with it. BTW, why does all the transaction reordering stuff has to be in core? How much of this infrastructure is to support replicating DDL changes? IOW, if we drop that requirement, how much code can we slash? Any other features or requirements that could be dropped? I think it's clear at this stage that this patch is not going to be committed as it is. If you can reduce it to a fraction of what it is now, that fraction might have a chance. Otherwise, it's just going to be pushed to the next commitfest as whole, and we're going to be having the same doubts and discussions then. - Heikki
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
Hi Robert, Hi all, On 2013-01-23 20:17:04 -0500, Robert Haas wrote: > On Wed, Jan 23, 2013 at 5:30 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > The only reason the submitted version of logical decoding is > > comparatively slow is that its xmin update policy is braindamaged, > > working on that right now. > > I agree. The thing that scares me about the logical replication stuff > is not that it might be slow (and if your numbers are to be believed, > it isn't), but that I suspect it's riddled with bugs and possibly some > questionable design decisions. If we commit it and release it, then > we're going to be stuck maintaining it for a very, very long time. If > it turns out to have serious bugs that can't be fixed without a new > major release, it's going to be a serious black eye for the project. Thats way much more along the lines of what I am afraid of than the performance stuff - but Heikki cited those, so I replied to that. Note that I didn't say this must, must go in - I just don't think Heikki's reasoning about why not hit the nail on the head. > Of course, I have no evidence that that will happen. But it is a > really big piece of code, and therefore unless you are superman, it's > probably got a really large number of bugs. The scary thing is that > it is not as if we can say, well, this is a big hunk of code, but it > doesn't really touch the core of the system, so if it's broken, it'll > be broken itself, but it won't break anything else. Rather, this code > is deeply in bed with WAL, with MVCC, and with the on-disk format of > tuples, and makes fundamental changes to the first two of those. You > agreed with Tom that 9.2 is the buggiest release in recent memory, but > I think logical replication could easily be an order of magnitude > worse. I tried very, very hard to get the basics of the design & interface solid. Which obviously doesn't man I am succeeding - luckily not being superhuman after all ;). And I think thats very much where input is desparetely needed and where I failed to raise enough attention. The "output plugin" interface follewed by the walsender interface is what needs to be most closely vetted. Those are the permanent, user/developer exposed UI and the one we should try to keep as stable as possible. The output plugin callbacks are defined here: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/include/replication/output_plugin.h;hb=xlog-decoding-rebasing-cf4 To make it more agnostic of the technology to implement changeset extraction we possibly should replace the ReorderBuffer(TXN|Change) structs being passed by something more implementation agnostic. walsender interface: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/backend/replication/repl_gram.y;hb=xlog-decoding-rebasing-cf4 The interesting new commands are: 1) K_INIT_LOGICAL_REPLICATION NAME NAME 2) K_START_LOGICAL_REPLICATION NAME RECPTR plugin_options 3) K_FREE_LOGICAL_REPLICATION NAME 1 & 3 allocate (respectively free) the permanent state associated with one changeset consumer whereas START_LOGICAL_REPLICATION streams out changes starting at RECPTR. Btw, there are currently *no* changes to the wal format at all if wal_format < logical except that xl_running_xacts are logged more frequently which obviously could easily be made conditional. Baring bugs of course. The changes with wal_level>=logical aren't that big either imo: * heap_insert, heap_update prevent full page writes from removing their normal record by using a separate XLogRecData blockfor the buffer and the record * heap_delete adds more data (the pkey of the tuple) after the unchanged xl_heap_delete struct * On changes to catalog tables (relfilenode, tid, cmin, cmax) are logged. No changes to mvcc for normal backends at all, unless you count the very slightly changed *Satisfies interface (getting passed a HeapTuple instead of HeapTupleHeader). I am not sure what you're concerned about WRT the on-disk format of the tuples? We are pretty much nailed down on that due to pg_upgrade'ability anyway and it could be changed from this patches POV without a problem, the output plugin just sees normal HeapTuples? Or are you concerned about the code extracting them from the xlog records? So I think the "won't break anything else" argument can be made rather fairly if the heapam.c changes, which aren't that complex, are vetted closely. Now, the disucssion about all the code thats active *during* decoding is something else entirely :/ > You > agreed with Tom that 9.2 is the buggiest release in recent memory, but > I think logical replication could easily be an order of magnitude > worse. I unfortunately think that not providing more builtin capabilities in this area also has significant dangers. Imo this is one of the weakest, or even the weakest, area of postgres. I personally have the impression that just about nobody did actual beta testing of the lastest releases, especially 9.2, and that is the reason why its the buggiest recent release. > I also have serious concerns about checksums and foreign key locks. > Any single one of those three patches could really inflict > unprecedented damage on our community's reputation for stability and > reliability if they turn out to be seriously buggy, and unfortunately > I don't consider that an unlikely outcome. I don't know what to do > about it, either. I see your point although I would attest both having far more danger of collateral damage than logical decoding itself. Especially fklocks pretty much has to be active by default and there's not much that can be reasonably done about that. I tried to give fklocks a very thorough look but unfortunately I didn't know anything beforehand about most areas of the code it touches which obviously limits the amount of dangers one is seeing (FWIW I am still mostly concerned with multixact logging and the locking changes in heapam.c itself). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Heikki Linnakangas
Дата:
One random thing that caught my eye in the patch, I though I'd mention it while I still remember: In heap_delete, you call heap_form_tuple() in a critical section. That's a bad idea, because if it runs out of memory -> PANIC. - Heikki
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
On 2013-01-24 12:38:25 +0200, Heikki Linnakangas wrote: > On 24.01.2013 00:30, Andres Freund wrote: > >Hi, > > > >I decided to reply on the patches thread to be able to find this later. > > > >On 2013-01-23 22:48:50 +0200, Heikki Linnakangas wrote: > >>"logical changeset generation v4" > >>This is a boatload of infrastructure for supporting logical replication, yet > >>we have no code actually implementing logical replication that would go with > >>this. The premise of logical replication over trigger-based was that it'd be > >>faster, yet we cannot asses that without a working implementation. I don't > >>think this can be committed in this state. > > > >Its a fair point that this is a huge amount of code without a user in > >itself in-core. > >But the reason it got no user included is because several people > >explicitly didn't want a user in-core for now but said the first part of > >this would be to implement the changeset generation as a separate > >piece. Didn't you actually prefer not to have any users of this in-core > >yourself? > > Yes, I certainly did. But we still need to see the other piece of the puzzle > to see how this fits with it. Fair enough. I am also working on a user of this infrastructure but that doesn't help you very much. Steve Singer seemed to make some stabs at writing an output plugin as well. Steve, how far did you get there? > BTW, why does all the transaction reordering stuff has to be in core? It didn't use to, but people argued pretty damned hard that no undecoded data should ever allowed to leave the postgres cluster. And to be fair it makes writing an output plugin *way* much easier. Check http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=contrib/test_decoding/test_decoding.c;hb=xlog-decoding-rebasing-cf4 If you skip over tuple_to_stringinfo(), which is just pretty generic scaffolding for converting a whole tuple to a string, writing out the changes in some format by now is pretty damn simple. > How much of this infrastructure is to support replicating DDL changes? IOW, > if we drop that requirement, how much code can we slash? Unfortunately I don't think too much unless we add in other code that allows us to check whether the current definition of a table is still the same as it was back when the tuple was logged. > Any other features or requirements that could be dropped? I think it's clear at this stage that > this patch is not going to be committed as it is. If you can reduce it to a > fraction of what it is now, that fraction might have a chance. Otherwise, > it's just going to be pushed to the next commitfest as whole, and we're > going to be having the same doubts and discussions then. One thing that reduces complexity is to declare the following as unsupported: - CREATE TABLE foo(data text); - DECODE UP TO HERE; - INSERT INTO foo(data) VALUES(very-long-to-be-externally-toasted-tuple); - DROP TABLE foo; - DECODE UP TO HERE; but thats just a minor thing. I think what we can do more realistically than to chop of required parts of changeset extraction is to start applying some of the preliminary patches independently: - the relmapper/relfilenode changes + pg_relation_by_filenode(spc, relnode) should be independently committable if a bitboring - allowing walsenders to connect to a database possibly needs an interface change but otherwise it should be fine to go inindependently. It also has other potential use-cases, so I think thats fair. - logging xl_running_xact's more frequently could also be committed independently and makes sense independently as it allowsa standby to enter HS faster if the master is busy - Introducing InvalidCommandId should be relatively uncontroversial. The fact that no invalid value for command ids existsis imo an oversight - the *Satisfies change could be applied and they are imo ready but there's no use-case for it without the rest, so I amnot sure whether theres a point - currently not separately available, but we could add wal_level=logical independently. There would be no user of it, butit would be partial work. That includes the relcache support for keeping track of the primary key which already is availableseparately. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
On 2013-01-24 13:28:18 +0200, Heikki Linnakangas wrote: > One random thing that caught my eye in the patch, I though I'd mention it > while I still remember: In heap_delete, you call heap_form_tuple() in a > critical section. That's a bad idea, because if it runs out of memory -> > PANIC. Good point, will fix. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 13-01-24 06:40 AM, Andres Freund wrote: > > Fair enough. I am also working on a user of this infrastructure but that > doesn't help you very much. Steve Singer seemed to make some stabs at > writing an output plugin as well. Steve, how far did you get there? I was able to get something that generated output for INSERT statements in a format similar to what a modified slony apply trigger would want. This was with the list of tables to replicate hard-coded in the plugin. This was with the patchset from the last commitfest.I had gotten a bit hung up on the UPDATE and DELETE support because slony allows you to use an arbitrary user specified unique index as your key. It looks like better support for tables with a unique non-primary key is in the most recent patch set. I am hoping to have time this weekend to update my plugin to use parameters passed in on the init and other updates in the most recent version. If I make some progress I will post a link to my progress at the end of the weekend. My big issue is that I have limited time to spend on this. >> BTW, why does all the transaction reordering stuff has to be in core? > It didn't use to, but people argued pretty damned hard that no undecoded > data should ever allowed to leave the postgres cluster. And to be fair > it makes writing an output plugin *way* much easier. Check > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=contrib/test_decoding/test_decoding.c;hb=xlog-decoding-rebasing-cf4 > If you skip over tuple_to_stringinfo(), which is just pretty generic > scaffolding for converting a whole tuple to a string, writing out the > changes in some format by now is pretty damn simple. > I think we will find that the replication systems won't be the only users of this feature. I have often seen systems that have a logging requirement for auditing purposes or to log then reconstruct the sequence of changes made to a set of tables in order to feed a downstream application. Triggers and a journaling table are the traditional way of doing this but it should be pretty easy to write a plugin to accomplish the same thing that should give better performance. If the reordering stuff wasn't in core this would be much harder. >> How much of this infrastructure is to support replicating DDL changes? IOW, >> if we drop that requirement, how much code can we slash? > Unfortunately I don't think too much unless we add in other code that > allows us to check whether the current definition of a table is still > the same as it was back when the tuple was logged. > >> Any other features or requirements that could be dropped? I think it's clear at this stage that >> this patch is not going to be committed as it is. If you can reduce it to a >> fraction of what it is now, that fraction might have a chance. Otherwise, >> it's just going to be pushed to the next commitfest as whole, and we're >> going to be having the same doubts and discussions then. > One thing that reduces complexity is to declare the following as > unsupported: > - CREATE TABLE foo(data text); > - DECODE UP TO HERE; > - INSERT INTO foo(data) VALUES(very-long-to-be-externally-toasted-tuple); > - DROP TABLE foo; > - DECODE UP TO HERE; > > but thats just a minor thing. > > I think what we can do more realistically than to chop of required parts > of changeset extraction is to start applying some of the preliminary > patches independently: > - the relmapper/relfilenode changes + pg_relation_by_filenode(spc, > relnode) should be independently committable if a bit boring > - allowing walsenders to connect to a database possibly needs an interface change > but otherwise it should be fine to go in independently. It also has > other potential use-cases, so I think thats fair. > - logging xl_running_xact's more frequently could also be committed > independently and makes sense independently as it allows a standby to > enter HS faster if the master is busy > - Introducing InvalidCommandId should be relatively uncontroversial. The > fact that no invalid value for command ids exists is imo an oversight > - the *Satisfies change could be applied and they are imo ready but > there's no use-case for it without the rest, so I am not sure whether > theres a point > - currently not separately available, but we could add wal_level=logical > independently. There would be no user of it, but it would be partial > work. That includes the relcache support for keeping track of the > primary key which already is available separately. > > > Greetings, > > Andres Freund >
On Thu, Jan 24, 2013 at 6:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Thats way much more along the lines of what I am afraid of than the > performance stuff - but Heikki cited those, so I replied to that. > > Note that I didn't say this must, must go in - I just don't think > Heikki's reasoning about why not hit the nail on the head. Fair enough, no argument. Before getting bogged down in technical commentary, let me say this very clearly: I am enormously grateful for your work on this project. Logical replication based on WAL decoding is a feature of enormous value that PostgreSQL has needed for a long time, and your work has made that look like an achievable goal. Furthermore, it seems to me that you have pursued the community process with all the vigor and sincerity for which anyone could ask. Serious design concerns were raised early in the process and you made radical changes to the design which I believe have improved it tremendously, and you've continued to display an outstanding attitude at every phase of this process about which I can't say enough good things. There is no question in my mind that this work is going to be the beginning of a process that revolutionizes the way people think about replication and PostgreSQL, and you deserve our sincere thanks for that. Now, the bad news is, I don't think it's very reasonable to try to commit this to 9.3. I think it is just too much stuff too late in the cycle. I've reviewed some of the patches from time to time but there is a lot more stuff and it's big and complicated and it's not really clear that we have the interface quite right yet, even though I think it's also clear that we are a lot of closer than we were. I don't want to be fixing that during beta, much less after release. > I tried very, very hard to get the basics of the design & interface > solid. Which obviously doesn't man I am succeeding - luckily not being > superhuman after all ;). And I think thats very much where input is > desparetely needed and where I failed to raise enough attention. The > "output plugin" interface follewed by the walsender interface is what > needs to be most closely vetted. > Those are the permanent, user/developer exposed UI and the one we should > try to keep as stable as possible. > > The output plugin callbacks are defined here: > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/include/replication/output_plugin.h;hb=xlog-decoding-rebasing-cf4 > To make it more agnostic of the technology to implement changeset > extraction we possibly should replace the ReorderBuffer(TXN|Change) > structs being passed by something more implementation agnostic. > > walsender interface: > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/backend/replication/repl_gram.y;hb=xlog-decoding-rebasing-cf4 > The interesting new commands are: > 1) K_INIT_LOGICAL_REPLICATION NAME NAME > 2) K_START_LOGICAL_REPLICATION NAME RECPTR plugin_options > 3) K_FREE_LOGICAL_REPLICATION NAME > > 1 & 3 allocate (respectively free) the permanent state associated with > one changeset consumer whereas START_LOGICAL_REPLICATION streams out > changes starting at RECPTR. Forgive me for not having looked at the patch, but to what extent is all this, ah, documented? > Btw, there are currently *no* changes to the wal format at all if > wal_format < logical except that xl_running_xacts are logged more > frequently which obviously could easily be made conditional. Baring bugs > of course. > The changes with wal_level>=logical aren't that big either imo: > * heap_insert, heap_update prevent full page writes from removing their > normal record by using a separate XLogRecData block for the buffer and > the record > * heap_delete adds more data (the pkey of the tuple) after the unchanged > xl_heap_delete struct > * On changes to catalog tables (relfilenode, tid, cmin, cmax) are logged. > > No changes to mvcc for normal backends at all, unless you count the very > slightly changed *Satisfies interface (getting passed a HeapTuple > instead of HeapTupleHeader). > > I am not sure what you're concerned about WRT the on-disk format of the > tuples? We are pretty much nailed down on that due to pg_upgrade'ability > anyway and it could be changed from this patches POV without a problem, > the output plugin just sees normal HeapTuples? Or are you concerned > about the code extracting them from the xlog records? Mostly, my concern is that you've accidentally broken something, or that your code will turn out to be flaky in ways we can't now predict.My only really specific concern at this point is aboutthe special treatment of catalog tables. We've never done anything like that before, and it feels like a bad idea. In particular, the fact that you have to WAL-log new information about cmin/cmax really suggests that we're committing ourselves to the MVCC infrastructure in a way that we weren't previously. There's some category of stuff that our MVCC implementation didn't previously require us to persist on disk which, after this, it will. I don't understand exactly where the boundaries of that are in terms of future changes we might want to make - but I don't like moving the goalposts in that area. > So I think the "won't break anything else" argument can be made rather > fairly if the heapam.c changes, which aren't that complex, are vetted > closely. > > Now, the disucssion about all the code thats active *during* decoding is > something else entirely :/ > >> You >> agreed with Tom that 9.2 is the buggiest release in recent memory, but >> I think logical replication could easily be an order of magnitude >> worse. > > I unfortunately think that not providing more builtin capabilities in > this area also has significant dangers. Imo this is one of the weakest, > or even the weakest, area of postgres. > > I personally have the impression that just about nobody did actual beta > testing of the lastest releases, especially 9.2, and that is the reason > why its the buggiest recent release. > >> I also have serious concerns about checksums and foreign key locks. >> Any single one of those three patches could really inflict >> unprecedented damage on our community's reputation for stability and >> reliability if they turn out to be seriously buggy, and unfortunately >> I don't consider that an unlikely outcome. I don't know what to do >> about it, either. > > I see your point although I would attest both having far more danger of > collateral damage than logical decoding itself. Especially fklocks > pretty much has to be active by default and there's not much that can > be reasonably done about that. > I tried to give fklocks a very thorough look but unfortunately I didn't > know anything beforehand about most areas of the code it touches which > obviously limits the amount of dangers one is seeing (FWIW I am still > mostly concerned with multixact logging and the locking changes in > heapam.c itself). Yeah, I don't disagree with any of that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Stephen Frost
Дата:
* Robert Haas (robertmhaas@gmail.com) wrote: > Now, the bad news is, I don't think it's very reasonable to try to > commit this to 9.3. I think it is just too much stuff too late in the > cycle. I've reviewed some of the patches from time to time but there > is a lot more stuff and it's big and complicated and it's not really > clear that we have the interface quite right yet, even though I think > it's also clear that we are a lot of closer than we were. I don't > want to be fixing that during beta, much less after release. The only way to avoid this happening again and again, imv, is to get it committed early in whatever cycle it's slated to release for. We've got some serious challenges there though because we want to encourage everyone to focus on beta testing and going through the release process, plus we don't want to tag/branch too early or we create more work for ourselves. It would have been nice to get this into 9.3, but I can certainly understand needing to move it back, but can we get a slightly more specific plan around getting it in then? Thanks, Stephen
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Heikki Linnakangas
Дата:
On 24.01.2013 20:27, Robert Haas wrote: > Before getting bogged down in technical commentary, let me say this > very clearly: I am enormously grateful for your work on this project. > Logical replication based on WAL decoding is a feature of enormous > value that PostgreSQL has needed for a long time, and your work has > made that look like an achievable goal. Furthermore, it seems to me > that you have pursued the community process with all the vigor and > sincerity for which anyone could ask. Serious design concerns were > raised early in the process and you made radical changes to the design > which I believe have improved it tremendously, and you've continued to > display an outstanding attitude at every phase of this process about > which I can't say enough good things. +1. I really appreciate all the work you Andres have put into this. I've argued in the past myself that there should be a little tool that scrapes the WAL to do logical replication. Essentially, just what you've implemented. That said (hah, you knew there would be a "but" ;-)), now that I see what that looks like, I'm feeling that maybe it wasn't such a good idea after all. It sounded like a fairly small patch that greatly reduces the overhead in the master with existing replication systems like slony, but it turned out to be a huge patch with a lot of new concepts and interfaces. - Heikki
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
Hi! On 2013-01-24 13:27:00 -0500, Robert Haas wrote: > On Thu, Jan 24, 2013 at 6:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Before getting bogged down in technical commentary, let me say this > very clearly: I am enormously grateful for your work on this project. > Logical replication based on WAL decoding is a feature of enormous > value that PostgreSQL has needed for a long time, and your work has > made that look like an achievable goal. Furthermore, it seems to me > that you have pursued the community process with all the vigor and > sincerity for which anyone could ask. Serious design concerns were > raised early in the process and you made radical changes to the design > which I believe have improved it tremendously, and you've continued to > display an outstanding attitude at every phase of this process about > which I can't say enough good things. Very much appreciated. Especially as I can echo your feeling of not only having positive feelings about the process ;) > Now, the bad news is, I don't think it's very reasonable to try to > commit this to 9.3. I think it is just too much stuff too late in the > cycle. I've reviewed some of the patches from time to time but there > is a lot more stuff and it's big and complicated and it's not really > clear that we have the interface quite right yet, even though I think > it's also clear that we are a lot of closer than we were. I don't > want to be fixing that during beta, much less after release. It pains me to admit that you have a point there. What I am afraid though is that it basically goes on like this in the next commitfests: * 9.4-CF1: no "serious" reviewer comments because they are busy doing release work * 9.4-CF2: all are relieved that the release is over and a bit tired * 9.4-CF3: first deeper review, some more complex restructuring required * 9.4-CF4: too many changes to commit. If you look at the development of the feature, after the first prototype and the resulting design changes nobody with decision power had a more than cursory look at the proposed interfaces. Thats very, very, very understandable, you all are busy people and the patch & the interfaces are complex so it takes noticeable amounts of time, but it unfortunately doesn't help in getting an acceptable interface nailed down. The problem with that is not only that it sucks huge amounts of energy out of me and others but also that its very hard to really build the layers/users above changeset extraction without being able to rely on the interface and semantics. So we never get to the actually benefits :(, and we don't get the users people require for the feature to be committed. So far, the only really effective way of getting people to comment on patches in this state & complexity is the threat of an upcoming commit because of the last commitfest :( I honestly don't know how to go on about this... > > I tried very, very hard to get the basics of the design & interface > > solid. Which obviously doesn't man I am succeeding - luckily not being > > superhuman after all ;). And I think thats very much where input is > > desparetely needed and where I failed to raise enough attention. The > > "output plugin" interface follewed by the walsender interface is what > > needs to be most closely vetted. > > Those are the permanent, user/developer exposed UI and the one we should > > try to keep as stable as possible. > > > > The output plugin callbacks are defined here: > > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/include/replication/output_plugin.h;hb=xlog-decoding-rebasing-cf4 > > To make it more agnostic of the technology to implement changeset > > extraction we possibly should replace the ReorderBuffer(TXN|Change) > > structs being passed by something more implementation agnostic. > > > > walsender interface: > > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/backend/replication/repl_gram.y;hb=xlog-decoding-rebasing-cf4 > > The interesting new commands are: > > 1) K_INIT_LOGICAL_REPLICATION NAME NAME > > 2) K_START_LOGICAL_REPLICATION NAME RECPTR plugin_options > > 3) K_FREE_LOGICAL_REPLICATION NAME > > > > 1 & 3 allocate (respectively free) the permanent state associated with > > one changeset consumer whereas START_LOGICAL_REPLICATION streams out > > changes starting at RECPTR. > > Forgive me for not having looked at the patch, but to what extent is > all this, ah, documented? There are several mails on -hackers where I ask for input on whether that interface is what people want but all the comments have been from non-core pg people, although mildly favorable. I couldn't convince myself of writing real low-level documentation instead of just the example code I needed for testing anyway and some more higher-level docs before I had input from that side. Perhaps that was a mistake. So, here's a slightly less quick overview of the walsender interface: Whenever a new replication consumer wants to stream data we need to make sure on the primary that the data can be provided gapless, even across disconnects, crashes et al. The permanent state associated with this is currently called a "replication slot". $ psql "port=5440 dbname=postgres replication=1" postgres=# INIT_LOGICAL_REPLICATION 'bdr-whatever-1' 'test_decoding';replication_id | consistent_point | snapshot_name | plugin ----------------+------------------+---------------+---------------bdr-whatever-1 | 0/3E8DFA08 | 000F54F1-1 | test_decoding (1 row) So now we have allocated a permanent slot identified by the name 'bdr-whatever-1'. It also automatically exported the snapshot '000F54F1-1' that can be imported into another transaction, e.g. to consistently dump an initial snapshot of the data. The information returned in the 'consistent_point' column tells us that we will be able to return all data from that LSN onwards. That replication slot can *only* be used for replicating changes out of the database postgres and with the plugin 'test_decoding' (a contrib module). That slot will persist across restarts and everything until somebody issues a FREE_LOGICAL_REPLICATION 'bdr-whatever-1'. To start streaming out changes the command postgres=# START_LOGICAL_REPLICATION 'bdr-whatever-1' 0/3E8DFA08; WARNING: Starting logical replication unexpected PQresultStatus: 8 Time: 76.346 ms is used. Unfortunately psql isn't a suitable consumer as it cannot deal with the unrequested copy, but thats what we have pg_receivellog for: $ ~/.../pg_receivellog -p 5440 -d postgres --slot bdr-whatever-1 -f - --start -v pg_receivellog: starting log streaming at 0/0 (slot bdr-whatever-1) pg_receivellog: initiated streaming Which will start streaming out changes when we do: $ psql -h /tmp -p 5440 -U andres postgres postgres=# CREATE TABLE frak(id serial primary key, data int); CREATE TABLE postgres=# INSERT INTO frak (data) SELECT * FROM generate_series(1, 1); INSERT 0 1 back to receivellog: BEGIN 1004786 table "frak_id_seq": INSERT: sequence_name[name]:frak_id_seq last_value[int8]:1 start_value[int8]:1 increment_by[int8]:1max_value[int8]:9223372036854775807 min_value[int8]:1 cache_value[int8]:1 log_cnt[int8]:0 is_cycled[bool]:fis_called[bool]:f COMMIT 1004786 pg_receivellog: confirming flush up to 0/3E8F0F30 (slot bdr-whatever-1) BEGIN 1004787 table "frak": INSERT: id[int4]:1 data[int4]:1 COMMIT 1004787 pg_receivellog: confirming flush up to 0/3E8FCDC0 (slot bdr-whatever-1) Makes sense so far? The actual output you see there, the BEGIN 1004787 table "frak": INSERT: id[int4]:1 data[int4]:1 COMMIT 1004787 bit, is generated by the test_decoding plugin referenced previously which has functions like extern void pg_decode_init(struct LogicalDecodingContext *ctx, bool is_init); extern bool pg_decode_begin_txn(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn); extern bool pg_decode_commit_txn(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn, XLogRecPtr commit_lsn); extern bool pg_decode_change(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn, Oid tableoid, ReorderBufferChange*change); And e.g. begin_txn looks like: bool pg_decode_begin_txn(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn) { TestDecodingData *data = ctx->output_plugin_private; ctx->prepare_write(ctx, txn->lsn, txn->xid); if (data->include_xids) appendStringInfo(ctx->out, "BEGIN %u", txn->xid); else appendStringInfoString(ctx->out, "BEGIN"); ctx->write(ctx, txn->lsn, txn->xid); return true; } As you see, it seems to have somehow gathered options from somewhere. Those can be specified as optional argumetns to START_LOGICAL_REPLICATION. > > Btw, there are currently *no* changes to the wal format at all if > > wal_format < logical except that xl_running_xacts are logged more > > frequently which obviously could easily be made conditional. Baring bugs > > of course. > > The changes with wal_level>=logical aren't that big either imo: > > * heap_insert, heap_update prevent full page writes from removing their > > normal record by using a separate XLogRecData block for the buffer and > > the record > > * heap_delete adds more data (the pkey of the tuple) after the unchanged > > xl_heap_delete struct > > * On changes to catalog tables (relfilenode, tid, cmin, cmax) are logged. > > > > No changes to mvcc for normal backends at all, unless you count the very > > slightly changed *Satisfies interface (getting passed a HeapTuple > > instead of HeapTupleHeader). > > > > I am not sure what you're concerned about WRT the on-disk format of the > > tuples? We are pretty much nailed down on that due to pg_upgrade'ability > > anyway and it could be changed from this patches POV without a problem, > > the output plugin just sees normal HeapTuples? Or are you concerned > > about the code extracting them from the xlog records? > > Mostly, my concern is that you've accidentally broken something, or > that your code will turn out to be flaky in ways we can't now predict. I really think a look or two from experienced enough people should make the heapam parts safe enough. The changes basically are like: heap_insert(Relation relation, HeapTuple tup, CommandId cid, xl_heap_insert xlrec; xl_heap_header xlhdr; XLogRecPtr recptr; - XLogRecData rdata[3]; + XLogRecData rdata[4]; Page page = BufferGetPage(buffer); uint8 info = XLOG_HEAP_INSERT; + /* + * For the logical replication case we need the tuple even if were + * doing a full page write. We could alternatively store a pointer into + * the fpw though. + * For that to work we add another rdata entry for the buffer in that + * case. + */ + bool need_tuple_data = wal_level >= WAL_LEVEL_LOGICAL + && RelationGetRelid(relation) >= FirstNormalObjectId; + + /* For logical decode we need combocids to properly decode the catalog */ + if (wal_level >= WAL_LEVEL_LOGICAL && RelationGetRelid(relation) < FirstNormalObjectId) + log_heap_new_cid(relation, heaptup); ... rdata[1].data = (char *) &xlhdr; rdata[1].len = SizeOfHeapHeader; - rdata[1].buffer = buffer; + rdata[1].buffer = need_tuple_data ? InvalidBuffer : buffer; rdata[1].buffer_std = true; rdata[1].next= &(rdata[2]); /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */ rdata[2].data = (char *)heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits); rdata[2].len = heaptup->t_len - offsetof(HeapTupleHeaderData,t_bits); - rdata[2].buffer = buffer; + rdata[2].buffer = need_tuple_data ? InvalidBuffer : buffer; rdata[2].buffer_std = true; rdata[2].next= NULL; /* + * add record for the buffer without actual content thats removed if + * fpw is done for that buffer + */ + if (need_tuple_data) + { + rdata[2].next = &(rdata[3]); + + rdata[3].data = NULL; + rdata[3].len = 0; + rdata[3].buffer = buffer; + rdata[3].buffer_std = true; + rdata[3].next = NULL; + } Both the wal_level >= logical && XXX checks are now nicely encapsulated but this shows the complexity of whats being done better... Thats about all the changes that are done to heapam.c. Well, the same is done for update, multi_insert, and delete as well, but... > My only really specific concern at this point is about the special > treatment of catalog tables. We've never done anything like that > before, and it feels like a bad idea. In particular, the fact that > you have to WAL-log new information about cmin/cmax really suggests > that we're committing ourselves to the MVCC infrastructure in a way > that we weren't previously. It basically restores the pre 8.3 (?) state again where cmin/max were really stored - only that it only does so temporarily instead of permanently bloating the tables again. It imo pretty closely resembles what the normal code is doing with combocids, just that the combocid in this case is slightly more complex because it needs to be looked up over a longer timeframe. I thought about simply re-adding cmin/max storage for catalog tables, with some trickery thats not that hard to do (store it similar to oids), but the impact of that would have been far, far greater. And the decision of treating only some tables that way? Well, thats a question of overhead. There simply is no need to do something like that for tables that aren't required for converting a HeapTuple to the format the output wants. From my pov its somewhat similar to the way we log differently for temporary, persistent and unlogged tables. > There's some category of stuff that our > MVCC implementation didn't previously require us to persist on disk > which, after this, it will. I don't understand exactly where the > boundaries of that are in terms of future changes we might want to > make - but I don't like moving the goalposts in that area. I don't really see a problem there. If we decide to get rid of MVCC in a fundamental manner, this will be the absolutely, smallest problem of it all. IMNSHO ;) Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Bruce Momjian
Дата:
On Fri, Jan 25, 2013 at 02:16:09AM +0100, Andres Freund wrote: > What I am afraid though is that it basically goes on like this in the > next commitfests: > * 9.4-CF1: no "serious" reviewer comments because they are busy doing release work > * 9.4-CF2: all are relieved that the release is over and a bit tired > * 9.4-CF3: first deeper review, some more complex restructuring required > * 9.4-CF4: too many changes to commit. > > If you look at the development of the feature, after the first prototype > and the resulting design changes nobody with decision power had a more > than cursory look at the proposed interfaces. Thats very, very, very > understandable, you all are busy people and the patch & the interfaces > are complex so it takes noticeable amounts of time, but it unfortunately > doesn't help in getting an acceptable interface nailed down. > > The problem with that is not only that it sucks huge amounts of energy > out of me and others but also that its very hard to really build the > layers/users above changeset extraction without being able to rely on > the interface and semantics. So we never get to the actually benefits > :(, and we don't get the users people require for the feature to be > committed. > > So far, the only really effective way of getting people to comment on > patches in this state & complexity is the threat of an upcoming commit > because of the last commitfest :( > > I honestly don't know how to go on about this... This is very accurate and the big challenge of large, invasive patches. You almost need to hit it perfect the first time to get it committed in less than a year. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
On 2013-01-24 20:53:18 +0200, Heikki Linnakangas wrote: > That said (hah, you knew there would be a "but" ;-)), now that I see what > that looks like, I'm feeling that maybe it wasn't such a good idea after > all. It sounded like a fairly small patch that greatly reduces the overhead > in the master with existing replication systems like slony, but it turned > out to be a huge patch with a lot of new concepts and interfaces. Heh, I know the feeling that there must be a simpler way. But after trying several approaches (more than I dare to admit) I don't really think there's any that provides the asked for flexibility. I really think the flexibility is whats required to satisfy the very diverse aims people have for a feature like this. And if you look at the overall diffstat, without minor changes, example code and documentation: src/backend/access/heap/heapam.c | 286 ++-src/backend/replication/logical/decode.c | 514 ++++++src/backend/replication/logical/logical.c | 943 ++++++++++src/backend/replication/logical/logicalfuncs.c | 115 ++src/backend/replication/logical/reorderbuffer.c | 1947 ++++++++++++++++++++src/backend/replication/logical/snapbuild.c | 1596 ++++++++++++++++src/backend/replication/walsender.c | 620 ++++++-src/bin/pg_basebackup/pg_receivellog.c | 869 +++++++++src/include/replication/decode.h | 20 +src/include/replication/logical.h | 205 +++src/include/replication/logicalfuncs.h | 14 +src/include/replication/output_plugin.h | 73 +src/include/replication/reorderbuffer.h | 296 +++src/include/replication/snapbuild.h | 176 ++src/include/replication/walsender_private.h | 6 +-src/backend/storage/ipc/procarray.c | 63 +-src/backend/utils/time/tqual.c | 272 ++- Its not *that* big compared to other patches that have been committed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
On 2013-01-24 20:28:41 -0500, Bruce Momjian wrote: > On Fri, Jan 25, 2013 at 02:16:09AM +0100, Andres Freund wrote: > > What I am afraid though is that it basically goes on like this in the > > next commitfests: > > * 9.4-CF1: no "serious" reviewer comments because they are busy doing release work > > * 9.4-CF2: all are relieved that the release is over and a bit tired > > * 9.4-CF3: first deeper review, some more complex restructuring required > > * 9.4-CF4: too many changes to commit. > > > > If you look at the development of the feature, after the first prototype > > and the resulting design changes nobody with decision power had a more > > than cursory look at the proposed interfaces. Thats very, very, very > > understandable, you all are busy people and the patch & the interfaces > > are complex so it takes noticeable amounts of time, but it unfortunately > > doesn't help in getting an acceptable interface nailed down. > > > > The problem with that is not only that it sucks huge amounts of energy > > out of me and others but also that its very hard to really build the > > layers/users above changeset extraction without being able to rely on > > the interface and semantics. So we never get to the actually benefits > > :(, and we don't get the users people require for the feature to be > > committed. > > > > So far, the only really effective way of getting people to comment on > > patches in this state & complexity is the threat of an upcoming commit > > because of the last commitfest :( > > > > I honestly don't know how to go on about this... > > This is very accurate and the big challenge of large, invasive patches. > You almost need to hit it perfect the first time to get it committed in > less than a year. My primary concern really isn't to get it committed inside a year, but to be sure to get input in-time to be able to actually continue to work. And to commit it then. And I am absolutely, absolutely not sure thats going to work. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Bruce Momjian
Дата:
On Fri, Jan 25, 2013 at 02:40:03AM +0100, Andres Freund wrote: > > > The problem with that is not only that it sucks huge amounts of energy > > > out of me and others but also that its very hard to really build the > > > layers/users above changeset extraction without being able to rely on > > > the interface and semantics. So we never get to the actually benefits > > > :(, and we don't get the users people require for the feature to be > > > committed. > > > > > > So far, the only really effective way of getting people to comment on > > > patches in this state & complexity is the threat of an upcoming commit > > > because of the last commitfest :( > > > > > > I honestly don't know how to go on about this... > > > > This is very accurate and the big challenge of large, invasive patches. > > You almost need to hit it perfect the first time to get it committed in > > less than a year. > > My primary concern really isn't to get it committed inside a year, but > to be sure to get input in-time to be able to actually continue to > work. And to commit it then. And I am absolutely, absolutely not sure > thats going to work. I have found that if I push out improvements right after they are requested, I can sometimes get momentum for people to get excited about the patch. That is very hard to do with any other time constraints. I am not saying you didn't push out stuff quickly, only that this is hard to do. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 13-01-24 11:15 AM, Steve Singer wrote: > On 13-01-24 06:40 AM, Andres Freund wrote: >> >> Fair enough. I am also working on a user of this infrastructure but that >> doesn't help you very much. Steve Singer seemed to make some stabs at >> writing an output plugin as well. Steve, how far did you get there? > > I was able to get something that generated output for INSERT > statements in a format similar to what a modified slony apply trigger > would want. This was with the list of tables to replicate hard-coded > in the plugin. This was with the patchset from the last commitfest.I > had gotten a bit hung up on the UPDATE and DELETE support because > slony allows you to use an arbitrary user specified unique index as > your key. It looks like better support for tables with a unique > non-primary key is in the most recent patch set. I am hoping to have > time this weekend to update my plugin to use parameters passed in on > the init and other updates in the most recent version. If I make some > progress I will post a link to my progress at the end of the weekend. > My big issue is that I have limited time to spend on this. > This isn't a complete review just a few questions I've hit so far that I thought I'd ask to see if I'm not seeing something related to updates. *** a/src/include/catalog/index.h --- b/src/include/catalog/index.h *************** extern bool ReindexIsProcessingHeap(Oid *** 114,117 **** --- 114,121 ---- extern bool ReindexIsProcessingIndex(Oid indexOid); extern Oid IndexGetRelation(Oid indexId, bool missing_ok); + extern void relationFindPrimaryKey(Relation pkrel, Oid *indexOid, + int16 *nratts, int16 *attnums, Oid *atttypids, + Oid *opclasses); + #endif /* INDEX_H */ I don't see this defined anywhere could it be left over from a previous version of the patch? In decode.c DecodeUpdate: + + /* + * FIXME: need to get/save the old tuple as well if we want primary key + * changes to work. + */ + change->newtuple = ReorderBufferGetTupleBuf(reorder); I also don't see any code in heap_update to find + save the old primary key values like you added to heap_delete. You didn't list "Add ability to change the primary key on an UPDATE" in the TODO so I'm wondering if I'm missing something. Is there another way I can bet the primary key values for the old_tuple? Also, I think the name of the test contrib module was changed but you didn't update the make file. This fixes it diff --git a/contrib/Makefile b/contrib/Makefile index 1cc30fe..36e6bfe 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -50,7 +50,7 @@ SUBDIRS = \ tcn \ test_parser \ test_decoding \ - test_logical_replication \ + test_logical_decoding \ tsearch2 \ unaccent \ vacuumlo \
On 13-01-22 11:30 AM, Andres Freund wrote: > Hi, > > I pushed a new rebased version (the xlogreader commit made it annoying > to merge). > > The main improvements are > * way much coherent code internally for intializing logical rep > * explicit control over slots > * options for logical replication Exactly what is the syntax for using that. My reading your changes to repl_gram.y make me think that any of the following should work (but they don't). START_LOGICAL_REPLICATION 'slon1' 0/0 ('opt1') ERROR: syntax error: unexpected character "(" "START_LOGICAL_REPLICATION 'slon1' 0/0 ('opt1' 'val1') ERROR: syntax error: unexpected character "(" START_LOGICAL_REPLICATION 'slon1' 0/0 ('opt1','opt2') ERROR: syntax error: unexpected character "(" I'm also attaching a patch to pg_receivellog that allows you to specify these options on the command line. I'm not saying I think that it is appropriate to be adding more bells and whistles to the utilities two weeks into the CF but I found this useful for testing so I'm sharing it. > > > Greetings, > > Andres Freund > > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > >
Вложения
On 13-01-24 11:15 AM, Steve Singer wrote: > On 13-01-24 06:40 AM, Andres Freund wrote: >> >> Fair enough. I am also working on a user of this infrastructure but that >> doesn't help you very much. Steve Singer seemed to make some stabs at >> writing an output plugin as well. Steve, how far did you get there? > > I was able to get something that generated output for INSERT > statements in a format similar to what a modified slony apply trigger > would want. This was with the list of tables to replicate hard-coded > in the plugin. This was with the patchset from the last commitfest.I > had gotten a bit hung up on the UPDATE and DELETE support because > slony allows you to use an arbitrary user specified unique index as > your key. It looks like better support for tables with a unique > non-primary key is in the most recent patch set. I am hoping to have > time this weekend to update my plugin to use parameters passed in on > the init and other updates in the most recent version. If I make some > progress I will post a link to my progress at the end of the weekend. > My big issue is that I have limited time to spend on this. > > A few more comments; In decode.c DecodeDelete + if (r->xl_len <= (SizeOfHeapDelete + SizeOfHeapHeader)) + { + elog(DEBUG2, "huh, no primary key for a delete on wal_level = logical?"); + return; + } + I think we should be passing delete's with candidate key data logged to the plugin. If the table isn't a replicated table then ignoring the delete is fine. If the table is a replicated table but someone has deleted the unique index from the table then the plugin will receive INSERT changes on the table but not DELETE changes. If this happens the plugin would have any way of knowing that it is missing delete changes. If my plugin gets passed a DELETE change record but with no key data then my plugin could do any of 1. Start screaming for help (ie log errors) 2. Drop the table from replication 3. Pass the delete (with no key values) onto the replication client and let it deal with it (see 1 and 2) Also, 'huh' isn't one of our standard log message phrases :) How do you plan on dealing with sequences? I don't see my plugin being called on sequence changes and I don't see XLOG_SEQ_LOG listed in DecodeRecordIntoReorderBuffer. Is there a reason why this can't be easily added? Also what do we want to do about TRUNCATE support. I could always leave a TRUNCATE trigger in place that logged the truncate to a sl_truncates and have my replication daemon respond to the insert on a sl_truncates table by actually truncating the data on the replica. I've spent some time this weekend updating my prototype plugin that generates slony 2.2 style COPY output. I have attached my progress here (also https://github.com/ssinger/slony1-engine/tree/logical_repl). I have not gotten as far as modifying slon to act as a logical log receiver, or made a version of the slony apply trigger that would process these changes. I haven't looked into the details of what is involved in setting up a subscription with the snapshot exporting. I couldn't get the options on the START REPLICATION command to parse so I just hard coded some list building code in the init method. I do plan on pasing the list of tables to replicate from the replica to the plugin (because this list comes from the replica). Passing what could be a few thousand table names as a list of arguments is a bit ugly and I admit my list processing code is rough. Does this make us want to reconsider the format of the option_list ? I guess should provide an opinion on if I think that the patch in this CF, if committed could be used to act as a source for slony instead of the log trigger. The biggest missing piece I mentioned in my email yesterday, that we aren't logging the old primary key on row UPDATEs. I don't see building a credible replication system where you don't allow users to update any column of a row. The other issues I've raised (DecodeDelete hiding bad deletes, replication options not parsing for me) look like easy fixes no wal decoding support for sequences or truncate are things that I could work around by doing things much like slony does today. The SYNC can still capture the sequence changes in a table (where the INSERT's would be logged) and I can have a trigger capture truncates. I mostly did this review from the point of view of someone trying to use the feature, I haven't done a line-by-line review of the code. I suspect Andres can address these issues and get an updated patch out during this CF. I think a more detailed code review by someone more familiar with postgres internals will reveal a handful of other issues that hopefully can be fixed without a lot of effort. If this were the only patch in the commitfest I would encourage Andres to push to get these changes done. If the standard for CF4 is that a patch needs to be basically in a commitable state at the start of the CF, other than minor issues, then I don't think this patch meets that bar. In a few more weeks from now, with a handful of more updates and re-reviews it might. If we give everyone in the CF that much time to get their patches into a committable state then I think the CF will drag on until April or even May and we might not see 9.3 released until close to Christmas (4 patches so far have been rejected or returned with feedback, 51 need reviewer or committer attention) . I'm not sure I have a huge problem with that but I don't think it is what was agreed to in the developer meeting last May. If this patch is going to get bumped to 9.4 I really hope that someone with good knowledge of the internals (ie a committer) can give this patch a good review sooner rather than later. If there are issues Andres has overlooked that are more serious or complicated to fix I would like to see them raised before the next CF in June. Steve >>> BTW, why does all the transaction reordering stuff has to be in core? >> It didn't use to, but people argued pretty damned hard that no undecoded >> data should ever allowed to leave the postgres cluster. And to be fair >> it makes writing an output plugin *way* much easier. Check >> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=contrib/test_decoding/test_decoding.c;hb=xlog-decoding-rebasing-cf4 >> >> If you skip over tuple_to_stringinfo(), which is just pretty generic >> scaffolding for converting a whole tuple to a string, writing out the >> changes in some format by now is pretty damn simple. >> > > I think we will find that the replication systems won't be the only > users of this feature. I have often seen systems that have a logging > requirement for auditing purposes or to log then reconstruct the > sequence of changes made to a set of tables in order to feed a > downstream application. Triggers and a journaling table are the > traditional way of doing this but it should be pretty easy to write a > plugin to accomplish the same thing that should give better > performance. If the reordering stuff wasn't in core this would be > much harder. > > >>> How much of this infrastructure is to support replicating DDL >>> changes? IOW, >>> if we drop that requirement, how much code can we slash? >> Unfortunately I don't think too much unless we add in other code that >> allows us to check whether the current definition of a table is still >> the same as it was back when the tuple was logged. >> >>> Any other features or requirements that could be dropped? I think >>> it's clear at this stage that >>> this patch is not going to be committed as it is. If you can reduce >>> it to a >>> fraction of what it is now, that fraction might have a chance. >>> Otherwise, >>> it's just going to be pushed to the next commitfest as whole, and we're >>> going to be having the same doubts and discussions then. >> One thing that reduces complexity is to declare the following as >> unsupported: >> - CREATE TABLE foo(data text); >> - DECODE UP TO HERE; >> - INSERT INTO foo(data) >> VALUES(very-long-to-be-externally-toasted-tuple); >> - DROP TABLE foo; >> - DECODE UP TO HERE; >> >> but thats just a minor thing. >> >> I think what we can do more realistically than to chop of required parts >> of changeset extraction is to start applying some of the preliminary >> patches independently: >> - the relmapper/relfilenode changes + pg_relation_by_filenode(spc, >> relnode) should be independently committable if a bit boring >> - allowing walsenders to connect to a database possibly needs an >> interface change >> but otherwise it should be fine to go in independently. It also has >> other potential use-cases, so I think thats fair. >> - logging xl_running_xact's more frequently could also be committed >> independently and makes sense independently as it allows a standby to >> enter HS faster if the master is busy >> - Introducing InvalidCommandId should be relatively uncontroversial. The >> fact that no invalid value for command ids exists is imo an oversight >> - the *Satisfies change could be applied and they are imo ready but >> there's no use-case for it without the rest, so I am not sure whether >> theres a point >> - currently not separately available, but we could add wal_level=logical >> independently. There would be no user of it, but it would be partial >> work. That includes the relcache support for keeping track of the >> primary key which already is available separately. >> >> >> Greetings, >> >> Andres Freund >> > > >
Вложения
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Heikki Linnakangas
Дата:
On 24.01.2013 00:30, Andres Freund wrote: > Also, while the apply side surely isn't benchmarkable without any being > submitted, the changeset generation can very well be benchmarked. > > A very, very adhoc benchmark: > -c max_wal_senders=10 > -c max_logical_slots=10 --disabled for anything but logical > -c wal_level=logical --hot_standby for anything but logical > -c checkpoint_segments=100 > -c log_checkpoints=on > -c shared_buffers=512MB > -c autovacuum=on > -c log_min_messages=notice > -c log_line_prefix='[%p %t] ' > -c wal_keep_segments=100 > -c fsync=off > -c synchronous_commit=off > > pgbench -p 5440 -h /tmp -n -M prepared -c 16 -j 16 -T 30 > > pgbench upstream: > tps: 22275.941409 > space overhead: 0% > pgbench logical-submitted > tps: 16274.603046 > space overhead: 2.1% > pgbench logical-HEAD (will submit updated version tomorrow or so): > tps: 20853.341551 > space overhead: 2.3% > pgbench single plpgsql trigger (INSERT INTO log(data) VALUES(NEW::text)) > tps: 14101.349535 > space overhead: 369% > > Note that in the single trigger case nobody consumed the queue while the > logical version streamed the changes out and stored them to disk. That makes the space overhead comparison completely worthless, no? I would expect the trigger-based approach to generate roughly 100% more WAL, not close to 400%. As long as the queue is drained constantly, there should be no big difference in the disk space used, except for the WAL. > Adding a default NOW() or similar to the tables immediately makes > logical decoding faster by a factor of about 3 in comparison to the > above trivial trigger. Hmm, is that because of the conversion to text? I believe slony also converts all the values to text in the trigger, because that's simple and flexible, but if we're trying to compare the performance of logical changeset generation vs. trigger-based replication in general, we should choose the most efficient trigger-based scheme to compare with. That means, don't convert to text. And write the trigger in C. - Heikki
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
On 2013-01-26 16:20:33 -0500, Steve Singer wrote: > On 13-01-24 11:15 AM, Steve Singer wrote: > >On 13-01-24 06:40 AM, Andres Freund wrote: > >> > >>Fair enough. I am also working on a user of this infrastructure but that > >>doesn't help you very much. Steve Singer seemed to make some stabs at > >>writing an output plugin as well. Steve, how far did you get there? > > > >I was able to get something that generated output for INSERT statements in > >a format similar to what a modified slony apply trigger would want. This > >was with the list of tables to replicate hard-coded in the plugin. This > >was with the patchset from the last commitfest.I had gotten a bit hung up > >on the UPDATE and DELETE support because slony allows you to use an > >arbitrary user specified unique index as your key. It looks like better > >support for tables with a unique non-primary key is in the most recent > >patch set. I am hoping to have time this weekend to update my plugin to > >use parameters passed in on the init and other updates in the most recent > >version. If I make some progress I will post a link to my progress at the > >end of the weekend. My big issue is that I have limited time to spend on > >this. > > > > This isn't a complete review just a few questions I've hit so far that I > thought I'd ask to see if I'm not seeing something related to updates. > + extern void relationFindPrimaryKey(Relation pkrel, Oid *indexOid, > + int16 *nratts, int16 *attnums, Oid > *atttypids, > + Oid *opclasses); > + > > I don't see this defined anywhere could it be left over from a previous > version of the patch? Yes, its dead and now gone. > In decode.c > DecodeUpdate: > + > + /* > + * FIXME: need to get/save the old tuple as well if we want primary key > + * changes to work. > + */ > + change->newtuple = ReorderBufferGetTupleBuf(reorder); > > I also don't see any code in heap_update to find + save the old primary key > values like you added to heap_delete. You didn't list "Add ability to > change the primary key on an UPDATE" in the TODO so I'm wondering if I'm > missing something. Is there another way I can bet the primary key values > for the old_tuple? Nope, there isn't any right now. I have considered as something not all that interesting for real-world usecases based on my experience, but adding support shouldn't be that hard anymore, so I can just bite the bullet... > I think the name of the test contrib module was changed but you didn't > update the make file. This fixes it Yea, I had forgotten to add that hunk when committing. Fixed. Thanks, Andres --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
Hi, On 2013-01-27 23:07:51 -0500, Steve Singer wrote: > A few more comments; > > In decode.c DecodeDelete > > + if (r->xl_len <= (SizeOfHeapDelete + SizeOfHeapHeader)) > + { > + elog(DEBUG2, "huh, no primary key for a delete on wal_level = > logical?"); > + return; > + } > + > I think we should be passing delete's with candidate key data logged to the > plugin. If the table isn't a replicated table then ignoring the delete is > fine. If the table is a replicated table but someone has deleted the unique > index from the table then the plugin will receive INSERT changes on the > table but not DELETE changes. If this happens the plugin would have any way > of knowing that it is missing delete changes. If my plugin gets passed a > DELETE change record but with no key data then my plugin could do any of I basically didn't do that because I thought people would forget to check whether oldtuple is empty I have no problem with addind support for that though. > 1. Start screaming for help (ie log errors) Yes. > 2. Drop the table from replication No, you can't write from an output plugin, and I don't immediately see support for that comming. There's no fundamental blockers, just makes things more complicated. > 3. Pass the delete (with no key values) onto the replication client and let > it deal with it (see 1 and 2) Hm. While I agree that nicer behaviour would be good I think the real enforcement should happen on a higher level, e.g. with event triggers or somesuch. It seems way too late to do anything about it when we're already decoding. The transaction will already have committed... > Also, 'huh' isn't one of our standard log message phrases :) You're right there ;). I bascially wanted to remove the log message almost instantly but it was occasionally useful so I kept it arround... > How do you plan on dealing with sequences? > I don't see my plugin being called on sequence changes and I don't see > XLOG_SEQ_LOG listed in DecodeRecordIntoReorderBuffer. Is there a reason why > this can't be easily added? I basically was hoping for Simon's sequence-am to get in before doing anything real here. That didn't really happen yet. I am not sure whether there's a real usecase in decoding normal XLOG_SEQ_LOG records, their content isn't all that easy to interpet unless youre rather familiar with pg's innards. So, adding support wouldn't hard from a technical pov but it seems the semantics are a bit hard to nail down. > Also what do we want to do about TRUNCATE support. I could always leave a > TRUNCATE trigger in place that logged the truncate to a sl_truncates and > have my replication daemon respond to the insert on a sl_truncates table > by actually truncating the data on the replica. I have planned to add some generic "table_rewrite" handling, but I have to admit I haven't thought too much about it yet. Currently if somebody rewrites a table, e.g. with an ALTER ... ADD COLUMN .. DEFAULT .. or ALTER COLUMN ... USING ..., you will see INSERTs into a temporary table. That basically seems to be a good thing, but the user needs to be told about that ;) > I've spent some time this weekend updating my prototype plugin that > generates slony 2.2 style COPY output. I have attached my progress here > (also https://github.com/ssinger/slony1-engine/tree/logical_repl). I have > not gotten as far as modifying slon to act as a logical log receiver, or > made a version of the slony apply trigger that would process these > changes. I only gave it a quick look and have a couple of questions and remarks. The way you used the options it looks like youre thinking of specifying all the tables as options? I would have thought those would get stored & queried locally and only something like the 'replication set' name or such would be set as an option. Iterating over a list withfor(i = 0; i < options->length; i= i + 2 ){ DefElem * def_schema = (DefElem*) list_nth(options,i); is not a good idea btw, thats quadratic in complexity ;) In the REORDER_BUFFER_CHANGE_UPDATE I suggest using relation->rd_primary, just as in the DELETE case, that should always give you a consistent candidate key in an efficient manner. > I haven't looked into the details of what is involved in setting up a > subscription with the snapshot exporting. That hopefully shouldn't be too hard... At least thats the idea :P > I couldn't get the options on the START REPLICATION command to parse so I > just hard coded some list building code in the init method. I do plan on > pasing the list of tables to replicate from the replica to the plugin > (because this list comes from the replica). Passing what could be a few > thousand table names as a list of arguments is a bit ugly and I admit my > list processing code is rough. Does this make us want to reconsider the > format of the option_list ? Yea, something's screwed up there, sorry. Will push a fix later today. > I guess should provide an opinion on if I think that the patch in this CF, > if committed could be used to act as a source for slony instead of the log > trigger. > The biggest missing piece I mentioned in my email yesterday, that we aren't > logging the old primary key on row UPDATEs. I don't see building a credible > replication system where you don't allow users to update any column of a > row. Ok, I really thought this wouldn't be that much of an issue in a first version, but if you think its important, I'll add support for it. Shouldn't be too hard. > The other issues I've raised (DecodeDelete hiding bad deletes, replication > options not parsing for me) look like easy fixes > > no wal decoding support for sequences or truncate are things that I could > work around by doing things much like slony does today. The SYNC can still > capture the sequence changes in a table (where the INSERT's would be > logged) and I can have a trigger capture truncates. Could you explan a bit what's being done there in slony? > If this patch is going to get bumped to 9.4 I really hope that someone with > good knowledge of the internals (ie a committer) can give this patch a good > review sooner rather than later. If there are issues Andres has overlooked > that are more serious or complicated to fix I would like to see them raised > before the next CF in June. Absolutely seconded. I *really* would love to see a more technical review, its hard to see issues after spending that much time in a certain worldview... Thanks! Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-27 12:28:21 -0500, Steve Singer wrote: > On 13-01-22 11:30 AM, Andres Freund wrote: > >Hi, > > > >I pushed a new rebased version (the xlogreader commit made it annoying > >to merge). > > > >The main improvements are > >* way much coherent code internally for intializing logical rep > >* explicit control over slots > >* options for logical replication > > > Exactly what is the syntax for using that. My reading your changes to > repl_gram.y make me think that any of the following should work (but they > don't). > > START_LOGICAL_REPLICATION 'slon1' 0/0 ('opt1') > ERROR: syntax error: unexpected character "(" > > "START_LOGICAL_REPLICATION 'slon1' 0/0 ('opt1' 'val1') > ERROR: syntax error: unexpected character "(" > > START_LOGICAL_REPLICATION 'slon1' 0/0 ('opt1','opt2') > ERROR: syntax error: unexpected character "(" The syntax is right, the grammar (or rather scanner) support is a bit botched, will push a new version soon. > I'm also attaching a patch to pg_receivellog that allows you to specify > these options on the command line. I'm not saying I think that it is > appropriate to be adding more bells and whistles to the utilities two weeks > into the CF but I found this useful for testing so I'm sharing it. The CF is also there to find UI warts and such, so something like this seems perfectly fine. Even moreso as it doesn't look this will get into 9.3 anyway. I wanted to add such an option, but I was too lazy^Wbusy to think about the sematics. Your current syntax doesn't really allow arguments to be specified in a nice way. I was thinking of -o name=value and allowing multiple specifications of -o to build the option string. Any arguments against that? > /* Initiate the replication stream at specified location */ > - snprintf(query, sizeof(query), "START_LOGICAL_REPLICATION '%s' %X/%X", > - slot, (uint32) (startpos >> 32), (uint32) startpos); > + snprintf(query, sizeof(query), "START_LOGICAL_REPLICATION '%s' %X/%X (%s)", > + slot, (uint32) (startpos >> 32), (uint32) startpos,plugin_opts); ISTM that (%s) shouldn't be specified when there are no options, but as the options need to be pre-escaped anyway, that looks like a non-problem in a bit more complete implementation. Greetings, Andres Freund
Re: Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
On 2013-01-28 11:59:52 +0200, Heikki Linnakangas wrote: > On 24.01.2013 00:30, Andres Freund wrote: > >Also, while the apply side surely isn't benchmarkable without any being > >submitted, the changeset generation can very well be benchmarked. > > > >A very, very adhoc benchmark: > > -c max_wal_senders=10 > > -c max_logical_slots=10 --disabled for anything but logical > > -c wal_level=logical --hot_standby for anything but logical > > -c checkpoint_segments=100 > > -c log_checkpoints=on > > -c shared_buffers=512MB > > -c autovacuum=on > > -c log_min_messages=notice > > -c log_line_prefix='[%p %t] ' > > -c wal_keep_segments=100 > > -c fsync=off > > -c synchronous_commit=off > > > >pgbench -p 5440 -h /tmp -n -M prepared -c 16 -j 16 -T 30 > > > >pgbench upstream: > >tps: 22275.941409 > >space overhead: 0% > >pgbench logical-submitted > >tps: 16274.603046 > >space overhead: 2.1% > >pgbench logical-HEAD (will submit updated version tomorrow or so): > >tps: 20853.341551 > >space overhead: 2.3% > >pgbench single plpgsql trigger (INSERT INTO log(data) VALUES(NEW::text)) > >tps: 14101.349535 > >space overhead: 369% > > > >Note that in the single trigger case nobody consumed the queue while the > >logical version streamed the changes out and stored them to disk. > > That makes the space overhead comparison completely worthless, no? I would > expect the trigger-based approach to generate roughly 100% more WAL, not > close to 400%. As long as the queue is drained constantly, there should be > no big difference in the disk space used, except for the WAL. Imo its a valid comparison as all such queues can only be drained in a rather imperfect manner. I think these days all solutions use multiple (two) queue tables and switch between those and truncate the non-active one as vacuuming them works far too unreliable. And those tables have to be plain logged once, so they matter in checkpoints et al. > >Adding a default NOW() or similar to the tables immediately makes > >logical decoding faster by a factor of about 3 in comparison to the > >above trivial trigger. > > Hmm, is that because of the conversion to text? I believe slony also > converts all the values to text in the trigger, because that's simple and > flexible, but if we're trying to compare the performance of logical > changeset generation vs. trigger-based replication in general, we should > choose the most efficient trigger-based scheme to compare with. That means, > don't convert to text. And write the trigger in C. Imo its basically impossible for the current queue-based solutions not to convert to text because they otherwise would need to queue all the conversion information as well. And the the test_decoding plugin also converts everything to text, so thats a fair comparison from that POV. In fact the test_decoding plugin does noticeably more as it also outputs table, column and type name. I aggree on the C argument. I really doubt its going to make that much of a difference but we should try it. In my experience a plpgsql trigger that just does a straight conversion via cast is still noticeably faster than any of the "real" replication triggers out there though, so I wouldn't expect much there. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 13-01-28 06:17 AM, Andres Freund wrote: > Hi, > > 3. Pass the delete (with no key values) onto the replication client and let > it deal with it (see 1 and 2) > Hm. > > > While I agree that nicer behaviour would be good I think the real > enforcement should happen on a higher level, e.g. with event triggers or > somesuch. It seems way too late to do anything about it when we're > already decoding. The transaction will already have committed... Ideally the first line of enforcement would be with event triggers. The thing with user-level mechanisms for enforcing things is that they sometimes can be disabled or by-passed. I don't have a lot of sympathy for people who do this but I like the idea of at least having the option coding defensively to detect the situation and whine to the user. >> How do you plan on dealing with sequences? >> I don't see my plugin being called on sequence changes and I don't see >> XLOG_SEQ_LOG listed in DecodeRecordIntoReorderBuffer. Is there a reason why >> this can't be easily added? > I basically was hoping for Simon's sequence-am to get in before doing > anything real here. That didn't really happen yet. > I am not sure whether there's a real usecase in decoding normal > XLOG_SEQ_LOG records, their content isn't all that easy to interpet > unless youre rather familiar with pg's innards. > > So, adding support wouldn't hard from a technical pov but it seems the > semantics are a bit hard to nail down. > >> Also what do we want to do about TRUNCATE support. I could always leave a >> TRUNCATE trigger in place that logged the truncate to a sl_truncates and >> have my replication daemon respond to the insert on a sl_truncates table >> by actually truncating the data on the replica. > I have planned to add some generic "table_rewrite" handling, but I have > to admit I haven't thought too much about it yet. Currently if somebody > rewrites a table, e.g. with an ALTER ... ADD COLUMN .. DEFAULT .. or > ALTER COLUMN ... USING ..., you will see INSERTs into a temporary > table. That basically seems to be a good thing, but the user needs to be > told about that ;) > >> I've spent some time this weekend updating my prototype plugin that >> generates slony 2.2 style COPY output. I have attached my progress here >> (also https://github.com/ssinger/slony1-engine/tree/logical_repl). I have >> not gotten as far as modifying slon to act as a logical log receiver, or >> made a version of the slony apply trigger that would process these >> changes. > I only gave it a quick look and have a couple of questions and > remarks. The way you used the options it looks like youre thinking of > specifying all the tables as options? I would have thought those would > get stored & queried locally and only something like the 'replication > set' name or such would be set as an option. The way slony works today is that the list of tables to pull for a SYNC comes from the subscriber because the subscriber might be behind the provider, where a table has been removed from the set in the meantime. The subscriber still needs to receive data from that table until it is caught up to the point where that removal happens. Having a time-travelled version of a user table (sl_table) might fix that problem but I haven't yet figured out how that needs to work with cascading (since that is a feature of slony today I can't ignore the problem). I'm also not sure how that will work with table renames. Today if the user renames a table inside of an EXECUTE SCRIPT slony will update the name of the table in sl_table. This type of change wouldn't be visible (yet) in the time-travelled catalog. There might be a solution to this yet but I haven't figured out it. Sticking with what slony does today seemed easier as a first step. > Iterating over a list with > for(i = 0; i < options->length; i= i + 2 ) > { > DefElem * def_schema = (DefElem*) list_nth(options,i); > is not a good idea btw, thats quadratic in complexity ;) Thanks I'll rewrite this to walk a list of ListCell objects with next. > In the REORDER_BUFFER_CHANGE_UPDATE I suggest using > relation->rd_primary, just as in the DELETE case, that should always > give you a consistent candidate key in an efficient manner. > >> I haven't looked into the details of what is involved in setting up a >> subscription with the snapshot exporting. > That hopefully shouldn't be too hard... At least thats the idea :P > >> I couldn't get the options on the START REPLICATION command to parse so I >> just hard coded some list building code in the init method. I do plan on >> pasing the list of tables to replicate from the replica to the plugin >> (because this list comes from the replica). Passing what could be a few >> thousand table names as a list of arguments is a bit ugly and I admit my >> list processing code is rough. Does this make us want to reconsider the >> format of the option_list ? > Yea, something's screwed up there, sorry. Will push a fix later today. > >> I guess should provide an opinion on if I think that the patch in this CF, >> if committed could be used to act as a source for slony instead of the log >> trigger. >> The biggest missing piece I mentioned in my email yesterday, that we aren't >> logging the old primary key on row UPDATEs. I don't see building a credible >> replication system where you don't allow users to update any column of a >> row. > Ok, I really thought this wouldn't be that much of an issue in a first > version, but if you think its important, I'll add support for > it. Shouldn't be too hard. If your using non-surragate /natural primary keys this tends to come up occasionally due to data-entry errors or renames. I'm looking at this from the point of view of what do I need to use this as a source for a production replication system with fewer sharp-edges compared to trigger source slony. My standard is a bit higher than 'first' version because I intent to use it in the version 3.0 of slony not 1.0. If others feel I'm asking for too much they should speak up, maybe I am. Also the way things will fail if someone were to try and update a primary key value is pretty nasty (it will leave them with inconsistent databases). We could install UPDATE triggers to try and detect this type of thing but I'd rather see us just log the old values so we can use them during replay. >> The other issues I've raised (DecodeDelete hiding bad deletes, replication >> options not parsing for me) look like easy fixes >> >> no wal decoding support for sequences or truncate are things that I could >> work around by doing things much like slony does today. The SYNC can still >> capture the sequence changes in a table (where the INSERT's would be >> logged) and I can have a trigger capture truncates. > Could you explan a bit what's being done there in slony? Each time the slon connects to the local database to create a SYNC event, which is when slony captures snapshot visiblity information, it also gets also looks at all of the replicated sequences and finds any that have changed since the last sync The values sequence values as of the last SYNC are stored in memory. Any sequences that have changed get there new values written to the table sl_seqlog. When slon applies row updates for a SYNC it also updates (setval) on any sequences that have changed. For truncates the truncate trigger just logs a single row into sl_log indicating that the table has been truncated. When slon encounters a row of operation 'TRUNCATE' it executes a TRUNCATE ONLY on the table. >> If this patch is going to get bumped to 9.4 I really hope that someone with >> good knowledge of the internals (ie a committer) can give this patch a good >> review sooner rather than later. If there are issues Andres has overlooked >> that are more serious or complicated to fix I would like to see them raised >> before the next CF in June. > Absolutely seconded. I *really* would love to see a more technical > review, its hard to see issues after spending that much time in a > certain worldview... > > Thanks! > > Andres >
On 13-01-28 06:23 AM, Andres Freund wrote: > The CF is also there to find UI warts and such, so something like this > seems perfectly fine. Even moreso as it doesn't look this will get > into 9.3 anyway. I wanted to add such an option, but I was too > lazy^Wbusy to think about the sematics. Your current syntax doesn't > really allow arguments to be specified in a nice way. I was thinking > of -o name=value and allowing multiple specifications of -o to build > the option string. Any arguments against that? Multiple -o options sound fine to me. >> /* Initiate the replication stream at specified location */ >> - snprintf(query, sizeof(query), "START_LOGICAL_REPLICATION '%s' %X/%X", >> - slot, (uint32) (startpos >> 32), (uint32) startpos); >> + snprintf(query, sizeof(query), "START_LOGICAL_REPLICATION '%s' %X/%X (%s)", >> + slot, (uint32) (startpos >> 32), (uint32) startpos,plugin_opts); > ISTM that (%s) shouldn't be specified when there are no options, but as > the options need to be pre-escaped anyway, that looks like a non-problem > in a bit more complete implementation. > > Greetings, > > Andres Freund > >
On 2013-01-28 12:23:02 +0100, Andres Freund wrote: > On 2013-01-27 12:28:21 -0500, Steve Singer wrote: > > On 13-01-22 11:30 AM, Andres Freund wrote: > > >Hi, > > > > > >I pushed a new rebased version (the xlogreader commit made it annoying > > >to merge). > > > > > >The main improvements are > > >* way much coherent code internally for intializing logical rep > > >* explicit control over slots > > >* options for logical replication > > > > > > Exactly what is the syntax for using that. My reading your changes to > > repl_gram.y make me think that any of the following should work (but they > > don't). > > > > START_LOGICAL_REPLICATION 'slon1' 0/0 ('opt1') > > ERROR: syntax error: unexpected character "(" > > > > "START_LOGICAL_REPLICATION 'slon1' 0/0 ('opt1' 'val1') > > ERROR: syntax error: unexpected character "(" > > > > START_LOGICAL_REPLICATION 'slon1' 0/0 ('opt1','opt2') > > ERROR: syntax error: unexpected character "(" > > The syntax is right, the grammar (or rather scanner) support is a bit > botched, will push a new version soon. Pushed and rebased some minutes ago. I changed the syntax so that slot names, plugins, and option names are identifiers and behave just as in normal sql identifier. That means ' need to be changed to ". The new version is rebased ontop of fklocks, walsender et al, which was a bit of work but actually makes more comprehensive logging in heap_update easier. That will come tomorrow. Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v4 - Heikki's thoughts about the patch state
От
Andres Freund
Дата:
On 2013-01-28 16:55:52 -0500, Steve Singer wrote: > If your using non-surragate /natural primary keys this tends to come up > occasionally due to data-entry errors or renames. I'm looking at this from > the point of view of what do I need to use this as a source for a production > replication system with fewer sharp-edges compared to trigger source slony. > My standard is a bit higher than 'first' version because I intent to use it > in the version 3.0 of slony not 1.0. If others feel I'm asking for too much > they should speak up, maybe I am. Also the way things will fail if someone > were to try and update a primary key value is pretty nasty (it will leave > them with inconsistent databases). We could install UPDATE triggers to > try and detect this type of thing but I'd rather see us just log the old > values so we can use them during replay. I pushed support for this. I am not yet 100% happy with this due to two issues: * it increases the xlog size logged by heap_update by 2 bytes even with wal_level < logical as it uses a variant of xl_heap_headerthat includes its lenght. Conditionally using xl_heap_header would make the code even harder to read. Is thatacceptable? * multi_insert should be converted to use xl_heap_header_len as well, instead of using xl_multi_insert_tuple, that wouldalso reduce the amount of multi-insert specific code in decode.c * both for update and delete we should denote more explicitly that ->oldtuple points to an index tuple, not to an full tuple Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Feb 2, 2013 at 4:38 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-01-28 16:55:52 -0500, Steve Singer wrote: >> If your using non-surragate /natural primary keys this tends to come up >> occasionally due to data-entry errors or renames. I'm looking at this from >> the point of view of what do I need to use this as a source for a production >> replication system with fewer sharp-edges compared to trigger source slony. >> My standard is a bit higher than 'first' version because I intent to use it >> in the version 3.0 of slony not 1.0. If others feel I'm asking for too much >> they should speak up, maybe I am. Also the way things will fail if someone >> were to try and update a primary key value is pretty nasty (it will leave >> them with inconsistent databases). We could install UPDATE triggers to >> try and detect this type of thing but I'd rather see us just log the old >> values so we can use them during replay. > > I pushed support for this. I am not yet 100% happy with this due to two > issues: > > * it increases the xlog size logged by heap_update by 2 bytes even with > wal_level < logical as it uses a variant of xl_heap_header that > includes its lenght. Conditionally using xl_heap_header would make the > code even harder to read. Is that acceptable? I think it's important to avoid adding to DML WAL volume when wal_level < logical. I am not positive that 2 bytes is noticeable, but I'm not positive that it isn't either: heap insert/update must be our most commonly-used WAL records. On the other hand, we also need to keep in mind that branches in hot code paths aren't free either. I would be concerned more about the increased run-time cost of constructing the correct WAL record than with the related code complexity. None of that code is simple anyway. > * multi_insert should be converted to use xl_heap_header_len as well, > instead of using xl_multi_insert_tuple, that would also reduce the > amount of multi-insert specific code in decode.c > * both for update and delete we should denote more explicitly that > ->oldtuple points to an index tuple, not to an full tuple -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company