Обсуждение: Multimaster

Поиск
Список
Период
Сортировка

Multimaster

От
Konstantin Knizhnik
Дата:
Hi,

Our company PostgresPRO is now developing multimaster for Postgres, some
analogue of MySQL Galera.
It is based on logical replication and pglogical plugin from
2ndQuandrand and uses distributed transaction manager.
The main differences of multimaster from standard PostgreSQL streaming
replication with hot standby are:

1. There is no time gap between master and slave (which is currently
present even for synchronous streaming replication).
2. It is possible to execute any transaction on any node. In case of hot
standby you can only execute read-only queries on replicas and,
moreover,  you can not create temporary tables to store the results of
complex queries. With hot standby you have to change your application by
separating read-only and read-write transactions or use special proxies
like PgBouncer.
3. Multimaster provides high-availability out of the box: it is able to
detect node failure and provide automatic recovery and node return to
the cluster. Unlike HA stacks based on streaming replication,
multimaster doesn't require any external tools like corosync and pacemaker.

Unlike bidirectional replication, multimaster provides distributed
consistency: all transactions will always see consistent snapshots.
If transactions on different nodes are trying to update the same data,
then conflicts will be detected in the same way as in standalone
Postgres. No custom conflict resolution is required.
Multimaster is using two phase commit, so a transaction is either
applied on all nodes or on none. Thus the data of all online nodes are
identical.

Multimaster doesn't provide:
1. Write scalability: the changes have to be applied to all nodes, so
there can not be any write performance gain over standalone server. If
you need write scalability, use sharding. In this case you have to use
Postgres-XL, GreenPlum, pg_shard, or some application-level solution.
2. Optimization of complex analytic queries using distributed query
execution. Multimaster doesn't build distributed execution plans and can
not speed-up execution of single heavy query.
3. Georeplication. Although it is possible to have multimaster nodes in
different data centers, in practice it will lead to very low write
performance. For such scenarios asynchronous BDR is much better solution.


So multimaster may be efficient for providing high availability and load
balancing when most of your queries are read-only.

It will be interesting for us to get user's feedbacks and collect
possible use cases and workloads for multimaster.
Performance of multimaster greatly depends on relative amount of update
queries and transaction size (most of multimaster overhead is related to
transaction processing). In case of simple pgbench-like queries
performance of three-node multimaster is about two times lower than
performance of standalone Postgres on update queries and about two times
higher on read-only queries.
In case of complex analytic queries (like TPC-H) results are much better
and multimaster provides almost linear scalability.
But multimaster is not intended for read-only OLAP: there are more
efficient solutions.
We expect most performance benefits for OLTP with high ratio of
read-only queries.
This is why we try to find workloads for multimaster as close to real
life as possible. Any references to benchmarks, examples of queries,
scenarios, etc. are welcome.

Also we would like to get feedback for the following questions:

1. Currently logical replication requires primary key for replicated
table (otherwise the updates can not be propagated).
How critical can this limitation be for potential multimaster use cases?

2. Is it necessary to support local (not replicated) tables?

3. Right now multimaster replicates the whole instance. Alternatively we
can try to implement database-level replication, making it possible
to provide more flexible configurations of different databases connected
into different clusters. It is hard to estimate now how difficult it
will be to support such mode, but first of all we want to know if this
feature is really needed?

4. We plan to implement sharding support. But first we also have to
understand what people want from cluster first of all?
HA, load balancing, distributed queries, etc. In particular: what use
cases do you see for multimaster without sharding?

5.What is the preferred way of cluster configuration: through
postgresql.conf or using some special API allowing to dynamically
add/remove nodes from cluster?

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Multimaster

От
Simon Riggs
Дата:
On 1 April 2016 at 11:33, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
Our company PostgresPRO is now developing multimaster for Postgres, some analogue of MySQL Galera.
It is based on logical replication and pglogical plugin from 2ndQuandrant and uses distributed transaction manager.

Hi Konstantin,

Is this open source with The PostgreSQL Licence?

Will you be contributing those changes to the BDR project, or is this a permanent fork of that?

Thanks

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multimaster

От
Konstantin Knizhnik
Дата:
Hi Simon,

Yes, we will distributed all our code under PostgreSQL license.
Right now we are using copy of pglogical_output plugin + receiver part written based on BDR code (just because when we started work on multimaster pglogical plugin was not released).
We have plans to eliminate copies of pglogical plugin files from our code and instead of it refer pglogical plugin.

Right now the main problem is parallel apply: we need to apply changes concurrently to avoid unintended dependencies causing deadlocks and provide reasonable performance.

We also need 2PC support but this code was sent to you by Stas, so I hope that sometime it will be included in PostgreSQL core and pglogical plugin.

There are also some minor technical issues which lead us to making few changes in pglogical code but we tried to do our best to keep original versions unchanged, so we can switch to public version in future.




On 01.04.2016 14:38, Simon Riggs wrote:
On 1 April 2016 at 11:33, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
Our company PostgresPRO is now developing multimaster for Postgres, some analogue of MySQL Galera.
It is based on logical replication and pglogical plugin from 2ndQuandrant and uses distributed transaction manager.

Hi Konstantin,

Is this open source with The PostgreSQL Licence?

Will you be contributing those changes to the BDR project, or is this a permanent fork of that?

Thanks

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Multimaster

От
"Joshua D. Drake"
Дата:
On 04/01/2016 04:50 AM, Konstantin Knizhnik wrote:

> There are also some minor technical issues which lead us to making few
> changes in pglogical code but we tried to do our best to keep original
> versions unchanged, so we can switch to public version in future.

It seems the better solution for all of this would be for BDR and this
project to come together as a community project to get multimaster for
PostgreSQL. Otherwise we are going to end up with the same situation we
had before Pg had master-slave replication.

JD

--
Command Prompt, Inc.                  http://the.postgres.company/
                         +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


Re: Multimaster

От
Moreno Andreo
Дата:
Il 01/04/2016 17:22, Joshua D. Drake ha scritto:
> On 04/01/2016 04:50 AM, Konstantin Knizhnik wrote:
>
>> There are also some minor technical issues which lead us to making few
>> changes in pglogical code but we tried to do our best to keep original
>> versions unchanged, so we can switch to public version in future.
>
> It seems the better solution for all of this would be for BDR and this
> project to come together as a community project to get multimaster for
> PostgreSQL. Otherwise we are going to end up with the same situation
> we had before Pg had master-slave replication.
>
> JD
>
I'd like to post my use case, hoping it could be of any interest.
We are giving a cloud service, based on our professional application,
hosting databases on our server and replicating them to every site the
users register with.
Every user have (for now) a unique database to keep in sync.
Users write on their local database and data is replicated with the
server via a background process running while user is connected.
Actually we have to improve what our replicator is doing: it's only
replicating the single user's database. The improvement should that we
can put it on the "server" (in some cases there are groups of users
sharing a dedicated server) and, given a configuration of what and how
to replicate, it should replicate more than one DB a time.
Actually, it's a stand-alone program, but what we'd like would be
something more integrated in PostgreSQL, so where PostgreSQL can run
(everywhere!), so can it.
We were beginning to "translate" (and then improve) this program in c#,
when I bumped into articles pointing to BDR, and I started taking a
look. But it seems that is good to replicahe whole servers, and still
hasn't the granularity we need.

My 2 cent...

Cheers,
Moreno.



Re: Multimaster

От
Dorian Hoxha
Дата:
@Konstantin
1. It's ok in my cases.
2. Not required in my cases.
3. Just require users to use different servers for now I think. Sometimes(always?) users can be greedy with feature requests.
4. I want magically consistency + failover (I can instruct the client to retry all masters).

Good-cluster is the only thing that is stopping postgresql from killing many nosqls in my opinion. Citusdb is single-master(can't even update many rows on the same shard), Postgres-XL has no highavailibility (and having 3 servers per node + global dtm), GreenPlum is olap(and old version of postgres), app-side sucks.

With sharding, the holy-grail is to have automatic-global-range-partitioning (hypertable,hbase,cockroachdb,bigtable etc) on the primary-key or a special column. The con of this, is that indexes of a row reside in another region, which may mean another server, which you need cross-server-transaction to update a row.
In this case, I think you can save the indexes on the same server as the row and be fine.

To have good speed you can implement: 1-shard-only-queries (no cross-nodes transaction), async-disk changes (so the changes are done only in the memory of all replicated servers and not in commit-log (used by couchbase)).

Multi-shard transactions aren't needed as much in my opinion.

5. The fewer places/files there are to configure any system the better it is IMHO.

Cheers and good luck!

On Fri, Apr 1, 2016 at 8:15 PM, Moreno Andreo <moreno.andreo@evolu-s.it> wrote:
Il 01/04/2016 17:22, Joshua D. Drake ha scritto:
On 04/01/2016 04:50 AM, Konstantin Knizhnik wrote:

There are also some minor technical issues which lead us to making few
changes in pglogical code but we tried to do our best to keep original
versions unchanged, so we can switch to public version in future.

It seems the better solution for all of this would be for BDR and this project to come together as a community project to get multimaster for PostgreSQL. Otherwise we are going to end up with the same situation we had before Pg had master-slave replication.

JD

I'd like to post my use case, hoping it could be of any interest.
We are giving a cloud service, based on our professional application, hosting databases on our server and replicating them to every site the users register with.
Every user have (for now) a unique database to keep in sync.
Users write on their local database and data is replicated with the server via a background process running while user is connected.
Actually we have to improve what our replicator is doing: it's only replicating the single user's database. The improvement should that we can put it on the "server" (in some cases there are groups of users sharing a dedicated server) and, given a configuration of what and how to replicate, it should replicate more than one DB a time.
Actually, it's a stand-alone program, but what we'd like would be something more integrated in PostgreSQL, so where PostgreSQL can run (everywhere!), so can it.
We were beginning to "translate" (and then improve) this program in c#, when I bumped into articles pointing to BDR, and I started taking a look. But it seems that is good to replicahe whole servers, and still hasn't the granularity we need.

My 2 cent...

Cheers,
Moreno.




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Multimaster

От
Craig Ringer
Дата:
On 1 April 2016 at 19:50, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

Right now the main problem is parallel apply: we need to apply changes concurrently to avoid unintended dependencies causing deadlocks and provide reasonable performance.

How do you intend to approach that?

You should be able to do parallel apply between nodes trivially, i.e. node A applies changes in parallel from nodes B C and D. 

Doing parallel apply of multiple changes from node A to node B is much harder. I wrote about parallel logical apply separately so I won't repeat it here; search the archives for the notes if interested.
 
We also need 2PC support but this code was sent to you by Stas, so I hope that sometime it will be included in PostgreSQL core and pglogical plugin.

I never got a response to my suggestion that testing of upstream DDL is needed for that. I want to see more on how you plan to handle DDL on the upstream side that changes the table structure and acquires strong locks. Especially when it's combined with row changes in the same prepared xacts. 

I'd really like separate patches for the core changes and the pglogical support for them, too.
 
There are also some minor technical issues which lead us to making few changes in pglogical code but we tried to do our best to keep original versions unchanged, so we can switch to public version in future.

Details?
 
Now is exactly the time to address those points.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Multimaster

От
Craig Ringer
Дата:
On 2 April 2016 at 02:15, Moreno Andreo <moreno.andreo@evolu-s.it> wrote:
 
Actually we have to improve what our replicator is doing: it's only replicating the single user's database. The improvement should that we can put it on the "server" (in some cases there are groups of users sharing a dedicated server) and, given a configuration of what and how to replicate, it should replicate more than one DB a time.

That's a layer on top as far as pglogical is concerned. It's only interested in a single database at a time.

The same is true of BDR.

A tool that automatically configures newly found databases to be replicated should be pretty trivial to write, though.
 
We were beginning to "translate" (and then improve) this program in c#, when I bumped into articles pointing to BDR, and I started taking a look. But it seems that is good to replicahe whole servers, and still hasn't the granularity we need.

Huh?

BDR is configured database-by-database. The only exception is with bdr_init_copy, for initial setup using binary base backups; in that case all databases are copied.

It sounds like you actually *want* to replicate all databases at once. Presumably the reason you're not just using physical streaming replication for that  is that different hosts have a different set of writeable databases? E.g.

[Node A]   [Node B]
[DB-1]   ->  [DB-1]
[DB-2]   ->  [DB-2]
[DB-3]   <-  [DB-3]
[DB-4]   <-  [DB-4]

so each DB is written from only one node at a time, but both nodes have writeable DBs. Right?


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Multimaster

От
Moreno Andreo
Дата:
<div class="moz-cite-prefix">Il 14/04/2016 07:45, Craig Ringer ha scritto:<br /></div><blockquote
cite="mid:CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g@mail.gmail.com"type="cite"><div dir="ltr"><div
class="gmail_extra"><divclass="gmail_quote">On 2 April 2016 at 02:15, Moreno Andreo <span dir="ltr"><<a
href="mailto:moreno.andreo@evolu-s.it"moz-do-not-send="true" target="_blank">moreno.andreo@evolu-s.it</a>></span>
wrote:<br/><div> </div><blockquote class="gmail_quote" style="margin:0 0 0             .8ex;border-left:1px #ccc
solid;padding-left:1ex">Actually we have to improve what our replicator is doing: it's only replicating the single
user'sdatabase. The improvement should that we can put it on the "server" (in some cases there are groups of users
sharinga dedicated server) and, given a configuration of what and how to replicate, it should replicate more than one
DBa time.<br /></blockquote><div><br /></div><div>That's a layer on top as far as pglogical is concerned. It's only
interestedin a single database at a time.</div><div><br /></div><div>The same is true of
BDR.</div></div></div></div></blockquote>I'll try to explain.<br /> We have a "cloud server" which contains all users
DBsand to which all users connect to get replication and.<br /> Some users have more than one work site, i.e. in one
they'realone (and the PostgreSQL server is on their own work PC) and in another they're teamed up and they have a
dedicatedserver (and everyone can write on teammates' databases).<br /> In the first case it's easy: the only database
isreplicated with the corresponding one on the cloud server. Straightforward.<br /> At the current state, in the team
case,replication occurs only when the user connects to application, so if there's a user going to a particular site
say,only once a week, in the remaining days all data input by team mates won't be replicated, and the user won't be
awareon what's been done until he replicates again...<br /> So we are trying to have a process, running on the server,
thatchecks for any database to be replicated and performs the action on all of these.<br /><br /> Case 1<br /> Local   
                    Remote<br /> [User1:DB1]    -->    [DB1]<br /><br /> Case 2<br /> [User1:DB1]    -->   
[DB1]<br/> [User2:DB2]    -->    [DB2]<br /> ...<br /> [UserN:DBn]   -->    [DBn]<br /><br /><blockquote
cite="mid:CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g@mail.gmail.com"type="cite"><div dir="ltr"><div
class="gmail_extra"><divclass="gmail_quote"><div><br /></div><div>A tool that automatically configures newly found
databasesto be replicated should be pretty trivial to write, though.</div></div></div></div></blockquote> All databases
thathave to be replicated have a flag in the app configuration, and the replication config is set up, for each database
andeach site, when the first synchronization is made, so we could spawn a process per database and replicate.<br /> Not
tomention we should be able, if necessary, to exclude one or more tables from the replication.<br /><blockquote
cite="mid:CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g@mail.gmail.com"type="cite"><div dir="ltr"><div
class="gmail_extra"><divclass="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"> We were beginning to "translate" (and then improve) this program in
c#,when I bumped into articles pointing to BDR, and I started taking a look. But it seems that is good to replicahe
wholeservers, and still hasn't the granularity we need.</blockquote><div><br /></div><div>Huh?</div><div><br
/></div><div>BDRis configured database-by-database. The only exception is with bdr_init_copy, for initial setup using
binarybase backups; in that case all databases are copied.</div></div></div></div></blockquote> Hmmm... it seems I
misunderstoodsomething......<br /><blockquote
cite="mid:CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g@mail.gmail.com"type="cite"><div dir="ltr"><div
class="gmail_extra"><divclass="gmail_quote"><div><br /></div><div>It sounds like you actually *want* to replicate all
databasesat once. Presumably the reason you're not just using physical streaming replication for that  is that
differenthosts have a different set of writeable databases? E.g.</div><div><br /></div><div>[Node A]   [Node
B]</div><div>[DB-1]  ->  [DB-1]</div><div>[DB-2]   ->  [DB-2]</div><div>[DB-3]   <-  [DB-3]</div><div>[DB-4]  
<- [DB-4]</div><div><br /></div><div>so each DB is written from only one node at a time, but both nodes have
writeableDBs. Right?</div></div></div></div></blockquote> Yes, I tried to explain this in the first part of the
message.<br/><br /> Thanks<br /> Moreno<br /><blockquote
cite="mid:CAMsr+YHA-ijT0Cq1UTy=YNXyEf7WcfzHAXQXxdOCqwu4M=uD+g@mail.gmail.com"type="cite"><div dir="ltr"><div
class="gmail_extra"><brclear="all" /><div><br /></div> -- <br /><div class="gmail_signature"><div dir="ltr"><div><div
dir="ltr"> CraigRinger                   <a href="http://www.2ndQuadrant.com/" moz-do-not-send="true"
target="_blank"><aclass="moz-txt-link-freetext"
href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a></a><br/>  PostgreSQL Development, 24x7 Support,
Training& Services<br /></div></div></div></div></div></div></blockquote><br /> 

Re: Multimaster

От
konstantin knizhnik
Дата:

On Apr 14, 2016, at 8:41 AM, Craig Ringer wrote:

On 1 April 2016 at 19:50, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

Right now the main problem is parallel apply: we need to apply changes concurrently to avoid unintended dependencies causing deadlocks and provide reasonable performance.

How do you intend to approach that?

Actually we already have working implementation of multimaster...
There is a pool of pglogical executors. pglogical_receiver just reads transaction body from connection and append it to ready-for-execution queue.
Any vacant worker form this pool can dequeue this work and proceed it.
To provide correct  replication session context, I have to setup replication session for each transaction. It looks dummy but fortunately has no effect on performance.

It is certainly possible with this approach that order of applying transactions can be not the same at different nodes.
But it is not a problem if we have DTM. The only exception is recovery of multimaster node. In this case we have to apply transaction exactly in the same order as them were applied at the original node performing recovery. It is done by applying changes in recovery mode by pglogical_receiver itself.


You should be able to do parallel apply between nodes trivially, i.e. node A applies changes in parallel from nodes B C and D. 

Doing parallel apply of multiple changes from node A to node B is much harder. I wrote about parallel logical apply separately so I won't repeat it here; search the archives for the notes if interested.
 
We also need 2PC support but this code was sent to you by Stas, so I hope that sometime it will be included in PostgreSQL core and pglogical plugin.

I never got a response to my suggestion that testing of upstream DDL is needed for that. I want to see more on how you plan to handle DDL on the upstream side that changes the table structure and acquires strong locks. Especially when it's combined with row changes in the same prepared xacts. 

We are now replicating DDL in the way similar with one used in BDR: DDL statements are inserted in special table and are replayed at destination node as part of transaction.
We have also alternative implementation done by Artur Zakirov <a.zakirov@postgrespro.ru
Patch for custom WAL records was committed in 9.6, so we are going to switch to this approach.

Right now we are trying to run all Postgres regression tests for multimaster. Currently about 50 tests are failed. There are a lot of different issues: with temporary tables, GUC variables, ...
Some we have fixed, investigation of others is in progress...



I'd really like separate patches for the core changes and the pglogical support for them, too.
 
There are also some minor technical issues which lead us to making few changes in pglogical code but we tried to do our best to keep original versions unchanged, so we can switch to public version in future.

Details?
 
Now is exactly the time to address those points.


Well, recently I have made attempt to merge our code with the latest version of pglogical plugin (because our original implementation of multimaster was based on the code partly taken fro BDR) but finally have to postpone most of changes. My primary intention was to support metadata caching. But presence of multiple apply workers make it not possible to implement it in the same way as it is done node in pglogical plugin. 

Also now pglogical plugin contains a lot of code which performs mapping between source and target database schemas. So it it is assumed that them may be different.
But it is not true in case of multimaster and I do not want to pay extra cost for the functionality we do not need.

Frankly speaking I do not wan to spent much time on integration of multimaster with pglogical plugin right now,  because  both of them are actively developed now. We should definitely do it, but  when we have stable version, so there is no need to repeat this work once and once again.

We can try to prepare our "wish list" for pglogical plugin.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Multimaster

От
Simon Riggs
Дата:
On 10 April 2016 at 22:48, Dorian Hoxha <dorian.hoxha@gmail.com> wrote:
 
Postgres-XL has no highavailibility

Postgres-XL 9.5 has had lots of additional work put in, HA being one of those areas.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multimaster

От
Craig Ringer
Дата:
On 14 April 2016 at 17:14, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:

On Apr 14, 2016, at 8:41 AM, Craig Ringer wrote:

On 1 April 2016 at 19:50, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

Right now the main problem is parallel apply: we need to apply changes concurrently to avoid unintended dependencies causing deadlocks and provide reasonable performance.

How do you intend to approach that?

Actually we already have working implementation of multimaster...
There is a pool of pglogical executors. pglogical_receiver just reads transaction body from connection and append it to ready-for-execution queue.

I intend to make the same split in pglogical its self - a receiver and apply worker split. Though my intent is to have them communicate via a shared memory segment until/unless the apply worker gets too far behind and spills to disk.

Any vacant worker form this pool can dequeue this work and proceed it.

How do you handle correctness of ordering though? A naïve approach will suffer from a variety of anomalies when subject to insert/delete/insert write patterns, among other things. You can also get lost updates, rows deleted upstream that don't get deleted downstream and various other exciting ordering issues.

At absolute minimum you'd have to commit on the downstream in the same commit order as the upstream.. This can deadlock. So when you get a deadlock you'd abort the xacts of the deadlocked worker and all xacts with later commit timestamps, then retry the lot.

BDR has enough trouble with this when applying transactions from multiple peer nodes. To a degree it just throws its hands up and gives up - in particular, it can't tell the difference between an insert/update conflict and an update/delete conflict. But that's between loosely coupled nodes where we explicitly document that some kinds of anomalies are permitted. I can't imagine it being OK to have an even more complex set of possible anomalies occur when simply replaying transactions from a single peer...

It is certainly possible with this approach that order of applying transactions can be not the same at different nodes.

Well, it can produce downright wrong results, and the results even in a single-master case will be all over the place.

But it is not a problem if we have DTM.

How does that follow?
 
The only exception is recovery of multimaster node. In this case we have to apply transaction exactly in the same order as them were applied at the original node performing recovery. It is done by applying changes in recovery mode by pglogical_receiver itself.

I'm not sure I understand what you area saying here.
  
We also need 2PC support but this code was sent to you by Stas, so I hope that sometime it will be included in PostgreSQL core and pglogical plugin.

I never got a response to my suggestion that testing of upstream DDL is needed for that. I want to see more on how you plan to handle DDL on the upstream side that changes the table structure and acquires strong locks. Especially when it's combined with row changes in the same prepared xacts. 

We are now replicating DDL in the way similar with one used in BDR: DDL statements are inserted in special table and are replayed at destination node as part of transaction. 
We have also alternative implementation done by Artur Zakirov <a.zakirov@postgrespro.ru
Patch for custom WAL records was committed in 9.6, so we are going to switch to this approach.

How does that really improve anything over using a table?

This doesn't address what I asked above though, which is whether you have tried doing ALTER TABLE in a 2PC xact with your 2PC replication patch, especially one that also makes row changes. 
Well, recently I have made attempt to merge our code with the latest version of pglogical plugin (because our original implementation of multimaster was based on the code partly taken fro BDR) but finally have to postpone most of changes. My primary intention was to support metadata caching. But presence of multiple apply workers make it not possible to implement it in the same way as it is done node in pglogical plugin.

Not with a simplistic implementation of multiple workers that just round-robin process transactions, no. Your receiver will have to be smart enough to read the protocol stream and write the metadata changes to a separate stream all the workers read. Which is awkward.

I think you'll probably need your receiver to act as a metadata broker for the apply workers in the end.

Also now pglogical plugin contains a lot of code which performs mapping between source and target database schemas. So it it is assumed that them may be different.
But it is not true in case of multimaster and I do not want to pay extra cost for the functionality we do not need.

All it's really doing is mapping upstream to downstream tables by name, since the oids will be different.

Are you attempting to force table oids to be the same on all nodes, so you can rely on direct 1:1 table oid mappings? 'cos that seems fragile...
 
We can try to prepare our "wish list" for pglogical plugin.
 
That would be useful. 



--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Multimaster

От
Craig Ringer
Дата:
On 14 April 2016 at 17:11, Moreno Andreo <moreno.andreo@evolu-s.it> wrote:
 
All databases that have to be replicated have a flag in the app configuration, and the replication config is set up, for each database and each site, when the first synchronization is made, so we could spawn a process per database and replicate.
Not to mention we should be able, if necessary, to exclude one or more tables from the replication.

That should all be pretty simple with what's already there and supported in BDR using replication sets.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Multimaster

От
Konstantin Knizhnik
Дата:
Hi,
Thank you for your response.

On 17.04.2016 15:30, Craig Ringer wrote:
I intend to make the same split in pglogical its self - a receiver and apply worker split. Though my intent is to have them communicate via a shared memory segment until/unless the apply worker gets too far behind and spills to disk.


In case of multimaster  "too far behind" scenario can never happen. So here is yet another difference in asynchronous and synchronous replication approaches. For asynchronous replication situation when replica is far behind master is quite normal and has to be addressed without blocking master. For synchronous replication it is not possible all this "spill to disk" adds just extra overhead.

It seems to me that pglogical plugin is now becoming too universal, trying to address a lot of different issues and play different roles.
Here are some use cases for logical replication which I see (I am quite sure that you know more):
1. Asynchronous replication (including georeplication) - this is actually BDR.
2. Logical backup: transfer data to different database (including new version of Postgres)
3. Change notification: there are many different subscribers which can be interested in receiving notifications about database changes.
As far as I know new JDBC driver is going to use logical replication to receive update streams. It can be also used for update/invalidation of caches in ORMs.
4. Synchronous replication: multimaster

Any vacant worker form this pool can dequeue this work and proceed it.

How do you handle correctness of ordering though? A naïve approach will suffer from a variety of anomalies when subject to insert/delete/insert write patterns, among other things. You can also get lost updates, rows deleted upstream that don't get deleted downstream and various other exciting ordering issues.

At absolute minimum you'd have to commit on the downstream in the same commit order as the upstream.. This can deadlock. So when you get a deadlock you'd abort the xacts of the deadlocked worker and all xacts with later commit timestamps, then retry the lot.


We are not enforcing order of commits as Galera does. Consistency is enforces by DTM, which enforce that transactions at all nodes are given consistent snapshots and assigned same CSNs. We have also global deadlock detection algorithm which build global lock graph (but still false positives are possible because  this graphs is build incrementally and so it doesn't correspond to some global snapshot).


BDR has enough trouble with this when applying transactions from multiple peer nodes. To a degree it just throws its hands up and gives up - in particular, it can't tell the difference between an insert/update conflict and an update/delete conflict. But that's between loosely coupled nodes where we explicitly document that some kinds of anomalies are permitted. I can't imagine it being OK to have an even more complex set of possible anomalies occur when simply replaying transactions from a single peer...

We should definitely perform more testing here, but right now we do not have any tests causing some synchronization anomalies.


It is certainly possible with this approach that order of applying transactions can be not the same at different nodes.

Well, it can produce downright wrong results, and the results even in a single-master case will be all over the place.

But it is not a problem if we have DTM.

How does that follow?

Multimaster is just particular (and simplest) case of distributed transactions. Specific of multimaster is that the same transaction has to be applied at all nodes and that selects can be executed at any node. The goal of DTM is to provide consistent execution of distributed transactions. If it is able to do for arbitrary transactions then, definitely,  it can do it for multimaster.
I can not give you here formal prove that our DTM is able to solve all this problems. Certainly there are may be bugs in implementation
and this is why we need to perform more testing.  But actually we are not "reinventing the wheel", our DTM is based on the existed approaches.

 
The only exception is recovery of multimaster node. In this case we have to apply transaction exactly in the same order as them were applied at the original node performing recovery. It is done by applying changes in recovery mode by pglogical_receiver itself.

I'm not sure I understand what you area saying here.

Sorry for unclearness.
I just said that normally transactions are applied concurrently by multiple workers and DTM is used to enforce consistency.
But in case of recovery (when some node is crashed and then reconnect to the cluster), we perform recovery of this node sequentially, by single worker. In this case DTM is not used (because other nodes are far ahead) and to restore the same state of node we need to apply changes exactly in the same order and at the source node. In this case case content of target (recovered) node should be the same as of source node.

  
We also need 2PC support but this code was sent to you by Stas, so I hope that sometime it will be included in PostgreSQL core and pglogical plugin.

I never got a response to my suggestion that testing of upstream DDL is needed for that. I want to see more on how you plan to handle DDL on the upstream side that changes the table structure and acquires strong locks. Especially when it's combined with row changes in the same prepared xacts. 

We are now replicating DDL in the way similar with one used in BDR: DDL statements are inserted in special table and are replayed at destination node as part of transaction. 
We have also alternative implementation done by Artur Zakirov <a.zakirov@postgrespro.ru
Patch for custom WAL records was committed in 9.6, so we are going to switch to this approach.

How does that really improve anything over using a table?

It is more straightforward approach, isn't it? You can either try to restore DDL from low level sequence of updates of system catalogue.
But it is difficult and not always possible. Or need to add to somehow add original DDL statements to the log.
It can be done using some special table or store this information directly in the log (if custom WAL records are supported).
Certainly in the last case logical protocol should be extended to support playback of user-defined WAl records.
But it seems to be universal mechanism which can be used not only for DDL.

I agree, that custom WAL adds no performance or functionality advantages over using a table.
This is why we still didn't switch to it. But IMHO approach with inserting DDL (or any other user-defined information) in special table looks like hack.


This doesn't address what I asked above though, which is whether you have tried doing ALTER TABLE in a 2PC xact with your 2PC replication patch, especially one that also makes row changes. 
Well, recently I have made attempt to merge our code with the latest version of pglogical plugin (because our original implementation of multimaster was based on the code partly taken fro BDR) but finally have to postpone most of changes. My primary intention was to support metadata caching. But presence of multiple apply workers make it not possible to implement it in the same way as it is done node in pglogical plugin.

Not with a simplistic implementation of multiple workers that just round-robin process transactions, no. Your receiver will have to be smart enough to read the protocol stream and write the metadata changes to a separate stream all the workers read. Which is awkward.

I think you'll probably need your receiver to act as a metadata broker for the apply workers in the end.

Also now pglogical plugin contains a lot of code which performs mapping between source and target database schemas. So it it is assumed that them may be different.
But it is not true in case of multimaster and I do not want to pay extra cost for the functionality we do not need.

All it's really doing is mapping upstream to downstream tables by name, since the oids will be different.

Really?
Why then you send all table metadata (information about attributes) and handle invalidation messages?
What is the purpose of "mapping to local relation, filled as needed" fields in PGLogicalRelation if are are not going to perform such mapping?

Multimater really  needs to map local or remote OIDs.  We do not need to provide any attribute mapping and handle catalog invalidations.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Multimaster

От
Craig Ringer
Дата:
On 18 April 2016 at 16:28, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
I intend to make the same split in pglogical its self - a receiver and apply worker split. Though my intent is to have them communicate via a shared memory segment until/unless the apply worker gets too far behind and spills to disk.


In case of multimaster  "too far behind" scenario can never happen.

I disagree. In the case of tightly coupled synchronous multi-master it can't happen, sure. But that's hardly the only case of multi-master out there.

I expect you'll want the ability to weaken synchronous guarantees for some commits anyway, like we have with physical replication's synchronous_commit = remote_write, synchronous_commit = local, etc. In that case lag becomes relevant again.

You might also want to be able to spool a big tx to temporary storage even as you apply it, if you're running over a WAN or something. That way if you crash during apply you don't have to transfer the data over the WAN again. Like we do with physical replication, where we write the WAL to disk then replay from disk.

I agree that spilling to disk isn't needed for the simplest cases of synchronous logical MM. But it's far from useless.
 
It seems to me that pglogical plugin is now becoming too universal, trying to address a lot of different issues and play different roles.

I'm not convinced. They're all closely related, overlapping, and require much of the same functionality. While some use cases don't need certain pieces of functionality, they can still be _useful_. Asynchronous MM replication doesn't need table mapping and transforms, for example ... except that in reality lots of the flexibility offered by replication sets, table mapping, etc is actually really handy in MM too.

We may well want to move much of that into core and have much thinner plugins, but the direction Andres, Robert etc are talking about seems to be more along the lines of a fully in-core logical replication subsystem. It'll need to (eventually) meet all theses sorts of needs.

Before you start cutting or assuming you need something very separate I suggest taking a closer look at why each piece is there,  whether there's truly any significant performance impact, and whether it can be avoided without just cutting out the functionality entirely.

1. Asynchronous replication (including georeplication) - this is actually BDR.

Well, BDR is asynchronous MM. There's also the single-master case and related ones for non-overlapping multimaster where any given set of tables are only written on one node.
 
2. Logical backup: transfer data to different database (including new version of Postgres)

I think that's more HA than logical backup. Needs to be able to be synchronous or asynchronous, much like our current phys.rep.

Closely related but not quite the same is logical read replicas/standbys.
 
3. Change notification: there are many different subscribers which can be interested in receiving notifications about database changes.

Yep. I suspect we'll want a json output plugin for this, separate to pglogical etc, but we'll need to move a bunch of functionality from pglogical into core so it can be shared rather than duplicated.
 
4. Synchronous replication: multimaster

"Synchronous multimaster". Not all multimastrer is synchronous, not all synchronous replication is multimaster. 

We are not enforcing order of commits as Galera does. Consistency is enforces by DTM, which enforce that transactions at all nodes are given consistent snapshots and assigned same CSNs. We have also global deadlock detection algorithm which build global lock graph (but still false positives are possible because  this graphs is build incrementally and so it doesn't correspond to some global snapshot).

OK, so you're relying on a GTM to determine safe, conflict-free apply orderings.

I'm ... curious ... about how you do that. Do you have a global lock manager too? How do you determine ordering for things that in a single-master case are addressed via unique b-tree indexes, not (just) heavyweight locking?

Multimaster is just particular (and simplest) case of distributed transactions. Specific of multimaster is that the same transaction has to be applied at all nodes and that selects can be executed at any node.

The specification of your symmetric, synchronous tightly-coupled multimaster design, yes. Which sounds like it's intended to be transparent or near-transparent multi-master clustering.
 

The only exception is recovery of multimaster node. In this case we have to apply transaction exactly in the same order as them were applied at the original node performing recovery. It is done by applying changes in recovery mode by pglogical_receiver itself.

I'm not sure I understand what you area saying here.

Sorry for unclearness.
I just said that normally transactions are applied concurrently by multiple workers and DTM is used to enforce consistency.
But in case of recovery (when some node is crashed and then reconnect to the cluster), we perform recovery of this node sequentially, by single worker. In this case DTM is not used (because other nodes are far ahead) and to restore the same state of node we need to apply changes exactly in the same order and at the source node. In this case case content of target (recovered) node should be the same as of source node.

OK, that makes perfect sense.

Presumably in this case you could save a local snapshot of the DTM's knowledge of the correct apply ordering of those tx's as you apply, so when you crash you can consult that saved ordering information to still parallelize apply. Later.
 

 
We are now replicating DDL in the way similar with one used in BDR: DDL statements are inserted in special table and are replayed at destination node as part of transaction. 
We have also alternative implementation done by Artur Zakirov <a.zakirov@postgrespro.ru
Patch for custom WAL records was committed in 9.6, so we are going to switch to this approach.

How does that really improve anything over using a table?

It is more straightforward approach, isn't it? You can either try to restore DDL from low level sequence of updates of system catalogue.
But it is difficult and not always possible.

Understatement of the century ;) 
 
Or need to add to somehow add original DDL statements to the log.

Actually you need to be able to add normalized statements to the xlog. The original DDL text isn't quite good enough due to issues with search_path among other things. Hence DDL deparse.
 
I agree, that custom WAL adds no performance or functionality advantages over using a table.
This is why we still didn't switch to it. But IMHO approach with inserting DDL (or any other user-defined information) in special table looks like hack.

Yeah, it is a hack. Logical WAL messages do provide a cleaner way to do it, though with the minor downside that they're opaque to the user, who can't see what DDL is being done / due to be done anymore. I'd rather do it with generic logical WAL messages in future, now that they're in core. 
 
Also now pglogical plugin contains a lot of code which performs mapping between source and target database schemas. So it it is assumed that them may be different.
But it is not true in case of multimaster and I do not want to pay extra cost for the functionality we do not need.

All it's really doing is mapping upstream to downstream tables by name, since the oids will be different.

Really?
Why then you send all table metadata (information about attributes) and handle invalidation messages?

Right, you meant columns, not tables.

See DESIGN.md.

We can't just use attno since column drops on one node will cause attno to differ even if the user-visible table schema is the same.

BDR solves this (now) by either initalizing nodes from a physical pg_basebackup of another node, including dropped cols etc, or using pg_dump's binary upgrade mode to preserve dropped columns when bringing a node up from a logical copy.

That's not viable for general purpose logical replication like pglogical, so we send a table attribute mapping.

I agree that this can be avoided if the system can guarantee that the upstream and downstream tables have exactly the same structure including dropped columns. Which it can only guarantee when it has DDL replication and all DDL is either replicated or blocked from being run. That's the approach BDR tries to take, and it works - with problems. One of the problems you won't have because it's caused by the need to sync up the otherwise asynchronous cluster so there are no outstanding committed-but-not-replayed changes for the old table structure on any node before we change the structure on all nodes. But others, with coverage of DDL replication, problems with full table rewrites, etc, you will have.

I think it would be reasonable for pglogical to offer the option of sending a minimal table metadata message that simply says that it expects the downstream to deal with the upstream attnos exactly as-is, either by having them exactly the same or managing its own translations. In this case column mapping etc can be omitted. Feel free to send a patch.
 
Multimater really  needs to map local or remote OIDs.  We do not need to provide any attribute mapping and handle catalog invalidations.

For synchronous tightly-coupled multi-master with a GTM and GLM that doesn't allow non-replicated DDL, yes, I agree.



--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Multimaster

От
Konstantin Knizhnik
Дата:


On 19.04.2016 15:56, Craig Ringer wrote:
On 18 April 2016 at 16:28, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
I intend to make the same split in pglogical its self - a receiver and apply worker split. Though my intent is to have them communicate via a shared memory segment until/unless the apply worker gets too far behind and spills to disk.


In case of multimaster  "too far behind" scenario can never happen.

I disagree. In the case of tightly coupled synchronous multi-master it can't happen, sure. But that's hardly the only case of multi-master out there.

Sorry, it is just matter of terms meaning. By multimaster I really mean "synchronous multimaster", because from my point of view the main characteristic of multimaster is symmetric access to all nodes. If there is no warranty that all cluster nodes have the same state, then, from my point of view, it is not a multimaster at all.  But i have registered "multimaster" trademark, so can not insists on such treatment of this term:)




 
2. Logical backup: transfer data to different database (including new version of Postgres)

I think that's more HA than logical backup. Needs to be able to be synchronous or asynchronous, much like our current phys.rep.

Closely related but not quite the same is logical read replicas/standbys.

This is use case from real production system (Galera use case). If customer want to migrate data to new data center, then multimaster is one of the possible (and safest) ways to do it. You can ste-by-step and transparently for users redirect workload to new data center.

 
3. Change notification: there are many different subscribers which can be interested in receiving notifications about database changes.

Yep. I suspect we'll want a json output plugin for this, separate to pglogical etc, but we'll need to move a bunch of functionality from pglogical into core so it can be shared rather than duplicated.

JSON is not is efficient format for it. And here performance may be critical.

 
4. Synchronous replication: multimaster

"Synchronous multimaster". Not all multimastrer is synchronous, not all synchronous replication is multimaster. 

We are not enforcing order of commits as Galera does. Consistency is enforces by DTM, which enforce that transactions at all nodes are given consistent snapshots and assigned same CSNs. We have also global deadlock detection algorithm which build global lock graph (but still false positives are possible because  this graphs is build incrementally and so it doesn't correspond to some global snapshot).

OK, so you're relying on a GTM to determine safe, conflict-free apply orderings.

I'm ... curious ... about how you do that. Do you have a global lock manager too? How do you determine ordering for things that in a single-master case are addressed via unique b-tree indexes, not (just) heavyweight locking?


We have tried both DTM with global arbiter (analogue of XL GTM) and DTM based on timestamps. In the last case there is no centralized arbiter. But we are using "raftable" - yet another our plugin which provides consistent distributed storage based on RAFT protocol.
Using this raftable we build global deadlock graph based on local subgraphs.



Or need to add to somehow add original DDL statements to the log.

Actually you need to be able to add normalized statements to the xlog. The original DDL text isn't quite good enough due to issues with search_path among other things. Hence DDL deparse.

Yes, for general purpose we need some DBMS-independent representation of DDL.
But for multimaster needs original SQL statement will be enough.


 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company