Re: Synchronizing slots from primary to standby
От | shveta malik |
---|---|
Тема | Re: Synchronizing slots from primary to standby |
Дата | |
Msg-id | CAJpy0uC2gYybBcvYCW68wvR-3akWnFbr7x66nZg7XTyjELW9iw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: Synchronizing slots from primary to standby
("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>)
RE: Synchronizing slots from primary to standby ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) Re: Synchronizing slots from primary to standby (Ajin Cherian <itsajin@gmail.com>) |
Список | pgsql-hackers |
On Thu, Nov 9, 2023 at 8:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 9, 2023 at 8:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > Unrelated to above, if there is a user slot on standby with the same > > > > name which the slot-sync worker is trying to create, then shall it > > > > emit a warning and skip the sync of that slot or shall it throw an > > > > error? > > > > > > > > > > I'd vote for emit a warning and move on to the next slot if any. > > > > > > > But then it could take time for users to know the actual problem and > > they probably notice it after failover. OTOH, if we throw an error > > then probably they will come to know earlier because the slot sync > > mechanism would be stopped. Do you have reasons to prefer giving a > > WARNING and skipping creating such slots? I expect this WARNING to > > keep getting repeated in LOGs because the consecutive sync tries will > > again generate a WARNING. > > > > Apart from the above, I would like to discuss the slot sync work > distribution strategy of this patch. The current implementation as > explained in the commit message [1] works well if the slots belong to > multiple databases. It is clear from the data in emails [2][3][4] that > having more workers really helps if the slots belong to multiple > databases. But I think if all the slots belong to one or very few > databases then such a strategy won't be as good. Now, on one hand, we > get very good numbers for a particular workload with the strategy used > in the patch but OTOH it may not be adaptable to various different > kinds of workloads. So, I have a question whether we should try to > optimize this strategy for various kinds of workloads or for the first > version let's use a single-slot sync-worker and then we can enhance > the functionality in later patches either in PG17 itself or in PG18 or > later versions. One thing to note is that a lot of the complexity of > the patch is attributed to the multi-worker strategy which may still > not be efficient, so there is an argument to go with a simpler > single-slot sync-worker strategy and then enhance it in future > versions as we learn more about various workloads. It will also help > to develop this feature incrementally instead of doing all the things > in one go and taking a much longer time than it should. > > Thoughts? > > [1] - "The replication launcher on the physical standby queries > primary to get the list of dbids for failover logical slots. Once it > gets the dbids, if dbids < max_slotsync_workers, it starts only that > many workers, and if dbids > max_slotsync_workers, it starts > max_slotsync_workers and divides the work equally among them. Each > worker is then responsible to keep on syncing the logical slots > belonging to the DBs assigned to it. > > Each slot-sync worker will have its own dbids list. Since the upper > limit of this dbid-count is not known, it needs to be handled using > dsa. We initially allocated memory to hold 100 dbids for each worker. > If this limit is exhausted, we reallocate this memory with size > incremented again by 100." > > [2] - https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com > [3] - https://www.postgresql.org/message-id/CAFPTHDZw2G3Pax0smymMjfPqdPcZhMWo36f9F%2BTwNTs0HFxK%2Bw%40mail.gmail.com > [4] - https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com > > -- > With Regards, > Amit Kapila. PFA v32 patches which has below changes: 1) Changed how standby_slot_names is handled. On reanalyzing, logical decoding context might not be the best place to cache the standby slot list, because not all the callers(1. user backend. 2. walsender 3. slotsync worker) can access the logical decoding ctx. To make the access of the list consistent, cache the list in a global variable instead. Also, to avoid the trouble of allocating and freeing the list at various places, we [re]initialize the list in the GUC assign hook, it would be easier for caller to use the list. 2) Changed 'bool synced' in ReplicationSlotPersistentData to 'char sync_state'. Values are: 'n': none for user slots, 'i': sync initiated for the slot but waiting for primary to catch up. 'r': ready for periodic syncs. 3) Improved slot-creation logic in slot sync worker. Now any active slot's sync is not blocked by inactive slot's creation. The worker attempts pinging the primary server a fixed number of times and waits for it to catch-up with local-slot's lsn, after that it moves to the next slot. The worker reattempts the wait for pending ones in the next sync-cycle. Meanwhile any such slot (waiting for primary to catch-up) is not dropped but sync_status is marked as 'i'. Once the worker finishes initialization for such a slot (in any of the sync-cycles), sync_state of slot is changed to 'r'. 4) The slots with state 'i' are dropped by the slot-sync worker when it finds out that it is no longer in standby mode and then it exits. 5) Cascading standby does not sync slots with 'sync_state' = 'i' from the first standby. 6) Changed the naptime computation logic. Now during each sync-cycle, if any of the received slots is updated, we retain default-naptime else we increase the naptime provided inactivity time reaches threshold. 7) Added warning for cases where a user-slot with the same name is already present which slot-sync worker is trying to create. Sync for such slots is skipped. Changes for 1 are in patch001 and patch003. Changes for 2-7 are in patch002. Thank You Hou-san for working on 1. Open Question: 1) Currently I have put drop slot logic for slots with 'sync_state=i' in slot-sync worker. Do we need to put it somewhere in promotion-logic as well? Perhaps in WaitForWALToBecomeAvailable() where we call XLogShutdownWalRcv after checking 'CheckForStandbyTrigger'. Thoughts? thanks Shveta
Вложения
В списке pgsql-hackers по дате отправления: