Обсуждение: logical replication snapshots
Hi everyone, I'm playing with logical replication in postgres-10 and I got it to hang on startup with > 2018-07-23 13:15:06.577 CDT [18624] FATAL: the database system is starting up > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(1): 0 on_shmem_exit callbacks to make > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: proc_exit(1): 1 callbacks to make > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: exit(1) > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: proc_exit(-1): 0 callbacks to make ... forever ... After poking around: # find /var/lib/pgsql/10/data/pg_logical/snapshots -printf . | wc -c 12727975 It looks like ZFS simply can't handle this many files: `ls` etc. hang forever. So my question is, does it have to be that many? And if not, is there a knob to keep the number down? TIA -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Вложения
Hi, On 2018-07-23 15:23:04 -0500, Dimitri Maziuk wrote: > I'm playing with logical replication in postgres-10 and I got it to hang > on startup with > > > 2018-07-23 13:15:06.577 CDT [18624] FATAL: the database system is starting up > > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make > > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(1): 0 on_shmem_exit callbacks to make > > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: proc_exit(1): 1 callbacks to make > > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: exit(1) > > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make > > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make > > 2018-07-23 13:15:06.577 CDT [18624] DEBUG: proc_exit(-1): 0 callbacks to make > ... forever ... What precedes that "loop"? Could you describe what exactly you did into that situation? > After poking around: > > # find /var/lib/pgsql/10/data/pg_logical/snapshots -printf . | wc -c > 12727975 Ugh, something absolutely clearly isn't right here. > It looks like ZFS simply can't handle this many files: `ls` etc. hang > forever. ls probably works in general, it's just the sorting that makes it hang. Should work with -f or such. Greetings, Andres Freund
On 7/23/2018 6:51 PM, Andres Freund wrote: > What precedes that "loop"? systemctl start postgresql-10 > Could you describe what exactly you did into that situation? Created a database, few publications, as per TFM, and a few subscriptions on another host. Went on vacation for 3 weeks. The problem host is centos 7.current w/ postgres rpms from PGDG repo. Postgres is on port 5434 because there is a stock one on 5432. > ls probably works in general, it's just the sorting that makes it > hang. Should work with -f or such. I don't think this means what you think this means. There are *13 million* files in /var/lib/pgsql/10/data/pg_logical/snapshots and postgres hangs forever on start with FATAL: the database system is starting up FATAL: the database system is starting up FATAL: the database system is starting up ... I think the two are related. ICBW and all that. What do I do no not get it into that state again? Thx
On 2018-Jul-24, Dimitri Maziuk wrote: > On 7/23/2018 6:51 PM, Andres Freund wrote: > > Could you describe what exactly you did into that situation? > > Created a database, few publications, as per TFM, and a few subscriptions on > another host. Went on vacation for 3 weeks. The problem host is centos > 7.current w/ postgres rpms from PGDG repo. Postgres is on port 5434 because > there is a stock one on 5432. Are you serious about us trying to diagnose a bug with this description? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/24/2018 12:21 PM, Alvaro Herrera wrote: > > Are you serious about us trying to diagnose a bug with this description? What do you want to know, exactly? -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Вложения
On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote: > On 07/24/2018 12:21 PM, Alvaro Herrera wrote: > > > > Are you serious about us trying to diagnose a bug with this description? > > What do you want to know, exactly? A recipe that we can follow and reproduce the issue. Greetings, Andres Freund
On 07/24/2018 01:43 PM, Andres Freund wrote: > On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote: >> On 07/24/2018 12:21 PM, Alvaro Herrera wrote: >>> >>> Are you serious about us trying to diagnose a bug with this description? >> >> What do you want to know, exactly? > > A recipe that we can follow and reproduce the issue. The nutshell version is as I said: I pretty much followed the manual to create a test publication/subscription setup and let it run for a few weeks. I had to increase a few limits but otherwise everything's at the default settings as shipped in the rpms. I can send you postgres config files from both nodes and the entire database dump -- it's all public domain. I can `zfs send` you the snapshot of the entire /var/lib/pgsql/10 as is, too. If you are asking for something other than those two extremes, please let me know what it is and I'll be happy to try and provide it. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Вложения
On 2018-07-24 14:02:26 -0500, Dimitri Maziuk wrote: > On 07/24/2018 01:43 PM, Andres Freund wrote: > > On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote: > >> On 07/24/2018 12:21 PM, Alvaro Herrera wrote: > >>> > >>> Are you serious about us trying to diagnose a bug with this description? > >> > >> What do you want to know, exactly? > > > > A recipe that we can follow and reproduce the issue. > > The nutshell version is as I said: I pretty much followed the manual to > create a test publication/subscription setup and let it run for a few > weeks. I had to increase a few limits but otherwise everything's at the > default settings as shipped in the rpms. Are you really expecting us to be able to reproduce the problem based on the above description? Our test suites do setup plain replications setups, and the problem doesn't occur there. > If you are asking for something other than those two extremes, please > let me know what it is and I'll be happy to try and provide it. A command by command reproducer would be good. > I can send you postgres config files from both nodes and the entire > database dump -- it's all public domain. I can `zfs send` you the > snapshot of the entire /var/lib/pgsql/10 as is, too. Without the genesis, that's probably not as helpful, unfortunately. Greetings, Andres Freund
On 7/25/2018 10:28 AM, Andres Freund wrote: > > Are you really expecting us to be able to reproduce the problem based on > the above description? Our test suites do setup plain replications > setups, and the problem doesn't occur there. I don't, by definition, have a reproducible case: it only happened once so far. If nobody knows what limits the number of files created in $PGDATA/pg_logical/snapshots then we'll all have to wait until this happens again. (To somebody else as I'm obviously not turning logical replication back on until I know it won't kill my server again.) Dima
On 07/25/2018 08:40 AM, Dimitri Maziuk wrote: > On 7/25/2018 10:28 AM, Andres Freund wrote: >> >> Are you really expecting us to be able to reproduce the problem based on >> the above description? Our test suites do setup plain replications >> setups, and the problem doesn't occur there. > > I don't, by definition, have a reproducible case: it only happened once > so far. Where you using pg_export_snapshot() by any chance?: https://www.postgresql.org/docs/10/static/functions-admin.html#FUNCTIONS-SNAPSHOT-SYNCHRONIZATION Where there any relevant error messages in the log before the database hung? > > If nobody knows what limits the number of files created in > $PGDATA/pg_logical/snapshots then we'll all have to wait until this > happens again. > > (To somebody else as I'm obviously not turning logical replication back > on until I know it won't kill my server again.) Given that it took 3 weeks to manifest itself before, I would say give it a try and monitor $PGDATA/pg_logical/snapshots. That would help provide information for getting at the source of the problem. You can always disable the replication if it looks like it running away. > > Dima > -- Adrian Klaver adrian.klaver@aklaver.com
On 07/25/2018 11:16 AM, Adrian Klaver wrote: > Where you using pg_export_snapshot() by any chance?: If it's not in chapter 31 of TFM then I have not touched it. The only notable thing I changed from 31.9 quickstart is that replica identity is full on all tables (the data's a mess). > Where there any relevant error messages in the log before the database > hung? Do you seriously expect me to know what "relevant" is? ;) -- It's 7 days' worth of > 2018-07-23 09:43:12.851 CDT [7975] STATEMENT: COPY macromolecules."Entity_common_name" TO STDOUT > 2018-07-23 09:43:12.925 CDT [7982] LOG: logical decoding found consistent point at 19/E6942408 > 2018-07-23 09:43:12.925 CDT [7982] DETAIL: There are no running transactions. > 2018-07-23 09:43:12.935 CDT [7982] ERROR: permission denied for schema macromolecules > 2018-07-23 09:43:12.935 CDT [7982] STATEMENT: COPY macromolecules."Assembly" TO STDOUT > 2018-07-23 09:43:13.034 CDT [7987] LOG: logical decoding found consistent point at 19/E6942440 > 2018-07-23 09:43:13.034 CDT [7987] DETAIL: There are no running transactions. > 2018-07-23 09:43:13.044 CDT [7987] ERROR: permission denied for schema macromolecules > 2018-07-23 09:43:13.044 CDT [7987] STATEMENT: COPY macromolecules."Spectral_transition" TO STDOUT > 2018-07-23 09:43:13.135 CDT [7990] LOG: logical decoding found consistent point at 19/E6942478 > 2018-07-23 09:43:13.135 CDT [7990] DETAIL: There are no running transactions. > 2018-07-23 09:43:13.145 CDT [7990] ERROR: permission denied for schema macromolecules > 2018-07-23 09:43:13.145 CDT [7990] STATEMENT: COPY macromolecules."Assembly_db_link" TO STDOUT that ends with > 2018-07-23 09:45:40.280 CDT [6032] LOG: database system was interrupted; last known up at 2018-07-06 17:04:28 CDT > 2018-07-23 09:45:40.281 CDT [6033] FATAL: the database system is starting up > 2018-07-23 09:45:40.282 CDT [6034] FATAL: the database system is starting up ... ad infinitum ... Would "permission denied" be relevant? (Not sure where that is coming from: I did initial sync as postgres then altered it to replication user. Can't track it down now since the database is dead.) -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Вложения
On 07/25/2018 11:10 AM, Dimitri Maziuk wrote: > On 07/25/2018 11:16 AM, Adrian Klaver wrote: > >> Where you using pg_export_snapshot() by any chance?: > > If it's not in chapter 31 of TFM then I have not touched it. The only > notable thing I changed from 31.9 quickstart is that replica identity is > full on all tables (the data's a mess). So no. > >> Where there any relevant error messages in the log before the database >> hung? > > Do you seriously expect me to know what "relevant" is? ;) -- It's 7 > days' worth of Generally speaking relevancy increases the closer you get to the fall over event. So we are most interested in entries close to the point where the database stopped/failed. > >> 2018-07-23 09:43:12.851 CDT [7975] STATEMENT: COPY macromolecules."Entity_common_name" TO STDOUT >> 2018-07-23 09:43:12.925 CDT [7982] LOG: logical decoding found consistent point at 19/E6942408 >> 2018-07-23 09:43:12.925 CDT [7982] DETAIL: There are no running transactions. >> 2018-07-23 09:43:12.935 CDT [7982] ERROR: permission denied for schema macromolecules >> 2018-07-23 09:43:12.935 CDT [7982] STATEMENT: COPY macromolecules."Assembly" TO STDOUT >> 2018-07-23 09:43:13.034 CDT [7987] LOG: logical decoding found consistent point at 19/E6942440 >> 2018-07-23 09:43:13.034 CDT [7987] DETAIL: There are no running transactions. >> 2018-07-23 09:43:13.044 CDT [7987] ERROR: permission denied for schema macromolecules >> 2018-07-23 09:43:13.044 CDT [7987] STATEMENT: COPY macromolecules."Spectral_transition" TO STDOUT >> 2018-07-23 09:43:13.135 CDT [7990] LOG: logical decoding found consistent point at 19/E6942478 >> 2018-07-23 09:43:13.135 CDT [7990] DETAIL: There are no running transactions. >> 2018-07-23 09:43:13.145 CDT [7990] ERROR: permission denied for schema macromolecules >> 2018-07-23 09:43:13.145 CDT [7990] STATEMENT: COPY macromolecules."Assembly_db_link" TO STDOUT > > that ends with > >> 2018-07-23 09:45:40.280 CDT [6032] LOG: database system was interrupted; last known up at 2018-07-06 17:04:28 CDT >> 2018-07-23 09:45:40.281 CDT [6033] FATAL: the database system is starting up >> 2018-07-23 09:45:40.282 CDT [6034] FATAL: the database system is starting up > ... ad infinitum ... > > Would "permission denied" be relevant? Logical decoding is something I am still learning. The "permission denied" would to me be relevant only to the extent that it seems to be provoking: "LOG: logical decoding found consistent point at 19/E6942440" DETAIL: There are no running transactions." Others with more experience in this area would need to fill whether that might account for the 13 million files in the ~snapshot/ directory. > > (Not sure where that is coming from: I did initial sync as postgres then > altered it to replication user. Can't track it down now since the > database is dead.) > In your previous post you said: "(To somebody else as I'm obviously not turning logical replication back on until I know it won't kill my server again.) " So what are you working off now? -- Adrian Klaver adrian.klaver@aklaver.com
On 07/25/2018 02:31 PM, Adrian Klaver wrote: > In your previous post you said: > > "(To somebody else as I'm obviously not turning logical replication back > on until I know it won't kill my server again.) " > > So what are you working off now? I have that database rebuilt with no publications defined, and I have the original $PGDATA on a ZFS volume. I have not looked at the subscriber host yet, I've more urgent fires to douse first. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Вложения
On 07/25/2018 11:10 AM, Dimitri Maziuk wrote: > On 07/25/2018 11:16 AM, Adrian Klaver wrote: > >> Where you using pg_export_snapshot() by any chance?: > > If it's not in chapter 31 of TFM then I have not touched it. The only > notable thing I changed from 31.9 quickstart is that replica identity is > full on all tables (the data's a mess). > >> Where there any relevant error messages in the log before the database >> hung? > > Do you seriously expect me to know what "relevant" is? ;) -- It's 7 > days' worth of > >> 2018-07-23 09:43:12.851 CDT [7975] STATEMENT: COPY macromolecules."Entity_common_name" TO STDOUT >> 2018-07-23 09:43:12.925 CDT [7982] LOG: logical decoding found consistent point at 19/E6942408 >> 2018-07-23 09:43:12.925 CDT [7982] DETAIL: There are no running transactions. >> 2018-07-23 09:43:12.935 CDT [7982] ERROR: permission denied for schema macromolecules >> 2018-07-23 09:43:12.935 CDT [7982] STATEMENT: COPY macromolecules."Assembly" TO STDOUT >> 2018-07-23 09:43:13.034 CDT [7987] LOG: logical decoding found consistent point at 19/E6942440 >> 2018-07-23 09:43:13.034 CDT [7987] DETAIL: There are no running transactions. >> 2018-07-23 09:43:13.044 CDT [7987] ERROR: permission denied for schema macromolecules >> 2018-07-23 09:43:13.044 CDT [7987] STATEMENT: COPY macromolecules."Spectral_transition" TO STDOUT >> 2018-07-23 09:43:13.135 CDT [7990] LOG: logical decoding found consistent point at 19/E6942478 >> 2018-07-23 09:43:13.135 CDT [7990] DETAIL: There are no running transactions. >> 2018-07-23 09:43:13.145 CDT [7990] ERROR: permission denied for schema macromolecules >> 2018-07-23 09:43:13.145 CDT [7990] STATEMENT: COPY macromolecules."Assembly_db_link" TO STDOUT A little slow, but I realized that the above looks like the initial sync for a table. Do these errors stretch back all the way back to your original sync? > > that ends with > >> 2018-07-23 09:45:40.280 CDT [6032] LOG: database system was interrupted; last known up at 2018-07-06 17:04:28 CDT >> 2018-07-23 09:45:40.281 CDT [6033] FATAL: the database system is starting up >> 2018-07-23 09:45:40.282 CDT [6034] FATAL: the database system is starting up > ... ad infinitum ... > > Would "permission denied" be relevant? > > (Not sure where that is coming from: I did initial sync as postgres then > altered it to replication user. Can't track it down now since the > database is dead.) > -- Adrian Klaver adrian.klaver@aklaver.com
On 2018-07-25 12:31:01 -0700, Adrian Klaver wrote: > On 07/25/2018 11:10 AM, Dimitri Maziuk wrote: > > On 07/25/2018 11:16 AM, Adrian Klaver wrote: > > > > > Where you using pg_export_snapshot() by any chance?: > > > > If it's not in chapter 31 of TFM then I have not touched it. The only > > notable thing I changed from 31.9 quickstart is that replica identity is > > full on all tables (the data's a mess). > > So no. > > > > > > Where there any relevant error messages in the log before the database > > > hung? > > > > Do you seriously expect me to know what "relevant" is? ;) -- It's 7 > > days' worth of > > Generally speaking relevancy increases the closer you get to the fall over > event. So we are most interested in entries close to the point where the > database stopped/failed. > > > > > > 2018-07-23 09:43:12.851 CDT [7975] STATEMENT: COPY macromolecules."Entity_common_name" TO STDOUT > > > 2018-07-23 09:43:12.925 CDT [7982] LOG: logical decoding found consistent point at 19/E6942408 > > > 2018-07-23 09:43:12.925 CDT [7982] DETAIL: There are no running transactions. > > > 2018-07-23 09:43:12.935 CDT [7982] ERROR: permission denied for schema macromolecules > > > 2018-07-23 09:43:12.935 CDT [7982] STATEMENT: COPY macromolecules."Assembly" TO STDOUT > > > 2018-07-23 09:43:13.034 CDT [7987] LOG: logical decoding found consistent point at 19/E6942440 > > > 2018-07-23 09:43:13.034 CDT [7987] DETAIL: There are no running transactions. > > > 2018-07-23 09:43:13.044 CDT [7987] ERROR: permission denied for schema macromolecules > > > 2018-07-23 09:43:13.044 CDT [7987] STATEMENT: COPY macromolecules."Spectral_transition" TO STDOUT > > > 2018-07-23 09:43:13.135 CDT [7990] LOG: logical decoding found consistent point at 19/E6942478 > > > 2018-07-23 09:43:13.135 CDT [7990] DETAIL: There are no running transactions. > > > 2018-07-23 09:43:13.145 CDT [7990] ERROR: permission denied for schema macromolecules > > > 2018-07-23 09:43:13.145 CDT [7990] STATEMENT: COPY macromolecules."Assembly_db_link" TO STDOUT > > > > that ends with > > > > > 2018-07-23 09:45:40.280 CDT [6032] LOG: database system was interrupted; last known up at 2018-07-06 17:04:28 CDT > > > 2018-07-23 09:45:40.281 CDT [6033] FATAL: the database system is starting up > > > 2018-07-23 09:45:40.282 CDT [6034] FATAL: the database system is starting up > > ... ad infinitum ... > > > > Would "permission denied" be relevant? > > Logical decoding is something I am still learning. The "permission denied" > would to me be relevant only to the extent that it seems to be provoking: Yes, it looks related. Looks like logical rep was trying to perform the intiial sync of those tables, and couldn't due to permission errors. > "LOG: logical decoding found consistent point at 19/E6942440" > DETAIL: There are no running transactions." > > Others with more experience in this area would need to fill whether that > might account for the 13 million files in the ~snapshot/ directory. That indicates there's some error handling issues to be resolved. Petr? Greetings, Andres Freund
On 07/25/2018 07:57 PM, Andres Freund wrote: > On 2018-07-25 12:31:01 -0700, Adrian Klaver wrote: >>> Would "permission denied" be relevant? >> >> Logical decoding is something I am still learning. The "permission denied" >> would to me be relevant only to the extent that it seems to be provoking: > > Yes, it looks related. Looks like logical rep was trying to perform the > intiial sync of those tables, and couldn't due to permission errors. I'm not sure what happened, I remember the initial sync of that particular schema failing on one table only, but looking at it now, all tables are empty on the subscriber. >> "LOG: logical decoding found consistent point at 19/E6942440" >> DETAIL: There are no running transactions." >> >> Others with more experience in this area would need to fill whether that >> might account for the 13 million files in the ~snapshot/ directory. > > That indicates there's some error handling issues to be resolved. Petr? It logs "COPY TO STDOUT", does that actually cache to disk (in ~snapshot/)? Would it be safe to delete every file in ~snapshot/ older than some X-number of minutes? (And what a good X might be?) -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Вложения
On 07/26/2018 10:54 AM, Dimitri Maziuk wrote: > On 07/25/2018 07:57 PM, Andres Freund wrote: >> On 2018-07-25 12:31:01 -0700, Adrian Klaver wrote: > >>>> Would "permission denied" be relevant? >>> >>> Logical decoding is something I am still learning. The "permission denied" >>> would to me be relevant only to the extent that it seems to be provoking: >> >> Yes, it looks related. Looks like logical rep was trying to perform the >> intiial sync of those tables, and couldn't due to permission errors. > > I'm not sure what happened, I remember the initial sync of that > particular schema failing on one table only, but looking at it now, all > tables are empty on the subscriber. To me that indicates all the syncs failed. > >>> "LOG: logical decoding found consistent point at 19/E6942440" >>> DETAIL: There are no running transactions." >>> >>> Others with more experience in this area would need to fill whether that >>> might account for the 13 million files in the ~snapshot/ directory. >> >> That indicates there's some error handling issues to be resolved. Petr? > > It logs "COPY TO STDOUT", does that actually cache to disk (in ~snapshot/)? That is getting deeper into the code then I have been, so I don't know. FYI the code is below and the part that triggered the error is around line 762 (copy_table(Relation rel)). https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/logical/tablesync.c;h=acc6498567d07c3e93d0adf5039a10c693e38f64;hb=HEAD > > Would it be safe to delete every file in ~snapshot/ older than some > X-number of minutes? (And what a good X might be?) > Again I don't know the answer to this. Are you trying this on a test setup or production one? -- Adrian Klaver adrian.klaver@aklaver.com
On 2018-07-26 12:54:19 -0500, Dimitri Maziuk wrote: > >> "LOG: logical decoding found consistent point at 19/E6942440" > >> DETAIL: There are no running transactions." > >> > >> Others with more experience in this area would need to fill whether that > >> might account for the 13 million files in the ~snapshot/ directory. > > > > That indicates there's some error handling issues to be resolved. Petr? > > It logs "COPY TO STDOUT", does that actually cache to disk (in ~snapshot/)? No, it doesn't. The snapshots are internal to logical replication. > Would it be safe to delete every file in ~snapshot/ older than some > X-number of minutes? (And what a good X might be?) You shouldn't manually delete them. But an explicit 'checkpoint;' (and all automatically triggered ones), should do so for all that are older. Is there a chance that you have some old leftover replication slots around? Because that'd prevent cleanup? What does SELECT * FROM pg_replication_slots ; return? Greetings, Andres Freund
On 07/26/2018 02:54 PM, Adrian Klaver wrote: > On 07/26/2018 10:54 AM, Dimitri Maziuk wrote: >> I'm not sure what happened, I remember the initial sync of that >> particular schema failing on one table only, but looking at it now, all >> tables are empty on the subscriber. > > To me that indicates all the syncs failed. Yeah, well... the error message said one table failed and I went off to find out why (a co-worker added a column behind everyone's back) and never checked 'count(*)' on the other tables. ... deleting files in ~snapshots > Again I don't know the answer to this. Are you trying this on a test > setup or production one? I could fire up another instance on a different port off the now-broken $PGDATA easily enough and test. However if whatever uses those files needs to start from the very first file and "replay" them in sequence, that won't work. The files are named like 19_E6942440.snap which presumably corresponds to "LOG: logical decoding found consistent point at 19/E6942440" and they seem to get progressively larger. That suggest that maybe just one (the newest one) could be good enough... -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Вложения
On 07/26/2018 02:06 PM, Dimitri Maziuk wrote: > On 07/26/2018 02:54 PM, Adrian Klaver wrote: >> On 07/26/2018 10:54 AM, Dimitri Maziuk wrote: > >>> I'm not sure what happened, I remember the initial sync of that >>> particular schema failing on one table only, but looking at it now, all >>> tables are empty on the subscriber. >> >> To me that indicates all the syncs failed. > > Yeah, well... the error message said one table failed and I went off to > find out why (a co-worker added a column behind everyone's back) and > never checked 'count(*)' on the other tables. The thing that has me somewhat confused is: ERROR: permission denied for schema macromolecules I would thought the replication user could access that. Is there something special about that schema? Did the user role you set up for replication have the REPLICATION attribute or was it a superuser? > > ... deleting files in ~snapshots >> Again I don't know the answer to this. Are you trying this on a test >> setup or production one? > > I could fire up another instance on a different port off the now-broken > $PGDATA easily enough and test. However if whatever uses those files > needs to start from the very first file and "replay" them in sequence, > that won't work. If I remember back correctly this is because the server chokes on the 13 million files in the ~/pg_logical/snapshot directly. I wonder what happens if you change(on test setup)?: wal_level = logical to wal_level = replica > > The files are named like 19_E6942440.snap which presumably corresponds > to "LOG: logical decoding found consistent point at 19/E6942440" and > they seem to get progressively larger. That suggest that maybe just one > (the newest one) could be good enough... > -- Adrian Klaver adrian.klaver@aklaver.com
On 07/26/2018 04:39 PM, Adrian Klaver wrote: > The thing that has me somewhat confused is: > > ERROR: permission denied for schema macromolecules > > I would thought the replication user could access that. The more I look at the errors, the less sense it all makes. Let me ask a different question: if I drop and re-create a published table on the publisher without doing anything to the publication and subscription, what happens? Do I need to re-create the publication when I re-create the table? -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Вложения
On 07/26/2018 03:01 PM, Dimitri Maziuk wrote: > On 07/26/2018 04:39 PM, Adrian Klaver wrote: > >> The thing that has me somewhat confused is: >> >> ERROR: permission denied for schema macromolecules >> >> I would thought the replication user could access that. > > The more I look at the errors, the less sense it all makes. Sounds like my problems:) > > Let me ask a different question: if I drop and re-create a published > table on the publisher without doing anything to the publication and > subscription, what happens? Take a look at: https://www.postgresql.org/docs/10/static/sql-alterpublication.html > > Do I need to re-create the publication when I re-create the table? > -- Adrian Klaver adrian.klaver@aklaver.com
On 07/26/2018 05:34 PM, Adrian Klaver wrote: > On 07/26/2018 03:01 PM, Dimitri Maziuk wrote: >> Let me ask a different question: if I drop and re-create a published >> table on the publisher without doing anything to the publication and >> subscription, what happens? > > Take a look at: > > https://www.postgresql.org/docs/10/static/sql-alterpublication.html Thanks but what I ask (and what I suspect happened here while I wasn't looking) is - create schema foo with tables bar and baz and some rows in 'em - create publication foopub with foo.bar and foo.baz - create subscription to foopub on the other node, initial sync and all - drop schema foo on the publisher - create schema foo with tables bar and baz and some rows in 'em The publication foopub is at this point fubar I take it? And needs to be re-created on the publisher and reconnected on the subscriber? Complete with with inital resync? -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Вложения
On 07/26/2018 04:48 PM, Dimitri Maziuk wrote: > On 07/26/2018 05:34 PM, Adrian Klaver wrote: >> On 07/26/2018 03:01 PM, Dimitri Maziuk wrote: > >>> Let me ask a different question: if I drop and re-create a published >>> table on the publisher without doing anything to the publication and >>> subscription, what happens? >> >> Take a look at: >> >> https://www.postgresql.org/docs/10/static/sql-alterpublication.html > > Thanks but what I ask (and what I suspect happened here while I wasn't > looking) is > > - create schema foo with tables bar and baz and some rows in 'em > - create publication foopub with foo.bar and foo.baz > - create subscription to foopub on the other node, initial sync and all > - drop schema foo on the publisher > - create schema foo with tables bar and baz and some rows in 'em > > The publication foopub is at this point fubar I take it? And needs to be > re-created on the publisher and reconnected on the subscriber? Complete > with with inital resync? Not sure. Personally I would try: 1) ALTER PUBLICATION DROP TABLE foo|bar; 2) ALTER PUBLICATION ADD TABLE foo|bar; 3) ALTER SUBSCRIPTION sub_name REFRESH PUBLICATION If you get to 3) it will re-sync the data unless you tell it otherwise. The above is probably dependent on the size of the publication. If you did a publication for ALL it would make more sense to do the above then if you did a publication for just foo or bar. -- Adrian Klaver adrian.klaver@aklaver.com
On 07/26/2018 07:11 PM, Adrian Klaver wrote: > On 07/26/2018 04:48 PM, Dimitri Maziuk wrote: ... >> The publication foopub is at this point fubar I take it? And needs to be >> re-created on the publisher and reconnected on the subscriber? Complete >> with with inital resync? > > Not sure. Personally I would try: > > 1) ALTER PUBLICATION DROP TABLE foo|bar; > > 2) ALTER PUBLICATION ADD TABLE foo|bar; > > 3) ALTER SUBSCRIPTION sub_name REFRESH PUBLICATION > > If you get to 3) it will re-sync the data unless you tell it otherwise. > > The above is probably dependent on the size of the publication. If you > did a publication for ALL it would make more sense to do the above then > if you did a publication for just foo or bar. ... but if I did the publication for ALL, I could just use streaming replication and then drop table/add table would replicate automagically ... It looks like we probably have to re-think a few of our workflows and procedures, and until/unless that happens, logical replication won't do what we want. Which means figuring out this 13-million-files problem becomes a very low priority for me. Unfortunately: it'd be nice to track it down and squash it... Thanks everyone, -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Вложения
On 07/27/2018 03:04 PM, Dimitri Maziuk wrote: > On 07/26/2018 07:11 PM, Adrian Klaver wrote: >> On 07/26/2018 04:48 PM, Dimitri Maziuk wrote: > ... >>> The publication foopub is at this point fubar I take it? And needs to be >>> re-created on the publisher and reconnected on the subscriber? Complete >>> with with inital resync? >> >> Not sure. Personally I would try: >> >> 1) ALTER PUBLICATION DROP TABLE foo|bar; >> >> 2) ALTER PUBLICATION ADD TABLE foo|bar; >> >> 3) ALTER SUBSCRIPTION sub_name REFRESH PUBLICATION >> >> If you get to 3) it will re-sync the data unless you tell it otherwise. >> >> The above is probably dependent on the size of the publication. If you >> did a publication for ALL it would make more sense to do the above then >> if you did a publication for just foo or bar. > > ... but if I did the publication for ALL, I could just use streaming > replication and then drop table/add table would replicate automagically ... Well I was just showing the extremes from a single table publication to ALL tables. You can also do subsets of ALL. Remember that binary replication(streaming) involves the whole Postgres cluster, e.g. all the databases in the cluster no choice in the matter. Also it does not allow you to shape what is replicated. In other words what form of DML you want replicated e.g. UPDATE, INSERT, DELETE. Last but not least logical replication works across major versions and different OS'es, which binary replication does not. > > It looks like we probably have to re-think a few of our workflows and > procedures, and until/unless that happens, logical replication won't do > what we want. Which means figuring out this 13-million-files problem > becomes a very low priority for me. Unfortunately: it'd be nice to track > it down and squash it... > > Thanks everyone, > -- Adrian Klaver adrian.klaver@aklaver.com
On 07/27/2018 05:20 PM, Adrian Klaver wrote: > On 07/27/2018 03:04 PM, Dimitri Maziuk wrote: >> ... but if I did the publication for ALL, I could just use streaming >> replication ... > > Well I was just showing the extremes from a single table publication to > ALL tables. You can also do subsets of ALL. Remember that binary > replication(streaming) involves the whole Postgres cluster, e.g. all the > databases in the cluster no choice in the matter. Also it does not allow > you to shape what is replicated. Yes, and that's why I was looking into logical replication. Now that I looked, it seems the limitations are too many for our use case now, and that spinning up a separate "cluster" for each "replicated dataset" is simply less work ATM. Thanks again, -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu