Обсуждение: PgUpgrade bumped my XIDs by ~50M?
We have a large >20TB system just pg_upgraded from 9.5 to 9.6 as per the versions shown below. The system does <5M transactions/day based on sum(commit + abort) from pg_stat_database. Autovac is running all possible threads now and upon investigating I see that thousands of tables are now above the freeze threshold. Same tables all appear ~50M xids older than they did yesterday and the upgrade was less than 24 hours ago. I have a "safety" snap made of the system before upgrade that can be used for inspection. Any ideas why the age jump? select age(l.relfrozenxid), l.oid::regclass::text as relation, l.relkind, l.relpages, r.oid::regclass::text as "toast for" from pg_class l left join pg_class r on l.oid = r.reltoastrelid where l.relkind in ('r', 't') order by age desc, relation limit :limit ii postgresql-9.5 9.5.12-1.pgdg16.04+1 amd64 object-relational SQL database,version 9.5 server ii postgresql-9.6 9.6.8-1.pgdg16.04+1 amd64 object-relational SQL database,version 9.6 server Thanks -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net p: 312.241.7800
On Wed, Apr 4, 2018 at 05:29:46PM -0500, Jerry Sievers wrote: > We have a large >20TB system just pg_upgraded from 9.5 to 9.6 as per the > versions shown below. > > The system does <5M transactions/day based on sum(commit + abort) from > pg_stat_database. > > Autovac is running all possible threads now and upon investigating I see > that thousands of tables are now above the freeze threshold. Same > tables all appear ~50M xids older than they did yesterday and the > upgrade was less than 24 hours ago. > > I have a "safety" snap made of the system before upgrade that can be > used for inspection. > > Any ideas why the age jump? Uh, you can read how pg_upgrade handles frozen xids in pg_upgrade.c: https://doxygen.postgresql.org/pg__upgrade_8c_source.html#l00543 I am not sure what would have caused such a jump. pg_upgrade brings over the frozen values for each table, and sets the server's frozen counter to match the old one. If you run 'pg_dump --binary-upgrade' you will see the frozen xids being transfered: -- For binary upgrade, set heap's relfrozenxid and relminmxid UPDATE pg_catalog.pg_class SET relfrozenxid = '558', relminmxid = '1' WHERE oid = 'public.test'::pg_catalog.regclass; Is it possible that pg_upgrade used 50M xids while upgrading? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Bruce Momjian <bruce@momjian.us> writes: > On Wed, Apr 4, 2018 at 05:29:46PM -0500, Jerry Sievers wrote: > >> We have a large >20TB system just pg_upgraded from 9.5 to 9.6 as per the >> versions shown below. >> >> The system does <5M transactions/day based on sum(commit + abort) from >> pg_stat_database. >> >> Autovac is running all possible threads now and upon investigating I see >> that thousands of tables are now above the freeze threshold. Same >> tables all appear ~50M xids older than they did yesterday and the >> upgrade was less than 24 hours ago. >> >> I have a "safety" snap made of the system before upgrade that can be >> used for inspection. >> >> Any ideas why the age jump? > > Uh, you can read how pg_upgrade handles frozen xids in pg_upgrade.c: > > https://doxygen.postgresql.org/pg__upgrade_8c_source.html#l00543 > > I am not sure what would have caused such a jump. pg_upgrade brings > over the frozen values for each table, and sets the server's frozen > counter to match the old one. > > If you run 'pg_dump --binary-upgrade' you will see the frozen xids being > transfered: > > -- For binary upgrade, set heap's relfrozenxid and relminmxid > UPDATE pg_catalog.pg_class > SET relfrozenxid = '558', relminmxid = '1' > WHERE oid = 'public.test'::pg_catalog.regclass; > > Is it possible that pg_upgrade used 50M xids while upgrading? Hi Bruce. Don't think so, as I did just snap the safety snap and ran another upgrade on that. And I also compared txid_current for the upgraded snap and our running production instance. The freshly upgraded snap is ~50M txids behind the prod instance. If this is a not too uncommon case of users running amok, then this time in particular they really went off the charts :-) Will investigate... FYI, this is the same system that a few weeks ago issued complaints during vacuum of an XID younger than relfrozenxid, which a system restart did mysteriously resolve. I hope we're going to be OK here. Thx -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net p: 312.241.7800
On Wed, Apr 4, 2018 at 07:13:36PM -0500, Jerry Sievers wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Is it possible that pg_upgrade used 50M xids while upgrading? > > Hi Bruce. > > Don't think so, as I did just snap the safety snap and ran another > upgrade on that. > > And I also compared txid_current for the upgraded snap and our running > production instance. > > The freshly upgraded snap is ~50M txids behind the prod instance. Are the objects 50M behind or is txid_current 50M different? Higher or lower? > > If this is a not too uncommon case of users running amok, then this time > in particular they really went off the charts :-) I have never heard of this problem. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Bruce Momjian <bruce@momjian.us> writes: > On Wed, Apr 4, 2018 at 07:13:36PM -0500, Jerry Sievers wrote: > >> Bruce Momjian <bruce@momjian.us> writes: >> > Is it possible that pg_upgrade used 50M xids while upgrading? >> >> Hi Bruce. >> >> Don't think so, as I did just snap the safety snap and ran another >> upgrade on that. >> >> And I also compared txid_current for the upgraded snap and our running >> production instance. >> >> The freshly upgraded snap is ~50M txids behind the prod instance. > > Are the objects 50M behind or is txid_current 50M different? Higher or > lower? txid_current is another 12M higher then a few hours ago. Still waiting to hear from my reporting team if they changed anything. This thing is running PgLogical and has a few of our event triggers as well. But nothing in this regard changed with the upgrade. What if some very frequent but trivial statements that did not get assigned a real TXID in 9.5 on this configuration now are being treated differently? What's puzzling too is that when I run my TPS monitor script, it's clicking along at what looks typical, presently would only amount to 700k transactions/day but we're off-peak. Thx > > >> >> If this is a not too uncommon case of users running amok, then this time >> in particular they really went off the charts :-) > > I have never heard of this problem. -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net p: 312.241.7800
Jerry Sievers <gsievers19@comcast.net> writes: > Bruce Momjian <bruce@momjian.us> writes: > >> On Wed, Apr 4, 2018 at 07:13:36PM -0500, Jerry Sievers wrote: >> >>> Bruce Momjian <bruce@momjian.us> writes: >>> > Is it possible that pg_upgrade used 50M xids while upgrading? >>> >>> Hi Bruce. >>> >>> Don't think so, as I did just snap the safety snap and ran another >>> upgrade on that. >>> >>> And I also compared txid_current for the upgraded snap and our running >>> production instance. >>> >>> The freshly upgraded snap is ~50M txids behind the prod instance. >> >> Are the objects 50M behind or is txid_current 50M different? Higher or >> lower? > > txid_current is another 12M higher then a few hours ago. Still waiting > to hear from my reporting team if they changed anything. Reporting team claims nothing changed. I still have 150 tables ready for autovac just based on freeze age value. Autovac is running at our as-config'd max worker count of 20 w/all threads busy as expected. If I can assume stats such as pg_stat_database start initially cleared after an upgrade... Please see that pg_stat_database showing about the number of transactions that I'd expect for this system and ~1.5 day duration. How have some objects apparently aged by ~100M transactions (by now at last check) since the upgrade ?? Thanks postgres=# select sum(xact_rollback+ xact_commit) from pg_stat_database; sum --------- 5292417 (1 row) postgres=# select now() - pg_postmaster_start_time(); ?column? ----------------------- 1 day 13:18:48.721896 (1 row) postgres=# select version(); version ---------------------------------------------------------------------------------------------------------------------------------------------- PostgreSQL 9.6.8 on x86_64-pc-linux-gnu (Ubuntu 9.6.8-1.pgdg16.04+1), compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.020160609, 64-bit (1 row) > > This thing is running PgLogical and has a few of our event triggers as > well. But nothing in this regard changed with the upgrade. > > What if some very frequent but trivial statements that did not get > assigned a real TXID in 9.5 on this configuration now are being treated > differently? > > What's puzzling too is that when I run my TPS monitor script, it's > clicking along at what looks typical, presently would only amount to > 700k transactions/day but we're off-peak. > > Thx >> >> >>> >>> If this is a not too uncommon case of users running amok, then this time >>> in particular they really went off the charts :-) >> >> I have never heard of this problem. -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net p: 312.241.7800
On Wed, Apr 4, 2018 at 08:29:06PM -0400, Bruce Momjian wrote: > On Wed, Apr 4, 2018 at 07:13:36PM -0500, Jerry Sievers wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > Is it possible that pg_upgrade used 50M xids while upgrading? > > > > Hi Bruce. > > > > Don't think so, as I did just snap the safety snap and ran another > > upgrade on that. > > > > And I also compared txid_current for the upgraded snap and our running > > production instance. > > > > The freshly upgraded snap is ~50M txids behind the prod instance. > > Are the objects 50M behind or is txid_current 50M different? Higher or > lower? Uh, here is a report of a similar problem from March 6, 2018: https://www.postgresql.org/message-id/flat/C44C73BC-6B3A-42E0-9E44-6CE4E5B5D601%40ebureau.com#C44C73BC-6B3A-42E0-9E44-6CE4E5B5D601@ebureau.com I upgraded a very large database from 9.6 to 10.1 via pg_upgrade recently, and ever since, the auto vacuum has been busy on a large legacy table that has experienced no changes since the upgrade. If the whole table had been frozen prior to the upgrade, would you expect it to stay frozen? It sure smells like we have a bug here. Could this be statistics collection instead? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Bruce Momjian <bruce@momjian.us> writes: > On Wed, Apr 4, 2018 at 08:29:06PM -0400, Bruce Momjian wrote: > >> On Wed, Apr 4, 2018 at 07:13:36PM -0500, Jerry Sievers wrote: >> > Bruce Momjian <bruce@momjian.us> writes: >> > > Is it possible that pg_upgrade used 50M xids while upgrading? >> > >> > Hi Bruce. >> > >> > Don't think so, as I did just snap the safety snap and ran another >> > upgrade on that. >> > >> > And I also compared txid_current for the upgraded snap and our running >> > production instance. >> > >> > The freshly upgraded snap is ~50M txids behind the prod instance. >> >> Are the objects 50M behind or is txid_current 50M different? Higher or >> lower? > > Uh, here is a report of a similar problem from March 6, 2018: > > https://www.postgresql.org/message-id/flat/C44C73BC-6B3A-42E0-9E44-6CE4E5B5D601%40ebureau.com#C44C73BC-6B3A-42E0-9E44-6CE4E5B5D601@ebureau.com > > I upgraded a very large database from 9.6 to 10.1 via pg_upgrade > recently, and ever since, the auto vacuum has been busy on a large > legacy table that has experienced no changes since the upgrade. If the > whole table had been frozen prior to the upgrade, would you expect it to > stay frozen? > > It sure smells like we have a bug here. Could this be statistics > collection instead? No clue but we still have the 9.5 safety snap that I can make repeated sub-snaps of for mor testing. I did one such test yesterday and things looked $normal after the upgrade. Noteworthy omission was that I did *not* run the post-analysis. I could repeat the test more thoroughly. We run a parallel analyzer framework to get huge DBs processed quickly. And we generally do it in 2 passes; the first with def_stats_target scaled down to finish quicker and hopefully be good enough to run production. At this point we open the DB cluster for business. Immediately following we again run the analyzer at full stats_target. Any other suggestions what to also look at and I'll be glad to do and report back. Big Thanks! -- Jerry Sievers e: jerry.sievers@comcast.net p: 312.241.7800