Re: Long running query causing XID limit breach

Поиск
Список
Период
Сортировка
От Laurenz Albe
Тема Re: Long running query causing XID limit breach
Дата
Msg-id abd3bc064d16bc93a2d8661a692903da97d2c154.camel@cybertec.at
обсуждение исходный текст
Ответ на Long running query causing XID limit breach  (sud <suds1434@gmail.com>)
Ответы Re: Long running query causing XID limit breach
Список pgsql-general
On Thu, 2024-05-23 at 02:46 +0530, sud wrote:
> It's RDS postgres version 15.4. We suddenly saw the "MaximumUsedTransactionIDs"
> reach to ~1.5billion and got alerted by team members who mentioned the database
> is going to be in shutdown/hung if this value reaches to ~2billion and won't be
> able to serve any incoming transactions. It was a panic situation.
>
> I have heard of it before , because of the way postgres works and the XID being
> a datatype of length 32 bit integer can only represent (2^32)/2=~2 billion
> transactions. However, as RDS performs the auto vacuum , we thought that we need
> not worry about this issue. But it seems we were wrong. And we found one adhoc
> "SELECT '' query was running on the reader instance since the last couple of
> days and when that was killed, the max xid (MaximumUsedTransactionIDs) dropped
> to 50million immediately.

This has nothing to do with autovacuum running.
PostgreSQL won't freeze any rows above the xmin horizon (see the "backend_xmin"
column in "pg_stat_activity").

> So I have few questions,
>
> 1)This system is going to be a 24/7 up and running system which will process
>   ~500million business transactions/day in future i.e. ~4-5billion rows/day
>   inserted across multiple tables each day. And as I understand each row will
>   have XID allocated. So in that case , does it mean that, we will need
>   (5billion/24)=~200million XID/hour and thus , if any such legitimate
>   application "SELECT" query keeps running for ~10 hours (and thus keep the
>   historical XID alive) , then it can saturate the "MaximumUsedTransactionIDs"
>   and make the database standstill in 2billion/200million=~10hrs. Is this
>   understanding correct? Seems we are prone to hit this limit sooner going forward.

Yes, that is correct.  You cannot run such long-running queries with a
transaction rate like that.

> 2)We have some legitimate cases where the reporting queries can run for 5-6hrs.
>   So in such cases if the start of this SELECT query happen at 100th XID on
>   table TAB1, then whatever transactions happen after that time, across all
>   other tables(table2, table3 etc) in the database won't get vacuum until that
>   SELECT query on table1 get vacuumed(as database will try to keep that same
>   100th XID image) and the XID will just keep incrementing for new transaction,
>   eventually reaching the max limit. Is my understanding correct here?

Again correct.  PostgreSQL cannot tell which tables the query will use in the
future, so VACUUM cannot clean up old row versions in any table in the database.
This is irrelevant for transaction ID wraparound, though: you'll get into trouble
even if only a single table holds an unfrozen row that is old enough.

> 3)Although RDS does the auto vacuum by default. but should we also consider
>   doing manual vacuum without impacting ongoing transactions?

That won't help.  Autovacuum is running, but cannot freeze the rows, and a manual
VACUUM cannot do it either.

> 4)Had worked in past in oracle database [...]

Oracle implements all this radically differently.

>  And in that case, It seems we have to mandatorily set "statement_timeout" to
> some value e.g. 4hrs(also i am not seeing a way to set it for any specific
> user level, so it will be set for all queries including application level)
> and also "idle_in_transaction_session_timeout" to 5minutes, even on all the
> prod and non prod databases, to restrict the long running transactions/queries
> and avoid such issues in future. Correct me if I'm wrong.

That looks right.

One thing you could consider is running the long-running queries on a standby
server.  Replication will get delayed, and you have to keep all the WAL
around for the standby to catch up once the query is done, but it should work.
You'd set "max_streaming_standby_delay" to -1 on the standby.

Yours,
Laurenz Albe



В списке pgsql-general по дате отправления:

Предыдущее
От: yudhi s
Дата:
Сообщение: Re: Long running query causing XID limit breach
Следующее
От: Jethish Jethish
Дата:
Сообщение: Backup failure Postgres