Обсуждение: why there is not VACUUM FULL CONCURRENTLY?

Поиск
Список
Период
Сортировка

why there is not VACUUM FULL CONCURRENTLY?

От
Pavel Stehule
Дата:
Hi

I have one question, what is a block of implementation of some variant of VACUUM FULL like REINDEX CONCURRENTLY? Why similar mechanism of REINDEX CONCURRENTLY cannot be used for VACUUM FULL?

Regards

Pavel

Re: why there is not VACUUM FULL CONCURRENTLY?

От
Michael Paquier
Дата:
On Tue, Jan 30, 2024 at 09:01:57AM +0100, Pavel Stehule wrote:
> I have one question, what is a block of implementation of some variant of
> VACUUM FULL like REINDEX CONCURRENTLY? Why similar mechanism of REINDEX
> CONCURRENTLY cannot be used for VACUUM FULL?

You may be interested in these threads:
https://www.postgresql.org/message-id/CAB7nPqTGmNUFi%2BW6F1iwmf7J-o6sY%2Bxxo6Yb%3DmkUVYT-CG-B5A%40mail.gmail.com
https://www.postgresql.org/message-id/CAB7nPqTys6JUQDxUczbJb0BNW0kPrW8WdZuk11KaxQq6o98PJg%40mail.gmail.com

VACUUM FULL is CLUSTER under the hoods.  One may question whether it
is still a relevant discussion these days if we assume that autovacuum
is able to keep up, because it always keeps up with the house cleanup,
right?  ;)

More seriously, we have a lot more options these days with VACUUM like
PARALLEL, so CONCURRENTLY may still have some uses, but the new toys
available may have changed things.  So, would it be worth the
complexities around heap manipulations that lower locks would require?
--
Michael

Вложения

Re: why there is not VACUUM FULL CONCURRENTLY?

От
Pavel Stehule
Дата:


út 30. 1. 2024 v 9:14 odesílatel Michael Paquier <michael@paquier.xyz> napsal:
On Tue, Jan 30, 2024 at 09:01:57AM +0100, Pavel Stehule wrote:
> I have one question, what is a block of implementation of some variant of
> VACUUM FULL like REINDEX CONCURRENTLY? Why similar mechanism of REINDEX
> CONCURRENTLY cannot be used for VACUUM FULL?

You may be interested in these threads:
https://www.postgresql.org/message-id/CAB7nPqTGmNUFi%2BW6F1iwmf7J-o6sY%2Bxxo6Yb%3DmkUVYT-CG-B5A%40mail.gmail.com
https://www.postgresql.org/message-id/CAB7nPqTys6JUQDxUczbJb0BNW0kPrW8WdZuk11KaxQq6o98PJg%40mail.gmail.com

VACUUM FULL is CLUSTER under the hoods.  One may question whether it
is still a relevant discussion these days if we assume that autovacuum
is able to keep up, because it always keeps up with the house cleanup,
right?  ;)

More seriously, we have a lot more options these days with VACUUM like
PARALLEL, so CONCURRENTLY may still have some uses, but the new toys
available may have changed things.  So, would it be worth the
complexities around heap manipulations that lower locks would require?

One of my customer today is reducing one table from 140GB to 20GB.  Now he is able to run archiving. He should play with pg_repack, and it is working well today, but I ask myself, what pg_repack does not be hard to do internally because it should be done for REINDEX CONCURRENTLY. This is not a common task, and not will be, but on the other hand, it can be nice to have feature, and maybe not too hard to implement today. But I didn't try it

I'll read the threads

Pavel


--
Michael

Re: why there is not VACUUM FULL CONCURRENTLY?

От
Alvaro Herrera
Дата:
On 2024-Jan-30, Pavel Stehule wrote:

> One of my customer today is reducing one table from 140GB to 20GB.  Now he
> is able to run archiving. He should play with pg_repack, and it is working
> well today, but I ask myself, what pg_repack does not be hard to do
> internally because it should be done for REINDEX CONCURRENTLY. This is not
> a common task, and not will be, but on the other hand, it can be nice to
> have feature, and maybe not too hard to implement today. But I didn't try it

FWIW a newer, more modern and more trustworthy alternative to pg_repack
is pg_squeeze, which I discovered almost by random chance, and soon
discovered I liked it much more.

So thinking about your question, I think it might be possible to
integrate a tool that works like pg_squeeze, such that it runs when
VACUUM is invoked -- either under some new option, or just replace the
code under FULL, not sure.  If the Cybertec people allows it, we could
just grab the pg_squeeze code and add it to the things that VACUUM can
run.

Now, pg_squeeze has some additional features, such as periodic
"squeezing" of tables.  In a first attempt, for simplicity, I would
leave that stuff out and just allow it to run from the user invoking it,
and then have the command to do a single run.  (The scheduling features
could be added later, or somehow integrated into autovacuum, or maybe
something else.)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"We're here to devour each other alive"            (Hobbes)



Re: why there is not VACUUM FULL CONCURRENTLY?

От
Pavel Stehule
Дата:


út 30. 1. 2024 v 11:31 odesílatel Alvaro Herrera <alvherre@alvh.no-ip.org> napsal:
On 2024-Jan-30, Pavel Stehule wrote:

> One of my customer today is reducing one table from 140GB to 20GB.  Now he
> is able to run archiving. He should play with pg_repack, and it is working
> well today, but I ask myself, what pg_repack does not be hard to do
> internally because it should be done for REINDEX CONCURRENTLY. This is not
> a common task, and not will be, but on the other hand, it can be nice to
> have feature, and maybe not too hard to implement today. But I didn't try it

FWIW a newer, more modern and more trustworthy alternative to pg_repack
is pg_squeeze, which I discovered almost by random chance, and soon
discovered I liked it much more.

So thinking about your question, I think it might be possible to
integrate a tool that works like pg_squeeze, such that it runs when
VACUUM is invoked -- either under some new option, or just replace the
code under FULL, not sure.  If the Cybertec people allows it, we could
just grab the pg_squeeze code and add it to the things that VACUUM can
run.

Now, pg_squeeze has some additional features, such as periodic
"squeezing" of tables.  In a first attempt, for simplicity, I would
leave that stuff out and just allow it to run from the user invoking it,
and then have the command to do a single run.  (The scheduling features
could be added later, or somehow integrated into autovacuum, or maybe
something else.)

some basic variant (without autovacuum support) can be good enough. We have no autovacuum support for REINDEX CONCURRENTLY and I don't see a necessity for it (sure, it can be limited by my perspective) . The necessity of reducing table size is not too common (a lot of use cases are better covered by using partitioning), but sometimes it is, and then buildin simple available solution can be helpful.





--
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"We're here to devour each other alive"            (Hobbes)

Re: why there is not VACUUM FULL CONCURRENTLY?

От
Alvaro Herrera
Дата:
On 2024-Jan-30, Pavel Stehule wrote:

> some basic variant (without autovacuum support) can be good enough. We have
> no autovacuum support for REINDEX CONCURRENTLY and I don't see a necessity
> for it (sure, it can be limited by my perspective) . The necessity of
> reducing table size is not too common (a lot of use cases are better
> covered by using partitioning), but sometimes it is, and then buildin
> simple available solution can be helpful.

That's my thinking as well.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: why there is not VACUUM FULL CONCURRENTLY?

От
Michael Paquier
Дата:
On Tue, Jan 30, 2024 at 12:37:12PM +0100, Alvaro Herrera wrote:
> On 2024-Jan-30, Pavel Stehule wrote:
>
> > some basic variant (without autovacuum support) can be good enough. We have
> > no autovacuum support for REINDEX CONCURRENTLY and I don't see a necessity
> > for it (sure, it can be limited by my perspective) . The necessity of
> > reducing table size is not too common (a lot of use cases are better
> > covered by using partitioning), but sometimes it is, and then buildin
> > simple available solution can be helpful.
>
> That's my thinking as well.

Or, yes, I'd agree about that.  This can make for a much better user
experience.  I'm just not sure how that stuff would be shaped and how
much ground it would need to cover.
--
Michael

Вложения

Re: why there is not VACUUM FULL CONCURRENTLY?

От
Antonin Houska
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> On 2024-Jan-30, Pavel Stehule wrote:
>
> > One of my customer today is reducing one table from 140GB to 20GB.  Now he
> > is able to run archiving. He should play with pg_repack, and it is working
> > well today, but I ask myself, what pg_repack does not be hard to do
> > internally because it should be done for REINDEX CONCURRENTLY. This is not
> > a common task, and not will be, but on the other hand, it can be nice to
> > have feature, and maybe not too hard to implement today. But I didn't try it
>
> FWIW a newer, more modern and more trustworthy alternative to pg_repack
> is pg_squeeze, which I discovered almost by random chance, and soon
> discovered I liked it much more.
>
> So thinking about your question, I think it might be possible to
> integrate a tool that works like pg_squeeze, such that it runs when
> VACUUM is invoked -- either under some new option, or just replace the
> code under FULL, not sure.  If the Cybertec people allows it, we could
> just grab the pg_squeeze code and add it to the things that VACUUM can
> run.

There are no objections from Cybertec. Nevertheless, I don't expect much code
to be just copy & pasted. If I started to implement the extension today, I'd
do some things in a different way. (Some things might actually be simpler in
the core, i.e. a few small changes in PG core are easier than the related
workarounds in the extension.)

The core idea is that: 1) a "historic snapshot" is used to get the current
contents of the table, 2) logical decoding is used to capture the changes done
while the data is being copied to new storage, 3) the exclusive lock on the
table is only taken for very short time, to swap the storage (relfilenode) of
the table.

I think it should be coded in a way that allows use by VACUUM FULL, CLUSTER,
and possibly some subcommands of ALTER TABLE. For example, some users of
pg_squeeze requested an enhancement that allows the user to change column data
type w/o service disruption (typically when it appears that integer type is
going to overflow and change bigint is needed).

Online (re)partitioning could be another use case, although I admit that
commands that change the system catalog are a bit harder to implement than
VACUUM FULL / CLUSTER.

One thing that pg_squeeze does not handle is visibility: it uses heap_insert()
to insert the tuples into the new storage, so the problems described in [1]
can appear. The in-core implementation should rather do something like tuple
rewriting (rewriteheap.c).

Is your plan to work on it soon or should I try to write a draft patch? (I
assume this is for PG >= 18.)

[1] https://www.postgresql.org/docs/current/mvcc-caveats.html

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: why there is not VACUUM FULL CONCURRENTLY?

От
Alvaro Herrera
Дата:
This is great to hear.

On 2024-Jan-31, Antonin Houska wrote:

> Is your plan to work on it soon or should I try to write a draft patch? (I
> assume this is for PG >= 18.)

I don't have plans for it, so if you have resources, please go for it.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: why there is not VACUUM FULL CONCURRENTLY?

От
Antonin Houska
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> This is great to hear.
> 
> On 2024-Jan-31, Antonin Houska wrote:
> 
> > Is your plan to work on it soon or should I try to write a draft patch? (I
> > assume this is for PG >= 18.)
> 
> I don't have plans for it, so if you have resources, please go for it.

ok, I'm thinking how can the feature be integrated into the core.

BTW, I'm failing to understand why cluster_rel() has no argument of the
BufferAccessStrategy type. According to buffer/README, the criterion for using
specific strategy is that page "is unlikely to be needed again
soon". Specifically for cluster_rel(), the page will *definitely* not be used
again (unless the VACCUM FULL/CLUSTER command fails): BufferTag contains the
relatin file number and the old relation file is eventually dropped.

Am I missing anything?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: why there is not VACUUM FULL CONCURRENTLY?

От
Alvaro Herrera
Дата:
On 2024-Feb-16, Antonin Houska wrote:

> BTW, I'm failing to understand why cluster_rel() has no argument of the
> BufferAccessStrategy type. According to buffer/README, the criterion for using
> specific strategy is that page "is unlikely to be needed again
> soon". Specifically for cluster_rel(), the page will *definitely* not be used
> again (unless the VACCUM FULL/CLUSTER command fails): BufferTag contains the
> relatin file number and the old relation file is eventually dropped.
> 
> Am I missing anything?

No, that's just an oversight.  Access strategies are newer than that
cluster code.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."                               (Nathaniel Smith)
      https://mail.gnu.org/archive/html/monotone-devel/2007-01/msg00080.html