Обсуждение: Better understanding checkpoint logs

Поиск
Список
Период
Сортировка

Better understanding checkpoint logs

От
Wells Oliver
Дата:
Hi all. Trying to asses our WAL checkpoint during heavy ETL loads. Checkpoint timer is at 5min and it seems to be constantly checkpointing, maybe that's OK. We see an increased % of buffer percentages in the logs during the checkpoint. Here's a run:

| 2023-06-21 12:20:02.000 | 2023-06-21 12:20:02 UTC::@:[396]:LOG:  checkpoint complete: wrote 88523 buffers (4.4%); 0 WAL file(s) added, 0 removed, 29 recycled; write=268.929 s, sync=0.288 s, total=269.395 s; sync files=177, longest=0.178 s, average=0.002 s; distance=2368871 kB, estimate=4720836 kB |
| 2023-06-21 12:15:32.000 | 2023-06-21 12:15:32 UTC::@:[396]:LOG:  checkpoint starting: time |
| 2023-06-21 12:15:02.000 | 2023-06-21 12:15:02 UTC::@:[396]:LOG:  checkpoint complete: wrote 63434 buffers (3.1%); 0 WAL file(s) added, 0 removed, 43 recycled; write=269.279 s, sync=0.676 s, total=270.448 s; sync files=220, longest=0.487 s, average=0.004 s; distance=2830955 kB, estimate=4982166 kB |
| 2023-06-21 12:10:32.000 | 2023-06-21 12:10:32 UTC::@:[396]:LOG:  checkpoint starting: time |
| 2023-06-21 12:10:02.000 | 2023-06-21 12:10:02 UTC::@:[396]:LOG:  checkpoint complete: wrote 42229 buffers (2.1%); 0 WAL file(s) added, 0 removed, 45 recycled; write=269.272 s, sync=0.167 s, total=269.642 s; sync files=224, longest=0.131 s, average=0.001 s; distance=2923597 kB, estimate=5221189 kB |
| 2023-06-21 12:05:32.000 | 2023-06-21 12:05:32 UTC::@:[396]:LOG:  checkpoint starting: time |
| 2023-06-21 12:05:03.000 | 2023-06-21 12:05:03 UTC::@:[396]:LOG:  checkpoint complete: wrote 249580 buffers (12.4%); 0 WAL file(s) added, 0 removed, 79 recycled; write=269.868 s, sync=1.333 s, total=271.434 s; sync files=256, longest=0.375 s, average=0.006 s; distance=5209546 kB, estimate=5476477 kB |
| 2023-06-21 12:00:32.000 | 2023-06-21 12:00:32 UTC::@:[396]:LOG:  checkpoint starting: time |
| 2023-06-21 11:59:45.000 | 2023-06-21 11:59:45 UTC::@:[396]:LOG:  checkpoint complete: wrote 357329 buffers (17.7%); 0 WAL file(s) added, 1 removed, 83 recycled; write=249.722 s, sync=2.006 s, total=252.201 s; sync files=137, longest=0.849 s, average=0.015 s; distance=5504554 kB, estimate=5506136 kB |
| 2023-06-21 11:55:32.000 | 2023-06-21 11:55:32 UTC::@:[396]:LOG:  checkpoint starting: wal |
| 2023-06-21 11:55:13.000 | 2023-06-21 11:55:13 UTC::@:[396]:LOG:  checkpoint complete: wrote 470167 buffers (23.3%); 0 WAL file(s) added, 0 removed, 84 recycled; write=250.553 s, sync=0.026 s, total=250.705 s; sync files=193, longest=0.006 s, average=0.001 s; distance=5498807 kB, estimate=5506312 kB |
| 2023-06-21 11:51:02.000 | 2023-06-21 11:51:02 UTC::@:[396]:LOG:  checkpoint starting: wal |
| 2023-06-21 11:50:44.000 | 2023-06-21 11:50:44 UTC::@:[396]:LOG:  checkpoint complete: wrote 437633 buffers (21.7%); 0 WAL file(s) added, 0 removed, 84 recycled; write=157.932 s, sync=0.005 s, total=158.886 s; sync files=197, longest=0.004 s, average=0.001 s; distance=5507145 kB, estimate=5507145 kB |
| 2023-06-21 11:48:05.000 | 2023-06-21 11:48:05 UTC::@:[396]:LOG:  checkpoint starting: wal |
| 2023-06-21 11:47:57.000 | 2023-06-21 11:47:57 UTC::@:[396]:LOG:  checkpoint complete: wrote 338877 buffers (16.8%); 0 WAL file(s) added, 0 removed, 84 recycled; write=240.248 s, sync=0.042 s, total=240.928 s; sync files=164, longest=0.020 s, average=0.001 s; distance=5453945 kB, estimate=5453945 kB |
| 2023-06-21 11:43:56.000 | 2023-06-21 11:43:56 UTC::@:[396]:LOG:  checkpoint starting: wal |
| 2023-06-21 11:43:32.000 | 2023-06-21 11:43:32 UTC::@:[396]:LOG:  checkpoint complete: wrote 712593 buffers (35.3%); 0 WAL file(s) added, 15 removed, 42 recycled; write=269.229 s, sync=0.004 s, total=269.362 s; sync files=69, longest=0.003 s, average=0.001 s; distance=3454836 kB, estimate=3625475 kB |
| 2023-06-21 11:39:02.000 | 2023-06-21 11:39:02 UTC::@:[396]:LOG:  checkpoint starting: time |

I don't want to read too much into it but I am trying to understand it better to consider any potential changes in max WAL sizes, or increasing our checkpoint timeouts if they are causing the ETL stuff to be slower than it should. Our wal_level is currently set to logical but we plan to change this to replica soon as it feels like overkill.

--

Re: Better understanding checkpoint logs

От
Laurenz Albe
Дата:
On Wed, 2023-06-21 at 18:53 -0700, Wells Oliver wrote:
> Hi all. Trying to asses our WAL checkpoint during heavy ETL loads.
> Checkpoint timer is at 5min and it seems to be constantly checkpointing, maybe that's OK.
> We see an increased % of buffer percentages in the logs during the checkpoint. Here's a run:
>
> [...]
>
> I don't want to read too much into it

But that's how you learn...

> but I am trying to understand it better to consider any potential changes in max WAL sizes,
> or increasing our checkpoint timeouts if they are causing the ETL stuff to be slower than
> it should. Our wal_level is currently set to logical but we plan to change this to replica
> soon as it feels like overkill.

You should increase "max_wal_size" until most of your checkpoints are triggered by
timeout, once every 5 minutes.  That will reduce the amount of WAL, because you will
have fewer full-page images, and you will avoid redundant buffer writes.

The only negative side effect is that recovery can take longer if you crash while
loading.

Yours,
Laurenz Albe