Обсуждение: Improve "select count(*)" query - takes more than 30 mins for some large tables

Поиск
Список
Период
Сортировка

Improve "select count(*)" query - takes more than 30 mins for some large tables

От
Raj kumar
Дата:
Hi,

How can I improve "select count(*)" for larger tables? I'm doing a db migration and need to validate the data count.
"select count(*) " queries are taking more than 30 minutes for some tables which is more than the downtime we have.
Will work_mem increase help? or how can i speed up this row count?

Thanks,
Raj

Re: Improve "select count(*)" query - takes more than 30 mins for some large tables

От
Florents Tselai
Дата:

> On 11 Jul 2022, at 10:16 AM, Raj kumar <rajkumar820999@gmail.com> wrote:
>
> Hi,
>
> How can I improve "select count(*)" for larger tables? I'm doing a db migration and need to validate the data count.
> "select count(*) " queries are taking more than 30 minutes for some tables which is more than the downtime we have.
> Will work_mem increase help? or how can i speed up this row count?

Personally, whenever I’ve had slow count(*) or count (distinct id), eventually
I’ve resorted to Unix tools.

psql “select id from my_table" | sort -u | wc -l

The convenience/performance tradeoff depends heavily on your schema.
After all unix streams don’t know much about your integrity requirements.

>
> Thanks,
> Raj




Re: Improve "select count(*)" query - takes more than 30 mins for some large tables

От
Holger Jakobs
Дата:
Hi Raj,

Since PostgreSQL uses MVCC there is no current number of rows as an 
absolute truth. Every COUNT(*) has to check the visibility of each row 
for the current transaction. This makes it slow for large tables.

If you don't need the exact figure but just an estimation, you can get 
this from the internal statistics tables. After an ANALYZE they are 
close to the "real value".

This show the number of rows for all non-empty tables in all visible 
schemas.

SELECT nspname AS schemaname,relname,reltuples::numeric
   FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
   WHERE
     nspname IN (SELECT unnest(current_schemas(false))) AND
     relkind='r' AND reltuples > 0
   ORDER BY reltuples DESC;

Regards,

Holger

Am 11.07.22 um 09:16 schrieb Raj kumar:
> Hi,
>
> How can I improve "select count(*)" for larger tables? I'm doing a db 
> migration and need to validate the data count.
> "select count(*) " queries are taking more than 30 minutes for some 
> tables which is more than the downtime we have.
> Will work_mem increase help? or how can i speed up this row count?
>
> Thanks,
> Raj

-- 
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012


Вложения

Re: Improve "select count(*)" query - takes more than 30 mins for some large tables

От
Raj kumar
Дата:
Thanks Florents,

I tried psql -c "select count(*)" . It has brought down the time from 30 minutes to 2 minutes.
Thanks alot.

Thanks  Holger,
I'm going to try this query now.

Thanks,
Raj Kumar

On Mon, Jul 11, 2022 at 12:53 PM Florents Tselai <florents.tselai@gmail.com> wrote:


> On 11 Jul 2022, at 10:16 AM, Raj kumar <rajkumar820999@gmail.com> wrote:
>
> Hi,
>
> How can I improve "select count(*)" for larger tables? I'm doing a db migration and need to validate the data count.
> "select count(*) " queries are taking more than 30 minutes for some tables which is more than the downtime we have.
> Will work_mem increase help? or how can i speed up this row count?

Personally, whenever I’ve had slow count(*) or count (distinct id), eventually
I’ve resorted to Unix tools.

psql “select id from my_table" | sort -u | wc -l   

The convenience/performance tradeoff depends heavily on your schema.
After all unix streams don’t know much about your integrity requirements.

>
> Thanks,
> Raj

Re: Improve "select count(*)" query - takes more than 30 mins for some large tables

От
Mladen Gogala
Дата:
On 7/11/22 03:23, Florents Tselai wrote:
psql “select id from my_table" | sort -u | wc -l   

That will be a lot slower than just "select count(*) from my_table".  You are delivering data to the user program (psql) and then shipping them to pipe and then processing the output with "wc". Depending on the version, PostgreSQL has very reliable parallelism and can do counting rather quickly. The speed of "select count(*) from my_table" depends on the speed of I/O. Since the table is big, it cannot be cached in the file system cache, so all that you have at your disposal is the raw disk speed. For the smaller machines, NVME is the king. For larger rigs, you should consider something like Pure, XTremIO or NetApp SolidFire. People frequently expect database to do miracles with under par hardware.

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: Improve "select count(*)" query - takes more than 30 mins for some large tables

От
Mladen Gogala
Дата:
What's wrong with parallelism? That's why it was invented. If you really need an accurate count at moment's notice, create a trigger to maintain it.
Regards

On Tue, Jul 12, 2022, 10:31 AM MichaelDBA Vitale <michaeldba@sqlexec.com> wrote:
Perhaps do an analyze on the table and then select reltuples from pg_class for that table.  Might be faster than the select count(*).

Regards,
Michael Vitale

On 07/12/2022 8:51 AM Mladen Gogala <gogala.mladen@gmail.com> wrote:


On 7/11/22 03:23, Florents Tselai wrote:
psql “select id from my_table" | sort -u | wc -l   

That will be a lot slower than just "select count(*) from my_table".  You are delivering data to the user program (psql) and then shipping them to pipe and then processing the output with "wc". Depending on the version, PostgreSQL has very reliable parallelism and can do counting rather quickly. The speed of "select count(*) from my_table" depends on the speed of I/O. Since the table is big, it cannot be cached in the file system cache, so all that you have at your disposal is the raw disk speed. For the smaller machines, NVME is the king. For larger rigs, you should consider something like Pure, XTremIO or NetApp SolidFire. People frequently expect database to do miracles with under par hardware.

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

RE: [EXT] Re: Improve "select count(*)" query - takes more than 30 mins for some large tables

От
"Pierson Patricia L (Contractor)"
Дата:

Hello,

 

Do a count on the primary key.  Will force index access and you don’t access the entire row which may be very long.

LIKE : select count(ID) from my_table;

 

From: Mladen Gogala <gogala.mladen@gmail.com>
Sent: Tuesday, July 12, 2022 11:58 AM
To: MichaelDBA Vitale <michaeldba@sqlexec.com>
Cc: pgsql-admin@lists.postgresql.org
Subject: [EXT] Re: Improve "select count(*)" query - takes more than 30 mins for some large tables

 

What's wrong with parallelism? That's why it was invented. If you really need an accurate count at moment's notice, create a trigger to maintain it.

Regards

 

On Tue, Jul 12, 2022, 10:31 AM MichaelDBA Vitale <michaeldba@sqlexec.com> wrote:

Perhaps do an analyze on the table and then select reltuples from pg_class for that table.  Might be faster than the select count(*).

 

Regards,

Michael Vitale

 

On 07/12/2022 8:51 AM Mladen Gogala <gogala.mladen@gmail.com> wrote:

 

 

On 7/11/22 03:23, Florents Tselai wrote:

psql “select id from my_table" | sort -u | wc -l   

That will be a lot slower than just "select count(*) from my_table".  You are delivering data to the user program (psql) and then shipping them to pipe and then processing the output with "wc". Depending on the version, PostgreSQL has very reliable parallelism and can do counting rather quickly. The speed of "select count(*) from my_table" depends on the speed of I/O. Since the table is big, it cannot be cached in the file system cache, so all that you have at your disposal is the raw disk speed. For the smaller machines, NVME is the king. For larger rigs, you should consider something like Pure, XTremIO or NetApp SolidFire. People frequently expect database to do miracles with under par hardware.

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: [EXT] Re: Improve "select count(*)" query - takes more than 30 mins for some large tables

От
Holger Jakobs
Дата:
Am 12.07.22 um 20:13 schrieb Pierson Patricia L (Contractor):
@font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;}@font-face {font-family:Consolas; panose-1:2 11 6 9 2 2 4 3 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; font-size:11.0pt; font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;}pre {mso-style-priority:99; mso-style-link:"HTML Preformatted Char"; margin:0in; font-size:10.0pt; font-family:"Courier New";}span.HTMLPreformattedChar {mso-style-name:"HTML Preformatted Char"; mso-style-priority:99; mso-style-link:"HTML Preformatted"; font-family:Consolas;}span.EmailStyle21 {mso-style-type:personal-reply; font-family:"Verdana",sans-serif; color:windowtext;}.MsoChpDefault {mso-style-type:export-only; font-family:"Calibri",sans-serif;}div.WordSection1 {page:WordSection1;}

Hello,

 

Do a count on the primary key.  Will force index access and you don’t access the entire row which may be very long.

LIKE : select count(ID) from my_table;

If you do a COUNT(*) the database won't do anything different. That's a myth.

Just compare the execution plans.


db=# explain select count(anr) from auftrag;
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=11614.55..11614.56 rows=1 width=8)
   ->  Gather  (cost=11614.33..11614.54 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=10614.33..10614.34 rows=1 width=8)
               ->  Parallel Seq Scan on auftrag  (cost=0.00..9572.67 rows=416667 width=8)
(5 Zeilen)

db=# explain select count(*) from auftrag;
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=11614.55..11614.56 rows=1 width=8)
   ->  Gather  (cost=11614.33..11614.54 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=10614.33..10614.34 rows=1 width=8)
               ->  Parallel Seq Scan on auftrag  (cost=0.00..9572.67 rows=416667 width=0)
(5 Zeilen)

-- 
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Вложения

Re: [EXT] Re: Improve "select count(*)" query - takes more than 30 mins for some large tables

От
Mladen Gogala
Дата:
On 7/12/22 14:25, MichaelDBA Vitale wrote:
That is not true: doing the select on the primary key will still result in a table scan, not an index scan.  The heap always gets accessed for select counts.

Regards,
Michael Vitale

Well, not always:

mgogala=# explain select count(*) from emp;
                        QUERY PLAN                        
----------------------------------------------------------
 Aggregate  (cost=1.18..1.19 rows=1 width=8)
   ->  Seq Scan on emp  (cost=0.00..1.14 rows=14 width=0)
(2 rows)

That is the classic "select count(*)".  Now, let's see PK:

mgogala=# \d emp
                           Table "mgogala.emp"
  Column  |            Type             | Collation | Nullable | Default
----------+-----------------------------+-----------+----------+---------
 empno    | smallint                    |           | not null |
 ename    | character varying(10)       |           |          |
 job      | character varying(9)        |           |          |
 mgr      | smallint                    |           |          |
 hiredate | timestamp without time zone |           |          |
 sal      | double precision            |           |          |
 comm     | double precision            |           |          |
 deptno   | smallint                    |           |          |
Indexes:
    "emp_pkey" PRIMARY KEY, btree (empno)
Foreign-key constraints:
    "fk_deptno" FOREIGN KEY (deptno) REFERENCES dept(deptno)


Now, let's do scan on "count(empno)":

mgogala=# explain select count(empno) from emp;
                        QUERY PLAN                        
----------------------------------------------------------
 Aggregate  (cost=1.18..1.19 rows=1 width=8)
   ->  Seq Scan on emp  (cost=0.00..1.14 rows=14 width=2)
(2 rows)

So far, so good. Sequential scan, despite the existence of the NOT NULL primary key. Now comes the twist:

mgogala=# set enable_seqscan=off;
SET
mgogala=# explain select count(empno) from emp;
                                   QUERY PLAN                                   
 
--------------------------------------------------------------------------------
-
 Aggregate  (cost=12.38..12.39 rows=1 width=8)
   ->  Index Only Scan using emp_pkey on emp  (cost=0.14..12.35 rows=14 width=2)
(2 rows)

Voila, we've got index only "count" scan. So, in some cases Postgres does do index only scan. What is surprising is that pg_hint_plan doesn't produce the desired results:

mgogala=# explain /*+ IndexScan(emp emp_pkey) */ select count(empno) from emp;
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Aggregate  (cost=10000000001.17..10000000001.18 rows=1 width=8)
   ->  Seq Scan on emp  (cost=10000000000.00..10000000001.14 rows=14 width=2)
 JIT:
   Functions: 3
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(5 rows)

This is strange in more ways than one:

  1. The hint didn't work, Postgres still performed sequential scan
  2. JIT got involved for the reasons that are completely unclear.
  3. The cost is enormous, despite the fact the the "emp" table has only 14 rows and is fully contained in a single block.

However, if we do GUC hint, the result behaves as expected:

mgogala=# explain /*+ set(enable_seqscan off) */ select count(empno) from emp e;
                                    QUERY PLAN                                  
   
--------------------------------------------------------------------------------
---
 Aggregate  (cost=12.38..12.39 rows=1 width=8)
   ->  Index Only Scan using emp_pkey on emp e  (cost=0.14..12.35 rows=14 width=2)
(2 rows)

So, it is possible to force index only scan. However, from purely theoretic point of view, the speed difference is questionable. Indexes are stored differently than tables, they have branch blocks and data bocks with an ample free space in the data blocks to prevent block splits and index level increase, with mandatory re-balancing. Indexes are actually quite large structures, sometimes even larger than the underlying tables. In general, going after the index-only scan doesn't look like a good strategy.

After all this work, I also have a question: does PostgreSQL read index blocks in batches or it does single block scans? Some other databases have 2 events: "db file scattered read" for sequential scan (called "full table scan" in that specific vernacular) and "db file sequential read" for index scans. Those database only do single block reads from indexes while the tables can be read using batches of blocks. Does PostgreSQL do the same thing?

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com