Обсуждение: [PERFORM] Very poor read performance, query independent
I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:
Logical Drive: 3
Size: 273.4 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 128 KB
Full Stripe Size: 256 KB
Status: OK
Caching: Enabled
Unique Identifier: 600508B1001037383941424344450A00
Disk Name: /dev/sdc
Mount Points: /mnt/data 273.4 GB
OS Status: LOCKED
Logical Drive Label: A00A194750123456789ABCDE516F
Mirror Group 0:
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
Mirror Group 1:
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)
Drive Type: Data
Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.
Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"
Postgresql is the only application running on this server.
Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.
My configuration is:
name | current_setting | source
---------------------------------+------------------------------------------------+----------------------
application_name | psql | client
autovacuum_vacuum_scale_factor | 0 | configuration file
autovacuum_vacuum_threshold | 2000 | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_timeout | 30min | configuration file
client_encoding | UTF8 | client
client_min_messages | log | configuration file
cluster_name | 9.6/main | configuration file
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.003 | configuration file
DateStyle | ISO, YMD | configuration file
default_statistics_target | 100 | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 22GB | configuration file
effective_io_concurrency | 4 | configuration file
external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file
lc_messages | C | configuration file
lc_monetary | en_CA.UTF-8 | configuration file
lc_numeric | en_CA.UTF-8 | configuration file
lc_time | en_CA.UTF-8 | configuration file
listen_addresses | * | configuration file
lock_timeout | 100s | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_destination | csvlog | configuration file
log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file
log_disconnections | on | configuration file
log_error_verbosity | default | configuration file
log_file_mode | 0600 | configuration file
log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file
log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 0 | configuration file
log_min_error_statement | debug1 | configuration file
log_min_messages | debug1 | configuration file
log_rotation_size | 1GB | configuration file
log_temp_files | 0 | configuration file
log_timezone | localtime | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 3GB | configuration file
max_connections | 10 | configuration file
max_locks_per_transaction | 256 | configuration file
max_parallel_workers_per_gather | 14 | configuration file
max_stack_depth | 2MB | environment variable
max_wal_size | 4GB | configuration file
max_worker_processes | 14 | configuration file
min_wal_size | 2GB | configuration file
parallel_setup_cost | 1000 | configuration file
parallel_tuple_cost | 0.012 | configuration file
port | 5432 | configuration file
random_page_cost | 22 | configuration file
seq_page_cost | 1 | configuration file
shared_buffers | 34GB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
ssl | on | configuration file
ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file
ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file
statement_timeout | 1000000s | configuration file
stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file
superuser_reserved_connections | 1 | configuration file
syslog_facility | local1 | configuration file
syslog_ident | postgres | configuration file
syslog_sequence_numbers | on | configuration file
temp_file_limit | 80GB | configuration file
TimeZone | localtime | configuration file
track_activities | on | configuration file
track_counts | on | configuration file
track_functions | all | configuration file
unix_socket_directories | /var/run/postgresql | configuration file
vacuum_cost_delay | 1ms | configuration file
vacuum_cost_limit | 5000 | configuration file
vacuum_cost_page_dirty | 200 | configuration file
vacuum_cost_page_hit | 10 | configuration file
vacuum_cost_page_miss | 100 | configuration file
wal_buffers | 16MB | configuration file
wal_compression | on | configuration file
wal_sync_method | fdatasync | configuration file
work_mem | 1468006kB | configuration file
The part of /etc/sysctl.conf I modified is:
vm.swappiness = 1
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 1073741824
vm.overcommit_ratio = 100
vm.zone_reclaim_mode = 0
kernel.numa_balancing = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_migration_cost_ns = 5000000
The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.
I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.
Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.
Thanks!
Charles
I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:
Logical Drive: 3
Size: 273.4 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 128 KB
Full Stripe Size: 256 KB
Status: OK
Caching: Enabled
Unique Identifier: 600508B1001037383941424344450A
00 Disk Name: /dev/sdc
Mount Points: /mnt/data 273.4 GB
OS Status: LOCKED
Logical Drive Label: A00A194750123456789ABCDE516F
Mirror Group 0:
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
Mirror Group 1:
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)
Drive Type: Data
Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.
Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-
0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr, noatime,nobarrier,errors= remount-ro 0 1" Postgresql is the only application running on this server.
Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.
My configuration is:
name | current_setting | source
------------------------------
---+-------------------------- ----------------------+------- --------------- application_name | psql | client
autovacuum_vacuum_scale_factor | 0 | configuration file
autovacuum_vacuum_threshold | 2000 | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_timeout | 30min | configuration file
client_encoding | UTF8 | client
client_min_messages | log | configuration file
cluster_name | 9.6/main | configuration file
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.003 | configuration file
DateStyle | ISO, YMD | configuration file
default_statistics_target | 100 | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 22GB | configuration file
effective_io_concurrency | 4 | configuration file
external_pid_file | /var/run/postgresql/9.6-main.
pid | configuration file lc_messages | C | configuration file
lc_monetary | en_CA.UTF-8 | configuration file
lc_numeric | en_CA.UTF-8 | configuration file
lc_time | en_CA.UTF-8 | configuration file
listen_addresses | * | configuration file
lock_timeout | 100s | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_destination | csvlog | configuration file
log_directory | /mnt/bigzilla/data/toburn/hp/
postgresql/pg_log | configuration file log_disconnections | on | configuration file
log_error_verbosity | default | configuration file
log_file_mode | 0600 | configuration file
log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file
log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 0 | configuration file
log_min_error_statement | debug1 | configuration file
log_min_messages | debug1 | configuration file
log_rotation_size | 1GB | configuration file
log_temp_files | 0 | configuration file
log_timezone | localtime | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 3GB | configuration file
max_connections | 10 | configuration file
max_locks_per_transaction | 256 | configuration file
max_parallel_workers_per_
gather | 14 | configuration file max_stack_depth | 2MB | environment variable
max_wal_size | 4GB | configuration file
max_worker_processes | 14 | configuration file
min_wal_size | 2GB | configuration file
parallel_setup_cost | 1000 | configuration file
parallel_tuple_cost | 0.012 | configuration file
port | 5432 | configuration file
random_page_cost | 22 | configuration file
seq_page_cost | 1 | configuration file
shared_buffers | 34GB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
ssl | on | configuration file
ssl_cert_file | /etc/ssl/certs/ssl-cert-
snakeoil.pem | configuration file ssl_key_file | /etc/ssl/private/ssl-cert-
snakeoil.key | configuration file statement_timeout | 1000000s | configuration file
stats_temp_directory | /var/run/postgresql/9.6-main.
pg_stat_tmp | configuration file superuser_reserved_connections | 1 | configuration file
syslog_facility | local1 | configuration file
syslog_ident | postgres | configuration file
syslog_sequence_numbers | on | configuration file
temp_file_limit | 80GB | configuration file
TimeZone | localtime | configuration file
track_activities | on | configuration file
track_counts | on | configuration file
track_functions | all | configuration file
unix_socket_directories | /var/run/postgresql | configuration file
vacuum_cost_delay | 1ms | configuration file
vacuum_cost_limit | 5000 | configuration file
vacuum_cost_page_dirty | 200 | configuration file
vacuum_cost_page_hit | 10 | configuration file
vacuum_cost_page_miss | 100 | configuration file
wal_buffers | 16MB | configuration file
wal_compression | on | configuration file
wal_sync_method | fdatasync | configuration file
work_mem | 1468006kB | configuration file
The part of /etc/sysctl.conf I modified is:
vm.swappiness = 1
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 1073741824
vm.overcommit_ratio = 100
vm.zone_reclaim_mode = 0
kernel.numa_balancing = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_migration_cost_ns = 5000000
The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.
I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.
Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.
Thanks!
Charles
--Charles Nadeau Ph.D.
Am 10.07.2017 um 16:03 schrieb Charles Nadeau: > random_page_cost | 22 why such a high value for random_page_cost? Regards, Andreas -- 2ndQuadrant - The PostgreSQL Support Company. www.2ndQuadrant.com
Am 10.07.2017 um 16:03 schrieb Charles Nadeau:random_page_cost | 22
why such a high value for random_page_cost?
Regards, Andreas
--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
http://charlesnadeau.blogspot.com/
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Monday, July 10, 2017 11:48 AM
To: Andreas Kretschmer <andreas@a-kretschmer.de>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Andreas,
Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.
Thanks!
Charles
On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:
Am 10.07.2017 um 16:03 schrieb Charles Nadeau:random_page_cost | 22
why such a high value for random_page_cost?
Regards, Andreas
--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com
--
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/
Considering RAM size of 72 GB and your database size of ~225GB, and also the fact that Postgres is the only app running on the server, probably 1/3 of your database resides in memory, so random_page_cost = 22 looks extremely high, probably it completely precludes index usage in your queries.
You should try this setting at least at its default value: random_page_cost =4, and probably go even lower.
Also, effective_cache_size is at least as big as your shared_buffers. Having 72GB RAM t effective_cache_size should be set around 64GB (again considering that Postgres is the only app running on the server).
Regards,
Igor Neyman
The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS.
Although probably not the root cause, at the least I would set up hugepages ( https://www.postgresql.org/docs/9.6/static/kernel- resources.html#LINUX-HUGE- PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?). On Mon, Jul 10, 2017 at 10:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:
Logical Drive: 3
Size: 273.4 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 128 KB
Full Stripe Size: 256 KB
Status: OK
Caching: Enabled
Unique Identifier: 600508B1001037383941424344450A
00 Disk Name: /dev/sdc
Mount Points: /mnt/data 273.4 GB
OS Status: LOCKED
Logical Drive Label: A00A194750123456789ABCDE516F
Mirror Group 0:
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
Mirror Group 1:
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)
Drive Type: Data
Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.
Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-
0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noati me,nobarrier,errors=remount-ro 0 1" Postgresql is the only application running on this server.
Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.
My configuration is:
name | current_setting | source
------------------------------
---+-------------------------- ----------------------+------- --------------- application_name | psql | client
autovacuum_vacuum_scale_factor | 0 | configuration file
autovacuum_vacuum_threshold | 2000 | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_timeout | 30min | configuration file
client_encoding | UTF8 | client
client_min_messages | log | configuration file
cluster_name | 9.6/main | configuration file
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.003 | configuration file
DateStyle | ISO, YMD | configuration file
default_statistics_target | 100 | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 22GB | configuration file
effective_io_concurrency | 4 | configuration file
external_pid_file | /var/run/postgresql/9.6-main.p
id | configuration file lc_messages | C | configuration file
lc_monetary | en_CA.UTF-8 | configuration file
lc_numeric | en_CA.UTF-8 | configuration file
lc_time | en_CA.UTF-8 | configuration file
listen_addresses | * | configuration file
lock_timeout | 100s | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_destination | csvlog | configuration file
log_directory | /mnt/bigzilla/data/toburn/hp/p
ostgresql/pg_log | configuration file log_disconnections | on | configuration file
log_error_verbosity | default | configuration file
log_file_mode | 0600 | configuration file
log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file
log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 0 | configuration file
log_min_error_statement | debug1 | configuration file
log_min_messages | debug1 | configuration file
log_rotation_size | 1GB | configuration file
log_temp_files | 0 | configuration file
log_timezone | localtime | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 3GB | configuration file
max_connections | 10 | configuration file
max_locks_per_transaction | 256 | configuration file
max_parallel_workers_per_gathe
r | 14 | configuration file max_stack_depth | 2MB | environment variable
max_wal_size | 4GB | configuration file
max_worker_processes | 14 | configuration file
min_wal_size | 2GB | configuration file
parallel_setup_cost | 1000 | configuration file
parallel_tuple_cost | 0.012 | configuration file
port | 5432 | configuration file
random_page_cost | 22 | configuration file
seq_page_cost | 1 | configuration file
shared_buffers | 34GB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
ssl | on | configuration file
ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeo
il.pem | configuration file ssl_key_file | /etc/ssl/private/ssl-cert-snak
eoil.key | configuration file statement_timeout | 1000000s | configuration file
stats_temp_directory | /var/run/postgresql/9.6-main.p
g_stat_tmp | configuration file superuser_reserved_connections | 1 | configuration file
syslog_facility | local1 | configuration file
syslog_ident | postgres | configuration file
syslog_sequence_numbers | on | configuration file
temp_file_limit | 80GB | configuration file
TimeZone | localtime | configuration file
track_activities | on | configuration file
track_counts | on | configuration file
track_functions | all | configuration file
unix_socket_directories | /var/run/postgresql | configuration file
vacuum_cost_delay | 1ms | configuration file
vacuum_cost_limit | 5000 | configuration file
vacuum_cost_page_dirty | 200 | configuration file
vacuum_cost_page_hit | 10 | configuration file
vacuum_cost_page_miss | 100 | configuration file
wal_buffers | 16MB | configuration file
wal_compression | on | configuration file
wal_sync_method | fdatasync | configuration file
work_mem | 1468006kB | configuration file
The part of /etc/sysctl.conf I modified is:
vm.swappiness = 1
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 1073741824
vm.overcommit_ratio = 100
vm.zone_reclaim_mode = 0
kernel.numa_balancing = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_migration_cost_ns = 5000000
The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.
I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.
Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.
Thanks!
Charles
--Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/
From: pgsql-performance-owner@
postgresql.org [mailto:pgsql-performance- owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Monday, July 10, 2017 11:48 AM
To: Andreas Kretschmer <andreas@a-kretschmer.de>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Andreas,
Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.
Thanks!
Charles
On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:
Am 10.07.2017 um 16:03 schrieb Charles Nadeau:random_page_cost | 22
why such a high value for random_page_cost?
Regards, Andreas
--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com--
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/
Considering RAM size of 72 GB and your database size of ~225GB, and also the fact that Postgres is the only app running on the server, probably 1/3 of your database resides in memory, so random_page_cost = 22 looks extremely high, probably it completely precludes index usage in your queries.
You should try this setting at least at its default value: random_page_cost =4, and probably go even lower.
Also, effective_cache_size is at least as big as your shared_buffers. Having 72GB RAM t effective_cache_size should be set around 64GB (again considering that Postgres is the only app running on the server).
Regards,
Igor Neyman
http://charlesnadeau.blogspot.com/
#!/bin/bash#Random throughputecho "Random throughput"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=randread --ramp_time=4#Random IOPSecho "Random IOPS"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=randread --ramp_time=4#Sequential throughputecho "Sequential throughput"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4#Sequential IOPSecho "Sequential IOPS"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=read --ramp_time=4
charles@hpdl380g6:~$ sudo sh -c 'time cat /mnt/data/postgresql/base/16385/* | wc -c'[sudo] password for charles:1.62user 179.94system 29:50.79elapsed 10%CPU (0avgtext+0avgdata 1920maxresident)k448026264inputs+0outputs (0major+117minor)pagefaults 0swaps241297594904
charles@hpdl380g6:~$ sudo sh -c 'time cat /mnt/data/postgresql/base/16385/* | wc -c'[sudo] password for charles:0.86user 130.84system 28:15.78elapsed 7%CPU (0avgtext+0avgdata 1820maxresident)k471286792inputs+0outputs (1major+118minor)pagefaults 0swaps241297594904
On Mon, Jul 10, 2017 at 7:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:
The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS.
That doesn't seem right. Sequential is only 43% faster? What job file are giving to fio?What do you get if you do something simpler, like:time cat ~/$PGDATA/base/16402/*|wc -creplacing 16402 with whatever your biggest database is.Cheers,Jeff
http://charlesnadeau.blogspot.com/
From: pgsql-performance-owner@
postgresql.org [mailto:pgsql-performance- owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Monday, July 10, 2017 11:48 AM
To: Andreas Kretschmer <andreas@a-kretschmer.de>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Andreas,
Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.
Thanks!
Charles
On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:
Am 10.07.2017 um 16:03 schrieb Charles Nadeau:random_page_cost | 22
why such a high value for random_page_cost?
Regards, Andreas
--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com--
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/
Considering RAM size of 72 GB and your database size of ~225GB, and also the fact that Postgres is the only app running on the server, probably 1/3 of your database resides in memory, so random_page_cost = 22 looks extremely high, probably it completely precludes index usage in your queries.
You should try this setting at least at its default value: random_page_cost =4, and probably go even lower.
Also, effective_cache_size is at least as big as your shared_buffers. Having 72GB RAM t effective_cache_size should be set around 64GB (again considering that Postgres is the only app running on the server).
Regards,
Igor Neyman
http://charlesnadeau.blogspot.com/
From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Igor,
I reduced the value of random_page_cost to 4 but the read speed remains low.
Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?
Thanks for suggestions!
Charles
No, they should not be equal.
From the docs:
effective_cache_size (integer)
Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).
So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.
Regards,
Igor
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Tuesday, July 11, 2017 10:34 AM
To: Charles Nadeau <charles.nadeau@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Igor,
I reduced the value of random_page_cost to 4 but the read speed remains low.
Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?
Thanks for suggestions!
Charles
No, they should not be equal.
From the docs:
effective_cache_size (integer)
Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).
So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.
Regards,
Igor
Also, maybe it’s time to look at execution plans (explain analyze) of specific slow queries, instead of trying to solve the problem “in general”.
Igor
From: Charles Nadeau [mailto:charles.nadeau@gmail.
com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Igor,
I reduced the value of random_page_cost to 4 but the read speed remains low.
Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?
Thanks for suggestions!
Charles
No, they should not be equal.
From the docs:
effective_cache_size (integer)
Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).
So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.
Regards,
Igor
http://charlesnadeau.blogspot.com/
From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 11:25 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Attention: This email was sent from someone outside of Perceptron. Always exercise caution when opening attachments or clicking links from unknown senders or when receiving unexpected emails.
Igor,
The sum of effective_cache_size and shared_buffer will be higher than the physical memory I have. Is it OK?
Thanks!
Charles
Yes, that’s normal.
shared_buffers is the maximum that Postgres allowed to allocate, while effective_cache_size is just a number that optimizer takes into account when creating execution plan.
Igor
On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote: > I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). > Hardware is: > > *2x Intel Xeon E5550 > > *72GB RAM > > *Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% > read/20% write) for Postgresql data only: > > The problem I have is very poor read. When I benchmark my array with fio I > get random reads of about 200MB/s and 1100IOPS and sequential reads of about > 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get > at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. > This problem is not query-dependent. Stop right there. 1100 iops * 8kb = ~8mb/sec raw which might reasonably translate to 4mb/sec to the client. 200mb/sec random read/sec on spinning media is simply not plausible; your benchmark is lying to you. Random reads on spinning media are absolutely going to be storage bound. merlin
On 07/11/2017 04:15 PM, Merlin Moncure wrote: > On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau > <charles.nadeau@gmail.com> wrote: >> I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). >> Hardware is: >> >> *2x Intel Xeon E5550 >> >> *72GB RAM >> >> *Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% >> read/20% write) for Postgresql data only: >> >> The problem I have is very poor read. When I benchmark my array with fio I >> get random reads of about 200MB/s and 1100IOPS and sequential reads of about >> 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get >> at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. >> This problem is not query-dependent. > > Stop right there. 1100 iops * 8kb = ~8mb/sec raw which might > reasonably translate to 4mb/sec to the client. 200mb/sec random > read/sec on spinning media is simply not plausible; Sure it is, if he had more than 4 disks ;) but he also isn't going to get 1100 IOPS from 4 10k disks. The average 10k disk is going to get around 130 IOPS . If he only has 4 then there is no way he is getting 1100 IOPS. Using the above specs (4x146GB) the best he can reasonably hope for from the drives themselves is about 50MB/s add in the 1GB FWBC and that is how he is getting those high numbers for IOPS but that is because of caching. He may need to adjust his readahead as well as his kernel scheduler. At a minimum he should be able to saturate the drives without issue. JD -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc PostgreSQL Centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://pgconf.us ***** Unless otherwise stated, opinions are my own. *****
Jeff,I used fio in a quick benchmarking script inspired by https://smcleod.net/benchmarking-io/: #!/bin/bash#Random throughputecho "Random throughput"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=randread --ramp_time=4#Random IOPSecho "Random IOPS"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=randread --ramp_time=4#Sequential throughputecho "Sequential throughput"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4#Sequential IOPSecho "Sequential IOPS"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=read --ramp_time=4
Performing the test you suggested, I get 128.5MB/s. Monitoring the test, I find that the throughput is constant from start to finish and that the iowait is also constant at 5%:
Could you suggest another way to benchmark random reads?
On 07/11/2017 04:15 PM, Merlin Moncure wrote:On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic).
Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80%
read/20% write) for Postgresql data only:
The problem I have is very poor read. When I benchmark my array with fio I
get random reads of about 200MB/s and 1100IOPS and sequential reads of about
286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get
at best 4MB/s. Also using dstat I can see that iowait time is at about 25%.
This problem is not query-dependent.
Stop right there. 1100 iops * 8kb = ~8mb/sec raw which might
reasonably translate to 4mb/sec to the client. 200mb/sec random
read/sec on spinning media is simply not plausible;
Sure it is, if he had more than 4 disks ;)
but he also isn't going to get 1100 IOPS from 4 10k disks. The average 10k disk is going to get around 130 IOPS . If he only has 4 then there is no way he is getting 1100 IOPS.
Hmm - how are you measuring that sequential scan speed of 4MB/s? I'd recommend doing a very simple test e.g, here's one on my workstation - 13 GB single table on 1 SATA drive - cold cache after reboot, sequential scan using Postgres 9.6.2: bench=# EXPLAIN SELECT count(*) FROM pgbench_accounts; QUERY PLAN ------------------------------------------------------------------------------------ Aggregate (cost=2889345.00..2889345.01 rows=1 width=8) -> Seq Scan on pgbench_accounts (cost=0.00..2639345.00 rows=100000000 width=0) (2 rows) bench=# SELECT pg_relation_size('pgbench_accounts'); pg_relation_size ------------------ 13429514240 (1 row) bench=# SELECT count(*) FROM pgbench_accounts; count ----------- 100000000 (1 row) Time: 118884.277 ms So doing the math seq read speed is about 110MB/s (i.e 13 GB in 120 sec). Sure enough, while I was running the query iostat showed: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 926.00 0.00 114.89 0.00 254.10 1.90 2.03 2.03 0.00 1.08 100.00 So might be useful for us to see something like that from your system - note you need to check you really have flushed the cache, and that no other apps are using the db. regards Mark On 12/07/17 00:46, Charles Nadeau wrote: > After reducing random_page_cost to 4 and testing more, I can report > that the aggregate read throughput for parallel sequential scan is > about 90MB/s. However the throughput for sequential scan is still > around 4MB/s. >
From: pgsql-performance-owner@
postgresql.org [mailto:pgsql-performance- owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Tuesday, July 11, 2017 10:34 AM
To: Charles Nadeau <charles.nadeau@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
From: Charles Nadeau [mailto:charles.nadeau@gmail.
com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Igor,
I reduced the value of random_page_cost to 4 but the read speed remains low.
Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?
Thanks for suggestions!
Charles
No, they should not be equal.
From the docs:
effective_cache_size (integer)
Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).
So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.
Regards,
Igor
Also, maybe it’s time to look at execution plans (explain analyze) of specific slow queries, instead of trying to solve the problem “in general”.
Igor
http://charlesnadeau.blogspot.com/
charles@hpdl380g6:~$ cat /sys/block/sdc/queue/read_ahead_kb128charles@hpdl380g6:~$ cat /sys/block/sdc/queue/scheduler[noop] deadline cfq
On 07/11/2017 04:15 PM, Merlin Moncure wrote:On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic).
Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80%
read/20% write) for Postgresql data only:
The problem I have is very poor read. When I benchmark my array with fio I
get random reads of about 200MB/s and 1100IOPS and sequential reads of about
286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get
at best 4MB/s. Also using dstat I can see that iowait time is at about 25%.
This problem is not query-dependent.
Stop right there. 1100 iops * 8kb = ~8mb/sec raw which might
reasonably translate to 4mb/sec to the client. 200mb/sec random
read/sec on spinning media is simply not plausible;
Sure it is, if he had more than 4 disks ;) but he also isn't going to get 1100 IOPS from 4 10k disks. The average 10k disk is going to get around 130 IOPS . If he only has 4 then there is no way he is getting 1100 IOPS.
Using the above specs (4x146GB) the best he can reasonably hope for from the drives themselves is about 50MB/s add in the 1GB FWBC and that is how he is getting those high numbers for IOPS but that is because of caching.
He may need to adjust his readahead as well as his kernel scheduler. At a minimum he should be able to saturate the drives without issue.
JD
--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
***** Unless otherwise stated, opinions are my own. *****
http://charlesnadeau.blogspot.com/
flows=# SET track_io_timing = on;LOG: duration: 24.101 ms statement: SET track_io_timing = on;SETflows=# explain (analyze, timing off) SELECT DISTINCTflows-# srcaddr,flows-# dstaddr,flows-# dstport,flows-# COUNT(*) AS conversation,flows-# SUM(doctets) / 1024 / 1024 AS mbytesflows-# FROMflows-# flowscompact,flows-# mynetworksflows-# WHEREflows-# mynetworks.ipaddr >>= flowscompact.srcaddrflows-# AND dstaddr INflows-# (flows(# SELECTflows(# dstaddrflows(# FROMflows(# dstexterneflows(# )flows-# GROUP BYflows-# srcaddr,flows-# dstaddr,flows-# dstportflows-# ORDER BYflows-# mbytes DESC LIMIT 50;LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.3", size 1073741824LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.4", size 1073741824LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.5", size 639696896LOG: duration: 2632108.352 ms statement: explain (analyze, timing off) SELECT DISTINCTsrcaddr,dstaddr,dstport,COUNT(*) AS conversation,SUM(doctets) / 1024 / 1024 AS mbytesFROMflowscompact,mynetworksWHEREmynetworks.ipaddr >>= flowscompact.srcaddrAND dstaddr IN(SELECTdstaddrFROMdstexterne)GROUP BYsrcaddr,dstaddr,dstportORDER BYmbytes DESC LIMIT 50;QUERY PLAN---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual rows=50 loops=1)-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual rows=50 loops=1)-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual rows=50 loops=1)Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))Sort Method: quicksort Memory: 563150kB-> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual rows=4691734 loops=1)Group Key: flows.srcaddr, flows.dstaddr, flows.dstport-> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual rows=81896988 loops=1)Sort Key: flows.srcaddr, flows.dstaddr, flows.dstportSort Method: external merge Disk: 2721856kB-> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual rows=81896988 loops=1)Workers Planned: 9Workers Launched: 9-> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual rows=8189699 loops=10)Hash Cond: (flows.dstaddr = flows_1.dstaddr)-> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual rows=45499045 loops=10)-> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual rows=54155970 loops=10)-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual rows=1 loops=541559704)Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)Heap Fetches: 48679396-> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual rows=3099798 loops=10)Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB-> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual rows=3099798 loops=10)Group Key: flows_1.dstaddr-> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual rows=113420172 loops=10)Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)Rows Removed by Join Filter: 453681377-> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual rows=541559704 loops=10)Heap Fetches: 91-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual rows=2 loops=5415597040)-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual rows=4 loops=10)Planning time: 62.066 msExecution time: 2631923.716 ms(33 rows)flows=# explain (analyze, buffers) SELECT DISTINCTflows-# srcaddr,flows-# dstaddr,flows-# dstport,flows-# COUNT(*) AS conversation,flows-# SUM(doctets) / 1024 / 1024 AS mbytesflows-# FROMflows-# flowscompact,flows-# mynetworksflows-# WHEREflows-# mynetworks.ipaddr >>= flowscompact.srcaddrflows-# AND dstaddr INflows-# (flows(# SELECTflows(# dstaddrflows(# FROMflows(# dstexterneflows(# )flows-# GROUP BYflows-# srcaddr,flows-# dstaddr,flows-# dstportflows-# ORDER BYflows-# mbytes DESC LIMIT 50;LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896LOG: duration: 2765020.327 ms statement: explain (analyze, buffers) SELECT DISTINCTsrcaddr,dstaddr,dstport,COUNT(*) AS conversation,SUM(doctets) / 1024 / 1024 AS mbytesFROMflowscompact,mynetworksWHEREmynetworks.ipaddr >>= flowscompact.srcaddrAND dstaddr IN(SELECTdstaddrFROMdstexterne)GROUP BYsrcaddr,dstaddr,dstportORDER BYmbytes DESC LIMIT 50;QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244I/O Timings: read=5323746.860-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244I/O Timings: read=5323746.860-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))Sort Method: quicksort Memory: 563150kBBuffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244I/O Timings: read=5323746.860-> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)Group Key: flows.srcaddr, flows.dstaddr, flows.dstportBuffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244I/O Timings: read=5323746.860-> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)Sort Key: flows.srcaddr, flows.dstaddr, flows.dstportSort Method: external merge Disk: 2721856kBBuffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244I/O Timings: read=5323746.860-> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)Workers Planned: 9Workers Launched: 9Buffers: shared hit=1116590559 read=15851133I/O Timings: read=5323746.860-> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)Hash Cond: (flows.dstaddr = flows_1.dstaddr)Buffers: shared hit=1116588309 read=15851133I/O Timings: read=5323746.860-> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)Buffers: shared hit=996551813 read=15851133I/O Timings: read=5323746.860-> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)Buffers: shared hit=1634 read=15851133I/O Timings: read=5323746.860-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)Heap Fetches: 59971474Buffers: shared hit=996550152-> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kBBuffers: shared hit=120036496-> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)Group Key: flows_1.dstaddrBuffers: shared hit=120036496-> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)Rows Removed by Join Filter: 453681377Buffers: shared hit=120036496-> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)Heap Fetches: 91Buffers: shared hit=120036459-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)Buffers: shared hit=10-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)Buffers: shared hit=10Planning time: 6.689 msExecution time: 2764860.853 ms(58 rows)
On Tue, Jul 11, 2017 at 4:02 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:Jeff,I used fio in a quick benchmarking script inspired by https://smcleod.net/benchmarking-io/: #!/bin/bash#Random throughputecho "Random throughput"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=randread --ramp_time=4#Random IOPSecho "Random IOPS"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=randread --ramp_time=4#Sequential throughputecho "Sequential throughput"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4#Sequential IOPSecho "Sequential IOPS"syncfio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=read --ramp_time=4I don't think any of those are directly relevant to PostgreSQL, as it doesn't use direct IO, doesn't use libaio, and is rarely going to get anywhere near 256 iodepth. So the best they can do is put a theoretical ceiling on the performance. Also, random IO with a 4MB stride doesn't make any sense from a PostgreSQL perspective.Performing the test you suggested, I get 128.5MB/s. Monitoring the test, I find that the throughput is constant from start to finish and that the iowait is also constant at 5%:I would have expected it to do better than that. Maybe you increase the kernel readahead setting. I've found the default to be much too small. But it doesn't make much difference to you, as you appear to be doing random IO in your queries, not sequential.Could you suggest another way to benchmark random reads?Your 1100 IOPS times 8kb block size gives about 8MB/s of throughput, which is close to what you report. So I think I'd would instead focus on tuning your actual queries. You say the problem is not query-dependent, but I think that that just means all the queries you looked at are similar. If you looked at a query that can't use indexes, like count(unindexed_column) from biggest_table; you would find it doing much more IO than 4MB/s.Can you pick the simplest query you actually care about, and post both an "explain (analyze, timing off)" and an "explain (analyze, buffers)" for it? (Preferably turning "track_io_timing" on first).One other question I had, you said you had "2x Intel Xeon E5550", which should be 8 CPU (or 16, if the hyperthreadsare reported as separate CPUs). But you also said: "Also using dstat I can see that iowait time is at about 25%". Usually if there is only one thing going on on the server, then IOWAIT won't be more than reciprocal of #CPU. Is the server busy doing other stuff at the same time you are benchmarking it?Cheers,Jeff
http://charlesnadeau.blogspot.com/
Although probably not the root cause, at the least I would set up hugepages ( https://www.postgresql.org/docs/9.6/static/kernel- resources.html#LINUX-HUGE- PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?). On Mon, Jul 10, 2017 at 10:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:
Logical Drive: 3
Size: 273.4 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 128 KB
Full Stripe Size: 256 KB
Status: OK
Caching: Enabled
Unique Identifier: 600508B1001037383941424344450A
00 Disk Name: /dev/sdc
Mount Points: /mnt/data 273.4 GB
OS Status: LOCKED
Logical Drive Label: A00A194750123456789ABCDE516F
Mirror Group 0:
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
Mirror Group 1:
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)
Drive Type: Data
Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.
Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-
0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noati me,nobarrier,errors=remount-ro 0 1" Postgresql is the only application running on this server.
Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.
My configuration is:
name | current_setting | source
------------------------------
---+-------------------------- ----------------------+------- --------------- application_name | psql | client
autovacuum_vacuum_scale_factor | 0 | configuration file
autovacuum_vacuum_threshold | 2000 | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_timeout | 30min | configuration file
client_encoding | UTF8 | client
client_min_messages | log | configuration file
cluster_name | 9.6/main | configuration file
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.003 | configuration file
DateStyle | ISO, YMD | configuration file
default_statistics_target | 100 | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 22GB | configuration file
effective_io_concurrency | 4 | configuration file
external_pid_file | /var/run/postgresql/9.6-main.p
id | configuration file lc_messages | C | configuration file
lc_monetary | en_CA.UTF-8 | configuration file
lc_numeric | en_CA.UTF-8 | configuration file
lc_time | en_CA.UTF-8 | configuration file
listen_addresses | * | configuration file
lock_timeout | 100s | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_destination | csvlog | configuration file
log_directory | /mnt/bigzilla/data/toburn/hp/p
ostgresql/pg_log | configuration file log_disconnections | on | configuration file
log_error_verbosity | default | configuration file
log_file_mode | 0600 | configuration file
log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file
log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 0 | configuration file
log_min_error_statement | debug1 | configuration file
log_min_messages | debug1 | configuration file
log_rotation_size | 1GB | configuration file
log_temp_files | 0 | configuration file
log_timezone | localtime | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 3GB | configuration file
max_connections | 10 | configuration file
max_locks_per_transaction | 256 | configuration file
max_parallel_workers_per_gathe
r | 14 | configuration file max_stack_depth | 2MB | environment variable
max_wal_size | 4GB | configuration file
max_worker_processes | 14 | configuration file
min_wal_size | 2GB | configuration file
parallel_setup_cost | 1000 | configuration file
parallel_tuple_cost | 0.012 | configuration file
port | 5432 | configuration file
random_page_cost | 22 | configuration file
seq_page_cost | 1 | configuration file
shared_buffers | 34GB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
ssl | on | configuration file
ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeo
il.pem | configuration file ssl_key_file | /etc/ssl/private/ssl-cert-snak
eoil.key | configuration file statement_timeout | 1000000s | configuration file
stats_temp_directory | /var/run/postgresql/9.6-main.p
g_stat_tmp | configuration file superuser_reserved_connections | 1 | configuration file
syslog_facility | local1 | configuration file
syslog_ident | postgres | configuration file
syslog_sequence_numbers | on | configuration file
temp_file_limit | 80GB | configuration file
TimeZone | localtime | configuration file
track_activities | on | configuration file
track_counts | on | configuration file
track_functions | all | configuration file
unix_socket_directories | /var/run/postgresql | configuration file
vacuum_cost_delay | 1ms | configuration file
vacuum_cost_limit | 5000 | configuration file
vacuum_cost_page_dirty | 200 | configuration file
vacuum_cost_page_hit | 10 | configuration file
vacuum_cost_page_miss | 100 | configuration file
wal_buffers | 16MB | configuration file
wal_compression | on | configuration file
wal_sync_method | fdatasync | configuration file
work_mem | 1468006kB | configuration file
The part of /etc/sysctl.conf I modified is:
vm.swappiness = 1
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 1073741824
vm.overcommit_ratio = 100
vm.zone_reclaim_mode = 0
kernel.numa_balancing = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_migration_cost_ns = 5000000
The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.
I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.
Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.
Thanks!
Charles
--Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/
From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Wednesday, July 12, 2017 3:21 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Igor,
I set shared_buffers to 24 GB and effective_cache_size to 64GB and I can see that the queries are faster due to the fact that the index are used more often. Knowing I have 72GB of RAM and the server is exclusively dedicated to Postgresql, what could be the maximum value for effective_cache?
Thanks!
Charles
64GB for effective_cache_size should be good enough, adding couple more GB wouldn’t change much.
Igor
Rick,Should the number of page should always be correlated to the VmPeak of the postmaster or could it be set to reflect shared_buffer or another setting?Thanks!
CharlesOn Mon, Jul 10, 2017 at 5:25 PM, Rick Otten <rottenwindfish@gmail.com> wrote:Although probably not the root cause, at the least I would set up hugepages ( https://www.postgresql.org/docs/9.6/static/kernel-resourc es.html#LINUX-HUGE-PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?).
I use noop as the scheduler because it is better to let the RAID controller re-arrange the IO operation before they reach the disk. Read ahead is set to 128:charles@hpdl380g6:~$ cat /sys/block/sdc/queue/read_ahead_kb 128charles@hpdl380g6:~$ cat /sys/block/sdc/queue/scheduler[noop] deadline cfq
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Wednesday, July 12, 2017 6:05 AM
To: Jeff Janes <jeff.janes@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
flows=# explain (analyze, buffers) SELECT DISTINCT
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport,
flows-# COUNT(*) AS conversation,
flows-# SUM(doctets) / 1024 / 1024 AS mbytes
flows-# FROM
flows-# flowscompact,
flows-# mynetworks
flows-# WHERE
flows-# mynetworks.ipaddr >>= flowscompact.srcaddr
flows-# AND dstaddr IN
flows-# (
flows(# SELECT
flows(# dstaddr
flows(# FROM
flows(# dstexterne
flows(# )
flows-# GROUP BY
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport
flows-# ORDER BY
flows-# mbytes DESC LIMIT 50;
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896
LOG: duration: 2765020.327 ms statement: explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 563150kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 2721856kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)
Workers Planned: 9
Workers Launched: 9
Buffers: shared hit=1116590559 read=15851133
I/O Timings: read=5323746.860
-> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=1116588309 read=15851133
I/O Timings: read=5323746.860
-> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)
Buffers: shared hit=996551813 read=15851133
I/O Timings: read=5323746.860
-> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)
Buffers: shared hit=1634 read=15851133
I/O Timings: read=5323746.860
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 59971474
Buffers: shared hit=996550152
-> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB
Buffers: shared hit=120036496
-> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)
Group Key: flows_1.dstaddr
Buffers: shared hit=120036496
-> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 453681377
Buffers: shared hit=120036496
-> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)
Heap Fetches: 91
Buffers: shared hit=120036459
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)
Buffers: shared hit=10
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)
Buffers: shared hit=10
Planning time: 6.689 ms
Execution time: 2764860.853 ms
(58 rows)
Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.
Thanks!
Charles
Charles,
In your original posting I couldn’t find what value you set for temp_buffers.
Considering you have plenty of RAM, try setting temp_buffers=’6GB’ and then run ‘explain (analyze, buffers) select…’ in the same session. This should alleviate “disk sort’ problem.
Also, could you post the structure of flowscompact, mynetworks, and dstextern tables with all the indexes and number of rows. Actually, are they all – tables, or some of them – views?
Igor
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Wednesday, July 12, 2017 6:05 AM
To: Jeff Janes <jeff.janes@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
flows=# explain (analyze, buffers) SELECT DISTINCT
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport,
flows-# COUNT(*) AS conversation,
flows-# SUM(doctets) / 1024 / 1024 AS mbytes
flows-# FROM
flows-# flowscompact,
flows-# mynetworks
flows-# WHERE
flows-# mynetworks.ipaddr >>= flowscompact.srcaddr
flows-# AND dstaddr IN
flows-# (
flows(# SELECT
flows(# dstaddr
flows(# FROM
flows(# dstexterne
flows(# )
flows-# GROUP BY
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport
flows-# ORDER BY
flows-# mbytes DESC LIMIT 50;
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896
LOG: duration: 2765020.327 ms statement: explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 563150kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 2721856kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)
Workers Planned: 9
Workers Launched: 9
Buffers: shared hit=1116590559 read=15851133
I/O Timings: read=5323746.860
-> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=1116588309 read=15851133
I/O Timings: read=5323746.860
-> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)
Buffers: shared hit=996551813 read=15851133
I/O Timings: read=5323746.860
-> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)
Buffers: shared hit=1634 read=15851133
I/O Timings: read=5323746.860
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 59971474
Buffers: shared hit=996550152
-> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB
Buffers: shared hit=120036496
-> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)
Group Key: flows_1.dstaddr
Buffers: shared hit=120036496
-> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 453681377
Buffers: shared hit=120036496
-> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)
Heap Fetches: 91
Buffers: shared hit=120036459
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)
Buffers: shared hit=10
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)
Buffers: shared hit=10
Planning time: 6.689 ms
Execution time: 2764860.853 ms
(58 rows)
Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.
Thanks!
Charles
Charles,
In your original posting I couldn’t find what value you set for temp_buffers.
Considering you have plenty of RAM, try setting temp_buffers=’6GB’ and then run ‘explain (analyze, buffers) select…’ in the same session. This should alleviate “disk sort’ problem.
Also, could you post the structure of flowscompact, mynetworks, and dstextern tables with all the indexes and number of rows. Actually, are they all – tables, or some of them – views?
Igor
Sorry, I misstated the parameter to change.
It is work_mem (not temp_buffers) you should try to increase to 6GB.
Igor
Jeff,Here are the 2 EXPLAINs for one of my simplest query:
flows=# \timingTiming is on.flows=# explain select count(*) from flows;QUERY PLAN------------------------------------------------------------ ------------------------------ ----- Finalize Aggregate (cost=17214914.09..17214914.09 rows=1 width=8) -> Gather (cost=17214914.07..17214914.09 rows=1 width=8) Workers Planned: 1-> Partial Aggregate (cost=17213914.07..17213914.07 rows=1 width=8) -> Parallel Seq Scan on flows (cost=0.00..17019464.49 rows=388899162 width=0)(5 rows)Time: 171.835 msflows=# select pg_relation_size('flows');pg_relation_size------------------129865867264(1 row)Time: 57.157 msflows=# select count(*) from flows;LOG: duration: 625546.522 ms statement: select count(*) from flows;count-----------589831190(1 row)Time: 625546.662 ms
Using HugepageIncreasing effective_io_concurrency to 256Reducing random_page_cost from 22 to 4Reducing min_parallel_relation_size to 512kB to have more workers when doing sequential parallel scan of my biggest table
Hmm - how are you measuring that sequential scan speed of 4MB/s? I'd recommend doing a very simple test e.g, here's one on my workstation - 13 GB single table on 1 SATA drive - cold cache after reboot, sequential scan using Postgres 9.6.2:
bench=# EXPLAIN SELECT count(*) FROM pgbench_accounts;
QUERY PLAN
------------------------------------------------------------ ------------------------
Aggregate (cost=2889345.00..2889345.01 rows=1 width=8)
-> Seq Scan on pgbench_accounts (cost=0.00..2639345.00 rows=100000000 width=0)
(2 rows)
bench=# SELECT pg_relation_size('pgbench_accounts');
pg_relation_size
------------------
13429514240
(1 row)
bench=# SELECT count(*) FROM pgbench_accounts;
count
-----------
100000000
(1 row)
Time: 118884.277 ms
So doing the math seq read speed is about 110MB/s (i.e 13 GB in 120 sec). Sure enough, while I was running the query iostat showed:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 926.00 0.00 114.89 0.00 254.10 1.90 2.03 2.03 0.00 1.08 100.00
So might be useful for us to see something like that from your system - note you need to check you really have flushed the cache, and that no other apps are using the db.
regards
Mark
On 12/07/17 00:46, Charles Nadeau wrote:After reducing random_page_cost to 4 and testing more, I can report that the aggregate read throughput for parallel sequential scan is about 90MB/s. However the throughput for sequential scan is still around 4MB/s.
http://charlesnadeau.blogspot.com/
flows=# set work_mem='6GB';SETflows=# explain (analyze, buffers) SELECT DISTINCTsrcaddr,dstaddr,dstport,COUNT(*) AS conversation,SUM(doctets) / 1024 / 1024 AS mbytesFROMflowscompact,mynetworksWHEREmynetworks.ipaddr >>= flowscompact.srcaddrAND dstaddr IN(SELECTdstaddrFROMdstexterne)GROUP BYsrcaddr,dstaddr,dstportORDER BYmbytes DESC LIMIT 50;QUERY PLAN------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Limit (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154-> Unique (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154-> Sort (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))Sort Method: quicksort Memory: 654395kBBuffers: shared hit=728798038 read=82974833, temp read=381154 written=381154-> GroupAggregate (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)Group Key: flows.srcaddr, flows.dstaddr, flows.dstportBuffers: shared hit=728798038 read=82974833, temp read=381154 written=381154-> Sort (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)Sort Key: flows.srcaddr, flows.dstaddr, flows.dstportSort Method: external merge Disk: 3049216kBBuffers: shared hit=728798038 read=82974833, temp read=381154 written=381154-> Gather (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)Workers Planned: 12Workers Launched: 12Buffers: shared hit=728798037 read=82974833-> Hash Semi Join (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)Hash Cond: (flows.dstaddr = flows_1.dstaddr)Buffers: shared hit=728795193 read=82974833-> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)Buffers: shared hit=590692229 read=14991777-> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)Buffers: shared hit=860990 read=14991777-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)Heap Fetches: 0Buffers: shared hit=589831203-> Hash (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kBBuffers: shared hit=138102964 read=67983056-> HashAggregate (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)Group Key: flows_1.dstaddrBuffers: shared hit=138102964 read=67983056-> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)Rows Removed by Join Filter: 503353617Buffers: shared hit=138102964 read=67983056-> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)Buffers: shared hit=138102915 read=67983056-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)Buffers: shared hit=13-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)Buffers: shared hit=13Planning time: 0.941 msExecution time: 2228345.171 ms(48 rows)
flows=# \d+ flowscompact;View "public.flowscompact"Column | Type | Modifiers | Storage | Description-----------+--------------------------+-----------+---------+-------------flow_id | bigint | | plain |sysuptime | bigint | | plain |exaddr | ip4 | | plain |dpkts | integer | | plain |doctets | bigint | | plain |first | bigint | | plain |last | bigint | | plain |srcaddr | ip4 | | plain |dstaddr | ip4 | | plain |srcport | integer | | plain |dstport | integer | | plain |prot | smallint | | plain |tos | smallint | | plain |tcp_flags | smallint | | plain |timestamp | timestamp with time zone | | plain |View definition:SELECT flowstimestamp.flow_id,flowstimestamp.sysuptime,flowstimestamp.exaddr,flowstimestamp.dpkts,flowstimestamp.doctets,flowstimestamp.first,flowstimestamp.last,flowstimestamp.srcaddr,flowstimestamp.dstaddr,flowstimestamp.srcport,flowstimestamp.dstport,flowstimestamp.prot,flowstimestamp.tos,flowstimestamp.tcp_flags,flowstimestamp."timestamp"FROM flowstimestamp;
flows=# select * from mynetworks;ipaddr----------------(4 row)flows=# \d+ mynetworks;Table "public.mynetworks"Column | Type | Modifiers | Storage | Stats target | Description--------+------+-----------+---------+--------------+-------------ipaddr | ip4r | | plain | |Indexes:"mynetworks_ipaddr_idx" gist (ipaddr)
flows=# \d+ dstexterne;View "public.dstexterne"Column | Type | Modifiers | Storage | Description---------+------+-----------+---------+-------------dstaddr | ip4 | | plain |View definition:SELECT DISTINCT flowscompact.dstaddrFROM flowscompactLEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4rWHERE mynetworks.ipaddr IS NULL;
From: pgsql-performance-owner@
postgresql.org [mailto:pgsql-performance- owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Wednesday, July 12, 2017 6:05 AM
To: Jeff Janes <jeff.janes@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
flows=# explain (analyze, buffers) SELECT DISTINCT
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport,
flows-# COUNT(*) AS conversation,
flows-# SUM(doctets) / 1024 / 1024 AS mbytes
flows-# FROM
flows-# flowscompact,
flows-# mynetworks
flows-# WHERE
flows-# mynetworks.ipaddr >>= flowscompact.srcaddr
flows-# AND dstaddr IN
flows-# (
flows(# SELECT
flows(# dstaddr
flows(# FROM
flows(# dstexterne
flows(# )
flows-# GROUP BY
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport
flows-# ORDER BY
flows-# mbytes DESC LIMIT 50;
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_
201608131/pgsql_tmp/pgsql_ tmp14573.6", size 1073741824 LOG: temporary file: path "pg_tblspc/36238/PG_9.6_
201608131/pgsql_tmp/pgsql_ tmp14573.7", size 1073741824 LOG: temporary file: path "pg_tblspc/36238/PG_9.6_
201608131/pgsql_tmp/pgsql_ tmp14573.8", size 639696896 LOG: duration: 2765020.327 ms statement: explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
------------------------------
------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ -------------------- Limit (cost=37762321.83..37762321.
98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1) Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Unique (cost=37762321.83..37769053.
57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1) Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37762321.83..37763443.
79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1) Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 563150kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> GroupAggregate (cost=37698151.34..37714980.
68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1) Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37698151.34..37699273.
29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1) Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 2721856kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Gather (cost=19463936.00..37650810.
19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1) Workers Planned: 9
Workers Launched: 9
Buffers: shared hit=1116590559 read=15851133
I/O Timings: read=5323746.860
-> Hash Semi Join (cost=19462936.00..37622883.
23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10) Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=1116588309 read=15851133
I/O Timings: read=5323746.860
-> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)
Buffers: shared hit=996551813 read=15851133
I/O Timings: read=5323746.860
-> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)
Buffers: shared hit=1634 read=15851133
I/O Timings: read=5323746.860
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 59971474
Buffers: shared hit=996550152
-> Hash (cost=19462896.74..19462896.
74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10) Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB
Buffers: shared hit=120036496
-> HashAggregate (cost=19462829.48..19462863.
11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10) Group Key: flows_1.dstaddr
Buffers: shared hit=120036496
-> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 453681377
Buffers: shared hit=120036496
-> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)
Heap Fetches: 91
Buffers: shared hit=120036459
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)
Buffers: shared hit=10
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)
Buffers: shared hit=10
Planning time: 6.689 ms
Execution time: 2764860.853 ms
(58 rows)
Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.
Thanks!
Charles
Charles,
In your original posting I couldn’t find what value you set for temp_buffers.
Considering you have plenty of RAM, try setting temp_buffers=’6GB’ and then run ‘explain (analyze, buffers) select…’ in the same session. This should alleviate “disk sort’ problem.
Also, could you post the structure of flowscompact, mynetworks, and dstextern tables with all the indexes and number of rows. Actually, are they all – tables, or some of them – views?
Igor
Sorry, I misstated the parameter to change.
It is work_mem (not temp_buffers) you should try to increase to 6GB.
Igor
http://charlesnadeau.blogspot.com/
From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Igor,
Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?
Here is the explain you requested with work_mem set to 6GB:
flows=# set work_mem='6GB';
SET
flows=# explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Unique (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 654395kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> GroupAggregate (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 3049216kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Gather (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)
Workers Planned: 12
Workers Launched: 12
Buffers: shared hit=728798037 read=82974833
-> Hash Semi Join (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=728795193 read=82974833
-> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)
Buffers: shared hit=590692229 read=14991777
-> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)
Buffers: shared hit=860990 read=14991777
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 0
Buffers: shared hit=589831203
-> Hash (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kB
Buffers: shared hit=138102964 read=67983056
-> HashAggregate (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)
Group Key: flows_1.dstaddr
Buffers: shared hit=138102964 read=67983056
-> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 503353617
Buffers: shared hit=138102964 read=67983056
-> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)
Buffers: shared hit=138102915 read=67983056
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)
Buffers: shared hit=13
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)
Buffers: shared hit=13
Planning time: 0.941 ms
Execution time: 2228345.171 ms
(48 rows)
With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query.
flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.
flows=# \d+ flowscompact;
View "public.flowscompact"
Column | Type | Modifiers | Storage | Description
-----------+--------------------------+-----------+---------+-------------
flow_id | bigint | | plain |
sysuptime | bigint | | plain |
exaddr | ip4 | | plain |
dpkts | integer | | plain |
doctets | bigint | | plain |
first | bigint | | plain |
last | bigint | | plain |
srcaddr | ip4 | | plain |
dstaddr | ip4 | | plain |
srcport | integer | | plain |
dstport | integer | | plain |
prot | smallint | | plain |
tos | smallint | | plain |
tcp_flags | smallint | | plain |
timestamp | timestamp with time zone | | plain |
View definition:
SELECT flowstimestamp.flow_id,
flowstimestamp.sysuptime,
flowstimestamp.exaddr,
flowstimestamp.dpkts,
flowstimestamp.doctets,
flowstimestamp.first,
flowstimestamp.last,
flowstimestamp.srcaddr,
flowstimestamp.dstaddr,
flowstimestamp.srcport,
flowstimestamp.dstport,
flowstimestamp.prot,
flowstimestamp.tos,
flowstimestamp.tcp_flags,
flowstimestamp."timestamp"
FROM flowstimestamp;
mynetworks is a table having one column and 4 rows; it contains a list of our network networks:
flows=# select * from mynetworks;
ipaddr
----------------
(4 row)
flows=# \d+ mynetworks;
Table "public.mynetworks"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+---------+--------------+-------------
ipaddr | ip4r | | plain | |
Indexes:
"mynetworks_ipaddr_idx" gist (ipaddr)
dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.
flows=# \d+ dstexterne;
View "public.dstexterne"
Column | Type | Modifiers | Storage | Description
---------+------+-----------+---------+-------------
dstaddr | ip4 | | plain |
View definition:
SELECT DISTINCT flowscompact.dstaddr
FROM flowscompact
LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r
WHERE mynetworks.ipaddr IS NULL;
Thanks!
Charles
Charles,
Don’t change temp_buffers.
Try to increase work_mem even more, say work_mem=’12GB’, because it’s still using disk for sorting (starting around 20th minute as you noticed).
See if this:
“Sort Method: external merge Disk: 3049216kB”
goes away.
Igor
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Friday, July 14, 2017 3:13 PM
To: Charles Nadeau <charles.nadeau@gmail.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Igor,
Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?
Here is the explain you requested with work_mem set to 6GB:
flows=# set work_mem='6GB';
SET
flows=# explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Unique (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 654395kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> GroupAggregate (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 3049216kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Gather (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)
Workers Planned: 12
Workers Launched: 12
Buffers: shared hit=728798037 read=82974833
-> Hash Semi Join (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=728795193 read=82974833
-> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)
Buffers: shared hit=590692229 read=14991777
-> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)
Buffers: shared hit=860990 read=14991777
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 0
Buffers: shared hit=589831203
-> Hash (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kB
Buffers: shared hit=138102964 read=67983056
-> HashAggregate (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)
Group Key: flows_1.dstaddr
Buffers: shared hit=138102964 read=67983056
-> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 503353617
Buffers: shared hit=138102964 read=67983056
-> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)
Buffers: shared hit=138102915 read=67983056
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)
Buffers: shared hit=13
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)
Buffers: shared hit=13
Planning time: 0.941 ms
Execution time: 2228345.171 ms
(48 rows)
With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query.
flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.
flows=# \d+ flowscompact;
View "public.flowscompact"
Column | Type | Modifiers | Storage | Description
-----------+--------------------------+-----------+---------+-------------
flow_id | bigint | | plain |
sysuptime | bigint | | plain |
exaddr | ip4 | | plain |
dpkts | integer | | plain |
doctets | bigint | | plain |
first | bigint | | plain |
last | bigint | | plain |
srcaddr | ip4 | | plain |
dstaddr | ip4 | | plain |
srcport | integer | | plain |
dstport | integer | | plain |
prot | smallint | | plain |
tos | smallint | | plain |
tcp_flags | smallint | | plain |
timestamp | timestamp with time zone | | plain |
View definition:
SELECT flowstimestamp.flow_id,
flowstimestamp.sysuptime,
flowstimestamp.exaddr,
flowstimestamp.dpkts,
flowstimestamp.doctets,
flowstimestamp.first,
flowstimestamp.last,
flowstimestamp.srcaddr,
flowstimestamp.dstaddr,
flowstimestamp.srcport,
flowstimestamp.dstport,
flowstimestamp.prot,
flowstimestamp.tos,
flowstimestamp.tcp_flags,
flowstimestamp."timestamp"
FROM flowstimestamp;
mynetworks is a table having one column and 4 rows; it contains a list of our network networks:
flows=# select * from mynetworks;
ipaddr
----------------
(4 row)
flows=# \d+ mynetworks;
Table "public.mynetworks"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+---------+--------------+-------------
ipaddr | ip4r | | plain | |
Indexes:
"mynetworks_ipaddr_idx" gist (ipaddr)
dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.
flows=# \d+ dstexterne;
View "public.dstexterne"
Column | Type | Modifiers | Storage | Description
---------+------+-----------+---------+-------------
dstaddr | ip4 | | plain |
View definition:
SELECT DISTINCT flowscompact.dstaddr
FROM flowscompact
LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r
WHERE mynetworks.ipaddr IS NULL;
Thanks!
Charles
Charles,
Also, let’s try to simplify your query and see if it performs better.
You are grouping by srcaddr, dstaddr, dstport, that makes DISTINCT not needed.
And after simplifying WHERE clause (let me know if the result is not what you want), the query looks like:
SELECT srcaddr, dstaddr, dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM flowscompact
WHERE srcaddr IN (SELECT ipaddr FROM mynetworks)
AND dstaddr NOT IN (SELECT ipaddr FROM mynetworks)
GROUP BY srcaddr, dstaddr, dstport
ORDER BY mbytes DESC
LIMIT 50;
Now, you didn’t provide the definition of flowstimestamp table.
If this table doesn’t have an index on (srcaddr, dstaddr, dstport) creating one should help (I think).
Igor
Ah yes - that seems more sensible (but still slower than I would expect for 5 disks RAID 0). You should be able to get something like 5 * (single disk speed) i.e about 500MB/s. Might be worth increasing device read ahead (more than you have already). Some of these so-called 'smart' RAID cards need to be hit over the head before they will perform. E.g: I believe you have it set to 128 - I'd try 4096 or even 16384 (In the past I've used those settings on some extremely stupid cards that refused to max out their disks known speeds). Also worth investigating is RAID stripe size - for DW work it makes sense for it to be reasonably big (256K to 1M), which again will help speed is sequential scans. Cheers Mark On 15/07/17 02:09, Charles Nadeau wrote: > Mark, > > First I must say that I changed my disks configuration from 4 disks in > RAID 10 to 5 disks in RAID 0 because I almost ran out of disk space > during the last ingest of data. > Here is the result test you asked. It was done with a cold cache: > > flows=# \timing > Timing is on. > flows=# explain select count(*) from flows; > QUERY PLAN > ----------------------------------------------------------------------------------------------- > Finalize Aggregate (cost=17214914.09..17214914.09 rows=1 width=8) > -> Gather (cost=17214914.07..17214914.09 rows=1 width=8) > Workers Planned: 1 > -> Partial Aggregate (cost=17213914.07..17213914.07 > rows=1 width=8) > -> Parallel Seq Scan on flows > (cost=0.00..17019464.49 rows=388899162 width=0) > (5 rows) > > Time: 171.835 ms > flows=# select pg_relation_size('flows'); > pg_relation_size > ------------------ > 129865867264 > (1 row) > > Time: 57.157 ms > flows=# select count(*) from flows; > LOG: duration: 625546.522 ms statement: select count(*) from flows; > count > ----------- > 589831190 > (1 row) > > Time: 625546.662 ms > > The throughput reported by Postgresql is almost 198MB/s, and the > throughput as mesured by dstat during the query execution was between > 25 and 299MB/s. It is much better than what I had before! The i/o wait > was about 12% all through the query. One thing I noticed is the > discrepency between the read throughput reported by pg_activity and > the one reported by dstat: pg_activity always report a value lower > than dstat. > > Besides the change of disks configuration, here is what contributed > the most to the improvment of the performance so far: > > Using Hugepage > Increasing effective_io_concurrency to 256 > Reducing random_page_cost from 22 to 4 > Reducing min_parallel_relation_size to 512kB to have more workers > when doing sequential parallel scan of my biggest table > > > Thanks for recomending this test, I now know what the real throughput > should be! > > Charles > > On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> > wrote: > > Hmm - how are you measuring that sequential scan speed of 4MB/s? > I'd recommend doing a very simple test e.g, here's one on my > workstation - 13 GB single table on 1 SATA drive - cold cache > after reboot, sequential scan using Postgres 9.6.2: > > bench=# EXPLAIN SELECT count(*) FROM pgbench_accounts; > QUERY PLAN > ------------------------------------------------------------------------------------ > Aggregate (cost=2889345.00..2889345.01 rows=1 width=8) > -> Seq Scan on pgbench_accounts (cost=0.00..2639345.00 > rows=100000000 width=0) > (2 rows) > > > bench=# SELECT pg_relation_size('pgbench_accounts'); > pg_relation_size > ------------------ > 13429514240 > (1 row) > > bench=# SELECT count(*) FROM pgbench_accounts; > count > ----------- > 100000000 > (1 row) > > Time: 118884.277 ms > > > So doing the math seq read speed is about 110MB/s (i.e 13 GB in > 120 sec). Sure enough, while I was running the query iostat showed: > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sda 0.00 0.00 926.00 0.00 114.89 0.00 > 254.10 1.90 2.03 2.03 0.00 1.08 100.00 > > > So might be useful for us to see something like that from your > system - note you need to check you really have flushed the cache, > and that no other apps are using the db. > > regards > > Mark > > > On 12/07/17 00:46, Charles Nadeau wrote: > > After reducing random_page_cost to 4 and testing more, I can > report that the aggregate read throughput for parallel > sequential scan is about 90MB/s. However the throughput for > sequential scan is still around 4MB/s. > > > > > > -- > Charles Nadeau Ph.D. > http://charlesnadeau.blogspot.com/
Thinking about this a bit more - if somewhat more blazing performance is needed, then this could be achieved via losing the RAID card and spinning disks altogether and buying 1 of the NVME or SATA solid state products: e.g - Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds and 200K IOPS) - Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS) The Samsung needs an M.2 port on the mobo (but most should have 'em - and if not PCIe X4 adapter cards are quite cheap). The Intel is a bit more expensive compared to the Samsung, and is slower but has a longer lifetime. However for your workload the Sammy is probably fine. regards Mark On 15/07/17 11:09, Mark Kirkwood wrote: > Ah yes - that seems more sensible (but still slower than I would > expect for 5 disks RAID 0).
Thinking about this a bit more - if somewhat more blazing performance is needed, then this could be achieved via losing the RAID card and spinning disks altogether and buying 1 of the NVME or SATA solid state products: e.g
- Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds and 200K IOPS)
- Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)
The Samsung needs an M.2 port on the mobo (but most should have 'em - and if not PCIe X4 adapter cards are quite cheap). The Intel is a bit more expensive compared to the Samsung, and is slower but has a longer lifetime. However for your workload the Sammy is probably fine.
regards
Mark
On 15/07/17 11:09, Mark Kirkwood wrote:Ah yes - that seems more sensible (but still slower than I would expect for 5 disks RAID 0).--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
http://charlesnadeau.blogspot.com/
Ah yes - that seems more sensible (but still slower than I would expect for 5 disks RAID 0). You should be able to get something like 5 * (single disk speed) i.e about 500MB/s.
Might be worth increasing device read ahead (more than you have already). Some of these so-called 'smart' RAID cards need to be hit over the head before they will perform. E.g: I believe you have it set to 128 - I'd try 4096 or even 16384 (In the past I've used those settings on some extremely stupid cards that refused to max out their disks known speeds).
Also worth investigating is RAID stripe size - for DW work it makes sense for it to be reasonably big (256K to 1M), which again will help speed is sequential scans.
Cheers
Mark
P.s I used to work for Greenplum, so this type of problem came up a lot :-) . The best cards were the LSI and Areca!
On 15/07/17 02:09, Charles Nadeau wrote:Mark,
First I must say that I changed my disks configuration from 4 disks in RAID 10 to 5 disks in RAID 0 because I almost ran out of disk space during the last ingest of data.
Here is the result test you asked. It was done with a cold cache:
flows=# \timing
Timing is on.
flows=# explain select count(*) from flows;
QUERY PLAN
------------------------------------------------------------ ------------------------------ -----
Finalize Aggregate (cost=17214914.09..17214914.09 rows=1 width=8)
-> Gather (cost=17214914.07..17214914.09 rows=1 width=8)
Workers Planned: 1
-> Partial Aggregate (cost=17213914.07..17213914.07
rows=1 width=8)
-> Parallel Seq Scan on flows
(cost=0.00..17019464.49 rows=388899162 width=0)
(5 rows)
Time: 171.835 ms
flows=# select pg_relation_size('flows');
pg_relation_size
------------------
129865867264
(1 row)
Time: 57.157 ms
flows=# select count(*) from flows;
LOG: duration: 625546.522 ms statement: select count(*) from flows;
count
-----------
589831190
(1 row)
Time: 625546.662 ms
The throughput reported by Postgresql is almost 198MB/s, and the throughput as mesured by dstat during the query execution was between 25 and 299MB/s. It is much better than what I had before! The i/o wait was about 12% all through the query. One thing I noticed is the discrepency between the read throughput reported by pg_activity and the one reported by dstat: pg_activity always report a value lower than dstat.
Besides the change of disks configuration, here is what contributed the most to the improvment of the performance so far:
Using Hugepage
Increasing effective_io_concurrency to 256
Reducing random_page_cost from 22 to 4
Reducing min_parallel_relation_size to 512kB to have more workers
when doing sequential parallel scan of my biggest table
Thanks for recomending this test, I now know what the real throughput should be!
CharlesOn Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> wrote:
Hmm - how are you measuring that sequential scan speed of 4MB/s?
I'd recommend doing a very simple test e.g, here's one on my
workstation - 13 GB single table on 1 SATA drive - cold cache
after reboot, sequential scan using Postgres 9.6.2:
bench=# EXPLAIN SELECT count(*) FROM pgbench_accounts;
QUERY PLAN
------------------------------------------------------------ ------------------------
Aggregate (cost=2889345.00..2889345.01 rows=1 width=8)
-> Seq Scan on pgbench_accounts (cost=0.00..2639345.00
rows=100000000 width=0)
(2 rows)
bench=# SELECT pg_relation_size('pgbench_accounts');
pg_relation_size
------------------
13429514240
(1 row)
bench=# SELECT count(*) FROM pgbench_accounts;
count
-----------
100000000
(1 row)
Time: 118884.277 ms
So doing the math seq read speed is about 110MB/s (i.e 13 GB in
120 sec). Sure enough, while I was running the query iostat showed:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 926.00 0.00 114.89 0.00 254.10 1.90 2.03 2.03 0.00 1.08 100.00
So might be useful for us to see something like that from your
system - note you need to check you really have flushed the cache,
and that no other apps are using the db.
regards
Mark
On 12/07/17 00:46, Charles Nadeau wrote:
After reducing random_page_cost to 4 and testing more, I can
report that the aggregate read throughput for parallel
sequential scan is about 90MB/s. However the throughput for
sequential scan is still around 4MB/s.
--
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/
http://charlesnadeau.blogspot.com/
On Sat, Jul 15, 2017 at 11:53 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote: > Mark, > > I increased the read ahead to 16384 and it doesn't improve performance. My > RAID 0 use a stripe size of 256k, the maximum size supported by the > controller. Are your queries still spilling to disk for sorts? If this is the case, and they're just too big to fit in memory, then you need to move your temp space, where sorts happen, onto another disk array that isn't your poor overworked raid-10 array. Contention between sorts and reads can kill performance quick, esp on spinning rust.
Right, that is a bit of a show stopper for those SSD (the Intel needs SATA 6Gb/s and the Sammy's need PCIe 3.0 to perform to their rated specs). regards Mark On 16/07/17 04:12, Charles Nadeau wrote: > Mark, > > The server is a . It doesn't really work with SATA drives. And when > you find one that is compatible, it is only used at 3Gb/s with a > maximum of 50000 IOPS (a well know caracteristic of the HP P410i SAS > RAID controller). I am looking at getting a Kingston Digital HyperX > Predator that I could use in one of the PCIe 2.0 x4 slot. However I am > worried about the "thermal runaway", i.e. when the server can't get a > temperature reading from a PCIe card, it spins the fans at full speed > to protect the server against high temperature. The machine being next > to my desk I worry about the deafening noise it will create. > Thanks! > > Chales > > On Sat, Jul 15, 2017 at 1:57 AM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> > wrote: > > Thinking about this a bit more - if somewhat more blazing > performance is needed, then this could be achieved via losing the > RAID card and spinning disks altogether and buying 1 of the NVME > or SATA solid state products: e.g > > - Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds > and 200K IOPS) > > - Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS) > > > The Samsung needs an M.2 port on the mobo (but most should have > 'em - and if not PCIe X4 adapter cards are quite cheap). The Intel > is a bit more expensive compared to the Samsung, and is slower but > has a longer lifetime. However for your workload the Sammy is > probably fine. > > regards > > Mark > > On 15/07/17 11:09, Mark Kirkwood wrote: > > Ah yes - that seems more sensible (but still slower than I > would expect for 5 disks RAID 0). > > > > > -- > Sent via pgsql-performance mailing list > (pgsql-performance@postgresql.org > <mailto:pgsql-performance@postgresql.org>) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > <http://www.postgresql.org/mailpref/pgsql-performance> > > > > > -- > Charles Nadeau Ph.D. > http://charlesnadeau.blogspot.com/
From: Charles Nadeau [mailto:charles.nadeau@gmail.
com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Igor,
Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?
Here is the explain you requested with work_mem set to 6GB:
flows=# set work_mem='6GB';
SET
flows=# explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
------------------------------
------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------ Limit (cost=48135680.07..48135680.
22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1) Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Unique (cost=48135680.07..48143613.
62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1) Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48135680.07..48137002.
33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1) Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 654395kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> GroupAggregate (cost=48059426.65..48079260.
50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1) Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48059426.65..48060748.
90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1) Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 3049216kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Gather (cost=30060688.07..48003007.
07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1) Workers Planned: 12
Workers Launched: 12
Buffers: shared hit=728798037 read=82974833
-> Hash Semi Join (cost=30059688.07..47951761.
31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13) Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=728795193 read=82974833
-> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)
Buffers: shared hit=590692229 read=14991777
-> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)
Buffers: shared hit=860990 read=14991777
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 0
Buffers: shared hit=589831203
-> Hash (cost=30059641.47..30059641.
47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13) Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kB
Buffers: shared hit=138102964 read=67983056
-> HashAggregate (cost=30059561.64..30059601.
56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13) Group Key: flows_1.dstaddr
Buffers: shared hit=138102964 read=67983056
-> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 503353617
Buffers: shared hit=138102964 read=67983056
-> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)
Buffers: shared hit=138102915 read=67983056
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)
Buffers: shared hit=13
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)
Buffers: shared hit=13
Planning time: 0.941 ms
Execution time: 2228345.171 ms
(48 rows)
With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query.
flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.
flows=# \d+ flowscompact;
View "public.flowscompact"
Column | Type | Modifiers | Storage | Description
-----------+------------------
--------+-----------+--------- +------------- flow_id | bigint | | plain |
sysuptime | bigint | | plain |
exaddr | ip4 | | plain |
dpkts | integer | | plain |
doctets | bigint | | plain |
first | bigint | | plain |
last | bigint | | plain |
srcaddr | ip4 | | plain |
dstaddr | ip4 | | plain |
srcport | integer | | plain |
dstport | integer | | plain |
prot | smallint | | plain |
tos | smallint | | plain |
tcp_flags | smallint | | plain |
timestamp | timestamp with time zone | | plain |
View definition:
SELECT flowstimestamp.flow_id,
flowstimestamp.sysuptime,
flowstimestamp.exaddr,
flowstimestamp.dpkts,
flowstimestamp.doctets,
flowstimestamp.first,
flowstimestamp.last,
flowstimestamp.srcaddr,
flowstimestamp.dstaddr,
flowstimestamp.srcport,
flowstimestamp.dstport,
flowstimestamp.prot,
flowstimestamp.tos,
flowstimestamp.tcp_flags,
flowstimestamp."timestamp"
FROM flowstimestamp;
mynetworks is a table having one column and 4 rows; it contains a list of our network networks:
flows=# select * from mynetworks;
ipaddr
----------------
(4 row)
flows=# \d+ mynetworks;
Table "public.mynetworks"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+--
-------+--------------+------- ------ ipaddr | ip4r | | plain | |
Indexes:
"mynetworks_ipaddr_idx" gist (ipaddr)
dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.
flows=# \d+ dstexterne;
View "public.dstexterne"
Column | Type | Modifiers | Storage | Description
---------+------+-----------+-
--------+------------- dstaddr | ip4 | | plain |
View definition:
SELECT DISTINCT flowscompact.dstaddr
FROM flowscompact
LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r
WHERE mynetworks.ipaddr IS NULL;
Thanks!
Charles
Charles,
Don’t change temp_buffers.
Try to increase work_mem even more, say work_mem=’12GB’, because it’s still using disk for sorting (starting around 20th minute as you noticed).
See if this:
“Sort Method: external merge Disk: 3049216kB”
goes away.
Igor
http://charlesnadeau.blogspot.com/
On Sat, Jul 15, 2017 at 11:53 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> Mark,
>
> I increased the read ahead to 16384 and it doesn't improve performance. My
> RAID 0 use a stripe size of 256k, the maximum size supported by the
> controller.
Are your queries still spilling to disk for sorts? If this is the
case, and they're just too big to fit in memory, then you need to move
your temp space, where sorts happen, onto another disk array that
isn't your poor overworked raid-10 array. Contention between sorts and
reads can kill performance quick, esp on spinning rust.
http://charlesnadeau.blogspot.com/
From: pgsql-performance-owner@
postgresql.org [mailto:pgsql-performance- owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Friday, July 14, 2017 3:13 PM
To: Charles Nadeau <charles.nadeau@gmail.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
From: Charles Nadeau [mailto:charles.nadeau@gmail.
com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent
Igor,
Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?
Here is the explain you requested with work_mem set to 6GB:
flows=# set work_mem='6GB';
SET
flows=# explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
------------------------------
------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------ Limit (cost=48135680.07..48135680.
22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1) Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Unique (cost=48135680.07..48143613.
62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1) Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48135680.07..48137002.
33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1) Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 654395kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> GroupAggregate (cost=48059426.65..48079260.
50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1) Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48059426.65..48060748.
90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1) Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 3049216kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Gather (cost=30060688.07..48003007.
07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1) Workers Planned: 12
Workers Launched: 12
Buffers: shared hit=728798037 read=82974833
-> Hash Semi Join (cost=30059688.07..47951761.
31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13) Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=728795193 read=82974833
-> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)
Buffers: shared hit=590692229 read=14991777
-> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)
Buffers: shared hit=860990 read=14991777
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 0
Buffers: shared hit=589831203
-> Hash (cost=30059641.47..30059641.
47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13) Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kB
Buffers: shared hit=138102964 read=67983056
-> HashAggregate (cost=30059561.64..30059601.
56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13) Group Key: flows_1.dstaddr
Buffers: shared hit=138102964 read=67983056
-> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 503353617
Buffers: shared hit=138102964 read=67983056
-> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)
Buffers: shared hit=138102915 read=67983056
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)
Buffers: shared hit=13
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)
Buffers: shared hit=13
Planning time: 0.941 ms
Execution time: 2228345.171 ms
(48 rows)
With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query.
flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.
flows=# \d+ flowscompact;
View "public.flowscompact"
Column | Type | Modifiers | Storage | Description
-----------+------------------
--------+-----------+--------- +------------- flow_id | bigint | | plain |
sysuptime | bigint | | plain |
exaddr | ip4 | | plain |
dpkts | integer | | plain |
doctets | bigint | | plain |
first | bigint | | plain |
last | bigint | | plain |
srcaddr | ip4 | | plain |
dstaddr | ip4 | | plain |
srcport | integer | | plain |
dstport | integer | | plain |
prot | smallint | | plain |
tos | smallint | | plain |
tcp_flags | smallint | | plain |
timestamp | timestamp with time zone | | plain |
View definition:
SELECT flowstimestamp.flow_id,
flowstimestamp.sysuptime,
flowstimestamp.exaddr,
flowstimestamp.dpkts,
flowstimestamp.doctets,
flowstimestamp.first,
flowstimestamp.last,
flowstimestamp.srcaddr,
flowstimestamp.dstaddr,
flowstimestamp.srcport,
flowstimestamp.dstport,
flowstimestamp.prot,
flowstimestamp.tos,
flowstimestamp.tcp_flags,
flowstimestamp."timestamp"
FROM flowstimestamp;
mynetworks is a table having one column and 4 rows; it contains a list of our network networks:
flows=# select * from mynetworks;
ipaddr
----------------
(4 row)
flows=# \d+ mynetworks;
Table "public.mynetworks"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+--
-------+--------------+------- ------ ipaddr | ip4r | | plain | |
Indexes:
"mynetworks_ipaddr_idx" gist (ipaddr)
dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.
flows=# \d+ dstexterne;
View "public.dstexterne"
Column | Type | Modifiers | Storage | Description
---------+------+-----------+-
--------+------------- dstaddr | ip4 | | plain |
View definition:
SELECT DISTINCT flowscompact.dstaddr
FROM flowscompact
LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r
WHERE mynetworks.ipaddr IS NULL;
Thanks!
Charles
Charles,
Also, let’s try to simplify your query and see if it performs better.
You are grouping by srcaddr, dstaddr, dstport, that makes DISTINCT not needed.
And after simplifying WHERE clause (let me know if the result is not what you want), the query looks like:
SELECT srcaddr, dstaddr, dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM flowscompact
WHERE srcaddr IN (SELECT ipaddr FROM mynetworks)
AND dstaddr NOT IN (SELECT ipaddr FROM mynetworks)
GROUP BY srcaddr, dstaddr, dstport
ORDER BY mbytes DESC
LIMIT 50;
Now, you didn’t provide the definition of flowstimestamp table.
If this table doesn’t have an index on (srcaddr, dstaddr, dstport) creating one should help (I think).
Igor
http://charlesnadeau.blogspot.com/
On Fri, Jul 14, 2017 at 12:34 PM, Charles Nadeau <charles.nadeau@gmail.com> wrote: > Workers Planned: 12 > Workers Launched: 12 > Buffers: shared hit=728798037 read=82974833 > -> Hash Semi Join > (cost=30059688.07..47951761.31 rows=220376 width=20) (actual > time=1268845.181..2007864.725 rows=7057357 loops=13) > Hash Cond: (flows.dstaddr = > flows_1.dstaddr) > Buffers: shared hit=728795193 > read=82974833 > -> Nested Loop > (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 > rows=37910370 loops=13) > Buffers: shared hit=590692229 > read=14991777 > -> Parallel Seq Scan on flows > (cost=0.00..16018049.14 rows=55094048 width=20) (actual > time=0.152..566179.117 rows=45371630 loops=13) > Buffers: shared > hit=860990 read=14991777 > -> Index Only Scan using > mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) > (actual time=0.002..0.002 rows=1 loops=589831190) > Index Cond: (ipaddr >>= > (flows.srcaddr)::ip4r) > Heap Fetches: 0 > Buffers: shared > hit=589831203 12 workers on a parallel sequential scan on a RAID-10 volume of rotating disks may not be a good idea. Have you measured average request size and average wait times with iostat? Run "iostat -x -m -d 60" while running the query and copy a few relevant lines (or attach the whole thing). I suspect 12 parallel sequential scans are degrading your array's performance to random I/O performance, and that explains the 10MB/s very well (a rotating disk will give you about 3-4MB/s at random I/O, and you've got 2 mirrors on that array). You could try setting the max_parallel_workers_per_gather to 2, which should be the optimum allocation for your I/O layout. You might also want to test switching to the deadline scheduler. While the controller may get more aggregate thoughput rearranging your I/O requests, high I/O latency will severly reduce postgres' ability to saturate the I/O system itself, and deadlines tends to minimize latency. I've had good results in the past using deadline, but take this suggestion with a grain of salt, YMMV.
12 workers on a parallel sequential scan on a RAID-10 volume ofOn Fri, Jul 14, 2017 at 12:34 PM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> Workers Planned: 12
> Workers Launched: 12
> Buffers: shared hit=728798037 read=82974833
> -> Hash Semi Join
> (cost=30059688.07..47951761.31 rows=220376 width=20) (actual
> time=1268845.181..2007864.725 rows=7057357 loops=13)
> Hash Cond: (flows.dstaddr =
> flows_1.dstaddr)
> Buffers: shared hit=728795193
> read=82974833
> -> Nested Loop
> (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283
> rows=37910370 loops=13)
> Buffers: shared hit=590692229
> read=14991777
> -> Parallel Seq Scan on flows
> (cost=0.00..16018049.14 rows=55094048 width=20) (actual
> time=0.152..566179.117 rows=45371630 loops=13)
> Buffers: shared
> hit=860990 read=14991777
> -> Index Only Scan using
> mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8)
> (actual time=0.002..0.002 rows=1 loops=589831190)
> Index Cond: (ipaddr >>=
> (flows.srcaddr)::ip4r)
> Heap Fetches: 0
> Buffers: shared
> hit=589831203
rotating disks may not be a good idea.
Have you measured average request size and average wait times with iostat?
Run "iostat -x -m -d 60" while running the query and copy a few
relevant lines (or attach the whole thing). I suspect 12 parallel
sequential scans are degrading your array's performance to random I/O
performance, and that explains the 10MB/s very well (a rotating disk
will give you about 3-4MB/s at random I/O, and you've got 2 mirrors on
that array).
You could try setting the max_parallel_workers_per_gather to 2, which
should be the optimum allocation for your I/O layout.
You might also want to test switching to the deadline scheduler. While
the controller may get more aggregate thoughput rearranging your I/O
requests, high I/O latency will severly reduce postgres' ability to
saturate the I/O system itself, and deadlines tends to minimize
latency. I've had good results in the past using deadline, but take
this suggestion with a grain of salt, YMMV.
http://charlesnadeau.blogspot.com/
Вложения
On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote: > Claudio, > > At one moment > during the query, there is a write storm to the swap drive (a bit like this > case: > https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com). > I can hardly explain it as there is plenty of memory on this server. That sounds a lot like NUMA zone_reclaim issues: https://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com
On Tue, Jul 18, 2017 at 1:01 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau > <charles.nadeau@gmail.com> wrote: >> Claudio, >> >> At one moment >> during the query, there is a write storm to the swap drive (a bit like this >> case: >> https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com). >> I can hardly explain it as there is plenty of memory on this server. > > That sounds a lot like NUMA zone_reclaim issues: > > https://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com I realize you have zone_reclaim_mode set to 0. Still, the symptoms are eerily similar.
On Tue, Jul 18, 2017 at 02:13:58PM -0300, Claudio Freire wrote: > On Tue, Jul 18, 2017 at 1:01 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > > On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau > > <charles.nadeau@gmail.com> wrote: > >> Claudio, > >> > >> At one moment > >> during the query, there is a write storm to the swap drive (a bit like this > >> case: > >> https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com). > >> I can hardly explain it as there is plenty of memory on this server. > > > > That sounds a lot like NUMA zone_reclaim issues: > > > > https://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com > > I realize you have zone_reclaim_mode set to 0. Still, the symptoms are > eerily similar. Did you look at disabling KSM and/or THP ? sudo sh -c 'echo 2 >/sys/kernel/mm/ksm/run' https://www.postgresql.org/message-id/20170524155855.GH31097%40telsasoft.com https://www.postgresql.org/message-id/CANQNgOrD02f8mR3Y8Pi=zFsoL14RqNQA8hwz1r4rSnDLr1b2Cw@mail.gmail.com https://www.postgresql.org/message-id/CAHyXU0y9hviyKWvQZxX5UWfH9M2LYvwvAOPQ_DUPva2b71t12g%40mail.gmail.com https://www.postgresql.org/message-id/20130716195834.8fe5c79249cb2ff0d4270b3e@yahoo.es https://www.postgresql.org/message-id/CAE_gQfW3dBiELcOppYN6v%3D8%2B%2BpEeywD7iXGw-OT3doB8SXO4_A%40mail.gmail.com https://www.postgresql.org/message-id/flat/1436268563235-5856914.post%40n5.nabble.com#1436268563235-5856914.post@n5.nabble.com https://www.postgresql.org/message-id/CAL_0b1tJOZCx3Lo3Eve1RqGaT%2BJJ_Q7w4pkJ87WfWwXbTugnxw@mail.gmail.com https://www.postgresql.org/message-id/556E2068.7070007@vuole.me https://www.postgresql.org/message-id/1415981309.90631.YahooMailNeo%40web133205.mail.ir2.yahoo.com https://www.postgresql.org/message-id/CAHyXU0yXYpCXN4%3D81ZDRQu-oGzrcq2qNAXDpyz4oiQPPAGk4ew%40mail.gmail.com https://www.pythian.com/blog/performance-tuning-hugepages-in-linux/ http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/ Justin
On Tue, Jul 18, 2017 at 3:20 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote: > Claudio, > > Find attached the iostat measured while redoing the query above > (iostat1.txt). sda holds my temp directory (noop i/o scheduler), sdb the > swap partition (cfq i/o scheduler) only and sdc (5 disks RAID0, noop i/o > scheduler) holds the data. I didn't pay attention to the load caused by 12 > parallel scans as I thought the RAID card would be smart enough to > re-arrange the read requests optimally regardless of the load. At one moment > during the query, there is a write storm to the swap drive (a bit like this > case: > https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com). My experience from that case (and few more) has led me to believe that Linux database servers with plenty of memory should have their swaps turned off. The Linux kernel works hard to swap out little used memory to make more space for caching active data. Problem is that whatever decides to swap stuff out gets stupid when presented with 512GB RAM and starts swapping out things like sys v shared_buffers etc. Here's the thing, either your memory is big enough to buffer your whole data set, so nothing should get swapped out to make room for caching. OR your dataset is much bigger than memory. In which case, making more room gets very little if it comes at the cost of waiting for stuff you need to get read back in. Linux servers should also have zone reclaim turned off, and THP disabled. Try running "sudo swapoff -a" and see if it gets rid of your swap storms.
Did you look at disabling KSM and/or THP ?On Tue, Jul 18, 2017 at 02:13:58PM -0300, Claudio Freire wrote:
> On Tue, Jul 18, 2017 at 1:01 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau
> > <charles.nadeau@gmail.com> wrote:
> >> Claudio,
> >>
> >> At one moment
> >> during the query, there is a write storm to the swap drive (a bit like this
> >> case:
> >> https://www.postgresql.org/message-id/AANLkTi% 3Diw4fC2RgTxhw0aGpyXANhOT% 3DXBnjLU1_v6PdA%40mail.gmail. com).
> >> I can hardly explain it as there is plenty of memory on this server.
> >
> > That sounds a lot like NUMA zone_reclaim issues:
> >
> > https://www.postgresql.org/message-id/500616CB.3070408@ 2ndQuadrant.com
>
> I realize you have zone_reclaim_mode set to 0. Still, the symptoms are
> eerily similar.
sudo sh -c 'echo 2 >/sys/kernel/mm/ksm/run'
https://www.postgresql.org/message-id/20170524155855. GH31097%40telsasoft.com
https://www.postgresql.org/message-id/ CANQNgOrD02f8mR3Y8Pi= zFsoL14RqNQA8hwz1r4rSnDLr1b2Cw @mail.gmail.com
https://www.postgresql.org/message-id/ CAHyXU0y9hviyKWvQZxX5UWfH9M2LY vwvAOPQ_DUPva2b71t12g%40mail. gmail.com
https://www.postgresql.org/message-id/20130716195834. 8fe5c79249cb2ff0d4270b3e@ yahoo.es
https://www.postgresql.org/message-id/CAE_ gQfW3dBiELcOppYN6v%3D8%2B% 2BpEeywD7iXGw-OT3doB8SXO4_A% 40mail.gmail.com
https://www.postgresql.org/message-id/flat/1436268563235- 5856914.post%40n5.nabble.com# 1436268563235-5856914.post@n5. nabble.com
https://www.postgresql.org/message-id/CAL_ 0b1tJOZCx3Lo3Eve1RqGaT%2BJJ_ Q7w4pkJ87WfWwXbTugnxw@mail. gmail.com
https://www.postgresql.org/message-id/556E2068.7070007@ vuole.me
https://www.postgresql.org/message-id/1415981309.90631. YahooMailNeo%40web133205.mail. ir2.yahoo.com
https://www.postgresql.org/message-id/CAHyXU0yXYpCXN4% 3D81ZDRQu- oGzrcq2qNAXDpyz4oiQPPAGk4ew% 40mail.gmail.com
https://www.pythian.com/blog/performance-tuning-hugepages- in-linux/
http://structureddata.org/2012/06/18/linux-6- transparent-huge-pages-and- hadoop-workloads/
Justin
http://charlesnadeau.blogspot.com/
Thinking about this a bit more - if somewhat more blazing performance is needed, then this could be achieved via losing the RAID card and spinning disks altogether and buying 1 of the NVME or SATA solid state products: e.g
- Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds and 200K IOPS)
- Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)
The Samsung needs an M.2 port on the mobo (but most should have 'em - and if not PCIe X4 adapter cards are quite cheap). The Intel is a bit more expensive compared to the Samsung, and is slower but has a longer lifetime. However for your workload the Sammy is probably fine.
regards
Mark
On 15/07/17 11:09, Mark Kirkwood wrote:Ah yes - that seems more sensible (but still slower than I would expect for 5 disks RAID 0).--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
http://charlesnadeau.blogspot.com/
I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:
Logical Drive: 3
Size: 273.4 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 128 KB
Full Stripe Size: 256 KB
Status: OK
Caching: Enabled
Unique Identifier: 600508B1001037383941424344450A
00 Disk Name: /dev/sdc
Mount Points: /mnt/data 273.4 GB
OS Status: LOCKED
Logical Drive Label: A00A194750123456789ABCDE516F
Mirror Group 0:
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
Mirror Group 1:
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)
Drive Type: Data
Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.
Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-
0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr, noatime,nobarrier,errors= remount-ro 0 1" Postgresql is the only application running on this server.
Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.
My configuration is:
name | current_setting | source
------------------------------
---+-------------------------- ----------------------+------- --------------- application_name | psql | client
autovacuum_vacuum_scale_factor | 0 | configuration file
autovacuum_vacuum_threshold | 2000 | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_timeout | 30min | configuration file
client_encoding | UTF8 | client
client_min_messages | log | configuration file
cluster_name | 9.6/main | configuration file
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.003 | configuration file
DateStyle | ISO, YMD | configuration file
default_statistics_target | 100 | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 22GB | configuration file
effective_io_concurrency | 4 | configuration file
external_pid_file | /var/run/postgresql/9.6-main.
pid | configuration file lc_messages | C | configuration file
lc_monetary | en_CA.UTF-8 | configuration file
lc_numeric | en_CA.UTF-8 | configuration file
lc_time | en_CA.UTF-8 | configuration file
listen_addresses | * | configuration file
lock_timeout | 100s | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_destination | csvlog | configuration file
log_directory | /mnt/bigzilla/data/toburn/hp/
postgresql/pg_log | configuration file log_disconnections | on | configuration file
log_error_verbosity | default | configuration file
log_file_mode | 0600 | configuration file
log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file
log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 0 | configuration file
log_min_error_statement | debug1 | configuration file
log_min_messages | debug1 | configuration file
log_rotation_size | 1GB | configuration file
log_temp_files | 0 | configuration file
log_timezone | localtime | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 3GB | configuration file
max_connections | 10 | configuration file
max_locks_per_transaction | 256 | configuration file
max_parallel_workers_per_
gather | 14 | configuration file max_stack_depth | 2MB | environment variable
max_wal_size | 4GB | configuration file
max_worker_processes | 14 | configuration file
min_wal_size | 2GB | configuration file
parallel_setup_cost | 1000 | configuration file
parallel_tuple_cost | 0.012 | configuration file
port | 5432 | configuration file
random_page_cost | 22 | configuration file
seq_page_cost | 1 | configuration file
shared_buffers | 34GB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
ssl | on | configuration file
ssl_cert_file | /etc/ssl/certs/ssl-cert-
snakeoil.pem | configuration file ssl_key_file | /etc/ssl/private/ssl-cert-
snakeoil.key | configuration file statement_timeout | 1000000s | configuration file
stats_temp_directory | /var/run/postgresql/9.6-main.
pg_stat_tmp | configuration file superuser_reserved_connections | 1 | configuration file
syslog_facility | local1 | configuration file
syslog_ident | postgres | configuration file
syslog_sequence_numbers | on | configuration file
temp_file_limit | 80GB | configuration file
TimeZone | localtime | configuration file
track_activities | on | configuration file
track_counts | on | configuration file
track_functions | all | configuration file
unix_socket_directories | /var/run/postgresql | configuration file
vacuum_cost_delay | 1ms | configuration file
vacuum_cost_limit | 5000 | configuration file
vacuum_cost_page_dirty | 200 | configuration file
vacuum_cost_page_hit | 10 | configuration file
vacuum_cost_page_miss | 100 | configuration file
wal_buffers | 16MB | configuration file
wal_compression | on | configuration file
wal_sync_method | fdatasync | configuration file
work_mem | 1468006kB | configuration file
The part of /etc/sysctl.conf I modified is:
vm.swappiness = 1
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 1073741824
vm.overcommit_ratio = 100
vm.zone_reclaim_mode = 0
kernel.numa_balancing = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_migration_cost_ns = 5000000
The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.
I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.
Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.
Thanks!
Charles
--Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/
Nice! Pleased that the general idea worked well for you! I'm also relieved that you did not follow my recommendation exactly - I'm been trialling a Samsung 960 Evo (256GB) and Intel 600p (256GB) and I've stumbled across the serious disadvantages of (consumer) M.2 drives using TLC NAND - terrible sustained write performance! While these guys can happily do ~ 2GB/s reads, their write performance is only 'burst capable'. They have small SLC NAND 'write caches' that do ~1GB/s for a *limited time* (10-20s) and after that you get ~ 200 MB/s! Ouch - my old Crucial 550 can do 350 MB/s sustained writes (so two of them in RAID0 are doing 700 MB/s for hours). Bigger capacity drives can do better - but overall I'm not that impressed with the current trend of using TLC NAND. regards Mark On 21/07/17 00:50, Charles Nadeau wrote: > Mark, > > I received yesterday a second server having 16 drives bays. Just for a > quick trial, I used 2 old 60GB SSD (a Kingston V300 and a ADATA SP900) > to build a RAID0. To my surprise, my very pecky RAID controller (HP > P410i) recognised them without a fuss (although as SATAII drives at > 3Gb/s. A quick fio benchmark gives me 22000 random 4k read IOPS, more > than my 5 146GB 10k SAS disks in RAID0). I moved my most frequently > used index to this array and will try to do some benchmarks. > Knowing that SSDs based on SandForce-2281 controller are recognised by > my server, I may buy a pair of bigger/newer ones to put my tables on. > > Thanks! > > Charles > > On Sat, Jul 15, 2017 at 1:57 AM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> > wrote: > > Thinking about this a bit more - if somewhat more blazing > performance is needed, then this could be achieved via losing the > RAID card and spinning disks altogether and buying 1 of the NVME > or SATA solid state products: e.g > > - Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds > and 200K IOPS) > > - Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS) > > > The Samsung needs an M.2 port on the mobo (but most should have > 'em - and if not PCIe X4 adapter cards are quite cheap). The Intel > is a bit more expensive compared to the Samsung, and is slower but > has a longer lifetime. However for your workload the Sammy is > probably fine. > > regards > > Mark > > On 15/07/17 11:09, Mark Kirkwood wrote: > > Ah yes - that seems more sensible (but still slower than I > would expect for 5 disks RAID 0). > > > > > -- > Sent via pgsql-performance mailing list > (pgsql-performance@postgresql.org > <mailto:pgsql-performance@postgresql.org>) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > <http://www.postgresql.org/mailpref/pgsql-performance> > > > > > -- > Charles Nadeau Ph.D. > http://charlesnadeau.blogspot.com/