Re: Why is a hash join preferred when it does not fit in work_mem

Поиск
Список
Период
Сортировка
От Dimitrios Apostolou
Тема Re: Why is a hash join preferred when it does not fit in work_mem
Дата
Msg-id 85365bc8-f35b-be82-2582-1545830f976a@gmx.net
обсуждение исходный текст
Ответ на Re: Why is a hash join preferred when it does not fit in work_mem  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
Hello again, I am back with new experiments.

First of all, I have a concrete set of steps that replicate the
slowness of the hash join that I described. If you have a system with
spinning disks lying around, I would appreciate if you can verify the
scenario. Can you also replicate it in different kind of systems?


CREATE TABLE descriptions (description_id serial PRIMARY KEY, description text);
INSERT INTO  descriptions (description_id, description)
    SELECT s, repeat(encode(sha512(s::text::bytea), 'hex'), 4)
        FROM generate_series(0,1200300) AS s;
CREATE TABLE descriptions_in_books (description_id integer REFERENCES descriptions(description_id), book_id integer);
INSERT INTO  descriptions_in_books (description_id, book_id)
    SELECT s % 1200300, s
        FROM generate_series(0,5200300) AS s;

SET work_mem TO '1MB';
SET hash_mem_multiplier = 1.0;
SET track_io_timing TO on;

EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM descriptions NATURAL JOIN descriptions_in_books;
SET enable_hashjoin TO off;
EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM descriptions NATURAL JOIN descriptions_in_books;



The first JOIN query uses a hash join with Batches: 1024 and takes 622s!
For the longest part, I can see the disk writing most of the time around
1-2 MB/s, so I have to assume it's not writing sequentially.

The second identical JOIN uses a merge join that completes in 14s. The I/O
happens in a much higher rate (10x maybe), so I'm assuming it's mostly
sequential.

Another observation is that the hash join deteriorates as the length of
the TEXT column grows. In fact, if I fill it with only 32 char long
strings, then the hash join is split in only 128 batches, and it completes
almost as fast as the merge join. Could it be that the cost estimation
is underestimating the I/O pattern related to splitting in batches?


Here are the measurements:


  Hash Join  (cost=192450.84..401456.02 rows=5200486 width=524) (actual time=344516.004..621725.562 rows=5200301
loops=1)
    Output: descriptions.description_id, descriptions.description, descriptions_in_books.book_id
    Inner Unique: true
    Hash Cond: (descriptions_in_books.description_id = descriptions.description_id)
    Buffers: shared hit=15586 read=93161, temp read=97829 written=97829
    I/O Timings: shared/local read=1402.597, temp read=229252.170 write=371508.313
    ->  Seq Scan on public.descriptions_in_books  (cost=0.00..75015.86 rows=5200486 width=8) (actual
time=0.068..1819.629rows=5200301 loops=1) 
          Output: descriptions_in_books.book_id, descriptions_in_books.description_id
          Buffers: shared hit=32 read=22979
          I/O Timings: shared/local read=249.910
    ->  Hash  (cost=97739.04..97739.04 rows=1200304 width=520) (actual time=343268.470..343268.471 rows=1200301
loops=1)
          Output: descriptions.description_id, descriptions.description
          Buckets: 2048  Batches: 1024  Memory Usage: 686kB
          Buffers: shared hit=15554 read=70182, temp written=78538
          I/O Timings: shared/local read=1152.687, temp write=338883.205
          ->  Seq Scan on public.descriptions  (cost=0.00..97739.04 rows=1200304 width=520) (actual
time=0.028..2278.791rows=1200301 loops=1) 
                Output: descriptions.description_id, descriptions.description
                Buffers: shared hit=15554 read=70182
                I/O Timings: shared/local read=1152.687
  Settings: hash_mem_multiplier = '1', work_mem = '1MB'
  Planning Time: 0.303 ms
  Execution Time: 622495.279 ms
(22 rows)


SET enable_hashjoin TO off;


  Merge Join  (cost=868411.87..1079330.96 rows=5200301 width=524) (actual time=6091.932..13304.924 rows=5200301
loops=1)
    Output: descriptions.description_id, descriptions.description, descriptions_in_books.book_id
    Merge Cond: (descriptions.description_id = descriptions_in_books.description_id)
    Buffers: shared hit=67 read=111962 written=1, temp read=45806 written=46189
    I/O Timings: shared/local read=1007.043 write=28.575, temp read=344.937 write=794.483
    ->  Index Scan using descriptions_pkey on public.descriptions
(cost=0.43..116919.99 rows=1200304 width=520) (actual time=0.028..1596.387 rows=1200301 loops=1)
          Output: descriptions.description_id, descriptions.description
          Buffers: shared hit=3 read=89015 written=1
          I/O Timings: shared/local read=834.732 write=28.575
    ->  Materialize  (cost=868408.84..894410.35 rows=5200301 width=8) (actual time=6091.892..9171.796 rows=5200301
loops=1)
          Output: descriptions_in_books.book_id, descriptions_in_books.description_id
          Buffers: shared hit=64 read=22947, temp read=45806 written=46189
          I/O Timings: shared/local read=172.311, temp read=344.937 write=794.483
          ->  Sort  (cost=868408.84..881409.60 rows=5200301 width=8) (actual time=6091.885..7392.828 rows=5200301
loops=1)
                Output: descriptions_in_books.book_id, descriptions_in_books.description_id
                Sort Key: descriptions_in_books.description_id
                Sort Method: external merge  Disk: 91632kB
                Buffers: shared hit=64 read=22947, temp read=45806 written=46189
                I/O Timings: shared/local read=172.311, temp read=344.937 write=794.483
                ->  Seq Scan on public.descriptions_in_books  (cost=0.00..75014.01 rows=5200301 width=8) (actual
time=0.031..1617.520rows=5200301 loops=1) 
                      Output: descriptions_in_books.book_id, descriptions_in_books.description_id
                      Buffers: shared hit=64 read=22947
                      I/O Timings: shared/local read=172.311
  Settings: enable_hashjoin = 'off', hash_mem_multiplier = '1', work_mem = '1MB'
  Planning:
    Buffers: shared hit=10 read=9
    I/O Timings: shared/local read=90.112
  Planning Time: 90.774 ms
  Execution Time: 13955.462 ms
(29 rows)




Regards,
Dimitris




В списке pgsql-general по дате отправления:

Предыдущее
От: Laurenz Albe
Дата:
Сообщение: Re: ***Conflict with recovery error***
Следующее
От: Marc
Дата:
Сообщение: Re: Autovacuum on sys tables