Обсуждение: BUG #15577: Query returns different results when executed multipletimes
BUG #15577: Query returns different results when executed multipletimes
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 15577 Logged by: Bartosz Polnik Email address: bartoszpolnik@gmail.com PostgreSQL version: 11.1 Operating system: Windows 10 Description: Hello, A couple of days ago I stumbled upon a query that doesn't return all rows it should. When we execute it multiple times, we get only a subset of data. The query is: SELECT ta.id AS table_a_id, tc.id as table_c_id, tba.id AS table_b_id FROM test.table_b_active tba INNER JOIN test.table_c tc ON tba.target_id = tc.id INNER JOIN test.table_d td ON tc.table_d_id = td.id LEFT JOIN test.table_a ta ON ta.table_c_id = tc.id AND ta.date = '2018-08-31' :: DATE WHERE tba.date BETWEEN '2018-08-10' :: DATE AND '2018-09-01' :: DATE AND td.group = 'A' AND tc.table_e_id = 4 AND ( (tba.target_id = tc.id AND tba.group_type = 'A') OR tba.source_id = tc.id ); We always expect 31 rows in return: table_a_id | table_c_id | table_b_id ------------+------------+------------ 16116185 | 328860 | 2936924 16115788 | 348539 | 2913754 16115788 | 348539 | 3039173 16115788 | 348539 | 2913874 16116256 | 293541 | 2997160 16116256 | 293541 | 2901938 16116256 | 293541 | 2901933 16114789 | 292051 | 3038539 16114813 | 342353 | 3052371 16116069 | 351585 | 3025941 16114814 | 331329 | 2946332 16115861 | 350487 | 2933633 16115106 | 350047 | 2902075 16116066 | 351434 | 3010909 16114811 | 298605 | 2998909 16114811 | 298605 | 2893809 16114811 | 298605 | 2987038 16114811 | 298605 | 3038877 16114811 | 298605 | 3010694 16114811 | 298605 | 2893188 16114811 | 298605 | 2893391 16114811 | 298605 | 2983360 16114811 | 298605 | 3038221 16114811 | 298605 | 3026078 16114811 | 298605 | 2998966 16116260 | 290583 | 2955483 16116260 | 290583 | 2921135 16116260 | 290583 | 2947914 16116260 | 290583 | 2901669 16116241 | 295971 | 3038921 16116249 | 296708 | 3038888 But when we run it using psql, we get different results (not all rows are returned). Here's the query I used to get the execution plan and example output from two consecutive plan executions: explain (costs false, analyze true) SELECT ta.id AS table_a_id, tc.id as table_c_id, tba.id AS table_b_id FROM test.table_b_active tba INNER JOIN test.table_c tc ON tba.target_id = tc.id INNER JOIN test.table_d td ON tc.table_d_id = td.id LEFT JOIN test.table_a ta ON ta.table_c_id = tc.id AND ta.date = '2018-08-31' :: DATE WHERE tba.date BETWEEN '2018-08-10' :: DATE AND '2018-09-01' :: DATE AND td.group = 'A' AND tc.table_e_id = 4 AND ( (tba.target_id = tc.id AND tba.group_type = 'A') OR tba.source_id = tc.id ); Nested Loop Left Join (actual time=157.436..329.538 rows=21 loops=1) -> Gather (actual time=157.407..349.743 rows=21 loops=1) Workers Planned: 1 Workers Launched: 1 -> Nested Loop (actual time=167.978..281.540 rows=11 loops=2) -> Hash Join (actual time=0.300..18.206 rows=945 loops=2) Hash Cond: (tc.table_d_id = td.id) -> Parallel Seq Scan on table_c tc (actual time=0.107..17.326 rows=1751 loops=2) Filter: (table_e_id = 4) Rows Removed by Filter: 49117 -> Hash (actual time=0.065..0.065 rows=8 loops=2) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on table_d td (actual time=0.043..0.049 rows=8 loops=2) Filter: (group = 'A') Rows Removed by Filter: 55 -> Index Scan using table_b_idx_target_id on table_b (actual time=0.273..0.278 rows=0 loops=1890) Index Cond: (target_id = tc.id) Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A')) OR (source_id = tc.id))) Rows Removed by Filter: 26 -> Index Scan using table_a_uq_001 on table_a ta (actual time=0.019..0.019 rows=1 loops=21) Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date)) Planning Time: 1.218 ms Execution Time: 350.283 ms Above we only got 21 rows, instead of 31. Nested Loop Left Join (actual time=164.189..358.068 rows=28 loops=1) -> Gather (actual time=164.140..370.034 rows=28 loops=1) Workers Planned: 1 Workers Launched: 1 -> Nested Loop (actual time=188.718..306.289 rows=14 loops=2) -> Hash Join (actual time=0.321..21.860 rows=945 loops=2) Hash Cond: (tc.table_d_id = td.id) -> Parallel Seq Scan on table_c tc (actual time=0.113..20.784 rows=1751 loops=2) Filter: (table_e_id = 4) Rows Removed by Filter: 49117 -> Hash (actual time=0.063..0.063 rows=8 loops=2) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on table_d td (actual time=0.040..0.049 rows=8 loops=2) Filter: (group = 'A') Rows Removed by Filter: 55 -> Index Scan using table_b_idx_target_id on table_b (actual time=0.294..0.300 rows=0 loops=1890) Index Cond: (target_id = tc.id) Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A')) OR (source_id = tc.id))) Rows Removed by Filter: 26 -> Index Scan using table_a_uq_001 on table_a ta (actual time=0.031..0.031 rows=1 loops=28) Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date)) Planning Time: 0.971 ms Execution Time: 371.202 ms Here we only got 28 rows, instead of 31. Used indices: create index table_b_idx_target_id on test.table_b (target_id) where (t_active IS TRUE); create unique index table_a_uq_001 on table_a (table_c_id, date); And views: create view test.table_b_active as SELECT * FROM test.table_b WHERE (table_b.t_active IS TRUE); I made a couple of tests and couldn't reproduce this problem with max_parallel_workers=0. With this parameter set, we get the execution plan: Nested Loop Left Join (actual time=212.010..509.274 rows=31 loops=1) -> Gather (actual time=211.976..508.855 rows=31 loops=1) Workers Planned: 1 Workers Launched: 0 -> Nested Loop (actual time=211.593..508.425 rows=31 loops=1) -> Hash Join (actual time=0.063..31.628 rows=1890 loops=1) Hash Cond: (tc.table_d_id = td.id) -> Parallel Seq Scan on table_c tc (actual time=0.042..30.000 rows=3501 loops=1) Filter: (table_e_id = 4) Rows Removed by Filter: 98233 -> Hash (actual time=0.016..0.016 rows=8 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on table_d_id td (actual time=0.008..0.012 rows=8 loops=1) Filter: (group = 'A') Rows Removed by Filter: 55 -> Index Scan using table_b_idx_target_id on table_b (actual time=0.248..0.251 rows=0 loops=1890) Index Cond: (target_id = tc.id) Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A')) OR (source_id = tc.id))) Rows Removed by Filter: 26 -> Index Scan using table_a_uq_001 on table_a ta (actual time=0.014..0.014 rows=1 loops=31) Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date)) Planning Time: 1.135 ms Execution Time: 509.431 ms Nr of rows returned is correct - 31. I tried to create a clean schema with test data, but couldn't get the same execution plan, so I can't include that. The data above comes from db snapshot with table, index and view names changed by hand, so there might be some typos. Sorry about them. I'm using PostgreSQL 11.1, compiled by Visual C++ build 1914, 64-bit. Best regards, Bartosz Polnik
On Mon, 7 Jan 2019 at 05:46, PG Bug reporting form <noreply@postgresql.org> wrote: > A couple of days ago I stumbled upon a query that doesn't return all rows it > should. > -> Nested Loop (actual time=167.978..281.540 rows=11 loops=2) > -> Nested Loop (actual time=188.718..306.289 rows=14 loops=2) Above is the first place where the actual row counts diverge. This makes it appear that it's the join to table_b that's causing the variation. The parallel worker is meant to be running with the same snapshot as the main process so that it properly sees the same tuples as visible. If for some reason that was not working correctly then that might explain why you get differing results. Can you confirm if there are concurrent changes being made to table_b perhaps while the query is running? Do you still get the variation with an otherwise idle database with no open transactions that's just received a complete VACUUM? I'm able to produce the same plan as you're getting with the attached script, but unable to see any sort of row variation. My row counts don't match yours exactly. If I adjust the rows in table_b too much I get a completely different plan. I tested with PostgreSQL 11.1, compiled by Visual C++ build 1916, 64-bit. > I tried to create a clean schema with test data, but couldn't get the same > execution plan, so I can't include that. If you're removing columns to simplify the test script then you made need to add some columns back to pad the tuples out a bit in order to keep the tables around the same size. Variations in the table size will have an effect on the join costs and could change the join order. Also, a smaller table_c may not receive a parallel seq scan. In my mockup, I made table_b and table_a fairly large so as to keep the joins as parameterized nested loops. I also left out indexes on the "id" columns to reduce the chances of a Merge Join. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
> Can you confirm if there are concurrent changes being made to table_b
> perhaps while the query is running?
> perhaps while the query is running?
I confirm. There are no concurrent changes being made and I'm the only client connecting to that db.
> Do you still get the variation with an otherwise idle database with no
> open transactions that's just received a complete VACUUM?
> open transactions that's just received a complete VACUUM?
I executed VACUUM (FULL, ANALYZE) on all tables from the query, but I'm still getting different results.
With your script, I'm always getting the same output, so it must be something else.
I thought about sharing db dump, but I can't do that due to the company policy. What I could share though, is anonymized db dump. Do you know if there is any good tool for performing anonymization?
On Mon, Jan 7, 2019 at 2:57 AM David Rowley <david.rowley@2ndquadrant.com> wrote:
On Mon, 7 Jan 2019 at 05:46, PG Bug reporting form
<noreply@postgresql.org> wrote:
> A couple of days ago I stumbled upon a query that doesn't return all rows it
> should.
> -> Nested Loop (actual time=167.978..281.540 rows=11 loops=2)
> -> Nested Loop (actual time=188.718..306.289 rows=14 loops=2)
Above is the first place where the actual row counts diverge. This
makes it appear that it's the join to table_b that's causing the
variation. The parallel worker is meant to be running with the same
snapshot as the main process so that it properly sees the same tuples
as visible. If for some reason that was not working correctly then
that might explain why you get differing results.
Can you confirm if there are concurrent changes being made to table_b
perhaps while the query is running?
Do you still get the variation with an otherwise idle database with no
open transactions that's just received a complete VACUUM?
I'm able to produce the same plan as you're getting with the attached
script, but unable to see any sort of row variation. My row counts
don't match yours exactly. If I adjust the rows in table_b too much I
get a completely different plan.
I tested with PostgreSQL 11.1, compiled by Visual C++ build 1916, 64-bit.
> I tried to create a clean schema with test data, but couldn't get the same
> execution plan, so I can't include that.
If you're removing columns to simplify the test script then you made
need to add some columns back to pad the tuples out a bit in order to
keep the tables around the same size. Variations in the table size
will have an effect on the join costs and could change the join order.
Also, a smaller table_c may not receive a parallel seq scan. In my
mockup, I made table_b and table_a fairly large so as to keep the
joins as parameterized nested loops. I also left out indexes on the
"id" columns to reduce the chances of a Merge Join.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 7, 2019 at 10:29 PM Bartosz Polnik <bartoszpolnik@gmail.com> wrote: > > > Can you confirm if there are concurrent changes being made to table_b > > perhaps while the query is running? > > I confirm. There are no concurrent changes being made and I'm the only client connecting to that db. > > > Do you still get the variation with an otherwise idle database with no > > open transactions that's just received a complete VACUUM? > > I executed VACUUM (FULL, ANALYZE) on all tables from the query, but I'm still getting different results. > > With your script, I'm always getting the same output, so it must be something else. What do you get if you run it with SET force_parallel_mode = on? -- Thomas Munro http://www.enterprisedb.com
I've modified the script:
1. I Added SET force_parallel_mode = on;:
SET force_parallel_mode = on;
explain (costs off, analyze)
SELECT ta.id AS table_a_id,
tc.id as table_c_id,
tba.id AS table_b_id
2. I've changed group and group_type to be enums (we have them as enums in db):
drop table if exists table_c;
drop table if exists table_d;
drop type if exists "group";
drop type if exists "group_type";
create type group_type AS ENUM ('A', 'B');
create type "group" AS ENUM ('A', 'B');
Updated version of the script is in attachment.
When I execute it, I get:
c:\Program Files\PostgreSQL\11\bin>psql -U postgres -f "C:\mockup_bug15577.sql" test
DROP VIEW
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TYPE
DROP TYPE
CREATE TABLE
INSERT 0 20000
CREATE TYPE
CREATE TYPE
CREATE TABLE
INSERT 0 200000
CREATE TABLE
INSERT 0 101736
CREATE TABLE
INSERT 0 8
INSERT 0 55
CREATE INDEX
CREATE INDEX
CREATE VIEW
ANALYZE
SET
SET
ALTER TABLE
DROP VIEW
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TYPE
DROP TYPE
CREATE TYPE
CREATE TYPE
CREATE TABLE
INSERT 0 20000
CREATE TABLE
INSERT 0 200000
CREATE TABLE
INSERT 0 101736
CREATE TABLE
INSERT 0 8
INSERT 0 55
CREATE INDEX
CREATE INDEX
CREATE VIEW
ANALYZE
SET
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (actual time=1.185..81.170 rows=112 loops=1)
-> Gather (actual time=1.179..95.573 rows=112 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Nested Loop (actual time=0.076..7.182 rows=56 loops=2)
-> Hash Join (actual time=0.050..4.592 rows=936 loops=2)
Hash Cond: (tc.table_d_id = td.id)
-> Parallel Seq Scan on table_c tc (actual time=0.004..4.358 rows=1755 loops=2)
Filter: (table_e_id = 4)
Rows Removed by Filter: 49114
-> Hash (actual time=0.019..0.020 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on table_d td (actual time=0.011..0.014 rows=8 loops=1)
Filter: ("group" = 'A'::"group")
Rows Removed by Filter: 55
-> Index Scan using table_b_idx_target_id on table_b (actual time=0.003..0.003 rows=0 loops=1871)
Index Cond: (target_id = tc.id)
Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::group_type)) OR (source_id = tc.id)))
Rows Removed by Filter: 1
-> Index Scan using table_a_uq_001 on table_a ta (actual time=0.003..0.003 rows=0 loops=112)
Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Planning Time: 0.773 ms
Execution Time: 96.037 ms
(23 rows)
Nr of rows doesn't change from execution to execution.
On Mon, Jan 7, 2019 at 1:20 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Mon, Jan 7, 2019 at 10:29 PM Bartosz Polnik <bartoszpolnik@gmail.com> wrote:
>
> > Can you confirm if there are concurrent changes being made to table_b
> > perhaps while the query is running?
>
> I confirm. There are no concurrent changes being made and I'm the only client connecting to that db.
>
> > Do you still get the variation with an otherwise idle database with no
> > open transactions that's just received a complete VACUUM?
>
> I executed VACUUM (FULL, ANALYZE) on all tables from the query, but I'm still getting different results.
>
> With your script, I'm always getting the same output, so it must be something else.
What do you get if you run it with SET force_parallel_mode = on?
--
Thomas Munro
http://www.enterprisedb.com
Вложения
On Tue, Jan 8, 2019 at 1:56 AM Bartosz Polnik <bartoszpolnik@gmail.com> wrote: > I've modified the script: > 1. I Added SET force_parallel_mode = on;: What if you try that with your original query in the database that shows the problem? Or alternatively SET parallel_leader_participation = on. What I'm wondering is whether the parallel worker is consistently getting the same result, but it's a different result from the leader, and the variation comes from the race between leader and worker, or whether the instability comes from somewhere else. -- Thomas Munro http://www.enterprisedb.com
For the following sql:
SET force_parallel_mode = on;
explain (costs false, analyze true)
SELECT ta.id AS table_a_id,
tc.id as table_c_id,
tba.id AS table_b_id
FROM test.table_b_active tba
INNER JOIN test.table_c tc ON tba.target_id = tc.id
INNER JOIN test.table_d td ON tc.table_d_id = td.id
LEFT JOIN test.table_a ta ON ta.table_c_id = tc.id AND ta.date =
'2018-08-31' :: DATE
WHERE tba.date BETWEEN '2018-08-10' :: DATE AND '2018-09-01' :: DATE
AND td.group = 'A'
AND tc.table_e_id = 4
AND (
(tba.target_id = tc.id AND tba.group_type = 'A')
OR tba.source_id = tc.id
);
I got:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (actual time=145.151..284.612 rows=26 loops=1)
-> Gather (actual time=145.092..304.166 rows=26 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Nested Loop (actual time=153.289..247.768 rows=13 loops=2)
-> Hash Join (actual time=0.221..15.028 rows=945 loops=2)
Hash Cond: (tc.table_d_id = td.id)
-> Parallel Seq Scan on table_c tc (actual time=0.085..14.442 rows=1751 loops=2)
Filter: (table_e_id = 4)
Rows Removed by Filter: 49117
-> Hash (actual time=0.044..0.045 rows=8 loops=2)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on table_d td (actual time=0.028..0.033 rows=8 loops=2)
Filter: (group = 'A'::test.group)
Rows Removed by Filter: 55
-> Index Scan using table_b_idx_target_id on table_b (actual time=0.242..0.246 rows=0 loops=1890)
Index Cond: (target_id = tc.id)
Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::test.group_type)) OR (source_id = tc.id)))
Rows Removed by Filter: 26
-> Index Scan using table_a_uq_001 on table_a ta (actual time=0.016..0.016 rows=1 loops=26)
Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Planning Time: 5.024 ms
Execution Time: 304.858 ms
On Second execution:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (actual time=228.405..279.383 rows=19 loops=1)
-> Gather (actual time=228.376..288.807 rows=19 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Nested Loop (actual time=150.785..242.440 rows=10 loops=2)
-> Hash Join (actual time=0.212..14.353 rows=945 loops=2)
Hash Cond: (tc.table_d_id = td.id)
-> Parallel Seq Scan on table_c tc (actual time=0.075..13.794 rows=1751 loops=2)
Filter: (table_e_id = 4)
Rows Removed by Filter: 49117
-> Hash (actual time=0.050..0.050 rows=8 loops=2)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on table_d_id td (actual time=0.031..0.037 rows=8 loops=2)
Filter: (group = 'A'::test.group)
Rows Removed by Filter: 55
-> Index Scan using table_b_idx_target_id on table_b (actual time=0.237..0.241 rows=0 loops=1890)
Index Cond: (target_id = tc.id)
Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::test.group_type)) OR (source_id = tc.id)))
Rows Removed by Filter: 26
-> Index Scan using table_a_uq_001 on table_a ta (actual time=0.021..0.021 rows=1 loops=19)
Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Planning Time: 4.816 ms
Execution Time: 289.503 ms
I executed it a couple of times and saw in output 26 rows (as above), 21 rows (also above), 28 rows, 28 rows, 29 rows etc.
The changing parts (nr of rows) were: Nested Loop Left Join and Nested Loop.
For the following sql:
SET parallel_leader_participation = on;
explain (costs false, analyze true)
SELECT ta.id AS table_a_id,
tc.id as table_c_id,
tba.id AS table_b_id
FROM test.table_b_active tba
INNER JOIN test.table_c tc ON tba.target_id = tc.id
INNER JOIN test.table_d td ON tc.table_d_id = td.id
LEFT JOIN test.table_a ta ON ta.table_c_id = tc.id AND ta.date =
'2018-08-31' :: DATE
WHERE tba.date BETWEEN '2018-08-10' :: DATE AND '2018-09-01' :: DATE
AND td.group = 'A'
AND tc.table_e_id = 4
AND (
(tba.target_id = tc.id AND tba.group_type = 'A')
OR tba.source_id = tc.id
);
I received:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (actual time=143.805..279.354 rows=29 loops=1)
-> Gather (actual time=143.735..299.066 rows=29 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Nested Loop (actual time=149.334..242.108 rows=15 loops=2)
-> Hash Join (actual time=0.203..14.536 rows=945 loops=2)
Hash Cond: (tc.table_d_id = td.id)
-> Parallel Seq Scan on table_c tc (actual time=0.072..13.971 rows=1751 loops=2)
Filter: (table_e_id = 4)
Rows Removed by Filter: 49117
-> Hash (actual time=0.043..0.043 rows=8 loops=2)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on table_d td (actual time=0.028..0.034 rows=8 loops=2)
Filter: (group = 'A'::test.group)
Rows Removed by Filter: 55
-> Index Scan using table_b_idx_target_id on table_b (actual time=0.237..0.240 rows=0 loops=1890)
Index Cond: (target_id = tc.id)
Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::test.group_type)) OR (source_id = tc.id)))
Rows Removed by Filter: 26
-> Index Scan using table_a_uq_001 on table_a ta (actual time=0.018..0.018 rows=1 loops=29)
Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Planning Time: 5.070 ms
Execution Time: 299.849 ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (actual time=231.714..276.770 rows=26 loops=1)
-> Gather (actual time=231.654..295.971 rows=26 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Nested Loop (actual time=151.209..239.965 rows=13 loops=2)
-> Hash Join (actual time=0.214..14.467 rows=945 loops=2)
Hash Cond: (tc.table_d_id = td.id)
-> Parallel Seq Scan on table_c tc (actual time=0.070..13.923 rows=1751 loops=2)
Filter: (table_e_id = 4)
Rows Removed by Filter: 49117
-> Hash (actual time=0.045..0.045 rows=8 loops=2)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on table_d td (actual time=0.028..0.033 rows=8 loops=2)
Filter: (group = 'A'::test.group)
Rows Removed by Filter: 55
-> Index Scan using table_b_idx_target_id on table_b (actual time=0.235..0.238 rows=0 loops=1890)
Index Cond: (target_id = tc.id)
Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::test.group_type)) OR (source_id = tc.id)))
Rows Removed by Filter: 26
-> Index Scan using table_a_uq_001 on table_a ta (actual time=0.017..0.017 rows=1 loops=26)
Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Planning Time: 4.976 ms
Execution Time: 296.729 ms
I also made a test with:
SET parallel_leader_participation = off;
SET force_parallel_mode = on;
explain (costs false, analyze true)
SELECT ta.id AS table_a_id,
tc.id as table_c_id,
tba.id AS table_b_id
FROM test.table_b_active tba
INNER JOIN test.table_c tc ON tba.target_id = tc.id
INNER JOIN test.table_d td ON tc.table_d_id = td.id
LEFT JOIN test.table_a ta ON ta.table_c_id = tc.id AND ta.date =
'2018-08-31' :: DATE
WHERE tba.date BETWEEN '2018-08-10' :: DATE AND '2018-09-01' :: DATE
AND td.group = 'A'
AND tc.table_e_id = 4
AND (
(tba.target_id = tc.id AND tba.group_type = 'A')
OR tba.source_id = tc.id
);
The output is correct - 31 rows, but the plan is different:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (actual time=263.880..507.668 rows=31 loops=1)
Workers Planned: 1
Workers Launched: 1
Single Copy: true
-> Nested Loop Left Join (actual time=193.246..415.020 rows=31 loops=1)
-> Nested Loop (actual time=192.997..414.336 rows=31 loops=1)
-> Hash Join (actual time=0.287..23.958 rows=1890 loops=1)
Hash Cond: (tc.table_d_id = td.id)
-> Seq Scan on table_c tc (actual time=0.128..23.044 rows=3501 loops=1)
Filter: (table_e_id = 4)
Rows Removed by Filter: 98233
-> Hash (actual time=0.064..0.069 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on table_d td (actual time=0.045..0.052 rows=8 loops=1)
Filter: (group = 'A'::test.group)
Rows Removed by Filter: 55
-> Index Scan using table_b_idx_target_id on table_b (actual time=0.203..0.206 rows=0 loops=1890)
Index Cond: (target_id = tc.id)
Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::test.group_type)) OR (source_id = tc.id)))
Rows Removed by Filter: 26
-> Index Scan using table_a_uq_001 on table_a ta (actual time=0.021..0.021 rows=1 loops=31)
Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Planning Time: 5.051 ms
Execution Time: 507.858 ms
On Mon, Jan 7, 2019 at 8:59 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Tue, Jan 8, 2019 at 1:56 AM Bartosz Polnik <bartoszpolnik@gmail.com> wrote:
> I've modified the script:
> 1. I Added SET force_parallel_mode = on;:
What if you try that with your original query in the database that
shows the problem? Or alternatively SET parallel_leader_participation
= on. What I'm wondering is whether the parallel worker is
consistently getting the same result, but it's a different result from
the leader, and the variation comes from the race between leader and
worker, or whether the instability comes from somewhere else.
--
Thomas Munro
http://www.enterprisedb.com
On Tue, Jan 8, 2019 at 10:39 AM Bartosz Polnik <bartoszpolnik@gmail.com> wrote: > Gather (actual time=263.880..507.668 rows=31 loops=1) > Workers Planned: 1 > Workers Launched: 1 > Single Copy: true > -> Nested Loop Left Join (actual time=193.246..415.020 rows=31 loops=1) > -> Nested Loop (actual time=192.997..414.336 rows=31 loops=1) > -> Hash Join (actual time=0.287..23.958 rows=1890 loops=1) > Hash Cond: (tc.table_d_id = td.id) > -> Seq Scan on table_c tc (actual time=0.128..23.044 rows=3501 loops=1) > Filter: (table_e_id = 4) > Rows Removed by Filter: 98233 > -> Hash (actual time=0.064..0.069 rows=8 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 9kB > -> Seq Scan on table_d td (actual time=0.045..0.052 rows=8 loops=1) > Filter: (group = 'A'::test.group) > Rows Removed by Filter: 55 > -> Index Scan using table_b_idx_target_id on table_b (actual time=0.203..0.206 rows=0 loops=1890) > Index Cond: (target_id = tc.id) > Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND(group_type = 'A'::test.group_type)) OR (source_id = tc.id))) > Rows Removed by Filter: 26 > -> Index Scan using table_a_uq_001 on table_a ta (actual time=0.021..0.021 rows=1 loops=31) > Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date)) Hmm. So if the plan runs entirely in a worker, we get 31 rows. If the plan runs entirely in the leader, we get 31 rows. But if it runs in both, we get fewer. In all cases the Hash Join produces 1890 rows, and in all cases the Nested Loop probes table_b_idx_target_id for all of them: "loops=1890". But somehow the output of the Nested Loop is not 31 rows (or 31 / 2) when divided over two processes. As a sanity check, can you please see if the run-only-in-leader case (max_parallel_workers = 0, so that we see "Workers Launched: 0") produces the *same* 31 rows as the run-only-in-worker case (force_parallel_mode = on, so that we see "Single Copy: true")? That is, the actual values of those 31 rows, in particular the columns coming from table_b. -- Thomas Munro http://www.enterprisedb.com
For:
SET max_parallel_workers = 0;
explain (costs false, analyze true)
SELECT ta.id AS table_a_id,
tc.id as table_c_id,
tba.id AS table_b_id
FROM test.table_b_active tba
INNER JOIN test.table_c tc ON tba.target_id = tc.id
INNER JOIN test.table_d td ON tc.table_d_id = td.id
LEFT JOIN test.table_a ta ON ta.table_c_id = tc.id AND ta.date =
'2018-08-31' :: DATE
WHERE tba.date BETWEEN '2018-08-10' :: DATE AND '2018-09-01' :: DATE
AND td.group = 'A'
AND tc.table_e_id = 4
AND (
(tba.target_id = tc.id AND tba.group_type = 'A')
OR tba.source_id = tc.id
);
SELECT ta.id AS table_a_id,
tc.id as table_c_id,
tba.id AS table_b_id
FROM test.table_b_active tba
INNER JOIN test.table_c tc ON tba.target_id = tc.id
INNER JOIN test.table_d td ON tc.table_d_id = td.id
LEFT JOIN test.table_a ta ON ta.table_c_id = tc.id AND ta.date =
'2018-08-31' :: DATE
WHERE tba.date BETWEEN '2018-08-10' :: DATE AND '2018-09-01' :: DATE
AND td.group = 'A'
AND tc.table_e_id = 4
AND (
(tba.target_id = tc.id AND tba.group_type = 'A')
OR tba.source_id = tc.id
);
I got:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (actual time=196.969..433.340 rows=31 loops=1)
-> Gather (actual time=196.897..432.757 rows=31 loops=1)
Workers Planned: 1
Workers Launched: 0
-> Nested Loop (actual time=196.384..432.197 rows=31 loops=1)
-> Hash Join (actual time=0.104..24.477 rows=1890 loops=1)
Hash Cond: (tc.table_d_id = td.id)
-> Parallel Seq Scan on table_c tc (actual time=0.060..23.639 rows=3501 loops=1)
Filter: (table_e_id = 4)
Rows Removed by Filter: 98233
-> Hash (actual time=0.022..0.023 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on table_d td (actual time=0.009..0.013 rows=8 loops=1)
Filter: (group = 'A'::test.group)
Rows Removed by Filter: 55
-> Index Scan using table_b_idx_target_id on table_b (actual time=0.212..0.215 rows=0 loops=1890)
Index Cond: (target_id = tc.id)
Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::test.group_type)) OR (source_id = tc.id)))
Rows Removed by Filter: 26
-> Index Scan using table_a_uq_001 on table_a ta (actual time=0.019..0.019 rows=1 loops=31)
Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Planning Time: 5.084 ms
Execution Time: 433.630 ms
table_a_id | table_c_id | table_b_id
------------+------------+------------
16116185 | 328860 | 2936924
16115788 | 348539 | 3039173
16115788 | 348539 | 2913874
16115788 | 348539 | 2913754
16116256 | 293541 | 2901938
16116256 | 293541 | 2901933
16116256 | 293541 | 2997160
16114789 | 292051 | 3038539
16114813 | 342353 | 3052371
16116069 | 351585 | 3025941
16114814 | 331329 | 2946332
16115861 | 350487 | 2933633
16115106 | 350047 | 2902075
16116066 | 351434 | 3010909
16114811 | 298605 | 2893809
16114811 | 298605 | 2987038
16114811 | 298605 | 3038877
16114811 | 298605 | 3010694
16114811 | 298605 | 2893188
16114811 | 298605 | 2893391
16114811 | 298605 | 2983360
16114811 | 298605 | 3038221
16114811 | 298605 | 3026078
16114811 | 298605 | 2998966
16114811 | 298605 | 2998909
16116260 | 290583 | 2955483
16116260 | 290583 | 2921135
16116260 | 290583 | 2947914
16116260 | 290583 | 2901669
16116241 | 295971 | 3038921
16116249 | 296708 | 3038888
(31 rows)
When I use force_parallel_mode = on, parallel_leader_participation = off and removed max_parallel_workers line from the script (so that default applies), I got:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (actual time=271.255..530.713 rows=31 loops=1)
Workers Planned: 1
Workers Launched: 1
Single Copy: true
-> Nested Loop Left Join (actual time=200.434..437.321 rows=31 loops=1)
-> Nested Loop (actual time=200.126..436.547 rows=31 loops=1)
-> Hash Join (actual time=0.322..25.510 rows=1890 loops=1)
Hash Cond: (tc.table_d_id = td.id)
-> Seq Scan on table_c tc (actual time=0.141..24.501 rows=3501 loops=1)
Filter: (table_e_id = 4)
Rows Removed by Filter: 98233
-> Hash (actual time=0.082..0.084 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on table_d td (actual time=0.049..0.066 rows=8 loops=1)
Filter: (group = 'A'::test.group)
Rows Removed by Filter: 55
-> Index Scan using table_b_idx_target_id on table_b (actual time=0.214..0.217 rows=0 loops=1890)
Index Cond: (target_id = tc.id)
Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::test.group_type)) OR (source_id = tc.id)))
Rows Removed by Filter: 26
-> Index Scan using table_a_uq_001 on table_a ta (actual time=0.024..0.024 rows=1 loops=31)
Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Planning Time: 5.641 ms
Execution Time: 530.926 ms
table_a_id | table_c_id | table_b_id
------------+------------+------------
16116185 | 328860 | 2936924
16115788 | 348539 | 3039173
16115788 | 348539 | 2913874
16115788 | 348539 | 2913754
16116256 | 293541 | 2901938
16116256 | 293541 | 2901933
16116256 | 293541 | 2997160
16114789 | 292051 | 3038539
16114813 | 342353 | 3052371
16116069 | 351585 | 3025941
16114814 | 331329 | 2946332
16115861 | 350487 | 2933633
16115106 | 350047 | 2902075
16116066 | 351434 | 3010909
16114811 | 298605 | 2893809
16114811 | 298605 | 2987038
16114811 | 298605 | 3038877
16114811 | 298605 | 3010694
16114811 | 298605 | 2893188
16114811 | 298605 | 2893391
16114811 | 298605 | 2983360
16114811 | 298605 | 3038221
16114811 | 298605 | 3026078
16114811 | 298605 | 2998966
16114811 | 298605 | 2998909
16116260 | 290583 | 2955483
16116260 | 290583 | 2921135
16116260 | 290583 | 2947914
16116260 | 290583 | 2901669
16116241 | 295971 | 3038921
16116249 | 296708 | 3038888
(31 rows)
Rows returned by two queries are the same.
On Mon, Jan 7, 2019 at 11:23 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Tue, Jan 8, 2019 at 10:39 AM Bartosz Polnik <bartoszpolnik@gmail.com> wrote:
> Gather (actual time=263.880..507.668 rows=31 loops=1)
> Workers Planned: 1
> Workers Launched: 1
> Single Copy: true
> -> Nested Loop Left Join (actual time=193.246..415.020 rows=31 loops=1)
> -> Nested Loop (actual time=192.997..414.336 rows=31 loops=1)
> -> Hash Join (actual time=0.287..23.958 rows=1890 loops=1)
> Hash Cond: (tc.table_d_id = td.id)
> -> Seq Scan on table_c tc (actual time=0.128..23.044 rows=3501 loops=1)
> Filter: (table_e_id = 4)
> Rows Removed by Filter: 98233
> -> Hash (actual time=0.064..0.069 rows=8 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 9kB
> -> Seq Scan on table_d td (actual time=0.045..0.052 rows=8 loops=1)
> Filter: (group = 'A'::test.group)
> Rows Removed by Filter: 55
> -> Index Scan using table_b_idx_target_id on table_b (actual time=0.203..0.206 rows=0 loops=1890)
> Index Cond: (target_id = tc.id)
> Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::test.group_type)) OR (source_id = tc.id)))
> Rows Removed by Filter: 26
> -> Index Scan using table_a_uq_001 on table_a ta (actual time=0.021..0.021 rows=1 loops=31)
> Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Hmm. So if the plan runs entirely in a worker, we get 31 rows. If
the plan runs entirely in the leader, we get 31 rows. But if it runs
in both, we get fewer. In all cases the Hash Join produces 1890 rows,
and in all cases the Nested Loop probes table_b_idx_target_id for all
of them: "loops=1890". But somehow the output of the Nested Loop is
not 31 rows (or 31 / 2) when divided over two processes.
As a sanity check, can you please see if the run-only-in-leader case
(max_parallel_workers = 0, so that we see "Workers Launched: 0")
produces the *same* 31 rows as the run-only-in-worker case
(force_parallel_mode = on, so that we see "Single Copy: true")? That
is, the actual values of those 31 rows, in particular the columns
coming from table_b.
--
Thomas Munro
http://www.enterprisedb.com
On Tue, Jan 8, 2019 at 12:04 PM Bartosz Polnik <bartoszpolnik@gmail.com> wrote: > On Mon, Jan 7, 2019 at 11:23 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> As a sanity check, can you please see if the run-only-in-leader case >> (max_parallel_workers = 0, so that we see "Workers Launched: 0") >> produces the *same* 31 rows as the run-only-in-worker case >> (force_parallel_mode = on, so that we see "Single Copy: true")? That >> is, the actual values of those 31 rows, in particular the columns >> coming from table_b. > > Rows returned by two queries are the same. Can we please also see the actual output in the broken case, where it runs in two processes and produces fewer than 31 rows? -- Thomas Munro http://www.enterprisedb.com
Query:
explain (costs false, analyze true)
SELECT ta.id AS table_a_id,
tc.id as table_c_id,
tba.id AS table_b_id
FROM test.table_b_active tba
INNER JOIN test.table_c tc ON tba.target_id = tc.id
INNER JOIN test.table_d td ON tc.table_d_id = td.id
LEFT JOIN test.table_a ta ON ta.table_c_id = tc.id AND ta.date = '2018-08-31' :: DATE
WHERE tba.date BETWEEN '2018-08-10' :: DATE AND '2018-09-01' :: DATE
AND td.group = 'A'
AND tc.table_e_id = 4
AND (
(tba.target_id = tc.id AND tba.group_type = 'A')
OR tba.source_id = tc.id
);
SELECT ta.id AS table_a_id,
tc.id as table_c_id,
tba.id AS table_b_id
FROM test.table_b_active tba
INNER JOIN test.table_c tc ON tba.target_id = tc.id
INNER JOIN test.table_d td ON tc.table_d_id = td.id
LEFT JOIN test.table_a ta ON ta.table_c_id = tc.id AND ta.date = '2018-08-31' :: DATE
WHERE tba.date BETWEEN '2018-08-10' :: DATE AND '2018-09-01' :: DATE
AND td.group = 'A'
AND tc.table_e_id = 4
AND (
(tba.target_id = tc.id AND tba.group_type = 'A')
OR tba.source_id = tc.id
);
Output:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (actual time=236.640..281.711 rows=26 loops=1)
-> Gather (actual time=236.578..304.194 rows=26 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Nested Loop (actual time=154.947..244.097 rows=13 loops=2)
-> Hash Join (actual time=0.223..14.649 rows=945 loops=2)
Hash Cond: (tc.table_d_id = td.id)
-> Parallel Seq Scan on table_c tc (actual time=0.083..14.102 rows=1751 loops=2)
Filter: (table_e_id = 4)
Rows Removed by Filter: 49117
-> Hash (actual time=0.044..0.044 rows=8 loops=2)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on table_d td (actual time=0.028..0.033 rows=8 loops=2)
Filter: (group = 'A'::test.group)
Rows Removed by Filter: 55
-> Index Scan using table_b_idx_target_id on table_b (actual time=0.239..0.242 rows=0 loops=1890)
Index Cond: (target_id = tc.id)
Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::test.group_type)) OR (source_id = tc.id)))
Rows Removed by Filter: 26
-> Index Scan using table_a_uq_001 on table_a ta (actual time=0.020..0.020 rows=1 loops=26)
Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Planning Time: 5.047 ms
Execution Time: 304.960 ms
table_a_id | table_c_id | table_b_id
------------+------------+------------
16116185 | 328860 | 2936924
16116256 | 293541 | 2901938
16115788 | 348539 | 3039173
16115788 | 348539 | 2913874
16115788 | 348539 | 2913754
16114813 | 342353 | 3052371
16114789 | 292051 | 3038539
16116069 | 351585 | 3025941
16115861 | 350487 | 2933633
16114814 | 331329 | 2946332
16115106 | 350047 | 2902075
16116260 | 290583 | 2955483
16116066 | 351434 | 3010909
16114811 | 298605 | 2893809
16114811 | 298605 | 2987038
16114811 | 298605 | 3038877
16114811 | 298605 | 3010694
16114811 | 298605 | 2893188
16114811 | 298605 | 2893391
16114811 | 298605 | 2983360
16114811 | 298605 | 3038221
16114811 | 298605 | 3026078
16114811 | 298605 | 2998966
16116249 | 296708 | 3038888
16114811 | 298605 | 2998909
16116241 | 295971 | 3038921
(26 rows)
Here's an example with only 21 rows:
table_a_id | table_c_id | table_b_id
------------+------------+------------
16116185 | 328860 | 2936924
16115788 | 348539 | 3039173
16115788 | 348539 | 2913874
16115788 | 348539 | 2913754
16116256 | 293541 | 2901938
16116256 | 293541 | 2901933
16116256 | 293541 | 2997160
16116069 | 351585 | 3025941
16114789 | 292051 | 3038539
16114813 | 342353 | 3052371
16114814 | 331329 | 2946332
16115861 | 350487 | 2933633
16116066 | 351434 | 3010909
16114811 | 298605 | 2893809
16115106 | 350047 | 2902075
16116241 | 295971 | 3038921
16116260 | 290583 | 2955483
16116249 | 296708 | 3038888
16116260 | 290583 | 2921135
16116260 | 290583 | 2947914
16116260 | 290583 | 2901669
(21 rows)
On Tue, Jan 8, 2019 at 12:39 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Tue, Jan 8, 2019 at 12:04 PM Bartosz Polnik <bartoszpolnik@gmail.com> wrote:
> On Mon, Jan 7, 2019 at 11:23 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> As a sanity check, can you please see if the run-only-in-leader case
>> (max_parallel_workers = 0, so that we see "Workers Launched: 0")
>> produces the *same* 31 rows as the run-only-in-worker case
>> (force_parallel_mode = on, so that we see "Single Copy: true")? That
>> is, the actual values of those 31 rows, in particular the columns
>> coming from table_b.
>
> Rows returned by two queries are the same.
Can we please also see the actual output in the broken case, where it
runs in two processes and produces fewer than 31 rows?
--
Thomas Munro
http://www.enterprisedb.com
On Tue, 8 Jan 2019 at 13:09, Bartosz Polnik <bartoszpolnik@gmail.com> wrote: > (26 rows) > > Here's an example with only 21 rows: Interestingly between those two results, all your 31 distinct rows exist. I threw them into a table and did: # select * from t1 full join t2 on t1.a=t2.a and t1.b=t2.b and t1.c=t2.c; a | b | c | a | b | c ----------+--------+---------+----------+--------+--------- 16114789 | 292051 | 3038539 | 16114789 | 292051 | 3038539 16114811 | 298605 | 2893188 | | | 16114811 | 298605 | 2893391 | | | 16114811 | 298605 | 2893809 | 16114811 | 298605 | 2893809 16114811 | 298605 | 2983360 | | | 16114811 | 298605 | 2987038 | | | 16114811 | 298605 | 2998909 | | | 16114811 | 298605 | 2998966 | | | 16114811 | 298605 | 3010694 | | | 16114811 | 298605 | 3026078 | | | 16114811 | 298605 | 3038221 | | | 16114811 | 298605 | 3038877 | | | 16114813 | 342353 | 3052371 | 16114813 | 342353 | 3052371 16114814 | 331329 | 2946332 | 16114814 | 331329 | 2946332 16115106 | 350047 | 2902075 | 16115106 | 350047 | 2902075 16115788 | 348539 | 2913754 | 16115788 | 348539 | 2913754 16115788 | 348539 | 2913874 | 16115788 | 348539 | 2913874 16115788 | 348539 | 3039173 | 16115788 | 348539 | 3039173 16115861 | 350487 | 2933633 | 16115861 | 350487 | 2933633 16116066 | 351434 | 3010909 | 16116066 | 351434 | 3010909 16116069 | 351585 | 3025941 | 16116069 | 351585 | 3025941 16116185 | 328860 | 2936924 | 16116185 | 328860 | 2936924 16116241 | 295971 | 3038921 | 16116241 | 295971 | 3038921 16116249 | 296708 | 3038888 | 16116249 | 296708 | 3038888 | | | 16116256 | 293541 | 2901933 16116256 | 293541 | 2901938 | 16116256 | 293541 | 2901938 | | | 16116256 | 293541 | 2997160 | | | 16116260 | 290583 | 2901669 | | | 16116260 | 290583 | 2921135 | | | 16116260 | 290583 | 2947914 16116260 | 290583 | 2955483 | 16116260 | 290583 | 2955483 (31 rows) IOW, it does not seem like there's a particlar pattern to what rows are missing. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 8, 2019 at 1:31 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > On Tue, 8 Jan 2019 at 13:09, Bartosz Polnik <bartoszpolnik@gmail.com> wrote: > > (26 rows) > > > > Here's an example with only 21 rows: > > Interestingly between those two results, all your 31 distinct rows > exist. I threw them into a table and did: > > # select * from t1 full join t2 on t1.a=t2.a and t1.b=t2.b and t1.c=t2.c; > a | b | c | a | b | c > ----------+--------+---------+----------+--------+--------- > 16114789 | 292051 | 3038539 | 16114789 | 292051 | 3038539 > 16114811 | 298605 | 2893188 | | | > 16114811 | 298605 | 2893391 | | | > 16114811 | 298605 | 2893809 | 16114811 | 298605 | 2893809 > 16114811 | 298605 | 2983360 | | | > 16114811 | 298605 | 2987038 | | | > 16114811 | 298605 | 2998909 | | | > 16114811 | 298605 | 2998966 | | | > 16114811 | 298605 | 3010694 | | | > 16114811 | 298605 | 3026078 | | | > 16114811 | 298605 | 3038221 | | | > 16114811 | 298605 | 3038877 | | | > 16114813 | 342353 | 3052371 | 16114813 | 342353 | 3052371 > 16114814 | 331329 | 2946332 | 16114814 | 331329 | 2946332 > 16115106 | 350047 | 2902075 | 16115106 | 350047 | 2902075 > 16115788 | 348539 | 2913754 | 16115788 | 348539 | 2913754 > 16115788 | 348539 | 2913874 | 16115788 | 348539 | 2913874 > 16115788 | 348539 | 3039173 | 16115788 | 348539 | 3039173 > 16115861 | 350487 | 2933633 | 16115861 | 350487 | 2933633 > 16116066 | 351434 | 3010909 | 16116066 | 351434 | 3010909 > 16116069 | 351585 | 3025941 | 16116069 | 351585 | 3025941 > 16116185 | 328860 | 2936924 | 16116185 | 328860 | 2936924 > 16116241 | 295971 | 3038921 | 16116241 | 295971 | 3038921 > 16116249 | 296708 | 3038888 | 16116249 | 296708 | 3038888 > | | | 16116256 | 293541 | 2901933 > 16116256 | 293541 | 2901938 | 16116256 | 293541 | 2901938 > | | | 16116256 | 293541 | 2997160 > | | | 16116260 | 290583 | 2901669 > | | | 16116260 | 290583 | 2921135 > | | | 16116260 | 290583 | 2947914 > 16116260 | 290583 | 2955483 | 16116260 | 290583 | 2955483 > (31 rows) > > IOW, it does not seem like there's a particlar pattern to what rows are missing. Is the pattern something like this? When the index probe of b should find multiple matches, it's returning only one and then we move onto the next outer row? For example we see this for table_c_id = 298605, of which we see 11 examples in the 31-row output (with different table_b_id values), but only one in the 21-row output, with the first table_b_id value. -- Thomas Munro http://www.enterprisedb.com
On Tue, Jan 8, 2019 at 1:40 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Jan 8, 2019 at 1:31 PM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > On Tue, 8 Jan 2019 at 13:09, Bartosz Polnik <bartoszpolnik@gmail.com> wrote: > > > (26 rows) > > > > > > Here's an example with only 21 rows: > > > > Interestingly between those two results, all your 31 distinct rows > > exist. I threw them into a table and did: > > > > # select * from t1 full join t2 on t1.a=t2.a and t1.b=t2.b and t1.c=t2.c; > > a | b | c | a | b | c > > ----------+--------+---------+----------+--------+--------- > > 16114789 | 292051 | 3038539 | 16114789 | 292051 | 3038539 > > 16114811 | 298605 | 2893188 | | | > > 16114811 | 298605 | 2893391 | | | > > 16114811 | 298605 | 2893809 | 16114811 | 298605 | 2893809 > > 16114811 | 298605 | 2983360 | | | > > 16114811 | 298605 | 2987038 | | | > > 16114811 | 298605 | 2998909 | | | > > 16114811 | 298605 | 2998966 | | | > > 16114811 | 298605 | 3010694 | | | > > 16114811 | 298605 | 3026078 | | | > > 16114811 | 298605 | 3038221 | | | > > 16114811 | 298605 | 3038877 | | | > > 16114813 | 342353 | 3052371 | 16114813 | 342353 | 3052371 > > 16114814 | 331329 | 2946332 | 16114814 | 331329 | 2946332 > > 16115106 | 350047 | 2902075 | 16115106 | 350047 | 2902075 > > 16115788 | 348539 | 2913754 | 16115788 | 348539 | 2913754 > > 16115788 | 348539 | 2913874 | 16115788 | 348539 | 2913874 > > 16115788 | 348539 | 3039173 | 16115788 | 348539 | 3039173 > > 16115861 | 350487 | 2933633 | 16115861 | 350487 | 2933633 > > 16116066 | 351434 | 3010909 | 16116066 | 351434 | 3010909 > > 16116069 | 351585 | 3025941 | 16116069 | 351585 | 3025941 > > 16116185 | 328860 | 2936924 | 16116185 | 328860 | 2936924 > > 16116241 | 295971 | 3038921 | 16116241 | 295971 | 3038921 > > 16116249 | 296708 | 3038888 | 16116249 | 296708 | 3038888 > > | | | 16116256 | 293541 | 2901933 > > 16116256 | 293541 | 2901938 | 16116256 | 293541 | 2901938 > > | | | 16116256 | 293541 | 2997160 > > | | | 16116260 | 290583 | 2901669 > > | | | 16116260 | 290583 | 2921135 > > | | | 16116260 | 290583 | 2947914 > > 16116260 | 290583 | 2955483 | 16116260 | 290583 | 2955483 > > (31 rows) > > > > IOW, it does not seem like there's a particlar pattern to what rows are missing. > > Is the pattern something like this? When the index probe of b should > find multiple matches, it's returning only one and then we move onto > the next outer row? For example we see this for table_c_id = 298605, > of which we see 11 examples in the 31-row output (with different > table_b_id values), but only one in the 21-row output, with the first > table_b_id value. It's as if it thought it was executing a semi-join. -- Thomas Munro http://www.enterprisedb.com
On Tue, 8 Jan 2019 at 13:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > It's as if it thought it was executing a semi-join. EXPLAIN VERBOSE would show the inner_unique flag. If that's set it would act as a semi-join. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 8, 2019 at 1:48 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > On Tue, 8 Jan 2019 at 13:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > It's as if it thought it was executing a semi-join. > > EXPLAIN VERBOSE would show the inner_unique flag. If that's set it > would act as a semi-join. Hmm, yes, node->js.single_match == true in one (but not both!) process would fit these results. But I don't see a mechanism for that. I *guess* it's the worker that is skipping duplicates, because the leader usually has time to emit a few tuples while the worker's warming up and we see some duplicates (348539) in first few results (at least it usually does on my machines for eager plans, though maybe on Windows it's different?) Bartosz, can we please try with force_parallel_mode = off, but also parallel_leader_participation = off? (I meant to write that in an earlier email but accidentally wrote "on". Gah.) Then we should get the same plan (the forced parallel plan is a bit different as you noted; maybe whatever is broken isn't triggered that way). You might need to set parallel_workers to 2 on the table for it to pick a parallel plan without leader participation. -- Thomas Munro http://www.enterprisedb.com
Sure.
SET force_parallel_mode = off;
SET parallel_leader_participation = off;
alter table_c set (parallel_workers=2);
explain (costs false, analyze true)
SELECT ta.id AS table_a_id,
tc.id as table_c_id,
tba.id AS table_b_id
FROM test.table_b_active tba
INNER JOIN test.table_c tc ON tba.target_id = tc.id
INNER JOIN test.table_d td ON tc.table_d_id = td.id
LEFT JOIN test.table_a ta ON ta.table_c_id = tc.id AND ta.date = '2018-08-31' :: DATE
WHERE tba.date BETWEEN '2018-08-10' :: DATE AND '2018-09-01' :: DATE
AND td.group = 'A'
AND tc.table_e_id = 4
AND (
(tba.target_id = tc.id AND tba.group_type = 'A')
OR tba.source_id = tc.id
);
Output:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (actual time=203.910..346.758 rows=31 loops=1)
-> Gather (actual time=203.802..373.228 rows=31 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (actual time=157.358..253.888 rows=16 loops=2)
-> Hash Join (actual time=0.365..15.749 rows=945 loops=2)
Hash Cond: (tc.table_d_id = td.id)
-> Parallel Seq Scan on table_c tc (actual time=0.113..15.043 rows=1751 loops=2)
Filter: (table_e_id = 4)
Rows Removed by Filter: 49117
-> Hash (actual time=0.078..0.078 rows=8 loops=2)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on table_d td (actual time=0.058..0.065 rows=8 loops=2)
Filter: (group = 'A'::test.group)
Rows Removed by Filter: 55
-> Index Scan using table_b_idx_target_id on table_b (actual time=0.248..0.252 rows=0 loops=1890)
Index Cond: (target_id = tc.id)
Filter: ((date >= '2018-08-10'::date) AND (date <= '2018-09-01'::date) AND (((target_id = tc.id) AND (group_type = 'A'::test.group_type)) OR (source_id = tc.id)))
Rows Removed by Filter: 26
-> Index Scan using table_a_uq_001 on table_a ta (actual time=0.030..0.030 rows=1 loops=31)
Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date))
Planning Time: 4.923 ms
Execution Time: 374.408 ms
31 rows - correct result.
On Tue, Jan 8, 2019 at 2:30 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Tue, Jan 8, 2019 at 1:48 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On Tue, 8 Jan 2019 at 13:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> > It's as if it thought it was executing a semi-join.
>
> EXPLAIN VERBOSE would show the inner_unique flag. If that's set it
> would act as a semi-join.
Hmm, yes, node->js.single_match == true in one (but not both!) process
would fit these results. But I don't see a mechanism for that. I
*guess* it's the worker that is skipping duplicates, because the
leader usually has time to emit a few tuples while the worker's
warming up and we see some duplicates (348539) in first few results
(at least it usually does on my machines for eager plans, though maybe
on Windows it's different?)
Bartosz, can we please try with force_parallel_mode = off, but also
parallel_leader_participation = off? (I meant to write that in an
earlier email but accidentally wrote "on". Gah.) Then we should get
the same plan (the forced parallel plan is a bit different as you
noted; maybe whatever is broken isn't triggered that way). You might
need to set parallel_workers to 2 on the table for it to pick a
parallel plan without leader participation.
--
Thomas Munro
http://www.enterprisedb.com
On Tue, Jan 8, 2019 at 10:45 PM Bartosz Polnik <bartoszpolnik@gmail.com> wrote: > SET force_parallel_mode = off; > SET parallel_leader_participation = off; > > alter table_c set (parallel_workers=2); > ... > > 31 rows - correct result. Huh. So now we have: 1. Leader only (max_parallel_workers = 0): OK 2. Worker only (parallel_leader_participation = off, parallel_workers = 2): OK 3. Leader + worker: broken, short results 4. Worker only with slightly different plan (force_parallel_mode = on): OK Ahh, I just reproduced this. Take mockup_bug15577.sql (Bartosz's version) and then do this: insert into table_c select * from table_c; insert into table_b select generate_series(1000000, 1000010), 100112, 'A', 99888, '2018-09-01', true, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'; Now the number of results from the query in the script is not stable, varying between 235 or 246 results for me. Not sure what's going on yet. -- Thomas Munro http://www.enterprisedb.com
On Wed, Jan 9, 2019 at 2:52 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Now the number of results from the query in the script is not stable, > varying between 235 or 246 results for me. Not sure what's going on > yet. When there are multiple matches found by the Index Scan, sometimes only the first one makes it through to the Nested Loop, in the leader. It's not because of single_match being inappropriately set as I speculated earlier (though that'd behave similarly I guess) but because the quals are somehow broken in between invocations so that ExecScan() filters the rest of the matches out, and then fixed again on the next rescan. This happens whenever the leader takes a break to read tuple from parallel workers in between executing the Nested Loop node, which it often does after reading the first of several tuples resulting from a single rescan. What exactly is being corrupted and how, I don't yet know, and I need to leave this here for today, but that's what I've got so far. -- Thomas Munro http://www.enterprisedb.com
Thomas, I tried to reproduce this case with changes you listed, but failed to do that - the number of results was the same on each run. Could you post an updated script in an attachment?
On Wed, Jan 9, 2019 at 5:01 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Jan 9, 2019 at 2:52 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Now the number of results from the query in the script is not stable,
> varying between 235 or 246 results for me. Not sure what's going on
> yet.
When there are multiple matches found by the Index Scan, sometimes
only the first one makes it through to the Nested Loop, in the leader.
It's not because of single_match being inappropriately set as I
speculated earlier (though that'd behave similarly I guess) but
because the quals are somehow broken in between invocations so that
ExecScan() filters the rest of the matches out, and then fixed again
on the next rescan. This happens whenever the leader takes a break to
read tuple from parallel workers in between executing the Nested Loop
node, which it often does after reading the first of several tuples
resulting from a single rescan.
What exactly is being corrupted and how, I don't yet know, and I need
to leave this here for today, but that's what I've got so far.
--
Thomas Munro
http://www.enterprisedb.com
On Wed, 9 Jan 2019 at 17:01, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Wed, Jan 9, 2019 at 2:52 PM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > Now the number of results from the query in the script is not stable, > > varying between 235 or 246 results for me. Not sure what's going on > > yet. > > When there are multiple matches found by the Index Scan, sometimes > only the first one makes it through to the Nested Loop, in the leader. > It's not because of single_match being inappropriately set as I > speculated earlier (though that'd behave similarly I guess) but > because the quals are somehow broken in between invocations so that > ExecScan() filters the rest of the matches out, and then fixed again > on the next rescan. This happens whenever the leader takes a break to > read tuple from parallel workers in between executing the Nested Loop > node, which it often does after reading the first of several tuples > resulting from a single rescan. > > What exactly is being corrupted and how, I don't yet know, and I need > to leave this here for today, but that's what I've got so far. From what I can see it seems to be the parameter for the index scan that gets the wrong value. I changed int4eq to become: Datum int4eq(PG_FUNCTION_ARGS) { int32 arg1 = PG_GETARG_INT32(0); int32 arg2 = PG_GETARG_INT32(1); if (arg1 == 100112 || arg2 == 100112) elog(NOTICE, "%d %d", arg1, arg2); PG_RETURN_BOOL(arg1 == arg2); } When the query returns the correct number of rows I only see NOTICES with: NOTICE: 100112 100112 but when the wrong results are returned I see a few such as: NOTICE: 100112 57772 If I put a conditional breakpoint on the elog with the condition "arg2 != 100112" I see that the arg2 value is coming from the nested loop's parameter. If I modify the query to remove the redundant tba.target_id = tc.id from the WHERE clause, then I can't trigger the issue. i.e. AND ( (tba.target_id = tc.id AND tba.group_type = 'A') OR tba.source_id = tc.id ); becomes: AND ( tba.group_type = 'A' OR tba.source_id = tc.id ); So the parameter does not need to have its value checked again during the qual check. (If Bartosz is looking for a quick fix then he should be able to alter the query to that) The index scan should have already ensured that tba.target_id = tc.id. So the nested loop parameter must have had the correct value at the start of the index scan, but somehow it must have changed sometime after the first tuple was returned from the index scan and before we asked the index for the next tuple (hence why it looked like a semi-join's result) I've just not managed to find out how the parameter value gets changed. The value does not seem to have been stomped on as each time I see the incorrect parameter value, it's set to something else that occurs in the table. At the moment I'm a bit suspicious of the following code in ParallelQueryMain(). paramexec_space = dsa_get_address(area, fpes->param_exec); RestoreParamExecParams(paramexec_space, queryDesc->estate); On first look, it seems that the exec params are in shared memory? ... but I need more time to figure out how this all fits together. I've attached the updated script per Thomas' alterations to make it break. The breakage is very old. I found it existed when parallel joins were added, but not necessarily blaming that commit. 45be99f8cd Support parallel joins, and make related improvements. -- BAD a7de3dc5c3 Support multi-stage aggregation. -- GOOD -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On Thu, Jan 10, 2019 at 5:35 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > On Wed, 9 Jan 2019 at 17:01, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > What exactly is being corrupted and how, I don't yet know, and I need > > to leave this here for today, but that's what I've got so far. > > From what I can see it seems to be the parameter for the index scan > that gets the wrong value. > > [explanation about redundant qual tba.source_id = tc.id and corrupted parameter] Oh, yeah, that'd do it. Here are some excerpts from my printf debugging, all from the leader process: [51461] nestloop plan id 2 setting paramno 0 to value 100112 [51461] IndexScan returning a tuple (att1 = 1000010, att2 = 100112) [51461] --- NestLoop got inner att2 = 100112 <--- FIRST TUPLE MADE IT OUT [51461] nestloop plan id 0 setting paramno 0 to value 100112 [51461] gather will read tuple from queue [51461] nestloop plan id 0 setting paramno 0 to value 98662 <--- OOPS [51461] IndexScan returning a tuple (att1 = 1000009, att2 = 100112) [51461] dropping a tuple (att1 = 1000009, att2 = 100112) <--- NEXT ONE DIDN'T [51461] IndexScan returning a tuple (att1 = 1000008, att2 = 100112) [51461] dropping a tuple (att1 = 1000008, att2 = 100112) <--- DITTO ... So it's not the reading of the tuple queue per se, but the resulting visit to a higher part of the plan when the Gather node emits a tuple. It can't be right that plan id 2 and plan id 0 are both using paramno 0, can it? I'm not too familiar with the planner code that allocates those but will go and poke at it. -- Thomas Munro http://www.enterprisedb.com
On Wed, Jan 9, 2019 at 8:48 PM Bartosz Polnik <bartoszpolnik@gmail.com> wrote: > Thomas, I tried to reproduce this case with changes you listed, but failed to do that - the number of results was the sameon each run. Could you post an updated script in an attachment? Probably due to timing of parallel worker startup. Different computers, operating systems, alignment of the stars etc. Debugging parallel query problems is fun like that. -- Thomas Munro http://www.enterprisedb.com
>>>>> "Thomas" == Thomas Munro <thomas.munro@enterprisedb.com> writes: Thomas> So it's not the reading of the tuple queue per se, but the Thomas> resulting visit to a higher part of the plan when the Gather Thomas> node emits a tuple. It can't be right that plan id 2 and plan Thomas> id 0 are both using paramno 0, can it? I'm not too familiar Thomas> with the planner code that allocates those but will go and poke Thomas> at it. Looks like this isn't a new bug, either; I can reproduce it on 10.x. So I think you're right to finger the planner's allocation of params as the culprit. Looking at replace_nestloop_params_mutator, it's keeping a list of nestloop params in root->curOuterParams (thus, globally to the whole subquery being planned), and it assumes that if equal() is true between the value of an existing nestloop param (which must be a Var) and the Var for the one it's trying to create, then it's ok to use the existing one. But clearly this can't work if one param is referenced both inside and outside a Gather, because while they will compare equal for Vars, they won't actually have the same value thanks to rows coming in from workers. -- Andrew (irc:RhodiumToad)
On Thu, Jan 10, 2019 at 10:04 AM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > So I think you're right to finger the planner's allocation of params as > the culprit. Looking at replace_nestloop_params_mutator, it's keeping a > list of nestloop params in root->curOuterParams (thus, globally to the > whole subquery being planned), and it assumes that if equal() is true > between the value of an existing nestloop param (which must be a Var) > and the Var for the one it's trying to create, then it's ok to use the > existing one. > > But clearly this can't work if one param is referenced both inside and > outside a Gather, because while they will compare equal for Vars, they > won't actually have the same value thanks to rows coming in from > workers. But if they used different params, there could be different problems, no? It's logically the same var. This makes me wonder if we need some kind of scheme for saving and restoring affected params whenever Gather switches between executing the plan directly and emitting tuples from workers, or something like that... -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > So it's not the reading of the tuple queue per se, but the resulting > visit to a higher part of the plan when the Gather node emits a tuple. > It can't be right that plan id 2 and plan id 0 are both using paramno > 0, can it? So the plan shape, shorn of uninteresting detail, is Nested Loop Left Join -> Gather -> Nested Loop -> Hash Join -> Parallel Seq Scan on table_c tc -> Hash -> Seq Scan on table_d td -> Index Scan ... on table_b tba Index Cond: (target_id = tc.id) -> Index Scan ... on table_a ta Index Cond: ((table_c_id = tc.id) AND (date = '2018-08-31'::date)) The lower Nested Loop has a NestLoopParam node telling it to pass the value of tc.id down to the tba indexscan. The upper one also has a NestLoopParam node, telling it to pass the value of tc.id down to the ta indexscan. Since these are for the same Var, the planner believes it's safe to use the same PARAM_EXEC slot for both of them. And in a normal plan tree, it'd be right. I think what is happening is (1) Leader runs the lower nestloop for a bit and gets back a successful tc/tba join row. At this point its PARAM_EXEC slot is filled with the tc.id value from that tc row. (2) It hands this up to the upper nestloop, which assigns that same tc.id to the PARAM_EXEC slot ... no trouble so far ... and runs the ta indexscan with that. (3) Now we go back to the Gather, which I think must go off and look for rows coming from child workers before it resumes the leader's nestloop. It gets some row, with a different tc.id value, and hands that back up. (4) Upper nestloop assigns the tc.id value from the child's row to the PARAM_EXEC slot, and runs the ta indexscan with that. (5) Return to the Gather, which now allows the lower nestloop to proceed in the leader. (6) Lower nestloop thinks it's not done with the current outer (tc) row, so it invokes the tba indexscan to see if that can return any more rows. But since it hasn't advanced to a new outer row, it doesn't touch the NestLoopParam or PARAM_EXEC slot. (7) At this point we're resuming the tba indexscan with a different value for tc.id than it was started with. The results of that are undefined, IMO, but what's apparently happening is that it fails to match some rows that it otherwise would have matched. So the end result is some rows are missing from the output. One way we could deal with this, perhaps, is to decide that NestLoopParams above and below a Gather can't share PARAM_EXEC slots. I'm not sure how we'd mechanize that, other than not allowing NestLoopParams to share PARAM_EXEC slots *ever*, which might not really be that much of a cost. But TBH the whole thing frightens me quite a lot as to what other serial-processing assumptions are getting broken by plopping Gather into the middle of a plan tree. I wonder whether we'd not be best off disallowing this sort of plan shape, and/or forbidding the leader from also executing the parallelized subplan. regards, tom lane
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Thu, Jan 10, 2019 at 10:04 AM Andrew Gierth > <andrew@tao11.riddles.org.uk> wrote: >> But clearly this can't work if one param is referenced both inside and >> outside a Gather, because while they will compare equal for Vars, they >> won't actually have the same value thanks to rows coming in from >> workers. > But if they used different params, there could be different problems, > no? It's logically the same var. As far as I can think at the moment, there's no problem with having multiple nestloop Params referencing the "same" Var. It could be an impediment to optimization if it happened (much) earlier in the planner, but for the situation at hand the only code that's going to be looking at the tree is the executor and maybe ruleutils, both of which are much too stupid to be bothered by such aliasing. > This makes me wonder if we need > some kind of scheme for saving and restoring affected params whenever > Gather switches between executing the plan directly and emitting > tuples from workers, or something like that... I don't think we need to (or should) go there if this is the only problem. What's worrying me is what other assumptions based on serial plan execution are getting broken by injecting Gather into mid levels of a plan tree. regards, tom lane
On Thu, Jan 10, 2019 at 12:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > On Thu, Jan 10, 2019 at 10:04 AM Andrew Gierth > > <andrew@tao11.riddles.org.uk> wrote: > >> But clearly this can't work if one param is referenced both inside and > >> outside a Gather, because while they will compare equal for Vars, they > >> won't actually have the same value thanks to rows coming in from > >> workers. > > > But if they used different params, there could be different problems, > > no? It's logically the same var. > > As far as I can think at the moment, there's no problem with having > multiple nestloop Params referencing the "same" Var. It could be an > impediment to optimization if it happened (much) earlier in the planner, > but for the situation at hand the only code that's going to be looking > at the tree is the executor and maybe ruleutils, both of which are much > too stupid to be bothered by such aliasing. The index scan does actually emit all the tuples it should in your paragraph "(7)" (maybe why nobody ever noticed this problem before) but in this case there is also an extra (redundant?) qual referencing the param, so ExecScan()'s call to ExecQual() returns false after the other Nest Loop tramples on it, and the tuples are filtered out (I showed that as "dropped" in my printf-debugging excerpt up-thread). We'd have to make sure that the qual references the param that is set by this join and not its evil twin. I'm confused about how that and any other references to the Var would work, but as you can probably tell I don't have a great grip on the Var/param system and the relevant optimisation phases yet. Hmm. Why are those ExecQual() -> false cases not showing up as variation in the "Rows Removed by Filter" counter visible in EXPLAIN ANALYZE? Then we might have arrived here a lot faster. InstrCountFiltered1(node, 1) is executed, but somehow the count doesn't make it into the total shown by EXPLAIN. > > This makes me wonder if we need > > some kind of scheme for saving and restoring affected params whenever > > Gather switches between executing the plan directly and emitting > > tuples from workers, or something like that... > > I don't think we need to (or should) go there if this is the only > problem. What's worrying me is what other assumptions based on serial > plan execution are getting broken by injecting Gather into mid levels > of a plan tree. parallel_leader_participation = on is a many-headed serpentine beast. -- Thomas Munro http://www.enterprisedb.com
On Thu, 10 Jan 2019 at 10:04, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Looks like this isn't a new bug, either; I can reproduce it on 10.x. Yeah, introduced in 9.6; I wrote: > 45be99f8cd Support parallel joins, and make related improvements. -- BAD > a7de3dc5c3 Support multi-stage aggregation. -- GOOD -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 10 Jan 2019 at 12:28, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Hmm. Why are those ExecQual() -> false cases not showing up as > variation in the "Rows Removed by Filter" counter visible in EXPLAIN > ANALYZE? Then we might have arrived here a lot faster. > InstrCountFiltered1(node, 1) is executed, but somehow the count > doesn't make it into the total shown by EXPLAIN. Good question. They are tracked by the instrumentation, it's just that show_instrumentation_count() divides by nloops so the additional filtered rows are lost to the rounding. Removing the / nloops shows you what you want to see. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 9, 2019 at 5:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > (7) At this point we're resuming the tba indexscan with a different > value for tc.id than it was started with. The results of that are > undefined, IMO, but what's apparently happening is that it fails to > match some rows that it otherwise would have matched. So the end > result is some rows are missing from the output. Oops. > One way we could deal with this, perhaps, is to decide that NestLoopParams > above and below a Gather can't share PARAM_EXEC slots. I'm not sure how > we'd mechanize that, other than not allowing NestLoopParams to share > PARAM_EXEC slots *ever*, which might not really be that much of a cost. > > But TBH the whole thing frightens me quite a lot as to what other > serial-processing assumptions are getting broken by plopping Gather > into the middle of a plan tree. I wonder whether we'd not be best > off disallowing this sort of plan shape, and/or forbidding the leader > from also executing the parallelized subplan. FWIW, I would be inclined to blame this on me not really understanding the Param machinery very well rather than any more general brand of insanity. I wouldn't be a bit shocked if this is not the last bit Param-related dumbness than parallel query has. I did *try* to get it right, but I find that stuff to be pretty arcane and not entirely well-documented. As to disallow "this plan shape," both leader participation and the ability to generate this kind of plan are there because they improve performance substantially on some queries. I put a heck of a lot of energy into being able to parallelize only part of a query, and into allowing for leader participation, and I believe those things make a real difference. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 10 Jan 2019 at 15:06, Robert Haas <robertmhaas@gmail.com> wrote: > As to disallow "this plan shape," both leader participation and the > ability to generate this kind of plan are there because they improve > performance substantially on some queries. I put a heck of a lot of > energy into being able to parallelize only part of a query, and into > allowing for leader participation, and I believe those things make a > real difference. I 100% agree that this shape of plan is useful. I guess though, that whether we keep allowing it or not depends on if a suitable fix appears for the issue. One idea that I've been thinking about is some sort of param context number that we bump during createplan when we encounter a Gather/GatherMerge path and have assign_param_for_var() pay attention to that number when deciding if sharing a parameter is okay. How exactly that context number is calculated and how we pass the value around I've currently no idea, so please consider this very handwavy. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jan 9, 2019 at 5:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> One way we could deal with this, perhaps, is to decide that NestLoopParams >> above and below a Gather can't share PARAM_EXEC slots. I'm not sure how >> we'd mechanize that, other than not allowing NestLoopParams to share >> PARAM_EXEC slots *ever*, which might not really be that much of a cost. >> >> But TBH the whole thing frightens me quite a lot as to what other >> serial-processing assumptions are getting broken by plopping Gather >> into the middle of a plan tree. I wonder whether we'd not be best >> off disallowing this sort of plan shape, and/or forbidding the leader >> from also executing the parallelized subplan. > FWIW, I would be inclined to blame this on me not really understanding > the Param machinery very well rather than any more general brand of > insanity. I wouldn't be a bit shocked if this is not the last bit > Param-related dumbness than parallel query has. I did *try* to get it > right, but I find that stuff to be pretty arcane and not entirely > well-documented. Well, if we posit that the problems are limited to false Param-sharing, my inclination is to fix it by simplifying rather than adding complexity. To wit, let's just assign a new PARAM_EXEC slot for every NestLoopParam and never try to share them. The argument that sharing them would work for NLPs representing the same Var hinged on the assumption that nested NestLoops would necessarily execute "in sync", which we now see is false at least when Gathers are in between. This'll cost some more PARAM_EXEC slots, of course, but AFAICS the cost of that is trivial, just a few more bytes per slot. It's arguable that the cost of trying to de-duplicate them at plan time is more than we can save at execution :-( I'm about fried for tonight, but I can take a look at this tomorrow. regards, tom lane
I wrote: > Well, if we posit that the problems are limited to false Param-sharing, > my inclination is to fix it by simplifying rather than adding complexity. > To wit, let's just assign a new PARAM_EXEC slot for every NestLoopParam > and never try to share them. The argument that sharing them would work > for NLPs representing the same Var hinged on the assumption that nested > NestLoops would necessarily execute "in sync", which we now see is false > at least when Gathers are in between. Here's a POC patch for that. It's seriously ugly, but it passes check-world and I can no longer reproduce the problem using mockup_bug15577_v2.sql. At this point assign_nestloop_param_var and assign_nestloop_param_placeholdervar are dead code, and there's a bunch of incorrect comments in subselect.c, and I really need to refactor the division of labor between createplan.c and subselect.c (for one thing, this is an abuse of the documented purpose of SS_make_initplan_output_param). But functionally I think it does the right thing. Please test and verify that you no longer see the race condition. regards, tom lane diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 066685c..6ce5938 100644 *** a/src/backend/optimizer/plan/createplan.c --- b/src/backend/optimizer/plan/createplan.c *************** replace_nestloop_params_mutator(Node *no *** 4360,4379 **** /* If not to be replaced, we can just return the Var unmodified */ if (!bms_is_member(var->varno, root->curOuterRels)) return node; ! /* Create a Param representing the Var */ ! param = assign_nestloop_param_var(root, var); ! /* Is this param already listed in root->curOuterParams? */ foreach(lc, root->curOuterParams) { nlp = (NestLoopParam *) lfirst(lc); ! if (nlp->paramno == param->paramid) { ! Assert(equal(var, nlp->paramval)); ! /* Present, so we can just return the Param */ return (Node *) param; } } ! /* No, so add it */ nlp = makeNode(NestLoopParam); nlp->paramno = param->paramid; nlp->paramval = var; --- 4360,4389 ---- /* If not to be replaced, we can just return the Var unmodified */ if (!bms_is_member(var->varno, root->curOuterRels)) return node; ! /* Is this Var already listed in root->curOuterParams? */ foreach(lc, root->curOuterParams) { nlp = (NestLoopParam *) lfirst(lc); ! if (equal(var, nlp->paramval)) { ! /* Yes, so just make a Param referencing this NLP's slot */ ! param = makeNode(Param); ! param->paramkind = PARAM_EXEC; ! param->paramid = nlp->paramno; ! param->paramtype = var->vartype; ! param->paramtypmod = var->vartypmod; ! param->paramcollid = var->varcollid; ! param->location = var->location; return (Node *) param; } } ! /* No, so assign a PARAM_EXEC slot for a new NLP */ ! param = SS_make_initplan_output_param(root, ! var->vartype, ! var->vartypmod, ! var->varcollid); ! param->location = var->location; ! /* Add it to the list of NLPs */ nlp = makeNode(NestLoopParam); nlp->paramno = param->paramid; nlp->paramval = var; *************** replace_nestloop_params_mutator(Node *no *** 4423,4442 **** root); return (Node *) newphv; } ! /* Create a Param representing the PlaceHolderVar */ ! param = assign_nestloop_param_placeholdervar(root, phv); ! /* Is this param already listed in root->curOuterParams? */ foreach(lc, root->curOuterParams) { nlp = (NestLoopParam *) lfirst(lc); ! if (nlp->paramno == param->paramid) { ! Assert(equal(phv, nlp->paramval)); ! /* Present, so we can just return the Param */ return (Node *) param; } } ! /* No, so add it */ nlp = makeNode(NestLoopParam); nlp->paramno = param->paramid; nlp->paramval = (Var *) phv; --- 4433,4461 ---- root); return (Node *) newphv; } ! /* Is this PHV already listed in root->curOuterParams? */ foreach(lc, root->curOuterParams) { nlp = (NestLoopParam *) lfirst(lc); ! if (equal(phv, nlp->paramval)) { ! /* Yes, so just make a Param referencing this NLP's slot */ ! param = makeNode(Param); ! param->paramkind = PARAM_EXEC; ! param->paramid = nlp->paramno; ! param->paramtype = exprType((Node *) phv->phexpr); ! param->paramtypmod = exprTypmod((Node *) phv->phexpr); ! param->paramcollid = exprCollation((Node *) phv->phexpr); ! param->location = -1; return (Node *) param; } } ! /* No, so assign a PARAM_EXEC slot for a new NLP */ ! param = SS_make_initplan_output_param(root, ! exprType((Node *) phv->phexpr), ! exprTypmod((Node *) phv->phexpr), ! exprCollation((Node *) phv->phexpr)); ! /* Add it to the list of NLPs */ nlp = makeNode(NestLoopParam); nlp->paramno = param->paramid; nlp->paramval = (Var *) phv;
On Fri, Jan 11, 2019 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > Well, if we posit that the problems are limited to false Param-sharing, > > my inclination is to fix it by simplifying rather than adding complexity. > > To wit, let's just assign a new PARAM_EXEC slot for every NestLoopParam > > and never try to share them. The argument that sharing them would work > > for NLPs representing the same Var hinged on the assumption that nested > > NestLoops would necessarily execute "in sync", which we now see is false > > at least when Gathers are in between. > > Here's a POC patch for that. It's seriously ugly, but it passes > check-world and I can no longer reproduce the problem using > mockup_bug15577_v2.sql. > > At this point assign_nestloop_param_var and > assign_nestloop_param_placeholdervar are dead code, and there's a bunch > of incorrect comments in subselect.c, and I really need to refactor > the division of labor between createplan.c and subselect.c (for one > thing, this is an abuse of the documented purpose of > SS_make_initplan_output_param). But functionally I think it does the > right thing. Please test and verify that you no longer see the race > condition. I no longer see it here. I can also see that Nested Loop with plan ID 0 is using paramno 2, and the Nested Loop with plan ID 2 is using paramno 0. -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Fri, Jan 11, 2019 at 6:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> At this point assign_nestloop_param_var and >> assign_nestloop_param_placeholdervar are dead code, and there's a bunch >> of incorrect comments in subselect.c, and I really need to refactor >> the division of labor between createplan.c and subselect.c (for one >> thing, this is an abuse of the documented purpose of >> SS_make_initplan_output_param). But functionally I think it does the >> right thing. Please test and verify that you no longer see the race >> condition. > I no longer see it here. Thanks for checking! I'm having some difficulty choosing what to do refactoring-wise. There are a couple of conflicting considerations: * Currently, subselect.c is in charge of assigning PARAM_EXEC slots; in particular, nothing else touches root->glob->paramExecTypes. I'm kind of loath to give that up. * On the other hand, root->curOuterParams is currently only manipulated by createplan.c, and if we could keep that as a local data structure, that'd be nice too. * However, if we stick to both of those constraints then we're forced into more or less what the POC patch does. We could provide another subselect.c function, say SS_make_nestloop_param, which'd just wrap generate_new_param the same as SS_make_initplan_output_param, but we still have a pretty weird division of labor IMO. The fundamental issue here is that it's now going to be the state of the curOuterParams list that determines whether a new PARAM_EXEC slot is needed. Really that list serves the same sort of purpose as the root->plan_params list, but its entries have very different lifespans than the entries in plan_params. So there's a reasonable case to be made that we should put management of curOuterParams into subselect.c, except that (a) it's a bit far afield from sub-selects, and (b) I'm not sure how completely it could be decoupled from createplan.c. If this were a HEAD-only patch I'd be tempted to do like Alvaro just did and move all the PARAM_EXEC assignment logic and root->plan_params and root->curOuterParams manipulation into a new file, say optimizer/util/paramassign.c. But that would be a little invasive for a back-patch. Thoughts? regards, tom lane
I wrote: > I'm having some difficulty choosing what to do refactoring-wise. > ... > If this were a HEAD-only patch I'd be tempted to do like Alvaro just did > and move all the PARAM_EXEC assignment logic and root->plan_params > and root->curOuterParams manipulation into a new file, say > optimizer/util/paramassign.c. But that would be a little invasive > for a back-patch. Actually, maybe it wouldn't be that bad. The functions I'd want to move into a new file are currently mostly static in subselect.c. Moving them elsewhere (and consequently making them not-static) would therefore not break any existing API. The exceptions are: SS_assign_special_param We could keep a wrapper by this name in the back branches, and thereby avoid any API/ABI break for extension code using it (which it seems somewhat likely that there might be). SS_make_initplan_output_param Although this'd just be a one-line wrapper for functionality exported by the hypothetical new file, it's somewhat closely related to SS_make_initplan_from_plan. So keeping it, with its current name, isn't outlandish. assign_nestloop_param_var assign_nestloop_param_placeholdervar These are going to go away, or else change API substantially, in any case. Hopefully there's no extension code using them. So what I'm now contemplating is moving these existing functions to a new file: int assign_param_for_var(PlannerInfo *root, Var *var) Param *replace_outer_var(PlannerInfo *root, Var *var) int assign_param_for_placeholdervar(PlannerInfo *root, PlaceHolderVar *phv) Param *replace_outer_placeholdervar(PlannerInfo *root, PlaceHolderVar *phv) Param *replace_outer_agg(PlannerInfo *root, Aggref *agg) Param *replace_outer_grouping(PlannerInfo *root, GroupingFunc *grp) Param *generate_new_param(PlannerInfo *root, Oid paramtype, int32 paramtypmod, Oid paramcollation) int assign_special_param(PlannerInfo *root) (rename of SS_assign_special_param) along with some new functions for nestloop parameters (extracted from createplan.c logic). assign_param_for_var and assign_param_for_placeholdervar could remain "static" since they aren't called from anywhere else. (Actually I guess they'd end up getting merged back into the replace_outer_ functions, since they'll now have just one caller apiece.) A variant idea would involve also moving replace_correlation_vars_mutator, which would allow the individual replace_outer_xxx functions to remain static. A different idea is to just move the assign_param_for_xxx functionality, which'd require breaking replace_outer_agg and replace_outer_grouping down into two functions apiece, since they currently combine creation of plan_params items with creation of the referencing Param nodes. Anybody have a preference? regards, tom lane