I am having a problem with the performance of a
view that will be a critical part of a database system
I am working on, and would appreciate some advice.
Apologies for the length of this posting!
I have a parent table P and to child tables A and B,:
CREATE TABLE p ( id INT NOT NULL PRIMARY KEY);
CREATE TABLE a ( id INT NOT NULL PRIMARY KEY, p INT REFERENCES p(id)); CREATE INDEX ON a(p);
CREATE TABLE b ( id INT NOT NULL PRIMARY KEY, p INT REFERENCES p(id)); CREATE INDEX ON b(p);
Each "p" row has between 1 and 5 (or so)
child rows in the "a" table, and between 0 and 4
rows in the "b" table.
Now for most p's the a's and the b's are independent,
and all combinations of a's and b's for that p are ok.
But for a small percentage of p's (<5%) there are some
combinations of a's and b's are distinguished (I will call
them "invalid").
So I created a table to record these "invalid" combinations:
CREATE TABLE x ( a INT NOT NULL REFERENCES a(id), b INT NOT NULL REFERENCES b(id), PRIMARY KEY (a,b)); CREATE
INDEXON x(a); CREATE INDEX ON x(b);
Here is some sample data with a single p:
# Create one parent item...
INSERT INTO p VALUES(1)
# Create 4 a-items for that parent...
INSERT INTO a VALUES(1,1)
INSERT INTO a VALUES(2,1)
INSERT INTO a VALUES(3,1)
INSERT INTO a VALUES(4,1)
# Create 3 b-items for that parent...
INSERT INTO b VALUES(11,1)
INSERT INTO b VALUES(12,1)
INSERT INTO b VALUES(13,1)
So for parent p=1, there are 12 combinations
of a and b items (each of the 4 a items can be
paired with any of the 3 b items).
Now, make some combinations of a items
and b items "invalid"...
# For a=2, make b=13 invalid, i.e only b=11 and b=12 are valid.
INSERT INTO x VALUES(2,13) # For a=3, only b=11 is valid.
INSERT INTO x VALUES(3,12)
INSERT INTO x VALUES(3,13) # For a=4, no b's are valid.
INSERT INTO x VALUES(4,11)
INSERT INTO x VALUES(4,12)
INSERT INTO p VALUES(4,13)
Now I need a view that will display, for each p, its
a's and for each a, only the valid b's, that is, the
combinations of a and b that are *not* in table x.
OK, no problem...
(#1)
SELECT p.id AS pid, a.id AS aid, b.id AS bid FROM p JOIN a ON a.p=p.id LEFT JOIN b ON b.p=a.p LEFT
JOINx ON x.a=a.id AND x.b=b.id WHERE x.a IS NULL AND p.id=1;
Here is the result on the data given above:pid | aid | bid
-----+-----+----- 1 | 1 | 11 1 | 1 | 12 1 | 1 | 13 1 | 2 | 11 1 | 2 | 13 1 | 3 | 11 1 | 3 |
12
Ok, but I want all a's in the output, even when they
have no valid b's. aid=4 is not in the output.
So I did what I thought was the obvious answer, a
left join between a, and the above query...
(#2)
SELECT p.id AS pid, a.id AS aid, sub.bid AS bid FROM p JOIN a ON a.p=p.id LEFT JOIN ( SELECT
a.idAS aid, b.id as bid FROM a LEFT JOIN b ON b.p=a.p LEFT JOIN x ON
x.a=a.idAND x.b=b.id WHERE x.a IS NULL ) AS sub ON sub.aid=a.id WHERE p.id=1;
Results:pid | aid | bid
-----+-----+----- 1 | 1 | 11 1 | 1 | 12 1 | 1 | 13 1 | 2 | 11 1 | 2 | 13 1 | 3 | 11 1 | 3 |
12 1 | 4 |
Exactly what I want.
The problem is that when there are ~100K parent entries
the above query (#2) takes ~10 seconds to run but the first
query (#1) runs in a few tens of milliseconds.
Is there any way I can get postgresql to better optimize
query #2, or rewrite it to that is is more "postgresql friendly"?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If it helps, here is a python script I used to generate enough
pseudo-data to show the time difference...
#!/usr/bin/python
import psycopg2, random
def main(): cn = psycopg2.connect (database="test", user="postgres",password="") c
=cn.cursor() pkp = 1; pka = 1; pkb = 1; while pkp < 30000: c.execute ("INSERT INTO p
VALUES(%s)",(pkp,)) na = random.randint(1,5) for a in range(na): c.execute ("INSERT
INTOa VALUES(%s,%s)", (pka+a,pkp)) nb = random.randint(0,4) for b in range(nb):
c.execute("INSERT INTO b VALUES(%s,%s)", (pkb+b,pkp)) if na*nb > 1 and random.randint (0,99) < 10:
zlst = [(a,b) for a in range(pka,pka+na) for b in range(pkb,pkb+nb)] for
zin random.sample (zlst, random.randint(1,na*nb-1)): c.execute ("INSERT INTO x VALUES(%s,%s)", z)
pkp += 1; pka += na; pkb += nb cn.commit()
if __name__ == '__main__': main ()