Обсуждение: Cannot do time-aligned listing with FULL JOIN, get ERROR: FULL JOIN is only supported with merge-joinable join conditions

Поиск
Список
Период
Сортировка
Hello,

FULL JOIN requiring merge-joinable conditions has been covered previously
in http://archives.postgresql.org/pgsql-sql/2006-01/msg00080.php for a
different application; here is a new context in which FULL JOIN in
non-merge-joinable conditions might be useful.

In this example, events have a start_time and an end_time.  I want a
time-aligned listing of all events from two tables A and B, joined up
where there is any time overlap between A's events with B's events.  Here
are the two tables:

CREATE TABLE public.events_a AS
      SELECT 'A1' as event_a, '2009-01-03'::date as start_a,
'2009-01-03'::date as end_a
UNION SELECT 'A2'           , '2009-01-07'::date           ,
'2009-01-08'::date
UNION SELECT 'A3'           , '2009-01-14'::date           ,
'2009-01-15'::date
;
CREATE TABLE public.events_b AS
      SELECT 'B1' as event_b, '2009-01-04'::date as start_b,
'2009-01-06'::date as end_b
UNION SELECT 'B2'           , '2009-01-07'::date           ,
'2009-01-12'::date
UNION SELECT 'B3'           , '2009-01-13'::date           ,
'2009-01-13'::date
;
A FULL JOIN would be good because presumably it would require only two
sequential scans, but it appears the OVERLAPS operator creates a
non-merge-joinable condition; this query returns ERROR: FULL JOIN is only
supported with merge-joinable join conditions / SQL state: 0A000:

SELECT *
  FROM public.events_a A
  FULL JOIN public.events_b B
    ON (start_a, end_a) OVERLAPS (start_b, end_b)
 ORDER BY coalesce(start_a, start_b), start_b  ;

As noted in other posts to the archive listing above, the answer I am
looking for can be had by UNIONing a LEFT join and a RIGHT JOIN:

SELECT *
FROM
(
SELECT *
  FROM public.events_a A
  LEFT JOIN public.events_b B
    ON (start_a, end_a) OVERLAPS (start_b, end_b)
 UNION SELECT *
  FROM public.events_a A
  RIGHT OUTER JOIN public.events_b B
    ON (start_a, end_a) OVERLAPS (start_b, end_b)
) U
ORDER BY COALESCE(start_a, start_b), start_b;

but this requires 4 sequential scans:

EXPLAIN above query:
QUERY PLAN
Sort  (cost=298586.19..300828.86 rows=897066 width=80)
  Sort Key: (COALESCE(u.start_a, u.start_b)), u.start_b
  ->  Subquery Scan u  (cost=185220.13..209889.44 rows=897066 width=80)
        ->  Unique  (cost=185220.13..200918.78 rows=897066 width=80)
              ->  Sort  (cost=185220.13..187462.79 rows=897066 width=80)
                    Sort Key: a.event_a, a.start_a, a.end_a, b.event_b,
b.start_b, b.end_b
                    ->  Append  (cost=22.76..96523.38 rows=897066
width=80)
                          ->  Nested Loop Left Join  (cost=22.76..43776.36
rows=448533 width=80)
                                Join Filter:
"overlaps"((a.start_a)::timestamp with time zone, (a.end_a)::timestamp
with time zone, (b.start_b)::timestamp with time zone,
(b.end_b)::timestamp with time zone)
                                ->  Seq Scan on events_a a
(cost=0.00..21.60 rows=1160 width=40)
                                ->  Materialize  (cost=22.76..34.36
rows=1160 width=40)
                                      ->  Seq Scan on events_b b
(cost=0.00..21.60 rows=1160 width=40)
                          ->  Nested Loop Left Join  (cost=22.76..43776.36
rows=448533 width=80)
                                Join Filter:
"overlaps"((a.start_a)::timestamp with time zone, (a.end_a)::timestamp
with time zone, (b.start_b)::timestamp with time zone,
(b.end_b)::timestamp with time zone)
                                ->  Seq Scan on events_b b
(cost=0.00..21.60 rows=1160 width=40)
                                ->  Materialize  (cost=22.76..34.36
rows=1160 width=40)
                                      ->  Seq Scan on events_a a
(cost=0.00..21.60 rows=1160 width=40)

In practice, is there:
* A different structuring the time spans of the 'events' such that a
merge-joinable condition can be found?
* A possibility of teaching Hash Join to do the FULL JOIN on
non-merge-joinable conditions?

Thanks,

John Koerber
Senior Systems Developer / Analyst


"John Koerber" <johnk@musicreports.com> writes:
> SELECT *
>   FROM public.events_a A
>   FULL JOIN public.events_b B
>     ON (start_a, end_a) OVERLAPS (start_b, end_b)
>  ORDER BY coalesce(start_a, start_b), start_b  ;
>  [ doesn't work ]

> In practice, is there:
> * A different structuring the time spans of the 'events' such that a
> merge-joinable condition can be found?
> * A possibility of teaching Hash Join to do the FULL JOIN on
> non-merge-joinable conditions?

Even if we could do full joins by hashing, that wouldn't help you since
OVERLAPS is no more hashable than it is mergeable.  The only possible
join plan would be nestloop, with a work table the size of the inner
input to keep track of which inner rows hadn't been joined to anything
:-(

            regards, tom lane