Обсуждение: unexpected results with NOT IN query

Поиск
Список
Период
Сортировка

unexpected results with NOT IN query

От
"Mason Hale"
Дата:
Hello --

I'm getting some unexpected results with a NOT IN query -- this is on 8.2.5.

This is the query in question:

prod_2=> select id from feed_download_task where id in (111102466,141701504) and id not in (select last_feed_download_task_id from subscription);
 id
----
(0 rows)


This query returns zero rows, but I expect it to return 1 row, because I know that 111102466 *is not* in (select last_feed_download_task_id from subscription) and I know that 1471701504 *is* in that set, as demonstrated below:

Verify that both id values are in the feed_download_task table:

prod_2=> select id from feed_download_task where id in (111102466,141701504);
    id    
-----------
 141701504
 111102466
(2 rows)


Verify that 111102466 is NOT in the set of last_feed_download_task_id's, and that 141701504 is in this set:

prod_2=> select last_feed_download_task_id from subscription where last_feed_download_task_id in (111102466,141701504);
 last_feed_download_task_id
----------------------------
                  141701504
(1 row)


Here's the problem query again, with explain analyze.

prod_2=> select id from feed_download_task where id in (111102466,141701504) and id not in (select last_feed_download_task_id from subscription);
 id
----
(0 rows)

prod_2=> explain analyze select id from feed_download_task where id in (111102466,141701504) and id not in (select last_feed_download_task_id from subscription);
                              
                             QUERY PLAN                                                           
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on feed_download_task  (cost=45077.24..45083.27 rows=1 width=4) (actual time=601.229..601.229 rows=0 loops=1)
   Recheck Cond: (id = ANY ('{111102466,141701504}'::integer[]))
   Filter: (NOT (hashed subplan))
   ->  Bitmap Index Scan on feed_download_task_pkey  (cost=0.00..30.52 rows=2 width=0) (actual time=0.095..0.095 rows=2 loops=1)
         Index Cond: (id = ANY ('{111102466,141701504}'::integer[]))
   SubPlan
     ->  Seq Scan on subscription  (cost=0.00..44097.78 rows=379578 width=4) (actual time=0.032..488.193 rows=162365 loops=1)
 Total runtime: 601.281 ms
(8 rows)


I've tried re-analyzing and re-indexing the tables involved in this query, but I still left scratching my head.

I am also aware that I can use a left join instead of a NOT IN query -- but in this case I need to use this in a DELETE statement, which eliminates the possibility of the left join (I think).

Here is a version using a left outer join, it returns the expected result:

prod_2=# select feed_download_task.id from feed_download_task left join subscription on (subscription.last_feed_download_task_id = feed_download_task.id) where feed_download_task.id in (111102466,141701504) and subscription.id IS NOT NULL;
    id    
-----------
 141701504
(1 row)

Here is the explain analyze output for the above query:

prod_2=# explain analyze select feed_download_task.id from feed_download_task left join subscription on (subscription.last_feed_download_task_id = feed_download_task.id) where feed_download_task.id in (111102466,141701504) and subscription.id IS NOT NULL;
                                                                             QUERY PLAN                                                                             
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=31.19..51.69 rows=1 width=4) (actual time=0.158..0.210 rows=1 loops=1)
   ->  Bitmap Heap Scan on feed_download_task  (cost=31.19..37.21 rows=2 width=4) (actual time=0.120..0.134 rows=2 loops=1)
         Recheck Cond: (id = ANY ('{111102466,141701504}'::integer[]))
         ->  Bitmap Index Scan on feed_download_task_pkey  (cost=0.00..31.19 rows=2 width=0) (actual time=0.102..0.102 rows=2 loops=1)
               Index Cond: (id = ANY ('{111102466,141701504}'::integer[]))
   ->  Index Scan using index_subscription_on_last_feed_download_task_id on subscription  (cost=0.00..7.23 rows=1 width=4) (actual time=0.036..0.037 rows=0 loops=2)
         Index Cond: (subscription.last_feed_download_task_id = feed_download_task.id)
         Filter: (id IS NOT NULL)

I feel like I must be missing something obvious.

Thanks in advance for the assistance.

cheers,
Mason

Re: unexpected results with NOT IN query

От
Tom Lane
Дата:
"Mason Hale" <masonhale@gmail.com> writes:
> I'm getting some unexpected results with a NOT IN query -- this is on 8.2.5.

If there are any NULLs in subscription.last_feed_download_task_id, that
NOT IN will not behave the way you are expecting.  You might want to
filter the nulls out of the subselect result ...

            regards, tom lane

Re: unexpected results with NOT IN query

От
Stephan Szabo
Дата:
On Thu, 20 Mar 2008, Mason Hale wrote:

> Hello --
>
> I'm getting some unexpected results with a NOT IN query -- this is on 8.2.5.
>
> This is the query in question:
>
> prod_2=> select id from feed_download_task where id in (111102466,141701504)
> and id not in (select last_feed_download_task_id from subscription);

Is it possible for last_feed_download_task_id be NULL? If so, then then id
not in (...)  will not ever return true due to the way comparisons with
NULLs work -- basically, it can't tell if the id is in the other table
because id = NULL is unknown, so it thus can't tell that it's not in the
other table either, so you could end up with neither in nor not in
returning the row.


Re: unexpected results with NOT IN query

От
"Mason Hale"
Дата:
Thanks -- that was it -- last_feed_download_task_id can indeed be null.

- Mason

On Thu, Mar 20, 2008 at 10:17 AM, Stephan Szabo <sszabo@megazone.bigpanda.com> wrote:
On Thu, 20 Mar 2008, Mason Hale wrote:

> Hello --
>
> I'm getting some unexpected results with a NOT IN query -- this is on 8.2.5.
>
> This is the query in question:
>
> prod_2=> select id from feed_download_task where id in (111102466,141701504)
> and id not in (select last_feed_download_task_id from subscription);

Is it possible for last_feed_download_task_id be NULL? If so, then then id
not in (...)  will not ever return true due to the way comparisons with
NULLs work -- basically, it can't tell if the id is in the other table
because id = NULL is unknown, so it thus can't tell that it's not in the
other table either, so you could end up with neither in nor not in
returning the row.