When you want to perform a difference operation between two tables, you have a choice:
NOT EXISTS
with a correlated subquery, or NOT IN
. The latter
is arguably simpler to write and makes the intent of the query more obvious. And modern
database systems will optimize the two queries into similar execution plans, handling
the correlation between outer and inner queries (I say “modern” because when
I was working with Oracle 7.3 in the mid-90s I learned the hard way that it did not).
There is one key difference between the two constructs: if the subquery returns a NULL
in its results then the NOT IN
condition will fail, because null is neither equal-to
nor not-equal-to any other value. But if you guard against that, they should be equivalent —
indeed,
some sources
will tell you that NOT IN
is faster and therefore preferred.
This post is about one case where it's dramatically slower, and nulls are to blame.
Consider the following two tables, which might be used to track clickstream data. Since we
track both anonymous and registered users, EVENTS.USER_ID
is nullable. However,
when the user is not null, the secondary index has high cardinality.
create table USERS
(
ID integer auto_increment primary key,
...
)
create table EVENTS
(
ID integer auto_increment primary key,
TYPE smallint not null,
USER_ID integer
...
)
create index EVENTS_USER_IDX on EVENTS(USER_ID);
OK, now let's use these tables: from a small set of users, we want to find the ones that
haven't had a particular event. Using a NOT IN
, and ensuring that
nulls don't appear in the inner results, the query looks like this:
select ID
from USERS
where ID in (1, 7, 2431, 87142, 32768)
and ID not in
(
select USER_ID
from EVENTS
where TYPE = 7
and USER_ID is not null
);
For my test dataset, the USERS
table has 100,000 rows and the EVENTS
table has 10,000,000, of which approximately 75% have a null USER_ID
. I'm running
on my laptop, which has a Core i7 processor, 12 GB of RAM, and an SSD.
And I consistently get runtimes of around 2 minutes, which is … wow.
Let's replace that NOT IN
with a NOT EXISTS
and correlated subuery:
select ID
from USERS
where ID in (1, 7, 2431, 87142, 32768)
and not exists
(
select 1
from EVENTS
where USER_ID = USERS.ID
and TYPE = 7
);
This version runs in 0.01 seconds, which is more what I expected.
Time to compare execution plans. The first plan is from the NOT IN
query, the
second is from the NOT EXISTS
.
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
| 1 | PRIMARY | USERS | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 5 | 100.00 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | EVENTS | NULL | index_subquery | EVENTS_USER_IDX | EVENTS_USER_IDX | 5 | func | 195 | 10.00 | Using where |
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
| 1 | PRIMARY | USERS | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 5 | 100.00 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | EVENTS | NULL | ref | EVENTS_USER_IDX | EVENTS_USER_IDX | 5 | example.USERS.ID | 97 | 10.00 | Using where |
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
Almost identical: both select rows from the USERS
table, and then use a nested-loops
join (“dependent subquery&rquo;) to retrieve rows from the EVENTS
table. Both
claim to use EVENTS_USER_IDX
to select rows in the subquery. And they estimate
similar numbers of rows at each step.
But look more closely at the join types. The NOT IN
version uses
index_subquery,
while the NOT EXISTS
version uses
ref.
Also look at the ref
column: the NOT EXISTS
version uses an explicit reference to
the outer column, while the NOT IN
uses a function. What's going on here?
The index_subquery
join type indicates that MySQL will scan the index to find relevant rows
for the subquery. Could that be the problem? I don't think so, because the EVENTS_USER_IDX
is
“narrow”: it only has one column, so the engine should not have to read a lot of blocks to
find rows corresponding to the IDs from the outer query (indeed, I've tried a variety of queries to exercise
this index, and all run in a few hundredths of a second).
For more information, I turned to the “extended” execution plan. To see this plan, prefix
the query with explain extended
, and follow it with show warnings
. Here's
what you get from the NOT IN
query (reformatted for clarity):
/* select#1 */ select `example`.`USERS`.`ID` AS `ID`
from `example`.`USERS`
where ((`example`.`USERS`.`ID` in (1,7,2431,87142,32768))
and (not((`example`.`USERS`.`ID`,
(((`example`.`USERS`.`ID`) in EVENTS
on EVENTS_USER_IDX checking NULL
where ((`example`.`EVENTS`.`TYPE` = 7)
and (`example`.`EVENTS`.`USER_ID` is not null))
having (`example`.`EVENTS`.`USER_ID`)))))))
I haven't been able to find an explanation of “on EVENTS_USER_IDX checking NULL
”
but here's what I think is happening: the optimizer believes that it is executing an IN
query that can have NULL
in the results; it does not consider the null check in the
where
clause when making this decision. As a result, it will examine the 7.5 million
rows where USER_ID
is null, along with the few dozen rows where it matches the
values from the outer query. And by “examine,” I mean that it will read the table row
and only then apply the is not null
condition. Moreover, based on the time taken to
run the query, I think it's doing this for every one of the candidate values in the outer query.
So, bottom line: whenever you're thinking of using an IN
or NOT IN
subquery on a nullable column, rethink and use EXISTS
or NOT EXISTS
instead.