When you want to perform a difference operation between two tables, you have a choice:
    NOT EXISTS with a correlated subquery, or NOT IN. The latter
    is arguably simpler to write and makes the intent of the query more obvious. And modern
    database systems will optimize the two queries into similar execution plans, handling
    the correlation between outer and inner queries (I say “modern” because when
    I was working with Oracle 7.3 in the mid-90s I learned the hard way that it did not).
 There is one key difference between the two constructs: if the subquery returns a NULL
    in its results then the NOT IN condition will fail, because null is neither equal-to
    nor not-equal-to any other value.  But if you guard against that, they should be equivalent —
    indeed,
    some sources
    will tell you that NOT IN is faster and therefore preferred.
 This post is about one case where it's dramatically slower, and nulls are to blame.
 Consider the following two tables, which might be used to track clickstream data. Since we
    track both anonymous and registered users, EVENTS.USER_ID is nullable. However,
    when the user is not null, the secondary index has high cardinality.
create table USERS
(
  ID    integer auto_increment primary key,
  ...
)
create table EVENTS
(
  ID      integer auto_increment primary key,
  TYPE    smallint not null,
  USER_ID integer
  ...
)
create index EVENTS_USER_IDX on EVENTS(USER_ID);
 OK, now let's use these tables: from a small set of users, we want to find the ones that
    haven't had a particular event. Using a NOT IN, and ensuring that
    nulls don't appear in the inner results, the query looks like this:
select  ID
from    USERS
where   ID in (1, 7, 2431, 87142, 32768)
and     ID not in
        (
        select  USER_ID
        from    EVENTS
        where   TYPE = 7
        and     USER_ID is not null
        );
 For my test dataset, the USERS table has 100,000 rows and the EVENTS 
    table has 10,000,000, of which approximately 75% have a null USER_ID. I'm running
    on my laptop, which has a Core i7 processor, 12 GB of RAM, and an SSD.
 And I consistently get runtimes of around 2 minutes, which is … wow.
 Let's replace that NOT IN with a NOT EXISTS and correlated subuery:
select  ID
from    USERS
where   ID in (1, 7, 2431, 87142, 32768)
and     not exists
        (
        select  1
        from    EVENTS
        where   USER_ID = USERS.ID
        and     TYPE = 7
        );
 This version runs in 0.01 seconds, which is more what I expected.
 Time to compare execution plans. The first plan is from the NOT IN query, the
    second is from the NOT EXISTS.
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
| id | select_type        | table  | partitions | type           | possible_keys   | key             | key_len | ref  | rows | filtered | Extra                    |
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
|  1 | PRIMARY            | USERS  | NULL       | range          | PRIMARY         | PRIMARY         | 4       | NULL |    5 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | EVENTS | NULL       | index_subquery | EVENTS_USER_IDX | EVENTS_USER_IDX | 5       | func |  195 |    10.00 | Using where              |
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
| id | select_type        | table  | partitions | type  | possible_keys   | key             | key_len | ref              | rows | filtered | Extra                    |
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
|  1 | PRIMARY            | USERS  | NULL       | range | PRIMARY         | PRIMARY         | 4       | NULL             |    5 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | EVENTS | NULL       | ref   | EVENTS_USER_IDX | EVENTS_USER_IDX | 5       | example.USERS.ID |   97 |    10.00 | Using where              |
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
 Almost identical: both select rows from the USERS table, and then use a nested-loops
    join (“dependent subquery&rquo;) to retrieve rows from the EVENTS table. Both
    claim to use EVENTS_USER_IDX to select rows in the subquery. And they estimate
    similar numbers of rows at each step.
 But look more closely at the join types. The NOT IN version uses
    index_subquery,
    while the NOT EXISTS version uses 
    ref.
    Also look at the ref column: the NOT EXISTS version uses an explicit reference to
    the outer column, while the NOT IN uses a function. What's going on here?
 The index_subquery join type indicates that MySQL will scan the index to find relevant rows
    for the subquery. Could that be the problem? I don't think so, because the EVENTS_USER_IDX is
    “narrow”: it only has one column, so  the engine should not have to read a lot of blocks to
    find rows corresponding to the IDs from the outer query (indeed, I've tried a variety of queries to exercise
    this index, and all run in a few hundredths of a second).
 For more information, I turned to the “extended” execution plan. To see this plan, prefix
    the query with explain extended, and follow it with show warnings.  Here's
    what you get from the NOT IN query (reformatted for clarity):
/* select#1 */  select `example`.`USERS`.`ID` AS `ID` 
                from    `example`.`USERS` 
                where   ((`example`.`USERS`.`ID` in (1,7,2431,87142,32768)) 
                        and (not((`example`.`USERS`.`ID`,
                            (((`example`.`USERS`.`ID`) in EVENTS 
                            on      EVENTS_USER_IDX checking NULL
                            where   ((`example`.`EVENTS`.`TYPE` = 7) 
                                    and (`example`.`EVENTS`.`USER_ID` is not null))
                            having (`example`.`EVENTS`.`USER_ID`)))))))
 I haven't been able to find an explanation of “on EVENTS_USER_IDX checking NULL”
    but here's what I think is happening: the optimizer believes that it is executing an IN
    query that can have NULL in the results; it does not consider the null check in the
    where clause when making this decision. As a result, it will examine the 7.5 million
    rows where USER_ID is null, along with the few dozen rows where it matches the
    values from the outer query. And by “examine,” I mean that it will read the table row
    and only then apply the is not null condition. Moreover, based on the time taken to
    run the query, I think it's doing this for every one of the candidate values in the outer query.
 So, bottom line: whenever you're thinking of using an IN or NOT IN
    subquery on a nullable column, rethink and use EXISTS or NOT EXISTS
    instead.