Sunday, February 26, 2017

Warming up RDS Snapshots

I learned something last week: when you start an RDS database instance from a snapshot, the instance's volume isn't copied all at once. Instead, it's lazily paged in from the snapshot, which is stored in S3. I suppose that shouldn't surprise me; after all, it's documented behavior (albeit for EC2, not RDS).

What did surprise me was just how much of a performance hit it causes. For example, a table scan of one of our large, rarely accessed (so not in cache) tables took about 10 minutes on an instance that had been running for several months. The same query, on a new but otherwise identical replica took nearly an hour.

Read replicas are one of the places where this behavior is particularly painful: one of the primary reasons for using a replica is to move long-running queries away from your primary machine. But with lazy-loading, for the first few hours (or days!) of a replica's life, it's significantly slower than running against the master. Worse, if you ever need to restore from a snapshot: your whole system will run slower while the blocks are loaded.

So what can you do?

You need to “warm up” the replica, by reading data from it before you put it into active use. And there are several ways to do this: one is to write queries that touch every table, preferably with a table-scan. An alternative is to take a backup of the database, and immediately throw it away:

mysqldump --all-databases --host HOST --user USER --password > /dev/null

I'm not convinced this is sufficient: it will touch all of the data blocks, but may leave the index blocks behind. I welcome any better approaches.

Saturday, February 11, 2017

MySQL: when NOT IN is not equal to NOT EXISTS

When you want to perform a difference operation between two tables, you have a choice: NOT EXISTS with a correlated subquery, or NOT IN. The latter is arguably simpler to write and makes the intent of the query more obvious. And modern database systems will optimize the two queries into similar execution plans, handling the correlation between outer and inner queries (I say “modern” because when I was working with Oracle 7.3 in the mid-90s I learned the hard way that it did not).

There is one key difference between the two constructs: if the subquery returns a NULL in its results then the NOT IN condition will fail, because null is neither equal-to nor not-equal-to any other value. But if you guard against that, they should be equivalent — indeed, some sources will tell you that NOT IN is faster and therefore preferred.

This post is about one case where it's dramatically slower, and nulls are to blame.

Consider the following two tables, which might be used to track clickstream data. Since we track both anonymous and registered users, EVENTS.USER_ID is nullable. However, when the user is not null, the secondary index has high cardinality.

create table USERS
  ID    integer auto_increment primary key,

create table EVENTS
  ID      integer auto_increment primary key,
  TYPE    smallint not null,
  USER_ID integer


OK, now let's use these tables: from a small set of users, we want to find the ones that haven't had a particular event. Using a NOT IN, and ensuring that nulls don't appear in the inner results, the query looks like this:

select  ID
from    USERS
where   ID in (1, 7, 2431, 87142, 32768)
and     ID not in
        select  USER_ID
        from    EVENTS
        where   TYPE = 7
        and     USER_ID is not null

For my test dataset, the USERS table has 100,000 rows and the EVENTS table has 10,000,000, of which approximately 75% have a null USER_ID. I'm running on my laptop, which has a Core i7 processor, 12 GB of RAM, and an SSD.

And I consistently get runtimes of around 2 minutes, which is … wow.

Let's replace that NOT IN with a NOT EXISTS and correlated subuery:

select  ID
from    USERS
where   ID in (1, 7, 2431, 87142, 32768)
and     not exists
        select  1
        from    EVENTS
        where   USER_ID = USERS.ID
        and     TYPE = 7

This version runs in 0.01 seconds, which is more what I expected.

Time to compare execution plans. The first plan is from the NOT IN query, the second is from the NOT EXISTS.

| id | select_type        | table  | partitions | type           | possible_keys   | key             | key_len | ref  | rows | filtered | Extra                    |
|  1 | PRIMARY            | USERS  | NULL       | range          | PRIMARY         | PRIMARY         | 4       | NULL |    5 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | EVENTS | NULL       | index_subquery | EVENTS_USER_IDX | EVENTS_USER_IDX | 5       | func |  195 |    10.00 | Using where              |
| id | select_type        | table  | partitions | type  | possible_keys   | key             | key_len | ref              | rows | filtered | Extra                    |
|  1 | PRIMARY            | USERS  | NULL       | range | PRIMARY         | PRIMARY         | 4       | NULL             |    5 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | EVENTS | NULL       | ref   | EVENTS_USER_IDX | EVENTS_USER_IDX | 5       | example.USERS.ID |   97 |    10.00 | Using where              |

Almost identical: both select rows from the USERS table, and then use a nested-loops join (“dependent subquery&rquo;) to retrieve rows from the EVENTS table. Both claim to use EVENTS_USER_IDX to select rows in the subquery. And they estimate similar numbers of rows at each step.

But look more closely at the join types. The NOT IN version uses index_subquery, while the NOT EXISTS version uses ref. Also look at the ref column: the NOT EXISTS version uses an explicit reference to the outer column, while the NOT IN uses a function. What's going on here?

The index_subquery join type indicates that MySQL will scan the index to find relevant rows for the subquery. Could that be the problem? I don't think so, because the EVENTS_USER_IDX is “narrow”: it only has one column, so the engine should not have to read a lot of blocks to find rows corresponding to the IDs from the outer query (indeed, I've tried a variety of queries to exercise this index, and all run in a few hundredths of a second).

For more information, I turned to the “extended” execution plan. To see this plan, prefix the query with explain extended, and follow it with show warnings. Here's what you get from the NOT IN query (reformatted for clarity):

/* select#1 */  select `example`.`USERS`.`ID` AS `ID` 
                from    `example`.`USERS` 
                where   ((`example`.`USERS`.`ID` in (1,7,2431,87142,32768)) 
                        and (not((`example`.`USERS`.`ID`,
                            (((`example`.`USERS`.`ID`) in EVENTS 
                            on      EVENTS_USER_IDX checking NULL
                            where   ((`example`.`EVENTS`.`TYPE` = 7) 
                                    and (`example`.`EVENTS`.`USER_ID` is not null))
                            having (`example`.`EVENTS`.`USER_ID`)))))))

I haven't been able to find an explanation of “on EVENTS_USER_IDX checking NULL” but here's what I think is happening: the optimizer believes that it is executing an IN query that can have NULL in the results; it does not consider the null check in the where clause when making this decision. As a result, it will examine the 7.5 million rows where USER_ID is null, along with the few dozen rows where it matches the values from the outer query. And by “examine,” I mean that it will read the table row and only then apply the is not null condition. Moreover, based on the time taken to run the query, I think it's doing this for every one of the candidate values in the outer query.

So, bottom line: whenever you're thinking of using an IN or NOT IN subquery on a nullable column, rethink and use EXISTS or NOT EXISTS instead.