Saturday, May 20, 2017


I recently sold three calculators on eBay: an HP15C, an HP16C, and a TI59. Got slightly over $300 for them, which was less than I paid 30-plus years ago (especially adjusted for inflation), but they had no value sitting on a shelf in my office. Along the way, I realized two things.

First, I was, and may still be, a calculator geek.

I was never one of those people that walked around with his calculator on his belt (indeed, only one of the three even had a case with a belt loop). But, let's be honest, owning a dozen calculators just ain't normal (no, I didn't sell all of my old calculators; sue me). Nor is it normal for a 15-year-old to save his allowance and odd-job money so that he could buy a programmable calculator (a TI-55, which I still have; again, sue me).

The second thing I realized is just how irrelevant calculators are today.

These three calculators were the pinnacle of late-70s/early-80s calculator design. All three were programmable. The HP15 was designed for engineers: it could work with complex numbers, and could compute integrals and roots based on your programmed function. The TI-59, with nearly 1 KB (!) of memory and a magnetic card reader, was a viable competitor for then-current home computers, at 1/10th the price (and was touted as such in contemporary reviews). Certainly I used it as such: looking over my personal program card library, I saw one card that I used to balance my checkbook, tracking credits and debits.

But as desktop computers became more prevalent, nobody needed programmable calculators. In the early 90s I bought what I still consider to be HP's best calculator, the HP27S. It wasn't programmable, but you could still find the root of a function: you simply entered it as a formula and the calculator was smart enough to either solve it algebraically or numerically. It also combined functionality from the financial and computing domains with the basic scientific functions. And it used algebraic notation, which I'm sure upset a lot of HP purists.

I first got a hint of how irrelevant the physical devices had become a few years ago, when I was doing some bit manipulation operations and took out my 16C. When I learned that a new set of batteries would cost $15, I looked for something similar that I could run on my phone. And found apps that were based on the actual HP ROMs (they weren't copyrighted), running on an emulator. For less than the cost of batteries.

So the actual calculator remained on a shelf, unused. And really, I don't use the apps that much either (I also downloaded an HP11 emulation, so plural). When sitting at the computer it's easier to start up a Python shell or a spreadsheet. Really, the only place that I use the calculator is when I'm at Home Depot trying to figure out how many bags of fertilizer I need.

Perhaps high school and college students still use calculators, but I suspect not. Laptops are pervasive and inexpensive; why carry two devices. Which makes me wonder: will the calculators I sold sit unused on the shelf of someone who's more of a calculator geek than me?

Sunday, March 5, 2017

MySQL: The problem with synthetic primary keys

One of the standard practices in database design is to use a “synthetic” primary key — often an auto-increment integer — rather than a “natural” key relevant to the business domain. This is a Good Thing: natural keys are often not unique or ever-present, even when you think they are. Social security number is a perfect example: aside from the fact that you're not legally allowed to use it for non-tax-related identification, not everybody has one, and some people (either maliciously or accidentally) give you the wrong one.

So this post is not against synthetic primary keys per se. Rather, it's about the performance penalty that you'll pay if you create a single-column auto-increment primary key for every table in your database.

As an example, I'm going to use the USERS and EVENTS tables from my previous post. The former has 100,000 rows, the latter 10,000,000. As before, only 25% of events have associated users, so it's not a strict parent-child relationship, but we'll get to that.

create table USERS
    ID          integer not null auto_increment,
    CREATED_AT  timestamp not null default CURRENT_TIMESTAMP,
    VALUE       varchar(255),

    primary key (ID),
    key         (CREATED_AT)

create table EVENTS
    ID          integer not null auto_increment,
    CREATED_AT  timestamp not null default CURRENT_TIMESTAMP,
    TYPE        smallint not null,
    USER_ID     integer null,
    VALUE       varchar(255),

    primary key (ID),
    key         EVENTS_USER_IDX (USER_ID)

Both of these tables have synthetic primary keys, an auto-increment ID column. The EVENTS table has a foreign key relationship to USERS via USER_ID, and a secondary index to (we hope!) improve join times. Here's a typical query, with its execution plan:

select  count(*)
from    USERS U
join    EVENTS E on E.USER_ID = U.ID
where   U.CREATED_AT > subdate(CURRENT_TIMESTAMP(), 7)
and     E.TYPE = 7;
| id | select_type | table | partitions | type  | possible_keys      | key             | key_len | ref          | rows | filtered | Extra                    |
|  1 | SIMPLE      | U     | NULL       | range | PRIMARY,CREATED_AT | CREATED_AT      | 4       | NULL         | 3891 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | E     | NULL       | ref   | EVENTS_USER_IDX    | EVENTS_USER_IDX | 5       | example.U.ID |   98 |    10.00 | Using where              |

That looks perfectly reasonable: the query selects rows from USERS using the index on CREATED_AT, and then joins to EVENTS using the index on USER_ID. With my test database it returns slightly under 10,000 rows.

But why does it take 12 seconds the first time I run it, and only 1 second for subsequent runs?

The problem is that MySQL manages primary keys as a clustered index: it orders the physical rows in the table according to the primary key value. What this means for the join between USERS and EVENTS is that the events for a given user will be sprinkled all over the disk: each new row will be assigned a primary key value that's physically located at the end of the current data. As a result, if you select all of the rows for a particular user, you may have to access a large number of disk blocks. And to access a block you have to read it from disk.

That's not so bad if you have enough RAM to cache your entire database — which is why my subsequent queries ran so quickly. But if you don't, or queries against the EVENTS table are infrequent, you'll be hitting the disk constantly. Worse, you'll be evicting pages from the disk cache, so all your other queries will slow down because they also have to read data from the disk.

So what's the alternative?

create table NEW_EVENTS
  ID          int         not null auto_increment,
  CREATED_AT  timestamp   not null default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  TYPE        smallint    not null,
  USER_ID     int         not null,
  VALUE       varchar(255),


With this new table, my example query has the same execution plan. But the primary key on the new table organizes its data such that rows for the same user are collocated: assuming that your conditions don't select all users, it should read fewer blocks. On my laptop — after restarting MySQL and flushing the disk cache — it took just under half a second for the first query, and 0.04 seconds for subsequent queries.

So clearly the theory is borne out in practice, but can we gain more insight into what's happening? The answer is yes, provided that you have the performance schema enabled for your database (and you really should, even in production!). You can retrieve information about the number of disk operations and the time they take by querying TABLE_IO_WAITS_SUMMARY_BY_INDEX_USAGE:

from    performance_schema.table_io_waits_summary_by_index_usage
where   OBJECT_NAME in ('USERS', 'EVENTS');

Like other tables in the performance schema, the counters are not tied to a query; you need to ensure that nothing else is happening and truncate the table between your test queries. Arguably better is to restart your server and flush the disk caches, as this will show you the worst-case timings.

Here are the stats for the first query:

| example       | EVENTS      | PRIMARY         |          0 |              0 |           0 |               0 |           0 |               0 |
| example       | EVENTS      | EVENTS_USER_IDX |      97669 | 12350283888004 |           0 |               0 |       97669 |  12350283888004 |
| example       | USERS       | PRIMARY         |          0 |              0 |           0 |               0 |           0 |               0 |
| example       | USERS       | CREATED_AT      |       3892 |     5101149804 |           0 |               0 |        3892 |      5101149804 |

And for the second:

| example       | NEW_EVENTS  | PRIMARY           |      97669 |   455075854140 |           0 |               0 |       97669 |    455075854140 |
| example       | NEW_EVENTS  | NEW_EVENTS_ID_KEY |          0 |              0 |           0 |               0 |           0 |               0 |
| example       | USERS       | PRIMARY           |          0 |              0 |           0 |               0 |           0 |               0 |
| example       | USERS       | CREATED_AT        |       3892 |     2637773460 |           0 |               0 |        3892 |      2637773460 |

The number of reads are identical, but look at the difference in the read timer. This value is measured in trillionths of a second, so after adding a decimal point and dropping digits from the end we see that accesses to the EVENTS table took 12.350 seconds for the first query and 0.455 seconds for the second, with the same number of reads in each. The difference is that, to find all of the relevant events for a given user the first query had to read (and wait for) far more blocks to come from the disk.

You will also notice that the time taken to access the USERS table via the CREATED_AT index has decreased as well. This can't be explained by data clustering — indeed, in creating the test data for this post, I ensured that users would not be clustered by creation date. However, I believe that these numbers show the overall benefit of reducing disk operations: by reading fewer blocks in the NEW_EVENTS table, there's a higher likelihood that blocks from USERS will remain in the disk cache.

There are, of course, tradeoffs to reindexing your tables. The most obvious one is that EVENTS.USER_ID can no longer be nullable, because it's part of the primary key. This will cause some pain when dealing with anonymous users: you must insert an explicit value rather than leave the column NULL (I will note that some authorities, including C. J. Date, will say you shouldn't use nulls anyway, so this isn't necessarily a hardship).

There's also a price to pay with inserts: in order to keep the rows ordered by primary key, MySQL may need to split blocks on insert, in order to make room for the new rows. I don't think this is a major issue: for active users, you'll be inserting new blocks into the table rather than splitting existing blocks, and for inactive users you won't often pay the price of a split.

And finally, beware that this technique this isn't appropriate everywhere. It works best with large tables, where physical row data has to be read to perform a query (versus a covering index). As with all performance optimizations, there's no substitute for actual measurement.

Sunday, February 26, 2017

Warming up RDS Snapshots

I learned something last week: when you start an RDS database instance from a snapshot, the instance's volume isn't copied all at once. Instead, it's lazily paged in from the snapshot, which is stored in S3. I suppose that shouldn't surprise me; after all, it's documented behavior (albeit for EC2, not RDS).

What did surprise me was just how much of a performance hit it causes. For example, a table scan of one of our large, rarely accessed (so not in cache) tables took about 10 minutes on an instance that had been running for several months. The same query, on a new but otherwise identical replica took nearly an hour.

Read replicas are one of the places where this behavior is particularly painful: one of the primary reasons for using a replica is to move long-running queries away from your primary machine. But with lazy-loading, for the first few hours (or days!) of a replica's life, it's significantly slower than running against the master. Worse, if you ever need to restore from a snapshot: your whole system will run slower while the blocks are loaded.

So what can you do?

You need to “warm up” the replica, by reading data from it before you put it into active use. And there are several ways to do this: one is to write queries that touch every table, preferably with a table-scan. An alternative is to take a backup of the database, and immediately throw it away:

mysqldump --all-databases --host HOST --user USER --password > /dev/null

I'm not convinced this is sufficient: it will touch all of the data blocks, but may leave the index blocks behind. I welcome any better approaches.