blog.kdgregory.com: database

Showing posts with label database. Show all posts

Monday, July 15, 2019

Stupid Database Tricks: Accessing the filesystem from Postgres

A while back, one of my younger co-workers asked me why Amazon RDS doesn't support a real Postgres superuser. I replied by showing him this:

create table passwords (entry text);

copy passwords from '/etc/passwd';

He ran off to try running the same code against our production database using Splunk (I didn't ask; I didn't want to know, and still don't). But this example, while dramatic, doesn't really answer the question; /etc/passwd is intended to be read by anyone.

What I should have done was show him this example (warning: don't try this at home unless you enjoy restoring from backups!):

copy passwords to '/usr/local/pgsql/data/postgresql.conf';

You may be wondering — like my colleague was — why Postgres would have such a huge security hole. And the answer is that it's not: only superusers can copy to or from files. If an ordinary user tries either of these statements, this is the response:

ERROR:  must be superuser to COPY to or from a file
HINT:  Anyone can COPY to stdout or from stdin. psql's \copy command also works for anyone.

The real security hole was that we were using the superuser as our standard login, because it made managing the database easier. And while the ability to overwrite files is one reason not to do that, there's a bigger one: we were one mis-qualified table name away from corrupting the data in the database, because the superuser also bypasses any checks on schema or database permissions.

Sunday, December 16, 2018

Database Connection Pools Need To Evolve

I never thought very deeply about connection pools, other than as a good example of how to use phantom references. Most of the projects that I worked on had already picked a pool, or defaulted their choice to whatever framework they were using. It is, after all, a rather mundane component, usually hidden behind the object relational manager that you use to interact with your database.

Then I answered this Stack Overflow question. And, after playing around a bit and writing a much longer answer, I realized that connection pools — even the newest — have been completely left behind by the current best practices in database management. Two of those practices, in particular:

Database credentials should be handled with care
All of the pools that I've used expect you to provide database credentials in their configuration. Which means that you, as a security-conscious developer, need to retrieve those credentials from somewhere and manually configure the pool. Or, as a not-so-security-conscious developer, store them in plain text in a configuration file. In either case, you're doing this once, when the application starts. If there's ever a need to change credentials, you restart your application.
That makes it difficult to practice credential rotation, where your database credentials change on a frequent basis to minimize the impact of losing those credentials. At the extreme, Amazon's RDS databases support generation of credentials that last only 15 minutes. But even if you rotate credentials on a monthly basis, the need to restart all of your applications turns this simple practice into a major undertaking, almost certainly manual, and one that may require application downtime.
Failover isn't a big deal
Failover from primary database to replica has traditionally been a Big Event, performed manually, and often involving several people on a conference call. At a minimum you need to bring up a new replaca, and with asynchronous, log-based replication there is always the chance of lost data. But with modern cluster-based database servers like Amazon Aurora, failover might happen during unattended maintenance. If the application can't recognize that it's in the middle of a short-duration failover, that's means it's still a Big Deal.

One solution to both of these problems is for the pool to provide hooks into its behavior: points where it calls out to user code to get the information it needs. For example, rather than read a static configuration variable for username and password, it would call a user-defined function for these values.

And that's fine, but it made me realize something: the real problem is that current connection pools try to do two things. The first is managing the pool: creating connections, holding them until needed, handing them out, collecting them again when the application code forgets to close them, and trying to do all of this with the minimal amount of overhead. The second task is establishing a connection, which has subtle variations depending on the database in use.

I think that the next evolution in connection pools is to separate those behaviors, and turn the connection management code into a composable stack of behaviors that can be mixed and matched as needed. One person might need a MySQL connection factory that uses an IAM credentials provider and a post-connection check that throws if the database is read-only. Another person might want a Postgres connection factory that uses an environment-based credentials provider and is fine with a basic connection check.

The tricky part is deciding on the right interfaces. I don't know that the three-part stack that I just described is the right way to go. I'm pretty sure that overriding javax.sql.DataSource isn't. And after creating an API, there's the hard work of implementing it, although I think that you'll find a lot of people willing to contribute small pieces.

The real question: is there a connection pool maintainer out there who thinks the same way and is willing to break her pool into pieces?

Sunday, February 26, 2017

Warming up RDS Snapshots

I learned something last week: when you start an RDS database instance from a snapshot, the instance's volume isn't copied all at once. Instead, it's lazily paged in from the snapshot, which is stored in S3. I suppose that shouldn't surprise me; after all, it's documented behavior (albeit for EC2, not RDS).

What did surprise me was just how much of a performance hit it causes. For example, a table scan of one of our large, rarely accessed (so not in cache) tables took about 10 minutes on an instance that had been running for several months. The same query, on a new but otherwise identical replica took nearly an hour.

Read replicas are one of the places where this behavior is particularly painful: one of the primary reasons for using a replica is to move long-running queries away from your primary machine. But with lazy-loading, for the first few hours (or days!) of a replica's life, it's significantly slower than running against the master. Worse, if you ever need to restore from a snapshot: your whole system will run slower while the blocks are loaded.

So what can you do?

You need to “warm up” the replica, by reading data from it before you put it into active use. And there are several ways to do this: one is to write queries that touch every table, preferably with a table-scan. An alternative is to take a backup of the database, and immediately throw it away:

mysqldump --all-databases --host HOST --user USER --password > /dev/null

I'm not convinced this is sufficient: it will touch all of the data blocks, but may leave the index blocks behind. I welcome any better approaches.

Saturday, February 11, 2017

MySQL: when NOT IN is not equal to NOT EXISTS

When you want to perform a difference operation between two tables, you have a choice: NOT EXISTS with a correlated subquery, or NOT IN. The latter is arguably simpler to write and makes the intent of the query more obvious. And modern database systems will optimize the two queries into similar execution plans, handling the correlation between outer and inner queries (I say “modern” because when I was working with Oracle 7.3 in the mid-90s I learned the hard way that it did not).

There is one key difference between the two constructs: if the subquery returns a NULL in its results then the NOT IN condition will fail, because null is neither equal-to nor not-equal-to any other value. But if you guard against that, they should be equivalent — indeed, some sources will tell you that NOT IN is faster and therefore preferred.

This post is about one case where it's dramatically slower, and nulls are to blame.

Consider the following two tables, which might be used to track clickstream data. Since we track both anonymous and registered users, EVENTS.USER_ID is nullable. However, when the user is not null, the secondary index has high cardinality.

create table USERS
(
  ID    integer auto_increment primary key,
  ...
)

create table EVENTS
(
  ID      integer auto_increment primary key,
  TYPE    smallint not null,
  USER_ID integer
  ...
)

create index EVENTS_USER_IDX on EVENTS(USER_ID);

OK, now let's use these tables: from a small set of users, we want to find the ones that haven't had a particular event. Using a NOT IN, and ensuring that nulls don't appear in the inner results, the query looks like this:

select  ID
from    USERS
where   ID in (1, 7, 2431, 87142, 32768)
and     ID not in
        (
        select  USER_ID
        from    EVENTS
        where   TYPE = 7
        and     USER_ID is not null
        );

For my test dataset, the USERS table has 100,000 rows and the EVENTS table has 10,000,000, of which approximately 75% have a null USER_ID. I'm running on my laptop, which has a Core i7 processor, 12 GB of RAM, and an SSD.

And I consistently get runtimes of around 2 minutes, which is … wow.

Let's replace that NOT IN with a NOT EXISTS and correlated subuery:

select  ID
from    USERS
where   ID in (1, 7, 2431, 87142, 32768)
and     not exists
        (
        select  1
        from    EVENTS
        where   USER_ID = USERS.ID
        and     TYPE = 7
        );

This version runs in 0.01 seconds, which is more what I expected.

Time to compare execution plans. The first plan is from the NOT IN query, the second is from the NOT EXISTS.

+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
| id | select_type        | table  | partitions | type           | possible_keys   | key             | key_len | ref  | rows | filtered | Extra                    |
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+
|  1 | PRIMARY            | USERS  | NULL       | range          | PRIMARY         | PRIMARY         | 4       | NULL |    5 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | EVENTS | NULL       | index_subquery | EVENTS_USER_IDX | EVENTS_USER_IDX | 5       | func |  195 |    10.00 | Using where              |
+----+--------------------+--------+------------+----------------+-----------------+-----------------+---------+------+------+----------+--------------------------+

+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
| id | select_type        | table  | partitions | type  | possible_keys   | key             | key_len | ref              | rows | filtered | Extra                    |
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+
|  1 | PRIMARY            | USERS  | NULL       | range | PRIMARY         | PRIMARY         | 4       | NULL             |    5 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | EVENTS | NULL       | ref   | EVENTS_USER_IDX | EVENTS_USER_IDX | 5       | example.USERS.ID |   97 |    10.00 | Using where              |
+----+--------------------+--------+------------+-------+-----------------+-----------------+---------+------------------+------+----------+--------------------------+

Almost identical: both select rows from the USERS table, and then use a nested-loops join (“dependent subquery&rquo;) to retrieve rows from the EVENTS table. Both claim to use EVENTS_USER_IDX to select rows in the subquery. And they estimate similar numbers of rows at each step.

But look more closely at the join types. The NOT IN version uses index_subquery, while the NOT EXISTS version uses ref. Also look at the ref column: the NOT EXISTS version uses an explicit reference to the outer column, while the NOT IN uses a function. What's going on here?

The index_subquery join type indicates that MySQL will scan the index to find relevant rows for the subquery. Could that be the problem? I don't think so, because the EVENTS_USER_IDX is “narrow”: it only has one column, so the engine should not have to read a lot of blocks to find rows corresponding to the IDs from the outer query (indeed, I've tried a variety of queries to exercise this index, and all run in a few hundredths of a second).

For more information, I turned to the “extended” execution plan. To see this plan, prefix the query with explain extended, and follow it with show warnings. Here's what you get from the NOT IN query (reformatted for clarity):

/* select#1 */  select `example`.`USERS`.`ID` AS `ID` 
                from    `example`.`USERS` 
                where   ((`example`.`USERS`.`ID` in (1,7,2431,87142,32768)) 
                        and (not((`example`.`USERS`.`ID`,
                            (((`example`.`USERS`.`ID`) in EVENTS 
                            on      EVENTS_USER_IDX checking NULL
                            where   ((`example`.`EVENTS`.`TYPE` = 7) 
                                    and (`example`.`EVENTS`.`USER_ID` is not null))
                            having (`example`.`EVENTS`.`USER_ID`)))))))

I haven't been able to find an explanation of “on EVENTS_USER_IDX checking NULL” but here's what I think is happening: the optimizer believes that it is executing an IN query that can have NULL in the results; it does not consider the null check in the where clause when making this decision. As a result, it will examine the 7.5 million rows where USER_ID is null, along with the few dozen rows where it matches the values from the outer query. And by “examine,” I mean that it will read the table row and only then apply the is not null condition. Moreover, based on the time taken to run the query, I think it's doing this for every one of the candidate values in the outer query.

So, bottom line: whenever you're thinking of using an IN or NOT IN subquery on a nullable column, rethink and use EXISTS or NOT EXISTS instead.

blog.kdgregory.com